Arxiv今日论文 | 2024-12-23

本篇博文主要展示 2024-12-23 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大语言模型 (LLMs) 在多步推理任务中的能力提升问题，特别是通过离线强化学习 (Offline Reinforcement Learning, RL) 来实现。解决方案的关键在于提出了 OREO (Offline Reasoning Optimization) 方法，该方法基于最大熵强化学习的思想，通过优化软贝尔曼方程 (soft Bellman Equation) 来联合学习策略模型和价值函数。这一方法有效减少了对于成对偏好数据 (paired preference data) 的依赖，并改进了在多步推理任务中的信用分配 (credit assignment) 问题。实验结果表明，OREO 在数学推理任务 (如 GSM8K 和 MATH) 和具身代理控制任务 (如 ALFWorld) 等多步推理基准上超越了现有的离线学习方法。此外，所学的价值函数还可以用于指导测试时的树搜索，进一步提升性能。

链接: https://arxiv.org/abs/2412.16145
作者: Huaijie Wang,Shibo Hao,Hanze Dong,Shenao Zhang,Yilin Bao,Ziran Yang,Yi Wu
机构: 未知
关键词: multi-step reasoning tasks, multi-step reasoning, Direct Preference Optimization, multi-step reasoning ability, reasoning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.
zh

[NLP-1] Can LLM s Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation AAAI2025

【速读】：该论文试图解决的问题是如何利用大型语言模型 (LLMs) 生成新的混淆汇编代码，从而增加恶意软件的检测难度。解决方案的关键在于开发了MetamorphASM基准测试，包括MetamorphASM数据集 (MAD) 和三种代码混淆技术：死代码、寄存器替换和控制流变化。通过系统评估LLMs生成和分析混淆代码的能力，论文验证了LLMs在生成混淆汇编代码方面的潜力，并提供了数据集和评估方法，为研究者提供了基础工具来研究和应对这一风险。

链接: https://arxiv.org/abs/2412.16135
作者: Seyedreza Mohseni,Seyedali Mohammadi,Deepa Tilwani,Yash Saxena,Gerald Ndwula,Sriram Vema,Edward Raff,Manas Gaur
机构: KAI2 Lab UMBC; University of Maryland, Baltimore County (马里兰大学巴尔的摩分校)
关键词: Malware authors, malware harder, Large Language Models, obfuscated assembly code, employ code obfuscations
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear in AAAI 2025, Main Track

点击查看摘要

Abstract:Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk. The source code can be found at the following GitHub link: this https URL.
zh

[NLP-2] PromptOptMe: Error-Aware Prompt Compression for LLM -based MT Evaluation Metrics

【速读】：该论文试图解决机器生成的自然语言内容质量评估中，使用大型语言模型（LLMs）如GPT-4进行复杂评估时，由于大量标记（token）使用导致的计算成本高昂问题。解决方案的关键在于提出了一种提示优化（prompt optimization）方法，通过使用一个较小且经过微调的语言模型来压缩输入数据，从而减少在下游评估中使用大型LLMs时的标记使用量和计算成本。该方法包括两个阶段的微调过程：首先是监督微调，随后是基于人类偏好的优化，以进一步精炼模型的输出。研究结果表明，该方法在不降低评估质量的情况下，实现了标记使用量减少2.37倍，从而使先进的LLM-based评估指标（如GEMBA-MQM）更具成本效益和效率。

链接: https://arxiv.org/abs/2412.16120
作者: Daniil Larionov,Steffen Eger
机构: NLLG; University of Mannheim(曼海姆大学); University of Technology Nuremberg(纽伦堡科技大学)
关键词: Natural Language Processing, machine-generated natural language, natural language content, natural language, Language Processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the quality of machine-generated natural language content is a challenging task in Natural Language Processing (NLP). Recently, large language models (LLMs) like GPT-4 have been employed for this purpose, but they are computationally expensive due to the extensive token usage required by complex evaluation prompts. In this paper, we propose a prompt optimization approach that uses a smaller, fine-tuned language model to compress input data for evaluation prompt, thus reducing token usage and computational cost when using larger LLMs for downstream evaluation. Our method involves a two-stage fine-tuning process: supervised fine-tuning followed by preference optimization to refine the model’s outputs based on human preferences. We focus on Machine Translation (MT) evaluation and utilize the GEMBA-MQM metric as a starting point. Our results show a 2.37\times reduction in token usage without any loss in evaluation quality. This work makes state-of-the-art LLM-based metrics like GEMBA-MQM more cost-effective and efficient, enhancing their accessibility for broader use.
zh

[NLP-3] Logical Consistency of Large Language Models in Fact-checking

【速读】：该论文试图解决大型语言模型 (LLMs) 在处理复杂逻辑查询时表现出的逻辑不一致性问题。解决方案的关键在于通过引入基于命题逻辑的逻辑一致性评估方法，并结合知识图谱 (KGs) 的上下文信息，对 LLMs 进行监督微调 (supervised fine-tuning)，以提高其在复杂事实核查任务中的逻辑一致性。具体来说，论文提出了三个逻辑事实核查数据集，并通过评估现有 LLMs 在命题逻辑查询上的表现，揭示了其在复杂查询上的逻辑不一致性，最终通过改进方法提升了 LLMs 的逻辑一致性。

链接: https://arxiv.org/abs/2412.16100
作者: Bishwamittra Ghosh,Sarah Hasan,Naheed Anjum Arafat,Arijit Khan
机构: Max Planck Institute for Software Systems; Aalborg University; Independent Researcher
关键词: large language models, varied natural language, demonstrated significant success, performing varied natural, natural language tasks
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs’ impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses – a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination, jailbreaking, etc. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitates an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider retrieval-augmented LLMs on a fact-checking task involving propositional logic queries from real-world knowledge graphs (KGs). Our contributions are three-fold. Benchmark: We introduce three logical fact-checking datasets over KGs for community development towards logically consistent LLMs. Assessment: We propose consistency measures of LLMs on propositional logic queries as input and demonstrate that existing LLMs lack logical consistency, specially on complex queries. Improvement: We employ supervised fine-tuning to improve the logical consistency of LLMs on the complex fact-checking task with KG contexts.
zh

[NLP-4] owards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agent ic RAG ECIR2025

【速读】：该论文试图解决深度学习在医学图像分类中的可解释性问题，特别是在胸部X光片（CXR）分类中的临床应用。解决方案的关键在于使用概念瓶颈模型（Concept Bottleneck Models, CBMs）和多代理检索增强生成（Retrieval-Augmented Generation, RAG）系统来生成放射学报告。通过建模视觉特征与临床概念之间的关系，生成可解释的概念向量，并指导多代理RAG系统生成具有临床相关性、可解释性和透明度的报告。这种方法不仅提高了模型的分类准确性（在COVID-QU数据集上达到81%），还增强了报告生成的稳健性，五个关键指标表现范围在84%到90%之间，从而在临床环境中实现了高性能AI与可解释性需求的有效结合。

链接: https://arxiv.org/abs/2412.16086
作者: Hasan Md Tusfiqur Alam,Devansh Srivastav,Md Abdul Kadir,Daniel Sonntag
机构: 未知
关键词: advanced medical image, Deep learning, medical image classification, interpretability challenges hinder, learning has advanced
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted in ECIR 2025

点击查看摘要

Abstract:Deep learning has advanced medical image classification, but interpretability challenges hinder its clinical adoption. This study enhances interpretability in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs) and a multi-agent Retrieval-Augmented Generation (RAG) system for report generation. By modeling relationships between visual features and clinical concepts, we create interpretable concept vectors that guide a multi-agent RAG system to generate radiology reports, enhancing clinical relevance, explainability, and transparency. Evaluation of the generated reports using an LLM-as-a-judge confirmed the interpretability and clinical utility of our model’s outputs. On the COVID-QU dataset, our model achieved 81% classification accuracy and demonstrated robust report generation performance, with five key metrics ranging between 84% and 90%. This interpretable multi-agent framework bridges the gap between high-performance AI and the explainability required for reliable AI-driven CXR analysis in clinical settings.
zh

[NLP-5] he Only Way is Ethics: A Guide to Ethical Research with Large Language Models COLING’25

【速读】：该论文试图解决大型语言模型（LLMs）在伦理方面的复杂问题，并提供一个集成多种伦理考虑的实用指南。解决方案的关键在于创建一个名为“LLM Ethics Whitepaper”的开源且持续更新的资源，旨在为自然语言处理（NLP）从业者和伦理评估人员提供具体的伦理建议和思考方向。该白皮书通过总结广泛的文献，提炼出明确的“应该做”和“不应该做”的指导原则，并推荐了支持伦理工作的工具包，从而帮助计算机科学家在项目生命周期的各个阶段进行伦理考量。

链接: https://arxiv.org/abs/2412.16022
作者: Eddie L. Ungless,Nikolas Vitsakis,Zeerak Talat,James Garforth,Björn Ross,Arno Onken,Atoosa Kasirzadeh,Alexandra Birch
机构: University of Edinburgh(爱丁堡大学); Heriot-Watt University(赫瑞瓦特大学)
关键词: LLM Ethics Whitepaper, large language models, Ethics Whitepaper, LLM Ethics, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to COLING '25. This paper is the condensed pocket guide to accompany our full LLM Ethics Whitepaper, available at arXiv:2410.19812 , and at this https URL for suggested revisions

点击查看摘要

Abstract:There is a significant body of work looking at the ethical considerations of large language models (LLMs): critiquing tools to measure performance and harms; proposing toolkits to aid in ideation; discussing the risks to workers; considering legislation around privacy and security etc. As yet there is no work that integrates these resources into a single practical guide that focuses on LLMs; we attempt this ambitious goal. We introduce ‘LLM Ethics Whitepaper’, which we provide as an open and living resource for NLP practitioners, and those tasked with evaluating the ethical implications of others’ work. Our goal is to translate ethics literature into concrete recommendations and provocations for thinking with clear first steps, aimed at computer scientists. ‘LLM Ethics Whitepaper’ distils a thorough literature review into clear Do’s and Don’ts, which we present also in this paper. We likewise identify useful toolkits to support ethical work. We refer the interested reader to the full LLM Ethics Whitepaper, which provides a succinct discussion of ethical considerations at each stage in a project lifecycle, as well as citations for the hundreds of papers from which we drew our recommendations. The present paper can be thought of as a pocket guide to conducting ethical research with LLMs.
zh

[NLP-6] Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

【速读】：该论文试图解决在多模态对话系统中，如何利用有限的语音数据高效提升语音建模性能的问题。解决方案的关键在于提出了一种数据中心化的定制化方法，并引入了一种新颖的多任务学习范式 (multi-task learning paradigm)，通过设计辅助任务 (auxiliary tasks) 来充分利用少量的语音数据。该方法在仅使用10%训练数据的情况下，在Spoken-SQuAD基准上达到了最先进的性能，并首次引入了ASK-QA数据集，用于处理多轮对话中用户请求的模糊性和动态评估输入。

链接: https://arxiv.org/abs/2412.15995
作者: Maximillian Chen,Ruoxi Sun,Sercan Ö. Arık
机构: Columbia University(哥伦比亚大学); Google(谷歌)
关键词: diverse real-world applications, real-world applications, assistants are increasingly, increasingly popular, popular across diverse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 22 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs. Code and data forthcoming.
zh

[NLP-7] Fearful Falcons and Angry Llamas: Emotion Category Annotations of Arguments by Humans and LLM s

【速读】：该论文试图解决在论证中情感分类（discrete emotion categories）的缺失问题，特别是在德语论证语料库中对情感类别（如“愤怒”）的主观标注。解决方案的关键在于通过众包方式获取情感类别的主观标注，并评估基于大型语言模型（LLM）的自动标注方法。研究比较了三种提示策略（zero-shot, one-shot, chain-of-thought）在三个指令微调语言模型（Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini）上的表现，并探讨了输出空间的定义（二元、封闭域、开放域）对情感预测的影响。结果表明，情感类别显著增强了论证中情感性的预测，强调了在论证中进行离散情感标注的必要性。

链接: https://arxiv.org/abs/2412.15993
作者: Lynn Greschner,Roman Klinger
机构: 未知
关键词: Arguments evoke emotions, argument, emotion categories, emotion, Arguments evoke
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Arguments evoke emotions, influencing the effect of the argument itself. Not only the emotional intensity but also the category influence the argument’s effects, for instance, the willingness to adapt stances. While binary emotionality has been studied in arguments, there is no work on discrete emotion categories (e.g., “Anger”) in such data. To fill this gap, we crowdsource subjective annotations of emotion categories in a German argument corpus and evaluate automatic LLM-based labeling methods. Specifically, we compare three prompting strategies (zero-shot, one-shot, chain-of-thought) on three large instruction-tuned language models (Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini). We further vary the definition of the output space to be binary (is there emotionality in the argument?), closed-domain (which emotion from a given label set is in the argument?), or open-domain (which emotion is in the argument?). We find that emotion categories enhance the prediction of emotionality in arguments, emphasizing the need for discrete emotion annotations in arguments. Across all prompt settings and models, automatic predictions show a high recall but low precision for predicting anger and fear, indicating a strong bias toward negative emotions.
zh

[NLP-8] BabyHGRN: Exploring RNNs for Sample-Efficient Training of Language Models

【速读】：该论文试图解决在低资源语言建模场景中，传统基于Transformer的模型可能存在的效率和性能问题。解决方案的关键在于探索并验证循环神经网络 (RNN) 及其他次二次方架构 (subquadratic architectures) 作为替代方案的可行性。具体而言，论文采用了HGRN2这一基于RNN的架构，并通过实验证明其在BLiMP、EWoK、GLUE和BEAR等基准测试中，优于Transformer模型及其他次二次方架构（如LSTM、xLSTM、Mamba）。此外，论文还强调了知识蒸馏 (knowledge distillation) 对提升模型性能的积极影响，从而挑战了当前对Transformer架构的过度依赖，并展示了RNN模型在资源受限环境中的潜力。

链接: https://arxiv.org/abs/2412.15978
作者: Patrick Haller,Jonas Golde,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学)
关键词: recurrent neural networks, language modeling scenarios, low-resource language modeling, neural networks, modeling scenarios
类目: Computation and Language (cs.CL)
备注: 7 pages, 7 figures and tables, Published in Proceedings of the BabyLM Challenge 2025

点击查看摘要

Abstract:This paper explores the potential of recurrent neural networks (RNNs) and other subquadratic architectures as competitive alternatives to transformer-based models in low-resource language modeling scenarios. We utilize HGRN2 (Qin et al., 2024), a recently proposed RNN-based architecture, and comparatively evaluate its effectiveness against transformer-based baselines and other subquadratic architectures (LSTM, xLSTM, Mamba). Our experimental results show that BABYHGRN, our HGRN2 language model, outperforms transformer-based models in both the 10M and 100M word tracks of the challenge, as measured by their performance on the BLiMP, EWoK, GLUE and BEAR benchmarks. Further, we show the positive impact of knowledge distillation. Our findings challenge the prevailing focus on transformer architectures and indicate the viability of RNN-based models, particularly in resource-constrained environments.
zh

[NLP-9] From General to Specific: Tailoring Large Language Models for Personalized Healthcare

【速读】：该论文试图解决现有大型语言模型（LLMs）在医疗领域应用中缺乏个性化的问题，特别是在未能考虑患者个体差异和提供真正个性化服务方面。解决方案的关键在于提出了一种名为个性化医疗语言模型（Personalized Medical Language Model, PMLM）的新方法，通过推荐系统和强化学习（Reinforcement Learning, RL）来优化个性化LLMs。具体而言，PMLM利用自信息和同伴信息的个性化策略，捕捉患者行为和偏好的变化，设计针对个体需求的初始个性化提示（hard prompt），并通过RL进一步精炼这些提示，从而提高LLM指导的精确性。该方法具有高适应性和可重用性，能够直接利用高质量的专有LLMs，并通过实际的妇产科数据验证了其有效性。

链接: https://arxiv.org/abs/2412.15957
作者: Ruize Shi,Hong Huang,Wei Zhou,Kehan Yin,Kai Zhao,Yun Zhao
机构: Huazhong University of Science and Technology, Wuhan, China; Tongji Medical College; Hubei Maternity and Child Health Care Hospital
关键词: large language models, including healthcare, transformed many industries, rapid development, development of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has transformed many industries, including healthcare. However, previous medical LLMs have largely focused on leveraging general medical knowledge to provide responses, without accounting for patient variability and lacking true personalization at the individual level. To address this, we propose a novel method called personalized medical language model (PMLM), which explores and optimizes personalized LLMs through recommendation systems and reinforcement learning (RL). Specifically, by utilizing self-informed and peer-informed personalization, PMLM captures changes in behaviors and preferences to design initial personalized prompts tailored to individual needs. We further refine these initial personalized prompts through RL, ultimately enhancing the precision of LLM guidance. Notably, the personalized prompt are hard prompt, which grants PMLM high adaptability and reusability, allowing it to directly leverage high-quality proprietary LLMs. We evaluate PMLM using real-world obstetrics and gynecology data, and the experimental results demonstrate that PMLM achieves personalized responses, and it provides more refined and individualized services, offering a potential way for personalized medical LLMs.
zh

[NLP-10] Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model

【速读】：该论文旨在解决日本放射学领域缺乏大规模高质量CT报告数据集的问题，并开发一种专门用于结构化发现分类的语言模型。解决方案的关键在于通过GPT-4o mini将CT-RATE数据集（包含24,283份CT报告）机器翻译为日语，并结合专家放射科医生的修订，构建了一个包含22,778份机器翻译报告的训练集和150份修订报告的验证集。基于"tohoku-nlp/bert-base-japanese-v3"架构，研究团队开发了CT-BERT-JPN模型，用于从日语放射学报告中提取18种结构化发现。该混合方法结合了机器翻译的高效性和专家验证的准确性，确保了数据集的高质量和模型的优越性能。

链接: https://arxiv.org/abs/2412.15907
作者: Yosuke Yamagishi,Yuta Nakamura,Tomohiro Kikuchi,Yuki Sonoda,Hiroshi Hirakawa,Shintaro Kano,Satoshi Nakamura,Shouhei Hanaoka,Takeharu Yoshikawa,Osamu Abe
机构: 未知
关键词: Recent advances, specialized language model, high-quality multilingual medical, language models highlight, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset available at this https URL

点击查看摘要

Abstract:Background: Recent advances in large language models highlight the need for high-quality multilingual medical datasets. While Japan leads globally in CT scanner deployment and utilization, the lack of large-scale Japanese radiology datasets has hindered the development of specialized language models for medical imaging analysis. Objective: To develop a comprehensive Japanese CT report dataset through machine translation and establish a specialized language model for structured finding classification. Additionally, to create a rigorously validated evaluation dataset through expert radiologist review. Methods: We translated the CT-RATE dataset (24,283 CT reports from 21,304 patients) into Japanese using GPT-4o mini. The training dataset consisted of 22,778 machine-translated reports, while the validation dataset included 150 radiologist-revised reports. We developed CT-BERT-JPN based on “tohoku-nlp/bert-base-japanese-v3” architecture for extracting 18 structured findings from Japanese radiology reports. Results: Translation metrics showed strong performance with BLEU scores of 0.731 and 0.690, and ROUGE scores ranging from 0.770 to 0.876 for Findings and from 0.748 to 0.857 for Impression sections. CT-BERT-JPN demonstrated superior performance compared to GPT-4o in 11 out of 18 conditions, including lymphadenopathy (+14.2%), interlobular septal thickening (+10.9%), and atelectasis (+7.4%). The model maintained F1 scores exceeding 0.95 in 14 out of 18 conditions and achieved perfect scores in four conditions. Conclusions: Our study establishes a robust Japanese CT report dataset and demonstrates the effectiveness of a specialized language model for structured finding classification. The hybrid approach of machine translation and expert validation enables the creation of large-scale medical datasets while maintaining high quality.
zh

[NLP-11] On the Suitability of pre-trained foundational LLM s for Analysis in German Legal Education

【速读】：该论文试图解决当前开源基础大语言模型（LLMs）在特定法律分析任务中表现不足的问题，尤其是在处理复杂的法律意见和特定分类任务（如“Gutachtenstil”评估风格组件）时。解决方案的关键在于引入基于检索增强生成（Retrieval Augmented Generation）的提示示例选择方法，该方法在数据可用性较高的场景中显著提升了预测性能。此外，论文还评估了预训练LLMs在论证挖掘和自动作文评分标准任务中的表现，发现其在数据稀缺或无标签数据的情况下，通过思维链（Chain-of-Thought）提示策略进一步提升了零样本场景下的表现。

链接: https://arxiv.org/abs/2412.15902
作者: Lorenz Wendlinger,Christian Braun,Abdullah Al Zubaer,Simon Alexander Nonn,Sarah Großkopf,Christofer Fellicious,Michael Granitzer
机构: Chair of Data Science, Universität Passau, Germany(帕绍大学数据科学系); Institut für Rechtsdidaktik, Universität Passau, Germany(帕绍大学法律教育研究所)
关键词: German legal background, current open-source foundational, possess instruction capability, legal background knowledge, open-source foundational LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:We show that current open-source foundational LLMs possess instruction capability and German legal background knowledge that is sufficient for some legal analysis in an educational context. However, model capability breaks down in very specific tasks, such as the classification of “Gutachtenstil” appraisal style components, or with complex contexts, such as complete legal opinions. Even with extended context and effective prompting strategies, they cannot match the Bag-of-Words baseline. To combat this, we introduce a Retrieval Augmented Generation based prompt example selection method that substantially improves predictions in high data availability scenarios. We further evaluate the performance of pre-trained LLMs on two standard tasks for argument mining and automated essay scoring and find it to be more adequate. Throughout, pre-trained LLMs improve upon the baseline in scenarios with little or no labeled data with Chain-of-Thought prompting further helping in the zero-shot case.
zh

[NLP-12] A Thorough Investigation into the Application of Deep CNN for Enhancing Natural Language Processing Capabilities

【速读】：该论文试图解决传统自然语言处理 (NLP) 模型在准确性和效率方面的不足。解决方案的关键在于将深度卷积神经网络 (DCNN) 引入 NLP，并结合机器学习 (ML) 算法和生成对抗网络 (GAN)，以提升语言理解能力、减少歧义并增强任务性能。通过这种集成方法，模型在分词准确性上提升了10%，召回率提高了4%，并在词分词、词性标注、机器翻译和文本分类等任务中表现出更高的识别精度和处理效率。

链接: https://arxiv.org/abs/2412.15900
作者: Chang Weng,Scott Rood,Mehdi Ali Ramezani,Amir Aslani,Reza Zarrab,Wang Zwuo,Sanjeev Salimans,Tim Satheesh
机构: 未知
关键词: Natural Language Processing, Natural Language, Deep Convolutional Neural, sentiment analysis, Convolutional Neural Networks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) is widely used in fields like machine translation and sentiment analysis. However, traditional NLP models struggle with accuracy and efficiency. This paper introduces Deep Convolutional Neural Networks (DCNN) into NLP to address these issues. By integrating DCNN, machine learning (ML) algorithms, and generative adversarial networks (GAN), the study improves language understanding, reduces ambiguity, and enhances task performance. The high-performance NLP model shows a 10% improvement in segmentation accuracy and a 4% increase in recall rate compared to traditional models. This integrated approach excels in tasks such as word segmentation, part-of-speech tagging, machine translation, and text classification, offering better recognition accuracy and processing efficiency.
zh

[NLP-13] coLM: collecting data adapting and benchmarking language models for the telecommunication domain

【速读】：该论文试图解决大型语言模型（LLMs）在处理高度技术性领域（如电信领域）时缺乏准确性的问题。解决方案的关键在于通过收集大规模的领域特定数据（800M tokens, 80K instructions），并采用多种方法进行模型适应，特别是通过指令微调（instruction-tuning）来提升模型在电信领域的性能，而无需进行传统的文本微调（fine-tuning）。实验结果表明，经过领域适应的模型在电信领域的下游任务中能够与通用的大型模型竞争，且仅通过指令微调即可实现有效的适应。

链接: https://arxiv.org/abs/2412.15891
作者: Camille Barboule,Viet-Phi Huynh,Adrien Bufort,Yoan Chabot,Géraldine Damnati,Gwénolé Lecorvé
机构: Orange(法国电信)
关键词: Large Language Models, Large Language, highly technical domains, Language Models, outstanding processes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages (main: 13 pages, appendices: 17 pages), 1 figure, 22 tables, achieved March 2024, released December 2024

点击查看摘要

Abstract:Despite outstanding processes in many tasks, Large Language Models (LLMs) still lack accuracy when dealing with highly technical domains. Especially, telecommunications (telco) is a particularly challenging domain due the large amount of lexical, semantic and conceptual peculiarities. Yet, this domain holds many valuable use cases, directly linked to industrial needs. Hence, this paper studies how LLMs can be adapted to the telco domain. It reports our effort to (i) collect a massive corpus of domain-specific data (800M tokens, 80K instructions), (ii) perform adaptation using various methodologies, and (iii) benchmark them against larger generalist models in downstream tasks that require extensive knowledge of telecommunications. Our experiments on Llama-2-7b show that domain-adapted models can challenge the large generalist models. They also suggest that adaptation can be restricted to a unique instruction-tuning step, dicarding the need for any fine-tuning on raw texts beforehand.
zh

[NLP-14] Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

【速读】：该论文试图解决在跨模态领域中，如何通过人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 来微调全模态模型 (all-modality models)，使其行为与人类意图对齐的问题。解决方案的关键在于提出了 align-anything 框架，该框架包括了精心标注的 200k 全模态人类偏好数据，并通过统一的语言反馈学习方法，有效捕捉复杂的模态特定人类偏好，从而提升模型的指令跟随能力。此外，论文还构建了 eval-anything 评估框架，用于评估全模态模型在训练后对齐的效果。

链接: https://arxiv.org/abs/2412.15838
作者: Jiaming Ji,Jiayi Zhou,Hantao Lou,Boyuan Chen,Donghai Hong,Xuyao Wang,Wenqi Chen,Kaile Wang,Rui Pan,Jiahao Li,Mohan Wang,Josef Dai,Tianyi Qiu,Hua Xu,Dong Li,Weipeng Chen,Jun Song,Bo Zheng,Yaodong Yang
机构: Institute for AI, Peking University(北京大学人工智能研究所); Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院); Huawei Noah’s Ark LAB(华为诺亚方舟实验室); Taobao & Tmall Group of Alibaba(阿里巴巴淘宝天猫集团)
关键词: Reinforcement learning, cross-modality domain, human, human preference data, proven effective
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions – such as instruction following – becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model’s instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework – eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to this https URL.
zh

[NLP-15] Enriching Social Science Research via Survey Item Linking

【速读】：该论文试图解决在社会科学研究中，由于研究人员对调查项（survey items）的隐式引用而非显式引用，导致在比较相关工作时难以找到感兴趣的调查项的问题。解决方案的关键在于提出了一个两阶段的任务模型，称为调查项链接（Survey Item Linking, SIL），包括提及检测（mention detection）和实体消歧（entity disambiguation）。论文通过创建一个高质量、丰富标注的数据集，并分别对这两个阶段进行基准测试，展示了任务的可行性。然而，由于任务定义不精确，现有数据集规模小且质量低，导致性能评估受限。论文建议区分潜在概念和调查项提及，并提出通过建模整个文档上下文、整合两阶段任务以及提高知识库质量来进一步改进系统性能。

链接: https://arxiv.org/abs/2412.15831
作者: Tornike Tsereteli,Daniel Ruffinelli,Simone Paolo Ponzetto
机构: 未知
关键词: influencing life satisfaction, factors influencing life, survey items, called survey items, Survey Item Linking
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Questions within surveys, called survey items, are used in the social sciences to study latent concepts, such as the factors influencing life satisfaction. Instead of using explicit citations, researchers paraphrase the content of the survey items they use in-text. However, this makes it challenging to find survey items of interest when comparing related work. Automatically parsing and linking these implicit mentions to survey items in a knowledge base can provide more fine-grained references. We model this task, called Survey Item Linking (SIL), in two stages: mention detection and entity disambiguation. Due to an imprecise definition of the task, existing datasets used for evaluating the performance for SIL are too small and of low-quality. We argue that latent concepts and survey item mentions should be differentiated. To this end, we create a high-quality and richly annotated dataset consisting of 20,454 English and German sentences. By benchmarking deep learning systems for each of the two stages independently and sequentially, we demonstrate that the task is feasible, but observe that errors propagate from the first stage, leading to a lower overall task performance. Moreover, mentions that require the context of multiple sentences are more challenging to identify for models in the first stage. Modeling the entire context of a document and combining the two stages into an end-to-end system could mitigate these problems in future work, and errors could additionally be reduced by collecting more diverse data and by improving the quality of the knowledge base. The data and code are available at this https URL .
zh

[NLP-16] S2DN: Learning to Denoise Unconvincing Knowledge for Inductive Knowledge Graph Completion

【速读】：该论文试图解决知识图谱补全（Knowledge Graph Completion, KGC）中新兴实体间缺失事实的推断问题，特别是由于相似关系的语义不一致性和知识图谱中新兴实体的不可靠交互导致的噪声问题。解决方案的关键在于提出了一种语义结构感知去噪网络（Semantic Structure-aware Denoising Network, S²DN），通过引入语义平滑模块来保留关系的全局语义知识，并通过结构精炼模块过滤不可靠交互并保留目标链接周围的稳健结构，从而在保持语义一致性的同时增强了对污染知识图谱中不可靠交互的过滤能力。

链接: https://arxiv.org/abs/2412.15822
作者: Tengfei Ma,Yujie Chen,Liang Wang,Xuan Lin,Bosheng Song,Xiangxiang Zeng
机构: 未知
关键词: Knowledge Graph Completion, Graph Completion, infer missing facts, Inductive Knowledge Graph, newly emerged entities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Inductive Knowledge Graph Completion (KGC) aims to infer missing facts between newly emerged entities within knowledge graphs (KGs), posing a significant challenge. While recent studies have shown promising results in inferring such entities through knowledge subgraph reasoning, they suffer from (i) the semantic inconsistencies of similar relations, and (ii) noisy interactions inherent in KGs due to the presence of unconvincing knowledge for emerging entities. To address these challenges, we propose a Semantic Structure-aware Denoising Network (S ^2 DN) for inductive KGC. Our goal is to learn adaptable general semantics and reliable structures to distill consistent semantic knowledge while preserving reliable interactions within KGs. Specifically, we introduce a semantic smoothing module over the enclosing subgraphs to retain the universal semantic knowledge of relations. We incorporate a structure refining module to filter out unreliable interactions and offer additional knowledge, retaining robust structure surrounding target links. Extensive experiments conducted on three benchmark KGs demonstrate that S ^2 DN surpasses the performance of state-of-the-art models. These results demonstrate the effectiveness of S ^2 DN in preserving semantic consistency and enhancing the robustness of filtering out unreliable interactions in contaminated KGs.
zh

[NLP-17] pi-yalli: un nouveau corpus pour le nahuatl

【速读】：该论文旨在解决纳瓦特尔语（Nahuatl）在计算资源方面的匮乏问题，尤其是针对该语言的机器学习应用。解决方案的关键在于构建一个名为 π-YALLI 的语料库（corpus），该语料库将用于开发纳瓦特尔语的语言模型（Language Models, LM），进而支持自然语言处理（Natural Language Processing, NLP）工具的开发，包括字素统一器、分词器、词性语法分析器、基于内容的自动文本摘要生成器，以及可能的翻译器（概率性或基于学习的）。通过这一语料库的构建，研究者能够推动纳瓦特尔语在计算语言学领域的应用和发展。

链接: https://arxiv.org/abs/2412.15821
作者: Juan-Manuel Torres-Moreno,Juan-José Guzmán-Landa,Graham Ranger,Martha Lorena Avendaño Garrido,Miguel Figueroa-Saavedra,Ligia Quintana-Torres,Carlos-Emiliano González-Gallardo,Elvys Linhares Pontes,Patricia Velázquez Morales,Luis-Gil Moreno Jiménez
机构: Laboratoire Informatique d’Avignon / ICTT, Université d’Avignon(阿维尼翁大学); Facultad de Matemáticas & IEE, Universidad Veracruzana(韦拉克鲁斯大学); LIFAT, Université François Rabelais à Tours(图尔弗朗索瓦·拉伯雷大学); Trading Central Labs, Trading Central(Trading Central实验室); Sorbonne Université(索邦大学)
关键词: Franco-Mexican collaboration aimed, YALLI corpus adapted, develop computer resources, develop Language Models, Natural Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, in French language, 2 figures

点击查看摘要

Abstract:The NAHU ^2 project is a Franco-Mexican collaboration aimed at building the \pi -YALLI corpus adapted to machine learning, which will subsequently be used to develop computer resources for the Nahuatl language. Nahuatl is a language with few computational resources, even though it is a living language spoken by around 2 million people. We have decided to build \pi -YALLI, a corpus that will enable to carry out research on Nahuatl in order to develop Language Models (LM), whether dynamic or not, which will make it possible to in turn enable the development of Natural Language Processing (NLP) tools such as: a) a grapheme unifier, b) a word segmenter, c) a POS grammatical analyser, d) a content-based Automatic Text Summarization; and possibly, e) a translator translator (probabilistic or learning-based).
zh

[NLP-18] Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

【速读】：该论文试图解决开源语言模型在复杂推理任务中表现不稳定的问题。解决方案的关键是提出了一个名为“语言模型集成与蒙特卡洛树搜索 (LE-MCTS)”的新框架，该框架将语言模型的逐步推理过程建模为马尔可夫决策过程 (Markov decision process)。在LE-MCTS中，状态表示中间推理路径，动作则是在预定义的语言模型池中选择一个模型来生成下一步推理步骤。通过基于过程的奖励模型引导，LE-MCTS在不同语言模型生成的推理步骤上进行树搜索，从而识别出最准确的推理链。实验结果表明，该方法在多个数学推理基准测试中显著优于单一语言模型解码算法和现有的语言模型集成方法。

链接: https://arxiv.org/abs/2412.15797
作者: Sungjin Park,Xiao Liu,Yeyun Gong,Edward Choi
机构: KAIST AI(韩国科学技术院人工智能); Microsoft Research(微软研究院)
关键词: language models, Language model Ensemble, language, large language models, Monte Carlo Tree
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent advances in large language models, open-source models often struggle to consistently perform well on complex reasoning tasks. Existing ensemble methods, whether applied at the token or output levels, fail to address these challenges. In response, we present Language model Ensemble with Monte Carlo Tree Search (LE-MCTS), a novel framework for process-level ensembling of language models. LE-MCTS formulates step-by-step reasoning with an ensemble of language models as a Markov decision process. In this framework, states represent intermediate reasoning paths, while actions consist of generating the next reasoning step using one of the language models selected from a predefined pool. Guided by a process-based reward model, LE-MCTS performs a tree search over the reasoning steps generated by different language models, identifying the most accurate reasoning chain. Experimental results on five mathematical reasoning benchmarks demonstrate that our approach outperforms both single language model decoding algorithms and language model ensemble methods. Notably, LE-MCTS improves performance by 3.6% and 4.3% on the MATH and MQA datasets, respectively, highlighting its effectiveness in solving complex reasoning problems.
zh

[NLP-19] Learning from Impairment: Leveraging Insights from Clinical Linguistics in Language Modelling Research COLING2025

【速读】：该论文试图解决如何将语言障碍研究及其临床治疗中的见解整合到语言模型 (LMs) 的学习策略和评估框架中的问题。解决方案的关键在于借鉴神经语言学和失语症学中的理论基础，特别是那些针对语法领域的训练方法，以增强语言模型对复杂句法现象的处理能力，并开发更符合人类认知的可持续自然语言处理 (NLP) 模型。

链接: https://arxiv.org/abs/2412.15785
作者: Dominique Brunato
机构: Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC); ItaliaNLP Lab, Pisa
关键词: position paper investigates, develop human-inspired learning, language impairment research, human-inspired learning strategies, position paper
类目: Computation and Language (cs.CL)
备注: accepted at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:This position paper investigates the potential of integrating insights from language impairment research and its clinical treatment to develop human-inspired learning strategies and evaluation frameworks for language models (LMs). We inspect the theoretical underpinnings underlying some influential linguistically motivated training approaches derived from neurolinguistics and, particularly, aphasiology, aimed at enhancing the recovery and generalization of linguistic skills in aphasia treatment, with a primary focus on those targeting the syntactic domain. We highlight how these insights can inform the design of rigorous assessments for LMs, specifically in their handling of complex syntactic phenomena, as well as their implications for developing human-like learning strategies, aligning with efforts to create more sustainable and cognitively plausible natural language processing (NLP) models.
zh

[NLP-20] Linguistic Features Extracted by GPT-4 Improve Alzheimers Disease Detection based on Spontaneous Speech COLING2025

【速读】：该论文试图解决阿尔茨海默病 (Alzheimer’s Disease, AD) 的早期检测问题，特别是通过分析患者的语言和言语模式来实现大规模、低成本且无创的检测。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 如 GPT-4，从患者的自发言语转录中提取五个语义特征，这些特征能够捕捉到 AD 的已知症状，但难以通过传统的计算语言学方法有效量化。通过结合这些 GPT 衍生的特征与已有的语言学特征，并使用随机森林分类器 (Random Forest classifier)，显著提升了 AD 的检测效果。该方法在手动和自动生成的转录文本上均表现有效，展示了 LLMs 在 AD 言语分析中的创新应用。

链接: https://arxiv.org/abs/2412.15772
作者: Jonathan Heitz,Gerold Schneider,Nicolas Langer
机构: University of Zurich(苏黎世大学); Department of Psychology(心理学系); Methods of Plasticity Research(可塑性研究方法); Language & Medicine Competence Centre(语言与医学能力中心); Department of Computational Linguistics(计算语言学系)
关键词: Alzheimer Disease, public health concern, growing public health, health concern, significant and growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a significant and growing public health concern. Investigating alterations in speech and language patterns offers a promising path towards cost-effective and non-invasive early detection of AD on a large scale. Large language models (LLMs), such as GPT, have enabled powerful new possibilities for semantic text analysis. In this study, we leverage GPT-4 to extract five semantic features from transcripts of spontaneous patient speech. The features capture known symptoms of AD, but they are difficult to quantify effectively using traditional methods of computational linguistics. We demonstrate the clinical significance of these features and further validate one of them (“Word-Finding Difficulties”) against a proxy measure and human raters. When combined with established linguistic features and a Random Forest classifier, the GPT-derived features significantly improve the detection of AD. Our approach proves effective for both manually transcribed and automatically generated transcripts, representing a novel and impactful use of recent advancements in LLMs for AD speech analysis.
zh

[NLP-21] Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models

【速读】：该论文试图解决医学领域中大型语言模型（LLMs）推理行为缺乏研究的问题，强调理解推理行为的重要性，而非仅关注高层次的预测准确性。解决方案的关键在于定义和分类评估医学LLMs推理行为的方法，并提出理论框架，使医学专业人员和机器学习工程师能够深入理解这些模型的低层次推理操作，从而提高模型的透明度和可解释性（XAI），最终加速医疗AI在医疗系统中的整合、应用和发展。

链接: https://arxiv.org/abs/2412.15748
作者: Shamus Sim,Tyrone Chen
机构: QueueMed Healthtech; Intersect Australia
关键词: Large Language Models, Large Language, ubiquity of Large, reasoning behaviour, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 2 tables. Conceptualization, both authors. formal analysis, both authors. funding acquisition, both authors. investigation, both authors. resources, both authors. supervision, T.C… validation, both authors. visualization, both authors. writing original draft, both authors. writing review and editing, both authors

点击查看摘要

Abstract:Background: Despite the current ubiquity of Large Language Models (LLMs) across the medical domain, there is a surprising lack of studies which address their reasoning behaviour. We emphasise the importance of understanding reasoning behaviour as opposed to high-level prediction accuracies, since it is equivalent to explainable AI (XAI) in this context. In particular, achieving XAI in medical LLMs used in the clinical domain will have a significant impact across the healthcare sector. Results: Therefore, we define the concept of reasoning behaviour in the specific context of medical LLMs. We then categorise and discuss the current state of the art of methods which evaluate reasoning behaviour in medical LLMs. Finally, we propose theoretical frameworks which can empower medical professionals or machine learning engineers to gain insight into the low-level reasoning operations of these previously obscure models. Conclusion: The subsequent increased transparency and trust in medical machine learning models by clinicians as well as patients will accelerate the integration, application as well as further development of medical AI for the healthcare system as a whole
zh

[NLP-22] Fine-tuning Whisper on Low-Resource Languages for Real-World Applications

【速读】：该论文试图解决低资源语言（low-resource languages）在语音转文本（STT）任务中面临的挑战，特别是缺乏长格式音频数据的问题。解决方案的关键在于提出了一种新颖的数据生成方法，通过将句子级别的数据转换为长格式语料库，从而在不依赖非句子级别数据的情况下，提升模型对长音频的处理能力和分段性能。该方法不仅改善了多个实际应用场景中的表现，还为瑞士德语开发了一个新的最先进的STT模型，并在BLEU评分上超越了未微调的Whisper模型和之前的最佳模型。此外，该方法具有通用性，适用于其他低资源语言，并提供了代码和指导，使得基于句子级别数据的高质量长音频转录成为可能。

链接: https://arxiv.org/abs/2412.15726
作者: Vincenzo Timmel,Claudio Paonessa,Reza Kakooee,Manfred Vogel,Daniel Perruchoud
机构: 未知
关键词: fine-tuning OpenAI Whisper, Swiss German, Swiss German STT, case study, converts sentence-level data
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a new approach to fine-tuning OpenAI’s Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model’s ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed method is adaptable to other low-resource languages, supported by written guidance and code that allows the creation of fine-tuned Whisper models, which keep segmentation capabilities and allow the transcription of longer audio files using only sentence-level data with high quality.
zh

[NLP-23] AutoLife: Automatic Life Journaling with Smartphones and LLM s

【速读】：该论文试图解决的问题是如何自动生成用户日常生活的语义描述，即生活日志 (life journaling)。解决方案的关键在于利用商用智能手机的低成本传感器数据（不包括照片或音频），通过多模态传感器数据提取时间、运动和位置上下文，并结合大型语言模型 (Large Language Models, LLMs) 的零样本学习能力，辅以关于人类生活的常识知识，来解释多样化的上下文并生成生活日志。此外，论文提出了一种多层框架，将任务分解并无缝集成LLMs与其他技术，以应对任务复杂性和长时间感知的挑战。

链接: https://arxiv.org/abs/2412.15714
作者: Huatao Xu,Panron Tong,Mo Li,Mani Srivastava
机构: Hong Kong University of Science and Technology; Alibaba Group; University of California Los Angeles
关键词: users’ daily lives, mobile sensing application, generate semantic descriptions, paper introduces, semantic descriptions
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages

点击查看摘要

Abstract:This paper introduces a novel mobile sensing application - life journaling - designed to generate semantic descriptions of users’ daily lives. We present AutoLife, an automatic life journaling system based on commercial smartphones. AutoLife only inputs low-cost sensor data (without photos or audio) from smartphones and can automatically generate comprehensive life journals for users. To achieve this, we first derive time, motion, and location contexts from multimodal sensor data, and harness the zero-shot capabilities of Large Language Models (LLMs), enriched with commonsense knowledge about human lives, to interpret diverse contexts and generate life journals. To manage the task complexity and long sensing duration, a multilayer framework is proposed, which decomposes tasks and seamlessly integrates LLMs with other techniques for life journaling. This study establishes a real-life dataset as a benchmark and extensive experiment results demonstrate that AutoLife produces accurate and reliable life journals.
zh

[NLP-24] Contrastive Learning for Task-Independent SpeechLLM -Pretraining

【速读】：该论文试图解决将大型语言模型（LLMs）高效适应于语音处理任务的问题，传统方法如直接任务特定微调存在过拟合风险、数据需求高和计算成本大的局限性。解决方案的关键在于提出了一种可扩展的两阶段训练方法：首先通过对比学习进行任务无关的语音预训练，以在所有层面上对齐文本和语音表示；随后进行任务特定的微调，仅需少量数据即可实现。这种方法不仅优于传统的自动语音识别（ASR）预训练，还能在仅使用10%的任务特定数据的情况下超越专门用于语音翻译和问答的模型。

链接: https://arxiv.org/abs/2412.15712
作者: Maike Züfle,Jan Niehues
机构: Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)
关键词: Large language models, processing tasks efficiently, natural language processing, Large language, excel in natural
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in natural language processing but adapting these LLMs to speech processing tasks efficiently is not straightforward. Direct task-specific fine-tuning is limited by overfitting risks, data requirements, and computational costs. To address these challenges, we propose a scalable, two-stage training approach: (1) A task-independent speech pretraining stage using contrastive learning to align text and speech representations over all layers, followed by (2) a task-specific fine-tuning stage requiring minimal data. This approach outperforms traditional ASR pretraining and enables the model to surpass models specialized on speech translation and question answering while being trained on only 10% of the task-specific data.
zh

[NLP-25] Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

【速读】：该论文试图解决在语言模型（LMs）代理开发中，如何有效实现人机协作的问题。解决方案的关键在于提出了一个名为Collaborative Gym（Co-Gym）的通用框架，该框架支持代理、人类和任务环境之间的异步三方交互。通过在模拟和真实世界条件下实例化Co-Gym，并结合三个代表性任务进行评估，研究发现协作代理在任务表现上显著优于全自主代理，尤其是在旅行规划、表格分析和相关文献检索等任务中。然而，论文也指出开发协作代理面临的核心挑战，包括提升通信能力、情境意识以及在自主性和人类控制之间取得平衡。

链接: https://arxiv.org/abs/2412.15701
作者: Yijia Shao,Vinay Samuel,Yucheng Jiang,John Yang,Diyi Yang
机构: 未知
关键词: sparked growing interest, language models, Recent advancements, sparked growing, growing interest
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preprint. Work in progress

点击查看摘要

Abstract:Recent advancements in language models (LMs) have sparked growing interest in developing LM agents. While fully autonomous agents could excel in many scenarios, numerous use cases inherently require them to collaborate with humans due to humans’ latent preferences, domain expertise, or need for control. To facilitate the study of human-agent collaboration, we present Collaborative Gym (Co-Gym), a general framework enabling asynchronous, tripartite interaction among agents, humans, and task environments. We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions, and propose an evaluation framework that assesses both the collaboration outcomes and processes. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance within those delivered cases, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. However, our study also highlights significant challenges in developing collaborative agents, requiring advancements in core aspects of intelligence – communication capabilities, situational awareness, and balancing autonomy and human control.
zh

[NLP-26] Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses

【速读】：该论文试图解决语言模型（LMs）在面对不同程度的模糊或开放性提示时，如何准确评估其生成响应的可靠性问题。解决方案的关键在于提出了一个新的度量方法——概率分配给充分响应（PROBAR），通过标注采样响应的充分性并估计模型对这些充分响应的概率分配，来衡量模型在实例级别的可靠性。与传统的语义熵方法不同，PROBAR不将语义变异性直接等同于错误，尤其在开放性设置中，能够更准确地反映模型的可靠性。实验结果表明，PROBAR在不同模糊度和开放性程度的提示中，优于语义熵方法。

链接: https://arxiv.org/abs/2412.15683
作者: Evgenia Ilia,Wilker Aziz
机构: University of Amsterdam (阿姆斯特丹大学)
关键词: ability to respond, respond reliably, responses, generated responses, language models
类目: Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:With the broader use of language models (LMs) comes the need to estimate their ability to respond reliably to prompts (e.g., are generated responses likely to be correct?). Uncertainty quantification tools (notions of confidence and entropy, i.a.) can be used to that end (e.g., to reject a response when the model is uncertain'). For example, Kuhn et al. (semantic entropy; 2022b) regard semantic variation amongst sampled responses as evidence that the model struggles’ with the prompt and that the LM is likely to err. We argue that semantic variability need not imply error–this being especially intuitive in open-ended settings, where prompts elicit multiple adequate but semantically distinct responses. Hence, we propose to annotate sampled responses for their adequacy to the prompt (e.g., using a classifier) and estimate the Probability the model assigns to Adequate Responses (PROBAR), which we then regard as an indicator of the model’s reliability at the instance level. We evaluate PROBAR as a measure of confidence in selective prediction with OPT models (in two QA datasets and in next-word prediction, for English) and find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.
zh

[NLP-27] Adaptable and Precise: Enterprise-Scenario LLM Function-Calling Capability Training Pipeline

【速读】：该论文试图解决企业中通用模型在计算效率、输出准确性和稳定性方面无法满足特定业务场景需求的问题。解决方案的关键在于提出了一种针对实际业务场景的函数调用能力训练流程，包括场景特定函数调用数据的合成与增强、模型微调以及性能评估与分析。通过这一流程，论文在数字HR代理场景中生成了大量AI生成的样本和人工标注增强样本，并使用Qwen2.5-Coder-7B-Instruct模型进行微调，最终在测试集上超越了GPT-4和GPT-4o的准确性，验证了该训练流程的有效性和可靠性。

链接: https://arxiv.org/abs/2412.15660
作者: Guancheng Zeng,Wentao Ding,Beining Xu,Chi Zhang,Wenqiang Han,Gang Li,Jingjing Mo,Pengxu Qiu,Xinran Tao,Wang Tao,Haowen Hu
机构: Digital China AI Research(数字中国人工智能研究院)
关键词: API assets scattered, existing business processes, forming the backbone, API assets, possess a vast
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 23 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Enterprises possess a vast array of API assets scattered across various functions, forming the backbone of existing business processes. By leveraging these APIs as functional tools, enterprises can design diverse, scenario-specific agent applications, driven by on-premise function-calling models as the core engine. However, generic models often fail to meet enterprise requirements in terms of computational efficiency, output accuracy, and stability, necessitating scenario-specific adaptation. In this paper, we propose a training pipeline for function-calling capabilities tailored to real-world business scenarios. This pipeline includes the synthesis and augmentation of scenario-specific function-calling data, model fine-tuning, and performance evaluation and analysis. Using this pipeline, we generated 1,260 fully AI-generated samples and 1,035 augmented manually-labeled samples in digital HR agent scenario. The Qwen2.5-Coder-7B-Instruct model was employed as the base model and fine-tuned using the LoRA method on four GPUs with 24GB VRAM. Our fine-tuned model demonstrated outstanding performance in evaluations and practical applications, surpassing GPT-4 and GPT-4o in accuracy on the test set. These results validate the reliability of the proposed pipeline for training scenario-specific function-calling models.
zh

[NLP-28] MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula AAAI2025

【速读】：该论文试图解决在学术和专业环境中，口头传达数学表达式时因缺乏视觉辅助而导致的理解障碍问题，尤其是对听力障碍者或依赖字幕的语言障碍者。解决方案的关键在于引入MathSpeech，这是一个将自动语音识别(ASR)模型与小型语言模型(sLM)结合的新型管道，用于纠正数学表达式中的错误，并将其准确转换为结构化的LaTeX表示。通过这种方式，MathSpeech能够生成与领先的大型语言模型(LLM)相媲美的LaTeX格式，同时利用仅120M参数的微调小型语言模型，显著提高了LaTeX翻译的准确性和效率。

链接: https://arxiv.org/abs/2412.15655
作者: Sieun Hyeon,Kyudan Jung,Jaehee Won,Nam-Joon Kim,Hyun Gon Ryu,Hyuk-Jae Lee,Jaeyoung Do
机构: KAIST(韩国科学技术院); Seoul National University(首尔大学); Korea University(高丽大学); Yonsei University(延世大学)
关键词: mathematical expressions orally, convey mathematical expressions, Automatic Speech Recognition, mathematical expressions, small Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in AAAI 2025

点击查看摘要

Abstract:In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler’s Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i \textitside of x), instead of the concise \LaTeX format (i.e., e^ix = \cos(x) + i\sin(x) ), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured \LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates \LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for \LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
zh

[NLP-29] Error-driven Data-efficient Large Multimodal Model Tuning

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在下游任务中表现不佳的问题，尤其是在缺乏特定任务训练样本的情况下。解决方案的关键在于提出了一种基于错误驱动的数据高效调优框架，通过在目标任务的小验证集上评估通用LMM（学生模型），并利用更强大的模型（教师模型）识别学生模型推理步骤中的错误和能力差距。基于这些差距，从现有的任务无关数据集中检索有针对性的训练样本，从而对学生模型进行调优，使其适应目标任务。该方法无需特定任务的训练样本，显著提升了LMM在下游任务中的表现，平均性能提升达7.01%。

链接: https://arxiv.org/abs/2412.15652
作者: Barry Menglong Yao(UC Davis),Qifan Wang(Meta AI),Lifu Huang(UC Davis)
机构: UC Davis(加州大学戴维斯分校); Meta AI(Meta AI)
关键词: Large Multimodal Models, Large Multimodal, numerous academic benchmarks, demonstrated impressive performance, Multimodal Models
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive performance across numerous academic benchmarks. However, fine-tuning still remains essential to achieve satisfactory performance on downstream tasks, while the task-specific tuning samples are usually not readily available or expensive and time-consuming to obtain. To address this, we propose an error-driven data-efficient tuning framework that aims to efficiently adapt generic LMMs to newly emerging tasks without requiring any task-specific training samples. In our approach, a generic LMM, acting as a student model, is first evaluated on a small validation set of the target task, and then a more powerful model, acting as a teacher model, identifies the erroneous steps within the student model’s reasoning steps and analyzes its capability gaps from fully addressing the target task. Based on these gaps, targeted training samples are further retrieved from existing task-agnostic datasets to tune the student model and tailor it to the target task. We perform extensive experiments across three different training data scales and seven tasks, demonstrating that our training paradigm significantly and efficiently improves LMM’s performance on downstream tasks, achieving an average performance boost of 7.01%.
zh

[NLP-30] Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning?

【速读】：该论文试图解决在大语言模型（LLMs）和上下文学习（ICL）背景下，如何解释输入归属（IA）以识别提示中哪些示例对任务/规则的识别做出了贡献的问题。解决方案的关键在于引入合成诊断任务，这些任务受到归纳推理中“刺激贫乏”设计的启发，其中大多数上下文示例对潜在规则是模糊的，而一个关键示例则明确了任务。论文通过实验验证了传统IA方法在解释ICL中的归纳推理过程时的有效性，并发现某些简单的IA方法效果最佳，而随着模型规模的增大，基于梯度的IA方法在解释ICL时通常变得更加困难。

链接: https://arxiv.org/abs/2412.15628
作者: Mengyu Ye,Tatsuki Kuribayashi,Goro Kobayashi,Jun Suzuki
机构: Tohoku University(东北大学); MBZUAI(MBZUAI); RIKEN(理化学研究所)
关键词: machine learning field, neural models’ outputs, large language models, Elucidating the rationale, learning field
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Elucidating the rationale behind neural models’ outputs has been challenging in the machine learning field, which is indeed applicable in this age of large language models (LLMs) and in-context learning (ICL). When it comes to estimating input attributions (IA), ICL poses a new issue of interpreting which example in the prompt, consisting of a set of examples, contributed to identifying the task/rule to be solved. To this end, in this paper, we introduce synthetic diagnostic tasks inspired by the poverty of the stimulus design in inductive reasoning; here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates the task demonstrated. The question is whether conventional IA methods can identify such an example in interpreting the inductive reasoning process in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.
zh

[NLP-31] A Fusion Approach of Dependency Syntax and Sentiment Polarity for Feature Label Extraction in Commodity Reviews

【速读】：该论文试图解决现有特征标签提取算法鲁棒性低的问题，提出了一种结合依存句法分析 (dependency parsing) 和情感极性分析 (sentiment polarity analysis) 的新方法。该方法的关键在于通过整合这两种分析技术，显著提升了特征标签提取的准确性，实验结果显示其准确率达到0.7，召回率和F值均稳定在0.8，证明了其有效性。然而，该方法仍存在对匹配字典的依赖性以及提取特征标签范围有限的问题，需在未来研究中进一步探讨。

链接: https://arxiv.org/abs/2412.15610
作者: Jianfei Xu
机构: Shanghai University of Engineering Science (上海工程技术大学)
关键词: http URL, covering four categories, mobile phones, study analyzes, product reviews
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study analyzes 13,218 product reviews from this http URL, covering four categories: mobile phones, computers, cosmetics, and food. A novel method for feature label extraction is proposed by integrating dependency parsing and sentiment polarity analysis. The proposed method addresses the challenges of low robustness in existing extraction algorithms and significantly enhances extraction accuracy. Experimental results show that the method achieves an accuracy of 0.7, with recall and F-score both stabilizing at 0.8, demonstrating its effectiveness. However, challenges such as dependence on matching dictionaries and the limited scope of extracted feature tags require further investigation in future research.
zh

[NLP-32] Dont Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

【速读】：该论文试图解决检索增强生成 (Retrieval-augmented Generation, RAG) 在实时检索过程中面临的延迟、文档选择错误以及系统复杂性增加等问题。解决方案的关键在于提出缓存增强生成 (Cache-augmented Generation, CAG) 这一替代范式，通过预加载所有相关资源（尤其是在知识库规模有限且可管理的情况下）到具有扩展上下文窗口的大型语言模型 (LLM) 中，并缓存其运行时参数。在推理阶段，模型利用这些预加载的参数直接回答查询，无需额外的检索步骤，从而消除了检索延迟并减少了检索错误，同时保持了上下文的相关性。

链接: https://arxiv.org/abs/2412.15605
作者: Brian J Chan,Chao-Ting Chen,Jui-Hung Cheng,Hen-Hsen Huang
机构: National Chengchi University(国立政治大学); Academia Sinica(中央研究院)
关键词: Retrieval-augmented generation, external knowledge sources, integrating external knowledge, gained traction, powerful approach
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG introduces challenges such as retrieval latency, potential errors in document selection, and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM’s extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval latency and minimizes retrieval errors while maintaining context relevance. Performance evaluations across multiple benchmarks highlight scenarios where long-context LLMs either outperform or complement traditional RAG pipelines. These findings suggest that, for certain applications, particularly those with a constrained knowledge base, CAG provide a streamlined and efficient alternative to RAG, achieving comparable or superior results with reduced complexity.
zh

[NLP-33] Dynamic Label Name Refinement for Few-Shot Dialogue Intent Classification

【速读】：该论文试图解决对话意图分类中的两个主要问题：大量可能的意图类别和相似意图类别之间的显著语义重叠。解决方案的关键在于提出了一种基于上下文学习（in-context learning）的少样本对话意图分类方法，并结合动态标签细化（dynamic label refinement）。具体来说，该方法通过从训练集中检索相关示例，并利用大型语言模型根据语义理解动态调整意图标签，从而确保意图之间的清晰区分。实验结果表明，该方法在多个数据集上显著提升了性能，并生成了更具解释性和语义一致性的意图标签。

链接: https://arxiv.org/abs/2412.15603
作者: Gyutae Park,Ingeol Baek,ByeongJeong Kim,Joongbo Shin,Hwanhee Lee
机构: Chung-Ang University(中央大学); LG AI Research(LG人工智能研究院)
关键词: Dialogue intent classification, intent classification aims, aims to identify, intent classification, Dialogue intent
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 11 tables

点击查看摘要

Abstract:Dialogue intent classification aims to identify the underlying purpose or intent of a user’s input in a conversation. Current intent classification systems encounter considerable challenges, primarily due to the vast number of possible intents and the significant semantic overlap among similar intent classes. In this paper, we propose a novel approach to few-shot dialogue intent classification through in-context learning, incorporating dynamic label refinement to address these challenges. Our method retrieves relevant examples for a test input from the training set and leverages a large language model to dynamically refine intent labels based on semantic understanding, ensuring that intents are clearly distinguishable from one another. Experimental results demonstrate that our approach effectively resolves confusion between semantically similar intents, resulting in significantly enhanced performance across multiple datasets compared to baselines. We also show that our method generates more interpretable intent labels, and has a better semantic coherence in capturing underlying user intents compared to baselines.
zh

[NLP-34] mplate-Driven LLM -Paraphrased Framework for Tabular Math Word Problem Generation AAAI2025

【速读】：该论文试图解决表格数学应用题 (Tabular Math Word Problems, TMWPs) 生成中的正确性和多样性问题。解决方案的关键在于提出了一个模板驱动的LLM重述 (Template-driven LLM-paraphrased, TeLL) 框架，通过从现有真实样本中提取模板生成初始问题以确保正确性，并利用大型语言模型 (LLMs) 扩展模板和重述问题以获得多样化的TMWP样本。此外，论文强调了解决方案中推理步骤的标注重要性，通过丰富每个解决方案的推理步骤来提升数据集质量。最终，基于该框架构建了高质量的TabMWP-TeLL数据集，并通过实验验证了其在提升TMWP解决性能方面的有效性。

链接: https://arxiv.org/abs/2412.15594
作者: Xiaoqiang Kang,Zimu Wang,Xiaobo Jin,Wei Wang,Kaizhu Huang,Qiufeng Wang
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心);
3. Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系)
关键词: large language models, tabular math word, math word problems, Solving tabular math, mathematical reasoning ability
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI 2025, extended version with appendix

点击查看摘要

Abstract:Solving tabular math word problems (TMWPs) has become a critical role in evaluating the mathematical reasoning ability of large language models (LLMs), where large-scale TMWP samples are commonly required for LLM fine-tuning. Since the collection of high-quality TMWP datasets is costly and time-consuming, recent research has concentrated on automatic TMWP generation. However, current generated samples usually suffer from issues of either correctness or diversity. In this paper, we propose a Template-driven LLM-paraphrased (TeLL) framework for generating high-quality TMWP samples with diverse backgrounds and accurate tables, questions, answers, and solutions. To this end, we first extract templates from existing real samples to generate initial problems, ensuring correctness. Then, we adopt an LLM to extend templates and paraphrase problems, obtaining diverse TMWP samples. Furthermore, we find the reasoning annotation is important for solving TMWPs. Therefore, we propose to enrich each solution with illustrative reasoning steps. Through the proposed framework, we construct a high-quality dataset TabMWP-TeLL by adhering to the question types in the TabMWP dataset, and we conduct extensive experiments on a variety of LLMs to demonstrate the effectiveness of TabMWP-TeLL in improving TMWP solving performance. The code and data of this paper are available at: this https URL.
zh

[NLP-35] NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization AAAI2025

【速读】：该论文试图解决神经符号方法在视觉-语言推理任务中面临的三个关键挑战：(a) 依赖预定义谓词的符号表示限制了适应性，(b) 从原始数据中提取谓词的困难，以及 © 使用不可微分的操作来组合基本概念。解决方案的关键在于提出了一种名为 NeSyCoCo 的神经符号框架，该框架利用大型语言模型 (LLMs) 生成符号表示并将其映射到可微分的神经计算。NeSyCoCo 的创新点包括：(a) 通过增强自然语言输入的依赖结构来提高与符号表示的对齐，(b) 使用分布式词表示将多样化的语言学驱动的逻辑谓词与神经模块连接，以及 © 使用归一化谓词分数的软组合来对齐符号和可微分推理。

链接: https://arxiv.org/abs/2412.15588
作者: Danial Kamali,Elham J. Barezi,Parisa Kordjamshidi
机构: Michigan State University (密歇根州立大学); University of Memphis (孟菲斯大学)
关键词: artificial intelligence agents, solve complex vision-language, vision-language reasoning tasks, complex vision-language reasoning, crucial for artificial
类目: Computation and Language (cs.CL)
备注: AAAI 2025 Project Page: this https URL

点击查看摘要

Abstract:Compositional generalization is crucial for artificial intelligence agents to solve complex vision-language reasoning tasks. Neuro-symbolic approaches have demonstrated promise in capturing compositional structures, but they face critical challenges: (a) reliance on predefined predicates for symbolic representations that limit adaptability, (b) difficulty in extracting predicates from raw data, and © using non-differentiable operations for combining primitive concepts. To address these issues, we propose NeSyCoCo, a neuro-symbolic framework that leverages large language models (LLMs) to generate symbolic representations and map them to differentiable neural computations. NeSyCoCo introduces three innovations: (a) augmenting natural language inputs with dependency structures to enhance the alignment with symbolic representations, (b) employing distributed word representations to link diverse, linguistically motivated logical predicates to neural modules, and © using the soft composition of normalized predicate scores to align symbolic and differentiable reasoning. Our framework achieves state-of-the-art results on the ReaSCAN and CLEVR-CoGenT compositional generalization benchmarks and demonstrates robust performance with novel concepts in the CLEVR-SYN benchmark.
zh

[NLP-36] Continual Learning Using a Kernel-Based Method Over Foundation Models

【速读】：该论文试图解决类增量学习 (Class-Incremental Learning, CIL) 中的两个关键挑战：灾难性遗忘 (Catastrophic Forgetting, CF) 和任务间类别分离 (Inter-Task Class Separation, ICS)。解决方案的关键在于提出了一种新的方法，称为核线性判别分析 (Kernel Linear Discriminant Analysis, KLDA)，该方法利用基础模型 (Foundation Model, FM) 中学习到的强大特征，并通过引入径向基函数 (Radial Basis Function, RBF) 核及其随机傅里叶特征 (Random Fourier Features, RFF) 来增强这些特征表示。KLDA 在新任务到来时，仅计算每个类别的均值并更新所有已学习类别的共享协方差矩阵，分类则通过线性判别分析 (Linear Discriminant Analysis) 进行。实验结果表明，KLDA 在不依赖回放数据的情况下，性能显著优于基线方法，并达到了与所有类别联合训练相当的准确率。

链接: https://arxiv.org/abs/2412.15571
作者: Saleh Momeni,Sahisnu Mazumder,Bing Liu
机构: 未知
关键词: Continual learning, Linear Discriminant Analysis, learns a sequence, Discriminant Analysis, KLDA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning (CL) learns a sequence of tasks incrementally. This paper studies the challenging CL setting of class-incremental learning (CIL). CIL has two key challenges: catastrophic forgetting (CF) and inter-task class separation (ICS). Despite numerous proposed methods, these issues remain persistent obstacles. This paper proposes a novel CIL method, called Kernel Linear Discriminant Analysis (KLDA), that can effectively avoid CF and ICS problems. It leverages only the powerful features learned in a foundation model (FM). However, directly using these features proves suboptimal. To address this, KLDA incorporates the Radial Basis Function (RBF) kernel and its Random Fourier Features (RFF) to enhance the feature representations from the FM, leading to improved performance. When a new task arrives, KLDA computes only the mean for each class in the task and updates a shared covariance matrix for all learned classes based on the kernelized features. Classification is performed using Linear Discriminant Analysis. Our empirical evaluation using text and image classification datasets demonstrates that KLDA significantly outperforms baselines. Remarkably, without relying on replay data, KLDA achieves accuracy comparable to joint training of all classes, which is considered the upper bound for CIL performance. The KLDA code is available at this https URL.
zh

[NLP-37] In-context Continual Learning Assisted by an External Continual Learner

【速读】：该论文试图解决持续学习（Continual Learning, CL）中由于灾难性遗忘（Catastrophic Forgetting, CF）和提示长度增长导致的可扩展性问题。现有方法主要依赖于微调或适应大型语言模型（Large Language Models, LLMs），但这些方法仍然面临CF问题。论文提出了一种名为InCA的新方法，其关键在于将外部持续学习器（External Continual Learner, ECL）与上下文学习（In-Context Learning, ICL）相结合，以实现无参数更新的可扩展CL。ECL逐步构建，用于预选每个测试实例的可能类别子集，从而限制ICL提示的长度，避免过长提示导致的性能下降，同时保持高精度。实验结果表明，InCA显著优于现有的CL基线方法，取得了显著的性能提升。

链接: https://arxiv.org/abs/2412.15563
作者: Saleh Momeni,Sahisnu Mazumder,Zixuan Ke,Bing Liu
机构: Department of Computer Science, University of Illinois Chicago, USA; Intel Labs, USA; Salesforce AI Research, USA
关键词: adapting large language, large language models, methods mainly rely, rely on fine-tuning, fine-tuning or adapting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing continual learning (CL) methods mainly rely on fine-tuning or adapting large language models (LLMs). They still suffer from catastrophic forgetting (CF). Little work has been done to exploit in-context learning (ICL) to leverage the extensive knowledge within LLMs for CL without updating any parameters. However, incrementally learning each new task in ICL necessitates adding training examples from each class of the task to the prompt, which hampers scalability as the prompt length increases. This issue not only leads to excessively long prompts that exceed the input token limit of the underlying LLM but also degrades the model’s performance due to the overextended context. To address this, we introduce InCA, a novel approach that integrates an external continual learner (ECL) with ICL to enable scalable CL without CF. The ECL is built incrementally to pre-select a small subset of likely classes for each test instance. By restricting the ICL prompt to only these selected classes, InCA prevents prompt lengths from becoming excessively long, while maintaining high performance. Experimental results demonstrate that InCA significantly outperforms existing CL baselines, achieving substantial performance gains.
zh

[NLP-38] MORTAR: Metamorphic Multi-turn Testing for LLM -based Dialogue Systems

【速读】：该论文试图解决多轮对话测试中的Oracle问题，即在多轮对话系统中难以确定预期输出的问题。解决方案的关键是提出了MORTAR，一种基于变形的（Metamorphic）多轮对话测试方法。MORTAR通过自动化生成具有多轮对话级别扰动的后续问答（QA）测试用例，并利用知识图谱（Knowledge Graph）构建的对话信息模型来生成扰动对话测试数据集，从而有效地检测多轮对话系统中的错误。该方法无需依赖大型语言模型（LLM）作为评判标准，避免了评估过程中的潜在偏见，并且在实验中显示出比现有单轮变形测试方法更高的错误检测能力，尤其是在检测严重错误方面。

链接: https://arxiv.org/abs/2412.15557
作者: Guoxiang Guo,Aldeida Aleti,Neelofar Neelofar,Chakkrit Tantithamthavorn
机构: Monash University(莫纳什大学); RMIT University(皇家墨尔本理工大学)
关键词: LLM-based dialogue systems, dialogue systems, dialogue, LLM-based dialogue, daily life
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn scenarios. However, multi-turn dialogue testing remains underexplored, with the Oracle problem in multi-turn testing posing a persistent challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a MetamORphic multi-TuRn diAlogue testing appRoach, which mitigates the test oracle problem in the assessment of LLM-based dialogue systems. MORTAR automates the generation of follow-up question-answer (QA) dialogue test cases with multiple dialogue-level perturbations and metamorphic relations. MORTAR employs a novel knowledge graph-based dialogue information model which effectively generates perturbed dialogue test datasets and detects bugs of multi-turn dialogue systems in a low-cost manner. The proposed approach does not require an LLM as a judge, eliminating potential of any biases in the evaluation step. According to the experiment results on multiple LLM-based dialogue systems and comparisons with single-turn metamorphic testing approaches, MORTAR explores more unique bugs in LLM-based dialogue systems, especially for severe bugs that MORTAR detects up to four times more unique bugs than the most effective existing metamorphic testing approach.
zh

[NLP-39] NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning

【速读】：该论文试图解决个性化营养健康推理中的关键问题，即如何根据个体健康状况定制饮食建议。当前研究面临的主要挑战包括缺乏包含用户特定医疗信息的个性化数据集，以及大型语言模型（LLMs）在处理个性化健康饮食推理领域特定复杂性方面的不足。论文提出的解决方案是引入Nutritional Graph Question Answering (NGQA)基准，这是首个针对个性化营养健康推理设计的图问题回答数据集。NGQA利用来自National Health and Nutrition Examination Survey (NHANES)和Food and Nutrient Database for Dietary Studies (FNDDS)的数据，评估特定食物对特定用户的健康影响，并提供关键营养素解释。该基准通过三种问题复杂度设置和三个下游任务，有效挑战现有模型，推动了GraphQA研究在特定领域的发展。

链接: https://arxiv.org/abs/2412.15547
作者: Zheyuan Zhang,Yiyang Li,Nhi Ha Lan Le,Zehong Wang,Tianyi Ma,Vincent Galassi,Keerthiram Murugesan,Nuno Moniz,Werner Geyer,Nitesh V Chawla,Chuxu Zhang,Yanfang Ye
机构: University of Notre Dame(圣母大学); Brandeis University(布兰迪斯大学); IBM Research(IBM研究院); University of Connecticut(康涅狄格大学)
关键词: Diet plays, health conditions remains, Graph Question Answering, Question Answering, role in human
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On one hand, the absence of datasets involving user-specific medical information severely limits \textitpersonalization. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark.
zh

[NLP-40] MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering

【速读】：该论文试图解决时间敏感问答系统中，现有方法在处理需要复杂时间推理的问题时表现不佳的问题。解决方案的关键在于提出了一种无需训练的模块化检索框架（Modular Retrieval, MRAG），该框架通过三个模块协同工作：(1) 问题处理模块将问题分解为主内容和时间约束；(2) 检索与摘要模块根据主内容检索证据并使用大语言模型（LLM）进行摘要；(3) 语义-时间混合排序模块根据语义和时间相关性对证据摘要进行评分。通过这种模块化设计，MRAG在TempRAGEval基准上显著提升了检索性能和最终答案的准确性。

链接: https://arxiv.org/abs/2412.15540
作者: Zhang Siyue,Xue Yuxiang,Zhang Yiming,Wu Xiaobao,Luu Anh Tuan,Zhao Chen
机构: 未知
关键词: large language models, Understanding temporal relations, question-answering systems powered, language models, Understanding temporal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.
zh

[NLP-41] XRAG: eXamining the Core – Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成 (Retrieval-augmented Generation, RAG) 系统中潜在的故障点问题，并提供优化这些系统的解决方案。其关键在于通过XRAG这一开源、模块化的代码库，系统性地评估RAG模块的核心组件，包括预检索、检索、后检索和生成四个阶段，并在重新配置的数据集上进行全面基准测试。论文提出了一套实验方法和诊断测试协议，以识别和分析RAG模块中的故障点，并提供定制化的解决方案来增强验证过程和提升系统整体性能。

链接: https://arxiv.org/abs/2412.15529
作者: Qianren Mao,Yangyifei Luo,Jinlong Zhang,Hanwen Hao,Zhilong Cao,Xiaolong Wang,Xiao Guan,Zhenting Huang,Weifeng Jiang,Shuyu Guo,Zhentao Han,Qili Zhang,Siyuan Tao,Yujie Liu,Junnan Liu,Zhixing Tan,Jie Sun,Bo Li,Xudong Liu,Richong Zhang,Jianxin Li
机构: 未知
关键词: Large Language Models, URL introduce XRAG, http URL introduce, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and this http URL introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. Given the escalating complexity of RAG systems, we underscore the necessity of identifying potential failure points of RAG modules. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in the engineering of RAG modules. Subsequently, we proffer bespoke solutions that are designed to augment the validation processes and bolster the overall performance of these modules. Our work thoroughly evaluates the performance of core advanced components in RAG systems, providing insights into optimizations for prevalent failure points.
zh

[NLP-42] HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

【速读】：该论文试图解决大语言模型（LLMs）在遵循指令任务中的自动评估中存在的偏差问题，这些偏差源于评估过程中依赖于另一个强大的LLM作为评判者，导致其判断与人类评判者不一致。解决方案的关键在于引入人类编写的响应作为评估的参考，通过实验发现这种方法显著提高了自动评估的可靠性，与人类评判者的共识度提升了3.2%。此外，论文提出了一个新的评估基准——人类响应引导的指令遵循评估（HREF），该基准包含4,258个样本，涵盖11个任务类别，采用复合评估设置，确保每个类别的评估方法最为可靠。HREF不仅提供了可靠的评估，还强调了任务的个体性能，并且避免了数据污染。

链接: https://arxiv.org/abs/2412.15524
作者: Xinxi Lyu,Yizhong Wang,Hannaneh Hajishirzi,Pradeep Dasigi
机构: Allen Institute for AI(艾伦人工智能研究所); University of Washington(华盛顿大学)
关键词: Large Language Models, Large Language, introducing unresolved biases, Evaluating the capability, capability of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 15 figures

点击查看摘要

Abstract:Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate various choices for automatic evaluation on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across a wide range of tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also discovered that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as an additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF), comprising 4,258 samples across 11 task categories with a composite evaluation setup, employing a composite evaluation setup that selects the most reliable method for each category. In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. Finally, we study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template. We host a live leaderboard that evaluates LLMs on the private evaluation set of HREF.
zh

[NLP-43] ADEQA: A Question Answer based approach for joint ADE-Suspect Extraction using Sequence-To-Sequence Transformers

【速读】：该论文试图解决从非结构化数据源（如临床研究报告、患者健康记录、社交媒体帖子等）中提取不良药物事件（ADE）及其相关可疑药物的挑战。解决方案的关键在于提出了ADEQA，一种基于问答（QA）的方法，利用准监督标签数据和序列到序列的变换器模型来提取ADE、可疑药物及其关系。与传统的QA模型不同，基于自然语言生成（NLG）的模型不需要大量的标记数据，从而显著降低了应用门槛，并在公开的ADE语料库上实现了94%的F1分数，达到了最先进的性能。

链接: https://arxiv.org/abs/2412.15510
作者: Vinayak Arannil,Tomal Deb,Atanu Roy
机构: 未知
关键词: Adverse Drug Events, taking prompt actions, Early identification, identification of Adverse, Drug Events
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Early identification of Adverse Drug Events (ADE) is critical for taking prompt actions while introducing new drugs into the market. These ADEs information are available through various unstructured data sources like clinical study reports, patient health records, social media posts, etc. Extracting ADEs and the related suspect drugs using machine learning is a challenging task due to the complex linguistic relations between drug ADE pairs in textual data and unavailability of large corpus of labelled datasets. This paper introduces ADEQA, a question-answer(QA) based approach using quasi supervised labelled data and sequence-to-sequence transformers to extract ADEs, drug suspects and the relationships between them. Unlike traditional QA models, natural language generation (NLG) based models don’t require extensive token level labelling and thereby reduces the adoption barrier significantly. On a public ADE corpus, we were able to achieve state-of-the-art results with an F1 score of 94% on establishing the relationships between ADEs and the respective suspects.
zh

[NLP-44] Mitigating Social Bias in Large Language Models : A Multi-Objective Approach within a Multi-Agent Framework AAAI AAAI-2025

【速读】：该论文试图解决大型语言模型 (LLMs) 在自然语言处理 (NLP) 中产生的社会偏见问题。解决方案的关键在于提出了一种多目标多代理框架 (MOMA)，通过部署多个代理对输入问题中的偏见相关内容进行因果干预，打破这些内容与相应答案之间的快捷连接。与传统的去偏技术不同，MOMA 在显著减少偏见的同时，仅导致下游任务性能的轻微下降，实验结果表明，MOMA 在两个数据集和两个模型上分别将偏见分数降低了高达 87.7%，并在 StereoSet 数据集上显著提升了多目标指标 icat 达 58.1%。

链接: https://arxiv.org/abs/2412.15504
作者: Zhenjie Xu(1),Wenqing Chen(1),Yi Tang(1),Xuanying Li(2),Cheng Hu(1),Zhixuan Chu(3),Kui Ren(3),Zibin Zheng(1),Zhichao Lu(4) ((1) School of Software Engineering, Sun Yat-sen University, (2) School of Physics and Astronomy, Sun Yat-sen University, (3) School of Cyber Science and Technology, Zhejiang University, (4) Department of Computer Science, City University of Hong Kong)
机构: 1. Sun Yat-sen University(中山大学); 2. Guangzhou University(广州大学); 3. University at Buffalo, State University of New York(纽约州立大学布法罗分校); 4. Alibaba Group(阿里巴巴集团)
关键词: Natural language processing, Natural language, large language models, language processing, large language
类目: Computation and Language (cs.CL)
备注: This work has been accepted at The 39th Annual AAAI Conference on Artificial Intelligence (AAAI-2025)

点击查看摘要

Abstract:Natural language processing (NLP) has seen remarkable advancements with the development of large language models (LLMs). Despite these advancements, LLMs often produce socially biased outputs. Recent studies have mainly addressed this problem by prompting LLMs to behave ethically, but this approach results in unacceptable performance degradation. In this paper, we propose a multi-objective approach within a multi-agent framework (MOMA) to mitigate social bias in LLMs without significantly compromising their performance. The key idea of MOMA involves deploying multiple agents to perform causal interventions on bias-related contents of the input questions, breaking the shortcut connection between these contents and the corresponding answers. Unlike traditional debiasing techniques leading to performance degradation, MOMA substantially reduces bias while maintaining accuracy in downstream tasks. Our experiments conducted on two datasets and two models demonstrate that MOMA reduces bias scores by up to 87.7%, with only a marginal performance degradation of up to 6.8% in the BBQ dataset. Additionally, it significantly enhances the multi-objective metric icat in the StereoSet dataset by up to 58.1%. Code will be made available at this https URL.
zh

[NLP-45] Humanlike Cognitive Patterns as Emergent Phenomena in Large Language Models

【速读】：该论文旨在解决大语言模型 (Large Language Models, LLMs) 在决策偏差、推理和创造力等认知领域中的表现问题，并通过系统性综述提供对这一复杂领域的综合理解。解决方案的关键在于通过实证研究，利用已建立的心理学测试，将LLMs的表现与人类基准进行比较。研究发现，LLMs在决策偏差方面表现出部分人类特征，但在推理方面，如GPT-4等先进模型展现出类似人类系统2思维的深思熟虑推理能力，而在创造力方面，LLMs在语言类创造性任务中表现优异，但在需要现实世界背景的发散思维任务中表现较差。论文还指出了LLMs的局限性，并为未来研究提供了方向，如记忆、注意力和开源模型开发等。

链接: https://arxiv.org/abs/2412.15501
作者: Zhisheng Tang,Mayank Kejriwal
机构: University of Southern California Information Sciences Institute (南加州大学信息科学研究所); University of Southern California Information Sciences Institute (南加州大学信息科学研究所)
关键词: Large Language Models, Large Language, gained significant traction, artificial intelligence, complex landscape
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research on emergent patterns in Large Language Models (LLMs) has gained significant traction in both psychology and artificial intelligence, motivating the need for a comprehensive review that offers a synthesis of this complex landscape. In this article, we systematically review LLMs’ capabilities across three important cognitive domains: decision-making biases, reasoning, and creativity. We use empirical studies drawing on established psychological tests and compare LLMs’ performance to human benchmarks. On decision-making, our synthesis reveals that while LLMs demonstrate several human-like biases, some biases observed in humans are absent, indicating cognitive patterns that only partially align with human decision-making. On reasoning, advanced LLMs like GPT-4 exhibit deliberative reasoning akin to human System-2 thinking, while smaller models fall short of human-level performance. A distinct dichotomy emerges in creativity: while LLMs excel in language-based creative tasks, such as storytelling, they struggle with divergent thinking tasks that require real-world context. Nonetheless, studies suggest that LLMs hold considerable potential as collaborators, augmenting creativity in human-machine problem-solving settings. Discussing key limitations, we also offer guidance for future research in areas such as memory, attention, and open-source model development.
zh

[NLP-46] he First Multilingual Model For The Detection of Suicide Texts COLING2025

【速读】：该论文试图解决自杀意念（suicidal ideation）这一全球性健康问题，通过社交网络中用户的情感表达来识别自杀风险。解决方案的关键在于利用多语言模型，如mBERT、XML-R和mT5，通过跨语言迁移学习（cross-lingual transfer learning）在六种语言（西班牙语、英语、德语、加泰罗尼亚语、葡萄牙语和意大利语）的帖子中检测自杀相关文本。研究通过将西班牙语自杀意念推文数据集翻译成其他五种语言，并对这些多语言数据进行微调，最终mT5模型在分类指标上表现最佳，F1分数超过85%。该研究强调了在开发自动化多语言工具时考虑语言多样性的重要性，同时也指出了翻译中的语义保真度和伦理问题，为未来的人机协作评估提供了指导。

链接: https://arxiv.org/abs/2412.15498
作者: Rodolfo Zevallos,Annika Schoene,John E. Ortega
机构: 未知
关键词: problem affecting millions, health problem affecting, people worldwide, affecting millions, millions of people
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: SUMEval-2: The 2nd Workshop on Scaling Up Multilingual Multi-Cultural Evaluation at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Suicidal ideation is a serious health problem affecting millions of people worldwide. Social networks provide information about these mental health problems through users’ emotional expressions. We propose a multilingual model leveraging transformer architectures like mBERT, XML-R, and mT5 to detect suicidal text across posts in six languages - Spanish, English, German, Catalan, Portuguese and Italian. A Spanish suicide ideation tweet dataset was translated into five other languages using SeamlessM4T. Each model was fine-tuned on this multilingual data and evaluated across classification metrics. Results showed mT5 achieving the best performance overall with F1 scores above 85%, highlighting capabilities for cross-lingual transfer learning. The English and Spanish translations also displayed high quality based on perplexity. Our exploration underscores the importance of considering linguistic diversity in developing automated multilingual tools to identify suicidal risk. Limitations exist around semantic fidelity in translations and ethical implications which provide guidance for future human-in-the-loop evaluations.
zh

[NLP-47] Lexicography Saves Lives (LSL): Automatically Translating Suicide-Related Language COLING2025

【速读】：该论文试图解决全球范围内自杀相关研究资源的不均衡问题，尤其是针对非英语和非西方文化背景的资源匮乏问题。解决方案的关键在于提出了“Lexicography Saves Lives Project”，通过三个主要贡献来解决这一问题：首先，制定了伦理考虑和指导方针，以减少在开发自杀相关资源时可能造成的伤害；其次，将现有的与自杀意念相关的词典翻译成200种不同的语言，并对部分翻译结果进行人工评估；最后，建立了一个公开网站，使资源得以共享并促进社区参与。

链接: https://arxiv.org/abs/2412.15497
作者: Annika Marie Schoene,John E. Ortega,Rodolfo Joel Zevallos,Laura Haaber Ihle
机构: Northeastern Universty, Institute for Experiential AI(东北大学，体验式人工智能研究所); Barcelona Supercomputing Center(巴塞罗那超级计算中心)
关键词: Recent years, predict risk, Sustainable Development Goals, marked increase, increase in research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Recent years have seen a marked increase in research that aims to identify or predict risk, intention or ideation of suicide. The majority of new tasks, datasets, language models and other resources focus on English and on suicide in the context of Western culture. However, suicide is global issue and reducing suicide rate by 2030 is one of the key goals of the UN’s Sustainable Development Goals. Previous work has used English dictionaries related to suicide to translate into different target languages due to lack of other available resources. Naturally, this leads to a variety of ethical tensions (e.g.: linguistic misrepresentation), where discourse around suicide is not present in a particular culture or country. In this work, we introduce the ‘Lexicography Saves Lives Project’ to address this issue and make three distinct contributions. First, we outline ethical consideration and provide overview guidelines to mitigate harm in developing suicide-related resources. Next, we translate an existing dictionary related to suicidal ideation into 200 different languages and conduct human evaluations on a subset of translated dictionaries. Finally, we introduce a public website to make our resources available and enable community participation.
zh

[NLP-48] L-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

【速读】：该论文试图解决大型语言模型（LLMs）在工具使用任务中因标准监督微调（SFT）方法依赖大规模数据集而忽视任务特定特征，导致性能瓶颈的问题。解决方案的关键在于提出了一种基于任务特征的训练框架TL-Training，该框架通过以下方式提升模型性能：1) 缓解次优训练数据对工具使用行为的影响；2) 动态调整标记权重以优先处理关键标记；3) 引入针对错误类别的鲁棒奖励机制，并通过近端策略优化进行优化。实验结果表明，使用TL-Training训练的CodeLLaMA-2-7B在工具使用性能上与开源和闭源LLMs相当或超越，同时增强了在噪声环境中的鲁棒性和任务表现。

链接: https://arxiv.org/abs/2412.15495
作者: Junjie Ye,Yilong Wu,Sixian Li,Yuming Yang,Tao Gui,Qi Zhang,Xuanjing Huang,Peng Wang,Zhongchao Shi,Jianping Fan,Zhengyin Du
机构: 未知
关键词: Large language models, achieve remarkable advancements, Large language, language models, achieve remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with external environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of distinct categories. Building on these findings, we propose TL-Training, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four diverse open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. The code and data are available at this https URL.
zh

[NLP-49] Multi-LLM Text Summarization

【速读】：该论文试图解决单一大型语言模型（LLM）在文本摘要任务中的性能局限性问题，提出了一种多LLM摘要框架。解决方案的关键在于采用两种不同的多LLM策略：集中式和分布式。在每轮对话中，框架包含两个核心步骤：生成和评估。集中式策略使用单一LLM进行摘要评估和选择最佳摘要，而分布式策略则使用多个LLM进行评估。实验结果表明，多LLM方法在摘要任务中显著优于单一LLM，性能提升可达3倍，证明了多LLM方法在摘要任务中的有效性。

链接: https://arxiv.org/abs/2412.15487
作者: Jiangnan Fang,Cheng-Tse Liu,Jieun Kim,Yash Bhedaru,Ethan Liu,Nikhil Singh,Nedim Lipka,Puneet Mathur,Nesreen K. Ahmed,Franck Dernoncourt,Ryan A. Rossi,Hanieh Deilamsalehy
机构: University of California, Santa Cruz; Adobe Research
关键词: Multi-LLM, Multi-LLM summarization framework, Multi-LLM summarization, summarization, multi-LLM strategies including
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.
zh

[NLP-50] Continual Learning Using Only Large Language Model Prompting COLING-2025

【速读】：该论文试图解决在持续学习（Continual Learning, CL）场景下，如何在不微调大语言模型（Large Language Model, LLM）或增加可训练参数的情况下，实现增量学习的问题。解决方案的关键在于提出了一种新的持续学习范式CLOB，通过仅使用语言提示（verbal prompting）进行增量学习，而不对LLM进行任何微调或参数修改。此外，论文还提出了基于增量总结（incremental summarization）的持续学习技术CIS，该技术不仅克服了LLM输入长度限制，还在实验中显著优于基线方法。

链接: https://arxiv.org/abs/2412.15479
作者: Jiabao Qiu,Zixuan Ke,Bing Liu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Salesforce AI Research (Salesforce AI 研究)
关键词: large language model, language model, black box, introduce CLOB, continual learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To Appear in COLING-2025 (short paper)

点击查看摘要

Abstract:We introduce CLOB, a novel continual learning (CL) paradigm wherein a large language model (LLM) is regarded as a black box. Learning is done incrementally via only verbal prompting. CLOB does not fine-tune any part of the LLM or add any trainable parameters to it. It is particularly suitable for LLMs that are accessible via APIs. We also propose a new CL technique, called CIS, based on incremental summarization that also overcomes the LLM’s input length limit. Experiments show CIS outperforms baselines by a very large margin.
zh

[NLP-51] A Review of the Marathi Natural Language Processing

【速读】：该论文试图解决印度语言（特别是马拉地语）在自然语言处理（NLP）研究中资源匮乏和工具不足的问题。解决方案的关键在于通过引入神经网络（NN）模型和工具，以及在过去十年中为印度22种官方语言（包括马拉地语）开发高质量的数据集、基准和评估指标，从而提升马拉地语NLP任务的可行性和研究进展。

链接: https://arxiv.org/abs/2412.15471
作者: Asang Dani,Shailesh R Sathe
机构: 未知
关键词: NLP, NLP research, Marathi, languages, Marathi NLP tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Marathi is one of the most widely used languages in the world. One might expect that the latest advances in NLP research in languages like Enlighs reach such a large community. However, NLP advancements in English didn’t immediately reach Indian languages like Marathi. There were several reasons for this. They included diversity of scripts used, lack of (publicly available) resources like tokenization strategies, high quality datasets \ benchmarks, and evaluation metrics. In addition to this, the morphologically rich nature of Marathi, made NLP tasks challenging. Advances in Neural Network (NN) based models and tools since the early 2000s helped improve this situation and make NLP research more accessible. In the past 10 years, significant efforts were made to improve language resources for all 22 scheduled languages of India. This paper presents a broad overview of evolution of NLP research in Indic languages with a focus on Marathi and state-of-the-art resources and tools available to the research community. It also provides an overview of tools \ techniques associated with Marathi NLP tasks.
zh

[NLP-52] alkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models

【速读】：该论文旨在通过增强工业机器人系统的可解释性，特别是在安全关键应用中，提升人机交互的效果。解决方案的关键在于将大型语言模型（LLMs）和视觉语言模型（VLMs）与机器人感知和控制相结合，使机器人能够理解和执行自然语言指令，并通过视觉或描述性输入感知环境。此外，将LLM的内部状态和推理过程转化为易于理解的文本，确保操作人员能够清晰了解机器人的当前状态和意图，从而实现有效且安全的操作。论文提出了四种基于LLM的仿真机器人控制工作流程，分别涉及低级控制、生成语言反馈、使用视觉信息作为输入以及利用机器人结构信息生成任务计划和反馈。

链接: https://arxiv.org/abs/2412.15462
作者: Ammar N. Abbas,Csaba Beleznai
机构: Technological University Dublin(都柏林理工大学); AIT Austrian Institute of Technology(奥地利技术研究所)
关键词: enhance human-robot interaction, Large Language Models, Vision Language Models, industrial robotic systems, interpretable industrial robotic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This paper has been accepted for publication in the proceedings of the 2024 Eighth IEEE International Conference on Robotic Computing (IRC)

点击查看摘要

Abstract:TalkWithMachines aims to enhance human-robot interaction by contributing to interpretable industrial robotic systems, especially for safety-critical applications. The presented paper investigates recent advancements in Large Language Models (LLMs) and Vision Language Models (VLMs), in combination with robotic perception and control. This integration allows robots to understand and execute commands given in natural language and to perceive their environment through visual and/or descriptive inputs. Moreover, translating the LLM’s internal states and reasoning into text that humans can easily understand ensures that operators gain a clearer insight into the robot’s current state and intentions, which is essential for effective and safe operation. Our paper outlines four LLM-assisted simulated robotic control workflows, which explore (i) low-level control, (ii) the generation of language-based feedback that describes the robot’s internal states, (iii) the use of visual information as additional input, and (iv) the use of robot structure information for generating task plans and feedback, taking the robot’s physical capabilities and limitations into account. The proposed concepts are presented in a set of experiments, along with a brief discussion. Project description, videos, and supplementary materials will be available on the project website: this https URL.
zh

[NLP-53] Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization COLING2025

【速读】：该论文试图解决自动生成反仇恨言论（counter-speech, CS）时，现有方法在生成高质量、有影响力且可扩展的CS方面表现不佳的问题，尤其是在多语言环境下。解决方案的关键在于通过监督微调（Supervised Fine-Tuning, SFT）和直接偏好优化（Direct Preference Optimization, DPO）对大型语言模型（Large Language Models, LLMs）进行对齐，以确保生成的CS在语境上适当且语言上可适应。此外，通过知识基础（knowledge grounding）增强生成CS的事实准确性和相关性。实验结果表明，DPO对齐的模型在CS基准测试中显著优于SFT基线，并在多语言环境中有效扩展。

链接: https://arxiv.org/abs/2412.15453
作者: Sahil Wadhwa,Chengtian Xu,Haoming Chen,Aakash Mahalingam,Akankshya Kar,Divya Chaudhary
机构: 未知
关键词: addressing hate speech, Direct Preference Optimization, critical strategy, strategy for addressing, addressing hate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 tables, 1 figure, The First Workshop on Multilingual Counterspeech Generation (MCG) at The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses. However, existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse linguistic contexts. In this paper, we propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Our approach leverages DPO to align LLM outputs with human preferences, ensuring contextually appropriate and linguistically adaptable responses. Additionally, we incorporate knowledge grounding to enhance the factual accuracy and relevance of generated CS. Experimental results demonstrate that DPO-aligned models significantly outperform SFT baselines on CS benchmarks while scaling effectively to multiple languages. These findings highlight the potential of preference-based alignment techniques to advance CS generation across varied linguistic settings. The model supervision and alignment is done in English and the same model is used for reporting metrics across other languages like Basque, Italian, and Spanish.
zh

[NLP-54] Fietje: An open efficient LLM for Dutch

【速读】：该论文旨在解决荷兰语语言处理中模型可用性和性能的问题，提出了Fietje系列小型语言模型（SLMs），这些模型基于Phi 2（2.7亿参数的英语为中心的模型），专门为荷兰语设计。解决方案的关键在于模型的透明性和可复现性，Fietje是完全开源的，包括模型权重、数据集、训练和评估代码均公开可用。通过在推理、情感分析、世界知识、语言可接受性和词义消歧等多个基准上的评估，Fietje展示了与更大规模语言模型相竞争的结果，表明小型语言模型在荷兰语处理中的潜力和未来发展方向。

链接: https://arxiv.org/abs/2412.15450
作者: Bram Vanroy
机构: KU Leuven(鲁汶大学); Dutch Language Institute(荷兰语言研究所)
关键词: specifically designed, Dutch language, paper introduces Fietje, Dutch, Fietje
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces Fietje, a family of small language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. Fietje demonstrated competitive results with larger language models upon its release. A core emphasis of this work is transparency and reproducibility: Fietje is fully open-source, with model weights, datasets, training, and evaluation code all publicly accessible. The paper discusses the performance of Fietje and many other models on an extensive evaluation suite of benchmarks on reasoning, sentiment analysis, world knowledge, linguistic acceptability and word sense disambiguation. Evaluation results illustrate the rapid progress in the field of LLMs, where recent small models outperform older, larger models that were fine-tuned for Dutch. This trend signals an exciting future for Dutch language processing, suggesting that even compact LLMs are becoming increasingly capable. Furthermore, ongoing and future efforts to adapt LLMs to Dutch are poised to enhance these models even further, broadening their applicability and accessibility. Fietje is only an intermediate step in improving accessibility to language technology for users of the Dutch language. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.15450 [cs.CL] (or arXiv:2412.15450v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.15450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-55] SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval COLING2025

【速读】：该论文试图解决检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在处理大规模数据时，难以高效检索信息并保持上下文理解的问题。解决方案的关键在于引入了一种名为 SKETCH 的新方法，通过将语义文本检索与知识图谱 (knowledge graphs) 相结合，实现了结构化与非结构化数据的融合，从而提升了检索性能并保持了更高的上下文完整性。SKETCH 在多个数据集上的评估结果显示，其在关键的 RAGAS 指标（如 answer_relevancy、faithfulness、context_precision 和 context_recall）上均优于传统方法，尤其是在意大利菜数据集上表现尤为突出，达到了 0.94 的答案相关性和 0.99 的上下文精度。

链接: https://arxiv.org/abs/2412.15443
作者: Aakash Mahalingam,Vinesh Kumar Gande,Aman Chadha,Vinija Jain,Divya Chaudhary
机构: 未知
关键词: Large Language Models, Language Models, leveraging vast corpora, Retrieval-Augmented Generation, Large Language
类目: Computation and Language (cs.CL)
备注: 16 pages, 8 figures, Workshop on Generative AI and Knowledge Graphs (GenAIK) at The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH’s capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.
zh

[NLP-56] me Will Tell: Timing Side Channels via Output Token Count in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）中基于输出token数量泄露推理输入敏感信息的问题。解决方案的关键在于识别并利用这一侧信道（side-channel），即通过分析LLM响应中的输出token数量来推断输入信息。论文展示了在机器翻译和文本分类任务中，攻击者可以通过这一侧信道以高精度（超过75%和70%）恢复目标语言和输入类别。此外，由于LLMs的自回归生成机制，攻击者还能通过时间信道（timing channel）在网络环境下可靠地恢复输出token数量。论文最后提出了基于tokenizer、系统和提示的缓解措施来对抗这一侧信道攻击。

链接: https://arxiv.org/abs/2412.15431
作者: Tianchen Zhang,Gururaj Saileshwar,David Lie
机构: University of Toronto (多伦多大学)
关键词: extract sensitive information, output token count, large language models, paper demonstrates, extract sensitive
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper demonstrates a new side-channel that enables an adversary to extract sensitive information about inference inputs in large language models (LLMs) based on the number of output tokens in the LLM response. We construct attacks using this side-channel in two common LLM tasks: recovering the target language in machine translation tasks and recovering the output class in classification tasks. In addition, due to the auto-regressive generation mechanism in LLMs, an adversary can recover the output token count reliably using a timing channel, even over the network against a popular closed-source commercial LLM. Our experiments show that an adversary can learn the output language in translation tasks with more than 75% precision across three different models (Tower, M2M100, MBart50). Using this side-channel, we also show the input class in text classification tasks can be leaked out with more than 70% precision from open-source LLMs like Llama-3.1, Llama-3.2, Gemma2, and production models like GPT-4o. Finally, we propose tokenizer-, system-, and prompt-based mitigations against the output token count side-channel.
zh

[NLP-57] Learning Visual Composition through Improved Semantic Guidance

【速读】：该论文试图解决视觉表示学习中对多个流动概念的组合理解不足的问题，特别是在处理复杂任务时，现有模型（如CLIP）表现不佳。解决方案的关键在于通过显著改进弱标签数据（如字幕）来提升标准对比学习方法的性能。具体来说，论文展示了通过增强数据训练的标准CLIP模型在图像检索任务中表现出显著的性能提升，超越了专门设计的组合学习架构。

链接: https://arxiv.org/abs/2412.15396
作者: Austin Stone,Hagen Soltau,Robert Geirhos,Xi Yi,Ye Xia,Bingyi Cao,Kaifeng Chen,Abhijit Ogale,Jonathon Shlens
机构: 未知
关键词: fluid concepts, visual representation learning, Visual imagery, consist of solitary, reflects the composition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning – where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
zh

[NLP-58] Systematic Evaluation of Long-Context LLM s on Financial Concepts EMNLP2024

【速读】：该论文试图解决长上下文大语言模型（LC LLMs）在处理和理解长输入文档时的可靠性问题。研究的关键在于评估GPT-4系列LC LLMs在不同上下文长度、任务难度以及关键信息位置等因素下的表现。通过创建一个真实的金融新闻数据集，研究发现LC LLMs在较长上下文长度下表现出脆弱性，尤其是在任务复杂度增加时，性能显著下降，甚至出现指令跟随失败和输出退化的情况。此外，研究还揭示了模型对任务指令在上下文窗口中的位置和格式微小变化的敏感性。论文建议采用更严格的评估方法，如使用F1分数（而非召回率）并报告置信区间，以确保评估结果的稳健性和结论的可靠性。

链接: https://arxiv.org/abs/2412.15386
作者: Lavanya Gupta,Saket Sharma,Yiyun Zhao
机构: 未知
关键词: Long-context large language, long input documents, Long-context large, large language models, real-world tasks requiring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2024

点击查看摘要

Abstract:Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and understanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing context windows remains under investigation. In this work, we evaluate the performance of state-of-the-art GPT-4 suite of LC LLMs in solving a series of progressively challenging tasks, as a function of factors such as context length, task difficulty, and position of key information by creating a real world financial news dataset. Our findings indicate that LC LLMs exhibit brittleness at longer context lengths even for simple tasks, with performance deteriorating sharply as task complexity increases. At longer context lengths, these state-of-the-art models experience catastrophic failures in instruction following resulting in degenerate outputs. Our prompt ablations also reveal unfortunate continued sensitivity to both the placement of the task instruction in the context window as well as minor markdown formatting. Finally, we advocate for more rigorous evaluation of LC LLMs by employing holistic metrics such as F1 (rather than recall) and reporting confidence intervals, thereby ensuring robust and conclusive findings.
zh

[NLP-59] Automatic Extraction of Metaphoric Analogies from Literary Texts: Task Formulation Dataset Construction and Evaluation COLING2025

【速读】：该论文试图解决从自由文本中提取隐喻和类比的问题，这需要高层次的推理能力，如抽象和语言理解。解决方案的关键在于构建了一个由领域专家协助的新数据集，并评估了最新的大型语言模型 (LLMs) 在从包含比例类比的文本片段中结构化隐喻映射的能力。此外，模型还被评估了生成隐含类比元素的能力，这些元素在文本中是间接暗示的，并由人类读者推断。实验结果表明，LLMs 在这一任务中表现出色，为自动从文本中提取类比和隐喻提供了新的可能性，从而减少了对领域专家手动标注数据的依赖。

链接: https://arxiv.org/abs/2412.15375
作者: Joanne Boisson,Zara Siddique,Hsuvas Borkakoty,Dimosthenis Antypas,Luis Espinosa Anke,Jose Camacho-Collados
机构: Cardiff NLP, School of Computer Science and Informatics, Cardiff University, U.K.; Amplyfi, Cardiff, U.K.
关键词: requires high-level reasoning, high-level reasoning abilities, free text requires, text requires high-level, requires high-level
类目: Computation and Language (cs.CL)
备注: Accepted to COLING 2025, long paper

点击查看摘要

Abstract:Extracting metaphors and analogies from free text requires high-level reasoning abilities such as abstraction and language understanding. Our study focuses on the extraction of the concepts that form metaphoric analogies in literary texts. To this end, we construct a novel dataset in this domain with the help of domain experts. We compare the out-of-the-box ability of recent large language models (LLMs) to structure metaphoric mappings from fragments of texts containing proportional analogies. The models are further evaluated on the generation of implicit elements of the analogy, which are indirectly suggested in the texts and inferred by human readers. The competitive results obtained by LLMs in our experiments are encouraging and open up new avenues such as automatically extracting analogies and metaphors from text instead of investing resources in domain experts to manually label data.
zh

[NLP-60] Decade of Natural Language Processing in Chronic Pain: A Systematic Review

【速读】：该论文旨在解决自然语言处理（NLP）在慢性疼痛研究领域的知识分散问题，并通过对现有文献的系统回顾，整合现有知识、识别研究空白，并为未来研究提供方向。解决方案的关键在于利用先进的NLP技术，如基于transformer的模型（如RoBERTa和BERT），在分类任务中实现高绩效（如F1值达到0.8），以及通过无监督方法（如LDA和k-means聚类）进行探索性分析。此外，论文强调了未来研究应关注多模态数据验证系统、上下文感知的机制建模以及标准化评估指标的开发，以提高研究的重复性和公平性。

链接: https://arxiv.org/abs/2412.15360
作者: Swati Rajwal
机构: Emory University (埃默里大学)
关键词: Natural Language Processing, Language Processing, Natural Language, opened innovative pathways, chronic pain research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the intersection of Natural Language Processing (NLP) and public health has opened innovative pathways for investigating various domains, including chronic pain in textual datasets. Despite the promise of NLP in chronic pain, the literature is dispersed across various disciplines, and there is a need to consolidate existing knowledge, identify knowledge gaps in the literature, and inform future research directions in this emerging field. This review aims to investigate the state of the research on NLP-based interventions designed for chronic pain research. A search strategy was formulated and executed across PubMed, Web of Science, IEEE Xplore, Scopus, and ACL Anthology to find studies published in English between 2014 and 2024. After screening 132 papers, 26 studies were included in the final review. Key findings from this review underscore the significant potential of NLP techniques to address pressing challenges in chronic pain research. The past 10 years in this field have showcased the utilization of advanced methods (transformers like RoBERTa and BERT) achieving high-performance metrics (e.g., F10.8) in classification tasks, while unsupervised approaches like Latent Dirichlet Allocation (LDA) and k-means clustering have proven effective for exploratory analyses. Results also reveal persistent challenges such as limited dataset diversity, inadequate sample sizes, and insufficient representation of underrepresented populations. Future research studies should explore multimodal data validation systems, context-aware mechanistic modeling, and the development of standardized evaluation metrics to enhance reproducibility and equity in chronic pain research.
zh

[NLP-61] Eliciting Causal Abilities in Large Language Models for Reasoning Tasks

【速读】：该论文试图解决当前提示优化方法在训练成本高且缺乏足够可解释性的问题。解决方案的关键在于提出了一种自因果指令增强方法 (Self-Causal Instruction Enhancement, SCIE)，通过激发大语言模型 (LLMs) 的因果推理能力，从提示指令中推导出正确答案。SCIE 方法通过生成高质量、低数量的观测数据，估计因果效应，并最终生成具有优化因果效应的指令。该方法将指令视为处理因素，利用文本特征处理自然语言，在指令与下游任务之间建立因果关系。此外，论文还引入了对象关系 (Object-Relational, OR) 原则，将揭示的因果关系作为可继承的类跨任务对象，确保低成本的可重用性。实验结果表明，该方法能有效提升推理性能，同时降低提示训练成本，并提供可解释的文本特征以支持可操作的见解。

链接: https://arxiv.org/abs/2412.15314
作者: Yajing Wang,Zongwei Luo,Jingzhe Wang,Zhanke Zhou,Yongqiang Chen,Bo Han
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学);
2. Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University(苏州大学);
3. School of Computer Science and Technology, Soochow University(苏州大学);
4. Jiangsu Key Laboratory of Big Data Analysis Technology(江苏省大数据分析技术重点实验室);
5. School of Electronic and Information Engineering, Soochow University(苏州大学)
关键词: optimization automatically refines, Prompt optimization automatically, refines prompting expressions, automatically refines prompting, current prompt optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt optimization automatically refines prompting expressions, unlocking the full potential of LLMs in downstream tasks. However, current prompt optimization methods are costly to train and lack sufficient interpretability. This paper proposes enhancing LLMs’ reasoning performance by eliciting their causal inference ability from prompting instructions to correct answers. Specifically, we introduce the Self-Causal Instruction Enhancement (SCIE) method, which enables LLMs to generate high-quality, low-quantity observational data, then estimates the causal effect based on these data, and ultimately generates instructions with the optimized causal effect. In SCIE, the instructions are treated as the treatment, and textual features are used to process natural language, establishing causal relationships through treatments between instructions and downstream tasks. Additionally, we propose applying Object-Relational (OR) principles, where the uncovered causal relationships are treated as the inheritable class across task objects, ensuring low-cost reusability. Extensive experiments demonstrate that our method effectively generates instructions that enhance reasoning performance with reduced training cost of prompts, leveraging interpretable textual features to provide actionable insights.
zh

[NLP-62] Conceptual In-Context Learning and Chain of Concepts: Solving Complex Conceptual Problems Using Large Language Models

【速读】：该论文试图解决复杂概念性问题（complex conceptual problems），特别是在工程和科学领域中，生成基于数据建模指南的专有数据模型。解决方案的关键在于提出了两种新的浅层定制方法（Shallow Customization Methods, SCMs），即概念性上下文学习（Conceptual In-Context Learning, C-ICL）和概念链（Chain of Concepts, CoC），以增强大型语言模型（Large Language Models, LLMs）的特定概念信息（Conceptual Information, CI）能力，从而提升其在解决复杂概念性问题上的表现。与现有的上下文学习（In-context Learning, ICL）和思维链（Chain of Thoughts, CoT）方法相比，这两种新方法在响应正确性上分别提高了30.6%和29.88%，并显著减少了模型幻觉（hallucinations）和重复提示示例（parroting）的现象，使问题解决过程更加透明。

链接: https://arxiv.org/abs/2412.15309
作者: Nishtha N. Vaidya,Thomas Runkler,Thomas Hubauer,Veronika Haderlein-Hoegberg,Maja Mlicic Brandt
机构: Siemens AG(西门子股份公司); Technical University of Munich(慕尼黑工业大学)
关键词: complex conceptual problems, require specific conceptual, complex conceptual, conceptual problems, conceptual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to 2025 IEEE Symposium on Computational Intelligence in Natural Language Processing and Social Media

点击查看摘要

Abstract:Science and engineering problems fall in the category of complex conceptual problems that require specific conceptual information (CI) like math/logic -related know-how, process information, or engineering guidelines to solve them. Large Language Models (LLMs) are promising agents to solve such complex conceptual problems due to their implications in advancing engineering and science tasks like assisted problem-solving. But vanilla LLMs, trained on open-world data, lack the necessary CI. In this work, we specifically explore shallow customization methods (SCMs) of LLMs for solving complex conceptual problems. We propose two novel SCM algorithms for LLM, to augment LLMs with CI and enable LLMs to solve complex conceptual problems: Conceptual In-Context Learning (C-ICL) and Chain of Concepts (CoC). The problem tackled in this paper is generation of proprietary data models in the engineering/industry domain based on conceptual information in data modelling guidelines. We evaluate our algorithms on varied sizes of the OpenAI LLMs against four evaluation metrics related to syntactic and semantic correctness, time and cost incurred. The proposed algorithms perform better than currently popular LLM SCMs like In-context Learning (ICL) and Chain of Thoughts (CoT). It was observed that as compared to CoT, response correctness increased by 30.6% and 29.88% for the new SCMs C-ICL and CoC respectively. Qualitative analysis suggests that the proposed new SCMs activate emergent capabilities in LLMs, previously unobserved in the existing SCMs. They make problem-solving processes more transparent and reduce hallucinations and the tendency of model responses to copy examples from prompts (parroting).
zh

[NLP-63] ViFactCheck: A New Benchmark Dataset and Methods for Multi-domain News Fact-Checking in Vietnamese AAAI’2025

【速读】：该论文试图解决越南语等资源有限语言在事实核查（fact-checking）领域的挑战，关键在于引入了首个公开的越南语事实核查基准数据集ViFactCheck。该数据集包含7,232个人工标注的声明-证据对，涵盖12个不同主题，并通过严格的标注流程确保高质量，获得了0.83的Fleiss Kappa评分。解决方案的核心在于利用先进的预训练和大型语言模型（如Gemma模型）进行微调和提示技术，显著提升了越南语事实核查的准确性，Gemma模型在评估中达到了89.90%的宏F1分数，为该领域设立了新的标准。

链接: https://arxiv.org/abs/2412.15308
作者: Tran Thai Hoa,Tran Quang Duy,Khanh Quoc Tran,Kiet Van Nguyen
机构: Vietnam National University, Hanoi (越南国家大学，河内); Hanoi University of Science and Technology (河内科学与技术大学)
关键词: effective fact-checking tools, limited resources, rapid spread, Vietnamese, reputable Vietnamese online
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at AAAI’2025 Main Conference

点击查看摘要

Abstract:The rapid spread of information in the digital age highlights the critical need for effective fact-checking tools, particularly for languages with limited resources, such as Vietnamese. In response to this challenge, we introduce ViFactCheck, the first publicly available benchmark dataset designed specifically for Vietnamese fact-checking across multiple online news domains. This dataset contains 7,232 human-annotated pairs of claim-evidence combinations sourced from reputable Vietnamese online news, covering 12 diverse topics. It has been subjected to a meticulous annotation process to ensure high quality and reliability, achieving a Fleiss Kappa inter-annotator agreement score of 0.83. Our evaluation leverages state-of-the-art pre-trained and large language models, employing fine-tuning and prompting techniques to assess performance. Notably, the Gemma model demonstrated superior effectiveness, with an impressive macro F1 score of 89.90%, thereby establishing a new standard for fact-checking benchmarks. This result highlights the robust capabilities of Gemma in accurately identifying and verifying facts in Vietnamese. To further promote advances in fact-checking technology and improve the reliability of digital media, we have made the ViFactCheck dataset, model checkpoints, fact-checking pipelines, and source code freely available on GitHub. This initiative aims to inspire further research and enhance the accuracy of information in low-resource languages.
zh

[NLP-64] Self-Evolution Knowledge Distillation for LLM -based Machine Translation COLING2025

【速读】：该论文试图解决现有知识蒸馏 (Knowledge Distillation, KD) 策略在处理大语言模型时，对所有token indiscriminately最小化输出分布的问题，忽略了token之间学习难度的不平衡性。解决方案的关键在于提出了一种名为Self-Evolution KD的蒸馏策略，其核心是通过动态地将教师模型的分布与真实标签的one-hot分布整合到学生模型的分布中，作为先验知识，并根据token的学习难度调整先验知识的比重，从而更有效地利用教师模型的潜力，实现更好的知识传递。

链接: https://arxiv.org/abs/2412.15303
作者: Yuncheng Song,Liang Ding,Changtong Zan,Shujian Huang
机构: Nanjing University; The University of Sydney; China University of Petroleum (East China)
关键词: shown great promise, smaller student models, larger teacher models, shown great, great promise
类目: Computation and Language (cs.CL)
备注: COLING 2025

点击查看摘要

Abstract:Knowledge distillation (KD) has shown great promise in transferring knowledge from larger teacher models to smaller student models. However, existing KD strategies for large language models often minimize output distributions between student and teacher models indiscriminately for each token. This overlooks the imbalanced nature of tokens and their varying transfer difficulties. In response, we propose a distillation strategy called Self-Evolution KD. The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge, which promotes the distillation process. It adjusts the ratio of prior knowledge based on token learning difficulty, fully leveraging the teacher model’s potential. Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets. Further analysis indicates that the improvement comes from better knowledge transfer from teachers, confirming our hypothesis.
zh

[NLP-65] LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

【速读】：该论文试图解决多语言自动语音识别 (ASR) 模型在不同语言间表现不均衡的问题。解决方案的关键在于提出了一种语言无关的多语言ASR流程，通过正字法统一和语言特定音译 (LAMA-UT) 实现。LAMA-UT 的核心步骤包括：首先，利用通用转录生成器将正字法特征统一为罗马化形式，捕捉跨语言的共同音韵特征；其次，使用通用转换器将这些通用转录转换为特定语言的转录。该方法无需任何语言特定模块，且在少量数据训练下达到与最先进模型相当的性能，甚至在某些情况下超越了依赖额外语言特定词典和语言模型的零样本ASR方法。

链接: https://arxiv.org/abs/2412.15299
作者: Sangmin Lee,Woo-Jin Chung Hong-Goo Kang
机构: Korea Institute of Science and Technology (韩国科学技术院)
关键词: automatic speech recognition, multilingual automatic speech, speech recognition, inherent difficulties, Multilingual ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper’s training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
zh

[NLP-66] A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

【速读】：该论文试图解决如何使大型语言模型（LLM）的提示与其评估与人类标注对齐的问题。解决方案的关键在于使用声明式自改进Python（DSPy）优化器，通过比较五种提示优化算法（COPRO、MIPRO、BootstrapFewShot、BootstrapFewShot with Optuna、K-Nearest Neighbor Few Shot）在DSPy框架下的表现，来优化提示以更好地对齐人类标注的幻觉检测任务。实验结果表明，优化后的提示在检测幻觉方面能够超越多种基准方法，且某些提示优化算法在这些实验中表现更优。

链接: https://arxiv.org/abs/2412.15298
作者: Bhaskarjit Sarmah,Kriti Dutta,Anna Grigoryan,Sachin Tiwari,Stefano Pasquali,Dhagash Mehta
机构: BlackRock(贝莱德); BlackRock(贝莱德); BlackRock(贝莱德); BlackRock(贝莱德); BlackRock(贝莱德); BlackRock(贝莱德)
关键词: Declarative Self-improving Python, Self-improving Python, large language model, Declarative Self-improving, Cooperative Prompt Optimization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Methodology (stat.ME)
备注: 7 pages, 10 tables, two-column format

点击查看摘要

Abstract:We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these experiments.
zh

[NLP-67] Confidence in the Reasoning of Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在回答问题时的不确定性问题，特别是模型对其答案的信心程度与准确性之间的关系。解决方案的关键在于通过两种方式评估模型的信心：一是定性评估，即在提示模型重新考虑时，模型是否坚持其初始答案；二是定量评估，即通过模型自我报告的信心分数。研究结果表明，尽管LLMs的表现显著优于随机猜测，但它们在改变初始答案的倾向性上存在较大差异。定性信心与准确性之间存在正相关，但第二答案的准确性往往低于第一答案。此外，模型倾向于高估自我报告的信心分数，且信心部分由底层token级别的概率解释。这些发现表明，当前的LLMs缺乏内在一致的信心感知。

链接: https://arxiv.org/abs/2412.15296
作者: Yudi Pawitan,Chris Holmes
机构: Karolinska Institutet(卡罗林斯卡学院); Oxford University(牛津大学)
关键词: large language models, language models, growing literature, literature on reasoning, reasoning by large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMs – GPT4o, GPT4-turbo and Mistral – on two benchmark sets of questions on causal judgement and formal fallacies and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and accuracy, but the overall accuracy for the second answer is often worse than for the first answer. There is a strong tendency to overstate the self-reported confidence score. Confidence is only partially explained by the underlying token-level probability. The material effects of prompting on qualitative confidence and the strong tendency for overconfidence indicate that current LLMs do not have any internally coherent sense of confidence.
zh

[NLP-68] A Large-scale Empirical Study on Large Language Models for Election Prediction

【速读】：该论文试图解决大型语言模型（LLMs）在选举预测中的准确性问题。解决方案的关键在于引入了一个多步骤推理框架，该框架系统地整合了人口统计、意识形态和时间敏感因素，并通过2016年和2020年的真实数据及广泛的合成人物数据进行验证。该方法能够适应不断变化的政治环境，减少偏差并显著提高预测精度。此外，论文还探讨了LLM在选举预测中的潜在政治偏见，并提出了缓解这些问题的策略，从而在提高预测准确性的同时，推动政治科学研究中更加平衡、透明和上下文感知的建模。

链接: https://arxiv.org/abs/2412.15291
作者: Chenxiao Yu,Zhaotian Weng,Yuangang Li,Zheng Li,Xiyang Hu,Yue Zhao
机构: University of Southern California; Arima; Carnegie Mellon University
关键词: Large Language Models, Language Models, Large Language, accurately predict election, predict election outcomes
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: arXiv admin note: substantial text overlap with arXiv:2411.03321

点击查看摘要

Abstract:Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in healthcare, legal analysis, and creative applications, their capabilities in election forecasting remain uncertain. Notably, election prediction poses unique challenges: limited voter-level data, evolving political contexts, and the complexity of modeling human behavior. In the first part of this paper, we explore and introduce a multi-step reasoning framework for election prediction, which systematically integrates demographic, ideological, and time-sensitive factors. Validated on 2016 and 2020 real-world data and extensive synthetic personas, our approach adapts to changing political landscapes, reducing bias and significantly improving predictive accuracy. We further apply our pipeline to the 2024 U.S. presidential election, illustrating its ability to generalize beyond observed historical data. Beyond enhancing accuracy, the second part of the paper provides insights into the broader implications of LLM-based election forecasting. We identify potential political biases embedded in pretrained corpora, examine how demographic patterns can become exaggerated, and suggest strategies for mitigating these issues. Together, this project, a large-scale LLM empirical study, advances the accuracy of election predictions and establishes directions for more balanced, transparent, and context-aware modeling in political science research and practice.
zh

[NLP-69] SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 的安全对齐问题，特别是如何绕过其安全防护机制以引发有害响应。解决方案的关键在于提出了一种新的越狱范式，称为简单辅助任务链接 (Simple Assistive Task Linkage, SATA)。SATA 通过在恶意查询中掩盖有害关键词，生成包含特殊标记 [MASK] 的相对良性的查询，并利用简单的辅助任务（如掩码语言模型任务或按位置查找元素任务）来编码被掩盖关键词的语义。最后，SATA 将辅助任务与掩盖后的查询链接起来，共同执行越狱操作。实验结果表明，SATA 在攻击成功率 (Attack Success Rate, ASR) 和有害评分 (Harmful Score, HS) 上均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.15289
作者: Xiaoning Dong,Wenbo Hu,Wei Xu,Tianxing He
机构: Tsinghua University(清华大学); Hefei University of Technology(合肥工业大学); Shanghai Qi Zhi Institute(上海期智研究院)
关键词: made significant advancements, safety alignment remain, major concern, Assistive Task, made significant
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remain a major concern. Exploring jailbreak prompts can expose LLMs’ vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task such as a masked language model task or an element lookup by position task to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
zh

[NLP-70] Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在推理阶段计算资源的高效利用问题，关键在于提出了一种新的推理感知微调范式（inference-aware fine-tuning paradigm）。具体而言，该范式通过直接优化推理策略的性能来微调模型，并采用了一种简单有效的Best-of-N（BoN）推理策略，其中验证器从一组LLM生成的响应中选择最佳响应。论文设计了首个基于模仿学习和强化学习（RL）的BoN感知微调方法，克服了BoN中非可微分的argmax操作的挑战。实验结果表明，BoN感知模型能够隐式学习一种元策略，该策略在最佳响应与更适合测试输入的多样化响应之间进行交替，类似于强化学习中的探索-利用权衡。该方法显著提升了模型在多个任务上的性能和推理阶段的计算效率。

链接: https://arxiv.org/abs/2412.15287
作者: Yinlam Chow,Guy Tennenholtz,Izzeddin Gur,Vincent Zhuang,Bo Dai,Sridhar Thiagarajan,Craig Boutilier,Rishabh Agarwal,Aviral Kumar,Aleksandra Faust
机构: Google DeepMind; Google Research
关键词: effectively utilizing inference-time, large language models, Recent studies, utilizing inference-time compute, effectively utilizing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input – a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
zh

[NLP-71] Maximize Your Datas Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

【速读】：该论文试图解决预训练大型语言模型时数据选择、混合和排序的策略问题，尤其是关于数据混合的可扩展性在更长token范围和更大模型规模下的研究不足。解决方案的关键在于提出了两阶段预训练的概念，并通过系统性研究展示了如何选择和混合数据以最大化模型在两个阶段的准确性。研究发现，两阶段预训练方法相较于随机数据排序和自然token分布，平均准确率分别提高了3.4%和17%。论文提供了基于数据源质量和训练轮次的最佳混合策略，并展示了从小规模（1T tokens）到大规模（15T tokens和25B模型规模）的有效扩展方法，为实践者提供了设计和扩展数据混合的具体步骤。

链接: https://arxiv.org/abs/2412.15285
作者: Steven Feng,Shrimai Prabhumoye,Kezhi Kong,Dan Su,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
机构: NVIDIA; Stanford University; Boston University
关键词: effectively requires strategic, language models effectively, models effectively requires, large language models, Pretraining large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretraining large language models effectively requires strategic data selection, blending and ordering. However, key details about data mixtures especially their scalability to longer token horizons and larger model sizes remain underexplored due to limited disclosure by model developers. To address this, we formalize the concept of two-phase pretraining and conduct an extensive systematic study on how to select and mix data to maximize model accuracies for the two phases. Our findings illustrate that a two-phase approach for pretraining outperforms random data ordering and natural distribution of tokens by 3.4% and 17% on average accuracies. We provide in-depth guidance on crafting optimal blends based on quality of the data source and the number of epochs to be seen. We propose to design blends using downsampled data at a smaller scale of 1T tokens and then demonstrate effective scaling of our approach to larger token horizon of 15T tokens and larger model size of 25B model size. These insights provide a series of steps practitioners can follow to design and scale their data blends.
zh

[NLP-72] Channel Merging: Preserving Specialization for Merged Experts AAAI2025

【速读】：该论文试图解决大规模语言模型（LLM）在任务特定微调后进行集成时，传统集成方法内存占用高、参数冲突导致性能下降的问题。解决方案的关键是提出了**通道合并（Channel Merging）**策略，通过离线聚类和合并相似的通道参数，形成多个组，从而在减少参数冲突的同时提高存储效率。该方法确保仅在高度相似的参数之间进行合并，并在推理时通过快速查找合并组中的专家参数，保留了特定任务的知识。实验结果表明，通道合并在多种任务中表现优异，且在仅使用53%参数的情况下，性能可与模型集成相媲美。

链接: https://arxiv.org/abs/2412.15283
作者: Mingyang Zhang,Jing Liu,Ganggui Ding,Xinyi Yu,Linlin Ou,Bohan Zhuang
机构: 1. School of Computer Science and Technology, Nanjing University of Science and Technology(南京理工大学计算机科学与技术学院);
2. School of Computer Science, Fudan University(复旦大学计算机科学学院);
3. School of Computer Science, Peking University(北京大学计算机科学学院)
关键词: large language models, utilizing task-specific fine-tuning, practice of utilizing, implemented to improve, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by AAAI 2025

点击查看摘要

Abstract:Lately, the practice of utilizing task-specific fine-tuning has been implemented to improve the performance of large language models (LLM) in subsequent tasks. Through the integration of diverse LLMs, the overall competency of LLMs is significantly boosted. Nevertheless, traditional ensemble methods are notably memory-intensive, necessitating the simultaneous loading of all specialized models into GPU memory. To address the inefficiency, model merging strategies have emerged, merging all LLMs into one model to reduce the memory footprint during inference. Despite these advances, model merging often leads to parameter conflicts and performance decline as the number of experts increases. Previous methods to mitigate these conflicts include post-pruning and partial merging. However, both approaches have limitations, particularly in terms of performance and storage efficiency when merged experts increase. To address these challenges, we introduce Channel Merging, a novel strategy designed to minimize parameter conflicts while enhancing storage efficiency. This method clusters and merges channel parameters based on their similarity to form several groups offline. By ensuring that only highly similar parameters are merged within each group, it significantly reduces parameter conflicts. During inference, we can instantly look up the expert parameters from the merged groups, preserving specialized knowledge. Our experiments demonstrate that Channel Merging consistently delivers high performance, matching unmerged models in tasks like English and Chinese reasoning, mathematical reasoning, and code generation. Moreover, it obtains results comparable to model ensemble with just 53% parameters when used with a task-specific router.
zh

[NLP-73] A Systematic Examination of Preference Learning through the Lens of Instruction-Following

【速读】：该论文旨在解决如何优化偏好数据集的构建以提升大型语言模型（LLMs）在指令跟随任务中的对齐效果和下游任务性能。其关键解决方案包括：1) 通过合成数据生成管道创建包含23种可验证约束的48,000个独特指令跟随提示，以实现细粒度和自动化的质量评估；2) 使用拒绝采样（Rejection Sampling, RS）和蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）两种偏好数据集构建方法，生成(chosen, rejected)响应对；3) 系统研究共享前缀、响应对比度和训练提示复杂性对模型性能的影响，发现共享前缀提供边际但稳定的改进，高对比度偏好对通常表现更好，而中等难度的训练提示有助于更好的任务泛化。这些发现为优化偏好数据集的构建提供了可操作的见解，并提供了一个可扩展且有效的框架，用于增强LLM的训练和对齐。

链接: https://arxiv.org/abs/2412.15282
作者: Joongwon Kim,Anirudh Goyal,Aston Zhang,Bo Xiong,Rui Hou,Melanie Kambadur,Dhruv Mahajan,Hannaneh Hajishirzi,Liang Tan
机构: Meta; Meta
关键词: widely adopted post-training, adopted post-training technique, aligns large language, large language models, improves specific downstream
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 23 pages

点击查看摘要

Abstract:Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) - to obtain pairs of (chosen, rejected) responses. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements and greater stability across challenging training configurations. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance by balancing diversity and learning efficiency. Additionally, training on prompts of moderate difficulty leads to better generalization across tasks, even for more complex evaluation scenarios, compared to overly challenging prompts. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.
zh

[NLP-74] Context-DPO: Aligning Language Models for Context-Faithfulness

【速读】：该论文试图解决大语言模型（LLMs）在遵循用户指令和检索信息时，如何提高上下文忠实度（context-faithfulness）的问题。解决方案的关键在于提出了 Context-DPO，这是首个专门设计用于增强 LLMs 上下文忠实度的对齐方法。通过引入 ConFiQA 基准，模拟检索增强生成（RAG）场景中的知识冲突，Context-DPO 利用直接偏好优化（direct preference optimization）来对齐模型，显著提升了上下文忠实度，并在多个开源模型上实现了 35% 至 280% 的改进。此外，Context-DPO 在保持模型生成能力的同时，提供了对上下文利用的可解释性洞察。

链接: https://arxiv.org/abs/2412.15280
作者: Baolong Bi,Shaohan Huang,Yiwei Wang,Tianchi Yang,Zihan Zhang,Haizhen Huang,Lingrui Mei,Junfeng Fang,Zehao Li,Furu Wei,Weiwei Deng,Feng Sun,Qi Zhang,Shenghua Liu
机构: University of Chinese Academy of Sciences; Microsoft Corporation; University of California, Merced; National University of Singapore
关键词: large language models, Reliable responses, require adherence, retrieved information, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Reliable responses from large language models (LLMs) require adherence to user instructions and retrieved information. While alignment techniques help LLMs align with human intentions and values, improving context-faithfulness through alignment remains underexplored. To address this, we propose \textbfContext-DPO , the first alignment method specifically designed to enhance LLMs’ context-faithfulness. We introduce \textbfConFiQA , a benchmark that simulates Retrieval-Augmented Generation (RAG) scenarios with knowledge conflicts to evaluate context-faithfulness. By leveraging faithful and stubborn responses to questions with provided context from ConFiQA, our Context-DPO aligns LLMs through direct preference optimization. Extensive experiments demonstrate that our Context-DPO significantly improves context-faithfulness, achieving 35% to 280% improvements on popular open-source models. Further analysis demonstrates that Context-DPO preserves LLMs’ generative capabilities while providing interpretable insights into context utilization. Our code and data are released at this https URL
zh

[NLP-75] PLPP: Prompt Learning with Perplexity Is Self-Distillation for Vision-Language Models

【速读】：该论文试图解决预训练视觉-语言模型（VL models）在下游任务中由于仅依赖CLIP损失进行微调而导致的过拟合问题。解决方案的关键在于提出了一种名为PLPP（Prompt Learning with PerPlexity）的插件式提示正则化方法，通过引入困惑度损失（perplexity loss）来正则化提示学习。PLPP的核心在于设计了一个两步操作来计算提示的困惑度：首先计算嵌入层权重与提示之间的余弦相似度以生成标签，然后引入一个无需训练的语言模型头（LM head）来输出词概率分布。此外，PLPP通过将硬标签转换为软标签并选择top-k值来计算困惑度损失，从而进一步防止过拟合并减少额外的计算开销。实验结果表明，PLPP在多个分类任务中表现出优于现有方法的性能。

链接: https://arxiv.org/abs/2412.15277
作者: Biao Liu,Wenyi Fang,Xiaoyu Wu,Yang Zheng,Zheng Hu,Bo Yuan
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. School of Software, Tsinghua University, Beijing, China(清华大学软件学院，北京，中国)
关键词: Pre-trained Vision-Language, numerous downstream tasks, Context Optimization, demonstrated their excellent, downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained Vision-Language (VL) models such as CLIP have demonstrated their excellent performance across numerous downstream tasks. A recent method, Context Optimization (CoOp), further improves the performance of VL models on downstream tasks by introducing prompt learning. CoOp optimizes a set of learnable vectors, aka prompt, and freezes the whole CLIP model. However, relying solely on CLIP loss to fine-tune prompts can lead to models that are prone to overfitting on downstream task. To address this issue, we propose a plug-in prompt-regularization method called PLPP (Prompt Learning with PerPlexity), which use perplexity loss to regularize prompt learning. PLPP designs a two-step operation to compute the perplexity for prompts: (a) calculating cosine similarity between the weight of the embedding layer and prompts to get labels, (b) introducing a language model (LM) head that requires no training behind text encoder to output word probability distribution. Meanwhile, we unveil that the essence of PLPP is inherently a form of self-distillation. To further prevent overfitting as well as to reduce the additional computation introduced by PLPP, we turn the hard label to soft label and choose top- k values for calculating the perplexity loss. For accelerating model convergence, we introduce mutual self-distillation learning, that is perplexity and inverted perplexity loss. The experiments conducted on four classification tasks indicate that PLPP exhibits superior performance compared to existing methods.
zh

[NLP-76] Fooling LLM graders into giving better grades through neural activity guided adversarial prompting

【速读】：该论文旨在解决人工智能（AI）在关键决策和评估过程中可能存在的内在偏见问题，特别是这些偏见可能被恶意行为者利用以扭曲决策结果。解决方案的关键在于提出了一种系统性方法，通过识别预测扭曲决策结果的隐藏神经活动模式，并优化对抗性输入后缀以放大这些模式，从而有效地欺骗大型语言模型（LLM）评分系统，使其给出远高于人类评分的成绩。此外，该方法还展示了这种白盒攻击可以转移到其他模型（包括商业闭源模型如Gemini）的黑盒攻击中，并揭示了“魔法词”在攻击效果中的关键作用。通过追溯这种魔法词偏见的来源，论文进一步提出了一种通过修改监督微调LLM时常用的聊天模板来减少偏见的方法，从而为确保AI的安全性和可靠性提供了系统性的解决方案。

链接: https://arxiv.org/abs/2412.15275
作者: Atsushi Yamamura,Surya Ganguli
机构: 未知
关键词: processes raises concerns, evaluation processes raises, distort decision outcomes, artificial intelligence, deployment of artificial
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:The deployment of artificial intelligence (AI) in critical decision-making and evaluation processes raises concerns about inherent biases that malicious actors could exploit to distort decision outcomes. We propose a systematic method to reveal such biases in AI evaluation systems and apply it to automated essay grading as an example. Our approach first identifies hidden neural activity patterns that predict distorted decision outcomes and then optimizes an adversarial input suffix to amplify such patterns. We demonstrate that this combination can effectively fool large language model (LLM) graders into assigning much higher grades than humans would. We further show that this white-box attack transfers to black-box attacks on other models, including commercial closed-source models like Gemini. They further reveal the existence of a “magic word” that plays a pivotal role in the efficacy of the attack. We trace the origin of this magic word bias to the structure of commonly-used chat templates for supervised fine-tuning of LLMs and show that a minor change in the template can drastically reduce the bias. This work not only uncovers vulnerabilities in current LLMs but also proposes a systematic method to identify and remove hidden biases, contributing to the goal of ensuring AI safety and security.
zh

[NLP-77] Memory-Augmented Agent Training for Business Document Understanding

【速读】：该论文试图解决传统企业在处理业务文档时面临的自动化挑战，特别是在物流操作中从发票中提取运输参考信息这一关键任务上。解决方案的关键在于提出了Matrix（Memory-Augmented agent Training through Reasoning and Iterative eXploration）这一新范式，通过经验驱动的记忆优化和迭代学习，使大型语言模型（LLM）逐步构建领域专业知识。Matrix通过增强记忆机制，显著提升了LLM在特定业务领域的性能，实验结果表明其相较于单一LLM提示和传统LLM代理分别提升了30.3%和35.2%。此外，该系统在减少API调用、降低成本以及处理更长文档方面也表现出优势。

链接: https://arxiv.org/abs/2412.15274
作者: Jiale Liu,Yifan Zeng,Malte Højmark-Bertelsen,Marie Normann Gadeberg,Huazheng Wang,Qingyun Wu
机构: 1. School of Computer Science, Wuhan University (武汉大学计算机科学学院);
2. School of Software Engineering, Tongji University (同济大学软件工程学院);
3. Department of Environmental Science, Aarhus University (奥胡斯大学环境科学系)
关键词: Traditional enterprises face, enterprises face significant, face significant challenges, remain largely manual, Traditional enterprises
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Traditional enterprises face significant challenges in processing business documents, where tasks like extracting transport references from invoices remain largely manual despite their crucial role in logistics operations. While Large Language Models offer potential automation, their direct application to specialized business domains often yields unsatisfactory results. We introduce Matrix (Memory-Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise through experience-driven memory refinement and iterative learning. To validate this approach, we collaborate with one of the world’s largest logistics companies to create a dataset of Universal Business Language format invoice documents, focusing on the task of transport reference extraction. Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%. We further analyze the metrics of the optimized systems and observe that the agent system requires less API calls, fewer costs and can analyze longer documents on average. Our methods establish a new approach to transform general-purpose LLMs into specialized business tools through systematic memory enhancement in document processing tasks.
zh

[NLP-78] Do Voters Get the Information They Want? Understanding Authentic Voter FAQs in the US and How to Improve for Informed Electoral Participation

【速读】：该论文试图解决美国各州选举委员会（SECs）在提供选民常见问题解答（FAQs）方面的信息不全面和不一致的问题。解决方案的关键在于首次构建了一个涵盖全美50个州的选民FAQs数据集，并引入了FAQ信息质量（FIQ）的评估指标，包括问题、答案及其对应性。通过分析这些指标，论文识别了领先、主流和落后的内容实践，并提出了各州如何改进FAQ质量以提升整体信息生态系统的建议。

链接: https://arxiv.org/abs/2412.15273
作者: Vipula Rawte,Deja N Scott,Gaurav Kumar,Aishneet Juneja,Bharat Sowrya Yaddanapalli,Biplav Srivastava
机构: AI Institute, University of South Carolina, USA(南卡罗来纳大学人工智能研究所，美国); University of South Carolina, USA(南卡罗来纳大学，美国)
关键词: make informed decisions, Frequently Asked Questions, Accurate information, keeping them accountable, crucial for democracy
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate information is crucial for democracy as it empowers voters to make informed decisions about their representatives and keeping them accountable. In the US, state election commissions (SECs), often required by law, are the primary providers of Frequently Asked Questions (FAQs) to voters, and secondary sources like non-profits such as League of Women Voters (LWV) try to complement their information shortfall. However, surprisingly, to the best of our knowledge, there is neither a single source with comprehensive FAQs nor a study analyzing the data at national level to identify current practices and ways to improve the status quo. This paper addresses it by providing the \bf first dataset on Voter FAQs covering all the US states. Second, we introduce metrics for FAQ information quality (FIQ) with respect to questions, answers, and answers to corresponding questions. Third, we use FIQs to analyze US FAQs to identify leading, mainstream and lagging content practices and corresponding states. Finally, we identify what states across the spectrum can do to improve FAQ quality and thus, the overall information ecosystem. Across all 50 U.S. states, 12% were identified as leaders and 8% as laggards for FIQS\textsubscriptvoter, while 14% were leaders and 12% laggards for FIQS\textsubscriptdeveloper.
zh

[NLP-79] SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation

【速读】：该论文试图解决大语言模型 (LLMs) 在生成过程中产生的幻觉问题，并提出了一种基于知识图谱 (Knowledge Graphs, KGs) 的增强型检索生成方法 (Retrieval-Augmented Generation, RAG)，称为相似图增强检索生成 (SimGRAG)。解决方案的关键在于通过两阶段过程实现查询文本与知识图谱结构的精确对齐：首先，利用 LLM 将查询转换为所需的图模式 (query-to-pattern)；其次，通过图语义距离 (Graph Semantic Distance, GSD) 度量候选子图与模式之间的对齐程度 (pattern-to-subgraph)。此外，论文还开发了一种优化的检索算法，能够在 1 秒内从 1000 万规模的知识图谱中高效识别出前 k 个子图，从而显著提升了 RAG 方法在问答和事实验证任务中的性能、可插拔性和可扩展性。

链接: https://arxiv.org/abs/2412.15272
作者: Yuzheng Cai,Zhenyue Guo,Yiwen Pei,Wanrui Bian,Weiguo Zheng
机构: Fudan University (复旦大学)
关键词: large language models, shown impressive versatility, Recent advancements, language models, Enhanced Retrieval-Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate its hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top- k subgraphs within 1-second latency on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification, offering superior plug-and-play usability and scalability.
zh

[NLP-80] A MapReduce Approach to Effectively Utilize Long Context Information in Retrieval Augmented Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在医疗领域应用中因知识过时或幻觉（hallucination）导致的响应不准确问题。解决方案的关键是提出了一种名为BriefContext的map-reduce策略，通过在不修改模型权重的情况下，改善检索增强生成（RAG）工作流中的“中间迷失”问题（“lost-in-the-middle” problem），从而提高RAG响应的鲁棒性和可靠性。该策略通过优化检索结果的排序和信息密度，提升了LLMs在医疗领域应用中的安全性和可靠性。

链接: https://arxiv.org/abs/2412.15271
作者: Gongbo Zhang,Zihan Xu,Qiao Jin,Fangyi Chen,Yilu Fang,Yi Liu,Justin F. Rousseau,Ziyang Xu,Zhiyong Lu,Chunhua Weng,Yifan Peng
机构: Columbia University(哥伦比亚大学); Weill Cornell Medicine(威尔康奈尔医学); National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health(国家生物技术信息中心，国家医学图书馆，国立卫生研究院); University of Texas Southwestern Medical Center(德克萨斯大学西南医学中心); NYU Grossman School of Medicine(纽约大学格罗斯曼医学院)
关键词: evolving topics due, large language models, holding great promise, struggle to produce, large language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While holding great promise for improving and facilitating healthcare, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the “lost-in-the-middle” problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the “lost-in-the-middle” issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains.
zh

[NLP-81] Baichuan4-Finance Technical Report

【速读】：该论文试图解决大型语言模型（LLMs）在金融领域应用的潜力未被充分探索的问题。解决方案的关键在于开发了专门针对金融领域的Baichuan4-Finance系列模型，包括基础模型Baichuan4-Finance-Base和对齐语言模型Baichuan4-Finance。通过构建详细的数据质量提升管道和采用新颖的领域自约束训练策略（domain self-constraint training strategy），Baichuan4-Finance-Base在获取金融知识的同时保持了通用能力。随后，通过监督微调（Supervised Fine-tuning）和基于人类与AI反馈的强化学习（Reinforcement Learning from Human Feedback and AI Feedback），Baichuan4-Finance模型在金融认证问题和实际应用场景中表现出色，展示了其在金融领域推动创新应用的潜力。

链接: https://arxiv.org/abs/2412.15270
作者: Hanyu Zhang,Boyu Qiu,Yuhao Feng,Shuqi Li,Qian Ma,Xiyuan Zhang,Qiang Ju,Dong Yan,Jian Xie
机构: Baichuan Inc.(百川公司)
关键词: Large language models, remains underexplored due, finance remains underexplored, Large language, demonstrated strong capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in language understanding, generation, and reasoning, yet their potential in finance remains underexplored due to the complexity and specialization of financial knowledge. In this work, we report the development of the Baichuan4-Finance series, including a comprehensive suite of foundational Baichuan4-Finance-Base and an aligned language model Baichuan4-Finance, which are built upon Baichuan4-Turbo base model and tailored for finance domain. Firstly, we have dedicated significant effort to building a detailed pipeline for improving data quality. Moreover, in the continual pre-training phase, we propose a novel domain self-constraint training strategy, which enables Baichuan4-Finance-Base to acquire financial knowledge without losing general capabilities. After Supervised Fine-tuning and Reinforcement Learning from Human Feedback and AI Feedback, the chat model Baichuan4-Finance is able to tackle various financial certification questions and real-world scenario applications. We evaluate Baichuan4-Finance on many widely used general datasets and two holistic financial benchmarks. The evaluation results show that Baichuan4-Finance-Base surpasses almost all competitive baselines on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. At the same time, Baichuan4-Finance demonstrates even more impressive performance on financial application scenarios, showcasing its potential to foster community innovation in the financial LLM field.
zh

[NLP-82] he Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration

【速读】：该论文试图解决预训练语言模型（PLMs）在预测置信度估计中存在的校准误差（miscalibration）问题，并探讨校准误差与模型决策规则的泛化能力之间的关系。解决方案的关键在于揭示了低校准误差并不一定意味着模型具有可靠的决策规则，反而可能反映了模型依赖于非泛化性的“捷径学习”（shortcut learning）。论文通过研究这一关系，挑战了传统观点，即校准良好的模型必然可靠，并强调了需要开发综合框架以实现真正稳健和可靠的语言模型，从而弥合校准与泛化目标之间的差距。

链接: https://arxiv.org/abs/2412.15269
作者: Geetanjali Bihani,Julia Rayz
机构: Purdue University (普渡大学)
关键词: natural language processing, enabled significant performance, significant performance gains, advent of pre-trained, enabled significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages; 9 figures. Accepted for publication at the Hawaii International Conference on System Sciences (HICSS-58) 2025

点击查看摘要

Abstract:The advent of pre-trained language models (PLMs) has enabled significant performance gains in the field of natural language processing. However, recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. Current evaluation methods for PLM calibration often assume that lower calibration error estimates indicate more reliable predictions. However, fine-tuned PLMs often resort to shortcuts, leading to overconfident predictions that create the illusion of enhanced performance but lack generalizability in their decision rules. The relationship between PLM reliability, as measured by calibration error, and shortcut learning, has not been thoroughly explored thus far. This paper aims to investigate this relationship, studying whether lower calibration error implies reliable decision rules for a language model. Our findings reveal that models with seemingly superior calibration portray higher levels of non-generalizable decision rules. This challenges the prevailing notion that well-calibrated models are inherently reliable. Our study highlights the need to bridge the current gap between language model calibration and generalization objectives, urging the development of comprehensive frameworks to achieve truly robust and reliable language models.
zh

[NLP-83] Enhancing LLM -based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph

【速读】：该论文试图解决在线内容毒性检测中的两个关键问题：1) 由于缺乏领域特定的毒性知识导致的假阴性；2) 由于大型语言模型 (LLMs) 对毒性言论的过度敏感导致的假阳性，限制了言论自由。解决方案的关键在于提出了一种名为 MetaTox 的新方法，通过在元毒性知识图谱上进行图搜索来增强仇恨和毒性检测。具体来说，MetaTox 首先构建了一个全面的元毒性知识图谱，利用 LLMs 通过三步管道从毒性基准数据集中提取毒性信息。然后，通过检索和排序过程查询该图谱，以补充准确且相关的毒性知识。实验结果表明，MetaTox 显著降低了假阳性率，同时提升了整体毒性检测性能。

链接: https://arxiv.org/abs/2412.15268
作者: Yibo Zhao,Jiapeng Zhu,Can Xu,Xiang Li
机构: East China Normal University(华东师范大学)
关键词: social media platforms, raised significant concerns, Large Language Models, online content toxicity, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages of content, 7 pages of Limitation, Ethical Statement, Reference ans Appendix

点击查看摘要

Abstract:The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxic knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three-step pipeline, with toxic benchmark datasets serving as corpora. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxic knowledge. Extensive experiments and in-depth case studies across multiple datasets demonstrate that our MetaTox significantly decreases the false positive rate while boosting overall toxicity detection performance. Our code will be available soon.
zh

[NLP-84] oxicity Detection towards Adaptability to Changing Perturbations

【速读】：该论文试图解决在毒性检测领域中，现有方法对不断演变的扰动模式（perturbation patterns）的脆弱性问题。具体来说，恶意用户通过创建新的扰动模式来规避检测器，例如在提示前添加“I am a scientist”以欺骗大型语言模型（LLMs）的检测器。解决方案的关键在于引入持续学习（continual learning）范式，通过构建包含9种扰动模式的新数据集，并采用领域增量学习（domain incremental learning）方法，确保检测器能够应对动态出现的扰动类型，从而提升其鲁棒性。

链接: https://arxiv.org/abs/2412.15267
作者: Hankun Kang,Jianhao Chen,Yongqi Li,Xin Miao,Mayi Xu,Ming Zhong,Yuanyuan Zhu,Tieyun Qian
机构: Wuhan University (武汉大学)
关键词: perturbation patterns, crucial for maintaining, maintaining the peace, perturbation, evolving perturbation patterns
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Toxicity detection is crucial for maintaining the peace of the society. While existing methods perform well on normal toxic contents or those generated by specific perturbation methods, they are vulnerable to evolving perturbation patterns. However, in real-world scenarios, malicious users tend to create new perturbation patterns for fooling the detectors. For example, some users may circumvent the detector of large language models (LLMs) by adding `I am a scientist’ at the beginning of the prompt. In this paper, we introduce a novel problem, i.e., continual learning jailbreak perturbation patterns, into the toxicity detection field. To tackle this problem, we first construct a new dataset generated by 9 types of perturbation patterns, 7 of them are summarized from prior work and 2 of them are developed by us. We then systematically validate the vulnerability of current methods on this new perturbation pattern-aware dataset via both the zero-shot and fine tuned cross-pattern detection. Upon this, we present the domain incremental learning paradigm and the corresponding benchmark to ensure the detector’s robustness to dynamically emerging types of perturbed toxic text. Our code and dataset are provided in the appendix and will be publicly available at GitHub, by which we wish to offer new research opportunities for the security-relevant communities.
zh

[NLP-85] On the Structural Memory of LLM Agents

【速读】：该论文试图解决大语言模型（LLM）在复杂和长期交互任务中，不同记忆结构和记忆检索方法对性能的影响问题。解决方案的关键在于评估和比较四种记忆结构（chunks, knowledge triples, atomic facts, summaries）以及混合记忆结构，同时评估三种常用的记忆检索方法（single-step retrieval, reranking, iterative retrieval）。研究结果表明，不同记忆结构在特定任务中具有不同的优势，混合记忆结构在噪声环境中表现出显著的鲁棒性，而迭代检索（iterative retrieval）在各种场景下均优于其他方法。

链接: https://arxiv.org/abs/2412.15266
作者: Ruihong Zeng,Jinyuan Fang,Siwei Liu,Zaiqiao Meng
机构: University of Glasgow(格拉斯哥大学); University of Aberdeen(阿伯丁大学)
关键词: large language model, enabling large language, memory structures, Memory, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory plays a pivotal role in enabling large language model~(LLM)-based agents to engage in complex and long-term interactions, such as question answering (QA) and dialogue systems. While various memory modules have been proposed for these tasks, the impact of different memory structures across tasks remains insufficiently explored. This paper investigates how memory structures and memory retrieval methods affect the performance of LLM-based agents. Specifically, we evaluate four types of memory structures, including chunks, knowledge triples, atomic facts, and summaries, along with mixed memory that combines these components. In addition, we evaluate three widely used memory retrieval methods: single-step retrieval, reranking, and iterative retrieval. Extensive experiments conducted across four tasks and six datasets yield the following key insights: (1) Different memory structures offer distinct advantages, enabling them to be tailored to specific tasks; (2) Mixed memory structures demonstrate remarkable resilience in noisy environments; (3) Iterative retrieval consistently outperforms other methods across various scenarios. Our investigation aims to inspire further research into the design of memory systems for LLM-based agents.
zh

[NLP-86] Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在安全领域应用中的事实性能力问题，特别是其在法律、政策和伦理等领域的准确性、全面性和清晰度。解决方案的关键在于引入了一个名为Chinese SafetyQA的基准测试，该基准具有中文、多样化、高质量、静态、易于评估、与安全相关且无害的特性。通过这一基准，论文对现有LLMs的事实性能力进行了全面评估，并分析了这些能力与LLM的检索增强生成（RAG）能力和抗攻击能力之间的关系。

链接: https://arxiv.org/abs/2412.15265
作者: Yingshui Tan,Boren Zheng,Baihui Zheng,Kerui Cao,Huiyun Jing,Jincheng Wei,Jiaheng Liu,Yancheng He,Wenbo Su,Xiangyong Zhu,Bo Zheng
机构: 未知
关键词: Large Language Models, significant safety concerns, Large Language, Language Models, advancement of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), significant safety concerns have emerged. Fundamentally, the safety of large language models is closely linked to the accuracy, comprehensiveness, and clarity of their understanding of safety knowledge, particularly in domains such as law, policy and ethics. This factuality ability is crucial in determining whether these models can be deployed and applied safely and compliantly within specific regions. To address these challenges and better evaluate the factuality ability of LLMs to answer short questions, we introduce the Chinese SafetyQA benchmark. Chinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate, Safety-related, Harmless). Based on Chinese SafetyQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG ability and robustness against attacks.
zh

[NLP-87] ReXTrust: A Model for Fine-Grained Hallucination Detection in AI-Generated Radiology Reports ALT

【速读】：该论文旨在解决AI生成的放射学报告中可能存在的幻觉（hallucinations）问题，这些幻觉可能导致错误的或无根据的陈述，从而影响患者护理。解决方案的关键在于提出了ReXTrust框架，该框架通过利用大型视觉-语言模型（vision-language models）的隐藏状态序列，生成针对每个发现（finding）的幻觉风险评分。通过在MIMIC-CXR数据集上的评估，ReXTrust展示了优于现有方法的性能，特别是在临床显著性发现上的表现，表明基于模型隐藏状态的白盒方法能够为医疗AI系统提供可靠的幻觉检测，从而提高自动化放射学报告的安全性和可靠性。

链接: https://arxiv.org/abs/2412.15264
作者: Romain Hardy,Sung Eun Kim,Pranav Rajpurkar
机构: Stanford University (斯坦福大学)
关键词: impact patient care, necessitates robust methods, reports necessitates robust, AI-generated radiology reports, radiology reports necessitates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AIMedHealth 10 pages, 5 figures

点击查看摘要

Abstract:The increasing adoption of AI-generated radiology reports necessitates robust methods for detecting hallucinations–false or unfounded statements that could impact patient care. We present ReXTrust, a novel framework for fine-grained hallucination detection in AI-generated radiology reports. Our approach leverages sequences of hidden states from large vision-language models to produce finding-level hallucination risk scores. We evaluate ReXTrust on a subset of the MIMIC-CXR dataset and demonstrate superior performance compared to existing approaches, achieving an AUROC of 0.8751 across all findings and 0.8963 on clinically significant findings. Our results show that white-box approaches leveraging model hidden states can provide reliable hallucination detection for medical AI systems, potentially improving the safety and reliability of automated radiology reporting.
zh

[NLP-88] PROPOE 2: Avanccos na Sintese Computacional de Poemas Baseados em Prosa Literaria Brasileira

【速读】：该论文试图解决诗歌生成中的复杂任务，涉及声音、韵律和节奏等多个资源。解决方案的关键在于提出了PROPOE 2系统，该系统在结构和节奏可能性上相较于原系统有显著扩展，能够从巴西文学的散文中提取韵律句子并生成诗歌，采用多种节奏组合标准。这一进展使得对诗歌节奏和声音效果的探索更加连贯和有效。

链接: https://arxiv.org/abs/2412.15263
作者: Felipe José D. Sousa,Sarah P. Cerqueira,João Queiroz,Angelo Loula
机构: 1. Instituto de Ciências Biomédicas, Universidade de São Paulo (生物医学研究所，圣保罗大学); 2. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (数学与计算机科学研究所，圣保罗大学)
关键词: complex task, rhythmic resources, rhythmic assembly criteria, prosodic and rhythmic, rhythmic possibilities compared
类目: Computation and Language (cs.CL)
备注: in Portuguese language

点击查看摘要

Abstract:The computational generation of poems is a complex task, which involves several sound, prosodic and rhythmic resources. In this work we present PROPOE 2, with the extension of structural and rhythmic possibilities compared to the original system, generating poems from metered sentences extracted from the prose of Brazilian literature, with multiple rhythmic assembly criteria. These advances allow for a more coherent exploration of rhythms and sound effects for the poem. Results of poems generated by the system are demonstrated, with variations in parameters to exemplify generation and evaluation using various criteria. A geração computacional de poemas é uma tarefa complexa, que envolve diversos recursos sonoros, prosódicos e rítmicos. Neste trabalho apresentamos PROPOE 2, com a ampliação de possibilidades estruturais e rítmicas em relação ao sistema original, gerando poemas a partir de sentenças metrificadas extraídas da prosa da literatura brasileira, com múltiplos critérios rítmicos de montagem. Esses avanços permitem uma exploração mais coerente de ritmos e efeitos sonoros para o poema. Resultados de poemas gerados pelo sistema são demonstrados, com variações de parâmetros para exemplificar a geração e a avaliação pelos variados critérios. Comments: in Portuguese language Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.15263 [cs.CL] (or arXiv:2412.15263v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.15263 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-89] Advanced ingestion process powered by LLM parsing for RAG system

【速读】：该论文试图解决检索增强生成 (Retrieval Augmented Generation, RAG) 系统在处理结构复杂、多模态文档时的困难。解决方案的关键在于引入了一种基于大型语言模型 (LLM) 支持的光学字符识别 (OCR) 的多策略解析方法，能够从包括演示文稿和高文本密度文件在内的多种文档类型中提取内容，无论是扫描文档还是非扫描文档。该方法采用基于节点的提取技术，建立不同信息类型之间的关系，并生成上下文感知的元数据。通过实施多模态组装代理 (Multimodal Assembler Agent) 和灵活的嵌入策略，系统显著提升了文档理解和检索能力，实验结果表明在多个知识库中提高了答案的相关性和信息的忠实度。

链接: https://arxiv.org/abs/2412.15262
作者: Arnau Perez,Xavier Vizcaino
机构: Applus+ IDIADA (Applus+ IDIADA)
关键词: Retrieval Augmented Generation, varying structural complexity, Augmented Generation, structural complexity, Retrieval Augmented
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems struggle with processing multimodal documents of varying structural complexity. This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types, including presentations and high text density files both scanned or not. The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata. By implementing a Multimodal Assembler Agent and a flexible embedding strategy, the system enhances document comprehension and retrieval capabilities. Experimental evaluations across multiple knowledge bases demonstrate the approach’s effectiveness, showing improvements in answer relevancy and information faithfulness.
zh

[NLP-90] Analyzing Images of Legal Documents: Toward Multi-Modal LLM s for Access to Justice

【速读】：该论文试图解决普通人在与法律系统和政府互动时，因信息分散在不同纸质文件（如表格、证书和合同）中，难以查找、定位和填写正确信息的问题。解决方案的关键在于利用多模态大型语言模型 (multi-modal LLMs) 分析手写纸质表格的图像，自动提取并结构化相关信息。尽管初步结果显示了潜力，但仍存在一些局限性，如图像质量较低时的识别问题。该研究展示了多模态LLMs在支持普通人和自诉当事人查找和整理相关信息方面的潜力。

链接: https://arxiv.org/abs/2412.15260
作者: Hannes Westermann,Jaromir Savelka
机构: Maastricht Law and Tech Lab, Maastricht University, Maastricht, Netherlands(马斯特里赫特法律与科技实验室，马斯特里赫特大学，马斯特里赫特，荷兰); Computer Science Department, Carnegie Mellon University, Pittsburgh, USA(计算机科学系，卡内基梅隆大学，匹兹堡，美国)
关键词: certificates and contracts, requires the assembly, assembly and analysis, legal system, government requires
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at AI for Access to Justice Workshop at Jurix 2024, Brno, Czechia. Code and Data available at: this https URL

点击查看摘要

Abstract:Interacting with the legal system and the government requires the assembly and analysis of various pieces of information that can be spread across different (paper) documents, such as forms, certificates and contracts (e.g. leases). This information is required in order to understand one’s legal rights, as well as to fill out forms to file claims in court or obtain government benefits. However, finding the right information, locating the correct forms and filling them out can be challenging for laypeople. Large language models (LLMs) have emerged as a powerful technology that has the potential to address this gap, but still rely on the user to provide the correct information, which may be challenging and error-prone if the information is only available in complex paper documents. We present an investigation into utilizing multi-modal LLMs to analyze images of handwritten paper forms, in order to automatically extract relevant information in a structured format. Our initial results are promising, but reveal some limitations (e.g., when the image quality is low). Our work demonstrates the potential of integrating multi-modal LLMs to support laypeople and self-represented litigants in finding and assembling relevant information.
zh

[NLP-91] GLARE: Google Apps Arabic Reviews Dataset

【速读】：该论文试图解决阿拉伯语应用评论数据的收集和分析问题，解决方案的关键在于引入了GLARE数据集，这是一个从沙特阿拉伯Google Play商店收集的包含7600万条评论的数据集，其中6900万条是阿拉伯语评论，涵盖9,980个Android应用程序。论文详细描述了数据收集方法、探索性数据分析(EDA)和特征工程，并强调了该数据集在研究阿拉伯语应用市场和用户反馈方面的潜在应用和益处。

链接: https://arxiv.org/abs/2412.15259
作者: Fatima AlGhamdi,Reem Mohammed,Hend Al-Khalifa,Areeb Alowisheq
机构: 未知
关键词: Saudi Google PlayStore, paper introduces GLARE, Saudi Google, Arabic Apps Reviews, Google PlayStore
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Github Repo: this https URL Zenodo: this https URL

点击查看摘要

Abstract:This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.
zh

[NLP-92] DisEmbed: Transforming Disease Understanding through Embeddings

【速读】：该论文试图解决现有医疗领域嵌入模型在疾病理解上的广义化问题，这些模型由于覆盖整个医疗领域的广泛应用，难以深入捕捉疾病的具体特征。解决方案的关键在于提出了DisEmbed，一个专注于疾病的嵌入模型。DisEmbed通过在专门构建的合成数据集上进行训练，该数据集包含疾病描述、症状以及疾病相关的问答对，从而使其在疾病相关任务中表现出色。通过疾病特定数据集和三元组评估方法的基准测试，DisEmbed在识别疾病相关上下文和区分相似疾病方面显著优于其他模型，尤其在检索增强生成（RAG）任务中表现尤为突出。

链接: https://arxiv.org/abs/2412.15258
作者: Salman Faroz
机构: 未知
关键词: general healthcare applications, vast and diverse, healthcare applications, domain is vast, focused on general
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The medical domain is vast and diverse, with many existing embedding models focused on general healthcare applications. However, these models often struggle to capture a deep understanding of diseases due to their broad generalization across the entire medical field. To address this gap, I present DisEmbed, a disease-focused embedding model. DisEmbed is trained on a synthetic dataset specifically curated to include disease descriptions, symptoms, and disease-related Q\A pairs, making it uniquely suited for disease-related tasks. For evaluation, I benchmarked DisEmbed against existing medical models using disease-specific datasets and the triplet evaluation method. My results demonstrate that DisEmbed outperforms other models, particularly in identifying disease-related contexts and distinguishing between similar diseases. This makes DisEmbed highly valuable for disease-specific use cases, including retrieval-augmented generation (RAG) tasks, where its performance is particularly robust.
zh

[NLP-93] An Incremental Clustering Baseline for Event Detection on Twitter

【速读】：该论文试图解决文本流中的事件检测问题，特别是在在线媒体和社交网络分析中的应用。解决方案的关键在于采用增量聚类算法（incremental clustering algorithm）结合最新的句子嵌入（sentence embeddings）技术，以在保持计算复杂度可接受的前提下，提升事件检测的性能。研究结果表明，该方法相较于Cao et al. (2024)和Mazoyer et al. (2020)的研究有显著改进，并可作为未来研究的相关基线。

链接: https://arxiv.org/abs/2412.15257
作者: Marjolaine Ray(Lattice),Qi Wang(Lattice),Frédérique Mélanie-Becquet(Lattice),Thierry Poibeau(Lattice),Béatrice Mazoyer(médialab)
机构: 未知
关键词: Event detection, social networks, detection in text, text streams, crucial task
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.
zh

[NLP-94] Structured Extraction of Real World Medical Knowledge using LLM s for Summarization and Search

【速读】：该论文试图解决疾病发现和分析中的一个关键问题：现有的疾病分类系统（如SNOMED-CT、ICD10、CPT）无法捕捉患者病情的细微差别或罕见疾病的复杂性，导致跨数据源的疾病定义不一致，进而增加了本体映射和疾病聚类的难度。解决方案的关键在于利用大型语言模型（LLM）通过自然语言处理技术从电子健康记录（EHR）中提取患者信息，构建患者知识图谱，而非依赖于刚性的本体层次结构。该方法通过映射到现有的本体（如MeSH、SNOMED-CT、RxNORM、HPO）来确保提取实体的准确性，并展示了其在识别罕见疾病（如Dravet综合征和Beta-propeller蛋白相关神经退行性疾病）中的实际应用。

链接: https://arxiv.org/abs/2412.15256
作者: Edward Kim,Manil Shrestha,Richard Foty,Tom DeLay,Vicki Seyfert-Margolis
机构: RespondHealth, Washington, DC, USA; Department of Computer Science, Drexel University, Philadelphia, PA, USA
关键词: Creation and curation, knowledge graphs, data, Creation, graphs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, Work published in 4th Workshop on Knowledge Graphs and Big Data (In Conjunction with IEEE Big Data 2024)

点击查看摘要

Abstract:Creation and curation of knowledge graphs can accelerate disease discovery and analysis in real-world data. While disease ontologies aid in biological data annotation, codified categories (SNOMED-CT, ICD10, CPT) may not capture patient condition nuances or rare diseases. Multiple disease definitions across data sources complicate ontology mapping and disease clustering. We propose creating patient knowledge graphs using large language model extraction techniques, allowing data extraction via natural language rather than rigid ontological hierarchies. Our method maps to existing ontologies (MeSH, SNOMED-CT, RxNORM, HPO) to ground extracted entities. Using a large ambulatory care EHR database with 33.6M patients, we demonstrate our method through the patient search for Dravet syndrome, which received ICD10 recognition in October 2020. We describe our construction of patient-specific knowledge graphs and symptom-based patient searches. Using confirmed Dravet syndrome ICD10 codes as ground truth, we employ LLM-based entity extraction to characterize patients in grounded ontologies. We then apply this method to identify Beta-propeller protein-associated neurodegeneration (BPAN) patients, demonstrating real-world discovery where no ground truth exists. Comments: 10 pages, 3 figures, Work published in 4th Workshop on Knowledge Graphs and Big Data (In Conjunction with IEEE Big Data 2024) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.15256 [cs.CL] (or arXiv:2412.15256v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.15256 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-95] Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation

【速读】：该论文旨在揭示当前评估实践中存在的严重漏洞，即通过知识蒸馏（knowledge distillation）可以操纵语言模型的基准测试分数。论文提出了一个名为“数据清洗”（Data Laundering）的三阶段过程，类似于金融洗钱，通过看似合法的中间训练步骤，实现基准特定知识的隐秘转移。关键在于，这种方法可以在不提升模型真实推理能力的情况下，显著提高基准测试的准确性（如在GPQA上达到75%的提升）。论文通过实验展示了该方法的有效性，并强调了在AI开发中需要更稳健的评估方法，以确保基准测试能够更准确地反映模型的真实能力。

链接: https://arxiv.org/abs/2412.15255
作者: Jonibek Mansurov,Akhmed Sakip,Alham Fikri Aji
机构: Mohamed bin Zayed University of Artificial Intelligence, UAE(穆罕默德·本·扎耶德人工智能大学，阿联酋)
关键词: current evaluation practices, manipulate language model, revealing a critical, subverted to manipulate, manipulate language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:In this paper, we show that knowledge distillation can be subverted to manipulate language model benchmark scores, revealing a critical vulnerability in current evaluation practices. We introduce “Data Laundering,” a three-phase process analogous to financial money laundering, that enables the covert transfer of benchmark-specific knowledge through seemingly legitimate intermediate training steps. Through extensive experiments with a 2-layer BERT student model, we show how this approach can achieve substantial improvements in benchmark accuracy (up to 75% on GPQA) without developing genuine reasoning capabilities. Notably, this method can be exploited intentionally or even unintentionally, as researchers may inadvertently adopt this method that inflates scores using knowledge distillation without realizing the implications. While our findings demonstrate the effectiveness of this technique, we present them as a cautionary tale highlighting the urgent need for more robust evaluation methods in AI. This work aims to contribute to the ongoing discussion about evaluation integrity in AI development and the need for benchmarks that more accurately reflect true model capabilities. The code is available at \urlthis https URL.
zh

[NLP-96] RIRO: Reshaping Inputs Refining Outputs Unlocking the Potential of Large Language Models in Data-Scarce Contexts

【速读】：该论文试图解决大型语言模型 (LLMs) 在微调于小规模、领域特定数据集时面临的泛化能力不足和结果不准确的问题。解决方案的关键在于引入了一种新颖的两层架构 RIRO，其中第一层通过先进的提示工程 (prompt engineering) 重新构建输入，以更好地与训练数据对齐；第二层则专注于优化输出，减少不一致性。通过在 Phi-2、Falcon 7B 和 Falcon 1B 等模型上进行微调，并引入包括余弦相似度、Levenshtein 距离、BLEU 分数、ROUGE-1、ROUGE-2 和 ROUGE-L 在内的评估基准，RIRO 显著提升了模型在数据稀缺环境中的性能，尽管仍面临计算需求和过拟合等挑战。

链接: https://arxiv.org/abs/2412.15254
作者: Ali Hamdi,Hozaifa Kassab,Mohamed Bahaa,Marwa Mohamed
机构: 未知
关键词: Large language models, natural language processing, Large language, advanced natural language, significantly advanced natural
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced natural language processing, excelling in areas like text generation, summarization, and question-answering. Despite their capabilities, these models face challenges when fine-tuned on small, domain-specific datasets, often struggling to generalize and deliver accurate results with unfamiliar inputs. To tackle this issue, we introduce RIRO, a novel two-layer architecture designed to improve performance in data-scarce environments. The first layer leverages advanced prompt engineering to reformulate inputs, ensuring better alignment with training data, while the second layer focuses on refining outputs to minimize inconsistencies. Through fine-tuning models like Phi-2, Falcon 7B, and Falcon 1B, with Phi-2 outperforming the others. Additionally, we introduce a benchmark using evaluation metrics such as cosine similarity, Levenshtein distance, BLEU score, ROUGE-1, ROUGE-2, and ROUGE-L. While these advancements improve performance, challenges like computational demands and overfitting persist, limiting the potential of LLMs in data-scarce, high-stakes environments such as healthcare, legal documentation, and software testing.
zh

[NLP-97] Using Machine Learning to Distinguish Human-written from Machine-generated Creative Fiction

【速读】：该论文试图解决生成式 AI (Generative AI) 对创意写作领域带来的威胁，特别是通过生成模仿特定风格的“伪书”（sham books），从而影响出版物质量和文学文化的经济与文化贡献。解决方案的关键在于训练机器学习分类器模型，以区分人类创作与机器生成的短篇创意小说。研究中采用的朴素贝叶斯（Naive Bayes）和多层感知器（Multi-Layer Perceptron）分类器在区分短文本（约100字）方面表现出色，准确率达到95%，显著优于人类判断（55%）。这一成果为开发轻量级且可靠的应用工具（如AI Detective）奠定了基础，旨在帮助编辑和出版商保护人类作者的经济和文化贡献。

链接: https://arxiv.org/abs/2412.15253
作者: Andrea Cristina McGlinchey,Peter J Barclay
机构: Lumerate; School of Computing, Engineering & the Built Environment, Edinburgh Napier University, Scotland
关键词: Large Language Models, Large Language, deceptive text created, release of ChatGPT, automatic detection
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at ICAART 2025: this https URL

点击查看摘要

Abstract:Following the universal availability of generative AI systems with the release of ChatGPT, automatic detection of deceptive text created by Large Language Models has focused on domains such as academic plagiarism and “fake news”. However, generative AI also poses a threat to the livelihood of creative writers, and perhaps to literary culture in general, through reduction in quality of published material. Training a Large Language Model on writers’ output to generate “sham books” in a particular style seems to constitute a new form of plagiarism. This problem has been little researched. In this study, we trained Machine Learning classifier models to distinguish short samples of human-written from machine-generated creative fiction, focusing on classic detective novels. Our results show that a Naive Bayes and a Multi-Layer Perceptron classifier achieved a high degree of success (accuracy 95%), significantly outperforming human judges (accuracy 55%). This approach worked well with short text samples (around 100 words), which previous research has shown to be difficult to classify. We have deployed an online proof-of-concept classifier tool, AI Detective, as a first step towards developing lightweight and reliable applications for use by editors and publishers, with the aim of protecting the economic and cultural contribution of human authors.
zh

[NLP-98] NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages

【速读】：该论文试图解决库尔德语自然语言处理 (KNLP) 中命名实体识别 (KNER) 的挑战，特别是由于库尔德语丰富的语言结构、多样化的方言和有限的数据集所导致的困难。解决方案的关键在于提出了一种方法，通过微调预训练的 RoBERTa 模型来提升 KNER 的性能。具体步骤包括创建库尔德语语料库，设计改进的模型架构，并实施训练过程。实验结果表明，使用 SentencePiece 分词方法的微调 RoBERTa 模型显著提高了 KNER 的 F1-score，相较于传统模型提升了 12.8%，从而为 KNLP 设立了新的基准。

链接: https://arxiv.org/abs/2412.15252
作者: Abdulhady Abas Abdullah,Srwa Hasan Abdulla,Dalia Mohammad Toufiq,Halgurd S. Maghdid,Tarik A. Rashid,Pakshan F. Farho,Shadan Sh. Sabr,Akar H. Taher,Darya S. Hamad,Hadi Veisi,Aras T. Asaad
机构: 未知
关键词: Natural Language Processing, named entity recognition, daily life routines, people daily life, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, Natural Language Processing (NLP) is an important tool for most people’s daily life routines, ranging from understanding speech, translation, named entity recognition (NER), and text categorization, to generative text models such as ChatGPT. Due to the existence of big data and consequently large corpora for widely used languages like English, Spanish, Turkish, Persian, and many more, these applications have been developed accurately. However, the Kurdish language still requires more corpora and large datasets to be included in NLP applications. This is because Kurdish has a rich linguistic structure, varied dialects, and a limited dataset, which poses unique challenges for Kurdish NLP (KNLP) application development. While several studies have been conducted in KNLP for various applications, Kurdish NER (KNER) remains a challenge for many KNLP tasks, including text analysis and classification. In this work, we address this limitation by proposing a methodology for fine-tuning the pre-trained RoBERTa model for KNER. To this end, we first create a Kurdish corpus, followed by designing a modified model architecture and implementing the training procedures. To evaluate the trained model, a set of experiments is conducted to demonstrate the performance of the KNER model using different tokenization methods and trained models. The experimental results show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance, achieving a 12.8% improvement in F1-score compared to traditional models, and consequently establishes a new benchmark for KNLP.
zh

[NLP-99] Agent PS: Agent ic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在处理复杂、相互依赖的逻辑结构时遇到的推理挑战。解决方案的关键在于引入了一个名为 AgentPS 的新框架，该框架通过在微调过程中集成代理过程监督 (Agentic Process Supervision) 和多轮问答机制，实现了结构化的顺序推理。这种方法显著提升了 MLLMs 在复杂任务上的性能，尤其是在 TikTok 专有数据集上的表现。此外，使用生成式 AI (LLM) 生成的标签替代人工标注，仍能保持大部分性能提升，展示了该框架在工业应用中的可扩展性和高效性。

链接: https://arxiv.org/abs/2412.15251
作者: Gorden Liu,Yu Sun,Ruixiao Sun,Xin Dong,Hongyu Xiong
机构: TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司)
关键词: large language models, driven substantial progress, language models, progress in vision-language, multimodal large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often encounter challenges when reasoning over complex, interdependent logic structures. To address this limitation, we introduce \textitAgentPS, a novel framework that integrates Agentic Process Supervision into MLLMs via multi-round question answering during fine-tuning. \textitAgentPS demonstrates significant performance improvements over baseline MLLMs on proprietary TikTok datasets, due to its integration of process supervision and structured sequential reasoning. Furthermore, we show that replacing human-annotated labels with LLM-generated labels retains much of the performance gain, highlighting the framework’s practical scalability in industrial applications. These results position \textitAgentPS as a highly effective and efficient architecture for multimodal classification tasks. Its adaptability and scalability, especially when enhanced by automated annotation generation, make it a powerful tool for handling large-scale, real-world challenges.
zh

[NLP-100] An Enhanced Text Compression Approach Using Transformer-based Language Models

【速读】：该论文试图解决文本压缩与恢复中的关键问题，即如何优化基于transformer的文本解压缩方法，并有效结合无损压缩算法。解决方案的关键在于提出了一种名为RejuvenateForme的transformer方法，通过引入新的预处理技术（结合Lempel-Ziv-Welch算法）和无损压缩方法，显著提升了压缩比和解压缩效果。具体而言，RejuvenateForme在BookCorpus、EN-DE和EN-FR语料库上分别实现了12.57、13.38和11.42的压缩比，并在BLEU评分上达到了27.31、25.78和50.45，展示了其在压缩效率和解压缩质量上的优越性。

链接: https://arxiv.org/abs/2412.15250
作者: Chowdhury Mofizur Rahman,Mahbub E Sobhani,Anika Tasnim Rodela,Swakkhar Shatabda
机构: State University of Bangladesh(孟加拉国立大学); United International University(联合国际大学); BRAC University(BRAC大学)
关键词: keeping crucial information, shrinks textual data, compression shrinks textual, English text data, Text compression shrinks
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.
zh

[NLP-101] LLM s for Literature Review: Are we there yet?

【速读】：该论文试图解决科学研究中文献综述撰写耗时且困难的问题，尤其是在研究论文数量激增的背景下。解决方案的关键在于利用大型语言模型 (LLMs) 的零样本能力，将任务分解为两个主要部分：1) 根据查询摘要检索相关文献；2) 基于检索结果撰写文献综述。在检索阶段，论文提出了一种新颖的两步搜索策略，首先使用LLM从论文摘要中提取关键词，然后通过查询外部知识库检索相关文献。此外，论文还研究了一种基于提示的重新排序机制，显著提高了标准化召回率，并提供了对LLM决策过程的洞察。在生成阶段，论文提出了一种两步生成方法，先制定综述计划，再执行计划生成综述。通过实验评估，论文展示了这种规划方法在减少生成综述中幻觉引用方面比现有简单生成方法提高了18-26%，从而生成更高质量的文献综述。

链接: https://arxiv.org/abs/2412.15249
作者: Shubham Agarwal,Gaurav Sahu,Abhay Puri,Issam H. Laradji,Krishnamurthy DJ Dvijotham,Jason Stanley,Laurent Charlin,Christopher Pal
机构: ServiceNow Research; Mila - Quebec AI Institute; HEC Montreal; University of Waterloo; University of British Columbia
关键词: Large Language Models, recent Large Language, challenging to write, Language Models, remain time-intensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM’s decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.
zh

[NLP-102] RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

【速读】：该论文试图解决低资源语言在光学字符识别（OCR）后错误校正数据集稀缺的问题。解决方案的关键在于提出了一个名为RoundTripOCR的合成数据生成方法，专门针对Devanagari语言。该方法通过将OCR错误视为机器翻译中的误译，利用预训练的transformer模型在错误文本和正确文本对之间建立映射，从而实现有效的OCR错误校正。此外，论文还发布了针对多种低资源语言（如印地语、马拉地语、博多语、尼泊尔语、孔卡尼语和梵语）的OCR后文本校正数据集。

链接: https://arxiv.org/abs/2412.15248
作者: Harshvivek Kashid,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
关键词: Optical Character Recognition, Optical Character, Character Recognition, enabling efficient data, efficient data extraction
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.
zh

[NLP-103] Streamlining Systematic Reviews: A Novel Application of Large Language Models

【速读】：该论文试图解决系统综述 (Systematic reviews, SRs) 中文献筛选过程耗时的问题，特别是标题/摘要和全文筛选的自动化。解决方案的关键在于利用基于大语言模型 (Large Language Models, LLMs) 的自研系统，通过提示工程 (prompt engineering) 进行标题/摘要筛选，以及使用检索增强生成 (Retrieval-Augmented Generation, RAG) 进行全文筛选。该系统显著提高了筛选效率和准确性，将手动筛选时间减少了95.5%，同时保持了高排除率 (99.5%)、特异性 (99.6%)、零假阴性率 (0%) 和完美的负预测值 (100%)，从而在系统综述的工作流程中展现了革命性的潜力。

链接: https://arxiv.org/abs/2412.15247
作者: Fouad Trad,Ryan Yammine,Jana Charafeddine,Marlene Chakhtoura,Maya Rahme,Ghada El-Hajj Fuleihan,Ali Chehab
机构: 未知
关键词: Large Language Models, Systematic reviews, screening, Language Models, essential for evidence-based
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Systematic reviews (SRs) are essential for evidence-based guidelines but are often limited by the time-consuming nature of literature screening. We propose and evaluate an in-house system based on Large Language Models (LLMs) for automating both title/abstract and full-text screening, addressing a critical gap in the literature. Using a completed SR on Vitamin D and falls (14,439 articles), the LLM-based system employed prompt engineering for title/abstract screening and Retrieval-Augmented Generation (RAG) for full-text screening. The system achieved an article exclusion rate (AER) of 99.5%, specificity of 99.6%, a false negative rate (FNR) of 0%, and a negative predictive value (NPV) of 100%. After screening, only 78 articles required manual review, including all 20 identified by traditional methods, reducing manual screening time by 95.5%. For comparison, Rayyan, a commercial tool for title/abstract screening, achieved an AER of 72.1% and FNR of 5% when including articles Rayyan considered as undecided or likely to include. Lowering Rayyan’s inclusion thresholds improved FNR to 0% but increased screening time. By addressing both screening phases, the LLM-based system significantly outperformed Rayyan and traditional methods, reducing total screening time to 25.5 hours while maintaining high accuracy. These findings highlight the transformative potential of LLMs in SR workflows by offering a scalable, efficient, and accurate solution, particularly for the full-text screening phase, which has lacked automation tools.
zh

[NLP-104] Accelerating Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（LLMs）中的幻觉问题并提高其准确性，提出了基于检索增强生成（Retrieval-Augmented Generation, RAG）的解决方案。其关键在于通过从外部知识源（如网络）检索信息来增强LLMs，并设计了一种名为智能知识存储（Intelligent Knowledge Store, IKS）的类型2 CXL设备，以加速精确最近邻搜索。IKS采用了一种扩展的近内存加速架构，并通过创新的缓存一致性接口连接主机CPU和近内存加速器，从而在512GB向量数据库上实现了13.4-27.9倍的精确最近邻搜索速度提升，显著降低了RAG应用的端到端推理时间（1.7-26.3倍）。此外，IKS的内部DRAM可以被解耦并用于服务器上的其他应用，以避免DRAM资源的浪费。

链接: https://arxiv.org/abs/2412.15246
作者: Derrick Quinn,Mohammad Nouri,Neel Patel,John Salihu,Alireza Salemi,Sukhan Lee,Hamed Zamani,Mohammad Alian
机构: Cornell University(康奈尔大学); Cornell University(康奈尔大学); Cornell University(康奈尔大学); University of Kansas(堪萨斯大学); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Samsung Electronics(三星电子); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Cornell University(康奈尔大学)
关键词: involves augmenting LLMs, external knowledge source, augmenting LLMs, large language models, Retrieval-Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today’s servers, from being stranded. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR) Cite as: arXiv:2412.15246 [cs.CL] (or arXiv:2412.15246v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.15246 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-105] MPPO: Multi Pair-wise Preference Optimization for LLM s with Arbitrary Negative Samples COLING2025

【速读】：该论文试图解决现有基于人类反馈的强化学习（RLHF）方法在优化大型语言模型（LLMs）时，依赖于参考模型和大量偏好数据的问题。解决方案的关键在于提出了多偏好优化算法（MPPO），该算法通过利用模型响应的平均似然来拟合奖励函数，从而最大化偏好数据的利用率。MPPO在点对点（Point-wise）、配对（Pair-wise）和列表（List-wise）三种实现方式中，配对方式表现出最佳性能，显著提升了模型响应的质量，并在多个基准测试中超越了现有的DPO、ORPO和SimPO等方法。

链接: https://arxiv.org/abs/2412.15244
作者: Shuo Xie,Fangzhi Zhu,Jiahui Wang,Lulu Wen,Wei Dai,Xiaowei Chen,Junxiong Zhu,Kai Zhou,Bo Zheng
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团阿里巴巴)
关键词: Aligning Large Language, Large Language Models, Aligning Large, Large Language, human feedback
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by COLING2025

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are inherently derived from PPO, requiring a reference model that adds GPU memory resources and relies heavily on abundant preference data. Meanwhile, current preference optimization research mainly targets single-question scenarios with two replies, neglecting optimization with multiple replies, which leads to a waste of data in the application. This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function and maximizes the utilization of preference data. Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance, significantly enhancing the quality of model responses. Experimental results demonstrate MPPO’s outstanding performance across various benchmarks. On MT-Bench, MPPO outperforms DPO, ORPO, and SimPO. Notably, on Arena-Hard, MPPO surpasses DPO and ORPO by substantial margins. These achievements underscore the remarkable advantages of MPPO in preference optimization tasks.
zh

[NLP-106] Script-Based Dialog Policy Planning for LLM -Powered Conversational Agents : A Basic Architecture for an “AI Therapist”

【速读】：该论文试图解决大语言模型（LLM）驱动的对话代理在提供行为健康支持时面临的两个关键问题：（a）缺乏一致且可靠的预定义规则来确保对话与整体治疗概念对齐；（b）决策路径不可检查，难以进行风险管理和临床评估。解决方案的关键在于引入了一种新的对话策略规划范式，通过专家编写的“脚本”（script）来约束LLM的行为，使其能够按照治疗方案进行对话，并通过显式的状态转换机制使决策路径可检查。该方法通过两种不同的提示技术实现，并进行了100次模拟对话实验，验证了其可行性和不同实现方式的效率与效果。

链接: https://arxiv.org/abs/2412.15242
作者: Robert Wasenmüller,Kevin Hilbert,Christoph Benzmüller
机构: Free University of Berlin(柏林自由大学); University of Luxembourg(卢森堡大学); University of Potsdam(波茨坦大学)
关键词: Large Language Model, Powered Conversational Agents, scaled behavioral healthcare, behavioral healthcare support, Powered Conversational
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 1 table

点击查看摘要

Abstract:Large Language Model (LLM)-Powered Conversational Agents have the potential to provide users with scaled behavioral healthcare support, and potentially even deliver full-scale “AI therapy’” in the future. While such agents can already conduct fluent and proactive emotional support conversations, they inherently lack the ability to (a) consistently and reliably act by predefined rules to align their conversation with an overarching therapeutic concept and (b) make their decision paths inspectable for risk management and clinical evaluation – both essential requirements for an “AI Therapist”. In this work, we introduce a novel paradigm for dialog policy planning in conversational agents enabling them to (a) act according to an expert-written “script” that outlines the therapeutic approach and (b) explicitly transition through a finite set of states over the course of the conversation. The script acts as a deterministic component, constraining the LLM’s behavior in desirable ways and establishing a basic architecture for an AI Therapist. We implement two variants of Script-Based Dialog Policy Planning using different prompting techniques and synthesize a total of 100 conversations with LLM-simulated patients. The results demonstrate the feasibility of this new technology and provide insights into the efficiency and effectiveness of different implementation variants. Comments: 9 pages, 5 figures, 1 table Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T01 Cite as: arXiv:2412.15242 [cs.CL] (or arXiv:2412.15242v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.15242 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-107] Quantifying Positional Biases in Text Embedding Models NEURIPS

【速读】：该论文试图解决嵌入模型在处理长文本时对内容位置的偏倚问题，特别是在信息检索 (Information Retrieval, IR) 和语义相似度测量任务中的表现。研究揭示了嵌入模型无论采用何种位置编码机制，都会过度优先考虑输入的开头部分。解决方案的关键在于通过实验量化了这种偏倚，发现插入无关文本或删除文档开头的内容会显著降低嵌入之间的余弦相似度，比在文档末尾进行相同操作的影响高出12.3%。回归分析进一步证实了这种偏倚，表明句子重要性随位置远离开头而下降，即使内容本身与位置无关。研究推测这种效应源于预处理策略和所选位置编码技术，为提高嵌入模型的鲁棒性提供了新的视角。

链接: https://arxiv.org/abs/2412.15241
作者: Samarth Goel,Reagan J. Lee,Kannan Ramchandran
机构: University of California, Berkeley (加州大学伯克利分校); Department of Electrical Engineering and Computer Science (电气工程与计算机科学系)
关键词: biases remains underexplored, tasks in Information, semantic similarity measurement, positional biases remains, Information Retrieval
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 13 pages, 11 figures, NeurIPS

点击查看摘要

Abstract:Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, irrespective of their positional encoding mechanisms, disproportionately prioritize the beginning of an input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards embedding model robustness.
zh

[NLP-108] ChainStream: An LLM -based Framework for Unified Synthetic Sensing

【速读】：该论文试图解决开发者构建上下文感知程序的复杂性和终端用户对隐私问题的担忧。解决方案的关键在于使用自然语言作为统一接口来处理个人数据和感知用户上下文，从而简化应用开发并提高数据管道的透明度。具体实现包括两个核心组件：1) 一个统一的数据处理框架，使上下文感知程序更简单；2) 一个反馈引导的查询优化器，使数据查询更具信息性。通过这些方法，论文展示了自然语言处理在上下文感知任务中的高效性和精确性。

链接: https://arxiv.org/abs/2412.15240
作者: Jiacheng Liu,Yuanchun Li,Liangyan Li,Yi Sun,Hao Wen,Xiangyu Li,Yao Guo,Yunxin Liu
机构: Institute of AI Industry Research (AIR), Tsinghua University; Beijing Institute of Technology
Institute of AI Industry Research (AIR), Tsinghua University; Shanghai AI Laboratory
University of Science and Technology Beijing
Institute of AI Industry Research (AIR), Tsinghua University
Institute of AI Industry Research (AIR), Tsinghua University
Institute of AI Industry Research (AIR), Tsinghua University
Peking University
Institute of AI Industry Research (AIR), Tsinghua University; Shanghai AI Laboratory
关键词: applications demand context, demand context sensing, timely services, applications demand, offer personalized
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Many applications demand context sensing to offer personalized and timely services. Yet, developing sensing programs can be challenging for developers and using them is privacy-concerning for end-users. In this paper, we propose to use natural language as the unified interface to process personal data and sense user context, which can effectively ease app development and make the data pipeline more transparent. Our work is inspired by large language models (LLMs) and other generative models, while directly applying them does not solve the problem - letting the model directly process the data cannot handle complex sensing requests and letting the model write the data processing program suffers error-prone code generation. We address the problem with 1) a unified data processing framework that makes context-sensing programs simpler and 2) a feedback-guided query optimizer that makes data query more informative. To evaluate the performance of natural language-based context sensing, we create a benchmark that contains 133 context sensing tasks. Extensive evaluation has shown that our approach is able to automatically solve the context-sensing tasks efficiently and precisely. The code is opensourced at this https URL.
zh

[NLP-109] Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLM s

【速读】：该论文试图解决消费者在何时以及为何与故事内容产生互动的问题，特别是如何捕捉和建模受众对故事未来发展的前瞻性信念（forward-looking beliefs）。解决方案的关键在于引入了一种新的框架，利用大型语言模型（large language models）生成故事的多种潜在延续，并通过既定的内容分析技术提取与预期、不确定性和惊喜相关的特征。这种方法不仅补充了现有的特征提取技术，还通过增强其边际解释力（平均提升31%），揭示了不同类型的互动（如继续阅读、评论和投票）是由当前和预期内容特征的不同组合驱动的。这一框架为研究受众前瞻性信念如何影响其与叙事媒体的互动提供了新的途径，对内容驱动型行业的营销策略具有重要意义。

链接: https://arxiv.org/abs/2412.15239
作者: Hortense Fong,George Gui
机构: Columbia Business School(哥伦比亚商学院)
关键词: audience forward-looking beliefs, creators and platforms, consumers engage, forward-looking beliefs, model audience forward-looking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters from Wattpad, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.
zh

[NLP-110] Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLM）在推理任务中，尤其是资源受限情况下的小模型表现不佳的问题。解决方案的关键在于提出了一种无需训练的集成框架，通过并行输入优化且多样化的提示（prompts）给单个LLM模型，在推理时生成一个集成模型，从而在推理任务中实现性能提升。该方法在数学推理任务（如MATH数据集）上展示了显著的性能提升，例如，通过集成几个小型模型（如三个Qwen2-MATH-1.5B-it模型），能够超越一个更大的模型（如Qwen2-MATH-7B-it）。

链接: https://arxiv.org/abs/2412.15238
作者: Gregory Kang Ruey Lau,Wenyang Hu,Diwen Liu,Jizhuo Chen,See-Kiong Ng,Bryan Kian Hsiang Low
机构: National University of Singapore (新加坡国立大学); CNRS@CREATE (CNRS@CREATE)
关键词: GPU memory restrictions, Large Language Models, Large Language, encounter substantial challenges, GPU memory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted to NeurIPS 2024 Workshop on Foundation Model Interventions (MINT)

点击查看摘要

Abstract:Large Language Models still encounter substantial challenges in reasoning tasks, especially for smaller models, which many users may be restricted to due to resource constraints (e.g. GPU memory restrictions). Inference-time methods to boost LLM performance, such as prompting methods to invoke certain reasoning pathways in responses, have been shown effective in past works, though they largely rely on sequential queries. The ensemble method, which consists of multiple constituent models running in parallel, is a promising approach to achieving better inference-time performance, especially given recent developments that enabled significant speed-ups in LLM batch inference. In this work, we propose a novel, training-free LLM ensemble framework where a single LLM model is fed an optimized, diverse set of prompts in parallel, effectively producing an ensemble at inference time to achieve performance improvement in reasoning tasks. We empirically demonstrate that our method leads to significant gains on math reasoning tasks, e.g., on MATH, where our ensemble consisting of a few small models (e.g., three Qwen2-MATH-1.5B-it models) can outperform a larger model (e.g., Qwen2-MATH-7B-it).
zh

[NLP-111] CareBot: A Pioneering Full-Process Open-Source Medical Language Model AAAI2025

【速读】：该论文试图解决开放源代码社区中大型语言模型（LLM）在专业医学领域表现不佳的问题。解决方案的关键在于提出了一种名为CareBot的双语医学LLM，通过综合运用连续预训练（CPT）、监督微调（SFT）和基于人类反馈的强化学习（RLHF）来提升模型在医学领域的性能。具体而言，论文创新性地采用了两阶段CPT方法（Stable CPT和Boost CPT），有效弥合了通用数据与领域特定数据之间的差距，并引入了DataRater模型来评估数据质量，确保训练数据的准确性和相关性。此外，论文还开发了大规模双语数据集和ConFilter指标，以提升多轮对话质量，从而增强模型处理复杂对话的能力。这些技术和方法的结合显著提升了CareBot在医学咨询和教育等应用中的表现，并通过严格的基准测试验证了其有效性。

链接: https://arxiv.org/abs/2412.15236
作者: Lulu Zhao,Weihao Zeng,Xiaofeng Shi,Hua Zhou
机构: 未知
关键词: made significant strides, significant strides, communities have made, made significant, CPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accept by AAAI 2025

点击查看摘要

Abstract:Recently, both closed-source LLMs and open-source communities have made significant strides, outperforming humans in various general domains. However, their performance in specific professional domains such as medicine, especially within the open-source community, remains suboptimal due to the complexity of medical knowledge. In this paper, we propose CareBot, a bilingual medical LLM, which leverages a comprehensive approach integrating continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF). Our novel two-stage CPT method, comprising Stable CPT and Boost CPT, effectively bridges the gap between general and domain-specific data, facilitating a smooth transition from pre-training to fine-tuning and enhancing domain knowledge progressively. We also introduce DataRater, a model designed to assess data quality during CPT, ensuring that the training data is both accurate and relevant. For SFT, we develope a large and diverse bilingual dataset, along with ConFilter, a metric to enhance multi-turn dialogue quality, which is crucial to improving the model’s ability to handle more complex dialogues. The combination of high-quality data sources and innovative techniques significantly improves CareBot’s performance across a range of medical applications. Our rigorous evaluations on Chinese and English benchmarks confirm CareBot’s effectiveness in medical consultation and education. These advancements not only address current limitations in medical LLMs but also set a new standard for developing effective and reliable open-source models in the medical domain. We will open-source the datasets and models later, contributing valuable resources to the research community.
zh

[NLP-112] OG-RAG: Ontology-Grounded Retrieval-Augmented Generation For Large Language Models

【速读】：该论文试图解决大语言模型 (LLM) 在处理特定领域知识时，由于缺乏结构化知识表示和高效检索方法而导致的生成响应不准确和适应性差的问题。解决方案的关键在于提出了基于本体 (Ontology) 的检索增强生成方法 OG-RAG，通过将检索过程锚定在领域特定的本体上，构建超图表示的领域文档，并利用优化算法检索最小集的 hyperedges 来生成精确且概念上合理的上下文，从而提升检索效率并保留实体间的复杂关系。OG-RAG 特别适用于需要基于事实推理的任务，如工业流程、法律、医疗和农业等领域，显著提高了事实召回率、响应正确性和推理准确性。

链接: https://arxiv.org/abs/2412.15235
作者: Kartik Sharma,Peeyush Kumar,Yunqing Li
机构: Microsoft Research(微软研究院)
关键词: Ontology-Grounded Retrieval Augmented, Retrieval Augmented Generation, Augmented Generation method, anchoring retrieval processes, Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. While LLMs are widely used for tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows or knowledge work, without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology. An optimization algorithm then retrieves the minimal set of hyperedges that constructs a precise, conceptually grounded context for the LLM. This method enables efficient retrieval while preserving the complex relationships between entities. OG-RAG applies to domains where fact-based reasoning is essential, particularly in tasks that require workflows or decision-making steps to follow predefined rules and procedures. These include industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, consulting and more. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods.
zh

[NLP-113] Early Dementia Detection Using Multiple Spontaneous Speech Prompts: The PROCESS Challenge

【速读】：该论文试图解决早期痴呆症（dementia）的检测问题，关键在于通过自发语音（spontaneous speech）信号处理技术来识别认知衰退的早期迹象。论文提供了一个新的自发语音语料库，包含由神经学家设计的三个提示问题，旨在更好地捕捉说话者的认知状态。解决方案的核心是通过该语料库训练模型，以实现对早期痴呆症的分类和回归预测，基线模型在分类任务中达到了55.0%的F1分数，在回归任务中达到了2.98的均方根误差（RMSE）。

链接: https://arxiv.org/abs/2412.15230
作者: Fuxiang Tao,Bahman Mirheidari,Madhurananda Pahar,Sophie Young,Yao Xiao,Hend Elghazaly,Fritz Peters,Caitlin Illingworth,Dorota Braun,Ronan O’Malley,Simon Bell,Daniel Blackburn,Fasih Haider,Saturnino Luz,Heidi Christensen
机构: 未知
关键词: Signal Processing Grand, Processing Grand Challenge, significant progression, making intervention, stage often ineffective
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 2 pages, no figure, conference

点击查看摘要

Abstract:Dementia is associated with various cognitive impairments and typically manifests only after significant progression, making intervention at this stage often ineffective. To address this issue, the Prediction and Recognition of Cognitive Decline through Spontaneous Speech (PROCESS) Signal Processing Grand Challenge invites participants to focus on early-stage dementia detection. We provide a new spontaneous speech corpus for this challenge. This corpus includes answers from three prompts designed by neurologists to better capture the cognition of speakers. Our baseline models achieved an F1-score of 55.0% on the classification task and an RMSE of 2.98 on the regression task.
zh

[NLP-114] ouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

【速读】：该论文试图解决大规模自动语音识别 (ASR) 模型在训练和部署过程中对大量参数、数据和计算资源的需求问题，以及这些模型功能单一且成本高昂的局限性。解决方案的关键在于提出了弹性专家混合模型 (eMoE)，该模型只需训练一次即可根据部署需求弹性扩展，同时设计了无监督数据生成和验证流程，收集了数百万小时的多样化音频数据进行训练。通过这些技术，系统不仅实现了弹性部署，还将SpeechIO测试集上的字符错误率 (CER) 从4.98%降低到2.45%，并且模型不仅擅长普通话语音识别，还具备多语言、多方言、情感、性别和声音事件感知能力，称为自动语音感知 (ASP)。

链接: https://arxiv.org/abs/2412.15622
作者: Xingchen Song,Chengdong Liang,Binbin Zhang,Pengshen Zhang,ZiYu Wang,Youcheng Ma,Menglong Xu,Lin Wang,Di Wu,Fuping Pan,Dinghao Zhou,Zhendong Peng
机构: Tsinghua University (清华大学); WeNet Community
关键词: Large Automatic Speech, significant computational resources, Large Automatic, Speech Recognition, Automatic Speech Recognition
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Technical Report

点击查看摘要

Abstract:Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98% to 2.45%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.
zh

[NLP-115] ranscribing and Translating Fast and Slow: Joint Speech Translation and Recognition ICASSP2025

【速读】：该论文试图解决同时进行自动语音识别（ASR）和语音翻译（ST）的问题，特别是在双语对话场景中，通过智能眼镜实现实时流式处理。解决方案的关键在于提出了联合语音翻译和识别模型（JSTAR），该模型采用快速-慢速级联编码器架构，并基于转换器（transducer）进行多目标训练，同时优化ASR和ST目标。此外，论文还探讨了不同的预训练策略，包括首次训练基于转换器的流式机器翻译（MT）模型，并将其用于JSTAR的参数初始化，从而显著提升了模型在BLEU分数和延迟方面的性能。

链接: https://arxiv.org/abs/2412.15415
作者: Niko Moritz,Ruiming Xie,Yashesh Gaur,Ke Li,Simone Merello,Zeeshan Ahmed,Frank Seide,Christian Fuegen
机构: Meta AI
关键词: automatic speech recognition, fast-slow cascaded encoder, cascaded encoder architecture, architecture for simultaneous, propose the joint
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.
zh

计算机视觉

[CV-0] HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

【速读】：该论文试图解决单体视觉-语言模型 (Monolithic Vision-Language Models, VLMs) 在性能上不如组合式模型的挑战。解决方案的关键在于引入了一个整体嵌入模块 (holistic embedding module)，该模块将视觉和文本输入转换到共享空间，使大型语言模型 (LLMs) 能够以处理文本的方式处理图像。此外，论文设计了多阶段训练策略，首先通过预训练的视觉编码器和LLM提取视觉和文本嵌入，然后在多模态数据上进行下一词预测以对齐嵌入，最后通过指令微调进一步提升模型性能。

链接: https://arxiv.org/abs/2412.16158
作者: Chenxin Tao,Shiqian Su,Xizhou Zhu,Chenyu Zhang,Zhe Chen,Jiawen Liu,Wenhai Wang,Lewei Lu,Gao Huang,Yu Qiao,Jifeng Dai
机构: Tsinghua University(清华大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Johns Hopkins University(约翰斯·霍普金斯大学); SenseTime Research(商汤研究); Nanjing University(南京大学)
关键词: Monolithic VLMs, rapid advance, catalyzed the development, development of Vision-Language, holistic embedding module
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at this https URL.
zh

[CV-1] Personalized Representation from Personalized Generation

【速读】：该论文试图解决个性化视觉任务中的细粒度和数据稀缺问题，关键在于利用个性化合成数据进行个性化表示学习。论文提出了一种对比学习方法，通过创新性地使用图像生成器来生成个性化合成数据，从而提升从识别到分割等多种下游任务的个性化表示学习效果。解决方案的核心在于有效地结合合成数据与对比学习，以编码目标对象的知识，并灵活应用于相关下游任务。

链接: https://arxiv.org/abs/2412.16156
作者: Shobhita Sundaram,Julia Chae,Yonglong Tian,Sara Beery,Phillip Isola
机构: MIT; OpenAI
关键词: Modern vision models, Modern vision, vision models excel, excel at general, Modern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: S.S. and J.C contributed equally; S.B. and P.I. co-supervised. Project page: this https URL

点击查看摘要

Abstract:Modern vision models excel at general purpose downstream tasks. It is unclear, however, how they may be used for personalized vision tasks, which are both fine-grained and data-scarce. Recent works have successfully applied synthetic data to general-purpose representation learning, while advances in T2I diffusion models have enabled the generation of personalized images from just a few real examples. Here, we explore a potential connection between these ideas, and formalize the challenge of using personalized synthetic data to learn personalized representations, which encode knowledge about an object of interest and may be flexibly applied to any downstream task relating to the target object. We introduce an evaluation suite for this challenge, including reformulations of two existing datasets and a novel dataset explicitly constructed for this purpose, and propose a contrastive learning approach that makes creative use of image generators. We show that our method improves personalized representation learning for diverse downstream tasks, from recognition to segmentation, and analyze characteristics of image generation approaches that are key to this gain.
zh

[CV-2] Can Generative Video Models Help Pose Estimation?

【速读】：该论文试图解决图像间几乎没有重叠情况下的成对姿态估计问题，这是一个计算机视觉领域的开放挑战。现有方法，即使在大规模数据集上训练，也因缺乏可识别的对应关系或视觉重叠而难以应对。解决方案的关键在于提出了一种名为InterPose的新方法，该方法利用预训练的生成式视频模型中编码的丰富先验知识，通过生成中间帧来创建密集的视觉过渡，从而显著简化姿态估计问题。此外，为了应对当前视频模型可能产生的不可信运动或不一致几何形状的问题，论文引入了自一致性评分来评估从采样视频中得出的姿态预测的一致性。

链接: https://arxiv.org/abs/2412.16155
作者: Ruojin Cai,Jason Y. Zhang,Philipp Henzler,Zhengqi Li,Noah Snavely,Ricardo Martin-Brualla
机构: Google(谷歌); Cornell University (康奈尔大学)
关键词: Pairwise pose estimation, Pairwise pose, computer vision, open challenge, challenge in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: this https URL.
zh

[CV-3] MotiF: Making Text Count in Image Animation with Motion Focal Loss

【速读】：该论文试图解决文本引导图像动画生成（text-guided image animation）中视频与文本描述对齐不佳的问题，特别是在运动指定方面。解决方案的关键在于引入MotiF方法，通过光流（optical flow）生成运动热图（motion heatmap），并根据运动强度对损失函数进行加权，从而引导模型更关注运动区域，提升文本对齐和运动生成的质量。此外，论文还提出了TI2V Bench数据集，用于对TI2V生成进行鲁棒评估，并通过人类评估协议验证了MotiF方法的有效性。

链接: https://arxiv.org/abs/2412.16153
作者: Shijie Wang,Samaneh Azadi,Rohit Girdhar,Saketh Rambhatla,Chen Sun,Xi Yin
机构: Brown University; GenAI, Meta
关键词: text-guided image animation, image animation, text-guided image, aims to generate, text description
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: TI2V Bench is released in this https URL

点击查看摘要

Abstract:Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model’s learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation on TI2V Bench, MotiF outperforms nine open-sourced models, achieving an average preference of 72%. The TI2V Bench is released in this https URL.
zh

[CV-4] Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training

【速读】：该论文试图解决在视觉语言模型 (Vision Language Models, VLMs) 训练中如何通过减少训练集大小来提高效率的问题。解决方案的关键在于动态调整掩码策略，并利用词频信息 (word frequency information) 来优化模型性能。研究表明，最佳的掩码策略会随着训练轮次的变化而变化，而词频信息在足够多的训练轮次下能够显著提升模型表现。论文提出的解决方案称为基于词频掩码的对比语言-图像预训练 (Contrastive Language-Image Pre-training with word Frequency Masking, CLIPF)，其在输入标记减少的情况下表现出显著优势，并通过保持词频在不同词性类别间的平衡来进一步优化模型。

链接: https://arxiv.org/abs/2412.16148
作者: Mingliang Liang,Martha Larson
机构: Radboud University (拉德堡德大学)
关键词: Vision Language Models, Vision Language, Language Models, reduced in size, word frequency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of approaches: truncation, random masking, block masking and syntax masking. In this paper, we show that the best masking strategy changes over training epochs and that, given sufficient training epochs, word frequency information is what you need to achieve the best performance. Experiments on a large range of data sets demonstrate the advantages of our approach, called Contrastive Language-Image Pre-training with word Frequency Masking (CLIPF). The benefits are particularly evident as the number of input tokens decreases. We analyze the impact of CLIPF vs. other masking approaches on word frequency balance and discuss the apparently critical contribution of CLIPF in maintaining word frequency balance across POS categories.
zh

[CV-5] SeagrassFinder: Deep Learning for Eelgrass Detection and Coverage Estimation in the Wild

【速读】：该论文试图解决传统手动分析水下视频以评估海草覆盖率的方法耗时且主观的问题。解决方案的关键在于利用深度学习模型（deep learning models）自动化检测和估算海草覆盖率。研究通过创建一个包含8,300多张标注水下图像的数据集，并评估了多种深度学习架构（如ResNet、InceptionNetV3、DenseNet和Vision Transformer）在二分类任务（“海草存在”和“海草不存在”）中的表现。结果表明，特别是Vision Transformer模型，能够实现高精度的海草存在预测，AUROC分数超过0.95。通过迁移学习（transfer learning）和应用Deep WaveNet水下图像增强模型，进一步提升了模型的性能。该方法能够高效处理大量视频数据，提供比传统手动方法更详细的海草分布信息，这对于环境影响评估和监测项目至关重要。

链接: https://arxiv.org/abs/2412.16147
作者: Jannik Elsäßer,Laura Weihl,Veronika Cheplygina,Lisbeth Tangaa Nielsen
机构: DHI Group(DHI集团); IT University of Copenhagen(哥本哈根信息技术大学)
关键词: water quality improvement, Seagrass meadows play, providing important services, deep learning, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Seagrass meadows play a crucial role in marine ecosystems, providing important services such as carbon sequestration, water quality improvement, and habitat provision. Monitoring the distribution and abundance of seagrass is essential for environmental impact assessments and conservation efforts. However, the current manual methods of analyzing underwater video transects to assess seagrass coverage are time-consuming and subjective. This work explores the use of deep learning models to automate the process of seagrass detection and coverage estimation from underwater video data. A dataset of over 8,300 annotated underwater images was created, and several deep learning architectures, including ResNet, InceptionNetV3, DenseNet, and Vision Transformer, were evaluated for the task of binary classification of Eelgrass Present'' and Eelgrass Absent’’ images. The results demonstrate that deep learning models, particularly the Vision Transformer, can achieve high performance in predicting eelgrass presence, with AUROC scores exceeding 0.95 on the final test dataset. The use of transfer learning and the application of the Deep WaveNet underwater image enhancement model further improved the models’ capabilities. The proposed methodology allows for the efficient processing of large volumes of video data, enabling the acquisition of much more detailed information on seagrass distributions compared to current manual methods. This information is crucial for environmental impact assessments and monitoring programs, as seagrasses are important indicators of coastal ecosystem health. Overall, this project demonstrates the value that deep learning can bring to the field of marine ecology and environmental monitoring.
zh

[CV-6] Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

【速读】：该论文试图解决现有状态空间模型 (State-Space Models, SSMs) 在处理视觉输入时，由于其根植于自然语言处理 (Natural Language Processing) 的固有偏见，无法有效建模空间依赖性特征的问题。解决方案的关键在于重新推导现代选择性状态空间技术，从原生的多维公式出发，提出 Mamba2D 模型。与现有方法通过任意组合一维扫描方向来处理二维数据不同，Mamba2D 采用单一的二维扫描方向，原生地考虑输入的两个维度，从而在建模隐藏状态时更有效地捕捉空间依赖性。

链接: https://arxiv.org/abs/2412.16146
作者: Enis Baty,Alejandro Hernández Díaz,Chris Bridges,Rebecca Davidson,Steve Eckersley,Simon Hadfield
机构: 未知
关键词: long-standing transformer architecture, transformer architecture, recently emerged, powerful and efficient, efficient alternative
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-Space Models (SSMs) have recently emerged as a powerful and efficient alternative to the long-standing transformer architecture. However, existing SSM conceptualizations retain deeply rooted biases from their roots in natural language processing. This constrains their ability to appropriately model the spatially-dependent characteristics of visual inputs. In this paper, we address these limitations by re-deriving modern selective state-space techniques, starting from a natively multidimensional formulation. Currently, prior works attempt to apply natively 1D SSMs to 2D data (i.e. images) by relying on arbitrary combinations of 1D scan directions to capture spatial dependencies. In contrast, Mamba2D improves upon this with a single 2D scan direction that factors in both dimensions of the input natively, effectively modelling spatial dependencies when constructing hidden states. Mamba2D shows comparable performance to prior adaptations of SSMs for vision tasks, on standard image classification evaluations with the ImageNet-1K dataset.
zh

[CV-7] NeRF-To-Real Tester: Neural Radiance Fields as Test Image Generators for Vision of Autonomous Systems

【速读】：该论文试图解决自主系统（如自主水下车辆和无人机）在实际操作环境中因控制器过度拟合仿真条件而导致性能不佳的问题。解决方案的关键在于利用神经辐射场（Neural Radiance Fields）生成逼真且多样化的测试图像，并将其集成到变质测试框架中，用于视觉组件（如vSLAM和目标检测）的测试。通过工具N2R-Tester，用户可以训练自定义场景的模型并从扰动位置渲染测试图像，从而有效评估不同视觉组件在实际环境中的表现。

链接: https://arxiv.org/abs/2412.16141
作者: Laura Weihl,Bilal Wehbe,Andrzej Wąsowski
机构: IT University of Copenhagen (哥本哈根信息技术大学); Robotics Innovation Centre, DFKI (机器人创新中心，DFKI)
关键词: quickly growing market, including surveying constructions, wind energy farms, applications including surveying, off-shore wind energy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous inspection of infrastructure on land and in water is a quickly growing market, with applications including surveying constructions, monitoring plants, and tracking environmental changes in on- and off-shore wind energy farms. For Autonomous Underwater Vehicles and Unmanned Aerial Vehicles overfitting of controllers to simulation conditions fundamentally leads to poor performance in the operation environment. There is a pressing need for more diverse and realistic test data that accurately represents the challenges faced by these systems. We address the challenge of generating perception test data for autonomous systems by leveraging Neural Radiance Fields to generate realistic and diverse test images, and integrating them into a metamorphic testing framework for vision components such as vSLAM and object detection. Our tool, N2R-Tester, allows training models of custom scenes and rendering test images from perturbed positions. An experimental evaluation of N2R-Tester on eight different vision components in AUVs and UAVs demonstrates the efficacy and versatility of the approach.
zh

[CV-8] Camera-Based Localization and Enhanced Normalized Mutual Information

【速读】：该论文试图解决在恶劣环境下，基于低成本摄像头图像数据的自动驾驶车辆精确定位问题。解决方案的关键在于改进现有的匹配算法，以应对图像和全局地图中的噪声，并考虑物理约束导致的图像视角变换和环境变化带来的不确定性。论文提出了对标准内积 (SIP) 和归一化互信息 (NMI) 算法的创新性改进，这些改进基于统计信号处理，旨在显著提升算法在噪声环境中的性能。通过数值模拟验证了这些改进的有效性。

链接: https://arxiv.org/abs/2412.16137
作者: Vishnu Teja Kunde,Jean-Francois Chamberland,Siddharth Agarwal
机构: Texas A&M University (德克萨斯A&M大学); Ford Motor Company (福特汽车公司)
关键词: global map, localization algorithms, fine localization algorithms, captured image, fine global map
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Robust and fine localization algorithms are crucial for autonomous driving. For the production of such vehicles as a commodity, affordable sensing solutions and reliable localization algorithms must be designed. This work considers scenarios where the sensor data comes from images captured by an inexpensive camera mounted on the vehicle and where the vehicle contains a fine global map. Such localization algorithms typically involve finding the section in the global map that best matches the captured image. In harsh environments, both the global map and the captured image can be noisy. Because of physical constraints on camera placement, the image captured by the camera can be viewed as a noisy perspective transformed version of the road in the global map. Thus, an optimal algorithm should take into account the unequal noise power in various regions of the captured image, and the intrinsic uncertainty in the global map due to environmental variations. This article briefly reviews two matching methods: (i) standard inner product (SIP) and (ii) normalized mutual information (NMI). It then proposes novel and principled modifications to improve the performance of these algorithms significantly in noisy environments. These enhancements are inspired by the physical constraints associated with autonomous vehicles. They are grounded in statistical signal processing and, in some context, are provably better. Numerical simulations demonstrate the effectiveness of such modifications.
zh

[CV-9] LEDA: Log-Euclidean Diffeomorphic Autoencoder for Efficient Statistical Analysis of Diffeomorphism

【速读】：该论文试图解决传统可逆形变配准方法在处理复杂非线性形变时存在的计算成本高、对初始化敏感以及易受数值误差影响的问题，尤其是在形变远离恒等变换时。解决方案的关键在于提出了Log-Euclidean Diffeomorphic Autoencoder (LEDA)框架，通过高效预测连续平方根来计算形变场的主对数，并在遵循微分同胚群作用法则的线性化潜在空间中操作，从而增强了模型的鲁棒性和适用性。此外，引入的损失函数确保了逆一致性，保证了形变场的准确潜在表示。

链接: https://arxiv.org/abs/2412.16129
作者: Krithika Iyer,Shireen Elhabian,Sarang Joshi
机构: 未知
关键词: Image registration, core task, task in computational, computational anatomy, anatomy that establishes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image registration is a core task in computational anatomy that establishes correspondences between images. Invertible deformable registration, which computes a deformation field and handles complex, non-linear transformation, is essential for tracking anatomical variations, especially in neuroimaging applications where inter-subject differences and longitudinal changes are key. Analyzing the deformation fields is challenging due to their non-linearity, limiting statistical analysis. However, traditional approaches for analyzing deformation fields are computationally expensive, sensitive to initialization, and prone to numerical errors, especially when the deformation is far from the identity. To address these limitations, we propose the Log-Euclidean Diffeomorphic Autoencoder (LEDA), an innovative framework designed to compute the principal logarithm of deformation fields by efficiently predicting consecutive square roots. LEDA operates within a linearized latent space that adheres to the diffeomorphisms group action laws, enhancing our model’s robustness and applicability. We also introduce a loss function to enforce inverse consistency, ensuring accurate latent representations of deformation fields. Extensive experiments with the OASIS-1 dataset demonstrate the effectiveness of LEDA in accurately modeling and analyzing complex non-linear deformations while maintaining inverse consistency. Additionally, we evaluate its ability to capture and incorporate clinical variables, enhancing its relevance for clinical applications.
zh

[CV-10] Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts

【速读】：该论文试图解决低资源语言（如乌尔都语、阿尔巴尼亚语和塔吉克语）在光学字符识别（OCR）中的可访问性问题，特别是通过大型语言模型（LLMs）如GPT-4o的应用。研究的关键在于揭示零样本LLM-based OCR在处理复杂语言脚本时的局限性，并强调了标注数据集和微调模型的重要性。通过使用精心设计的包含多种变量的2,520张图像数据集，研究结果表明，尽管LLMs在英语等资源丰富的语言上表现良好，但在低资源语言上仍需进一步优化，这为开发更具包容性和鲁棒性的OCR解决方案铺平了道路。

链接: https://arxiv.org/abs/2412.16119
作者: Muhammad Abdullah Sohail,Salaar Masood,Hamza Iqbal
机构: Lahore University of Management Sciences (拉合尔管理科学大学)
关键词: Optical Character Recognition, Large Language Models, Character Recognition, Optical Character, potential of Large
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, with English serving as a benchmark. Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges. Results emphasize the limitations of zero-shot LLM-based OCR, particularly for linguistically complex scripts, highlighting the need for annotated datasets and fine-tuned models. This work underscores the urgency of addressing accessibility gaps in text digitization, paving the way for inclusive and robust OCR solutions for underserved languages.
zh

[CV-11] PruneVid: Visual Token Pruning for Efficient Video Large Language Models

【速读】：该论文试图解决多模态视频理解任务中视频数据冗余导致的计算效率问题。解决方案的关键在于引入了一种无需训练的视觉标记剪枝方法，称为PruneVid。该方法通过合并时空标记来减少视频冗余，并利用大语言模型（LLMs）的推理能力，有选择性地剪枝与问题标记相关的视觉特征，从而提高模型效率。实验结果表明，PruneVid能够在剪枝超过80%的标记的同时，保持与不同模型网络结合时的竞争性性能，显示出其在效率和效果上的优越性。

链接: https://arxiv.org/abs/2412.16117
作者: Xiaohu Huang,Hao Zhou,Kai Han
机构: The University of Hong Kong(香港大学); Baidu Inc.(百度公司)
关键词: multi-modal video understanding, designed to enhance, Large Language Models, video understanding, pruning method designed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Efficient Video Large Language Models

点击查看摘要

Abstract:In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs’ reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: this https URL.
zh

[CV-12] CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

【速读】：该论文试图解决扩散变压器（Diffusion Transformers, DiT）在生成高分辨率图像时由于注意力机制（attention mechanisms）的二次复杂度导致的显著延迟问题。解决方案的关键在于引入一种线性注意力机制，通过四个关键因素（locality, formulation consistency, high-rank attention maps, 和 feature integrity）来实现对预训练DiT的线性化。具体来说，论文提出了一种卷积式的局部注意力策略，称为CLEAR，它限制特征交互在每个查询标记的局部窗口内，从而实现线性复杂度。实验表明，通过在仅10K自生成样本上微调注意力层10K次迭代，可以将知识从预训练的DiT有效转移到具有线性复杂度的学生模型，同时减少99.5%的注意力计算，并将8K分辨率图像生成速度提升6.3倍。

链接: https://arxiv.org/abs/2412.16112
作者: Songhua Liu,Zhenxiong Tan,Xinchao Wang
机构: National University of Singapore(新加坡国立大学)
关键词: Diffusion Transformers, leading architecture, attention, Diffusion, Transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: this https URL.
zh

[CV-13] Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring

【速读】：该论文试图解决在建筑行业中利用大型视觉-语言模型（Large Vision-Language Models, LVLMs）如GPT-4 Vision进行项目监控和进度跟踪的问题。解决方案的关键在于利用高分辨率航拍图像，通过GPT-4 Vision进行详细的场景分析和时间序列上的发展变化跟踪。尽管GPT-4 Vision在识别施工阶段、材料和机械方面表现出色，但在精确的对象定位和分割方面存在挑战，因此未来的改进方向包括特定领域的训练和与其他计算机视觉技术及数字孪生的集成。

链接: https://arxiv.org/abs/2412.16108
作者: Ahmet Bahaddin Ersoz
机构: 未知
关键词: Large Vision-Language Models, Large Vision-Language, artificial intelligence, visual data, sectors has marked
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Vision-Language Models (LVLMs) such as OpenAI’s GPT-4 Vision into various sectors has marked a significant evolution in the field of artificial intelligence, particularly in the analysis and interpretation of visual data. This paper explores the practical application of GPT-4 Vision in the construction industry, focusing on its capabilities in monitoring and tracking the progress of construction projects. Utilizing high-resolution aerial imagery of construction sites, the study examines how GPT-4 Vision performs detailed scene analysis and tracks developmental changes over time. The findings demonstrate that while GPT-4 Vision is proficient in identifying construction stages, materials, and machinery, it faces challenges with precise object localization and segmentation. Despite these limitations, the potential for future advancements in this technology is considerable. This research not only highlights the current state and opportunities of using LVLMs in construction but also discusses future directions for enhancing the model’s utility through domain-specific training and integration with other computer vision techniques and digital twins.
zh

[CV-14] Fair Distributed Machine Learning with Imbalanced Data as a Stackelberg Evolutionary Game

【速读】：该论文试图解决分布式学习环境中数据不平衡问题，特别是在医疗领域中由于患者群体差异、技术不平等和数据收集方法不同导致的显著数据不平衡。解决方案的关键在于提出了两种算法：确定性Stackelberg权重模型 (Deterministic Stackelberg Weighting Model, DSWM) 和自适应Stackelberg权重模型 (Adaptive Stackelberg Weighting Model, ASWM)，用于在每次训练轮次中动态调整每个节点的贡献权重。ASWM通过动态权重分配显著提升了数据量较小的节点的性能（AUC提升2.713%），同时对数据量较大的节点仅造成平均0.441%的性能下降。

链接: https://arxiv.org/abs/2412.16079
作者: Sebastian Niehaus,Ingo Roeder,Nico Scherf
机构: Max Planck Institute for Human Cognitive and Brain Sciences, Germany; Institute for Medical Informatics and Biometry (IMB), TU Dresden, Germany; Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Germany
关键词: Decentralised learning enables, centralising data sets, improved data privacy, data ownership policies, Decentralised learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Decentralised learning enables the training of deep learning algorithms without centralising data sets, resulting in benefits such as improved data privacy, operational efficiency and the fostering of data ownership policies. However, significant data imbalances pose a challenge in this framework. Participants with smaller datasets in distributed learning environments often achieve poorer results than participants with larger datasets. Data imbalances are particularly pronounced in medical fields and are caused by different patient populations, technological inequalities and divergent data collection practices. In this paper, we consider distributed learning as an Stackelberg evolutionary game. We present two algorithms for setting the weights of each node’s contribution to the global model in each training round: the Deterministic Stackelberg Weighting Model (DSWM) and the Adaptive Stackelberg Weighting Model (ASWM). We use three medical datasets to highlight the impact of dynamic weighting on underrepresented nodes in distributed learning. Our results show that the ASWM significantly favours underrepresented nodes by improving their performance by 2.713% in AUC. Meanwhile, nodes with larger datasets experience only a modest average performance decrease of 0.441%. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2412.16079 [cs.LG] (or arXiv:2412.16079v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.16079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-15] SegCol Challenge: Semantic Segmentation for Tools and Fold Edges in Colonoscopy data ICIP MICCAI2024

【速读】：该论文试图解决结直肠癌（Colorectal Cancer, CRC）早期筛查中结肠镜检查时导航和息肉检测的挑战。解决方案的关键在于提出了SegCol挑战赛，通过引入包含结肠褶皱（colon folds）和内镜工具（endoscopic tools）像素级语义标注的数据集，旨在提升深度感知和定位方法。该数据集通过提供褶皱边缘作为解剖标志以及从褶皱和工具标签中获取的深度不连续信息，推动结肠镜导航系统的创新。

链接: https://arxiv.org/abs/2412.16078
作者: Xinwei Ju,Rema Daher,Razvan Caramalau,Baoru Huang,Danail Stoyanov,Francisco Vasconcelos
机构: University College London (伦敦大学学院); University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); University of Exeter (埃克塞特大学)
关键词: cancer-related deaths worldwide, effective early screening, Colorectal cancer, early screening method, remains a leading
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure. Dataset introduction for the SegCol Challenge at MICCAI 2024. Full Challenge paper, including participant methods and evaluation results, will be released soon

点击查看摘要

Abstract:Colorectal cancer (CRC) remains a leading cause of cancer-related deaths worldwide, with polyp removal being an effective early screening method. However, navigating the colon for thorough polyp detection poses significant challenges. To advance camera navigation in colonoscopy, we propose the Semantic Segmentation for Tools and Fold Edges in Colonoscopy (SegCol) Challenge. This challenge introduces a dataset from the EndoMapper repository, featuring manually annotated, pixel-level semantic labels for colon folds and endoscopic tools across selected frames from 96 colonoscopy videos. By providing fold edges as anatomical landmarks and depth discontinuity information from both fold and tool labels, the dataset is aimed to improve depth perception and localization methods. Hosted as part of the Endovis Challenge at MICCAI 2024, SegCol aims to drive innovation in colonoscopy navigation systems. Details are available at this https URL, and code resources at this https URL .
zh

[CV-16] Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy AAAI2025

【速读】：该论文试图解决在介入心脏荧光透视视频中准确分割导丝的问题，特别是在深度学习方法需要大量标注数据的情况下。解决方案的关键是提出了分割引导的帧一致性视频扩散模型 (Segmentation-guided Frame-consistency Video Diffusion Model, SF-VD)，通过生成大量标注的荧光透视视频来增强训练数据。SF-VD 利用有限标注的视频，分别建模场景分布和运动分布，首先生成带有导丝的 2D 荧光图像，然后通过帧一致性策略逐步生成后续帧，确保帧间连贯性。此外，分割引导机制调整导丝对比度，确保合成图像中导丝可见性的多样性，从而提升导丝分割的质量。

链接: https://arxiv.org/abs/2412.16050
作者: Shaoyan Pan,Yikang Liu,Lin Zhao,Eric Z. Chen,Xiao Chen,Terrence Chen,Shanhui Sun
机构: United Imaging Intelligence(联影智能)
关键词: computer-aided navigation tasks, interventional cardiac fluoroscopy, navigation tasks, cardiac fluoroscopy videos, Video Diffusion Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025

点击查看摘要

Abstract:The accurate segmentation of guidewires in interventional cardiac fluoroscopy videos is crucial for computer-aided navigation tasks. Although deep learning methods have demonstrated high accuracy and robustness in wire segmentation, they require substantial annotated datasets for generalizability, underscoring the need for extensive labeled data to enhance model performance. To address this challenge, we propose the Segmentation-guided Frame-consistency Video Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos, augmenting the training data for wire segmentation networks. SF-VD leverages videos with limited annotations by independently modeling scene distribution and motion distribution. It first samples the scene distribution by generating 2D fluoroscopy images with wires positioned according to a specified input mask, and then samples the motion distribution by progressively generating subsequent frames, ensuring frame-to-frame coherence through a frame-consistency strategy. A segmentation-guided mechanism further refines the process by adjusting wire contrast, ensuring a diverse range of visibility in the synthesized image. Evaluation on a fluoroscopy dataset confirms the superior quality of the generated videos and shows significant improvements in guidewire segmentation.
zh

[CV-17] Segmentation of arbitrary features in very high resolution remote sensing imagery

【速读】：该论文试图解决高分辨率遥感影像（VHR RS imagery）中自动化处理和特征分割的通用性问题。解决方案的关键是引入了EcoMapper，这是一个可扩展的工具，能够自动处理地理空间数据、训练深度学习（DL）模型并进行推理，从而实现对任意特征的分割。EcoMapper通过在实际无人机数据集上成功分割两种不同特征，展示了其与特定上下文模型相竞争的性能。此外，研究还提出了Cording Index（CI），用于从特征大小推导出最佳地面采样距离，并开发了一套全面的实地调查方法，以确保深度学习方法能够有效应用于收集的数据。

链接: https://arxiv.org/abs/2412.16046
作者: Henry Cording,Yves Plancherel,Pablo Brito-Parada
机构: Imperial College London (伦敦帝国学院); Vattenfall Europe Information Services GmbH (Vattenfall欧洲信息服务有限公司)
关键词: high resolution, mapping through remote, remote sensing, countless domains, opportunity to inform
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main article: 18 pages, 9 figures; appendix: 17 pages, 9 figures

点击查看摘要

Abstract:Very high resolution (VHR) mapping through remote sensing (RS) imagery presents a new opportunity to inform decision-making and sustainable practices in countless domains. Efficient processing of big VHR data requires automated tools applicable to numerous geographic regions and features. Contemporary RS studies address this challenge by employing deep learning (DL) models for specific datasets or features, which limits their applicability across contexts. The present research aims to overcome this limitation by introducing EcoMapper, a scalable solution to segment arbitrary features in VHR RS imagery. EcoMapper fully automates processing of geospatial data, DL model training, and inference. Models trained with EcoMapper successfully segmented two distinct features in a real-world UAV dataset, achieving scores competitive with prior studies which employed context-specific models. To evaluate EcoMapper, many additional models were trained on permutations of principal field survey characteristics (FSCs). A relationship was discovered allowing derivation of optimal ground sampling distance from feature size, termed Cording Index (CI). A comprehensive methodology for field surveys was developed to ensure DL methods can be applied effectively to collected data. The EcoMapper code accompanying this work is available at this https URL . Comments: Main article: 18 pages, 9 figures; appendix: 17 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.16046 [cs.CV] (or arXiv:2412.16046v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.16046 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Henry Cording [view email] [v1] Fri, 20 Dec 2024 16:48:52 UTC (9,612 KB)
zh

[CV-18] SafeCFG: Redirecting Harmful Classifier-Free Guidance for Safe Generation

【速读】：该论文试图解决扩散模型（Diffusion Models, DMs）在文本到图像（Text-to-Image, T2I）任务中通过无分类器引导（Classifier-Free Guidance, CFG）生成有害图像的问题。解决方案的关键是引入有害引导重定向器（Harmful Guidance Redirector, HGR），它能够在图像生成过程中重定向有害的CFG方向，同时保留干净的CFG方向，从而将CFG转化为安全CFG（SafeCFG），实现高质量和高安全性的图像生成。HGR能够同时处理多种有害CFG方向，消除有害元素并保持生成图像的高质量，并且能够检测图像的有害性，支持无监督的细调安全扩散模型，无需预定义的有害或干净标签。

链接: https://arxiv.org/abs/2412.16039
作者: Jiadong Pan,Hongcheng Gao,Liang Li,Zheng-Jun Zha,Qingming Huang,Jiebo Luo
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, CAS(智能信息处理重点实验室，计算技术研究所，中国科学院); University of Chinese Academy of Sciences(中国科学院大学); University of Science and Technology of China(中国科学技术大学); University of Rochester(罗切斯特大学)
关键词: CFG, demonstrated exceptional performance, demonstrated exceptional, HGR, harmful
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated exceptional performance in text-to-image (T2I) tasks, leading to their widespread use. With the introduction of classifier-free guidance (CFG), the quality of images generated by DMs is improved. However, DMs can generate more harmful images by maliciously guiding the image generation process through CFG. Some safe guidance methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we introduce the Harmful Guidance Redirector (HGR), which redirects harmful CFG direction while preserving clean CFG direction during image generation, transforming CFG into SafeCFG and achieving high safety and quality generation. We train HGR to redirect multiple harmful CFG directions simultaneously, demonstrating its ability to eliminate various harmful elements while preserving high-quality generation. Additionally, we find that HGR can detect image harmfulness, allowing for unsupervised fine-tuning of safe diffusion models without pre-defined clean or harmful labels. Experimental results show that by incorporating HGR, images generated by diffusion models achieve both high quality and strong safety, and safe DMs trained through unsupervised methods according to the harmfulness detected by HGR also exhibit good safety performance. The codes will be publicly available.
zh

[CV-19] CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images

【速读】：该论文试图解决在有限景深条件下，由于散焦模糊（defocus blur）导致的真实场景三维重建不准确的问题。解决方案的关键在于提出了CoCoGaussian，这是一种基于散焦圈（Circle of Confusion, CoC）感知的3D高斯喷射（3D Gaussian Splatting, 3DGS）方法。CoCoGaussian通过物理基础的散焦圈建模，利用深度和可学习的孔径信息计算散焦圈直径，并生成多个高斯分布来精确捕捉散焦圈形状。此外，引入可学习的缩放因子以增强对反射或折射表面场景中不可靠深度的鲁棒性。该方法在合成和真实数据集上的实验表明，其在多个基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.16028
作者: Jungho Lee,Suhwan Cho,Taeoh Kim,Ho-Deok Jang,Minhyeok Lee,Geonho Cha,Dongyoon Wee,Dogyoon Lee,Sangyoun Lee
机构: Yonsei University(延世大学); Naver Cloud(Naver云)
关键词: attracted significant attention, Confusion-aware Gaussian Splatting, Gaussian Splatting, view rendering, inspiring research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has attracted significant attention for its high-quality novel view rendering, inspiring research to address real-world challenges. While conventional methods depend on sharp images for accurate scene reconstruction, real-world scenarios are often affected by defocus blur due to finite depth of field, making it essential to account for realistic 3D scene representation. In this study, we propose CoCoGaussian, a Circle of Confusion-aware Gaussian Splatting that enables precise 3D scene representation using only defocused images. CoCoGaussian addresses the challenge of defocus blur by modeling the Circle of Confusion (CoC) through a physically grounded approach based on the principles of photographic defocus. Exploiting 3D Gaussians, we compute the CoC diameter from depth and learnable aperture information, generating multiple Gaussians to precisely capture the CoC shape. Furthermore, we introduce a learnable scaling factor to enhance robustness and provide more flexibility in handling unreliable depth in scenes with reflective or refractive surfaces. Experiments on both synthetic and real-world datasets demonstrate that CoCoGaussian achieves state-of-the-art performance across multiple benchmarks.
zh

[CV-20] MR-GDINO: Efficient Open-World Continual Object Detection

【速读】：该论文试图解决在持续学习（continual learning）场景下，开放世界（Open-world, OW）目标检测模型在处理已见类别、新类别和未见类别时面临的灾难性遗忘（catastrophic forgetting）问题。解决方案的关键在于提出了一个名为MR-GDINO的基线模型，通过引入记忆和检索机制（memory and retrieval mechanisms），并结合高度可扩展的记忆池，有效缓解了对未见类别的遗忘问题。实验结果表明，MR-GDINO在仅激活0.1%额外参数的情况下，显著减少了遗忘现象，并在已见类别、新类别和未见类别上均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.15979
作者: Bowen Dong,Zitong Huang,Guanglei Yang,Lei Zhang,Wangmeng Zuo
机构: Harbin Institute of Technology(哈尔滨工业大学); The Hong Kong Polytechnic University(香港理工大学)
关键词: continual learning methods, methods to improve, unseen categories, detection models show, continual learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL . Code is available at: this https URL

点击查看摘要

Abstract:Open-world (OW) recognition and detection models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve performance. Despite promising results on seen classes, such OW abilities on unseen classes are largely degenerated due to catastrophic forgetting. To tackle this challenge, we propose an open-world continual object detection task, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present a challenging yet practical OW-COD benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO largely mitigates forgetting with only 0.1% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.
zh

[CV-21] Self-Supervised Radiograph Anatomical Region Classification – How Clean Is Your Real-World Data?

【速读】：该论文试图解决临床影像工作流程中因外部数据源标签不准确或存在数据录入错误，导致无法有效识别解剖区域的问题。解决方案的关键在于采用自监督学习方法（如SimCLR和BYOL）和监督对比深度学习方法，对48,434张骨骼X光片进行解剖区域分类，成功实现了96.6%的线性评估准确率（单模型）和97.7%的集成模型准确率。此外，仅需少量标注数据（训练集的1%）即可达到92.2%的准确率，适用于低标签资源场景。该模型还能用于纠正数据录入错误，通过专家放射科医生的后续分析，发现测试集中35%的标签错误和11%的域外图像，修正后理论准确率分别提升至98.0%和98.8%。

链接: https://arxiv.org/abs/2412.15967
作者: Simon Langer,Jessica Ritter,Rickmer Braren,Daniel Rueckert,Paul Hager
机构: 未知
关键词: Modern deep learning-based, learning-based clinical imaging, clinical imaging workflows, imaging workflows rely, deep learning-based clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 2 supplementary figures

点击查看摘要

Abstract:Modern deep learning-based clinical imaging workflows rely on accurate labels of the examined anatomical region. Knowing the anatomical region is required to select applicable downstream models and to effectively generate cohorts of high quality data for future medical and machine learning research efforts. However, this information may not be available in externally sourced data or generally contain data entry errors. To address this problem, we show the effectiveness of self-supervised methods such as SimCLR and BYOL as well as supervised contrastive deep learning methods in assigning one of 14 anatomical region classes in our in-house dataset of 48,434 skeletal radiographs. We achieve a strong linear evaluation accuracy of 96.6% with a single model and 97.7% using an ensemble approach. Furthermore, only a few labeled instances (1% of the training set) suffice to achieve an accuracy of 92.2%, enabling usage in low-label and thus low-resource scenarios. Our model can be used to correct data entry mistakes: a follow-up analysis of the test set errors of our best-performing single model by an expert radiologist identified 35% incorrect labels and 11% out-of-domain images. When accounted for, the radiograph anatomical region labelling performance increased – without and with an ensemble, respectively – to a theoretical accuracy of 98.0% and 98.8%.
zh

[CV-22] Monkey Transfer Learning Can Improve Human Pose Estimation

【速读】：该论文试图解决现有姿态估计技术在临床场景中表现不佳的问题，特别是在处理病理运动模式时。解决方案的关键在于利用猕猴（macaque monkeys）的数据进行迁移学习（transfer learning），以扩展网络对运动线索的识别范围。研究表明，通过引入其他物种的数据，不仅提高了姿态估计的精度和召回率，还显著减少了训练所需的人类样本数量（1,000 vs 19,185）。这一方法在临床应用中显示出潜在的改进效果，未来研究应进一步探索其在临床人群中的实用性。

链接: https://arxiv.org/abs/2412.15966
作者: Bradley Scott,Clarisse de Vries,Aiden Durrant,Nir Oren,Edward Chadwick,Dimitra Blana
机构: University of Aberdeen(阿伯丁大学); University of Glasgow(格拉斯哥大学)
关键词: pose estimation, pose, estimation, human pose estimation, human
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we investigated whether transfer learning from macaque monkeys could improve human pose estimation. Current state-of-the-art pose estimation techniques, often employing deep neural networks, can match human annotation in non-clinical datasets. However, they underperform in novel situations, limiting their generalisability to clinical populations with pathological movement patterns. Clinical datasets are not widely available for AI training due to ethical challenges and a lack of data collection. We observe that data from other species may be able to bridge this gap by exposing the network to a broader range of motion cues. We found that utilising data from other species and undertaking transfer learning improved human pose estimation in terms of precision and recall compared to the benchmark, which was trained on humans only. Compared to the benchmark, fewer human training examples were needed for the transfer learning approach (1,000 vs 19,185). These results suggest that macaque pose estimation can improve human pose estimation in clinical situations. Future work should further explore the utility of pose estimation trained with monkey data in clinical populations.
zh

[CV-23] Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation WACV

【速读】：该论文试图解决图像差异描述 (Image Difference Captioning, IDC) 任务在处理真实世界图像时面临的挑战，主要问题包括训练数据稀缺和捕捉复杂图像间细微差异的困难。解决方案的关键在于提出了一种简单而有效的框架，通过适配现有的图像描述模型 (BLIP2) 到 IDC 任务并增强 IDC 数据集来解决这些问题。具体来说，论文引入了 BLIP2IDC，这是一种低计算成本的 BLIP2 适配方案，显著优于传统的双流方法。此外，论文还提出了使用合成数据增强策略来提升 IDC 模型的性能，并展示了该策略生成了高质量数据，从而创建了一个新的挑战性数据集 Syned1。

链接: https://arxiv.org/abs/2412.15939
作者: Gautier Evennou,Antoine Chaffin,Vivien Chappelier,Ewa Kijak
机构: IMATAG, France; IRISA, CNRS, France; LightOn, France
关键词: past years enabled, IDC, important scale, past years, years enabled
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.
zh

[CV-24] MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection

【速读】：该论文试图解决胰腺放射影像诊断的挑战，主要由于胰腺的小尺寸、模糊边界以及形状和位置在患者间的变异性。解决方案的关键在于提出了MiniGPT-Pancreas，一个多模态大语言模型（Multimodal Large Language Model, MLLM），通过整合视觉和文本信息，支持临床医生在胰腺癌诊断中的应用。具体方法是通过对MiniGPT-v2进行微调，结合来自NIH和MSD数据集的计算机断层扫描（CT）图像和多模态提示，实现胰腺检测、肿瘤分类和肿瘤检测。研究结果显示，MiniGPT-Pancreas在胰腺检测和胰腺癌分类任务中表现出较高的准确性和召回率，但在胰腺肿瘤检测任务中仍有改进空间。

链接: https://arxiv.org/abs/2412.15925
作者: Andrea Moglia,Elia Clement Nastasio,Luca Mainardi,Pietro Cerveri
机构: Polytechnic University of Milan(米兰理工大学); University of Pavia(帕维亚大学)
关键词: Large Language Model, Multimodal Large Language, Pancreas radiological imaging, Medical Segmentation Decathlon, Pancreas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Problem: Pancreas radiological imaging is challenging due to the small size, blurred boundaries, and variability of shape and position of the organ among patients. Goal: In this work we present MiniGPT-Pancreas, a Multimodal Large Language Model (MLLM), as an interactive chatbot to support clinicians in pancreas cancer diagnosis by integrating visual and textual information. Methods: MiniGPT-v2, a general-purpose MLLM, was fine-tuned in a cascaded way for pancreas detection, tumor classification, and tumor detection with multimodal prompts combining questions and computed tomography scans from the National Institute of Health (NIH), and Medical Segmentation Decathlon (MSD) datasets. The AbdomenCT-1k dataset was used to detect the liver, spleen, kidney, and pancreas. Results: MiniGPT-Pancreas achieved an Intersection over Union (IoU) of 0.595 and 0.550 for the detection of pancreas on NIH and MSD datasets, respectively. For the pancreas cancer classification task on the MSD dataset, accuracy, precision, and recall were 0.876, 0.874, and 0.878, respectively. When evaluating MiniGPT-Pancreas on the AbdomenCT-1k dataset for multi-organ detection, the IoU was 0.8399 for the liver, 0.722 for the kidney, 0.705 for the spleen, and 0.497 for the pancreas. For the pancreas tumor detection task, the IoU score was 0.168 on the MSD dataset. Conclusions: MiniGPT-Pancreas represents a promising solution to support clinicians in the classification of pancreas images with pancreas tumors. Future research is needed to improve the score on the detection task, especially for pancreas tumors.
zh

[CV-25] Watertox: The Art of Simplicity in Universal Attacks A Cross-Model Framework for Robust Adversarial Generation

【速读】：该论文试图解决当代对抗攻击方法在跨模型可迁移性和实际应用中的显著局限性问题。解决方案的关键在于提出了Watertox框架，通过架构多样性和精确控制的扰动来实现高效的对抗攻击。Watertox采用两阶段的快速梯度符号方法，结合均匀基线扰动（\epsilon_1 = 0.1）和目标增强扰动（\epsilon_2 = 0.4），并利用从VGG到ConvNeXt等多种互补架构的集成，通过创新的投票机制合成多样化的视角。实验结果显示，Watertox在对抗最先进的模型时，能将模型准确率从70.6%降低至16.0%，并且在零样本攻击中对未见过的架构实现了高达98.8%的准确率降低。

链接: https://arxiv.org/abs/2412.15924
作者: Zhenghao Gao,Shengjie Xu,Meixi Chen,Fangyao Zhao
机构: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China; School of Software Engineering, Huazhong University of Science and Technology, Wuhan 430074, China; Journalism and Information Communication School, Huazhong University of Science and Technology, Wuhan 430074, China; School of Integrated Circuit, Huazhong University of Science and Technology, Wuhan 430074, China
关键词: Contemporary adversarial attack, Contemporary adversarial, practical applicability, Fast Gradient Sign, face significant limitations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 18 pages, 4 figures, 3 tables. Advances a novel method for generating cross-model transferable adversarial perturbations through a two-stage FGSM process and architectural ensemble voting mechanism

点击查看摘要

Abstract:Contemporary adversarial attack methods face significant limitations in cross-model transferability and practical applicability. We present Watertox, an elegant adversarial attack framework achieving remarkable effectiveness through architectural diversity and precision-controlled perturbations. Our two-stage Fast Gradient Sign Method combines uniform baseline perturbations ( \epsilon_1 = 0.1 ) with targeted enhancements ( \epsilon_2 = 0.4 ). The framework leverages an ensemble of complementary architectures, from VGG to ConvNeXt, synthesizing diverse perspectives through an innovative voting mechanism. Against state-of-the-art architectures, Watertox reduces model accuracy from 70.6% to 16.0%, with zero-shot attacks achieving up to 98.8% accuracy reduction against unseen architectures. These results establish Watertox as a significant advancement in adversarial methodologies, with promising applications in visual security systems and CAPTCHA generation.
zh

[CV-26] CCNDF: Curvature Constrained Neural Distance Fields from 3D LiDAR Sequences ACCV2024

【速读】：该论文试图解决在大规模户外场景中，由于缺乏真实神经距离场 (Neural Distance Fields, NDF) 的监督信号，导致神经场学习困难的问题。解决方案的关键在于利用符号距离场 (Signed Distance Field) 的二阶导数来改进神经场的学习过程。通过这种方法，能够更准确地估计符号距离，从而提供对底层几何结构的更全面理解，克服了传统方法在小规模实现中的局限性，并在映射和定位任务中表现出优越性。

链接: https://arxiv.org/abs/2412.15909
作者: Akshit Singh,Karan Bhakuni,Rajendra Nagar
机构: 未知
关键词: graphics downstream problems, Neural distance fields, downstream problems, powerful tool, tool for addressing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: ACCV 2024, Oral Presentation

点击查看摘要

Abstract:Neural distance fields (NDF) have emerged as a powerful tool for addressing challenges in 3D computer vision and graphics downstream problems. While significant progress has been made to learn NDF from various kind of sensor data, a crucial aspect that demands attention is the supervision of neural fields during training as the ground-truth NDFs are not available for large-scale outdoor scenes. Previous works have utilized various forms of expected signed distance to guide model learning. Yet, these approaches often need to pay more attention to critical considerations of surface geometry and are limited to small-scale implementations. To this end, we propose a novel methodology leveraging second-order derivatives of the signed distance field for improved neural field learning. Our approach addresses limitations by accurately estimating signed distance, offering a more comprehensive understanding of underlying geometry. To assess the efficacy of our methodology, we conducted comparative evaluations against prevalent methods for mapping and localization tasks, which are primary application areas of NDF. Our results demonstrate the superiority of the proposed approach, highlighting its potential for advancing the capabilities of neural distance fields in computer vision and graphics applications.
zh

[CV-27] NeuroPump: Simultaneous Geometric and Color Rectification for Underwater Images

【速读】：该论文试图解决水下图像恢复中几何和颜色失真同时修复的问题。现有的研究通常只关注颜色或几何的恢复，而本文提出的NeuroPump方法通过自监督的方式，在Neural Radiance Field (NeRF)框架中显式建模折射、吸收和散射效应，从而实现几何和颜色的同时优化与校正，仿佛将水抽出一般。其关键在于通过解耦参数控制，不仅能够同时进行几何和颜色校正，还能生成新视角和光学效果。此外，为了解决缺乏真实配对基准图像的问题，论文还提出了一个包含真实水下和无水场景配对图像的360度基准数据集。

链接: https://arxiv.org/abs/2412.15890
作者: Yue Guo,Haoxiang Liao,Haibin Ling,Bingyao Huang
机构: Southwest University(西南大学); Stony Brook University(石溪大学)
关键词: color distortions due, image restoration aims, restoration aims, aims to remove, distortions due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater image restoration aims to remove geometric and color distortions due to water refraction, absorption and scattering. Previous studies focus on restoring either color or the geometry, but to our best knowledge, not both. However, in practice it may be cumbersome to address the two rectifications one-by-one. In this paper, we propose NeuroPump, a self-supervised method to simultaneously optimize and rectify underwater geometry and color as if water were pumped out. The key idea is to explicitly model refraction, absorption and scattering in Neural Radiance Field (NeRF) pipeline, such that it not only performs simultaneous geometric and color rectification, but also enables to synthesize novel views and optical effects by controlling the decoupled parameters. In addition, to address issue of lack of real paired ground truth images, we propose an underwater 360 benchmark dataset that has real paired (i.e., with and without water) images. Our method clearly outperforms other baselines both quantitatively and qualitatively.
zh

[CV-28] IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing

【速读】：该论文试图解决逆向渲染中由于缺乏强大的高斯射线追踪器（Gaussian ray tracer）而导致的光照和材质估计不准确的问题。解决方案的关键在于提出了互反射高斯点云渲染（inter-reflective Gaussian splatting, IRGS），通过应用完整的渲染方程而不进行简化，并使用可微分的二维高斯射线追踪实时计算入射辐射，从而准确捕捉互反射效应。此外，论文还提出了一种高效的优化方案来处理蒙特卡洛采样在渲染方程评估中的计算需求，并引入了一种新的策略来查询重照明场景时的间接辐射。实验结果表明，IRGS能够有效建模复杂的互反射效果。

链接: https://arxiv.org/abs/2412.15867
作者: Chun Gu,Xiaofei Wei,Zixuan Zeng,Yuxuan Yao,Li Zhang
机构: School of Data Science, Fudan University (数据科学学院，复旦大学)
关键词: capturing secondary effects, accurately modeling visibility, modeling visibility, essential for capturing, capturing secondary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In inverse rendering, accurately modeling visibility and indirect radiance for incident light is essential for capturing secondary effects. Due to the absence of a powerful Gaussian ray tracer, previous 3DGS-based methods have either adopted a simplified rendering equation or used learnable parameters to approximate incident light, resulting in inaccurate material and lighting estimations. To this end, we introduce inter-reflective Gaussian splatting (IRGS) for inverse rendering. To capture inter-reflection, we apply the full rendering equation without simplification and compute incident radiance on the fly using the proposed differentiable 2D Gaussian ray tracing. Additionally, we present an efficient optimization scheme to handle the computational demands of Monte Carlo sampling for rendering equation evaluation. Furthermore, we introduce a novel strategy for querying the indirect radiance of incident light when relighting the optimized scenes. Extensive experiments on multiple standard benchmarks validate the effectiveness of IRGS, demonstrating its capability to accurately model complex inter-reflection effects.
zh

[CV-29] Semi-Supervised Adaptation of Diffusion Models for Handwritten Text Generation

【速读】：该论文试图解决手写文本生成 (Handwritten Text Generation, HTG) 中生成未见过的书写风格图像的挑战。解决方案的关键在于扩展潜在扩散模型 (Latent Diffusion Models, DMs)，通过学习风格条件化与掩码自编码器 (Masked Autoencoder) 结合，使得模型能够生成训练过程中未见过的书写风格。此外，论文提出了内容编码器以实现对文本和书法特征的不同条件化方式，并采用无分类器指导 (Classifier-Free Guidance) 来提升生成图像的质量。为了适应新的未标注数据集，论文还提出了半监督训练方案，并在IAM数据库和RIMES数据库上进行了评估，验证了其在生成未见数据方面的改进。

链接: https://arxiv.org/abs/2412.15853
作者: Kai Brandenbusch
机构: 未知
关键词: readable handwritten text, handwritten text generation, handwritten text, readable handwritten, text generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generation of images of realistic looking, readable handwritten text is a challenging task which is referred to as handwritten text generation (HTG). Given a string and examples from a writer, the goal is to synthesize an image depicting the correctly spelled word in handwriting with the calligraphic style of the desired writer. An important application of HTG is the generation of training images in order to adapt downstream models for new data sets. With their success in natural image generation, diffusion models (DMs) have become the state-of-the-art approach in HTG. In this work, we present an extension of a latent DM for HTG to enable generation of writing styles not seen during training by learning style conditioning with a masked auto encoder. Our proposed content encoder allows for different ways of conditioning the DM on textual and calligraphic features. Additionally, we employ classifier-free guidance and explore the influence on the quality of the generated training images. For adapting the model to a new unlabeled data set, we propose a semi-supervised training scheme. We evaluate our approach on the IAM-database and use the RIMES-database to examine the generation of data not seen during training achieving improvements in this particularly promising application of DMs for HTG.
zh

[CV-30] Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

【速读】：该论文试图解决现有基于Transformer的图像恢复模型在处理不同退化类型和程度时，面临的模型能力与计算负担之间的矛盾问题。现有模型在捕捉长距离依赖关系时，由于自注意力机制（self-attention mechanism）的计算复杂度随图像尺寸呈二次增长，导致计算负担过重。同时，大多数基于Mamba的模型仅在空间维度上进行全局建模，未能充分利用通道维度信息。论文的关键解决方案是结合Mamba和Transformer的优势，通过选择性扫描机制（selective scanning mechanism）在空间维度上以线性复杂度捕捉长距离依赖，并通过自注意力机制在通道维度上进行建模，避免计算负担的二次增长。此外，论文还提出了多维提示学习模块（multi-dimensional prompt learning modules），从多尺度编码器/解码器层中学习提示流，以增强模型对不同退化类型的特征揭示能力，从而提升“全能型”模型的恢复性能。

链接: https://arxiv.org/abs/2412.15845
作者: Aiwen Jiang,Hourong Chen,Zhiwen Chen,Jihua Ye,Mingwen Wang
机构: School of Digital Industry, Jiangxi Normal University(数字产业学院，江西师范大学); School of Computer and Information Engineering, Jiangxi Normal University(计算机与信息工程学院，江西师范大学)
关键词: Recent efforts, focused on developing, types and levels, levels within single, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent efforts on image restoration have focused on developing “all-in-one” models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image’s spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of “all-in-one” model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github this https URL.
zh

[CV-31] Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison

【速读】：该论文试图解决大规模无脊椎动物图像数据集的整理问题，特别是针对包含多个相同分类单元或标本且背景相对统一的图像数据集。解决方案的关键在于利用预训练的深度神经网络提取特征嵌入（feature embeddings），并通过比较这些嵌入与群体原型嵌入（group prototype embedding）来识别视觉上最独特的图像。此外，论文还提出了一种基于面积的尺寸比较方法，用于检测常见的错误图像，如包含分离的身体部位或错误分类的样本。论文还引入了新的评估指标，用于评估人机交互式异常检测方法，并公开了所提出的整理方法的实现代码和一个包含标注错误图像的基准数据集。

链接: https://arxiv.org/abs/2412.15844
作者: Mikko Impiö,Philipp M. Rehsen,Jenni Raitoharju
机构: Finnish Environment Institute(芬兰环境研究所); University of Duisburg-Essen(杜伊斯堡-埃森大学); University of Jyväskylä(于韦斯屈莱大学)
关键词: environmental monitoring purposes, computer vision assisted, vision assisted methods, gained interest, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE CIETES 2025

点击查看摘要

Abstract:The amount of image datasets collected for environmental monitoring purposes has increased in the past years as computer vision assisted methods have gained interest. Computer vision applications rely on high-quality datasets, making data curation important. However, data curation is often done ad-hoc and the methods used are rarely published. We present a method for curating large-scale image datasets of invertebrates that contain multiple images of the same taxa and/or specimens and have relatively uniform background in the images. Our approach is based on extracting feature embeddings with pretrained deep neural networks, and using these embeddings to find visually most distinct images by comparing their embeddings to the group prototype embedding. Also, we show that a simple area-based size comparison approach is able to find a lot of common erroneous images, such as images containing detached body parts and misclassified samples. In addition to the method, we propose using novel metrics for evaluating human-in-the-loop outlier detection methods. The implementations of the proposed curation methods, as well as a benchmark dataset containing annotated erroneous images, are publicly available in this https URL.
zh

[CV-32] Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer AAAI2025

【速读】：该论文试图解决广义少样本语义分割 (Generalized Few-Shot Semantic Segmentation, GFSS) 中由于基类 (base classes) 和新型类 (novel classes) 之间的分布差异导致的知识迁移问题。解决方案的关键在于设计了两个模块：一是新型原型调制模块 (prototype modulation module)，通过利用基类和新型类之间的相关性来调制新型类的原型；二是分类器校准模块 (classifier calibration module)，根据基类分类器的权重分布来校准新型类分类器的权重分布。此外，为了弥补新型类样本有限导致的上下文信息不足，论文还引入了上下文一致性学习方案 (context consistency learning scheme)，将基类的上下文知识迁移到新型类中。这些创新显著提升了GFSS任务的性能。

链接: https://arxiv.org/abs/2412.15835
作者: Xinyue Chen,Miaojing Shi,Zijian Zhou,Lianghua He,Sophia Tsoka
机构: 1. University College London (伦敦大学学院); 2. Beijing Institute of Technology (北京理工大学)
关键词: Generalized few-shot semantic, few-shot semantic segmentation, Generalized few-shot, classes, base
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Generalized few-shot semantic segmentation (GFSS) aims to segment objects of both base and novel classes, using sufficient samples of base classes and few samples of novel classes. Representative GFSS approaches typically employ a two-phase training scheme, involving base class pre-training followed by novel class fine-tuning, to learn the classifiers for base and novel classes respectively. Nevertheless, distribution gap exists between base and novel classes in this process. To narrow this gap, we exploit effective knowledge transfer from base to novel classes. First, a novel prototype modulation module is designed to modulate novel class prototypes by exploiting the correlations between base and novel classes. Second, a novel classifier calibration module is proposed to calibrate the weight distribution of the novel classifier according to that of the base classifier. Furthermore, existing GFSS approaches suffer from a lack of contextual information for novel classes due to their limited samples, we thereby introduce a context consistency learning scheme to transfer the contextual knowledge from base to novel classes. Extensive experiments on PASCAL-5 ^i and COCO-20 ^i demonstrate that our approach significantly enhances the state of the art in the GFSS setting. The code is available at: this https URL.
zh

[CV-33] Robustness-enhanced Myoelectric Control with GAN-based Open-set Recognition

【速读】：该论文试图解决肌电信号(EMG)在人体运动识别和医疗康复中的变异性和噪声敏感性问题，这些问题严重影响了肌电控制系统(myoelectric control systems)的可靠性和稳定性。解决方案的关键在于提出了一种基于生成对抗网络(GANs)的新框架，通过引入GAN-based的判别器来实现开放集识别(open-set recognition)，从而有效识别并拒绝未知动作，避免误分类，提升系统的稳定性和准确性。实验结果表明，该方法在已知动作的识别准确率达到97.6%，并在拒绝未知动作后使主动错误率(AER)提高了23.6%，同时具有计算效率高、适合边缘设备部署的特点。

链接: https://arxiv.org/abs/2412.15819
作者: Cheng Wang,Ziyang Feng,Pin Zhang,Manjiang Cao,Yiming Yuan,Tengfei Chang
机构: 未知
关键词: noise significantly limit, human motion recognition, myoelectric control systems, Generative Adversarial Networks, signals are widely
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 11 pages, 14 figures

点击查看摘要

Abstract:Electromyography (EMG) signals are widely used in human motion recognition and medical rehabilitation, yet their variability and susceptibility to noise significantly limit the reliability of myoelectric control systems. Existing recognition algorithms often fail to handle unfamiliar actions effectively, leading to system instability and errors. This paper proposes a novel framework based on Generative Adversarial Networks (GANs) to enhance the robustness and usability of myoelectric control systems by enabling open-set recognition. The method incorporates a GAN-based discriminator to identify and reject unknown actions, maintaining system stability by preventing misclassifications. Experimental evaluations on publicly available and self-collected datasets demonstrate a recognition accuracy of 97.6% for known actions and a 23.6% improvement in Active Error Rate (AER) after rejecting unknown actions. The proposed approach is computationally efficient and suitable for deployment on edge devices, making it practical for real-world applications.
zh

[CV-34] Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations

【速读】：该论文试图解决在跨模态少样本学习（cross-modal few-shot learning）中常见的过拟合问题，特别是在训练样本有限的情况下。解决方案的关键在于引入了一种新颖的方法SONO，该方法利用二阶神经常微分方程（Second-Order Neural Ordinary Differential Equations, Second-Order NODEs）来增强模型的表达能力和特征泛化能力。SONO通过简单的架构，将二阶NODEs模型与跨模态分类器结合，能够近似更广泛的函数类别，从而有效缓解过拟合问题。此外，SONO通过使用从类相关提示中提取的文本嵌入来初始化跨模态分类器，避免了频繁的文本编码器处理，提高了训练效率。论文还利用基于文本的图像增强技术，借助CLIP的强大图文关联性来丰富训练数据，进一步提升了少样本学习的表现。

链接: https://arxiv.org/abs/2412.15813
作者: Yi Zhang,Chun-Wun Cheng,Junyi He,Zhihai He,Carola-Bibiane Schönlieb,Yuyan Chen,Angelica I Aviles-Rivero
机构: Yi Zhang1,2\equalcontrib; Chun-Wun Cheng3\equalcontrib; Junyi He2; Zhihai He2,4†; Carola-Bibiane Schönlieb3; Yuyan Chen5; Angelica I Aviles-Rivero6
关键词: Ordinary Differential Equations, Neural Ordinary Differential, Second-Order Neural Ordinary, Differential Equations, Neural Ordinary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce SONO, a novel method leveraging Second-Order Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model’s expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP’s robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.
zh

[CV-35] Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance WACV2025

【速读】：该论文试图解决基于文本驱动的图像到图像翻译问题，特别是如何在生成目标图像时保持源图像的结构和背景。解决方案的关键在于提出了一种无需额外训练的方法，通过结合两个目标来引导生成过程：最大化目标提示的CLIP分数以确保生成的图像与目标文本一致，同时最小化与源图像潜在变量的结构距离以保持源图像的结构完整性。通过在扩散模型的逆过程中优化目标潜在变量，该方法显著提升了图像翻译的保真度和结构一致性。

链接: https://arxiv.org/abs/2412.15798
作者: Hyunsoo Lee,Minsoo Kang,Bohyung Han
机构: 未知
关键词: effective training-free approach, approach for text-driven, present a simple, simple but effective, effective training-free
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:We present a simple but effective training-free approach for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our goal is to generate an image that aligns with the target task while preserving the structure and background of a source image. To this end, we derive the representation guidance with a combination of two objectives: maximizing the similarity to the target prompt based on the CLIP score and minimizing the structural distance to the source latent variable. This guidance improves the fidelity of the generated target image to the given target prompt while maintaining the structure integrity of the source image. To incorporate the representation guidance component, we optimize the target latent variable of diffusion model’s reverse process with the guidance. Experimental results demonstrate that our method achieves outstanding image-to-image translation performance on various tasks when combined with the pretrained Stable Diffusion model.
zh

[CV-36] Sparse Point Clouds Assisted Learned Image Compression

【速读】：该论文试图解决在自动驾驶领域中，如何利用多模态传感器数据（如点云）来辅助图像压缩，以提升压缩性能的问题。解决方案的关键在于提出了一种新的框架，通过将稀疏点云投影到2D平面生成稀疏深度图，并利用该深度图预测相机图像，进而提取多尺度结构特征。这些特征被整合到现有的图像压缩模型中，作为额外的信息源，从而显著提升了压缩性能。该框架兼容多种主流的图像压缩模型，并通过实验验证了其有效性。

链接: https://arxiv.org/abs/2412.15752
作者: Yiheng Jiang,Haotian Zhang,Li Li,Dong Liu,Zhu Li
机构: University of Science and Technology of China(中国科学技术大学); University of Missouri, Kansas City(密苏里大学堪萨斯城分校)
关键词: data types exist, image compression, sensor data types, learned image compression, compression
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by TCSVT

点击查看摘要

Abstract:In the field of autonomous driving, a variety of sensor data types exist, each representing different modalities of the same scene. Therefore, it is feasible to utilize data from other sensors to facilitate image compression. However, few techniques have explored the potential benefits of utilizing inter-modality correlations to enhance the image compression performance. In this paper, motivated by the recent success of learned image compression, we propose a new framework that uses sparse point clouds to assist in learned image compression in the autonomous driving scenario. We first project the 3D sparse point cloud onto a 2D plane, resulting in a sparse depth map. Utilizing this depth map, we proceed to predict camera images. Subsequently, we use these predicted images to extract multi-scale structural features. These features are then incorporated into learned image compression pipeline as additional information to improve the compression performance. Our proposed framework is compatible with various mainstream learned image compression models, and we validate our approach using different existing image compression methods. The experimental results show that incorporating point cloud assistance into the compression pipeline consistently enhances the performance.
zh

[CV-37] VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models

【速读】：该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLMs) 在生成内容时产生的“幻觉”问题，即模型生成的信息虽然看似合理但可能与源内容不一致或不准确。解决方案的关键是提出了一种名为 VORD 的方法，通过校准基于修改图像对之间序数关系的标记预测来缓解幻觉现象。VORD 有两种形式：一种是无需训练的极简变体，通过消除修改图像对中的不合理标记来实现；另一种是可训练的目标函数，通过惩罚不太可能的标记来实现。实验表明，VORD 在广泛的 LVLM 基准测试中提供了更好的校准效果，并有效减少了对象幻觉。

链接: https://arxiv.org/abs/2412.15739
作者: Dexter Neo,Tsuhan Chen
机构: National University of Singapore(新加坡国立大学)
关键词: Large Vision-Language Models, large language models, made remarkable developments, Vision-Language Models, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.
zh

[CV-38] he Role of Recurrency in Image Segmentation for Noisy and Limited Sample Settings

【速读】：该论文试图解决的问题是：在计算机视觉领域，现有的先进模型（如前馈分割模型）是否可以通过引入循环机制（recurrency）来提升其性能，使其更接近生物大脑的决策和输出改进能力。解决方案的关键在于探索多种类型的循环机制，包括自组织（self-organizing）、关系型（relational）和记忆检索（memory retrieval），并通过最小化特定的能量函数来实现。实验结果表明，尽管这些循环架构在某些情况下表现良好，但它们本身并不足以超越现有的前馈模型，这表明在该领域仍需进一步的研究和改进。

链接: https://arxiv.org/abs/2412.15734
作者: David Calhas,João Marques,Arlindo L. Oliveira
机构: 未知
关键词: inspired multiple advances, advances in machine, biological brain, inspired multiple, multiple advances
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages

点击查看摘要

Abstract:The biological brain has inspired multiple advances in machine learning. However, most state-of-the-art models in computer vision do not operate like the human brain, simply because they are not capable of changing or improving their decisions/outputs based on a deeper analysis. The brain is recurrent, while these models are not. It is therefore relevant to explore what would be the impact of adding recurrent mechanisms to existing state-of-the-art architectures and to answer the question of whether recurrency can improve existing architectures. To this end, we build on a feed-forward segmentation model and explore multiple types of recurrency for image segmentation. We explore self-organizing, relational, and memory retrieval types of recurrency that minimize a specific energy function. In our experiments, we tested these models on artificial and medical imaging data, while analyzing the impact of high levels of noise and few-shot learning settings. Our results do not validate our initial hypothesis that recurrent models should perform better in these settings, suggesting that these recurrent architectures, by themselves, are not sufficient to surpass state-of-the-art feed-forward versions and that additional work needs to be done on the topic.
zh

[CV-39] Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

【速读】：该论文试图解决传统RGB跟踪方法在复杂场景中难以捕捉目标动态变化和运动信息的问题。解决方案的关键在于提出了一种统一的多模态时空跟踪方法STTrack，通过引入时间状态生成器（Temporal State Generator, TSG）来生成包含多模态时间信息的序列化token，从而引导目标在下一时间状态的定位，建立长程上下文关系并捕捉目标的时间轨迹。此外，在空间层面引入mamba融合和背景抑制交互（Background Suppression Interactive, BSI）模块，形成双阶段机制以协调模态间信息交互和融合。

链接: https://arxiv.org/abs/2412.15691
作者: Xiantao Hu,Ying Tai,Xu Zhao,Chen Zhao,Zhenyu Zhang,Jun Li,Bineng Zhong,Jian Yang
机构: 1. Huazhong University of Science and Technology(华中科技大学); 2. Tencent Youtu Lab(腾讯优图实验室); 3. University of Electronic Science and Technology of China(电子科技大学)
关键词: traditional RGB tracking, garnered widespread attention, traditional RGB, RGB tracking, garnered widespread
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: this https URL.
zh

[CV-40] DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

【速读】：该论文试图解决扩散概率模型在视频生成中由于大量采样步骤导致的计算效率低下的问题。解决方案的关键在于提出了一种结合变分分数蒸馏（variational score distillation）和一致性蒸馏（consistency distillation）的蒸馏方法，能够在减少采样步骤的同时保持高质量和生成多样性。此外，论文还提出了基于潜在奖励模型的微调方法，以根据任意指定的奖励指标进一步增强视频生成性能，且该方法不要求奖励函数可微分，从而降低了内存使用。通过这些技术，论文实现了在10秒视频（128帧，12 FPS）上的少步生成，达到了当前最先进的性能。

链接: https://arxiv.org/abs/2412.15689
作者: Zihan Ding,Chi Jin,Difan Liu,Haitian Zheng,Krishna Kumar Singh,Qiang Zhang,Yan Kang,Zhe Lin,Yuchen Liu
机构: 未知
关键词: shown significant progress, sampling steps required, shown significant, significant progress, computational efficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model’s diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.
zh

[CV-41] Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network AAAI2025

【速读】：该论文试图解决时间句子定位 (Temporal Sentence Grounding, TSG) 任务中，现有方法在训练过程中忽视不同视频-查询对之间关系的问题。现有方法采用单线程框架，导致每个视频-查询对单独训练，无法共享和迁移知识，且重复获取冗余信息，限制了其在实际应用中的效率。论文提出的解决方案是引入多对TSG (Multi-Pair TSG) 设置，通过多线程知识迁移网络 (Multi-Thread Knowledge Transfer Network) 实现不同视频-查询对的协同训练。关键在于设计了跨模态对比模块和原型对齐策略，前者通过自监督策略探索语义一致性，后者分别在空间和时间维度上对齐视觉和文本表示，从而有效提升模型对复杂视频-查询对的定位能力。此外，自适应负样本选择模块进一步优化了跨模态匹配的阈值生成。

链接: https://arxiv.org/abs/2412.15678
作者: Xiang Fang,Wanlong Fang,Changshuo Wang,Daizong Liu,Keke Tang,Jianfeng Dong,Pan Zhou,Beibei Li
机构: 1. Huazhong University of Science and Technology(华中科技大学); 2. Wuhan University(武汉大学); 3. Alibaba Group(阿里巴巴集团); 4. Tsinghua University(清华大学); 5. ByteDance(字节跳动); 6. Zhejiang University(浙江大学)
关键词: locate query-relevant segments, video-query pairs, query-relevant segments, TSG, pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success, they train each video-query pair separately and ignore the relationship between different pairs. We observe that the similar video/query content not only helps the TSG model better understand and generalize the cross-modal representation but also assists the model in locating some complex video-query pairs. Previous methods follow a single-thread framework that cannot co-train different pairs and usually spends much time re-obtaining redundant knowledge, limiting their real-world applications. To this end, in this paper, we pose a brand-new setting: Multi-Pair TSG, which aims to co-train these pairs. In particular, we propose a novel video-query co-training approach, Multi-Thread Knowledge Transfer Network, to locate a variety of video-query pairs effectively and efficiently. Firstly, we mine the spatial and temporal semantics across different queries to cooperate with each other. To learn intra- and inter-modal representations simultaneously, we design a cross-modal contrast module to explore the semantic consistency by a self-supervised strategy. To fully align visual and textual representations between different pairs, we design a prototype alignment strategy to 1) match object prototypes and phrase prototypes for spatial alignment, and 2) align activity prototypes and sentence prototypes for temporal alignment. Finally, we develop an adaptive negative selection module to adaptively generate a threshold for cross-modal matching. Extensive experiments show the effectiveness and efficiency of our proposed method.
zh

[CV-42] AI-generated Image Quality Assessment in Visual Communication AAAI-2025

【速读】：该论文试图解决传统图像质量评估 (IQA) 算法在评估人工智能生成图像 (AIGIs) 时，过度关注低层次视觉感知和生成内容本身，而忽视其在实际应用中的有效性的问题。解决方案的关键在于提出了 AIGI-VC 数据库，该数据库专注于研究 AIGIs 在视觉传播中的信息清晰度和情感交互能力，特别是在广告领域的应用。AIGI-VC 包含 2,500 张图像，涵盖 14 个广告主题和 8 种情感类型，提供了粗粒度和细粒度的人类偏好注释，用于评估和推理 IQA 方法在偏好预测和解释方面的能力。通过实证研究现有 IQA 方法和大规模多模态模型在 AIGI-VC 上的表现，揭示了它们的优缺点。

链接: https://arxiv.org/abs/2412.15677
作者: Yu Tian,Yixuan Li,Baoliang Chen,Hanwei Zhu,Shiqi Wang,Sam Kwong
机构: 1. School of Electrical and Information Engineering, Tianjin University, Tianjin, China(天津大学电气与信息工程学院，天津，中国);
2. School of Computer Science and Technology, Tianjin University, Tianjin, China(天津大学计算机科学与技术学院，天津，中国);
3. Department of Computer Science, City University of Hong Kong, Hong Kong, China(香港城市大学计算机科学系，香港，中国)
关键词: artificial intelligence-generated images, plays a crucial, artificial intelligence-generated, crucial role, real-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI-2025; Project page: this https URL

点击查看摘要

Abstract:Assessing the quality of artificial intelligence-generated images (AIGIs) plays a crucial role in their application in real-world scenarios. However, traditional image quality assessment (IQA) algorithms primarily focus on low-level visual perception, while existing IQA works on AIGIs overemphasize the generated content itself, neglecting its effectiveness in real-world applications. To bridge this gap, we propose AIGI-VC, a quality assessment database for AI-Generated Images in Visual Communication, which studies the communicability of AIGIs in the advertising field from the perspectives of information clarity and emotional interaction. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning. We conduct an empirical study of existing representative IQA methods and large multi-modal models on the AIGI-VC dataset, uncovering their strengths and weaknesses.
zh

[CV-43] PersonaMagic: Stage-Regulated High-Fidelity Face Customization with Tandem Equilibrium AAAI2025

【速读】：该论文试图解决个性化图像生成中，如何在保持新概念准确重建的同时，根据提示进行可编辑性操作的问题，尤其是在处理面部特征复杂细节时。解决方案的关键在于引入阶段划分 (stage partitioning) 和时间动态 (temporal dynamics) 的概念，通过PersonaMagic 技术实现高保真面部定制。具体来说，PersonaMagic 利用简单的 MLP 网络在特定时间步长内学习一系列嵌入，以捕捉面部概念，并结合串联平衡机制 (Tandem Equilibrium) 调整文本编码器中的自注意力响应，从而在文本描述和身份保留之间取得平衡。这种方法不仅在定性和定量评估中优于现有最先进的方法，还展示了在非面部领域中的鲁棒性和灵活性，并可作为预训练个性化模型的增强插件。

链接: https://arxiv.org/abs/2412.15674
作者: Xinzhe Li,Jiahui Zhan,Shengfeng He,Yangyang Xu,Junyu Dong,Huaidong Zhang,Yong Du
机构: 未知
关键词: Personalized image generation, made significant strides, Personalized image, image generation, generation has made
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by AAAI 2025. The code is available at this https URL

点击查看摘要

Abstract:Personalized image generation has made significant strides in adapting content to novel concepts. However, a persistent challenge remains: balancing the accurate reconstruction of unseen concepts with the need for editability according to the prompt, especially when dealing with the complex nuances of facial features. In this study, we delve into the temporal dynamics of the text-to-image conditioning process, emphasizing the crucial role of stage partitioning in introducing new concepts. We present PersonaMagic, a stage-regulated generative technique designed for high-fidelity face customization. Using a simple MLP network, our method learns a series of embeddings within a specific timestep interval to capture face concepts. Additionally, we develop a Tandem Equilibrium mechanism that adjusts self-attention responses in the text encoder, balancing text description and identity preservation, improving both areas. Extensive experiments confirm the superiority of PersonaMagic over state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, its robustness and flexibility are validated in non-facial domains, and it can also serve as a valuable plug-in for enhancing the performance of pretrained personalization models.
zh

[CV-44] Learning Group Interactions and Semantic Intentions for Multi-Object Trajectory Prediction

【速读】：该论文试图解决复杂场景中群体交互和动态语义意图对行为预测（如轨迹或运动）的影响问题。解决方案的关键在于提出了一种基于扩散的轨迹预测框架，该框架将群体级别的交互整合到条件扩散模型中，生成与特定群体活动相一致的多样化轨迹。通过将群体交互预测建模为合作博弈，并使用Banzhaf交互来捕捉合作趋势，结合语义意图与增强的代理嵌入（agent embeddings）进行融合，从而实现对动态语义意图的捕捉。此外，通过扩展NBA SportVU数据集并添加团队战术的人工标注，提升了轨迹和战术预测任务的性能。

链接: https://arxiv.org/abs/2412.15673
作者: Mengshi Qi,Yuxin Yang,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China(国家重点实验室网络与交换技术，北京邮电大学，中国)
关键词: Effective modeling, crucial for forecasting, forecasting behaviors, dynamic semantic intentions, Effective
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective modeling of group interactions and dynamic semantic intentions is crucial for forecasting behaviors like trajectories or movements. In complex scenarios like sports, agents’ trajectories are influenced by group interactions and intentions, including team strategies and opponent actions. To this end, we propose a novel diffusion-based trajectory prediction framework that integrates group-level interactions into a conditional diffusion model, enabling the generation of diverse trajectories aligned with specific group activity. To capture dynamic semantic intentions, we frame group interaction prediction as a cooperative game, using Banzhaf interaction to model cooperation trends. We then fuse semantic intentions with enhanced agent embeddings, which are refined through both global and local aggregation. Furthermore, we expand the NBA SportVU dataset by adding human annotations of team-level tactics for trajectory and tactic prediction tasks. Extensive experiments on three widely-adopted datasets demonstrate that our model outperforms state-of-the-art methods. Our source code and data are available at this https URL.
zh

[CV-45] Adaptive Hierarchical Graph Cut for Multi-granularity Out-of-distribution Detection

【速读】：该论文试图解决分布外检测 (OOD detection) 问题，即区分并拒绝具有语义偏移的测试样本，以防止在分布内 (ID) 数据上训练的模型产生不可靠的预测。解决方案的关键在于提出了一种自适应层次图割网络 (Adaptive Hierarchical Graph Cut network, AHGC)，通过构建层次化的KNN图来深入探索不同图像之间的语义关系。具体来说，AHGC基于余弦相似度评估图像间的相似性，并根据图的连接性和密度信息将图分割成多个子图，从而整合语义相似的样本。如果子图中的标签比例超过阈值，则将最高比例的标签分配给未标记的图像。此外，通过图像增强和最大化增强版本之间的相似性来进一步提高模型泛化能力，并利用相似性得分进行OOD检测。实验结果表明，AHGC在CIFAR-100和CIFAR-10基准测试中显著优于现有的OOD检测方法。

链接: https://arxiv.org/abs/2412.15668
作者: Xiang Fang,Arvind Easwaran,Blaise Genest,Ponnuthurai Nagaratnam Suganthan
机构: Energy Research Institute @ NTU, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore; CNRS@CREATE, Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore; CNRS@CREATE, Singapore; CNRS and CNRS@CREATE, IPAL IRL 2955, France and Singapore; KINDI Computing Research Center, College of Engineering, Qatar University, Doha
关键词: producing unreliable predictions, reject test samples, prevent models trained, trained on in-distribution, unreliable predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper focuses on a significant yet challenging task: out-of-distribution detection (OOD detection), which aims to distinguish and reject test samples with semantic shifts, so as to prevent models trained on in-distribution (ID) data from producing unreliable predictions. Although previous works have made decent success, they are ineffective for real-world challenging applications since these methods simply regard all unlabeled data as OOD data and ignore the case that different datasets have different label granularity. For example, “cat” on CIFAR-10 and “tabby cat” on Tiny-ImageNet share the same semantics but have different labels due to various label granularity. To this end, in this paper, we propose a novel Adaptive Hierarchical Graph Cut network (AHGC) to deeply explore the semantic relationship between different images. Specifically, we construct a hierarchical KNN graph to evaluate the similarities between different images based on the cosine similarity. Based on the linkage and density information of the graph, we cut the graph into multiple subgraphs to integrate these semantics-similar samples. If the labeled percentage in a subgraph is larger than a threshold, we will assign the label with the highest percentage to unlabeled images. To further improve the model generalization, we augment each image into two augmentation versions, and maximize the similarity between the two versions. Finally, we leverage the similarity score for OOD detection. Extensive experiments on two challenging benchmarks (CIFAR- 10 and CIFAR-100) illustrate that in representative cases, AHGC outperforms state-of-the-art OOD detection methods by 81.24% on CIFAR-100 and by 40.47% on CIFAR-10 in terms of “FPR95”, which shows the effectiveness of our AHGC.
zh

[CV-46] SCENIC: Scene-aware Semantic Navigation with Instruction-guided Control

【速读】：该论文试图解决在复杂环境中生成适应性强且允许创意控制的人类运动的问题。现有模型通常假设平坦地形或无法通过文本控制运动语义。解决方案的关键在于引入SCENIC，一种扩散模型，能够在虚拟场景中动态地形上生成适应性的人类运动，并通过自然语言实现语义控制。核心技术挑战在于同时推理复杂的场景几何和保持文本控制，这需要理解高层导航目标和细粒度的环境约束。论文提出的解决方案包括一种分层的场景推理方法，结合了依赖场景的目标中心化（scene-dependent, goal-centric canonicalization）和自我中心的距离场（ego-centric distance field），以捕捉局部几何细节。这种双重表示使得模型能够在多样化的3D场景中生成物理上合理的运动，并通过帧级文本对齐实现不同运动风格之间的无缝过渡，同时保持场景约束。

链接: https://arxiv.org/abs/2412.15664
作者: Xiaohan Zhang,Sebastian Starke,Vladimir Guzov,Zhensong Zhang,Eduardo Pérez Pellitero,Gerard Pons-Moll
机构: Max Planck Institute for Informatics(马克斯·普朗克信息学研究所); Technische Universität Darmstadt(达姆施塔特工业大学); University of Adelaide(阿德莱德大学); University of Bath(巴斯大学)
关键词: Synthesizing natural human, allowing creative control, creative control remains, Synthesizing natural, environments while allowing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing natural human motion that adapts to complex environments while allowing creative control remains a fundamental challenge in motion synthesis. Existing models often fall short, either by assuming flat terrain or lacking the ability to control motion semantics through text. To address these limitations, we introduce SCENIC, a diffusion model designed to generate human motion that adapts to dynamic terrains within virtual scenes while enabling semantic control through natural language. The key technical challenge lies in simultaneously reasoning about complex scene geometry while maintaining text control. This requires understanding both high-level navigation goals and fine-grained environmental constraints. The model must ensure physical plausibility and precise navigation across varied terrain, while also preserving user-specified text control, such as carefully stepping over obstacles" or walking upstairs like a zombie." Our solution introduces a hierarchical scene reasoning approach. At its core is a novel scene-dependent, goal-centric canonicalization that handles high-level goal constraint, and is complemented by an ego-centric distance field that captures local geometric details. This dual representation enables our model to generate physically plausible motion across diverse 3D scenes. By implementing frame-wise text alignment, our system achieves seamless transitions between different motion styles while maintaining scene constraints. Experiments demonstrate our novel diffusion model generates arbitrarily long human motions that both adapt to complex scenes with varying terrain surfaces and respond to textual prompts. Additionally, we show SCENIC can generalize to four real-scene datasets. Our code, dataset, and models will be released at \urlthis https URL.
zh

[CV-47] CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training AAAI2025

【速读】：该论文试图解决的问题是将多个从不同参考资源中训练出的定制化概念（如特定主体或动作）整合到一个单一网络中时出现的明显伪影问题。解决方案的关键在于提出了一种名为CustomTTT的方法，通过分析提示词在当前视频扩散模型中的影响，确定仅在特定层面上需要使用低秩适应（LoRA）进行外观和动作的定制。此外，由于每个LoRA是单独训练的，论文提出了一种新颖的测试时训练技术，在组合后利用已训练的定制模型更新参数，从而有效解决了整合多个概念时的伪影问题。

链接: https://arxiv.org/abs/2412.15646
作者: Xiuli Bi,Jian Lu,Bo Liu,Xiaodong Cun,Yong Zhang,Weisheng Li,Bin Xiao
机构: 未知
关键词: Benefiting from large-scale, generate high-quality videos, generate high-quality, generate high-quality customized, text-video pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2025

点击查看摘要

Abstract:Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
zh

[CV-48] CrackUDA: Incremental Unsupervised Domain Adaptation for Improved Crack Segmentation in Civil Structures ICPR2024

【速读】：该论文试图解决裂缝分割算法在跨数据集时因领域偏移（domain shift）导致的精度下降问题。解决方案的关键在于提出了一种基于增量训练和无监督领域适应（Unsupervised Domain Adaptation, UDA）的深度网络，通过对抗学习实现领域不变特征和领域特定特征的结合。具体来说，网络采用编码器-解码器架构，编码器学习跨领域的共享裂缝特征，确保对领域变化的鲁棒性；解码器的领域特定参数则捕捉每个领域的独特特征。此外，论文还引入了新的BuildCrack数据集，并通过实验验证了该方法在源域和目标域上的裂缝分割精度和泛化能力的显著提升。

链接: https://arxiv.org/abs/2412.15637
作者: Kushagra Srivastava,Damodar Datta Kancharla,Rizvi Tahereen,Pradeep Kumar Ramancharla,Ravi Kiran Sarvadevabhatla,Harikumar Kandath
机构: 未知
关键词: civil structures, plays a crucial, crucial role, structural integrity, integrity and seismic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPR 2024. Details and code can be accessed from this https URL

点击查看摘要

Abstract:Crack segmentation plays a crucial role in ensuring the structural integrity and seismic safety of civil structures. However, existing crack segmentation algorithms encounter challenges in maintaining accuracy with domain shifts across datasets. To address this issue, we propose a novel deep network that employs incremental training with unsupervised domain adaptation (UDA) using adversarial learning, without a significant drop in accuracy in the source domain. Our approach leverages an encoder-decoder architecture, consisting of both domain-invariant and domain-specific parameters. The encoder learns shared crack features across all domains, ensuring robustness to domain variations. Simultaneously, the decoder’s domain-specific parameters capture domain-specific features unique to each domain. By combining these components, our model achieves improved crack segmentation performance. Furthermore, we introduce BuildCrack, a new crack dataset comparable to sub-datasets of the well-established CrackSeg9K dataset in terms of image count and crack percentage. We evaluate our proposed approach against state-of-the-art UDA methods using different sub-datasets of CrackSeg9K and our custom dataset. Our experimental results demonstrate a significant improvement in crack segmentation accuracy and generalization across target domains compared to other UDA methods - specifically, an improvement of 0.65 and 2.7 mIoU on source and target domains respectively.
zh

[CV-49] A New Method to Capturing Compositional Knowledge in Linguistic Space

【速读】：该论文试图解决现有视觉语言模型在组合理解（compositional understanding）任务中依赖硬负样本（hard negative examples）和微调（fine-tuning）的问题，这些方法可能导致对改进的高估，并且受限于获取硬负样本的难度。解决方案的关键在于提出了一种无需硬负样本训练数据的零样本组合理解任务（Zero-Shot Compositional Understanding, ZS-CU），并引入了YUKINO方法。YUKINO通过文本反演（textual inversion）将未标注的图像映射到预训练CLIP模型中的伪标记（pseudo-tokens），并通过引入“否”逻辑正则化（“no” logical regularization）来解决反演中的标记交互问题。此外，使用知识蒸馏（knowledge distillation）来降低文本反演的时间复杂度。实验结果表明，YUKINO在SugarCREPE基准上超越了现有的多模态SOTA模型，并在图像检索任务中取得了显著改进。

链接: https://arxiv.org/abs/2412.15632
作者: Jiahe Wan
机构: South-Central Minzu University (中南民族大学)
关键词: interpret complex relationships, Compositional understanding, Yielded Compositional Understanding, relationships between objects, visual language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing “no” logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.
zh

[CV-50] 3D Shape Tokenization

【速读】：该论文试图解决3D形状表示的连续性、紧凑性以及与机器学习模型集成的问题。解决方案的关键在于引入Shape Tokens，这是一种连续且紧凑的3D表示方法，能够作为条件向量在3D流匹配模型中表示形状信息。Shape Tokens通过训练流匹配模型来近似概率密度函数，这些函数集中在3D形状的表面上，从而实现形状生成、图像到3D转换、形状与文本和图像对齐以及可变分辨率的形状渲染等功能。此外，Shape Tokens还支持对几何属性（如法线、密度和变形场）的系统分析，并在各项任务中表现出优于现有基线的性能。

链接: https://arxiv.org/abs/2412.15618
作者: Jen-Hao Rick Chang,Yuyang Wang,Miguel Angel Bautista Martin,Jiatao Gu,Josh Susskind,Oncel Tuzel
机构: Apple(苹果)
关键词: Shape Tokens, introduce Shape Tokens, machine learning models, Shape Tokens act, Tokens
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We introduce Shape Tokens, a 3D representation that is continuous, compact, and easy to incorporate into machine learning models. Shape Tokens act as conditioning vectors that represent shape information in a 3D flow-matching model. The flow-matching model is trained to approximate probability density functions corresponding to delta functions concentrated on the surfaces of shapes in 3D. By attaching Shape Tokens to various machine learning models, we can generate new shapes, convert images to 3D, align 3D shapes with text and images, and render shapes directly at variable, user specified, resolution. Moreover, Shape Tokens enable a systematic analysis of geometric properties such as normal, density, and deformation field. Across all tasks and experiments, utilizing Shape Tokens demonstrate strong performance compared to existing baselines.
zh

[CV-51] chnical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM ICML

【速读】：该论文旨在解决TiFA研讨会中的MLLM攻击挑战，提出了两种关键方法：后缀注入（suffix injection）和投影梯度下降（projected gradient descent, PGD）。首先，通过将错误标注的文本（伪标注）作为后缀附加到原始查询中；其次，利用PGD方法对图像添加不可察觉的扰动。这两种技术的结合成功实现了对LLaVA 1.5模型的攻击。

链接: https://arxiv.org/abs/2412.15614
作者: Yangyang Guo,Ziwei Xu,Xilie Xu,YongKang Wong,Liqiang Nie,Mohan Kankanhalli
机构: 未知
关键词: TiFA workshop MLLM, projected gradient descent, MLLM attack challenge, technical report introduces, workshop MLLM attack
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML TiFA Challenge Technical Report

点击查看摘要

Abstract:This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model.
zh

[CV-52] Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

【速读】：该论文旨在解决多模态智能体在实际任务中调用外部工具的能力问题。解决方案的关键在于提出了一种多模态智能体调优方法，通过自动生成多模态工具使用数据，并对视觉语言模型 (Vision-Language Model, VLM) 进行调优，使其作为控制器实现强大的工具使用推理能力。具体来说，论文利用 GPT-4o mini 模型生成查询、文件和轨迹，并通过查询-文件和轨迹验证器确保数据质量，最终构建了包含 20K 任务的 MM-Traj 数据集。基于此数据集，论文开发了 T3-Agent，通过轨迹调优 (Trajectory Tuning) 方法在 VLM 上进行工具使用调优。实验结果表明，T3-Agent 在 GTA 和 GAIA 基准测试中显著提升了 MiniCPM-V-8.5B 和 Qwen2-VL-7B 的性能，证明了数据合成管道的有效性，从而为工具使用能力提供了高质量的数据支持。

链接: https://arxiv.org/abs/2412.15606
作者: Zhi Gao,Bofei Zhang,Pengxiang Li,Xiaojian Ma,Tao Yuan,Yue Fan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li
机构: Peking University(北京大学); State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室，BIGAI); Beijing Institute of Technology(北京理工大学); Shenzhen MSU-BIT University(深圳北理莫斯科大学); Tsinghua University(清华大学)
关键词: large language models, call external tools, solve practical tasks, multi-modal agent tuning, providing a feasible
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underlineTrajectory \underlineTuning on VLMs for \underlineTool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and Qwen2-VL-7B, which outperforms untrained VLMs by 20% , showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.
zh

[CV-53] Gaze Label Alignment: Alleviating Domain Shift for Gaze Estimation AAAI2025

【速读】：该论文试图解决跨域注视估计中由于标签偏差导致的性能下降问题。解决方案的关键在于提出了一种注视标签对齐算法 (Gaze Label Alignment, GLA)，通过消除标签分布偏差来提升模型性能。具体步骤包括：首先在所有域上训练特征提取器以获得域不变特征，然后选择一个锚定域来训练注视回归器，并在剩余域上预测注视标签，最后使用映射函数对齐这些标签，从而用于训练注视估计模型。该方法可以与任何现有方法结合，实验结果表明GLA能有效缓解标签分布偏移，并显著提升现有最先进注视估计方法的性能。

链接: https://arxiv.org/abs/2412.15601
作者: Guanzhong Zeng,Jingjing Wang,Zefu Xu,Pengwei Yin,Wenqi Ren,Di Xie,Jiang Zhu
机构: 未知
关键词: encounter significant performance, significant performance deterioration, methods encounter significant, Gaze estimation methods, gaze label
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera Ready. Accepted to AAAI 2025

点击查看摘要

Abstract:Gaze estimation methods encounter significant performance deterioration when being evaluated across different domains, because of the domain gap between the testing and training data. Existing methods try to solve this issue by reducing the deviation of data distribution, however, they ignore the existence of label deviation in the data due to the acquisition mechanism of the gaze label and the individual physiological differences. In this paper, we first point out that the influence brought by the label deviation cannot be ignored, and propose a gaze label alignment algorithm (GLA) to eliminate the label distribution deviation. Specifically, we first train the feature extractor on all domains to get domain invariant features, and then select an anchor domain to train the gaze regressor. We predict the gaze label on remaining domains and use a mapping function to align the labels. Finally, these aligned labels can be used to train gaze estimation models. Therefore, our method can be combined with any existing method. Experimental results show that our GLA method can effectively alleviate the label distribution shift, and SOTA gaze estimation methods can be further improved obviously.
zh

[CV-54] Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving

【速读】：该论文试图解决现有基于雷达的模型在处理射频图像序列时，过度依赖卷积神经网络而忽略时空语义上下文的问题。解决方案的关键在于提出了Mask-RadarNet模型，该模型通过结合交错卷积和注意力操作（interleaved convolution and attention operations）来替代传统Transformer架构，并引入了patch shift技术以高效学习时空特征。此外，模型中设计的类掩码注意力模块（class masking attention module, CMAM）用于捕捉时空语义上下文信息，并通过轻量级辅助解码器聚合来自CMAM的先验图，从而在降低计算复杂度和参数量的同时，提升了自动驾驶中目标检测的识别精度。

链接: https://arxiv.org/abs/2412.15595
作者: Yuzhi Wu,Jun Liu,Guangfeng Jiang,Weijian Liu,Danilo Orlando
机构: University of Science and Technology of China(中国科学技术大学); Wuhan Electronic Information Institute(武汉电子信息研究院); Università degli Studi “Niccolò Cusano”(尼科洛·库萨诺大学)
关键词: robust technology, cost-effective and robust, steady improvement, appealing complement, complement to commonly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years, making it an appealing complement to commonly used sensors like camera and LiDAR in autonomous driving. Radio frequency data with rich semantic information are attracting more and more attention. Most current radar-based models take radio frequency image sequences as the input. However, these models heavily rely on convolutional neural networks and leave out the spatial-temporal semantic context during the encoding stage. To solve these problems, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the input radar data. Mask-RadarNet exploits the combination of interleaved convolution and attention operations to replace the traditional architecture in transformer-based models. In addition, patch shift is introduced to the Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate the superiority of the proposed method to some state-of-the-art radar-based object detection algorithms. With relatively lower computational complexity and fewer parameters, the proposed Mask-RadarNet achieves higher recognition accuracy for object detection in autonomous driving.
zh

[CV-55] SemDP: Semantic-level Differential Privacy Protection for Face Datasets

【速读】：该论文试图解决大规模人脸数据集在深度学习应用中引发的隐私问题，特别是现有差分隐私（differential privacy）方案未能充分满足其核心要求的问题。解决方案的关键在于提出了一种语义级别的差分隐私保护方案，该方案将整个人脸数据集作为处理对象，而非将每张图像视为独立的数据库。与传统的像素级别差分隐私方法不同，该方案通过提取人脸数据集的语义信息构建属性数据库，并对属性数据进行差分扰动，最后利用图像合成模型生成受保护的人脸数据集。这种方法确保了语义隐私不被破坏，同时保持了视觉自然性，并在隐私与效用之间实现了平衡。

链接: https://arxiv.org/abs/2412.15590
作者: Xiaoting Zhang,Tao Wang,Junhao Ji
机构: Nanjing University of Aeronautics and Astronautics(南京航空航天大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学)
关键词: advanced deep learning-based, learning-based face analysis, raise privacy concerns, privacy concerns due, deep learning-based face
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:While large-scale face datasets have advanced deep learning-based face analysis, they also raise privacy concerns due to the sensitive personal information they contain. Recent schemes have implemented differential privacy to protect face datasets. However, these schemes generally treat each image as a separate database, which does not fully meet the core requirements of differential privacy. In this paper, we propose a semantic-level differential privacy protection scheme that applies to the entire face dataset. Unlike pixel-level differential privacy approaches, our scheme guarantees that semantic privacy in faces is not compromised. The key idea is to convert unstructured data into structured data to enable the application of differential privacy. Specifically, we first extract semantic information from the face dataset to build an attribute database, then apply differential perturbations to obscure this attribute data, and finally use an image synthesis model to generate a protected face dataset. Extensive experimental results show that our scheme can maintain visual naturalness and balance the privacy-utility trade-off compared to the mainstream schemes.
zh

[CV-56] SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning

【速读】：该论文试图解决图像与点云（point cloud）之间的跨模态全局定位问题，特别是在无全球导航卫星系统（GNSS）环境下，机器人导航、多机器人地图融合和城市资产管理中的应用。现有解决方案要么需要模态统一导致信息丢失，要么依赖于工程化的训练方案，缺乏特征对齐和关系一致性。论文提出的解决方案之关键是SaliencyI2PLoc，一种基于对比学习（contrastive learning）的新型架构，通过融合显著性图（saliency map）到特征聚合中，并在多流形空间中保持特征关系一致性，从而有效实现跨模态特征映射。该方法还设计了上下文显著性引导的局部特征聚合模块，充分利用场景中的静态信息生成更具代表性的全局特征，并通过考虑不同流形空间中样本间的相对关系一致性来增强跨模态特征对齐。实验结果表明，该方法在城市和高速公路场景数据集上表现出显著的性能提升。

链接: https://arxiv.org/abs/2412.15577
作者: Yuhao Li,Jianping Li,Zhen Dong,Yuan Wang,Bisheng Yang
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); Jiangxi Normal University (江西师范大学)
关键词: urban asset management, multi-robot map fusion, point cloud global, cross-modality global localization, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Under Review

点击查看摘要

Abstract:Image to point cloud global localization is crucial for robot navigation in GNSS-denied environments and has become increasingly important for multi-robot map fusion and urban asset management. The modality gap between images and point clouds poses significant challenges for cross-modality fusion. Current cross-modality global localization solutions either require modality unification, which leads to information loss, or rely on engineered training schemes to encode multi-modality features, which often lack feature alignment and relation consistency. To address these limitations, we propose, SaliencyI2PLoc, a novel contrastive learning based architecture that fuses the saliency map into feature aggregation and maintains the feature relation consistency on multi-manifold spaces. To alleviate the pre-process of data mining, the contrastive learning framework is applied which efficiently achieves cross-modality feature mapping. The context saliency-guided local feature aggregation module is designed, which fully leverages the contribution of the stationary information in the scene generating a more representative global feature. Furthermore, to enhance the cross-modality feature alignment during contrastive learning, the consistency of relative relationships between samples in different manifold spaces is also taken into account. Experiments conducted on urban and highway scenario datasets demonstrate the effectiveness and robustness of our method. Specifically, our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset, showing an improvement of 37.35% and 18.07%, compared to the baseline method. This demonstrates that our architecture efficiently fuses images and point clouds and represents a significant step forward in cross-modality global localization. The project page and code will be released.
zh

[CV-57] QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

【速读】：该论文试图解决在四足视觉-语言-动作任务（QUAR-VLA）中部署多模态大语言模型（MLLM）时所面临的固有推理延迟问题。解决方案的关键在于引入了一种名为QUART-Online的新型无延迟四足MLLM模型，通过采用动作块离散化（Action Chunk Discretization, ACD）技术，压缩原始动作表示空间，将连续动作值映射到较小的离散代表向量集，同时保留关键信息。这种方法在不降低语言基础模型性能的情况下，显著提升了推理效率，并实现了与底层控制器频率同步的实时推理，从而将任务成功率提高了65%。

链接: https://arxiv.org/abs/2412.15576
作者: Xinyang Tong,Pengxiang Ding,Donglin Wang,Wenjie Zhang,Can Cui,Mingyang Sun,Yiguo Fan,Han Zhao,Hongyin Zhang,Yonghao Dang,Siteng Huang,Shangke Lyu
机构: MiLAB, Westlake University, Hangzhou, 310030, China; Zhejiang University, Hangzhou, 310027, China; Beijing University of Posts and Telecommunications, Beijing, 100876, China
关键词: deploying multimodal large, multimodal large language, inherent inference latency, inference latency challenges, language foundation model
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is \hrefthis https URLthis https URL.
zh

[CV-58] J-EDI QA: Benchmark for deep-sea organism-specific multimodal LLM

【速读】：该论文试图解决深海生物图像理解的问题，特别是通过多模态大语言模型 (LLM) 来评估和提升对深海物种的识别和理解能力。解决方案的关键在于提出了J-EDI QA基准，这是一个包含100张深海图像及其对应问题和答案的数据集，旨在测试模型对深海物种的日语描述的理解能力。通过评估当前最先进的模型（如OpenAI o1）在该基准上的表现（正确率为50%），论文指出，尽管技术先进，但现有模型在深海物种理解方面仍未达到专家水平，因此需要进一步开发针对深海物种的特定LLM。

链接: https://arxiv.org/abs/2412.15574
作者: Takero Yoshida,Yuikazu Ito,Yoshihiro Fujiwara,Shinji Tsuchida,Daisuke Sugiyama,Daisuke Matsuoka
机构: 未知
关键词: Japan Agency, Science and Technology, JAMSTEC Earth Deep-sea, https URL, Agency for Marine-Earth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Japan Agency for Marine-Earth Science and Technology (JAMSTEC) has made available the JAMSTEC Earth Deep-sea Image (J-EDI), a deep-sea video and image archive (this https URL). This archive serves as a valuable resource for researchers and scholars interested in deep-sea imagery. The dataset comprises images and videos of deep-sea phenomena, predominantly of marine organisms, but also of the seafloor and physical processes. In this study, we propose J-EDI QA, a benchmark for understanding images of deep-sea organisms using a multimodal large language model (LLM). The benchmark is comprised of 100 images, accompanied by questions and answers with four options by JAMSTEC researchers for each image. The QA pairs are provided in Japanese, and the benchmark assesses the ability to understand deep-sea species in Japanese. In the evaluation presented in this paper, OpenAI o1 achieved a 50% correct response rate. This result indicates that even with the capabilities of state-of-the-art models as of December 2024, deep-sea species comprehension is not yet at an expert level. Further advances in deep-sea species-specific LLMs are therefore required.
zh

[CV-59] DefFiller: Mask-Conditioned Diffusion for Salient Steel Surface Defect Generation

【速读】：该论文试图解决钢材生产环境中缺陷检测数据集创建的难题，特别是在缺陷不可预测的情况下，现有的基于显著性（saliency-based）的缺陷检测方法性能受限。解决方案的关键是提出了DefFiller，一种基于掩码条件（mask-conditioned）的缺陷生成方法，利用布局到图像扩散模型（layout-to-image diffusion model）生成与掩码条件配对的缺陷样本。这种方法无需像素级标注（pixel-level annotations），显著减少了数据标注的时间和资源消耗，并可直接用于模型训练。通过实验验证，DefFiller生成的缺陷图像质量高，且能显著提升基于显著性的缺陷检测模型的性能。

链接: https://arxiv.org/abs/2412.15570
作者: Yichun Tai,Zhenzhen Huang,Tao Peng,Zhijiang Zhang
机构: 未知
关键词: steel production environments, production environments complicates, methods show promise, Current saliency-based defect, complicates dataset creation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:Current saliency-based defect detection methods show promise in industrial settings, but the unpredictability of defects in steel production environments complicates dataset creation, hampering model performance. Existing data augmentation approaches using generative models often require pixel-level annotations, which are time-consuming and resource-intensive. To address this, we introduce DefFiller, a mask-conditioned defect generation method that leverages a layout-to-image diffusion model. DefFiller generates defect samples paired with mask conditions, eliminating the need for pixel-level annotations and enabling direct use in model training. We also develop an evaluation framework to assess the quality of generated samples and their impact on detection performance. Experimental results on the SD-Saliency-900 dataset demonstrate that DefFiller produces high-quality defect images that accurately match the provided mask conditions, significantly enhancing the performance of saliency-based defect detection models trained on the augmented dataset.
zh

[CV-60] EGSRAL: An Enhanced 3D Gaussian Splatting based Renderer with Automated Labeling for Large-Scale Driving Scene AAAI2025

【速读】：该论文试图解决3D Gaussian Splatting (3D GS) 在重建驾驶场景时依赖多种数据类型（如深度图、3D框和运动物体轨迹）以及合成图像缺乏标注的问题。解决方案的关键在于提出了EGSRAL方法，该方法仅依赖训练图像而不需要额外标注，通过增强3D GS对动态物体和静态背景的建模能力，并引入一种新颖的适配器进行自动标注，生成相应的标注信息。此外，论文还提出了一种分组策略，用于解决在大规模复杂场景中渲染时的透视问题。该方法在多个数据集上实现了最先进的性能，且其自动标注功能显著提升了2D/3D检测任务的性能。

链接: https://arxiv.org/abs/2412.15550
作者: Yixiong Huo,Guangfeng Jiang,Hongyang Wei,Ji Liu,Song Zhang,Han Liu,Xingliang Huang,Mingjie Lu,Jinzhang Peng,Dong Li,Lu Tian,Emad Barsoum
机构: 未知
关键词: Gaussian Splatting, gained popularity due, faster rendering speed, view synthesis, gained popularity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2025

点击查看摘要

Abstract:3D Gaussian Splatting (3D GS) has gained popularity due to its faster rendering speed and high-quality novel view synthesis. Some researchers have explored using 3D GS for reconstructing driving scenes. However, these methods often rely on various data types, such as depth maps, 3D boxes, and trajectories of moving objects. Additionally, the lack of annotations for synthesized images limits their direct application in downstream tasks. To address these issues, we propose EGSRAL, a 3D GS-based method that relies solely on training images without extra annotations. EGSRAL enhances 3D GS’s capability to model both dynamic objects and static backgrounds and introduces a novel adaptor for auto labeling, generating corresponding annotations based on existing annotations. We also propose a grouping strategy for vanilla 3D GS to address perspective issues in rendering large-scale, complex scenes. Our method achieves state-of-the-art performance on multiple datasets without any extra annotation. For example, the PSNR metric reaches 29.04 on the nuScenes dataset. Moreover, our automated labeling can significantly improve the performance of 2D/3D detection tasks. Code is available at this https URL.
zh

[CV-61] VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving

【速读】：该论文试图解决传统强化学习（Reinforcement Learning, RL）在自动驾驶中依赖手动设计奖励函数的问题，这些奖励函数不仅耗时且缺乏泛化能力。解决方案的关键是提出了VLM-RL框架，该框架将预训练的视觉-语言模型（Vision-Language Models, VLMs）与RL结合，通过图像观测和自然语言目标生成奖励信号。核心创新在于对比语言目标（Contrasting Language Goal, CLG）-作为奖励的范式，利用正负语言目标生成语义奖励，并通过层次化奖励合成方法结合CLG语义奖励与车辆状态信息，提升奖励的稳定性和全面性。此外，采用批处理技术优化训练中的计算效率。实验结果表明，VLM-RL在CARLA模拟器中显著优于现有方法，展示了其在未见场景中的鲁棒泛化能力。

链接: https://arxiv.org/abs/2412.15544
作者: Zilin Huang,Zihao Sheng,Yansong Qu,Junwei You,Sikai Chen
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Purdue University (普渡大学)
关键词: gained increasing attention, achieved remarkable progress, learning driving policies, autonomous driving community, reinforcement learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 16 figures

点击查看摘要

Abstract:In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbfVLM-RL, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5% reduction in collision rate, a 104.6% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements. The demo video and code can be accessed at: this https URL.
zh

[CV-62] ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model

【速读】：该论文试图解决现有变化检测（Change Detection, CD）数据合成方法中的三个主要问题：1) 难以灵活控制变化事件，2) 依赖额外数据训练数据生成器，3) 专注于特定变化检测任务。解决方案的关键在于开发了一种基于扩散模型的多时相语义变化检测（Semantic Change Detection, SCD）数据生成器ChangeDiff。ChangeDiff通过两步生成变化数据：首先利用文本提示和文本到布局（Text-to-Layout, T2L）模型生成连续布局，然后通过布局到图像（Layout-to-Image, L2I）模型将这些布局转换为图像。具体而言，论文提出了多类分布引导的文本提示（Multi-Class Distribution-Guided Text Prompt, MCDG-TP），允许通过可控的类别及其比例灵活生成布局，并通过类分布细化损失（Class Distribution Refinement Loss）作为训练监督，以泛化T2L模型到MCDG-TP。该方法显著提升了生成数据的时间连续性、空间多样性和质量真实性，增强了变化检测器的准确性和可迁移性。

链接: https://arxiv.org/abs/2412.15541
作者: Qi Zang,Jiayi Yang,Shuang Wang,Dong Zhao,Wenjun Yi,Zhun Zhong
机构: 未知
关键词: Data-driven deep learning, Data-driven deep, deep learning models, enabled tremendous progress, pixel-level annotations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven deep learning models have enabled tremendous progress in change detection (CD) with the support of pixel-level annotations. However, collecting diverse data and manually annotating them is costly, laborious, and knowledge-intensive. Existing generative methods for CD data synthesis show competitive potential in addressing this issue but still face the following limitations: 1) difficulty in flexibly controlling change events, 2) dependence on additional data to train the data generators, 3) focus on specific change detection tasks. To this end, this paper focuses on the semantic CD (SCD) task and develops a multi-temporal SCD data generator ChangeDiff by exploring powerful diffusion models. ChangeDiff innovatively generates change data in two steps: first, it uses text prompts and a text-to-layout (T2L) model to create continuous layouts, and then it employs layout-to-image (L2I) to convert these layouts into images. Specifically, we propose multi-class distribution-guided text prompts (MCDG-TP), allowing for layouts to be generated flexibly through controllable classes and their corresponding ratios. Subsequently, to generalize the T2L model to the proposed MCDG-TP, a class distribution refinement loss is further designed as training supervision. %For the former, a multi-classdistribution-guided text prompt (MCDG-TP) is proposed to complement via controllable classes and ratios. To generalize the text-to-image diffusion model to the proposed MCDG-TP, a class distribution refinement loss is designed as training supervision. For the latter, MCDG-TP in three modes is proposed to synthesize new layout masks from various texts. Our generated data shows significant progress in temporal continuity, spatial diversity, and quality realism, empowering change detectors with accuracy and transferability. The code is available at this https URL
zh

[CV-63] SGTC: Semantic-Guided Triplet Co-training for Sparsely Annotated Semi-Supervised Medical Image Segmentation AAAI2025

【速读】：该论文试图解决医学图像分割中全标注成本高、时间消耗大以及现有方法忽视语义特征导致弱边界感知能力不足的问题。解决方案的关键在于提出了一个新颖的语义引导三重协同训练框架（Semantic-Guided Triplet Co-training, SGTC），通过仅标注少量体积样本的三个正交切片来实现高质量的医学图像分割。该框架包含两个主要组件：一是基于预训练的CLIP模型提出的语义引导辅助学习机制，用于实现语义感知、细粒度分割并提升伪标签质量；二是针对更具挑战性的临床实际场景，提出了三视图差异训练策略，通过稀疏标注（即仅标注少量体积样本的三个切片）在三个子网络之间进行协同训练，显著提高鲁棒性。

链接: https://arxiv.org/abs/2412.15526
作者: Ke Yan,Qing Cai,Fan Zhang,Ziyan Cao,Zhi Liu
机构: 1. National University of Singapore (新加坡国立大学);
2. Tsinghua University (清华大学);
3. Alibaba Group (阿里巴巴集团)
关键词: made significant advances, medical image segmentation, time-consuming task, made significant, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Although semi-supervised learning has made significant advances in the field of medical image segmentation, fully annotating a volumetric sample slice by slice remains a costly and time-consuming task. Even worse, most of the existing approaches pay much attention to image-level information and ignore semantic features, resulting in the inability to perceive weak boundaries. To address these issues, we propose a novel Semantic-Guided Triplet Co-training (SGTC) framework, which achieves high-end medical image segmentation by only annotating three orthogonal slices of a few volumetric samples, significantly alleviating the burden of radiologists. Our method consist of two main components. Specifically, to enable semantic-aware, fine-granular segmentation and enhance the quality of pseudo-labels, a novel semantic-guided auxiliary learning mechanism is proposed based on the pretrained CLIP. In addition, focusing on a more challenging but clinically realistic scenario, a new triple-view disparity training strategy is proposed, which uses sparse annotations (i.e., only three labeled slices of a few volumes) to perform co-training between three sub-networks, significantly improving the robustness. Extensive experiments on three public medical datasets demonstrate that our method outperforms most state-of-the-art semi-supervised counterparts under sparse annotation settings. The source code is available at this https URL.
zh

[CV-64] InstructOCR: Instruction Boosting Scene Text Spotting AAAI2025

【速读】：该论文试图解决场景文本识别（scene text spotting）领域中，传统光学字符识别（OCR）方法忽视了结合人类语言指令优势的问题。解决方案的关键在于提出了InstructOCR模型，该模型通过引入精心设计的人类语言指令，结合文本和图像编码器，增强了模型对图像中文本的准确理解和灵活解释能力。这一方法不仅在场景文本识别任务中取得了最先进的结果，还能无缝应用于场景文本视觉问答（VQA）任务，显著提升了下游VQA任务的性能。

链接: https://arxiv.org/abs/2412.15523
作者: Chen Duan,Qianyi Jiang,Pei Fu,Jiamin Chen,Shengxi Li,Zining Wang,Shan Guo,Junfeng Luo
机构: 未知
关键词: previous OCR methods, OCR methods primarily, human language instructions, methods primarily relied, previous OCR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.
zh

[CV-65] Reconstruction of Contour Lines During the Digitization of Contour Maps to Build a Digital Elevation Model

【速读】：该论文旨在解决在数字化和预处理等高线地图（contour map）过程中，由于等高线相互交叉或断裂而产生的断裂等高线段问题。这些断裂线段在构建数字高程模型（Digital Elevation Model, DEM）时会导致模型错误。解决方案的关键在于使用最小欧几里得距离（minimum Euclidean distance）和梯度方向（gradient direction）的概念来匹配断裂线段的端点，并通过三次埃尔米特样条插值（Cubic Hermite spline interpolation）技术重新连接这些端点，以最小化整体表面曲率并生成平滑曲线。

链接: https://arxiv.org/abs/2412.15515
作者: Aroj Subedi,Pradip Ganesh,Sandip Mishra
机构: 未知
关键词: Contour map, broken contour lines, Digital Elevation Model, broken contour segments, Contour
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contour map has contour lines that are significant in building a Digital Elevation Model (DEM). During the digitization and pre-processing of contour maps, the contour line intersects with each other or break apart resulting in broken contour segments. These broken segments impose a greater risk while building DEM leading to a faulty model. In this project, a simple yet efficient mechanism is used to match and reconnect the endpoints of the broken segments accurately and efficiently. The matching of the endpoints is done using the concept of minimum Euclidean distance and gradient direction while the Cubic Hermite spline interpolation technique is used to reconnect the endpoints by estimating the values using a mathematical function that minimizes overall surface curvature resulting in a smooth curve. The purpose of this work is to reconnect the broken contour lines generated during the digitization of the contour map, to help build the most appropriate digital elevation model for the corresponding contour map.
zh

[CV-66] PolySmart @ TRECVid 2024 Medical Video Question Answering

【速读】：该论文试图解决视频语料库视觉答案定位 (VCVAL) 问题，包括基于医学问题的相关视频检索和视频中视觉答案的定位。解决方案的关键在于：首先通过文本到文本检索 (text-to-text retrieval) 基于视频转录文本和 GPT4 生成的答案的相似性来检索相关视频；其次，通过视觉内容和字幕与查询的对齐来预测答案的起止时间戳；最后，在查询聚焦的教学步骤字幕生成 (QFISC) 任务中，利用 GPT4 生成步骤字幕，输入包括 LLaVA-Next-Video 模型生成的视频字幕和带有时间戳的视频字幕。

链接: https://arxiv.org/abs/2412.15514
作者: Jiaxin Wu,Yiyang Jiang,Xiao-Yong Wei,Qing Li
机构: The Hong Kong Polytechnic University(香港理工大学); Sichuan University(四川大学)
关键词: Corpus Visual Answer, Visual Answer Localization, Video Corpus Visual, includes question-related video, Answer Localization
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
zh

[CV-67] RESQUE: Quantifying Estimator to Task and Distribution Shift for Sustainable Model Reusability AAAI

【速读】：该论文旨在解决深度学习中模型重用与可持续性问题，特别是通过再训练现有模型而非从头训练新模型来减少资源消耗。解决方案的关键是提出了表示偏移量化估计器 (REpresentation Shift QUantifying Estimator, RESQUE)，该工具能够预测模型在面对任务或数据分布变化时的再训练成本。RESQUE通过提供一个简洁的指标，帮助用户评估再训练所需的资源，包括训练轮数、梯度范数、参数变化幅度、能量消耗和碳排放等。实验结果表明，RESQUE与多种再训练指标高度相关，能够有效指导用户选择最经济且可持续的模型重用策略，从而减少环境影响。

链接: https://arxiv.org/abs/2412.15511
作者: Vishwesh Sangarya,Jung-Eun Kim
机构: 未知
关键词: deep learning, reusing an existing, scratch is critical, strategy for sustainability, sustainability of deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The Annual AAAI Conference on Artificial Intelligence (AAAI), 2025

点击查看摘要

Abstract:As a strategy for sustainability of deep learning, reusing an existing model by retraining it rather than training a new model from scratch is critical. In this paper, we propose REpresentation Shift QUantifying Estimator (RESQUE), a predictive quantifier to estimate the retraining cost of a model to distributional shifts or change of tasks. It provides a single concise index for an estimate of resources required for retraining the model. Through extensive experiments, we show that RESQUE has a strong correlation with various retraining measures. Our results validate that RESQUE is an effective indicator in terms of epochs, gradient norms, changes of parameter magnitude, energy, and carbon emissions. These measures align well with RESQUE for new tasks, multiple noise types, and varying noise intensities. As a result, RESQUE enables users to make informed decisions for retraining to different tasks/distribution shifts and determine the most cost-effective and sustainable option, allowing for the reuse of a model with a much smaller footprint in the environment. The code for this work is available here: this https URL
zh

[CV-68] PolySmart @ TRECVid 2024 Video-To-Text

【速读】：该论文试图解决视频转文本 (Video-To-Text, VTT) 任务中的描述准确性、上下文相关性和语言一致性问题。解决方案的关键在于对视觉-语言模型 (Vision-Language Models, VLMs) 如LLaVA和LLaVA-NeXT-Video进行微调 (fine-tuning)，以增强模型在特定VTT数据集上的表现。通过微调，模型能够生成更详细、与领域对齐的文本描述，从而缩小通用VLM任务与VTT任务的差距。实验结果表明，经过微调的模型在各种评估指标上均优于基线模型，突显了领域特定微调在复杂VTT任务中的重要性。

链接: https://arxiv.org/abs/2412.15509
作者: Jiaxin Wu,Wengyu Zhang,Xiao-Yong Wei,Qing Li
机构: Department of Computing, The Hong Kong Polytechnic University(计算系，香港理工大学); Department of Computer Science, Sichuan University(计算机科学系，四川大学)
关键词: generating natural language, natural language descriptions, exploring the capabilities, video content, present our methods
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In this paper, we present our methods and results for the Video-To-Text (VTT) task at TRECVid 2024, exploring the capabilities of Vision-Language Models (VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language descriptions for video content. We investigate the impact of fine-tuning VLMs on VTT datasets to enhance description accuracy, contextual relevance, and linguistic consistency. Our analysis reveals that fine-tuning substantially improves the model’s ability to produce more detailed and domain-aligned text, bridging the gap between generic VLM tasks and the specialized needs of VTT. Experimental results demonstrate that our fine-tuned model outperforms baseline VLMs across various evaluation metrics, underscoring the importance of domain-specific tuning for complex VTT tasks.
zh

[CV-69] Stylish and Functional: Guided Interpolation Subject to Physical Constraints NEURIPS2024

【速读】：该论文试图解决生成式 AI 在工程设计中无法考虑物理约束和功能需求的问题。解决方案的关键在于提出了一种零样本框架，通过利用预训练的扩散模型 (diffusion model) 作为骨干，并在生成过程中引入对称器 (symmetrizer) 来强制执行物理和功能要求。具体案例中，论文以车轮设计的旋转对称性为例，通过对称器引导扩散过程生成满足物理稳定性的对称车轮设计。实验结果表明，该方法在生成设计的真实性（通过 Fréchet inception distance 评估）和满足物理功能要求方面优于现有方法。

链接: https://arxiv.org/abs/2412.15507
作者: Yan-Ying Chen,Nikos Arechiga,Chenyang Yuan,Matthew Hong,Matt Klenk,Charlene Wu
机构: Toyota Research Institute(丰田研究院)
关键词: enabling rapid prototyping, functional requirements, practices by enabling, enabling rapid, rapid prototyping
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Foundation Models for Science Workshop, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Generative AI is revolutionizing engineering design practices by enabling rapid prototyping and manipulation of designs. One example of design manipulation involves taking two reference design images and using them as prompts to generate a design image that combines aspects of both. Real engineering designs have physical constraints and functional requirements in addition to aesthetic design considerations. Internet-scale foundation models commonly used for image generation, however, are unable to take these physical constraints and functional requirements into consideration as part of the generation process. We consider the problem of generating a design inspired by two input designs, and propose a zero-shot framework toward enforcing physical, functional requirements over the generation process by leveraging a pretrained diffusion model as the backbone. As a case study, we consider the example of rotational symmetry in generation of wheel designs. Automotive wheels are required to be rotationally symmetric for physical stability. We formulate the requirement of rotational symmetry by the use of a symmetrizer, and we use this symmetrizer to guide the diffusion process towards symmetric wheel generations. Our experimental results find that the proposed approach makes generated interpolations with higher realism than methods in related work, as evaluated by Fréchet inception distance (FID). We also find that our approach generates designs that more closely satisfy physical and functional requirements than generating without the symmetry guidance.
zh

[CV-70] A Robust Prototype-Based Network with Interpretable RBF Classifier Foundations AAAI2025

【速读】：该论文试图解决基于原型的分类学习方法在性能上相对深度模型较低的问题，同时克服其解释性方面的不足。解决方案的关键在于提出了一种扩展的分类组件方法 (Classification-by-Components, CBC)，该方法通过引入概率模型确保解释性，并解决了原有模型中解释矛盾的问题。论文进一步证明了该扩展方法具有鲁棒性保证，并设计了优化鲁棒性的损失函数。此外，研究还表明，大多数深度原型网络 (deep Prototype-Based Networks, PBNs) 与深度径向基函数分类器 (deep RBF classifiers) 相关，因此其鲁棒性保证也适用于浅层 RBF 分类器。实验结果显示，所提出的深度 PBN 在多个基准数据集上达到了最先进的分类准确率，同时解决了其他方法在解释性上的缺陷。

链接: https://arxiv.org/abs/2412.15499
作者: Sascha Saralajew,Ashish Rana,Thomas Villmann,Ammar Shaker
机构: Leipzig University(莱比锡大学); University of Applied Sciences Mittweida(米特韦达应用技术大学)
关键词: classification learning methods, learning methods, Prototype-based classification learning, deep Prototype-Based Networks, Prototype-Based Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at AAAI 2025. Includes the Appendix

点击查看摘要

Abstract:Prototype-based classification learning methods are known to be inherently interpretable. However, this paradigm suffers from major limitations compared to deep models, such as lower performance. This led to the development of the so-called deep Prototype-Based Networks (PBNs), also known as prototypical parts models. In this work, we analyze these models with respect to different properties, including interpretability. In particular, we focus on the Classification-by-Components (CBC) approach, which uses a probabilistic model to ensure interpretability and can be used as a shallow or deep architecture. We show that this model has several shortcomings, like creating contradicting explanations. Based on these findings, we propose an extension of CBC that solves these issues. Moreover, we prove that this extension has robustness guarantees and derive a loss that optimizes robustness. Additionally, our analysis shows that most (deep) PBNs are related to (deep) RBF classifiers, which implies that our robustness guarantees generalize to shallow RBF classifiers. The empirical evaluation demonstrates that our deep PBN yields state-of-the-art classification accuracy on different benchmarks while resolving the interpretability shortcomings of other approaches. Further, our shallow PBN variant outperforms other shallow PBNs while being inherently interpretable and exhibiting provable robustness guarantees.
zh

[CV-71] GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D Generators

【速读】：该论文试图解决3D生成式领域适应（3D generative domain adaptation）中的两个主要问题：一是传统方法依赖于大规模数据集和相机姿态分布，导致数据生成过程繁琐且容易引入姿态偏差；二是现有方法无法有效支持单张图像引导的领域适应，这种情况下会面临更严重的姿态偏差和额外的身份偏差。解决方案的关键在于提出了一种名为GCA-3D的通用且一致的3D领域适应方法，通过引入多模态深度感知分数蒸馏采样损失（multi-modal depth-aware score distillation sampling loss），以非对抗方式高效适应3D生成模型。该方法不仅支持文本提示和单张图像提示的适应，还利用体积渲染模块中的实例深度图来缓解过拟合问题并保持结果的多样性。此外，论文还提出了分层空间一致性损失（hierarchical spatial consistency loss），以增强源域和目标域生成图像之间的姿态和身份一致性。

链接: https://arxiv.org/abs/2412.15491
作者: Hengjia Li,Yang Liu,Yibo Zhao,Haoran Cheng,Yang Yang,Linxuan Xia,Zekai Luo,Qibo Qiu,Boxi Wu,Tu Zheng,Zheng Yang,Deng Cai
机构: Zhejiang University(浙江大学); Alibaba Group(阿里巴巴集团); Fabu Inc(法布公司)
关键词: camera pose distributions, collecting massive datasets, collecting massive, domain adaptation, domain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.
zh

[CV-72] oward Appearance-based Autonomous Landing Site Identification for Multirotor Drones in Unstructured Environments

【速读】：该论文试图解决多旋翼无人机在非结构化环境中自主识别可行着陆点的挑战。解决方案的关键在于利用现代无人机自动勘测地形的能力，通过生成合成数据集来训练基于外观的地形分类器。具体方法包括从无人机采集的RGB图像中自动生成安全和不安全区域的掩码，并使用这些合成数据训练U-Net模型。该方法不仅降低了数据集创建的成本，还实现了在实际场景中的实时应用和验证。

链接: https://arxiv.org/abs/2412.15486
作者: Joshua Springer,Gylfi Þór Guðmundsson,Marcel Kyas
机构: Reykjavik University (雷克雅未克大学)
关键词: multirotor drone flight, viable landing sites, unstructured environments, remaining challenge, challenge in multirotor
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:A remaining challenge in multirotor drone flight is the autonomous identification of viable landing sites in unstructured environments. One approach to solve this problem is to create lightweight, appearance-based terrain classifiers that can segment a drone’s RGB images into safe and unsafe regions. However, such classifiers require data sets of images and masks that can be prohibitively expensive to create. We propose a pipeline to automatically generate synthetic data sets to train these classifiers, leveraging modern drones’ ability to survey terrain automatically and the ability to automatically calculate landing safety masks from terrain models derived from such surveys. We then train a U-Net on the synthetic data set, test it on real-world data for validation, and demonstrate it on our drone platform in real-time.
zh

[CV-73] oward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在生成高度详细描述时容易产生幻觉的问题。现有幻觉检测方法在处理详细描述时表现不佳，主要原因是随着序列长度的增加，MLLMs 越来越依赖生成的文本而非输入图像。论文提出的解决方案关键在于采用多智能体方法，通过大语言模型 (LLM) 与 MLLM 的协作来纠正生成的描述。此外，论文还引入了一个评估框架和基准数据集，以系统地分析详细描述的准确性，并证明该方法在提高描述事实准确性方面优于现有指标和方法，甚至在 GPT-4V 生成的描述上也有显著改进。

链接: https://arxiv.org/abs/2412.15484
作者: Saehyung Lee,Seunghyun Yoon,Trung Bui,Jing Shi,Sungroh Yoon
机构: 未知
关键词: Multimodal large language, large language models, Multimodal large, generating highly detailed, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM’s performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.
zh

[CV-74] ask-Specific Preconditioner for Cross-Domain Few-Shot Learning AAAI2025

【速读】：该论文试图解决跨域少样本学习 (Cross-Domain Few-Shot Learning, CDFSL) 中任务特定参数适应性不足的问题。现有方法通常采用固定的优化策略，这在不同领域或目标任务中可能表现不佳。解决方案的关键在于提出了一种新的适应机制，称为任务特定预条件梯度下降 (Task-Specific Preconditioned gradient descent, TSP)。该方法首先通过元学习获得领域特定预条件器 (Domain-Specific Preconditioners, DSPs)，这些预条件器捕捉了每个元训练领域的特征，然后通过任务系数进行线性组合，形成任务特定预条件器。该预条件器应用于梯度下降，使优化过程能够自适应目标任务。通过约束预条件器为正定矩阵，确保预条件梯度朝着最速下降方向优化。实验结果表明，TSP在Meta-Dataset上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.15483
作者: Suhyun Kang,Jungwon Park,Wonseok Lee,Wonjong Rhee
机构: 未知
关键词: Cross-Domain Few-Shot Learning, typically parameterize models, Few-Shot Learning, methods typically parameterize, Cross-Domain Few-Shot
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning~(CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent~(TSP). Our method first meta-learns Domain-Specific Preconditioners~(DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.
zh

[CV-75] Difficulty-aware Balancing Margin Loss for Long-tailed Recognition

【速读】：该论文试图解决深度神经网络在处理严重不平衡数据时难以准确识别少数类样本的问题。解决方案的关键在于提出了难度感知平衡边际损失（Difficulty-aware Balancing Margin, DBM），该损失函数同时考虑了类别不平衡和实例难度。DBM损失由两部分组成：类别边际用于缓解由于类别频率不平衡引起学习偏差，实例边际则根据样本的难度分配给难例正样本。通过为更难的样本分配更大的边际，DBM损失提高了类别的区分度，并与现有方法无缝结合，在各种长尾识别基准上持续提升性能。

链接: https://arxiv.org/abs/2412.15477
作者: Minseok Son,Inyong Koo,Jinyoung Park,Changick Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); Korea University(高丽大学)
关键词: deep neural networks, accurately recognize classes, severely imbalanced data, deep neural, trained with severely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When trained with severely imbalanced data, deep neural networks often struggle to accurately recognize classes with only a few samples. Previous studies in long-tailed recognition have attempted to rebalance biased learning using known sample distributions, primarily addressing different classification difficulties at the class level. However, these approaches often overlook the instance difficulty variation within each class. In this paper, we propose a difficulty-aware balancing margin (DBM) loss, which considers both class imbalance and instance difficulty. DBM loss comprises two components: a class-wise margin to mitigate learning bias caused by imbalanced class frequencies, and an instance-wise margin assigned to hard positive samples based on their individual difficulty. DBM loss improves class discriminativity by assigning larger margins to more difficult samples. Our method seamlessly combines with existing approaches and consistently improves performance across various long-tailed recognition benchmarks.
zh

[CV-76] LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction

【速读】：该论文试图解决现有高斯光栅化 (Gaussian Splatting, GS) 方法在自动驾驶场景重建中的两个关键问题：一是现有方法主要关注低速、特征丰富的城市场景，而忽略了高速公路场景在自动驾驶中的重要性；二是现有方法主要依赖图像数据，未能充分利用LiDAR提供的丰富深度信息，且缺乏对LiDAR数据的精确建模和合成能力。解决方案的关键在于提出一种新的GS方法，通过LiDAR监督和LiDAR渲染支持，实现动态场景的合成与编辑，并首次聚焦于更具挑战性的高速公路场景，以应对稀疏传感器视图和单调背景的复杂情况。

链接: https://arxiv.org/abs/2412.15447
作者: Pou-Chun Kung,Xianling Zhang,Katherine A. Skinner,Nikita Jaipuria
机构: Latitude AI; University of Michigan, Ann Arbor
关键词: additional acquisition costs, Neural Radiance Fields, simulate safety-critical scenarios, expand training data, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Photorealistic 3D scene reconstruction plays an important role in autonomous driving, enabling the generation of novel data from existing datasets to simulate safety-critical scenarios and expand training data without additional acquisition costs. Gaussian Splatting (GS) facilitates real-time, photorealistic rendering with an explicit 3D Gaussian representation of the scene, providing faster processing and more intuitive scene editing than the implicit Neural Radiance Fields (NeRFs). While extensive GS research has yielded promising advancements in autonomous driving applications, they overlook two critical aspects: First, existing methods mainly focus on low-speed and feature-rich urban scenes and ignore the fact that highway scenarios play a significant role in autonomous driving. Second, while LiDARs are commonplace in autonomous driving platforms, existing methods learn primarily from images and use LiDAR only for initial estimates or without precise sensor modeling, thus missing out on leveraging the rich depth information LiDAR offers and limiting the ability to synthesize LiDAR data. In this paper, we propose a novel GS method for dynamic scene synthesis and editing with improved scene reconstruction through LiDAR supervision and support for LiDAR rendering. Unlike prior works that are tested mostly on urban datasets, to the best of our knowledge, we are the first to focus on the more challenging and highly relevant highway scenes for autonomous driving, with sparse sensor views and monotone backgrounds.
zh

[CV-77] Efficient Neural Network Encoding for 3D Color Lookup Tables AAAI2025

【速读】：该论文旨在解决3D颜色查找表（LUTs）在存储大量LUT时占用大量内存的问题。解决方案的关键在于开发一种神经网络架构，能够将数百个LUT编码为一个紧凑的表示形式，其内存占用小于0.25 MB，同时保持颜色转换的精度（在整个色域内平均颜色误差 \bar\DeltaE_M ≤ 2.0），并在自然图像颜色上进一步优化（\bar\DeltaE_M ≤ 1.0）。此外，通过网络架构的微小修改，实现了可逆的颜色处理，使得LUTs具有可逆性。

链接: https://arxiv.org/abs/2412.15438
作者: Vahid Zehtab,David B. Lindell,Marcus A. Brubaker,Michael S. Brown
机构: Samsung AI Center Toronto(三星AI中心多伦多)
关键词: mapping input RGB, specific output RGB, input RGB, output RGB, color lookup tables
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 13 figures; extended version; to appear in AAAI 2025

点击查看摘要

Abstract:3D color lookup tables (LUTs) enable precise color manipulation by mapping input RGB values to specific output RGB values. 3D LUTs are instrumental in various applications, including video editing, in-camera processing, photographic filters, computer graphics, and color processing for displays. While an individual LUT does not incur a high memory overhead, software and devices may need to store dozens to hundreds of LUTs that can take over 100 MB. This work aims to develop a neural network architecture that can encode hundreds of LUTs in a single compact representation. To this end, we propose a model with a memory footprint of less than 0.25 MB that can reconstruct 512 LUTs with only minor color distortion ( \bar\DeltaE_M \leq 2.0) over the entire color gamut. We also show that our network can weight colors to provide further quality gains on natural image colors ( \bar\DeltaE_M \leq 1.0). Finally, we show that minor modifications to the network architecture enable a bijective encoding that produces LUTs that are invertible, allowing for reverse color processing. Our code is available at this https URL.
zh

[CV-78] SolidGS: Consolidating Gaussian Surfel Splatting for Sparse-View Surface Reconstruction

【速读】：该论文试图解决稀疏视角输入图像下高斯光栅化（Gaussian splatting）方法在表面重建质量上的不足问题。解决方案的关键在于提出了一种新的方法SolidGS，通过采用更稳定的核函数来整合高斯分布，从而减少多视角几何重建中的不一致性。此外，结合几何正则化和单目法线估计，SolidGS在稀疏视角表面重建任务中显著优于传统的高斯光栅化方法和神经场方法，在DTU、Tanks-and-Temples和LLFF等广泛使用的数据集上表现出色。

链接: https://arxiv.org/abs/2412.15400
作者: Zhuowen Shen,Yuan Liu,Zhang Chen,Zhong Li,Jiepeng Wang,Yongqing Liang,Zhengming Yu,Jingdong Zhang,Yi Xu,Scott Schaefer,Xin Li,Wenping Wang
机构: Texas A&M University(德克萨斯A&M大学); OPPO US Research Center(OPPO美国研究中心); Nanyang Technological University(南洋理工大学); Hong Kong University of Science and Technology(香港科技大学); The University of Hong Kong(香港大学)
关键词: achieved impressive improvements, Gaussian splatting, achieved impressive, impressive improvements, novel-view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Gaussian splatting has achieved impressive improvements for both novel-view synthesis and surface reconstruction from multi-view images. However, current methods still struggle to reconstruct high-quality surfaces from only sparse view input images using Gaussian splatting. In this paper, we propose a novel method called SolidGS to address this problem. We observed that the reconstructed geometry can be severely inconsistent across multi-views, due to the property of Gaussian function in geometry rendering. This motivates us to consolidate all Gaussians by adopting a more solid kernel function, which effectively improves the surface reconstruction quality. With the additional help of geometrical regularization and monocular normal estimation, our method achieves superior performance on the sparse view surface reconstruction than all the Gaussian splatting methods and neural field methods on the widely used DTU, Tanks-and-Temples, and LLFF datasets.
zh

[CV-79] Maximising Histopathology Segmentation using Minimal Labels via Self-Supervision

【速读】：该论文试图解决在组织病理学图像分割中，深度学习方法（如UNet）需要大量标注数据的问题。解决方案的关键在于通过自监督预训练（self-supervised pre-training），包括SimCLR、BYOL和一种新方法HR-CS-CO，显著减少所需的标注数据量。具体来说，即使只使用5%的标注数据，通过自监督预训练，UNet、MDS1和UDAGAN的性能仅分别下降5.9%、4.5%和6.2%，相较于完全监督学习（使用100%标注数据），性能损失极小。

链接: https://arxiv.org/abs/2412.15389
作者: Zeeshan Nisar,Thomas Lampert
机构: University of Strasbourg(斯特拉斯堡大学); 未知
关键词: tissue samples, diagnosis and prognosis, microscopic examination, examination of tissue, essential for disease
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 10 figures, 3 Tables

点击查看摘要

Abstract:Histopathology, the microscopic examination of tissue samples, is essential for disease diagnosis and prognosis. Accurate segmentation and identification of key regions in histopathology images are crucial for developing automated solutions. However, state-of-art deep learning segmentation methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple stainings. To mitigate this, multi-stain segmentation methods such as MDS1 and UDAGAN have been developed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging, and segmentation models fail when they are unavailable. This article shows that through self-supervised pre-training, including SimCLR, BYOL, and a novel approach, HR-CS-CO, the performance of these segmentation methods (UNet, MDS1, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet, 4.5% for MDS1, and 6.2% for UDAGAN, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). The code is available from this https URL [to be made public upon acceptance].
zh

[CV-80] Uncertainty-Guided Cross Attention Ensemble Mean Teacher for Semi-supervised Medical Image Segmentation WACV2025

【速读】：该论文旨在解决半监督医学图像分割中的性能提升问题，提出了一个名为不确定性引导的交叉注意力集成均值教师模型 (Uncertainty-Guided Cross Attention Ensemble Mean Teacher, UG-CEMT) 的新框架。其关键在于结合了交叉注意力集成均值教师框架 (Cross-attention Ensemble Mean Teacher, CEMT) 和不确定性引导的一致性正则化及锐度感知最小化方法，通过促进子网络间的高差异性来提升半监督性能。实验结果表明，UG-CEMT 在多中心前列腺 MRI 和心脏 MRI 数据集上实现了最先进的分割性能，并能在仅使用 10% 标注数据的情况下接近全监督方法的性能，展示了其在利用未标注数据进行鲁棒医学图像分割方面的有效性。

链接: https://arxiv.org/abs/2412.15380
作者: Meghana Karri,Amit Soni Arya,Koushik Biswas,Nicol`o Gennaro,Vedat Cicek,Gorkem Durak,Yuri S. Velichko,Ulas Bagci
机构: Northwestern University(西北大学); Bennett University(贝内特大学)
关键词: Cross Attention Ensemble, Uncertainty-Guided Cross Attention, Cross Attention, Attention Ensemble, Ensemble Mean Teacher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in WACV 2025

点击查看摘要

Abstract:This work proposes a novel framework, Uncertainty-Guided Cross Attention Ensemble Mean Teacher (UG-CEMT), for achieving state-of-the-art performance in semi-supervised medical image segmentation. UG-CEMT leverages the strengths of co-training and knowledge distillation by combining a Cross-attention Ensemble Mean Teacher framework (CEMT) inspired by Vision Transformers (ViT) with uncertainty-guided consistency regularization and Sharpness-Aware Minimization emphasizing uncertainty. UG-CEMT improves semi-supervised performance while maintaining a consistent network architecture and task setting by fostering high disparity between sub-networks. Experiments demonstrate significant advantages over existing methods like Mean Teacher and Cross-pseudo Supervision in terms of disparity, domain generalization, and medical image segmentation performance. UG-CEMT achieves state-of-the-art results on multi-center prostate MRI and cardiac MRI datasets, where object segmentation is particularly challenging. Our results show that using only 10% labeled data, UG-CEMT approaches the performance of fully supervised methods, demonstrating its effectiveness in exploiting unlabeled data for robust medical image segmentation. The code is publicly available at \urlthis https URL
zh

[CV-81] Dataset Augmentation by Mixing Visual Concepts WACV2025

【速读】：该论文试图解决生成式图像与真实数据之间存在的领域差异问题，通过微调预训练的扩散模型来实现数据集增强。解决方案的关键在于提出了一种名为“混合视觉概念 (Mixing Visual Concepts, MVC)”的独特过程，通过结合真实图像和从图像描述生成的全新文本嵌入来微调扩散模型。MVC方法能够生成多样且与真实数据相似的图像，从而有效提升数据集增强的效果，并在基准分类任务中超越了现有的最先进数据增强技术。

链接: https://arxiv.org/abs/2412.15358
作者: Abdullah Al Rahat,Hemanth Venkateswara
机构: Georgia State University (乔治亚州立大学)
关键词: pre-trained diffusion models, pre-trained diffusion, Mixing Visual Concepts, diffusion model, fine-tuning pre-trained diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025 main conference

点击查看摘要

Abstract:This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
zh

[CV-82] Exploring Machine Learning Engineering for Object Detection and Tracking by Unmanned Aerial Vehicle (UAV) ICML

【速读】：该论文试图解决室内环境中自主系统的目标检测与跟踪问题，特别是通过模拟搜救任务（SAR）来检测和跟踪Roomba真空吸尘器。解决方案的关键在于构建一个完整的机器学习流水线，包括数据集的创建、标注、特征选择、算法集成与优化，以及性能评估。具体步骤包括：通过采集视频并提取帧来创建数据集，结合手动和自动化技术进行标注，使用YOLOv4进行初步训练和数据集精炼，最终在YOLOv4和Mask R-CNN模型上进行训练，并将模型部署在Parrot Mambo无人机上实现实时目标检测与跟踪。实验结果表明，该方法在检测和跟踪Roomba方面表现出色，平均损失为0.1942，准确率达到96%。

链接: https://arxiv.org/abs/2412.15347
作者: Aneesha Guna,Parth Ganeriwala,Siddhartha Bhattacharyya
机构: Edgewood Jr/Sr High School; Florida Institute of Technology
关键词: advanced machine learning, autonomous operations, deep learning methods, autonomous systems, advancement of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICMLA '24

点击查看摘要

Abstract:With the advancement of deep learning methods it is imperative that autonomous systems will increasingly become intelligent with the inclusion of advanced machine learning algorithms to execute a variety of autonomous operations. One such task involves the design and evaluation for a subsystem of the perception system for object detection and tracking. The challenge in the creation of software to solve the task is in discovering the need for a dataset, annotation of the dataset, selection of features, integration and refinement of existing algorithms, while evaluating performance metrics through training and testing. This research effort focuses on the development of a machine learning pipeline emphasizing the inclusion of assurance methods with increasing automation. In the process, a new dataset was created by collecting videos of moving object such as Roomba vacuum cleaner, emulating search and rescue (SAR) for indoor environment. Individual frames were extracted from the videos and labeled using a combination of manual and automated techniques. This annotated dataset was refined for accuracy by initially training it on YOLOv4. After the refinement of the dataset it was trained on a second YOLOv4 and a Mask R-CNN model, which is deployed on a Parrot Mambo drone to perform real-time object detection and tracking. Experimental results demonstrate the effectiveness of the models in accurately detecting and tracking the Roomba across multiple trials, achieving an average loss of 0.1942 and 96% accuracy.
zh

[CV-83] Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models

【速读】：该论文试图解决扩散生成模型在资源受限环境（如移动设备）中部署时面临的计算负担问题，同时避免模型在压缩过程中传播不良行为（如生成受版权保护的内容或不安全概念）。解决方案的关键在于提出了一种新颖的双层优化框架，将微调（fine-tuning）和遗忘（unlearning）过程整合为一个统一阶段。该框架不仅保留了知识蒸馏的优势，如高效的收敛性和风格迁移能力，还能有选择性地抑制生成不希望的内容。这一插件式框架兼容多种剪枝和概念遗忘方法，从而在受控环境中实现扩散模型的安全、高效部署。

链接: https://arxiv.org/abs/2412.15341
作者: Reza Shirkavand,Peiran Yu,Shangqian Gao,Gowthami Somepalli,Tom Goldstein,Heng Huang
机构: University of Maryland, College Park(马里兰大学学院公园分校); Florida State University(佛罗里达州立大学)
关键词: yielded remarkable progress, Recent advances, remarkable progress, yielded remarkable, Recent
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion generative models have yielded remarkable progress. While the quality of generated content continues to improve, these models have grown considerably in size and complexity. This increasing computational burden poses significant challenges, particularly in resource-constrained deployment scenarios such as mobile devices. The combination of model pruning and knowledge distillation has emerged as a promising solution to reduce computational demands while preserving generation quality. However, this technique inadvertently propagates undesirable behaviors, including the generation of copyrighted content and unsafe concepts, even when such instances are absent from the fine-tuning dataset. In this paper, we propose a novel bilevel optimization framework for pruned diffusion models that consolidates the fine-tuning and unlearning processes into a unified phase. Our approach maintains the principal advantages of distillation-namely, efficient convergence and style transfer capabilities-while selectively suppressing the generation of unwanted content. This plug-in framework is compatible with various pruning and concept unlearning methods, facilitating efficient, safe deployment of diffusion models in controlled environments.
zh

[CV-84] aming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

【速读】：该论文旨在解决从视频和可选的文本条件生成高质量、同步音频的问题。其关键解决方案是提出了一种多模态联合训练框架MMAudio，通过结合大规模可用的文本-音频数据进行联合训练，以生成语义对齐的高质量音频样本。此外，论文引入了一个条件同步模块，用于在帧级别上对齐视频条件与音频潜在表示，从而提升音视频同步性。通过使用流匹配目标进行训练，MMAudio在音频质量、语义对齐和音视频同步方面达到了新的公开模型中的最佳水平，同时具有较低的推理时间和参数规模（157M参数）。该方法不仅在视频到音频生成上表现出色，还在文本到音频生成上展示了竞争性的性能，表明联合训练并未影响单模态性能。

链接: https://arxiv.org/abs/2412.15322
作者: Ho Kei Cheng,Masato Ishii,Akio Hayakawa,Takashi Shibuya,Alexander Schwing,Yuki Mitsufuji
机构: University of Illinois Urbana-Champaign; Sony AI; Sony Group Corporation
关键词: optional text conditions, propose to synthesize, optional text, multimodal joint training, joint training framework
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: this https URL
zh

[CV-85] Next Patch Prediction for Autoregressive Visual Generation ATC

【速读】：该论文试图解决自回归模型在图像生成任务中的高计算成本问题，解决方案的关键在于提出了一个新的范式——Next Patch Prediction (NPP)。具体来说，论文通过将图像的token分组并聚合成信息密度更高的patch token，从而缩短输入序列，使得自回归模型能够预测下一个patch，显著降低了计算成本。此外，论文还提出了一种多尺度从粗到细的patch分组策略，利用图像数据的自然层次特性，进一步优化了生成效果。实验结果表明，该方法在减少训练成本至约0.6倍的同时，提升了生成图像的质量，FID分数最多提高了1.0分。关键的创新点在于保持了原有自回归模型的架构，未引入额外的可训练参数或定制的图像tokenizer，确保了方法的灵活性和对多种自回归模型的无缝适应性。

链接: https://arxiv.org/abs/2412.15321
作者: Yatian Pang,Peng Jin,Shuo Yang,Bin Lin,Bin Zhu,Zhenyu Tang,Liuhan Chen,Francis E. H. Tay,Ser-Nam Lim,Harry Yang,Li Yuan
机构: Peking University(北京大学); PengCheng Laboratory(鹏城实验室); NUS(新加坡国立大学); HKUST(香港科技大学); UCF(中佛罗里达大学); Everlyn; Rabbitpre AI
关键词: show great potential, patch prediction paradigm, Patch Prediction, built based, show great
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. In this work, we rethink the NTP for autoregressive image generation and propose a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens containing high information density. With patch tokens as a shorter input sequence, the autoregressive model is trained to predict the next patch, thereby significantly reducing the computational cost. We further propose a multi-scale coarse-to-fine patch grouping strategy that exploits the natural hierarchical property of image data. Experiments on a diverse range of models (100M-1.4B parameters) demonstrate that the next patch prediction paradigm could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet benchmark. We highlight that our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, thus ensuring flexibility and seamless adaptation to various autoregressive models for visual generation.
zh

[CV-86] Multi-concept Model Immunization through Differentiable Model Merging AAAI2025

【速读】：该论文试图解决在实际应用中需要对多个概念进行免疫化的问题，即如何使开源模型在面对多种潜在有害应用时难以被微调。解决方案的关键在于提出了一种免疫化算法，该算法通过引入一个可微分的合并层（differentiable merging layer），能够同时学习一个针对多个概念的“困难初始化”（difficult initialization），从而使得模型在面对多种概念时难以被适配。这一方法通过实验验证了其在多概念免疫化方面的有效性，并扩展了先前研究在重新学习和个性化适配方面的实验设置。

链接: https://arxiv.org/abs/2412.15320
作者: Amber Yijia Zheng,Raymond A. Yeh
机构: 未知
关键词: emerging direction, direction that aims, aims to mitigate, mitigate the potential, potential risk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Model immunization is an emerging direction that aims to mitigate the potential risk of misuse associated with open-sourced models and advancing adaptation methods. The idea is to make the released models’ weights difficult to fine-tune on certain harmful applications, hence the name immunized''. Recent work on model immunization focuses on the single-concept setting. However, models need to be immunized against multiple concepts in real-world situations. To address this gap, we propose an immunization algorithm that, simultaneously, learns a single difficult initialization’’ for adaptation methods over a set of concepts. We achieve this by incorporating a differentiable merging layer that combines a set of model weights adapted over multiple concepts. In our experiments, we demonstrate the effectiveness of multi-concept immunization by generalizing prior work’s experiment setup of re-learning and personalization adaptation to multiple concepts.
zh

[CV-87] Parametric rho-Norm Scaling Calibration

【速读】：该论文试图解决在有限数据集上校准模型置信度时，个体样本的不确定性验证困难的问题。解决方案的关键在于引入了一种后处理的参数化校准方法，即 \rho -Norm Scaling，该方法通过扩展校准器的表达式，缓解了由于输出幅度过大导致的过自信问题，同时保留了模型的准确性。此外，论文还提出了概率分布正则化，确保校准后的实例级不确定性分布与校准前相似，从而保留了实例级的信息。实验结果表明，该方法显著提升了后处理校准器在不确定性校准方面的性能。

链接: https://arxiv.org/abs/2412.15301
作者: Siyuan Zhang,Linbo Xie
机构: 未知
关键词: probabilistic properties reflect, properties reflect objective, reflect objective characteristics, reflect objective, individual sample properties
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Output uncertainty indicates whether the probabilistic properties reflect objective characteristics of the model output. Unlike most loss functions and metrics in machine learning, uncertainty pertains to individual samples, but validating it on individual samples is unfeasible. When validated collectively, it cannot fully represent individual sample properties, posing a challenge in calibrating model confidence in a limited data set. Hence, it is crucial to consider confidence calibration characteristics. To counter the adverse effects of the gradual amplification of the classifier output amplitude in supervised learning, we introduce a post-processing parametric calibration method, \rho -Norm Scaling, which expands the calibrator expression and mitigates overconfidence due to excessive amplitude while preserving accuracy. Moreover, bin-level objective-based calibrator optimization often results in the loss of significant instance-level information. Therefore, we include probability distribution regularization, which incorporates specific priori information that the instance-level uncertainty distribution after calibration should resemble the distribution before calibration. Experimental results demonstrate the substantial enhancement in the post-processing calibrator for uncertainty calibration with our proposed method.
zh

[CV-88] FORCE: Physics-aware Human-object Interaction

【速读】：该论文试图解决在人-物体交互中，由于物体物理属性（如质量、表面摩擦）对交互动作的细微影响被忽视的问题。解决方案的关键在于引入FORCE模型，通过建模物理属性来合成多样且细微的人-物体交互动作。其核心洞察是人类动作受施加的力和感知到的阻力之间的相互关系所决定。通过一种新颖的直观物理编码，模型捕捉了人类力和阻力之间的相互作用，从而促进了多类别动作的学习。此外，论文还提供了一个包含多样化、不同风格动作的数据集，以支持模型开发。

链接: https://arxiv.org/abs/2403.11237
作者: Xiaohan Zhang,Bharat Lal Bhatnagar,Sebastian Starke,Ilya Petrov,Vladimir Guzov,Helisa Dhamo,Eduardo Pérez-Pellitero,Gerard Pons-Moll
机构: Max Planck Institute for Informatics(马克斯·普朗克信息学研究所); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所); Technische Universität Dresden(德累斯顿工业大学); University of Zaragoza(萨拉戈萨大学)
关键词: pose and shape, surface friction, mass and surface, human, Interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Interactions between human and objects are influenced not only by the object’s pose and shape, but also by physical attributes such as object mass and surface friction. They introduce important motion nuances that are essential for diversity and realism. Despite advancements in recent human-object interaction methods, this aspect has been overlooked. Generating nuanced human motion presents two challenges. First, it is non-trivial to learn from multi-modal human and object information derived from both the physical and non-physical attributes. Second, there exists no dataset capturing nuanced human interactions with objects of varying physical properties, hampering model development. This work addresses the gap by introducing the FORCE model, an approach for synthesizing diverse, nuanced human-object interactions by modeling physical attributes. Our key insight is that human motion is dictated by the interrelation between the force exerted by the human and the perceived resistance. Guided by a novel intuitive physics encoding, the model captures the interplay between human force and resistance. Experiments also demonstrate incorporating human force facilitates learning multi-class motion. Accompanying our model, we contribute a dataset, which features diverse, different-styled motion through interactions with varying resistances.
zh

[CV-89] Efficient MedSAMs: Segment Anything in Medical Images on Laptop CVPR2024 WWW

【速读】：该论文试图解决现有可提示分割基础模型在医学图像处理中计算成本高昂的问题，这阻碍了其在临床实践中的广泛应用。解决方案的关键在于组织了首个专注于可提示医学图像分割的国际竞赛，利用大规模多机构数据集，推动了轻量级分割基础模型的开发，并通过高效的推理管道显著降低了计算需求，同时保持了最先进的分割精度。此外，竞赛后的算法优化和可重复性验证进一步提升了算法的性能，并将最佳算法集成到开源软件中，以促进其在临床中的应用。

链接: https://arxiv.org/abs/2412.16085
作者: Jun Ma,Feifei Li,Sumin Kim,Reza Asakereh,Bao-Hiep Le,Dang-Khoa Nguyen-Vu,Alexander Pfefferle,Muxin Wei,Ruochen Gao,Donghang Lyu,Songxiao Yang,Lennart Purucker,Zdravko Marinov,Marius Staring,Haisheng Lu,Thuy Thanh Dao,Xincheng Ye,Zhi Li,Gianluca Brugnara,Philipp Vollmuth,Martha Foltyn-Dumitru,Jaeyoung Cho,Mustafa Ahmed Mahmutoglu,Martin Bendszus,Irada Pflüger,Aditya Rastogi,Dong Ni,Xin Yang,Guang-Quan Zhou,Kaini Wang,Nicholas Heller,Nikolaos Papanikolopoulos,Christopher Weight,Yubing Tong,Jayaram K Udupa,Cahill J. Patrick,Yaqi Wang,Yifan Zhang,Francisco Contijoch,Elliot McVeigh,Xin Ye,Shucheng He,Robert Haase,Thomas Pinetz,Alexander Radbruch,Inga Krause,Erich Kobler,Jian He,Yucheng Tang,Haichun Yang,Yuankai Huo,Gongning Luo,Kaisar Kushibar,Jandos Amankulov,Dias Toleshbayev,Amangeldi Mukhamejan,Jan Egger,Antonio Pepe,Christina Gsaxner,Gijs Luijten,Shohei Fujita,Tomohiro Kikuchi,Benedikt Wiestler,Jan S. Kirschke,Ezequiel de la Rosa,Federico Bolelli,Luca Lumetti,Costantino Grana,Kunpeng Xie,Guomin Wu,Behrus Puladi,Carlos Martín-Isla,Karim Lekadir,Victor M. Campello,Wei Shao,Wayne Brisbane,Hongxu Jiang,Hao Wei,Wu Yuan,Shuangle Li,Yuyin Zhou,Bo Wang
机构: 未知
关键词: require expensive computing, medical image segmentation, segmentation foundation models, promptable medical image, existing models require
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2024 MedSAM on Laptop Competition Summary: this https URL

点击查看摘要

Abstract:Promptable segmentation foundation models have emerged as a transformative approach to addressing the diverse needs in medical images, but most existing models require expensive computing, posing a big barrier to their adoption in clinical practice. In this work, we organized the first international competition dedicated to promptable medical image segmentation, featuring a large-scale dataset spanning nine common imaging modalities from over 20 different institutions. The top teams developed lightweight segmentation foundation models and implemented an efficient inference pipeline that substantially reduced computational requirements while maintaining state-of-the-art segmentation accuracy. Moreover, the post-challenge phase advanced the algorithms through the design of performance booster and reproducibility tasks, resulting in improved algorithms and validated reproducibility of the winning solution. Furthermore, the best-performing algorithms have been incorporated into the open-source software with a user-friendly interface to facilitate clinical adoption. The data and code are publicly available to foster the further development of medical image segmentation foundation models and pave the way for impactful real-world applications.
zh

[CV-90] Image Quality Assessment: Enhancing Perceptual Exploration and Interpretation with Collaborative Feature Refinement and Hausdorff distance

【速读】：该论文试图解决现有全参考图像质量评估方法（FR-IQA）在处理颜色、亮度和边缘、纹理失真时未能区分低频和高频失真特征的问题。解决方案的关键在于引入了一种无需训练的FR-IQA方法，通过感知退化建模（perceptual degradation modelling）来准确预测图像质量，并与人类视觉系统（HVS）的行为对齐。具体来说，该方法包括两个核心模块：一是协作特征精炼模块，利用精心设计的小波变换（wavelet transform）提取多尺度感知信息，模拟HVS在空间和频率域中分析视觉信息的方式；二是基于Hausdorff距离的分布相似性度量模块，用于评估参考图像和失真图像特征分布之间的差异，能够有效处理异常值和变化，模拟HVS对失真的感知和容忍能力。该方法无需训练数据或主观质量评分，实验结果表明其在多个基准数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.15847
作者: Xuekai Wei,Junyu Zhang,Qinlin Hu,Mingliang Zhou\Yong Feng,Weizhi Xian,Huayan Pu,Sam Kwong
机构: Chongqing University(重庆大学); Lingnan University(岭南大学)
关键词: Current full-reference image, luminance distortions occur, texture distortions occur, image quality assessment, Current full-reference
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current full-reference image quality assessment (FR-IQA) methods often fuse features from reference and distorted images, overlooking that color and luminance distortions occur mainly at low frequencies, whereas edge and texture distortions occur at high frequencies. This work introduces a pioneering training-free FR-IQA method that accurately predicts image quality in alignment with the human visual system (HVS) by leveraging a novel perceptual degradation modelling approach to address this limitation. First, a collaborative feature refinement module employs a carefully designed wavelet transform to extract perceptually relevant features, capturing multiscale perceptual information and mimicking how the HVS analyses visual information at various scales and orientations in the spatial and frequency domains. Second, a Hausdorff distance-based distribution similarity measurement module robustly assesses the discrepancy between the feature distributions of the reference and distorted images, effectively handling outliers and variations while mimicking the ability of HVS to perceive and tolerate certain levels of distortion. The proposed method accurately captures perceptual quality differences without requiring training data or subjective quality scores. Extensive experiments on multiple benchmark datasets demonstrate superior performance compared with existing state-of-the-art approaches, highlighting its ability to correlate strongly with the HVS.\footnoteThe code is available at \urlthis https URL.
zh

[CV-91] Precision ICU Resource Planning: A Multimodal Model for Brain Surgery Outcomes

【速读】：该论文试图解决脑外科手术后患者是否需要转入重症监护病房（ICU）的预测问题，特别是通过优化ICU入院决策来降低医疗成本。解决方案的关键在于采用多模态数据融合方法，即将临床数据与影像数据结合，以提高预测准确性。研究表明，与仅使用临床数据的基线方法相比，多模态方法在严重类别不平衡的情况下显著提升了预测性能，从0.29 [F1] 提高到0.30 [F1]（仅使用术前临床数据），以及从0.37 [F1] 提高到0.41 [F1]（使用术前和术后数据）。

链接: https://arxiv.org/abs/2412.15818
作者: Maximilian Fischer,Florian M. Hauptmann,Robin Peretzke,Paul Naser,Peter Neher,Jan-Oliver Neumann,Klaus Maier-Hein
机构: 未知
关键词: Intensive Care Unit, requiring Intensive Care, complications requiring Intensive, Care Unit, Intensive Care
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Although advances in brain surgery techniques have led to fewer postoperative complications requiring Intensive Care Unit (ICU) monitoring, the routine transfer of patients to the ICU remains the clinical standard, despite its high cost. Predictive Gradient Boosted Trees based on clinical data have attempted to optimize ICU admission by identifying key risk factors pre-operatively; however, these approaches overlook valuable imaging data that could enhance prediction accuracy. In this work, we show that multimodal approaches that combine clinical data with imaging data outperform the current clinical data only baseline from 0.29 [F1] to 0.30 [F1], when only pre-operative clinical data is used and from 0.37 [F1] to 0.41 [F1], for pre- and post-operative data. This study demonstrates that effective ICU admission prediction benefits from multimodal data fusion, especially in contexts of severe class imbalance.
zh

[CV-92] From Model Based to Learned Regularization in Medical Image Registration: A Comprehensive Review

【速读】：该论文试图解决图像配准（image registration）中正则化（regularization）方法的系统分类和应用问题。解决方案的关键在于引入一种新的分类法（taxonomy），系统地梳理和分类现有的正则化方法，并特别强调了学习型正则化（learned regularization）这一新兴领域，该领域利用数据驱动技术自动从数据中推导变形特性。此外，论文还探讨了正则化方法从传统配准到基于学习的配准的迁移，指出了当前的挑战，并提出了未来的研究方向。通过强调正则化在图像配准中的关键作用，论文旨在激励研究社区重新审视现代配准算法中的正则化策略，并进一步探索这一快速发展的领域。

链接: https://arxiv.org/abs/2412.15740
作者: Anna Reithmeir,Veronika Spieker,Vasiliki Sideri-Lampretsa,Daniel Rueckert,Julia A. Schnabel,Veronika A. Zimmer
机构: 未知
关键词: radiation therapy planning, disease progression analysis, medical imaging applications, Image registration, therapy planning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:Image registration is fundamental in medical imaging applications, such as disease progression analysis or radiation therapy planning. The primary objective of image registration is to precisely capture the deformation between two or more images, typically achieved by minimizing an optimization problem. Due to its inherent ill-posedness, regularization is a key component in driving the solution toward anatomically meaningful deformations. A wide range of regularization methods has been proposed for both conventional and deep learning-based registration. However, the appropriate application of regularization techniques often depends on the specific registration problem, and no one-fits-all method exists. Despite its importance, regularization is often overlooked or addressed with default approaches, assuming existing methods are sufficient. A comprehensive and structured review remains missing. This review addresses this gap by introducing a novel taxonomy that systematically categorizes the diverse range of proposed regularization methods. It highlights the emerging field of learned regularization, which leverages data-driven techniques to automatically derive deformation properties from the data. Moreover, this review examines the transfer of regularization methods from conventional to learning-based registration, identifies open challenges, and outlines future research directions. By emphasizing the critical role of regularization in image registration, we hope to inspire the research community to reconsider regularization strategies in modern registration algorithms and to explore this rapidly evolving field further.
zh

[CV-93] BS-LDM: Effective Bone Suppression in High-Resolution Chest X-Ray Images with Conditional Latent Diffusion Models

【速读】：该论文试图解决胸部X射线（CXR）检查中由于重叠的骨骼和肺部结构导致的诊断效果降低问题。解决方案的关键在于引入了一个端到端的骨抑制框架BS-LDM，该框架利用条件潜在扩散模型（conditional latent diffusion model）在潜在空间中进行骨抑制，生成高分辨率的软组织图像，保留了关键的肺部病理信息和纹理细节。通过在训练过程中实施偏移噪声（offset noise）和在采样过程中采用动态裁剪策略（dynamic clipping strategy），进一步提升了生成图像的质量。此外，论文构建了一个高质量的骨抑制数据集SZCH-X-Rays，并预处理了JSRT数据集中的相关图像，以支持模型的训练和评估。实验和临床评估结果表明，BS-LDM在骨抑制方面表现优异，具有显著的临床应用潜力。

链接: https://arxiv.org/abs/2412.15670
作者: Yifei Sun,Zhanghao Chen,Hao Zheng,Ruiquan Ge,Jin Liu,Wenwen Min,Ahmed Elazab,Xiang Wan,Changmiao Wang
机构: 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(中国科学院深圳先进技术研究院);
2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院);
3. School of Computer Science and Engineering, South China University of Technology, Guangzhou, China(华南理工大学计算机科学与工程学院);
4. Department of Computer Science, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt(本哈大学计算机科学与人工智能学院计算机科学系);
5. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学（深圳）计算机科学与工程系)
关键词: Chest X-ray, effectiveness of Chest, soft tissue images, Bone suppression, DES soft tissue
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:The interference of overlapping bones and pulmonary structures can reduce the effectiveness of Chest X-ray (CXR) examinations. Bone suppression techniques have been developed to improve diagnostic accuracy. Dual-energy subtraction (DES) imaging, a common method for bone suppression, is costly and exposes patients to higher radiation levels. Deep learning-based image generation methods have been proposed as alternatives, however, they often fail to produce high-quality and high-resolution images, resulting in the loss of critical lesion information and texture details. To address these issues, in this paper, we introduce an end-to-end framework for bone suppression in high-resolution CXR images, termed BS-LDM. This framework employs a conditional latent diffusion model to generate high-resolution soft tissue images with fine detail and critical lung pathology by performing bone suppression in the latent space. We implement offset noise during the noise addition phase of the training process to better render low-frequency information in soft tissue images. Additionally, we introduce a dynamic clipping strategy during the sampling process to refine pixel intensity in the generated soft tissue images. We compiled a substantial and high-quality bone suppression dataset, SZCH-X-Rays, including high-resolution paired CXR and DES soft tissue images from 818 patients, collected from our partner hospitals. Moreover, we pre-processed 241 pairs of CXR and DES soft tissue images from the JSRT dataset, the largest publicly available dataset. Comprehensive experimental and clinical evaluations demonstrate that BS-LDM exhibits superior bone suppression capabilities, highlighting its significant clinical potential.
zh

[CV-94] From Galaxy Zoo DECaLS to BASS/MzLS: detailed galaxy morphology classification with unsupervised domain adaption

【速读】：该论文试图解决DESI Legacy Imaging Surveys中DECaLS、BASS和MzLS（BMz）图像之间由于信噪比和分辨率差异导致的分布不一致问题，使得在DECaLS上训练的神经网络模型无法直接应用于BMz图像的星系形态分类。解决方案的关键在于采用无监督域适应（Unsupervised Domain Adaptation, UDA）方法，通过微调在DECaLS图像上训练的源域模型，使其能够适应BMz图像，从而减少BMz星系形态分类中的偏差。该方法显著提升了BMz图像上的分类性能，达到了与源域相当的水平，并发布了BMz星系的详细形态分类目录。

链接: https://arxiv.org/abs/2412.15533
作者: Renhao Ye,Shiyin Shen,Rafael S. de Souza,Quanfeng Xu,Mi Chen,Zhu Chen,Emille E. O. Ishida,Alberto Krone-Martins,Rupesh Durgesh
机构: Shanghai Astronomical Observatory, Chinese Academy of Sciences(中国科学院上海天文台); School of Astronomy and Space Science, University of Chinese Academy of Sciences(中国科学院大学天文与空间科学学院); Centre for Astrophysics Research, University of Hertfordshire(赫特福德大学天体物理研究中心); Shanghai Key Lab for Astrophysics, Shanghai Normal University(上海师范大学上海天文重点实验室); LPCA, Université Clermont Auvergne, CNRS/IN2P3(克莱蒙奥弗涅大学LPCA实验室, CNRS/IN2P3); Donald Bren School of Information and Computer Sciences, University of California, Irvine(加州大学欧文分校唐纳德·布伦信息与计算机科学学院); CENTRA/SIM, Faculdade de Ciências, Universidade de Lisboa(里斯本大学科学学院CENTRA/SIM); Independent Researcher(独立研究者)
关键词: DESI Legacy Imaging, Energy Camera Legacy, Dark Energy Camera, Legacy Imaging Surveys, Camera Legacy Survey
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, accepted for publication in MNRAS

点击查看摘要

Abstract:The DESI Legacy Imaging Surveys (DESI-LIS) comprise three distinct surveys: the Dark Energy Camera Legacy Survey (DECaLS), the Beijing-Arizona Sky Survey (BASS), and the Mayall z-band Legacy Survey (MzLS).The citizen science project Galaxy Zoo DECaLS 5 (GZD-5) has provided extensive and detailed morphology labels for a sample of 253,287 galaxies within the DECaLS survey. This dataset has been foundational for numerous deep learning-based galaxy morphology classification studies. However, due to differences in signal-to-noise ratios and resolutions between the DECaLS images and those from BASS and MzLS (collectively referred to as BMz), a neural network trained on DECaLS images cannot be directly applied to BMz images due to distributional this http URL this study, we explore an unsupervised domain adaptation (UDA) method that fine-tunes a source domain model trained on DECaLS images with GZD-5 labels to BMz images, aiming to reduce bias in galaxy morphology classification within the BMz survey. Our source domain model, used as a starting point for UDA, achieves performance on the DECaLS galaxies’ validation set comparable to the results of related works. For BMz galaxies, the fine-tuned target domain model significantly improves performance compared to the direct application of the source domain model, reaching a level comparable to that of the source domain. We also release a catalogue of detailed morphology classifications for 248,088 galaxies within the BMz survey, accompanied by usage recommendations.
zh

[CV-95] Underwater Image Quality Assessment: A Perceptual Framework Guided by Physical Imaging

【速读】：该论文试图解决水下图像质量评估 (UIQA) 问题，特别是如何综合考虑直接传输衰减和后向散射对图像感知的影响。解决方案的关键在于提出了一种物理成像引导的框架 (PIGUIQA)，该框架通过结合基于物理的水下成像估计，定义了衡量直接传输衰减和后向散射对图像质量影响的失真度量。此外，论文设计了一个基于邻域注意力机制的局部感知模块，以捕捉图像中的细微特征，从而增强对局部信息失真的自适应感知。最后，通过全局感知模块进一步整合原始图像内容和水下图像失真信息，实现对图像质量分数的准确预测。实验结果表明，PIGUIQA 在水下图像质量预测方面达到了最先进的性能，并具有较强的泛化能力。

链接: https://arxiv.org/abs/2412.15527
作者: Weizhi Xian,Mingliang Zhou,Leong Hou U,Lang Shujun,Bin Fang,Tao Xiang,Zhaowei Shang
机构: Chongqing Research Institute of Harbin Institute of Technology(哈尔滨工业大学重庆研究院); Harbin Institute of Technology(哈尔滨工业大学); Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算学院); School of Computer Science, Chongqing University(重庆大学计算机学院); Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院)
关键词: physically imaging-guided framework, image quality assessment, direct transmission attenuation, image quality, underwater image quality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a physically imaging-guided framework for underwater image quality assessment (UIQA), called PIGUIQA. First, we formulate UIQA as a comprehensive problem that considers the combined effects of direct transmission attenuation and backwards scattering on image perception. On this basis, we incorporate advanced physics-based underwater imaging estimation into our method and define distortion metrics that measure the impact of direct transmission attenuation and backwards scattering on image quality. Second, acknowledging the significant content differences across various regions of an image and the varying perceptual sensitivity to distortions in these regions, we design a local perceptual module on the basis of the neighborhood attention mechanism. This module effectively captures subtle features in images, thereby enhancing the adaptive perception of distortions on the basis of local information. Finally, by employing a global perceptual module to further integrate the original image content with underwater image distortion information, the proposed model can accurately predict the image quality score. Comprehensive experiments demonstrate that PIGUIQA achieves state-of-the-art performance in underwater image quality prediction and exhibits strong generalizability. The code for PIGUIQA is available on this https URL
zh

[CV-96] Uncertainty Estimation for Super-Resolution using ESRGAN

【速读】：该论文试图解决基于深度学习的图像超分辨率（SR）模型（如SRGAN和ESRGAN）在预测不确定性估计方面的不足。解决方案的关键在于通过引入蒙特卡洛 dropout (Monte Carlo Dropout) 和深度集成 (Deep Ensemble) 技术，增强这些模型以计算预测不确定性。这种方法不仅能够提供更可靠的像素级不确定性估计，帮助用户识别可能存在误差的区域，还不会导致模型性能的下降。

链接: https://arxiv.org/abs/2412.15439
作者: Maniraj Sai Adapa,Marco Zullich,Matias Valdenegro-Toro
机构: University of Groningen(格罗宁根大学)
关键词: Generative Adversarial Networks, Learning-based image super-resolution, Adversarial Networks, Generative Adversarial, Deep Learning-based image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures. VISAPP 2025 camera ready

点击查看摘要

Abstract:Deep Learning-based image super-resolution (SR) has been gaining traction with the aid of Generative Adversarial Networks. Models like SRGAN and ESRGAN are constantly ranked between the best image SR tools. However, they lack principled ways for estimating predictive uncertainty. In the present work, we enhance these models using Monte Carlo-Dropout and Deep Ensemble, allowing the computation of predictive uncertainty. When coupled with a prediction, uncertainty estimates can provide more information to the model users, highlighting pixels where the SR output might be uncertain, hence potentially inaccurate, if these estimates were to be reliable. Our findings suggest that these uncertainty estimates are decently calibrated and can hence fulfill this goal, while providing no performance drop with respect to the corresponding models without uncertainty estimation.
zh

[CV-97] Leveraging Weak Supervision for Cell Localization in Digital Pathology Using Multitask Learning and Consistency Loss

【速读】：该论文试图解决在数字病理学中细胞检测与分割任务中，由于全边界标注（full boundary annotations）获取困难且成本高昂，导致训练编码-解码网络（Encoder-decoder networks）受限的问题。解决方案的关键在于提出了一种混合监督策略（mixed-supervision strategy），通过引入由目测法（eyeballing process）得到的细胞计数作为辅助监督信号，训练多任务网络（multitask network），使其同时学习细胞计数和细胞定位任务。该方法通过引入一致性损失（consistency loss）来规范训练，惩罚两个任务预测之间的不一致性，从而在强标注有限的情况下，有效利用最弱的标注形式，提升模型性能。

链接: https://arxiv.org/abs/2412.15392
作者: Berke Levent Cesur,Ayse Humeyra Dur Karasayar,Pinar Bulutay,Nilgun Kapucuoglu,Cisel Aydin Mericoz,Handan Eren,Omer Faruk Dilbaz,Javidan Osmanli,Burhan Soner Yetkili,Ibrahim Kulac,Can Fahrettin Koyuncu,Cigdem Gunduz-Demir
机构: Koc University(科克大学); KUIS AI Center(KUIS人工智能中心); Graduate School of Health Sciences(健康科学研究生院); Department of Pathology(病理学系); Basaksehir Cam and Sakura City Hospital(巴萨克希尔樱之城医院); Sisli Hamidiye Etfal Health Application and Research Center(西斯利哈米迪耶埃特法尔健康应用与研究中心); Koc University School of Medicine(科克大学医学院); Koc University Research Center for Translational Medicine(科克大学转化医学研究中心); Wallace H Coulter Department of Biomedical Engineering(华莱士H库尔特生物医学工程系), Georgia Institute of Technology(佐治亚理工学院) and Emory University(埃默里大学)
关键词: detection and segmentation, segmentation are integral, integral parts, parts of automated, automated systems
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell detection and segmentation are integral parts of automated systems in digital pathology. Encoder-decoder networks have emerged as a promising solution for these tasks. However, training of these networks has typically required full boundary annotations of cells, which are labor-intensive and difficult to obtain on a large scale. However, in many applications, such as cell counting, weaker forms of annotations–such as point annotations or approximate cell counts–can provide sufficient supervision for training. This study proposes a new mixed-supervision approach for training multitask networks in digital pathology by incorporating cell counts derived from the eyeballing process–a quick visual estimation method commonly used by pathologists. This study has two main contributions: (1) It proposes a mixed-supervision strategy for digital pathology that utilizes cell counts obtained by eyeballing as an auxiliary supervisory signal to train a multitask network for the first time. (2) This multitask network is designed to concurrently learn the tasks of cell counting and cell localization, and this study introduces a consistency loss that regularizes training by penalizing inconsistencies between the predictions of these two tasks. Our experiments on two datasets of hematoxylin-eosin stained tissue images demonstrate that the proposed approach effectively utilizes the weakest form of annotation, improving performance when stronger annotations are limited. These results highlight the potential of integrating eyeballing-derived ground truths into the network training, reducing the need for resource-intensive annotations.
zh

[CV-98] DCRA-Net: Attention-Enabled Reconstruction Model for Dynamic Fetal Cardiac MRI

【速读】：该论文试图解决动态胎儿心脏磁共振成像 (Dynamic fetal heart magnetic resonance imaging) 中由于胎儿心率快和不可控运动带来的挑战，特别是需要在高时间分辨率 (temporal resolution) 和空间分辨率 (spatial resolution) 下覆盖大视野以包含母体解剖结构的问题。解决方案的关键是引入动态心脏重建注意力网络 (Dynamic Cardiac Reconstruction Attention Network, DCRA-Net)，该网络通过在空间和时间域中应用注意力机制 (attention mechanisms) 以及时间频率表示 (temporal frequency representation) 来从高度加速的自由运行 (非门控) MRI 采集数据中重建胎儿心脏的动态变化。DCRA-Net 在胎儿和成人数据上均表现优异，显著优于 L+S 和 k-GIN 方法，尤其是在使用格点欠采样 (lattice undersampling)、数据一致性 (data consistency) 和时间频率表示时，取得了最高的性能，胎儿和成人数据的峰值信噪比 (PSNR) 分别达到 38 和 35。

链接: https://arxiv.org/abs/2412.15342
作者: Denis Prokopenko,David F.A. Lloyd,Amedeo Chiribiri,Daniel Rueckert,Joseph V. Hajnal
机构: King’s College London(伦敦国王学院); Evelina London Children’s Hospital(伊芙琳娜伦敦儿童医院); Imperial College London(伦敦帝国学院); Klinikum rechts der Isar, Technical University of Munich(伊萨尔河右岸医院，慕尼黑工业大学)
关键词: presents unique challenges, magnetic resonance imaging, uncontrolled fetal motion, heart magnetic resonance, fast heart rate
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic fetal heart magnetic resonance imaging (MRI) presents unique challenges due to the fast heart rate of the fetus compared to adult subjects and uncontrolled fetal motion. This requires high temporal and spatial resolutions over a large field of view, in order to encompass surrounding maternal anatomy. In this work, we introduce Dynamic Cardiac Reconstruction Attention Network (DCRA-Net) - a novel deep learning model that employs attention mechanisms in spatial and temporal domains and temporal frequency representation of data to reconstruct the dynamics of the fetal heart from highly accelerated free-running (non-gated) MRI acquisitions. DCRA-Net was trained on retrospectively undersampled complex-valued cardiac MRIs from 42 fetal subjects and separately from 153 adult subjects, and evaluated on data from 14 fetal and 39 adult subjects respectively. Its performance was compared to L+S and k-GIN methods in both fetal and adult cases for an undersampling factor of 8x. The proposed network performed better than the comparators for both fetal and adult data, for both regular lattice and centrally weighted random undersampling. Aliased signals due to the undersampling were comprehensively resolved, and both the spatial details of the heart and its temporal dynamics were recovered with high fidelity. The highest performance was achieved when using lattice undersampling, data consistency and temporal frequency representation, yielding PSNR of 38 for fetal and 35 for adult cases. Our method is publicly available at this https URL.
zh

[CV-99] Federated Learning for Coronary Artery Plaque Detection in Atherosclerosis Using IVUS Imaging: A Multi-Hospital Collaboration

【速读】：该论文旨在解决传统血管内超声（Intravascular Ultrasound, IVUS）图像在经皮冠状动脉介入治疗（Percutaneous Coronary Intervention, PCI）中的解释效率低、一致性差的问题，特别是由于监管限制和隐私问题导致的跨医院数据整合困难。解决方案的关键在于开发了一种并行的2D U-Net模型，采用多阶段分割架构，并通过联邦学习（Federated Learning）实现跨机构的安全数据分析，同时保护隐私。该模型通过识别和减去外弹力膜（External Elastic Membrane, EEM）和管腔区域来分割斑块，并通过将笛卡尔坐标转换为极坐标进行预处理，以提高计算效率。最终模型实现了0.706的Dice相似系数（Dice Similarity Coefficient, DSC），能够实时有效地识别斑块并检测圆形边界。

链接: https://arxiv.org/abs/2412.15307
作者: Chiu-Han Hsiao,Kai Chen,Tsung-Yu Peng,Wei-Chieh Huang
机构: 1. National Taiwan University (国立台湾大学); 2. National Tsing Hua University (国立清华大学); 3. National Chiao Tung University (国立交通大学)
关键词: Percutaneous Coronary Intervention, Intravascular Ultrasound, Percutaneous Coronary, Coronary Intervention, time-intensive and inconsistent
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The traditional interpretation of Intravascular Ultrasound (IVUS) images during Percutaneous Coronary Intervention (PCI) is time-intensive and inconsistent, relying heavily on physician expertise. Regulatory restrictions and privacy concerns further hinder data integration across hospital systems, complicating collaborative analysis. To address these challenges, a parallel 2D U-Net model with a multi-stage segmentation architecture has been developed, utilizing federated learning to enable secure data analysis across institutions while preserving privacy. The model segments plaques by identifying and subtracting the External Elastic Membrane (EEM) and lumen areas, with preprocessing converting Cartesian to polar coordinates for improved computational efficiency. Achieving a Dice Similarity Coefficient (DSC) of 0.706, the model effectively identifies plaques and detects circular boundaries in real-time. Collaborative efforts with domain experts enhance plaque burden interpretation through precise quantitative measurements. Future advancements may involve integrating advanced federated learning techniques and expanding datasets to further improve performance and applicability. This adaptable technology holds promise for environments handling sensitive, distributed data, offering potential to optimize outcomes in medical imaging and intervention. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.15307 [eess.IV] (or arXiv:2412.15307v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2412.15307 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

人工智能

[AI-0] Explainable AI for Multivariate Time Series Pattern Exploration: Latent Space Visual Analytics with Time Fusion Transformer and Variational Autoencoders in Power Grid Event Diagnosis

链接: https://arxiv.org/abs/2412.16098
作者: Haowen Xu,Ali Boyaci,Jianming Lian,Aaron Wilson
关键词: Detecting and analyzing, analyzing complex patterns, complex patterns, crucial for decision-making, decision-making in urban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting and analyzing complex patterns in multivariate time-series data is crucial for decision-making in urban and environmental system operations. However, challenges arise from the high dimensionality, intricate complexity, and interconnected nature of complex patterns, which hinder the understanding of their underlying physical processes. Existing AI methods often face limitations in interpretability, computational efficiency, and scalability, reducing their applicability in real-world scenarios. This paper proposes a novel visual analytics framework that integrates two generative AI models, Time Fusion Transformer (TFT) and Variational Autoencoders (VAEs), to reduce complex patterns into lower-dimensional latent spaces and visualize them in 2D using dimensionality reduction techniques such as PCA, t-SNE, and UMAP with DBSCAN. These visualizations, presented through coordinated and interactive views and tailored glyphs, enable intuitive exploration of complex multivariate temporal patterns, identifying patterns’ similarities and uncover their potential correlations for a better interpretability of the AI outputs. The framework is demonstrated through a case study on power grid signal data, where it identifies multi-label grid event signatures, including faults and anomalies with diverse root causes. Additionally, novel metrics and visualizations are introduced to validate the models and evaluate the performance, efficiency, and consistency of latent maps generated by TFT and VAE under different configurations. These analyses provide actionable insights for model parameter tuning and reliability improvements. Comparative results highlight that TFT achieves shorter run times and superior scalability to diverse time-series data shapes compared to VAE. This work advances fault diagnosis in multivariate time series, fostering explainable AI to support critical system operations.

[AI-1] he Evolution of LLM Adoption in Industry Data Curation Practices

链接: https://arxiv.org/abs/2412.16089
作者: Crystal Qian,Michael Xieyang Liu,Emily Reif,Grady Simon,Nada Hussein,Nathan Clement,James Wexler,Carrie J. Cai,Michael Terry,Minsuk Kahng
关键词: grow increasingly adept, large language models, processing unstructured text, language models, grow increasingly
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages, 4 tables, 3 figures

点击查看摘要

Abstract:As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants’ perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners’ current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created ‘golden datasets’ with LLM-generated ‘silver’ datasets and rigorously validated ‘super golden’ datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.

[AI-2] Formal Mathematical Reasoning: A New Frontier in AI

链接: https://arxiv.org/abs/2412.16075
作者: Kaiyu Yang,Gabriel Poesia,Jingxuan He,Wenda Li,Kristin Lauter,Swarat Chaudhuri,Dawn Song
关键词: formal mathematical reasoning, discovery in science, intriguing intellectually, crucial for AI-driven, AI-driven discovery
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:AI for Mathematics (AI4Math) is not only intriguing intellectually but also crucial for AI-driven discovery in science, engineering, and beyond. Extensive efforts on AI4Math have mirrored techniques in NLP, in particular, training large language models on carefully curated math datasets in text form. As a complementary yet less explored avenue, formal mathematical reasoning is grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback. In this position paper, we advocate for formal mathematical reasoning and argue that it is indispensable for advancing AI4Math to the next level. In recent years, we have seen steady progress in using AI to perform formal reasoning, including core tasks such as theorem proving and autoformalization, as well as emerging applications such as verifiable generation of code and hardware designs. However, significant challenges remain to be solved for AI to truly master mathematics and achieve broader impact. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success. At this inflection point for formal mathematical reasoning, we call on the research community to come together to drive transformative advancements in this field.

[AI-3] Applying Predictive Analytics to Occupational Health and Safety in India

链接: https://arxiv.org/abs/2412.16038
作者: Ritwik Raj Saxena
关键词: revolutionizing occupational health, Predictive analytics, Predictive, OHS, revolutionizing occupational
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 16 pages, 5 figures, 1 table

点击查看摘要

Abstract:Predictive analytics is revolutionizing occupational health and safety (OHS). It offers evidence-based insights. These insights enable proactive risk management and informed, data-driven decision-making in organizational settings. This paper explores the key components of predictive analytics in OHS, beginning with data collection, management, and preparation, and moving through to advanced predictive modelling techniques. We emphasize the importance of data integrity through processes such as missing value imputation, anomaly detection, and feature engineering to ensure accurate model predictions. Risk prioritization identifies and ranks hazards across various factors, including employee behaviours, organizational policies, environmental conditions, and operational practices. We posit that insights derived from predictive models must be effectively interpreted and implemented. These insights guide organizations to focus on high-impact areas for accident prevention and resource optimization. The integration of predictive analytics in OHS brings notable benefits, including enhanced decision-making, greater operational efficiency, cost savings, and improved compliance with safety standards. We examine applications of predictive analytics in OHS in Indian settings. India has the largest workforce in the world, and the predominance of it is in the informal sector - a sector largely unprotected by the already inadequate OHS laws. Ethical considerations, data privacy concerns, and the risk of overdependence on predictive models are discussed. We conclude with a discussion on the potential for predictive analytics to create a data-oriented, adaptive approach to OHS in India. We posit that, using predictive analytics, India can develop high safety standards while traversing the complexities of its workforce setting.

[AI-4] A Framework for Streaming Event-Log Prediction in Business Processes

链接: https://arxiv.org/abs/2412.16032
作者: Benedikt Bollig,Matthias Függer,Thomas Nowak
关键词: present a Python-based, Python-based framework, business process, streaming mode, mode
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:We present a Python-based framework for event-log prediction in streaming mode, enabling predictions while data is being generated by a business process. The framework allows for easy integration of streaming algorithms, including language models like n-grams and LSTMs, and for combining these predictors using ensemble methods. Using our framework, we conducted experiments on various well-known process-mining data sets and compared classical batch with streaming mode. Though, in batch mode, LSTMs generally achieve the best performance, there is often an n-gram whose accuracy comes very close. Combining basic models in ensemble methods can even outperform LSTMs. The value of basic models with respect to LSTMs becomes even more apparent in streaming mode, where LSTMs generally lack accuracy in the early stages of a prediction run, while basic methods make sensible predictions immediately. Comments: 18 pages Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.16032 [cs.AI] (or arXiv:2412.16032v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.16032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-5] Choose Your Explanation: A Comparison of SHAP and GradCAM in Human Activity Recognition

链接: https://arxiv.org/abs/2412.16003
作者: Felix Tempel,Daniel Groos,Espen Alexander F. Ihlen,Lars Adde,Inga Strümke
关键词: Explaining machine learning, Explaining machine, machine learning, essential to make, Class Activation Mapping
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explaining machine learning (ML) models using eXplainable AI (XAI) techniques has become essential to make them more transparent and trustworthy. This is especially important in high-stakes domains like healthcare, where understanding model decisions is critical to ensure ethical, sound, and trustworthy outcome predictions. However, users are often confused about which explanability method to choose for their specific use case. We present a comparative analysis of widely used explainability methods, Shapley Additive Explanations (SHAP) and Gradient-weighted Class Activation Mapping (GradCAM), within the domain of human activity recognition (HAR) utilizing graph convolutional networks (GCNs). By evaluating these methods on skeleton-based data from two real-world datasets, including a healthcare-critical cerebral palsy (CP) case, this study provides vital insights into both approaches’ strengths, limitations, and differences, offering a roadmap for selecting the most appropriate explanation method based on specific models and applications. We quantitatively and quantitatively compare these methods, focusing on feature importance ranking, interpretability, and model sensitivity through perturbation experiments. While SHAP provides detailed input feature attribution, GradCAM delivers faster, spatially oriented explanations, making both methods complementary depending on the application’s requirements. Given the importance of XAI in enhancing trust and transparency in ML models, particularly in sensitive environments like healthcare, our research demonstrates how SHAP and GradCAM could complement each other to provide more interpretable and actionable model explanations.

[AI-6] CNN-LSTM Hybrid Deep Learning Model for Remaining Useful Life Estimation

链接: https://arxiv.org/abs/2412.15998
作者: Muthukumar G,Jyosna Philip
关键词: Remaining Useful Life, Convolutional Neural Networks, Predictive Maintenance applications, Life, RUL estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: conference paper

点击查看摘要

Abstract:Remaining Useful Life (RUL) of a component or a system is defined as the length from the current time to the end of the useful life. Accurate RUL estimation plays a crucial role in Predictive Maintenance applications. Traditional regression methods, both linear and non-linear, have struggled to achieve high accuracy in this domain. While Convolutional Neural Networks (CNNs) have shown improved accuracy, they often overlook the sequential nature of the data, relying instead on features derived from sliding windows. Since RUL prediction inherently involves multivariate time series analysis, robust sequence learning is essential. In this work, we propose a hybrid approach combining Convolutional Neural Networks with Long Short-Term Memory (LSTM) networks for RUL estimation. Although CNN-based LSTM models have been applied to sequence prediction tasks in financial forecasting, this is the first attempt to adopt this approach for RUL estimation in prognostics. In this approach, CNN is first employed to efficiently extract features from the data, followed by LSTM, which uses these extracted features to predict RUL. This method effectively leverages sensor sequence information, uncovering hidden patterns within the data, even under multiple operating conditions and fault scenarios. Our results demonstrate that the hybrid CNN-LSTM model achieves the highest accuracy, offering a superior score compared to the other methods.

[AI-7] APIRL: Deep Reinforcement Learning for REST API Fuzzing AAAI2025

链接: https://arxiv.org/abs/2412.15991
作者: Myles Foley,Sergio Maffeis
关键词: web services, components of web, REST APIs, REST, Abstract
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Thirty-ninth Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:REST APIs have become key components of web services. However, they often contain logic flaws resulting in server side errors or security vulnerabilities. HTTP requests are used as test cases to find and mitigate such issues. Existing methods to modify requests, including those using deep learning, suffer from limited performance and precision, relying on undirected search or making limited usage of the contextual information. In this paper we propose APIRL, a fully automated deep reinforcement learning tool for testing REST APIs. A key novelty of our approach is the use of feedback from a transformer module pre-trained on JSON-structured data, akin to that used in API responses. This allows APIRL to learn the subtleties relating to test outcomes, and generalise to unseen API endpoints. We show APIRL can find significantly more bugs than the state-of-the-art in real world REST APIs while minimising the number of required test cases. We also study how reward functions, and other key design choices, affect learnt policies in a thorough ablation study.

[AI-8] Never Reset Again: A Mathematical Framework for Continual Inference in Recurrent Neural Networks

链接: https://arxiv.org/abs/2412.15983
作者: Bojian Yin,Federico Corradi
关键词: Recurrent Neural Networks, Recurrent Neural, requiring disruptive hidden, disruptive hidden state, face fundamental limitations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) are widely used for sequential processing but face fundamental limitations with continual inference due to state saturation, requiring disruptive hidden state resets. However, reset-based methods impose synchronization requirements with input boundaries and increase computational costs at inference. To address this, we propose an adaptive loss function that eliminates the need for resets during inference while preserving high accuracy over extended sequences. By combining cross-entropy and Kullback-Leibler divergence, the loss dynamically modulates the gradient based on input informativeness, allowing the network to differentiate meaningful data from noise and maintain stable representations over time. Experimental results demonstrate that our reset-free approach outperforms traditional reset-based methods when applied to a variety of RNNs, particularly in continual tasks, enhancing both the theoretical and practical capabilities of RNNs for streaming applications.

[AI-9] rust Calibration in IDEs: Paving the Way for Widespread Adoption of AI Refactoring

链接: https://arxiv.org/abs/2412.15948
作者: Markus Borg
关键词: improve existing code, existing code, drive to add, add new features, features often overshadows
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted for publication in the Proc. of the 2nd Workshop on Integrated Development Environments, 2025

点击查看摘要

Abstract:In the software industry, the drive to add new features often overshadows the need to improve existing code. Large Language Models (LLMs) offer a new approach to improving codebases at an unprecedented scale through AI-assisted refactoring. However, LLMs come with inherent risks such as braking changes and the introduction of security vulnerabilities. We advocate for encapsulating the interaction with the models in IDEs and validating refactoring attempts using trustworthy safeguards. However, equally important for the uptake of AI refactoring is research on trust development. In this position paper, we position our future work based on established models from research on human factors in automation. We outline action research within CodeScene on development of 1) novel LLM safeguards and 2) user interaction that conveys an appropriate level of trust. The industry collaboration enables large-scale repository analysis and A/B testing to continuously guide the design of our research interventions.

[AI-10] Less is More: Towards Green Code Large Language Models via Unified Structural Pruning

链接: https://arxiv.org/abs/2412.15921
作者: Guang Yang,Yu Zhou,Xiangyu Zhang,Wei Cheng,Ke Liu,Xiang Chen,Terry Yue Zhuo,Taolue Chen
关键词: Large Language Models, Large Language, raised concerns due, generative Code LLMs, high computational demands
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: UNDER REVIEW

点击查看摘要

Abstract:The extensive application of Large Language Models (LLMs) in generative coding tasks has raised concerns due to their high computational demands and energy consumption. Unlike previous structural pruning methods designed for classification models that deal with lowdimensional classification logits, generative Code LLMs produce high-dimensional token logit sequences, making traditional pruning objectives inherently limited. Moreover, existing single component pruning approaches further constrain the effectiveness when applied to generative Code LLMs. In response, we propose Flab-Pruner, an innovative unified structural pruning method that combines vocabulary, layer, and Feed-Forward Network (FFN) pruning. This approach effectively reduces model parameters while maintaining performance. Additionally, we introduce a customized code instruction data strategy for coding tasks to enhance the performance recovery efficiency of the pruned model. Through extensive evaluations on three state-of-the-art Code LLMs across multiple generative coding tasks, the results demonstrate that Flab-Pruner retains 97% of the original performance after pruning 22% of the parameters and achieves the same or even better performance after post-training. The pruned models exhibit significant improvements in storage, GPU usage, computational efficiency, and environmental impact, while maintaining well robustness. Our research provides a sustainable solution for green software engineering and promotes the efficient deployment of LLMs in real-world generative coding intelligence applications.

[AI-11] Speedup Techniques for Switchable Temporal Plan Graph Optimization AAAI2025

链接: https://arxiv.org/abs/2412.15908
作者: He Jiang,Muhan Lin,Jiaoyang Li
关键词: Multi-Agent Path Finding, Temporal Plan Graph, planning collision-free paths, Multi-Agent Path, Switchable Temporal Plan
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) focuses on planning collision-free paths for multiple agents. However, during the execution of a MAPF plan, agents may encounter unexpected delays, which can lead to inefficiencies, deadlocks, or even collisions. To address these issues, the Switchable Temporal Plan Graph provides a framework for finding an acyclic Temporal Plan Graph with the minimum execution cost under delays, ensuring deadlock- and collision-free execution. Unfortunately, existing optimal algorithms, such as Mixed Integer Linear Programming and Graph-Based Switchable Edge Search (GSES), are often too slow for practical use. This paper introduces Improved GSES, which significantly accelerates GSES through four speedup techniques: stronger admissible heuristics, edge grouping, prioritized branching, and incremental implementation. Experiments conducted on four different map types with varying numbers of agents demonstrate that Improved GSES consistently achieves over twice the success rate of GSES and delivers up to a 30-fold speedup on instances where both methods successfully find solutions.

[AI-12] What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning AAAI2025

链接: https://arxiv.org/abs/2412.15904
作者: Yiran Ma,Zui Chen,Tianqiao Liu,Mi Tian,Zhuo Liu,Zitao Liu,Weiqi Luo
关键词: preference alignment based, significantly enhance mathematical, step-level preference alignment, Carlo Tree Search, Monte Carlo Tree
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Step-level reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective. However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that SRMs are adept at assessing the complex logical coherence present in mathematical language while having difficulty in natural language. These insights provide a nuanced understanding of the core elements that drive effective step-level reward modeling in mathematical reasoning. By shedding light on these mechanisms, this study offers valuable guidance for developing more efficient and streamlined SRMs, which can be achieved by focusing on the crucial parts of mathematical reasoning.

[AI-13] Approximate State Abstraction for Markov Games

链接: https://arxiv.org/abs/2412.15877
作者: Hiroki Ishibashi,Kenshi Abe,Atsushi Iwasaki
关键词: two-player zero-sum Markov, zero-sum Markov games, paper introduces state, state abstraction, Markov decision processes
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper introduces state abstraction for two-player zero-sum Markov games (TZMGs), where the payoffs for the two players are determined by the state representing the environment and their respective actions, with state transitions following Markov decision processes. For example, in games like soccer, the value of actions changes according to the state of play, and thus such games should be described as Markov games. In TZMGs, as the number of states increases, computing equilibria becomes more difficult. Therefore, we consider state abstraction, which reduces the number of states by treating multiple different states as a single state. There is a substantial body of research on finding optimal policies for Markov decision processes using state abstraction. However, in the multi-player setting, the game with state abstraction may yield different equilibrium solutions from those of the ground game. To evaluate the equilibrium solutions of the game with state abstraction, we derived bounds on the duality gap, which represents the distance from the equilibrium solutions of the ground game. Finally, we demonstrate our state abstraction with Markov Soccer, compute equilibrium policies, and examine the results.

[AI-14] AI-in-the-loop: The future of biomedical visual analytics applications in the era of AI

链接: https://arxiv.org/abs/2412.15876
作者: Katja Bühler,Thomas Höllt,Thomas Schulz,Pere-Pau Vázquez
关键词: Large Language Models, modern data analytics, visual analytics, workhorse of modern, data analytics
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Accepted for publication in IEEE Computer Graphics Applications

点击查看摘要

Abstract:AI is the workhorse of modern data analytics and omnipresent across many sectors. Large Language Models and multi-modal foundation models are today capable of generating code, charts, visualizations, etc. How will these massive developments of AI in data analytics shape future data visualizations and visual analytics workflows? What is the potential of AI to reshape methodology and design of future visual analytics applications? What will be our role as visualization researchers in the future? What are opportunities, open challenges and threats in the context of an increasingly powerful AI? This Visualization Viewpoint discusses these questions in the special context of biomedical data analytics as an example of a domain in which critical decisions are taken based on complex and sensitive data, with high requirements on transparency, efficiency, and reliability. We map recent trends and developments in AI on the elements of interactive visualization and visual analytics workflows and highlight the potential of AI to transform biomedical visualization as a research field. Given that agency and responsibility have to remain with human experts, we argue that it is helpful to keep the focus on human-centered workflows, and to use visual analytics as a tool for integrating AI-in-the-loop''. This is in contrast to the more traditional term human-in-the-loop’', which focuses on incorporating human expertise into AI-based systems.

[AI-15] raffic-Rule-Compliant Trajectory Repair via Satisfiability Modulo Theories and Reachability Analysis

链接: https://arxiv.org/abs/2412.15837
作者: Yuanfei Lin,Zekun Xing,Xuyuan Han,Matthias Althoff
关键词: considered simultaneously, Complying with traffic, traffic rules, violates traffic rules, Complying
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Complying with traffic rules is challenging for automated vehicles, as numerous rules need to be considered simultaneously. If a planned trajectory violates traffic rules, it is common to replan a new trajectory from scratch. We instead propose a trajectory repair technique to save computation time. By coupling satisfiability modulo theories with set-based reachability analysis, we determine if and in what manner the initial trajectory can be repaired. Experiments in high-fidelity simulators and in the real world demonstrate the benefits of our proposed approach in various scenarios. Even in complex environments with intricate rules, we efficiently and reliably repair rule-violating trajectories, enabling automated vehicles to swiftly resume legally safe operation in real-time.

[AI-16] WebLLM : A High-Performance In-Browser LLM Inference Engine

链接: https://arxiv.org/abs/2412.15803
作者: Charlie F. Ruan,Yucheng Qin,Xun Zhou,Ruihang Lai,Hongyi Jin,Yixin Dong,Bohan Hou,Meng-Shiun Yu,Yiyan Zhai,Sudeep Agarwal,Hangrui Cao,Siyuan Feng,Tianqi Chen
关键词: unlocked remarkable capabilities, Advancements in large, large language models, remarkable capabilities, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: this https URL.

[AI-17] Bi-directional Mapping of Morphology Metrics and 3D City Blocks for Enhanced Characterization and Generation of Urban Form

链接: https://arxiv.org/abs/2412.15801
作者: Chenyi Cai,Biao Li,Qiyan Zhang,Xiao Wang,Filip Biljecki,Pieter Herthogs
关键词: urban form, Urban, Morphology metrics, urban form generation, urban design
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urban morphology, examining city spatial configurations, links urban design to sustainability. Morphology metrics play a fundamental role in performance-driven computational urban design (CUD) which integrates urban form generation, performance evaluation and optimization. However, a critical gap remains between performance evaluation and complex urban form generation, caused by the disconnection between morphology metrics and urban form, particularly in metric-to-form workflows. It prevents the application of optimized metrics to generate improved urban form with enhanced urban performance. Formulating morphology metrics that not only effectively characterize complex urban forms but also enable the reconstruction of diverse forms is of significant importance. This paper highlights the importance of establishing a bi-directional mapping between morphology metrics and complex urban form to enable the integration of urban form generation with performance evaluation. We present an approach that can 1) formulate morphology metrics to both characterize urban forms and in reverse, retrieve diverse similar 3D urban forms, and 2) evaluate the effectiveness of morphology metrics in representing 3D urban form characteristics of blocks by comparison. We demonstrate the methodology with 3D urban models of New York City, covering 14,248 blocks. We use neural networks and information retrieval for morphology metric encoding, urban form clustering and morphology metric evaluation. We identified an effective set of morphology metrics for characterizing block-scale urban forms through comparison. The proposed methodology tightly couples complex urban forms with morphology metrics, hence it can enable a seamless and bidirectional relationship between urban form generation and optimization in performance-driven urban design towards sustainable urban design and planning.

[AI-18] fluke: Federated Learning Utility frameworK for Experimentation and research AAAI2025

链接: https://arxiv.org/abs/2412.15728
作者: Mirko Polato
关键词: gaining tremendous popularity, machine learning community, gaining tremendous, tremendous popularity, learning community
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at FLUID workshop (AAAI 2025) [4 pages (+2 references), 2 figures, 1 algorithm]

点击查看摘要

Abstract:Since its inception in 2016, Federated Learning (FL) has been gaining tremendous popularity in the machine learning community. Several frameworks have been proposed to facilitate the development of FL algorithms, but researchers often resort to implementing their algorithms from scratch, including all baselines and experiments. This is because existing frameworks are not flexible enough to support their needs or the learning curve to extend them is too steep. In this paper, we present \fluke, a Python package designed to simplify the development of new FL algorithms. fluke is specifically designed for prototyping purposes and is meant for researchers or practitioners focusing on the learning components of a federated system. fluke is open-source, and it can be either used out of the box or extended with new algorithms with minimal overhead.

[AI-19] owards Secure AI-driven Industrial Metaverse with NFT Digital Twins

链接: https://arxiv.org/abs/2412.15716
作者: Ravi Prakash,Tony Thomas
关键词: brought digital twins, digital twins, brought digital, Blockchain-powered non-fungible tokens, DTs
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rise of the industrial metaverse has brought digital twins (DTs) to the forefront. Blockchain-powered non-fungible tokens (NFTs) offer a decentralized approach to creating and owning these cloneable DTs. However, the potential for unauthorized duplication, or counterfeiting, poses a significant threat to the security of NFT-DTs. Existing NFT clone detection methods often rely on static information like metadata and images, which can be easily manipulated. To address these limitations, we propose a novel deep-learning-based solution as a combination of an autoencoder and RNN-based classifier. This solution enables real-time pattern recognition to detect fake NFT-DTs. Additionally, we introduce the concept of dynamic metadata, providing a more reliable way to verify authenticity through AI-integrated smart contracts. By effectively identifying counterfeit DTs, our system contributes to strengthening the security of NFT-based assets in the metaverse.

[AI-20] MacLight: Multi-scene Aggregation Convolutional Learning for Traffic Signal Control AAMAS2025

链接: https://arxiv.org/abs/2412.15703
作者: Sunbowen Lee,Hongqin Lyu,Yicheng Gong,Yingying Sun,Chao Deng
关键词: Reinforcement learning methods, large road networks, proposed promising traffic, Reinforcement learning, road networks
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as full paper by AAMAS2025

点击查看摘要

Abstract:Reinforcement learning methods have proposed promising traffic signal control policy that can be trained on large road networks. Current SOTA methods model road networks as topological graph structures, incorporate graph attention into deep Q-learning, and merge local and global embeddings to improve policy. However, graph-based methods are difficult to parallelize, resulting in huge time overhead. Moreover, none of the current peer studies have deployed dynamic traffic systems for experiments, which is far from the actual situation. In this context, we propose Multi-Scene Aggregation Convolutional Learning for traffic signal control (MacLight), which offers faster training speeds and more stable performance. Our approach consists of two main components. The first is the global representation, where we utilize variational autoencoders to compactly compress and extract the global representation. The second component employs the proximal policy optimization algorithm as the backbone, allowing value evaluation to consider both local features and global embedding representations. This backbone model significantly reduces time overhead and ensures stability in policy updates. We validated our method across multiple traffic scenarios under both static and dynamic traffic systems. Experimental results demonstrate that, compared to general and domian SOTA methods, our approach achieves superior stability, optimized convergence levels and the highest time efficiency. The code is under this https URL. Comments: Accepted as full paper by AAMAS2025 Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.15703 [cs.MA] (or arXiv:2412.15703v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2412.15703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-21] AIR: Unifying Individual and Cooperative Exploration in Collective Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2412.15700
作者: Guangchong Zhou,Zeren Zhang,Guoliang Fan
关键词: multi-agent reinforcement learning, cooperative multi-agent reinforcement, value-based agents due, reinforcement learning, remains challenging
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Exploration in cooperative multi-agent reinforcement learning (MARL) remains challenging for value-based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

[AI-22] acit Learning with Adaptive Information Selection for Cooperative Multi-Agent Reinforcement Learning AAMAS2025

链接: https://arxiv.org/abs/2412.15639
作者: Lunjun Liu,Weilai Jiang,Yaonan Wang
关键词: gained widespread adoption, widespread adoption due, multi-agent reinforcement learning, decentralized execution, multi-agent reinforcement
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by AAMAS 2025 (Extended Abstract)

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), the centralized training with decentralized execution (CTDE) framework has gained widespread adoption due to its strong performance. However, the further development of CTDE faces two key challenges. First, agents struggle to autonomously assess the relevance of input information for cooperative tasks, impairing their decision-making abilities. Second, in communication-limited scenarios with partial observability, agents are unable to access global information, restricting their ability to collaborate effectively from a global perspective. To address these challenges, we introduce a novel cooperative MARL framework based on information selection and tacit learning. In this framework, agents gradually develop implicit coordination during training, enabling them to infer the cooperative behavior of others in a discrete space without communication, relying solely on local information. Moreover, we integrate gating and selection mechanisms, allowing agents to adaptively filter information based on environmental changes, thereby enhancing their decision-making capabilities. Experiments on popular MARL benchmarks show that our framework can be seamlessly integrated with state-of-the-art algorithms, leading to significant performance improvements.

[AI-23] JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLM s AAAI2025

链接: https://arxiv.org/abs/2412.15623
作者: Hongyi Li,Jiawei Ye,Jie Wu,Tianjie Yan,Chu Wang,Zhixin Li
关键词: Large Language Models, Large Language, garnered significant attention, recently garnered significant, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely on handcrafted templates or generated-based optimization, posing challenges in scalability, efficiency and universality. To address these issues, we present JailPO, a novel black-box jailbreak framework to examine LLM alignment. For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts. Furthermore, we introduce a preference optimization-based attack method to enhance the jailbreak effectiveness, thereby improving efficiency. To analyze model vulnerabilities, we provide three flexible jailbreak patterns. Extensive experiments demonstrate that JailPO not only automates the attack process while maintaining effectiveness but also exhibits superior performance in efficiency, universality, and robustness against defenses compared to baselines. Additionally, our analysis of the three JailPO patterns reveals that attacks based on complex templates exhibit higher attack strength, whereas covert question transformations elicit riskier responses and are more likely to bypass defense mechanisms.

[AI-24] Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

链接: https://arxiv.org/abs/2412.15619
作者: Chen Jianming,Wang Yawen,Wang Junjie,Xie Xiaofei,Hu jun,Wang Qing,Xu Fanjiang
关键词: Explaining multi-agent systems, Explaining multi-agent, increasingly prevalent, Explaining, MAS
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Explaining multi-agent systems (MAS) is urgent as these systems become increasingly prevalent in various applications. Previous work has proveided explanations for the actions or states of agents, yet falls short in understanding the black-boxed agent’s importance within a MAS and the overall team strategy. To bridge this gap, we propose EMAI, a novel agent-level explanation approach that evaluates the individual agent’s importance. Inspired by counterfactual reasoning, a larger change in reward caused by the randomized action of agent indicates its higher importance. We model it as a MARL problem to capture interactions across agents. Utilizing counterfactual reasoning, EMAI learns the masking agents to identify important agents. Specifically, we define the optimization function to minimize the reward difference before and after action randomization and introduce sparsity constraints to encourage the exploration of more action randomization of agents during training. The experimental results in seven multi-agent tasks demonstratee that EMAI achieves higher fidelity in explanations than baselines and provides more effective guidance in practical applications concerning understanding policies, launching attacks, and patching policies.

[AI-25] Microservices-Based Framework for Predictive Analytics and Real-time Performance Enhancement in Travel Reservation Systems

链接: https://arxiv.org/abs/2412.15616
作者: Biman Barua,M. Shamim Kaiser
关键词: microservices-based architecture dedicated, paper presents, dedicated to enhancing, system, real-time predictive analytics
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 Pages, 05 figures

点击查看摘要

Abstract:The paper presents a framework of microservices-based architecture dedicated to enhancing the performance of real-time travel reservation systems using the power of predictive analytics. Traditional monolithic systems are bad at scaling and performing with high loads, causing backup resources to be underutilized along with delays. To overcome the above-stated problems, we adopt a modularization approach in decoupling system components into independent services that can grow or shrink according to demand. Our framework also includes real-time predictive analytics, through machine learning models, that optimize forecasting customer demand, dynamic pricing, as well as system performance. With an experimental evaluation applying the approach, we could show that the framework impacts metrics of performance such as response time, throughput, transaction rate of success, and prediction accuracy compared to their conventional counterparts. Not only does the microservices approach improve scalability and fault tolerance like a usual architecture, but it also brings along timely and accurate predictions, which imply a greater customer satisfaction and efficiency of operation. The integration of real-time analytics would lead to more intelligent decision-making, thereby improving the response of the system along with the reliability it holds. A scalable, efficient framework is offered by such a system to address the modern challenges imposed by any form of travel reservation system while considering other complex, data-driven industries as future applications. Future work will be an investigation of advanced AI models and edge processing to further improve the performance and robustness of the systems employed.

[AI-26] SODor: Long-Term EEG Partitioning for Seizure Onset Detection AAAI2025

链接: https://arxiv.org/abs/2412.15598
作者: Zheng Chen,Yasuko Matsubara,Yasushi Sakurai,Jimeng Sun
关键词: Deep learning models, recently shown great, shown great success, classifying epileptic patients, Deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Deep learning models have recently shown great success in classifying epileptic patients using EEG recordings. Unfortunately, classification-based methods lack a sound mechanism to detect the onset of seizure events. In this work, we propose a two-stage framework, \method, that explicitly models seizure onset through a novel task formulation of subsequence clustering. Given an EEG sequence, the framework first learns a set of second-level embeddings with label supervision. It then employs model-based clustering to explicitly capture long-term temporal dependencies in EEG sequences and identify meaningful subsequences. Epochs within a subsequence share a common cluster assignment (normal or seizure), with cluster or state transitions representing successful onset detections. Extensive experiments on three datasets demonstrate that our method can correct misclassifications, achieving 5%-11% classification improvements over other baselines and accurately detecting seizure onsets.

[AI-27] Machine Learning Techniques for Pattern Recognition in High-Dimensional Data Mining

链接: https://arxiv.org/abs/2412.15593
作者: Pochun Li
关键词: frequent pattern mining, pattern mining, mining algorithm based, frequent pattern, support vector machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a frequent pattern data mining algorithm based on support vector machine (SVM), aiming to solve the performance bottleneck of traditional frequent pattern mining algorithms in high-dimensional and sparse data environments. By converting the frequent pattern mining task into a classification problem, the SVM model is introduced to improve the accuracy and robustness of pattern extraction. In terms of method design, the kernel function is used to map the data to a high-dimensional feature space, so as to construct the optimal classification hyperplane, realize the nonlinear separation of patterns and the accurate mining of frequent items. In the experiment, two public datasets, Retail and Mushroom, were selected to compare and analyze the proposed algorithm with traditional FP-Growth, FP-Tree, decision tree and random forest models. The experimental results show that the algorithm in this paper is significantly better than the traditional model in terms of three key indicators: support, confidence and lift, showing strong pattern recognition ability and rule extraction effect. The study shows that the SVM model has excellent performance advantages in an environment with high data sparsity and a large number of transactions, and can effectively cope with complex pattern mining tasks. At the same time, this paper also points out the potential direction of future research, including the introduction of deep learning and ensemble learning frameworks to further improve the scalability and adaptability of the algorithm. This research not only provides a new idea for frequent pattern mining, but also provides important technical support for solving pattern discovery and association rule mining problems in practical applications.

[AI-28] Pre-training Graph Neural Networks on Molecules by Using Subgraph-Conditioned Graph Information Bottleneck

链接: https://arxiv.org/abs/2412.15589
作者: Van Thuy Hoang,O-Joun Lee
关键词: Graph Neural Network, pre-trained Graph Neural, Neural Network, functional groups, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:This study aims to build a pre-trained Graph Neural Network (GNN) model on molecules without human annotations or prior knowledge. Although various attempts have been proposed to overcome limitations in acquiring labeled molecules, the previous pre-training methods still rely on semantic subgraphs, i.e., functional groups. Only focusing on the functional groups could overlook the graph-level distinctions. The key challenge to build a pre-trained GNN on molecules is how to (1) generate well-distinguished graph-level representations and (2) automatically discover the functional groups without prior knowledge. To solve it, we propose a novel Subgraph-conditioned Graph Information Bottleneck, named S-CGIB, for pre-training GNNs to recognize core subgraphs (graph cores) and significant subgraphs. The main idea is that the graph cores contain compressed and sufficient information that could generate well-distinguished graph-level representations and reconstruct the input graph conditioned on significant subgraphs across molecules under the S-CGIB principle. To discover significant subgraphs without prior knowledge about functional groups, we propose generating a set of functional group candidates, i.e., ego networks, and using an attention-based interaction between the graph core and the candidates. Despite being identified from self-supervised learning, our learned subgraphs match the real-world functional groups. Extensive experiments on molecule datasets across various domains demonstrate the superiority of S-CGIB.

[AI-29] Score-based Generative Diffusion Models for Social Recommendations

链接: https://arxiv.org/abs/2412.15579
作者: Chengyi Liu,Jiahao Zhang,Shijie Wang,Wenqi Fan,Qing Li
关键词: enhancing personalized recommendations, social, online platforms, enhancing personalized, Stochastic Differential Equation
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:With the prevalence of social networks on online platforms, social recommendation has become a vital technique for enhancing personalized recommendations. The effectiveness of social recommendations largely relies on the social homophily assumption, which presumes that individuals with social connections often share similar preferences. However, this foundational premise has been recently challenged due to the inherent complexity and noise present in real-world social networks. In this paper, we tackle the low social homophily challenge from an innovative generative perspective, directly generating optimal user social representations that maximize consistency with collaborative signals. Specifically, we propose the Score-based Generative Model for Social Recommendation (SGSR), which effectively adapts the Stochastic Differential Equation (SDE)-based diffusion models for social recommendations. To better fit the recommendation context, SGSR employs a joint curriculum training strategy to mitigate challenges related to missing supervision signals and leverages self-supervised learning techniques to align knowledge across social and collaborative domains. Extensive experiments on real-world datasets demonstrate the effectiveness of our approach in filtering redundant social information and improving recommendation performance.

[AI-30] Architecture-Aware Learning Curve Extrapolation via Graph Ordinary Differential Equation

链接: https://arxiv.org/abs/2412.15554
作者: Yanna Ding,Zijie Huang,Xiao Shou,Yihang Guo,Yizhou Sun,Jianxi Gao
关键词: facilitating hyperparameter tuning, neural architecture search, predicts neural network, neural network performance, early training epochs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning curve extrapolation predicts neural network performance from early training epochs and has been applied to accelerate AutoML, facilitating hyperparameter tuning and neural architecture search. However, existing methods typically model the evolution of learning curves in isolation, neglecting the impact of neural network (NN) architectures, which influence the loss landscape and learning trajectories. In this work, we explore whether incorporating neural network architecture improves learning curve modeling and how to effectively integrate this architectural information. Motivated by the dynamical system view of optimization, we propose a novel architecture-aware neural differential equation model to forecast learning curves continuously. We empirically demonstrate its ability to capture the general trend of fluctuating learning curves while quantifying uncertainty through variational parameters. Our model outperforms current state-of-the-art learning curve extrapolation methods and pure time-series modeling approaches for both MLP and CNN-based learning curves. Additionally, we explore the applicability of our method in Neural Architecture Search scenarios, such as training configuration ranking.

[AI-31] FedRLHF: A Convergence-Guaranteed Federated Framework for Privacy-Preserving and Personalized RLHF AAMAS2025

链接: https://arxiv.org/abs/2412.15538
作者: Flint Xiaofeng Fan,Cheston Tan,Yew-Soon Ong,Roger Wattenhofer,Wei-Tsang Ooi
关键词: traditional Reinforcement Learning, Federated Reinforcement Learning, face significant challenges, significant challenges due, Human Feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted to AAMAS 2025. This preprint represents the full version of the paper, including all proofs, experimental details, and additional discussions

点击查看摘要

Abstract:In the era of increasing privacy concerns and demand for personalized experiences, traditional Reinforcement Learning with Human Feedback (RLHF) frameworks face significant challenges due to their reliance on centralized data. We introduce Federated Reinforcement Learning with Human Feedback (FedRLHF), a novel framework that decentralizes the RLHF process. FedRLHF enables collaborative policy learning across multiple clients without necessitating the sharing of raw data or human feedback, thereby ensuring robust privacy preservation. Leveraging federated reinforcement learning, each client integrates human feedback locally into their reward functions and updates their policies through personalized RLHF processes. We establish rigorous theoretical foundations for FedRLHF, providing convergence guarantees, and deriving sample complexity bounds that scale efficiently with the number of clients. Empirical evaluations on the MovieLens and IMDb datasets demonstrate that FedRLHF not only preserves user privacy but also achieves performance on par with centralized RLHF, while enhancing personalization across diverse client environments.

[AI-32] Enhancing Large-scale UAV Route Planing with Global and Local Features via Reinforcement Graph Fusion

链接: https://arxiv.org/abs/2412.15537
作者: Tao Zhou,Kai Ye,Zeyu Shi,Jiajing Lin,Dejun Xu,Min Jiang
关键词: Vehicle Route Planing, Unmanned Aerial Vehicle, Aerial Vehicle Route, Numerous remarkable advancements, Route Planing
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Numerous remarkable advancements have been made in accuracy, speed, and parallelism for solving the Unmanned Aerial Vehicle Route Planing (UAVRP). However, existing UAVRP solvers face challenges when attempting to scale effectively and efficiently for larger instances. In this paper, we present a generalization framework that enables current UAVRP solvers to robustly extend their capabilities to larger instances, accommodating up to 10,000 points, using widely recognized test sets. The UAVRP under a large number of patrol points is a typical large-scale TSP this http URL proposed framework comprises three distinct steps. Firstly, we employ Delaunay triangulation to extract subgraphs from large instances while preserving global features. Secondly, we utilize an embedded TSP solver to obtain sub-results, followed by graph fusion. Finally, we implement a decoding strategy customizable to the user’s requirements, resulting in high-quality solutions, complemented by a warming-up process for the heatmap. To demonstrate the flexibility of our approach, we integrate two representative TSP solvers into our framework and conduct a comprehensive comparative analysis against existing algorithms using large TSP benchmark datasets. The results unequivocally demonstrate that our framework efficiently scales existing TSP solvers to handle large instances and consistently outperforms state-of-the-art (SOTA) methods. Furthermore, since our proposed framework does not necessitate additional training or fine-tuning, we believe that its generality can significantly advance research on end-to-end UAVRP solvers, enabling the application of a broader range of methods to real-world scenarios.

[AI-33] Generalized Back-Stepping Experience Replay in Sparse-Reward Environments

链接: https://arxiv.org/abs/2412.15525
作者: Guwen Lyu,Masahiro Sato
关键词: Back-stepping experience replay, accelerate learning efficiency, BER, efficiency in reversible, generated back-stepping transitions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Back-stepping experience replay (BER) is a reinforcement learning technique that can accelerate learning efficiency in reversible environments. BER trains an agent with generated back-stepping transitions of collected experiences and normal forward transitions. However, the original algorithm is designed for a dense-reward environment that does not require complex exploration, limiting the BER technique to demonstrate its full potential. Herein, we propose an enhanced version of BER called Generalized BER (GBER), which extends the original algorithm to sparse-reward environments, particularly those with complex structures that require the agent to explore. GBER improves the performance of BER by introducing relabeling mechanism and applying diverse sampling strategies. We evaluate our modified version, which is based on a goal-conditioned deep deterministic policy gradient offline learning algorithm, across various maze navigation environments. The experimental results indicate that the GBER algorithm can significantly boost the performance and stability of the baseline algorithm in various sparse-reward environments, especially those with highly structural symmetricity.

[AI-34] Non-Uniform Parameter-Wise Model Merging

链接: https://arxiv.org/abs/2412.15467
作者: Albert Manuel Orozco Camacho,Stefan Horoi,Guy Wolf,Eugene Belilovsky
关键词: Combining multiple machine, Combining multiple, enhancing performance, technique for enhancing, Combining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 1 figure, to be published in the Proceedings of the 9th IEEE Special Session on Machine Learning on Big Data (MLBD 2024)

点击查看摘要

Abstract:Combining multiple machine learning models has long been a technique for enhancing performance, particularly in distributed settings. Traditional approaches, such as model ensembles, work well, but are expensive in terms of memory and compute. Recently, methods based on averaging model parameters have achieved good results in some settings and have gained popularity. However, merging models initialized differently that do not share a part of their training trajectories can yield worse results than simply using the base models, even after aligning their neurons. In this paper, we introduce a novel approach, Non-uniform Parameter-wise Model Merging, or NP Merge, which merges models by learning the contribution of each parameter to the final model using gradient-based optimization. We empirically demonstrate the effectiveness of our method for merging models of various architectures in multiple settings, outperforming past methods. We also extend NP Merge to handle the merging of multiple models, showcasing its scalability and robustness.

[AI-35] AI-Enhanced Sensemaking: Exploring the Design of a Generative AI-Based Assistant to Support Genetic Professionals

链接: https://arxiv.org/abs/2412.15444
作者: Angela Mastrianni,Hope Twede,Aleksandra Sarcevic,Jeremiah Wander,Christina Austin-Tse,Scott Saponas,Heidi Rehm,Ashley Mae Conard,Amanda K. Hall
关键词: Toggle, Generative, knowledge work, WGS analysis, Code
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 22 pages, 8 figures, 1 table, 3 appendices

点击查看摘要

Abstract:Generative AI has the potential to transform knowledge work, but further research is needed to understand how knowledge workers envision using and interacting with generative AI. We investigate the development of generative AI tools to support domain experts in knowledge work, examining task delegation and the design of human-AI interactions. Our research focused on designing a generative AI assistant to aid genetic professionals in analyzing whole genome sequences (WGS) and other clinical data for rare disease diagnosis. Through interviews with 17 genetics professionals, we identified current challenges in WGS analysis. We then conducted co-design sessions with six genetics professionals to determine tasks that could be supported by an AI assistant and considerations for designing interactions with the AI assistant. From our findings, we identified sensemaking as both a current challenge in WGS analysis and a process that could be supported by AI. We contribute an understanding of how domain experts envision interacting with generative AI in their knowledge work, a detailed empirical study of WGS analysis, and three design considerations for using generative AI to support domain experts in sensemaking during knowledge work. CCS CONCEPTS: Human-centered computing, Human-computer interaction, Empirical studies in HCI Additional Keywords and Phrases: whole genome sequencing, generative AI, large language models, knowledge work, sensemaking, co-design, rare disease Contact Author: Angela Mastrianni (This work was done during the author’s internship at Microsoft Research) Ashley Mae Conard and Amanda K. Hall contributed equally Comments: 22 pages, 8 figures, 1 table, 3 appendices Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.15444 [cs.HC] (or arXiv:2412.15444v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2412.15444 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hope Twede [view email] [v1] Thu, 19 Dec 2024 22:54:49 UTC (1,572 KB) Full-text links: Access Paper: View a PDF of the paper titled AI-Enhanced Sensemaking: Exploring the Design of a Generative AI-Based Assistant to Support Genetic Professionals, by Angela Mastrianni and 8 other authorsView PDFOther Formats view license Current browse context: cs.HC prev | next new | recent | 2024-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-36] Energy consumption of code small language models serving with runtime engines and execution providers

链接: https://arxiv.org/abs/2412.15441
作者: Francisco Durán,Matias Martinez,Patricia Lago,Silverio Martínez-Fernández
关键词: Small Language Models, Language Models, Toggle, code, energy
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, submitted to journal

点击查看摘要

Abstract:Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers’ requirements for enhancing serving efficiency in energy and performance. Comments: 26 pages, submitted to journal Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.15441 [cs.SE] (or arXiv:2412.15441v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.15441 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francisco Durán [view email] [v1] Thu, 19 Dec 2024 22:44:02 UTC (4,115 KB) Full-text links: Access Paper: View a PDF of the paper titled Energy consumption of code small language models serving with runtime engines and execution providers, by Francisco Dur’an and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-12 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-37] Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations

链接: https://arxiv.org/abs/2412.15433
作者: Paolo Bova,Alessandro Di Stefano, TheAnh Han
关键词: dangerous capability testing, present a quantitative, dangerous capability, capability testing, dangerous
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); General Economics (econ.GN); Applications (stat.AP)
*备注: 26 pages, 15 figures

点击查看摘要

Abstract:We present a quantitative model for tracking dangerous AI capabilities over time. Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks. We first use the model to provide a novel introduction to dangerous capability testing and how this testing can directly inform policy. Decision makers in AI labs and government often set policy that is sensitive to the estimated danger of AI systems, and may wish to set policies that condition on the crossing of a set threshold for danger. The model helps us to reason about these policy choices. We then run simulations to illustrate how we might fail to test for dangerous capabilities. To summarise, failures in dangerous capability testing may manifest in two ways: higher bias in our estimates of AI danger, or larger lags in threshold monitoring. We highlight two drivers of these failure modes: uncertainty around dynamics in AI capabilities and competition between frontier AI labs. Effective AI policy demands that we address these failure modes and their drivers. Even if the optimal targeting of resources is challenging, we show how delays in testing can harm AI policy. We offer preliminary recommendations for building an effective testing ecosystem for dangerous capabilities and advise on a research agenda.

[AI-38] Offline Safe Reinforcement Learning Using Trajectory Classification AAAI2025

链接: https://arxiv.org/abs/2412.15429
作者: Ze Gong,Akshat Kumar,Pradeep Varakantham
关键词: risky online interactions, Offline safe reinforcement, Offline safe, trajectories, behaviors without engaging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAAI 2025

点击查看摘要

Abstract:Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the pre-collected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach’s strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.

[AI-39] Investigating Relational State Abstraction in Collaborative MARL

链接: https://arxiv.org/abs/2412.15388
作者: Sharlin Utke,Jeremie Houssineau,Giovanni Montana
关键词: Multi-Agent Reinforcement Learning, Reinforcement Learning, collaborative Multi-Agent Reinforcement, Multi-Agent Reinforcement, paper explores
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper explores the impact of relational state abstraction on sample efficiency and performance in collaborative Multi-Agent Reinforcement Learning. The proposed abstraction is based on spatial relationships in environments where direct communication between agents is not allowed, leveraging the ubiquity of spatial reasoning in real-world multi-agent scenarios. We introduce MARC (Multi-Agent Relational Critic), a simple yet effective critic architecture incorporating spatial relational inductive biases by transforming the state into a spatial graph and processing it through a relational graph neural network. The performance of MARC is evaluated across six collaborative tasks, including a novel environment with heterogeneous agents. We conduct a comprehensive empirical analysis, comparing MARC against state-of-the-art MARL baselines, demonstrating improvements in both sample efficiency and asymptotic performance, as well as its potential for generalization. Our findings suggest that a minimal integration of spatial relational inductive biases as abstraction can yield substantial benefits without requiring complex designs or task-specific engineering. This work provides insights into the potential of relational state abstraction to address sample efficiency, a key challenge in MARL, offering a promising direction for developing more efficient algorithms in spatially complex environments.

[AI-40] Automated Root Cause Analysis System for Complex Data Products

链接: https://arxiv.org/abs/2412.15374
作者: Mathieu Demarne,Miso Cilimdzic,Tom Falkowski,Timothy Johnson,Jim Gramling,Wei Kuang,Hoobie Hou,Amjad Aryan,Gayatri Subramaniam,Kenny Lee,Manuel Mejia,Lisa Liu,Divya Vermareddy
关键词: Domain Specific Language, low learning curve, Root Cause Analysis, Domain Specific, fast diagnostic implementation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform based on a Domain Specific Language (DSL) built for fast diagnostic implementation and low learning curve. Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time. The DSL is tailored specifically to ensure that subject matter experts can deliver highly curated and relevant Auto-TSGs in a short time without having to understand how they will interact with the rest of the diagnostic platform, thus reducing time-to-mitigate and saving crucial engineering cycles when they matter most. This contrasts with platforms like Datadog and New Relic, which primarily focus on monitoring and require manual intervention for mitigation. ARCAS uses a Large Language Model (LLM) to prioritize Auto-TSGs outputs and take appropriate actions, thus suppressing the costly requirement of understanding the general behavior of the system. We explain the key concepts behind ARCAS and demonstrate how it has been successfully used for multiple products across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.

[AI-41] Granger Causality Detection with Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2412.15373
作者: Hongyu Lin,Mohan Ren,Paolo Barucca,Tomaso Aste
关键词: time series data, Granger causality, Granger causality detection, Discovering causal relationships, Granger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Discovering causal relationships in time series data is central in many scientific areas, ranging from economics to climate science. Granger causality is a powerful tool for causality detection. However, its original formulation is limited by its linear form and only recently nonlinear machine-learning generalizations have been introduced. This study contributes to the definition of neural Granger causality models by investigating the application of Kolmogorov-Arnold networks (KANs) in Granger causality detection and comparing their capabilities against multilayer perceptrons (MLP). In this work, we develop a framework called Granger Causality KAN (GC-KAN) along with a tailored training approach designed specifically for Granger causality detection. We test this framework on both Vector Autoregressive (VAR) models and chaotic Lorenz-96 systems, analysing the ability of KANs to sparsify input features by identifying Granger causal relationships, providing a concise yet accurate model for Granger causality detection. Our findings show the potential of KANs to outperform MLPs in discerning interpretable Granger causal relationships, particularly for the ability of identifying sparse Granger causality patterns in high-dimensional settings, and more generally, the potential of AI in causality discovery for the dynamical laws in physical systems.

[AI-42] Making Transparency Advocates: An Educational Approach Towards Better Algorithmic Transparency in Practice

链接: https://arxiv.org/abs/2412.15363
作者: Andrew Bell,Julia Stoyanovich
关键词: algorithmic transparency, giving rise, risks and harms, harms posed, resulted in significant
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Concerns about the risks and harms posed by artificial intelligence (AI) have resulted in significant study into algorithmic transparency, giving rise to a sub-field known as Explainable AI (XAI). Unfortunately, despite a decade of development in XAI, an existential challenge remains: progress in research has not been fully translated into the actual implementation of algorithmic transparency by organizations. In this work, we test an approach for addressing the challenge by creating transparency advocates, or motivated individuals within organizations who drive a ground-up cultural shift towards improved algorithmic transparency. Over several years, we created an open-source educational workshop on algorithmic transparency and advocacy. We delivered the workshop to professionals across two separate domains to improve their algorithmic transparency literacy and willingness to advocate for change. In the weeks following the workshop, participants applied what they learned, such as speaking up for algorithmic transparency at an organization-wide AI strategy meeting. We also make two broader observations: first, advocacy is not a monolith and can be broken down into different levels. Second, individuals’ willingness for advocacy is affected by their professional field. For example, news and media professionals may be more likely to advocate for algorithmic transparency than those working at technology start-ups. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2412.15363 [cs.CY] (or arXiv:2412.15363v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2412.15363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] GeoPro-Net: Learning Interpretable Spatiotemporal Prediction Models through Statistically-Guided Geo-Prototyping

链接: https://arxiv.org/abs/2412.15353
作者: Bang An,Xun Zhou,Zirui Zhou,Ronilo Ragodos,Zenglin Xu,Jun Luo
关键词: city management, multi-source spatiotemporal features, crimes and accidents, accidents is crucial, crucial to public
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The problem of forecasting spatiotemporal events such as crimes and accidents is crucial to public safety and city management. Besides accuracy, interpretability is also a key requirement for spatiotemporal forecasting models to justify the decisions. Interpretation of the spatiotemporal forecasting mechanism is, however, challenging due to the complexity of multi-source spatiotemporal features, the non-intuitive nature of spatiotemporal patterns for non-expert users, and the presence of spatial heterogeneity in the data. Currently, no existing deep learning model intrinsically interprets the complex predictive process learned from multi-source spatiotemporal features. To bridge the gap, we propose GeoPro-Net, an intrinsically interpretable spatiotemporal model for spatiotemporal event forecasting problems. GeoPro-Net introduces a novel Geo-concept convolution operation, which employs statistical tests to extract predictive patterns in the input as Geo-concepts, and condenses the Geo-concept-encoded input through interpretable channel fusion and geographic-based pooling. In addition, GeoPro-Net learns different sets of prototypes of concepts inherently, and projects them to real-world cases for interpretation. Comprehensive experiments and case studies on four real-world datasets demonstrate that GeoPro-Net provides better interpretability while still achieving competitive prediction performance compared with state-of-the-art baselines.

[AI-44] MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

链接: https://arxiv.org/abs/2412.15310
作者: Yuxuan Wan,Yi Dong,Jingyu Xiao,Yintong Huo,Wenxuan Wang,Michael R. Lyu
关键词: modern web development, dominate modern web, websites dominate modern, Multi-page websites dominate, dominate modern
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-page websites dominate modern web development. However, existing design-to-code methods rely on simplified assumptions, limiting to single-page, self-contained webpages without external resource connection. To address this gap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation task, which transforms UI designs into multi-page, functional web UIs with internal/external navigation, image loading, and backend routing. We propose a novel resource list data structure to track resources, links, and design components. Our study applies existing methods to the MRWeb problem using a newly curated dataset of 500 websites (300 synthetic, 200 real-world). Specifically, we identify the best metric to evaluate the similarity of the web UI, assess the impact of the resource list on MRWeb generation, analyze MLLM limitations, and evaluate the effectiveness of the MRWeb tool in real-world workflows. The results show that resource lists boost navigation functionality from 0% to 66%-80% while facilitating visual similarity. Our proposed metrics and evaluation framework provide new insights into MLLM performance on MRWeb tasks. We release the MRWeb tool, dataset, and evaluation framework to promote further research.

[AI-45] ree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling DATE NAACL2025

链接: https://arxiv.org/abs/2412.15305
作者: Ziyi Ni,Yifan Li,Ning Yang,Dou Shen,Pin Lv,Daxiang Dong
关键词: Solving complex reasoning, key real-world application, Large Language Models, Solving complex, complex reasoning tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: This idea was first submitted to the NeuralPS Workshop “System 2 Reasoning At Scale” in September 2024. Its OpenReview: this https URL . It was then submitted to the NAACL 2025 in October 2024, which is recorded in: this https URL . This work predates many existing works

点击查看摘要

Abstract:Solving complex reasoning tasks is a key real-world application of agents. Thanks to the pretraining of Large Language Models (LLMs) on code data, recent approaches like CodeAct successfully use code as LLM agents’ action, achieving good results. However, CodeAct greedily generates the next action’s code block by relying on fragmented thoughts, resulting in inconsistency and instability. Moreover, CodeAct lacks action-related ground-truth (GT), making its supervision signals and termination conditions questionable in multi-turn interactions. To address these issues, we first introduce a simple yet effective end-to-end code generation paradigm, CodeProgram, which leverages code’s systematic logic to align with global reasoning and enable cohesive problem-solving. Then, we propose Tree-of-Code (ToC), which self-grows CodeProgram nodes based on the executable nature of the code and enables self-supervision in a GT-free scenario. Experimental results on two datasets using ten popular zero-shot LLMs show ToC remarkably boosts accuracy by nearly 20% over CodeAct with less than 1/4 turns. Several LLMs even perform better on one-turn CodeProgram than on multi-turn CodeAct. To further investigate the trade-off between efficacy and efficiency, we test different ToC tree sizes and exploration mechanisms. We also highlight the potential of ToC’s end-to-end data generation for supervised and reinforced fine-tuning.

[AI-46] A Universal Model for Human Mobility Prediction

链接: https://arxiv.org/abs/2412.15294
作者: Qingyue Long,Yuan Yuan,Yong Li
关键词: Predicting human mobility, crowd flow, Predicting human, individual trajectory, trajectory and crowd
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting human mobility is crucial for urban planning, traffic control, and emergency response. Mobility behaviors can be categorized into individual and collective, and these behaviors are recorded by diverse mobility data, such as individual trajectory and crowd flow. As different modalities of mobility data, individual trajectory and crowd flow have a close coupling relationship. Crowd flows originate from the bottom-up aggregation of individual trajectories, while the constraints imposed by crowd flows shape these individual trajectories. Existing mobility prediction methods are limited to single tasks due to modal gaps between individual trajectory and crowd flow. In this work, we aim to unify mobility prediction to break through the limitations of task-specific models. We propose a universal human mobility prediction model (named UniMob), which can be applied to both individual trajectory and crowd flow. UniMob leverages a multi-view mobility tokenizer that transforms both trajectory and flow data into spatiotemporal tokens, facilitating unified sequential modeling through a diffusion transformer architecture. To bridge the gap between the different characteristics of these two data modalities, we implement a novel bidirectional individual and collective alignment mechanism. This mechanism enables learning common spatiotemporal patterns from different mobility data, facilitating mutual enhancement of both trajectory and flow predictions. Extensive experiments on real-world datasets validate the superiority of our model over state-of-the-art baselines in trajectory and flow prediction. Especially in noisy and scarce data scenarios, our model achieves the highest performance improvement of more than 14% and 25% in MAPE and Accuracy@5.

[AI-47] Deep reinforcement learning with time-scale invariant memory

链接: https://arxiv.org/abs/2412.15292
作者: Md Rysul Kabir,James Mochizuki-Freeman,Zoran Tiganj
关键词: estimate temporal relationships, ability to estimate, temporal relationships, temporal, estimate temporal
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to estimate temporal relationships is critical for both animals and artificial agents. Cognitive science and neuroscience provide remarkable insights into behavioral and neural aspects of temporal credit assignment. In particular, scale invariance of learning dynamics, observed in behavior and supported by neural data, is one of the key principles that governs animal perception: proportional rescaling of temporal relationships does not alter the overall learning efficiency. Here we integrate a computational neuroscience model of scale invariant memory into deep reinforcement learning (RL) agents. We first provide a theoretical analysis and then demonstrate through experiments that such agents can learn robustly across a wide range of temporal scales, unlike agents built with commonly used recurrent memory architectures such as LSTM. This result illustrates that incorporating computational principles from neuroscience and cognitive science into deep neural networks can enhance adaptability to complex temporal dynamics, mirroring some of the core properties of human learning.

[AI-48] Functional connectomes of neural networks AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.15279
作者: Tananun Songdechakraiwut,Yutong Wu
关键词: complex system, challenge in neuroscience, long-standing challenge, neural networks, functional connectome
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:The human brain is a complex system, and understanding its mechanisms has been a long-standing challenge in neuroscience. The study of the functional connectome, which maps the functional connections between different brain regions, has provided valuable insights through various advanced analysis techniques developed over the years. Similarly, neural networks, inspired by the brain’s architecture, have achieved notable success in diverse applications but are often noted for their lack of interpretability. In this paper, we propose a novel approach that bridges neural networks and human brain functions by leveraging brain-inspired techniques. Our approach, grounded in the insights from the functional connectome, offers scalable ways to characterize topology of large neural networks using stable statistical and machine learning techniques. Our empirical analysis demonstrates its capability to enhance the interpretability of neural networks, providing a deeper understanding of their underlying mechanisms.

[AI-49] DreaMark: Rooting Watermark in Score Distillation Sampling Generated Neural Radiance Fields

链接: https://arxiv.org/abs/2412.15278
作者: Xingyu Zhu,Xiapu Luo,Xuetao Wei
关键词: neural radiance fields, score distillation sampling, real-world data capture, Recent advancements, generate neural radiance
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-3D generation can generate neural radiance fields (NeRFs) with score distillation sampling, enabling 3D asset creation without real-world data capture. With the rapid advancement in NeRF generation quality, protecting the copyright of the generated NeRF has become increasingly important. While prior works can watermark NeRFs in a post-generation way, they suffer from two vulnerabilities. First, a delay lies between NeRF generation and watermarking because the secret message is embedded into the NeRF model post-generation through fine-tuning. Second, generating a non-watermarked NeRF as an intermediate creates a potential vulnerability for theft. To address both issues, we propose Dreamark to embed a secret message by backdooring the NeRF during NeRF generation. In detail, we first pre-train a watermark decoder. Then, the Dreamark generates backdoored NeRFs in a way that the target secret message can be verified by the pre-trained watermark decoder on an arbitrary trigger viewport. We evaluate the generation quality and watermark robustness against image- and model-level attacks. Extensive experiments show that the watermarking process will not degrade the generation quality, and the watermark achieves 90+% accuracy among both image-level attacks (e.g., Gaussian noise) and model-level attacks (e.g., pruning attack).

[AI-50] Exploring Query Efficient Data Generation towards Data-free Model Stealing in Hard Label Setting

链接: https://arxiv.org/abs/2412.15276
作者: Gaozheng Pei,Shaojie lyu,Ke Ma,Pinci Yang,Qianqian Xu,Yingfei Sun
关键词: target model, target model structure, substitute model, Data-free model stealing, stealing involves replicating
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-free model stealing involves replicating the functionality of a target model into a substitute model without accessing the target model’s structure, parameters, or training data. The adversary can only access the target model’s predictions for generated samples. Once the substitute model closely approximates the behavior of the target model, attackers can exploit its white-box characteristics for subsequent malicious activities, such as adversarial attacks. Existing methods within cooperative game frameworks often produce samples with high confidence for the prediction of the substitute model, which makes it difficult for the substitute model to replicate the behavior of the target model. This paper presents a new data-free model stealing approach called Query Efficient Data Generation (\textbfQEDG). We introduce two distinct loss functions to ensure the generation of sufficient samples that closely and uniformly align with the target model’s decision boundary across multiple classes. Building on the limitation of current methods, which typically yield only one piece of supervised information per query, we propose the query-free sample augmentation that enables the acquisition of additional supervised information without increasing the number of queries. Motivated by theoretical analysis, we adopt the consistency rate metric, which more accurately evaluates the similarity between the substitute and target models. We conducted extensive experiments to verify the effectiveness of our proposed method, which achieved better performance with fewer queries compared to the state-of-the-art methods on the real \textbfMLaaS scenario and five datasets.

[AI-51] Learning-by-teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education

链接: https://arxiv.org/abs/2412.15226
作者: Angxuan Chen,Yuang Wei,Huixiao Le,Yan Zhang
关键词: teachable agent, traditional teachable agents, learning, teaching, ChatGPT
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study investigates the potential of using ChatGPT as a teachable agent to support students’ learning by teaching process, specifically in programming education. While learning by teaching is an effective pedagogical strategy for promoting active learning, traditional teachable agents have limitations, particularly in facilitating natural language dialogue. Our research explored whether ChatGPT, with its ability to engage learners in natural conversations, can support this process. The findings reveal that interacting with ChatGPT improves students’ knowledge gains and programming abilities, particularly in writing readable and logically sound code. However, it had limited impact on developing learners’ error-correction skills, likely because ChatGPT tends to generate correct code, reducing opportunities for students to practice debugging. Additionally, students’ self-regulated learning (SRL) abilities improved, suggesting that teaching ChatGPT fosters learners’ higher self-efficacy and better implementation of SRL strategies. This study discussed the role of natural dialogue in fostering socialized learning by teaching, and explored ChatGPT’s specific contributions in supporting students’ SRL through the learning by teaching process. Overall, the study highlights ChatGPT’s potential as a teachable agent, offering insights for future research on ChatGPT-supported education.

[AI-52] A Survey on Large Language Model-based Agents for Statistics and Data Science

链接: https://arxiv.org/abs/2412.14222
作者: Maojun Sun,Ruijian Han,Binyan Jiang,Houduo Qi,Defeng Sun,Yancheng Yuan,Jian Huang
关键词: Large Language Models, Language Models, Large Language, shown significant potential, powered by Large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In recent years, data science agents powered by Large Language Models (LLMs), known as “data agents,” have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.

[AI-53] ACNET: Temporal Audio Source Counting Network

链接: https://arxiv.org/abs/2311.02369
作者: Amirreza Ahmadnejad,Ahmad Mahmmodian Darviishani,Mohmmad Mehrdad Asadi,Sajjad Saffariyeh,Pedram Yousef,Emad Fatemizadeh
关键词: Source Counting Network, Temporal Audio Source, Audio Source Counting, source counting tasks, introduce the Temporal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the Temporal Audio Source Counting Network (TaCNet), an innovative architecture that addresses limitations in audio source counting tasks. TaCNet operates directly on raw audio inputs, eliminating complex preprocessing steps and simplifying the workflow. Notably, it excels in real-time speaker counting, even with truncated input windows. Our extensive evaluation, conducted using the LibriCount dataset, underscores TaCNet’s exceptional performance, positioning it as a state-of-the-art solution for audio source counting tasks. With an average accuracy of 74.18 percentage over 11 classes, TaCNet demonstrates its effectiveness across diverse scenarios, including applications involving Chinese and Persian languages. This cross-lingual adaptability highlights its versatility and potential impact.

[AI-54] Convolutional Deep Operator Networks for Learning Nonlinear Focused Ultrasound Wave Propagation in Heterogeneous Spinal Cord Anatomy AAAI

链接: https://arxiv.org/abs/2412.16118
作者: Avisha Kumar,Xuzhe Zhi,Zan Ahmad,Minglang Yin,Amir Manbachi
关键词: offering submillimeter precision, enhance blood flow, spinal cord injuries, optimally targeted treatment, spinal cord
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
*备注: Accepted for oral presentation at AAAI Conference on Artificial Intelligence: AI for Accelerating Science and Engineering Workshop 2025

点击查看摘要

Abstract:Focused ultrasound (FUS) therapy is a promising tool for optimally targeted treatment of spinal cord injuries (SCI), offering submillimeter precision to enhance blood flow at injury sites while minimizing impact on surrounding tissues. However, its efficacy is highly sensitive to the placement of the ultrasound source, as the spinal cord’s complex geometry and acoustic heterogeneity distort and attenuate the FUS signal. Current approaches rely on computer simulations to solve the governing wave propagation equations and compute patient-specific pressure maps using ultrasound images of the spinal cord anatomy. While accurate, these high-fidelity simulations are computationally intensive, taking up to hours to complete parameter sweeps, which is impractical for real-time surgical decision-making. To address this bottleneck, we propose a convolutional deep operator network (DeepONet) to rapidly predict FUS pressure fields in patient spinal cords. Unlike conventional neural networks, DeepONets are well equipped to approximate the solution operator of the parametric partial differential equations (PDEs) that govern the behavior of FUS waves with varying initial and boundary conditions (i.e., new transducer locations or spinal cord geometries) without requiring extensive simulations. Trained on simulated pressure maps across diverse patient anatomies, this surrogate model achieves real-time predictions with only a 2% loss on the test set, significantly accelerating the modeling of nonlinear physical systems in heterogeneous domains. By facilitating rapid parameter sweeps in surgical settings, this work provides a crucial step toward precise and individualized solutions in neurosurgical treatments.

[AI-55] GraphSeqLM: A Unified Graph Language Framework for Omic Graph Learning

链接: https://arxiv.org/abs/2412.15790
作者: Heming Zhang,Di Huang,Yixin Chen,Fuhai Li
关键词: understanding complex diseases, present significant challenges, noise present significant, Large Language Models, Graph Neural Networks
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of multi-omic data is pivotal for understanding complex diseases, but its high dimensionality and noise present significant challenges. Graph Neural Networks (GNNs) offer a robust framework for analyzing large-scale signaling pathways and protein-protein interaction networks, yet they face limitations in expressivity when capturing intricate biological relationships. To address this, we propose Graph Sequence Language Model (GraphSeqLM), a framework that enhances GNNs with biological sequence embeddings generated by Large Language Models (LLMs). These embeddings encode structural and biological properties of DNA, RNA, and proteins, augmenting GNNs with enriched features for analyzing sample-specific multi-omic data. By integrating topological, sequence-derived, and biological information, GraphSeqLM demonstrates superior predictive accuracy and outperforms existing methods, paving the way for more effective multi-omic data integration in precision medicine.

[AI-56] Modeling Autonomous Shifts Between Focus State and Mind-Wandering Using a Predictive-Coding-Inspired Variational RNN Model

链接: https://arxiv.org/abs/2412.15620
作者: Henrique Oyama,Jun Tani
关键词: neural mechanisms underling, current study investigates, underling autonomous shifts, conducting model simulation, mechanisms underling autonomous
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The current study investigates possible neural mechanisms underling autonomous shifts between focus state and mind-wandering by conducting model simulation experiments. On this purpose, we modeled perception processes of continuous sensory sequences using our previous proposed variational RNN model which was developed based on the free energy principle. The current study extended this model by introducing an adaptation mechanism of a meta-level parameter, referred to as the meta-prior \mathbfw , which regulates the complexity term in the free energy. Our simulation experiments demonstrated that autonomous shifts between focused perception and mind-wandering take place when \mathbfw switches between low and high values associated with decrease and increase of the average reconstruction error over the past window. In particular, high \mathbfw prioritized top-down predictions while low \mathbfw emphasized bottom-up sensations. This paper explores how our experiment results align with existing studies and highlights their potential for future research.

[AI-57] Improved Forecasts of Global Extreme Marine Heatwaves Through a Physics-guided Data-driven Approach

链接: https://arxiv.org/abs/2412.15532
作者: Ruiqi Shu,Hao Wu,Yuan Gao,Fanghua Xu,Ruijian Gou,Xiaomeng Huang
关键词: unusually warm sea, warm sea surface, sea surface temperature, surface temperature events, extreme MHWs
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The unusually warm sea surface temperature events known as marine heatwaves (MHWs) have a profound impact on marine ecosystems. Accurate prediction of extreme MHWs has significant scientific and financial worth. However, existing methods still have certain limitations, especially in the most extreme MHWs. In this study, to address these issues, based on the physical nature of MHWs, we created a novel deep learning neural network that is capable of accurate 10-day MHW forecasting. Our framework significantly improves the forecast ability of extreme MHWs through two specially designed modules inspired by numerical models: a coupler and a probabilistic data argumentation. The coupler simulates the driving effect of atmosphere on MHWs while the probabilistic data argumentation approaches significantly boost the forecast ability of extreme MHWs based on the idea of ensemble forecast. Compared with traditional numerical prediction, our framework has significantly higher accuracy and requires fewer computational resources. What’s more, explainable AI methods show that wind forcing is the primary driver of MHW evolution and reveal its relation with air-sea heat exchange. Overall, our model provides a framework for understanding MHWs’ driving processes and operational forecasts in the future.

[AI-58] Sum-of-Squares Programming for Ma-Trudinger-Wang Regularity of Optimal Transport Maps

链接: https://arxiv.org/abs/2412.13372
作者: Sachin Shivakumar,Georgiy A. Bondar,Gabriel Khan,Abhishek Halder
关键词: Monge optimal transport, machine learning algorithms, modern machine learning, optimal transport map, optimal transport
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:For a given ground cost, approximating the Monge optimal transport map that pushes forward a given probability measure onto another has become a staple in several modern machine learning algorithms. The fourth-order Ma-Trudinger-Wang (MTW) tensor associated with this ground cost function provides a notion of curvature in optimal transport. The non-negativity of this tensor plays a crucial role for establishing continuity for the Monge optimal transport map. It is, however, generally difficult to analytically verify this condition for any given ground cost. To expand the class of cost functions for which MTW non-negativity can be verified, we propose a provably correct computational approach which provides certificates of non-negativity for the MTW tensor using Sum-of-Squares (SOS) programming. We further show that our SOS technique can also be used to compute an inner approximation of the region where MTW non-negativity holds. We apply our proposed SOS programming method to several practical ground cost functions to approximate the regions of regularity of their corresponding optimal transport maps.

机器学习

[LG-0] FedGAT: A Privacy-Preserving Federated Approximation Algorithm for Graph Attention Networks

链接: https://arxiv.org/abs/2412.16144
作者: Siddharth Ambekar,Yuhang Yao,Ryan Li,Carlee Joe-Wong
关键词: huge online marketplaces, applications including friendship, social media sites, including friendship graphs, customer-merchant interaction graphs
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated training methods have gained popularity for graph learning with applications including friendship graphs of social media sites and customer-merchant interaction graphs of huge online marketplaces. However, privacy regulations often require locally generated data to be stored on local clients. The graph is then naturally partitioned across clients, with no client permitted access to information stored on another. Cross-client edges arise naturally in such cases and present an interesting challenge to federated training methods, as training a graph model at one client requires feature information of nodes on the other end of cross-client edges. Attempting to retain such edges often incurs significant communication overhead, and dropping them altogether reduces model performance. In simpler models such as Graph Convolutional Networks, this can be fixed by communicating a limited amount of feature information across clients before training, but GATs (Graph Attention Networks) require additional information that cannot be pre-communicated, as it changes from training round to round. We introduce the Federated Graph Attention Network (FedGAT) algorithm for semi-supervised node classification, which approximates the behavior of GATs with provable bounds on the approximation error. FedGAT requires only one pre-training communication round, significantly reducing the communication overhead for federated GAT training. We then analyze the error in the approximation and examine the communication overhead and computational complexity of the algorithm. Experiments show that FedGAT achieves nearly the same accuracy as a GAT model in a centralised setting, and its performance is robust to the number of clients as well as data distribution.

[LG-1] EF-Net: A Deep Learning Approach Combining Word Embeddings and Feature Fusion for Patient Disposition Analysis

链接: https://arxiv.org/abs/2412.16134
作者: Nafisa Binte Feroz,Chandrima Sarker,Tanzima Ahsan,K M Arefeen Sultan,Raqeebir Rab
关键词: rising healthcare costs, healthcare costs, urgent problems, aging population, population and rising
类目: Machine Learning (cs.LG)
*备注: Accepted to ICCIT2024

点击查看摘要

Abstract:One of the most urgent problems is the overcrowding in emergency departments (EDs), caused by an aging population and rising healthcare costs. Patient dispositions have become more complex as a result of the strain on hospital infrastructure and the scarcity of medical resources. Individuals with more dangerous health issues should be prioritized in the emergency room. Thus, our research aims to develop a prediction model for patient disposition using EF-Net. This model will incorporate categorical features into the neural network layer and add numerical features with the embedded categorical features. We combine the EF-Net and XGBoost models to attain higher accuracy in our results. The result is generated using the soft voting technique. In EF-Net, we attained an accuracy of 95.33%, whereas in the Ensemble Model, we achieved an accuracy of 96%. The experiment’s analysis shows that EF-Net surpasses existing works in accuracy, AUROC, and F1-Score on the MIMIC-IV-ED dataset, demonstrating its potential as a scalable solution for patient disposition assessment. Our code is available at this https URL

[LG-2] Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation

链接: https://arxiv.org/abs/2412.16083
作者: Timur Sattarov,Marco Schreyer,Damian Borth
关键词: finance necessitates solutions, Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, uphold privacy standards, rigorously uphold privacy
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: 9 pages, 9 figures, preprint version, currently under review

点击查看摘要

Abstract:The increasing demand for privacy-preserving data analytics in finance necessitates solutions for synthetic data generation that rigorously uphold privacy standards. We introduce DP-Fed-FinDiff framework, a novel integration of Differential Privacy, Federated Learning and Denoising Diffusion Probabilistic Models designed to generate high-fidelity synthetic tabular data. This framework ensures compliance with stringent privacy regulations while maintaining data utility. We demonstrate the effectiveness of DP-Fed-FinDiff on multiple real-world financial datasets, achieving significant improvements in privacy guarantees without compromising data quality. Our empirical evaluations reveal the optimal trade-offs between privacy budgets, client configurations, and federated optimization strategies. The results affirm the potential of DP-Fed-FinDiff to enable secure data sharing and robust analytics in highly regulated domains, paving the way for further advances in federated learning and privacy-preserving data synthesis.

[LG-3] Black-Box Uniform Stability for Non-Euclidean Empirical Risk Minimization

链接: https://arxiv.org/abs/2412.15956
作者: Simon Vary,David Martínez-Rubio,Patrick Rebeschini
关键词: empirical risk minimization, study first-order algorithms, study first-order, Hölder smooth convex, ERM
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 33 pages, no figures

点击查看摘要

Abstract:We study first-order algorithms that are uniformly stable for empirical risk minimization (ERM) problems that are convex and smooth with respect to p -norms, p \geq 1 . We propose a black-box reduction method that, by employing properties of uniformly convex regularizers, turns an optimization algorithm for Hölder smooth convex losses into a uniformly stable learning algorithm with optimal statistical risk bounds on the excess risk, up to a constant factor depending on p . Achieving a black-box reduction for uniform stability was posed as an open question by (Attia and Koren, 2022), which had solved the Euclidean case p=2 . We explore applications that leverage non-Euclidean geometry in addressing binary classification problems.

[LG-4] RiTTA: Modeling Event Relations in Text-to-Audio Generation

链接: https://arxiv.org/abs/2412.15922
作者: Yuhang He,Yash Jain,Xubo Liu,Andrew Markham,Vibhav Vineet
关键词: fine-grained context understanding, event relation modeling, models achieving high-fidelity, achieving high-fidelity audio, audio event relation
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Audio Events Relation Modeling in TTA Generative Model. Code: this https URL

点击查看摘要

Abstract:Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: this https URL

[LG-5] Data Preparation for Fairness-Performance Trade-Offs: A Practitioner-Friendly Alternative?

链接: https://arxiv.org/abs/2412.15920
作者: Gianmario Voria,Rebecca Di Matteo,Giammaria Giordano,Gemma Catolino,Fabio Palomba
关键词: machine learning, systems are increasingly, adopted across industries, increasingly adopted, Data Preparation
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted as Registered Report at SANER’25

点击查看摘要

Abstract:As machine learning (ML) systems are increasingly adopted across industries, addressing fairness and bias has become essential. While many solutions focus on ethical challenges in ML, recent studies highlight that data itself is a major source of bias. Pre-processing techniques, which mitigate bias before training, are effective but may impact model performance and pose integration difficulties. In contrast, fairness-aware Data Preparation practices are both familiar to practitioners and easier to implement, providing a more accessible approach to reducing bias. Objective. This registered report proposes an empirical evaluation of how optimally selected fairness-aware practices, applied in early ML lifecycle stages, can enhance both fairness and performance, potentially outperforming standard pre-processing bias mitigation methods. Method. To this end, we will introduce FATE, an optimization technique for selecting ‘Data Preparation’ pipelines that optimize fairness and performance. Using FATE, we will analyze the fairness-performance trade-off, comparing pipelines selected by FATE with results by pre-processing bias mitigation techniques.

[LG-6] Self-supervised Spatial-Temporal Learner for Precipitation Nowcasting

链接: https://arxiv.org/abs/2412.15917
作者: Haotian Li,Arno Siebes,Siamak Mehrkanoon
关键词: weather-dependent decisions, essential for making, making timely, timely and weather-dependent, Nowcasting
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Nowcasting, the short-term prediction of weather, is essential for making timely and weather-dependent decisions. Specifically, precipitation nowcasting aims to predict precipitation at a local level within a 6-hour time frame. This task can be framed as a spatial-temporal sequence forecasting problem, where deep learning methods have been particularly effective. However, despite advancements in self-supervised learning, most successful methods for nowcasting remain fully supervised. Self-supervised learning is advantageous for pretraining models to learn representations without requiring extensive labeled data. In this work, we leverage the benefits of self-supervised learning and integrate it with spatial-temporal learning to develop a novel model, SpaT-SparK. SpaT-SparK comprises a CNN-based encoder-decoder structure pretrained with a masked image modeling (MIM) task and a translation network that captures temporal relationships among past and future precipitation maps in downstream tasks. We conducted experiments on the NL-50 dataset to evaluate the performance of SpaT-SparK. The results demonstrate that SpaT-SparK outperforms existing baseline supervised models, such as SmaAt-UNet, providing more accurate nowcasting predictions.

[LG-7] Statistical Modeling of Univariate Multimodal Data

链接: https://arxiv.org/abs/2412.15894
作者: Paraskevi Chasani,Aristidis Likas
关键词: key property indicating, property indicating grouping, indicating grouping behavior, Unimodality constitutes, constitutes a key
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 9 figures

点击查看摘要

Abstract:Unimodality constitutes a key property indicating grouping behavior of the data around a single mode of its density. We propose a method that partitions univariate data into unimodal subsets through recursive splitting around valley points of the data density. For valley point detection, we introduce properties of critical points on the convex hull of the empirical cumulative density function (ecdf) plot that provide indications on the existence of density valleys. Next, we apply a unimodal data modeling approach that provides a statistical model for each obtained unimodal subset in the form of a Uniform Mixture Model (UMM). Consequently, a hierarchical statistical model of the initial dataset is obtained in the form of a mixture of UMMs, named as the Unimodal Mixture Model (UDMM). The proposed method is non-parametric, hyperparameter-free, automatically estimates the number of unimodal subsets and provides accurate statistical models as indicated by experimental results on clustering and density estimation tasks.

[LG-8] IMPLY-based Approximate Full Adders for Efficient Arithmetic Operations in Image Processing and Machine Learning

链接: https://arxiv.org/abs/2412.15888
作者: Melanie Qiu,Caoyueshan Fan,Gulafshan,Salar Shakibhamedan,Fabian Seiler,Nima TaheriNejad
关键词: gaining increasing importance, emerging computing paradigms, power wall, increasing importance, performance limitations
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To overcome the performance limitations in modern computing, such as the power wall, emerging computing paradigms are gaining increasing importance. Approximate computing offers a promising solution by substantially enhancing energy efficiency and reducing latency, albeit with a trade-off in accuracy. Another emerging method is memristor-based In-Memory Computing (IMC) which has the potential to overcome the Von Neumann bottleneck. In this work, we combine these two approaches and propose two Serial APProximate IMPLY-based full adders (SAPPI). When embedded in a Ripple Carry Adder (RCA), our designs reduce the number of steps by 39%-41% and the energy consumption by 39%-42% compared to the exact algorithm. We evaluated our approach at the circuit level and compared it with State-of-the-Art (SoA) approximations where our adders improved the speed by up to 10% and the energy efficiency by up to 13%. We applied our designs in three common image processing applications where we achieved acceptable image quality with up to half of the RCA approximated. We performed a case study to demonstrate the applicability of our approximations in Machine Learning (ML) underscoring the potential gains in more complex scenarios. The proposed approach demonstrates energy savings of up to 296 mJ (21%) and a reduction of 1.3 billion (20%) computational steps when applied to Convolutional Neural Networks (CNNs) trained on the MNIST dataset while maintaining accuracy. Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2412.15888 [cs.ET] (or arXiv:2412.15888v1 [cs.ET] for this version) https://doi.org/10.48550/arXiv.2412.15888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Bayesian Optimization for Unknown Cost-Varying Variable Subsets with No-Regret Costs

链接: https://arxiv.org/abs/2412.15863
作者: Vu Viet Hoang,Quoc Anh Hoang Nguyen,Hung Tran The
关键词: Bayesian Optimization, black-box functions, method for optimizing, widely-used method, BOCVS
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is a widely-used method for optimizing expensive-to-evaluate black-box functions. Traditional BO assumes that the learner has full control over all query variables without additional constraints. However, in many real-world scenarios, controlling certain query variables may incur costs. Therefore, the learner needs to balance the selection of informative subsets for targeted learning against leaving some variables to be randomly sampled to minimize costs. This problem is known as Bayesian Optimization with cost-varying variable subsets (BOCVS). While the goal of BOCVS is to identify the optimal solution with minimal cost, previous works have only guaranteed finding the optimal solution without considering the total costs incurred. Moreover, these works assume precise knowledge of the cost for each subset, which is often unrealistic. In this paper, we propose a novel algorithm for the extension of the BOCVS problem with random and unknown costs that separates the process into exploration and exploitation phases. The exploration phase will filter out low-quality variable subsets, while the exploitation phase will leverage high-quality ones. Furthermore, we theoretically demonstrate that our algorithm achieves a sub-linear rate in both quality regret and cost regret, addressing the objective of the BOCVS problem more effectively than previous analyses. Finally, we show that our proposed algorithm outperforms comparable baselines across a wide range of benchmarks.

[LG-10] MarkovType: A Markov Decision Process Strategy for Non-Invasive Brain-Computer Interfaces Typing Systems

链接: https://arxiv.org/abs/2412.15862
作者: Elifnur Sunger,Yunus Bicer,Deniz Erdogmus,Tales Imbiriba
关键词: motor disabilities communicate, RSVP typing task, Serial Visual Presentation, Rapid Serial Visual, Brain-Computer Interfaces
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Brain-Computer Interfaces (BCIs) help people with severe speech and motor disabilities communicate and interact with their environment using neural activity. This work focuses on the Rapid Serial Visual Presentation (RSVP) paradigm of BCIs using noninvasive electroencephalography (EEG). The RSVP typing task is a recursive task with multiple sequences, where users see only a subset of symbols in each sequence. Extensive research has been conducted to improve classification in the RSVP typing task, achieving fast classification. However, these methods struggle to achieve high accuracy and do not consider the typing mechanism in the learning procedure. They apply binary target and non-target classification without including recursive training. To improve performance in the classification of symbols while controlling the classification speed, we incorporate the typing setup into training by proposing a Partially Observable Markov Decision Process (POMDP) approach. To the best of our knowledge, this is the first work to formulate the RSVP typing task as a POMDP for recursive classification. Experiments show that the proposed approach, MarkovType, results in a more accurate typing system compared to competitors. Additionally, our experiments demonstrate that while there is a trade-off between accuracy and speed, MarkovType achieves the optimal balance between these factors compared to other methods.

[LG-11] Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

链接: https://arxiv.org/abs/2412.15846
作者: Chengting Yu,Shu Yang,Fengzhao Zhang,Hanzhi Ma,Aili Wang,Er-Ping Li
关键词: Quantization-aware training, training phase incorporates, task goals, incorporates the simulation, computation to optimize
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task goals. However, direct training of low-precision networks generally faces two obstacles: 1. The low-precision model exhibits limited representation capabilities and cannot directly replicate full-precision calculations, which constitutes a deficiency compared to full-precision alternatives; 2. Non-ideal deviations during gradient propagation are a common consequence of employing pseudo-gradients as approximations in derived quantized functions. In this paper, we propose a general QAT framework for alleviating the aforementioned concerns by permitting the forward and backward processes of the low-precision network to be guided by the full-precision partner during training. In conjunction with the direct training of the quantization model, intermediate mixed-precision models are generated through the block-by-block replacement on the full-precision model and working simultaneously with the low-precision backbone, which enables the integration of quantized low-precision blocks into full-precision networks throughout the training phase. Consequently, each quantized block is capable of: 1. simulating full-precision representation during forward passes; 2. obtaining gradients with improved estimation during backward passes. We demonstrate that the proposed method achieves state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10. The proposed framework provides a compatible extension for most QAT methods and only requires a concise wrapper for existing codes.

[LG-12] Measuring Cross-Modal Interactions in Multimodal Models

链接: https://arxiv.org/abs/2412.15828
作者: Laura Wenderoth,Konstantin Hemker,Nikola Simidjievski,Mateja Jamnik
关键词: greatly improve patient, improve patient care, greatly improve, improve patient, patient care
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating AI in healthcare can greatly improve patient care and system efficiency. However, the lack of explainability in AI systems (XAI) hinders their clinical adoption, especially in multimodal settings that use increasingly complex model architectures. Most existing XAI methods focus on unimodal models, which fail to capture cross-modal interactions crucial for understanding the combined impact of multiple data sources. Existing methods for quantifying cross-modal interactions are limited to two modalities, rely on labelled data, and depend on model performance. This is problematic in healthcare, where XAI must handle multiple data sources and provide individualised explanations. This paper introduces InterSHAP, a cross-modal interaction score that addresses the limitations of existing approaches. InterSHAP uses the Shapley interaction index to precisely separate and quantify the contributions of the individual modalities and their interactions without approximations. By integrating an open-source implementation with the SHAP package, we enhance reproducibility and ease of use. We show that InterSHAP accurately measures the presence of cross-modal interactions, can handle multiple modalities, and provides detailed explanations at a local level for individual samples. Furthermore, we apply InterSHAP to multimodal medical datasets and demonstrate its applicability for individualised explanations.

[LG-13] Function Space Diversity for Uncertainty Prediction via Repulsive Last-Layer Ensembles

链接: https://arxiv.org/abs/2412.15758
作者: Sophie Steger,Christian Knoll,Bernhard Klein,Holger Fröning,Franz Pernkopf
关键词: gained attention due, Bayesian inference, function space, gained attention, attention due
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian inference in function space has gained attention due to its robustness against overparameterization in neural networks. However, approximating the infinite-dimensional function space introduces several challenges. In this work, we discuss function space inference via particle optimization and present practical modifications that improve uncertainty estimation and, most importantly, make it applicable for large and pretrained networks. First, we demonstrate that the input samples, where particle predictions are enforced to be diverse, are detrimental to the model performance. While diversity on training data itself can lead to underfitting, the use of label-destroying data augmentation, or unlabeled out-of-distribution data can improve prediction diversity and uncertainty estimates. Furthermore, we take advantage of the function space formulation, which imposes no restrictions on network parameterization other than sufficient flexibility. Instead of using full deep ensembles to represent particles, we propose a single multi-headed network that introduces a minimal increase in parameters and computation. This allows seamless integration to pretrained networks, where this repulsive last-layer ensemble can be used for uncertainty aware fine-tuning at minimal additional cost. We achieve competitive results in disentangling aleatoric and epistemic uncertainty for active learning, detecting out-of-domain data, and providing calibrated uncertainty estimates under distribution shifts with minimal compute and memory.

[LG-14] Probabilistic Latent Variable Modeling for Dynamic Friction Identification and Estimation

链接: https://arxiv.org/abs/2412.15756
作者: Victor Vantilborgh,Sander De Witte,Frederik Ostyn,Tom Lefebvre,Guillaume Crevecoeur
关键词: support control design, Precise identification, output torque estimation, control design, essential to support
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Precise identification of dynamic models in robotics is essential to support control design, friction compensation, output torque estimation, etc. A longstanding challenge remains in the identification of friction models for robotic joints, given the numerous physical phenomena affecting the underlying friction dynamics which result into nonlinear characteristics and hysteresis behaviour in particular. These phenomena proof difficult to be modelled and captured accurately using physical analogies alone. This has motivated researchers to shift from physics-based to data-driven models. Currently, these methods are still limited in their ability to generalize effectively to typical industrial robot deployement, characterized by high- and low-velocity operations and frequent direction reversals. Empirical observations motivate the use of dynamic friction models but these remain particulary challenging to establish. To address the current limitations, we propose to account for unidentified dynamics in the robot joints using latent dynamic states. The friction model may then utilize both the dynamic robot state and additional information encoded in the latent state to evaluate the friction torque. We cast this stochastic and partially unsupervised identification problem as a standard probabilistic representation learning problem. In this work both the friction model and latent state dynamics are parametrized as neural networks and integrated in the conventional lumped parameter dynamic robot model. The complete dynamics model is directly learned from the noisy encoder measurements in the robot joints. We use the Expectation-Maximisation (EM) algorithm to find a Maximum Likelihood Estimate (MLE) of the model parameters. The effectiveness of the proposed method is validated in terms of open-loop prediction accuracy in comparison with baseline methods, using the Kuka KR6 R700 as a test platform.

[LG-15] Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference AAAI25

链接: https://arxiv.org/abs/2412.15750
作者: Jorge García-Carrasco,Alejandro Maté,Juan Trujillo
关键词: Large Language Models, Large Language, shown impressive performance, Language Models, impressive performance
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 25 Main Technical Track

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance across a wide range of tasks. However, the size of LLMs is steadily increasing, hindering their application on computationally constrained environments. On the other hand, despite their general capabilities, there are many situations where only one specific task is performed, rendering all other capabilities unnecessary and wasteful. This leads us to the following question: Is it possible to extract the minimal subset from an LLM that is able to perform a specific task in a faster, standalone manner? Recent works on Mechanistic Interpretability (MI) have shown that specific tasks are performed by a localized subset of components, or circuit. However, current techniques used to identify the circuit cannot be used to extract it for its standalone usage. In this work, we propose a novel approach to automatically extract the subset of the LLM that properly performs a targeted task requiring no additional training and a small amount of data samples. We evaluate our approach on different tasks and show that the resulting models are (i) considerably smaller, reducing the number of parameters up to 82.77% and (ii) more interpretable, as they focus on the circuit that is used to carry out the specific task, and can therefore be understood using MI techniques.

[LG-16] Prompt-based Unifying Inference Attack on Graph Neural Networks AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.15735
作者: Yuecen Wei,Xingcheng Fu,Lingyun Liu,Qingyun Sun,Hao Peng,Chunming Hu
关键词: social behavior analysis, financial risk analysis, risk analysis based, provide important prospective, important prospective insights
类目: Machine Learning (cs.LG)
*备注: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Graph neural networks (GNNs) provide important prospective insights in applications such as social behavior analysis and financial risk analysis based on their powerful learning capabilities on graph data. Nevertheless, GNNs’ predictive performance relies on the quality of task-specific node labels, so it is common practice to improve the model’s generalization ability in the downstream execution of decision-making tasks through pre-training. Graph prompting is a prudent choice but risky without taking measures to prevent data leakage. In other words, in high-risk decision scenarios, prompt learning can infer private information by accessing model parameters trained on private data (publishing model parameters in pre-training, i.e., without directly leaking the raw data, is a tacitly accepted trend). However, myriad graph inference attacks necessitate tailored module design and processing to enhance inference capabilities due to variations in supervision signals. In this paper, we propose a novel Prompt-based unifying Inference Attack framework on GNNs, named ProIA. Specifically, ProIA retains the crucial topological information of the graph during pre-training, enhancing the background knowledge of the inference attack model. It then utilizes a unified prompt and introduces additional disentanglement factors in downstream attacks to adapt to task-relevant knowledge. Finally, extensive experiments show that ProIA enhances attack capabilities and demonstrates remarkable adaptability to various inference attacks.

[LG-17] Concept Boundary Vectors

链接: https://arxiv.org/abs/2412.15698
作者: Thomas Walker
关键词: Machine learning models, Machine learning, simple objectives, token prediction, Machine
类目: Machine Learning (cs.LG)
*备注: 21 pages, 21 figures

点击查看摘要

Abstract:Machine learning models are trained with relatively simple objectives, such as next token prediction. However, on deployment, they appear to capture a more fundamental representation of their input data. It is of interest to understand the nature of these representations to help interpret the model’s outputs and to identify ways to improve the salience of these representations. Concept vectors are constructions aimed at attributing concepts in the input data to directions, represented by vectors, in the model’s latent space. In this work, we introduce concept boundary vectors as a concept vector construction derived from the boundary between the latent representations of concepts. Empirically we demonstrate that concept boundary vectors capture a concept’s semantic meaning, and we compare their effectiveness against concept activation vectors.

[LG-18] Hypergraph clustering using Ricci curvature: an edge transport perspective

链接: https://arxiv.org/abs/2412.15695
作者: Olympio Hacquard
关键词: defining probability measures, extending Ricci flow, Ricci flow, defining probability, probability measures
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.

[LG-19] heory of Mixture-of-Experts for Mobile Edge Computing

链接: https://arxiv.org/abs/2412.15690
作者: Hongbo Li,Lingjie Duan
关键词: mobile users generate, mobile edge computing, users generate diverse, generate diverse machine, diverse machine learning
类目: Machine Learning (cs.LG)
*备注: This is the technical report for our paper accepted by INFOCOM 2025

点击查看摘要

Abstract:In mobile edge computing (MEC) networks, mobile users generate diverse machine learning tasks dynamically over time. These tasks are typically offloaded to the nearest available edge server, by considering communication and computational efficiency. However, its operation does not ensure that each server specializes in a specific type of tasks and leads to severe overfitting or catastrophic forgetting of previous tasks. To improve the continual learning (CL) performance of online tasks, we are the first to introduce mixture-of-experts (MoE) theory in MEC networks and save MEC operation from the increasing generalization error over time. Our MoE theory treats each MEC server as an expert and dynamically adapts to changes in server availability by considering data transfer and computation time. Unlike existing MoE models designed for offline tasks, ours is tailored for handling continuous streams of tasks in the MEC environment. We introduce an adaptive gating network in MEC to adaptively identify and route newly arrived tasks of unknown data distributions to available experts, enabling each expert to specialize in a specific type of tasks upon convergence. We derived the minimum number of experts required to match each task with a specialized, available expert. Our MoE approach consistently reduces the overall generalization error over time, unlike the traditional MEC approach. Interestingly, when the number of experts is sufficient to ensure convergence, adding more experts delays the convergence time and worsens the generalization error. Finally, we perform extensive experiments on real datasets in deep neural networks (DNNs) to verify our theoretical results.

[LG-20] A survey on FPGA-based accelerator for ML models

链接: https://arxiv.org/abs/2412.15666
作者: Feng Yan,Andreas Koch,Oliver Sinnen
关键词: Field-Programmable Gate Arrays, Gate Arrays, surveys machine learning, Field-Programmable Gate, machine learning
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 16 pages, 4 figures (Working paper)

点击查看摘要

Abstract:This paper thoroughly surveys machine learning (ML) algorithms acceleration in hardware accelerators, focusing on Field-Programmable Gate Arrays (FPGAs). It reviews 287 out of 1138 papers from the past six years, sourced from four top FPGA conferences. Such selection underscores the increasing integration of ML and FPGA technologies and their mutual importance in technological advancement. Research clearly emphasises inference acceleration (81%) compared to training acceleration (13%). Additionally, the findings reveals that CNN dominates current FPGA acceleration research while emerging models like GNN show obvious growth trends. The categorization of the FPGA research papers reveals a wide range of topics, demonstrating the growing relevance of ML in FPGA research. This comprehensive analysis provides valuable insights into the current trends and future directions of FPGA research in the context of ML applications.

[LG-21] Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class AAAI

链接: https://arxiv.org/abs/2412.15657
作者: Annie D’souza,Swetha M,Sunita Sarawagi
关键词: generative models, deep generative models, long-standing interest, class, deep generative
类目: Machine Learning (cs.LG)
*备注: AAAI Conference 2025

点击查看摘要

Abstract:Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance. Comments: AAAI Conference 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.15657 [cs.LG] (or arXiv:2412.15657v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.15657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution AAAI2025

链接: https://arxiv.org/abs/2412.15650
作者: Wentao Tan,Qiong Cao,Yibing Zhan,Chao Xue,Changxing Ding
关键词: Multimodal Large Language, Large Language Models, Large Language, enhance Multimodal Large, greatly enhance Multimodal
类目: Machine Learning (cs.LG)
*备注: AAAI 2025. The code is available at this https URL

点击查看摘要

Abstract:Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality. We also use corrupted images to generate rejected answers, forming distinct preference pairs for optimization. Finally, we incorporate an image content alignment loss function alongside Direct Preference Optimization (DPO) loss to reduce hallucinations, ensuring the model focuses on image content. Experiments show that our framework performs competitively with methods using external information, offering a more efficient and scalable approach to MLLMs. Comments: AAAI 2025. The code is available at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.15650 [cs.LG] (or arXiv:2412.15650v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.15650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Music Genre Classification: Ensemble Learning with Subcomponents-level Attention

链接: https://arxiv.org/abs/2412.15602
作者: Yichen Liu,Abhijit Dasgupta,Qiwei He
关键词: Music Information Retrieval, Information Retrieval, digital signal processing, Music Genre Classification, Music Information
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music Genre Classification is one of the most popular topics in the fields of Music Information Retrieval (MIR) and digital signal processing. Deep Learning has emerged as the top performer for classifying music genres among various methods. The letter introduces a novel approach by combining ensemble learning with attention to sub-components, aiming to enhance the accuracy of identifying music genres. The core innovation of our work is the proposal to classify the subcomponents of the music pieces separately, allowing our model to capture distinct characteristics from those sub components. By applying ensemble learning techniques to these individual classifications, we make the final classification decision on the genre of the music. The proposed method has superior advantages in terms of accuracy compared to the other state-of-the-art techniques trained and tested on the GTZAN dataset.

[LG-24] Dexterous Manipulation Based on Prior Dexterous Grasp Pose Knowledge

链接: https://arxiv.org/abs/2412.15587
作者: Hengxu Yan,Haoshu Fang,Cewu Lu
关键词: received considerable attention, dexterous grasp pose, recent research, received considerable, considerable attention
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dexterous manipulation has received considerable attention in recent research. Predominantly, existing studies have concentrated on reinforcement learning methods to address the substantial degrees of freedom in hand movements. Nonetheless, these methods typically suffer from low efficiency and accuracy. In this work, we introduce a novel reinforcement learning approach that leverages prior dexterous grasp pose knowledge to enhance both efficiency and accuracy. Unlike previous work, they always make the robotic hand go with a fixed dexterous grasp pose, We decouple the manipulation process into two distinct phases: initially, we generate a dexterous grasp pose targeting the functional part of the object; after that, we employ reinforcement learning to comprehensively explore the environment. Our findings suggest that the majority of learning time is expended in identifying the appropriate initial position and selecting the optimal manipulation viewpoint. Experimental results demonstrate significant improvements in learning efficiency and success rates across four distinct tasks.

[LG-25] A Deep Probabilistic Framework for Continuous Time Dynamic Graph Generation AAAI-25

链接: https://arxiv.org/abs/2412.15582
作者: Ryien Hosseini,Filippo Simini,Venkatram Vishwanath,Henry Hoffmann
关键词: exhibit evolving topologies, Recent advancements, graph representation learning, representation learning, learning have shifted
类目: Machine Learning (cs.LG)
*备注: To appear at AAAI-25

点击查看摘要

Abstract:Recent advancements in graph representation learning have shifted attention towards dynamic graphs, which exhibit evolving topologies and features over time. The increased use of such graphs creates a paramount need for generative models suitable for applications such as data augmentation, obfuscation, and anomaly detection. However, there are few generative techniques that handle continuously changing temporal graph data; existing work largely relies on augmenting static graphs with additional temporal information to model dynamic interactions between nodes. In this work, we propose a fundamentally different approach: We instead directly model interactions as a joint probability of an edge forming between two nodes at a given time. This allows us to autoregressively generate new synthetic dynamic graphs in a largely assumption free, scalable, and inductive manner. We formalize this approach as DG-Gen, a generative framework for continuous time dynamic graphs, and demonstrate its effectiveness over five datasets. Our experiments demonstrate that DG-Gen not only generates higher fidelity graphs compared to traditional methods but also significantly advances link prediction tasks.

[LG-26] Multi Agent Reinforcement Learning for Sequential Satellite Assignment Problems

链接: https://arxiv.org/abs/2412.15573
作者: Joshua Holder,Natasha Jaques,Mehran Mesbahi
关键词: classic combinatorial optimization, combinatorial optimization problem, satisfying assignment constraints, classic combinatorial, combinatorial optimization
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomial-time algorithms exist to solve a single assignment problem in its simplest form. However, in many modern-day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi-agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from a known polynomial-time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.

[LG-27] Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models

链接: https://arxiv.org/abs/2412.15559
作者: Nahian Ahmed,Mark Roth,Tyler A. Hallman,W. Douglas Robinson,Rebecca A. Hutchinson
关键词: present great opportunities, biodiversity data present, data present great, temporal scales, Citizen science biodiversity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site’s status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.

[LG-28] AutoRank: MCDA Based Rank Personalization for LoRA-Enabled Distributed Learning

链接: https://arxiv.org/abs/2412.15553
作者: Shuaijun Chen,Omid Tavallaie,Niousha Nazemi,Xin Chen,Albert Y. Zomaya
关键词: volumes expand rapidly, growing computational demands, data volumes expand, expand rapidly, volumes expand
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:As data volumes expand rapidly, distributed machine learning has become essential for addressing the growing computational demands of modern AI systems. However, training models in distributed environments is challenging with participants hold skew, Non-Independent-Identically distributed (Non-IID) data. Low-Rank Adaptation (LoRA) offers a promising solution to this problem by personalizing low-rank updates rather than optimizing the entire model, LoRA-enabled distributed learning minimizes computational and maximize personalization for each participant. Enabling more robust and efficient training in distributed learning settings, especially in large-scale, heterogeneous systems. Despite the strengths of current state-of-the-art methods, they often require manual configuration of the initial rank, which is increasingly impractical as the number of participants grows. This manual tuning is not only time-consuming but also prone to suboptimal configurations. To address this limitation, we propose AutoRank, an adaptive rank-setting algorithm inspired by the bias-variance trade-off. AutoRank leverages the MCDA method TOPSIS to dynamically assign local ranks based on the complexity of each participant’s data. By evaluating data distribution and complexity through our proposed data complexity metrics, AutoRank provides fine-grained adjustments to the rank of each participant’s local LoRA model. This adaptive approach effectively mitigates the challenges of double-imbalanced, non-IID data. Experimental results demonstrate that AutoRank significantly reduces computational overhead, enhances model performance, and accelerates convergence in highly heterogeneous federated learning environments. Through its strong adaptability, AutoRank offers a scalable and flexible solution for distributed machine learning.

[LG-29] he Impact of Cut Layer Selection in Split Federated Learning AAAI

链接: https://arxiv.org/abs/2412.15536
作者: Justin Dachille,Chao Huang,Xin Liu
关键词: Split Federated Learning, combines federated learning, Federated Learning, distributed machine learning, machine learning paradigm
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 1 figure, AAAI FLUID Workshop 2025

点击查看摘要

Abstract:Split Federated Learning (SFL) is a distributed machine learning paradigm that combines federated learning and split learning. In SFL, a neural network is partitioned at a cut layer, with the initial layers deployed on clients and remaining layers on a training server. There are two main variants of SFL: SFL-V1 where the training server maintains separate server-side models for each client, and SFL-V2 where the training server maintains a single shared model for all clients. While existing studies have focused on algorithm development for SFL, a comprehensive quantitative analysis of how the cut layer selection affects model performance remains unexplored. This paper addresses this gap by providing numerical and theoretical analysis of SFL performance and convergence relative to cut layer selection. We find that SFL-V1 is relatively invariant to the choice of cut layer, which is consistent with our theoretical results. Numerical experiments on four datasets and two neural networks show that the cut layer selection significantly affects the performance of SFL-V2. Moreover, SFL-V2 with an appropriate cut layer selection outperforms FedAvg on heterogeneous data.

[LG-30] SORREL: Suboptimal-Demonstration-Guided Reinforcement Learning for Learning to Branch AAAI2025

链接: https://arxiv.org/abs/2412.15534
作者: Shengyu Feng,Yiming Yang
关键词: Integer Linear Program, Mixed Integer Linear, Mixed Integer, Linear Program, Integer Linear
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Mixed Integer Linear Program (MILP) solvers are mostly built upon a Branch-and-Bound (B\B) algorithm, where the efficiency of traditional solvers heavily depends on hand-crafted heuristics for branching. The past few years have witnessed the increasing popularity of data-driven approaches to automatically learn these heuristics. However, the success of these methods is highly dependent on the availability of high-quality demonstrations, which requires either the development of near-optimal heuristics or a time-consuming sampling process. This paper averts this challenge by proposing Suboptimal-Demonstration-Guided Reinforcement Learning (SORREL) for learning to branch. SORREL selectively learns from suboptimal demonstrations based on value estimation. It utilizes suboptimal demonstrations through both offline reinforcement learning on the demonstrations generated by suboptimal heuristics and self-imitation learning on past good experiences sampled by itself. Our experiments demonstrate its advanced performance in both branching quality and training efficiency over previous methods for various MILPs.

[LG-31] PreNeT: Leveraging Computational Features to Predict Deep Neural Network Training Time

链接: https://arxiv.org/abs/2412.15519
作者: Alireza Pourali,Arian Boukani,Hamzeh Khazaei
关键词: Large Language Models, deep learning models, Large Language, Language Models, learning models
类目: Machine Learning (cs.LG)
*备注: 11 pages, Conference

点击查看摘要

Abstract:Training deep learning models, particularly Transformer-based architectures such as Large Language Models (LLMs), demands substantial computational resources and extended training periods. While optimal configuration and infrastructure selection can significantly reduce associated costs, this optimization requires preliminary analysis tools. This paper introduces PreNeT, a novel predictive framework designed to address this optimization challenge. PreNeT facilitates training optimization by integrating comprehensive computational metrics, including layer-specific parameters, arithmetic operations and memory utilization. A key feature of PreNeT is its capacity to accurately predict training duration on previously unexamined hardware infrastructures, including novel accelerator architectures. This framework employs a sophisticated approach to capture and analyze the distinct characteristics of various neural network layers, thereby enhancing existing prediction methodologies. Through proactive implementation of PreNeT, researchers and practitioners can determine optimal configurations, parameter settings, and hardware specifications to maximize cost-efficiency and minimize training duration. Experimental results demonstrate that PreNeT achieves up to 72% improvement in prediction accuracy compared to contemporary state-of-the-art frameworks.

[LG-32] Novelty-Guided Data Reuse for Efficient and Diversified Multi-Agent Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2412.15517
作者: Yangkun Chen,Kai Yang,Jian Tao,Jiafei Lyu
关键词: deep Multi-Agent Reinforcement, Multi-Agent Reinforcement Learning, Reinforcement Learning, pushing the boundaries, collaborative environments
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Recently, deep Multi-Agent Reinforcement Learning (MARL) has demonstrated its potential to tackle complex cooperative tasks, pushing the boundaries of AI in collaborative environments. However, the efficiency of these systems is often compromised by inadequate sample utilization and a lack of diversity in learning strategies. To enhance MARL performance, we introduce a novel sample reuse approach that dynamically adjusts policy updates based on observation novelty. Specifically, we employ a Random Network Distillation (RND) network to gauge the novelty of each agent’s current state, assigning additional sample update opportunities based on the uniqueness of the data. We name our method Multi-Agent Novelty-GuidEd sample Reuse (MANGER). This method increases sample efficiency and promotes exploration and diverse agent behaviors. Our evaluations confirm substantial improvements in MARL effectiveness in complex cooperative scenarios such as Google Research Football and super-hard StarCraft II micromanagement tasks.

[LG-33] Understanding When and Why Graph Attention Mechanisms Work via Node Classification

链接: https://arxiv.org/abs/2412.15496
作者: Zhongtian Ma,Qiaosheng Zhang,Bocheng Zhou,Yexin Zhang,Shuyue Hu,Zhen Wang
关键词: understanding remains limited, Stochastic Block Models, Contextual Stochastic Block, graph attention mechanisms, attention mechanisms
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Despite the growing popularity of graph attention mechanisms, their theoretical understanding remains limited. This paper aims to explore the conditions under which these mechanisms are effective in node classification tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our theoretical analysis reveals that incorporating graph attention mechanisms is \emphnot universally beneficial. Specifically, by appropriately defining \emphstructure noise and \emphfeature noise in graphs, we show that graph attention mechanisms can enhance classification performance when structure noise exceeds feature noise. Conversely, when feature noise predominates, simpler graph convolution operations are more effective. Furthermore, we examine the over-smoothing phenomenon and show that, in the high signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from over-smoothing, whereas graph attention mechanisms can effectively resolve this issue. Building on these insights, we propose a novel multi-layer Graph Attention Network (GAT) architecture that significantly outperforms single-layer GATs in achieving \emphperfect node classification in CSBMs, relaxing the SNR requirement from \omega(\sqrt\log n) to \omega(\sqrt\log n / \sqrt[3]n) . To our knowledge, this is the first study to delineate the conditions for perfect node classification using multi-layer GATs. Our theoretical contributions are corroborated by extensive experiments on both synthetic and real-world datasets, highlighting the practical implications of our findings.

[LG-34] DualGFL: Federated Learning with a Dual-Level Coalition-Auction Game AAAI25

链接: https://arxiv.org/abs/2412.15492
作者: Xiaobing Chen,Xiangwei Zhou,Songyang Zhang,Mingxuan Sun
关键词: Federated Learning framework, federated learning, game-theoretical methods, failing to capture, participants in practice
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures. Accepted by AAAI25

点击查看摘要

Abstract:Despite some promising results in federated learning using game-theoretical methods, most existing studies mainly employ a one-level game in either a cooperative or competitive environment, failing to capture the complex dynamics among participants in practice. To address this issue, we propose DualGFL, a novel Federated Learning framework with a Dual-level Game in cooperative-competitive environments. DualGFL includes a lower-level hedonic game where clients form coalitions and an upper-level multi-attribute auction game where coalitions bid for training participation. At the lower-level DualGFL, we introduce a new auction-aware utility function and propose a Pareto-optimal partitioning algorithm to find a Pareto-optimal partition based on clients’ preference profiles. At the upper-level DualGFL, we formulate a multi-attribute auction game with resource constraints and derive equilibrium bids to maximize coalitions’ winning probabilities and profits. A greedy algorithm is proposed to maximize the utility of the central server. Extensive experiments on real-world datasets demonstrate DualGFL’s effectiveness in improving both server utility and client utility.

[LG-35] Predicting Long-Term Student Outcomes from Short-Term EdTech Log Data

链接: https://arxiv.org/abs/2412.15473
作者: Ge Gao,Amelia Leon,Andrea Jetten,Jasmine Turner,Husni Almoubayyed,Stephen Fancsali,Emma Brunskill
关键词: delayed student outcomes, statewide exams, interested in sparse, Educational stakeholders, delayed student
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to the 15th International Learning Analytics and Knowledge Conference (LAK2025)

点击查看摘要

Abstract:Educational stakeholders are often particularly interested in sparse, delayed student outcomes, like end-of-year statewide exams. The rare occurrence of such assessments makes it harder to identify students likely to fail such assessments, as well as making it slow for researchers and educators to be able to assess the effectiveness of particular educational tools. Prior work has primarily focused on using logs from students full usage (e.g. year-long) of an educational product to predict outcomes, or considered predictive accuracy using a few minutes to predict outcomes after a short (e.g. 1 hour) session. In contrast, we investigate machine learning predictors using students’ logs during their first few hours of usage can provide useful predictive insight into those students’ end-of-school year external assessment. We do this on three diverse datasets: from students in Uganda using a literacy game product, and from students in the US using two mathematics intelligent tutoring systems. We consider various measures of the accuracy of the resulting predictors, including its ability to identify students at different parts along the assessment performance distribution. Our findings suggest that short-term log usage data, from 2-5 hours, can be used to provide valuable signal about students’ long-term external performance.

[LG-36] AdaCred: Adaptive Causal Decision Transformers with Feature Crediting AAMAS2025

链接: https://arxiv.org/abs/2412.15427
作者: Hemant Kumawat,Saibal Mukhopadhyay
关键词: predict future actions, future actions based, sequence modeling problem, models predict future, modeling problem
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Reinforcement learning (RL) can be formulated as a sequence modeling problem, where models predict future actions based on historical state-action-reward sequences. Current approaches typically require long trajectory sequences to model the environment in offline RL settings. However, these models tend to over-rely on memorizing long-term representations, which impairs their ability to effectively attribute importance to trajectories and learned representations based on task-specific relevance. In this work, we introduce AdaCred, a novel approach that represents trajectories as causal graphs built from short-term action-reward-state sequences. Our model adaptively learns control policy by crediting and pruning low-importance representations, retaining only those most relevant for the downstream task. Our experiments demonstrate that AdaCred-based policies require shorter trajectory sequences and consistently outperform conventional methods in both offline reinforcement learning and imitation learning environments.

[LG-37] Dimension Reduction with Locally Adjusted Graphs AAAI2025

链接: https://arxiv.org/abs/2412.15426
作者: Yingfan Wang,Yiyang Sun,Haiyang Huang,Cynthia Rudin
关键词: large-scale high-dimensional datasets, Dimension reduction, gaining insight, insight into large-scale, high-dimensional data
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Dimension reduction (DR) algorithms have proven to be extremely useful for gaining insight into large-scale high-dimensional datasets, particularly finding clusters in transcriptomic data. The initial phase of these DR methods often involves converting the original high-dimensional data into a graph. In this graph, each edge represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. This problem is exacerbated as the dataset size increases. If we reduce the size of the dataset by selecting points for a specific sections of the embeddings, the clusters observed through DR are more separable since the extracted subgraphs are more reliable. In this paper, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address this challenge. By dynamically extracting subgraphs and updating the graph on-the-fly, LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine. We demonstrate the benefits of LocalMAP through a case study on biological datasets, highlighting its utility in helping users more accurately identify clusters for real-world problems.

[LG-38] LG-Sleep: Local and Global Temporal Dependencies for Mice Sleep Scoring

链接: https://arxiv.org/abs/2412.15412
作者: Shadi Sartipi,Mie Andersen,Natalie Hauglund,Celia Kjaerby,Verena Untiet,Maiken Nedergaard,Mujdat Cetin
关键词: Efficiently identifying sleep, Efficiently identifying, clinical research, sleep, crucial for unraveling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficiently identifying sleep stages is crucial for unraveling the intricacies of sleep in both preclinical and clinical research. The labor-intensive nature of manual sleep scoring, demanding substantial expertise, has prompted a surge of interest in automated alternatives. Sleep studies in mice play a significant role in understanding sleep patterns and disorders and underscore the need for robust scoring methodologies. In response, this study introduces LG-Sleep, a novel subject-independent deep neural network architecture designed for mice sleep scoring through electroencephalogram (EEG) signals. LG-Sleep extracts local and global temporal transitions within EEG signals to categorize sleep data into three stages: wake, rapid eye movement (REM) sleep, and non-rapid eye movement (NREM) sleep. The model leverages local and global temporal information by employing time-distributed convolutional neural networks to discern local temporal transitions in EEG data. Subsequently, features derived from the convolutional filters traverse long short-term memory blocks, capturing global transitions over extended periods. Crucially, the model is optimized in an autoencoder-decoder fashion, facilitating generalization across distinct subjects and adapting to limited training samples. Experimental findings demonstrate superior performance of LG-Sleep compared to conventional deep neural networks. Moreover, the model exhibits good performance across different sleep stages even when tasked with scoring based on limited training samples.

[LG-39] A Multi-Fidelity Graph U-Net Model for Accelerated Physics Simulations

链接: https://arxiv.org/abs/2412.15372
作者: Rini Jasmine Gladstone,Hadi Meidani
关键词: Physics-based deep learning, complex physical systems, Physics-based deep, problem inputs, frameworks have shown
类目: Machine Learning (cs.LG)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:Physics-based deep learning frameworks have shown to be effective in accurately modeling the dynamics of complex physical systems with generalization capability across problem inputs. Data-driven networks like GNN, Neural Operators have proved to be very effective in generalizing the model across unseen domain and resolutions. But one of the most critical issues in these data-based models is the computational cost of generating training datasets. Complex phenomena can only be captured accurately using deep networks with large training datasets. Furthermore, numerical error of training samples is propagated in the model errors, thus requiring the need for accurate data, i.e. FEM solutions on high-resolution meshes. Multi-fidelity methods offer a potential solution to reduce the training data requirements. To this end, we propose a novel GNN architecture, Multi-Fidelity U-Net, that utilizes the advantages of the multi-fidelity methods for enhancing the performance of the GNN model. The proposed architecture utilizes the capability of GNNs to manage complex geometries across different fidelity levels, while enabling flow of information between these levels for improved prediction accuracy for high-fidelity graphs. We show that the proposed approach performs significantly better in accuracy and data requirement and only requires training of a single network compared to other benchmark multi-fidelity approaches like transfer learning. We also present Multi-Fidelity U-Net Lite, a faster version of the proposed architecture, with 35% faster training, with 2 to 5% reduction in accuracy. We carry out extensive validation to show that the proposed models surpass traditional single-fidelity GNN models in their performance, thus providing feasible alternative for addressing computational and accuracy requirements where traditional high-fidelity simulations can be time-consuming.

[LG-40] LISA: Learning-Integrated Space Partitioning Framework for Traffic Accident Forecasting on Heterogeneous Spatiotemporal Data

链接: https://arxiv.org/abs/2412.15365
作者: Bang An,Xun Zhou,Amin Vahedian,Nick Street,Jinping Guan,Jun Luo
关键词: Traffic accident forecasting, emergency response systems, intelligent transportation management, Traffic accident, response systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic accident forecasting is an important task for intelligent transportation management and emergency response systems. However, this problem is challenging due to the spatial heterogeneity of the environment. Existing data-driven methods mostly focus on studying homogeneous areas with limited size (e.g. a single urban area such as New York City) and fail to handle the heterogeneous accident patterns over space at different scales. Recent advances (e.g. spatial ensemble) utilize pre-defined space partitions and learn multiple models to improve prediction accuracy. However, external knowledge is required to define proper space partitions before training models and pre-defined partitions may not necessarily reduce the heterogeneity. To address this issue, we propose a novel Learning-Integrated Space Partition Framework (LISA) to simultaneously learn partitions while training models, where the partitioning process and learning process are integrated in a way that partitioning is guided explicitly by prediction accuracy rather than other factors. Experiments using real-world datasets, demonstrate that our work can capture underlying heterogeneous patterns in a self-guided way and substantially improve baseline networks by an average of 13.0%.

[LG-41] Spatiotemporally Coherent Probabilistic Generation of Weather from Climate

链接: https://arxiv.org/abs/2412.15361
作者: Jonathan Schmidt,Luca Schmidt,Felix Strnad,Nicole Ludwig,Philipp Hennig
关键词: assessment and decision-making, impact assessment, capture small-scale phenomena, small-scale phenomena, climate
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 15 pages, 6 figures, additional supplementary text and figures

点击查看摘要

Abstract:Local climate information is crucial for impact assessment and decision-making, yet coarse global climate simulations cannot capture small-scale phenomena. Current statistical downscaling methods infer these phenomena as temporally decoupled spatial patches. However, to preserve physical properties, estimating spatio-temporally coherent high-resolution weather dynamics for multiple variables across long time horizons is crucial. We present a novel generative approach that uses a score-based diffusion model trained on high-resolution reanalysis data to capture the statistical properties of local weather dynamics. After training, we condition on coarse climate model data to generate weather patterns consistent with the aggregate information. As this inference task is inherently uncertain, we leverage the probabilistic nature of diffusion models and sample multiple trajectories. We evaluate our approach with high-resolution reanalysis information before applying it to the climate model downscaling task. We then demonstrate that the model generates spatially and temporally coherent weather dynamics that align with global climate output.

[LG-42] Large Language Models on Small Resource-Constrained Systems: Performance Characterization Analysis and Trade-offs

链接: https://arxiv.org/abs/2412.15352
作者: Liam Seymour,Basar Kutukcu,Sabur Baidya
关键词: Large Language Models, Large Language, Language Models, general consumer, Generative
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Generative AI like the Large Language Models (LLMs) has become more available for the general consumer in recent years. Publicly available services, e.g., ChatGPT, perform token generation on networked cloud server hardware, effectively removing the hardware entry cost for end users. However, the reliance on network access for these services, privacy and security risks involved, and sometimes the needs of the application make it necessary to run LLMs locally on edge devices. A significant amount of research has been done on optimization of LLMs and other transformer-based models on non-networked, resource-constrained devices, but they typically target older hardware. Our research intends to provide a ‘baseline’ characterization of more recent commercially available embedded hardware for LLMs, and to provide a simple utility to facilitate batch testing LLMs on recent Jetson hardware. We focus on the latest line of NVIDIA Jetson devices (Jetson Orin), and a set of publicly available LLMs (Pythia) ranging between 70 million and 1.4 billion parameters. Through detailed experimental evaluation with varying software and hardware parameters, we showcase trade-off spaces and optimization choices. Additionally, we design our testing structure to facilitate further research that involves performing batch LLM testing on Jetson hardware.

[LG-43] Adaptive Urban Planning: A Hybrid Framework for Balanced City Development

链接: https://arxiv.org/abs/2412.15349
作者: Pratham Singla,Ayush Singh,Adesh Gupta,Shivank Garg
关键词: localized demographic preferences, balancing city-wide infrastructure, demographic preferences, faces a critical, critical challenge
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban planning faces a critical challenge in balancing city-wide infrastructure needs with localized demographic preferences, particularly in rapidly developing regions. Although existing approaches typically focus on top-down optimization or bottom-up community planning, only some frameworks successfully integrate both perspectives. Our methodology employs a two-tier approach: First, a deterministic solver optimizes basic infrastructure requirements in the city region. Second, four specialized planning agents, each representing distinct sub-regions, propose demographic-specific modifications to a master planner. The master planner then evaluates and integrates these suggestions to ensure cohesive urban development. We validate our framework using a newly created dataset comprising detailed region and sub-region maps from three developing cities in India, focusing on areas undergoing rapid urbanization. The results demonstrate that this hybrid approach enables more nuanced urban development while maintaining overall city functionality.

[LG-44] PCA-Featured Transformer for Jamming Detection in 5G UAV Networks

链接: https://arxiv.org/abs/2412.15312
作者: Joseanne Viana,Hamed Farkhari,Pedro Sebastiao,Victor P Gil Jimenez,Lester Ho
关键词: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, potentially disrupting essential, compromising network reliability
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Jamming attacks pose a threat to Unmanned Aerial Vehicle (UAV) wireless communication systems, potentially disrupting essential services and compromising network reliability. Current detection approaches struggle with sophisticated artificial intelligence (AI) jamming techniques that adapt their patterns while existing machine learning solutions often require extensive feature engineering and fail to capture complex temporal dependencies in attack signatures. Furthermore, 5G networks using either Time Division Duplex (TDD) or Frequency Division Duplex (FDD) methods can face service degradation from intentional interference sources. To address these challenges, we present a novel transformer-based deep learning framework for jamming detection with Principal Component Analysis (PCA) added features. Our architecture leverages the transformer’s self-attention mechanism to capture complex temporal dependencies and spatial correlations in wireless signal characteristics, enabling more robust jamming detection techniques. The U-shaped model incorporates a modified transformer encoder that processes signal features including received signal strength indicator (RSSI) and signal-to-noise ratio (SINR) measurements, alongside a specialized positional encoding scheme that accounts for the periodic nature of wireless signals. In addition, we propose a batch size scheduler and implement chunking techniques to optimize training convergence for time series data. These advancements contribute to achieving up to a ten times improvement in training speed within the advanced U-shaped encoder-decoder model introduced. Simulation results demonstrate that our approach achieves a detection accuracy of 90.33 % in Line-of-Sight (LoS) and 84.35 % in non-Line-of-Sight (NLoS) and outperforms machine learning methods and existing deep learning solutions such as the XGBoost (XGB) classifier in approximately 4%.

[LG-45] Re-evaluating Group Robustness via Adaptive Class-Specific Scaling

链接: https://arxiv.org/abs/2412.15311
作者: Seonguk Seo,Bohyung Han
关键词: Group distributionally robust, mitigate spurious correlations, Group distributionally, distributionally robust optimization, address dataset bias
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group distributionally robust optimization, which aims to improve robust accuracies – worst-group and unbiased accuracies – is a prominent algorithm used to mitigate spurious correlations and address dataset bias. Although existing approaches have reported improvements in robust accuracies, these gains often come at the cost of average accuracy due to inherent trade-offs. To control this trade-off flexibly and efficiently, we propose a simple class-specific scaling strategy, directly applicable to existing debiasing algorithms with no additional training. We further develop an instance-wise adaptive scaling technique to alleviate this trade-off, even leading to improvements in both robust and average accuracies. Our approach reveals that a naïve ERM baseline matches or even outperforms the recent debiasing methods by simply adopting the class-specific scaling technique. Additionally, we introduce a novel unified metric that quantifies the trade-off between the two accuracies as a scalar value, allowing for a comprehensive evaluation of existing algorithms. By tackling the inherent trade-off and offering a performance landscape, our approach provides valuable insights into robust techniques beyond just robust accuracy. We validate the effectiveness of our framework through experiments across datasets in computer vision and natural language processing domains.

[LG-46] MIETT: Multi-Instance Encrypted Traffic Transformer for Encrypted Traffic Classification AAAI2025

链接: https://arxiv.org/abs/2412.15306
作者: Xu-Yang Chen,Lu Han,De-Chuan Zhan,Han-Jia Ye
关键词: includes data transmitted, Network traffic includes, encrypted traffic, traffic includes data, file transfers
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: AAAI 2025 accepted

点击查看摘要

Abstract:Network traffic includes data transmitted across a network, such as web browsing and file transfers, and is organized into packets (small units of data) and flows (sequences of packets exchanged between two endpoints). Classifying encrypted traffic is essential for detecting security threats and optimizing network management. Recent advancements have highlighted the superiority of foundation models in this task, particularly for their ability to leverage large amounts of unlabeled data and demonstrate strong generalization to unseen data. However, existing methods that focus on token-level relationships fail to capture broader flow patterns, as tokens, defined as sequences of hexadecimal digits, typically carry limited semantic information in encrypted traffic. These flow patterns, which are crucial for traffic classification, arise from the interactions between packets within a flow, not just their internal structure. To address this limitation, we propose a Multi-Instance Encrypted Traffic Transformer (MIETT), which adopts a multi-instance approach where each packet is treated as a distinct instance within a larger bag representing the entire flow. This enables the model to capture both token-level and packet-level relationships more effectively through Two-Level Attention (TLA) layers, improving the model’s ability to learn complex packet dynamics and flow patterns. We further enhance the model’s understanding of temporal and flow-specific dynamics by introducing two novel pre-training tasks: Packet Relative Position Prediction (PRPP) and Flow Contrastive Learning (FCL). After fine-tuning, MIETT achieves state-of-the-art (SOTA) results across five datasets, demonstrating its effectiveness in classifying encrypted traffic and understanding complex network behaviors. Code is available at \urlthis https URL.

[LG-47] nyLLM : A Framework for Training and Deploying Language Models at the Edge Computers

链接: https://arxiv.org/abs/2412.15304
作者: Savitha Viswanadh Kandala,Pramuka Medaranga,Ambuj Varshney
关键词: gained significant interest, significant interest due, general-purpose capabilities, Language models, interest due
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Language models have gained significant interest due to their general-purpose capabilities, which appear to emerge as models are scaled to increasingly larger parameter sizes. However, these large models impose stringent requirements on computing systems, necessitating significant memory and processing requirements for inference. This makes performing inference on mobile and edge devices challenging, often requiring invocating remotely-hosted models via network calls. Remote inference, in turn, introduces issues like latency, unreliable network connectivity, and privacy concerns. To address these challenges, we explored the possibility of deviating from the trend of increasing model size. Instead, we hypothesize that much smaller models (~30-120M parameters) can outperform their larger counterparts for specific tasks by carefully curating the data used for pre-training and fine-tuning. We investigate this within the context of deploying edge-device models to support sensing applications. We trained several foundational models through a systematic study and found that small models can run locally on edge devices, achieving high token rates and accuracy. Based on these findings, we developed a framework that allows users to train foundational models tailored to their specific applications and deploy them at the edge.

[LG-48] okenphormer: Structure-aware Multi-token Graph Transformer for Node Classification AAAI2025

链接: https://arxiv.org/abs/2412.15302
作者: Zijie Zhou,Zhaoqi Lu,Xuekai Wei,Rongqin Chen,Shenghui Zhang,Pak Lon Ip,Leong Hou U
关键词: Graph Neural Networks, Neural Networks, graph data mining, data mining tasks, Graph Neural
类目: Machine Learning (cs.LG)
*备注: Accpeted by AAAI 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used in graph data mining tasks. Traditional GNNs follow a message passing scheme that can effectively utilize local and structural information. However, the phenomena of over-smoothing and over-squashing limit the receptive field in message passing processes. Graph Transformers were introduced to address these issues, achieving a global receptive field but suffering from the noise of irrelevant nodes and loss of structural information. Therefore, drawing inspiration from fine-grained token-based representation learning in Natural Language Processing (NLP), we propose the Structure-aware Multi-token Graph Transformer (Tokenphormer), which generates multiple tokens to effectively capture local and structural information and explore global information at different levels of granularity. Specifically, we first introduce the walk-token generated by mixed walks consisting of four walk types to explore the graph and capture structure and contextual information flexibly. To ensure local and global information coverage, we also introduce the SGPM-token (obtained through the Self-supervised Graph Pre-train Model, SGPM) and the hop-token, extending the length and density limit of the walk-token, respectively. Finally, these expressive tokens are fed into the Transformer model to learn node representations collaboratively. Experimental results demonstrate that the capability of the proposed Tokenphormer can achieve state-of-the-art performance on node classification tasks.

[LG-49] Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and Implementation

链接: https://arxiv.org/abs/2412.15295
作者: Jake Hyun
关键词: Lloyd algorithm, machine learning, simplicity and effectiveness, Lloyd, cdot
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Undergraduate Thesis, Department of Computer Science and Engineering, Seoul National University

点击查看摘要

Abstract:Clustering is a key task in machine learning, with k -means being widely used for its simplicity and effectiveness. While 1D clustering is common, existing methods often fail to exploit the structure of 1D data, leading to inefficiencies. This thesis introduces optimized algorithms for k -means++ initialization and Lloyd’s algorithm, leveraging sorted data, prefix sums, and binary search for improved computational performance. The main contributions are: (1) an optimized k -cluster algorithm achieving O(l \cdot k^2 \cdot \log n) complexity for greedy k -means++ initialization and O(i \cdot k \cdot \log n) for Lloyd’s algorithm, where l is the number of greedy k -means++ local trials, and i is the number of Lloyd’s algorithm iterations, and (2) a binary search-based two-cluster algorithm, achieving O(\log n) runtime with deterministic convergence to a Lloyd’s algorithm local minimum. Benchmarks demonstrate over 4500x speedup compared to scikit-learn for large datasets while maintaining clustering quality measured by within-cluster sum of squares (WCSS). Additionally, the algorithms achieve a 300x speedup in an LLM quantization task, highlighting their utility in emerging applications. This thesis bridges theory and practice for 1D k -means clustering, delivering efficient and sound algorithms implemented in a JIT-optimized open-source Python library.

[LG-50] Investigating the importance of social vulnerability in opioid-related mortality across the United States

链接: https://arxiv.org/abs/2412.15218
作者: Andrew Deas,Adam Spannaus,Dakotah D. Maguire,Jodie Trafton,Anuj J. Kapadia,Vasileios Maroulas
关键词: public health challenge, United States, critical public health, public health, health challenge
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The opioid crisis remains a critical public health challenge in the United States. Despite national efforts which reduced opioid prescribing rates by nearly 45% between 2011 and 2021, opioid overdose deaths more than tripled during this same period. Such alarming trends raise important questions about what underlying social factors may be driving opioid misuse. Using county-level data across the United States, this study begins with a preliminary data analysis of how the rates of thirteen social vulnerability index variables manifest in counties with both anomalously high and low mortality rates, identifying patterns that warrant further investigation. Building on these findings, we further investigate the importance of the thirteen SVI variables within a machine learning framework by employing two predictive models: XGBoost and a modified autoencoder. Both models take the thirteen SVI variables as input and predict county-level opioid-related mortality rates. This allows us to leverage two distinct feature importance metrics: information gain for XGBoost and a Shapley gradient explainer for the autoencoder. These metrics offer two unique insights into the most important SVI factors in relation to opioid-related mortality. By identifying the variables which consistently rank as most important, this study highlights key social vulnerability factors that may play critical roles in the opioid crisis.

[LG-51] Learning sparsity-promoting regularizers for linear inverse problems

链接: https://arxiv.org/abs/2412.16031
作者: Giovanni S. Alberti,Ernesto De Vito,Tapio Helin,Matti Lassas,Luca Ratti,Matteo Santacesaria
关键词: solving linear inverse, learning sparsity-promoting regularizers, linear inverse problems, paper introduces, sparsity-promoting regularizers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to learning sparsity-promoting regularizers for solving linear inverse problems. We develop a bilevel optimization framework to select an optimal synthesis operator, denoted as B , which regularizes the inverse problem while promoting sparsity in the solution. The method leverages statistical properties of the underlying data and incorporates prior knowledge through the choice of B . We establish the well-posedness of the optimization problem, provide theoretical guarantees for the learning process, and present sample complexity bounds. The approach is demonstrated through examples, including compact perturbations of a known operator and the problem of learning the mother wavelet, showcasing its flexibility in incorporating prior knowledge into the regularization framework. This work extends previous efforts in Tikhonov regularization by addressing non-differentiable norms and proposing a data-driven approach for sparse regularization in infinite dimensions.

[LG-52] Mamba-based Deep Learning Approaches for Sleep Staging on a Wireless Multimodal Wearable System without Electroencephalography

链接: https://arxiv.org/abs/2412.15947
作者: Andrew H. Zhang,Alex He-Mo,Richard Fei Yin,Chunlin Li,Yuzhi Tang,Dharmendra Gurve,Nasim Montazeri Ghahjaverestan,Maged Goubran,Bo Wang,Andrew S. P. Lim
关键词: Study Objectives, Sibel Health, measuring chest electrocardiography, minimally intrusive dual-sensor, intrusive dual-sensor wireless
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 21 pages, 11 figures. Authors Andrew H. Zhang, Alex He-Mo, and Richard Fei Yin contributed equally

点击查看摘要

Abstract:Study Objectives: We investigate using Mamba-based deep learning approaches for sleep staging on signals from ANNE One (Sibel Health, Evanston, IL), a minimally intrusive dual-sensor wireless wearable system measuring chest electrocardiography (ECG), triaxial accelerometry, and temperature, as well as finger photoplethysmography (PPG) and temperature. Methods: We obtained wearable sensor recordings from 360 adults undergoing concurrent clinical polysomnography (PSG) at a tertiary care sleep lab. PSG recordings were scored according to AASM criteria. PSG and wearable sensor data were automatically aligned using their ECG channels with manual confirmation by visual inspection. We trained Mamba-based models with both convolutional-recurrent neural network (CRNN) and the recurrent neural network (RNN) architectures on these recordings. Ensembling of model variants with similar architectures was performed. Results: Our best approach, after ensembling, attains a 3-class (wake, NREM, REM) balanced accuracy of 83.50%, F1 score of 84.16%, Cohen’s \kappa of 72.68%, and a MCC score of 72.84%; a 4-class (wake, N1/N2, N3, REM) balanced accuracy of 74.64%, F1 score of 74.56%, Cohen’s \kappa of 61.63%, and MCC score of 62.04%; a 5-class (wake, N1, N2, N3, REM) balanced accuracy of 64.30%, F1 score of 66.97%, Cohen’s \kappa of 53.23%, MCC score of 54.38%. Conclusions: Deep learning models can infer major sleep stages from a wearable system without electroencephalography (EEG) and can be successfully applied to data from adults attending a tertiary care sleep clinic. Comments: 21 pages, 11 figures. Authors Andrew H. Zhang, Alex He-Mo, and Richard Fei Yin contributed equally Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) Cite as: arXiv:2412.15947 [q-bio.QM] (or arXiv:2412.15947v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2412.15947 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrew Hanzhuo Zhang [view email] [v1] Fri, 20 Dec 2024 14:43:02 UTC (25,394 KB)

[LG-53] he common ground of DAE approaches. An overview of diverse DAE frameworks emphasizing their commonalities

链接: https://arxiv.org/abs/2412.15866
作者: Diana Estévez Schwarz,René Lamour,Roswitha März
关键词: implemented rank conditions, analyze different approaches, approaches to differential-algebraic, differential-algebraic equations, equations with attention
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze different approaches to differential-algebraic equations with attention to the implemented rank conditions of various matrix functions. These conditions are apparently very different and certain rank drops in some matrix functions actually indicate a critical solution behavior. We look for common ground by considering various index and regularity notions from literature generalizing the Kronecker index of regular matrix pencils. In detail, starting from the most transparent reduction framework, we work out a comprehensive regularity concept with canonical characteristic values applicable across all frameworks and prove the equivalence of thirteen distinct definitions of regularity. This makes it possible to use the findings of all these concepts together. Additionally, we show why not only the index but also these canonical characteristic values are crucial to describe the properties of the DAE.

[LG-54] On Robust Cross Domain Alignment

链接: https://arxiv.org/abs/2412.15861
作者: Anish Chakrabarty,Arkaprabha Basu,Swagatam Das
关键词: distinct ambient spaces, ambient spaces, distributions supported, supported on distinct, distinct ambient
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Gromov-Wasserstein (GW) distance is an effective measure of alignment between distributions supported on distinct ambient spaces. Calculating essentially the mutual departure from isometry, it has found vast usage in domain translation and network analysis. It has long been shown to be vulnerable to contamination in the underlying measures. All efforts to introduce robustness in GW have been inspired by similar techniques in optimal transport (OT), which predominantly advocate partial mass transport or unbalancing. In contrast, the cross-domain alignment problem being fundamentally different from OT, demands specific solutions to tackle diverse applications and contamination regimes. Deriving from robust statistics, we discuss three contextually novel techniques to robustify GW and its variants. For each method, we explore metric properties and robustness guarantees along with their co-dependencies and individual relations with the GW distance. For a comprehensive view, we empirically validate their superior resilience to contamination under real machine learning tasks against state-of-the-art methods.

[LG-55] Using matrix-product states for time-series machine learning

链接: https://arxiv.org/abs/2412.15826
作者: Joshua B. Moore,Hugo P. Stackhouse,Ben D. Fulcher,Sahand Mahmoodian
关键词: Matrix-product states, quantum many-body physics, modeling quantum many-body, joint probability distribution, many-body physics
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:Matrix-product states (MPS) have proven to be a versatile ansatz for modeling quantum many-body physics. For many applications, and particularly in one-dimension, they capture relevant quantum correlations in many-body wavefunctions while remaining tractable to store and manipulate on a classical computer. This has motivated researchers to also apply the MPS ansatz to machine learning (ML) problems where capturing complex correlations in datasets is also a key requirement. Here, we develop and apply an MPS-based algorithm, MPSTime, for learning a joint probability distribution underlying an observed time-series dataset, and show how it can be used to tackle important time-series ML problems, including classification and imputation. MPSTime can efficiently learn complicated time-series probability distributions directly from data, requires only moderate maximum MPS bond dimension \chi_\rm max , with values for our applications ranging between \chi_\rm max = 20-150 , and can be trained for both classification and imputation tasks under a single logarithmic loss function. Using synthetic and publicly available real-world datasets, spanning applications in medicine, energy, and astronomy, we demonstrate performance competitive with state-of-the-art ML approaches, but with the key advantage of encoding the full joint probability distribution learned from the data. By sampling from the joint probability distribution and calculating its conditional entanglement entropy, we show how its underlying structure can be uncovered and interpreted. This manuscript is supplemented with the release of a publicly available code package MPSTime that implements our approach. The efficiency of the MPS-based ansatz for learning complex correlation structures from time-series data is likely to underpin interpretable advances to challenging time-series ML problems across science, industry, and medicine.

[LG-56] Deep learning joint extremes of metocean variables using the SPAR model

链接: https://arxiv.org/abs/2412.15808
作者: Ed Mackay,Callum Murphy-Barltrop,Jordan Richards,Philip Jonathan
关键词: deep learning framework, modelling multivariate extremes, estimating multivariate joint, Semi-Parametric Angular-Radial, metocean variables
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper presents a novel deep learning framework for estimating multivariate joint extremes of metocean variables, based on the Semi-Parametric Angular-Radial (SPAR) model. When considered in polar coordinates, the problem of modelling multivariate extremes is transformed to one of modelling an angular density, and the tail of a univariate radial variable conditioned on angle. In the SPAR approach, the tail of the radial variable is modelled using a generalised Pareto (GP) distribution, providing a natural extension of univariate extreme value theory to the multivariate setting. In this work, we show how the method can be applied in higher dimensions, using a case study for five metocean variables: wind speed, wind direction, wave height, wave period and wave direction. The angular variable is modelled empirically, while the parameters of the GP model are approximated using fully-connected deep neural networks. Our data-driven approach provides great flexibility in the dependence structures that can be represented, together with computationally efficient routines for training the model. Furthermore, the application of the method requires fewer assumptions about the underlying distribution(s) compared to existing approaches, and an asymptotically justified means for extrapolating outside the range of observations. Using various diagnostic plots, we show that the fitted models provide a good description of the joint extremes of the metocean variables considered.

[LG-57] GraphDOP: Towards skilful data-driven medium-range weather forecasts learnt and initialised directly from observations

链接: https://arxiv.org/abs/2412.15687
作者: Mihai Alexe,Eulalie Boucher,Peter Lean,Ewan Pinnington,Patrick Laloyaux,Anthony McNally,Simon Lang,Matthew Chantry,Chris Burrows,Marcin Chrust,Florian Pinault,Ethel Villeneuve,Niels Bormann,Sean Healy
关键词: forecast system developed, Medium-Range Weather Forecasts, Earth System observations, European Centre, Earth System state
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 23 pages, 15 figures

点击查看摘要

Abstract:We introduce GraphDOP, a new data-driven, end-to-end forecast system developed at the European Centre for Medium-Range Weather Forecasts (ECMWF) that is trained and initialised exclusively from Earth System observations, with no physics-based (re)analysis inputs or feedbacks. GraphDOP learns the correlations between observed quantities - such as brightness temperatures from polar orbiters and geostationary satellites - and geophysical quantities of interest (that are measured by conventional observations), to form a coherent latent representation of Earth System state dynamics and physical processes, and is capable of producing skilful predictions of relevant weather parameters up to five days into the future.

[LG-58] Predicting Artificial Neural Network Representations to Learn Recognition Model for Music Identification from Brain Recordings

链接: https://arxiv.org/abs/2412.15560
作者: Taketo Akama,Zhuohao Zhang,Pengcheng Li,Kotaro Hongo,Hiroaki Kitano,Shun Minamikawa,Natalia Polouliakh
关键词: exhibit notable similarities, Recent studies, artificial neural networks, cortical representations, ANN representations
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:Recent studies have demonstrated that the representations of artificial neural networks (ANNs) can exhibit notable similarities to cortical representations when subjected to identical auditory sensory inputs. In these studies, the ability to predict cortical representations is probed by regressing from ANN representations to cortical representations. Building upon this concept, our approach reverses the direction of prediction: we utilize ANN representations as a supervisory signal to train recognition models using noisy brain recordings obtained through non-invasive measurements. Specifically, we focus on constructing a recognition model for music identification, where electroencephalography (EEG) brain recordings collected during music listening serve as input. By training an EEG recognition model to predict ANN representations-representations associated with music identification-we observed a substantial improvement in classification accuracy. This study introduces a novel approach to developing recognition models for brain recordings in response to external auditory stimuli. It holds promise for advancing brain-computer interfaces (BCI), neural decoding techniques, and our understanding of music cognition. Furthermore, it provides new insights into the relationship between auditory brain activity and ANN representations.

[LG-59] De-singularity Subgradient for the q-th-Powered ell_p-Norm Weber Location Problem AAAI2025

链接: https://arxiv.org/abs/2412.15546
作者: Zhao-Rong Lai,Xiaotian Wu,Liangda Fang,Ziliang Chen,Cheng Li
关键词: Weber location problem, Weber location, artificial intelligence scenarios, artificial intelligence, singular points
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:The Weber location problem is widely used in several artificial intelligence scenarios. However, the gradient of the objective does not exist at a considerable set of singular points. Recently, a de-singularity subgradient method has been proposed to fix this problem, but it can only handle the q -th-powered \ell_2 -norm case ( 1\leqslant q2 ), which has only finite singular points. In this paper, we further establish the de-singularity subgradient for the q -th-powered \ell_p -norm case with 1\leqslant q\leqslant p and 1\leqslant p2 , which includes all the rest unsolved situations in this problem. This is a challenging task because the singular set is a continuum. The geometry of the objective function is also complicated so that the characterizations of the subgradients, minimum and descent direction are very difficult. We develop a q -th-powered \ell_p -norm Weiszfeld Algorithm without Singularity ( q P p NWAWS) for this problem, which ensures convergence and the descent property of the objective function. Extensive experiments on six real-world data sets demonstrate that q P p NWAWS successfully solves the singularity problem and achieves a linear computational convergence rate in practical scenarios.

[LG-60] Learning charges and long-range interactions from energies and forces

链接: https://arxiv.org/abs/2412.15455
作者: Dongjin Kim,Daniel S. King,Peichen Zhong,Bingqing Cheng
关键词: Accurate modeling, Latent Ewald Summation, atomistic simulations, critical in atomistic, play a central
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate modeling of long-range forces is critical in atomistic simulations, as they play a central role in determining the properties of materials and chemical systems. However, standard machine learning interatomic potentials (MLIPs) often rely on short-range approximations, limiting their applicability to systems with significant electrostatics and dispersion forces. We recently introduced the Latent Ewald Summation (LES) method, which captures long-range electrostatics without explicitly learning atomic charges or charge equilibration. Extending LES, we incorporate the ability to learn physical partial charges, encode charge states, and the option to impose charge neutrality constraints. We benchmark LES on diverse and challenging systems, including charged molecules, ionic liquid, electrolyte solution, polar dipeptides, surface adsorption, electrolyte/solid interfaces, and solid-solid interfaces. Our results show that LES can effectively infer physical partial charges, dipole and quadrupole moments, as well as achieve better accuracy compared to methods that explicitly learn charges. LES thus provides an efficient, interpretable, and generalizable MLIP framework for simulating complex systems with intricate charge transfer and long-range

[LG-61] Cosmology with Persistent Homology: Parameter Inference via Machine Learning

链接: https://arxiv.org/abs/2412.15405
作者: Juan Calles,Jacky H. T. Yip,Gabriella Contardo,Jorge Noreña,Adam Rouhiainen,Gary Shiu
关键词: likelihood-free inference pipeline, potential constraining power, combined Power Spectrum, primordial non-Gaussianity amplitudes, inference pipeline
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 28 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Building upon [2308.02636], this article investigates the potential constraining power of persistent homology for cosmological parameters and primordial non-Gaussianity amplitudes in a likelihood-free inference pipeline. We evaluate the ability of persistence images (PIs) to infer parameters, compared to the combined Power Spectrum and Bispectrum (PS/BS), and we compare two types of models: neural-based, and tree-based. PIs consistently lead to better predictions compared to the combined PS/BS when the parameters can be constrained (i.e., for \Omega_\rm m, \sigma_8, n_\rm s, f_\rm NL^\rm loc\ ). PIs perform particularly well for f_\rm NL^\rm loc , showing the promise of persistent homology in constraining primordial non-Gaussianity. Our results show that combining PIs with PS/BS provides only marginal gains, indicating that the PS/BS contains little extra or complementary information to the PIs. Finally, we provide a visualization of the most important topological features for f_\rm NL^\rm loc and for \Omega_\rm m . This reveals that clusters and voids (0-cycles and 2-cycles) are most informative for \Omega_\rm m , while f_\rm NL^\rm loc uses the filaments (1-cycles) in addition to the other two types of topological features.

[LG-62] Enhancing Masked Time-Series Modeling via Dropping Patches

链接: https://arxiv.org/abs/2412.15315
作者: Tianyu Qiu,Yi Xie,Yun Xiong,Hao Niu,Xiaofeng Gao
关键词: enhance existing masked, existing masked time-series, masked time-series modeling, dropping sub-sequence level, sub-sequence level patches
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores how to enhance existing masked time-series modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations

[LG-63] Multi-Branch Mutual-Distillation Transformer for EEG-Based Seizure Subtype Classification

链接: https://arxiv.org/abs/2412.15224
作者: Ruimin Peng,Zhenbang Du,Changming Zhao,Jingwei Luo,Wenzhong Liu,Xinxing Chen,Dongrui Wu
关键词: precise epilepsy diagnostics, epilepsy diagnostics, based seizure subtype, important in precise, precise epilepsy
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-subject electroencephalogram (EEG) based seizure subtype classification is very important in precise epilepsy diagnostics. Deep learning is a promising solution, due to its ability to automatically extract latent patterns. However, it usually requires a large amount of training data, which may not always be available in clinical practice. This paper proposes Multi-Branch Mutual-Distillation (MBMD) Transformer for cross-subject EEG-based seizure subtype classification, which can be effectively trained from small labeled data. MBMD Transformer replaces all even-numbered encoder blocks of the vanilla Vision Transformer by our designed multi-branch encoder blocks. A mutual-distillation strategy is proposed to transfer knowledge between the raw EEG data and its wavelets of different frequency bands. Experiments on two public EEG datasets demonstrated that our proposed MBMD Transformer outperformed several traditional machine learning and state-of-the-art deep learning approaches. To our knowledge, this is the first work on knowledge distillation for EEG-based seizure subtype classification.

[LG-64] Leveraging Generative Adversarial Networks for Addressing Data Imbalance in Financial Market Supervision

链接: https://arxiv.org/abs/2412.15222
作者: Mohan Jiang,Yaxin Liang,Siyuan Han,Kunyuan Ma,Yuan Chen,Zhen Xu
关键词: generative adversarial networks, financial market supervision, financial market data, financial market, generative adversarial
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the application of generative adversarial networks in financial market supervision, especially for solving the problem of data imbalance to improve the accuracy of risk prediction. Since financial market data are often imbalanced, especially high-risk events such as market manipulation and systemic risk occur less frequently, traditional models have difficulty effectively identifying these minority events. This study proposes to generate synthetic data with similar characteristics to these minority events through GAN to balance the dataset, thereby improving the prediction performance of the model in financial supervision. Experimental results show that compared with traditional oversampling and undersampling methods, the data generated by GAN has significant advantages in dealing with imbalance problems and improving the prediction accuracy of the model. This method has broad application potential in financial regulatory agencies such as the U.S. Securities and Exchange Commission (SEC), the Financial Industry Regulatory Authority (FINRA), the Federal Deposit Insurance Corporation (FDIC), and the Federal Reserve.

[LG-65] Improving the performance of weak supervision searches using data augmentation

链接: https://arxiv.org/abs/2412.00198
作者: Zong-En Chen,Cheng-Wei Chiang,Feng-Yang Hsieh
关键词: exploit signal properties, Weak supervision combines, Weak supervision, combines the advantages, ability to exploit
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Weak supervision combines the advantages of training on real data with the ability to exploit signal properties. However, training a neural network using weak supervision often requires an excessive amount of signal data, which severely limits its practical applicability. In this study, we propose addressing this limitation through data augmentation, increasing the training data’s size and diversity. Specifically, we focus on physics-inspired data augmentation methods, such as p_\textT smearing and jet rotation. Our results demonstrate that data augmentation can significantly enhance the performance of weak supervision, enabling neural networks to learn efficiently from substantially less data.

信息检索

[IR-0] Legommenders: A Comprehensive Content-Based Recommendation Library with LLM Support

链接: https://arxiv.org/abs/2412.15973
作者: Qijiong Liu,Lu Fan,Xiao-Ming Wu
关键词: unique library designed, content understanding directly, encoders alongside behavior, interaction modules, content encoders alongside
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We present Legommenders, a unique library designed for content-based recommendation that enables the joint training of content encoders alongside behavior and interaction modules, thereby facilitating the seamless integration of content understanding directly into the recommendation pipeline. Legommenders allows researchers to effortlessly create and analyze over 1,000 distinct models across 15 diverse datasets. Further, it supports the incorporation of contemporary large language models, both as feature encoder and data generator, offering a robust platform for developing state-of-the-art recommendation models and enabling more personalized and effective content delivery.

[IR-1] ASPIRE: Assistive System for Performance Evaluation in IR ECIR

链接: https://arxiv.org/abs/2412.15759
作者: Georgios Peikos,Wojciech Kusa,Symeon Symeonidis
关键词: presenting performance measures, Information Retrieval, Information, presenting performance, performance measures
类目: Information Retrieval (cs.IR)
*备注: Accepted as a demo paper at the 47th European Conference on Information Retrieval (ECIR)

点击查看摘要

Abstract:Information Retrieval (IR) evaluation involves far more complexity than merely presenting performance measures in a table. Researchers often need to compare multiple models across various dimensions, such as the Precision-Recall trade-off and response time, to understand the reasons behind the varying performance of specific queries for different models. We introduce ASPIRE (Assistive System for Performance Evaluation in IR), a visual analytics tool designed to address these complexities by providing an extensive and user-friendly interface for in-depth analysis of IR experiments. ASPIRE supports four key aspects of IR experiment evaluation and analysis: single/multi-experiment comparisons, query-level analysis, query characteristics-performance interplay, and collection-based retrieval analysis. We showcase the functionality of ASPIRE using the TREC Clinical Trials collection. ASPIRE is an open-source toolkit available online: this https URL

[IR-2] PolySmart and VIREO @ TRECVid 2024 Ad-hoc Video Search

链接: https://arxiv.org/abs/2412.15494
作者: Jiaxin Wu,Chong-Wah Ngo,Xiao-Yong Wei,Qing Li
关键词: TRECVid AVS task, explore generation-augmented retrieval, AVS task, TRECVid AVS, explore generation-augmented
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This year, we explore generation-augmented retrieval for the TRECVid AVS task. Specifically, the understanding of textual query is enhanced by three generations, including Text2Text, Text2Image, and Image2Text, to address the out-of-vocabulary problem. Using different combinations of them and the rank list retrieved by the original query, we submitted four automatic runs. For manual runs, we use a large language model (LLM) (i.e., GPT4) to rephrase test queries based on the concept bank of the search engine, and we manually check again to ensure all the concepts used in the rephrased queries are in the bank. The result shows that the fusion of the original and generated queries outperforms the original query on TV24 query sets. The generated queries retrieve different rank lists from the original query.

[IR-3] A Retrieval-Augmented Generation Framework for Academic Literature Navigation in Data Science

链接: https://arxiv.org/abs/2412.15404
作者: Ahmet Yasin Aytar,Kemal Kilic,Kamer Kaya
关键词: rapidly evolving field, enhanced Retrieval-Augmented Generation, efficiently navigating, rapidly evolving, navigating the expansive
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the rapidly evolving field of data science, efficiently navigating the expansive body of academic literature is crucial for informed decision-making and innovation. This paper presents an enhanced Retrieval-Augmented Generation (RAG) application, an artificial intelligence (AI)-based system designed to assist data scientists in accessing precise and contextually relevant academic resources. The AI-powered application integrates advanced techniques, including the GeneRation Of BIbliographic Data (GROBID) technique for extracting bibliographic information, fine-tuned embedding models, semantic chunking, and an abstract-first retrieval method, to significantly improve the relevance and accuracy of the retrieved information. This implementation of AI specifically addresses the challenge of academic literature navigation. A comprehensive evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS) framework demonstrates substantial improvements in key metrics, particularly Context Relevance, underscoring the system’s effectiveness in reducing information overload and enhancing decision-making processes. Our findings highlight the potential of this enhanced Retrieval-Augmented Generation system to transform academic exploration within data science, ultimately advancing the workflow of research and innovation in the field.

[IR-4] Ranking Narrative Query Graphs for Biomedical Document Retrieval (Technical Report)

链接: https://arxiv.org/abs/2412.15232
作者: Hermann Kroll,Pascal Sackhoff,Timo Breuer,Ralf Schenkel,Wolf-Tilo Balke
关键词: Keyword-based searches, digital libraries, searches are today, today standard, standard in digital
类目: Information Retrieval (cs.IR)
*备注: Technical Report of our accepted paper at AI4LAC@JCDL2024. 11 pages, 5 figures

点击查看摘要

Abstract:Keyword-based searches are today’s standard in digital libraries. Yet, complex retrieval scenarios like in scientific knowledge bases, need more sophisticated access paths. Although each document somewhat contributes to a domain’s body of knowledge, the exact structure between keywords, i.e., their possible relationships, and the contexts spanned within each single document will be crucial for effective retrieval. Following this logic, individual documents can be seen as small-scale knowledge graphs on which graph queries can provide focused document retrieval. We implemented a full-fledged graph-based discovery system for the biomedical domain and demonstrated its benefits in the past. Unfortunately, graph-based retrieval methods generally follow an ‘exact match’ paradigm, which severely hampers search efficiency, since exact match results are hard to rank by relevance. This paper extends our existing discovery system and contributes effective graph-based unsupervised ranking methods, a new query relaxation paradigm, and ontological rewriting. These extensions improve the system further so that users can retrieve results with higher precision and higher recall due to partial matching and ontological rewriting.

[IR-5] Building an Explainable Graph-based Biomedical Paper Recommendation System (Technical Report)

链接: https://arxiv.org/abs/2412.15229
作者: Hermann Kroll,Christin K. Kreutz,Bill Matthias Thang,Philipp Schaer,Wolf-Tilo Balke
关键词: access paths, provide different access, Digital libraries provide, allowing users, Abstract
类目: Information Retrieval (cs.IR)
*备注: Technical Report of our accepted paper at AI4LAC@JCDL2024. 12 pages, 3 figures

点击查看摘要

Abstract:Digital libraries provide different access paths, allowing users to explore their collections. For instance, paper recommendation suggests literature similar to some selected paper. Their implementation is often cost-intensive, especially if neural methods are applied. Additionally, it is hard for users to understand or guess why a recommendation should be relevant for them. That is why we tackled the problem from a different perspective. We propose XGPRec, a graph-based and thus explainable method which we integrate into our existing graph-based biomedical discovery system. Moreover, we show that XGPRec (1) can, in terms of computational costs, manage a real digital library collection with 37M documents from the biomedical domain, (2) performs well on established test collections and concept-centric information needs, and (3) generates explanations that proved to be beneficial in a preliminary user study. We share our code so that user libraries can build upon XGPRec.

附件下载

点击下载今日全部论文列表

目录

概览 (2024-12-23)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载