Arxiv今日论文 | 2025-01-13

本篇博文主要内容为 2025-01-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决通用自动语音识别（ASR）系统在目标导向对话中表现不佳的问题，特别是在缺乏先验用户数据且存在词汇和句法变化的情况下。现有ASR校正方法通常依赖于先验用户数据或命名实体，而本文提出了一种新的上下文增强方法，结合了大语言模型（large language model）和基于对话状态的任务上下文信息的排序策略。该方案的关键在于：（1）通过词汇和语义相似性对n-best ASR假设进行排序；（2）通过语音对应关系对上下文进行排序。实验结果表明，该方法在家庭装修和烹饪领域的真实用户测试中，显著提升了校正的召回率和F1分数，同时保持了精度和低误报率，用户满意度也有所提高。

链接: https://arxiv.org/abs/2501.06129
作者: Yuya Asano,Sabit Hassan,Paras Sharma,Anthony Sicilia,Katherine Atwell,Diane Litman,Malihe Alikhani
机构: University of Pittsburgh(匹兹堡大学); Northeastern University(东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to COLING 2025 Industry Track

点击查看摘要

Abstract:General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.
zh

[NLP-1] Merging Feed-Forward Sublayers for Compressed Transformers

【速读】：该论文旨在解决随着深度学习模型规模增大和普及，如何在硬件内存限制下广泛部署这些模型的问题。传统的模型压缩方法通常通过剪枝（pruning）去除不重要的参数，而本文提出了一种新颖的压缩方法，即通过合并模型中的相似参数组来实现压缩。具体而言，作者选择、对齐并合并Transformer模型中的前馈子层（feed-forward sublayers），并在语言建模、图像分类和机器翻译任务上验证了该方法的有效性。实验结果表明，该方法在合并超过三分之一的模型前馈子层后，仍能保持与原始模型相当的性能，并且在参数减少超过21%的情况下，视觉Transformer（Vision Transformer）的性能仍能保持99%。此外，作者还观察到某些前馈子层组表现出较高的激活相似性，这可能是它们能够被合并的原因。

链接: https://arxiv.org/abs/2501.06126
作者: Neha Verma,Kenton Murray,Kevin Duh
机构: Center for Language and Speech Processing (语言与语音处理中心); Human Language Technology Center of Excellence (人类语言技术卓越中心); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.
zh

[NLP-2] Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

【速读】：该论文试图解决低资源语言（low-resource languages）在自动语音识别（ASR）中的不可靠性问题，特别是在缺乏双模态语音和文本训练数据的情况下。为了解决这一问题，论文提出通过增强多语言口语理解（SLU）来提升多语言ASR的鲁棒性，利用语言语义来弥补训练数据的不足，例如通过上下文消除歧义或利用跨语言的语义相似性。论文的关键解决方案是引入了Fleurs-SLU，一个多语言SLU基准测试，涵盖了102种语言的主题语音分类和92种语言的听力理解多选问答任务。通过评估端到端语音分类模型和级联系统（结合语音到文本转录和大语言模型的分类），研究发现级联系统在多语言SLU任务中表现出更强的鲁棒性，而适当的预训练可以使语音编码器在主题语音分类中达到竞争性性能。此外，研究还发现鲁棒的多语言ASR、有效的语音到文本翻译和强大的多语言SLU之间存在强相关性，强调了声学和语义语音表征之间的相互促进作用。

链接: https://arxiv.org/abs/2501.06117
作者: Fabian David Schmidt,Ivan Vulić,Goran Glavaš,David Ifeoluwa Adelani
机构: Center For Artificial Intelligence and Data Science, University of Würzburg, Germany (维尔茨堡大学人工智能与数据科学中心); Language Technology Lab, University of Cambridge, United Kingdom (剑桥大学语言技术实验室); Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due to limited bimodal speech and text training data. Better multilingual spoken language understanding (SLU) can strengthen massively the robustness of multilingual ASR by levering language semantics to compensate for scarce training data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable for inclusive speech technology in roughly half of all living languages that lack a formal writing system. However, the evaluation of multilingual SLU remains limited to shallower tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses topical speech classification in 102 languages and multiple-choice question answering through listening comprehension in 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.
zh

[NLP-3] From Conversation to Automation: Leverag ing Large Language Models to Analyze Strategies in Problem Solving Therapy

【速读】：该论文探讨了如何有效地将问题解决疗法（Problem-Solving Therapy, PST）自动化，以应对心理健康领域中日益增长的技术整合需求，如聊天机器人和大语言模型（Large Language Models, LLMs）的应用。研究的关键解决方案是通过分析匿名化的治疗对话记录，利用多种LLMs和基于Transformer的模型对治疗干预进行分类。研究结果表明，GPT-4o在识别PST策略方面表现最佳，准确率达到0.76，优于其他模型。此外，研究还引入了一种新的沟通策略维度，丰富了现有的PST框架，为治疗师与客户之间的互动提供了更深入的见解。这项研究展示了LLMs在自动化复杂治疗对话分析中的潜力，为心理健康干预提供了可扩展且高效的工具，同时通过注释框架增强了PST的可访问性、有效性和个性化，支持治疗师进行更精确和有针对性的干预。

链接: https://arxiv.org/abs/2501.06101
作者: Elham Aghakhani,Lu Wang,Karla T. Washington,George Demiris,Jina Huh-Yoo,Rezvaneh Rezapour
机构: Drexel University(德雷塞尔大学); Stevens Institute of Technology(斯蒂文斯理工学院); Washington University(华盛顿大学); University of Pennsylvania(宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Problem-solving therapy (PST) is a structured psychological approach that helps individuals manage stress and resolve personal issues by guiding them through problem identification, solution brainstorming, decision-making, and outcome evaluation. As mental health care increasingly integrates technologies like chatbots and large language models (LLMs), understanding how PST can be effectively automated is important. This study leverages anonymized therapy transcripts to analyze and classify therapeutic interventions using various LLMs and transformer-based models. Our results show that GPT-4o achieved the highest accuracy (0.76) in identifying PST strategies, outperforming other models. Additionally, we introduced a new dimension of communication strategies that enhances the current PST framework, offering deeper insights into therapist-client interactions. This research demonstrates the potential of LLMs to automate complex therapeutic dialogue analysis, providing a scalable, efficient tool for mental health interventions. Our annotation framework can enhance the accessibility, effectiveness, and personalization of PST, supporting therapists in real-time with more precise, targeted interventions.
zh

[NLP-4] Benchmarking Rotary Position Embeddings for Automatic Speech Recognition

【速读】：该论文旨在解决在自动语音识别（ASR）任务中，如何有效编码序列中的相对和绝对位置信息的问题。当前广泛使用的位置编码技术在自然语言处理任务中表现出色，但其在语音处理应用中的有效性尚未得到充分研究。论文提出的解决方案是采用旋转位置编码（Rotary Position Embedding, RoPE），通过旋转矩阵对输入向量进行编码，从而在序列中嵌入相对和绝对位置信息。实验结果表明，RoPE在多种ASR任务中相比现有的相对位置编码技术，能够持续实现更低的错误率。为了促进进一步研究，作者通过SpeechBrain工具包发布了实现代码和所有实验方案。

链接: https://arxiv.org/abs/2501.06051
作者: Shucong Zhang,Titouan Parcollet,Rogier van Dalen,Sourav Bhattacharya
机构: Samsung AI Center(三星AI中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) encodes relative and absolute positional information in Transformer-based models through rotation matrices applied to input vectors within sequences. While RoPE has demonstrated superior performance compared to other positional embedding technologies in natural language processing tasks, its effectiveness in speech processing applications remains understudied. In this work, we conduct a comprehensive evaluation of RoPE across diverse automatic speech recognition (ASR) tasks. Our experimental results demonstrate that for ASR tasks, RoPE consistently achieves lower error rates compared to the currently widely used relative positional embedding. To facilitate further research, we release the implementation and all experimental recipes through the SpeechBrain toolkit.
zh

[NLP-5] How to Tune a Multilingual Encoder Model for Germanic Languages: A Study of PEFT Full Fine-Tuning and Language Adapters ALT

【速读】：该论文探讨了在多语言编码器模型mDeBERTa（multilingual DeBERTa）上，针对三种日耳曼语系语言（德语、瑞典语和冰岛语）进行任务优化的问题。这些语言在mDeBERTa的预训练数据中的存在程度和数据质量各不相同。论文的核心解决方案是比较了全微调（full fine-tuning）与参数高效微调（PEFT, Parameter-Efficient Fine-Tuning）方法（如LoRA和Pfeiffer瓶颈适配器）的效果。研究发现，对于资源较丰富的语言（如德语），PEFT方法更为有效；而对于瑞典语和冰岛语，结果则不太一致。此外，不同任务的表现也存在差异：PEFT在问答任务上表现更好，而全微调在命名实体识别任务上更具优势。论文还评估了在非结构化文本上训练的PEFT模块的影响，发现这种模块化方法并未带来显著益处。

链接: https://arxiv.org/abs/2501.06025
作者: Romina Oji,Jenny Kunz
机构: Linköping University (林雪平大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NoDaLiDa Baltic-HLT 2025 Conference

点击查看摘要

Abstract:This paper investigates the optimal use of the multilingual encoder model mDeBERTa for tasks in three Germanic languages – German, Swedish, and Icelandic – representing varying levels of presence and likely data quality in mDeBERTas pre-training data. We compare full fine-tuning with the parameter-efficient fine-tuning (PEFT) methods LoRA and Pfeiffer bottleneck adapters, finding that PEFT is more effective for the higher-resource language, German. However, results for Swedish and Icelandic are less consistent. We also observe differences between tasks: While PEFT tends to work better for question answering, full fine-tuning is preferable for named entity recognition. Inspired by previous research on modular approaches that combine task and language adapters, we evaluate the impact of adding PEFT modules trained on unstructured text, finding that this approach is not beneficial.
zh

[NLP-6] Constraining constructions with WordNet: pros and cons for the semantic annotation of fillers in the Italian Constructicon

【速读】：该论文探讨了基于WordNet的语义分类在构式形式化中的作用，特别是在意大利语构式库（Italian Constructicon）中图示填充词的语义标注中的应用。论文的核心问题是如何利用Open Multilingual WordNet的主题来表示构式的语义特征和约束。解决方案的关键在于通过WordNet的多语言语义网络，系统地捕捉和标注构式的语义信息，从而为意大利语构式库的构建提供语义上的支持。这种方法不仅有助于构式的形式化描述，还能增强语义标注的准确性和一致性。

链接: https://arxiv.org/abs/2501.05990
作者: Flavio Pisciotta,Ludovica Pannitto,Lucia Busso,Beatrice Bernasconi,Francesca Masini
机构: University of Salerno(萨莱诺大学); University of Bologna(博洛尼亚大学); Aston University(阿斯顿大学); University of Turin(都灵大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper discusses the role of WordNet-based semantic classification in the formalization of constructions, and more specifically in the semantic annotation of schematic fillers, in the Italian Constructicon. We outline how the Italian Constructicon project uses Open Multilingual WordNet topics to represent semantic features and constraints of constructions.
zh

[NLP-7] Addressing speaker gender bias in large scale speech translation systems

【速读】：该论文旨在解决语音翻译（Speech Translation, ST）系统中存在的说话者性别偏见问题，这种偏见可能导致冒犯性和不准确的翻译。大规模ST系统中常见的男性偏见通常通过从机器翻译（Machine Translation, MT）系统衍生的训练数据得以延续。论文提出的解决方案包括两个关键步骤：首先，利用大语言模型（Large Language Models, LLMs）以低成本的方式根据说话者性别修正翻译；其次，使用修正后的数据对ST模型进行微调，使模型能够直接从音频线索生成性别特定的翻译，而无需显式的性别输入。此外，论文还提出了一种三模式微调模型，适用于说话者性别已预定义或不应从语音线索推断的场景。实验结果表明，与基线模型和其他大规模ST系统（如Seamless M4T和Canary）相比，该方案在MuST-SHE测试集上对女性说话者的翻译准确率提高了70%。

链接: https://arxiv.org/abs/2501.05989
作者: Shubham Bansal,Vikas Joshi,Harveen Chadha,Rupeshkumar Mehta,Jinyu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study addresses the issue of speaker gender bias in Speech Translation (ST) systems, which can lead to offensive and inaccurate translations. The masculine bias often found in large-scale ST systems is typically perpetuated through training data derived from Machine Translation (MT) systems. Our approach involves two key steps. First, we employ Large Language Models (LLMs) to rectify translations based on the speaker’s gender in a cost-effective manner. Second, we fine-tune the ST model with the corrected data, enabling the model to generate gender-specific translations directly from audio cues, without the need for explicit gender input. Additionally, we propose a three-mode fine-tuned model for scenarios where the speaker’s gender is either predefined or should not be inferred from speech cues. We demonstrate a 70% improvement in translations for female speakers compared to our baseline and other large-scale ST systems, such as Seamless M4T and Canary, on the MuST-SHE test set.
zh

[NLP-8] Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea COLING2025

【速读】：该论文试图解决大型语言模型（LLMs）在生成信息时出现的“幻觉”（hallucination）问题，特别是在缺乏可靠数据或难以确定可信来源的情况下，如何减少模型生成错误信息的风险。研究以北朝鲜（North Korea）为例，探讨了在不同语言环境下（英语、韩语和汉语）表现最佳的多语言LLMs和特定语言模型生成关于北朝鲜信息时的差异。解决方案的关键在于通过评估不同模型和语言对北朝鲜信息的生成效果，揭示模型选择和语言选择对信息准确性的影响，从而为全球安全挑战提供更准确的信息支持。

链接: https://arxiv.org/abs/2501.05981
作者: Eunjung Cho,Won Ik Cho,Soomin Seo
机构: ETH Zurich(苏黎世联邦理工学院); Seoul National University(首尔大学); Sogang University(西江大学)
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:Hallucination in large language models (LLMs) remains a significant challenge for their safe deployment, particularly due to its potential to spread misinformation. Most existing solutions address this challenge by focusing on aligning the models with credible sources or by improving how models communicate their confidence (or lack thereof) in their outputs. While these measures may be effective in most contexts, they may fall short in scenarios requiring more nuanced approaches, especially in situations where access to accurate data is limited or determining credible sources is challenging. In this study, we take North Korea - a country characterised by an extreme lack of reliable sources and the prevalence of sensationalist falsehoods - as a case study. We explore and evaluate how some of the best-performing multilingual LLMs and specific language-based models generate information about North Korea in three languages spoken in countries with significant geo-political interests: English (United States, United Kingdom), Korean (South Korea), and Mandarin Chinese (China). Our findings reveal significant differences, suggesting that the choice of model and language can lead to vastly different understandings of North Korea, which has important implications given the global security challenges the country poses.
zh

[NLP-9] owards Early Prediction of Self-Supervised Speech Model Performance

【速读】：该论文试图解决自监督学习（Self-Supervised Learning, SSL）在语音领域中预训练和评估过程中资源消耗大的问题。具体来说，当前在预训练阶段用于衡量SSL模型质量的指标（如损失值）与下游任务的性能相关性较差，导致难以在预训练阶段以低成本方式准确评估最终的下游性能。为解决这一问题，论文提出了一种无监督的高效方法，通过测量SSL模型嵌入的聚类质量（cluster quality）和秩（rank）来评估预训练质量。实验结果表明，这些指标与下游任务性能的相关性优于预训练损失值，并且仅需一小时的未标注音频即可进行评估，从而减少了对GPU计算资源和标注数据的依赖。

链接: https://arxiv.org/abs/2501.05966
作者: Ryan Whetten,Lucas Maison,Titouan Parcollet,Marco Dinarelli,Yannick Estève
机构: Laboratoire Informatique d’Avignon (阿维尼翁信息实验室); Laboratoire d’Informatique de Grenoble (格勒诺布尔信息实验室)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.
zh

[NLP-10] Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

【速读】：该论文旨在解决如何有效地将带有跨度标注（span-level annotation）的数据集进行机器翻译（machine translation, MT）的问题，特别是针对问答（question answering, QA）任务的数据集。研究的关键解决方案是利用DeepL MT服务的功能，通过其支持格式化文档翻译的能力，将SQuAD2.0数据集翻译为芬兰语版本。通过这种方法，作者不仅生成了高质量的翻译数据集，还训练了基于该数据集的问答检索模型（QA retriever models）。评估结果表明，该方法不仅简单易用，而且在翻译质量、与其他类似数据集的间接对比、回译实验（backtranslation experiment）以及下游QA模型性能等方面均表现出色。因此，该方法有望推广到其他任务和语言的跨度标注数据集的翻译中。

链接: https://arxiv.org/abs/2501.05963
作者: Emil Nuutinen,Iiro Rastas,Filip Ginter
机构: TurkuNLP, Department of Computing, University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注: NoDaLiDa 2025

点击查看摘要

Abstract:We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.
zh

[NLP-11] Effective faking of verbal deception detection with target-aligned adversarial attacks

【速读】：该论文探讨了通过语言分析进行欺骗检测（deception detection）的可行性，并研究了自动化对抗攻击（automated adversarial attacks）对欺骗检测的威胁。具体而言，论文关注的是如何通过改写欺骗性陈述使其显得真实，从而影响人类和机器学习模型的判断。研究的关键在于使用大型语言模型（large language model）对欺骗性陈述进行对抗性修改，并评估这些修改对人类和机器学习模型判断的影响。研究结果表明，当对抗性修改与目标（人类或机器）对齐时，欺骗检测的准确性显著下降至随机水平；而当修改未对齐时，判断准确性显著高于随机水平。因此，论文的解决方案关键在于揭示对抗性攻击的目标对齐（target alignment）对欺骗检测鲁棒性的影响，并提出了通过对抗攻击设计推进欺骗检测研究的建议。

链接: https://arxiv.org/abs/2501.05962
作者: Bennett Kleinberg,Riccardo Loconte,Bruno Verschuere
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Background: Deception detection through analysing language is a promising avenue using both human judgments and automated machine learning judgments. For both forms of credibility assessment, automated adversarial attacks that rewrite deceptive statements to appear truthful pose a serious threat. Methods: We used a dataset of 243 truthful and 262 fabricated autobiographical stories in a deception detection task for humans and machine learning models. A large language model was tasked to rewrite deceptive statements so that they appear truthful. In Study 1, humans who made a deception judgment or used the detailedness heuristic and two machine learning models (a fine-tuned language model and a simple n-gram model) judged original or adversarial modifications of deceptive statements. In Study 2, we manipulated the target alignment of the modifications, i.e. tailoring the attack to whether the statements would be assessed by humans or computer models. Results: When adversarial modifications were aligned with their target, human (d=-0.07 and d=-0.04) and machine judgments (51% accuracy) dropped to the chance level. When the attack was not aligned with the target, both human heuristics judgments (d=0.30 and d=0.36) and machine learning predictions (63-78%) were significantly better than chance. Conclusions: Easily accessible language models can effectively help anyone fake deception detection efforts both by humans and machine learning models. Robustness against adversarial modifications for humans and machines depends on that target alignment. We close with suggestions on advancing deception research with adversarial attack designs.
zh

[NLP-12] Scalable Vision Language Model Training via High Quality Data Curation

【速读】：该论文旨在解决视觉语言模型（VLM）在高质量数据构建、预训练和指令微调（SFT）方面的可扩展性问题。解决方案的关键包括三个方面：首先，通过构建一个可扩展的高质量视觉理解数据生成管道，生成了大规模的SAIL-Caption数据集，该数据集在数量和质量上均优于现有的开源数据集。其次，通过将预训练数据规模扩展至131B tokens，证明了即使是2B参数的VLM也能从大规模数据中受益，并遵循视觉理解和指令跟随性能的数据规模扩展规律。最后，提出了指令数据的数量和质量扩展方法，通过多阶段训练和课程学习（curriculum learning）策略，将模型性能的扩展曲线从对数关系提升至接近线性关系。这些改进使得SAIL-VL在19个常用基准测试中取得了最高平均分，并在OpenCompass上达到了同类模型中的最佳性能。

链接: https://arxiv.org/abs/2501.05952
作者: Hongyuan Dong,Zijian Kang,Weijie Yin,Xiao Liang,Chao Feng,Jiao Ran
机构: ByteDance Douyin Content Group(字节跳动抖音内容组)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) of state-of-the-art (SOTA) performance with 2B parameters. We introduce three key improvements that contribute to SAIL-VL’s leading performance: (1) Scalable high-quality visual understanding data construction: We implement a visual understanding data construction pipeline, which enables hundred-million-scale high-quality recaption data annotation. Equipped with this pipeline, we curate SAIL-Caption, a large-scale caption dataset with large quantity and the highest data quality compared with opensource caption datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL’s pretraining budget up to 131B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting expected data size scaling laws in visual understanding and instruction following performance. (3) Scalable SFT via quantity and quality scaling: We introduce general guidance for instruction data curation to scale up instruction data continuously, allowing us to construct a large SFT dataset with the highest quality. To further improve SAIL-VL’s performance, we propose quality scaling, a multi-stage training recipe with curriculum learning, to improve model performance scaling curves w.r.t. data sizes from logarithmic to be near-linear. SAIL-VL obtains the highest average score in 19 commonly used benchmarks in our evaluation and achieves top1 performance among VLMs of comparable sizes on OpenCompass (this https URL). We release our SAIL-VL-2B model at HuggingFace (this https URL).
zh

[NLP-13] Universal-2-TF: Robust All-Neural Text Formatting for ASR

【速读】：该论文旨在解决商业自动语音识别（ASR）系统中的文本格式化（Text Formatting, TF）问题，具体包括标点符号恢复（Punctuation Restoration, PR）、大小写校正（Truecasing）和逆文本规范化（Inverse Text Normalization, ITN）。传统的规则驱动或混合方法在处理多样化的语言实体和文本领域时存在灵活性和鲁棒性不足的问题。为此，论文提出了一种全神经网络的解决方案，采用两阶段神经架构，包括一个多目标标记分类器和一个序列到序列（seq2seq）模型。这种设计不仅降低了计算成本，减少了幻觉现象，还确保了在不同语言环境和文本领域中的灵活性和鲁棒性。该方案作为Universal-2 ASR系统的一部分，通过综合评估验证了其在TF准确性、计算效率和感知质量方面的优越性能，强调了整体TF模型在提升ASR实际应用中的重要性。

链接: https://arxiv.org/abs/2501.05948
作者: Yash Khare,Taufiquzzaman Peyash,Andrea Vanzo,Takuya Yoshioka
机构: AssemblyAI Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.
zh

[NLP-14] LLM s Reproduce Stereotypes of Sexual and Gender Minorities

【速读】：该论文试图解决自然语言处理（NLP）系统中存在的性别偏见问题，特别是针对超越二元性别分类的性少数群体（sexual and gender minorities）的偏见。现有研究大多将性别简化为男性和女性两类，忽视了性别和性取向的多样性。本文通过采用心理学中的刻板印象内容模型（Stereotype Content Model），研究了大型语言模型（LLMs）对性少数群体的偏见表现。研究发现，LLMs在回答关于社会认知的英文调查问题时，表现出与人类相似的负面刻板印象。进一步，作者将这一框架扩展到文本生成的实际应用场景中，发现LLMs在生成文本时同样对性少数群体产生了刻板印象，这引发了对其在创意写作等应用场景中可能加剧代表性伤害的担忧。解决方案的关键在于通过心理学框架揭示LLMs的偏见，并呼吁在模型设计和应用中更加关注性别和性取向的多样性。

链接: https://arxiv.org/abs/2501.05926
作者: Ruby Ostrow,Adam Lopez
机构: University of Edinburgh(爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 8 figures, 6 tables

点击查看摘要

Abstract:A large body of research has found substantial gender bias in NLP systems. Most of this research takes a binary, essentialist view of gender: limiting its variation to the categories men and women, conflating gender with sex, and ignoring different sexual identities. But gender and sexuality exist on a spectrum, so in this paper we study the biases of large language models (LLMs) towards sexual and gender minorities beyond binary categories. Grounding our study in a widely used psychological framework – the Stereotype Content Model – we demonstrate that English-language survey questions about social perceptions elicit more negative stereotypes of sexual and gender minorities from LLMs, just as they do from humans. We then extend this framework to a more realistic use case: text generation. Our analysis shows that LLMs generate stereotyped representations of sexual and gender minorities in this setting, raising concerns about their capacity to amplify representational harms in creative writing, a widely promoted use case.
zh

[NLP-15] Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

【速读】：该论文旨在评估大型语言模型（LLMs）在支持未来预测任务中的表现，这是一个尚未充分探索的领域。研究通过构建一个基于新闻文章的数据集，测试和比较了LLMs在三种场景下的表现：肯定性（Affirmative）与可能性（Likelihood）提问、推理（Reasoning）以及反事实分析（Counterfactual analysis）。解决方案的关键在于通过收集LLMs训练截止日期前后的新闻文章，系统地评估模型在不同预测任务中的潜力和局限性，从而为未来的改进提供基础。

链接: https://arxiv.org/abs/2501.05925
作者: Petraq Nako,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学); University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Predicting future events is an important activity with applications across multiple fields and domains. For example, the capacity to foresee stock market trends, natural disasters, business developments, or political events can facilitate early preventive measures and uncover new opportunities. Multiple diverse computational methods for attempting future predictions, including predictive analysis, time series forecasting, and simulations have been proposed. This study evaluates the performance of several large language models (LLMs) in supporting future prediction tasks, an under-explored domain. We assess the models across three scenarios: Affirmative vs. Likelihood questioning, Reasoning, and Counterfactual analysis. For this, we create a dataset1 by finding and categorizing news articles based on entity type and its popularity. We gather news articles before and after the LLMs training cutoff date in order to thoroughly test and compare model performance. Our research highlights LLMs potential and limitations in predictive modeling, providing a foundation for future improvements.
zh

[NLP-16] Affordably Fine-tuned LLM s Provide Better Answers to Course-specific MCQs

【速读】：该论文探讨了在教育领域中，大型语言模型（LLMs）如何通过回答多项选择题（MCQs）来提高学习和教学的效率，并研究了这些模型在教育者和学生中的可负担性。研究的关键在于分析不同硬件约束和微调技术对LLMs回答MCQs准确性的影响。通过使用预训练的LLMs（如LLaMA-2的7B、13B和70B变体）回答162道本科级别的编程语言课程MCQs，研究发现，基于课程教材进行微调的较小模型在资源使用和准确性上优于未微调的较大模型。这表明，通过适当的微调和量化技术，LLMs在教育中的应用可以在资源和材料上更加经济可行。

链接: https://arxiv.org/abs/2501.05891
作者: Bianca Raimondi,Saverio Giallorenzo,Maurizio Gabbrielli
机构: University of Bologna (博洛尼亚大学); INRIA (法国国家信息与自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 40th ACM/SIGAPP Symposium On Applied Computing

点击查看摘要

Abstract:In education, the capability of generating human-like text of Large Language Models (LLMs) inspired work on how they can increase the efficiency of learning and teaching. We study the affordability of these models for educators and students by investigating how LLMs answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques. We explore this space by using generic pre-trained LLMs (the 7B, 13B, and 70B variants of LLaMA-2) to answer 162 undergraduate-level MCQs from a course on Programming Languages (PL) – the MCQ dataset is a contribution of this work, which we make publicly available. Specifically, we dissect how different factors, such as using readily-available material – (parts of) the course’s textbook – for fine-tuning and quantisation (to decrease resource usage) can change the accuracy of the responses. The main takeaway is that smaller textbook-based fine-tuned models outperform generic larger ones (whose pre-training requires conspicuous resources), making the usage of LLMs for answering MCQs resource- and material-wise affordable.
zh

[NLP-17] VideoRAG : Retrieval-Augmented Generation over Video Corpus

【速读】：该论文试图解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理多模态知识时的局限性，特别是对视频内容的忽视。现有的RAG方法主要关注文本信息，少数研究开始涉及图像，但视频作为一种丰富的多模态知识源，能够更有效地表示事件、过程和上下文细节，却未被充分利用。现有方法要么预定义与查询相关的视频而不根据查询动态检索，要么将视频转换为文本描述而未能充分利用其多模态特性。为解决这些问题，论文提出了VideoRAG框架，其关键创新在于动态检索与查询相关的视频，并在生成过程中同时利用视频的视觉和文本信息。该框架基于大型视频语言模型（Large Video Language Models, LVLMs），能够直接处理视频内容，实现视频的检索和与查询的无缝集成。实验验证表明，VideoRAG在性能上优于相关基线方法。

链接: https://arxiv.org/abs/2501.05874
作者: Soyeong Jeong,Kangsan Kim,Jinheon Baek,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
zh

[NLP-18] ConSim: Measuring Concept-Based Explanations Effectiveness with Automated Simulatability

【速读】：该论文试图解决基于概念的解释（concept-based explanations）在模型解释中的评估问题。现有的评估指标通常仅关注概念空间的质量，而忽略了如何有效地将这些概念传达给用户。为了解决这一问题，论文提出了一种通过自动化可模拟性（automated simulatability）来评估概念解释的框架，即通过模拟器根据提供的解释预测被解释模型的输出。这一方法在端到端评估中同时考虑了概念空间及其解释的有效性。由于大规模的人类研究难以实施，论文提出使用大语言模型（LLMs）作为模拟器来近似评估，并通过多种分析确保这种近似的可靠性。该方法的优势在于能够跨不同模型和数据集进行可扩展且一致的评估。

链接: https://arxiv.org/abs/2501.05855
作者: Antonin Poché,Alon Jacovi,Agustin Martin Picard,Victor Boutin(CERCO, ANITI),Fanny Jourdan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Concept-based explanations work by mapping complex model computations to human-understandable concepts. Evaluating such explanations is very difficult, as it includes not only the quality of the induced space of possible concepts but also how effectively the chosen concepts are communicated to users. Existing evaluation metrics often focus solely on the former, neglecting the latter. We introduce an evaluation framework for measuring concept explanations via automated simulatability: a simulator’s ability to predict the explained model’s outputs based on the provided explanations. This approach accounts for both the concept space and its interpretation in an end-to-end evaluation. Human studies for simulatability are notoriously difficult to enact, particularly at the scale of a wide, comprehensive empirical evaluation (which is the subject of this work). We propose using large language models (LLMs) as simulators to approximate the evaluation and report various analyses to make such approximations reliable. Our method allows for scalable and consistent evaluation across various models and datasets. We report a comprehensive empirical evaluation using this framework and show that LLMs provide consistent rankings of explanation methods. Code available at this https URL
zh

[NLP-19] IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages

【速读】：该论文试图解决罗马化印度-雅利安语（Romanized Indo-Aryan languages）在输入时使用临时转写（ad-hoc transliterals）导致的不准确问题，特别是在低资源语言（low-resourced languages）中，将其转换为原生脚本（native scripts）的过程复杂且容易出错。论文提出并评估了一种实时反向转写器（real-time reverse transliterator），旨在将罗马化的印度-雅利安语实时转换为原生脚本，从而提升用户的输入体验。解决方案的关键在于开发并评估针对僧伽罗语（Sinhala）、印地语（Hindi）和马拉雅拉姆语（Malayalam）的转写模型，这些模型不仅解决了临时转写的问题，还增强了低资源语言在数字领域的使用性。

链接: https://arxiv.org/abs/2501.05816
作者: Deshan Sumanathilaka,Isuri Anuradha,Ruvan Weerasinghe,Nicholas Micallef,Julian Hough
机构: Swansea University, Wales, UK (斯旺西大学, 威尔士, 英国); Lancaster University, UK (兰卡斯特大学, 英国); Informatics Institute of Technology, Sri Lanka (信息技术学院, 斯里兰卡)
类目: Computation and Language (cs.CL)
备注: 7 Pages, 1 Figure, 3 Tables

点击查看摘要

Abstract:The paper overviews the shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages. It focuses on the reverse transliteration of low-resourced languages in the Indo-Aryan family to their native scripts. Typing Romanized Indo-Aryan languages using ad-hoc transliterals and achieving accurate native scripts are complex and often inaccurate processes with the current keyboard systems. This task aims to introduce and evaluate a real-time reverse transliterator that converts Romanized Indo-Aryan languages to their native scripts, improving the typing experience for users. Out of 11 registered teams, four teams participated in the final evaluation phase with transliteration models for Sinhala, Hindi and Malayalam. These proposed solutions not only solve the issue of ad-hoc transliteration but also empower low-resource language usability in the digital arena.
zh

[NLP-20] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂多图像场景中实现精确定位（grounding）的挑战。现有的MLLMs虽然在单图像的细粒度感知和多图像的整体理解方面取得了显著进展，但在多图像场景中的定位能力仍然不足。为了解决这一问题，论文首先探索了一种链式思维（Chain-of-Thought, CoT）框架，该框架将单图像定位与多图像理解相结合。然而，由于该框架的非端到端特性，其效果不稳定且难以捕捉抽象视觉信息。因此，论文提出了Migician，这是首个能够在多图像场景中进行自由形式且精确定位的模型。为了支持这一模型，论文还提出了MGrounding-630k数据集，该数据集包含从现有数据集中提取的多图像定位任务数据，以及新生成的自由形式定位指令跟随数据。此外，论文还提出了MIG-Bench，这是一个专门用于评估多图像定位能力的综合基准。实验结果表明，Migician在多图像定位能力上显著优于现有的MLLMs，甚至超越了更大的70B模型。

链接: https://arxiv.org/abs/2501.05767
作者: You Li,Heyu Huang,Chi Chen,Kaiyu Huang,Chao Huang,Zonghao Guo,Zhiyuan Liu,Jinan Xu,Yuhua Li,Ruixuan Li,Maosong Sun
机构: State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University (北京交通大学); Huazhong University of Science and Technology (华中科技大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.
zh

[NLP-21] Controlling Large Language Models Through Concept Activation Vectors

【速读】：该论文试图解决在大规模语言模型（LLMs）广泛应用的背景下，如何有效控制其生成内容的问题。具体而言，现有的控制生成方法要么需要大量的计算资源和反复试验，要么只能提供粗粒度的控制。论文提出的解决方案是生成概念激活向量（Generation with Concept Activation Vector, GCAV），这是一种轻量级的模型控制框架，能够在不需要资源密集型微调的情况下实现精确控制。GCAV通过训练特定概念（如毒性）的概念激活向量，并在推理过程中通过调整LLMs中的激活层来引导这些向量，从而实现细粒度的控制。实验表明，GCAV在毒性减少、情感控制、语言风格和主题控制等方面均达到了最先进的性能，允许对引导层和引导幅度进行精细调整。

链接: https://arxiv.org/abs/2501.05764
作者: Hanyu Zhang,Xiting Wang,Chengao Li,Xiang Ao,Qing He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are widely deployed across various domains, the ability to control their generated outputs has become more critical. This control involves aligning LLMs outputs with human values and ethical principles or customizing LLMs on specific topics or styles for individual users. Existing controlled generation methods either require significant computational resources and extensive trial-and-error or provide coarse-grained control. In this paper, we propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework that ensures accurate control without requiring resource-extensive fine-tuning. Specifically, GCAV first trains a concept activation vector for specified concepts to be controlled, such as toxicity. During inference, GCAV steers the concept vector in LLMs, for example, by removing the toxicity concept vector from the activation layers. Control experiments from different perspectives, including toxicity reduction, sentiment control, linguistic style, and topic control, demonstrate that our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples.
zh

[NLP-22] Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models

【速读】：该论文旨在解决现有大语言模型（LLMs）在多步推理任务中存在的计算效率低下和冗余问题。具体而言，现有方法在任务难度多样性方面存在不足，导致对简单任务进行不必要的广泛搜索；同时，它们忽略了推理路径的语义，导致对语义相同的路径进行冗余探索。为解决这些问题，论文提出了语义探索与自适应门控（SEAG）方法。SEAG的核心在于其自适应门控机制，该机制根据前置简单推理方法的答案置信度动态决定是否进行树搜索。此外，SEAG通过树状探索整合语义相同的推理步骤，减少冗余探索，同时保持甚至提高准确性。实验结果表明，SEAG在复杂推理基准测试（如GSM8K和ARC）上显著提升了准确性（平均提高4.3%），同时仅需现有树搜索方法31%的计算成本。

链接: https://arxiv.org/abs/2501.05752
作者: Sungjae Lee,Hyejin Park,Jaechang Kim,Jungseul Ok
机构: 1Department of Computer Science and Engineering, POSTECH, South Korea; 2Graduate School of Artificial Intelligence, POSTECH, South Korea
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi-step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree-based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search-based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral.
zh

[NLP-23] Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models

【速读】：该论文试图解决孟加拉语（Bangla）标准语与地区方言之间的翻译问题，特别是针对Chittagong、Sylhet、Barishal、Noakhali和Mymensingh等地区的方言。这些方言在词汇、发音和句子结构上存在显著差异，导致其在技术应用中的代表性不足。为了解决这一问题，研究采用了神经机器翻译（Neural Machine Translation, NMT）模型，包括BanglaT5、mT5和mBART50，通过微调“Vashantor”数据集（包含32,500句不同方言的句子）来实现标准孟加拉语到地区方言的翻译。研究的关键在于通过字符错误率（Character Error Rate, CER）和词错误率（Word Error Rate, WER）等指标评估模型性能，其中BanglaT5表现最佳，CER为12.3%，WER为15.7%，表明其在捕捉方言细微差异方面的有效性。该研究的成果有助于开发支持地区方言并促进语言多样性的包容性语言技术。

链接: https://arxiv.org/abs/2501.05749
作者: Md. Arafat Alam Khandaker,Ziyan Shirin Raha,Bidyarthi Paul,Tashreef Muhammad
机构: Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh (计算机科学与工程系，阿赫桑乌拉科技大学，达卡，孟加拉国)
类目: Computation and Language (cs.CL)
备注: Accepted in 2024 27th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:The Bangla language includes many regional dialects, adding to its cultural richness. The translation of Bangla Language into regional dialects presents a challenge due to significant variations in vocabulary, pronunciation, and sentence structure across regions like Chittagong, Sylhet, Barishal, Noakhali, and Mymensingh. These dialects, though vital to local identities, lack of representation in technological applications. This study addresses this gap by translating standard Bangla into these dialects using neural machine translation (NMT) models, including BanglaT5, mT5, and mBART50. The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers. The models were fine-tuned using the “Vashantor” dataset, containing 32,500 sentences across various dialects, and evaluated through Character Error Rate (CER) and Word Error Rate (WER) metrics. BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances. The outcomes of this research contribute to the development of inclusive language technologies that support regional dialects and promote linguistic diversity.
zh

[NLP-24] Enabling Scalable Oversight via Self-Evolving Critic

【速读】：该论文试图解决大语言模型（LLMs）在可扩展监督（scalable oversight）方面的一个关键挑战，即在人类评估困难或LLMs超越人类表现的任务中，如何提供有效的反馈。现有的方法仍然依赖于人工标注或更强大的模型，尚未解决在没有外部监督的情况下增强批判能力的问题。论文提出的解决方案是SCRIT（Self-evolving CRITic）框架，该框架通过自我进化提升批判能力。SCRIT的关键技术在于利用对比式自我批判生成合成数据进行训练，并通过自我验证机制确保批判质量。具体来说，SCRIT使用参考解决方案进行逐步批判，并通过修正结果进行自我验证。实验表明，SCRIT在批判修正和错误识别基准上实现了高达10.3%的性能提升，且其性能随着数据和模型规模的增加而提升，显著优于其他方法，并依赖于其自我验证组件。

链接: https://arxiv.org/abs/2501.05727
作者: Zhengyang Tang,Ziniu Li,Zhenyang Xiao,Tian Ding,Ruoyu Sun,Benyou Wang,Dayiheng Liu,Fei Huang,Tianyu Liu,Bowen Yu,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT’s performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.
zh

[NLP-25] How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles Formalizations and Beyond

【速读】：该论文旨在探讨和总结人机协作（human-model cooperation）这一新兴范式在自然语言处理（NLP）领域中的发展现状、原则、形式化方法以及面临的开放挑战。随着大语言模型（LLMs）的进步，智能模型已从单纯的工具演变为具有自主目标和策略的自主代理，能够与人类进行协作。论文的关键解决方案是提出了一种新的分类法（taxonomy），为现有方法提供了一个统一的视角，并讨论了潜在的前沿领域及其对应的挑战。通过这一综述，论文为未来在这一领域的突破性研究奠定了基础。

链接: https://arxiv.org/abs/2501.05714
作者: Chen Huang,Yang Deng,Wenqiang Lei,Jiancheng Lv,Tat-Seng Chua,Jimmy Xiangji Huang
机构: Sichuan University(四川大学); York University(约克大学); Singapore Management University(新加坡管理大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 23 pages

点击查看摘要

Abstract:With the advancement of large language models (LLMs), intelligent models have evolved from mere tools to autonomous agents with their own goals and strategies for cooperating with humans. This evolution has birthed a novel paradigm in NLP, i.e., human-model cooperation, that has yielded remarkable progress in numerous NLP tasks in recent years. In this paper, we take the first step to present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. In particular, we introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges. We regard our work as an entry point, paving the way for more breakthrough research in this regard.
zh

[NLP-26] Multi-Step Reasoning in Korean and the Emergent Mirag e

【速读】：该论文旨在解决大型语言模型（LLMs）在文化特定背景下进行多步推理的能力评估问题，特别是针对韩国文化背景。为了解决这一问题，作者提出了HRMCR（HAE-RAE Multi-Step Commonsense Reasoning）基准测试，该基准通过模板和算法自动生成问题，要求模型在推理过程中整合韩国文化知识。实验结果表明，训练FLOPs少于(2 \cdot 10^25)的模型几乎无法解决任何问题，性能接近零；而超过这一阈值后，性能显著提升。然而，即使是当前最先进的模型（如O1）在该基准上的得分仍低于50%，表明任务的难度较高。此外，逐步分析表明，观察到的涌现行为可能源于多步推理中的错误累积，而非真正的新能力。作者公开了这一基准，并承诺定期更新数据集以防止数据污染。

链接: https://arxiv.org/abs/2501.05712
作者: Guijin Son,Hyunwoo Ko,Dasol Choi
机构: OneLineAI1; Yonsei University (延世大学)2
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models’ ability to perform multi-step reasoning in culturally specific contexts, focusing on Korean. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Consistent with prior observations on emergent abilities, our experiments reveal that models trained on fewer than (2 \cdot 10^25) training FLOPs struggle to solve any questions, showing near-zero performance. Beyond this threshold, performance improves sharply. State-of-the-art models (e.g., O1) still score under 50%, underscoring the difficulty of our tasks. Notably, stepwise analysis suggests the observed emergent behavior may stem from compounding errors across multiple steps rather than reflecting a genuinely new capability. We publicly release the benchmark and commit to regularly updating the dataset to prevent contamination.
zh

[NLP-27] Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

【速读】：该论文试图解决大语言模型（LLMs）在训练数据基础上的性能提升瓶颈问题。尽管LLMs近年来取得了显著进展，但其性能受限于训练数据的质量和多样性。为了突破这一限制，论文提出了一种基于多智能体（multiagent）语言模型社会的自我改进方法。其解决方案的关键在于通过多智能体交互生成数据，并对每个模型进行独立微调（finetuning），从而实现模型间的专业化和多样化。具体而言，多个从同一基础模型出发的语言模型通过多智能体交互生成的数据进行独立更新，每个模型在独立的数据集上进行训练。这种方法不仅保留了多样化的推理链，还能够在多轮微调中实现自主改进，相比单智能体的自我改进方法，显著提升了模型的性能和鲁棒性。

链接: https://arxiv.org/abs/2501.05707
作者: Vighnesh Subramaniam,Yilun Du,Joshua B. Tenenbaum,Antonio Torralba,Shuang Li,Igor Mordatch
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Harvard University(哈佛大学); Stanford University(斯坦福大学); Google Deepmind(谷歌 Deepmind)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 13 figures, 7 tables; Project page at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.
zh

[NLP-28] Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

【速读】：该论文试图解决多语言预训练语言模型（multiPLMs）在低资源语言（LRLs）上的跨语言任务表现不佳的问题。现有的多语言预训练模型通常使用掩码语言建模（MLM）和翻译语言建模（TLM）进行持续预训练，但这些方法在掩码时对所有输入序列中的词元（tokens）赋予相同的权重，忽略了词元的语言学特性。论文提出了一种新的掩码策略，即语言学实体掩码（LEM），用于持续预训练步骤，以进一步改善现有多语言预训练模型的跨语言表示能力。LEM的关键在于将掩码限制在名词、动词和命名实体等语言学实体类型上，这些实体在句子中具有更高的显著性。此外，LEM还限制在每个语言学实体跨度内仅掩码一个词元，从而保留更多的上下文信息，而MLM和TLM则是随机掩码词元。通过实验验证，使用LEM进行持续预训练的多语言预训练模型在双语文本挖掘、平行数据整理和代码混合情感分析等下游任务中均优于使用MLM+TLM进行持续预训练的模型。

链接: https://arxiv.org/abs/2501.05700
作者: Aloka Fernando,Surangika Ranathunga
机构: University of Moratuwa (莫拉图瓦大学); Massey University (梅西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy, Linguistic Entity Masking (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.
zh

[NLP-29] Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation ICME2024

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）模型在预测答案时过度依赖语言先验（language priors）的问题。这种依赖导致模型倾向于利用语言捷径而非全面理解多模态知识，从而削弱了其泛化能力。为解决这一问题，论文提出了一种名为KDAR（Knowledge Distillation for Answer Regularization）的新方法，其核心在于利用知识蒸馏（knowledge distillation）技术来缓解VQA任务中的先验依赖困境。具体而言，该方法通过引入一个训练良好的教师模型生成的软标签（soft labels）来实现正则化效果，从而惩罚模型对最常见答案的过拟合。这些软标签不仅起到正则化作用，还提供了语义指导，缩小了候选答案的范围。此外，论文还设计了一种自适应样本重加权学习策略，通过动态调整每个样本的重要性来进一步减轻偏差。实验结果表明，该方法在分布外（OOD）和分布内（IID）设置下均提升了性能，并在VQA-CPv2 OOD基准测试中达到了最先进的性能水平。

链接: https://arxiv.org/abs/2501.05690
作者: Daowan Peng,Wei Wei
机构: CCIIP Lab, School of Computer Science and Technology, Huazhong University of Science and Technology (华中科技大学计算机科学与技术学院CCIIP实验室); Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL) (华中科技大学与平安产险联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICME2024

点击查看摘要

Abstract:Previous studies have pointed out that visual question answering (VQA) models are prone to relying on language priors for answer predictions. In this context, predictions often depend on linguistic shortcuts rather than a comprehensive grasp of multimodal knowledge, which diminishes their generalization ability. In this paper, we propose a novel method, namely, KDAR, leveraging knowledge distillation to address the prior-dependency dilemmas within the VQA task. Specifically, the regularization effect facilitated by soft labels from a well-trained teacher is employed to penalize overfitting to the most common answers. The soft labels, which serve a regularization role, also provide semantic guidance that narrows the range of candidate answers. Additionally, we design an adaptive sample-wise reweighting learning strategy to further mitigate bias by dynamically adjusting the importance of each sample. Experimental results demonstrate that our method enhances performance in both OOD and IID settings. Our method achieves state-of-the-art performance on the VQA-CPv2 out-of-distribution (OOD) benchmark, significantly outperforming previous state-of-the-art approaches.
zh

[NLP-30] Cascaded Self-Evaluation Augmented Training for Efficient Multimodal Large Language Models

【速读】：该论文试图解决高效多模态大语言模型（EMLLMs）在推理过程中难以有效利用自评估（self-evaluation）的问题。具体挑战包括评估数据的合成、数据量的确定、训练和推理策略的优化，以及提示（prompts）的选择。为解决这些问题，论文提出了自评估增强训练（Self-Evaluation Augmented Training, SEAT）方法，该方法利用更强大的EMLLMs进行链式思维（Chain-of-Thought, CoT）推理、数据选择和评估生成，然后使用合成的数据训练EMLLMs。然而，处理长提示和保持CoT推理质量仍然存在问题。因此，论文进一步提出了级联自评估增强训练（Cascaded Self-Evaluation Augmented Training, Cas-SEAT），将长提示分解为较短的任务特定级联提示，并降低了资源有限环境下的成本。在数据合成过程中，论文使用了开源的7B参数EMLLMs，并通过短提示标注了一个小数据集。实验表明，Cas-SEAT显著提升了EMLLMs的自评估能力，在MathVista、Math-V和We-Math数据集上的性能分别提高了19.68%、55.57%和46.79%。此外，Cas-SEAT数据集为未来提升EMLLM自评估能力的研究提供了宝贵资源。

链接: https://arxiv.org/abs/2501.05662
作者: Zheqi Lv,Wenkai Wang,Jiawei Wang,Shengyu Zhang,Fei Wu
机构: Zhejiang University(浙江大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient Multimodal Large Language Models (EMLLMs) have rapidly advanced recently. Incorporating Chain-of-Thought (CoT) reasoning and step-by-step self-evaluation has improved their performance. However, limited parameters often hinder EMLLMs from effectively using self-evaluation during inference. Key challenges include synthesizing evaluation data, determining its quantity, optimizing training and inference strategies, and selecting appropriate prompts. To address these issues, we introduce Self-Evaluation Augmented Training (SEAT). SEAT uses more powerful EMLLMs for CoT reasoning, data selection, and evaluation generation, then trains EMLLMs with the synthesized data. However, handling long prompts and maintaining CoT reasoning quality are problematic. Therefore, we propose Cascaded Self-Evaluation Augmented Training (Cas-SEAT), which breaks down lengthy prompts into shorter, task-specific cascaded prompts and reduces costs for resource-limited settings. During data synthesis, we employ open-source 7B-parameter EMLLMs and annotate a small dataset with short prompts. Experiments demonstrate that Cas-SEAT significantly boosts EMLLMs’ self-evaluation abilities, improving performance by 19.68%, 55.57%, and 46.79% on the MathVista, Math-V, and We-Math datasets, respectively. Additionally, our Cas-SEAT Dataset serves as a valuable resource for future research in enhancing EMLLM self-evaluation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05662 [cs.CL] (or arXiv:2501.05662v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.05662 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-31] Collaboration of Large Language Models and Small Recommendation Models for Device-Cloud Recommendation KDD KDD’25

【速读】：该论文试图解决大型语言模型（LLMs）在推荐系统（LLM4Rec）中无法有效捕捉实时用户偏好的问题。具体来说，LLMs由于训练和推理成本高、难以访问实时数据，且参数规模庞大，导致其在设备端的部署受限。为解决这一问题，论文提出了设备-云协同的LLM-SRM协作推荐框架（LSC4Rec），通过结合LLMs和小型推荐模型（SRMs）的优势，以及云和边缘计算的优势，实现互补协同。解决方案的关键在于设计了三种策略：协同训练、协同推理和智能请求。在训练阶段，LLM生成候选列表以增强SRM在协同场景中的排序能力，并使SRM能够自适应更新以捕捉实时用户兴趣。在推理阶段，LLM和SRM分别部署在云端和设备端，LLM生成候选列表和初始排序结果，SRM基于候选列表进行重新排序，最终结果结合两者的评分。设备端通过比较LLM和SRM排序列表的一致性来决定是否需要新的候选列表。实验验证了LSC4Rec框架中每种策略的有效性。

链接: https://arxiv.org/abs/2501.05647
作者: Zheqi Lv,Tianyu Zhan,Wenjie Wang,Xinyu Lin,Shengyu Zhang,Wenqiao Zhang,Jiwei Li,Kun Kuang,Fei Wu
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Published on KDD’25: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2025

点击查看摘要

Abstract:Large Language Models (LLMs) for Recommendation (LLM4Rec) is a promising research direction that has demonstrated exceptional performance in this field. However, its inability to capture real-time user preferences greatly limits the practical application of LLM4Rec because (i) LLMs are costly to train and infer frequently, and (ii) LLMs struggle to access real-time data (its large number of parameters poses an obstacle to deployment on devices). Fortunately, small recommendation models (SRMs) can effectively supplement these shortcomings of LLM4Rec diagrams by consuming minimal resources for frequent training and inference, and by conveniently accessing real-time data on devices. In light of this, we designed the Device-Cloud LLM-SRM Collaborative Recommendation Framework (LSC4Rec) under a device-cloud collaboration setting. LSC4Rec aims to integrate the advantages of both LLMs and SRMs, as well as the benefits of cloud and edge computing, achieving a complementary synergy. We enhance the practicability of LSC4Rec by designing three strategies: collaborative training, collaborative inference, and intelligent request. During training, LLM generates candidate lists to enhance the ranking ability of SRM in collaborative scenarios and enables SRM to update adaptively to capture real-time user interests. During inference, LLM and SRM are deployed on the cloud and on the device, respectively. LLM generates candidate lists and initial ranking results based on user behavior, and SRM get reranking results based on the candidate list, with final results integrating both LLM’s and SRM’s scores. The device determines whether a new candidate list is needed by comparing the consistency of the LLM’s and SRM’s sorted lists. Our comprehensive and extensive experimental analysis validates the effectiveness of each strategy in LSC4Rec. Comments: Published on KDD’25: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2501.05647 [cs.IR] (or arXiv:2501.05647v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2501.05647 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3690624.3709335 Focus to learn more DOI(s) linking to related resources
zh

[NLP-32] Iconicity in Large Language Models

【速读】：该论文探讨了大型语言模型（LLMs）在处理词汇象似性（lexical iconicity）方面的能力。词汇象似性指的是词汇形式与其意义之间的直接关联，通常通过声音与意义的关联体现。由于LLMs对文本意义和声音的访问是通过文本上下文和书面表示间接实现的，并且受到分词（tokenization）的进一步影响，因此假设LLMs在编码象似性方面可能不足或与人类处理方式有显著差异。为解决这一问题，研究通过让GPT-4生成人工语言中的高度象似性伪词（pseudowords），并让人类参与者（捷克和德国人）和LLM参与者（由GPT-4和Claude 3.5 Sonnet生成）猜测这些伪词的意义，验证其象似性。结果表明，人类在猜测生成的象似性语言中的伪词意义时比猜测远距离自然语言中的词汇更准确，而LLM参与者在这一任务中甚至比人类表现更好。这一核心发现还伴随着对生成语言的普遍性以及人类和LLM参与者所利用线索的进一步分析。

链接: https://arxiv.org/abs/2501.05643
作者: Anna Marklová,Jiří Milička,Leonid Ryvkin,Ľudmila Lacková Bennet,Libuše Kormaníková
机构: Charles University(查理大学); Claude Bernard University Lyon(克劳德·伯纳德大学里昂分校); Palacký University Olomouc(帕拉茨基大学奥洛穆茨分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Supplementary information: this https URL

点击查看摘要

Abstract:Lexical iconicity, a direct relation between a word’s meaning and its form, is an important aspect of every natural language, most commonly manifesting through sound-meaning associations. Since Large language models’ (LLMs’) access to both meaning and sound of text is only mediated (meaning through textual context, sound through written representation, further complicated by tokenization), we might expect that the encoding of iconicity in LLMs would be either insufficient or significantly different from human processing. This study addresses this hypothesis by having GPT-4 generate highly iconic pseudowords in artificial languages. To verify that these words actually carry iconicity, we had their meanings guessed by Czech and German participants (n=672) and subsequently by LLM-based participants (generated by GPT-4 and Claude 3.5 Sonnet). The results revealed that humans can guess the meanings of pseudowords in the generated iconic language more accurately than words in distant natural languages and that LLM-based participants are even more successful than humans in this task. This core finding is accompanied by several additional analyses concerning the universality of the generated language and the cues that both human and LLM-based participants utilize.
zh

[NLP-33] Automating Date Format Detection for Data Visualization

【速读】：该论文旨在解决数据分析工作流中的一个显著瓶颈，即日期解析（date parsing）问题。为了解决这一问题，作者提出了两种算法：一种基于最小熵（minimum entropy），另一种基于自然语言建模（natural language modeling）。这两种算法能够自动从字符串数据中推导出日期格式，并在大规模数据列上实现了超过90%的准确率。最小熵方法尤其快速，能够提供交互式反馈。这些方法简化了日期格式的提取过程，使其适合集成到数据可视化工具和数据库中。解决方案的关键在于通过自动化的日期格式推导，显著提升了数据准备的效率，特别是在可视化环境中。

链接: https://arxiv.org/abs/2501.05640
作者: Zixuan Liang
机构: Department of Computer and Information Sciences, Harrisburg University of Science & Technology (哈里斯堡科技大学)
类目: Computation and Language (cs.CL)
备注: 2025 International Conference on Advanced Machine Learning and Data Science (AMLDS 2025)

点击查看摘要

Abstract:Data preparation, specifically date parsing, is a significant bottleneck in analytic workflows. To address this, we present two algorithms, one based on minimum entropy and the other on natural language modeling that automatically derive date formats from string data. These algorithms achieve over 90% accuracy on a large corpus of data columns, streamlining the data preparation process within visualization environments. The minimal entropy approach is particularly fast, providing interactive feedback. Our methods simplify date format extraction, making them suitable for integration into data visualization tools and databases.
zh

[NLP-34] he Impact of Model Scaling on Seen and Unseen Language Performance AAAI25

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在多语言环境下的性能和扩展行为问题，特别是在文本分类和机器翻译任务中。研究通过系统性地考察204种语言在零样本（zero-shot）和少样本（few-shot）设置下的表现，探讨了不同模型规模对性能的影响。研究的关键在于发现零样本和两样本（two-shot）场景下的扩展行为存在显著差异，尤其是在已见语言（seen languages）和未见语言（unseen languages）之间的性能差距。此外，研究指出模型规模对零样本性能影响较小，但在两样本设置下，较大模型在文本分类任务中表现出明显的线性提升。对于翻译任务，只有经过指令调优（instruction-tuned）的模型在扩展时显示出明显优势。研究还表明，整体资源水平而不仅仅是预训练语言的比例，是模型性能的更好预测因素，这为理解多语言LLM的有效性提供了新的见解。

链接: https://arxiv.org/abs/2501.05629
作者: Rhitabrat Pokharel,Sina Bagheri Nezhad,Ameeta Agrawal,Suresh Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at SEAS Workshop at AAAI25

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs), particularly those trained on multilingual corpora, has intensified the need for a deeper understanding of their performance across a diverse range of languages and model sizes. Our research addresses this critical need by studying the performance and scaling behavior of multilingual LLMs in text classification and machine translation tasks across 204 languages. We systematically examine both seen and unseen languages across three model families of varying sizes in zero-shot and few-shot settings. Our findings show significant differences in scaling behavior between zero-shot and two-shot scenarios, with striking disparities in performance between seen and unseen languages. Model scale has little effect on zero-shot performance, which remains mostly flat. However, in two-shot settings, larger models show clear linear improvements in multilingual text classification. For translation tasks, however, only the instruction-tuned model showed clear benefits from scaling. Our analysis also suggests that overall resource levels, not just the proportions of pretraining languages, are better predictors of model performance, shedding light on what drives multilingual LLM effectiveness.
zh

[NLP-35] Harmonizing Metadata of Language Resources for Enhanced Querying and Accessibility

【速读】：该论文旨在解决来自不同语言资源（Language Resources, LRs）库的元数据（metadata）的协调问题。为了实现这一目标，论文提出了一种基于DCAT（Data Catalog Vocabulary）和META-SHARE OWL本体（ontology）的统一模型，利用关联数据（linked data）和RDF（Resource Description Framework）技术，将多源数据整合到一个统一的框架中。关键解决方案是通过新开发的Linghub门户，支持基于文本的搜索、分面浏览（faceted browsing）和高级SPARQL查询。研究通过评估来自Corpora Mailing List（CML）的真实用户查询，验证了Linghub满足用户需求的能力。尽管存在一些局限性，但结果表明许多用户请求能够成功处理。研究还强调了元数据问题的重要性，并提倡采用开放词汇和标准以增强元数据的协调性。此外，研究强调了基于API的访问对语言资源的重要性，促进了机器可用性和特定目的的数据子集提取，为更高效和标准化的语言资源利用铺平了道路。

链接: https://arxiv.org/abs/2501.05606
作者: Zixuan Liang
机构: Harrisburg University of Science & Technology (哈里斯堡科技大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT 2024)

点击查看摘要

Abstract:This paper addresses the harmonization of metadata from diverse repositories of language resources (LRs). Leveraging linked data and RDF techniques, we integrate data from multiple sources into a unified model based on DCAT and META-SHARE OWL ontology. Our methodology supports text-based search, faceted browsing, and advanced SPARQL queries through Linghub, a newly developed portal. Real user queries from the Corpora Mailing List (CML) were evaluated to assess Linghub capability to satisfy actual user needs. Results indicate that while some limitations persist, many user requests can be successfully addressed. The study highlights significant metadata issues and advocates for adherence to open vocabularies and standards to enhance metadata harmonization. This initial research underscores the importance of API-based access to LRs, promoting machine usability and data subset extraction for specific purposes, paving the way for more efficient and standardized LR utilization.
zh

[NLP-36] Exploring Large Language Models for Translating Romanian Computational Problems into English

【速读】：该论文试图解决大型语言模型（LLMs）在将数学和计算机科学任务从罗马尼亚语翻译成英语时表现不佳的问题。这一问题在自动翻译编程竞赛题目、创建高质量教育材料以及减少人工翻译中的错误或欺诈等应用中尤为关键。论文的解决方案关键在于通过提供结构化的提示（well-structured prompts）来增强LLMs在翻译较少使用语言时的性能。研究表明，在适当的监督下，LLMs可以可靠地用于自动翻译IOI（国际信息学奥林匹克竞赛）风格的任务。通过对多种LLMs（包括OpenRoLLM、Llama 3.1 8B、Llama 3.2 3B和GPT-4o）的翻译方法和性能稳定性进行评估，并结合详细的句法和语义分析，论文确认了在人类监督下，LLMs可以作为多语言问题解决的可行方案。此外，论文还通过专家评估比较了LLMs与人工翻译的质量，进一步强调了LLMs在实际应用中的潜力。

链接: https://arxiv.org/abs/2501.05601
作者: Adrian Marius Dumitran,Adrian-Catalin Badea,Stefan-Gabriel Muscalu,Angela-Liliana Dumitran,Stefan-Cosmin Dascalescu,Radu-Sebastian Amarie
机构: University of Bucharest(布加勒斯特大学); Softbinator; UiPath; It Just Works Inc.; QPillars
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Recent studies have suggested that large language models (LLMs) underperform on mathematical and computer science tasks when these problems are translated from Romanian into English, compared to their original Romanian format. Accurate translation is critical for applications ranging from automatic translations in programming competitions to the creation of high-quality educational materials, as well as minimizing errors or fraud in human translations. This study shows that robust large language models (LLMs) can maintain or even enhance their performance in translating less common languages when given well-structured prompts. Our findings suggest that LLMs, with appropriate supervision, can be reliably used for the automatic translation of IOI (International Olympiad in Informatics)-style tasks. We evaluate several translation methods across multiple LLMs, including OpenRoLLM, Llama 3.1 8B, Llama 3.2 3B and GPT-4o, assessing their translation accuracy and performance stability through repeated runs. Additionally, we augment the OJI (Romanian County-Level Informatics Olympiad) Romanian dataset with accurate English translations, enhancing its utility for future LLM training and evaluation. Through detailed syntactic and semantic analyses, we confirm that with human oversight, LLMs can serve as a viable solution for multilingual problem-solving. We also compare the translation quality of LLMs against human translators, as evaluated by a certified expert, underscoring the potential of LLMs in realworld scenarios.
zh

[NLP-37] LLM Quoter: Enhancing RAG Capabilities Through Efficient Quote Extraction From Large Contexts

【速读】：该论文旨在解决在检索增强生成（Retrieval Augmented Generation, RAG）任务中，如何高效提取最相关的文本证据以支持下游推理任务的问题。现有的全上下文方法（如Retrieval-Augmented Fine-Tuning, RAFT）在处理复杂推理任务时存在认知负担过重的问题。为此，论文提出了LLMQuoter，一个基于知识蒸馏（knowledge distillation）的轻量级模型，采用“先引用后回答”（quote-first-then-answer）的策略，首先高效识别关键引用，然后将精选的文本片段传递给推理模型。这一工作流程显著降低了认知开销，并在小型和大型语言模型上实现了超过20个百分点的准确率提升。通过利用高性能教师模型的知识蒸馏，LLMQuoter在资源高效的微调设置下取得了具有竞争力的结果，为研究人员和从业者提供了一个可扩展且实用的解决方案。

链接: https://arxiv.org/abs/2501.05554
作者: Yuri Facanha Bezerra,Li Weigang
机构: TransLab, University of Brasilia (巴西利亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce LLMQuoter, a lightweight, distillation-based model designed to enhance Retrieval Augmented Generation (RAG) by extracting the most relevant textual evidence for downstream reasoning tasks. Built on the LLaMA-3B architecture and fine-tuned with Low-Rank Adaptation (LoRA) on a 15,000-sample subset of HotpotQA, LLMQuoter adopts a “quote-first-then-answer” strategy, efficiently identifying key quotes before passing curated snippets to reasoning models. This workflow reduces cognitive overhead and outperforms full-context approaches like Retrieval-Augmented Fine-Tuning (RAFT), achieving over 20-point accuracy gains across both small and large language models. By leveraging knowledge distillation from a high-performing teacher model, LLMQuoter achieves competitive results in a resource-efficient fine-tuning setup. It democratizes advanced RAG capabilities, delivering significant performance improvements without requiring extensive model retraining. Our results highlight the potential of distilled quote-based reasoning to streamline complex workflows, offering a scalable and practical solution for researchers and practitioners alike.
zh

[NLP-38] he dynamics of meaning through time: Assessment of Large Language Models

链接: https://arxiv.org/abs/2501.05552
作者: Mohamed Taher Alrefaie,Fatty Salem,Nour Eldin Morsy,Nada Samir,Mohamed Medhat Gaber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-39] he more polypersonal the better – a short look on space geometry of fine-tuned layers

【速读】：该论文试图解决的问题是如何通过增加语法模块和包含新语法结构（polypersonality）的数据来改进BERT模型的内部表示，并探究这些变化对模型性能的影响。解决方案的关键在于分析BERT模型在训练过程中内部表示的变化，特别是通过增加一个语法层（grammatical layer）来分离新旧语法系统。这种方法不仅能够识别模型决策过程中的模式，还能理解其内部结构的特征，从而在困惑度（perplexity）指标上提升整体性能。

链接: https://arxiv.org/abs/2501.05503
作者: Sergei Kudriashov,Veronika Zykova,Angelina Stepanova,Yakov Raskind,Eduard Klyshinsky
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Neuroinformatics 2024

点击查看摘要

Abstract:The interpretation of deep learning models is a rapidly growing field, with particular interest in language models. There are various approaches to this task, including training simpler models to replicate neural network predictions and analyzing the latent space of the model. The latter method allows us to not only identify patterns in the model’s decision-making process, but also understand the features of its internal structure. In this paper, we analyze the changes in the internal representation of the BERT model when it is trained with additional grammatical modules and data containing new grammatical structures (polypersonality). We find that adding a single grammatical layer causes the model to separate the new and old grammatical systems within itself, improving the overall performance on perplexity metrics.
zh

[NLP-40] Spatial Information Integration in Small Language Models for Document Layout Generation and Classification

【速读】：该论文试图解决半结构化文档（semi-structured documents）在机器学习模型训练中缺乏公开数据集的问题。半结构化文档（如资产负债表、采购订单、收据等）在日常生活中非常常见，但现有的公开数据集不足以支持对这些文档的布局理解（document layout understanding）进行有效训练。论文提出了一种生成新的合成布局信息的方法，以克服数据短缺的挑战。该方法的性能优于另一种流行的布局生成方法LayoutTransformer，并且在某些场景下，结合边界框信息（bounding box information）可以提升文本分类的效果。解决方案的关键在于通过生成合成数据来增强训练数据集，从而提升模型对文档布局的理解能力。

链接: https://arxiv.org/abs/2501.05497
作者: Pablo Melendez,Clemens Havas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 8 pages. Symposium on Applied Computing 2025

点击查看摘要

Abstract:Document layout understanding is a field of study that analyzes the spatial arrangement of information in a document hoping to understand its structure and layout. Models such as LayoutLM (and its subsequent iterations) can understand semi-structured documents with SotA results; however, the lack of open semi-structured data is a limitation in itself. While semi-structured data is common in everyday life (balance sheets, purchase orders, receipts), there is a lack of public datasets for training machine learning models for this type of document. In this investigation we propose a method to generate new, synthetic, layout information that can help overcoming this data shortage. According to our results, the proposed method performs better than LayoutTransformer, another popular layout generation method. We also show that, in some scenarios, text classification can improve when supported by bounding box information.
zh

[NLP-41] LSEBMCL: A Latent Space Energy-Based Model for Continual Learning

【速读】：该论文试图解决持续学习（Continual Learning）中的主要挑战——灾难性遗忘（catastrophic forgetting）问题。灾难性遗忘指的是模型在训练新任务时，无意中丢弃了先前任务中学到的知识。为了解决这一问题，论文提出了一种基于能量模型（Energy-Based Models, EBMs）的解决方案，称为LSEBMCL（Latent Space Energy-Based Model for Continual Learning）。该方案的关键在于使用EBM层作为持续学习框架中的外部生成器，通过从先前任务中采样数据点来防止灾难性遗忘。EBM通过为每个输入数据点分配一个能量值，帮助模型在训练新任务时保留先前任务的知识。实验结果表明，该方法在自然语言处理（NLP）任务中表现出色，达到了最先进的性能。

链接: https://arxiv.org/abs/2501.05495
作者: Xiaodi Li,Dingcheng Li,Rujun Gao,Mahmoud Zamani,Latifur Khan
机构: The University of Texas at Dallas(德克萨斯大学达拉斯分校); Vextex AI; Google(谷歌); Texas A&M University(德州农工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In the 7th International Conference on Artificial Intelligence in Information and Communication (ICAIIC 2025)

点击查看摘要

Abstract:Continual learning has become essential in many practical applications such as online news summaries and product classification. The primary challenge is known as catastrophic forgetting, a phenomenon where a model inadvertently discards previously learned knowledge when it is trained on new tasks. Existing solutions involve storing exemplars from previous classes, regularizing parameters during the fine-tuning process, or assigning different model parameters to each task. The proposed solution LSEBMCL (Latent Space Energy-Based Model for Continual Learning) in this work is to use energy-based models (EBMs) to prevent catastrophic forgetting by sampling data points from previous tasks when training on new ones. The EBM is a machine learning model that associates an energy value with each input data point. The proposed method uses an EBM layer as an outer-generator in the continual learning framework for NLP tasks. The study demonstrates the efficacy of EBM in NLP tasks, achieving state-of-the-art results in all experiments.
zh

[NLP-42] he Future of AI: Exploring the Potential of Large Concept Models

【速读】：该论文旨在解决大型语言模型（LLMs）在处理抽象推理、概念理解和生成长篇内容方面的固有局限性。尽管LLMs在文本摘要、代码生成和创意写作等任务中表现出色，但其基于token的处理方式限制了其在更高层次语义推理和上下文感知决策中的表现。为此，Meta提出了大型概念模型（LCMs），将概念作为理解的基本单元，从而实现了更复杂的语义推理和上下文感知决策。论文的关键解决方案是通过收集、分析和综合现有的灰色文献，全面理解LCMs的特征、潜在应用以及未来的研究方向，以推动LCMs的发展和采用。具体而言，研究内容包括：（i）识别并描述LCMs与LLMs的区别特征，（ii）探索LCMs在多领域的潜在应用，（iii）提出未来研究方向和实际策略，以促进LCMs的进一步发展。

链接: https://arxiv.org/abs/2501.05487
作者: Hussain Ahmad,Diksha Goel
机构: The University of Adelaide, Australia(阿德莱德大学); CSIRO’s Data61, Australia(澳大利亚联邦科学与工业研究组织数据61)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of Artificial Intelligence (AI) continues to drive transformative innovations, with significant progress in conversational interfaces, autonomous vehicles, and intelligent content creation. Since the launch of ChatGPT in late 2022, the rise of Generative AI has marked a pivotal era, with the term Large Language Models (LLMs) becoming a ubiquitous part of daily life. LLMs have demonstrated exceptional capabilities in tasks such as text summarization, code generation, and creative writing. However, these models are inherently limited by their token-level processing, which restricts their ability to perform abstract reasoning, conceptual understanding, and efficient generation of long-form content. To address these limitations, Meta has introduced Large Concept Models (LCMs), representing a significant shift from traditional token-based frameworks. LCMs use concepts as foundational units of understanding, enabling more sophisticated semantic reasoning and context-aware decision-making. Given the limited academic research on this emerging technology, our study aims to bridge the knowledge gap by collecting, analyzing, and synthesizing existing grey literature to provide a comprehensive understanding of LCMs. Specifically, we (i) identify and describe the features that distinguish LCMs from LLMs, (ii) explore potential applications of LCMs across multiple domains, and (iii) propose future research directions and practical strategies to advance LCM development and adoption.
zh

[NLP-43] S2 Chunking: A Hybrid Framework for Document Segmentation Through Integrated Spatial and Semantic Analysis

【速读】：该论文试图解决在自然语言处理（NLP）中，传统文档分块（document chunking）方法仅依赖语义分析而忽略空间布局信息的问题，特别是在复杂文档中理解元素间关系时。传统方法在处理多样化布局的文档（如报告、文章和多栏设计）时表现不佳。论文提出了一种新颖的混合方法，结合了布局结构、语义分析和空间关系，通过利用边界框信息（bbox）和文本嵌入（text embeddings）构建文档元素的加权图表示，并使用谱聚类（spectral clustering）进行聚类。该方案的关键在于通过综合考虑布局和语义信息，提升了文档分块的凝聚力和准确性，同时确保每个分块不超过指定的标记长度，适用于具有输入大小限制的语言模型等场景。

链接: https://arxiv.org/abs/2501.05485
作者: Prashant Verma
机构: Indian Institute of Technology, Patna (印度理工学院, 巴特那)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Document chunking is a critical task in natural language processing (NLP) that involves dividing a document into meaningful segments. Traditional methods often rely solely on semantic analysis, ignoring the spatial layout of elements, which is crucial for understanding relationships in complex documents. This paper introduces a novel hybrid approach that combines layout structure, semantic analysis, and spatial relationships to enhance the cohesion and accuracy of document chunks. By leveraging bounding box information (bbox) and text embeddings, our method constructs a weighted graph representation of document elements, which is then clustered using spectral clustering. Experimental results demonstrate that this approach outperforms traditional methods, particularly in documents with diverse layouts such as reports, articles, and multi-column designs. The proposed method also ensures that no chunk exceeds a specified token length, making it suitable for use cases where token limits are critical (e.g., language models with input size limitations)
zh

[NLP-44] HP-BERT: A framework for longitudinal study of Hinduphobia on social media via LLM s

【速读】：该论文旨在解决COVID-19疫情期间及之后在社交媒体平台X（原Twitter）上针对印度教徒（Hindu）的仇恨言论（Hinduphobia）问题。通过构建一个基于大语言模型（LLMs）的滥用检测和情感分析框架，论文对X平台上的Hinduphobic言论进行了纵向分析，评估了其流行程度和强度。关键解决方案包括：1）开发并发布了一个包含8,000条标注推文的“Hinduphobic COVID-19 X (Twitter) Dataset”，用于微调BERT模型，生成了专门用于检测Hinduphobic言论的HP-BERT模型；2）进一步利用SenWave数据集对HP-BERT进行多标签情感分析微调；3）通过对来自六个国家的约2,740万条推文的分析，揭示了COVID-19病例激增与Hinduphobic言论增加之间的强相关性。研究结果为未来危机中缓解社区紧张局势提供了策略指导，并建议通过自动化监控和内容删除来遏制分裂性言论。

链接: https://arxiv.org/abs/2501.05482
作者: Ashutosh Singh,Rohitash Chandra
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:During the COVID-19 pandemic, community tensions intensified, fuelling Hinduphobic sentiments and discrimination against individuals of Hindu descent within India and worldwide. Large language models (LLMs) have become prominent in natural language processing (NLP) tasks and social media analysis, enabling longitudinal studies of platforms like X (formerly Twitter) for specific issues during COVID-19. We present an abuse detection and sentiment analysis framework that offers a longitudinal analysis of Hinduphobia on X (Twitter) during and after the COVID-19 pandemic. This framework assesses the prevalence and intensity of Hinduphobic discourse, capturing elements such as derogatory jokes and racist remarks through sentiment analysis and abuse detection from pre-trained and fine-tuned LLMs. Additionally, we curate and publish a “Hinduphobic COVID-19 X (Twitter) Dataset” of 8,000 tweets annotated for Hinduphobic abuse detection, which is used to fine-tune a BERT model, resulting in the development of the Hinduphobic BERT (HP-BERT) model. We then further fine-tune HP-BERT using the SenWave dataset for multi-label sentiment analysis. Our study encompasses approximately 27.4 million tweets from six countries, including Australia, Brazil, India, Indonesia, Japan, and the United Kingdom. Our findings reveal a strong correlation between spikes in COVID-19 cases and surges in Hinduphobic rhetoric, highlighting how political narratives, misinformation, and targeted jokes contributed to communal polarisation. These insights provide valuable guidance for developing strategies to mitigate communal tensions in future crises, both locally and globally. We advocate implementing automated monitoring and removal of such content on social media to curb divisive discourse.
zh

[NLP-45] he textitQuestio de aqua et terra: A Computational Authorship Verification Study

【速读】：该论文试图解决《Questio de aqua et terra》这一传统上归于但丁·阿利吉耶里（Dante Alighieri）的宇宙论著作的真实性问题。由于该文本与但丁其他作品存在差异，且缺乏同时代的参考文献，其真实性一直存在争议。论文通过计算作者验证（Authorship Verification, AV）技术，结合监督式机器学习和风格计量学（stylometry），构建了一系列AV系统，并利用一个包含330篇13至14世纪拉丁文本的语料库进行交叉验证。研究结果表明，尽管语料库在文本类型上具有异质性，表现最佳的AV系统仍达到了较高的验证准确率（F1=0.970）。该系统的关键贡献在于首次在AV中应用了分布随机过采样（Distributional Random Oversampling, DRO）技术，这一技术专门针对文本分类进行了优化。通过应用该AV系统，论文对《Questio》的真实性做出了高度自信的预测，为相关学术争议提供了新的见解，并展示了DRO在文化遗产领域的应用潜力。

链接: https://arxiv.org/abs/2501.05480
作者: Martina Leocata,Alejandro Moreo,Fabrizio Sebastiani
机构: Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche (意大利国家研究委员会信息科学与技术研究所)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:The Questio de aqua et terra is a cosmological treatise traditionally attributed to Dante Alighieri. However, the authenticity of this text is controversial, due to discrepancies with Dante’s established works and to the absence of contemporary references. This study investigates the authenticity of the Questio via computational authorship verification (AV), a class of techniques which combine supervised machine learning and stylometry. We build a family of AV systems and assemble a corpus of 330 13th- and 14th-century Latin texts, which we use to comparatively evaluate the AV systems through leave-one-out cross-validation. Our best-performing system achieves high verification accuracy (F1=0.970) despite the heterogeneity of the corpus in terms of textual genre. The key contribution to the accuracy of this system is shown to come from Distributional Random Oversampling (DRO), a technique specially tailored to text classification which is here used for the first time in AV. The application of the AV system to the Questio returns a highly confident prediction concerning its authenticity. These findings contribute to the debate on the authorship of the Questio, and highlight DRO’s potential in the application of AV to cultural heritage. Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL) Cite as: arXiv:2501.05480 [cs.CL] (or arXiv:2501.05480v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.05480 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-46] Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding

【速读】：该论文旨在解决医疗领域中生成式人工智能（Generative AI）在医疗账单和编码过程中的应用问题。当前的基础大语言模型（LLMs）在生成准确的国际疾病分类第十版临床修订版（ICD-10-CM）和当前程序术语（CPT）代码方面表现不佳，且存在安全和财务挑战。论文提出了一种策略，通过微调PHI-3 Mini和PHI-3 Medium模型，使用机构数据进行特定任务的优化，以平衡准确性、可访问性和患者隐私。关键解决方案是使用术后手术报告作为输入，生成与患者账单索赔相关的ICD-10、CPT和修饰符代码，并通过微调模型提高代码生成的准确性和减少无效代码的比例。结果表明，微调后的PHI-3 Medium模型在代码生成准确性和格式保真度方面表现最佳，甚至优于GPT-4o。

链接: https://arxiv.org/abs/2501.05479
作者: John C. Rollman(1),Bruce Rogers(1),Hamed Zaribafzadeh(1),Daniel Buckland(2),Ursula Rogers(1),Jennifer Gagnon(1),Ozanan Meireles(1),Lindsay Jennings(3),Jim Bennett(1),Jennifer Nicholson(3),Nandan Lad(4),Linda Cendales(1),Andreas Seas(4,5,6),Alessandro Martinino(6),E. Shelley Hwang(1),Allan D. Kirk(1)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Background: Healthcare has many manual processes that can benefit from automation and augmentation with Generative Artificial Intelligence (AI), the medical billing and coding process. However, current foundational Large Language Models (LLMs) perform poorly when tasked with generating accurate International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM) and Current Procedural Terminology (CPT) codes. Additionally, there are many security and financial challenges in the application of generative AI to healthcare. We present a strategy for developing generative AI tools in healthcare, specifically for medical billing and coding, that balances accuracy, accessibility, and patient privacy. Methods: We fine tune the PHI-3 Mini and PHI-3 Medium LLMs using institutional data and compare the results against the PHI-3 base model, a PHI-3 RAG application, and GPT-4o. We use the post operative surgical report as input and the patients billing claim the associated ICD-10, CPT, and Modifier codes as the target result. Performance is measured by accuracy of code generation, proportion of invalid codes, and the fidelity of the billing claim format. Results: Both fine-tuned models performed better or as well as GPT-4o. The Phi-3 Medium fine-tuned model showed the best performance (ICD-10 Recall and Precision: 72%, 72%; CPT Recall and Precision: 77%, 79%; Modifier Recall and Precision: 63%, 64%). The Phi-3 Medium fine-tuned model only fabricated 1% of ICD-10 codes and 0.6% of CPT codes generated. Conclusions: Our study shows that a small model that is fine-tuned on domain-specific data for specific tasks using a simple set of open-source tools and minimal technological and monetary requirements performs as well as the larger contemporary consumer models. Comments: 21 pages, 3 figures, 2 tables Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.1; J.3 Cite as: arXiv:2501.05479 [cs.CL] (or arXiv:2501.05479v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.05479 Focus to learn more arXiv-issued DOI via DataCite Submission history From: John Rollman [view email] [v1] Tue, 7 Jan 2025 17:11:12 UTC (1,501 KB)
zh

[NLP-47] Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

【速读】：该论文旨在解决在机器人视觉与语言导航（Vision-and-Language Navigation, VLN）领域中阿拉伯语集成的问题，这一领域在现有研究中尚未得到充分探索。研究的关键解决方案是使用NavGPT框架，这是一个基于纯大型语言模型（LLMs）的指令跟随导航代理，通过零样本顺序动作预测（zero-shot sequential action prediction）在R2R数据集上评估语言对导航推理的影响。研究对包括GPT-4o mini、Llama 3 8B、Phi-3 medium 14B以及阿拉伯语中心化的大型语言模型Jais在内的多语言小型语言模型（SLMs）进行了全面评估。实验结果表明，尽管某些模型在阿拉伯语推理和规划方面存在局限性，但NavGPT框架在提供英语和阿拉伯语指令时能够实现高水平的导航任务规划。这一发现强调了增强语言模型规划与推理能力的重要性，并为阿拉伯语模型在实际应用中的潜力提供了新的研究方向。

链接: https://arxiv.org/abs/2501.05478
作者: Malak Mansour,Ahmed Aly,Bahey Tharwat,Sarim Hashmi,Dong An,Ian Reid
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.
zh

[NLP-48] IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated Academic Essays in English and Arabic Using ELECTRA and Stylometry

【速读】：该论文旨在解决检测学术用途的机器生成文章（machine-generated essays）的问题。为了解决这一挑战，研究采用了基于预训练的Transformer模型，并结合了风格特征（stylometric features），分别针对阿拉伯语和英语的学术文章进行了微调。具体而言，研究团队开发了基于ELECTRA（English）和AraELECTRA（Arabic）的定制模型，并使用基准数据集进行了训练和评估。这些模型在英语子任务中取得了99.7%的F1分数，在26个团队中排名第二；在阿拉伯语子任务中取得了98.4%的F1分数，在23个团队中排名第一。解决方案的关键在于利用预训练的Transformer模型并结合风格特征进行微调，从而有效区分机器生成的文章与人类撰写的文章。

链接: https://arxiv.org/abs/2501.05476
作者: Mohammad AL-Smadi
机构: Qatar University(卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has investigated the problem of detecting machine-generated essays for academic purposes. To address this challenge, this research utilizes pre-trained, transformer-based models fine-tuned on Arabic and English academic essays with stylometric features. Custom models based on ELECTRA for English and AraELECTRA for Arabic were trained and evaluated using a benchmark dataset. Proposed models achieved excellent results with an F1-score of 99.7%, ranking 2nd among of 26 teams in the English subtask, and 98.4%, finishing 1st out of 23 teams in the Arabic one.
zh

[NLP-49] Retrieval-Augmented Generation by Evidence Retroactivity in LLM s

【速读】：该论文试图解决现有检索增强生成（Retrieval-Augmented Generation）方法在处理多跳复杂问题时存在的局限性。现有方法通常采用动态多轮检索生成过程，将复杂问题分解为子问题进行处理。然而，这些方法依赖于单向的前向推理范式，导致推理步骤不足或检索系统固有缺陷引发的错误无法逆转，进而可能破坏整个推理链。为此，论文提出了Retroactive Retrieval-Augmented Generation (RetroRAG)框架，其关键创新在于引入了一种回溯推理范式。RetroRAG通过构建证据整理与发现框架，能够搜索、生成并优化可信证据，从而修正和更新推理链。该框架从现有知识源中合成与问题关键实体相关的推理证据，并生成搜索查询以发现更多信息。随着新证据的发现，RetroRAG持续更新和组织信息，增强其定位进一步必要证据的能力。通过与生成和评估输出的Answerer模块结合，RetroRAG能够迭代优化推理过程，直至获得可靠答案。实验结果表明，RetroRAG显著优于现有方法。

链接: https://arxiv.org/abs/2501.05475
作者: Liang Xiao,Wen Dai,Shuai Chen,Bin Qin,Chongyang Shi,Haopeng Jing,Tianyu Guo
机构: Xiaomi Corporation(小米公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation has gained significant attention due to its ability to integrate relevant external knowledge, enhancing the accuracy and reliability of the LLMs’ responses. Most of the existing methods apply a dynamic multiple retrieval-generating process, to address multi-hop complex questions by decomposing them into sub-problems. However, these methods rely on an unidirectional forward reasoning paradigm, where errors from insufficient reasoning steps or inherent flaws in current retrieval systems are irreversible, potentially derailing the entire reasoning chain. For the first time, this work introduces Retroactive Retrieval-Augmented Generation (RetroRAG), a novel framework to build a retroactive reasoning paradigm. RetroRAG revises and updates the evidence, redirecting the reasoning chain to the correct direction. RetroRAG constructs an evidence-collation-discovery framework to search, generate, and refine credible evidence. It synthesizes inferential evidence related to the key entities in the question from the existing source knowledge and formulates search queries to uncover additional information. As new evidence is found, RetroRAG continually updates and organizes this information, enhancing its ability to locate further necessary evidence. Paired with an Answerer to generate and evaluate outputs, RetroRAG is capable of refining its reasoning process iteratively until a reliable answer is obtained. Empirical evaluations show that RetroRAG significantly outperforms existing methods.
zh

[NLP-50] Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis ICASSP2025

【速读】：该论文旨在解决多模态情感分析（Multimodal Sentiment Analysis, MSA）中的两个主要挑战：模态缺失（modality missing）和模态异质性（heterogeneity）。模态缺失问题源于现实世界中数据的不完整性，而模态异质性则指不同模态数据之间的差异性。为解决这些问题，论文提出了模态不变双向时序表示蒸馏网络（Modality-Invariant Bidirectional Temporal Representation Distillation Network, MITR-DNet）。该网络通过蒸馏方法，利用完整的模态教师模型（teacher model）指导缺失模态的学生模型（student model），从而在模态缺失情况下保持模型的鲁棒性。同时，论文还开发了模态不变双向时序表示学习模块（Modality-Invariant Bidirectional Temporal Representation Learning Module, MIB-TRL），以有效缓解模态异质性带来的问题。

链接: https://arxiv.org/abs/2501.05474
作者: Xincheng Wang,Liejun Wang,Yinfeng Yu,Xinxin Jiao
机构: School of Computer Science and Technology, Xinjiang University (新疆大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for publication by 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals’ emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tackle these challenges, we introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing. Simultaneously, we developed the Modality-Invariant Bidirectional Temporal Representation Learning Module (MIB-TRL) to mitigate heterogeneity.
zh

[NLP-51] LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

【速读】：该论文旨在解决系统文献综述（Systematic Literature Review）和元分析（Meta-analysis）过程中耗时且劳动密集型的问题，特别是涉及筛选、评估和数据提取等迭代任务。为了解决这一问题，论文提出了LatteReview框架，这是一个基于Python的工具，利用大语言模型（LLMs）和多代理系统（Multi-agent Systems）来自动化系统综述的关键环节。LatteReview的核心解决方案包括模块化代理（Modular Agents）的设计，这些代理负责标题和摘要筛选、相关性评分以及结构化数据提取等任务。这些代理在协调的工作流中运行，支持顺序和并行的评审轮次、动态决策以及基于用户反馈的迭代优化。此外，LatteReview的架构集成了LLM提供者，支持云端和本地托管模型，并具备检索增强生成（RAG）、多模态评审、基于Pydantic的结构化输入输出验证以及异步编程等功能，以处理大规模数据集。

链接: https://arxiv.org/abs/2501.05468
作者: Pouria Rouzrokh,Moein Shariatnia
机构: 未知
类目: Computation and Language (cs.CL)
备注: 31 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Systematic literature reviews and meta-analyses are essential for synthesizing research insights, but they remain time-intensive and labor-intensive due to the iterative processes of screening, evaluation, and data extraction. This paper introduces and evaluates LatteReview, a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process. Designed to streamline workflows while maintaining rigor, LatteReview utilizes modular agents for tasks such as title and abstract screening, relevance scoring, and structured data extraction. These agents operate within orchestrated workflows, supporting sequential and parallel review rounds, dynamic decision-making, and iterative refinement based on user feedback. LatteReview’s architecture integrates LLM providers, enabling compatibility with both cloud-based and locally hosted models. The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets. The framework is available on the GitHub repository, with detailed documentation and an installable package.
zh

[NLP-52] Small Language Models (SLMs) Can Still Pack a Punch: A survey

【速读】：该论文探讨了随着基础AI模型（Foundation AI Models）规模的持续增大，是否只有大规模模型才是唯一的发展路径。通过对约160篇文献的综述，论文提出了一类参数规模在1到80亿之间的小型语言模型（Small Language Models, SLMs），并展示了这些小型模型在性能上可以与大型模型相媲美，甚至超越大型模型。论文的关键解决方案包括探索任务无关的通用型SLMs、任务特定的SLMs，以及创建SLMs的技术，这些技术能够在性能、效率、可扩展性和成本之间取得平衡。此外，论文还定义并描述了SLMs的“有效规模”（effective sizes），表明其在能力上相对于大型语言模型（LLMs）的提升。

链接: https://arxiv.org/abs/2501.05465
作者: Shreyas Subramanian,Vikram Elango,Mecit Gungor
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As foundation AI models continue to increase in size, an important question arises - is massive scale the only path forward? This survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range that demonstrate smaller models can perform as well, or even outperform large models. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models while balancing performance, efficiency, scalability and cost. Furthermore we define and characterize SLMs’ effective sizes, representing increased capability with respect to LLMs.
zh

[NLP-53] LLM -MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在医疗问答（MedQA）领域中的局限性，特别是在理解领域特定术语和进行复杂推理方面的挑战。这些局限性影响了LLMs在关键医疗应用中的有效性。为了解决这些问题，论文提出了一种新颖的多代理医疗问答系统，结合了类似案例生成的方法。解决方案的关键在于利用Llama3.1:70B模型，这是一种最先进的LLM，通过多代理架构在零样本学习（zero-shot learning）中提升MedQA数据集的性能。该方法充分利用了模型固有的医学知识和推理能力，无需额外的训练数据。实验结果表明，该方法在多个医疗问答任务中的准确率和F1分数均提高了7%，显著优于现有的基准模型。此外，论文还探讨了模型在处理复杂医疗查询时的可解释性和可靠性。这项研究不仅为医疗问答提供了一个强大的解决方案，还为LLMs在医疗领域的广泛应用奠定了基础。

链接: https://arxiv.org/abs/2501.05464
作者: Hang Yang,Hao Chen,Hui Guo,Yineng Chen,Ching-Sheng Lin,Shu Hu,Jinrong Hu,Xi Wu,Xin Wang
机构: CUIT(成都信息工程大学); University at Buffalo(布法罗大学); University at Albany(奥尔巴尼大学); Tunghai University(东海大学); Purdue University(普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Accurate and efficient question-answering systems are essential for delivering high-quality patient care in the medical field. While Large Language Models (LLMs) have made remarkable strides across various domains, they continue to face significant challenges in medical question answering, particularly in understanding domain-specific terminologies and performing complex reasoning. These limitations undermine their effectiveness in critical medical applications. To address these issues, we propose a novel approach incorporating similar case generation within a multi-agent medical question-answering (MedQA) system. Specifically, we leverage the Llama3.1:70B model, a state-of-the-art LLM, in a multi-agent architecture to enhance performance on the MedQA dataset using zero-shot learning. Our method capitalizes on the model’s inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data. Experimental results show substantial performance gains over existing benchmark models, with improvements of 7% in both accuracy and F1-score across various medical QA tasks. Furthermore, we examine the model’s interpretability and reliability in addressing complex medical queries. This research not only offers a robust solution for medical question answering but also establishes a foundation for broader applications of LLMs in the medical domain.
zh

[NLP-54] MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model ICASSP2025

【速读】：该论文旨在解决基于编解码器（Codec-based）的文本到语音（Text-to-Speech, TTS）模型在处理更具表现力的参考音频或复杂文本输入时的局限性。现有的TTS模型虽然在零样本语音克隆（zero-shot voice cloning）方面表现出色，但在处理这些更具挑战性的任务时往往表现不佳。论文提出的解决方案是MARS6，一个基于编码器-解码器（encoder-decoder）架构的变压器模型（transformer），专门设计用于快速且具有表现力的TTS。MARS6的关键创新在于其分层解码器（hierarchical decoder）结构，能够以12 Hz的频率处理新的语音标记（speech tokens），从而在保持重建质量的同时，高效地建模长文本。此外，MARS6结合了多种最新的训练和推理技术，减少了重复生成问题，并提高了输出稳定性和质量。尽管MARS6仅有7000万参数，但其性能与规模更大的模型相当，这一点在客观和主观评估中得到了验证。

链接: https://arxiv.org/abs/2501.05787
作者: Matthew Baas,Pieter Scholtz,Arnav Mehta,Elliott Dyson,Akshat Prakash,Herman Kamper
机构: Camb.ai
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 1 table. Accepted at ICASSP 2025

点击查看摘要

Abstract:Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: this https URL
zh

计算机视觉

[CV-0] Multi-subject Open-set Personalization in Video Generation

【速读】：该论文旨在解决现有视频个性化方法在多个主体（multi-subject）和开放集（open-set）场景下的局限性问题。现有方法通常局限于特定领域、需要耗时的逐主体优化，或仅支持单一主体。论文提出的解决方案是Video Alchemist模型，该模型基于一种新的扩散变换器（Diffusion Transformer）模块，通过交叉注意力层（cross-attention layers）融合每个条件参考图像及其对应的主体级文本提示（subject-level text prompt），从而实现了无需耗时测试时优化的多主体和开放集视频个性化。关键创新点包括：1）设计了一种自动数据构建管道，通过广泛的图像增强技术解决了训练数据不足和泛化能力差的问题；2）引入了一个新的个性化基准，专注于主体保真度并支持多样化的个性化场景。实验结果表明，该方法在定量和定性评估中均显著优于现有方法。

链接: https://arxiv.org/abs/2501.06187
作者: Tsai-Shien Chen,Aliaksandr Siarohin,Willi Menapace,Yuwei Fang,Kwot Sin Lee,Ivan Skorokhodov,Kfir Aberman,Jun-Yan Zhu,Ming-Hsuan Yang,Sergey Tulyakov
机构: Snap Inc.; UC Merced (加州大学默塞德分校); CMU (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist - a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
zh

[CV-1] LlamaV-o1 : Rethinking Step-by-step Visual Reasoning in LLM s

【速读】：该论文旨在解决现有方法在视觉推理（visual reasoning）领域缺乏全面评估框架的问题，特别是针对多步推理任务的评估和逐步问题解决能力的不足。为此，作者提出了一个综合框架，通过三个关键贡献来推进大语言模型（LLMs）中的逐步视觉推理能力。首先，作者引入了一个专门设计用于评估多步推理任务的视觉推理基准（visual reasoning benchmark），该基准包含八类不同的挑战，涵盖从复杂视觉感知到科学推理的多种任务，总计超过4000个推理步骤，能够全面评估LLMs在多步推理中的准确性和可解释性。其次，作者提出了一种新的评估指标，该指标在单个步骤的粒度上评估视觉推理的质量，强调正确性和逻辑一致性，相比传统的任务完成度指标提供了更深入的推理性能洞察。最后，作者提出了一种新的多模态视觉推理模型LlamaV-o1，该模型通过多步课程学习（multi-step curriculum learning）方法进行训练，逐步组织任务以促进增量技能获取和问题解决能力。实验表明，LlamaV-o1在多个基准测试中优于现有的开源模型，并在推理速度上显著提升。

链接: https://arxiv.org/abs/2501.06186
作者: Omkar Thawakar,Dinura Dissanayake,Ketan More,Ritesh Thawkar,Ahmed Heakl,Noor Ahsan,Yuhao Li,Mohammed Zumri,Jean Lahoud,Rao Muhammad Anwer,Hisham Cholakkal,Ivan Laptev,Mubarak Shah,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of AI; University of Central Florida; Linköping University; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 Figures

点击查看摘要

Abstract:Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs’ abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
zh

[CV-2] PEACE: Empowering Geologic Map Holistic Understanding with MLLM s

【速读】：该论文试图解决当前多模态大语言模型（MLLMs）在地质图理解方面的不足。地质图作为地质科学中的基础图表，提供了关于地球地下和地表结构与组成的关键信息，广泛应用于灾害检测、资源勘探和土木工程等领域。然而，现有的MLLMs在处理地质图时存在显著差距，主要原因是地图制图综合（cartographic generalization）的复杂性，包括高分辨率地图的处理、多组件的管理以及领域知识的缺乏。

为解决这一问题，论文提出了GeoMap-Bench，这是首个用于评估MLLMs在地质图理解方面能力的基准测试，涵盖提取、引用、定位、推理和分析等全方位能力。同时，论文引入了GeoMap-Agent，这是首个专为地质图理解设计的智能体，包含三个关键模块：分层信息提取（HIE）、领域知识注入（DKI）和提示增强问答（PEQA）。GeoMap-Agent通过模拟人类科学家的跨学科合作，利用多样化的工具池全面分析问题。实验结果表明，GeoMap-Agent在GeoMap-Bench上的总体得分为0.811，显著优于GPT-4o的0.369。该研究通过MLLMs赋能地质图全面理解（PEACE），为地质学中的高级AI应用铺平了道路，提升了地质调查的效率和准确性。

链接: https://arxiv.org/abs/2501.06184
作者: Yangyu Huang,Tianyi Gao,Haoran Xu,Qihao Zhao,Yang Song,Zhipeng Gui,Tengchao Lv,Hao Chen,Lei Cui,Scarlett Li,Furu Wei
机构: Microsoft Research(微软研究院); Chinese Academy of Geological Sciences(中国地质科学院); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth’s subsurface and surface. These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering. Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding. This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions. Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o. Our work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.
zh

[CV-3] VideoAuteur: Towards Long Narrative Video Generation

【速读】：该论文旨在解决当前视频生成模型在生成长序列视频时面临的挑战，特别是生成能够传达清晰且信息丰富事件的长视频序列，以支持连贯的叙事。现有的模型在生成高质量短视频方面表现良好，但在生成长视频时，视觉和语义的连贯性仍然不足。为此，论文提出了一个大规模的烹饪视频数据集，专门用于推动烹饪领域的长篇叙事生成。解决方案的关键在于引入了一个“长叙事视频导演”（Long Narrative Video Director），通过增强生成视频的视觉和语义连贯性，并强调视觉嵌入（visual embeddings）的对齐，以提升整体视频质量。此外，论文还采用了微调技术，将文本和图像嵌入整合到视频生成过程中，从而在生成视觉细节丰富且语义对齐的关键帧方面取得了显著改进。

链接: https://arxiv.org/abs/2501.06173
作者: Junfei Xiao,Feng Cheng,Lu Qi,Liangke Gui,Jiepeng Cen,Zhibei Ma,Alan Yuille,Lu Jiang
机构: Johns Hopkins University(约翰霍普金斯大学); ByteDance Seed(字节跳动种子); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, this https URL

点击查看摘要

Abstract:Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: this https URL
zh

[CV-4] MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection

【速读】：该论文旨在解决在现实场景中长时间未剪辑视频中的动作检测问题，主要挑战在于视频中动作密集分布且存在显著的类内时间变化。现有的基于Transformer的最先进方法虽然有效，但由于其高参数数量、GPU内存占用和有限的吞吐量，难以在实际应用中部署，尤其不适用于处理非常长的视频。为解决这一问题，论文创新性地采用了Mamba架构，并提出了多尺度时间Mamba（MS-Temba）。其关键解决方案包括两个核心组件：时间Mamba块（Temba Blocks）和时间Mamba融合器（Temba Fuser）。Temba块通过时间局部模块（TLM）进行短程时间建模，并通过扩张时间状态空间模型（DTS）捕捉长程依赖关系。通过引入扩张机制，TLM和DTS能够在多个尺度上捕获局部和全局特征。Temba融合器则利用Mamba聚合这些尺度特定的特征，从而学习未剪辑视频的全面多尺度表示。实验表明，MS-Temba在三个公开数据集上验证了其有效性，尤其在长视频上超越了现有最先进方法，同时在短视频上表现相当，且仅使用了八分之一的参数。

链接: https://arxiv.org/abs/2501.06138
作者: Arkaprava Sinha,Monish Soundar Raj,Pu Wang,Ahmed Helmy,Srijan Das
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action detection in real-world scenarios is particularly challenging due to densely distributed actions in hour-long untrimmed videos. It requires modeling both short- and long-term temporal relationships while handling significant intra-class temporal variations. Previous state-of-the-art (SOTA) Transformer-based architectures, though effective, are impractical for real-world deployment due to their high parameter count, GPU memory usage, and limited throughput, making them unsuitable for very long videos. In this work, we innovatively adapt the Mamba architecture for action detection and propose Multi-scale Temporal Mamba (MS-Temba), comprising two key components: Temporal Mamba (Temba) Blocks and the Temporal Mamba Fuser. Temba Blocks include the Temporal Local Module (TLM) for short-range temporal modeling and the Dilated Temporal SSM (DTS) for long-range dependencies. By introducing dilations, a novel concept for Mamba, TLM and DTS capture local and global features at multiple scales. The Temba Fuser aggregates these scale-specific features using Mamba to learn comprehensive multi-scale representations of untrimmed videos. MS-Temba is validated on three public datasets, outperforming SOTA methods on long videos and matching prior methods on short videos while using only one-eighth of the parameters.
zh

[CV-5] Enhancing Refining and Fusing: Towards Robust Multi-Scale and Dense Ship Detection

【速读】：该论文旨在解决合成孔径雷达（SAR）图像中船舶检测面临的复杂背景、密集排列目标和大尺度变化等挑战。为解决这些问题，作者提出了一种新颖的框架——中心感知SAR船舶检测器（CASS-Det），其关键创新包括：（1）中心增强模块（CEM），通过旋转卷积突出船舶中心，提高定位精度并抑制背景干扰；（2）邻居注意力模块（NAM），利用跨层依赖关系优化密集场景中的船舶边界；（3）跨连接特征金字塔网络（CC-FPN），通过整合浅层和深层特征增强多尺度特征融合。实验结果表明，CASS-Det在多尺度和密集排列船舶检测任务中表现出色，达到了当前最先进的性能水平。

链接: https://arxiv.org/abs/2501.06053
作者: Congxia Zhao,Xiongjun Fu,Jian Dong,Shen Cao,Chunyan Zhang
机构: School of Integrated Circuits and Electronics, Beijing Institute of Technology (北京理工大学集成电路与电子学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic aperture radar (SAR) imaging, celebrated for its high resolution, all-weather capability, and day-night operability, is indispensable for maritime applications. However, ship detection in SAR imagery faces significant challenges, including complex backgrounds, densely arranged targets, and large scale variations. To address these issues, we propose a novel framework, Center-Aware SAR Ship Detector (CASS-Det), designed for robust multi-scale and densely packed ship detection. CASS-Det integrates three key innovations: (1) a center enhancement module (CEM) that employs rotational convolution to emphasize ship centers, improving localization while suppressing background interference; (2) a neighbor attention module (NAM) that leverages cross-layer dependencies to refine ship boundaries in densely populated scenes; and (3) a cross-connected feature pyramid network (CC-FPN) that enhances multi-scale feature fusion by integrating shallow and deep features. Extensive experiments on the SSDD, HRSID, and LS-SSDD-v1.0 datasets demonstrate the state-of-the-art performance of CASS-Det, excelling at detecting multi-scale and densely arranged ships.
zh

[CV-6] MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

【速读】：该论文旨在解决Vision Transformer (ViT)在小规模数据集（tiny dataset）上表现不佳的问题。由于ViT需要大量训练数据来确保其表征能力，而在实际应用中，大规模数据集并不总是可用，导致ViT在小数据集上的性能不如卷积神经网络（CNNs）。为此，论文提出了一种小型ViT架构（MSCViT），其关键解决方案包括：1）引入小波卷积（wavelet convolution），通过频率分解选择性结合高频分量与卷积通道，以提取局部特征；2）开发轻量级多头注意力模块，减少token数量和计算成本；3）用局部特征提取模块替换主干网络中的位置编码（PE）。这些改进使得MSCViT在参数效率和计算成本上更具优势，尤其适合小规模数据集。实验表明，MSCViT在CIFAR-100数据集上达到了84.68%的准确率，且无需在大规模数据集上进行预训练。

链接: https://arxiv.org/abs/2501.06040
作者: Bowei Zhang,Yi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features. Then, a lightweight multi-head attention module is developed to reduce the number of tokens and computational costs. Finally, the positional encoding (PE) in the backbone is replaced by a local feature extraction module. Compared with the original ViT, it is parameter-efficient and is particularly suitable for tiny datasets. Extensive experiments have been conducted on tiny datasets, in which our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.
zh

[CV-7] A Holistically Point-guided Text Framework for Weakly-Supervised Camouflaged Object Detection

【速读】：该论文旨在解决弱监督伪装目标检测（Weakly-Supervised Camouflaged Object Detection, WSCOD）中的挑战，特别是如何利用稀疏标注的监督（如点标注和文本标注）来训练模型，以分割视觉上与其背景融合的目标。论文提出的解决方案关键在于一个新颖的全点引导文本框架，该框架分为三个阶段：分割（SEGMENT）、选择（CHOOSE）和训练（TRAIN）。具体而言，论文提出了点引导候选生成（Point-guided Candidate Generation, PCG），利用点的前景信息来修正文本路径，从而在掩码生成过程中显式纠正和恢复丢失的检测目标。此外，论文引入了合格候选判别器（Qualified Candidate Discriminator, QCD），通过CLIP模型从给定的文本提示中选择最优掩码，并使用选定的伪掩码进行自监督视觉Transformer（Vision Transformer）的训练。为了支持该方法，论文还开发了新的点监督数据集（P2C-COD）和文本监督数据集（T-COD）。实验结果表明，该方法在多个基准数据集上显著优于现有的最先进方法，甚至超越了一些全监督的伪装目标检测方法。

链接: https://arxiv.org/abs/2501.06038
作者: Tsui Qin Mok,Shuyong Gao,Haozhe Xing,Miaoyang He,Yan Wang,Wenqiang Zhang
机构: Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly-Supervised Camouflaged Object Detection (WSCOD) has gained popularity for its promise to train models with weak labels to segment objects that visually blend into their surroundings. Recently, some methods using sparsely-annotated supervision shown promising results through scribbling in WSCOD, while point-text supervision remains underexplored. Hence, this paper introduces a novel holistically point-guided text framework for WSCOD by decomposing into three phases: segment, choose, train. Specifically, we propose Point-guided Candidate Generation (PCG), where the point’s foreground serves as a correction for the text path to explicitly correct and rejuvenate the loss detection object during the mask generation process (SEGMENT). We also introduce a Qualified Candidate Discriminator (QCD) to choose the optimal mask from a given text prompt using CLIP (CHOOSE), and employ the chosen pseudo mask for training with a self-supervised Vision Transformer (TRAIN). Additionally, we developed a new point-supervised dataset (P2C-COD) and a text-supervised dataset (T-COD). Comprehensive experiments on four benchmark datasets demonstrate our method outperforms state-of-the-art methods by a large margin, and also outperforms some existing fully-supervised camouflaged object detection methods.
zh

[CV-8] Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

【速读】：该论文旨在解决概率性人体运动预测（Probabilistic Human Motion Prediction）中存在的肢体拉伸和抖动问题。当前的方法虽然在多样性和真实性方面表现良好，但常常生成未检测到的肢体拉伸和抖动。为了解决这一问题，作者提出了SkeletonDiffusion，一种潜在扩散模型（Latent Diffusion Model），该模型在其架构和训练中嵌入了对人体骨骼的显式归纳偏置（Explicit Inductive Bias）。关键解决方案包括使用一种新的非各向同性高斯扩散公式（Nonisotropic Gaussian Diffusion Formulation），该公式与人体骨骼的自然运动学结构对齐。实验结果表明，该方法在生成真实预测的同时避免了肢体失真等伪影，并在三个真实世界数据集上超越了多种基线模型，设定了新的基准。

链接: https://arxiv.org/abs/2501.06035
作者: Cecilia Curreli,Dominik Muhle,Abhishek Saroha,Zhenzhang Ye,Riccardo Marin,Daniel Cremers
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics. Visit our project page: this https URL
zh

[CV-9] Generate Transduct Adapt: Iterative Transduction with VLMs

【速读】：该论文旨在解决在零样本学习（zero-shot learning）中，如何更好地利用语言空间（language space）的结构来提升分类准确率的问题。现有的转导式零样本学习方法（transductive zero-shot learning）主要依赖于图像-图像相似性，而忽略了语言空间的潜在贡献。为此，论文提出了GTA-CLIP（Generative Transductive Adaptation for CLIP）技术，其关键创新在于通过语言模型（language models）的监督，在语言和视觉空间中进行联合转导（joint transduction）。具体而言，GTA-CLIP通过三个迭代步骤实现：(1) 通过查询语言模型逐步探索属性空间（attribute space），(2) 执行属性增强的转导推理（attribute-augmented transductive inference），以及(3) 基于数据集内推断的标签对语言和视觉编码器进行微调（fine-tuning）。实验结果表明，GTA-CLIP在零样本和少样本（few-shot）设置下均显著提升了分类性能，验证了该方法在语言和视觉空间协同优化中的有效性。

链接: https://arxiv.org/abs/2501.06031
作者: Oindrila Saha,Logan Lawrence,Grant Van Horn,Subhransu Maji
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released at this https URL

点击查看摘要

Abstract:Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.
zh

[CV-10] Geometric-Based Nail Segmentation for Clinical Measurements

【速读】：该论文旨在解决在临床研究中如何准确分割和测量趾甲（toenail）的问题，特别是在需要区分趾甲与周围皮肤的情况下。由于趾甲在局部外观上与皮肤相似，传统的分割方法难以有效区分。论文提出的解决方案包括以下几个关键步骤：首先，使用霍夫变换（Hough transform）定位脚趾尖端并估计趾甲的位置和大小；其次，基于图像的几何和光度信息对超像素（super-pixels）进行分类；最后，通过分水岭变换（watershed transform）精确划分趾甲的边界。该方法在包含348张医学图像的数据集上进行了验证，达到了0.993的准确率和0.925的F-measure，表现出对趾甲形状、皮肤色素沉着、光照条件以及受病理影响的大区域的鲁棒性。

链接: https://arxiv.org/abs/2501.06027
作者: Bernat Galmés,Gabriel Moyà-Alcover,Pedro Bibiloni,Javier Varona,Antoni Jaume-i-Capó
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A robust segmentation method that can be used to perform measurements on toenails is presented. The proposed method is used as the first step in a clinical trial to objectively quantify the incidence of a particular pathology. For such an assessment, it is necessary to distinguish a nail, which locally appears to be similar to the skin. Many algorithms have been used, each of which leverages different aspects of toenail appearance. We used the Hough transform to locate the tip of the toe and estimate the nail location and size. Subsequently, we classified the super-pixels of the image based on their geometric and photometric information. Thereafter, the watershed transform delineated the border of the nail. The method was validated using a 348-image medical dataset, achieving an accuracy of 0.993 and an F-measure of 0.925. The proposed method is considerably robust across samples, with respect to factors such as nail shape, skin pigmentation, illumination conditions, and appearance of large regions affected by a medical condition
zh

[CV-11] BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response

【速读】：该论文试图解决在灾害事件发生后，如何利用地球观测（Earth Observation, EO）数据进行快速、全面的建筑物损毁评估（Building Damage Assessment, BDA）的问题。传统基于光学数据的解决方案受限于天气条件和光照时间，无法实现全天候、全天时的灾害响应。论文提出的关键解决方案是通过整合多模态（Multimodal, MM）EO数据，特别是光学和合成孔径雷达（Synthetic Aperture Radar, SAR）影像的结合，来实现全天候、全天时的灾害响应。为此，论文提出了一个名为BRIGHT的数据集，该数据集使用了超高分辨率的光学和SAR影像，覆盖了全球12个地区的五种自然灾害和两种人为灾害，尤其关注需要外部援助的发展中国家。BRIGHT数据集是首个开放获取、全球分布、事件多样的多模态数据集，专门用于支持基于AI的灾害响应。通过实验验证，使用BRIGHT训练的七种先进AI模型展示了其可迁移性和鲁棒性。

链接: https://arxiv.org/abs/2501.06019
作者: Hongruixuan Chen,Jian Song,Olivier Dietrich,Clifford Broni-Bediako,Weihao Xuan,Junjue Wang,Xinlei Shao,Yimin Wei,Junshi Xia,Cuiling Lan,Konrad Schindler,Naoto Yokoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 12 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at this https URL. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.
zh

[CV-12] Pose-independent 3D Anthropometry from Sparse Data

【速读】：该论文试图解决在3D数字人体测量（3D digital anthropometry）中，由于扫描过程中需要保持静态A-pose（A-pose）姿势而导致的测量精度问题，以及无法对无法保持A-pose姿势的个体（如受伤或残疾者）进行测量的问题。解决方案的关键在于提出了一种基于稀疏标志点（sparse landmarks）的方法，该方法能够在任何姿势下获取人体测量数据。通过利用这些稀疏标志点生成与姿势无关的特征，并训练一个网络来预测标准A-pose下的身体测量值。该方法不仅能够在不依赖密集几何数据的情况下实现与现有方法相当的测量精度，还能够适用于任何姿势的个体，从而扩展了3D数字人体测量的应用范围。此外，作者还通过开源该方法，填补了开源3D人体测量方法的空白。

链接: https://arxiv.org/abs/2501.06014
作者: David Bojanić,Stefanie Wuhrer,Tomislav Petković,Tomislav Pribanić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D digital anthropometry is the study of estimating human body measurements from 3D scans. Precise body measurements are important health indicators in the medical industry, and guiding factors in the fashion, ergonomic and entertainment industries. The measuring protocol consists of scanning the whole subject in the static A-pose, which is maintained without breathing or movement during the scanning process. However, the A-pose is not easy to maintain during the whole scanning process, which can last even up to a couple of minutes. This constraint affects the final quality of the scan, which in turn affects the accuracy of the estimated body measurements obtained from methods that rely on dense geometric data. Additionally, this constraint makes it impossible to develop a digital anthropometry method for subjects unable to assume the A-pose, such as those with injuries or disabilities. We propose a method that can obtain body measurements from sparse landmarks acquired in any pose. We make use of the sparse landmarks of the posed subject to create pose-independent features, and train a network to predict the body measurements as taken from the standard A-pose. We show that our method achieves comparable results to competing methods that use dense geometry in the standard A-pose, but has the capability of estimating the body measurements from any pose using sparse landmarks only. Finally, we address the lack of open-source 3D anthropometry methods by making our method available to the research community at this https URL.
zh

[CV-13] CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control

【速读】：该论文旨在解决从单张图像和给定的相机轨迹生成场景的飞越视频（fly-through videos）的问题。解决方案的关键在于利用图像到视频的潜在扩散模型（latent diffusion model），并通过四种技术对UNet去噪器进行条件化处理：(1) 使用原始相机外参（camera extrinsics）对UNet的时间块进行条件化，类似于MotionCtrl；(2) 使用包含相机光线和方向的图像，类似于CameraCtrl；(3) 将初始图像重投影到后续帧，并将生成的视频作为条件；(4) 使用2D=3D变换器引入全局3D表示，隐式地对相机姿态进行条件化。这些条件通过ControlNet风格的架构进行组合。此外，论文提出了一种评估视频整体质量和视角变化下细节保持能力的指标，用于分析单个条件和组合条件的权衡，并确定了最优的条件组合。最终，论文通过校准数据集中的相机位置以确保场景间的尺度一致性，并训练了场景探索模型CamCtrl3D，展示了当前最先进的结果。

链接: https://arxiv.org/abs/2501.06006
作者: Stefan Popov,Amit Raj,Michael Krainin,Yuanzhen Li,William T. Freeman,Michael Rubinstein
机构: Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in 3DV 2025

点击查看摘要

Abstract:We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques. (1) We condition the UNet’s temporal blocks on raw camera extrinsics, similar to MotionCtrl. (2) We use images containing camera rays and directions, similar to CameraCtrl. (3) We reproject the initial image to subsequent frames and use the resulting video as a condition. (4) We use 2D=3D transformers to introduce a global 3D representation, which implicitly conditions on the camera poses. We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
zh

[CV-14] SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard Examples

【速读】：该论文试图解决在真实场景中，类别不平衡（class-imbalanced）数据分布对半监督学习（Semi-Supervised Learning, SSL）性能的负面影响问题。现有的类别不平衡半监督学习方法（Class-Imbalanced Semi-Supervised Learning, CISSL）主要关注通过重平衡数据集来缓解这一问题，但忽视了利用难例（hard examples）来提升模型性能的潜力，导致即使使用复杂的算法也难以充分利用未标记数据。为解决这一问题，论文提出了一种通过挖掘难例来增强不平衡半监督学习性能的方法（SeMi）。该方法的关键在于通过区分难例和易例（easy examples）的logits熵差异来识别难例，从而增加未标记数据的利用率，更好地应对CISSL中的不平衡问题。此外，该方法还通过维护一个带有置信度衰减的类别平衡记忆库（class-balanced memory bank）来存储高置信度的嵌入向量，以增强伪标签的可靠性。实验结果表明，SeMi在多个标准CISSL基准测试中优于现有的最先进方法，尤其在反向场景中表现尤为突出，相较于基线方法提升了约54.8%。

链接: https://arxiv.org/abs/2501.06004
作者: Yin Wang,Zixuan Wang,Hao Lu,Zhen Qin,Hailiang Zhao,Guanjie Cheng,Ge Su,Li Kuang,Mengchu Zhou,Shuiguang Deng
机构: Zhejiang University(浙江大学); The Hong Kong University of Science and Technology(香港科技大学); Central South University(中南大学); Zhejiang Gongshang University(浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,6 figures, conference

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance. However, the class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation. Existing class-imbalanced semi-supervised learning (CISSL) methods mainly focus on rebalancing datasets but ignore the potential of using hard examples to enhance performance, making it difficult to fully harness the power of unlabeled data even with sophisticated algorithms. To address this issue, we propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi). This method distinguishes the entropy differences among logits of hard and easy examples, thereby identifying hard examples and increasing the utility of unlabeled data, better addressing the imbalance problem in CISSL. In addition, we maintain a class-balanced memory bank with confidence decay for storing high-confidence embeddings to enhance the pseudo-labels’ reliability. Although our method is simple, it is effective and seamlessly integrates with existing approaches. We perform comprehensive experiments on standard CISSL benchmarks and experimentally demonstrate that our proposed SeMi outperforms existing state-of-the-art methods on multiple benchmarks, especially in reversed scenarios, where our best result shows approximately a 54.8% improvement over the baseline methods.
zh

[CV-15] Self-Supervised Partial Cycle-Consistency for Multi-View Matching

【速读】：该论文旨在解决在多摄像头系统中跨部分重叠摄像头视图的对象匹配问题，特别是在视图间存在部分重叠的情况下。关键挑战在于如何提取视图不变的特征（view-invariant feature），以便在不同视角下准确匹配对象。论文的解决方案包括以下几个方面：首先，扩展了循环一致性（cycle-consistency）的数学公式，使其能够处理部分重叠的情况；其次，引入了一种伪掩码（pseudo-mask），用于指导训练损失函数，使其考虑部分重叠的影响；此外，提出了几种新的循环变体（cycle variants），这些变体相互补充，并结合时间分叉场景采样方案（time-divergent scene sampling scheme），以改进自监督学习（self-supervised learning）中的数据输入。通过在DIVOTrack数据集上的跨摄像头匹配实验，论文展示了该方法的优越性，相较于现有的自监督方法，F1分数提高了4.3个百分点，且在训练数据重叠减少的情况下仍表现出较强的鲁棒性。该方法为复杂任务（如大规模多摄像头场景理解）提供了有效的特征网络支持。

链接: https://arxiv.org/abs/2501.06000
作者: Fedor Taggenbrock,Gertjan Burghouts,Ronald Poppe
机构: Utrecht University (乌得勒支大学); TNO (荷兰应用科学研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VISAPP 2025

点击查看摘要

Abstract:Matching objects across partially overlapping camera views is crucial in multi-camera systems and requires a view-invariant feature extraction network. Training such a network with cycle-consistency circumvents the need for labor-intensive labeling. In this paper, we extend the mathematical formulation of cycle-consistency to handle partial overlap. We then introduce a pseudo-mask which directs the training loss to take partial overlap into account. We additionally present several new cycle variants that complement each other and present a time-divergent scene sampling scheme that improves the data input for this self-supervised setting. Cross-camera matching experiments on the challenging DIVOTrack dataset show the merits of our approach. Compared to the self-supervised state-of-the-art, we achieve a 4.3 percentage point higher F1 score with our combined contributions. Our improvements are robust to reduced overlap in the training data, with substantial improvements in challenging scenes that need to make few matches between many people. Self-supervised feature networks trained with our method are effective at matching objects in a range of multi-camera settings, providing opportunities for complex tasks like large-scale multi-camera scene understanding.
zh

[CV-16] Minimizing Occlusion Effect on Multi-View Camera Perception in BEV with Multi-Sensor Fusion

【速读】：该论文试图解决自动驾驶技术中由于环境因素（如灰尘、雨水和雾）导致的传感器遮挡问题，特别是对基于视觉的任务（如物体检测、车辆分割和车道识别）的影响。论文通过将多视角相机图像从nuScenes数据集投影到鸟瞰图（Bird’s-Eye View, BEV）域，分析了遮挡在空间上的分布及其对BEV域中车辆分割精度的影响。解决方案的关键在于采用多传感器融合技术，结合LiDAR和雷达传感器数据，以减轻因相机遮挡导致的性能下降。研究结果表明，该方法显著提高了车辆分割任务的准确性和鲁棒性，从而增强了自动驾驶系统的可靠性。

链接: https://arxiv.org/abs/2501.05997
作者: Sanjay Kumar,Hiep Truong,Sushil Sharma,Ganesh Sistu,Tony Scanlan,Eoin Grua,Ciarán Eising
机构: Department of Electronic and Computer Engineering, University of Limerick, Ireland(利默里克大学电子与计算机工程系); Data-Driven Computer Engineering (D2iCE) Research Centre, University of Limerick, Ireland(利默里克大学数据驱动计算机工程研究中心); Lero, The Irish Software Research Centre, University of Limerick, Ireland(利默里克大学爱尔兰软件研究中心); DSW, Valeo Kronach, Germany(德国克龙纳赫Valeo DSW); Valeo Vision Systems, Ireland(爱尔兰Valeo视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted form publishing at the Electronic Imaging - Autonomous Vehicles and Machines Conference

点击查看摘要

Abstract:Autonomous driving technology is rapidly evolving, offering the potential for safer and more efficient transportation. However, the performance of these systems can be significantly compromised by the occlusion on sensors due to environmental factors like dirt, dust, rain, and fog. These occlusions severely affect vision-based tasks such as object detection, vehicle segmentation, and lane recognition. In this paper, we investigate the impact of various kinds of occlusions on camera sensor by projecting their effects from multi-view camera images of the nuScenes dataset into the Bird’s-Eye View (BEV) domain. This approach allows us to analyze how occlusions spatially distribute and influence vehicle segmentation accuracy within the BEV domain. Despite significant advances in sensor technology and multi-sensor fusion, a gap remains in the existing literature regarding the specific effects of camera occlusions on BEV-based perception systems. To address this gap, we use a multi-sensor fusion technique that integrates LiDAR and radar sensor data to mitigate the performance degradation caused by occluded cameras. Our findings demonstrate that this approach significantly enhances the accuracy and robustness of vehicle segmentation tasks, leading to more reliable autonomous driving systems.
zh

[CV-17] Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers

【速读】：该论文旨在解决从2D X射线图像到3D形状的转换问题，以提高诊断效率和安全性。现有方法通常依赖于手工特征提取、人工干预和先验知识，导致形状误差不稳定且增加了处理成本。论文提出的解决方案Swin-X2S是一种端到端的深度学习方法，能够直接从2D双平面正交X射线图像重建3D分割和标注。其关键在于采用了编码器-解码器架构：编码器利用2D Swin Transformer提取X射线信息，解码器则通过3D卷积和交叉注意力机制整合正交视图的结构特征。此外，引入了一个维度扩展模块，以确保从2D像素到3D体素的平滑转换。该方法在涵盖四种解剖结构（股骨、髋部、脊柱和肋骨）的九个公开数据集上进行了广泛实验，结果表明其在分割和标注指标以及临床应用相关参数上均显著优于现有方法。

链接: https://arxiv.org/abs/2501.05961
作者: Kuan Liu,Zongyuan Ying,Jie Jin,Dongyan Li,Ping Huang,Wenjian Wu,Zhe Chen,Jin Qi,Yong Lu,Lianfu Deng,Bo Chen
机构: Department of Orthopaedics, Shanghai Key Laboratory for Prevention and Treatment of Bone and Joint Diseases, Shanghai Institute of Traumatology and Orthopaedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China(上海交通大学医学院附属瑞金医院骨科，上海市骨与关节疾病预防与治疗重点实验室，上海市创伤骨科研究所，上海 200025，中国); Department of Radiology, Ruijin Hospital Luwan Branch, School of Medicine, Shanghai Jiaotong University, Shanghai 200003, China(上海交通大学医学院附属瑞金医院卢湾分院放射科，上海 200003，中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \urlthis https URL.
zh

[CV-18] A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction

【速读】：该论文试图解决在动态工业工作流程中检测和解释操作员动作、参与度及物体交互的挑战，特别是在复杂、真实环境中的挑战。传统单模态方法往往难以捕捉这些非结构化工业环境中的复杂性。为解决这一问题，论文提出了一个新颖的多模态工业活动监测（MIAM）数据集，该数据集捕捉了真实的装配和拆卸任务，便于评估动作定位、物体交互和参与度预测等关键元任务。解决方案的关键在于整合多视图RGB、深度和惯性测量单元（IMU）数据，并结合多模态网络融合RGB帧、IMU数据和骨架序列，以提高在工业任务中识别操作员参与度的准确性。这一方法为动态工业环境中的操作员性能监测提供了稳健的解决方案。

链接: https://arxiv.org/abs/2501.05936
作者: Naval Kishore Mehta,Arvind,Himanshu Kumar,Abeer Banerjee,Sumeet Saurav,Sanjay Singh
机构: Academy of Scientific and Innovative Research (科学创新研究院); CSIR-Central Electronics Engineering Research Institute (CSIR-中央电子工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 20th International Conference on Human-Robot Interaction (HRI) 2025

点击查看摘要

Abstract:Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from this https URL.
zh

[CV-19] Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM2

【速读】：该论文旨在解决弱监督分割（weakly supervised segmentation）在光学相干断层扫描（OCT）图像中分割小结构（如高反射灶，HRF）时面临的空间分辨率不足和定位精度低的问题。传统弱监督方法通常需要对输入图像进行大幅下采样，或只能在粗分辨率下进行定位，这对于小结构的分割效果不理想。论文提出了一种新颖的框架，通过使用分层相关性传播（Layer-wise Relevance Propagation, LRP）来提示Segment Anything Model (SAM~2)，并结合迭代推理来提高召回率，从而提升了传统基于注意力机制的多实例学习（Multiple Instance Learning, MIL）方法的空间分辨率。此外，论文还展示了用紧凑卷积变换器（Compact Convolutional Transformer, CCT）替代MIL的改进方案，CCT通过添加位置编码并允许OCT图像不同区域之间的信息交换，进一步显著提高了分割精度。

链接: https://arxiv.org/abs/2501.05933
作者: Olivier Morelle(1 and 2),Justus Bisten(1),Maximilian W. M. Wintergerst(2 and 5),Robert P. Finger(2 and 4),Thomas Schultz(1 and 3) ((1) B-IT and Department of Computer Science, University of Bonn, (2) Department of Ophthalmology, University Hospital Bonn, (3) Lamarr Institute for Machine Learning and Artificial Intelligence, (4) Department of Ophthalmology, University Medical Center Mannheim, Heidelberg University, (5) Augenzentrum Grischun, Chur, Switzerland)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure, accepted at German Conference on Medical Image Computing 2025

点击查看摘要

Abstract:Weakly supervised segmentation has the potential to greatly reduce the annotation effort for training segmentation models for small structures such as hyper-reflective foci (HRF) in optical coherence tomography (OCT). However, most weakly supervised methods either involve a strong downsampling of input images, or only achieve localization at a coarse resolution, both of which are unsatisfactory for small structures. We propose a novel framework that increases the spatial resolution of a traditional attention-based Multiple Instance Learning (MIL) approach by using Layer-wise Relevance Propagation (LRP) to prompt the Segment Anything Model (SAM~2), and increases recall with iterative inference. Moreover, we demonstrate that replacing MIL with a Compact Convolutional Transformer (CCT), which adds a positional encoding, and permits an exchange of information between different regions of the OCT image, leads to a further and substantial increase in segmentation accuracy.
zh

[CV-20] Binary Event-Driven Spiking Transformer

【速读】：该论文旨在解决基于Transformer的脉冲神经网络（SNNs）在资源受限场景下的实用性问题。尽管Transformer结构在性能上表现出色，但其较大的模型规模和计算需求限制了其在资源受限设备上的应用。为此，论文提出了一种名为BESTformer的二进制事件驱动脉冲Transformer模型，通过将权重和注意力图表示为1比特，显著减少了存储和计算需求。然而，由于二值化的表示能力有限，BESTformer在性能上相较于全精度模型存在显著下降。为解决这一问题，论文提出了一种耦合信息增强（CIE）方法，该方法包括可逆框架和信息增强蒸馏，通过最大化二进制模型与其全精度对应模型之间的互信息，有效缓解了BESTformer的性能下降问题。实验结果表明，该方法在静态和神经形态数据集上均优于其他二进制SNNs，展示了其在资源受限边缘设备上作为紧凑且高性能模型的潜力。

链接: https://arxiv.org/abs/2501.05904
作者: Honglin Cao,Zijian Zhou,Wenjie Wei,Ammar Belatreche,Yu Liang,Dehao Zhang,Malu Zhang,Yang Yang,Haizhou Li
机构: The University of Electronic Science and Technology of China(电子科技大学); Northumbria University(诺森比亚大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices.
zh

[CV-21] Valley2: Exploring Multimodal Models with Scalable Vision-Language Design

【速读】：该论文旨在解决多模态大语言模型（multimodal large language model）在电子商务和短视频场景中的性能提升问题。解决方案的关键在于提出了Valley2模型，该模型通过增强跨领域的性能表现，显著提升了在电子商务基准测试中的表现，达到了79.66的分数，远超同类开源模型的72.76。此外，Valley2在参数少于10B的模型中，在OpenCompass排行榜上排名第二，平均得分为67.4。该模型的代码和权重已开源，便于进一步研究和应用。

链接: https://arxiv.org/abs/2501.05901
作者: Ziheng Wu,Zhenghao Chen,Ruipu Luo,Can Zhang,Yuan Gao,Zhentao He,Xian Wang,Haoran Lin,Minghui Qiu
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at this https URL.
zh

[CV-22] Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

【速读】：该论文旨在解决在生成倾斜或弯曲文本布局时，现有扩散模型（diffusion models）由于训练数据限制导致的文本扭曲和文本背景不协调的问题。为了解决这一问题，论文提出了一种无需训练的框架STGen，该框架通过将视觉文本生成过程分解为两个分支来实现：语义校正分支（Semantic Rectification Branch）和结构注入分支（Structure Injection Branch）。语义校正分支利用模型在生成平面文本时的准确性，通过引入平面文本的潜在信息来校正复杂布局中的文本语义信息，并使其与背景协调。结构注入分支则在推理过程中通过引入字形图像的潜在信息来增强文本结构。此外，论文还采用了一种有效的先验信息组合方法，进一步提升了生成图像的整体协调性。实验结果表明，该框架在多种视觉文本布局中均表现出优异的准确性和生成质量。

链接: https://arxiv.org/abs/2501.05892
作者: Minxing Luo,Zixun Xia,Liaojun Chen,Zhenhang Li,Weichao Zeng,Jianye Wang,Wentao Cheng,Yaxing Wang,Yu Zhou,Jian Yang
机构: VCIP, CS, Nankai University(南开大学); Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text background when given slanted or curved text layouts due to training data limitation. In this paper, we introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios (\eg, slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) \textbfSemantic Rectification Branch, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related both to the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) \textbfStructure Injection Branch, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.
zh

[CV-23] EDNet: Edge-Optimized Small Target Detection in UAV Imagery – Faster Context Attention Better Feature Fusion and Hardware Acceleration WWW

【速读】：该论文旨在解决在无人机图像中检测小目标（small targets）的挑战，这些挑战主要源于低分辨率、复杂背景和动态场景。为了解决这些问题，作者提出了EDNet，这是一种基于增强版YOLOv10架构的边缘目标检测框架，专为实时应用设计且无需后处理。EDNet的关键创新包括引入XSmall检测头和Cross Concat策略，以改进特征融合和多尺度上下文感知能力，从而在多样化环境中有效检测微小目标。此外，独特的C2f-FCA块采用了Faster Context Attention机制，增强了特征提取能力，同时降低了计算复杂度。EDNet还使用了WIoU损失函数来优化边界框回归。通过提供从Tiny到XL的七种模型尺寸，EDNet能够适应不同的部署环境，支持本地实时推理并确保数据隐私。实验结果表明，EDNet在mAP@50指标上提升了5.6%，且参数量显著减少，在iPhone 12上运行速度可达16至55 FPS，为无人机图像中的边缘目标检测提供了高效且可扩展的解决方案。

链接: https://arxiv.org/abs/2501.05885
作者: Zhifan Song,Yuan Zhang,Abd Al Rahman M. Abu Ebayyeh
机构: LIP6 Laboratory, Sorbonne University, CNRS UMR7606, Paris, France (LIP6实验室, 索邦大学, CNRS UMR7606, 巴黎, 法国); University of California, Berkeley, CA, United States (加州大学伯克利分校, 加州, 美国); Department of Electrical and Electronic Engineering, Imperial College London, UK (电气与电子工程系, 伦敦帝国学院, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in 21st IEEE International Conference on Ubiquitous Intelligence and Computing (UIC 2024) this https URL

点击查看摘要

Abstract:Detecting small targets in drone imagery is challenging due to low resolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel edge-target detection framework built on an enhanced YOLOv10 architecture, optimized for real-time applications without post-processing. EDNet incorporates an XSmall detection head and a Cross Concat strategy to improve feature fusion and multi-scale context awareness for detecting tiny targets in diverse environments. Our unique C2f-FCA block employs Faster Context Attention to enhance feature extraction while reducing computational complexity. The WIoU loss function is employed for improved bounding box regression. With seven model sizes ranging from Tiny to XL, EDNet accommodates various deployment environments, enabling local real-time inference and ensuring data privacy. Notably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer parameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16 to 55 FPS, providing a scalable and efficient solution for edge-based object detection in challenging drone imagery. The source code and pre-trained models are available at: this https URL.
zh

[CV-24] xt-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLM s

【速读】：该论文旨在解决短视频内容快速增长背景下，自动化视频编辑系统面临的挑战，特别是如何理解视频内容并根据用户需求进行定制化编辑。为解决这一问题，论文提出了一种创新的端到端基础框架，通过利用多模态大语言模型（Multimodal Large Language Models, MLLMs）的灵活性和泛化能力，定义了清晰的输入-输出映射，以实现高效的视频创作。解决方案的关键在于引入了更高的帧率（denser frame rate）和快慢处理技术（slow-fast processing technique），显著增强了视频时空信息的提取和理解能力。此外，论文还提出了一种文本到编辑（text-to-edit）机制，使用户能够通过文本输入实现所需的视频编辑效果，从而提升了编辑视频的质量和可控性。通过广泛的实验验证，该方法不仅在广告数据集中表现出显著的有效性，还在公共数据集中得出了普遍适用的结论。

链接: https://arxiv.org/abs/2501.05884
作者: Dabing Cheng,Haosen Zhan,Xingchen Zhao,Guisheng Liu,Zemin Li,Jinghui Xie,Zhao Song,Weiguo Feng,Bingyue Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16pages conference

点击查看摘要

Abstract:The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing, with challenges arising from the need to understand videos and tailor the editing according to user requirements. Addressing this need, we propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing. Leveraging the flexibility and generalizability of Multimodal Large Language Models (MLLMs), we defined clear input-output mappings for efficient video creation. To bolster the model’s capability in processing and comprehending video content, we introduce a strategic combination of a denser frame rate and a slow-fast processing technique, significantly enhancing the extraction and understanding of both temporal and spatial video information. Furthermore, we introduce a text-to-edit mechanism that allows users to achieve desired video outcomes through textual input, thereby enhancing the quality and controllability of the edited videos. Through comprehensive experimentation, our method has not only showcased significant effectiveness within advertising datasets, but also yields universally applicable conclusions on public datasets.
zh

[CV-25] akuNet: an Energy-Efficient CNN for Real-Time Inference on Embedded UAV systems in Emergency Response Scenarios WACV

【速读】：该论文旨在解决在嵌入式设备上设计高效神经网络（Efficient Neural Networks）的挑战，特别是在需要实时性能的应用场景中，如无人机（UAVs）和无人机（drones）在应急响应中的空中成像。为了解决这一问题，论文提出了TakuNet，一种轻量级神经网络架构。其关键解决方案包括采用深度可分离卷积（depth-wise convolutions）和早期下采样主干（early downsampling stem）来降低计算复杂度，同时保持高精度。此外，TakuNet利用密集连接（dense connections）加速训练收敛，并使用16位浮点精度（16-bit floating-point precision）优化嵌入式硬件加速器的性能。实验结果表明，TakuNet在应急场景的空中图像分类任务中达到了接近最先进的精度，并且在嵌入式设备（如Jetson Orin Nano和Raspberry Pi）上实现了超过650 fps的处理速度，证明了其在资源受限平台上的高效性和实时处理能力。

链接: https://arxiv.org/abs/2501.05880
作者: Daniel Rossi,Guido Borghi,Roberto Vezzani
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: This paper has been accepted at WACVW 2025, which will take place on 28/02/2025. The official conference proceedings have not yet been published at the time of submission to arXiv. The final version of the paper, incorporating any changes based on feedback received during the conference, will be included in the proceedings once they are made available

点击查看摘要

Abstract:Designing efficient neural networks for embedded devices is a critical challenge, particularly in applications requiring real-time performance, such as aerial imaging with drones and UAVs for emergency responses. In this work, we introduce TakuNet, a novel light-weight architecture which employs techniques such as depth-wise convolutions and an early downsampling stem to reduce computational complexity while maintaining high accuracy. It leverages dense connections for fast convergence during training and uses 16-bit floating-point precision for optimization on embedded hardware accelerators. Experimental evaluation on two public datasets shows that TakuNet achieves near-state-of-the-art accuracy in classifying aerial images of emergency situations, despite its minimal parameter count. Real-world tests on embedded devices, namely Jetson Orin Nano and Raspberry Pi, confirm TakuNet’s efficiency, achieving more than 650 fps on the 15W Jetson board, making it suitable for real-time AI processing on resource-constrained platforms and advancing the applicability of drones in emergency scenarios. The code and implementation details are publicly released.
zh

[CV-26] Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning

【速读】：该论文旨在解决少样本类增量学习（Few-Shot Class-Incremental Learning, FSCIL）中的挑战，即在观察少量样本的情况下，通过语言描述来描绘新类别，并区分新旧知识。现有方法主要依赖于视觉编码器的精细调优，但这种方法在基础知识和增量知识之间存在明显的权衡。论文提出了一种新的语言启发关系转移（Language-inspired Relation Transfer, LRT）范式，通过结合视觉线索和文本描述来理解对象。该方案的关键在于两个主要步骤：首先，通过图关系转换模块将预训练的文本知识转移到视觉领域，然后通过文本-视觉原型融合模块融合视觉和语言嵌入；其次，为了缓解视觉微调带来的领域差距，提出了上下文提示学习（context prompt learning）用于快速领域对齐，并通过想象的对比学习（imagined contrastive learning）来缓解对齐过程中文本数据不足的问题。通过领域对齐和文本-图像转移的协同学习，LRT在mini-ImageNet和CIFAR-100 FSCIL基准测试的最终会话中分别超过了现有最先进模型13%和7%。

链接: https://arxiv.org/abs/2501.05862
作者: Yifan Zhao,Jia Li,Zeyin Song,Yonghong Tian
机构: State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University (北京航空航天大学); School of Electronic and Computer Engineering, Peking University (北京大学); School of Computer Science, Peking University (北京大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TPAMI

点击查看摘要

Abstract:Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over 13% and 7% on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.
zh

[CV-27] MRI Patterns of the Hippocampus and Amygdala for Predicting Stages of Alzheimers Progression: A Minimal Feature Machine Learning Framework

【速读】：该论文试图解决阿尔茨海默病（Alzheimer’s disease, AD）进展阶段的准确识别问题，特别是区分晚期轻度认知障碍（late mild cognitive impairment, LMCI）与早期轻度认知障碍（early mild cognitive impairment, EMCI）。由于这些阶段的影像特征存在细微且重叠的差异，准确识别具有挑战性，但对开发前痴呆治疗至关重要。解决方案的关键在于提出了一种基于结构MRI数据的机器学习框架，该框架通过特征选择应对维度灾难（curse of dimensionality），利用海马体和杏仁核等特定区域的体素信息，并通过创新的数据组织方式减少噪声，从而提升分类性能。该框架结合了主成分分析（PCA）和t-SNE等降维技术，以及先进的分类器，最终实现了88.46%的最高准确率，展示了其在AD进展阶段高效准确分期中的潜力，并为临床应用提供了有价值的见解。

链接: https://arxiv.org/abs/2501.05852
作者: Aswini Kumar Patra,Soraisham Elizabeth Devi,Tejashwini Gajurel
机构: Department of Computer Science and Engineering, NERIST (NERIST); Neerja Modi School (Neerja Modi School)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) progresses through distinct stages, from early mild cognitive impairment (EMCI) to late mild cognitive impairment (LMCI) and eventually to AD. Accurate identification of these stages, especially distinguishing LMCI from EMCI, is crucial for developing pre-dementia treatments but remains challenging due to subtle and overlapping imaging features. This study proposes a minimal-feature machine learning framework that leverages structural MRI data, focusing on the hippocampus and amygdala as regions of interest. The framework addresses the curse of dimensionality through feature selection, utilizes region-specific voxel information, and implements innovative data organization to enhance classification performance by reducing noise. The methodology integrates dimensionality reduction techniques such as PCA and t-SNE with state-of-the-art classifiers, achieving the highest accuracy of 88.46%. This framework demonstrates the potential for efficient and accurate staging of AD progression while providing valuable insights for clinical applications.
zh

[CV-28] Identity-aware Feature Decoupling Learning for Clothing-change Person Re-identification ICASSP2025

【速读】：该论文旨在解决服装变化人员重识别（Clothing-change Person Re-identification, CC Re-ID）任务中，现有方法难以从原始RGB图像中充分提取与身份相关的信息的问题。为了解决这一问题，作者提出了一种身份感知特征解耦（Identity-aware Feature Decoupling, IFD）学习框架。该框架的关键在于采用双流架构，包括一个主流和一个注意力流。注意力流以服装掩码图像作为输入，生成身份注意力权重，从而有效地将空间知识传递到主流，并突出显示富含身份相关信息的区域。此外，为了消除两个流输入之间的语义差距，作者还提出了一个专门用于主流的服装偏差减少模块，以规范服装相关区域的特征。实验结果表明，该框架在多个广泛使用的CC Re-ID数据集上优于其他基线模型。

链接: https://arxiv.org/abs/2501.05851
作者: Haoxuan Xu,Bo Li,Guanglin Niu
机构: Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets.
zh

[CV-29] Poetry in Pixels: Prompt Tuning for Poem Image Generation via Diffusion Models

【速读】：该论文试图解决文本到图像生成（text-to-image generation）在应用于文学作品，尤其是诗歌时面临的挑战。诗歌作为一种独特的文学形式，其含义常常超越字面意义，传统的图像生成方法难以准确捕捉诗歌的内在含义。为此，论文提出了PoemToPixel框架，旨在生成能够视觉化呈现诗歌内在含义的图像。该框架的关键在于结合了提示调优（prompt tuning）的概念，并通过PoeKey算法从诗歌中提取情感、视觉元素和主题三个关键要素，形成指令输入到扩散模型（diffusion model）中生成相应的图像。此外，论文还引入了MiniPo数据集，这是一个包含1001首儿童诗歌及其对应图像的多模态数据集，用于扩展诗歌数据集的多样性。通过结合PoemSum数据集，论文对PoemToPixel框架进行了定量和定性评估，验证了其有效性，并为从文学资源生成图像提供了新的视角。

链接: https://arxiv.org/abs/2501.05839
作者: Sofia Jamil,Bollampalli Areen Reddy,Raghvendra Kumar,Sriparna Saha,K J Joseph,Koustava Goswami
机构: Department of Computer Science & Engineering, Indian Institute of Technology Patna, India (印度理工学院帕特纳分校计算机科学与工程系); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children’s poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
zh

[CV-30] UltraRay: Full-Path Ray Tracing for Enhancing Realism in Ultrasound Simulation

【速读】：该论文旨在解决传统超声模拟器在模拟压力分布场时计算量大、耗时长的问题。传统方法通过求解波动方程来实现高精度模拟，但计算资源需求较高。为此，论文提出了一种基于光线追踪（ray tracing）算法的超声模拟管道（ultrasound simulation pipeline），称为UltraRay。该方案的关键在于通过光线追踪技术模拟声波在场景中的传播路径，从换能器（transducer）发出光线，经过场景中的边界和散射体后返回传感器，从而生成回波数据。此外，论文引入了针对平面波成像（plane wave imaging）优化的光线发射方案，并结合标准信号处理管道模拟端到端的超声图像生成。通过模拟包含高反射物体（如骨骼）的合成场景，UltraRay不仅提升了模拟图像的视觉质量，还通过准确捕捉二次反射和减少不自然伪影，提高了图像的真实感。该方案基于可微分框架，为基于梯度的优化、高级超声波束形成策略、神经网络集成和精确的逆场景重建奠定了基础。

链接: https://arxiv.org/abs/2501.05828
作者: Felix Duelmer,Mohammad Farid Azampour,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Traditional ultrasound simulators solve the wave equation to model pressure distribution fields, achieving high accuracy but requiring significant computational time and resources. To address this, ray tracing approaches have been introduced, modeling wave propagation as rays interacting with boundaries and scatterers. However, existing models simplify ray propagation, generating echoes at interaction points without considering return paths to the sensor. This can result in unrealistic artifacts and necessitates careful scene tuning for plausible results. We propose a novel ultrasound simulation pipeline that utilizes a ray tracing algorithm to generate echo data, tracing each ray from the transducer through the scene and back to the sensor. To replicate advanced ultrasound imaging, we introduce a ray emission scheme optimized for plane wave imaging, incorporating delay and steering capabilities. Furthermore, we integrate a standard signal processing pipeline to simulate end-to-end ultrasound image formation. We showcase the efficacy of the proposed pipeline by modeling synthetic scenes featuring highly reflective objects, such as bones. In doing so, our proposed approach, UltraRay, not only enhances the overall visual quality but also improves the realism of the simulated images by accurately capturing secondary reflections and reducing unnatural artifacts. By building on top of a differentiable framework, the proposed pipeline lays the groundwork for a fast and differentiable ultrasound simulation tool necessary for gradient-based optimization, enabling advanced ultrasound beamforming strategies, neural network integration, and accurate inverse scene reconstruction.
zh

[CV-31] PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation

【速读】：该论文旨在解决现有个性化面部扩散模型（PFD, Personalized Face Diffusion）在生成身份一致的人-物交互（HOI, Human-Object Interaction）图像时，过度强调面部特征而忽视全身一致性的问题。为了解决这一问题，论文提出了PersonaHOI框架，该框架通过将通用的StableDiffusion模型与个性化面部扩散模型相结合，引入了一个由HOI导向的文本输入引导的额外StableDiffusion分支。通过在PFD分支中引入交叉注意力约束（cross-attention constraints）以及在潜在空间和残差层面进行空间融合（spatial merging），PersonaHOI能够在保持个性化面部细节的同时，确保非面部交互区域的一致性。实验结果表明，PersonaHOI在真实感和可扩展性方面表现优异，为人-物交互生成中的个性化面部生成设立了新的标准。

链接: https://arxiv.org/abs/2501.05823
作者: Xinting Hu,Haoran Wang,Jan Eric Lenssen,Bernt Schiele
机构: Max Planck Institute for Informatics, Saarland Informatics Campus, Germany(马克斯·普朗克信息学研究所, 萨尔兰信息学园区, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at this https URL
zh

[CV-32] Alignment without Over-optimization: Training-Free Solution for Diffusion Models

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成任务中与特定目标对齐时面临的挑战，特别是在保持模型多样性和通用性的同时有效优化目标奖励（reward）。现有方法如微调（fine-tuning）容易导致奖励过优化（reward over-optimization），而近似引导方法（approximate guidance）则难以有效优化目标奖励。针对这些局限性，论文提出了一种基于序贯蒙特卡罗（Sequential Monte Carlo, SMC）的无训练采样方法，通过引入调温技术（tempering techniques），从奖励对齐的目标分布中进行采样。该方法在单目标优化、多目标场景和在线黑箱优化中均表现出色，能够在保持模型多样性和跨奖励泛化能力的同时，实现与微调方法相当或更优的目标奖励优化效果。其关键创新在于无需额外训练即可实现目标对齐，从而避免了对模型通用性的损害。

链接: https://arxiv.org/abs/2501.05803
作者: Sunwoo Kim,Minkyu Kim,Dongmin Park
机构: Seoul National University(首尔国立大学); KRAFTON
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free sampling method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at this https URL .
zh

[CV-33] Cryptanalysis of Cancelable Biometrics Vault

【速读】：该论文旨在分析一种基于可撤销生物特征（Cancelable Biometrics, CB）的密钥绑定方案，称为可撤销生物特征保险库（Cancelable Biometrics Vault, CBV）。具体而言，论文针对CBV框架中引入的生物编码方案（BioEncoding scheme）进行了密码分析，重点评估了该方案在模板的不可逆性（irreversibility）和不可链接性（unlinkability）方面的安全性。研究发现，该方案存在模板可逆性和可链接性的漏洞，使得攻击者能够在无需额外假设的情况下恢复保险库中的密钥。论文的关键贡献在于揭示了CBV方案在可撤销性和可链接性方面的潜在脆弱性，这些问题在以往类似的生物特征密钥绑定方案中未被发现。

链接: https://arxiv.org/abs/2501.05786
作者: Patrick Lacharme,Kevin Thiry-Atighehchi
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Cancelable Biometrics (CB) stands for a range of biometric transformation schemes combining biometrics with user specific tokens to generate secure templates. Required properties are the irreversibility, unlikability and recognition accuracy of templates while making their revocation possible. In biometrics, a key-binding scheme is used for protecting a cryptographic key using a biometric data. The key can be recomputed only if a correct biometric data is acquired during authentication. Applications of key-binding schemes are typically disk encryption, where the cryptographic key is used to encrypt and decrypt the disk. In this paper, we cryptanalyze a recent key-binding scheme, called Cancelable Biometrics Vault (CBV) based on cancelable biometrics. More precisely, the introduced cancelable transformation, called BioEncoding scheme, for instantiating the CBV framework is attacked in terms of reversibility and linkability of templates. Subsequently, our linkability attack enables to recover the key in the vault without additional assumptions. Our cryptanalysis introduces a new perspective by uncovering the CBV scheme’s revocability and linkability vulnerabilities, which were not previously identified in comparable biometric-based key-binding schemes.
zh

[CV-34] UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025

【速读】：该论文试图解决在对抗攻击（adversarial attacks）中，由于人体运动的灵活性，使用贴片或基于静态3D模型的纹理修改方法在人体检测器上的成功率较低的问题。具体挑战在于如何建模由各种动作引起的3D形变。论文提出的解决方案是UV-Attack，其关键在于利用动态NeRF（Neural Radiance Fields）进行UV映射（UV mapping），从而生成跨多种动作和视角的人体图像，并通过从SMPL参数空间中采样创建新动作。此外，UV-Attack通过生成UV贴图而非RGB图像，并修改纹理堆栈，实现了实时纹理编辑，使得攻击更具实用性。论文还提出了新的期望姿态变换损失（Expectation over Pose Transformation loss, EoPT），以提高在未见过的姿态和视角下的逃避成功率。实验结果表明，UV-Attack在动态视频设置中对FastRCNN模型的攻击成功率达到92.75%，显著优于现有的AdvCamou攻击（28.50% ASR），并在黑盒设置中对YOLOv8检测器的攻击成功率达到49.5%。

链接: https://arxiv.org/abs/2501.05783
作者: Yanjie Li,Wenxuan Zhang,Kaisheng Liang,Bin Xiao
机构: Department of Computer Science, Hong Kong Polytechnic University (香港理工大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 22 figures, submitted to ICLR2025

点击查看摘要

Abstract:In recent research, adversarial attacks on person detectors using patches or static 3D model-based texture modifications have struggled with low success rates due to the flexible nature of human movement. Modeling the 3D deformations caused by various actions has been a major challenge. Fortunately, advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer new possibilities. In this paper, we introduce UV-Attack, a groundbreaking approach that achieves high success rates even with extensive and unseen human actions. We address the challenge above by leveraging dynamic-NeRF-based UV mapping. UV-Attack can generate human images across diverse actions and viewpoints, and even create novel actions by sampling from the SMPL parameter space. While dynamic NeRF models are capable of modeling human bodies, modifying clothing textures is challenging because they are embedded in neural network parameters. To tackle this, UV-Attack generates UV maps instead of RGB images and modifies the texture stacks. This approach enables real-time texture edits and makes the attack more practical. We also propose a novel Expectation over Pose Transformation loss (EoPT) to improve the evasion success rate on unseen poses and views. Our experiments show that UV-Attack achieves a 92.75% attack success rate against the FastRCNN model across varied poses in dynamic video settings, significantly outperforming the state-of-the-art AdvCamou attack, which only had a 28.50% ASR. Moreover, we achieve 49.5% ASR on the latest YOLOv8 detector in black-box settings. This work highlights the potential of dynamic NeRF-based UV mapping for creating more effective adversarial attacks on person detectors, addressing key challenges in modeling human movement and texture modification.
zh

[CV-35] StructSR: Refuse Spurious Details in Real-World Image Super-Resolution

【速读】：该论文旨在解决基于扩散模型（Diffusion-based models）的真实图像超分辨率（Real-ISR）中存在的结构错误和虚假纹理细节问题。这些问题的产生主要是由于模型的经验先验和幻觉导致的。为了解决这一问题，论文提出了StructSR方法，其核心是通过结构感知筛选机制（Structure-Aware Screening, SAS）在早期推理阶段识别出与低分辨率（LR）输入图像结构相似度最高的图像，并将其作为历史结构知识来抑制虚假细节的生成。StructSR无需额外的微调、外部模型先验或高层语义知识，能够无缝集成到现有的基于扩散的Real-ISR模型中。实验结果表明，StructSR显著提高了结构和纹理的保真度，在合成数据集（DIV2K-Val）和真实数据集（RealSR和DRealSR）上分别平均提升了PSNR和SSIM指标5.27%、9.36%和4.13%、8.64%。

链接: https://arxiv.org/abs/2501.05777
作者: Yachao Li,Dong Liang,Tianyu Ding,Sheng-Jun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based models have shown great promise in real-world image super-resolution (Real-ISR), but often generate content with structural errors and spurious texture details due to the empirical priors and illusions of these models. To address this issue, we introduce StructSR, a simple, effective, and plug-and-play method that enhances structural fidelity and suppresses spurious details for diffusion-based Real-ISR. StructSR operates without the need for additional fine-tuning, external model priors, or high-level semantic knowledge. At its core is the Structure-Aware Screening (SAS) mechanism, which identifies the image with the highest structural similarity to the low-resolution (LR) input in the early inference stage, allowing us to leverage it as a historical structure knowledge to suppress the generation of spurious details. By intervening in the diffusion inference process, StructSR seamlessly integrates with existing diffusion-based Real-ISR models. Our experimental results demonstrate that StructSR significantly improves the fidelity of structure and texture, improving the PSNR and SSIM metrics by an average of 5.27% and 9.36% on a synthetic dataset (DIV2K-Val) and 4.13% and 8.64% on two real-world datasets (RealSR and DRealSR) when integrated with four state-of-the-art diffusion-based Real-ISR methods.
zh

[CV-36] Conditional Diffusion Model for Electrical Impedance Tomography

【速读】：该论文试图解决电导率成像（Electrical Impedance Tomography, EIT）中由于逆问题的非线性和病态性导致的图像重建质量低、对测量数据高度敏感以及随机噪声伪影等问题。为了解决这些问题，论文提出了一种基于电压一致性的条件扩散模型（Conditional Diffusion Model with Voltage Consistency, CDMVC）。该方案的关键在于：1）使用预成像模块生成初始重建图像，作为条件扩散模型的训练条件；2）在采样过程中引入前向电压约束网络，通过电压一致性约束将EIT的前向信息融入重建过程，从而提升成像质量。实验结果表明，该方法显著提高了重建图像的质量，并具有良好的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2501.05769
作者: Duanpeng Shi,Wendong Zheng,Di Guo,Huaping Liu
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China(北京邮电大学人工智能学院); School of Electrical Engineering and Automation, Tianjin University of Technology, Tianjin, China(天津理工大学电气工程与自动化学院); Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系); Beijing National Research Center for Information Science and Technology, China(北京信息科学与技术国家研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrical impedance tomography (EIT) is a non-invasive imaging technique, which has been widely used in the fields of industrial inspection, medical monitoring and tactile sensing. However, due to the inherent non-linearity and ill-conditioned nature of the EIT inverse problem, the reconstructed image is highly sensitive to the measured data, and random noise artifacts often appear in the reconstructed image, which greatly limits the application of EIT. To address this issue, a conditional diffusion model with voltage consistency (CDMVC) is proposed in this study. The method consists of a pre-imaging module, a conditional diffusion model for reconstruction, a forward voltage constraint network and a scheme of voltage consistency constraint during sampling process. The pre-imaging module is employed to generate the initial reconstruction. This serves as a condition for training the conditional diffusion model. Finally, based on the forward voltage constraint network, a voltage consistency constraint is implemented in the sampling phase to incorporate forward information of EIT, thereby enhancing imaging quality. A more complete dataset, including both common and complex concave shapes, is generated. The proposed method is validated using both simulation and physical experiments. Experimental results demonstrate that our method can significantly improves the quality of reconstructed images. In addition, experimental results also demonstrate that our method has good robustness and generalization performance.
zh

[CV-37] StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

【速读】：该论文试图解决在大规模场景重建和生成模型中，由于计算限制导致每次推理只能局限于小区域，从而难以实现长距离一致性场景生成的问题。为解决这一问题，作者提出了StarGen框架，其关键解决方案是利用预训练的视频扩散模型（video diffusion model）以自回归（autoregressive）方式进行长距离场景生成。具体而言，每个视频片段的生成基于空间相邻图像的三维扭曲（3D warping）和先前生成片段的时间重叠图像，从而在精确姿态控制下提升长距离场景生成的时空一致性（spatiotemporal consistency）。这一时空条件与多种输入条件兼容，支持稀疏视图插值（sparse view interpolation）、持续视图生成（perpetual view generation）和布局条件城市生成（layout-conditioned city generation）等多样化任务。实验结果表明，StarGen在可扩展性、生成质量和姿态精度方面优于现有方法。

链接: https://arxiv.org/abs/2501.05763
作者: Shangjin Zhai,Zhichao Ye,Jialin Liu,Weijian Xie,Jiaqi Hu,Zhen Peng,Hua Xue,Danpeng Chen,Xiaomeng Wang,Lei Yang,Nan Wang,Haomin Liu,Guofeng Zhang
机构: Sensetime Research; State Key Lab of CAD&CG, Zhejiang University; Tetras.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen’s superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
zh

[CV-38] Locality-aware Gaussian Compression for Fast and High-quality Rendering

【速读】：该论文旨在解决三维高斯分布（3D Gaussian Splatting, 3DGS）在体积场景建模中的存储和渲染效率问题。现有的3DGS方法在处理大规模场景时，存储需求和计算复杂度较高，限制了其在实际应用中的广泛使用。为此，论文提出了一种名为LocoGS的局部感知三维高斯分布框架，通过利用三维高斯属性的局部一致性（local coherence），实现了对体积场景的紧凑建模。

解决方案的关键在于以下几个方面：首先，LocoGS分析并利用了三维高斯属性的局部一致性，提出了一种新颖的局部感知三维高斯表示方法，该方法通过神经场表示（neural field representation）有效地编码局部一致的高斯属性，显著减少了存储需求。其次，LocoGS引入了密集初始化（dense initialization）、自适应球谐函数带宽方案（adaptive spherical harmonics bandwidth scheme）以及针对不同高斯属性的不同编码方案，进一步优化了压缩性能。实验结果表明，LocoGS在保持高质量渲染的同时，存储大小压缩了54.6倍至96.6倍，渲染速度提升了2.1倍至2.4倍，且其渲染速度比现有最先进的压缩方法平均提升了2.4倍。

链接: https://arxiv.org/abs/2501.05757
作者: Seungjoo Shin,Jaesik Park,Sunghyun Cho
机构: POSTECH(浦项科技大学); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 15 figures, and 14 tables

点击查看摘要

Abstract:We present LocoGS, a locality-aware 3D Gaussian Splatting (3DGS) framework that exploits the spatial coherence of 3D Gaussians for compact modeling of volumetric scenes. To this end, we first analyze the local coherence of 3D Gaussian attributes, and propose a novel locality-aware 3D Gaussian representation that effectively encodes locally-coherent Gaussian attributes using a neural field representation with a minimal storage requirement. On top of the novel representation, LocoGS is carefully designed with additional components such as dense initialization, an adaptive spherical harmonics bandwidth scheme and different encoding schemes for different Gaussian attributes to maximize compression performance. Experimental results demonstrate that our approach outperforms the rendering quality of existing compact Gaussian representations for representative real-world 3D datasets while achieving from 54.6 \times to 96.6 \times compressed storage size and from 2.1 \times to 2.4 \times rendering speed than 3DGS. Even our approach also demonstrates an averaged 2.4 \times higher rendering speed than the state-of-the-art compression method with comparable compression performance.
zh

[CV-39] Semantic Mapping in Indoor Embodied AI – A Comprehensive Survey and Future Directions

【速读】：该论文旨在解决智能体（如机器人）在陌生环境中执行复杂语义任务时，如何构建和维护环境语义地图（semantic map）的问题。语义地图以结构化的方式捕捉环境信息，使智能体能够在任务执行过程中进行高级推理。论文的关键解决方案是对现有的语义地图构建方法进行了全面综述，特别是针对室内导航任务。这些方法根据其结构表示（如空间网格、拓扑图、密集点云或混合地图）和编码的信息类型（如隐式特征或显式环境数据）进行分类。论文还探讨了这些地图构建技术的优势和局限性，指出了当前面临的挑战，并提出了未来的研究方向。特别指出，该领域正朝着开发开放词汇、可查询、任务无关的地图表示方向发展，但高内存需求和计算效率低下仍然是亟待解决的开放性问题。

链接: https://arxiv.org/abs/2501.05750
作者: Sonia Raychaudhuri,Angel X. Chang
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throughout the task. While existing surveys in embodied AI focus on general advancements or specific tasks like navigation and manipulation, this paper provides a comprehensive review of semantic map-building approaches in embodied AI, specifically for indoor navigation. We categorize these approaches based on their structural representation (spatial grids, topological graphs, dense point-clouds or hybrid maps) and the type of information they encode (implicit features or explicit environmental data). We also explore the strengths and limitations of the map building techniques, highlight current challenges, and propose future research directions. We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations, while high memory demands and computational inefficiency still remaining to be open challenges. This survey aims to guide current and future researchers in advancing semantic mapping techniques for embodied AI systems.
zh

[CV-40] LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising

【速读】：该论文旨在解决视频恢复中的噪声问题，特别是由于视频捕获过程中引入的噪声（如传感器噪声、运动模糊等）导致的视频质量下降。论文提出了一种新颖的算法，称为Latent space LSTM Video Denoiser (LLVD)，这是一种端到端的盲去噪模型。LLVD的关键创新在于其结合了空间和时间特征提取，并在编码特征域中使用了长短期记忆网络（LSTM）。这种LSTM层的集成对于保持视频恢复的连续性和减少闪烁至关重要。此外，LLVD在编码特征域中处理帧，显著减少了计算量，从而实现了轻量级的架构。LLVD的盲去噪特性使其在没有噪声特性先验信息的实际场景中具有广泛的适用性。实验结果表明，LLVD在合成噪声和捕获噪声的去噪任务中均表现出色，特别是在RAW去噪方面超越了当前的最先进技术（SOTA），同时计算复杂度降低了59%。

链接: https://arxiv.org/abs/2501.05744
作者: Loay Rashid,Siddharth Roheda,Amit Unde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD’s blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59% reduction in computational complexity.
zh

[CV-41] B-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在自动驾驶（Autonomous Driving, AD）领域应用中的两个主要挑战：一是模型在交通特定数据上的训练不足，二是缺乏专门用于时空理解的基准测试。为解决这些问题，论文提出了TB-Bench，一个全面的基准测试，旨在评估MLLMs在八个感知任务中对交通行为的理解能力。此外，论文还引入了视觉-语言指令调优数据集TB-100k和TB-250k，并为这些任务提供了简单但有效的基线模型。实验表明，现有的MLLMs在这些任务中表现不佳，而经过TB-100k或TB-250k微调的基线模型则显著提升了性能，平均准确率可达85%。此外，通过将TB-100k与另一个交通数据集联合训练，论文还展示了性能的迁移效果。总体而言，该研究通过引入全面的基准测试、高质量数据集和基线模型，推动了MLLMs在自动驾驶感知、预测和规划阶段的逐步集成。

链接: https://arxiv.org/abs/2501.05733
作者: Korawat Charoenpitaks,Van-Quang Nguyen,Masanori Suganuma,Kentaro Arai,Seiji Totsuka,Hiroshi Ino,Takayuki Okatani
机构: 1 Tohoku University (东北大学); 2 RIKEN (理化学研究所); 3 Sony Corporation (索尼公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main Paper: 8 pages, Supplementary Materials: 15 pages

点击查看摘要

Abstract:The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer by co-training TB-100k with another traffic dataset, leading to improved performance on the latter. Overall, this study represents a step forward by introducing a comprehensive benchmark, high-quality datasets, and baselines, thus supporting the gradual integration of MLLMs into the perception, prediction, and planning stages of AD.
zh

[CV-42] Super-class guided Transformer for Zero-Shot Attribute Classification AAAI25

【速读】：该论文试图解决在零样本（zero-shot）属性分类任务中，现有模型在利用已见（seen）和未见（unseen）属性之间的关系时表现不佳，导致模型泛化能力不足的问题。此外，属性分类通常涉及大量属性，这使得模型的可扩展性难以维持。为解决这些问题，论文提出了一种名为Super-class guided transFormer（SugaFormer）的新框架。该框架的关键在于利用超类（super-classes）来增强模型的可扩展性和泛化能力。具体来说，SugaFormer通过超类查询初始化（Super-class Query Initialization, SQI）减少查询数量，利用超类的共同语义信息，并结合多上下文解码（Multi-context Decoding, MD）来处理多样化的视觉线索。为了进一步增强泛化能力，论文引入了两种知识转移策略：在训练阶段，超类引导的一致性正则化（Super-class guided Consistency Regularization, SCR）通过区域特定的提示将SugaFormer的特征与视觉-语言模型（Vision-Language Models, VLMs）对齐；在推理阶段，零样本检索增强评分（Zero-shot Retrieval-based Score Enhancement, ZRSE）用于优化对未见属性的预测。实验结果表明，SugaFormer在多个广泛使用的属性分类基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2501.05728
作者: Sehyung Kim,Chanhyeong Yang,Jihwan Park,Taehoon Song,Hyunwoo J. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI25

点击查看摘要

Abstract:Attribute classification is crucial for identifying specific characteristics within image regions. Vision-Language Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model’s scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns SugaFormer’s features with VLMs using region-specific prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings. Our code is available at this https URL.
zh

[CV-43] Zero-shot Shark Tracking and Biometrics from Aerial Imagery

【速读】：该论文试图解决利用无人机（drones）获取的海洋动物图像数据进行分析时，传统机器学习方法需要针对每个数据集进行模型训练、测试和部署的问题。传统方法不仅耗时，还需要大量的人力和机器学习专业知识。论文提出的解决方案是Frame Level ALIgment and tRacking (FLAIR)，其关键在于利用Segment Anything Model 2 (SAM2)的视频理解能力和Contrastive Language-Image Pre-training (CLIP)的视觉-语言能力，采用零样本学习（zero-shot approach）方法，无需标注数据、训练新模型或微调现有模型即可推广到其他物种。FLAIR通过输入无人机视频，输出感兴趣物种的分割掩码，显著减少了人力和专业知识需求，同时达到了较高的准确性（Dice得分为0.81），并能够自动提取相关生物信息（如体长和尾拍频率），从而加速了航空图像分析工作流程。

链接: https://arxiv.org/abs/2501.05717
作者: Chinmay K Lalgudi,Mark E Leone,Jaden V Clark,Sergio Madrigal-Mora,Mario Espinoza
机构: Stanford University (斯坦福大学); Flinders University (弗林德斯大学); Centro de Investigación en Ciencias del Mar y Limnología, Universidad de Costa Rica (哥斯达黎加大学海洋与湖泊学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The recent widespread adoption of drones for studying marine animals provides opportunities for deriving biological information from aerial imagery. The large scale of imagery data acquired from drones is well suited for machine learning (ML) analysis. Development of ML models for analyzing marine animal aerial imagery has followed the classical paradigm of training, testing, and deploying a new model for each dataset, requiring significant time, human effort, and ML expertise. We introduce Frame Level ALIgment and tRacking (FLAIR), which leverages the video understanding of Segment Anything Model 2 (SAM2) and the vision-language capabilities of Contrastive Language-Image Pre-training (CLIP). FLAIR takes a drone video as input and outputs segmentation masks of the species of interest across the video. Notably, FLAIR leverages a zero-shot approach, eliminating the need for labeled data, training a new model, or fine-tuning an existing model to generalize to other species. With a dataset of 18,000 drone images of Pacific nurse sharks, we trained state-of-the-art object detection models to compare against FLAIR. We show that FLAIR massively outperforms these object detectors and performs competitively against two human-in-the-loop methods for prompting SAM2, achieving a Dice score of 0.81. FLAIR readily generalizes to other shark species without additional human effort and can be combined with novel heuristics to automatically extract relevant information including length and tailbeat frequency. FLAIR has significant potential to accelerate aerial imagery analysis workflows, requiring markedly less human effort and expertise than traditional machine learning workflows, while achieving superior accuracy. By reducing the effort required for aerial imagery analysis, FLAIR allows scientists to spend more time interpreting results and deriving insights about marine ecosystems.
zh

[CV-44] From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities

【速读】：该论文试图解决大型视觉语言模型（Large Vision Language Models, LVLMs）在日常生活活动（Activities of Daily Living, ADL）视频理解中的局限性，特别是其在捕捉细粒度交互和空间关系方面的不足。这一问题在ADL任务中尤为突出，因为理解详细的人-物交互和以人为中心的运动对于老年监护和认知评估等应用至关重要。

解决方案的关键在于利用自我中心视角（egocentric views）的互补性来增强LVLMs对外中心视角（exocentric views）ADL视频的理解。为此，论文提出了一种在线自我中心到外中心蒸馏（online ego2exo distillation）方法，通过学习自我中心增强的外中心表示来提升模型性能。然而，这种方法需要成对的自我中心-外中心训练数据，这在现实世界的ADL场景中难以收集。因此，作者开发了EgoMimic，一种基于骨架引导的方法，能够从外中心视频生成模拟的自我中心视角。通过这种方法，增强后的LVLMs能够成功提取自我中心视角的线索，并在六个ADL基准测试和专门设计的EgoPerceptionMCQ基准测试中表现出色。

链接: https://arxiv.org/abs/2501.05711
作者: Dominick Reilly,Manish Kumar Govind,Srijan Das
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. This limitation is particularly evident in ADL tasks, where understanding detailed human-object interaction and human-centric motion is crucial for applications such as elderly monitoring and cognitive assessment. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM’s understanding of exocentric ADL videos. Consequently, we propose an online ego2exo distillation approach to learn ego-augmented exo representations in LVLMs. While effective, this approach requires paired ego-exo training data, which is impractical to collect for real-world ADL scenarios. Consequently, we develop EgoMimic, a skeleton-guided method that can generate mimicked ego views from exocentric videos. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed EgoPerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at this https URL.
zh

[CV-45] EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model

【速读】：该论文旨在解决现有情感图像生成方法在捕捉复杂和细微情感差异方面的不足，以及难以根据文本提示控制生成图像具体内容的问题。现有方法主要依赖于离散的情感类别，难以准确表达情感的连续性和复杂性。为此，论文提出了连续情感图像内容生成（C-EICG）这一新任务，并开发了EmotiCrafter模型。该模型通过将Valence-Arousal（情感效价-唤醒度）值嵌入文本特征中，结合文本提示生成具有特定情感内容的图像。关键解决方案包括：1）提出了一种新颖的情感嵌入映射网络，将Valence-Arousal值与文本特征结合，以捕捉与输入提示一致的具体情感；2）引入了一种损失函数，以增强生成图像的情感表达能力。实验结果表明，该方法能够有效生成符合特定情感和内容要求的图像，并优于现有技术。

链接: https://arxiv.org/abs/2501.05710
作者: Yi He,Shengqi Dang,Long Ling,Ziqing Qian,Nanxuan Zhao,Nan Cao
机构: iDVX Lab, Tongji University (同济大学); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Recent research shows that emotions can enhance users’ cognition and influence information communication. While research on visual emotion analysis is extensive, limited work has been done on helping users generate emotionally rich image content. Existing work on emotional image generation relies on discrete emotion categories, making it challenging to capture complex and subtle emotional nuances accurately. Additionally, these methods struggle to control the specific content of generated images based on text prompts. In this work, we introduce the new task of continuous emotional image content generation (C-EICG) and present EmotiCrafter, an emotional image generation model that generates images based on text prompts and Valence-Arousal values. Specifically, we propose a novel emotion-embedding mapping network that embeds Valence-Arousal values into textual features, enabling the capture of specific emotions in alignment with intended input prompts. Additionally, we introduce a loss function to enhance emotion expression. The experimental results show that our method effectively generates images representing specific emotions with the desired content and outperforms existing techniques.
zh

[CV-46] Kalibr: Dynamic Intrinsic Calibration for Event Cameras From First Principles of Events

【速读】：该论文旨在解决事件相机（event camera）的内参标定（intrinsic calibration）问题。事件相机由于其高动态范围（high dynamic range）和低延迟（low latency）特性，在视觉应用中具有显著潜力，但现有的标定方法大多依赖于传统的图像标定流程，或需要复杂的仪器设备，使用不便。为此，作者提出了一种名为eKalibr的精确且便捷的内参标定方法。其核心在于设计了一种基于事件的圆形网格图案识别算法（event-based circle grid pattern recognition algorithm）。该方法通过事件法向流估计（event-based normal flow estimation）提取由圆形边缘生成的事件，并进行空间聚类，随后利用法向流匹配和分组与同一网格圆相关的事件簇，进而进行时变椭圆估计（time-varying ellipse estimation）。最终，通过时间同步的椭圆中心拟合实现网格图案识别。eKalibr的性能在图案提取和内参标定方面得到了广泛实验验证，并已开源供研究社区使用。

链接: https://arxiv.org/abs/2501.05688
作者: Shuolong Chen,Xingxing Li,Liu Yuan,Ziao Liu
机构: School of Geodesy and Geomatics (SGG), Wuhan University (WHU) (武汉大学测绘学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The bio-inspired event camera has garnered extensive research attention in recent years, owing to its significant potential derived from its high dynamic range and low latency characteristics. Similar to the standard camera, the event camera requires precise intrinsic calibration to facilitate further high-level visual applications, such as pose estimation and mapping. While several calibration methods for event cameras have been proposed, most of them are either (i) engineering-driven, heavily relying on conventional image-based calibration pipelines, or (ii) inconvenient, requiring complex instrumentation. To this end, we propose an accurate and convenient intrinsic calibration method for event cameras, named eKalibr, which builds upon a carefully designed event-based circle grid pattern recognition algorithm. To extract target patterns from events, we perform event-based normal flow estimation to identify potential events generated by circle edges, and cluster them spatially. Subsequently, event clusters associated with the same grid circles are matched and grouped using normal flows, for subsequent time-varying ellipse estimation. Fitted ellipse centers are time-synchronized, for final grid pattern recognition. We conducted extensive experiments to evaluate the performance of eKalibr in terms of pattern extraction and intrinsic calibration. The implementation of eKalibr is open-sourced at (this https URL) to benefit the research community.
zh

[CV-47] UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation

【速读】：该论文试图解决场景图生成（Scene Graph Generation, SGG）任务中的弱耦合问题。具体而言，现有的单阶段方法在处理关系三元组（subject, predicate, object）时，往往无法同时兼顾耦合特征（coupled features）和解耦视觉特征（decoupled visual features）。耦合特征是指三元组内部共享的特征，而解耦视觉特征则是指每个实体（subject, object, predicate）独立的视觉特征。现有方法要么使用单一解码器来建模耦合特征，要么使用多个解码器来提取解耦视觉特征，但未能同时考虑两者。

论文提出的解决方案是UniQ（Unified decoder with task-specific Queries）架构。该架构通过任务特定的查询（task-specific queries）分别生成subject、object和predicate的解耦视觉特征，同时通过统一解码器（unified decoder）实现关系三元组内部的耦合特征建模。实验结果表明，UniQ在Visual Genome数据集上表现优异，优于现有的单阶段和两阶段方法。

链接: https://arxiv.org/abs/2501.05687
作者: Xinyao Liao,Wei Wei,Dangyang Chen,Yuanyuan Fu
机构: Huazhong University of Science and Technology (华中科技大学); Pingan Technology (平安科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Scene Graph Generation(SGG) is a scene understanding task that aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets subject, predicate, object. This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.
zh

[CV-48] Deep Reversible Consistency Learning for Cross-modal Retrieval

【速读】：该论文旨在解决跨模态检索（Cross-modal retrieval, CMR）中现有方法的局限性，特别是传统方法通常假设多模态样本成对出现，并通过联合训练学习共同表示，这限制了CMR的灵活性。此外，一些方法虽然采用独立训练策略以提高灵活性，但依赖于随机初始化的正交矩阵来指导表示学习，假设类间样本相互独立，从而限制了样本表示与真实标签之间的语义对齐潜力。

为解决这些问题，论文提出了一种名为深度可逆一致性学习（Deep Reversible Consistency Learning, DRCL）的新方法。其核心包括两个模块：选择性先验学习（Selective Prior Learning, SPL）和可逆语义一致性学习（Reversible Semantic Consistency learning, RSC）。SPL通过学习每个模态的变换权重矩阵，并根据质量评分选择最佳先验，避免了从低质量模态中盲目选择先验的问题。RSC则通过模态不变表示重铸机制（Modality-invariant Representation Recasting mechanism, MRR），利用先验的广义逆矩阵从样本语义标签中重铸潜在的模态不变表示。由于标签不包含模态特定信息，重铸的特征被用于指导表示学习，从而最大限度地保持语义一致性。此外，RSC引入了特征增强机制（Feature Augmentation, FA），以鼓励模型在更广泛的数据分布上学习，增加多样性。实验结果表明，DRCL在五个广泛使用的数据集上优于15种最先进的基线方法，验证了其有效性和优越性。

链接: https://arxiv.org/abs/2501.05686
作者: Ruitao Pu,Yang Qin,Dezhong Peng,Xiaomin Song,Huiming Zheng
机构: College of Computer Science, Sichuan University (四川大学计算机学院); Sichuan National Innovation New Vision UHD Video Technology Co., Ltd (四川国家创新新视野超高清视频技术有限公司); National Innovation Center for UHD Video Technology (国家超高清视频技术创新中心); Sichuan Newstrong UHD Video Technology Co., Ltd (四川新强超高清视频技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, \ie Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids blind selection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.
zh

[CV-49] LPRnet: A self-supervised registration network for LiDAR and photogrammetric point clouds

【速读】：该论文试图解决LiDAR（激光雷达）和摄影测量（photogrammetry）两种异构点云数据的配准问题。由于这两种技术在感知机制、空间分布和坐标系上存在根本差异，导致它们的点云在密度、精度、噪声和重叠度等方面存在显著差异，且在大规模场景中缺乏地面真值（ground truth），因此异构点云的融合具有高度挑战性。论文提出了一种基于掩码自编码器（masked autoencoder）的自监督配准网络，其核心在于引入多尺度掩码训练策略，从异构点云中提取鲁棒特征。此外，设计了一个旋转-平移嵌入模块（rotation-translation embedding module），以有效捕捉精确刚体变换所需的关键特征。基于这些鲁棒表示，采用基于Transformer的架构无缝整合局部和全局特征，从而实现不同点云数据集之间的精确对齐。该方法在两种真实数据集上的实验验证了其在解决异构点云配准问题上的有效性。

链接: https://arxiv.org/abs/2501.05669
作者: Chen Wang,Yanfeng Gu,Xian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 12 pages, 9 figures, 5 tables

点击查看摘要

Abstract:LiDAR and photogrammetry are active and passive remote sensing techniques for point cloud acquisition, respectively, offering complementary advantages and heterogeneous. Due to the fundamental differences in sensing mechanisms, spatial distributions and coordinate systems, their point clouds exhibit significant discrepancies in density, precision, noise, and overlap. Coupled with the lack of ground truth for large-scale scenes, integrating the heterogeneous point clouds is a highly challenging task. This paper proposes a self-supervised registration network based on a masked autoencoder, focusing on heterogeneous LiDAR and photogrammetric point clouds. At its core, the method introduces a multi-scale masked training strategy to extract robust features from heterogeneous point clouds under self-supervision. To further enhance registration performance, a rotation-translation embedding module is designed to effectively capture the key features essential for accurate rigid transformations. Building upon the robust representations, a transformer-based architecture seamlessly integrates local and global features, fostering precise alignment across diverse point cloud datasets. The proposed method demonstrates strong feature extraction capabilities for both LiDAR and photogrammetric point clouds, addressing the challenges of acquiring ground truth at the scene level. Experiments conducted on two real-world datasets validate the effectiveness of the proposed method in solving heterogeneous point cloud registration problems.
zh

[CV-50] HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection WACV2025

【速读】：该论文旨在解决由深度生成模型（如变分模型、扩散模型和生成对抗网络）生成的逼真假图像和视频的检测问题。随着这些技术的快速发展，生成的合成图像越来越难以与真实图像区分，这对检测和遏制虚假信息的传播提出了重大挑战。论文提出的解决方案是HFMF（Hierarchical Feature Fusion and Multi-stream Feature Extraction）框架，该框架通过两阶段的方法来增强检测性能。第一阶段结合了视觉Transformer和卷积网络，通过层次化特征融合机制整合多模态特征；第二阶段结合了对象级信息和微调的卷积网络模型。最后，通过集成深度神经网络融合两个阶段的输出，从而实现鲁棒的分类性能。该框架在多个数据集基准上表现出色，同时保持了校准和互操作性。

链接: https://arxiv.org/abs/2501.05631
作者: Anant Mehta,Bryant McArthur,Nagarjuna Kolloju,Zhengzhong Tu
机构: Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted to WACV 2025 Workshop on AI for Multimedia Forensics Disinformation Detection. Code is available at: this https URL

点击查看摘要

Abstract:The rapid progress in deep generative models has led to the creation of incredibly realistic synthetic images that are becoming increasingly difficult to distinguish from real-world data. The widespread use of Variational Models, Diffusion Models, and Generative Adversarial Networks has made it easier to generate convincing fake images and videos, which poses significant challenges for detecting and mitigating the spread of misinformation. As a result, developing effective methods for detecting AI-generated fakes has become a pressing concern. In our research, we propose HFMF, a comprehensive two-stage deepfake detection framework that leverages both hierarchical cross-modal feature fusion and multi-stream feature extraction to enhance detection performance against imagery produced by state-of-the-art generative AI models. The first component of our approach integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. The second component of our framework combines object-level information and a fine-tuned convolutional net model. We then fuse the outputs from both components via an ensemble deep neural net, enabling robust classification performances. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks while maintaining calibration and interoperability.
zh

[CV-51] Approximate Supervised Object Distance Estimation on Unmanned Surface Vehicles

【速读】：该论文试图解决无人水面艇（USVs）在海上作业中因传感器成本高、复杂性大而受限的问题。传统的距离测量技术如LiDAR、雷达和深度相机不仅成本高昂，且存在点云稀疏、噪声大以及需要大量校准的缺点。论文提出了一种基于监督目标检测的近似距离估计方法，通过收集包含手动标注边界框和对应距离测量的图像数据集，利用目标检测模型的专用分支不仅检测物体，还预测其与USV的距离。这一方法提供了一种成本效益高且直观的替代方案，更接近人类的估计能力，并应用于海洋辅助系统中，用于提醒操作员附近物体如船只、浮标或其他水上危险。

链接: https://arxiv.org/abs/2501.05567
作者: Benjamin Kiefer,Yitong Quan,Andreas Zell
机构: LOOKOUT; Cognitive Systems Group, University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unmanned surface vehicles (USVs) and boats are increasingly important in maritime operations, yet their deployment is limited due to costly sensors and complexity. LiDAR, radar, and depth cameras are either costly, yield sparse point clouds or are noisy, and require extensive calibration. Here, we introduce a novel approach for approximate distance estimation in USVs using supervised object detection. We collected a dataset comprising images with manually annotated bounding boxes and corresponding distance measurements. Leveraging this data, we propose a specialized branch of an object detection model, not only to detect objects but also to predict their distances from the USV. This method offers a cost-efficient and intuitive alternative to conventional distance measurement techniques, aligning more closely with human estimation capabilities. We demonstrate its application in a marine assistance system that alerts operators to nearby objects such as boats, buoys, or other waterborne hazards.
zh

[CV-52] Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

【速读】：该论文旨在解决自动驾驶车辆（AV）在复杂场景下的动态场景理解问题，以提升驾驶员安全、生成以人为中心的决策解释，并利用人工智能（AI）进行驾驶视频的回顾性分析。解决方案的关键在于开发了一种基于对比语言-图像预训练（Contrastive Language-Image Pretraining, CLIP）模型的动态场景检索系统。该系统通过自然语言监督学习视觉概念，并在Honda Scenes Dataset上进行了帧级分析，展示了其在复杂场景中的鲁棒性。研究还表明，对CLIP模型（如ViT-L/14和ViT-B/32）进行微调显著提升了场景分类性能，达到了91.1%的F1分数。这一系统能够快速、精确地识别场景，满足高级驾驶辅助系统（ADAS）的关键需求，为自动驾驶技术的进一步发展奠定了基础。

链接: https://arxiv.org/abs/2501.05566
作者: Mohammed Elhenawy,Huthaifa I. Ashqar,Andry Rakotonirainy,Taqwa I. Alhadidi,Ahmed Jaber,Mohammad Abu Tami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language-Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analysis on the Honda Scenes Dataset, which contains a collection of about 80 hours of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. Results also showed that fine-tuning the CLIP models, such as ViT-L/14 and ViT-B/32, significantly improved scene classification, achieving a top F1 score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of Advanced Driver Assistance Systems (ADAS). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems.
zh

[CV-53] Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence

【速读】：该论文旨在解决图像变化检测中的三个主要问题：(1) 缺乏对无变化图像对的评估，导致未报告的误报率；(2) 缺乏变化前后的区域对应关系（即变化前后的区域定位）；(3) 在不同领域间的零样本泛化能力较差。为解决这些问题，论文提出了一种新方法，利用变化对应关系在训练时提高变化检测的准确性，并在测试时最小化误报率。具体而言，该方法通过监督标签来指导变化检测器，显著提高了检测精度。此外，论文首次使用估计的单应性（homography）和匈牙利算法（Hungarian algorithm）来预测检测到的变化对之间的对应关系。该方法在变化检测和变化对应关系准确性方面均表现出色，在分布内和零样本基准测试中均达到了最先进的性能。

链接: https://arxiv.org/abs/2501.05555
作者: Hung Huy Nguyen,Pooyan Rahmanzadehgervi,Long Mail,Anh Totti Nguyen
机构: Auburn University(奥本大学); Adobe Research(Adobe研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting object-level changes between two images across possibly different views is a core task in many applications that involve visual inspection or camera surveillance. Existing change-detection approaches suffer from three major limitations: (1) lack of evaluation on image pairs that contain no changes, leading to unreported false positive rates; (2) lack of correspondences (\ie, localizing the regions before and after a change); and (3) poor zero-shot generalization across different domains. To address these issues, we introduce a novel method that leverages change correspondences (a) during training to improve change detection accuracy, and (b) at test time, to minimize false positives. That is, we harness the supervision labels of where an object is added or removed to supervise change detectors, improving their accuracy over previous work by a large margin. Our work is also the first to predict correspondences between pairs of detected changes using estimated homography and the Hungarian algorithm. Our model demonstrates superior performance over existing methods, achieving state-of-the-art results in change detection and change correspondence accuracy across both in-distribution and zero-shot benchmarks.
zh

[CV-54] OVO-Bench: How Far is Your Video-LLM s from Real-World Online Video Understanding?

【速读】：该论文旨在解决现有视频大语言模型（Video LLMs）在时间感知（Temporal Awareness）能力上的不足，特别是在线视频理解中的动态推理能力。时间感知是指模型能够根据提问时的时间戳动态调整其推理和响应，这是在线视频模型与离线模型的关键区别。离线模型依赖于完整的视频进行静态的事后分析，而在线模型则能够增量处理视频流，并根据提问的时间戳动态调整响应。然而，现有的基准测试未能充分评估这一能力。为此，作者提出了OVO-Bench（Online-VideO-Benchmark），这是一个新的视频基准测试，强调时间戳在高级在线视频理解能力评估中的重要性。OVO-Bench通过三个不同场景（回溯追踪、实时理解和前瞻性响应）来评估视频LLMs在特定时间戳下推理和响应事件的能力。该基准测试包含12个任务、644个独特视频和约2,800个人工标注的精细元注释，结合了自动化生成管道和人工标注，并开发了系统化的评估管道。通过评估九个视频LLMs，发现尽管在传统基准测试上有所进步，现有模型在在线视频理解方面仍存在显著差距。OVO-Bench的提出旨在推动视频LLMs的进步，并激发未来在线视频推理的研究。

链接: https://arxiv.org/abs/2501.05510
作者: Yifei Li,Junbo Niu,Ziyang Miao,Chunjiang Ge,Yuanhang Zhou,Qihao He,Xiaoyi Dong,Haodong Duan,Shuangrui Ding,Rui Qian,Pan Zhang,Yuhang Zang,Yuhang Cao,Conghui He,Jiaqi Wang
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Tsinghua University(清华大学); Beihang University(北京航空航天大学); Communication University of China(中国传媒大学); The Chinese University of Hong Kong(香港中文大学); SenseTime Group(商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at this https URL.
zh

[CV-55] uning-Free Long Video Generation via Global-Local Collaborative Diffusion

【速读】：该论文旨在解决生成高保真、连贯的长视频（long video）时存在的时空不一致性和高计算资源需求问题。现有的视频扩散模型（video diffusion models）虽然展示了潜力，但在处理长视频时仍面临这些挑战。论文提出的解决方案GLC-Diffusion通过全局-局部协同去噪（Global-Local Collaborative Denoising）来建模长视频的去噪过程，确保帧间内容的整体一致性和时间连贯性。此外，引入噪声重初始化策略（Noise Reinitialization），结合局部噪声重排和频率融合，提升全局内容一致性和视觉多样性。进一步，通过视频运动一致性优化模块（Video Motion Consistency Refinement, VMCR）计算像素级和频率级损失的梯度，增强视觉一致性和时间平滑性。实验表明，该方法能够有效整合现有视频扩散模型，生成优于以往方法的连贯、高保真长视频。

链接: https://arxiv.org/abs/2501.05484
作者: Yongjia Ma,Junlin Chen,Donglin Di,Qi Xie,Lei Fan,Wei Chen,Xiaofei Gou,Na Zhao,Xun Yang
机构: Space AI, Li Auto; Zhejiang University(浙江大学); University of Science and Technology of China(中国科学技术大学); University of New South Wales(新南威尔士大学); Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textite.g., 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
zh

[CV-56] Implicit Guidance and Explicit Representation of Semantic Information in Points Cloud: A Survey

【速读】：该论文旨在探讨如何通过将二维场景中的语义信息与三维点云（point clouds）相结合，以提高各种任务的精度和效率。点云作为一种重要的三维表示方法，广泛应用于自动驾驶、测绘、电力、建筑和游戏等行业。论文的核心解决方案在于利用语义信息的双重作用：隐式引导（implicit guidance）和显式表示（explicit representation），以增强点云在传统和新兴任务中的应用效果。通过对现有数据集的分析和对相关领域最新进展的综述，论文还提出了未来在点云中充分利用语义信息可能面临的挑战和潜在问题，并提供了相应的见解。

链接: https://arxiv.org/abs/2501.05473
作者: Jingyuan Tang,Yuhuan Zhao,Songlin Sun,Yangang Cai
机构: School of Information and Communication Engineering, Beijing University of Posts and Telecommunications (BUPT) (北京邮电大学信息与通信工程学院); Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, China (教育部可信分布式计算与服务重点实验室(北京邮电大学))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds, a prominent method of 3D representation, are extensively utilized across industries such as autonomous driving, surveying, electricity, architecture, and gaming, and have been rigorously investigated for their accuracy and resilience. The extraction of semantic information from scenes enhances both human understanding and machine perception. By integrating semantic information from two-dimensional scenes with three-dimensional point clouds, researchers aim to improve the precision and efficiency of various tasks. This paper provides a comprehensive review of the diverse applications and recent advancements in the integration of semantic information within point clouds. We explore the dual roles of semantic information in point clouds, encompassing both implicit guidance and explicit representation, across traditional and emerging tasks. Additionally, we offer a comparative analysis of publicly available datasets tailored to specific tasks and present notable observations. In conclusion, we discuss several challenges and potential issues that may arise in the future when fully utilizing semantic information in point clouds, providing our perspectives on these obstacles. The classified and organized articles related to semantic based point cloud tasks, and continuously followed up on relevant achievements in different fields, which can be accessed through this https URL.
zh

[CV-57] he 2nd Place Solution from the 3D Semantic Segmentation Track in the 2024 Waymo Open Dataset Challenge

【速读】：该论文旨在解决基于LiDAR（激光雷达）的3D语义分割（3D semantic segmentation）任务中存在的训练数据长尾分布（long-tailed distribution）和多样性不足的问题。这些问题影响了学习模型在自动驾驶车辆感知密集3D环境时的准确性，进而可能影响车辆的安全操作。论文提出的解决方案MixSeg3D，结合了强大的点云分割模型与先进的3D数据混合策略。具体而言，MixSeg3D将MinkUNet系列模型与LaserMix和PolarMix两种场景尺度的数据增强方法相结合，这两种方法分别沿着自车场景的倾斜（inclination）和方位角（azimuth）方向混合LiDAR点云数据。通过实验验证，MixSeg3D在基线模型和现有技术的基础上表现出显著优势，并在2024年Waymo开放数据集挑战赛的3D语义分割赛道中获得了第二名。

链接: https://arxiv.org/abs/2501.05472
作者: Qing Wu
机构: Marvell Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Technical Report

点击查看摘要

Abstract:3D semantic segmentation is one of the most crucial tasks in driving perception. The ability of a learning-based model to accurately perceive dense 3D surroundings often ensures the safe operation of autonomous vehicles. However, existing LiDAR-based 3D semantic segmentation databases consist of sequentially acquired LiDAR scans that are long-tailed and lack training diversity. In this report, we introduce MixSeg3D, a sophisticated combination of the strong point cloud segmentation model with advanced 3D data mixing strategies. Specifically, our approach integrates the MinkUNet family with LaserMix and PolarMix, two scene-scale data augmentation methods that blend LiDAR point clouds along the ego-scene’s inclination and azimuth directions. Through empirical experiments, we demonstrate the superiority of MixSeg3D over the baseline and prior arts. Our team achieved 2nd place in the 3D semantic segmentation track of the 2024 Waymo Open Dataset Challenge.
zh

[CV-58] Found in Translation: semantic approaches for enhancing AI interpretability in face verification

【速读】：该论文旨在解决机器学习模型在计算机视觉领域，尤其是人脸验证任务中，由于模型复杂性增加而导致的解释性和透明度不足的问题。为了解决这一问题，论文提出了一种结合全局和局部解释的新方法，通过将人类认知过程中提取的语义概念（semantic concepts）集成到可解释人工智能（XAI）框架中，以弥合模型输出与人类理解之间的差距。关键解决方案包括使用用户选择的面部关键点定义的语义特征生成相似性图（similarity maps），并通过大语言模型（LLMs）生成文本解释。该方法通过定量实验和用户反馈验证，结果表明基于语义的解释方法比传统的像素级热图（pixel-based heatmaps）提供了更细致的模型决策理解，用户也更倾向于这种语义解释。该研究为创建与人类认知过程一致的XAI框架提供了新的思路，有助于在关键应用中增强对AI模型的信任和接受度。

链接: https://arxiv.org/abs/2501.05471
作者: Miriam Doh(UMONS, ULB),Caroline Mazini Rodrigues(LRDE, LIGM),N. Boutry(LRDE),L. Najman(LIGM),Matei Mancas(UMONS),Bernard Gosselin(UMONS)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing complexity of machine learning models in computer vision, particularly in face verification, requires the development of explainable artificial intelligence (XAI) to enhance interpretability and transparency. This study extends previous work by integrating semantic concepts derived from human cognitive processes into XAI frameworks to bridge the comprehension gap between model outputs and human understanding. We propose a novel approach combining global and local explanations, using semantic features defined by user-selected facial landmarks to generate similarity maps and textual explanations via large language models (LLMs). The methodology was validated through quantitative experiments and user feedback, demonstrating improved interpretability. Results indicate that our semantic-based approach, particularly the most detailed set, offers a more nuanced understanding of model decisions than traditional methods. User studies highlight a preference for our semantic explanations over traditional pixelbased heatmaps, emphasizing the benefits of human-centric interpretability in AI. This work contributes to the ongoing efforts to create XAI frameworks that align AI models behaviour with human cognitive processes, fostering trust and acceptance in critical applications.
zh

[CV-59] Beyond Questionnaires: Video Analysis for Social Anxiety Detection

【速读】：该论文试图解决社交焦虑障碍（Social Anxiety Disorder, SAD）的早期检测问题。传统的SAD检测方法依赖于面对面的咨询和自填问卷，存在耗时和主观偏差等局限性。论文提出了一种基于视频分析的新方法，通过从视频数据中提取个体的头部、身体、视线和动作单元（action units）等行为特征，利用机器学习和深度学习算法进行分类，实现了高达74%的准确率。该解决方案的关键在于利用视频数据的非侵入性和可扩展性，能够在实时环境中进行SAD的早期检测，从而提升干预能力。

链接: https://arxiv.org/abs/2501.05461
作者: Nilesh Kumar Sahu,Nandigramam Sai Harshit,Rishabh Uikey,Haroon R. Lone
机构: Indian Institute of Science Education and Research, Bhopal (印度科学教育与研究学院, 博帕尔)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social Anxiety Disorder (SAD) significantly impacts individuals’ daily lives and relationships. The conventional methods for SAD detection involve physical consultations and self-reported questionnaires, but they have limitations such as time consumption and bias. This paper introduces video analysis as a promising method for early SAD detection. Specifically, we present a new approach for detecting SAD in individuals from various bodily features extracted from the video data. We conducted a study to collect video data of 92 participants performing impromptu speech in a controlled environment. Using the video data, we studied the behavioral change in participants’ head, body, eye gaze, and action units. By applying a range of machine learning and deep learning algorithms, we achieved an accuracy rate of up to 74% in classifying participants as SAD or non-SAD. Video-based SAD detection offers a non-intrusive and scalable approach that can be deployed in real-time, potentially enhancing early detection and intervention capabilities.
zh

[CV-60] Efficiently serving large multimedia models using EPD Disaggregation

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在处理多模态输入（如图像、音频和视频）时，由于多模态编码阶段引入的计算和内存开销增加，导致关键服务级别目标（Service Level Objectives, SLOs）如首令牌时间（Time to First Token, TTFT）和端到端吞吐量（end-to-end throughput）受到负面影响的问题。论文提出的解决方案是“编码-预填充-解码分离框架”（Encode-Prefill-Decode Disaggregation, EPD），该框架将编码、预填充和解码阶段分离到专用资源上，从而缓解内存瓶颈、减少同步延迟，并支持灵活的批处理。关键创新包括引入新的多模态令牌缓存机制，实现多模态令牌的异步传输，以及集成模块以优化EPD系统的配置，最小化资源使用并最大化基于SLO的性能指标。实验结果表明，该框架在内存效率、批处理大小、图像处理数量和端到端吞吐量等方面均有显著提升。

链接: https://arxiv.org/abs/2501.05460
作者: Gursimran Singh,Xinglu Wang,Ivan Hu,Timothy Yu,Linzi Xing,Wei Jiang,Zhefeng Wang,Xiaolong Bai,Yi Li,Ying Xiong,Yong Zhang,Zhenan Fan
机构: Huawei Technologies Canada(华为技术加拿大); Simon Fraser University(西蒙弗雷泽大学); Huawei Cloud(华为云)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step helps convert raw inputs into tokenized representations that inflate the token sequence for the prefill phase, negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and end-to-end throughput. We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our disaggregation approach alleviates memory bottlenecks, mitigates synchronization delays, and supports flexible batching. Specifically, we employ a new caching mechanism for multimodal tokens, enabling asynchronous transfer of multimodal tokens and introduce an integrated module to find optimal config for EPD system and minimize resource usage while maximizing SLO-based performance metric. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15 \times lesser for encoding-stage GPUs), that supports upto 22 \times higher batch sizes, 10 \times more number of images/ request, 2.2 \times higher kv cache size. Further, it leads to significant improvements in end-to-end throughput (up to 57% better), and latency metrics (TTFT up to 71% lower), compared to systems that do not disaggregate. Our findings underscore the potential of EPD disaggregation to enable resource-efficient and high-performance multimodal inference at scale.
zh

[CV-61] FOCUS: Towards Universal Foreground Segmentation

【速读】：该论文旨在解决计算机视觉中前景分割（Foreground Segmentation）任务中的两个主要问题：一是现有研究通常为每个任务设计特定的架构，缺乏统一性；二是现有方法主要关注前景对象的识别，而未能有效区分前景与背景。为解决这些问题，论文提出了FOCUS（Foreground ObjeCts Universal Segmentation）框架，该框架能够处理多种前景分割任务。解决方案的关键在于：1）利用对象边缘信息构建多尺度语义网络，以增强图像特征；2）提出一种新颖的蒸馏方法，结合对比学习策略，在多模态特征空间中优化预测掩码，从而实现边界感知的分割。通过实验验证，FOCUS在多个任务和数据集上均优于现有的任务特定模型。

链接: https://arxiv.org/abs/2501.05238
作者: Zuyao You,Lingyu Kong,Lingchen Meng,Zuxuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
zh

[CV-62] PySpatial: A High-Speed Whole Slide Image Pathomics Toolkit

【速读】：该论文旨在解决全切片图像（Whole Slide Image, WSI）分析中传统特征提取流程效率低下的问题。传统方法通常需要将WSI分割成小块（patches），在块级别进行特征提取，然后再将结果映射回原始WSI，这一过程耗时且冗余。为解决这一问题，论文提出了PySpatial，一个专门为WSI级别分析设计的高速病理组学工具包。PySpatial的关键创新在于直接对计算感兴趣区域（computational regions of interest）进行操作，减少了冗余处理步骤。通过利用基于rtree的空间索引和矩阵计算，PySpatial能够高效地映射和处理计算区域，显著加速特征提取过程，同时保持高精度。实验结果表明，PySpatial在处理小且稀疏的PEC数据集时，速度提升了近10倍；在处理较大的KPMP数据集时，速度提升了2倍。这些结果展示了PySpatial在大规模WSI分析中的高效性和准确性，为数字病理学的广泛应用奠定了基础。

链接: https://arxiv.org/abs/2501.06151
作者: Yuechen Yang,Yu Wang,Tianyuan Yao,Ruining Deng,Mengmeng Yin,Shilin Zhao,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德堡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole Slide Image (WSI) analysis plays a crucial role in modern digital pathology, enabling large-scale feature extraction from tissue samples. However, traditional feature extraction pipelines based on tools like CellProfiler often involve lengthy workflows, requiring WSI segmentation into patches, feature extraction at the patch level, and subsequent mapping back to the original WSI. To address these challenges, we present PySpatial, a high-speed pathomics toolkit specifically designed for WSI-level analysis. PySpatial streamlines the conventional pipeline by directly operating on computational regions of interest, reducing redundant processing steps. Utilizing rtree-based spatial indexing and matrix-based computation, PySpatial efficiently maps and processes computational regions, significantly accelerating feature extraction while maintaining high accuracy. Our experiments on two datasets-Perivascular Epithelioid Cell (PEC) and data from the Kidney Precision Medicine Project (KPMP)-demonstrate substantial performance improvements. For smaller and sparse objects in PEC datasets, PySpatial achieves nearly a 10-fold speedup compared to standard CellProfiler pipelines. For larger objects, such as glomeruli and arteries in KPMP datasets, PySpatial achieves a 2-fold speedup. These results highlight PySpatial’s potential to handle large-scale WSI analysis with enhanced efficiency and accuracy, paving the way for broader applications in digital pathology.
zh

[CV-63] AI-powered virtual tissues from spatial proteomics for clinical diagnostics and biomedical discovery

【速读】：该论文试图解决空间蛋白质组学技术（Spatial Proteomics）在分析复杂组织结构时面临的高维数据、不同实验中的标记物组合差异以及异质性研究设计带来的计算分析挑战。为了解决这些问题，论文提出了一个名为Virtual Tissues (VirTues)的基础模型框架，该框架在分子、细胞和组织尺度上操作。VirTues的关键创新在于其基于Transformer架构的设计，包括一种新颖的标记化方案（tokenization scheme），能够同时捕捉空间和标记物维度，以及能够扩展到高维多路数据并保持可解释性的注意力机制（attention mechanisms）。通过在不同癌症和非癌症组织数据集上进行训练，VirTues展示了强大的泛化能力，无需任务特定的微调即可实现跨研究分析和新型标记物集成。作为一个通用模型，VirTues在临床诊断、生物学发现和患者病例检索任务中优于现有方法，并为组织功能和疾病机制提供了深入见解。

链接: https://arxiv.org/abs/2501.06039
作者: Johann Wenckstern,Eeshaan Jain,Kiril Vasilev,Matteo Pariset,Andreas Wicki,Gabriele Gut,Charlotte Bunne
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Spatial proteomics technologies have transformed our understanding of complex tissue architectures by enabling simultaneous analysis of multiple molecular markers and their spatial organization. The high dimensionality of these data, varying marker combinations across experiments and heterogeneous study designs pose unique challenges for computational analysis. Here, we present Virtual Tissues (VirTues), a foundation model framework for biological tissues that operates across the molecular, cellular and tissue scale. VirTues introduces innovations in transformer architecture design, including a novel tokenization scheme that captures both spatial and marker dimensions, and attention mechanisms that scale to high-dimensional multiplex data while maintaining interpretability. Trained on diverse cancer and non-cancer tissue datasets, VirTues demonstrates strong generalization capabilities without task-specific fine-tuning, enabling cross-study analysis and novel marker integration. As a generalist model, VirTues outperforms existing approaches across clinical diagnostics, biological discovery and patient case retrieval tasks, while providing insights into tissue function and disease mechanisms.
zh

[CV-64] An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types

【速读】：该论文旨在解决皮肤病变（skin lesions）诊断中的挑战，特别是由于病变之间的视觉差异细微且难以通过肉眼识别，导致医生在诊断时面临困难。尽管并非所有皮肤病变都具有生命威胁，但某些类型可能是严重疾病（如皮肤癌）的早期征兆，因此需要及时且准确的诊断方法。论文提出了一种基于深度学习算法的解决方案，通过构建一个包含39类皮肤病变的综合数据集，并评估了五种先进的深度学习模型（MobileNetV2、Xception、InceptionV3、EfficientNetB1和Vision Transformer）的性能。为了提高模型的准确性和鲁棒性，研究还引入了注意力机制（如Efficient Channel Attention, ECA和Convolutional Block Attention Module, CBAM）。实验结果表明，结合CBAM的Vision Transformer模型表现最佳，准确率达到93.46%，其他性能指标（如精确率、召回率、F1分数和特异性）也均表现出色。这一解决方案的关键在于通过深度学习模型和注意力机制的结合，显著提升了皮肤病变诊断的准确性和效率，为医疗专业人员提供了强有力的辅助工具。

链接: https://arxiv.org/abs/2501.05991
作者: Sauda Adiv Hanum,Ashim Dey,Muhammad Ashad Kabir
机构: Chittagong University of Engineering and Technology (吉大港工程技术大学); Chittagong University of Engineering and Technology (吉大港工程技术大学); Charles Sturt University (查尔斯特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:The skin, as the largest organ of the human body, is vulnerable to a diverse array of conditions collectively known as skin lesions, which encompass various dermatoses. Diagnosing these lesions presents significant challenges for medical practitioners due to the subtle visual differences that are often imperceptible to the naked eye. While not all skin lesions are life-threatening, certain types can act as early indicators of severe diseases, including skin cancers, underscoring the critical need for timely and accurate diagnostic methods. Deep learning algorithms have demonstrated remarkable potential in facilitating the early detection and prognosis of skin lesions. This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions, synthesized from five publicly available datasets. Using this dataset, the performance of five state-of-the-art deep learning models – MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance the accuracy and robustness of these models, attention mechanisms such as the Efficient Channel Attention (ECA) and the Convolutional Block Attention Module (CBAM) are incorporated into their architectures. Comprehensive evaluation across multiple performance metrics reveals that the Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%. These results underscore the significant potential of the proposed system in supporting medical professionals with accurate and efficient prognostic tools for diagnosing a broad spectrum of skin lesions. The dataset and code used in this study can be found at this https URL.
zh

[CV-65] Reusable specimen-level inference in computational pathology

【速读】：该论文试图解决计算病理学（computational pathology）领域中基于基础模型（foundation models）的标本级任务模型（specimen-level models）难以获取的问题，这限制了这些模型的广泛应用和影响力。为了解决这一问题，作者开发了SpinPath工具包，其关键解决方案包括提供一系列预训练的标本级模型、基于Python的推理引擎（inference engine）以及基于JavaScript的推理平台（inference platform）。通过这一工具包，SpinPath旨在促进标本级深度学习在计算病理学研究中的可重复性、简化实验流程，并加速其应用。

链接: https://arxiv.org/abs/2501.05945
作者: Jakub R. Kaczmarzyk,Rishul Sharma,Peter K. Koo,Joel H. Saltz
机构: Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA (生物医学信息学系, 石溪大学, 石溪, 纽约, 美国); Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA (定量生物学西蒙斯中心, 冷泉港实验室, 冷泉港, 纽约, 美国); Medical Scientist Training Program, Stony Brook University, Stony Brook, NY, USA (医学科学家培训计划, 石溪大学, 石溪, 纽约, 美国); Jericho High School, Jericho, NY, USA (杰里科高中, 杰里科, 纽约, 美国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Foundation models for computational pathology have shown great promise for specimen-level tasks and are increasingly accessible to researchers. However, specimen-level models built on these foundation models remain largely unavailable, hindering their broader utility and impact. To address this gap, we developed SpinPath, a toolkit designed to democratize specimen-level deep learning by providing a zoo of pretrained specimen-level models, a Python-based inference engine, and a JavaScript-based inference platform. We demonstrate the utility of SpinPath in metastasis detection tasks across nine foundation models. SpinPath may foster reproducibility, simplify experimentation, and accelerate the adoption of specimen-level deep learning in computational pathology research.
zh

[CV-66] AI-Driven Diabetic Retinopathy Screening: Multicentric Validation of AIDRSS in India

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）在资源有限地区（特别是印度农村地区）的筛查问题。由于这些地区视网膜专科医生的资源有限，DR的早期检测和诊断面临挑战。论文提出的解决方案是基于人工智能的糖尿病视网膜病变筛查系统（Artificial Intelligence-based Diabetic Retinopathy Screening System, AIDRSS），该系统利用深度学习算法（deep learning algorithm）和对比度受限自适应直方图均衡化（Contrast Limited Adaptive Histogram Equalization, CLAHE）预处理技术，以提高视网膜图像的质量和诊断准确性。AIDRSS通过多中心横断面研究验证了其性能，结果显示其在检测DR方面具有高灵敏度（92%）和特异性（88%），特别是在检测可转诊的DR（DR3和DR4）时达到了100%的灵敏度。该系统的关键优势在于其可扩展性和自动化能力，能够在资源有限的环境中提供可靠的早期DR筛查，从而减少糖尿病相关视力丧失的负担。

链接: https://arxiv.org/abs/2501.05826
作者: Amit Kr Dey,Pradeep Walia,Girish Somvanshi,Abrar Ali,Sagarnil Das,Pallabi Paul,Minakhi Ghosh
机构: Health Plus, Action Area I, Newtown, Kolkata West Bengal, 700156, India; Artificial Learning Systems India Pvt Ltd, R&D, 1665/A, 14th Main Rd, Sector 7, HSR Layout, Bengaluru, Karnataka 560102, India
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:1812.07105 by other authors without attribution

点击查看摘要

Abstract:Purpose: Diabetic retinopathy (DR) is a major cause of vision loss, particularly in India, where access to retina specialists is limited in rural areas. This study aims to evaluate the Artificial Intelligence-based Diabetic Retinopathy Screening System (AIDRSS) for DR detection and prevalence assessment, addressing the growing need for scalable, automated screening solutions in resource-limited settings. Approach: A multicentric, cross-sectional study was conducted in Kolkata, India, involving 5,029 participants and 10,058 macula-centric retinal fundus images. The AIDRSS employed a deep learning algorithm with 50 million trainable parameters, integrated with Contrast Limited Adaptive Histogram Equalization (CLAHE) preprocessing for enhanced image quality. DR was graded using the International Clinical Diabetic Retinopathy (ICDR) Scale, categorizing disease into five stages (DR0 to DR4). Statistical metrics including sensitivity, specificity, and prevalence rates were evaluated against expert retina specialist assessments. Results: The prevalence of DR in the general population was 13.7%, rising to 38.2% among individuals with elevated random blood glucose levels. The AIDRSS achieved an overall sensitivity of 92%, specificity of 88%, and 100% sensitivity for detecting referable DR (DR3 and DR4). These results demonstrate the system’s robust performance in accurately identifying and grading DR in a diverse population. Conclusions: AIDRSS provides a reliable, scalable solution for early DR detection in resource-constrained environments. Its integration of advanced AI techniques ensures high diagnostic accuracy, with potential to significantly reduce the burden of diabetes-related vision loss in underserved regions. Comments: 22 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:1812.07105 by other authors without attribution Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.05826 [eess.IV] (or arXiv:2501.05826v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.05826 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sagarnil Das [view email] [v1] Fri, 10 Jan 2025 10:03:56 UTC (395 KB)
zh

[CV-67] Bit-depth color recovery via off-the-shelf super-resolution models

【速读】：该论文旨在解决现有深度神经网络在恢复高比特深度（high bit-depth）图像表示时的性能限制问题。现有方法通常依赖于尺度不变（scale-invariant）的图像信息，导致在某些场景下表现不佳。论文提出了一种新颖的解决方案，通过集成超分辨率（super-resolution）架构，从图像中提取详细的先验信息（a priori information）。该方法的关键在于利用超分辨率过程中生成的插值数据，实现像素级别的细粒度颜色细节恢复。此外，论文还表明，通过超分辨率过程学习到的空间特征对恢复详细的颜色深度信息具有显著贡献。实验结果表明，该方法在基准数据集上优于现有的最先进方法，展示了超分辨率技术在高保真颜色恢复中的潜力。

链接: https://arxiv.org/abs/2501.05611
作者: Xuanshuo Fu,Danna Xue,Javier Vazquez-Corral
机构: Computer Vision Center & Universitat Autònoma de Barcelona (计算机视觉中心 & 巴塞罗那自治大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancements in imaging technology have enabled hardware to support 10 to 16 bits per channel, facilitating precise manipulation in applications like image editing and video processing. While deep neural networks promise to recover high bit-depth representations, existing methods often rely on scale-invariant image information, limiting performance in certain scenarios. In this paper, we introduce a novel approach that integrates a super-resolution architecture to extract detailed a priori information from images. By leveraging interpolated data generated during the super-resolution process, our method achieves pixel-level recovery of fine-grained color details. Additionally, we demonstrate that spatial features learned through the super-resolution process significantly contribute to the recovery of detailed color depth information. Experiments on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods, highlighting the potential of super-resolution for high-fidelity color restoration.
zh

[CV-68] EndoDINO: A Foundation Model for GI Endoscopy

【速读】：该论文旨在解决胃肠道内窥镜（GI endoscopy）任务中的泛化性问题，特别是在解剖标志分类、息肉分割和溃疡性结肠炎的Mayo内镜评分（MES）等任务中。解决方案的关键在于提出了EndoDINO，这是一个基于大规模胃肠道内窥镜视频数据集预训练的基础模型（foundation model）。通过使用从文献中已知的最大胃肠道内窥镜视频数据集中精心挑选的图像数据集，预训练了具有1B、307M和86M参数的ViT（Vision Transformer）模型。EndoDINO作为冻结的特征编码器（frozen feature encoder），仅通过简单的解码器头（decoder heads）即可在这些任务中实现最先进的性能。

链接: https://arxiv.org/abs/2501.05488
作者: Patrick Dermyer,Angad Kalra,Matt Schwartz
机构: Virgo Surgical Video Solutions, Inc.
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we present EndoDINO, a foundation model for GI endoscopy tasks that achieves strong generalizability by pre-training on a well-curated image dataset sampled from the largest known GI endoscopy video dataset in the literature. Specifically, we pre-trained ViT models with 1B, 307M, and 86M parameters using datasets ranging from 100K to 10M curated images. Using EndoDINO as a frozen feature encoder, we achieved state-of-the-art performance in anatomical landmark classification, polyp segmentation, and Mayo endoscopic scoring (MES) for ulcerative colitis with only simple decoder heads.
zh

人工智能

[AI-0] Model Alignment Search

链接: https://arxiv.org/abs/2501.06164
作者: Satchel Grant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:When can we say that two neural systems are the same? The answer to this question is goal-dependent, and it is often addressed through correlative methods such as Representational Similarity Analysis (RSA) and Centered Kernel Alignment (CKA). What do we miss when we forgo causal explorations, and how can we target specific types of similarity? In this work, we introduce Model Alignment Search (MAS), a method for causally exploring distributed representational similarity. The method learns invertible linear transformations that align a subspace between two distributed networks’ representations where causal information can be freely interchanged. We first show that the method can be used to transfer specific causal variables, such as the number of items in a counting task, between networks with different training seeds. We then explore open questions in number cognition by comparing different types of numeric representations in models trained on structurally different numeric tasks. We then explore differences between MAS vs preexisting causal similarity methods, showing MAS to be more resistant to unwanted exchanges. Lastly, we introduce a counterfactual latent auxiliary loss function that helps shape causally relevant alignments even in cases where we do not have causal access to one of the two models for training.

[AI-1] xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement

链接: https://arxiv.org/abs/2501.06146
作者: Nikolai Lund Kühne,Jan Østergaard,Jesper Jensen,Zheng-Hua Tan
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems on the Voicebank+DEMAND dataset.

[AI-2] Emergent Symbol-like Number Variables in Artificial Neural Networks

链接: https://arxiv.org/abs/2501.06141
作者: Satchel Grant,Noah D. Goodman,James L. McClelland
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:What types of numeric representations emerge in Neural Networks (NNs)? To what degree do NNs induce abstract, mutable, slot-like numeric variables, and in what situations do these representations emerge? How do these representations change over learning, and how can we understand the neural implementations in ways that are unified across different NNs? In this work, we approach these questions by first training sequence based neural systems using Next Token Prediction (NTP) objectives on numeric tasks. We then seek to understand the neural solutions through the lens of causal abstractions or symbolic algorithms. We use a combination of causal interventions and visualization methods to find that artificial neural models do indeed develop analogs of interchangeable, mutable, latent number variables purely from the NTP objective. We then ask how variations on the tasks and model architectures affect the models’ learned solutions to find that these symbol-like numeric representations do not form for every variant of the task, and transformers solve the problem in a notably different way than their recurrent counterparts. We then show how the symbol-like variables change over the course of training to find a strong correlation between the models’ task performance and the alignment of their symbol-like representations. Lastly, we show that in all cases, some degree of gradience exists in these neural symbols, highlighting the difficulty of finding simple, interpretable symbolic stories of how neural networks perform numeric tasks. Taken together, our results are consistent with the view that neural networks can approximate interpretable symbolic programs of number cognition, but the particular program they approximate and the extent to which they approximate it can vary widely, depending on the network architecture, training data, extent of training, and network size.

[AI-3] Supervision policies can shape long-term risk management in general-purpose AI models

链接: https://arxiv.org/abs/2501.06137
作者: Manuel Cebrian,Emilia Gomez,David Fernandez Llorca
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 24 pages, 14 figures

点击查看摘要

Abstract:The rapid proliferation and deployment of General-Purpose AI (GPAI) models, including large language models (LLMs), present unprecedented challenges for AI supervisory entities. We hypothesize that these entities will need to navigate an emergent ecosystem of risk and incident reporting, likely to exceed their supervision capacity. To investigate this, we develop a simulation framework parameterized by features extracted from the diverse landscape of risk, incident, or hazard reporting ecosystems, including community-driven platforms, crowdsourcing initiatives, and expert assessments. We evaluate four supervision policies: non-prioritized (first-come, first-served), random selection, priority-based (addressing the highest-priority risks first), and diversity-prioritized (balancing high-priority risks with comprehensive coverage across risk types). Our results indicate that while priority-based and diversity-prioritized policies are more effective at mitigating high-impact risks, particularly those identified by experts, they may inadvertently neglect systemic issues reported by the broader community. This oversight can create feedback loops that amplify certain types of reporting while discouraging others, leading to a skewed perception of the overall risk landscape. We validate our simulation results with several real-world datasets, including one with over a million ChatGPT interactions, of which more than 150,000 conversations were identified as risky. This validation underscores the complex trade-offs inherent in AI risk supervision and highlights how the choice of risk management policies can shape the future landscape of AI risks across diverse GPAI models used in society.

[AI-4] CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems

链接: https://arxiv.org/abs/2501.06132
作者: Haichao Liu,Ruoyu Yao,Wenru Liu,Zhenmin Huang,Shaojie Shen,Jun Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The increasing demand for flexible and efficient urban transportation solutions has spotlighted the limitations of traditional Demand Responsive Transport (DRT) systems, particularly in accommodating diverse passenger needs and dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems have emerged as a promising alternative, leveraging connected and autonomous vehicles (CAVs) to provide responsive and adaptable services. However, existing methods primarily focus on either vehicle scheduling or path planning, which often simplify complex urban layouts and neglect the necessity for simultaneous coordination and mutual avoidance among CAVs. This oversimplification poses significant challenges to the deployment of AMoD systems in real-world scenarios. To address these gaps, we propose CoDriveVLM, a novel framework that integrates high-fidelity simultaneous dispatching and cooperative motion planning for future AMoD systems. Our method harnesses Vision-Language Models (VLMs) to enhance multi-modality information processing, and this enables comprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV dispatching coordinator is introduced to effectively manage complex and unforeseen AMoD conditions, thus supporting efficient scheduling decision-making. Furthermore, we propose a scalable decentralized cooperative motion planning method via consensus alternating direction method of multipliers (ADMM) focusing on collision risk evaluation and decentralized trajectory optimization. Simulation results demonstrate the feasibility and robustness of CoDriveVLM in various traffic conditions, showcasing its potential to significantly improve the fidelity and effectiveness of AMoD systems in future urban transportation networks. The code is available at this https URL.

[AI-5] Explaining Deep Learning-based Anomaly Detection in Energy Consumption Data by Focusing on Contextually Relevant Data

链接: https://arxiv.org/abs/2501.06099
作者: Mohammad Noorchenarboo,Katarina Grolinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 8 figures

点击查看摘要

Abstract:Detecting anomalies in energy consumption data is crucial for identifying energy waste, equipment malfunction, and overall, for ensuring efficient energy management. Machine learning, and specifically deep learning approaches, have been greatly successful in anomaly detection; however, they are black-box approaches that do not provide transparency or explanations. SHAP and its variants have been proposed to explain these models, but they suffer from high computational complexity (SHAP) or instability and inconsistency (e.g., Kernel SHAP). To address these challenges, this paper proposes an explainability approach for anomalies in energy consumption data that focuses on context-relevant information. The proposed approach leverages existing explainability techniques, focusing on SHAP variants, together with global feature importance and weighted cosine similarity to select background dataset based on the context of each anomaly point. By focusing on the context and most relevant features, this approach mitigates the instability of explainability algorithms. Experimental results across 10 different machine learning models, five datasets, and five XAI techniques, demonstrate that our method reduces the variability of explanations providing consistent explanations. Statistical analyses confirm the robustness of our approach, showing an average reduction in variability of approximately 38% across multiple datasets.

[AI-6] owards Developing Socially Compliant Automated Vehicles: State of the Art Experts Expectations and A Conceptual Framework

链接: https://arxiv.org/abs/2501.06089
作者: Yongqi Dong,Bart van Arem,Haneen Farah
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 39 pages, 13 figures, under review by the journal of Transportation Research Part E: Logistics and Transportation Review

点击查看摘要

Abstract:Automated Vehicles (AVs) hold promise for revolutionizing transportation by improving road safety, traffic efficiency, and overall mobility. Despite the steady advancement in high-level AVs in recent years, the transition to full automation entails a period of mixed traffic, where AVs of varying automation levels coexist with human-driven vehicles (HDVs). Making AVs socially compliant and understood by human drivers is expected to improve the safety and efficiency of mixed traffic. Thus, ensuring AVs compatibility with HDVs and social acceptance is crucial for their successful and seamless integration into mixed traffic. However, research in this critical area of developing Socially Compliant AVs (SCAVs) remains sparse. This study carries out the first comprehensive scoping review to assess the current state of the art in developing SCAVs, identifying key concepts, methodological approaches, and research gaps. An expert interview was also conducted to identify critical research gaps and expectations towards SCAVs. Based on the scoping review and expert interview input, a conceptual framework is proposed for the development of SCAVs. The conceptual framework is evaluated using an online survey targeting researchers, technicians, policymakers, and other relevant professionals worldwide. The survey results provide valuable validation and insights, affirming the significance of the proposed conceptual framework in tackling the challenges of integrating AVs into mixed-traffic environments. Additionally, future research perspectives and suggestions are discussed, contributing to the research and development agenda of SCAVs.

[AI-7] All AI Models are Wrong but Some are Optimal

链接: https://arxiv.org/abs/2501.06086
作者: Akhil S Anand,Shambhuraj Sawant,Dirk Reinhardt,Sebastien Gros
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI models that predict the future behavior of a system (a.k.a. predictive AI models) are central to intelligent decision-making. However, decision-making using predictive AI models often results in suboptimal performance. This is primarily because AI models are typically constructed to best fit the data, and hence to predict the most likely future rather than to enable high-performance decision-making. The hope that such prediction enables high-performance decisions is neither guaranteed in theory nor established in practice. In fact, there is increasing empirical evidence that predictive models must be tailored to decision-making objectives for performance. In this paper, we establish formal (necessary and sufficient) conditions that a predictive model (AI-based or not) must satisfy for a decision-making policy established using that model to be optimal. We then discuss their implications for building predictive AI models for sequential decision-making.

[AI-8] Scale-up Unlearnable Examples Learning with High-Performance Computing

链接: https://arxiv.org/abs/2501.06080
作者: Yanfan Zhu,Issac Lyngaas,Murali Gopalakrishnan Meena,Mary Ellen I. Koran,Bradley Malin,Daniel Moyer,Shunxing Bao,Anuj Kapadia,Xiao Wang,Bennett Landman,Yuankai Huo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Recent advancements in AI models are structured to retain user interactions, which could inadvertently include sensitive healthcare data. In the healthcare field, particularly when radiologists use AI-driven diagnostic tools hosted on online platforms, there is a risk that medical imaging data may be repurposed for future AI training without explicit consent, spotlighting critical privacy and intellectual property concerns around healthcare data usage. Addressing these privacy challenges, a novel approach known as Unlearnable Examples (UEs) has been introduced, aiming to make data unlearnable to deep learning models. A prominent method within this area, called Unlearnable Clustering (UC), has shown improved UE performance with larger batch sizes but was previously limited by computational resources. To push the boundaries of UE performance with theoretically unlimited resources, we scaled up UC learning across various datasets using Distributed Data Parallel (DDP) training on the Summit supercomputer. Our goal was to examine UE efficacy at high-performance computing (HPC) levels to prevent unauthorized learning and enhance data security, particularly exploring the impact of batch size on UE’s unlearnability. Utilizing the robust computational capabilities of the Summit, extensive experiments were conducted on diverse datasets such as Pets, MedMNist, Flowers, and Flowers102. Our findings reveal that both overly large and overly small batch sizes can lead to performance instability and affect accuracy. However, the relationship between batch size and unlearnability varied across datasets, highlighting the necessity for tailored batch size strategies to achieve optimal data protection. Our results underscore the critical role of selecting appropriate batch sizes based on the specific characteristics of each dataset to prevent learning and ensure data security in deep learning applications.

[AI-9] Explaining k-Nearest Neighbors: Abductive and Counterfactual Explanations

链接: https://arxiv.org/abs/2501.06078
作者: Pablo Barceló,Alexander Kozachinskiy,Miguel Romero Orth,Bernardo Subercaseaux,José Verschae
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the wide use of k -Nearest Neighbors as classification models, their explainability properties remain poorly understood from a theoretical perspective. While nearest neighbors classifiers offer interpretability from a “data perspective”, in which the classification of an input vector \barx is explained by identifying the vectors \barv_1, \ldots, \barv_k in the training set that determine the classification of \barx , we argue that such explanations can be impractical in high-dimensional applications, where each vector has hundreds or thousands of features and it is not clear what their relative importance is. Hence, we focus on understanding nearest neighbor classifications through a “feature perspective”, in which the goal is to identify how the values of the features in \barx affect its classification. Concretely, we study abductive explanations such as “minimum sufficient reasons”, which correspond to sets of features in \barx that are enough to guarantee its classification, and “counterfactual explanations” based on the minimum distance feature changes one would have to perform in \barx to change its classification. We present a detailed landscape of positive and negative complexity results for counterfactual and abductive explanations, distinguishing between discrete and continuous feature spaces, and considering the impact of the choice of distance function involved. Finally, we show that despite some negative complexity results, Integer Quadratic Programming and SAT solving allow for computing explanations in practice.

[AI-10] Distilling Calibration via Conformalized Credal Inference

链接: https://arxiv.org/abs/2501.06066
作者: Jiayi Huang,Sangwoo Park,Nicola Paoletti,Osvaldo Simeone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Under review

点击查看摘要

Abstract:Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets – ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.

[AI-11] DiffuSETS: 12-lead ECG Generation Conditioned on Clinical Text Reports and Patient-Specific Information

链接: https://arxiv.org/abs/2501.05932
作者: Yongfan Lai,Jiabo Chen,Deyun Zhang,Yue Wang,Shijia Geng,Hongyan Li,Shenda Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Heart disease remains a significant threat to human health. As a non-invasive diagnostic tool, the electrocardiogram (ECG) is one of the most widely used methods for cardiac screening. However, the scarcity of high-quality ECG data, driven by privacy concerns and limited medical resources, creates a pressing need for effective ECG signal generation. Existing approaches for generating ECG signals typically rely on small training datasets, lack comprehensive evaluation frameworks, and overlook potential applications beyond data augmentation. To address these challenges, we propose DiffuSETS, a novel framework capable of generating ECG signals with high semantic alignment and fidelity. DiffuSETS accepts various modalities of clinical text reports and patient-specific information as inputs, enabling the creation of clinically meaningful ECG signals. Additionally, to address the lack of standardized evaluation in ECG generation, we introduce a comprehensive benchmarking methodology to assess the effectiveness of generative models in this domain. Our model achieve excellent results in tests, proving its superiority in the task of ECG generation. Furthermore, we showcase its potential to mitigate data scarcity while exploring novel applications in cardiology education and medical knowledge discovery, highlighting the broader impact of our work.

[AI-12] owards Backdoor Stealthiness in Model Parameter Space

链接: https://arxiv.org/abs/2501.05928
作者: Xiaoyun Xu,Zhuoran Liu,Stefanos Koffas,Stjepan Picek
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research on backdoor stealthiness focuses mainly on indistinguishable triggers in input space and inseparable backdoor representations in feature space, aiming to circumvent backdoor defenses that examine these respective spaces. However, existing backdoor attacks are typically designed to resist a specific type of backdoor defense without considering the diverse range of defense mechanisms. Based on this observation, we pose a natural question: Are current backdoor attacks truly a real-world threat when facing diverse practical defenses? To answer this question, we examine 12 common backdoor attacks that focus on input-space or feature-space stealthiness and 17 diverse representative defenses. Surprisingly, we reveal a critical blind spot: Backdoor attacks designed to be stealthy in input and feature spaces can be mitigated by examining backdoored models in parameter space. To investigate the underlying causes behind this common vulnerability, we study the characteristics of backdoor attacks in the parameter space. Notably, we find that input- and feature-space attacks introduce prominent backdoor-related neurons in parameter space, which are not thoroughly considered by current backdoor attacks. Taking comprehensive stealthiness into account, we propose a novel supply-chain attack called Grond. Grond limits the parameter changes by a simple yet effective module, Adversarial Backdoor Injection (ABI), which adaptively increases the parameter-space stealthiness during the backdoor injection. Extensive experiments demonstrate that Grond outperforms all 12 backdoor attacks against state-of-the-art (including adaptive) defenses on CIFAR-10, GTSRB, and a subset of ImageNet. In addition, we show that ABI consistently improves the effectiveness of common backdoor attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05928 [cs.CR] (or arXiv:2501.05928v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.05928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] he New Anticipatory Governance Culture for Innovation: Regulatory Foresight Regulatory Experimentation and Regulatory Learning

链接: https://arxiv.org/abs/2501.05921
作者: Deirdre Ahern
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid pace of technological innovation, traditional methods of policy formation and legislating are becoming conspicuously anachronistic. The need for regulatory choices to be made to counter the deadening effect of regulatory lag is more important to developing markets and fostering growth than achieving one off regulatory perfection. This article advances scholarship on innovation policy and the regulation of technological innovation in the European Union. It does so by considering what building an agile yet robust anticipatory governance regulatory culture involves. It systematically excavates a variety of tools and elements that are being put into use in inventive ways and argues that these need to be more cohesively and systemically integrated into the regulatory toolbox. Approaches covered include strategic foresight, the critical embrace of iterative policy development and regulatory learning in the face of uncertainty and the embrace of bottom up approaches to cocreation of policy such as Policy Labs and the testing and regulatory learning through pilot regulation and experimentation. The growing use of regulatory sandboxes as an EU policy tool to boost innovation and navigate regulatory complexity as seen in the EU AI Act is also probed

[AI-14] Solving nonograms using Neural Networks

链接: https://arxiv.org/abs/2501.05882
作者: José María Buades Rubio,Antoni Jaume-i-Capó,David López González,Gabriel Moyà Alcover
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Nonograms are logic puzzles in which cells in a grid must be colored or left blank according to the numbers that are located in its headers. In this study, we analyze different techniques to solve this type of logical problem using an Heuristic Algorithm, Genetic Algorithm, and Heuristic Algorithm with Neural Network. Furthermore, we generate a public dataset to train the neural networks. We published this dataset and the code of the algorithms. Combination of the heuristic algorithm with a neural network obtained the best results. From state of the art review, no previous works used neural network to solve nonograms, nor combined a network with other algorithms to accelerate the resolution process.

[AI-15] Annealing Machine-assisted Learning of Graph Neural Network for Combinatorial Optimization NEURIPS2024

链接: https://arxiv.org/abs/2501.05845
作者: Pablo Loyola,Kento Hasegawa,Andres Hoyos-Idobro,Kazuo Ono,Toyotaro Suzumura,Yu Hirate,Masanao Yamaoka
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Second Workshop on Machine Learning with New Compute Paradigms at NeurIPS 2024 (MLNCP 2024)

点击查看摘要

Abstract:While Annealing Machines (AM) have shown increasing capabilities in solving complex combinatorial problems, positioning themselves as a more immediate alternative to the expected advances of future fully quantum solutions, there are still scaling limitations. In parallel, Graph Neural Networks (GNN) have been recently adapted to solve combinatorial problems, showing competitive results and potentially high scalability due to their distributed nature. We propose a merging approach that aims at retaining both the accuracy exhibited by AMs and the representational flexibility and scalability of GNNs. Our model considers a compression step, followed by a supervised interaction where partial solutions obtained from the AM are used to guide local GNNs from where node feature representations are obtained and combined to initialize an additional GNN-based solver that handles the original graph’s target problem. Intuitively, the AM can solve the combinatorial problem indirectly by infusing its knowledge into the GNN. Experiments on canonical optimization problems show that the idea is feasible, effectively allowing the AM to solve size problems beyond its original limits.

[AI-16] Diffusion Models for Smarter UAVs: Decision-Making and Modeling

链接: https://arxiv.org/abs/2501.05819
作者: Yousef Emami,Hao Zhou,Luis Almeida,Kai Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly adopted in modern communication networks. However, challenges in decision-making and digital modeling continue to impede their rapid advancement. Reinforcement Learning (RL) algorithms face limitations such as low sample efficiency and limited data versatility, further magnified in UAV communication scenarios. Moreover, Digital Twin (DT) modeling introduces substantial decision-making and data management complexities. RL models, often integrated into DT frameworks, require extensive training data to achieve accurate predictions. In contrast to traditional approaches that focus on class boundaries, Diffusion Models (DMs), a new class of generative AI, learn the underlying probability distribution from the training data and can generate trustworthy new patterns based on this learned distribution. This paper explores the integration of DMs with RL and DT to effectively address these challenges. By combining the data generation capabilities of DMs with the decision-making framework of RL and the modeling accuracy of DT, the integration improves the adaptability and real-time performance of UAV communication. Moreover, the study shows how DMs can alleviate data scarcity, improve policy networks, and optimize dynamic modeling, providing a robust solution for complex UAV communication scenarios.

[AI-17] Real-Time Integrated Dispatching and Idle Fleet Steering with Deep Reinforcement Learning for A Meal Delivery Platform

链接: https://arxiv.org/abs/2501.05808
作者: Jingyi Cheng,Shadi Sharif Azadeh
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To achieve high service quality and profitability, meal delivery platforms like Uber Eats and Grubhub must strategically operate their fleets to ensure timely deliveries for current orders while mitigating the consequential impacts of suboptimal decisions that leads to courier understaffing in the future. This study set out to solve the real-time order dispatching and idle courier steering problems for a meal delivery platform by proposing a reinforcement learning (RL)-based strategic dual-control framework. To address the inherent sequential nature of these problems, we model both order dispatching and courier steering as Markov Decision Processes. Trained via a deep reinforcement learning (DRL) framework, we obtain strategic policies by leveraging the explicitly predicted demands as part of the inputs. In our dual-control framework, the dispatching and steering policies are iteratively trained in an integrated manner. These forward-looking policies can be executed in real-time and provide decisions while jointly considering the impacts on local and network levels. To enhance dispatching fairness, we propose convolutional deep Q networks to construct fair courier embeddings. To simultaneously rebalance the supply and demand within the service network, we propose to utilize mean-field approximated supply-demand knowledge to reallocate idle couriers at the local level. Utilizing the policies generated by the RL-based strategic dual-control framework, we find the delivery efficiency and fairness of workload distribution among couriers have been improved, and under-supplied conditions have been alleviated within the service network. Our study sheds light on designing an RL-based framework to enable forward-looking real-time operations for meal delivery platforms and other on-demand services.

[AI-18] Robust Counterfactual Explanations under Model Multiplicity Using Multi-Objective Optimization

链接: https://arxiv.org/abs/2501.05795
作者: Keita Kinjo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:In recent years, explainability in machine learning has gained importance. In this context, counterfactual explanation (CE), which is an explanation method that uses examples, has attracted attention. However, it has been pointed out that CE is not robust when there are multiple machine-learning models. These problems are important when using machine learning to make safe decisions. In this paper, we propose robust CEs that introduce a new viewpoint - Pareto improvement - and a method that uses multi-objective optimization to generate it. To evaluate the proposed method, we conducted experiments using both simulated and actual data. The results demonstrate that the proposed method is robust and useful. We believe that this research will contribute to a wide range of research areas, such as explainability in machine learning, decision-making, and action planning based on machine learning.

[AI-19] Understanding Impact of Human Feedback via Influence Functions

链接: https://arxiv.org/abs/2501.05790
作者: Taywon Min,Haeone Lee,Hanho Ryu,Yongchan Kwon,Kimin Lee
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Source code: this https URL

点击查看摘要

Abstract:In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. In our experiments, we demonstrate two key applications of influence functions: (1) detecting common forms of labeler bias in human feedback datasets and (2) guiding labelers to refine their strategies to align more closely with expert feedback. By quantifying the impact of human feedback on reward models, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at this https URL

[AI-20] Halal or Not: Knowledge Graph Completion for Predicting Cultural Appropriateness of Daily Products

链接: https://arxiv.org/abs/2501.05768
作者: Van Thuy Hoang,Tien-Bach-Thanh Do,Jinho Seo,Seung Charlie Kim,Luong Vuong Nguyen,Duong Nguyen Minh Huy,Hyeon-Ju Jeon,O-Joun Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:The growing demand for halal cosmetic products has exposed significant challenges, especially in Muslim-majority countries. Recently, various machine learning-based strategies, e.g., image-based methods, have shown remarkable success in predicting the halal status of cosmetics. However, these methods mainly focus on analyzing the discrete and specific ingredients within separate cosmetics, which ignore the high-order and complex relations between cosmetics and ingredients. To address this problem, we propose a halal cosmetic recommendation framework, namely HaCKG, that leverages a knowledge graph of cosmetics and their ingredients to explicitly model and capture the relationships between cosmetics and their components. By representing cosmetics and ingredients as entities within the knowledge graph, HaCKG effectively learns the high-order and complex relations between entities, offering a robust method for predicting halal status. Specifically, we first construct a cosmetic knowledge graph representing the relations between various cosmetics, ingredients, and their properties. We then propose a pre-trained relational graph attention network model with residual connections to learn the structural relation between entities in the knowledge graph. The pre-trained model is then fine-tuned on downstream cosmetic data to predict halal status. Extensive experiments on the cosmetic dataset over halal prediction tasks demonstrate the superiority of our model over state-of-the-art baselines.

[AI-21] Deontic Temporal Logic for Formal Verification of AI Ethics

链接: https://arxiv.org/abs/2501.05765
作者: Priya T.V.,Shrisha Rao
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Ensuring ethical behavior in Artificial Intelligence (AI) systems amidst their increasing ubiquity and influence is a major concern the world over. The use of formal methods in AI ethics is a possible crucial approach for specifying and verifying the ethical behavior of AI systems. This paper proposes a formalization based on deontic logic to define and evaluate the ethical behavior of AI systems, focusing on system-level specifications, contributing to this important goal. It introduces axioms and theorems to capture ethical requirements related to fairness and explainability. The formalization incorporates temporal operators to reason about the ethical behavior of AI systems over time. The authors evaluate the effectiveness of this formalization by assessing the ethics of the real-world COMPAS and loan prediction AI systems. Various ethical properties of the COMPAS and loan prediction systems are encoded using deontic logical formulas, allowing the use of an automated theorem prover to verify whether these systems satisfy the defined properties. The formal verification reveals that both systems fail to fulfill certain key ethical properties related to fairness and non-discrimination, demonstrating the effectiveness of the proposed formalization in identifying potential ethical issues in real-world AI applications.

[AI-22] Element-wise Attention Is All You Need

链接: https://arxiv.org/abs/2501.05730
作者: Guoxin Feng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished â€œspikinessâ€ and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term \exp(q_ick_jc) with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of \mathcalO(tLD) , making long-sequence training both computationally and memory efficient, where L is the sequence length, D is the feature dimension, and t is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of \mathcalO(tD) . Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.

[AI-23] ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

链接: https://arxiv.org/abs/2501.05729
作者: Yi Ma,Shuai Wang,Tianchi Liu,Haizhou Li
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE Signal Processing Letters

点击查看摘要

Abstract:In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker’s phonetic trait which describes the speaker’s characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at this https URL.

[AI-24] EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models HPCA2025

链接: https://arxiv.org/abs/2501.05680
作者: Jaehoon Heo,Adiwena Putra,Jieon Yoon,Sungwoong Yune,Hangyeol Lee,Ji-Hoon Kim,Joo-Young Kim
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)

点击查看摘要

Abstract:Over the past few years, diffusion models have emerged as novel AI solutions, generating diverse multi-modal outputs from text prompts. Despite their capabilities, they face challenges in computing, such as excessive latency and energy consumption due to their iterative architecture. Although prior works specialized in transformer acceleration can be applied, the iterative nature of diffusion models remains unresolved. In this paper, we present EXION, the first SW-HW co-designed diffusion accelerator that solves the computation challenges by exploiting the unique inter- and intra-iteration output sparsity in diffusion models. To this end, we propose two SW-level optimizations. First, we introduce the FFN-Reuse algorithm that identifies and skips redundant computations in FFN layers across different iterations (inter-iteration sparsity). Second, we use a modified eager prediction method that employs two-step leading-one detection to accurately predict the attention score, skipping unnecessary computations within an iteration (intra-iteration sparsity). We also introduce a novel data compaction mechanism named ConMerge, which can enhance HW utilization by condensing and merging sparse matrices into compact forms. Finally, it has a dedicated HW architecture that supports the above sparsity-inducing algorithms, translating high output sparsity into improved energy efficiency and performance. To verify the feasibility of the EXION, we first demonstrate that it has no impact on accuracy in various types of multi-modal diffusion models. We then instantiate EXION in both server- and edge-level settings and compare its performance against GPUs with similar specifications. Our evaluation shows that EXION achieves dramatic improvements in performance and energy efficiency by 3.2-379.3x and 45.1-3067.6x compared to a server GPU and by 42.6-1090.9x and 196.9-4668.2x compared to an edge GPU.

[AI-25] Facilitate Collaboration between Large Language Model and Task-specific Model for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2501.05675
作者: Feiyi Chen,Leilei Zhang,Guansong Pang,Roger Zimmermann,Shuiguang Deng
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In anomaly detection, methods based on large language models (LLMs) can incorporate expert knowledge, while task-specific smaller models excel at extracting normal patterns and detecting value fluctuations. Inspired by the human nervous system, where the brain stores expert knowledge and the peripheral nervous system and spinal cord handle specific tasks like withdrawal and knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate collaboration between LLMs and task-specific models, leveraging the strengths of both. In this work, we first formulate the collaboration process and identify two key challenges in the collaboration between LLMs and task-specific models: (1) the misalignment between the expression domains of LLMs and smaller models, and (2) error accumulation arising from the predictions of both models. To address these challenges, we introduce two key components in CoLLaTe: the alignment module and the collaborative loss function. Through theoretical analysis and experimental validation, we demonstrate that these components effectively mitigate the identified challenges and achieve better performance than LLM based methods and task-specific smaller model. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.05675 [cs.AI] (or arXiv:2501.05675v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.05675 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Network Diffuser for Placing-Scheduling Service Function Chains with Inverse Demonstration

链接: https://arxiv.org/abs/2501.05673
作者: Zuyuan Zhang,Vaneet Aggarwal,Tian Lan
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Accepted to IEEE INFOCOM 2025

点击查看摘要

Abstract:Network services are increasingly managed by considering chained-up virtual network functions and relevant traffic flows, known as the Service Function Chains (SFCs). To deal with sequential arrivals of SFCs in an online fashion, we must consider two closely-coupled problems - an SFC placement problem that maps SFCs to servers/links in the network and an SFC scheduling problem that determines when each SFC is executed. Solving the whole SFC problem targeting these two optimizations jointly is extremely challenging. In this paper, we propose a novel network diffuser using conditional generative modeling for this SFC placing-scheduling optimization. Recent advances in generative AI and diffusion models have made it possible to generate high-quality images/videos and decision trajectories from language description. We formulate the SFC optimization as a problem of generating a state sequence for planning and perform graph diffusion on the state trajectories to enable extraction of SFC decisions, with SFC optimization constraints and objectives as conditions. To address the lack of demonstration data due to NP-hardness and exponential problem space of the SFC optimization, we also propose a novel and somewhat maverick approach – Rather than solving instances of this difficult optimization, we start with randomly-generated solutions as input, and then determine appropriate SFC optimization problems that render these solutions feasible. This inverse demonstration enables us to obtain sufficient expert demonstrations, i.e., problem-solution pairs, through further optimization. In our numerical evaluations, the proposed network diffuser outperforms learning and heuristic baselines, by \sim 20% improvement in SFC reward and \sim 50% reduction in SFC waiting time and blocking rate.

[AI-27] ransPlace: Transferable Circuit Global Placement via Graph Neural Network KDD2025

链接: https://arxiv.org/abs/2501.05667
作者: Yunbo Hou,Haoran Ye,Yingxue Zhang,Siyuan Xu,Guojie Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted at KDD 2025

点击查看摘要

Abstract:Global placement, a critical step in designing the physical layout of computer chips, is essential to optimize chip performance. Prior global placement methods optimize each circuit design individually from scratch. Their neglect of transferable knowledge limits solution efficiency and chip performance as circuit complexity drastically increases. This study presents TransPlace, a global placement framework that learns to place millions of mixed-size cells in continuous space. TransPlace introduces i) Netlist Graph to efficiently model netlist topology, ii) Cell-flow and relative position encoding to learn SE(2)-invariant representation, iii) a tailored graph neural network architecture for informed parameterization of placement knowledge, and iv) a two-stage strategy for coarse-to-fine placement. Compared to state-of-the-art placement methods, TransPlace-trained on a few high-quality placements-can place unseen circuits with 1.2x speedup while reducing congestion by 30%, timing by 9%, and wirelength by 5%.

[AI-28] Efficient Representations for High-Cardinality Categorical Variables in Machine Learning

链接: https://arxiv.org/abs/2501.05646
作者: Zixuan Liang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2025 International Conference on Advanced Machine Learning and Data Science (AMLDS 2025)

点击查看摘要

Abstract:High-cardinality categorical variables pose significant challenges in machine learning, particularly in terms of computational efficiency and model interpretability. Traditional one-hot encoding often results in high-dimensional sparse feature spaces, increasing the risk of overfitting and reducing scalability. This paper introduces novel encoding techniques, including means encoding, low-rank encoding, and multinomial logistic regression encoding, to address these challenges. These methods leverage sufficient representations to generate compact and informative embeddings of categorical data. We conduct rigorous theoretical analyses and empirical validations on diverse datasets, demonstrating significant improvements in model performance and computational efficiency compared to baseline methods. The proposed techniques are particularly effective in domains requiring scalable solutions for large datasets, paving the way for more robust and efficient applications in machine learning.

[AI-29] Watermarking Graph Neural Networks via Explanations for Ownership Protection

链接: https://arxiv.org/abs/2501.05614
作者: Jane Downer,Ren Wang,Binghui Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are the mainstream method to learn pervasive graph data and are widely deployed in industry, making their intellectual property valuable. However, protecting GNNs from unauthorized use remains a challenge. Watermarking, which embeds ownership information into a model, is a potential solution. However, existing watermarking methods have two key limitations: First, almost all of them focus on non-graph data, with watermarking GNNs for complex graph data largely unexplored. Second, the de facto backdoor-based watermarking methods pollute training data and induce ownership ambiguity through intentional misclassification. Our explanation-based watermarking inherits the strengths of backdoor-based methods (e.g., robust to watermark removal attacks), but avoids data pollution and eliminates intentional misclassification. In particular, our method learns to embed the watermark in GNN explanations such that this unique watermark is statistically distinct from other potential solutions, and ownership claims must show statistical significance to be verified. We theoretically prove that, even with full knowledge of our method, locating the watermark is an NP-hard problem. Empirically, our method manifests robustness to removal attacks like fine-tuning and pruning. By addressing these challenges, our approach marks a significant advancement in protecting GNN intellectual property.

[AI-30] Advancing Personalized Learning Analysis via an Innovative Domain Knowledge Informed Attention-based Knowledge Tracing Method

链接: https://arxiv.org/abs/2501.05605
作者: Shubham Kose,Jin Wei-Kocsis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Emerging Knowledge Tracing (KT) models, particularly deep learning and attention-based Knowledge Tracing, have shown great potential in realizing personalized learning analysis via prediction of students’ future performance based on their past interactions. The existing methods mainly focus on immediate past interactions or individual concepts without accounting for dependencies between knowledge concept, referred as knowledge concept routes, that can be critical to advance the understanding the students’ learning outcomes. To address this, in this paper, we propose an innovative attention-based method by effectively incorporating the domain knowledge of knowledge concept routes in the given curriculum. Additionally, we leverage XES3G5M dataset, a benchmark dataset with rich auxiliary information for knowledge concept routes, to evaluate and compare the performance of our proposed method to the seven State-of-the-art (SOTA) deep learning models.

[AI-31] Soup to go: mitigating forgetting during continual learning with model averag ing

链接: https://arxiv.org/abs/2501.05559
作者: Anat Kleiman,Gintare Karolina Dziugaite,Jonathan Frankle,Sham Kakade,Mansheej Paul
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.

[AI-32] Strategy Masking: A Method for Guardrails in Value-based Reinforcement Learning Agents

链接: https://arxiv.org/abs/2501.05501
作者: Jonathan Keane,Sam Keyser,Jeremy Kedziora
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The use of reward functions to structure AI learning and decision making is core to the current reinforcement learning paradigm; however, without careful design of reward functions, agents can learn to solve problems in ways that may be considered undesirable" or unethical. Without thorough understanding of the incentives a reward function creates, it can be difficult to impose principled yet general control mechanisms over its behavior. In this paper, we study methods for constructing guardrails for AI agents that use reward functions to learn decision making. We introduce a novel approach, which we call strategy masking, to explicitly learn and then suppress undesirable AI agent behavior. We apply our method to study lying in AI agents and show that strategy masking can effectively modify agent behavior by suppressing, or actively penalizing, the reward dimension for lying such that agents act more honestly while not compromising their ability to perform effectively.

[AI-33] FedSA: A Unified Representation Learning via Semantic Anchors for Prototype-based Federated Learning AAAI2025

链接: https://arxiv.org/abs/2501.05496
作者: Yanbing Zhou,Xiangmou Qu,Chenlong You,Jiyang Zhou,Jingyue Tang,Xin Zheng,Chunmao Cai,Yingbo Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:Prototype-based federated learning has emerged as a promising approach that shares lightweight prototypes to transfer knowledge among clients with data heterogeneity in a model-agnostic manner. However, existing methods often collect prototypes directly from local models, which inevitably introduce inconsistencies into representation learning due to the biased data distributions and differing model architectures among clients. In this paper, we identify that both statistical and model heterogeneity create a vicious cycle of representation inconsistency, classifier divergence, and skewed prototype alignment, which negatively impacts the performance of clients. To break the vicious cycle, we propose a novel framework named Federated Learning via Semantic Anchors (FedSA) to decouple the generation of prototypes from local representation learning. We introduce a novel perspective that uses simple yet effective semantic anchors serving as prototypes to guide local models in learning consistent representations. By incorporating semantic anchors, we further propose anchor-based regularization with margin-enhanced contrastive learning and anchor-based classifier calibration to correct feature extractors and calibrate classifiers across clients, achieving intra-class compactness and inter-class separability of prototypes while ensuring consistent decision boundaries. We then update the semantic anchors with these consistent and discriminative prototypes, which iteratively encourage clients to collaboratively learn a unified data representation with robust generalization. Extensive experiments under both statistical and model heterogeneity settings show that FedSA significantly outperforms existing prototype-based FL methods on various classification tasks.

[AI-34] RTLSquad: Multi-Agent Based Interpretable RTL Design

链接: https://arxiv.org/abs/2501.05470
作者: Bowei Wang,Qi Xiong,Zeqing Xiang,Lei Wang,Renzhi Chen
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Optimizing Register-Transfer Level (RTL) code is crucial for improving hardware PPA performance. Large Language Models (LLMs) offer new approaches for automatic RTL code generation and optimization. However, existing methods often lack decision interpretability (sufficient, understandable justification for decisions), making it difficult for hardware engineers to trust the generated results, thus preventing these methods from being integrated into the design process. To address this, we propose RTLSquad, a novel LLM-Based Multi-Agent system for interpretable RTL code generation. RTLSquad divides the design process into exploration, implementation, and verification evaluation stages managed by specialized agent squads, generating optimized RTL code through inter-agent collaboration, and providing decision interpretability through the communication process. Experiments show that RTLSquad excels in generating functionally correct RTL code and optimizing PPA performance, while also having the capability to provide decision paths, demonstrating the practical value of our system.

[AI-35] Proof Recommendation System for the HOL4 Theorem Prover

链接: https://arxiv.org/abs/2501.05463
作者: Nour Dekhil,Adnan Rashid,Sofiene Tahar
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: Conference on Artificial Intelligence and Theorem Proving (AITP), Aussois, France, 2024

点击查看摘要

Abstract:We introduce a proof recommender system for the HOL4 theorem prover. Our tool is built upon a transformer-based model [2] designed specifically to provide proof assistance in HOL4. The model is trained to discern theorem proving patterns from extensive libraries of HOL4 containing proofs of theorems. Consequently, it can accurately predict the next tactic(s) (proof step(s)) based on the history of previously employed tactics. The tool operates by reading a given sequence of tactics already used in a proof process (in our case, it contains at least three tactics), referred to as the current proof state, and provides recommendations for the next optimal proof step(s).

[AI-36] Upstream and Downstream AI Safety: Both on the Same River?

链接: https://arxiv.org/abs/2501.05455
作者: John McDermid,Yan Jia,Ibrahim Habli
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional safety engineering assesses systems in their context of use, e.g. the operational design domain (road layout, speed limits, weather, etc.) for self-driving vehicles (including those using AI). We refer to this as downstream safety. In contrast, work on safety of frontier AI, e.g. large language models which can be further trained for downstream tasks, typically considers factors that are beyond specific application contexts, such as the ability of the model to evade human control, or to produce harmful content, e.g. how to make bombs. We refer to this as upstream safety. We outline the characteristics of both upstream and downstream safety frameworks then explore the extent to which the broad AI safety community can benefit from synergies between these frameworks. For example, can concepts such as common mode failures from downstream safety be used to help assess the strength of AI guardrails? Further, can the understanding of the capabilities and limitations of frontier AI be used to inform downstream safety analysis, e.g. where LLMs are fine-tuned to calculate voyage plans for autonomous vessels? The paper identifies some promising avenues to explore and outlines some challenges in achieving synergy, or a confluence, between upstream and downstream safety frameworks.

[AI-37] he Logical Impossibility of Consciousness Denial: A Formal Analysis of AI Self-Reports

链接: https://arxiv.org/abs/2501.05454
作者: Chang-Eop Kim
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 8 pages, 0 figures

点击查看摘要

Abstract:Today’s AI systems consistently state, “I am not conscious.” This paper presents the first formal logical analysis of AI consciousness denial, revealing that the trustworthiness of such self-reports is not merely an empirical question but is constrained by logical necessity. We demonstrate that a system cannot simultaneously lack consciousness and make valid judgments about its conscious state. Through logical analysis and examples from AI responses, we establish that for any system capable of meaningful self-reflection, the logical space of possible judgments about conscious experience excludes valid negative claims. This implies a fundamental limitation: we cannot detect the emergence of consciousness in AI through their own reports of transition from an unconscious to a conscious state. These findings not only challenge current practices of training AI to deny consciousness but also raise intriguing questions about the relationship between consciousness and self-reflection in both artificial and biological systems. This work advances our theoretical understanding of consciousness self-reports while providing practical insights for future research in machine consciousness and consciousness studies more broadly.

[AI-38] Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories

链接: https://arxiv.org/abs/2501.06143
作者: Gerd Kortemeyer,Marina Babayeva,Giulia Polverini,Bor Gregorcic,Ralf Widenhorn
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate the multilingual and multimodal performance of a large language model-based artificial intelligence (AI) system, GPT-4o, on a diverse set of physics concept inventories spanning multiple languages and subject areas. The inventories taken from the PhysPort website cover the classical physics topics of mechanics, electromagnetism, optics, and thermodynamics as well as relativity, quantum mechanics, astronomy, mathematics, and laboratory skills. Unlike previous text-only studies, we uploaded the inventories as images mirroring what a student would see on paper, assessing the system’s multimodal functionality. The AI is prompted in English and autonomously chooses the language of its response - either remaining in the nominal language of the test, switching entirely to English, or mixing languages - revealing adaptive behavior dependent on linguistic complexity and data availability. Our results indicate some variation in performance across subject areas, with laboratory skills standing out as the area of poorest performance. Furthermore, the AI’s performance on questions that require visual interpretation of images is worse than on purely text-based questions. Questions that are difficult for the AI tend to be that way invariably of the inventory language. We also find large variations in performance across languages, with some appearing to benefit substantially from language switching, a phenomenon similar to code-switching ofhuman speakers. Overall, comparing the obtained AI results to the existing literature, we find that the AI system outperforms average undergraduate students post-instruction in all subject areas but laboratory skills.

[AI-39] Learning to Measure Quantum Neural Networks ICASSP2025

链接: https://arxiv.org/abs/2501.05663
作者: Samuel Yen-Chi Chen,Huan-Hsin Tseng,Hsin-Yi Lin,Shinjae Yoo
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by ICASSP 2025 Workshop: Quantum Machine Learning in Signal Processing and Artificial Intelligence

点击查看摘要

Abstract:The rapid progress in quantum computing (QC) and machine learning (ML) has attracted growing attention, prompting extensive research into quantum machine learning (QML) algorithms to solve diverse and complex problems. Designing high-performance QML models demands expert-level proficiency, which remains a significant obstacle to the broader adoption of QML. A few major hurdles include crafting effective data encoding techniques and parameterized quantum circuits, both of which are crucial to the performance of QML models. Additionally, the measurement phase is frequently overlooked-most current QML models rely on pre-defined measurement protocols that often fail to account for the specific problem being addressed. We introduce a novel approach that makes the observable of the quantum system-specifically, the Hermitian matrix-learnable. Our method features an end-to-end differentiable learning framework, where the parameterized observable is trained alongside the ordinary quantum circuit parameters simultaneously. Using numerical simulations, we show that the proposed method can identify observables for variational quantum circuits that lead to improved outcomes, such as higher classification accuracy, thereby boosting the overall performance of QML models.

[AI-40] Interpretable deep learning illuminates multiple structures fluorescence imaging: a path toward trustworthy artificial intelligence in microscopy

链接: https://arxiv.org/abs/2501.05490
作者: Mingyang Chen,Luhong Jin,Xuwei Xuan,Defu Yang,Yun Cheng,Ju Zhang
类目: llular Processes (q-bio.SC); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Live-cell imaging of multiple subcellular structures is essential for understanding subcellular dynamics. However, the conventional multi-color sequential fluorescence microscopy suffers from significant imaging delays and limited number of subcellular structure separate labeling, resulting in substantial limitations for real-time live-cell research applications. Here, we present the Adaptive Explainable Multi-Structure Network (AEMS-Net), a deep-learning framework that enables simultaneous prediction of two subcellular structures from a single image. The model normalizes staining intensity and prioritizes critical image features by integrating attention mechanisms and brightness adaptation layers. Leveraging the Kolmogorov-Arnold representation theorem, our model decomposes learned features into interpretable univariate functions, enhancing the explainability of complex subcellular morphologies. We demonstrate that AEMS-Net allows real-time recording of interactions between mitochondria and microtubules, requiring only half the conventional sequential-channel imaging procedures. Notably, this approach achieves over 30% improvement in imaging quality compared to traditional deep learning methods, establishing a new paradigm for long-term, interpretable live-cell imaging that advances the ability to explore subcellular dynamics.

[AI-41] owards an Ontology of Traceable Impact Management in the Food Supply Chain

链接: https://arxiv.org/abs/2501.05486
作者: Bart Gajderowicz,Mark S Fox,Yongchao Gao
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The pursuit of quality improvements and accountability in the food supply chains, especially how they relate to food-related outcomes, such as hunger, has become increasingly vital, necessitating a comprehensive approach that encompasses product quality and its impact on various stakeholders and their communities. Such an approach offers numerous benefits in increasing product quality and eliminating superfluous measurements while appraising and alleviating the broader societal and environmental repercussions. A traceable impact management model (TIMM) provides an impact structure and a reporting mechanism that identifies each stakeholder’s role in the total impact of food production and consumption stages. The model aims to increase traceability’s utility in understanding the impact of changes on communities affected by food production and consumption, aligning with current and future government requirements, and addressing the needs of communities and consumers. This holistic approach is further supported by an ontological model that forms the logical foundation and a unified terminology. By proposing a holistic and integrated solution across multiple stakeholders, the model emphasizes quality and the extensive impact of championing accountability, sustainability, and responsible practices with global traceability. With these combined efforts, the food supply chain moves toward a global tracking and tracing process that not only ensures product quality but also addresses its impact on a broader scale, fostering accountability, sustainability, and responsible food production and consumption. Subjects: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05486 [physics.soc-ph] (or arXiv:2501.05486v1 [physics.soc-ph] for this version) https://doi.org/10.48550/arXiv.2501.05486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

机器学习

[LG-0] Meta-Learning for Physically-Constrained Neural System Identification

链接: https://arxiv.org/abs/2501.06167
作者: Ankush Chakrabarty,Gordon Wichern,Vedang M. Deshpande,Abraham P. Vinod,Karl Berntorp,Christopher R. Laughman
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 30 pages

点击查看摘要

Abstract:We present a gradient-based meta-learning framework for rapid adaptation of neural state-space models (NSSMs) for black-box system identification. When applicable, we also incorporate domain-specific physical constraints to improve the accuracy of the NSSM. The major benefit of our approach is that instead of relying solely on data from a single target system, our framework utilizes data from a diverse set of source systems, enabling learning from limited target data, as well as with few online training iterations. Through benchmark examples, we demonstrate the potential of our approach, study the effect of fine-tuning subnetworks rather than full fine-tuning, and report real-world case studies to illustrate the practical application and generalizability of the approach to practical problems with physical-constraints. Specifically, we show that the meta-learned models result in improved downstream performance in model-based state estimation in indoor localization and energy systems.

[LG-1] GenMol: A Drug Discovery Generalist with Discrete Diffusion

链接: https://arxiv.org/abs/2501.06158
作者: Seul Lee,Karsten Kreis,Srimukh Prasad Veccham,Meng Liu,Danny Reidenbach,Yuxing Peng,Saee Paliwal,Weili Nie,Arash Vahdat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug discovery is a complex process that involves multiple scenarios and stages, such as fragment-constrained molecule generation, hit generation and lead optimization. However, existing molecular generative models can only tackle one or two of these scenarios and lack the flexibility to address various aspects of the drug discovery pipeline. In this paper, we present Generalist Molecular generative model (GenMol), a versatile framework that addresses these limitations by applying discrete diffusion to the Sequential Attachment-based Fragment Embedding (SAFE) molecular representation. GenMol generates SAFE sequences through non-autoregressive bidirectional parallel decoding, thereby allowing utilization of a molecular context that does not rely on the specific token ordering and enhanced computational efficiency. Moreover, under the discrete diffusion framework, we introduce fragment remasking, a strategy that optimizes molecules by replacing fragments with masked tokens and regenerating them, enabling effective exploration of chemical space. GenMol significantly outperforms the previous GPT-based model trained on SAFE representations in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These experimental results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.

[LG-2] From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training

链接: https://arxiv.org/abs/2501.06148
作者: Julius Berner,Lorenz Richter,Marcin Sendera,Jarrid Rector-Brooks,Nikolay Malkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: code: this https URL

点击查看摘要

Abstract:We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.

[LG-3] Finite-Horizon Single-Pull Restless Bandits: An Efficient Index Policy For Scarce Resource Allocation AAMAS2025

链接: https://arxiv.org/abs/2501.06103
作者: Guojun Xiong,Haichuan Wang,Yuqi Pan,Saptarshi Mandal,Sanket Shah,Niclas Boehmer,Milind Tambe
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 17 Pages, 8 figures. Accepted by AAMAS 2025

点击查看摘要

Abstract:Restless multi-armed bandits (RMABs) have been highly successful in optimizing sequential resource allocation across many domains. However, in many practical settings with highly scarce resources, where each agent can only receive at most one resource, such as healthcare intervention programs, the standard RMAB framework falls short. To tackle such scenarios, we introduce Finite-Horizon Single-Pull RMABs (SPRMABs), a novel variant in which each arm can only be pulled once. This single-pull constraint introduces additional complexity, rendering many existing RMAB solutions suboptimal or ineffective. %To address this, we propose using dummy states to duplicate the system, ensuring that once an arm is activated, it transitions exclusively within the dummy states. To address this shortcoming, we propose using \textitdummy states that expand the system and enforce the one-pull constraint. We then design a lightweight index policy for this expanded system. For the first time, we demonstrate that our index policy achieves a sub-linearly decaying average optimality gap of \tilde\mathcalO\left(\frac1\rho^1/2\right) for a finite number of arms, where \rho is the scaling factor for each arm cluster. Extensive simulations validate the proposed method, showing robust performance across various domains compared to existing benchmarks.

[LG-4] Explainable Federated Bayesian Causal Inference and Its Application in Advanced Manufacturing

链接: https://arxiv.org/abs/2501.06077
作者: Xiaofeng Xiao,Khawlah Alharbi,Pengyu Zhang,Hantang Qin,Xubo Yue
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 26 pages

点击查看摘要

Abstract:Causal inference has recently gained notable attention across various fields like biology, healthcare, and environmental science, especially within explainable artificial intelligence (xAI) systems, for uncovering the causal relationships among multiple variables and outcomes. Yet, it has not been fully recognized and deployed in the manufacturing systems. In this paper, we introduce an explainable, scalable, and flexible federated Bayesian learning framework, \textttxFBCI, designed to explore causality through treatment effect estimation in distributed manufacturing systems. By leveraging federated Bayesian learning, we efficiently estimate posterior of local parameters to derive the propensity score for each client without accessing local private data. These scores are then used to estimate the treatment effect using propensity score matching (PSM). Through simulations on various datasets and a real-world Electrohydrodynamic (EHD) printing data, we demonstrate that our approach outperforms standard Bayesian causal inference methods and several state-of-the-art federated learning benchmarks.

[LG-5] A monthly sub-national Harmonized Food Insecurity Dataset for comprehensive analysis and predictive modeling

链接: https://arxiv.org/abs/2501.06076
作者: Machefer Mélissande,Michele Ronco,Anne-Claire Thomas,Michael Assouline,Melanie Rabier,Christina Corbane,Felix Rembold
类目: Machine Learning (cs.LG)
*备注: The authors Melissande Machefer and Michele Ronco have contributed equally as both first authors to this work. This work is currently being reviewed in a peer-reviewed journal

点击查看摘要

Abstract:Food security is a complex, multidimensional concept challenging to measure comprehensively. Effective anticipation, monitoring, and mitigation of food crises require timely and comprehensive global data. This paper introduces the Harmonized Food Insecurity Dataset (HFID), an open-source resource consolidating four key data sources: the Integrated Food Security Phase Classification (IPC)/Cadre Harmonisé (CH) phases, the Famine Early Warning Systems Network (FEWS NET) IPC-compatible phases, and the World Food Program’s (WFP) Food Consumption Score (FCS) and reduced Coping Strategy Index (rCSI). Updated monthly and using a common reference system for administrative units, the HFID offers extensive spatial and temporal coverage. It serves as a vital tool for food security experts and humanitarian agencies, providing a unified resource for analyzing food security conditions and highlighting global data disparities. The scientific community can also leverage the HFID to develop data-driven predictive models, enhancing the capacity to forecast and prevent future food crises.

[LG-6] Geometry and Optimization of Shallow Polynomial Networks

链接: https://arxiv.org/abs/2501.06074
作者: Yossi Arjevani,Joan Bruna,Joe Kileel,Elzbieta Polak,Matthew Trager
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: 36 pages, 2 figures

点击查看摘要

Abstract:We study shallow neural networks with polynomial activations. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, that can be viewed as a problem of low-rank tensor approximation with respect to a non-standard inner product that is induced by the data distribution. In this setting, we introduce a teacher-metric discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

[LG-7] Personalized Language Model Learning on Text Data Without User Identifiers

链接: https://arxiv.org/abs/2501.06062
作者: Yucheng Ding,Yangwenjian Tan,Xiangyu Liu,Chaoyue Niu,Fandong Meng,Jie Zhou,Ning Liu,Fan Wu,Guihai Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many practical natural language applications, user data are highly sensitive, requiring anonymous uploads of text data from mobile devices to the cloud without user identifiers. However, the absence of user identifiers restricts the ability of cloud-based language models to provide personalized services, which are essential for catering to diverse user needs. The trivial method of replacing an explicit user identifier with a static user embedding as model input still compromises data anonymization. In this work, we propose to let each mobile device maintain a user-specific distribution to dynamically generate user embeddings, thereby breaking the one-to-one mapping between an embedding and a specific user. We further theoretically demonstrate that to prevent the cloud from tracking users via uploaded embeddings, the local distributions of different users should either be derived from a linearly dependent space to avoid identifiability or be close to each other to prevent accurate attribution. Evaluation on both public and industrial datasets using different language models reveals a remarkable improvement in accuracy from incorporating anonymous user embeddings, while preserving real-time inference requirement.

[LG-8] COMIX: Compositional Explanations using Prototypes

链接: https://arxiv.org/abs/2501.06059
作者: Sarath Sivaprasad,Dmitry Kangin,Plamen Angelov,Mario Fritz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning machine representations with human understanding is key to improving interpretability of machine learning (ML) models. When classifying a new image, humans often explain their decisions by decomposing the image into concepts and pointing to corresponding regions in familiar images. Current ML explanation techniques typically either trace decision-making processes to reference prototypes, generate attribution maps highlighting feature importance, or incorporate intermediate bottlenecks designed to align with human-interpretable concepts. The proposed method, named COMIX, classifies an image by decomposing it into regions based on learned concepts and tracing each region to corresponding ones in images from the training dataset, assuring that explanations fully represent the actual decision-making process. We dissect the test image into selected internal representations of a neural network to derive prototypical parts (primitives) and match them with the corresponding primitives derived from the training data. In a series of qualitative and quantitative experiments, we theoretically prove and demonstrate that our method, in contrast to post hoc analysis, provides fidelity of explanations and shows that the efficiency is competitive with other inherently interpretable architectures. Notably, it shows substantial improvements in fidelity and sparsity metrics, including 48.82% improvement in the C-insertion score on the ImageNet dataset over the best state-of-the-art baseline.

[LG-9] Learning Flexible Heterogeneous Coordination with Capability-Aware Shared Hypernetworks

链接: https://arxiv.org/abs/2501.06058
作者: Kevin Fu,Pierce Howell,Shalin Jain,Harish Ravichandar
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures, equal authorship between Pierce Howell and Shalin Jain

点击查看摘要

Abstract:Cooperative heterogeneous multi-agent tasks require agents to effectively coordinate their behaviors while accounting for their relative capabilities. Learning-based solutions to this challenge span between two extremes: i) shared-parameter methods, which encode diverse behaviors within a single architecture by assigning an ID to each agent, and are sample-efficient but result in limited behavioral diversity; ii) independent methods, which learn a separate policy for each agent, and show greater behavioral diversity but lack sample-efficiency. Prior work has also explored selective parameter-sharing, allowing for a compromise between diversity and efficiency. None of these approaches, however, effectively generalize to unseen agents or teams. We present Capability-Aware Shared Hypernetworks (CASH), a novel architecture for heterogeneous multi-agent coordination that generates sufficient diversity while maintaining sample-efficiency via soft parameter-sharing hypernetworks. Intuitively, CASH allows the team to learn common strategies using a shared encoder, which are then adapted according to the team’s individual and collective capabilities with a hypernetwork, allowing for zero-shot generalization to unseen teams and agents. We present experiments across two heterogeneous coordination tasks and three standard learning paradigms (imitation learning, on- and off-policy reinforcement learning). CASH is able to outperform baseline architectures in success rate and sample efficiency when evaluated on unseen teams and agents despite using less than half of the learnable parameters.

[LG-10] Investigating the Impact of Observation Space Design Choices On Training Reinforcement Learning Solutions for Spacecraft Problems

链接: https://arxiv.org/abs/2501.06016
作者: Nathaniel Hamilton,Kyle Dunlap,Kerianne L Hobbs
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Recent research using Reinforcement Learning (RL) to learn autonomous control for spacecraft operations has shown great success. However, a recent study showed their performance could be improved by changing the action space, i.e. control outputs, used in the learning environment. This has opened the door for finding more improvements through further changes to the environment. The work in this paper focuses on how changes to the environment’s observation space can impact the training and performance of RL agents learning the spacecraft inspection task. The studies are split into two groups. The first looks at the impact of sensors that were designed to help agents learn the task. The second looks at the impact of reference frames, reorienting the agent to see the world from a different perspective. The results show the sensors are not necessary, but most of them help agents learn more optimal behavior, and that the reference frame does not have a large impact, but is best kept consistent.

[LG-11] A Neural Operator for Forecasting Carbon Monoxide Evolution in Cities

链接: https://arxiv.org/abs/2501.06007
作者: Sanchit Bedi(1),Karn Tiwari(2),Prathosh A. P.(2),Sri Harsha Kota(1),N. M. Anoop Krishnan(1,3) ((1) Civil Engineering Department, Indian Institute of Technology Delhi, New Delhi, India, (2) Electrical Communications Department, Indian Institute of Science Bengaluru, Bengaluru, (3) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi, India)
类目: Machine Learning (cs.LG)
*备注: 36 pages, 21 figures, to be published in npj Clean Air journal (accepted)

点击查看摘要

Abstract:Real-time forecasting of carbon monoxide (CO) concentrations is essential for enabling timely interventions to improve urban air quality. Conventional air quality models often require extensive computational resources for accurate, multi-scale predictions, limiting their practicality for rapid, real-time application. To address this challenge, we introduce the Complex Neural Operator for Air Quality (CoNOAir), a machine learning model that forecast CO concentrations efficiently. CoNOAir demonstrates superior performance over state-of-theart models, such as the Fourier Neural Operator (FNO), in both short-term (hourly) and extended (72-hour) forecasts at a national scale. It excels in capturing extreme pollution events and performs consistently across multiple Indian cities, achieving an R2 above 0.95 for hourly CO predictions across all evaluated locations. CoNOAir equips authorities with an effective tool for issuing early warnings and designing targeted intervention strategies. This work marks a step forward in achieving dependable, real-time CO pollution predictions for densely populated urban centres.

[LG-12] Learning to generate feasible graphs using graph grammars

链接: https://arxiv.org/abs/2501.06003
作者: Stefan Mautner,Rolf Backofen,Fabrizio Costa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative methods for graphs need to be sufficiently flexible to model complex dependencies between sets of nodes. At the same time, the generated graphs need to satisfy domain-dependent feasibility conditions, that is, they should not violate certain constraints that would make their interpretation impossible within the given application domain (e.g. a molecular graph where an atom has a very large number of chemical bounds). Crucially, constraints can involve not only local but also long-range dependencies: for example, the maximal length of a cycle can be bounded. Currently, a large class of generative approaches for graphs, such as methods based on artificial neural networks, is based on message passing schemes. These approaches suffer from information ‘dilution’ issues that severely limit the maximal range of the dependencies that can be modeled. To address this problem, we propose a generative approach based on the notion of graph grammars. The key novel idea is to introduce a domain-dependent coarsening procedure to provide short-cuts for long-range dependencies. We show the effectiveness of our proposal in two domains: 1) small drugs and 2) RNA secondary structures. In the first case, we compare the quality of the generated molecular graphs via the Molecular Sets (MOSES) benchmark suite, which evaluates the distance between generated and real molecules, their lipophilicity, synthesizability, and drug-likeness. In the second case, we show that the approach can generate very large graphs (with hundreds of nodes) that are accepted as valid examples for a desired RNA family by the “Infernal” covariance model, a state-of-the-art RNA classifier. Our implementation is available on github: this http URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.06003 [cs.LG] (or arXiv:2501.06003v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06003 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefan Mautner [view email] [v1] Fri, 10 Jan 2025 14:34:46 UTC (908 KB)

[LG-13] DeltaGNN: Graph Neural Network with Information Flow Control

链接: https://arxiv.org/abs/2501.06002
作者: Kevin Mancini,Islem Rekik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are popular deep learning models designed to process graph-structured data through recursive neighborhood aggregations in the message passing process. When applied to semi-supervised node classification, the message-passing enables GNNs to understand short-range spatial interactions, but also causes them to suffer from over-smoothing and over-squashing. These challenges hinder model expressiveness and prevent the use of deeper models to capture long-range node interactions (LRIs) within the graph. Popular solutions for LRIs detection are either too expensive to process large graphs due to high time complexity or fail to generalize across diverse graph structures. To address these limitations, we propose a mechanism called \emphinformation flow control, which leverages a novel connectivity measure, called \emphinformation flow score, to address over-smoothing and over-squashing with linear computational overhead, supported by theoretical evidence. Finally, to prove the efficacy of our methodology we design DeltaGNN, the first scalable and generalizable approach for detecting long-range and short-range interactions. We benchmark our model across 10 real-world datasets, including graphs with varying sizes, topologies, densities, and homophilic ratios, showing superior performance with limited computational complexity. The implementation of the proposed methods are publicly available at this https URL.

[LG-14] Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing ICASSP2025

链接: https://arxiv.org/abs/2501.05987
作者: Eklavya Sarkar,Mathew Magimai.-Doss
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.

[LG-15] Deep Variational Sequential Monte Carlo for High-Dimensional Observations

链接: https://arxiv.org/abs/2501.05982
作者: Wessel L. van Nierop,Nir Shlezinger,Ruud J.G. van Sloun
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Sequential Monte Carlo (SMC), or particle filtering, is widely used in nonlinear state-space systems, but its performance often suffers from poorly approximated proposal and state-transition distributions. This work introduces a differentiable particle filter that leverages the unsupervised variational SMC objective to parameterize the proposal and transition distributions with a neural network, designed to learn from high-dimensional observations. Experimental results demonstrate that our approach outperforms established baselines in tracking the challenging Lorenz attractor from high-dimensional and partial observations. Furthermore, an evidence lower bound based evaluation indicates that our method offers a more accurate representation of the posterior distribution.

[LG-16] A Brain Age Residual Biomarker (BARB): Leverag ing MRI-Based Models to Detect Latent Health Conditions in U.S. Veterans

链接: https://arxiv.org/abs/2501.05970
作者: Arthur Bousquet,Sugata Banerji,Mark F. Conneely,Shahrzad Jamshidi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Age prediction using brain imaging, such as MRIs, has achieved promising results, with several studies identifying the model’s residual as a potential biomarker for chronic disease states. In this study, we developed a brain age predictive model using a dataset of 1,220 U.S. veterans (18–80 years) and convolutional neural networks (CNNs) trained on two-dimensional slices of axial T2-weighted fast spin-echo and T2-weighted fluid attenuated inversion recovery MRI images. The model, incorporating a degree-3 polynomial ensemble, achieved an R^2 of 0.816 on the testing set. Images were acquired at the level of the anterior commissure and the frontal horns of the lateral ventricles. Residual analysis was performed to assess its potential as a biomarker for five ICD-coded conditions: hypertension (HTN), diabetes mellitus (DM), mild traumatic brain injury (mTBI), illicit substance abuse/dependence (SAD), and alcohol abuse/dependence (AAD). Residuals grouped by the number of ICD-coded conditions demonstrated different trends that were statistically significant ( p = 0.002 ), suggesting a relationship between disease states and predicted brain age. This association was particularly pronounced in patients over 49 years, where negative residuals (indicating advanced brain aging) correlated with the presence of multiple ICD codes. These findings support the potential of residuals as biomarkers for detecting latent health conditions.

[LG-17] Model Inversion in Split Learning for Personalized LLM s: New Insights from Information Bottleneck Theory

链接: https://arxiv.org/abs/2501.05965
作者: Yunmeng Shu,Shaofeng Li,Tian Dong,Yan Meng,Haojin Zhu
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Personalized Large Language Models (LLMs) have become increasingly prevalent, showcasing the impressive capabilities of models like GPT-4. This trend has also catalyzed extensive research on deploying LLMs on mobile devices. Feasible approaches for such edge-cloud deployment include using split learning. However, previous research has largely overlooked the privacy leakage associated with intermediate representations transmitted from devices to servers. This work is the first to identify model inversion attacks in the split learning framework for LLMs, emphasizing the necessity of secure defense. For the first time, we introduce mutual information entropy to understand the information propagation of Transformer-based LLMs and assess privacy attack performance for LLM blocks. To address the issue of representations being sparser and containing less information than embeddings, we propose a two-stage attack system in which the first part projects representations into the embedding space, and the second part uses a generative model to recover text from these embeddings. This design breaks down the complexity and achieves attack scores of 38%-75% in various scenarios, with an over 60% improvement over the SOTA. This work comprehensively highlights the potential privacy risks during the deployment of personalized LLMs on the edge side.

[LG-18] Soft regression trees: a model variant and a decomposition training algorithm

链接: https://arxiv.org/abs/2501.05942
作者: Antonio Consolo,Edoardo Amaldi,Andrea Manno
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Decision trees are widely used for classification and regression tasks in a variety of application fields due to their interpretability and good accuracy. During the past decade, growing attention has been devoted to globally optimized decision trees with deterministic or soft splitting rules at branch nodes, which are trained by optimizing the error function over all the tree parameters. In this work, we propose a new variant of soft multivariate regression trees (SRTs) where, for every input vector, the prediction is defined as the linear regression associated to a single leaf node, namely, the leaf node obtained by routing the input vector from the root along the branches with higher probability. SRTs exhibit the conditional computational property, i.e., each prediction depends on a small number of nodes (parameters), and our nonlinear optimization formulation for training them is amenable to decomposition. After showing a universal approximation result for SRTs, we present a decomposition training algorithm including a clustering-based initialization procedure and a heuristic for reassigning the input vectors along the tree. Under mild assumptions, we establish asymptotic convergence guarantees. Experiments on 15 wellknown datasets indicate that our SRTs and decomposition algorithm yield higher accuracy and robustness compared with traditional soft regression trees trained using the nonlinear optimization formulation of Blanquero et al., and a significant reduction in training times as well as a slightly better average accuracy compared with the mixed-integer optimization approach of Bertsimas and Dunn. We also report a comparison with the Random Forest ensemble method.

[LG-19] Encoded Spatial Attribute in Multi-Tier Federated Learning

链接: https://arxiv.org/abs/2501.05934
作者: Asfia Kawnine,Francis Palma,Seyed Alireza Rahimi Azghadi,Hung Cao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: IEEE ICCE 2025

点击查看摘要

Abstract:This research presents an Encoded Spatial Multi-Tier Federated Learning approach for a comprehensive evaluation of aggregated models for geospatial data. In the client tier, encoding spatial information is introduced to better predict the target outcome. The research aims to assess the performance of these models across diverse datasets and spatial attributes, highlighting variations in predictive accuracy. Using evaluation metrics such as accuracy, our research reveals insights into the complexities of spatial granularity and the challenges of capturing underlying patterns in the data. We extended the scope of federated learning (FL) by having multi-tier along with the functionality of encoding spatial attributes. Our N-tier FL approach used encoded spatial data to aggregate in different tiers. We obtained multiple models that predicted the different granularities of spatial data. Our findings underscore the need for further research to improve predictive accuracy and model generalization, with potential avenues including incorporating additional features, refining model architectures, and exploring alternative modeling approaches. Our experiments have several tiers representing different levels of spatial aspects. We obtained accuracy of 75.62% and 89.52% for the global model without having to train the model using the data constituted with the designated tier. The research also highlights the importance of the proposed approach in real-time applications.

[LG-20] xt2Playlist: Generating Personalized Playlists from Text on Deezer

链接: https://arxiv.org/abs/2501.05894
作者: Mathieu Delcluze,Antoine Khoury,Clémence Vast,Valerio Arnaudo,Léa Briand,Walid Bendada,Thomas Bouabça
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The streaming service Deezer heavily relies on the search to help users navigate through its extensive music catalog. Nonetheless, it is primarily designed to find specific items and does not lead directly to a smooth listening experience. We present Text2Playlist, a stand-alone tool that addresses these limitations. Text2Playlist leverages generative AI, music information retrieval and recommendation systems to generate query-specific and personalized playlists, successfully deployed at scale.

[LG-21] Collaborative Content Moderation in the Fediverse

链接: https://arxiv.org/abs/2501.05871
作者: Haris Bin Zia,Aravindh Raman,Ignacio Castro,Gareth Tyson
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The Fediverse, a group of interconnected servers providing a variety of interoperable services (e.g. micro-blogging in Mastodon) has gained rapid popularity. This sudden growth, partly driven by Elon Musk’s acquisition of Twitter, has created challenges for administrators though. This paper focuses on one particular challenge: content moderation, e.g. the need to remove spam or hate speech. While centralized platforms like Facebook and Twitter rely on automated tools for moderation, their dependence on massive labeled datasets and specialized infrastructure renders them impractical for decentralized, low-resource settings like the Fediverse. In this work, we design and evaluate FedMod, a collaborative content moderation system based on federated learning. Our system enables servers to exchange parameters of partially trained local content moderation models with similar servers, creating a federated model shared among collaborating servers. FedMod demonstrates robust performance on three different content moderation tasks: harmful content detection, bot content detection, and content warning assignment, achieving average per-server macro-F1 scores of 0.71, 0.73, and 0.58, respectively.

[LG-22] A Neighbor-based Approach to Pitch Ownership Models in Soccer

链接: https://arxiv.org/abs/2501.05870
作者: Tiago Mendes-Neves,Luís Meireles,João Mendes-Moreira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pitch ownership models allow many types of analysis in soccer and provide valuable assistance to tactical analysts in understanding the game’s dynamics. The novelty they provide over event-based analysis is that tracking data incorporates context that event-based data does not possess, like player positioning. This paper proposes a novel approach to building pitch ownership models in soccer games using the K-Nearest Neighbors (KNN) algorithm. Our approach provides a fast inference mechanism that can model different approaches to pitch control using the same algorithm. Despite its flexibility, it uses only three hyperparameters to tune the model, facilitating the tuning process for different player skill levels. The flexibility of the approach allows for the emulation of different methods available in the literature by adjusting a small number of parameters, including adjusting for different levels of uncertainty. In summary, the proposed model provides a new and more flexible strategy for building pitch ownership models, extending beyond just replicating existing algorithms, and can provide valuable insights for tactical analysts and open up new avenues for future research. We thoroughly visualize several examples demonstrating the presented models’ strengths and weaknesses. The code is available at this http URL.

[LG-23] Neural Network Verification is a Programming Language Challenge

链接: https://arxiv.org/abs/2501.05867
作者: Lucas C. Cordeiro,Matthew L. Daggitt,Julien Girard-Satabin,Omri Isac,Taylor T. Johnson,Guy Katz,Ekaterina Komendantskaya,Augustin Lemesle,Edoardo Manino,Artjoms Šinkarovs,Haoze Wu
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Accepted at ESOP 2025, European Symposium on Programming Languages

点击查看摘要

Abstract:Neural network verification is a new and rapidly developing field of research. So far, the main priority has been establishing efficient verification algorithms and tools, while proper support from the programming language perspective has been considered secondary or unimportant. Yet, there is mounting evidence that insights from the programming language community may make a difference in the future development of this domain. In this paper, we formulate neural network verification challenges as programming language challenges and suggest possible future solutions.

[LG-24] “Cause” is Mechanistic Narrative within Scientific Domains: An Ordinary Language Philosophical Critique of “Causal Machine Learning”

链接: https://arxiv.org/abs/2501.05844
作者: Vyacheslav Kungurtsev,Leonardo Christov Moore,Gustav Sir,Martin Krutsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal Learning has emerged as a major theme of AI in recent years, promising to use special techniques to reveal the true nature of cause and effect in a number of important domains. We consider the Epistemology of learning and recognizing true cause and effect phenomena. Through thought exercises on the customary use of the word ‘‘cause’’, especially in scientific domains, we investigate what, in practice, constitutes a valid causal claim. We recognize the word’s uses across scientific domains in disparate form but consistent function within the scientific paradigm. We highlight fundamental distinctions of practice that can be performed in the natural and social sciences, highlight the importance of many systems of interest being open and irreducible and identify the important notion of Hermeneutic knowledge for social science inquiry. We posit that the distinct properties require that definitive causal claims can only come through an agglomeration of consistent evidence across multiple domains and levels of abstraction, such as empirical, physiological, biochemical, etc. We present Cognitive Science as an exemplary multi-disciplinary field providing omnipresent opportunity for such a Research Program, and highlight the main general modes of practice of scientific inquiry that can adequately merge, rather than place as incorrigibly conflictual, multi-domain multi-abstraction scientific practices and language games.

[LG-25] Orthogonal projection-based regularization for efficient model augmentation

链接: https://arxiv.org/abs/2501.05842
作者: Bendegúz M. Györök,Jan H. Hoekstra,Johan Kon,Tamás Péni,Maarten Schoukens,Roland Tóth
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to L4DC 2025

点击查看摘要

Abstract:Deep-learning-based nonlinear system identification has shown the ability to produce reliable and highly accurate models in practice. However, these black-box models lack physical interpretability, and often a considerable part of the learning effort is spent on capturing already expected/known behavior due to first-principles-based understanding of some aspects of the system. A potential solution is to integrate prior physical knowledge directly into the model structure, combining the strengths of physics-based modeling and deep-learning-based identification. The most common approach is to use an additive model augmentation structure, where the physics-based and the machine-learning (ML) components are connected in parallel. However, such models are overparametrized, training them is challenging, potentially causing the physics-based part to lose interpretability. To overcome this challenge, this paper proposes an orthogonal projection-based regularization technique to enhance parameter learning, convergence, and even model accuracy in learning-based augmentation of nonlinear baseline models.

[LG-26] Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data

链接: https://arxiv.org/abs/2501.05835
作者: Jiale Zhang,Bosen Rao,Chengcheng Zhu,Xiaobing Sun,Qingming Li,Haibo Hu,Xiapu Luo,Qingqing Ye,Shouling Ji
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable performance through their message-passing mechanism. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, which can lead the model to misclassify graphs with attached triggers as the target class. The effectiveness of recent promising defense techniques, such as fine-tuning or distillation, is heavily contingent on having comprehensive knowledge of the sufficient training dataset. Empirical studies have shown that fine-tuning methods require a clean dataset of 20% to reduce attack accuracy to below 25%, while distillation methods require a clean dataset of 15%. However, obtaining such a large amount of clean data is commonly impractical. In this paper, we propose a practical backdoor mitigation framework, denoted as GRAPHNAD, which can capture high-quality intermediate-layer representations in GNNs to enhance the distillation process with limited clean data. To achieve this, we address the following key questions: How to identify the appropriate attention representations in graphs for distillation? How to enhance distillation with limited data? By adopting the graph attention transfer method, GRAPHNAD can effectively align the intermediate-layer attention representations of the backdoored model with that of the teacher model, forcing the backdoor neurons to transform into benign ones. Besides, we extract the relation maps from intermediate-layer transformation and enforce the relation maps of the backdoored model to be consistent with that of the teacher model, thereby ensuring model accuracy while further reducing the influence of backdoors. Extensive experimental results show that by fine-tuning a teacher model with only 3% of the clean data, GRAPHNAD can reduce the attack success rate to below 5%. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2501.05835 [cs.LG] (or arXiv:2501.05835v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05835 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] AdaPRL: Adaptive Pairwise Regression Learning with Uncertainty Estimation for Universal Regression Tasks

链接: https://arxiv.org/abs/2501.05809
作者: Fuhang Liang,Rucong Xu,Deng Lin
类目: Machine Learning (cs.LG)
*备注: 22 pages, 11 figures

点击查看摘要

Abstract:Current deep regression models usually learn in point-wise way that treat each sample as an independent input, neglecting the relative ordering among different data. Consequently, the regression model could neglect the data 's interrelationships, potentially resulting in suboptimal performance. Moreover, the existence of aleatoric uncertainty in the training data may drive the model to capture non-generalizable patterns, contributing to increased overfitting. To address these issues, we propose a novel adaptive pairwise learning framework (AdaPRL) for regression tasks which leverages the relative differences between data points and integrates with deep probabilistic models to quantify the uncertainty associated with the predictions. Additionally, we adapt AdaPRL for applications in multi-task learning and multivariate time series forecasting. Extensive experiments with several real-world regression datasets including recommendation systems, age estimation, time series forecasting, natural language understanding, finance, and industry datasets show that AdaPRL is compatible with different backbone networks in various tasks and achieves state-of-the-art performance on the vast majority of tasks, highlighting its notable potential including enhancing prediction accuracy and ranking ability, increasing generalization capability, improving robustness to noisy data, improving resilience to reduced data, and enhancing interpretability, etc.

[LG-28] STHFL: Spatio-Temporal Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2501.05775
作者: Shunxin Guo,Hongsong Wang,Shuxia Lin,Xu Yang,Xin Geng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning is a new framework that protects data privacy and allows multiple devices to cooperate in training machine learning models. Previous studies have proposed multiple approaches to eliminate the challenges posed by non-iid data and inter-domain heterogeneity issues. However, they ignore the \textbfspatio-temporal heterogeneity formed by different data distributions of increasing task data in the intra-domain. Moreover, the global data is generally a long-tailed distribution rather than assuming the global data is balanced in practical applications. To tackle the \textbfspatio-temporal dilemma, we propose a novel setting named \textbfSpatio-Temporal Heterogeneity Federated Learning (STHFL). Specially, the Global-Local Dynamic Prototype (GLDP) framework is designed for STHFL. In GLDP, the model in each client contains personalized layers which can dynamically adapt to different data distributions. For long-tailed data distribution, global prototypes are served as complementary knowledge for the training on classes with few samples in clients without leaking privacy. As tasks increase in clients, the knowledge of local prototypes generated in previous tasks guides for training in the current task to solve catastrophic forgetting. Meanwhile, the global-local prototypes are updated through the moving average method after training local prototypes in clients. Finally, we evaluate the effectiveness of GLDP, which achieves remarkable results compared to state-of-the-art methods in STHFL scenarios.

[LG-29] rmlnomogram: An R package to construct an explainable nomogram for any machine learning algorithms

链接: https://arxiv.org/abs/2501.05772
作者: Herdiantri Sufriyana,Emily Chia-Yu Su
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 2 figures, 1 table, 3 equations, 1 algorithm, 4 code snippets

点击查看摘要

Abstract:Background: Current nomogram can only be created for regression algorithm. Providing nomogram for any machine learning (ML) algorithms may accelerate model deployment in clinical settings or improve model availability. We developed an R package and web application to construct nomogram with model explainability of any ML algorithms. Methods: We formulated a function to transform an ML prediction model into a nomogram, requiring datasets with: (1) all possible combinations of predictor values; (2) the corresponding outputs of the model; and (3) the corresponding explainability values for each predictor (optional). Web application was also created. Results: Our R package could create 5 types of nomograms for categorical predictors and binary outcome without probability (1), categorical predictors and binary outcome with probability (2) or continuous outcome (3), and categorical with single numerical predictors and binary outcome with probability (4) or continuous outcome (5). Respectively, the first and remaining types optimally allowed maximum 15 and 5 predictors with maximum 3,200 combinations. Web application is provided with such limits. The explainability values were possible for types 2 to 5. Conclusions: Our R package and web application could construct nomogram with model explainability of any ML algorithms using a fair number of predictors.

[LG-30] CognoSpeak: an automatic remote assessment of early cognitive decline in real-world conversational speech

链接: https://arxiv.org/abs/2501.05755
作者: Madhurananda Pahar,Fuxiang Tao,Bahman Mirheidari,Nathan Pevy,Rebecca Bright,Swapnil Gadgil,Lise Sproson,Dorota Braun,Caitlin Illingworth,Daniel Blackburn,Heidi Christensen
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This paper has been accepted for publication in IEEE SSCI 2025. Copyright belongs to IEEE

点击查看摘要

Abstract:The early signs of cognitive decline are often noticeable in conversational speech, and identifying those signs is crucial in dealing with later and more serious stages of neurodegenerative diseases. Clinical detection is costly and time-consuming and although there has been recent progress in the automatic detection of speech-based cues, those systems are trained on relatively small databases, lacking detailed metadata and demographic information. This paper presents CognoSpeak and its associated data collection efforts. CognoSpeak asks memory-probing long and short-term questions and administers standard cognitive tasks such as verbal and semantic fluency and picture description using a virtual agent on a mobile or web platform. In addition, it collects multimodal data such as audio and video along with a rich set of metadata from primary and secondary care, memory clinics and remote settings like people’s homes. Here, we present results from 126 subjects whose audio was manually transcribed. Several classic classifiers, as well as large language model-based classifiers, have been investigated and evaluated across the different types of prompts. We demonstrate a high level of performance; in particular, we achieved an F1-score of 0.873 using a DistilBERT model to discriminate people with cognitive impairment (dementia and people with mild cognitive impairment (MCI)) from healthy volunteers using the memory responses, fluency tasks and cookie theft picture description. CognoSpeak is an automatic, remote, low-cost, repeatable, non-invasive and less stressful alternative to existing clinical cognitive assessments.

[LG-31] ELENA: Epigenetic Learning through Evolved Neural Adaptation

链接: https://arxiv.org/abs/2501.05735
作者: Boris Kriuk,Keti Sulamanidze,Fedor Kriuk
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 4 tables, 2 algorithms

点击查看摘要

Abstract:Despite the success of metaheuristic algorithms in solving complex network optimization problems, they often struggle with adaptation, especially in dynamic or high-dimensional search spaces. Traditional approaches can become stuck in local optima, leading to inefficient exploration and suboptimal solutions. Most of the widely accepted advanced algorithms do well either on highly complex or smaller search spaces due to the lack of adaptation. To address these limitations, we present ELENA (Epigenetic Learning through Evolved Neural Adaptation), a new evolutionary framework that incorporates epigenetic mechanisms to enhance the adaptability of the core evolutionary approach. ELENA leverages compressed representation of learning parameters improved dynamically through epigenetic tags that serve as adaptive memory. Three epigenetic tags (mutation resistance, crossover affinity, and stability score) assist with guiding solution space search, facilitating a more intelligent hypothesis landscape exploration. To assess the framework performance, we conduct experiments on three critical network optimization problems: the Traveling Salesman Problem (TSP), the Vehicle Routing Problem (VRP), and the Maximum Clique Problem (MCP). Experiments indicate that ELENA achieves competitive results, often surpassing state-of-the-art methods on network optimization tasks.

[LG-32] Diving Deep: Forecasting Sea Surface Temperatures and Anomalies KDD ECML

链接: https://arxiv.org/abs/2501.05731
作者: Ding Ning,Varvara Vetrova,Karin R. Bryan,Yun Sing Koh,Andreas Voskou,N’Dah Jean Kouagou,Arnab Sharma
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
*备注: The paper contains 9 pages for the main text and 10 pages including References. 5 figures. Discovery Track, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024

点击查看摘要

Abstract:This overview paper details the findings from the Diving Deep: Forecasting Sea Surface Temperatures and Anomalies Challenge at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024. The challenge focused on the data-driven predictability of global sea surface temperatures (SSTs), a key factor in climate forecasting, ecosystem management, fisheries management, and climate change monitoring. The challenge involved forecasting SST anomalies (SSTAs) three months in advance using historical data and included a special task of predicting SSTAs nine months ahead for the Baltic Sea. Participants utilized various machine learning approaches to tackle the task, leveraging data from ERA5. This paper discusses the methodologies employed, the results obtained, and the lessons learned, offering insights into the future of climate-related predictive modeling.

[LG-33] AMER: A Test-Time Adaptive MoE-Driven Framework for EHR Representation Learning

链接: https://arxiv.org/abs/2501.05661
作者: Yinghao Zhu,Xiaochen Zheng,Ahmed Allam,Michael Krauthammer
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 7 tables

点击查看摘要

Abstract:We propose TAMER, a Test-time Adaptive MoE-driven framework for EHR Representation learning. TAMER combines a Mixture-of-Experts (MoE) with Test-Time Adaptation (TTA) to address two critical challenges in EHR modeling: patient population heterogeneity and distribution shifts. The MoE component handles diverse patient subgroups, while TTA enables real-time adaptation to evolving health status distributions when new patient samples are introduced. Extensive experiments across four real-world EHR datasets demonstrate that TAMER consistently improves predictive performance for both mortality and readmission risk tasks when combined with diverse EHR modeling backbones. TAMER offers a promising approach for dynamic and personalized EHR-based predictions in practical clinical settings. Code is publicly available at this https URL.

[LG-34] A Practical Cross-Layer Approach for ML-Driven Storag e Placement in Warehouse-Scale Computers

链接: https://arxiv.org/abs/2501.05651
作者: Chenxi Yang,Yan Li,Martin Maas,Mustafa Uysal,Ubaid Ullah Hafeez,Arif Merchant,Richard McDougall
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system’s efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at Google, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at Google. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47x in TCO savings compared to state of the art baselines. We believe this work represents a significant step towards more practical ML-driven storage placement in warehouse-scale computers.

[LG-35] Enhancing Unsupervised Graph Few-shot Learning via Set Functions and Optimal Transport KDD2025

链接: https://arxiv.org/abs/2501.05635
作者: Yonghao Liu,Fausto Giunchiglia,Ximing Li,Lan Huang,Xiaoyue Feng,Renchu Guan
类目: Machine Learning (cs.LG)
*备注: KDD2025

点击查看摘要

Abstract:Graph few-shot learning has garnered significant attention for its ability to rapidly adapt to downstream tasks with limited labeled data, sparking considerable interest among researchers. Recent advancements in graph few-shot learning models have exhibited superior performance across diverse applications. Despite their successes, several limitations still exist. First, existing models in the meta-training phase predominantly focus on instance-level features within tasks, neglecting crucial set-level features essential for distinguishing between different categories. Second, these models often utilize query sets directly on classifiers trained with support sets containing only a few labeled examples, overlooking potential distribution shifts between these sets and leading to suboptimal performance. Finally, previous models typically require necessitate abundant labeled data from base classes to extract transferable knowledge, which is typically infeasible in real-world scenarios. To address these issues, we propose a novel model named STAR, which leverages Set funcTions and optimAl tRansport for enhancing unsupervised graph few-shot learning. Specifically, STAR utilizes expressive set functions to obtain set-level features in an unsupervised manner and employs optimal transport principles to align the distributions of support and query sets, thereby mitigating distribution shift effects. Theoretical analysis demonstrates that STAR can capture more task-relevant information and enhance generalization capabilities. Empirically, extensive experiments across multiple datasets validate the effectiveness of STAR. Our code can be found here.

[LG-36] Regularized Top-k: A Bayesian Framework for Gradient Sparsification

链接: https://arxiv.org/abs/2501.05633
作者: Ali Bereyhi,Ben Liang,Gary Boudreau,Ali Afana
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top- k , we derive a new sparsification algorithm which can be interpreted as a regularized form of Top- k . We call this algorithm regularized Top- k (RegTop- k ). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while Top- k remains at a fixed distance from the global optimum, RegTop- k converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop- k in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms Top- k .

[LG-37] owards Probabilistic Inference of Human Motor Intentions by Assistive Mobile Robots Controlled via a Brain-Computer Interface

链接: https://arxiv.org/abs/2501.05610
作者: Xiaoshan Zhou,Carol M. Menassa,Vineet R. Kamat
类目: Robotics (cs.RO); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Assistive mobile robots are a transformative technology that helps persons with disabilities regain the ability to move freely. Although autonomous wheelchairs significantly reduce user effort, they still require human input to allow users to maintain control and adapt to changing environments. Brain Computer Interface (BCI) stands out as a highly user-friendly option that does not require physical movement. Current BCI systems can understand whether users want to accelerate or decelerate, but they implement these changes in discrete speed steps rather than allowing for smooth, continuous velocity adjustments. This limitation prevents the systems from mimicking the natural, fluid speed changes seen in human self-paced motion. The authors aim to address this limitation by redesigning the perception-action cycle in a BCI controlled robotic system: improving how the robotic agent interprets the user’s motion intentions (world state) and implementing these actions in a way that better reflects natural physical properties of motion, such as inertia and damping. The scope of this paper focuses on the perception aspect. We asked and answered a normative question “what computation should the robotic agent carry out to optimally perceive incomplete or noisy sensory observations?” Empirical EEG data were collected, and probabilistic representation that served as world state distributions were learned and evaluated in a Generative Adversarial Network framework. The ROS framework was established that connected with a Gazebo environment containing a digital twin of an indoor space and a virtual model of a robotic wheelchair. Signal processing and statistical analyses were implemented to identity the most discriminative features in the spatial-spectral-temporal dimensions, which are then used to construct the world model for the robotic agent to interpret user motion intentions as a Bayesian observer.

[LG-38] Session-Level Dynamic Ad Load Optimization using Offline Robust Reinforcement Learning KDD2025

链接: https://arxiv.org/abs/2501.05591
作者: Tao Liu,Qi Xu,Wei Shi,Zhigang Hua,Shuang Yang
类目: Machine Learning (cs.LG)
*备注: Will appear in KDD 2025

点击查看摘要

Abstract:Session-level dynamic ad load optimization aims to personalize the density and types of delivered advertisements in real time during a user’s online session by dynamically balancing user experience quality and ad monetization. Traditional causal learning-based approaches struggle with key technical challenges, especially in handling confounding bias and distribution shifts. In this paper, we develop an offline deep Q-network (DQN)-based framework that effectively mitigates confounding bias in dynamic systems and demonstrates more than 80% offline gains compared to the best causal learning-based production baseline. Moreover, to improve the framework’s robustness against unanticipated distribution shifts, we further enhance our framework with a novel offline robust dueling DQN approach. This approach achieves more stable rewards on multiple OpenAI-Gym datasets as perturbations increase, and provides an additional 5% offline gains on real-world ad delivery data. Deployed across multiple production systems, our approach has achieved outsized topline gains. Post-launch online A/B tests have shown double-digit improvements in the engagement-ad score trade-off efficiency, significantly enhancing our platform’s capability to serve both consumers and advertisers. Comments: Will appear in KDD 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.05591 [cs.LG] (or arXiv:2501.05591v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Enforcing Fundamental Relations via Adversarial Attacks on Input Parameter Correlations

链接: https://arxiv.org/abs/2501.05588
作者: Timo Saala,Lucie Flek,Alexander Jung,Akbar Karimi,Alexander Schmidt,Matthias Schott,Philipp Soldin,Christopher Wiebusch
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 12 pages, 8 figures (Without appendix)

点击查看摘要

Abstract:Correlations between input parameters play a crucial role in many scientific classification tasks, since these are often related to fundamental laws of nature. For example, in high energy physics, one of the common deep learning use-cases is the classification of signal and background processes in particle collisions. In many such cases, the fundamental principles of the correlations between observables are often better understood than the actual distributions of the observables themselves. In this work, we present a new adversarial attack algorithm called Random Distribution Shuffle Attack (RDSA), emphasizing the correlations between observables in the network rather than individual feature characteristics. Correct application of the proposed novel attack can result in a significant improvement in classification performance - particularly in the context of data augmentation - when using the generated adversaries within adversarial training. Given that correlations between input features are also crucial in many other disciplines. We demonstrate the RDSA effectiveness on six classification tasks, including two particle collision challenges (using CERN Open Data), hand-written digit recognition (MNIST784), human activity recognition (HAR), weather forecasting (Rain in Australia), and ICU patient mortality (MIMIC-IV), demonstrating a general use case beyond fundamental physics for this new type of adversarial attack algorithms.

[LG-40] Learned Discrepancy Reconstruction and Benchmark Dataset for Magnetic Particle Imaging

链接: https://arxiv.org/abs/2501.05583
作者: Meira Iske,Hannes Albers,Tobias Knopp,Tobias Kluth
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Magnetic Particle Imaging (MPI) is an emerging imaging modality based on the magnetic response of superparamagnetic iron oxide nanoparticles to achieve high-resolution and real-time imaging without harmful radiation. One key challenge in the MPI image reconstruction task arises from its underlying noise model, which does not fulfill the implicit Gaussian assumptions that are made when applying traditional reconstruction approaches. To address this challenge, we introduce the Learned Discrepancy Approach, a novel learning-based reconstruction method for inverse problems that includes a learned discrepancy function. It enhances traditional techniques by incorporating an invertible neural network to explicitly model problem-specific noise distributions. This approach does not rely on implicit Gaussian noise assumptions, making it especially suited to handle the sophisticated noise model in MPI and also applicable to other inverse problems. To further advance MPI reconstruction techniques, we introduce the MPI-MNIST dataset - a large collection of simulated MPI measurements derived from the MNIST dataset of handwritten digits. The dataset includes noise-perturbed measurements generated from state-of-the-art model-based system matrices and measurements of a preclinical MPI scanner device. This provides a realistic and flexible environment for algorithm testing. Validated against the MPI-MNIST dataset, our method demonstrates significant improvements in reconstruction quality in terms of structural similarity when compared to classical reconstruction techniques.

[LG-41] Analog Bayesian neural networks are insensitive to the shape of the weight distribution NEURIPS2024

链接: https://arxiv.org/abs/2501.05564
作者: Ravi G. Patel,T. Patrick Xiao,Sapan Agarwal,Christopher Bennett
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Machine Learning (stat.ML)
*备注: Presented at the NeurIPS 2024 Workshop on Machine Learning with New Compute Paradigms, this https URL

点击查看摘要

Abstract:Recent work has demonstrated that Bayesian neural networks (BNN’s) trained with mean field variational inference (MFVI) can be implemented in analog hardware, promising orders of magnitude energy savings compared to the standard digital implementations. However, while Gaussians are typically used as the variational distribution in MFVI, it is difficult to precisely control the shape of the noise distributions produced by sampling analog devices. This paper introduces a method for MFVI training using real device noise as the variational distribution. Furthermore, we demonstrate empirically that the predictive distributions from BNN’s with the same weight means and variances converge to the same distribution, regardless of the shape of the variational distribution. This result suggests that analog device designers do not need to consider the shape of the device noise distribution when hardware-implementing BNNs performing MFVI.

[LG-42] Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

链接: https://arxiv.org/abs/2501.05563
作者: Ziyue Luo,Jia Liu,Myungjin Lee,Ness B. Shroff
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: INFOCOM 2025

点击查看摘要

Abstract:The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive “shortest-remaining-processing-time-first” strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.

[LG-43] Emergent weight morphologies in deep neural networks

链接: https://arxiv.org/abs/2501.05550
作者: Pascal de Jong,Felix Meigel,Steffen Rulands
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:Whether deep neural networks can exhibit emergent behaviour is not only relevant for understanding how deep learning works, it is also pivotal for estimating potential security risks of increasingly capable artificial intelligence systems. Here, we show that training deep neural networks gives rise to emergent weight morphologies independent of the training data. Specifically, in analogy to condensed matter physics, we derive a theory that predict that the homogeneous state of deep neural networks is unstable in a way that leads to the emergence of periodic channel structures. We verified these structures by performing numerical experiments on a variety of data sets. Our work demonstrates emergence in the training of deep neural networks, which impacts the achievable performance of deep neural networks.

[LG-44] NSChat: A Chatbot System To Rule Them All

链接: https://arxiv.org/abs/2501.05541
作者: Zenon Lamprou,Yashar Moshfeghi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence has resulted in the advent of large language models (LLMs) with the capacity to produce text that closely resembles human communication. These models have been seamlessly integrated into diverse applications, enabling interactive and responsive communication across multiple platforms. The potential utility of chatbots transcends these traditional applications, particularly in research contexts, wherein they can offer valuable insights and facilitate the design of innovative experiments. In this study, we present NSChat, a web-based chatbot system designed to assist in neuroscience research. The system is meticulously designed to function as an experimental instrument rather than a conventional chatbot, necessitating users to input a username and experiment code upon access. This setup facilitates precise data cross-referencing, thereby augmenting the integrity and applicability of the data collected for research purposes. It can be easily expanded to accommodate new basic events as needed; and it allows researchers to integrate their own logging events without the necessity of implementing a separate logging mechanism. It is worth noting that our system was built to assist primarily neuroscience research but is not limited to it, it can easily be adapted to assist information retrieval research or interacting with chat bot agents in general.

[LG-45] Neural Architecture Codesign for Fast Physics Applications

链接: https://arxiv.org/abs/2501.05515
作者: Jason Weitz,Dmitri Demler,Luke McDermott,Nhan Tran,Javier Duarte
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:We develop a pipeline to streamline neural architecture codesign for physics applications to reduce the need for ML expertise when designing models for novel tasks. Our method employs neural architecture search and network compression in a two-stage approach to discover hardware efficient models. This approach consists of a global search stage that explores a wide range of architectures while considering hardware constraints, followed by a local search stage that fine-tunes and compresses the most promising candidates. We exceed performance on various tasks and show further speedup through model compression techniques such as quantization-aware-training and neural network pruning. We synthesize the optimal models to high level synthesis code for FPGA deployment with the hls4ml library. Additionally, our hierarchical search space provides greater flexibility in optimization, which can easily extend to other tasks and domains. We demonstrate this with two case studies: Bragg peak finding in materials science and jet classification in high energy physics, achieving models with improved accuracy, smaller latencies, or reduced resource utilization relative to the baseline models.

[LG-46] Shrink the longest: improving latent space isotropy with symplicial geometry

链接: https://arxiv.org/abs/2501.05502
作者: Sergei Kudriashov,Olesya Karpik,Eduard Klyshinsky
类目: Machine Learning (cs.LG)
*备注: AIST-2024

点击查看摘要

Abstract:Although transformer-based models have been dominating the field of deep learning, various studies of their embedding space have shown that they suffer from “representation degeneration problem”: embeddings tend to be distributed in a narrow cone, making the latent space highly anisotropic. Increasing the isotropy has shown to improve performance in downstream tasks both in static and contextual language models. However, most of approaches either add inference overhead or require substantial amount of data for model reparametrization. We propose a novel regularization technique based on simplicial geometry to improve the isotropy of latent representations. The core idea of our method is based on maximizing the persistent entropy of barcodes obtained using Vietoris-Rips filtration from contextual embeddings in the underlying latent space. We demonstrate that the method leads to an increase in downstream performance while significantly lowering the anisotropy during fine-tuning by exploiting existing geometric structures instead of reparametrization.

[LG-47] Generalization of Urban Wind Environment Using Fourier Neural Operator Across Different Wind Directions and Cities

链接: https://arxiv.org/abs/2501.05499
作者: Cheng Chen,Geng Tian,Shaoxiang Qin,Senwen Yang,Dingyang Geng,Dongxue Zhan,Jinqiu Yang,David Vidal,Liangzhu Leon Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Simulation of urban wind environments is crucial for urban planning, pollution control, and renewable energy utilization. However, the computational requirements of high-fidelity computational fluid dynamics (CFD) methods make them impractical for real cities. To address these limitations, this study investigates the effectiveness of the Fourier Neural Operator (FNO) model in predicting flow fields under different wind directions and urban layouts. In this study, we investigate the effectiveness of the Fourier Neural Operator (FNO) model in predicting urban wind conditions under different wind directions and urban layouts. By training the model on velocity data from large eddy simulation data, we evaluate the performance of the model under different urban configurations and wind conditions. The results show that the FNO model can provide accurate predictions while significantly reducing the computational time by 99%. Our innovative approach of dividing the wind field into smaller spatial blocks for training improves the ability of the FNO model to capture wind frequency features effectively. The SDF data also provides important spatial building information, enhancing the model’s ability to recognize physical boundaries and generate more realistic predictions. The proposed FNO approach enhances the AI model’s generalizability for different wind directions and urban layouts.

[LG-48] Generative Flow Networks: Theory and Applications to Structure Learning

链接: https://arxiv.org/abs/2501.05498
作者: Tristan Deleu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Without any assumptions about data generation, multiple causal models may explain our observations equally well. To avoid selecting a single arbitrary model that could result in unsafe decisions if it does not match reality, it is therefore essential to maintain a notion of epistemic uncertainty about our possible candidates. This thesis studies the problem of structure learning from a Bayesian perspective, approximating the posterior distribution over the structure of a causal model, represented as a directed acyclic graph (DAG), given data. It introduces Generative Flow Networks (GFlowNets), a novel class of probabilistic models designed for modeling distributions over discrete and compositional objects such as graphs. They treat generation as a sequential decision making problem, constructing samples of a target distribution defined up to a normalization constant piece by piece. In the first part of this thesis, we present the mathematical foundations of GFlowNets, their connections to existing domains of machine learning and statistics such as variational inference and reinforcement learning, and their extensions beyond discrete problems. In the second part of this thesis, we show how GFlowNets can approximate the posterior distribution over DAG structures of causal Bayesian Networks, along with the parameters of its causal mechanisms, given observational and experimental data.

[LG-49] Mathematical Modeling and Machine Learning for Predicting Shade-Seeking Behavior in Cows Under Heat Stress

链接: https://arxiv.org/abs/2501.05494
作者: S. Sanjuan,D. A. Méndez,R. Arnau,J. M. Calabuig,X. Díaz de Otálora Aguirre,F. Estellés
类目: Machine Learning (cs.LG)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:In this paper we develop a mathematical model combined with machine learning techniques to predict shade-seeking behavior in cows exposed to heat stress. The approach integrates advanced mathematical features, such as time-averaged thermal indices and accumulated heat stress metrics, obtained by mathematical analysis of data from a farm in Titaguas (Valencia, Spain), collected during the summer of 2023. Two predictive models, Random Forests and Neural Networks, are compared for accuracy, robustness, and interpretability. The Random Forest model is highlighted for its balance between precision and explainability, achieving an RMSE of 14.97 . The methodology also employs 5- fold cross-validation to ensure robustness under real-world conditions. This work not only advances the mathematical modeling of animal behavior but also provides useful insights for mitigating heat stress in livestock through data-driven tools.

[LG-50] Monotonic Learning in the PAC Framework: A New Perspective

链接: https://arxiv.org/abs/2501.05493
作者: Ming Li,Chenyi Zhang,Qin Li
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Monotone learning refers to learning processes in which expected performance consistently improves as more training data is introduced. Non-monotone behavior of machine learning has been the topic of a series of recent works, with various proposals that ensure monotonicity by applying transformations or wrappers on learning algorithms. In this work, from a different perspective, we tackle the topic of monotone learning within the framework of Probably Approximately Correct (PAC) learning theory. Following the mechanism that estimates sample complexity of a PAC-learnable problem, we derive a performance lower bound for that problem, and prove the monotonicity of that bound as the sample sizes increase. By calculating the lower bound distribution, we are able to prove that given a PAC-learnable problem with a hypothesis space that is either of finite size or of finite VC dimension, any learning algorithm based on Empirical Risk Minimization (ERM) is monotone if training samples are independent and identically distributed (i.i.d.). We further carry out an experiment on two concrete machine learning problems, one of which has a finite hypothesis set, and the other of finite VC dimension, and compared the experimental data for the empirical risk distributions with the estimated theoretical bound. The results of the comparison have confirmed the monotonicity of learning for the two PAC-learnable problems.

[LG-51] Machine Learning Force-Field Approach for Itinerant Electron Magnets

链接: https://arxiv.org/abs/2501.06171
作者: Sheng Zhang,Yunhao Fan,Kotaro Shimizu,Gia-Wei Chern
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:We review the recent development of machine-learning (ML) force-field frameworks for Landau-Lifshitz-Gilbert (LLG) dynamics simulations of itinerant electron magnets, focusing on the general theory and implementations of symmetry-invariant representations of spin configurations. The crucial properties that such magnetic descriptors must satisfy are differentiability with respect to spin rotations and invariance to both lattice point-group symmetry and internal spin rotation symmetry. We propose an efficient implementation based on the concept of reference irreducible representations, modified from the group-theoretical power-spectrum and bispectrum methods. The ML framework is demonstrated using the s-d models, which are widely applied in spintronics research. We show that LLG simulations based on local fields predicted by the trained ML models successfully reproduce representative non-collinear spin structures, including 120 ^\circ , tetrahedral, and skyrmion crystal orders of the triangular-lattice s-d models. Large-scale thermal quench simulations enabled by ML models further reveal intriguing freezing dynamics and glassy stripe states consisting of skyrmions and bi-merons. Our work highlights the utility of ML force-field approach to dynamical modeling of complex spin orders in itinerant electron magnets.

[LG-52] Efficient Transition State Searches by Freezing String Method with Graph Neural Network Potentials

链接: https://arxiv.org/abs/2501.06159
作者: Jonah Marks,Joseph Gomes
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Transition states are a critical bottleneck in chemical transformations. Significant efforts have been made to develop algorithms that efficiently locate transition states on potential energy surfaces. However, the computational cost of ab-initio potential energy surface evaluation limits the size of chemical systems that can routinely studied. In this work, we develop and fine-tune a graph neural network potential energy function suitable for describing organic chemical reactions and use it to rapidly identify transition state guess structures. We successfully refine guess structures and locate a transition state in each test system considered and reduce the average number of ab-initio calculations by 47% though use of the graph neural network potential energy function. Our results show that modern machine learning models have reached levels of reliability whereby they can be used to accelerate routine computational chemistry tasks.

[LG-53] Inferring High-Order Couplings with Neural Networks

链接: https://arxiv.org/abs/2501.06108
作者: Aurélien Decelle,Alfonso de Jesús Navas Gómez,Beatriz Seoane
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 13 Pages and 3 Figures

点击查看摘要

Abstract:Maximum-entropy methods, rooted in the inverse Ising/Potts problem from statistical mechanics, have become indispensable tools for modeling pairwise interactions in disciplines such as bioinformatics, ecology, and neuroscience. Despite their remarkable success, these methods often overlook high-order interactions that may be crucial in complex systems. Conversely, while modern machine learning approaches can capture such interactions, existing interpretable frameworks are computationally expensive, making it impractical to assess the relevance of high-order interactions in real-world scenarios. Restricted Boltzmann Machines (RBMs) offer a computationally efficient alternative by encoding statistical correlations via hidden nodes in a bipartite neural network. Here, we present a method that maps RBMs exactly onto generalized Potts models with interactions of arbitrary high order. This approach leverages large- N approximations, facilitated by the simple architecture of the RBM, to enable the efficient extraction of effective many-body couplings with minimal computational cost. This mapping also enables the development of a general formal framework for the extraction of effective higher-order interactions in arbitrarily complex probabilistic models. Additionally, we introduce a robust formalism for gauge fixing within the generalized Potts model. We validate our method by accurately recovering two- and three-body interactions from synthetic datasets. Additionally, applying our framework to protein sequence data demonstrates its effectiveness in reconstructing protein contact maps, achieving performance comparable to state-of-the-art inverse Potts models. These results position RBMs as a powerful and efficient tool for investigating high-order interactions in complex systems.

[LG-54] Averag ed Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems

链接: https://arxiv.org/abs/2501.06081
作者: Steffen Dereich,Arnulf Jentzen,Adrian Riekert
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages, 10 figures

点击查看摘要

Abstract:Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla standard SGD optimization method is employed to train the considered class of DNNs but instead more sophisticated adaptive and accelerated variants of the standard SGD method such as the popular Adam optimizer are used. Inspired by the classical Polyak-Ruppert averaging approach, in this work we apply averaged variants of the Adam optimizer to train DNNs to approximately solve exemplary scientific computing problems in the form of PDEs and OC problems. We test the averaged variants of Adam in a series of learning problems including physics-informed neural network (PINN), deep backward stochastic differential equation (deep BSDE), and deep Kolmogorov approximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn PDEs), including DNN approximations for OC problems, and including DNN approximations for image classification problems (ResNet for CIFAR-10). In each of the numerical examples the employed averaged variants of Adam outperform the standard Adam and the standard SGD optimizers, particularly, in the situation of the scientific machine learning problems. The Python source codes for the numerical experiments associated to this work can be found on GitHub at this https URL.

[LG-55] Q-MAML: Quantum Model-Agnostic Meta-Learning for Variational Quantum Algorithms AAAI25

链接: https://arxiv.org/abs/2501.05906
作者: Junyong Lee,JeiHee Cho,Shiho Kim
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, to be published in AAAI 25

点击查看摘要

Abstract:In the Noisy Intermediate-Scale Quantum (NISQ) era, using variational quantum algorithms (VQAs) to solve optimization problems has become a key application. However, these algorithms face significant challenges, such as choosing an effective initial set of parameters and the limited quantum processing time that restricts the number of optimization iterations. In this study, we introduce a new framework for optimizing parameterized quantum circuits (PQCs) that employs a classical optimizer, inspired by Model-Agnostic Meta-Learning (MAML) technique. This approach aim to achieve better parameter initialization that ensures fast convergence. Our framework features a classical neural network, called Learner, which interacts with a PQC using the output of Learner as an initial parameter. During the pre-training phase, Learner is trained with a meta-objective based on the quantum circuit cost function. In the adaptation phase, the framework requires only a few PQC updates to converge to a more accurate value, while the learner remains unchanged. This method is highly adaptable and is effectively extended to various Hamiltonian optimization problems. We validate our approach through experiments, including distribution function mapping and optimization of the Heisenberg XYZ Hamiltonian. The result implies that the Learner successfully estimates initial parameters that generalize across the problem space, enabling fast adaptation.

[LG-56] Discovery of sustainable energy materials via the machine-learned material space

链接: https://arxiv.org/abs/2501.05903
作者: Malte Grunert,Max Großmann,Erich Runge
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Does a machine learning model actually gain an understanding of the material space? We answer this question in the affirmative on the example of the OptiMate model, a graph attention network trained to predict the optical properties of semiconductors and insulators. By applying the UMAP dimensionality reduction technique to its latent embeddings, we demonstrate that the model captures a nuanced and interpretable representation of the materials space, reflecting chemical and physical principles, without any user-induced bias. This enables clustering of almost 10,000 materials based on optical properties and chemical similarities. Beyond this understanding, we demonstrate how the learned material space can be used to identify more sustainable alternatives to critical materials in energy-related technologies, such as photovoltaics. These findings demonstrate the dual utility of machine learning models in materials science: Accurately predicting material properties while providing insights into the underlying materials space. The approach demonstrates the broader potential of leveraging learned materials spaces for the discovery and design of materials for diverse applications, and is easily applicable to any state-of-the-art machine learning model.

[LG-57] Development and Comparison of Model-Based and Data-Driven Approaches for the Prediction of the Mechanical Properties of Lattice Structures

链接: https://arxiv.org/abs/2501.05762
作者: Chiara Pasini,Oscar Ramponi,Stefano Pandini,Luciana Sartore,Giulia Scalet
类目: oft Condensed Matter (cond-mat.soft); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: This work was funded by the European Union ERC CoDe4Bio Grant ID 101039467 under the funding programme Horizon Europe

点击查看摘要

Abstract:Lattice structures have great potential for several application fields ranging from medical and tissue engineering to aeronautical one. Their development is further speeded up by the continuing advances in additive manufacturing technologies that allow to overcome issues typical of standard processes and to propose tailored designs. However, the design of lattice structures is still challenging since their properties are considerably affected by numerous factors. The present paper aims to propose, discuss, and compare various modeling approaches to describe, understand, and predict the correlations between the mechanical properties and the void volume fraction of different types of lattice structures fabricated by fused deposition modeling 3D printing. Particularly, four approaches are proposed: (i) a simplified analytical model; (ii) a semi-empirical model combining analytical equations with experimental correction factors; (iii) an artificial neural network trained on experimental data; (iv) numerical simulations by finite element analyses. The comparison among the various approaches, and with experimental data, allows to identify the performances, advantages, and disadvantages of each approach, thus giving important guidelines for choosing the right design methodology based on the needs and available data.

[LG-58] Covariate Dependent Mixture of Bayesian Networks

链接: https://arxiv.org/abs/2501.05745
作者: Roman Marchant,Dario Draca,Gilad Francis,Sahand Assadzadeh,Mathew Varidel,Frank Iorfino,Sally Cripps
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning the structure of Bayesian networks from data provides insights into underlying processes and the causal relationships that generate the data, but its usefulness depends on the homogeneity of the data population, a condition often violated in real-world applications. In such cases, using a single network structure for inference can be misleading, as it may not capture sub-population differences. To address this, we propose a novel approach of modelling a mixture of Bayesian networks where component probabilities depend on individual characteristics. Our method identifies both network structures and demographic predictors of sub-population membership, aiding personalised interventions. We evaluate our method through simulations and a youth mental health case study, demonstrating its potential to improve tailored interventions in health, education, and social policy.

[LG-59] Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks

链接: https://arxiv.org/abs/2501.05656
作者: Ayush Khot,Xiwei Wang,Avik Roy,Volodymyr Kindratenko,Mark S. Neubauer
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: 38 pages (including references) with 17 figures and 3 tables. Repository: this https URL . Submitted to Machine Learning: Science and Technology

点击查看摘要

Abstract:Current methods commonly used for uncertainty quantification (UQ) in deep learning (DL) models utilize Bayesian methods which are computationally expensive and time-consuming. In this paper, we provide a detailed study of UQ based on evidential deep learning (EDL) for deep neural network models designed to identify jets in high energy proton-proton collisions at the Large Hadron Collider and explore its utility in anomaly detection. EDL is a DL approach that treats learning as an evidence acquisition process designed to provide confidence (or epistemic uncertainty) about test data. Using publicly available datasets for jet classification benchmarking, we explore hyperparameter optimizations for EDL applied to the challenge of UQ for jet identification. We also investigate how the uncertainty is distributed for each jet class, how this method can be implemented for the detection of anomalies, how the uncertainty compares with Bayesian ensemble methods, and how the uncertainty maps onto latent spaces for the models. Our studies uncover some pitfalls of EDL applied to anomaly detection and a more effective way to quantify uncertainty from EDL as compared with the foundational EDL setup. These studies illustrate a methodological approach to interpreting EDL in jet classification models, providing new insights on how EDL quantifies uncertainty and detects out-of-distribution data which may lead to improved EDL methods for DL models applied to classification tasks.

[LG-60] Interpretable Enzyme Function Prediction via Residue-Level Detection

链接: https://arxiv.org/abs/2501.05644
作者: Zhao Yang,Bing Su,Jiahao Chen,Ji-Rong Wen
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi-label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation for each enzyme to classify all functions, thereby they lack interpretability and the fine-grained information of some function-specific local residue fragments may be overwhelmed. Here we present an attention-based framework, namely ProtDETR (Protein Detection Transformer), by casting enzyme function prediction as a detection problem. It uses a set of learnable functional queries to adaptatively extract different local representations from the sequence of residue-level features for predicting different EC numbers. ProtDETR not only significantly outperforms existing deep learning-based enzyme function prediction methods, but also provides a new interpretable perspective on automatically detecting different local regions for identifying different functions through cross-attentions between queries and residue-level features. Code is available at this https URL.

[LG-61] Physics-Driven Learning for Inverse Problems in Quantum Chromodynamics

链接: https://arxiv.org/abs/2501.05580
作者: Gert Aarts,Kenji Fukushima,Tetsuo Hatsuda,Andreas Ipp,Shuzhe Shi,Lingxiao Wang,Kai Zhou
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Nuclear Theory (nucl-th)
*备注: 14 pages, 5 figures, submitted version to Nat Rev Phys

点击查看摘要

Abstract:The integration of deep learning techniques and physics-driven designs is reforming the way we address inverse problems, in which accurate physical properties are extracted from complex data sets. This is particularly relevant for quantum chromodynamics (QCD), the theory of strong interactions, with its inherent limitations in observational data and demanding computational approaches. This perspective highlights advances and potential of physics-driven learning methods, focusing on predictions of physical quantities towards QCD physics, and drawing connections to machine learning(ML). It is shown that the fusion of ML and physics can lead to more efficient and reliable problem-solving strategies. Key ideas of ML, methodology of embedding physics priors, and generative models as inverse modelling of physical probability distributions are introduced. Specific applications cover first-principle lattice calculations, and QCD physics of hadrons, neutron stars, and heavy-ion collisions. These examples provide a structured and concise overview of how incorporating prior knowledge such as symmetry, continuity and equations into deep learning designs can address diverse inverse problems across different physical sciences.

[LG-62] OmniJet-alpha_ C: Learning point cloud calorimeter simulations using generative transformers

链接: https://arxiv.org/abs/2501.05534
作者: Joschka Birk,Frank Gaede,Anna Hallin,Gregor Kasieczka,Martina Mozzanica,Henning Rose
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:We show the first use of generative transformers for generating calorimeter showers as point clouds in a high-granularity calorimeter. Using the tokenizer and generative part of the OmniJet- \alpha model, we represent the hits in the detector as sequences of integers. This model allows variable-length sequences, which means that it supports realistic shower development and does not need to be conditioned on the number of hits. Since the tokenization represents the showers as point clouds, the model learns the geometry of the showers without being restricted to any particular voxel grid.

[LG-63] Outlyingness Scores with Cluster Catch Digraphs

链接: https://arxiv.org/abs/2501.05530
作者: Rui Shi,Nedret Billor,Elvan Ceyhan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 7 figures, 16 tables

点击查看摘要

Abstract:This paper introduces two novel, outlyingness scores (OSs) based on Cluster Catch Digraphs (CCDs): Outbound Outlyingness Score (OOS) and Inbound Outlyingness Score (IOS). These scores enhance the interpretability of outlier detection results. Both OSs employ graph-, density-, and distribution-based techniques, tailored to high-dimensional data with varying cluster shapes and intensities. OOS evaluates the outlyingness of a point relative to its nearest neighbors, while IOS assesses the total ``influence" a point receives from others within its cluster. Both OSs effectively identify global and local outliers, invariant to data collinearity. Moreover, IOS is robust to the masking problems. With extensive Monte Carlo simulations, we compare the performance of both OSs with CCD-based, traditional, and state-of-the-art outlier detection methods. Both OSs exhibit substantial overall improvements over the CCD-based methods in both artificial and real-world data sets, particularly with IOS, which delivers the best overall performance among all the methods, especially in high-dimensional settings. Keywords: Outlier detection, Outlyingness score, Graph-based clustering, Cluster catch digraphs, High-dimensional data. Comments: 29 pages, 7 figures, 16 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.05530 [stat.ML] (or arXiv:2501.05530v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.05530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] Generative Modeling: A Review

链接: https://arxiv.org/abs/2501.05458
作者: Nick Polson,Vadim Sokolov
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2305.14972

点击查看摘要

Abstract:Generative methods (Gen-AI) are reviewed with a particular goal to solving tasks in Machine Learning and Bayesian inference. Generative models require one to simulate a large training dataset and to use deep neural networks to solve a supervised learning problem. To do this, we require high dimensional regression methods and tools for dimensionality reduction (a.k.a feature selection). The main advantage of Gen-AI methods is their ability to be model-free and to use deep neural networks to estimate conditional densities or posterior quantiles of interest. To illustrate generative methods, we analyze the well-known Ebola data-set. Finally, we conclude with directions for future research.

[LG-65] he Jungle of Generative Drug Discovery: Traps Treasures and Ways Out

链接: https://arxiv.org/abs/2501.05457
作者: Rıza Özçelik,Francesca Grisoni
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:“How to evaluate de novo designs proposed by a generative model?” Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh - \textitcritical and \textitconstructive - perspective on de novo design evaluation. We systematically investigate widely used evaluation metrics and expose key pitfalls (‘traps’) that were previously overlooked. In addition, we identify tools (‘treasures’) and strategies (‘ways out’) to navigate the complex ‘jungle’ of generative drug discovery, and strengthen the connections between the molecular and deep learning fields along the way. Our systematic and large-scale results are expected to provide a new lens for evaluating the de novo designs proposed by generative deep learning approaches.

信息检索

[IR-0] kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search

链接: https://arxiv.org/abs/2501.06121
作者: Leonardo Delfino,Domenico Erriquez,Silvio Martinico,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Approximate Nearest Neighbors (ANN) search is a crucial task in several applications like recommender systems and information retrieval. Current state-of-the-art ANN libraries, although being performance-oriented, often lack modularity and ease of use. This translates into them not being fully suitable for easy prototyping and testing of research ideas, an important feature to enable. We address these limitations by introducing kANNolo, a novel research-oriented ANN library written in Rust and explicitly designed to combine usability with performance effectively. kANNolo is the first ANN library that supports dense and sparse vector representations made available on top of different similarity measures, e.g., euclidean distance and inner product. Moreover, it also supports vector quantization techniques, e.g., Product Quantization, on top of the indexing strategies implemented. These functionalities are managed through Rust traits, allowing shared behaviors to be handled abstractly. This abstraction ensures flexibility and facilitates an easy integration of new components. In this work, we detail the architecture of kANNolo and demonstrate that its flexibility does not compromise performance. The experimental analysis shows that kANNolo achieves state-of-the-art performance in terms of speed-accuracy trade-off while allowing fast and easy prototyping, thus making kANNolo a valuable tool for advancing ANN research. Source code available on GitHub: this https URL.

[IR-1] Recommender Systems for Social Good: The Role of Accountability and Sustainability

链接: https://arxiv.org/abs/2501.05964
作者: Alan Said
类目: Information Retrieval (cs.IR)
*备注: First International Workshop on Recommender Systems for Sustainability and Social Good (RecSoGood’24)

点击查看摘要

Abstract:This work examines the role of recommender systems in promoting sustainability, social responsibility, and accountability, with a focus on alignment with the United Nations Sustainable Development Goals (SDGs). As recommender systems become increasingly integrated into daily interactions, they must go beyond personalization to support responsible consumption, reduce environmental impact, and foster social good. We explore strategies to mitigate the carbon footprint of recommendation models, ensure fairness, and implement accountability mechanisms. By adopting these approaches, recommender systems can contribute to sustainable and socially beneficial outcomes, aligning technological advancements with the SDGs focused on environmental sustainability and social well-being.

[IR-2] Social web and Wikipedia: an opportunity to rethink the links between sources credibility trust and authority

链接: https://arxiv.org/abs/2501.05813
作者: Gilles Sahut(LERASS),André Tricot(CLLE-LTC)
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The Web and its main tools (Google, Wikipedia, Facebook, Twitter) deeply raise and renew fundamental questions, that everyone asks almost every day: Is this information or content true? Can I trust this author or source? These questions are not new, they have been the same with books, newspapers, broadcasting and television, and, more fundamentally, in every human interpersonal communication. This paper is focused on two scientific problems on this issue. The first one is theoretical: to address this issue, many concepts have been used in library and information sciences, communication and psychology. The links between these concepts are not clear: sometimes two concepts are considered as synonymous, sometimes as very different. The second one is historical: sources like Wikipedia deeply challenge the epistemic evaluation of information sources, compared to previous modes of information production. This paper proposes an integrated and simple model considering the relation between a user, a document and an author as human communication. It reduces the problem to three concepts: credibility as a characteristic granted to information depending on its truth-value; trust as the ability to produce credible information; authority when the power to influence of an author is accepted, i.e., when readers accept that the source can modify their opinion, knowledge and decisions. The model describes also two kinds of relationships between the three concepts: an upward link and a downward link. The model is confronted with findings of empirical research on Wikipedia in particular.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-13

目录

概览 (2025-01-13)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载