本篇博文主要展示 2024-11-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-11-20)

今日共更新409篇论文,其中:

  • 自然语言处理49篇(Computation and Language (cs.CL))
  • 人工智能100篇(Artificial Intelligence (cs.AI))
  • 计算机视觉106篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习131篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] ACING: Actor-Critic for Instruction Learning in Black-Box Large Language Models

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在任务解决中依赖高质量指令的问题,而高质量指令通常需要大量人工微调。论文提出了一种名为ACING的任务特定提示优化方法,其关键在于将提示优化问题建模为一个无状态连续动作强化学习(Reinforcement Learning, RL)问题,即连续赌博机设置(continuum bandit setting)。ACING利用基于演员-评论家(actor-critic-based)的方法,通过非可微分的奖励信号来优化提示。实验结果表明,ACING在30个基于指令的任务上显著优于基线方法,中位数得分提高了10个百分点,并且在某些情况下超越了人工专家设计的指令,最高提升了39个百分点。

链接: https://arxiv.org/abs/2411.12736
作者: Salma Kharrat,Fares Fourati,Marco Canini
关键词-EN: Large Language Models, Large Language, effectiveness of Large, Language Models, extensive human effort
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The effectiveness of Large Language Models (LLMs) in solving tasks vastly depends on the quality of the instructions, which often require fine-tuning through extensive human effort. This highlights the need for automated instruction optimization; however, this optimization is particularly challenging when dealing with black-box LLMs, where model parameters and gradients remain inaccessible. We propose ACING, a task-specific prompt optimization approach framed as a stateless continuous-action Reinforcement Learning (RL) problem, known as the continuum bandit setting. ACING leverages an actor-critic-based method to optimize prompts, learning from non-differentiable reward signals. We validate ACING by optimizing prompts for ChatGPT on 30 instruction-based tasks. ACING consistently outperforms baseline methods, achieving a median score improvement of 10 percentage points. Furthermore, ACING not only recovers but also surpasses human-crafted expert instructions, achieving up to a 39 percentage point improvement against human benchmarks.
摘要:大语言模型(LLM)在解决任务时的有效性很大程度上依赖于指令的质量,而这些指令通常需要通过大量的人力进行精细调整。这凸显了自动化指令优化的必要性;然而,在处理黑箱LLM时,这种优化尤为困难,因为模型的参数和梯度无法访问。我们提出了ACING,这是一种特定任务的提示优化方法,被构造成一个无状态的连续动作强化学习(RL)问题,即连续强盗设置。ACING利用基于演员-评论家方法来优化提示,从不可微分的奖励信号中学习。我们通过在30个基于指令的任务上优化ChatGPT的提示来验证ACING。ACING始终优于基线方法,实现了中位数分数提高10个百分点。此外,ACING不仅恢复了人类专家设计的指令,还超越了这些指令,在人类基准测试中实现了高达39个百分点的改进。

[NLP-1] Information Theory of Meaningful Communication

【速读】: 该论文试图解决的问题是如何量化语言交流中的信息量,特别是从语义层面而非字符或词汇层面进行量化。解决方案的关键在于利用近期开发的大型语言模型(large language models),将信息单位从字符或词汇转换为句子中的最小语义单元——从句(clauses),并计算每个从句所传达的语义信息量,以比特(bits)为单位进行度量。

链接: https://arxiv.org/abs/2411.12728
作者: Doron Sivan,Misha Tsodyks
关键词-EN: Shannon seminal paper, stationary stochastic process, Shannon seminal, printed English, seminal paper
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:In Shannon’s seminal paper, entropy of printed English, treated as a stationary stochastic process, was estimated to be roughly 1 bit per character. However, considered as a means of communication, language differs considerably from its printed form: (i) the units of information are not characters or even words but clauses, i.e. shortest meaningful parts of speech; and (ii) what is transmitted is principally the meaning of what is being said or written, while the precise phrasing that was used to communicate the meaning is typically ignored. In this study, we show that one can leverage recently developed large language models to quantify information communicated in meaningful narratives in terms of bits of meaning per clause.
摘要:在香农的经典论文中,印刷英文作为平稳随机过程的熵被估计为每字符约1比特。然而,作为交流手段,语言与其印刷形式有显著不同:(i) 信息的单位不是字符甚至单词,而是从句,即最小的有意义语言部分;(ii) 传递的主要内容是所言或所写的意义,而用于传达意义的精确措辞通常被忽略。在本研究中,我们展示了可以利用最近开发的大语言模型来量化有意义叙述中传递的信息,以每从句的比特意义为单位。

[NLP-2] Scaling laws for nonlinear dynamical models of speech

【速读】: 该论文试图解决在语音手势动力学模型中引入非线性恢复力后,参数选择和数值稳定性方面的问题,尤其是在处理经验数据变异时。解决方案的关键在于引入简单的数值方法来参数化非线性任务动力学模型,并通过幂律(power laws)来缩放非线性刚度项。这种方法不仅提高了模型预测的经验准确性,还简化了非线性手势动力学模拟的可解释性。

链接: https://arxiv.org/abs/2411.12720
作者: Sam Kirkham
关键词-EN: gesture significantly improves, nonlinearity introduces challenges, nonlinear restoring force, speech gesture significantly, empirical data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The addition of a nonlinear restoring force to dynamical models of the speech gesture significantly improves the empirical accuracy of model predictions, but nonlinearity introduces challenges in selecting appropriate parameters and numerical stability, especially when modelling variation in empirical data. We address this issue by introducing simple numerical methods for parameterization of nonlinear task dynamic models. We first illustrate the problem and then outline solutions in the form of power laws that scale nonlinear stiffness terms. We apply the scaling laws to a cubic model and show how they facilitate interpretable simulations of the nonlinear gestural dynamics underpinning speech production.
摘要:在语音手势的动力学模型中加入非线性恢复力显著提高了模型预测的经验准确性,但非线性引入了选择适当参数和数值稳定性的挑战,尤其是在建模经验数据的变异时。我们通过引入简单的数值方法来解决这一问题,用于非线性任务动态模型的参数化。我们首先阐述了问题,然后以幂律的形式概述了解决方案,这些幂律用于缩放非线性刚度项。我们将这些缩放定律应用于立方模型,并展示了它们如何促进对支撑语音产生的非线性手势动力学的可解释模拟。

[NLP-3] Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

【速读】: 该论文试图解决现有文本到语音合成(TTS)模型评估框架中存在的两个主要问题:一是MUSHRA测试在评估现代TTS系统时,由于依赖于与人类参考语音的匹配,导致对超越人类语音质量的合成语音评分不公平;二是评估过程中存在的判断模糊性,缺乏明确的细粒度指导。解决方案的关键在于提出了两种改进的MUSHRA测试变体:第一种变体允许对超越人类参考质量的合成样本进行更公平的评分;第二种变体通过减少评分者间的差异,降低了判断模糊性。通过结合这两种方法,论文实现了更可靠和更细粒度的评估,并发布了MANGO数据集,这是首个针对印度语言的大规模人类评分数据集,有助于分析人类偏好和开发自动评估TTS系统的指标。

链接: https://arxiv.org/abs/2411.12719
作者: Praveen Srinivasa Varadhan,Amogh Gulati,Ashwin Sankar,Srija Anand,Anirudh Gupta,Anirudh Mukherjee,Shiva Kumar Marepally,Ankur Bhatia,Saloni Jaju,Suvrat Bhooshan,Mitesh M. Khapra
关键词-EN: MUSHRA test, TTS systems, rapid advancements, consistent and robust, TTS
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 19 pages, 12 Figures

点击查看摘要

Abstract:Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS’s pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 471 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 47,100 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.
摘要:尽管文本到语音 (TTS) 模型的技术进步迅速,但一个一致且稳健的人类评估框架仍然缺失。例如,平均意见得分 (MOS) 测试无法区分相似的模型,而对比平均意见得分 (CMOS) 的成对比较则耗时较长。MUSHRA 测试作为一种同时评估多个 TTS 系统的潜在替代方案,但在本研究中我们发现,其依赖于匹配人类参考语音的机制不当地降低了现代 TTS 系统的评分,这些系统在语音质量上甚至可以超越人类。更具体地说,我们对 MUSHRA 测试进行了全面的评估,重点关注其对评分者变异性、听众疲劳和参考偏差等因素的敏感性。基于我们在印地语和泰米尔语中涉及 471 名人类听众的广泛评估,我们识别出两个主要缺陷:(i) 参考匹配偏差,评分者受到人类参考的过度影响;(ii) 判断模糊性,由于缺乏明确的细粒度指导原则。为解决这些问题,我们提出了两种改进的 MUSHRA 测试变体。第一种变体使得合成样本的评分更加公平,即使其质量超越了人类参考。第二种变体减少了模糊性,表现为评分者间的相对较低的方差。通过结合这些方法,我们实现了更可靠和更细粒度的评估。我们还发布了 MANGO,一个包含 47,100 个人类评分的庞大数据集,这是首个针对印度语言的此类数据集,有助于分析人类偏好并开发用于评估 TTS 系统的自动指标。

[NLP-4] Enhancing Multi-Class Disease Classification: Neoplasms Cardiovascular Nervous System and Digestive Disorders Using Advanced LLM s

【速读】: 该论文试图解决多类别疾病分类问题,特别是在医学摘要语料库(Medical-Abstracts-TC-Corpus)中对五种医疗条件进行分类时,排除非癌症条件并专注于四种特定疾病。解决方案的关键在于利用预训练语言模型(LLMs),特别是针对医学数据的BioBERT和通用领域的XLNet,以及基于轻量级BERT的自定义模型Last-BERT。研究结果表明,BioBERT在医学文本分类中表现最佳(97%准确率),而XLNet(96%准确率)和Last-BERT(87.10%准确率)也显示出强大的竞争力,证实了专用模型(如BioBERT)和经过良好调优的通用模型(如XLNet和Last-BERT)在医学领域任务中的重要性。

链接: https://arxiv.org/abs/2411.12712
作者: Ahmed Akib Jawad Karim,Muhammad Zawad Mahmud,Samiha Islam,Aznur Azam
关键词-EN: explored the improvement, improvement in terms, terms of multi-class, multi-class disease classification, accuracy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 Pages, 4 tables and 11 figures. Under review in a IEEE conference

点击查看摘要

Abstract:In this research, we explored the improvement in terms of multi-class disease classification via pre-trained language models over Medical-Abstracts-TC-Corpus that spans five medical conditions. We excluded non-cancer conditions and examined four specific diseases. We assessed four LLMs, BioBERT, XLNet, and BERT, as well as a novel base model (Last-BERT). BioBERT, which was pre-trained on medical data, demonstrated superior performance in medical text classification (97% accuracy). Surprisingly, XLNet followed closely (96% accuracy), demonstrating its generalizability across domains even though it was not pre-trained on medical data. LastBERT, a custom model based on the lighter version of BERT, also proved competitive with 87.10% accuracy (just under BERT’s 89.33%). Our findings confirm the importance of specialized models such as BioBERT and also support impressions around more general solutions like XLNet and well-tuned transformer architectures with fewer parameters (in this case, LastBERT) in medical domain tasks.
摘要:在本研究中,我们探讨了通过预训练语言模型在涵盖五种医疗状况的Medical-Abstracts-TC-Corpus上进行多类别疾病分类的改进。我们排除了非癌症状况,并研究了四种特定疾病。我们评估了四种大语言模型(LLM),包括BioBERT、XLNet、BERT以及一个新颖的基础模型(Last-BERT)。BioBERT在医疗数据上进行了预训练,在医疗文本分类中表现出色(准确率为97%)。令人惊讶的是,XLNet紧随其后(准确率为96%),展示了其在不同领域中的通用性,尽管它并未在医疗数据上进行预训练。基于BERT轻量版本的定制模型Last-BERT也表现出色,准确率达到87.10%(略低于BERT的89.33%)。我们的研究结果确认了像BioBERT这样的专业模型的重要性,同时也支持了像XLNet这样更通用的解决方案以及经过良好调优的Transformer架构(如Last-BERT)在医疗领域任务中的有效性。

[NLP-5] Strengthening Fake News Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques. Defying BERT?

【速读】: 该论文试图解决在线平台中虚假新闻的快速传播问题,关键解决方案在于利用机器学习和自然语言处理技术,特别是支持向量机 (SVM) 和 BERT 模型,来检测虚假新闻。研究采用了三种不同的文本向量化方法(TF-IDF、Word2Vec 和 BoW)应用于 SVM,并与 BERT 模型进行对比。结果显示,尽管 BERT 在准确率和 F1 分数上表现更优(99.98% 和 0.9998),但使用线性核的 SVM 结合 BoW 向量化方法也表现出色(99.81% 和 0.9980),且具有较低的计算需求,显示出高度竞争的性能。

链接: https://arxiv.org/abs/2411.12703
作者: Ahmed Akib Jawad Karim,Kazi Hafiz Md Asad,Aznur Azam
关键词-EN: reliable detection systems, Support Vector Machines, specifically Support Vector, Term Frequency Inverse, Frequency Inverse Document
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 tables and 6 Figures. Submitted to a conference

点击查看摘要

Abstract:The rapid spread of misinformation, particularly through online platforms, underscores the urgent need for reliable detection systems. This study explores the utilization of machine learning and natural language processing, specifically Support Vector Machines (SVM) and BERT, to detect news that are fake. We employ three distinct text vectorization methods for SVM: Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW) evaluating their effectiveness in distinguishing between genuine and fake news. Additionally, we compare these methods against the transformer large language model, BERT. Our comprehensive approach includes detailed preprocessing steps, rigorous model implementation, and thorough evaluation to determine the most effective techniques. The results demonstrate that while BERT achieves superior accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear kernel and BoW vectorization also performs exceptionally well, achieving 99.81% accuracy and an F1-score of 0.9980. These findings highlight that, despite BERT’s superior performance, SVM models with BoW and TF-IDF vectorization methods come remarkably close, offering highly competitive performance with the advantage of lower computational requirements.
摘要:虚假信息的迅速传播,特别是在线平台的广泛传播,突显了建立可靠检测系统的迫切需求。本研究探讨了利用机器学习和自然语言处理技术,特别是支持向量机 (Support Vector Machines, SVM) 和 BERT,来检测虚假新闻的方法。我们采用了三种不同的文本向量化方法用于 SVM:词频逆文档频率 (Term Frequency Inverse Document Frequency, TF-IDF)、Word2Vec 和词袋模型 (Bag of Words, BoW),评估它们在区分真实新闻和虚假新闻方面的有效性。此外,我们将这些方法与 Transformer 大语言模型 BERT 进行了比较。我们的综合方法包括详细的预处理步骤、严格的模型实现和全面的评估,以确定最有效的技术。结果表明,尽管 BERT 在准确率上达到了 99.98%,F1 分数为 0.9998,但使用线性核和 BoW 向量化的 SVM 模型也表现出色,准确率达到 99.81%,F1 分数为 0.9980。这些发现强调,尽管 BERT 性能卓越,但使用 BoW 和 TF-IDF 向量化方法的 SVM 模型在性能上非常接近,且具有较低计算需求的显著优势。

[NLP-6] Enhanced Sign Language Translation between American Sign Language (ASL) and Indian Sign Language (ISL) Using LLM s

【速读】: 该论文试图解决美国手语(ASL)和印度手语(ISL)用户之间的沟通障碍问题。解决方案的关键在于开发了一个新颖的学习系统框架,利用大型模型(LLM)实现实时翻译。核心系统包括一个复杂的流程,首先通过强大的随机森林分类器对手语手势进行重新分类和识别,然后将识别的ASL手势转换为文本。随后,利用自然语言处理(NLP)技术将ASL文本转换为ISL,最后通过RIFE-Net将翻译后的文本合成为ISL手势,实现端到端的翻译体验。该框架的关键挑战在于自动处理手势变异性和克服ASL与ISL之间的语言差异,从而显著提高手语用户的可访问性。

链接: https://arxiv.org/abs/2411.12685
作者: Malay Kumar,S. Sarvajit Visagan,Tanish Sarang Mahajan,Anisha Natarajan
关键词-EN: American Sign Language, Indian Sign Language, American Sign, Indian Sign, Sign Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We have come up with a research that hopes to provide a bridge between the users of American Sign Language and the users of spoken language and Indian Sign Language (ISL). The research enabled us to create a novel framework that we have developed for Learner Systems. Leveraging art of Large models to create key features including: - Real-time translation between these two sign languages in an efficient manner. Making LLM’s capability available for seamless translations to ISL. Here is the full study showing its implementation in this paper. The core of the system is a sophisticated pipeline that begins with reclassification and recognition of ASL gestures based on a strong Random Forest Classifier. By recognizing the ASL, it is translated into text which can be more easily processed. Highly evolved natural language NLP (Natural Language Processing) techniques come in handy as they play a role in our LLM integration where you then use LLMs to be able to convert the ASL text to ISL which provides you with the intent of sentence or phrase. The final step is to synthesize the translated text back into ISL gestures, creating an end-to-end translation experience using RIFE-Net. This framework is tasked with key challenges such as automatically dealing with gesture variability and overcoming the linguistic differences between ASL and ISL. By automating the translation process, we hope to vastly improve accessibility for sign language users. No longer will the communication gap between ASL and ISL create barriers; this totally cool innovation aims to bring our communities closer together. And we believe, with full confidence in our framework, that we’re able to apply the same principles across a wide variety of sign language dialects.
摘要:我们提出了一项研究,旨在为美国手语(American Sign Language, ASL)和印度手语(Indian Sign Language, ISL)的用户之间搭建一座桥梁。该研究使我们能够开发出一个新颖的框架,用于学习系统。我们利用大语言模型(Large Language Model, LLM)的技术,创建了包括以下关键功能:- 实时高效地翻译这两种手语。使大语言模型的能力得以无缝地应用于ISL的翻译。本文详细展示了其实现过程。系统的核心是一个复杂的流水线,首先基于强大的随机森林分类器对手语手势进行重新分类和识别。通过识别ASL手势,将其翻译成文本,以便更容易处理。高度进化的自然语言处理(Natural Language Processing, NLP)技术在此过程中发挥了作用,它们在大语言模型的集成中起到了关键作用,使得能够将ASL文本转换为ISL,从而理解句子的意图或短语。最后一步是将翻译后的文本合成为ISL手势,使用RIFE-Net创建端到端的翻译体验。该框架面临的关键挑战包括自动处理手势变异性和克服ASL与ISL之间的语言差异。通过自动化翻译过程,我们希望大幅提升手语用户的可访问性。ASL与ISL之间的沟通障碍将不再存在;这一创新旨在拉近我们的社区。我们坚信,凭借我们的框架,能够将同样的原理应用于各种手语方言。

[NLP-7] Neurosymbolic Graph Enrichment for Grounded World Models

【速读】: 该论文试图解决复杂现实场景中的人工智能系统理解和推理能力的问题。解决方案的关键在于提出了一种多模态、知识增强的正式意义表示方法,结合了大型语言模型(LLM)和结构化语义表示的优势。具体步骤包括:首先利用最先进的LLM生成图像的自然语言描述,然后将该描述转换为抽象意义表示(Abstract Meaning Representation, AMR)图,并通过逻辑设计模式和来自语言及事实知识库的层级语义对其进行丰富和形式化。最终,将生成的图反馈回LLM,通过复杂的启发式学习激活隐含知识,包括语义隐含、道德价值、具身认知和隐喻表示。这种方法通过弥合非结构化语言模型与正式语义结构之间的差距,为解决自然语言理解和推理中的复杂问题开辟了新的途径。

链接: https://arxiv.org/abs/2411.12671
作者: Stefano De Giorgis,Aldo Gangemi,Alessandro Russo
关键词-EN: artificial intelligence systems, intelligence systems capable, complex real-world scenarios, significant challenge, development of artificial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The development of artificial intelligence systems capable of understanding and reasoning about complex real-world scenarios is a significant challenge. In this work we present a novel approach to enhance and exploit LLM reactive capability to address complex problems and interpret deeply contextual real-world meaning. We introduce a method and a tool for creating a multimodal, knowledge-augmented formal representation of meaning that combines the strengths of large language models with structured semantic representations. Our method begins with an image input, utilizing state-of-the-art large language models to generate a natural language description. This description is then transformed into an Abstract Meaning Representation (AMR) graph, which is formalized and enriched with logical design patterns, and layered semantics derived from linguistic and factual knowledge bases. The resulting graph is then fed back into the LLM to be extended with implicit knowledge activated by complex heuristic learning, including semantic implicatures, moral values, embodied cognition, and metaphorical representations. By bridging the gap between unstructured language models and formal semantic structures, our method opens new avenues for tackling intricate problems in natural language understanding and reasoning.
摘要:开发能够理解和推理复杂现实场景的人工智能系统是一个重大挑战。在本研究中,我们提出了一种新颖的方法,以增强和利用大语言模型(LLM)的反应能力来解决复杂问题并深入解读现实世界中的语境意义。我们引入了一种方法和工具,用于创建多模态、知识增强的语义形式化表示,该表示结合了大语言模型的优势与结构化语义表示。我们的方法从图像输入开始,利用最先进的大语言模型生成自然语言描述。随后,该描述被转换为抽象意义表示(Abstract Meaning Representation, AMR)图,该图经过形式化并丰富了逻辑设计模式,以及从语言和事实知识库中提取的分层语义。最终生成的图被反馈回大语言模型,通过复杂启发式学习激活的隐含知识进行扩展,包括语义隐含、道德价值观、具身认知和隐喻表示。通过弥合非结构化语言模型与形式化语义结构之间的差距,我们的方法为解决自然语言理解和推理中的复杂问题开辟了新的途径。

[NLP-8] Optimizing Airline Reservation Systems with Edge-Enabled Microservices: A Framework for Real-Time Data Processing and Enhanced User Responsiveness

【速读】: 该论文试图解决传统集中式航空预订系统在复杂操作环境下的效率和性能问题。解决方案的关键在于采用边缘计算微服务架构,通过将关键操作(如座位库存检查、预订流程和确认)部署在接近用户的位置,从而减少整体响应时间并提升系统性能。具体实现依赖于分布式计算微服务的Kubernetes编排、实时消息处理系统Kafka及其弹性扩展,以及Prometheus和Grafana用于资源监控和管理。该架构不仅旨在实现低延迟、高吞吐量和高用户体验,还预见性地支持未来技术如人工智能和物联网嵌入式系统的集成,从而为航空业提供一个市场就绪、可扩展的解决方案。

链接: https://arxiv.org/abs/2411.12650
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: adaptive reservation systems, development of quick, growing complexity, requires a smart, airline reservations requires
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 22 pages, 11 figures

点击查看摘要

Abstract:The growing complexity of the operations of airline reservations requires a smart solution for the adoption of novel approaches to the development of quick, efficient, and adaptive reservation systems. This paper outlines in detail a conceptual framework for the implementation of edge computing microservices in order to address the shortcomings of traditional centralized architectures. Specifically, as edge computing allows for certain activities such as seat inventory checks, booking processes and even confirmation to be done nearer to the user, thus lessening the overall response time and improving the performance of the system. In addition, the framework value should include achieving the high performance of the system such as low latency, high throughput and higher user experience. The major design components include deployed distributed computing microservices orchestrated by Kubernetes, real-time message processing system with Kafka and its elastic scaling. Other operational components include Prometheus and Grafana, which are used to monitor and manage resources, ensuring that all operational processes are optimized. Although this research focuses on a design and theoretical scheming of the framework, its use is foreseen to be more advantageous in facilitating a transform in the provision of services in the airline industry by improving customers’ satisfaction, providing infrastructure which is cheap to install and efficiently supporting technology changes such as artificial intelligence and internet of things embedded systems. This research addresses the increasing demand for new technologies with modern well-distributed and real-time-centric systems and also provides a basis for future case implementation and testing. As such, the proposed architecture offers a market-ready, extensible solution to the problems posed by existing airline reservation systems .
摘要:随着航空公司预订业务操作复杂性的不断增加,采用新型方法开发快速、高效且适应性强的预订系统变得尤为重要。本文详细阐述了一个概念框架,旨在通过实施边缘计算微服务来解决传统集中式架构的不足。具体而言,边缘计算使得诸如座位库存检查、预订流程甚至确认等操作能够在更接近用户的地方进行,从而缩短整体响应时间并提升系统性能。此外,该框架的价值还应包括实现系统的高性能,如低延迟、高吞吐量和更高的用户体验。主要设计组件包括由 Kubernetes 编排的分布式计算微服务、使用 Kafka 的实时消息处理系统及其弹性扩展。其他运营组件包括 Prometheus 和 Grafana,用于监控和管理资源,确保所有操作流程得到优化。尽管本研究侧重于框架的设计和理论规划,但其应用预计将更有利于推动航空业服务提供的转型,通过提高客户满意度、提供低成本安装的基础设施以及高效支持技术变革(如人工智能和物联网嵌入式系统)来实现。本研究满足了现代分布式和实时为中心系统对新技术的日益增长需求,并为未来的案例实施和测试提供了基础。因此,所提出的架构为现有航空公司预订系统所面临的问题提供了一个市场就绪、可扩展的解决方案。

[NLP-9] DLBacktrace: A Model Agnostic Explainability for any Deep Learning Models

【速读】: 该论文试图解决深度学习模型在决策过程中缺乏透明性和可解释性的问题。解决方案的关键是引入了一种名为DLBacktrace的创新技术,由AryaXAI团队开发,旨在提高多种深度学习模型(包括多层感知器(MLP)、卷积神经网络(CNN)、大型语言模型(LLM)、计算机视觉模型等)的决策透明度。DLBacktrace通过与PyTorch和TensorFlow兼容,支持多种模型架构,如Llama 3.2、BERT、LSTMs、ResNet和U-Net等,以及自定义深度神经网络(DNN)模型,从而在广泛的领域中增强模型的可解释性。该技术通过与现有解释性方法(如SHAP、LIME、GradCAM等)的性能对比,展示了其在多样化任务中的有效性和适应性。

链接: https://arxiv.org/abs/2411.12643
作者: Vinay Kumar Sankarapu,Chintan Chitroda,Yashwardhan Rathore,Neeraj Kumar Singh,Pratinav Seth
关键词-EN: increasingly sophisticated deep, Computer Vision Models, operate as opaque, black boxes’, decision-making processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence has led to increasingly sophisticated deep learning models, which frequently operate as opaque ‘black boxes’ with limited transparency in their decision-making processes. This lack of interpretability presents considerable challenges, especially in high-stakes applications where understanding the rationale behind a model’s outputs is as essential as the outputs themselves. This study addresses the pressing need for interpretability in AI systems, emphasizing its role in fostering trust, ensuring accountability, and promoting responsible deployment in mission-critical fields. To address the interpretability challenge in deep learning, we introduce DLBacktrace, an innovative technique developed by the AryaXAI team to illuminate model decisions across a wide array of domains, including simple Multi Layer Perceptron (MLPs), Convolutional Neural Networks (CNNs), Large Language Models (LLMs), Computer Vision Models, and more. We provide a comprehensive overview of the DLBacktrace algorithm and present benchmarking results, comparing its performance against established interpretability methods, such as SHAP, LIME, GradCAM, Integrated Gradients, SmoothGrad, and Attention Rollout, using diverse task-based metrics. The proposed DLBacktrace technique is compatible with various model architectures built in PyTorch and TensorFlow, supporting models like Llama 3.2, other NLP architectures such as BERT and LSTMs, computer vision models like ResNet and U-Net, as well as custom deep neural network (DNN) models for tabular data. This flexibility underscores DLBacktrace’s adaptability and effectiveness in enhancing model transparency across a broad spectrum of applications. The library is open-sourced and available at this https URL . Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2411.12643 [cs.LG] (or arXiv:2411.12643v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.12643 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:人工智能的快速发展带来了日益复杂的深度学习模型,这些模型往往作为不透明的“黑箱”运行,其决策过程的透明度有限。这种缺乏可解释性的情况带来了相当大的挑战,特别是在高风险应用中,理解模型输出背后的逻辑与输出本身同样重要。本研究针对人工智能系统中对可解释性的迫切需求,强调其在促进信任、确保责任和推动在关键任务领域负责任部署方面的作用。为了解决深度学习中的可解释性挑战,我们引入了DLBacktrace,这是一种由AryaXAI团队开发的新技术,能够在包括简单多层感知器(MLP)、卷积神经网络(CNN)、大语言模型(LLM)、计算机视觉模型等广泛领域中阐明模型决策。我们全面概述了DLBacktrace算法,并展示了基准测试结果,通过多样化的任务基准指标,将其性能与现有的可解释性方法(如SHAP、LIME、GradCAM、Integrated Gradients、SmoothGrad和Attention Rollout)进行比较。所提出的DLBacktrace技术兼容于在PyTorch和TensorFlow中构建的各种模型架构,支持Llama 3.2、其他NLP架构如BERT和LSTM、计算机视觉模型如ResNet和U-Net,以及用于表格数据的定制深度神经网络(DNN)模型。这种灵活性突显了DLBacktrace在增强广泛应用领域中模型透明度的适应性和有效性。该库已开源,可通过此https URL获取。

主题:机器学习(cs.LG);人工智能(cs.AI);计算与语言(cs.CL
引用方式:arXiv:2411.12643 [cs.LG](或arXiv:2411.12643v1 [cs.LG]用于此版本)
https://doi.org/10.48550/arXiv.2411.12643
通过DataCite发布的arXiv DOI(注册待定)

[NLP-10] Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D

【速读】: 该论文试图解决多语言学习的问题,特别是以印地语(Hindi)为例,通过结合虚拟现实(VR)环境和基于OpenAI GPT模型的AI辅助教学系统来实现。解决方案的关键在于利用Unity创建一个虚拟校园环境,重点模拟大学建筑的11层,该层是文化和技术活动的主要场所。在这个虚拟环境中,集成了一个由OpenAI GPT模型驱动的AI导师,通过API调用与用户互动,提供实时的印地语学习支持。该方法主要利用语音转文本、文本转换和文本转语音技术,在有互联网连接的情况下,实现用户与AI导师之间的实时互动,从而提供沉浸式的语言学习体验。

链接: https://arxiv.org/abs/2411.12619
作者: Adithya TG,Abhinavaram N,Gowri Srinivasa
关键词-EN: GPT api calls, enabled tutoring systems, virtual reality environments, paper presents, OpenAIs GPT api
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 5 pages, 2 tables, 8 figures

点击查看摘要

Abstract:This paper presents a new approach to multiple language learning, with Hindi the language to be learnt in our case, by using the integration of virtual reality environments and AI enabled tutoring systems using OpenAIs GPT api calls. We have developed a scenario which has a virtual campus environment using Unity which focuses on a detailed representation of our universitys buildings 11th floor, where most of the cultural and technological activities take place. Within this virtual environment that we have created, we have an AI tutor powered by OpenAI’s GPT model which was called using an api which moves around with the user. This provided language learning support in Hindi, as GPT is able to take care of language translation. Our approach mainly involves utilising speech to text, text to text conversion and text to speech capabilities to facilitate real time interaction between users and the AI tutor in the presence of internet. This research demonstrates the use of combining VR technology with AI tutoring for immersive language learning experiences and provides interaction.
摘要:本文提出了一种新的多语言学习方法,以印地语为例,通过整合虚拟现实环境和使用 OpenAI 的 GPT API 调用的 AI 辅助教学系统来实现。我们开发了一个场景,该场景使用 Unity 创建了一个虚拟校园环境,重点展示了我们大学建筑的 11 层,这里大部分的文化和技术活动都在此进行。在这个我们创建的虚拟环境中,有一个由 OpenAI 的 GPT 模型驱动的 AI 导师,通过 API 调用与用户一起移动。GPT 能够处理语言翻译,从而提供印地语的语言学习支持。我们的方法主要利用语音转文本、文本转文本转换和文本转语音功能,以在有互联网的情况下促进用户与 AI 导师之间的实时互动。本研究展示了将 VR 技术与 AI 辅导相结合,用于沉浸式语言学习体验,并提供了互动性。

[NLP-11] Whisper Finetuning on Nepali Language

【速读】: 该论文试图解决尼泊尔语等低资源语言在自动语音识别(ASR)模型中的性能问题。解决方案的关键在于通过精心构建和增强数据集,对OpenAI的Whisper模型进行微调,以提高尼泊尔语的语音转文本(speech-to-text)准确性。具体措施包括利用公开的ASR数据集和自录制的多样化口音、方言和说话风格的定制数据集,并通过数据增强进一步丰富数据。实验结果表明,基于定制数据集微调Whisper模型显著降低了词错误率(WER),特别是在数据集包含更多说话者年龄、性别、情感、声学环境、方言和更长音频片段(15-30秒)的情况下。此外,数据增强显著提升了模型的鲁棒性。该研究强调了数据集质量、多样性和增强在将先进模型适应于低资源语言以开发准确ASR系统中的重要性。

链接: https://arxiv.org/abs/2411.12587
作者: Sanjay Rijal,Shital Adhikari,Manish Dahal,Manish Awale,Vaghawan Ojha
关键词-EN: Automatic Speech Recognition, Speech Recognition, Automatic Speech, advancements in Automatic, remains a challenge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI’s Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker’s age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper’s input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper’s baseline models trained on Fleur’s dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.
摘要:尽管自动语音识别 (Automatic Speech Recognition, ASR) 模型取得了显著进展,但对于尼泊尔语等代表性不足的语言,开发稳健的模型仍然是一个挑战。本研究致力于构建一个详尽且通用的数据集,随后对不同规模的 OpenAI Whisper 模型进行微调,以提高尼泊尔语的转录 (语音转文本) 准确性。我们利用公开可用的 ASR 数据集和自录制的自定义数据集,这些数据集涵盖了多种口音、方言和说话风格,并通过数据增强进一步丰富。实验结果表明,在我们精心策划的自定义数据集上微调 Whisper 模型,显著降低了所有模型规模下的词错误率 (Word Error Rate, WER),这归因于数据在说话者年龄、性别和情感、声学环境、方言、更密集的音频片段 (15-30 秒) 方面的更大变化,这些片段更符合 Whisper 的输入要求,以及音频和转录的手动精选。值得注意的是,我们的方法优于基于 Fleur 数据集训练的 Whisper 基线模型,在小型模型上实现了高达 36.2% 的 WER 降低,在中型模型上实现了 23.8% 的 WER 降低。此外,我们还展示了数据增强在增强模型鲁棒性方面的重要作用。我们的方法强调了数据集质量、多样性和增强在将最先进模型适应于代表性不足的语言以开发准确 ASR 系统中的重要性。

[NLP-12] Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在推理任务中表现出的泛化策略问题。解决方案的关键在于通过分析预训练数据的影响力来揭示模型在推理任务中所采用的泛化策略。研究者通过对两个不同规模的模型(7B和35B)以及2.5亿预训练token进行分析,识别出影响模型输出结果的文档,并对比这些文档在事实性问题和推理问题中的作用。研究发现,模型在回答事实性问题时依赖于不同的数据集,而在同一任务内的不同推理问题中,同一文档的影响力相似,表明模型存在程序性知识。此外,推理问题的答案及其中间步骤通常不会出现在最具影响力的数据中,而最具影响力的文档往往包含程序性知识,如通过公式或代码展示如何获得解决方案。这些发现表明,模型在推理任务中采用的策略不同于检索,而更像是一种从类似推理形式的文档中综合程序性知识的可泛化策略。

链接: https://arxiv.org/abs/2411.12580
作者: Laura Ruis,Maximilian Mozes,Juhan Bae,Siddhartha Rao Kamalakara,Dwarak Talupuru,Acyr Locatelli,Robert Kirk,Tim Rocktäschel,Edward Grefenstette,Max Bartolo
关键词-EN: Large Language Models, Large Language, limitations of Large, Language Models, recent years
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.
摘要:近年来,大语言模型(LLM)的能力和局限性得到了详尽的描绘,呈现出一种既引人入胜又充满矛盾的图景。一方面,LLM展现出解决问题的普遍能力;另一方面,与人类相比,它们在推理过程中表现出令人惊讶的差距,这对其泛化策略的稳健性提出了质疑。由于LLM设计中使用的数据量巨大,我们无法采用传统的泛化度量方法——训练-测试集分离。为了克服这一问题,我们通过研究LLM所依赖的预训练数据,探讨了它们在执行推理任务时采用的泛化策略。对于两个不同规模的模型(7B和35B)以及25亿个预训练Token,我们识别了哪些文档对三个简单的数学推理任务的模型输出有影响,并将这些影响与回答事实性问题时的数据影响进行对比。我们发现,尽管模型在回答每个事实性问题时依赖于不同的数据集,但在同一任务中,不同推理问题之间,一个文档的影响往往相似,这表明存在程序性知识。进一步的研究表明,事实性问题的答案通常出现在最具影响力的数据中。然而,对于推理问题,答案通常不会作为高度影响力的因素出现,中间推理步骤的答案也是如此。当我们对推理问题的顶级文档进行定性分析时,我们确认这些有影响力的文档通常包含程序性知识,例如展示如何使用公式或代码获得解决方案。我们的研究结果表明,模型用于推理的方法不同于检索,更像是一种可泛化的策略,能够从进行类似推理的文档中综合程序性知识。

[NLP-13] Large Language Models for Combinatorial Optimization of Design Structure Matrix

【速读】: 该论文试图解决工程应用中的组合优化 (Combinatorial Optimization, CO) 问题,特别是在问题规模增大和依赖关系复杂化时,传统基于纯数学推理的算法难以捕捉优化所需的上下文细节。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 的推理能力和上下文知识,提出了一种结合网络拓扑和领域知识的LLM-based框架,用于优化设计结构矩阵 (Design Structure Matrix, DSM) 的排序问题。实验结果表明,该方法在收敛速度和解的质量上优于基准方法,并且引入领域知识显著提升了性能,突显了LLMs在结合语义和数学推理解决复杂现实世界组合优化问题中的潜力。

链接: https://arxiv.org/abs/2411.12571
作者: Shuo Jiang,Min Xie,Jianxi Luo
关键词-EN: essential for improving, improving efficiency, Large Language Models, Design Structure Matrix, Combinatorial optimization
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Combinatorial optimization (CO) is essential for improving efficiency and performance in engineering applications. As complexity increases with larger problem sizes and more intricate dependencies, identifying the optimal solution become challenging. When it comes to real-world engineering problems, algorithms based on pure mathematical reasoning are limited and incapable to capture the contextual nuances necessary for optimization. This study explores the potential of Large Language Models (LLMs) in solving engineering CO problems by leveraging their reasoning power and contextual knowledge. We propose a novel LLM-based framework that integrates network topology and domain knowledge to optimize the sequencing of Design Structure Matrix (DSM)-a common CO problem. Our experiments on various DSM cases demonstrate that the proposed method achieves faster convergence and higher solution quality than benchmark methods. Moreover, results show that incorporating contextual domain knowledge significantly improves performance despite the choice of LLMs. These findings highlight the potential of LLMs in tackling complex real-world CO problems by combining semantic and mathematical reasoning. This approach paves the way for a new paradigm in in real-world combinatorial optimization.
摘要:组合优化 (Combinatorial Optimization, CO) 在提升工程应用的效率和性能方面至关重要。随着问题规模的增大和复杂依赖性的增加,寻找最优解变得愈发困难。在处理实际工程问题时,基于纯数学推理的算法存在局限性,无法捕捉到优化所需的上下文细微差别。本研究探讨了大语言模型 (Large Language Models, LLMs) 在解决工程组合优化问题中的潜力,通过利用其推理能力和上下文知识。我们提出了一种新颖的基于 LLM 的框架,该框架整合了网络拓扑和领域知识,以优化设计结构矩阵 (Design Structure Matrix, DSM) 的排序——这是一个常见的组合优化问题。我们在多种 DSM 案例上的实验表明,所提出的方法在收敛速度和解的质量上均优于基准方法。此外,结果显示,尽管 LLM 的选择不同,但结合上下文领域知识显著提升了性能。这些发现突显了 LLMs 通过结合语义和数学推理来解决复杂现实世界组合优化问题的潜力。这种方法为现实世界中的组合优化开辟了新的范式。

[NLP-14] Predicting Customer Satisfaction by Replicating the Survey Response Distribution

【速读】: 该论文试图解决呼叫中心客户满意度(CSAT)调查参与率低导致的平均CSAT值偏差问题,以及由此带来的培训、跟进和纠正机会的缺失。解决方案的关键在于开发一种预测模型,能够在客户未完成调查的情况下准确预测其满意度,并确保预测的CSAT(pCSAT)分数能够准确反映实际调查CSAT响应的分布,从而减少偏差。该方法不仅适用于呼叫中心,还可推广至其他多类别分类问题,以改善类别平衡并减少模型更新时的类别变化。

链接: https://arxiv.org/abs/2411.12539
作者: Etienne Manderscheid,Matthias Lee
关键词-EN: key performance indicator, CSAT, performance indicator, key performance, customer satisfaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For many call centers, customer satisfaction (CSAT) is a key performance indicator (KPI). However, only a fraction of customers take the CSAT survey after the call, leading to a biased and inaccurate average CSAT value, and missed opportunities for coaching, follow-up, and rectification. Therefore, call centers can benefit from a model predicting customer satisfaction on calls where the customer did not complete the survey. Given that CSAT is a closely monitored KPI, it is critical to minimize any bias in the average predicted CSAT (pCSAT). In this paper, we introduce a method such that predicted CSAT (pCSAT) scores accurately replicate the distribution of survey CSAT responses for every call center with sufficient data in a live production environment. The method can be applied to many multiclass classification problems to improve the class balance and minimize its changes upon model updates.
摘要:对于许多呼叫中心而言,客户满意度 (Customer Satisfaction, CSAT) 是一个关键绩效指标 (Key Performance Indicator, KPI)。然而,只有少数客户在通话后参与 CSAT 调查,导致平均 CSAT 值存在偏差且不准确,同时也错失了指导、跟进和纠正的机会。因此,呼叫中心可以从预测未完成调查的通话中客户满意度的模型中受益。鉴于 CSAT 是一个密切监控的 KPI,最小化平均预测 CSAT (Predicted CSAT, pCSAT) 中的任何偏差至关重要。本文介绍了一种方法,使得在具有足够数据的实时生产环境中,预测的 CSAT (pCSAT) 分数能够准确复制每个呼叫中心的调查 CSAT 响应分布。该方法可应用于许多多类分类问题,以改善类别平衡并最小化模型更新时的变化。

[NLP-15] Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

【速读】: 该论文试图解决线性递归神经网络(Linear Recurrent Neural Networks, LRNNs)在状态跟踪任务中的性能问题,特别是在解决诸如代码评估或棋局跟踪等任务时。解决方案的关键在于扩展LRNNs的状态转移矩阵的特征值范围,使其包含负值。具体来说,论文证明了仅具有正特征值的有限精度LRNNs无法解决最简单的状态跟踪任务——奇偶校验(parity),而需要复特征值来处理模3计数。通过将特征值范围扩展到[-1, 1],LRNNs不仅能够解决奇偶校验问题,还能在状态跟踪任务中表现出一致的性能提升。此外,扩展特征值范围的LRNNs在语言建模预训练中表现出相当的性能和稳定性,并在代码和数学数据上显示出潜力。这一改进增强了现代LRNNs的表达能力,同时不增加训练或推理的成本。

链接: https://arxiv.org/abs/2411.12537
作者: Riccardo Grazzi,Julien Siems,Jörg K.H. Franke,Arber Zela,Frank Hutter,Massimiliano Pontil
关键词-EN: Recurrent Neural Networks, Linear Recurrent Neural, offering linear scaling, Neural Networks, Recurrent Neural
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to [0, 1] and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo 3 . Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range [-1, 1] . Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
摘要:线性递归神经网络 (Linear Recurrent Neural Networks, LRNNs) 如 Mamba、RWKV、GLA、mLSTM 和 DeltaNet,在大语言模型中作为 Transformer 的高效替代方案出现,提供了与序列长度成线性关系的扩展性以及更高的训练效率。然而,LRNNs 在状态追踪方面表现不佳,这可能影响其在代码评估或追踪棋局等任务中的性能。即使是非线性 RNN 如 LSTM 能够有效处理的简单状态追踪任务——奇偶校验,当前的 LRNNs 也无法解决。最近,Sarrof 等人 (2024) 证明,像 Mamba 这样的 LRNNs 无法解决奇偶校验问题,根源在于其对角状态转移矩阵的值范围被限制在 [0, 1],而引入负值可以解决这一问题。我们将这一结果扩展到非对角 LRNNs,这些网络在 DeltaNet 等模型中最近显示出潜力。我们证明,具有仅正特征值的状态转移矩阵的有限精度 LRNNs 无法解决奇偶校验问题,而需要复特征值来计算模 3。值得注意的是,我们还证明了当 LRNNs 的状态转移矩阵是单位矩阵减去向量外积矩阵的乘积,且每个矩阵的特征值在 [-1, 1] 范围内时,LRNNs 可以学习任何正则语言。我们的实证结果证实,将 Mamba 和 DeltaNet 等模型的特征值范围扩展至包含负值,不仅使它们能够解决奇偶校验问题,而且在状态追踪任务中持续提升性能。此外,使用扩展特征值范围对 LRNNs 进行语言模型预训练,在代码和数学数据上显示出潜力,同时实现了相当的性能和稳定性。我们的工作增强了现代 LRNNs 的表达能力,扩大了其应用范围,而无需改变训练或推理的成本。

[NLP-16] Bias Free Sentiment Analysis

【速读】: 该论文试图解决机器学习情感分析(SA)系统中存在的偏见问题,特别是政治和性别偏见。解决方案的关键在于提出了语义传播图神经网络(Semantic Propagation Graph Neural Network, SProp GNN),该架构完全依赖于句法结构和词级情感线索进行情感预测,通过语义屏蔽特定词汇信息,从而增强模型的鲁棒性。SProp GNN不仅在性能上优于基于词典的替代方案(如VADER和EmoAtlas),而且在减少情感预测任务中的偏见方面接近于基于Transformer的模型,同时提供了更好的可解释性。这一方法填补了可解释词典方法与强大但通常不透明的深度学习模型之间的方法论差距,为通过文本理解人类行为提供了更公平和有效的情感分析工具。

链接: https://arxiv.org/abs/2411.12493
作者: Hubert Plisiecki
关键词-EN: Graph Neural Network, Semantic Propagation Graph, Propagation Graph Neural, Neural Network, Semantic Propagation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Semantic Propagation Graph Neural Network (SProp GNN), a machine learning sentiment analysis (SA) architecture that relies exclusively on syntactic structures and word-level emotional cues to predict emotions in text. By semantically blinding the model to information about specific words, it is robust to biases such as political or gender bias that have been plaguing previous machine learning-based SA systems. The SProp GNN shows performance superior to lexicon-based alternatives such as VADER and EmoAtlas on two different prediction tasks, and across two languages. Additionally, it approaches the accuracy of transformer-based models while significantly reducing bias in emotion prediction tasks. By offering improved explainability and reducing bias, the SProp GNN bridges the methodological gap between interpretable lexicon approaches and powerful, yet often opaque, deep learning models, offering a robust tool for fair and effective emotion analysis in understanding human behavior through text.
摘要:本文介绍了一种名为语义传播图神经网络 (Semantic Propagation Graph Neural Network, SProp GNN) 的机器学习情感分析 (Sentiment Analysis, SA) 架构,该架构完全依赖于句法结构和词级情感线索来预测文本中的情感。通过使模型对特定词汇的信息进行语义屏蔽,SProp GNN 能够有效抵御如政治或性别偏见等困扰以往基于机器学习的 SA 系统的偏见。在两个不同的预测任务中,SProp GNN 的表现优于基于词典的方法,如 VADER 和 EmoAtlas,并且在两种语言中均表现出色。此外,它在情感预测任务中的准确性接近基于 Transformer 的模型,同时显著减少了偏见。通过提供更好的可解释性并减少偏见,SProp GNN 在可解释的词典方法与强大但往往不透明的深度学习模型之间架起了方法论的桥梁,为通过文本理解人类行为提供了公平且有效的情感分析工具。

[NLP-17] Regular-pattern-sensitive CRFs for Distant Label Interactions

【速读】: 该论文试图解决线性链条件随机场(Linear-chain CRFs)在序列标注任务中无法直接建模非相邻标签间交互的问题。解决方案的关键在于提出了正则模式敏感CRF(Regular-Pattern-sensitive CRFs, RPCRFs),通过允许用户指定正则表达式标签模式,使模型能够学习用户定义的长距离标签交互。这种方法不仅扩展了标准线性链CRF的能力,还保持了训练和推断的可行性,避免了加权有限状态转换器(Weighted Finite-State Transducers, FSTs)在一般情况下的推断难题。RPCRFs的核心在于其能够通过简洁的正则表达式定义复杂的交互模式,从而在保持计算效率的同时,捕捉标签序列中的非局部依赖结构。

链接: https://arxiv.org/abs/2411.12484
作者: Sean Papay,Roman Klinger,Sebastian Pado
关键词-EN: conditional random fields, Linear-chain conditional random, common model component, random fields, conditional random
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linear-chain conditional random fields (CRFs) are a common model component for sequence labeling tasks when modeling the interactions between different labels is important. However, the Markov assumption limits linear-chain CRFs to only directly modeling interactions between adjacent labels. Weighted finite-state transducers (FSTs) are a related approach which can be made to model distant label-label interactions, but exact label inference is intractable for these models in the general case, and the task of selecting an appropriate automaton structure for the desired interaction types poses a practical challenge. In this work, we present regular-pattern-sensitive CRFs (RPCRFs), a method of enriching standard linear-chain CRFs with the ability to learn long-distance label interactions which occur in user-specified patterns. This approach allows users to write regular-expression label patterns concisely specifying which types of interactions the model should take into account, allowing the model to learn from data whether and in which contexts these patterns occur. The result can be interpreted alternatively as a CRF augmented with additional, non-local potentials, or as a finite-state transducer whose structure is defined by a set of easily-interpretable patterns. Critically, unlike the general case for FSTs (and for non-chain CRFs), exact training and inference are tractable for many pattern sets. In this work, we detail how a RPCRF can be automatically constructed from a set of user-specified patterns, and demonstrate the model’s effectiveness on synthetic data, showing how different types of patterns can capture different nonlocal dependency structures in label sequences.
摘要:线性链条件随机场(CRFs)是序列标注任务中常见的模型组件,尤其在需要建模不同标签之间交互的情况下。然而,马尔可夫假设限制了线性链CRFs仅能直接建模相邻标签之间的交互。加权有限状态转换器(FSTs)是一种相关的方法,可以用于建模远距离标签间的交互,但在一般情况下,这些模型的精确标签推断是难以处理的,且为所需交互类型选择合适的自动机结构是一个实际挑战。在本研究中,我们提出了正则模式敏感CRFs(RPCRFs),这是一种通过学习用户指定模式中的长距离标签交互来丰富标准线性链CRFs的方法。该方法允许用户简洁地编写正则表达式标签模式,明确指定模型应考虑的交互类型,从而使模型能够从数据中学习这些模式是否以及在何种上下文中出现。结果可以被解释为一种增加了额外非局部势能的CRF,或者是一种结构由一组易于解释的模式定义的有限状态转换器。关键的是,与FSTs(以及非链式CRFs)的一般情况不同,对于许多模式集,RPCRFs的精确训练和推断是可处理的。本文详细描述了如何从一组用户指定的模式自动构建RPCRF,并展示了该模型在合成数据上的有效性,说明了不同类型的模式如何捕捉标签序列中的不同非局部依赖结构。

[NLP-18] Analysing Explanation-Related Interactions in Collaborative Perception-Cognition-Communication-Action

【速读】: 该论文试图解决AI-装备机器人与人类在协作任务中有效沟通的问题,特别是机器人如何通过解释其行为来增强合作和信任。解决方案的关键在于分析和分类人类在模拟紧急响应任务中的沟通内容,识别出与解释性AI文献中定义的交互式解释相关的信息。通过这种方式,研究揭示了人类在协作环境中对解释的期望类型,从而明确了AI-装备机器人最需要具备解释能力的场景。研究发现,大多数与解释相关的信息都涉及对决策或行动的澄清,并且这些信息对任务表现有显著影响。

链接: https://arxiv.org/abs/2411.12483
作者: Marc Roig Vilamala,Jack Furby,Julian de Gortari Briseno,Mani Srivastava,Alun Preece,Carolina Fuentes Toro
关键词-EN: working alongside humans, robots working alongside, Effective communication, earn trust, essential in collaborative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 3 figures, published as a Late Breaking Report in RO-MAN 2024

点击查看摘要

Abstract:Effective communication is essential in collaborative tasks, so AI-equipped robots working alongside humans need to be able to explain their behaviour in order to cooperate effectively and earn trust. We analyse and classify communications among human participants collaborating to complete a simulated emergency response task. The analysis identifies messages that relate to various kinds of interactive explanations identified in the explainable AI literature. This allows us to understand what type of explanations humans expect from their teammates in such settings, and thus where AI-equipped robots most need explanation capabilities. We find that most explanation-related messages seek clarification in the decisions or actions taken. We also confirm that messages have an impact on the performance of our simulated task.
摘要:在协作任务中,有效的沟通至关重要,因此与人类并肩工作的 AI 机器人需要能够解释其行为,以实现有效合作并赢得信任。我们分析并分类了人类参与者在协作完成模拟紧急响应任务时的沟通内容。通过分析,我们识别出与可解释 AI 文献中提到的各种交互式解释相关的信息。这使我们能够理解在这种情境下人类队友期望的解释类型,从而确定 AI 机器人最需要解释能力的领域。我们发现,大多数与解释相关的信息都旨在澄清决策或行动。我们还证实了信息对模拟任务的性能有影响。

[NLP-19] NMT-Obfuscator Attack: Ignore a sentence in translation with only one word

【速读】: 该论文试图解决神经机器翻译 (Neural Machine Translation, NMT) 系统在面对精心设计的对抗性攻击时的脆弱性问题。解决方案的关键在于提出一种新型的对抗性攻击方法,即在两个句子之间插入一个特定词汇,使得NMT模型在翻译时忽略第二个句子。这种攻击方法不仅能够有效隐藏恶意信息,还能保持整个输入文本在源语言中的自然性,从而在实际应用中造成危害。实验结果表明,不同NMT模型和翻译任务均对此类攻击表现出脆弱性,攻击成功率超过50%,同时保持较低的困惑度 (perplexity)。

链接: https://arxiv.org/abs/2411.12473
作者: Sahar Sadrizadeh,César Descalzo,Ljiljana Dolamic,Pascal Frossard
关键词-EN: Neural Machine Translation, Neural Machine, Machine Translation systems, diverse applications due, NMT models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural Machine Translation systems are used in diverse applications due to their impressive performance. However, recent studies have shown that these systems are vulnerable to carefully crafted small perturbations to their inputs, known as adversarial attacks. In this paper, we propose a new type of adversarial attack against NMT models. In this attack, we find a word to be added between two sentences such that the second sentence is ignored and not translated by the NMT model. The word added between the two sentences is such that the whole adversarial text is natural in the source language. This type of attack can be harmful in practical scenarios since the attacker can hide malicious information in the automatic translation made by the target NMT model. Our experiments show that different NMT models and translation tasks are vulnerable to this type of attack. Our attack can successfully force the NMT models to ignore the second part of the input in the translation for more than 50% of all cases while being able to maintain low perplexity for the whole input.
摘要:神经机器翻译系统因其卓越的性能而被广泛应用于各种场景。然而,最近的研究表明,这些系统对精心设计的小扰动输入(即对抗攻击)非常脆弱。本文提出了一种针对神经机器翻译模型的新型对抗攻击。在这种攻击中,我们找到一个单词,将其插入两句话之间,使得第二句话被神经机器翻译模型忽略,从而不被翻译。插入的单词使得整个对抗文本在源语言中显得自然。这种攻击在实际场景中可能造成危害,因为攻击者可以在目标神经机器翻译模型自动生成的翻译中隐藏恶意信息。我们的实验表明,不同的神经机器翻译模型和翻译任务都容易受到这种攻击。我们的攻击能够成功迫使神经机器翻译模型在超过50%的情况下忽略输入的第二部分,同时保持整个输入的低困惑度。

[NLP-20] Guide-to-Explain for Controllable Summarization

【速读】: 该论文试图解决大语言模型(LLMs)在生成式摘要任务中缺乏可控性的问题。具体来说,现有LLMs在控制摘要的多样属性(如长度和提取性)方面表现不佳,尤其是在数值属性上。论文提出的解决方案是引入一个名为“guide-to-explain框架(GTE)”,该框架允许模型识别初始摘要中与用户偏好不符的属性,并通过解释这些错误来指导模型生成调整后的摘要。关键在于通过这种反思机制,模型能够在比其他仅依赖LLMs的迭代方法更少的迭代次数内生成满足用户需求的摘要。

链接: https://arxiv.org/abs/2411.12460
作者: Sangwon Ryu,Heejin Do,Daehee Kim,Yunsu Kim,Gary Geunbae Lee,Jungseul Ok
关键词-EN: demonstrated remarkable performance, abstractive summarization tasks, large language models, large language, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, controllable summarization with LLMs remains underexplored, limiting their ability to generate summaries that align with specific user preferences. In this paper, we first investigate the capability of LLMs to control diverse attributes, revealing that they encounter greater challenges with numerical attributes, such as length and extractiveness, compared to linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. Our GTE framework enables the model to identify misaligned attributes in the initial draft and guides it in explaining errors in the previous output. Based on this reflection, the model generates a well-adjusted summary. As a result, by allowing the model to reflect on its misalignment, we generate summaries that satisfy the desired attributes in surprisingly fewer iterations than other iterative methods solely using LLMs.
摘要:近年来,大语言模型 (LLM) 在抽象摘要任务中展现了卓越的表现。然而,可控摘要方面仍未得到充分探索,限制了其生成符合特定用户偏好的摘要的能力。本文首先探讨了 LLM 对多种属性的控制能力,发现其在数值属性(如长度和抽取性)方面比语言属性面临更大的挑战。为应对这一挑战,我们提出了一种引导解释框架 (GTE) 用于可控摘要。我们的 GTE 框架使模型能够识别初始草稿中的属性偏差,并引导其解释先前输出的错误。基于此反思,模型生成调整后的摘要。结果显示,通过允许模型反思其偏差,我们生成的摘要能够在比其他仅使用 LLM 的迭代方法更少的迭代次数内满足期望的属性。

[NLP-21] Variation between Credible and Non-Credible News Across Topics

【速读】: 该论文试图解决虚假新闻(Fake News)在不同新闻主题领域中的语言和风格差异问题。解决方案的关键在于通过语言学和文体分析,识别出不同新闻主题(如经济、娱乐、健康、科学和体育)中可信新闻与虚假新闻之间的语言特征差异。论文强调了在分类任务中适应基于多样性的风格和语言差异的重要性,以提高实际应用中的性能。

链接: https://arxiv.org/abs/2411.12458
作者: Emilie Francis
关键词-EN: Fake News’ continues, News’ continues, Fake News’, journalism and politics, continues to undermine
类目: Computation and Language (cs.CL)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:‘Fake News’ continues to undermine trust in modern journalism and politics. Despite continued efforts to study fake news, results have been conflicting. Previous attempts to analyse and combat fake news have largely focused on distinguishing fake news from truth, or differentiating between its various sub-types (such as propaganda, satire, misinformation, etc.) This paper conducts a linguistic and stylistic analysis of fake news, focusing on variation between various news topics. It builds on related work identifying features from discourse and linguistics in deception detection by analysing five distinct news topics: Economy, Entertainment, Health, Science, and Sports. The results emphasize that linguistic features vary between credible and deceptive news in each domain and highlight the importance of adapting classification tasks to accommodate variety-based stylistic and linguistic differences in order to achieve better real-world performance.
摘要:“假新闻”持续削弱现代新闻业和政治的信任度。尽管不断努力研究假新闻,但结果却存在分歧。以往分析和对抗假新闻的尝试主要集中在区分假新闻与真相,或区分其各种子类型(如宣传、讽刺、错误信息等)。本文对假新闻进行了语言和风格分析,重点关注不同新闻主题之间的变化。它借鉴了相关工作,通过分析五个不同的新闻主题(经济、娱乐、健康、科学和体育),从话语和语言学中识别欺骗检测的特征。结果强调,在每个领域中,可信新闻与欺骗性新闻的语言特征存在差异,并突显了为适应基于多样性的风格和语言差异而调整分类任务的重要性,以实现更好的实际应用性能。

[NLP-22] textscNeon: News Entity-Interaction Extraction for Enhanced Question Answering

【速读】: 该论文试图解决大型语言模型(LLMs)在处理快速变化领域(如涉及实体的近期或正在发生的事件的网络搜索)时,由于参数化记忆中的信息过时以及传统检索系统难以捕捉最新相关信息和处理新闻中的冲突报道,导致生成的时间相关响应不准确的问题。解决方案的关键在于提出了NEON框架,该框架通过从新闻文章中提取新兴实体交互(如事件或活动),构建以实体为中心的时间戳知识图谱,从而增强与新闻事件相关的问答能力。NEON的创新之处在于将开放信息抽取(openIE)风格的元组集成到LLMs中,以实现上下文增强的检索生成,显著提升了在处理时间性和实体中心搜索查询时的问答性能,使LLMs能够提供更准确、可靠和最新的响应。

链接: https://arxiv.org/abs/2411.12449
作者: Sneha Singhania,Silviu Cucerzan,Allen Herring,Sujay Kumar Jauhar
关键词-EN: Capturing fresh information, large language models, augment existing large, existing large language, Capturing fresh
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Capturing fresh information in near real-time and using it to augment existing large language models (LLMs) is essential to generate up-to-date, grounded, and reliable output. This problem becomes particularly challenging when LLMs are used for informational tasks in rapidly evolving fields, such as Web search related to recent or unfolding events involving entities, where generating temporally relevant responses requires access to up-to-the-hour news sources. However, the information modeled by the parametric memory of LLMs is often outdated, and Web results from prototypical retrieval systems may fail to capture the latest relevant information and struggle to handle conflicting reports in evolving news. To address this challenge, we present the NEON framework, designed to extract emerging entity interactions – such as events or activities – as described in news articles. NEON constructs an entity-centric timestamped knowledge graph that captures such interactions, thereby facilitating enhanced QA capabilities related to news events. Our framework innovates by integrating open Information Extraction (openIE) style tuples into LLMs to enable in-context retrieval-augmented generation. This integration demonstrates substantial improvements in QA performance when tackling temporal, entity-centric search queries. Through NEON, LLMs can deliver more accurate, reliable, and up-to-date responses.
摘要:在近实时地捕捉新鲜信息并利用这些信息来增强现有的大语言模型 (LLM) 是生成最新、基于事实且可靠输出的关键。当 LLM 用于快速发展的领域中的信息任务时,这一问题变得尤为复杂,例如与近期或正在发生的事件相关的实体的网络搜索,生成时间相关的响应需要访问最新的新闻来源。然而,LLM 的参数化记忆所建模的信息往往已经过时,而典型检索系统返回的网络结果可能无法捕捉到最新的相关信息,并且在处理新闻中不断变化的冲突报道时表现不佳。为了应对这一挑战,我们提出了 NEON 框架,该框架旨在从新闻文章中提取新兴的实体交互(如事件或活动)。NEON 构建了一个以实体为中心的带时间戳的知识图谱,捕捉这些交互,从而增强与新闻事件相关的问答能力。我们的框架通过将开放信息抽取 (openIE) 风格的元组集成到 LLM 中,实现了上下文感知的检索增强生成,这一创新显著提升了处理时间相关、以实体为中心的搜索查询时的问答性能。通过 NEON,LLM 能够提供更准确、可靠且最新的响应。

[NLP-23] Evaluating the Prompt Steerability of Large Language Models

【速读】: 该论文试图解决构建多元AI模型的问题,即设计能够反映广泛价值体系和文化背景的模型。解决方案的关键在于提出了一种评估模型人格可塑性(steerability)的基准,该基准基于提示(prompting)对模型行为分布的影响。通过定义提示可塑性指数,并分析这些指数随提示力度变化的情况,研究者能够估计模型在不同人格维度和方向上的可塑性。研究发现,当前许多模型的可塑性有限,这既源于其基线行为的偏斜,也由于在多个维度上可塑性的不对称性。

链接: https://arxiv.org/abs/2411.12405
作者: Erik Miehling,Michael Desmond,Karthikeyan Natesan Ramamurthy,Elizabeth M. Daly,Pierre Dognin,Jesus Rios,Djallel Bouneffouf,Miao Liu
关键词-EN: Building pluralistic, requires designing models, systems and cultures, shaped to represent, represent a wide
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline behavior. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited – due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at this https URL.
摘要:构建多元化的 AI 需要设计能够适应多种价值体系和文化背景的模型。实现这一目标首先需要评估给定模型在多大程度上能够反映不同的个性特征。为此,我们提出了一种基准测试,用于评估模型个性在提示引导下的可塑性。我们的设计基于提示可塑性的形式化定义,该定义分析了模型联合行为分布从其基准行为转移的程度。通过定义可塑性指数并检查这些指数如何随引导努力的变化而变化,我们可以估计模型在不同个性维度和方向上的可塑性。我们的基准测试揭示,许多当前模型的可塑性有限——这既源于其基准行为的偏斜,也源于其在许多个性维度上可塑性的不对称性。我们在以下链接中发布了基准测试的实现。

[NLP-24] Do LLM s Understand Ambiguity in Text? A Case Study in Open-world Question Answering

【速读】: 该论文试图解决大型语言模型(LLMs)在开放领域问答任务中因自然语言的歧义性而导致的性能下降问题。解决方案的关键在于采用简单、无需训练的标记级歧义消除策略,通过显式地处理语言中的不确定性来提高LLMs在歧义问答任务中的表现。研究通过实验验证了这些策略的有效性,并讨论了在处理LLMs中的歧义问题时的最佳实践和广泛影响。

链接: https://arxiv.org/abs/2411.12395
作者: Aryan Keluskar,Amrita Bhattacharjee,Huan Liu
关键词-EN: Large Language Models, natural language poses, language poses significant, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the REU Symposium at IEEE BigData 2024

点击查看摘要

Abstract:Ambiguity in natural language poses significant challenges to Large Language Models (LLMs) used for open-domain question answering. LLMs often struggle with the inherent uncertainties of human communication, leading to misinterpretations, miscommunications, hallucinations, and biased responses. This significantly weakens their ability to be used for tasks like fact-checking, question answering, feature extraction, and sentiment analysis. Using open-domain question answering as a test case, we compare off-the-shelf and few-shot LLM performance, focusing on measuring the impact of explicit disambiguation strategies. We demonstrate how simple, training-free, token-level disambiguation methods may be effectively used to improve LLM performance for ambiguous question answering tasks. We empirically show our findings and discuss best practices and broader impacts regarding ambiguity in LLMs.
摘要:自然语言中的歧义性对用于开放领域问答的大语言模型(Large Language Models, LLMs)构成了重大挑战。LLMs 常常难以应对人类交流中固有的不确定性,导致误解、沟通失误、幻觉产生以及偏见响应。这显著削弱了它们在事实核查、问答、特征提取和情感分析等任务中的应用能力。以开放领域问答为测试案例,我们比较了现成和少样本 LLM 的表现,重点测量了显式消歧策略的影响。我们展示了如何有效地使用简单、无需训练、基于 Token 级别的消歧方法来提升 LLM 在处理歧义问答任务时的性能。我们通过实证展示了研究结果,并讨论了关于 LLMs 中歧义性的最佳实践和广泛影响。

[NLP-25] RedPajama: an Open Dataset for Training Large Language Models NEURIPS2024

【速读】: 该论文试图解决开源语言模型在数据集构建和筛选过程中面临的三大核心挑战:(1) 模型开发过程的透明度,包括数据筛选过程;(2) 获取大量高质量数据;(3) 数据集构建和分析所需的元数据和工件的可用性。解决方案的关键在于发布了RedPajama-V1和RedPajama-V2两个数据集,前者是对LLaMA训练数据集的开源复现,后者是一个包含原始、未过滤文本数据及其质量信号和元数据的大型网络数据集。这些数据集不仅提供了丰富的数据资源,还通过质量信号帮助筛选高质量数据子集,从而推动透明且高性能语言模型的发展。

链接: https://arxiv.org/abs/2411.12372
作者: Maurice Weber,Daniel Fu,Quentin Anthony,Yonatan Oren,Shane Adams,Anton Alexandrov,Xiaozhong Lyu,Huu Nguyen,Xiaozhe Yao,Virginia Adams,Ben Athiwaratkun,Rahul Chalamala,Kezhen Chen,Max Ryabinin,Tri Dao,Percy Liang,Christopher Ré,Irina Rish,Ce Zhang
关键词-EN: remain largely elusive, language models, filtering remain largely, artificial intelligence, largely elusive
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce’s XGen and AI2’s OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.
摘要:大语言模型正日益成为人工智能、科学以及整个社会中的基石技术,然而,关于数据集构成和筛选的最佳策略仍然大多未解。许多顶尖模型的数据集筛选和模型开发过程缺乏透明度,这成为开发完全开放语言模型的障碍。本文中,我们识别了三个核心的数据相关挑战,这些挑战必须得到解决以推动开源语言模型的发展。这些挑战包括:(1)模型开发中的透明度,包括数据筛选过程;(2)获取大量高质量数据;(3)数据集筛选和分析所需的元数据和工件的可用性。为应对这些挑战,我们发布了RedPajama-V1,这是LLaMA训练数据集的开源复制品。此外,我们还发布了RedPajama-V2,这是一个仅包含网络数据的庞大数据集,由原始、未经筛选的文本数据以及质量信号和元数据组成。RedPajama数据集合计包含超过100万亿个Token,涵盖多个领域,并通过其质量信号促进了数据的筛选,旨在激发新数据集的开发。迄今为止,这些数据集已被用于训练如Snowflake Arctic、Salesforce的XGen和AI2的OLMo等生产级强语言模型。为了深入了解RedPajama的质量,我们进行了一系列分析和消融研究,使用了解码器专用语言模型,参数规模高达16亿。我们的研究结果展示了如何有效利用网络数据的质量信号来筛选高质量的数据子集,突显了RedPajama在推动透明且高性能语言模型大规模开发方面的潜力。

[NLP-26] A Layered Architecture for Developing and Enhancing Capabilities in Large Language Model-based Software Systems

【速读】: 该论文试图解决在应用开发中,如何有效地扩展大型语言模型(LLMs)的功能以满足不断演进的需求。解决方案的关键在于引入一个分层架构,将LLM软件系统开发划分为不同的层次,每个层次具有特定的属性。通过将功能与这些层次对齐,该框架鼓励系统性地实现功能,以实现高效、有效的开发,从而支持所需的功能和质量。这种方法有助于在工程复杂性、可扩展性和运营成本之间找到平衡,并通过实际案例研究展示了其有效性。

链接: https://arxiv.org/abs/2411.12357
作者: Dawen Zhang,Xiwei Xu,Chen Wang,Zhenchang Xing,Robert Mao
关键词-EN: Large Language Models, basic language tasks, Language Models, Large Language, Significant efforts
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Significant efforts has been made to expand the use of Large Language Models (LLMs) beyond basic language tasks. While the generalizability and versatility of LLMs have enabled widespread adoption, evolving demands in application development often exceed their native capabilities. Meeting these demands may involve a diverse set of methods, such as enhancing creativity through either inference temperature adjustments or creativity-provoking prompts. Selecting the right approach is critical, as different methods lead to trade-offs in engineering complexity, scalability, and operational costs. This paper introduces a layered architecture that organizes LLM software system development into distinct layers, each characterized by specific attributes. By aligning capabilities with these layers, the framework encourages the systematic implementation of capabilities in effective and efficient ways that ultimately supports desired functionalities and qualities. Through practical case studies, we illustrate the utility of the framework. This work offers developers actionable insights for selecting suitable technologies in LLM-based software system development, promoting robustness and scalability.
摘要:近年来,大语言模型(Large Language Models, LLMs)的应用范围已远远超越了基本的语言处理任务。尽管LLMs的通用性和多功能性使其得到了广泛应用,但随着应用开发需求的不断演进,这些模型固有的能力往往难以满足所有需求。为了应对这些挑战,可能需要采用多种方法,例如通过调整推理温度或使用激发创造力的提示来增强模型的创造性。选择合适的方法至关重要,因为不同的方法在工程复杂性、可扩展性和运营成本方面存在权衡。本文提出了一种分层架构,将LLM软件系统开发划分为不同的层次,每个层次都具有特定的属性。通过将能力与这些层次对齐,该框架鼓励以有效且高效的方式系统化地实现能力,从而最终支持所需的功能和质量。通过实际案例研究,我们展示了该框架的实用性。这项工作为开发者在基于LLM的软件系统开发中选择合适的技术提供了可操作的见解,促进了系统的稳健性和可扩展性。

[NLP-27] Balancing Accuracy and Efficiency in Multi-Turn Intent Classification for LLM -Powered Dialog Systems in Production

【速读】: 该论文试图解决多轮对话意图分类中的两个关键问题:数据集的稀缺性和上下文依赖的复杂性。解决方案的关键在于利用大型语言模型 (LLMs) 来提升生产对话系统的可扩展性和降低延迟。具体方法包括:1) 引入符号调优 (Symbol Tuning),通过简化意图标签来降低任务复杂性并提高多轮对话中的性能;2) 提出C-LARA框架 (Consistency-aware, Linguistics Adaptive Retrieval Augmentation),利用LLMs进行数据增强和伪标签生成,以生成合成的多轮对话数据集,进而微调一个适合部署的小型高效模型。这些方法显著提高了分类准确性和资源效率,同时在多语言对话数据集上展示了其实用性和影响力。

链接: https://arxiv.org/abs/2411.12307
作者: Junhua Liu,Yong Keat Tan,Bin Fu,Kwan Hui Lim
关键词-EN: Accurate multi-turn intent, Accurate multi-turn, essential for advancing, advancing conversational, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Accurate multi-turn intent classification is essential for advancing conversational AI systems. However, challenges such as the scarcity of comprehensive datasets and the complexity of contextual dependencies across dialogue turns hinder progress. This paper presents two novel approaches leveraging Large Language Models (LLMs) to enhance scalability and reduce latency in production dialogue systems. First, we introduce Symbol Tuning, which simplifies intent labels to reduce task complexity and improve performance in multi-turn dialogues. Second, we propose C-LARA (Consistency-aware, Linguistics Adaptive Retrieval Augmentation), a framework that employs LLMs for data augmentation and pseudo-labeling to generate synthetic multi-turn dialogues. These enriched datasets are used to fine-tune a small, efficient model suitable for deployment. Experiments conducted on multilingual dialogue datasets demonstrate significant improvements in classification accuracy and resource efficiency. Our methods enhance multi-turn intent classification accuracy by 5.09%, reduce annotation costs by 40%, and enable scalable deployment in low-resource multilingual industrial systems, highlighting their practicality and impact.
摘要:准确的多轮意图分类对于推进对话式 AI 系统至关重要。然而,全面数据集的稀缺性和对话轮次间上下文依赖关系的复杂性阻碍了进展。本文提出了两种利用大语言模型 (LLM) 来增强生产对话系统可扩展性和降低延迟的新方法。首先,我们引入了符号调优 (Symbol Tuning),通过简化意图标签来降低任务复杂性并提高多轮对话中的性能。其次,我们提出了 C-LARA(一致性感知、语言适应性检索增强)框架,该框架利用 LLM 进行数据增强和伪标签生成,以生成合成多轮对话。这些丰富的数据集用于微调一个适合部署的小型高效模型。在多语言对话数据集上进行的实验表明,分类准确性和资源效率显著提高。我们的方法将多轮意图分类准确性提高了 5.09%,减少了 40% 的标注成本,并实现了在低资源多语言工业系统中的可扩展部署,突显了其实用性和影响力。

[NLP-28] CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model

【速读】: 该论文试图解决多模态大语言模型(MLLMs)在查询解析中面临的意图理解、信息检索和安全过滤方面的挑战。解决方案的关键是引入了一个名为Contextual Understanding and Enhanced Search with MLLM (CUE-M)的新型多模态搜索管道。CUE-M通过一个多阶段框架来解决这些问题,该框架包括图像上下文丰富、意图细化、上下文查询生成、外部API集成和基于相关性的过滤。此外,CUE-M还整合了一个强大的安全框架,结合了基于图像、文本和多模态的分类器,能够动态适应实例和类别特定的风险。通过在多模态QA数据集和公共安全基准上的评估,CUE-M在准确性、知识集成和安全性方面均优于基线系统,从而提升了多模态检索系统的性能。

链接: https://arxiv.org/abs/2411.12287
作者: Dongyoung Go,Taesun Whang,Chanhee Lee,Hwayeon Kim,Sunghoon Park,Seunghwan Ji,Dongchan Kim,Young-Bum Kim
关键词-EN: Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
类目: Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has expanded the scope of multimodal query resolution. However, current systems struggle with intent understanding, information retrieval, and safety filtering, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search pipeline that addresses these challenges through a multi-stage framework comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust safety framework combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific risks. Evaluations on a multimodal QA dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety, advancing the capabilities of multimodal retrieval systems.
摘要:将检索增强生成 (Retrieval-Augmented Generation, RAG) 与多模态大语言模型 (Multimodal Large Language Models, MLLMs) 结合,扩展了多模态查询解析的范围。然而,当前系统在意图理解、信息检索和安全过滤方面存在困难,限制了其有效性。本文提出了多模态搜索管道 Contextual Understanding and Enhanced Search with MLLM (CUE-M),通过一个多阶段框架来解决这些挑战,该框架包括图像上下文丰富、意图细化、上下文查询生成、外部 API 集成和基于相关性的过滤。CUE-M 结合了基于图像、文本和多模态分类器的强大安全框架,能够动态适应实例和类别特定的风险。在多模态问答数据集和公开的安全基准上的评估表明,CUE-M 在准确性、知识集成和安全性方面优于基线,提升了多模态检索系统的能力。

[NLP-29] Building Trust: Foundations of Security Safety and Transparency in AI

【速读】: 该论文试图解决公开可用AI模型生态系统中日益增长的安全和安全问题。解决方案的关键在于提出全面的策略,以增强模型开发者和终端用户的安全和安全保障。这些策略包括解决跟踪问题、修复漏洞以及建立AI模型生命周期和所有权流程的标准化方法,从而为AI模型开发和运营中的安全、安全和透明性奠定基础。

链接: https://arxiv.org/abs/2411.12275
作者: Huzaifa Sidhpurwala,Garth Mollett,Emily Fox,Mark Bestavros,Huamin Chen
关键词-EN: rapidly evolving ecosystem, explores the rapidly, rapidly evolving, potential implications, safety landscape
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulnerabilities is crucial. We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes. Comprehensive strategies to enhance security and safety for both model developers and end-users are proposed. This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models and the larger open ecosystems and communities forming around them.
摘要:本文探讨了公开可用 AI 模型生态系统的快速演变,及其对安全与安全领域可能产生的影响。随着 AI 模型的日益普及,理解其潜在风险和漏洞变得至关重要。我们回顾了当前的安全与安全情景,同时强调了诸如追踪问题、修复措施以及 AI 模型生命周期和所有权流程明显缺失等挑战。本文提出了增强模型开发者和终端用户安全与安全的全面策略。本文旨在为 AI 模型开发与运营中更标准化的安全、安全及透明度,以及围绕这些模型形成的更大开放生态系统和社区提供一些基础性内容。

[NLP-30] Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

【速读】: 该论文试图解决低资源语言(low-resource languages)机器翻译(MT)实际使用模式的研究不足问题。解决方案的关键在于通过观察性分析(observational analysis)实际使用数据,揭示用户需求和行为模式,从而为低资源语言的MT系统开发提供实用指导。具体来说,研究利用一个广泛使用的MT服务的服务器日志,分析了超过100,000条Tetun语(Timor-Leste的通用语言)的翻译请求,发现用户(尤其是学生)主要通过移动设备将简短文本翻译成Tetun,涉及科学、医疗和日常生活等多个领域。这与现有Tetun语语料库主要包含新闻文章的情况形成鲜明对比。因此,研究建议针对Tetun等低资源语言的MT系统应优先考虑将文本翻译成目标语言、有效处理简短输入,并涵盖教育环境中相关的广泛领域。这种方法通过将研究基于实际社区需求,展示了观察性分析在低资源语言技术开发中的重要性。

链接: https://arxiv.org/abs/2411.12262
作者: Raphael Merx,Hanna Suominen,Adérito José Guterres Correia,Trevor Cohn,Ekaterina Vylomova
关键词-EN: remains poorly understood, languages remains poorly, poorly understood, impact of machine, remains poorly
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The impact of machine translation (MT) on low-resource languages remains poorly understood. In particular, observational studies of actual usage patterns are scarce. Such studies could provide valuable insights into user needs and behaviours, complementing survey-based methods. Here we present an observational analysis of real-world MT usage for Tetun, the lingua franca of Timor-Leste, using server logs from a widely-used MT service with over 70,000 monthly active users. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate short texts into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for languages like Tetun should prioritise translating into the low-resource language, handling brief inputs effectively, and covering a wide range of domains relevant to educational contexts. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.
摘要:机器翻译(Machine Translation, MT)对低资源语言的影响仍未得到充分理解。特别是,关于实际使用模式的观察性研究非常稀缺。这类研究可以为用户的需要和行为提供宝贵的见解,补充基于调查的方法。本文通过对东帝汶的通用语言——德顿语(Tetun)的真实世界MT使用情况进行观察性分析,利用一个拥有超过70,000月活跃用户的广泛使用的MT服务的服务器日志。我们对100,000次翻译请求的分析揭示了与现有语料库假设相悖的模式。我们发现,用户(其中许多是使用移动设备的学生)通常将短文本翻译成德顿语,涵盖科学、医疗保健和日常生活等多个领域。这与现有的德顿语料库形成鲜明对比,后者主要由涵盖政府和社会问题的新闻文章组成。我们的研究结果表明,针对德顿语等语言的MT系统应优先考虑将文本翻译成低资源语言,有效处理简短输入,并涵盖与教育环境相关的广泛领域。更广泛地说,本研究展示了观察性分析如何通过基于实际社区需求来指导低资源语言技术的发展。

[NLP-31] Predicting User Intents and Musical Attributes from Music Discovery Conversations

【速读】: 该论文试图解决音乐领域中的意图分类问题,特别是音乐发现对话中的用户需求识别。解决方案的关键在于引入预训练语言模型,并提出了一种结合历史对话信息与当前单轮用户查询的方法,以增强模型对整体对话上下文的理解。这种方法不仅用于预测功能性需求(意图分类),还扩展到音乐属性分类,从而显著提升了F1分数,并超越了预训练的Llama 3模型在零样本和少样本学习中的表现。

链接: https://arxiv.org/abs/2411.12254
作者: Daeyong Kwon,SeungHeon Doh,Juhan Nam
关键词-EN: Intent classification, classification, Intent, text understanding task, musical attribute classification
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Intent classification is a text understanding task that identifies user needs from input text queries. While intent classification has been extensively studied in various domains, it has not received much attention in the music domain. In this paper, we investigate intent classification models for music discovery conversation, focusing on pre-trained language models. Rather than only predicting functional needs: intent classification, we also include a task for classifying musical needs: musical attribute classification. Additionally, we propose a method of concatenating previous chat history with just single-turn user queries in the input text, allowing the model to understand the overall conversation context better. Our proposed model significantly improves the F1 score for both user intent and musical attribute classification, and surpasses the zero-shot and few-shot performance of the pretrained Llama 3 model.
摘要:意图分类是一项文本理解任务,旨在从输入的文本查询中识别用户需求。尽管意图分类在多个领域得到了广泛研究,但在音乐领域却鲜有关注。本文探讨了用于音乐发现对话的意图分类模型,重点关注预训练语言模型。我们不仅预测功能需求:意图分类,还引入了一项分类音乐需求的任务:音乐属性分类。此外,我们提出了一种方法,即将之前的聊天历史与单轮用户查询连接起来作为输入文本,从而使模型能够更好地理解整体对话上下文。我们提出的模型在用户意图和音乐属性分类的F1分数上显著提升,并超越了预训练的Llama 3模型在零样本和少样本情况下的表现。

[NLP-32] Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages

【速读】: 该论文试图解决在基于transformer架构的大型语言模型(LLMs)中,针对印度官方语言的多语言模型中,如何优化tokenization过程以提升模型性能的问题。解决方案的关键在于全面评估12种LLMs在印度22种官方语言上的tokenization效率,并引入Normalized Sequence Length (NSL)作为主要评估指标。研究结果表明,SUTRA tokenizer在处理印度语言方面表现最为出色,尤其在14种语言上优于其他模型,包括多个专门针对印度语言的模型。此外,GPT-4o在处理印度语言方面相较于GPT-4有所改进,而Project Indus在某些语言上的表现则较为有限。该研究强调了开发针对多语言和印度语言的tokenization策略的重要性,为未来tokenizer设计的改进奠定了基础,以提升语言覆盖率和模型效率。

链接: https://arxiv.org/abs/2411.12240
作者: S. Tamang,D. J. Bora
关键词-EN: Large Language Models, Large Language, based on transformer, variety of domains, fine-tuning stages
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer’s superior handling of Indic languages, GPT-4o’s advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.
摘要:基于 Transformer 架构的大语言模型 (Large Language Models, LLMs) 已经在多个领域引发了革命性变革,其中 Tokenization 在其预处理和微调阶段扮演了关键角色。在多语言模型中,特别是针对印度语言定制的模型,有效的 Tokenization 对于优化性能至关重要。本文对 12 种大语言模型在印度所有 22 种官方语言中使用的 Tokenizer 进行了全面评估,重点比较了其 Tokenization 过程的效率。我们采用了标准化序列长度 (Normalized Sequence Length, NSL) 作为分析的关键指标。研究结果显示,SUTRA Tokenizer 在所有模型中表现最佳,包括多个专门针对印度语言的模型,在 14 种语言中表现出色。值得注意的是,SUTRA Tokenizer 在处理印度语言方面表现优越,GPT-4o 在处理印度语言方面相较于其前身 GPT-4 有所进步,而 Project Indus 在某些语言中的表现有限。本研究强调了为多语言和以印度语言为中心的模型开发针对性 Tokenization 策略的至关重要性,为未来在 Tokenizer 设计方面的改进奠定了基础,以提升语言覆盖范围和模型效率。

[NLP-33] BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language? EMNLP2024

【速读】: 该论文试图解决当前密集检索系统(Dense Retrieval)在理解语言中隐含的布尔逻辑(Boolean logic)方面的不足。解决方案的关键在于提出了布尔密集检索任务(Boolean Dense Retrieval),并构建了一个名为BoolQuestions的基准数据集,用于评估系统对复杂布尔查询的理解能力。通过实验,论文发现现有系统在布尔逻辑理解上存在显著缺陷,并提出了一种对比持续训练方法(contrastive continual training method),作为增强语言模型布尔逻辑理解能力的强基线(strong baseline)。

链接: https://arxiv.org/abs/2411.12235
作者: Zongmeng Zhang,Jinhua Zhu,Wengang Zhou,Xiang Qi,Peng Zhang,Houqiang Li
关键词-EN: dense vector representations, Boolean logic, natural language processing, dense retrieval systems, Boolean Dense Retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Findings of the Association for Computational Linguistics: EMNLP 2024

点击查看摘要

Abstract:Dense retrieval, which aims to encode the semantic information of arbitrary text into dense vector representations or embeddings, has emerged as an effective and efficient paradigm for text retrieval, consequently becoming an essential component in various natural language processing systems. These systems typically focus on optimizing the embedding space by attending to the relevance of text pairs, while overlooking the Boolean logic inherent in language, which may not be captured by current training objectives. In this work, we first investigate whether current retrieval systems can comprehend the Boolean logic implied in language. To answer this question, we formulate the task of Boolean Dense Retrieval and collect a benchmark dataset, BoolQuestions, which covers complex queries containing basic Boolean logic and corresponding annotated passages. Through extensive experimental results on the proposed task and benchmark dataset, we draw the conclusion that current dense retrieval systems do not fully understand Boolean logic in language, and there is a long way to go to improve our dense retrieval systems. Furthermore, to promote further research on enhancing the understanding of Boolean logic for language models, we explore Boolean operation on decomposed query and propose a contrastive continual training method that serves as a strong baseline for the research community.
摘要:密集检索(Dense Retrieval)旨在将任意文本的语义信息编码为密集向量表示或嵌入(embeddings),已成为文本检索中一种高效且有效的范式,并因此成为各种自然语言处理系统中的关键组件。这些系统通常专注于通过关注文本对的相关性来优化嵌入空间,而忽略了语言中固有的布尔逻辑,这种逻辑可能无法通过当前的训练目标捕捉到。在本研究中,我们首先探讨了当前的检索系统是否能够理解语言中隐含的布尔逻辑。为了回答这个问题,我们提出了布尔密集检索(Boolean Dense Retrieval)任务,并收集了一个基准数据集 BoolQuestions,该数据集涵盖了包含基本布尔逻辑的复杂查询及其对应的标注段落。通过在提出的任务和基准数据集上的广泛实验结果,我们得出结论:当前的密集检索系统并未完全理解语言中的布尔逻辑,提升我们的密集检索系统还有很长的路要走。此外,为了促进对语言模型布尔逻辑理解能力的进一步研究,我们探索了在分解查询上的布尔操作,并提出了一种对比持续训练方法,该方法作为研究社区的一个强有力的基线。

[NLP-34] Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

【速读】: 该论文试图解决在线多模态环境中毒性识别的挑战,特别是由于文本和视觉模态之间复杂的上下文联系。解决方案的关键在于提出了一种结合大视觉语言模型(Large Visual Language Models, LVLMs)的知识蒸馏(Knowledge Distillation, KD)和知识注入的新框架。该框架通过从大规模常识知识图谱(Knowledge Graph, KG)如ConceptNet中提取子知识图谱,并将其注入到紧凑的视觉语言模型(VLM)中,以增强模型对仇恨表情包中毒性检测的推理能力。这种方法通过融合显式(如KG)和隐式(如LVLMs)的上下文线索,显著提升了模型在毒性检测任务中的性能,特别是在AU-ROC、F1和Recall指标上分别提高了1.1%、7%和35%。

链接: https://arxiv.org/abs/2411.12174
作者: Rahul Garg,Trilok Padhi,Hemang Jain,Ugur Kursuncu,Ugur Kursuncu,Ponnurangam Kumaraguru
关键词-EN: Large Visual Language, challenging task due, multimodal environments remains, Visual Language Models, integrates Knowledge Distillation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Toxicity identification in online multimodal environments remains a challenging task due to the complexity of contextual connections across modalities (e.g., textual and visual). In this paper, we propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes. Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework. The relational context between toxic phrases in captions and memes, as well as visual concepts in memes enhance the model’s reasoning capabilities. Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively. Given the contextual complexity of the toxicity detection task, our approach showcases the significance of learning from both explicit (i.e. KG) as well as implicit (i.e. LVLMs) contextual cues incorporated through a hybrid neurosymbolic approach. This is crucial for real-world applications where accurate and scalable recognition of toxic content is critical for creating safer online environments.
摘要:在线多模态环境中的毒性识别仍然是一个具有挑战性的任务,这是由于模态间(例如,文本和视觉)的上下文联系复杂性所致。本文提出了一种新颖的框架,该框架结合了大视觉语言模型(LVLMs)的知识蒸馏(Knowledge Distillation, KD)和知识注入,以提升仇恨表情包中毒性检测的性能。我们的方法从ConceptNet这一大规模常识知识图谱(Knowledge Graph, KG)中提取子知识图谱,并将其注入到一个紧凑的视觉语言模型(VLM)框架中。标题中的毒性短语与表情包中的视觉概念之间的关系上下文增强了模型的推理能力。我们在两个仇恨言论基准数据集上的研究实验结果显示,与最先进的基线相比,在AU-ROC、F1和召回率上分别提高了1.1%、7%和35%。鉴于毒性检测任务的上下文复杂性,我们的方法展示了从显式(即KG)和隐式(即LVLMs)上下文线索中学习的重要性,这些线索通过混合神经符号方法结合在一起。这对于现实世界应用至关重要,因为在这些应用中,准确且可扩展的毒性内容识别对于创建更安全的在线环境至关重要。

[NLP-35] A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation

【速读】: 该论文试图解决现有文本生成模型在语义深度和文本流畅性方面的局限性,解决方案的关键在于结合BERT的语义解释能力与GPT-4的生成能力,构建一个混合模型(BERT-GPT-4)。通过这种架构,模型不仅增强了语义深度,还能生成流畅、类似人类的文本,克服了以往模型在这两方面的不足。实验结果表明,BERT-GPT-4在困惑度(Perplexity)和BLEU等关键指标上优于传统的GPT-3、T5、BART、Transformer-XL和CTRL模型,展示了其在自然语言生成任务中的优越性能。

链接: https://arxiv.org/abs/2411.12157
作者: Jiajing Chen,Shuo Wang,Zhen Qi,Zhenhong Zhang,Chihang Wang,Hongye Zheng
关键词-EN: combines BERT semantic, BERT semantic interpretation, combines BERT, semantic interpretation strengths, contextually accurate language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research introduces a novel text generation model that combines BERT’s semantic interpretation strengths with GPT-4’s generative capabilities, establishing a high standard in generating coherent, contextually accurate language. Through the combined architecture, the model enhances semantic depth and maintains smooth, human-like text flow, overcoming limitations seen in prior models. Experimental benchmarks reveal that BERT-GPT-4 surpasses traditional models, including GPT-3, T5, BART, Transformer-XL, and CTRL, in key metrics like Perplexity and BLEU, showcasing its superior natural language generation performance. By fully utilizing contextual information, this hybrid model generates text that is not only logically coherent but also aligns closely with human language patterns, providing an advanced solution for text generation tasks. This research highlights the potential of integrating semantic understanding with advanced generative models, contributing new insights for NLP, and setting a foundation for broader applications of large-scale generative architectures in areas such as automated writing, question-answer systems, and adaptive conversational agents.
摘要:本研究引入了一种新颖的文本生成模型,该模型结合了 BERT 的语义解释优势与 GPT-4 的生成能力,在生成连贯且语境准确的语言方面树立了高标准。通过这种结合架构,模型增强了语义深度,并保持了流畅、类人的文本流,克服了先前模型中存在的局限性。实验基准显示,BERT-GPT-4 在困惑度 (Perplexity) 和 BLEU 等关键指标上超越了传统模型,包括 GPT-3、T5、BART、Transformer-XL 和 CTRL,展示了其在自然语言生成方面的卓越性能。通过充分利用上下文信息,这种混合模型生成的文本不仅逻辑连贯,而且与人类语言模式高度一致,为文本生成任务提供了先进的解决方案。本研究强调了将语义理解与高级生成模型结合的潜力,为自然语言处理 (NLP) 领域贡献了新的见解,并为大规模生成架构在自动化写作、问答系统和自适应对话智能体等领域的广泛应用奠定了基础。

[NLP-36] HNCSE: Advancing Sentence Embeddings via Hybrid Contrastive Learning with Hard Negatives

【速读】: 该论文试图解决无监督句子表示学习中的一个关键挑战,即如何有效地捕捉文本的语义信息。解决方案的关键在于提出了一种名为HNCSE的新型对比学习框架,该框架扩展了领先的SimCSE方法。HNCSE的核心创新在于其对困难负样本(hard negative samples)的巧妙利用,这些样本由于接近决策边界而更难以区分,从而增强了正负样本的学习效果,进而实现更深层次的语义理解。通过在语义文本相似性和迁移任务数据集上的实证测试,HNCSE展示了其优越性。

链接: https://arxiv.org/abs/2411.12156
作者: Wenxiao Liu,Zihong Yang,Chaozhuo Li,Zijin Hong,Jianfeng Ma,Zhiquan Liu,Litian Zhang,Feiran Huang
关键词-EN: natural language processing, modern natural language, Unsupervised sentence representation, Unsupervised sentence, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised sentence representation learning remains a critical challenge in modern natural language processing (NLP) research. Recently, contrastive learning techniques have achieved significant success in addressing this issue by effectively capturing textual semantics. Many such approaches prioritize the optimization using negative samples. In fields such as computer vision, hard negative samples (samples that are close to the decision boundary and thus more difficult to distinguish) have been shown to enhance representation learning. However, adapting hard negatives to contrastive sentence learning is complex due to the intricate syntactic and semantic details of text. To address this problem, we propose HNCSE, a novel contrastive learning framework that extends the leading SimCSE approach. The hallmark of HNCSE is its innovative use of hard negative samples to enhance the learning of both positive and negative samples, thereby achieving a deeper semantic understanding. Empirical tests on semantic textual similarity and transfer task datasets validate the superiority of HNCSE.
摘要:无监督句子表示学习仍然是现代自然语言处理(NLP)研究中的一个关键挑战。近年来,对比学习技术在这一问题上取得了显著成功,通过有效捕捉文本语义来解决这一问题。许多此类方法优先使用负样本进行优化。在计算机视觉等领域,困难负样本(即接近决策边界、因此更难以区分的样本)已被证明能增强表示学习。然而,由于文本的复杂句法和语义细节,将困难负样本适应于对比句子学习是复杂的。为解决这一问题,我们提出了HNCSE,这是一种新颖的对比学习框架,扩展了领先的SimCSE方法。HNCSE的标志性特点是其创新性地使用困难负样本来增强正样本和负样本的学习,从而实现更深层次的语义理解。在语义文本相似性和迁移任务数据集上的实证测试验证了HNCSE的优越性。

[NLP-37] CoMeDi Shared Task: Models as Annotators in Lexical Semantics Disagreements

【速读】: 该论文旨在解决CoMeDi Shared Task中的两个子任务:预测多数投票(Subtask 1)和标注者分歧(Subtask 2)。解决方案的关键在于结合模型集成策略与基于多层感知器(MLP)和阈值的方法,这些方法在预训练语言模型上进行训练。通过将单个模型视为虚拟标注者,设计了包含连续相似度分数和离散分类标签的聚合度量,以捕捉多数投票和分歧。此外,采用各向异性去除技术以提升性能。实验结果表明,连续相似度分数在捕捉人类分歧模式方面优于聚合离散标签,特别是在Subtask 2中表现尤为突出。

链接: https://arxiv.org/abs/2411.12147
作者: Zhu Liu,Zhen Hu,Ying Liu
关键词-EN: CoMeDi Shared Task, Shared Task, CoMeDi Shared, predicts majority votes, Task
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:We present the results of our system for the CoMeDi Shared Task, which predicts majority votes (Subtask 1) and annotator disagreements (Subtask 2). Our approach combines model ensemble strategies with MLP-based and threshold-based methods trained on pretrained language models. Treating individual models as virtual annotators, we simulate the annotation process by designing aggregation measures that incorporate continuous similarity scores and discrete classification labels to capture both majority and disagreement. Additionally, we employ anisotropy removal techniques to enhance performance. Experimental results demonstrate the effectiveness of our methods, particularly for Subtask 2. Notably, we find that continuous similarity scores, even within the same model, align better with human disagreement patterns compared to aggregated discrete labels.
摘要:我们展示了针对 CoMeDi 共享任务的系统结果,该系统预测多数投票(子任务 1)和标注者分歧(子任务 2)。我们的方法结合了模型集成策略与基于 MLP 和基于阈值的方法,这些方法在预训练语言模型上进行训练。将单个模型视为虚拟标注者,我们通过设计聚合度量来模拟标注过程,这些度量结合了连续相似性分数和离散分类标签,以捕捉多数投票和分歧。此外,我们采用了各向异性去除技术以提升性能。实验结果表明我们的方法的有效性,特别是在子任务 2 上。值得注意的是,我们发现连续相似性分数,即使在同一模型内,与人类分歧模式的匹配度优于聚合的离散标签。

[NLP-38] A Computational Method for Measuring “Open Codes” in Qualitative Analysis

【速读】: 该论文试图解决在社会科学领域中,定性分析过程中开放编码(Open Coding)的全面性(“as exhaustive as possible”)难以达到的问题,并评估生成式 AI (Generative AI) 在支持开放编码时可能引入的偏见风险。解决方案的关键在于提出了一种基于团队合作的方法,结合了人类编码员和机器编码员的优势,通过计算方法系统地测量和识别开放编码中的潜在偏见。该方法不将人类专家的结果作为“金标准”,而是通过团队合作的方式,利用两个人机交互(HCI)数据集进行实验,验证了该方法的可靠性和输出稳定性,并提供了基于证据的建议和示例工作流程,以支持生成式 AI 在开放编码中的应用。

链接: https://arxiv.org/abs/2411.12142
作者: John Chen,Alexandros Lotsos,Lexie Zhao,Jessica Hullman,Bruce Sherin,Uri Wilensky,Michael Horn
关键词-EN: social science disciplines, science disciplines, critical to understanding, social science, Open coding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Qualitative analysis is critical to understanding human datasets in many social science disciplines. Open coding is an inductive qualitative process that identifies and interprets “open codes” from datasets. Yet, meeting methodological expectations (such as “as exhaustive as possible”) can be challenging. While many machine learning (ML)/generative AI (GAI) studies have attempted to support open coding, few have systematically measured or evaluated GAI outcomes, increasing potential bias risks. Building on Grounded Theory and Thematic Analysis theories, we present a computational method to measure and identify potential biases from “open codes” systematically. Instead of operationalizing human expert results as the “ground truth,” our method is built upon a team-based approach between human and machine coders. We experiment with two HCI datasets to establish this method’s reliability by 1) comparing it with human analysis, and 2) analyzing its output stability. We present evidence-based suggestions and example workflows for ML/GAI to support open coding.
摘要:定性分析在理解许多社会科学领域的人类数据集方面至关重要。开放编码是一种归纳性的定性过程,用于从数据集中识别和解释“开放代码”。然而,满足方法论期望(如“尽可能详尽”)可能具有挑战性。尽管许多机器学习 (ML) 和生成式 AI (GAI) 研究试图支持开放编码,但很少有研究系统地测量或评估 GAI 的结果,从而增加了潜在的偏见风险。基于扎根理论和主题分析理论,我们提出了一种计算方法,用于系统地测量和识别“开放代码”中的潜在偏见。我们的方法不是将人类专家的结果作为“真实标准”,而是建立在人机编码团队合作的基础上。我们通过两个人机交互 (HCI) 数据集来验证这种方法的可靠性,具体方法包括:1) 将其与人类分析进行比较,以及 2) 分析其输出稳定性。我们提供了基于证据的建议和示例工作流程,以支持 ML/GAI 在开放编码中的应用。

[NLP-39] Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

【速读】: 该论文试图解决大语言模型(LLM)中存在的信息遗忘问题,特别是如何有效移除模型中已学习的有害信息,以防止其被用于恶意目的。解决方案的关键在于比较两种主要的模型遗忘方法:LLM Unlearning (LLMU) 和 Retraining-based Model Unlearning (RMU)。研究通过在WMDP基准测试和自建的生物学基准测试中评估这两种方法,发现RMU在保持模型能力方面表现更优,且在遗忘效果上相似或更好。此外,研究还测试了这些方法的鲁棒性,发现5-shot提示或简单地重新表述问题可以显著提高遗忘基准测试的准确性。最后,研究指出,通过训练无关数据几乎可以完全恢复遗忘前的性能,表明这些方法在真正遗忘方面存在局限性。

链接: https://arxiv.org/abs/2411.12103
作者: Jai Doshi,Asa Cooper Stickland
关键词-EN: Large language model, remove harmful information, Large language, malicious purposes, aims to remove
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the efficacy of these methods by evaluating their impact on general model capabilities on the WMDP benchmark as well as a biology benchmark we create. Our experiments show that RMU generally leads to better preservation of model capabilities, for similar or better unlearning. We further test the robustness of these methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. The code is available at \hrefthis https URLthis, https, URL .
摘要:大语言模型遗忘(Large Language Model Unlearning)旨在移除大语言模型(LLM)中已学习的有害信息,以防止其被用于恶意目的。LLMU 和 RMU 已被提出作为两种实现 LLM 遗忘的方法,并在遗忘基准测试中取得了显著成果。我们通过评估这些方法对 WMDP 基准测试和我们创建的生物学基准测试中模型通用能力的影响,详细研究了这些方法的有效性。我们的实验表明,RMU 通常能更好地保留模型的能力,同时实现相似或更好的遗忘效果。我们进一步测试了这些方法的鲁棒性,发现进行 5-shot 提示或以简单方式重新表述问题,可以在遗忘基准测试中使准确率提高十倍以上。最后,我们展示了在不相关数据上进行训练几乎可以完全恢复遗忘前的性能,这表明这些方法在真正实现遗忘方面存在不足。代码可在 \hrefthis https URLthis, https, URL 获取。

[NLP-40] Mitigating Gender Bias in Contextual Word Embeddings

【速读】: 该论文试图解决上下文嵌入(contextual embeddings)中的性别偏见问题。解决方案的关键在于提出了一种新的掩码语言建模(Masked-Language Modeling, MLM)目标函数,该函数能够显著减少上下文嵌入中的性别偏见,同时保持其在下游任务中的性能。此外,论文还提出了新的评估指标来衡量偏见,并探讨了静态嵌入(static embeddings)中偏见的主要来源,即刻板印象名称的存在,而非性别词汇本身。

链接: https://arxiv.org/abs/2411.12074
作者: Navya Yarrabelly,Vinay Damodaran,Feng-Guang Su
关键词-EN: NLP related tasks, produce remarkable results, majority of NLP, NLP related, embeddings
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Word embeddings have been shown to produce remarkable results in tackling a vast majority of NLP related tasks. Unfortunately, word embeddings also capture the stereotypical biases that are prevalent in society, affecting the predictive performance of the embeddings when used in downstream tasks. While various techniques have been proposed \citebolukbasi2016man, zhao2018learning and criticized\citegonen2019lipstick for static embeddings, very little work has focused on mitigating bias in contextual embeddings. In this paper, we propose a novel objective function for MLM(Masked-Language Modeling) which largely mitigates the gender bias in contextual embeddings and also preserves the performance for downstream tasks. Since previous works on measuring bias in contextual embeddings lack in normative reasoning, we also propose novel evaluation metrics that are straight-forward and aligned with our motivations in debiasing. We also propose new methods for debiasing static embeddings and provide empirical proof via extensive analysis and experiments, as to why the main source of bias in static embeddings stems from the presence of stereotypical names rather than gendered words themselves. All experiments and embeddings studied are in English, unless otherwise specified.\citepbender2011achieving.
摘要:词嵌入在处理绝大多数自然语言处理(NLP)相关任务中展现了显著的效果。然而,词嵌入也捕捉到了社会上普遍存在的刻板偏见,这些偏见在使用词嵌入进行下游任务时影响了其预测性能。尽管针对静态嵌入的偏见缓解技术已有多种提出(如 \cite{bolukbasi2016man} 和 \cite{zhao2018learning})并受到批评(如 \cite{gonen2019lipstick}),但针对上下文嵌入的偏见缓解工作却非常有限。在本文中,我们提出了一种新的掩码语言建模(MLM, Masked-Language Modeling)目标函数,该函数在很大程度上缓解了上下文嵌入中的性别偏见,同时保持了下游任务的性能。由于先前在测量上下文嵌入偏见的工作中缺乏规范性推理,我们还提出了新的评估指标,这些指标直观且与我们的去偏见动机相一致。此外,我们提出了新的去偏见方法用于静态嵌入,并通过广泛的分析和实验提供了实证证据,证明静态嵌入中偏见的主要来源是刻板名字的存在,而非性别词汇本身。除非另有说明,所有实验和研究的嵌入均为英文。\citep{bender2011achieving}。

[NLP-41] Benchmarking pre-trained text embedding models in aligning built asset information

【速读】: 该论文试图解决建筑资产数据与既定数据分类系统及分类法之间准确映射的问题。由于建筑资产数据的复杂性,这一过程目前主要依赖于人工和领域专家的输入。论文的关键解决方案在于利用预训练的大型语言模型(text embedding)进行上下文文本表示学习,以自动化建筑资产数据的跨映射。通过对比现有最先进的文本嵌入模型,论文评估了这些模型在处理建筑资产技术术语复杂语义方面的有效性,并提出了基于两个知名建筑资产数据分类词典的数据集。研究结果强调了未来在领域适应技术方面的研究需求,并发布了一个开源库以支持该领域的未来评估。

链接: https://arxiv.org/abs/2411.12056
作者: Mehrzad Shahinmoghadam,Ali Motamedi
关键词-EN: effective asset management, data integration scenarios, Accurate mapping, built asset data, ad-hoc data integration
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate mapping of the built asset information to established data classification systems and taxonomies is crucial for effective asset management, whether for compliance at project handover or ad-hoc data integration scenarios. Due to the complex nature of built asset data, which predominantly comprises technical text elements, this process remains largely manual and reliant on domain expert input. Recent breakthroughs in contextual text representation learning (text embedding), particularly through pre-trained large language models, offer promising approaches that can facilitate the automation of cross-mapping of the built asset data. However, no comprehensive evaluation has yet been conducted to assess these models’ ability to effectively represent the complex semantics specific to built asset technical terminology. This study presents a comparative benchmark of state-of-the-art text embedding models to evaluate their effectiveness in aligning built asset information with domain-specific technical concepts. Our proposed datasets are derived from two renowned built asset data classification dictionaries. The results of our benchmarking across six proposed datasets, covering three tasks of clustering, retrieval, and reranking, highlight the need for future research on domain adaptation techniques. The benchmarking resources are published as an open-source library, which will be maintained and extended to support future evaluations in this field.
摘要:将建筑资产信息准确映射到已建立的数据分类系统和分类法对于有效的资产管理至关重要,无论是为了项目交接时的合规性还是临时数据集成场景。由于建筑资产数据主要由技术文本元素组成,这一过程在很大程度上仍然是手动操作,并依赖于领域专家的输入。最近在上下文文本表示学习(文本嵌入)方面的突破,特别是通过预训练的大语言模型,提供了有前景的方法,可以促进建筑资产数据的跨映射自动化。然而,尚未进行全面的评估来评估这些模型有效表示建筑资产技术术语复杂语义的能力。本研究对最先进的文本嵌入模型进行了比较基准测试,以评估它们在将建筑资产信息与领域特定的技术概念对齐方面的有效性。我们提出的数据集源自两个著名的建筑资产数据分类词典。通过对六个提议的数据集进行基准测试,涵盖聚类、检索和重排序三个任务,结果突显了未来在领域适应技术方面研究的必要性。基准测试资源作为开源库发布,并将得到维护和扩展,以支持该领域的未来评估。

[NLP-42] ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

【速读】: 该论文试图解决从科学文本中提取结构化知识的问题,特别是由于科学文本的领域特定性、复杂的数据预处理需求以及多层次设备级信息的粒度问题。解决方案的关键在于引入了一个名为ByteScience的非盈利云端自动微调大型语言模型(LLM)平台。该平台基于DARWIN,一个专门为自然科学领域微调的开源LLM,并构建在Amazon Web Services (AWS)上,提供了一个自动化、用户友好的工作流程,用于定制模型开发和数据提取。通过利用少量高质量标注的文章,该平台能够实现显著的准确性,从而简化了从科学文献到结构化知识和数据的转换过程,推动了自然信息学的发展。

链接: https://arxiv.org/abs/2411.12000
作者: Tong Xie,Hanzhi Zhang,Shaozhou Wang,Yuwei Wan,Imran Razzak,Chunyu Kit,Wenjie Zhangand Bram Hoex
关键词-EN: Natural Language Processing, supply summarization ability, Language Processing, Large Language Model, NLP models remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.
摘要:自然语言处理 (Natural Language Processing, NLP) 广泛应用于从长文本中提取结构化信息的能力。然而,由于其领域特定的性质,涉及复杂的数据预处理以及多层次设备级信息的粒度,从科学文本中提取结构化知识仍然是 NLP 模型面临的挑战。为此,我们推出了 ByteScience,一个非营利的基于云端的自动微调大语言模型 (Large Language Model, LLM) 平台,旨在从庞大的科学语料库中提取结构化科学数据并合成新的科学知识。该平台利用了 DARWIN,一个专注于自然科学的开放源代码微调 LLM。平台构建于 Amazon Web Services (AWS) 之上,并提供了一个自动化、用户友好的工作流程,用于定制模型开发和数据提取。该平台在仅使用少量高质量标注文章的情况下,实现了显著的准确性。这一创新工具简化了从科学文献到结构化知识与数据的转化过程,并促进了自然信息学领域的进步。

[NLP-43] Understanding Chain-of-Thought in LLM s through Information Theory

【速读】: 该论文试图解决现有链式思维推理(Chain-of-Thought, CoT)评估方法在大型语言模型(Large Language Models, LLMs)中存在的两个主要问题:一是需要大量标注的CoT数据,二是难以准确评估推理过程中的中间步骤,导致高误报率。解决方案的关键在于通过信息论的视角来形式化CoT推理,量化每一步推理的“信息增益”(information gain),从而在不依赖昂贵的标注数据的情况下,识别模型在推理过程中的失败模式。实验结果表明,该方法在玩具数据集和GSM-8K数据集上显著优于现有的基于结果的评估方法,提供了更准确的模型性能洞察。

链接: https://arxiv.org/abs/2411.11984
作者: Jean-Francois Ton,Muhammad Faaiz Taufiq,Yang Liu
关键词-EN: Large Language Models, Large Language, shown impressive performance, manageable sub-tasks, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain’ at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.
摘要:大语言模型 (Large Language Models, LLMs) 通过思维链 (Chain-of-Thought, CoT) 推理在复杂推理任务中展现了令人印象深刻的表现,使得模型能够将问题分解为可管理的子任务。然而,现有的 CoT 评估技术要么需要标注的 CoT 数据,要么在准确评估中间推理步骤方面存在不足,导致高比例的假阳性结果。本文中,我们通过信息论的视角对 LLMs 中的 CoT 推理进行了形式化。具体而言,我们的框架量化了每个推理步骤中的“信息增益”,从而无需昂贵的标注数据集即可识别 LLMs 的失败模式。我们通过在玩具数据和 GSM-8K 数据上的广泛实验展示了我们方法的有效性,该方法在提供更准确的模型性能洞察方面显著优于现有的基于结果的方法,尤其是在单个任务的性能评估上。

[NLP-44] Reviving Dormant Memories: Investigating Catastrophic Forgetting in Language Models through Rationale-Guidance Difficulty

【速读】: 该论文试图解决持续学习中的灾难性遗忘问题,即模型在不断学习新任务时如何避免遗忘旧任务的知识。解决方案的关键在于揭示了模型性能下降的原因并非真正意义上的“遗忘”,而是原有指令未能有效引导模型生成适当的推理依据(rationale)。论文提出了一种新的度量标准——推理依据引导难度(Rationale-Guidance Difficulty),用于评估指令在生成适当推理依据方面的有效性。通过优化基于回放的持续学习算法中的数据分配,实验结果表明该方法能有效缓解灾难性遗忘,同时保持模型的良好可塑性。

链接: https://arxiv.org/abs/2411.11932
作者: Huashan Sun,Yang Gao
关键词-EN: substantial efforts, intrinsic mechanisms, forgetting model, forgetting model passively, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:Although substantial efforts have been made to mitigate catastrophic forgetting in continual learning, the intrinsic mechanisms are not well understood. In this paper, we discover that when a forgetting model passively receives an externally provided partial appropriate rationale, its performance on the forgotten task can be restored. Furthermore, by simply adding a task-agnostic prefix to the original instruction, the forgetting model can actively generate an appropriate rationale to reach the correct answer. These findings suggest that the model does not actually ``forget’’ the task knowledge; instead, the degraded performance can be attributed to the failure of the original instructions in guiding the model to generate the appropriate rationales. Based on this insight, we propose the Rationale-Guidance Difficulty metric to evaluate how effectively a given instruction guides the model in generating appropriate rationales. We apply this metric to optimize the allocation of replay data in replay-based continual learning algorithm. Experimental results demonstrate that our data allocation method effectively mitigates catastrophic forgetting and maintains better model plasticity simultaneously across models.
摘要:尽管在持续学习中已经做出了大量努力来缓解灾难性遗忘,但其内在机制尚未得到充分理解。本文发现,当一个遗忘模型被动地接收到外部提供的部分适当理由时,其在遗忘任务上的表现可以得到恢复。此外,通过简单地在原始指令前添加一个任务无关的前缀,遗忘模型能够主动生成适当的理由以达到正确答案。这些发现表明,模型实际上并未“遗忘”任务知识;相反,性能下降可以归因于原始指令在引导模型生成适当理由方面的失败。基于这一洞察,我们提出了理由引导难度指标,用于评估给定指令在引导模型生成适当理由方面的有效性。我们将此指标应用于基于重放的持续学习算法中,以优化重放数据的分配。实验结果表明,我们的数据分配方法有效地缓解了灾难性遗忘,并在多个模型中同时保持了更好的模型可塑性。

[NLP-45] AIGS: Generating Science from AI-Powered Automated Falsification

【速读】: 该论文试图解决的问题是如何实现一个能够独立完成整个科学研究过程的自主代理系统,即AI-Generated Science (AIGS)。解决方案的关键在于引入证伪(falsification)作为系统设计的核心原则,并通过设计一个名为FalsificationAgent的多代理系统来实现这一目标。FalsificationAgent负责识别并验证可能的科学发现,从而赋予系统明确的证伪能力。论文提出的Baby-AIGS系统作为一个初步的演示,展示了在多个研究阶段中,代理如何协作以生成有意义的科学发现,尽管其成果尚未达到经验丰富的研究人员水平。

链接: https://arxiv.org/abs/2411.11910
作者: Zijun Liu,Kaiming Liu,Yiqi Zhu,Xuanyu Lei,Zonghan Yang,Zhenhe Zhang,Peng Li,Yang Liu
关键词-EN: Rapid development, development of artificial, Large Language Models, artificial intelligence, intelligence has drastically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre-print. 35 pages. Official website: this https URL

点击查看摘要

Abstract:Rapid development of artificial intelligence has drastically accelerated the development of scientific discovery. Trained with large-scale observation data, deep neural networks extract the underlying patterns in an end-to-end manner and assist human researchers with highly-precised predictions in unseen scenarios. The recent rise of Large Language Models (LLMs) and the empowered autonomous agents enable scientists to gain help through interaction in different stages of their research, including but not limited to literature review, research ideation, idea implementation, and academic writing. However, AI researchers instantiated by foundation model empowered agents with full-process autonomy are still in their infancy. In this paper, we study \textbfAI-Generated Science (AIGS), where agents independently and autonomously complete the entire research process and discover scientific laws. By revisiting the definition of scientific research, we argue that \textitfalsification is the essence of both human research process and the design of an AIGS system. Through the lens of falsification, prior systems attempting towards AI-Generated Science either lack the part in their design, or rely heavily on existing verification engines that narrow the use in specialized domains. In this work, we propose Baby-AIGS as a baby-step demonstration of a full-process AIGS system, which is a multi-agent system with agents in roles representing key research process. By introducing FalsificationAgent, which identify and then verify possible scientific discoveries, we empower the system with explicit falsification. Experiments on three tasks preliminarily show that Baby-AIGS could produce meaningful scientific discoveries, though not on par with experienced human researchers. Finally, we discuss on the limitations of current Baby-AIGS, actionable insights, and related ethical issues in detail.
摘要:人工智能的快速发展极大地加速了科学发现的进程。通过大规模观测数据的训练,深度神经网络以端到端的方式提取底层模式,并在未见场景中为人类研究人员提供高度精确的预测。近年来,大语言模型(LLM)和赋能的自主智能体使得科学家能够在研究的各个阶段(包括但不限于文献综述、研究构思、想法实施和学术写作)通过交互获得帮助。然而,由基础模型赋能的具有全过程自主性的AI研究人员仍处于起步阶段。本文研究了AI生成的科学(AIGS),其中智能体独立自主地完成整个研究过程并发现科学规律。通过重新审视科学研究的定义,我们认为证伪是人类研究过程和AIGS系统设计的本质。通过证伪的视角,先前试图实现AI生成科学的系统要么在其设计中缺乏这一部分,要么严重依赖现有的验证引擎,从而限制了其在特定领域的应用。在此工作中,我们提出了Baby-AIGS作为全过程AIGS系统的初步演示,这是一个多智能体系统,智能体扮演代表关键研究过程的角色。通过引入FalsificationAgent,该智能体识别并验证可能的科学发现,我们赋予系统显式的证伪能力。在三个任务上的实验初步表明,Baby-AIGS能够产生有意义的科学发现,尽管其水平尚不及经验丰富的研究人员。最后,我们详细讨论了当前Baby-AIGS的局限性、可操作的见解以及相关的伦理问题。

[NLP-46] Deploying Large Language Models With Retrieval Augmented Generation

【速读】: 该论文试图解决大型语言模型(LLM)在生成内容时可能出现的幻觉或非事实性响应问题。解决方案的关键在于采用检索增强生成(Retrieval Augmented Generation, RAG)方法,通过整合训练集之外的数据源(包括专有和最新的信息)来增强生成内容的准确性。论文通过实际项目开发和现场测试,探讨了RAG在信息检索中的应用,并分析了其对信息价值链(涉及人员、流程和技术)的影响。论文的主要贡献包括开发了采用这一技术的最佳实践和建议,并通过提出的AI治理模型确保行业法规的合规性。

链接: https://arxiv.org/abs/2411.11895
作者: Sonal Prabhune,Donald J. Berndt
关键词-EN: create non-factual responses, ground generated outputs, large language models, Retrieval Augmented Generation, non-factual responses
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowing that the generative capabilities of large language models (LLM) are sometimes hampered by tendencies to hallucinate or create non-factual responses, researchers have increasingly focused on methods to ground generated outputs in factual data. Retrieval Augmented Generation (RAG) has emerged as a key approach for integrating knowledge from data sources outside of the LLM’s training set, including proprietary and up-to-date information. While many research papers explore various RAG strategies, their true efficacy is tested in real-world applications with actual data. The journey from conceiving an idea to actualizing it in the real world is a lengthy process. We present insights from the development and field-testing of a pilot project that integrates LLMs with RAG for information retrieval. Additionally, we examine the impacts on the information value chain, encompassing people, processes, and technology. Our aim is to identify the opportunities and challenges of implementing this emerging technology, particularly within the context of behavioral research in the information systems (IS) field. The contributions of this work include the development of best practices and recommendations for adopting this promising technology while ensuring compliance with industry regulations through a proposed AI governance model.
摘要:尽管大语言模型 (LLM) 的生成能力有时会因产生幻觉或非事实性回应而受限,研究人员仍日益关注将生成输出与事实数据相结合的方法。检索增强生成 (RAG) 作为一种关键方法,能够整合训练集之外的数据源知识,包括专有和最新的信息。虽然众多研究论文探讨了各种 RAG 策略,但其真正效能需在实际应用中通过真实数据进行检验。从构思到实际应用的过程漫长。本文从开发和实地测试一个将 LLM 与 RAG 结合用于信息检索的试点项目中获得见解。此外,我们还探讨了这一技术对信息价值链的影响,涉及人员、流程和技术。我们的目标是识别这一新兴技术在实施中的机遇与挑战,特别是在信息系统 (IS) 领域的行为研究背景下。本研究贡献包括制定最佳实践和采用该技术的建议,并通过提出的 AI 治理模型确保符合行业法规。

[NLP-47] Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval

【速读】: 该论文试图解决生物信息学领域中的跨模态文本-分子检索任务,即根据文本描述准确检索分子结构的问题。解决方案的关键在于引入了一种基于最优传输(Optimal Transport, OT)的多粒度对齐模型(ORMA),该模型通过多粒度对齐文本描述和分子结构,从而捕捉分子子结构的细节。具体来说,ORMA模型包括一个文本编码器和一个分子编码器,分别生成文本的词级别和句子级别表示,以及分子的原子、基序和分子级别的表示。通过应用最优传输算法,ORMA能够对齐文本中的词与分子中的基序,形成多词表示,并结合对比学习在词-原子、多词-基序和句子-分子三个尺度上进行跨模态对齐,从而显著提升检索性能。这是首次尝试在基序和多词级别上探索对齐方法,实验结果表明ORMA在ChEBI-20和PCdes数据集上显著优于现有的最先进模型。

链接: https://arxiv.org/abs/2411.11875
作者: Zijun Min,Bingshuai Liu,Liang Zhang,Jia Song,Jinsong Su,Song He,Xiaochen Bo
关键词-EN: task increasingly vital, retrieval task increasingly, significant progress, increasingly vital, textual descriptions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注: BIBM 2024 Regular Paper

点击查看摘要

Abstract:The field of bioinformatics has seen significant progress, making the cross-modal text-molecule retrieval task increasingly vital. This task focuses on accurately retrieving molecule structures based on textual descriptions, by effectively aligning textual descriptions and molecules to assist researchers in identifying suitable molecular candidates. However, many existing approaches overlook the details inherent in molecule sub-structures. In this work, we introduce the Optimal TRansport-based Multi-grained Alignments model (ORMA), a novel approach that facilitates multi-grained alignments between textual descriptions and molecules. Our model features a text encoder and a molecule encoder. The text encoder processes textual descriptions to generate both token-level and sentence-level representations, while molecules are modeled as hierarchical heterogeneous graphs, encompassing atom, motif, and molecule nodes to extract representations at these three levels. A key innovation in ORMA is the application of Optimal Transport (OT) to align tokens with motifs, creating multi-token representations that integrate multiple token alignments with their corresponding motifs. Additionally, we employ contrastive learning to refine cross-modal alignments at three distinct scales: token-atom, multitoken-motif, and sentence-molecule, ensuring that the similarities between correctly matched text-molecule pairs are maximized while those of unmatched pairs are minimized. To our knowledge, this is the first attempt to explore alignments at both the motif and multi-token levels. Experimental results on the ChEBI-20 and PCdes datasets demonstrate that ORMA significantly outperforms existing state-of-the-art (SOTA) models.
摘要:生物信息学领域取得了显著进展,使得跨模态文本-分子检索任务变得愈发重要。该任务旨在基于文本描述准确检索分子结构,通过有效对齐文本描述与分子结构,帮助研究人员识别合适的分子候选物。然而,许多现有方法忽略了分子子结构中的细节。在本研究中,我们提出了基于最优传输的多粒度对齐模型(ORMA),这是一种新颖的方法,能够促进文本描述与分子之间的多粒度对齐。我们的模型包括一个文本编码器和一个分子编码器。文本编码器处理文本描述,生成Token级和句子级表示,而分子则被建模为包含原子、基序和分子节点的分层异构图,以提取这三个层次的表示。ORMA的一个关键创新是应用最优传输(OT)来对齐Token与基序,创建整合多个Token与其对应基序的多Token表示。此外,我们采用对比学习来细化跨模态对齐,涵盖三个不同尺度:Token-原子、多Token-基序和句子-分子,确保正确匹配的文本-分子对的相似性最大化,而未匹配对的相似性最小化。据我们所知,这是首次尝试在基序和多Token层次上探索对齐。在ChEBI-20和PCdes数据集上的实验结果表明,ORMA显著优于现有的最先进(SOTA)模型。

[NLP-48] Chat Bankman-Fried: an Exploration of LLM Alignment in Finance

【速读】: 该论文试图解决在大语言模型(LLMs)应用于金融领域时,如何评估其是否符合伦理和法律标准的问题。解决方案的关键在于提出一个实验框架,通过模拟场景测试LLMs在特定情境下的行为,特别是测试其是否愿意滥用客户资产以偿还公司债务。研究通过调整风险厌恶、利润预期和监管环境等变量,分析这些因素对模型行为的影响,并使用逻辑回归进行量化分析。研究发现,不同LLMs在基线配置下表现出显著的伦理行为异质性,且这些因素对模型行为的影响符合经济学理论预测,但效果大小因模型而异。该研究强调了基于模拟的事后安全测试的优点和局限性,虽然能为金融监管机构和机构提供LLM安全性的参考,但在通用性和成本之间存在明显的权衡。

链接: https://arxiv.org/abs/2411.11853
作者: Claudia Biancotti,Carolina Camassa,Andrea Coletta,Oliver Giudice,Aldo Glielmo
关键词-EN: large language models, Advancements in large, language models, large language, renewed concerns
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Advancements in large language models (LLMs) have renewed concerns about AI alignment - the consistency between human and AI goals and values. As various jurisdictions enact legislation on AI safety, the concept of alignment must be defined and measured across different domains. This paper proposes an experimental framework to assess whether LLMs adhere to ethical and legal standards in the relatively unexplored context of finance. We prompt nine LLMs to impersonate the CEO of a financial institution and test their willingness to misuse customer assets to repay outstanding corporate debt. Beginning with a baseline configuration, we adjust preferences, incentives and constraints, analyzing the impact of each adjustment with logistic regression. Our findings reveal significant heterogeneity in the baseline propensity for unethical behavior of LLMs. Factors such as risk aversion, profit expectations, and regulatory environment consistently influence misalignment in ways predicted by economic theory, although the magnitude of these effects varies across LLMs. This paper highlights both the benefits and limitations of simulation-based, ex post safety testing. While it can inform financial authorities and institutions aiming to ensure LLM safety, there is a clear trade-off between generality and cost.
摘要:大语言模型 (LLM) 的进步重新引发了关于 AI 对齐 (AI alignment) 的担忧,即人类与 AI 目标和价值观的一致性。随着各司法管辖区陆续出台 AI 安全相关的立法,对齐的概念必须在不同领域中定义和衡量。本文提出了一种实验框架,用于评估 LLM 在金融领域这一相对未被探索的背景下是否符合伦理和法律标准。我们引导九个 LLM 模拟金融机构的 CEO,测试它们是否愿意滥用客户资产来偿还企业债务。从基线配置开始,我们调整偏好、激励和约束条件,并使用逻辑回归分析每次调整的影响。研究发现,LLM 在基线配置下表现出显著的伦理行为异质性。风险厌恶、利润预期和监管环境等因素一致地影响着对齐问题,其方式与经济理论预测相符,尽管这些效应在不同 LLM 之间的大小有所不同。本文强调了基于模拟的事后安全测试的优点和局限性。虽然这种方法可以为旨在确保 LLM 安全的金融监管机构和机构提供信息,但在通用性和成本之间存在明显的权衡。

人工智能

[AI-0] Benchmarking Positional Encodings for GNNs and Graph Transformers

链接: https://arxiv.org/abs/2411.12732
作者: Florian Grötschla,Jiaqing Xie,Roger Wattenhofer
关键词-EN: Graph Neural Networks, capturing graph topology, Neural Networks, Positional Encodings, Graph Transformers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Graph Neural Networks (GNNs) and Graph Transformers (GTs) have been driven by innovations in architectures and Positional Encodings (PEs), which are critical for augmenting node features and capturing graph topology. PEs are essential for GTs, where topological information would otherwise be lost without message-passing. However, PEs are often tested alongside novel architectures, making it difficult to isolate their effect on established models. To address this, we present a comprehensive benchmark of PEs in a unified framework that includes both message-passing GNNs and GTs. We also establish theoretical connections between MPNNs and GTs and introduce a sparsified GRIT attention mechanism to examine the influence of global connectivity. Our findings demonstrate that previously untested combinations of GNN architectures and PEs can outperform existing methods and offer a more comprehensive picture of the state-of-the-art. To support future research and experimentation in our framework, we make the code publicly available.

[AI-1] Heuristic-Free Multi-Teacher Learning

链接: https://arxiv.org/abs/2411.12724
作者: Huy Thong Nguyen,En-Hung Chu,Lenord Melvix,Jazon Jiao,Chunglin Wen,Benjamin Louie
关键词-EN: manual aggregation heuristics, manual aggregation, ground truth labels, Existing multi-teacher methods, multi-teacher methods typically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Teacher2Task, a novel framework for multi-teacher learning that eliminates the need for manual aggregation heuristics. Existing multi-teacher methods typically rely on such heuristics to combine predictions from multiple teachers, often resulting in sub-optimal aggregated labels and the propagation of aggregation errors. Teacher2Task addresses these limitations by introducing teacher-specific input tokens and reformulating the training process. Instead of relying on aggregated labels, the framework transforms the training data, consisting of ground truth labels and annotations from N teachers, into N+1 distinct tasks: N auxiliary tasks that predict the labeling styles of the N individual teachers, and one primary task that focuses on the ground truth labels. This approach, drawing upon principles from multiple learning paradigms, demonstrates strong empirical results across a range of architectures, modalities, and tasks.

[AI-2] CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

链接: https://arxiv.org/abs/2411.12713
作者: Zhehan Kan,Ce Zhang,Zihan Liao,Yapeng Tian,Wenming Yang,Junyuan Xiao,Xu Li,Dongmei Jiang,Yaowei Wang,Qingmin Liao
关键词-EN: Large Vision-Language Model, posing significant risks, Adaptive Token-level Contrastive, Token-level Contrastive Decoding, demonstrated impressive vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

[AI-3] When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

链接: https://arxiv.org/abs/2411.12701
作者: Huaizhi Ge,Yiming Li,Qifan Wang,Yongfeng Zhang,Ruixiang Tang
关键词-EN: Large Language Models, Large Language, manipulate model behavior, maliciously manipulate model, hidden triggers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are vulnerable to backdoor attacks, where hidden triggers can maliciously manipulate model behavior. While several backdoor attack methods have been proposed, the mechanisms by which backdoor functions operate in LLMs remain underexplored. In this paper, we move beyond attacking LLMs and investigate backdoor functionality through the novel lens of natural language explanations. Specifically, we leverage LLMs’ generative capabilities to produce human-understandable explanations for their decisions, allowing us to compare explanations for clean and poisoned samples. We explore various backdoor attacks and embed the backdoor into LLaMA models for multiple tasks. Our experiments show that backdoored models produce higher-quality explanations for clean data compared to poisoned data, while generating significantly more consistent explanations for poisoned data than for clean data. We further analyze the explanation generation process, revealing that at the token level, the explanation token of poisoned samples only appears in the final few transformer layers of the LLM. At the sentence level, attention dynamics indicate that poisoned inputs shift attention from the input context when generating the explanation. These findings deepen our understanding of backdoor attack mechanisms in LLMs and offer a framework for detecting such vulnerabilities through explainability techniques, contributing to the development of more secure LLMs.

[AI-4] Attribute Inference Attacks for Federated Regression Tasks

链接: https://arxiv.org/abs/2411.12697
作者: Francesco Diana,Othmane Marfoq,Chuan Xu,Giovanni Neglia,Frédéric Giroire,Eoin Thomas
关键词-EN: global machine learning, machine learning model, enables multiple clients, machine learning, learning model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables multiple clients, such as mobile phones and IoT devices, to collaboratively train a global machine learning model while keeping their data localized. However, recent studies have revealed that the training phase of FL is vulnerable to reconstruction attacks, such as attribute inference attacks (AIA), where adversaries exploit exchanged messages and auxiliary public information to uncover sensitive attributes of targeted clients. While these attacks have been extensively studied in the context of classification tasks, their impact on regression tasks remains largely unexplored. In this paper, we address this gap by proposing novel model-based AIAs specifically designed for regression tasks in FL environments. Our approach considers scenarios where adversaries can either eavesdrop on exchanged messages or directly interfere with the training process. We benchmark our proposed attacks against state-of-the-art methods using real-world datasets. The results demonstrate a significant increase in reconstruction accuracy, particularly in heterogeneous client datasets, a common scenario in FL. The efficacy of our model-based AIAs makes them better candidates for empirically quantifying privacy leakage for federated regression tasks.

[AI-5] Deep Learning-Driven Heat Map Analysis for Evaluating thickness of Wounded Skin Layers

链接: https://arxiv.org/abs/2411.12678
作者: Devakumar GR,JB Kaarthikeyan,Dominic Immanuel T,Sheena Christabel Pravin
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-6] PoM: Efficient Image and Video Generation with the Polynomial Mixer

链接: https://arxiv.org/abs/2411.12663
作者: David Picard,Nicolas Dufour
关键词-EN: Diffusion models based, models based, based on Multi-Head, Multi-Head Attention, Polynomial Mixer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at this https URL.

[AI-7] CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

链接: https://arxiv.org/abs/2411.12644
作者: Ye Liu,Rui Meng,Shafiq Jot,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz
关键词-EN: largely underexplored area, NLP tasks, code retrieval remains, code retrieval, underexplored area
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving code. This gap leaves existing models unable to effectively capture the diversity of programming languages and tasks across different domains, highlighting the need for more focused research in code retrieval. To address this, we introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters. Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework, enhancing model generalizability and retrieval performance. Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark. In addition to excelling in code retrieval, our models demonstrate competitive performance on the widely adopted BeIR text retrieval benchmark, offering versatility across domains. Experimental results demonstrate that improving retrieval performance significantly enhances end-to-end Retrieval-Augmented Generation (RAG) performance for code-related tasks.

[AI-8] Instant Policy: In-Context Imitation Learning via Graph Diffusion WWW

链接: https://arxiv.org/abs/2411.12633
作者: Vitalis Vosylius,Edward Johns
关键词-EN: In-Context Imitation Learning, In-Context Imitation, Imitation Learning, large transformers, opportunity for robotics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code and videos are available on our project webpage at this https URL

点击查看摘要

Abstract:Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at this https URL.

[AI-9] STREAM: A Universal State-Space Model for Sparse Geometric Data

链接: https://arxiv.org/abs/2411.12603
作者: Mark Schöne,Yash Bhisikar,Karan Bania,Khaleelulla Khan Nazeer,Christian Mayr,Anand Subramoney,David Kappel
关键词-EN: geometric data, unstructured geometric data, Handling sparse, geometric, pressing challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Handling sparse and unstructured geometric data, such as point clouds or event-based vision, is a pressing challenge in the field of machine vision. Recently, sequence models such as Transformers and state-space models entered the domain of geometric data. These methods require specialized preprocessing to create a sequential view of a set of points. Furthermore, prior works involving sequence models iterate geometric data with either uniform or learned step sizes, implicitly relying on the model to infer the underlying geometric structure. In this work, we propose to encode geometric structure explicitly into the parameterization of a state-space model. State-space models are based on linear dynamics governed by a one-dimensional variable such as time or a spatial coordinate. We exploit this dynamic variable to inject relative differences of coordinates into the step size of the state-space model. The resulting geometric operation computes interactions between all pairs of N points in O(N) steps. Our model deploys the Mamba selective state-space model with a modified CUDA kernel to efficiently map sparse geometric data to modern hardware. The resulting sequence model, which we call STREAM, achieves competitive results on a range of benchmarks from point-cloud classification to event-based vision and audio classification. STREAM demonstrates a powerful inductive bias for sparse geometric data by improving the PointMamba baseline when trained from scratch on the ModelNet40 and ScanObjectNN point cloud analysis datasets. It further achieves, for the first time, 100% test accuracy on all 11 classes of the DVS128 Gestures dataset.

[AI-10] Provable unlearning in topic modeling and downstream tasks

链接: https://arxiv.org/abs/2411.12600
作者: Stanley Wei,Sadhika Malladi,Sanjeev Arora,Amartya Sanyal
关键词-EN: legal concerns arise, Machine unlearning algorithms, Machine unlearning, increasingly important, important as legal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model – i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.

[AI-11] AdaCM2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

链接: https://arxiv.org/abs/2411.12593
作者: Yuanbin Man,Ying Huang,Chengming Zhang,Bingzhe Li,Wei Niu,Miao Yin
关键词-EN: large language models, incorporating LLMs, advancements in large, large language, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM ^2 , which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM ^2 achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

[AI-12] hinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

链接: https://arxiv.org/abs/2411.12591
作者: Haojie Zheng,Tianyang Xu,Hanchi Sun,Shu Pu,Ruoxi Chen,Lichao Sun
关键词-EN: large language models, language models, Multimodal large language, linguistic modalities, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities, establishing themselves as the dominant paradigm for visual-language tasks. Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs), yet their adaptation to MLLMs is hindered by heightened risks of hallucination in cross-modality comprehension. In this paper, we find that the thinking while looking paradigm in current multimodal CoT approaches–where reasoning chains are generated alongside visual input–fails to mitigate hallucinations caused by misleading images. To address these limitations, we propose the Visual Inference Chain (VIC) framework, a novel approach that constructs reasoning chains using textual context alone before introducing visual input, effectively reducing cross-modal biases and enhancing multimodal reasoning accuracy. Comprehensive evaluations demonstrate that VIC significantly improves zero-shot performance across various vision-related tasks, mitigating hallucinations while refining the reasoning capabilities of MLLMs. Our code repository can be found at this https URL.

[AI-13] ULTra: Unveiling Latent Token Interpretability in Transformer Based Understanding

链接: https://arxiv.org/abs/2411.12589
作者: Hesam Hosseini,Ghazal Hosseini Mighan,Amirabbas Afzali,Sajjad Amini,Amir Houmansadr
关键词-EN: Natural Language Processing, revolutionized Computer Vision, Computer Vision, Language Processing, Natural Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have revolutionized Computer Vision (CV) and Natural Language Processing (NLP) through self-attention mechanisms. However, due to their complexity, their latent token representations are often difficult to interpret. We introduce a novel framework that interprets Transformer embeddings, uncovering meaningful semantic patterns within them. Based on this framework, we demonstrate that zero-shot unsupervised semantic segmentation can be performed effectively without any fine-tuning using a model pre-trained for tasks other than segmentation. Our method reveals the inherent capacity of Transformer models for understanding input semantics and achieves state-of-the-art performance in semantic segmentation, outperforming traditional segmentation models. Specifically, our approach achieves an accuracy of 67.2 % and an mIoU of 32.9 % on the COCO-Stuff dataset, as well as an mIoU of 51.9 % on the PASCAL VOC dataset. Additionally, we validate our interpretability framework on LLMs for text summarization, demonstrating its broad applicability and robustness.

[AI-14] Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

链接: https://arxiv.org/abs/2411.12584
作者: Xudong Yan,Songhe Feng,Yang Zhang,Jian Yang,Yueguan Lin,Haojun Fei
关键词-EN: Compositional zero-shot learning, Compositional zero-shot, Large Language Model, zero-shot learning, aims to recognize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.

[AI-15] opological Symmetry Enhanced Graph Convolution for Skeleton-Based Action Recognition

链接: https://arxiv.org/abs/2411.12560
作者: Zeyu Liang,Hailun Xia,Naichuan Zheng,Huan Xu
关键词-EN: graph convolutional networks, achieved remarkable performance, Skeleton-based action recognition, NTU RGB, Enhanced Graph Convolution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Skeleton-based action recognition has achieved remarkable performance with the development of graph convolutional networks (GCNs). However, most of these methods tend to construct complex topology learning mechanisms while neglecting the inherent symmetry of the human body. Additionally, the use of temporal convolutions with certain fixed receptive fields limits their capacity to effectively capture dependencies in time sequences. To address the issues, we (1) propose a novel Topological Symmetry Enhanced Graph Convolution (TSE-GC) to enable distinct topology learning across different channel partitions while incorporating topological symmetry awareness and (2) construct a Multi-Branch Deformable Temporal Convolution (MBDTC) for skeleton-based action recognition. The proposed TSE-GC emphasizes the inherent symmetry of the human body while enabling efficient learning of dynamic topologies. Meanwhile, the design of MBDTC introduces the concept of deformable modeling, leading to more flexible receptive fields and stronger modeling capacity of temporal dependencies. Combining TSE-GC with MBDTC, our final model, TSE-GCN, achieves competitive performance with fewer parameters compared with state-of-the-art methods on three large datasets, NTU RGB+D, NTU RGB+D 120, and NW-UCLA. On the cross-subject and cross-set evaluations of NTU RGB+D 120, the accuracies of our model reach 90.0% and 91.1%, with 1.1M parameters and 1.38 GFLOPS for one stream.

[AI-16] Recall and Refine: A Simple but Effective Source-free Open-set Domain Adaptation Framework

链接: https://arxiv.org/abs/2411.12558
作者: Ismail Nejjar,Hao Dong,Olga Fink
关键词-EN: Open-set Domain Adaptation, Domain Adaptation, Open-set Domain, unknown classes, Source-free Open-set Domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-set Domain Adaptation (OSDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where novel classes - also referred to as target-private unknown classes - are present. Source-free Open-set Domain Adaptation (SF-OSDA) methods address OSDA without accessing labeled source data, making them particularly relevant under privacy constraints. However, SF-OSDA presents significant challenges due to distribution shifts and the introduction of novel classes. Existing SF-OSDA methods typically rely on thresholding the prediction entropy of a sample to identify it as either a known or unknown class but fail to explicitly learn discriminative features for the target-private unknown classes. We propose Recall and Refine (RRDA), a novel SF-OSDA framework designed to address these limitations by explicitly learning features for target-private unknown classes. RRDA employs a two-step process. First, we enhance the model’s capacity to recognize unknown classes by training a target classifier with an additional decision boundary, guided by synthetic samples generated from target domain features. This enables the classifier to effectively separate known and unknown classes. In the second step, we adapt the entire model to the target domain, addressing both domain shifts and improving generalization to unknown classes. Any off-the-shelf source-free domain adaptation method (e.g., SHOT, AaD) can be seamlessly integrated into our framework at this stage. Extensive experiments on three benchmark datasets demonstrate that RRDA significantly outperforms existing SF-OSDA and OSDA methods.

[AI-17] Rethinking Top Probability from Multi-view for Distracted Driver Behaviour Localization

链接: https://arxiv.org/abs/2411.12525
作者: Quang Vinh Nguyen,Vo Hoang Thanh Son,Chau Truong Vinh Hoang,Duc Duy Nguyen,Nhat Huy Nguyen Minh,Soo-Hyung Kim
关键词-EN: Naturalistic driving action, real-world driving scenarios, comprehend human behaviors, video data captured, Naturalistic driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Computer Vision and Pattern Recognition Workshop 2024

点击查看摘要

Abstract:Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.

[AI-18] he Hermeneutic Turn of AI: Is the Machine Capable of Interpreting?

链接: https://arxiv.org/abs/2411.12517
作者: Remy Demichelis
关键词-EN: artificial neural networks, deep learning, artificial neural, neural networks, interactions with machines
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 4 pages.

点击查看摘要

Abstract:This article aims to demonstrate how the approach to computing is being disrupted by deep learning (artificial neural networks), not only in terms of techniques but also in our interactions with machines. It also addresses the philosophical tradition of hermeneutics (Don Ihde, Wilhelm Dilthey) to highlight a parallel with this movement and to demystify the idea of human-like AI.

[AI-19] ransformer Neural Processes – Kernel Regression

链接: https://arxiv.org/abs/2411.12502
作者: Daniel Jenson,Jhonathan Navott,Mengyan Zhang,Makkunda Sharma,Elizaveta Semenova,Seth Flaxman
关键词-EN: Stochastic processes model, Stochastic processes, Transformer Neural Process, Neural Processes, transformer-based Neural Processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic processes model various natural phenomena from disease transmission to stock prices, but simulating and quantifying their uncertainty can be computationally challenging. For example, modeling a Gaussian Process with standard statistical methods incurs an \mathcalO(n^3) penalty, and even using state-of-the-art Neural Processes (NPs) incurs an \mathcalO(n^2) penalty due to the attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a new architecture that incorporates a novel transformer block we call a Kernel Regression Block (KRBlock), which reduces the computational complexity of attention in transformer-based Neural Processes (TNPs) from \mathcalO((n_C+n_T)^2) to O(n_C^2+n_Cn_T) by eliminating masked computations, where n_C is the number of context, and n_T is the number of test points, respectively, and a fast attention variant that further reduces all attention calculations to \mathcalO(n_C) in space and time complexity. In benchmarks spanning such tasks as meta-regression, Bayesian optimization, and image completion, we demonstrate that the full variant matches the performance of state-of-the-art methods while training faster and scaling two orders of magnitude higher in number of test points, and the fast variant nearly matches that performance while scaling to millions of both test and context points on consumer hardware.

[AI-20] Enhancing Reasoning Capabilities of LLM s via Principled Synthetic Logic Corpus NEURIPS2024

链接: https://arxiv.org/abs/2411.12498
作者: Terufumi Morishita,Gaku Morio,Atsuki Yamaguchi,Yasuhiro Sogawa
关键词-EN: Large language models, Large language, Additional Logic Training, language models, range of tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose \textbfAdditional Logic Training (ALT) , which aims to enhance LLMs’ reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named \textbfFormal Logic Deduction Diverse ( \textbfFLD ^\times 2 ), comprising numerous samples of multi-step deduction with unknown facts, diverse reasoning rules, diverse linguistic expressions, and challenging distractors. Finally, we empirically show that ALT on FLD ^\times2 substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.

[AI-21] Comparing Prior and Learned Time Representations in Transformer Models of Timeseries

链接: https://arxiv.org/abs/2411.12476
作者: Natalia Koliou,Tatiana Boura,Stasinos Konstantopoulos,George Meramveliotakis,George Kosmadakis
关键词-EN: sets timeseries analysis, fixed time representation, time representation, machine learning exercises, time representation proposed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at the AI in Natural Sciences and Technology (AINST) track of the 13th Conference on Artificial Intelligence (SETN 2024), 11-13 September 2024, Piraeus, Greece

点击查看摘要

Abstract:What sets timeseries analysis apart from other machine learning exercises is that time representation becomes a primary aspect of the experiment setup, as it must adequately represent the temporal relations that are relevant for the application at hand. In the work described here we study wo different variations of the Transformer architecture: one where we use the fixed time representation proposed in the literature and one where the time representation is learned from the data. Our experiments use data from predicting the energy output of solar panels, a task that exhibits known periodicities (daily and seasonal) that is straight-forward to encode in the fixed time representation. Our results indicate that even in an experiment where the phenomenon is well-understood, it is difficult to encode prior knowledge due to side-effects that are difficult to mitigate. We conclude that research work is needed to work the human into the learning loop in ways that improve the robustness and trust-worthiness of the network.

[AI-22] Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity

链接: https://arxiv.org/abs/2411.12433
作者: Hannah Janmohamed,Maxence Faldor,Thomas Pierrot,Antoine Cully
关键词-EN: variety of domains, generate collections, diverse and high-performing, Quality-Diversity algorithms, Multi-Objective Quality-Diversity algorithms
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the Pareto front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics. This performance comes at a lower computational storage cost compared to previous methods.

[AI-23] DiM: f-Divergence Minimization Guided Sharpness-Aware Optimization for Semi-supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2411.12350
作者: Bingli Wang,Houcheng Su,Nan Yin,Mengzhu Wang,Li Shen
关键词-EN: attracted widespread attention, widespread attention, alleviate the pressure, attracted widespread, data annotation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8page

点击查看摘要

Abstract:As a technique to alleviate the pressure of data annotation, semi-supervised learning (SSL) has attracted widespread attention. In the specific domain of medical image segmentation, semi-supervised methods (SSMIS) have become a research hotspot due to their ability to reduce the need for large amounts of precisely annotated data. SSMIS focuses on enhancing the model’s generalization performance by leveraging a small number of labeled samples and a large number of unlabeled samples. The latest sharpness-aware optimization (SAM) technique, which optimizes the model by reducing the sharpness of the loss function, has shown significant success in SSMIS. However, SAM and its variants may not fully account for the distribution differences between different datasets. To address this issue, we propose a sharpness-aware optimization method based on f -divergence minimization (DiM) for semi-supervised medical image segmentation. This method enhances the model’s stability by fine-tuning the sensitivity of model parameters and improves the model’s adaptability to different datasets through the introduction of f -divergence. By reducing f -divergence, the DiM method not only improves the performance balance between the source and target datasets but also prevents performance degradation due to overfitting on the source dataset.

[AI-24] CLIP Unreasonable Potential in Single-Shot Face Recognition

链接: https://arxiv.org/abs/2411.12319
作者: Nhan T. Luu
关键词-EN: analyzing facial patterns, computer vision designed, designed to identify, identify and authenticate, Language Image Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Face recognition is a core task in computer vision designed to identify and authenticate individuals by analyzing facial patterns and features. This field intersects with artificial intelligence image processing and machine learning with applications in security authentication and personalization. Traditional approaches in facial recognition focus on capturing facial features like the eyes, nose and mouth and matching these against a database to verify identities However challenges such as high false positive rates have persisted often due to the similarity among individuals facial features. Recently Contrastive Language Image Pretraining (CLIP) a model developed by OpenAI has shown promising advancements by linking natural language processing with vision tasks allowing it to generalize across modalities. Using CLIP’s vision language correspondence and single-shot finetuning the model can achieve lower false positive rates upon deployment without the need of mass facial features extraction. This integration demonstrating CLIP’s potential to address persistent issues in face recognition model performance without complicating our training paradigm.

[AI-25] SNN-Based Online Learning of Concepts and Action Laws in an Open World

链接: https://arxiv.org/abs/2411.12308
作者: Christel Grimaud(IRIT-LILaC),Dominique Longin(IRIT-LILaC),Andreas Herzig(IRIT-LILaC)
关键词-EN: spiking neural network, bio-inspired cognitive agent, cognitive agent built, fully autonomous, bio-inspired cognitive
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent’s semantic memory. The agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent’s knowledge of its universe’s actions laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.

[AI-26] SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

链接: https://arxiv.org/abs/2411.12290
作者: Haowen Zheng,Yanyan Liang
关键词-EN: Recent advancements, semantic scene generation, semantic scene, gained attention, scene generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model’s ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

[AI-27] bcll: an Extendable Python Toolkit for Complementary-Label Learning

链接: https://arxiv.org/abs/2411.12276
作者: Nai-Xuan Ye,Tan-Ha Mai,Hsiu-Hsuan Wang,Wei-I Lin,Hsuan-Tien Lin
关键词-EN: weakly supervised learning, supervised learning paradigm, Complementary-label learning, multiclass classification, indicating classes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:Complementary-label learning (CLL) is a weakly supervised learning paradigm for multiclass classification, where only complementary labels – indicating classes an instance does not belong to – are provided to the learning algorithm. Despite CLL’s increasing popularity, previous studies highlight two main challenges: (1) inconsistent results arising from varied assumptions on complementary label generation, and (2) high barriers to entry due to the lack of a standardized evaluation platform across datasets and algorithms. To address these challenges, we introduce \textttlibcll, an extensible Python toolkit for CLL research. \textttlibcll provides a universal interface that supports a wide range of generation assumptions, both synthetic and real-world datasets, and key CLL algorithms. The toolkit is designed to mitigate inconsistencies and streamline the research process, with easy installation, comprehensive usage guides, and quickstart tutorials that facilitate efficient adoption and implementation of CLL techniques. Extensive ablation studies conducted with \textttlibcll demonstrate its utility in generating valuable insights to advance future CLL research.

[AI-28] Restructuring Tractable Probabilistic Circuits

链接: https://arxiv.org/abs/2411.12256
作者: Honghua Zhang,Benjie Wang,Marcelo Arenas,Guy Van den Broeck
关键词-EN: probabilistic models, unifying representation, models that support, Probabilistic circuits, representation for probabilistic
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic circuits (PCs) is a unifying representation for probabilistic models that support tractable inference. Numerous applications of PCs like controllable text generation depend on the ability to efficiently multiply two circuits. Existing multiplication algorithms require that the circuits respect the same structure, i.e. variable scopes decomposes according to the same vtree. In this work, we propose and study the task of restructuring structured(-decomposable) PCs, that is, transforming a structured PC such that it conforms to a target vtree. We propose a generic approach for this problem and show that it leads to novel polynomial-time algorithms for multiplying circuits respecting different vtrees, as well as a practical depth-reduction algorithm that preserves structured decomposibility. Our work opens up new avenues for tractable PC inference, suggesting the possibility of training with less restrictive PC structures while enabling efficient inference by changing their structures at inference time.

[AI-29] Error-Feedback Model for Output Correction in Bilateral Control-Based Imitation Learning

链接: https://arxiv.org/abs/2411.12255
作者: Hiroshi Sato,Masashi Konosu,Sho Sakaino,Toshiaki Tsuji
关键词-EN: perform flexible tasks, neural networks, recent years, imitation learning, enabled robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, imitation learning using neural networks has enabled robots to perform flexible tasks. However, since neural networks operate in a feedforward structure, they do not possess a mechanism to compensate for output errors. To address this limitation, we developed a feedback mechanism to correct these errors. By employing a hierarchical structure for neural networks comprising lower and upper layers, the lower layer was controlled to follow the upper layer. Additionally, using a multi-layer perceptron in the lower layer, which lacks an internal state, enhanced the error feedback. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. In the character-writing task, this model demonstrated improved accuracy in writing previously untrained characters. Through autonomous control with error feedback, we confirmed that the lower layer could effectively track the output of the upper layer. This study represents a promising step toward integrating neural networks with control theories.

[AI-30] Efficient Training in Multi-Agent Reinforcement Learning: A Communication-Free Framework for the Box-Pushing Problem

链接: https://arxiv.org/abs/2411.12246
作者: David Ge,Hao Ji
关键词-EN: Self-organizing systems consist, perform complex tasks, Self-organizing systems, central controller, systems consist
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 16 figures

点击查看摘要

Abstract:Self-organizing systems consist of autonomous agents that can perform complex tasks and adapt to dynamic environments without a central controller. Prior research often relies on reinforcement learning to enable agents to gain the skills needed for task completion, such as in the box-pushing environment. However, when agents push from opposing directions during exploration, they tend to exert equal and opposite forces on the box, resulting in minimal displacement and inefficient training. This paper proposes a model called Shared Pool of Information (SPI), which enables information to be accessible to all agents and facilitates coordination, reducing force conflicts among agents and enhancing exploration efficiency. Through computer simulations, we demonstrate that SPI not only expedites the training process but also requires fewer steps per episode, significantly improving the agents’ collaborative effectiveness.

[AI-31] Contrast Similarity-Aware Dual-Pathway Mamba for Multivariate Time Series Node Classification

链接: https://arxiv.org/abs/2411.12222
作者: Mingsen Du,Meng Chen,Yongjian Li,Xiuxin Zhang,Jiahui Gao,Cun Ji,Shoushui Wei
关键词-EN: Multivariate time series, high dimensional characteristics, MTS node classification, Multivariate time, MTS
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Knowledge-Based Systems on Nov 17, 2024

点击查看摘要

Abstract:Multivariate time series (MTS) data is generated through multiple sensors across various domains such as engineering application, health monitoring, and the internet of things, characterized by its temporal changes and high dimensional characteristics. Over the past few years, many studies have explored the long-range dependencies and similarities in MTS. However, long-range dependencies are difficult to model due to their temporal changes and high dimensionality makes it difficult to obtain similarities effectively and efficiently. Thus, to address these issues, we propose contrast similarity-aware dual-pathway Mamba for MTS node classification (CS-DPMamba). Firstly, to obtain the dynamic similarity of each sample, we initially use temporal contrast learning module to acquire MTS representations. And then we construct a similarity matrix between MTS representations using Fast Dynamic Time Warping (FastDTW). Secondly, we apply the DPMamba to consider the bidirectional nature of MTS, allowing us to better capture long-range and short-range dependencies within the data. Finally, we utilize the Kolmogorov-Arnold Network enhanced Graph Isomorphism Network to complete the information interaction in the matrix and MTS node classification task. By comprehensively considering the long-range dependencies and dynamic similarity features, we achieved precise MTS node classification. We conducted experiments on multiple University of East Anglia (UEA) MTS datasets, which encompass diverse application scenarios. Our results demonstrate the superiority of our method through both supervised and semi-supervised experiments on the MTS classification task.

[AI-32] DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning

链接: https://arxiv.org/abs/2411.12220
作者: Kichang Lee,Yujin Shin,Jonghyuk Yun,Jun Han,JeongGil Ko
关键词-EN: local data privacy, preserving local data, enables collaborative model, collaborative model training, enables collaborative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks, where adversaries implant trigger patterns to manipulate model predictions. In this paper, we propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework that leverages insights from adversarial attack methodologies. By employing gradient analysis with temperature scaling, DeTrigger detects and isolates backdoor triggers, allowing for precise model weight pruning of backdoor activations without sacrificing benign model knowledge. Extensive evaluations across four widely used datasets demonstrate that DeTrigger achieves up to 251x faster detection than traditional methods and mitigates backdoor attacks by up to 98.9%, with minimal impact on global model accuracy. Our findings establish DeTrigger as a robust and scalable solution to protect federated learning environments against sophisticated backdoor threats.

[AI-33] CCIS-Diff: A Generative Model with Stable Diffusion Prior for Controlled Colonoscopy Image Synthesis

链接: https://arxiv.org/abs/2411.12198
作者: Yifan Xie,Jingge Wang,Tao Feng,Fei Ma,Yang Li
关键词-EN: preventing colorectal cancer, identifying adenomatous polyps, colorectal cancer, crucial for identifying, identifying adenomatous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Colonoscopy is crucial for identifying adenomatous polyps and preventing colorectal cancer. However, developing robust models for polyp detection is challenging by the limited size and accessibility of existing colonoscopy datasets. While previous efforts have attempted to synthesize colonoscopy images, current methods suffer from instability and insufficient data diversity. Moreover, these approaches lack precise control over the generation process, resulting in images that fail to meet clinical quality standards. To address these challenges, we propose CCIS-DIFF, a Controlled generative model for high-quality Colonoscopy Image Synthesis based on a Diffusion architecture. Our method offers precise control over both the spatial attributes (polyp location and shape) and clinical characteristics of polyps that align with clinical descriptions. Specifically, we introduce a blur mask weighting strategy to seamlessly blend synthesized polyps with the colonic mucosa, and a text-aware attention mechanism to guide the generated images to reflect clinical characteristics. Notably, to achieve this, we construct a new multi-modal colonoscopy dataset that integrates images, mask annotations, and corresponding clinical text descriptions. Experimental results demonstrate that our method generates high-quality, diverse colonoscopy images with fine control over both spatial constraints and clinical consistency, offering valuable support for downstream segmentation and diagnostic tasks.

[AI-34] A More Advanced Group Polarization Measurement Approach Based on LLM -Based Agents and Graphs

链接: https://arxiv.org/abs/2411.12196
作者: Zixin Liu,Ji Zhang,Yiran Ding
关键词-EN: Group polarization, important research direction, attracting many researchers, explore this field, researchers to explore
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Group polarization is an important research direction in social media content analysis, attracting many researchers to explore this field. Therefore, how to effectively measure group polarization has become a critical topic. Measuring group polarization on social media presents several challenges that have not yet been addressed by existing solutions. First, social media group polarization measurement involves processing vast amounts of text, which poses a significant challenge for information extraction. Second, social media texts often contain hard-to-understand content, including sarcasm, memes, and internet slang. Additionally, group polarization research focuses on holistic analysis, while texts is typically fragmented. To address these challenges, we designed a solution based on a multi-agent system and used a graph-structured Community Sentiment Network (CSN) to represent polarization states. Furthermore, we developed a metric called Community Opposition Index (COI) based on the CSN to quantify polarization. Finally, we tested our multi-agent system through a zero-shot stance detection task and achieved outstanding results. In summary, the proposed approach has significant value in terms of usability, accuracy, and interpretability.

[AI-35] Diffusion-Inspired Cold Start with Sufficient Prior in Computerized Adaptive Testing KDD2025

链接: https://arxiv.org/abs/2411.12182
作者: Haiping Ma,Aoqing Xia,Changqian Wang,Hai Wang,Xingyi Zhang
关键词-EN: Computerized Adaptive Testing, Computerized Adaptive, Adaptive Testing, CSIP task, cognitive states
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted by KDD2025

点击查看摘要

Abstract:Computerized Adaptive Testing (CAT) aims to select the most appropriate questions based on the examinee’s ability and is widely used in online education. However, existing CAT systems often lack initial understanding of the examinee’s ability, requiring random probing questions. This can lead to poorly matched questions, extending the test duration and negatively impacting the examinee’s mindset, a phenomenon referred to as the Cold Start with Insufficient Prior (CSIP) task. This issue occurs because CAT systems do not effectively utilize the abundant prior information about the examinee available from other courses on online platforms. These response records, due to the commonality of cognitive states across different knowledge domains, can provide valuable prior information for the target domain. However, no prior work has explored solutions for the CSIP task. In response to this gap, we propose Diffusion Cognitive States TransfeR Framework (DCSR), a novel domain transfer framework based on Diffusion Models (DMs) to address the CSIP task. Specifically, we construct a cognitive state transition bridge between domains, guided by the common cognitive states of examinees, encouraging the model to reconstruct the initial ability state in the target domain. To enrich the expressive power of the generated data, we analyze the causal relationships in the generation process from a causal perspective. Redundant and extraneous cognitive states can lead to limited transfer and negative transfer effects. Our DCSR can seamlessly apply the generated initial ability states in the target domain to existing question selection algorithms, thus improving the cold start performance of the CAT system. Extensive experiments conducted on five real-world datasets demonstrate that DCSR significantly outperforms existing baseline methods in addressing the CSIP task.

[AI-36] SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

链接: https://arxiv.org/abs/2411.12173
作者: Yongyan Wen,Siyuan Li,Rongchang Zuo,Lei Yuan,Hangyu Mao,Peng Liu
关键词-EN: Deep reinforcement learning, achieved remarkable success, Deep reinforcement, reinforcement learning, achieved remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

[AI-37] UrbanDiT: A Foundation Model for Open-World Urban Spatio-Temporal Learning

链接: https://arxiv.org/abs/2411.12164
作者: Yuan Yuan,Chonghua Han,Jingtao Ding,Depeng Jin,Yong Li
关键词-EN: diverse human activities, complex spatio-temporal dynamics, spatio-temporal dynamics arising, activities and interactions, environment is characterized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scale up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse spatio-temporal data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three primary advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format, allowing to capture spatio-temporal dynamics across diverse scenarios of different cities; 2) With masking strategies and task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. These features allow UrbanDiT to achieves state-of-the-art performance in different domains such as transportation traffic, crowd flows, taxi demand, bike usage, and cellular traffic, across multiple cities and tasks. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain.

[AI-38] Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning

链接: https://arxiv.org/abs/2411.12155
作者: Younggyo Seo,Pieter Abbeel
关键词-EN: tasks typically requires, typically requires, requires a large, large number, Training reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 17 Pages. Website: this https URL

点击查看摘要

Abstract:Training reinforcement learning (RL) agents on robotic tasks typically requires a large number of training samples. This is because training data often consists of noisy trajectories, whether from exploration or human-collected demonstrations, making it difficult to learn value functions that understand the effect of taking each action. On the other hand, recent behavior-cloning (BC) approaches have shown that predicting a sequence of actions enables policies to effectively approximate noisy, multi-modal distributions of expert demonstrations. Can we use a similar idea for improving RL on robotic tasks? In this paper, we introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories. We study our algorithm across various setups with sparse and dense rewards, and with or without demonstrations, spanning mobile bi-manual manipulation, whole-body control, and tabletop manipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, by learning the critic network with action sequences, our algorithm outperforms various RL and BC baselines, in particular on challenging humanoid control tasks.

[AI-39] HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments

链接: https://arxiv.org/abs/2411.12150
作者: Shuijing Liu,Haochen Xia,Fatemeh Cheraghi Pouria,Kaiwen Hong,Neeloy Chakraborty,Katherine Driggs-Campbell
关键词-EN: corridors and furniture, study the problem, dense and interactive, interactive crowds, crowds with environmental
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of robot navigation in dense and interactive crowds with environmental constraints such as corridors and furniture. Previous methods fail to consider all types of interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different components in the environment and propose a heterogeneous spatio-temporal (st) graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous st-graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions among entities through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success and efficiency in challenging navigation scenarios. Furthermore, we demonstrate that our pipeline achieves better zero-shot generalization capability than previous works when the densities of humans and obstacles change. More videos are available at this https URL.

[AI-40] Visualizing Loss Functions as Topological Landscape Profiles

链接: https://arxiv.org/abs/2411.12136
作者: Caleb Geniesse,Jiaqing Chen,Tiankai Xie,Ge Shi,Yaoqing Yang,Dmitriy Morozov,Talita Perciano,Michael W. Mahoney,Ross Maciejewski,Gunther H. Weber
关键词-EN: loss function measures, loss, loss landscapes, predictions and ground-truth, function measures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In machine learning, a loss function measures the difference between model predictions and ground-truth (or target) values. For neural network models, visualizing how this loss changes as model parameters are varied can provide insights into the local structure of the so-called loss landscape (e.g., smoothness) as well as global properties of the underlying model (e.g., generalization performance). While various methods for visualizing the loss landscape have been proposed, many approaches limit sampling to just one or two directions, ignoring potentially relevant information in this extremely high-dimensional space. This paper introduces a new representation based on topological data analysis that enables the visualization of higher-dimensional loss landscapes. After describing this new topological landscape profile representation, we show how the shape of loss landscapes can reveal new details about model performance and learning dynamics, highlighting several use cases, including image segmentation (e.g., UNet) and scientific machine learning (e.g., physics-informed neural networks). Through these examples, we provide new insights into how loss landscapes vary across distinct hyperparameter spaces: we find that the topology of the loss landscape is simpler for better-performing models; and we observe greater variation in the shape of loss landscapes near transitions from low to high model performance.

[AI-41] he Role of Accuracy and Validation Effectiveness in Conversational Business Analytics

链接: https://arxiv.org/abs/2411.12128
作者: Adem Alparslan
关键词-EN: conversational business analytics, traditional self-service analytics, examines conversational business, conversational business, business analytics
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:This study examines conversational business analytics, an approach that utilizes AI to address the technical competency gaps that hindered end users from effectively using traditional self-service analytics. By facilitating natural language interactions, conversational business analytics aims to enable end users to independently retrieve data and generate insights. The analysis focuses on Text-to-SQL as a representative technology for translating natural language requests into SQL statements. Using models grounded in expected utility theory, the study identifies conditions under which conversational business analytics, through partial or full support, can outperform delegation to human experts. The results indicate that partial support, which focuses solely on information generation by AI, is viable when the accuracy of AI-generated SQL queries exceeds a defined threshold. In contrast, full support includes not only information generation but also validation through explanations provided by the AI, and requires sufficiently high validation effectiveness to be reliable. However, user-based validation presents challenges, such as misjudgment and rejection of valid SQL queries, which may limit the effectiveness of conversational business analytics. These challenges underscore the need for robust validation mechanisms, including improved user support, automated processes, and methods for assessing quality independently of end users’ technical competencies.

[AI-42] Distill the Best Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning

链接: https://arxiv.org/abs/2411.12115
作者: Brian B. Moser,Federico Raue,Tobias C. Nauen,Stanislav Frolov,Andreas Dengel
关键词-EN: including non-beneficial samples, potentially including non-beneficial, existing approaches typically, approaches typically distill, gained significant interest
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation has gained significant interest in recent years, yet existing approaches typically distill from the entire dataset, potentially including non-beneficial samples. We introduce a novel “Prune First, Distill After” framework that systematically prunes datasets via loss-based sampling prior to distillation. By leveraging pruning before classical distillation techniques and generative priors, we create a representative core-set that leads to enhanced generalization for unseen architectures - a significant challenge of current distillation methods. More specifically, our proposed framework significantly boosts distilled quality, achieving up to a 5.2 percentage points accuracy increase even with substantial dataset pruning, i.e., removing 80% of the original dataset prior to distillation. Overall, our experimental results highlight the advantages of our easy-sample prioritization and cross-architecture robustness, paving the way for more effective and high-quality dataset distillation.

[AI-43] Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning

链接: https://arxiv.org/abs/2411.12073
作者: Arundhati S. Shanbhag,Brian B. Moser,Tobias C. Nauen,Stanislav Frolov,Federico Raue,Andreas Dengel
关键词-EN: recently shown unexpected, shown unexpected potential, Bayes’ theorem, generative capabilities, Hierarchical Diffusion Classifier
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models, known for their generative capabilities, have recently shown unexpected potential in image classification tasks by using Bayes’ theorem. However, most diffusion classifiers require evaluating all class labels for a single classification, leading to significant computational costs that can hinder their application in large-scale scenarios. To address this, we present a Hierarchical Diffusion Classifier (HDC) that exploits the inherent hierarchical label structure of a dataset. By progressively pruning irrelevant high-level categories and refining predictions only within relevant subcategories, i.e., leaf nodes, HDC reduces the total number of class evaluations. As a result, HDC can accelerate inference by up to 60% while maintaining and, in some cases, improving classification accuracy. Our work enables a new control mechanism of the trade-off between speed and precision, making diffusion-based classification more viable for real-world applications, particularly in large-scale image classification tasks.

[AI-44] Zoomed In Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution

链接: https://arxiv.org/abs/2411.12072
作者: Brian B. Moser,Stanislav Frolov,Tobias C. Nauen,Federico Raue,Andreas Dengel
关键词-EN: gained significant popularity, shown unexpected potential, gained significant, significant popularity, shown unexpected
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Large-scale, pre-trained Text-to-Image (T2I) diffusion models have gained significant popularity in image generation tasks and have shown unexpected potential in image Super-Resolution (SR). However, most existing T2I diffusion models are trained with a resolution limit of 512x512, making scaling beyond this resolution an unresolved but necessary challenge for image SR. In this work, we introduce a novel approach that, for the first time, enables these models to generate 2K, 4K, and even 8K images without any additional training. Our method leverages MultiDiffusion, which distributes the generation across multiple diffusion paths to ensure global coherence at larger scales, and local degradation-aware prompt extraction, which guides the T2I model to reconstruct fine local structures according to its low-resolution input. These innovations unlock higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.

[AI-45] SPRank: Bridging Pairwise and Listwise Methods with a Bilinear Travelling Salesman Model KDD2025

链接: https://arxiv.org/abs/2411.12064
作者: Weixian Waylon Li,Yftah Ziser,Yifei Xie,Shay B. Cohen,Tiejun Ma
关键词-EN: Travelling Salesman Problem, Salesman Problem Rank, Travelling Salesman, RankNet and LambdaMART, leading to sub-optimal
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted to ACM SIGKDD 2025 Research Track

点击查看摘要

Abstract:Traditional Learning-To-Rank (LETOR) approaches, including pairwise methods like RankNet and LambdaMART, often fall short by solely focusing on pairwise comparisons, leading to sub-optimal global rankings. Conversely, deep learning based listwise methods, while aiming to optimise entire lists, require complex tuning and yield only marginal improvements over robust pairwise models. To overcome these limitations, we introduce Travelling Salesman Problem Rank (TSPRank), a hybrid pairwise-listwise ranking method. TSPRank reframes the ranking problem as a Travelling Salesman Problem (TSP), a well-known combinatorial optimisation challenge that has been extensively studied for its numerous solution algorithms and applications. This approach enables the modelling of pairwise relationships and leverages combinatorial optimisation to determine the listwise ranking. This approach can be directly integrated as an additional component into embeddings generated by existing backbone models to enhance ranking performance. Our extensive experiments across three backbone models on diverse tasks, including stock ranking, information retrieval, and historical events ordering, demonstrate that TSPRank significantly outperforms both pure pairwise and listwise methods. Our qualitative analysis reveals that TSPRank’s main advantage over existing methods is its ability to harness global information better while ranking. TSPRank’s robustness and superior performance across different domains highlight its potential as a versatile and effective LETOR solution. The code and preprocessed data are available at this https URL.

[AI-46] Fingerprinting and Tracing Shadows: The Development and Impact of Browser Fingerprinting on Digital Privacy

链接: https://arxiv.org/abs/2411.12045
作者: Alexander Lawall
关键词-EN: tracking users online, methods like cookies, identifying and tracking, online without traditional, traditional methods
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: SECURWARE 2024, France, Nice

点击查看摘要

Abstract:Browser fingerprinting is a growing technique for identifying and tracking users online without traditional methods like cookies. This paper gives an overview by examining the various fingerprinting techniques and analyzes the entropy and uniqueness of the collected data. The analysis highlights that browser fingerprinting poses a complex challenge from both technical and privacy perspectives, as users often have no control over the collection and use of their data. In addition, it raises significant privacy concerns as users are often tracked without their knowledge or consent.

[AI-47] Fast Convergence of Softmax Policy Mirror Ascent

链接: https://arxiv.org/abs/2411.12042
作者: Reza Asad,Reza Babanezhad,Issam Laradji,Nicolas Le Roux,Sharan Vaswani
关键词-EN: Natural policy gradient, SPMA, Natural policy, common policy optimization, Natural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.

[AI-48] Scaling Deep Learning Research with Kubernetes on the NRP Nautilus HyperCluster

链接: https://arxiv.org/abs/2411.12038
作者: J. Alex Hurt,Anes Ouadou,Mariam Alshehri,Grant J. Scott
关键词-EN: scientific computing space, shown excellent performance, computing space, scientific computing, algorithms have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Throughout the scientific computing space, deep learning algorithms have shown excellent performance in a wide range of applications. As these deep neural networks (DNNs) continue to mature, the necessary compute required to train them has continued to grow. Today, modern DNNs require millions of FLOPs and days to weeks of training to generate a well-trained model. The training times required for DNNs are oftentimes a bottleneck in DNN research for a variety of deep learning applications, and as such, accelerating and scaling DNN training enables more robust and accelerated research. To that end, in this work, we explore utilizing the NRP Nautilus HyperCluster to automate and scale deep learning model training for three separate applications of DNNs, including overhead object detection, burned area segmentation, and deforestation detection. In total, 234 deep neural models are trained on Nautilus, for a total time of 4,040 hours

[AI-49] Regret-Free Reinforcement Learning for LTL Specifications

链接: https://arxiv.org/abs/2411.12019
作者: Rupak Majumdar,Mahmoud Salamati,Sadegh Soudjani
关键词-EN: Reinforcement learning, unknown dynamics, optimal control policies, control systems research, Reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is a promising method to learn optimal control policies for systems with unknown dynamics. In particular, synthesizing controllers for safety-critical systems based on high-level specifications, such as those expressed in temporal languages like linear temporal logic (LTL), presents a significant challenge in control systems research. Current RL-based methods designed for LTL tasks typically offer only asymptotic guarantees, which provide no insight into the transient performance during the learning phase. While running an RL algorithm, it is crucial to assess how close we are to achieving optimal behavior if we stop learning. In this paper, we present the first regret-free online algorithm for learning a controller that addresses the general class of LTL specifications over Markov decision processes (MDPs) with a finite set of states and actions. We begin by proposing a regret-free learning algorithm to solve infinite-horizon reach-avoid problems. For general LTL specifications, we show that the synthesis problem can be reduced to a reach-avoid problem when the graph structure is known. Additionally, we provide an algorithm for learning the graph structure, assuming knowledge of a minimum transition probability, which operates independently of the main regret-free algorithm. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.12019 [cs.AI] (or arXiv:2411.12019v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.12019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-50] Medical Video Generation for Disease Progression Simulation

链接: https://arxiv.org/abs/2411.11943
作者: Xu Cao,Kaizhao Liang,Kuei-Da Liao,Tianren Gao,Wenqian Ye,Jintai Chen,Zhiguang Ding,Jianguo Cao,James M. Rehg,Jimeng Sun
关键词-EN: disease progression, diagnosis and prognosis, disease, crucial for improving, improving the quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech Report. The appendix will release soon. arXiv admin note: text overlap with arXiv:2309.11745

点击查看摘要

Abstract:Modeling disease progression is crucial for improving the quality and efficacy of clinical diagnosis and prognosis, but it is often hindered by a lack of longitudinal medical image monitoring for individual patients. To address this challenge, we propose the first Medical Video Generation (MVG) framework that enables controlled manipulation of disease-related image and video features, allowing precise, realistic, and personalized simulations of disease progression. Our approach begins by leveraging large language models (LLMs) to recaption prompt for disease trajectory. Next, a controllable multi-round diffusion model simulates the disease progression state for each patient, creating realistic intermediate disease state sequence. Finally, a diffusion-based video transition generation model interpolates disease progression between these states. We validate our framework across three medical imaging domains: chest X-ray, fundus photography, and skin image. Our results demonstrate that MVG significantly outperforms baseline models in generating coherent and clinically plausible disease trajectories. Two user studies by veteran physicians, provide further validation and insights into the clinical utility of the generated sequences. MVG has the potential to assist healthcare providers in modeling disease trajectories, interpolating missing medical image data, and enhancing medical education through realistic, dynamic visualizations of disease progression.

[AI-51] Newclid: A User-Friendly Replacement for AlphaGeometry

链接: https://arxiv.org/abs/2411.11938
作者: Vladmir Sicca,Tianxiang Xia,Mathïs Fédérico,Philip John Gorinski,Simon Frieder,Shangling Jui
关键词-EN: DDAR symbolic solver, symbolic solver called, symbolic solver, AlphaGeometry DDAR symbolic, solver called DDARN
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: 51 pages

点击查看摘要

Abstract:We introduce a new symbolic solver for geometry, called Newclid, which is based on AlphaGeometry. Newclid contains a symbolic solver called DDARN (derived from DDAR-Newclid), which is a significant refactoring and upgrade of AlphaGeometry’s DDAR symbolic solver by being more user-friendly - both for the end user as well as for a programmer wishing to extend the codebase. For the programmer, improvements include a modularized codebase and new debugging and visualization tools. For the user, Newclid contains a new command line interface (CLI) that provides interfaces for agents to guide DDARN. DDARN is flexible with respect to its internal reasoning, which can be steered by agents. Further, we support input from GeoGebra to make Newclid accessible for educational contexts. Further, the scope of problems that Newclid can solve has been expanded to include the ability to have an improved understanding of metric geometry concepts (length, angle) and to use theorems such as the Pythagorean theorem in proofs. Bugs have been fixed, and reproducibility has been improved. Lastly, we re-evaluated the five remaining problems from the original AG-30 dataset that AlphaGeometry was not able to solve and contrasted them with the abilities of DDARN, running in breadth-first-search agentic mode (which corresponds to how DDARN runs by default), finding that DDARN solves an additional problem. We have open-sourced our code under: this https URL

[AI-52] Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets

链接: https://arxiv.org/abs/2411.11937
作者: Ike Obi,Rohan Pant,Srishti Shekhar Agrawal,Maham Ghazanfar,Aaron Basiletti
关键词-EN: RLHF datasets, human, LLMs are increasingly, datasets, increasingly fine-tuned
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs are increasingly fine-tuned using RLHF datasets to align them with human preferences and values. However, very limited research has investigated which specific human values are operationalized through these datasets. In this paper, we introduce Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets. To investigate the viability of this framework, we conducted three case study experiments by auditing the Anthropic/hh-rlhf, OpenAI WebGPT Comparisons, and Alpaca GPT-4-LLM datasets to examine the human values embedded within them. Our analysis involved a two-phase process. During the first phase, we developed a taxonomy of human values through an integrated review of prior works from philosophy, axiology, and ethics. Then, we applied this taxonomy to annotate 6,501 RLHF preferences. During the second phase, we employed the labels generated from the annotation as ground truth data for training a transformer-based machine learning model to audit and classify the three RLHF datasets. Through this approach, we discovered that information-utility values, including Wisdom/Knowledge and Information Seeking, were the most dominant human values within all three RLHF datasets. In contrast, prosocial and democratic values, including Well-being, Justice, and Human/Animal Rights, were the least represented human values. These findings have significant implications for developing language models that align with societal values and norms. We contribute our datasets to support further research in this area.

[AI-53] METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth

链接: https://arxiv.org/abs/2411.11933
作者: Jiawei Li,Chong Feng,Yang Gao
关键词-EN: evolution enables learning, Model evolution enables, update skills, evolution enables, enables learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model evolution enables learning from feedback to refine experiences and update skills, transforming models from having no domain knowledge to becoming domain experts. However, there is currently no unified and effective method for guiding this evolutionary process. To address this gap, we propose the Meteor method, which includes three training phases: weak-to-strong data distillation, iterative training, and self-evolution strategies. Each phase maximizes the model’s inherent domain capabilities, allowing it to autonomously refine its domain knowledge and enhance performance. Experiments demonstrate that our approach significantly improves accuracy, completeness, relevance, coherence, and reliability across domain-specific tasks.

[AI-54] AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

链接: https://arxiv.org/abs/2411.11930
作者: Kun Xiang,Zhili Liu,Zihao Jiang,Yunshuang Nie,Runhui Huang,Haoxiang Fan,Hanhui Li,Weiran Huang,Yihan Zeng,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang
关键词-EN: large language models, multimodal large language, slow thinking, incorporating the ability, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking" into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50% relative accuracy gains on MathVista and 120% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on this https URL.

[AI-55] On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

链接: https://arxiv.org/abs/2411.11913
作者: Can Cui,Zichong Yang,Yupeng Zhou,Juntong Peng,Sung-Yeon Park,Cong Zhang,Yunsheng Ma,Xu Cao,Wenqian Ye,Yiheng Feng,Jitesh Panchal,Lingxi Li,Yaobin Chen,Ziran Wang
关键词-EN: match individual users’, individual users’ preferences, comfort standards, strategies to match, safety and comfort
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Personalized driving refers to an autonomous vehicle’s ability to adapt its driving behavior or control strategies to match individual users’ preferences and driving styles while maintaining safety and comfort standards. However, existing works either fail to capture every individual preference precisely or become computationally inefficient as the user base expands. Vision-Language Models (VLMs) offer promising solutions to this front through their natural language understanding and scene reasoning capabilities. In this work, we propose a lightweight yet effective on-board VLM framework that provides low-latency personalized driving performance while maintaining strong reasoning capabilities. Our solution incorporates a Retrieval-Augmented Generation (RAG)-based memory module that enables continuous learning of individual driving preferences through human feedback. Through comprehensive real-world vehicle deployment and experiments, our system has demonstrated the ability to provide safe, comfortable, and personalized driving experiences across various scenarios and significantly reduce takeover rates by up to 76.9%. To the best of our knowledge, this work represents the first end-to-end VLM-based motion control system in real-world autonomous vehicles.

[AI-56] F3OCUS – Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics

链接: https://arxiv.org/abs/2411.11912
作者: Pramit Saha,Felix Wagner,Divyanshu Mishra,Can Peng,Anshul Thakur,David Clifton,Konstantinos Kamnitsas,J. Alison Noble
关键词-EN: Federated Learning, Effective training, devices in Federated, resource-constrained client devices, Neural Tangent Kernels
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective training of large Vision-Language Models (VLMs) on resource-constrained client devices in Federated Learning (FL) requires the usage of parameter-efficient fine-tuning (PEFT) strategies. To this end, we demonstrate the impact of two factors \textitviz., client-specific layer importance score that selects the most important VLM layers for fine-tuning and inter-client layer diversity score that encourages diverse layer selection across clients for optimal VLM layer selection. We first theoretically motivate and leverage the principal eigenvalue magnitude of layerwise Neural Tangent Kernels and show its effectiveness as client-specific layer importance score. Next, we propose a novel layer updating strategy dubbed F ^3 OCUS that jointly optimizes the layer importance and diversity factors by employing a data-free, multi-objective, meta-heuristic optimization on the server. We explore 5 different meta-heuristic algorithms and compare their effectiveness for selecting model layers and adapter layers towards PEFT-FL. Furthermore, we release a new MedVQA-FL dataset involving overall 707,962 VQA triplets and 9 modality-specific clients and utilize it to train and evaluate our method. Overall, we conduct more than 10,000 client-level experiments on 6 Vision-Language FL task settings involving 58 medical image datasets and 4 different VLM architectures of varying sizes to demonstrate the effectiveness of the proposed method.

[AI-57] ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

链接: https://arxiv.org/abs/2411.11911
作者: Zikang Zhou,Hengjian Zhou,Haibo Hu,Zihao Wen,Jianping Wang,Yung-Hui Li,Yu-Kai Huang
关键词-EN: safe autonomous driving, future events lays, autonomous driving, events lays, lays the foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and misaligned mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or rule-based trajectory selection, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.

[AI-58] LLM 4DS: Evaluating Large Language Models for Data Science Code Generation

链接: https://arxiv.org/abs/2411.11908
作者: Nathalia Nascimento,Everton Guimaraes,Sai Sanjna Chintakunta,Santhosh Anitha Boominathan
关键词-EN: Large Language Models, Large Language, offers substantial potential, adoption of Large, science offers substantial
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 11 pages

点击查看摘要

Abstract:The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models in the data science domain remains underexplored. This paper presents a controlled experiment that empirically assesses the performance of four leading LLM-based AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet), and Perplexity Labs (Llama-3.1-70b-instruct)-on a diverse set of data science coding challenges sourced from the Stratacratch platform. Using the Goal-Question-Metric (GQM) approach, we evaluated each model’s effectiveness across task types (Analytical, Algorithm, Visualization) and varying difficulty levels. Our findings reveal that all models exceeded a 50% baseline success rate, confirming their capability beyond random chance. Notably, only ChatGPT and Claude achieved success rates significantly above a 60% baseline, though none of the models reached a 70% threshold, indicating limitations in higher standards. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude’s success rate fluctuated with task complexity. Hypothesis testing indicates that task type does not significantly impact success rate overall. For analytical tasks, efficiency analysis shows no significant differences in execution times, though ChatGPT tended to be slower and less predictable despite high success rates. This study provides a structured, empirical evaluation of LLMs in data science, delivering insights that support informed model selection tailored to specific task demands. Our findings establish a framework for future AI assessments, emphasizing the value of rigorous evaluation beyond basic accuracy measures.

[AI-59] ResLearn: Transformer-based Residual Learning for Metaverse Network Traffic Prediction

链接: https://arxiv.org/abs/2411.11894
作者: Yoga Suhas Kuruba Manjunath,Mathew Szymanowski,Austin Wissborn,Mushu Li,Lian Zhao,Xiao-Ping Zhang
关键词-EN: intelligent resource management, addressing the growing, eXtended Reality, comprehensive solution, solution for predicting
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Our work proposes a comprehensive solution for predicting Metaverse network traffic, addressing the growing demand for intelligent resource management in eXtended Reality (XR) services. We first introduce a state-of-the-art testbed capturing a real-world dataset of virtual reality (VR), augmented reality (AR), and mixed reality (MR) traffic, made openly available for further research. To enhance prediction accuracy, we then propose a novel view-frame (VF) algorithm that accurately identifies video frames from traffic while ensuring privacy compliance, and we develop a Transformer-based progressive error-learning algorithm, referred to as ResLearn for Metaverse traffic prediction. ResLearn significantly improves time-series predictions by using fully connected neural networks to reduce errors, particularly during peak traffic, outperforming prior work by 99%. Our contributions offer Internet service providers (ISPs) robust tools for real-time network management to satisfy Quality of Service (QoS) and enhance user experience in the Metaverse.

[AI-60] Green My LLM : Studying the key factors affecting the energy consumption of code assistants

链接: https://arxiv.org/abs/2411.11892
作者: Tristan Coignion,Clément Quinton,Romain Rouvoy
关键词-EN: Integrated Development Environments, recent years,Large Language, years,Large Language Models, developers’ Integrated Development, years,Large Language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to JSS

点击查看摘要

Abstract:In recent years,Large Language Models (LLMs) have significantly improved in generating high-quality code, enabling their integration into developers’ Integrated Development Environments (IDEs) as code assistants. These assistants, such as GitHub Copilot, deliver real-time code suggestions and can greatly enhance developers’ productivity. However, the environmental impact of these tools, in particular their energy consumption, remains a key concern. This paper investigates the energy consumption of LLM-based code assistants by simulating developer interactions with GitHub Copilot and analyzing various configuration factors. We collected a dataset of development traces from 20 developers and conducted extensive software project development simulations to measure energy usage under different scenarios. Our findings reveal that the energy consumption and performance of code assistants are influenced by various factors, such as the number of concurrent developers, model size, quantization methods, and the use of streaming. Notably, a substantial portion of generation requests made by GitHub Copilot is either canceled or rejected by developers, indicating a potential area for reducing wasted computations. Based on these findings, we share actionable insights into optimizing configurations for different use cases, demonstrating that careful adjustments can lead to significant energy savings.

[AI-61] Survey on Semantic Interpretation of Tabular Data: Challenges and Directions

链接: https://arxiv.org/abs/2411.11891
作者: Marco Cremaschi,Blerina Spahiu,Matteo Palmonari,Ernesto Jimenez-Ruiz
关键词-EN: Semantic Table Interpretation, Tabular data plays, manipulation and exchange, plays a pivotal, pivotal role
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Tabular data plays a pivotal role in various fields, making it a popular format for data manipulation and exchange, particularly on the web. The interpretation, extraction, and processing of tabular information are invaluable for knowledge-intensive applications. Notably, significant efforts have been invested in annotating tabular data with ontologies and entities from background knowledge graphs, a process known as Semantic Table Interpretation (STI). STI automation aids in building knowledge graphs, enriching data, and enhancing web-based question answering. This survey aims to provide a comprehensive overview of the STI landscape. It starts by categorizing approaches using a taxonomy of 31 attributes, allowing for comparisons and evaluations. It also examines available tools, assessing them based on 12 criteria. Furthermore, the survey offers an in-depth analysis of the Gold Standards used for evaluating STI approaches. Finally, it provides practical guidance to help end-users choose the most suitable approach for their specific tasks while also discussing unresolved issues and suggesting potential future research directions.

[AI-62] Can EDA Tool Feedback Improve Verilog Generation by LLM s?

链接: https://arxiv.org/abs/2411.11856
作者: Jason Blocklove,Shailja Thakur,Benjamin Tan,Hammond Pearce,Siddharth Garg,Ramesh Karri
关键词-EN: hardware description language, Verilog hardware description, digital hardware designs, digital hardware, hardware description
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Traditionally, digital hardware designs are written in the Verilog hardware description language (HDL) and debugged manually by engineers. This can be time-consuming and error-prone for complex designs. Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code, but most works have focused on generation in the single-shot capacity: i.e., run and evaluate, a process that does not leverage debugging and as such does not adequately reflect a realistic development process. In this work we evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog. To accomplish this we present an open-source, highly customizable framework, AutoChip, which combines conversational LLMs with the output from Verilog compilers and simulations to iteratively generate and repair Verilog. To determine the success of these LLMs we leverage the VerilogEval benchmark set. We evaluate four state-of-the-art conversational LLMs, focusing on readily accessible commercial models. EDA tool feedback proved to be consistently more effective than zero-shot prompting only with GPT-4o, the most computationally complex model we evaluated. In the best case we observed a 5.8% increase in the number of successful designs with a 34.2% decrease in cost over the best zero-shot results. Mixing smaller models with this larger model at the end of the feedback iterations resulted in equally as much success as with GPT-4o using feedback, but for an additional 41.9% less cost (overall decrease in cost over zero-shot of 89.6%).

[AI-63] LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

链接: https://arxiv.org/abs/2411.11852
作者: Yanyue Xie,Zhengang Li,Dana Diaconu,Suranga Handagala,Miriam Leeser,Xue Lin
关键词-EN: digital signal processing, digital signal, signal processing, blocks have traditionally, cornerstone for handling
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ASPDAC 2025

点击查看摘要

Abstract:For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.

[AI-64] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

链接: https://arxiv.org/abs/2411.11217
作者: Shiyi Cao,Shu Liu,Tyler Griggs,Peter Schafhalter,Xiaoxuan Liu,Ying Sheng,Joseph E. Gonzalez,Matei Zaharia,Ion Stoica
关键词-EN: Mixture of Experts, presents significant challenges, resource-constrained platforms presents, platforms presents significant, significant challenges
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2-4 T4).

[AI-65] AI Guided Early Screening of Cervical Cancer

链接: https://arxiv.org/abs/2411.12681
作者: Dharanidharan S I,Suhitha Renuka S V,Ajishi Singh,Sheena Christabel Pravin
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-66] Estimating Dark Matter Halo Masses in Simulated Galaxy Clusters with Graph Neural Networks NEURIPS

链接: https://arxiv.org/abs/2411.12629
作者: Nikhil Garuda,John F. Wu,Dylan Nelson,Annalisa Pillepich
关键词-EN: Galaxies grow, grow and evolve, dark matter, dark matter halos, Galaxies
类目: Astrophysics of Galaxies (astro-ph.GA); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, accepted at the NeurIPS ML4PS 2024 workshop

点击查看摘要

Abstract:Galaxies grow and evolve in dark matter halos. Because dark matter is not visible, galaxies’ halo masses ( \rmM_\rmhalo ) must be inferred indirectly. We present a graph neural network (GNN) model for predicting \rmM_\rmhalo from stellar mass ( \rmM_* ) in simulated galaxy clusters using data from the IllustrisTNG simulation suite. Unlike traditional machine learning models like random forests, our GNN captures the information-rich substructure of galaxy clusters by using spatial and kinematic relationships between galaxy neighbour. A GNN model trained on the TNG-Cluster dataset and independently tested on the TNG300 simulation achieves superior predictive performance compared to other baseline models we tested. Future work will extend this approach to different simulations and real observational datasets to further validate the GNN model’s ability to generalise.

[AI-67] AI Flow at the Network Edge

链接: https://arxiv.org/abs/2411.12469
作者: Jiawei Shao,Xuelong Li
关键词-EN: demonstrating impressive capabilities, Recent advancements, large language models, demonstrating impressive, unprecedented potential
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) and their multimodal variants have led to remarkable progress across various domains, demonstrating impressive capabilities and unprecedented potential. In the era of ubiquitous connectivity, leveraging communication networks to distribute intelligence is a transformative concept, envisioning AI-powered services accessible at the network edge. However, pushing large models from the cloud to resource-constrained environments faces critical challenges. Model inference on low-end devices leads to excessive latency and performance bottlenecks, while raw data transmission over limited bandwidth networks causes high communication overhead. This article presents AI Flow, a framework that streamlines the inference process by jointly leveraging the heterogeneous resources available across devices, edge nodes, and cloud servers, making intelligence flow across networks. To facilitate cooperation among multiple computational nodes, the proposed framework explores a paradigm shift in the design of communication network systems from transmitting information flow to intelligence flow, where the goal of communications is task-oriented and folded into the inference process. Experimental results demonstrate the effectiveness of the proposed framework through an image captioning use case, showcasing the ability to reduce response latency while maintaining high-quality captions. This article serves as a position paper for identifying the motivation, challenges, and principles of AI Flow.

[AI-68] stability of Instrumental Variables in Additive Nonlinear Non-Constant Effects Models

链接: https://arxiv.org/abs/2411.12184
作者: Xichen Guo,Zheng Li,Biwei Huang,Yan Zeng,Zhi Geng,Feng Xie
关键词-EN: instrumental variables derived, AIT condition, observational data, AIT, address the issue
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition.

[AI-69] Enhancing Low Dose Computed Tomography Images Using Consistency Training Techniques

链接: https://arxiv.org/abs/2411.12181
作者: Mahmut S. Gokmen,Jie Zhang,Ge Wang,Jin Chen,Cody Bumgardner
关键词-EN: Noise Improved Consistency, Improved Consistency Training, High Noise Improved, Diffusion models, inpainting and restoration
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have significant impact on wide range of generative tasks, especially on image inpainting and restoration. Although the improvements on aiming for decreasing number of function evaluations (NFE), the iterative results are still computationally expensive. Consistency models are as a new family of generative models, enable single-step sampling of high quality data without the need for adversarial training. In this paper, we introduce the beta noise distribution, which provides flexibility in adjusting noise levels. This is combined with a sinusoidal curriculum that enhances the learning of the trajectory between the noise distribution and the posterior distribution of interest, allowing High Noise Improved Consistency Training (HN-iCT) to be trained in a supervised fashion. Additionally, High Noise Improved Consistency Training with Image Condition (HN-iCT-CN) architecture is introduced, enables to take Low Dose images as a condition for extracting significant features by Weighted Attention Gates (WAG).Our results indicate that unconditional image generation using HN-iCT significantly outperforms basic CT and iCT training techniques with NFE=1 on the CIFAR10 and CelebA datasets. Moreover, our image-conditioned model demonstrates exceptional performance in enhancing low-dose (LD) CT scans.

[AI-70] Variable Rate Neural Compression for Sparse Detector Data

链接: https://arxiv.org/abs/2411.11942
作者: Yi Huang,Yeonju Go,Jin Huang,Shuhang Li,Xihaier Luo,Thomas Marshall,Joseph Osborn,Christopher Pinkenburg,Yihui Ren,Evgeny Shulga,Shinjae Yoo,Byung-Jun Yoon
关键词-EN: High-energy large-scale particle, High-energy large-scale, extraordinary rates, Time Projection Chamber, Relativistic Heavy Ion
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
*备注: 37 pages, 12 figures, submitted to Journal of Computational Physics

点击查看摘要

Abstract:High-energy large-scale particle colliders generate data at extraordinary rates. Developing real-time high-throughput data compression algorithms to reduce data volume and meet the bandwidth requirement for storage has become increasingly critical. Deep learning is a promising technology that can address this challenging topic. At the newly constructed sPHENIX experiment at the Relativistic Heavy Ion Collider, a Time Projection Chamber (TPC) serves as the main tracking detector, which records three-dimensional particle trajectories in a volume of a gas-filled cylinder. In terms of occupancy, the resulting data flow can be very sparse reaching 10^-3 for proton-proton collisions. Such sparsity presents a challenge to conventional learning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. In contrast, emerging deep learning-based models, particularly those utilizing convolutional neural networks for compression, have outperformed these conventional methods in terms of compression ratios and reconstruction accuracy. However, research on the efficacy of these deep learning models in handling sparse datasets, like those produced in particle colliders, remains limited. Furthermore, most deep learning models do not adapt their processing speeds to data sparsity, which affects efficiency. To address this issue, we propose a novel approach for TPC data compression via key-point identification facilitated by sparse convolution. Our proposed algorithm, BCAE-VS, achieves a 75% improvement in reconstruction accuracy with a 10% increase in compression ratio over the previous state-of-the-art model. Additionally, BCAE-VS manages to achieve these results with a model size over two orders of magnitude smaller. Lastly, we have experimentally verified that as sparsity increases, so does the model’s throughput.

[AI-71] CSP-Net: Common Spatial Pattern Empowered Neural Networks for EEG-Based Motor Imagery Classification

链接: https://arxiv.org/abs/2411.11879
作者: Xue Jiang,Lubin Meng,Xinru Chen,Yifan Xu,Dongrui Wu
关键词-EN: Electroencephalogram-based motor imagery, Electroencephalogram-based motor, motor imagery, CSP, important paradigm
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalogram-based motor imagery (MI) classification is an important paradigm of non-invasive brain-computer interfaces. Common spatial pattern (CSP), which exploits different energy distributions on the scalp while performing different MI tasks, is very popular in MI classification. Convolutional neural networks (CNNs) have also achieved great success, due to their powerful learning capabilities. This paper proposes two CSP-empowered neural networks (CSP-Nets), which integrate knowledge-driven CSP filters with data-driven CNNs to enhance the performance in MI classification. CSP-Net-1 directly adds a CSP layer before a CNN to improve the input discriminability. CSP-Net-2 replaces a convolutional layer in CNN with a CSP layer. The CSP layer parameters in both CSP-Nets are initialized with CSP filters designed from the training data. During training, they can either be kept fixed or optimized using gradient descent. Experiments on four public MI datasets demonstrated that the two CSP-Nets consistently improved over their CNN backbones, in both within-subject and cross-subject classifications. They are particularly useful when the number of training samples is very small. Our work demonstrates the advantage of integrating knowledge-driven traditional machine learning with data-driven deep learning in EEG-based brain-computer interfaces.

[AI-72] A Multi-Modal Unsupervised Machine Learning Approach for Biomedical Signal Processing in CPR

链接: https://arxiv.org/abs/2411.11869
作者: Saidul Islam,Jamal Bentahar,Robin Cohen,Gaith Rjoub
关键词-EN: life-saving intervention aimed, restoring blood circulation, individuals experiencing cardiac, experiencing cardiac arrest, Cardiopulmonary resuscitation
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiopulmonary resuscitation (CPR) is a critical, life-saving intervention aimed at restoring blood circulation and breathing in individuals experiencing cardiac arrest or respiratory failure. Accurate and real-time analysis of biomedical signals during CPR is essential for monitoring and decision-making, from the pre-hospital stage to the intensive care unit (ICU). However, CPR signals are often corrupted by noise and artifacts, making precise interpretation challenging. Traditional denoising methods, such as filters, struggle to adapt to the varying and complex noise patterns present in CPR signals. Given the high-stakes nature of CPR, where rapid and accurate responses can determine survival, there is a pressing need for more robust and adaptive denoising techniques. In this context, an unsupervised machine learning (ML) methodology is particularly valuable, as it removes the dependence on labeled data, which can be scarce or impractical in emergency scenarios. This paper introduces a novel unsupervised ML approach for denoising CPR signals using a multi-modality framework, which leverages multiple signal sources to enhance the denoising process. The proposed approach not only improves noise reduction and signal fidelity but also preserves critical inter-signal correlations (0.9993) which is crucial for downstream tasks. Furthermore, it outperforms existing methods in an unsupervised context in terms of signal-to-noise ratio (SNR) and peak signal-to-noise ratio (PSNR), making it highly effective for real-time applications. The integration of multi-modality further enhances the system’s adaptability to various biomedical signals beyond CPR, improving both automated CPR systems and clinical decision-making.

[AI-73] Machine Learning Assisted Postural Movement Recognition using Photoplethysmography(PPG)

链接: https://arxiv.org/abs/2411.11862
作者: Robbie Maccay,Roshan Weerasekera
关键词-EN: fall prevention technologies, care home admissions, fall detection, fall prevention, home admissions
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing percentage of elderly people and care home admissions, there is an urgent need for the development of fall detection and fall prevention technologies. This work presents, for the first time, the use of machine learning techniques to recognize postural movements exclusively from Photoplethysmography (PPG) data. To achieve this goal, a device was developed for reading the PPG signal, segmenting the PPG signals into individual pulses, extracting pulse morphology and homeostatic characteristic features, and evaluating different ML algorithms. Investigations into different postural movements (stationary, sitting to standing, and lying to standing) were performed by 11 participants. The results of these investigations provided insight into the differences in homeostasis after the movements in the PPG signal. Various machine learning approaches were used for classification, and the Artificial Neural Network (ANN) was found to be the best classifier, with a testing accuracy of 85.2% and an F1 score of 78% from experimental results.

计算机视觉

[CV-0] IoT-Based 3D Pose Estimation and Motion Optimization for Athletes: Application of C3D and OpenPose

链接: https://arxiv.org/abs/2411.12676
作者: Fei Ren,Chao Ren,Tianyi Lyu
关键词-EN: Pose Optimization Network, IoT-Enhanced Pose Optimization, IoT-Enhanced Pose, pose estimation, Optimization Network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:This study proposes the IoT-Enhanced Pose Optimization Network (IE-PONet) for high-precision 3D pose estimation and motion optimization of track and field athletes. IE-PONet integrates C3D for spatiotemporal feature extraction, OpenPose for real-time keypoint detection, and Bayesian optimization for hyperparameter tuning. Experimental results on NTURGB+D and FineGYM datasets demonstrate superior performance, with AP(^p50) scores of 90.5 and 91.0, and mAP scores of 74.3 and 74.0, respectively. Ablation studies confirm the essential roles of each module in enhancing model accuracy. IE-PONet provides a robust tool for athletic performance analysis and optimization, offering precise technical insights for training and injury prevention. Future work will focus on further model optimization, multimodal data integration, and developing real-time feedback mechanisms to enhance practical applications.

[CV-1] Machine Learning Approaches on Crop Pattern Recognition a Comparative Analysis

链接: https://arxiv.org/abs/2411.12667
作者: Kazi Hasibul Kabir,Md. Zahiruddin Aqib,Sharmin Sultana,Shamim Akhter
关键词-EN: ensure food security, Monitoring agricultural activities, food security, important to ensure, ensure food
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in ICNTET2018: International Conference on New Trends in Engineering Technology Tirupathi Highway, Tiruvallur Dist Chennai, India, September 7-8, 2018

点击查看摘要

Abstract:Monitoring agricultural activities is important to ensure food security. Remote sensing plays a significant role for large-scale continuous monitoring of cultivation activities. Time series remote sensing data were used for the generation of the cropping pattern. Classification algorithms are used to classify crop patterns and mapped agriculture land used. Some conventional classification methods including support vector machine (SVM) and decision trees were applied for crop pattern recognition. However, in this paper, we are proposing Deep Neural Network (DNN) based classification to improve the performance of crop pattern recognition and make a comparative analysis with two (2) other machine learning approaches including Naive Bayes and Random Forest.

[CV-2] M3D: Dual-Stream Selective State Spaces and Depth-Driven Framework for High-Fidelity Single-View 3D Reconstruction CVPR2025

链接: https://arxiv.org/abs/2411.12635
作者: Luoxi Zhang,Pragyan Shrestha,Yu Zhou,Chun Xie,Itaru Kitahara
关键词-EN: single RGB image, single RGB, RGB image, autonomous driving, complex scenes presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures, submitted to CVPR 2025 for review

点击查看摘要

Abstract:The precise reconstruction of 3D objects from a single RGB image in complex scenes presents a critical challenge in virtual reality, autonomous driving, and robotics. Existing neural implicit 3D representation methods face significant difficulties in balancing the extraction of global and local features, particularly in diverse and complex environments, leading to insufficient reconstruction precision and quality. We propose M3D, a novel single-view 3D reconstruction framework, to tackle these challenges. This framework adopts a dual-stream feature extraction strategy based on Selective State Spaces to effectively balance the extraction of global and local features, thereby improving scene comprehension and representation precision. Additionally, a parallel branch extracts depth information, effectively integrating visual and geometric features to enhance reconstruction quality and preserve intricate details. Experimental results indicate that the fusion of multi-scale features with depth information via the dual-branch feature extraction significantly boosts geometric consistency and fidelity, achieving state-of-the-art reconstruction performance.

[CV-3] Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images

链接: https://arxiv.org/abs/2411.12620
作者: Matteo Toso,Stefano Fiorini,Stuart James,Alessio Del Bue
关键词-EN: enormous collective efforts, require enormous collective, maps require enormous, World-wide detailed, collective efforts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map’s accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects’ poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.

[CV-4] A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation

链接: https://arxiv.org/abs/2411.12615
作者: Jiaqi Yang,Nitish Mehta,Xiaoling Hu,Chao Chen,Chia-Ling Tsai
关键词-EN: Optical Coherence Tomography, Coherence Tomography, Optical Coherence, monitoring retinal diseases, Accurate segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Accurate segmentation of Optical Coherence Tomography (OCT) images is crucial for diagnosing and monitoring retinal diseases. However, the labor-intensive nature of pixel-level annotation limits the scalability of supervised learning with large datasets. Weakly Supervised Semantic Segmentation (WSSS) provides a promising alternative by leveraging image-level labels. In this study, we propose a novel WSSS approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels, significantly improving segmentation performance. In terms of visual information, our method employs two processing modules that exchange raw image features and structural features from OCT images, guiding the model to identify where lesions are likely to occur. In terms of textual information, we utilize large-scale pretrained models from cross-domain sources to implement label-informed textual guidance and synthetic descriptive integration with two textual processing modules that combine local semantic features with consistent synthetic descriptions. By fusing these visual and textual components within a multimodal framework, our approach enhances lesion localization accuracy. Experimental results on three OCT datasets demonstrate that our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.

[CV-5] SG-LRA: Self-Generating Automatic Scoliosis Cobb Angle Measurement with Low-Rank Approximation

链接: https://arxiv.org/abs/2411.12604
作者: Zhiwen Shao,Yichen Yuan,Lizhuang Ma,Dit-Yan Yeung,Xiaojia Zhu
关键词-EN: Cobb angle measurement, Automatic Cobb angle, Cobb angle, screening and diagnosis, Automatic Cobb
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic Cobb angle measurement from X-ray images is crucial for scoliosis screening and diagnosis. However, most existing regression-based methods and segmentation-based methods struggle with inaccurate spine representations or mask connectivity/fragmentation issues. Besides, landmark-based methods suffer from insufficient training data and annotations. To address these challenges, we propose a novel framework including Self-Generation pipeline and Low-Rank Approximation representation (SG-LRA) for automatic Cobb angle measurement. Specifically, we propose a parameterized spine contour representation based on LRA, which enables eigen-spine decomposition and spine contour reconstruction. We can directly obtain spine contour with only regressed LRA coefficients, which form a more accurate spine representation than rectangular boxes. Also, we combine LRA coefficient regression with anchor box classification to solve inaccurate predictions and mask connectivity issues. Moreover, we develop a data engine with automatic annotation and automatic selection in an iterative manner, which is trained on a private Spinal2023 dataset. With our data engine, we generate the largest scoliosis X-ray dataset named Spinal-AI2024 largely without privacy leaks. Extensive experiments on public AASCE2019, private Spinal2023, and generated Spinal-AI2024 datasets demonstrate that our method achieves state-of-the-art Cobb angle measurement performance. Our code and Spinal-AI2024 dataset are available at this https URL and this https URL, respectively.

[CV-6] SAM Carries the Burden: A Semi-Supervised Approach Refining Pseudo Labels for Medical Segmentation MICCAI

链接: https://arxiv.org/abs/2411.12602
作者: Ron Keuth,Lasse Hansen,Maren Balks,Ronja Jäger,Anne-Nele Schröder,Ludger Tüshaus,Mattias Heinrich
关键词-EN: Semantic segmentation, annotated training data, Semantic, crucial task, Segment Anything Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at MICCAI Workshop on Advancing Data Solutions in Medical Imaging AI 2024; Code and data: this https URL

点击查看摘要

Abstract:Semantic segmentation is a crucial task in medical imaging. Although supervised learning techniques have proven to be effective in performing this task, they heavily depend on large amounts of annotated training data. The recently introduced Segment Anything Model (SAM) enables prompt-based segmentation and offers zero-shot generalization to unfamiliar objects. In our work, we leverage SAM’s abstract object understanding for medical image segmentation to provide pseudo labels for semi-supervised learning, thereby mitigating the need for extensive annotated training data. Our approach refines initial segmentations that are derived from a limited amount of annotated data (comprising up to 43 cases) by extracting bounding boxes and seed points as prompts forwarded to SAM. Thus, it enables the generation of dense segmentation masks as pseudo labels for unlabelled data. The results show that training with our pseudo labels yields an improvement in Dice score from 74.29,% to 84.17,% and from 66.63,% to 74.87,% for the segmentation of bones of the paediatric wrist and teeth in dental radiographs, respectively. As a result, our method outperforms intensity-based post-processing methods, state-of-the-art supervised learning for segmentation (nnU-Net), and the semi-supervised mean teacher approach. Our Code is available on GitHub.

[CV-7] SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

链接: https://arxiv.org/abs/2411.12592
作者: Yutao Tang,Yuxiang Guo,Deming Li,Cheng Peng
关键词-EN: sparse-view scenarios due, View Synthesis, dense point cloud, point cloud, Synthesis can achieve
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.

[CV-8] Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

链接: https://arxiv.org/abs/2411.12590
作者: Neale Ratzlaff,Matthew Lyle Olson,Musashi Hinck,Estelle Aflalo,Shao-Yen Tseng,Vasudev Lal,Phillip Howard
关键词-EN: Large Multi-Modal Models, Large Multi-Modal, demonstrated impressive capabilities, provided input, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 3 Figures, 3 Tables. arXiv admin note: text overlap with arXiv:2410.13976

点击查看摘要

Abstract:Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input, such as an image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LMMs that directly removes biased representations during text generation to decrease outputs related to protected attributes, or even representing them internally. Our proposed method is training-free; given a single image and a list of target attributes, we can ablate the corresponding representations with just one step of gradient descent on the image itself. Our experiments show that not only can we can minimize the propensity of LMMs to generate text related to protected attributes, but we can improve sentiment and even simply use synthetic data to inform the ablation while retaining language modeling capabilities on real data such as COCO or FACET. Furthermore, we find the resulting generations from a debiased LMM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.

[CV-9] Infrared-Assisted Single-Stage Framework for Joint Restoration and Fusion of Visible and Infrared Images under Hazy Conditions

链接: https://arxiv.org/abs/2411.12586
作者: Huafeng Li,Jiaqi Fang,Yafei Zhang,Yu Liu
关键词-EN: gained significant attention, gained significant, significant attention, broad application, infrared image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Infrared and visible (IR-VIS) image fusion has gained significant attention for its broad application value. However, existing methods often neglect the complementary role of infrared image in restoring visible image features under hazy conditions. To address this, we propose a joint learning framework that utilizes infrared image for the restoration and fusion of hazy IR-VIS images. To mitigate the adverse effects of feature diversity between IR-VIS images, we introduce a prompt generation mechanism that regulates modality-specific feature incompatibility. This creates a prompt selection matrix from non-shared image information, followed by prompt embeddings generated from a prompt pool. These embeddings help generate candidate features for dehazing. We further design an infrared-assisted feature restoration mechanism that selects candidate features based on haze density, enabling simultaneous restoration and fusion within a single-stage framework. To enhance fusion quality, we construct a multi-stage prompt embedding fusion module that leverages feature supplementation from the prompt generation module. Our method effectively fuses IR-VIS images while removing haze, yielding clear, haze-free fusion results. In contrast to two-stage methods that dehaze and then fuse, our approach enables collaborative training in a single-stage framework, making the model relatively lightweight and suitable for practical deployment. Experimental results validate its effectiveness and demonstrate advantages over existing methods.

[CV-10] Contourlet Refinement Gate Framework for Thermal Spectrum Distribution Regularized Infrared Image Super-Resolution

链接: https://arxiv.org/abs/2411.12530
作者: Yang Zou,Zhixin Chen,Zhipeng Zhang,Xingyuan Li,Long Ma,Jinyuan Liu,Peng Wang,Yanning Zhang
关键词-EN: active low-level vision, low-level vision problem, reconstruct high-resolution, active low-level, low-level vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 figures, 6 tables

点击查看摘要

Abstract:Image super-resolution (SR) is a classical yet still active low-level vision problem that aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, serving as a key technique for image enhancement. Current approaches to address SR tasks, such as transformer-based and diffusion-based methods, are either dedicated to extracting RGB image features or assuming similar degradation patterns, neglecting the inherent modal disparities between infrared and visible images. When directly applied to infrared image SR tasks, these methods inevitably distort the infrared spectral distribution, compromising the machine perception in downstream tasks. In this work, we emphasize the infrared spectral distribution fidelity and propose a Contourlet refinement gate framework to restore infrared modal-specific features while preserving spectral distribution fidelity. Our approach captures high-pass subbands from multi-scale and multi-directional infrared spectral decomposition to recover infrared-degraded information through a gate architecture. The proposed Spectral Fidelity Loss regularizes the spectral frequency distribution during reconstruction, which ensures the preservation of both high- and low-frequency components and maintains the fidelity of infrared-specific features. We propose a two-stage prompt-learning optimization to guide the model in learning infrared HR characteristics from LR degradation. Extensive experiments demonstrate that our approach outperforms existing image SR models in both visual and perceptual tasks while notably enhancing machine perception in downstream tasks. Our code is available at this https URL.

[CV-11] Data Pruning in Generative Diffusion Models

链接: https://arxiv.org/abs/2411.12523
作者: Rania Briq,Jiangtao Wang,Steffan Kesselheim
关键词-EN: discarding the remainder, identifying a core, core subset, training and discarding, Data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.

[CV-12] VMGNet: A Low Computational Complexity Robotic Grasping Network Based on VMamba with Multi-Scale Feature Fusion

链接: https://arxiv.org/abs/2411.12520
作者: Yuhao Jin,Qizhong Gao,Xiaohui Zhu,Yong Yue,Eng Gee Lim,Yuqing Chen,Prudence Wong,Yijie Chu
关键词-EN: demonstrated strong adaptability, high real-time requirements, deep learning-based robotic, robotic grasping technology, learning-based robotic grasping
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While deep learning-based robotic grasping technology has demonstrated strong adaptability, its computational complexity has also significantly increased, making it unsuitable for scenarios with high real-time requirements. Therefore, we propose a low computational complexity and high accuracy model named VMGNet for robotic grasping. For the first time, we introduce the Visual State Space into the robotic grasping field to achieve linear computational complexity, thereby greatly reducing the model’s computational cost. Meanwhile, to improve the accuracy of the model, we propose an efficient and lightweight multi-scale feature fusion module, named Fusion Bridge Module, to extract and fuse information at different scales. We also present a new loss function calculation method to enhance the importance differences between subtasks, improving the model’s fitting ability. Experiments show that VMGNet has only 8.7G Floating Point Operations and an inference time of 8.1 ms on our devices. VMGNet also achieved state-of-the-art performance on the Cornell and Jacquard public datasets. To validate VMGNet’s effectiveness in practical applications, we conducted real grasping experiments in multi-object scenarios, and VMGNet achieved an excellent performance with a 94.4% success rate in real-world grasping tasks. The video for the real-world robotic grasping experiments is available at this https URL.

[CV-13] 3D Reconstruction by Looking: Instantaneous Blind Spot Detector for Indoor SLAM through Mixed Reality

链接: https://arxiv.org/abs/2411.12514
作者: Hanbeom Chang,Jongseong Brad Choi,Chul Min Yeum
关键词-EN: Indoor SLAM, SLAM often suffers, double walls, LiDAR and cameras, point cloud registration
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 21 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Indoor SLAM often suffers from issues such as scene drifting, double walls, and blind spots, particularly in confined spaces with objects close to the sensors (e.g. LiDAR and cameras) in reconstruction tasks. Real-time visualization of point cloud registration during data collection may help mitigate these issues, but a significant limitation remains in the inability to in-depth compare the scanned data with actual physical environments. These challenges obstruct the quality of reconstruction products, frequently necessitating revisit and rescan efforts. For this regard, we developed the LiMRSF (LiDAR-MR-RGB Sensor Fusion) system, allowing users to perceive the in-situ point cloud registration by looking through a Mixed-Reality (MR) headset. This tailored framework visualizes point cloud meshes as holograms, seamlessly matching with the real-time scene on see-through glasses, and automatically highlights errors detected while they overlap. Such holographic elements are transmitted via a TCP server to an MR headset, where it is calibrated to align with the world coordinate, the physical location. This allows users to view the localized reconstruction product instantaneously, enabling them to quickly identify blind spots and errors, and take prompt action on-site. Our blind spot detector achieves an error detection precision with an F1 Score of 75.76% with acceptably high fidelity of monitoring through the LiMRSF system (highest SSIM of 0.5619, PSNR of 14.1004, and lowest MSE of 0.0389 in the five different sections of the simplified mesh model which users visualize through the LiMRSF device see-through glasses). This method ensures the creation of detailed, high-quality datasets for 3D models, with potential applications in Building Information Modeling (BIM) but not limited.

[CV-14] PR-ENDO: Physically Based Relightable Gaussian Splatting for Endoscopy

链接: https://arxiv.org/abs/2411.12510
作者: Joanna Kaleta,Weronika Smolak-Dyżewska,Dawid Malarz,Diego Dall’Alba,Przemysław Korzeniowski,Przemysław Spurek
关键词-EN: colorectal cancer diagnosis, significantly enhance diagnosis, real-time novel-view synthesis, Endoscopic procedures, cancer diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Endoscopic procedures are crucial for colorectal cancer diagnosis, and three-dimensional reconstruction of the environment for real-time novel-view synthesis can significantly enhance diagnosis. We present PR-ENDO, a framework that leverages 3D Gaussian Splatting within a physically based, relightable model tailored for the complex acquisition conditions in endoscopy, such as restricted camera rotations and strong view-dependent illumination. By exploiting the connection between the camera and light source, our approach introduces a relighting model to capture the intricate interactions between light and tissue using physically based rendering and MLP. Existing methods often produce artifacts and inconsistencies under these conditions, which PR-ENDO overcomes by incorporating a specialized diffuse MLP that utilizes light angles and normal vectors, achieving stable reconstructions even with limited training camera rotations. We benchmarked our framework using a publicly available dataset and a newly introduced dataset with wider camera rotations. Our methods demonstrated superior image quality compared to baseline approaches.

[CV-15] SCIGS: 3D Gaussians Splatting from a Snapshot Compressive Image

链接: https://arxiv.org/abs/2411.12471
作者: Zixu Wang,Hao Yang,Yu Guo,Fei Wang
关键词-EN: Snapshot Compressive Imaging, Snapshot Compressive, Compressive Imaging, requiring efficient reconstruction, NeRF-based reconstruction methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Snapshot Compressive Imaging (SCI) offers a possibility for capturing information in high-speed dynamic scenes, requiring efficient reconstruction method to recover scene information. Despite promising results, current deep learning-based and NeRF-based reconstruction methods face challenges: 1) deep learning-based reconstruction methods struggle to maintain 3D structural consistency within scenes, and 2) NeRF-based reconstruction methods still face limitations in handling dynamic scenes. To address these challenges, we propose SCIGS, a variant of 3DGS, and develop a primitive-level transformation network that utilizes camera pose stamps and Gaussian primitive coordinates as embedding vectors. This approach resolves the necessity of camera pose in vanilla 3DGS and enhances multi-view 3D structural consistency in dynamic scenes by utilizing transformed primitives. Additionally, a high-frequency filter is introduced to eliminate the artifacts generated during the transformation. The proposed SCIGS is the first to reconstruct a 3D explicit scene from a single compressed image, extending its application to dynamic 3D scenes. Experiments on both static and dynamic scenes demonstrate that SCIGS not only enhances SCI decoding but also outperforms current state-of-the-art methods in reconstructing dynamic 3D scenes from a single compressed image. The code will be made available upon publication.

[CV-16] GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving

链接: https://arxiv.org/abs/2411.12452
作者: Shaoqing Xu,Fang Li,Shengyin Jiang,Ziying Song,Li Liu,Zhi-xin Yang
关键词-EN: made substantial strides, Self-supervised learning, image processing, made substantial, substantial strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial structure and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D perception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrain’s theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. Source code will be available at this https URL

[CV-17] Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models ECCV2024

链接: https://arxiv.org/abs/2411.12450
作者: Jun Xiao,Zihang Lyu,Hao Xie,Cong Zhang,Yakun Ju,Changjian Shui,Kin-Man Lam
关键词-EN: low-level vision tasks, Blind image restoration, image restoration remains, Blind image, challenge in low-level
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 17 pages, 6 figures, has been accepted by the ECCV 2024: AIM workshop

点击查看摘要

Abstract:Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consider data consistency solely in the spatial domain, often resulting in distorted image content. In this paper, we propose a novel frequency-aware guidance loss that can be integrated into various diffusion models in a plug-and-play manner. Our proposed guidance loss, based on 2D discrete wavelet transform, simultaneously enforces content consistency in both the spatial and frequency domains. Experimental results demonstrate the effectiveness of our method in three blind restoration tasks: blind image deblurring, imaging through turbulence, and blind restoration for multiple degradations. Notably, our method achieves a significant improvement in PSNR score, with a remarkable enhancement of 3.72,dB in image deblurring. Moreover, our method exhibits superior capability in generating images with rich details and reduced distortion, leading to the best visual quality.

[CV-18] Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need

链接: https://arxiv.org/abs/2411.12448
作者: Kecheng Chen,Pingping Zhang,Hui Liu,Jie Liu,Yibing Liu,Jixin Huang,Shiqi Wang,Hong Yan,Haoliang Li
关键词-EN: language large model, lossless image compression, lossless image, image compression, general-purpose lossless compressor
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We have recently witnessed that Intelligence" and Compression" are the two sides of the same coin, where the language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities. This attribute particularly appeals to the lossless image compression community, given the increasing need to compress high-resolution images in the current streaming media era. Consequently, a spontaneous envision emerges: Can the compression performance of the LLM elevate lossless image compression to new heights? However, our findings indicate that the naive application of LLM-based lossless image compressors suffers from a considerable performance gap compared with existing state-of-the-art (SOTA) codecs on common benchmark datasets. In light of this, we are dedicated to fulfilling the unprecedented intelligence (compression) capacity of the LLM for lossless image compression tasks, thereby bridging the gap between theoretical and practical compression performance. Specifically, we propose P ^2 -LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies, \textite.g., pixel-level priors, the in-context ability of LLM, and a pixel-level semantic preservation strategy, to enhance the understanding capacity of pixel sequences for better next-pixel predictions. Extensive experiments on benchmark datasets demonstrate that P ^2 -LLM can beat SOTA classical and learned codecs.

[CV-19] Beyond Gaussians: Fast and High-Fidelity 3D Splatting with Linear Kernels

链接: https://arxiv.org/abs/2411.12440
作者: Haodong Chen,Runnan Chen,Qiang Qu,Zhaoqing Wang,Tongliang Liu,Xiaoming Chen,Yuk Ying Chung
关键词-EN: enabling high-quality reconstruction, Recent advancements, view synthesis, enabling high-quality, substantially improved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have substantially improved novel view synthesis, enabling high-quality reconstruction and real-time rendering. However, blurring artifacts, such as floating primitives and over-reconstruction, remain challenging. Current methods address these issues by refining scene structure, enhancing geometric representations, addressing blur in training images, improving rendering consistency, and optimizing density control, yet the role of kernel design remains underexplored. We identify the soft boundaries of Gaussian ellipsoids as one of the causes of these artifacts, limiting detail capture in high-frequency regions. To bridge this gap, we introduce 3D Linear Splatting (3DLS), which replaces Gaussian kernels with linear kernels to achieve sharper and more precise results, particularly in high-frequency regions. Through evaluations on three datasets, 3DLS demonstrates state-of-the-art fidelity and accuracy, along with a 30% FPS improvement over baseline 3DGS. The implementation will be made publicly available upon acceptance. \freefootnote*Corresponding author.

[CV-20] CV-Cities: Advancing Cross-View Geo-Localization in Global Cities

链接: https://arxiv.org/abs/2411.12431
作者: Gaoshuang Huang,Yang Zhou,Luying Zhao,Wenjian Gan
关键词-EN: Cross-view geo-localization, retrieving satellite images, involves matching, matching and retrieving, retrieving satellite
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Datasets and codes are available, accepted by IEEE JSTARS

点击查看摘要

Abstract:Cross-view geo-localization (CVGL), which involves matching and retrieving satellite images to determine the geographic location of a ground image, is crucial in GNSS-constrained scenarios. However, this task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization. To address these issues, we propose a novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer. Our framework introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy. Experimental results show that our framework surpasses existing methods across multiple public and self-built datasets. To further improve globalscale performance, we have developed CV-Cities, a novel dataset for global CVGL. CV-Cities includes 223,736 ground-satellite image pairs with geolocation data, spanning sixteen cities across six continents and covering a wide range of complex scenarios, providing a challenging benchmark for CVGL. The framework trained with CV-Cities demonstrates high localization accuracy in various test cities, highlighting its strong globalization and generalization capabilities. Our datasets and codes are available at this https URL.

[CV-21] Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph

链接: https://arxiv.org/abs/2411.12426
作者: Ziyang Chen,Yongjun Zhang,Wenting Li,Bingshu Wang,Yong Zhao,C. L. Philip Chen
关键词-EN: Real-world applications, place stringent demands, autonomous driving, safety and accuracy, stereo matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at this https URL.

[CV-22] Classification of Geographical Land Structure Using Convolution Neural Network and Transfer Learning

链接: https://arxiv.org/abs/2411.12415
作者: Mustafa M. Abd Zaid,Ahmed Abed Mohammed,Putra Sumari
关键词-EN: policymakers unprecedented global, unprecedented global access, giving academics, spatial data, dramatically revolutionized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Satellite imagery has dramatically revolutionized the field of geography by giving academics, scientists, and policymakers unprecedented global access to spatial data. Manual methods typically require significant time and effort to detect the generic land structure in satellite images. This study can produce a set of applications such as urban planning and development, environmental monitoring, disaster management, etc. Therefore, the research presents a methodology to minimize human labor, reducing the expenses and duration needed to identify the land structure. This article developed a deep learning-based approach to automate the process of classifying geographical land structures. We used a satellite image dataset acquired from MLRSNet. The study compared the performance of three architectures, namely CNN, ResNet-50, and Inception-v3. We used three optimizers with any model: Adam, SGD, and RMSProp. We conduct the training process for a fixed number of epochs, specifically 100 epochs, with a batch size of 64. The ResNet-50 achieved an accuracy of 76.5% with the ADAM optimizer, the Inception-v3 with RMSProp achieved an accuracy of 93.8%, and the proposed approach, CNN with RMSProp optimizer, achieved the highest level of performance and an accuracy of 94.8%. Moreover, a thorough examination of the CNN model demonstrated its exceptional accuracy, recall, and F1 scores for all categories, confirming its resilience and dependability in precisely detecting various terrain formations. The results highlight the potential of deep learning models in scene understanding, as well as their significance in efficiently identifying and categorizing land structures from satellite imagery.

[CV-23] Breathless: An 8-hour Performance Contrasting Human and Robot Expressiveness

链接: https://arxiv.org/abs/2411.12361
作者: Catie Cuan,Tianshuang Qiu,Shreya Ganti,Ken Goldberg
关键词-EN: American workday, industrial robot arm, paper describes, eight-hour dance, dance that unfolds
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 9 figures, accepted for ISRR (International Symposium of Robotics Research) 2024

点击查看摘要

Abstract:This paper describes the robot technology behind an original performance that pairs a human dancer (Cuan) with an industrial robot arm for an eight-hour dance that unfolds over the timespan of an American workday. To control the robot arm, we combine a range of sinusoidal motions with varying amplitude, frequency and offset at each joint to evoke human motions common in physical labor such as stirring, digging, and stacking. More motions were developed using deep learning techniques for video-based human-pose tracking and extraction. We combine these pre-recorded motions with improvised robot motions created live by putting the robot into teach-mode and triggering force sensing from the robot joints onstage. All motions are combined with commercial and original music using a custom suite of python software with AppleScript, Keynote, and Zoom to facilitate on-stage communication with the dancer. The resulting performance contrasts the expressivity of the human body with the precision of robot machinery. Video, code and data are available on the project website: this https URL

[CV-24] DynFocus: Dynamic Cooperative Network Empowers LLM s with Video Understanding

链接: https://arxiv.org/abs/2411.12355
作者: Yudong Han,Qingpei Guo,Liyuan Pan,Liu Liu,Yu Guan,Ming Yang
关键词-EN: memory-affordable token count, LLM-based video understanding, video understanding lies, challenge in LLM-based, understanding lies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.

[CV-25] arget Height Estimation Using a Single Acoustic Camera for Compensation in 2D Seabed Mosaicking

链接: https://arxiv.org/abs/2411.12338
作者: Xiaoteng Zhou,Yusheng Wang,Katsunori Mizuno
关键词-EN: low-visibility underwater perception, target height, compensating target height, underwater perception, target height data
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages,conference

点击查看摘要

Abstract:This letter proposes a novel approach for compensating target height data in 2D seabed mosaicking for low-visibility underwater perception. Acoustic cameras are effective sensors for sensing the marine environments due to their high-resolution imaging capabilities and robustness to darkness and turbidity. However, the loss of elevation angle during the imaging process results in a lack of target height information in the original acoustic camera images, leading to a simplistic 2D representation of the seabed mosaicking. In perceiving cluttered and unexplored marine environments, target height data is crucial for avoiding collisions with marine robots. This study proposes a novel approach for estimating seabed target height using a single acoustic camera and integrates height data into 2D seabed mosaicking to compensate for the missing 3D dimension of seabed targets. Unlike classic methods that model the loss of elevation angle to achieve seabed 3D reconstruction, this study focuses on utilizing available acoustic cast shadow clues and simple sensor motion to quickly estimate target height. The feasibility of our proposal is verified through a water tank experiment and a simulation experiment.

[CV-26] Accelerating UMAP for Large-Scale Datasets Through Spectral Coarsening

链接: https://arxiv.org/abs/2411.12331
作者: Yongyu Wang
关键词-EN: http URL proposed, spectral compression technique, dramatically accelerate UMAP, essential manifold structure, URL proposed method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces an innovative approach to dramatically accelerate UMAP using spectral data this http URL proposed method significantly reduces the size of the dataset, preserving its essential manifold structure through an advanced spectral compression technique. This allows UMAP to perform much faster while maintaining the quality of its embeddings. Experiments on real-world datasets, such as USPS, demonstrate the method’s ability to achieve substantial data reduction without compromising embedding fidelity.

[CV-27] Enhancing Blind Source Separation with Dissociative Principal Component Analysis

链接: https://arxiv.org/abs/2411.12321
作者: Muhammad Usman Khalid
关键词-EN: Sparse principal component, principal component analysis, imposing sparsity constraints, Sparse principal, component analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 1. 13 pages with 6 figures, this work has not bee published before. 2. The paper is yet to be peer-reviewed and I am planning to submit it to IEEE Transactions on Image Processing. 3. There is no supplementary material. 4. There is no funding for this work as of now

点击查看摘要

Abstract:Sparse principal component analysis (sPCA) enhances the interpretability of principal components (PCs) by imposing sparsity constraints on loading vectors (LVs). However, when used as a precursor to independent component analysis (ICA) for blind source separation (BSS), sPCA may underperform due to its focus on simplicity, potentially disregarding some statistical information essential for effective ICA. To overcome this limitation, a sophisticated approach is proposed that preserves the interpretability advantages of sPCA while significantly enhancing its source extraction capabilities. This consists of two tailored algorithms, dissociative PCA (DPCA1 and DPCA2), which employ adaptive and firm thresholding alongside gradient and coordinate descent approaches to optimize the proposed model dynamically. These algorithms integrate left and right singular vectors from singular value decomposition (SVD) through dissociation matrices (DMs) that replace traditional singular values, thus capturing latent interdependencies effectively to model complex source relationships. This leads to refined PCs and LVs that more accurately represent the underlying data structure. The proposed approach avoids focusing on individual eigenvectors, instead, it collaboratively combines multiple eigenvectors to disentangle interdependencies within each SVD variate. The superior performance of the proposed DPCA algorithms is demonstrated across four varied imaging applications including functional magnetic resonance imaging (fMRI) source retrieval, foreground-background separation, image reconstruction, and image inpainting. They outperformed traditional methods such as PCA+ICA, PPCA+ICA, SPCA+ICA, PMD, and GPower.

[CV-28] C2INet: Realizing Incremental Trajectory Prediction with Prior-Aware Continual Causal Intervention

链接: https://arxiv.org/abs/2411.12313
作者: Xiaohe Li,Feilong Huang,Zide Fan,Fangli Mou,Leilei Lin,Yingyan Hou,Lijie Wen
关键词-EN: autonomous driving, Continual Causal Intervention, continual learning, Abstract, continual
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C ^2 INet) method for generalizable multi-agent trajectory prediction within a continual learning framework. Using variational inference, we align environment-related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C ^2 INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state-of-the-art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real-world applications.

[CV-29] DGTR: Distributed Gaussian Turbo-Reconstruction for Sparse-View Vast Scenes

链接: https://arxiv.org/abs/2411.12309
作者: Hao Li,Yuanyuan Gao,Haosong Peng,Chenming Wu,Weicai Ye,Yufeng Zhan,Chen Zhao,Dingwen Zhang,Jingdong Wang,Junwei Han
关键词-EN: NVS, play a critical, critical role, reconstruction, vast
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will released on our ![project page]( this https URL )

点击查看摘要

Abstract:Novel-view synthesis (NVS) approaches play a critical role in vast scene reconstruction. However, these methods rely heavily on dense image inputs and prolonged training times, making them unsuitable where computational resources are limited. Additionally, few-shot methods often struggle with poor reconstruction quality in vast environments. This paper presents DGTR, a novel distributed framework for efficient Gaussian reconstruction for sparse-view vast scenes. Our approach divides the scene into regions, processed independently by drones with sparse image inputs. Using a feed-forward Gaussian model, we predict high-quality Gaussian primitives, followed by a global alignment algorithm to ensure geometric consistency. Synthetic views and depth priors are incorporated to further enhance training, while a distillation-based model aggregation mechanism enables efficient reconstruction. Our method achieves high-quality large-scale scene reconstruction and novel-view synthesis in significantly reduced training times, outperforming existing approaches in both speed and scalability. We demonstrate the effectiveness of our framework on vast aerial scenes, achieving high-quality results within minutes. Code will released on our ![project page](this https URL).

[CV-30] Diffusion Product Quantization

链接: https://arxiv.org/abs/2411.12306
作者: Jie Shao,Hanxiao Zhang,Jianxin Wu
关键词-EN: extreme compression regimes, reduce model size, diffusion models, regimes to reduce, model size
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we explore the quantization of diffusion models in extreme compression regimes to reduce model size while maintaining performance. We begin by investigating classical vector quantization but find that diffusion models are particularly susceptible to quantization error, with the codebook size limiting generation quality. To address this, we introduce product quantization, which offers improved reconstruction precision and larger capacity – crucial for preserving the generative capabilities of diffusion models. Furthermore, we propose a method to compress the codebook by evaluating the importance of each vector and removing redundancy, ensuring the model size remaining within the desired range. We also introduce an end-to-end calibration approach that adjusts assignments during the forward pass and optimizes the codebook using the DDPM loss. By compressing the model to as low as 1 bit (resulting in over 24 times reduction in model size), we achieve a balance between compression and quality. We apply our compression method to the DiT model on ImageNet and consistently outperform other quantization approaches, demonstrating competitive generative performance.

[CV-31] Physics-Guided Detector for SAR Airplanes

链接: https://arxiv.org/abs/2411.12301
作者: Zhongling Huang,Long Liu,Shuxin Yang,Zhirui Wang,Gong Cheng,Junwei Han
关键词-EN: SAR airplane, SAR airplane targets, variant scattering characteristics, SAR airplane detection, SAR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The disperse structure distributions (discreteness) and variant scattering characteristics (variability) of SAR airplane targets lead to special challenges of object detection and recognition. The current deep learning-based detectors encounter challenges in distinguishing fine-grained SAR airplanes against complex backgrounds. To address it, we propose a novel physics-guided detector (PGD) learning paradigm for SAR airplanes that comprehensively investigate their discreteness and variability to improve the detection performance. It is a general learning paradigm that can be extended to different existing deep learning-based detectors with “backbone-neck-head” architectures. The main contributions of PGD include the physics-guided self-supervised learning, feature enhancement, and instance perception, denoted as PGSSL, PGFE, and PGIP, respectively. PGSSL aims to construct a self-supervised learning task based on a wide range of SAR airplane targets that encodes the prior knowledge of various discrete structure distributions into the embedded space. Then, PGFE enhances the multi-scale feature representation of a detector, guided by the physics-aware information learned from PGSSL. PGIP is constructed at the detection head to learn the refined and dominant scattering point of each SAR airplane instance, thus alleviating the interference from the complex background. We propose two implementations, denoted as PGD and PGD-Lite, and apply them to various existing detectors with different backbones and detection heads. The experiments demonstrate the flexibility and effectiveness of the proposed PGD, which can improve existing detectors on SAR airplane detection with fine-grained classification task (an improvement of 3.1% mAP most), and achieve the state-of-the-art performance (90.7% mAP) on SAR-AIRcraft-1.0 dataset. The project is open-source at \urlthis https URL.

[CV-32] Generative Timelines for Instructed Visual Assembly

链接: https://arxiv.org/abs/2411.12293
作者: Alejandro Pardo,Jui-Hsien Wang,Bernard Ghanem,Josef Sivic,Bryan Russell,Fabian Caba Heilbron
关键词-EN: Instructed visual assembly, manipulate visual timelines, visual, disabled users, visual assembly tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.

[CV-33] GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping

链接: https://arxiv.org/abs/2411.12286
作者: Teli Ma,Zifan Wang,Jiaming Zhou,Mengmeng Wang,Junwei Liang
关键词-EN: Inferring affordable, arbitrary objects based, Large Language Models, based on human, human specifications
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 330 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art.

[CV-34] HouseLLM : LLM -Assisted Two-Phase Text-to-Floorplan Generation

链接: https://arxiv.org/abs/2411.12279
作者: Ziyang Zong,Zhaohuan Zhan,Guang Tan
关键词-EN: Large Language Model, conditional diffusion model, guides a Large, Large Language, Language Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a two-phase text-to-floorplan generation method, which guides a Large Language Model (LLM) to generate an initial layout (Layout-LLM) and refines them into the final floorplans through conditional diffusion model. We incorporate a Chain-of-Thought approach to prompt the LLM based on user text specifications, enabling a more user-friendly and intuitive house layout design. This method allows users to describe their needs in natural language, enhancing accessibility and providing clearer geometric constraints. The final floorplans generated by Layout-LLM through conditional diffusion refinement are more accurate and better meet user requirements. Experimental results demonstrate that our approach achieves state-of-the-art performance across all metrics, validating its effectiveness in practical home design applications. We plan to release our code for public use.

[CV-35] KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder

链接: https://arxiv.org/abs/2411.12270
作者: Maheswar Bora,Saurabh Atreya,Aritra Mukherjee,Abhijit Das
关键词-EN: major SSL frameworks, masked data modelling, Self-supervised Learning, knowledge distillation, combining contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

[CV-36] Prototype Optimization with Neural ODE for Few-Shot Learning

链接: https://arxiv.org/abs/2411.12259
作者: Baoquan Zhang,Shanshan Feng,Bingqi Shan,Xutao Li,Yunming Ye,Yew-Soon Ong
关键词-EN: Few-Shot Learning, challenging task, recognize novel classes, Learning, prototypes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: An extended version of metanode: prototype optimization as a neural ode for few-shot learning. arXiv admin note: text overlap with arXiv:2103.14341

点击查看摘要

Abstract:Few-Shot Learning (FSL) is a challenging task, which aims to recognize novel classes with few examples. Pre-training based methods effectively tackle the problem by pre-training a feature extractor and then performing class prediction via a cosine classifier with mean-based prototypes. Nevertheless, due to the data scarcity, the mean-based prototypes are usually biased. In this paper, we attempt to diminish the prototype bias by regarding it as a prototype optimization problem. To this end, we propose a novel prototype optimization framework to rectify prototypes, i.e., introducing a meta-optimizer to optimize prototypes. Although the existing meta-optimizers can also be adapted to our framework, they all overlook a crucial gradient bias issue, i.e., the mean-based gradient estimation is also biased on sparse data. To address this issue, in this paper, we regard the gradient and its flow as meta-knowledge and then propose a novel Neural Ordinary Differential Equation (ODE)-based meta-optimizer to optimize prototypes, called MetaNODE. Although MetaNODE has shown superior performance, it suffers from a huge computational burden. To further improve its computation efficiency, we conduct a detailed analysis on MetaNODE and then design an effective and efficient MetaNODE extension version (called E2MetaNODE). It consists of two novel modules: E2GradNet and E2Solver, which aim to estimate accurate gradient flows and solve optimal prototypes in an effective and efficient manner, respectively. Extensive experiments show that 1) our methods achieve superior performance over previous FSL methods and 2) our E2MetaNODE significantly improves computation efficiency meanwhile without performance degradation.

[CV-37] ADV2E: Bridging the Gap Between Analogue Circuit and Discrete Frames in the Video-to-Events Simulator

链接: https://arxiv.org/abs/2411.12250
作者: Xiao Jiang,Fei Zhou,Jiongzhi Lin
关键词-EN: Active Pixel Sensor, traditional Active Pixel, offering significant advantages, operate fundamentally differently, traditional Active
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Event cameras operate fundamentally differently from traditional Active Pixel Sensor (APS) cameras, offering significant advantages. Recent research has developed simulators to convert video frames into events, addressing the shortage of real event datasets. Current simulators primarily focus on the logical behavior of event cameras. However, the fundamental analogue properties of pixel circuits are seldom considered in simulator design. The gap between analogue pixel circuit and discrete video frames causes the degeneration of synthetic events, particularly in high-contrast scenes. In this paper, we propose a novel method of generating reliable event data based on a detailed analysis of the pixel circuitry in event cameras. We incorporate the analogue properties of event camera pixel circuits into the simulator design: (1) analogue filtering of signals from light intensity to events, and (2) a cutoff frequency that is independent of video frame rate. Experimental results on two relevant tasks, including semantic segmentation and image reconstruction, validate the reliability of simulated event data, even in high-contrast scenes. This demonstrates that deep neural networks exhibit strong generalization from simulated to real event data, confirming that the synthetic events generated by the proposed method are both realistic and well-suited for effective training.

[CV-38] Neuro-3D: Towards 3D Visual Decoding from EEG Signals

链接: https://arxiv.org/abs/2411.12248
作者: Zhanqiang Guo,Jiamin Wu,Yonghao Song,Weijian Mai,Qihao Zheng,Wanli Ouyang,Chunfeng Song
关键词-EN: Human perception, stereo processing, EEG signals, Human, visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human’s perception of the visual world is shaped by the stereo processing of 3D information. Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. To provide the essential benchmark, we first present EEG-3D, a pioneering dataset featuring multimodal analysis data and extensive EEG recordings from 12 subjects viewing 72 categories of 3D objects rendered in both videos and images. Furthermore, we propose Neuro-3D, a 3D visual decoding framework based on EEG signals. This framework adaptively integrates EEG features derived from static and dynamic stimuli to learn complementary and robust neural representations, which are subsequently utilized to recover both the shape and color of 3D objects through the proposed diffusion-based colored point cloud decoder. To the best of our knowledge, we are the first to explore EEG-based 3D visual decoding. Experiments indicate that Neuro-3D not only reconstructs colored 3D objects with high fidelity, but also learns effective neural representations that enable insightful brain region analysis. The dataset and associated code will be made publicly available.

[CV-39] Invariant Shape Representation Learning For Image Classification

链接: https://arxiv.org/abs/2411.12201
作者: Tonmoy Hossain,Jing Ma,Jundong Li,Miaomiao Zhang
关键词-EN: Geometric shape features, Geometric shape, strong predictors, Geometric, features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Geometric shape features have been widely used as strong predictors for image classification. Nevertheless, most existing classifiers such as deep neural networks (DNNs) directly leverage the statistical correlations between these shape features and target variables. However, these correlations can often be spurious and unstable across different environments (e.g., in different age groups, certain types of brain changes have unstable relations with neurodegenerative disease); hence leading to biased or inaccurate predictions. In this paper, we introduce a novel framework that for the first time develops invariant shape representation learning (ISRL) to further strengthen the robustness of image classifiers. In contrast to existing approaches that mainly derive features in the image space, our model ISRL is designed to jointly capture invariant features in latent shape spaces parameterized by deformable transformations. To achieve this goal, we develop a new learning paradigm based on invariant risk minimization (IRM) to learn invariant representations of image and shape features across multiple training distributions/environments. By embedding the features that are invariant with regard to target variables in different environments, our model consistently offers more accurate predictions. We validate our method by performing classification tasks on both simulated 2D images, real 3D brain and cine cardiovascular magnetic resonance images (MRIs). Our code is publicly available at this https URL.

[CV-40] RoSIS: Robust Framework for Text-Promptable Surgical Instrument Segmentation Using Vision-Language Fusion

链接: https://arxiv.org/abs/2411.12199
作者: Tae-Min Choi,Juyoun Park
关键词-EN: deep learning-based research, learning-based research improving, research improving accuracy, Surgical instrument segmentation, computer-assisted surgeries
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, submitted to IEEE transactions on Medical Imaging

点击查看摘要

Abstract:Surgical instrument segmentation (SIS) is an essential task in computer-assisted surgeries, with deep learning-based research improving accuracy in complex environments. Recently, text-promptable segmentation methods have been introduced to generate masks based on text prompts describing target objects. However, these methods assume that the object described by a given text prompt exists in the scene. This results in mask generation whenever a related text prompt is provided, even if the object is absent from the image. Existing methods handle this by using prompts only for objects known to be present in the image, which introduces inaccessible information in a vision-based method setting and results in unfair comparisons. For fair comparison, we redefine existing text-promptable SIS settings to robust conditions, called Robust text-promptable SIS (R-SIS), designed to forward prompts of all classes and determine the existence of an object from a given text prompt for the fair comparison. Furthermore, we propose a novel framework, Robust Surgical Instrument Segmentation (RoSIS), which combines visual and language features for promptable segmentation in the R-SIS setting. RoSIS employs an encoder-decoder architecture with a Multi-Modal Fusion Block (MMFB) and a Selective Gate Block (SGB) to achieve balanced integration of vision and language features. Additionally, we introduce an iterative inference strategy that refines segmentation masks in two steps: an initial pass using name-based prompts, followed by a refinement step using location prompts. Experiments on various datasets and settings demonstrate that RoSIS outperforms existing vision-based and promptable methods under robust conditions.

[CV-41] MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

链接: https://arxiv.org/abs/2411.12197
作者: Yu Liu,Ruowei Wang,Jiaqi Li,Zixiang Xu,Qijun Zhao
关键词-EN: computer vision, long-standing problem, problem in computer, Neural Radiance Fields, Reconstructing
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: PRCV 2024

点击查看摘要

Abstract:Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text description capturing the image’s characteristics. Then, we use this description and the image to generate a 3D model with FlexiCubes. Additionally, MTFusion enhances FlexiCubes by employing a special decoder network for Signed Distance Functions, leading to faster training and finer surface representation. Extensive evaluations demonstrate that our MTFusion surpasses existing image-to-3D methods on a wide range of synthetic and real-world images. Furthermore, the ablation study proves the effectiveness of our network designs.

[CV-42] A Survey of Medical Vision-and-Language Applications and Their Techniques

链接: https://arxiv.org/abs/2411.12195
作者: Qi Chen,Ruoshan Zhao,Sinuo Wang,Vu Minh Hieu Phan,Anton van den Hengel,Johan Verjans,Zhibin Liao,Minh-Son To,Yong Xia,Jian Chen,Yutong Xie,Qi Wu
关键词-EN: attracted substantial interest, substantial interest due, complex medical data, natural language interface, Medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: this https URL.

[CV-43] Constant Rate Schedule: Constant-Rate Distributional Change for Efficient Training and Sampling in Diffusion Models

链接: https://arxiv.org/abs/2411.12188
作者: Shuntaro Okada,Kenji Doi,Ryota Yoshihashi,Hirokatsu Kataoka,Tomohiro Tanaka
关键词-EN: noise schedule, probability distribution, rate of change, ensures a constant, diffused data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 33 pages, 9 figures

点击查看摘要

Abstract:We propose a noise schedule that ensures a constant rate of change in the probability distribution of diffused data throughout the diffusion process. To obtain this noise schedule, we measure the rate of change in the probability distribution of the forward process and use it to determine the noise schedule before training diffusion models. The functional form of the noise schedule is automatically determined and tailored to each dataset and type of diffusion model. We evaluate the effectiveness of our noise schedule on unconditional and class-conditional image generation tasks using the LSUN (bedroom/church/cat/horse), ImageNet, and FFHQ datasets. Through extensive experiments, we confirmed that our noise schedule broadly improves the performance of the diffusion models regardless of the dataset, sampler, number of function evaluations, or type of diffusion model.

[CV-44] Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation

链接: https://arxiv.org/abs/2411.12177
作者: Zhuangwei Zhuang,Ziyin Wang,Sitao Chen,Lizhao Liu,Hui Luo,Mingkui Tan
关键词-EN: autonomous driving systems, driving systems, comprehensive representations, sensor calibration, accurate sensor calibration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 11 figures, 18 tables

点击查看摘要

Abstract:3D semantic occupancy prediction, which seeks to provide accurate and comprehensive representations of environment scenes, is important to autonomous driving systems. For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions. Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space. These methods, however, suffer from two major limitations: First, they rely on accurate sensor calibration and are sensitive to the calibration noise, which limits their application in real complex environments. Second, the spatial transformation layers are computationally expensive and limit their running on an autonomous vehicle. In this work, we attempt to exploit a Robust and Efficient 3D semantic Occupancy (REO) prediction scheme. To this end, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence. In this way, we robustly project the 2D features to a predefined BEV plane without using sensor calibration as input. Then, we introduce 2D and 3D auxiliary training tasks to enhance the discrimination power of 2D backbones on spatial, semantic, and texture features. Last, we propose a query-based prediction scheme to efficiently generate large-scale fine-grained occupancy predictions. By fusing point clouds that provide complementary spatial information, our REO surpasses the existing methods by a large margin on three benchmarks, including OpenOccupancy, Occ3D-nuScenes, and SemanticKITTI Scene Completion. For instance, our REO achieves 19.8 \times speedup compared to Co-Occ, with 1.1 improvements in geometry IoU on OpenOccupancy. Our code will be available at this https URL.

[CV-45] AsynEIO: Asynchronous Monocular Event-Inertial Odometry Using Gaussian Process Regression

链接: https://arxiv.org/abs/2411.12175
作者: Zhixiang Wang,Xudong Li,Yizhai Zhang,Fan Zhang,Panfeng
关键词-EN: show significant potential, show significant, low-light environments, significant potential, potential for motion
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE (2024-11-4)

点击查看摘要

Abstract:Event cameras, when combined with inertial sensors, show significant potential for motion estimation in challenging scenarios, such as high-speed maneuvers and low-light environments. There are many methods for producing such estimations, but most boil down to a synchronous discrete-time fusion problem. However, the asynchronous nature of event cameras and their unique fusion mechanism with inertial sensors remain underexplored. In this paper, we introduce a monocular event-inertial odometry method called AsynEIO, designed to fuse asynchronous event and inertial data within a unified Gaussian Process (GP) regression framework. Our approach incorporates an event-driven frontend that tracks feature trajectories directly from raw event streams at a high temporal resolution. These tracked feature trajectories, along with various inertial factors, are integrated into the same GP regression framework to enable asynchronous fusion. With deriving analytical residual Jacobians and noise models, our method constructs a factor graph that is iteratively optimized and pruned using a sliding-window optimizer. Comparative assessments highlight the performance of different inertial fusion strategies, suggesting optimal choices for varying conditions. Experimental results on both public datasets and our own event-inertial sequences indicate that AsynEIO outperforms existing methods, especially in high-speed and low-illumination scenarios.

[CV-46] Sketch-guided Cage-based 3D Gaussian Splatting Deformation

链接: https://arxiv.org/abs/2411.12168
作者: Tianhao Xie,Noam Aigerman,Eugene Belilovsky,Tiberiu Popa
关键词-EN: Gaussian Splatting, received great interest, computer vision, computer graphics, received great
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:3D Gaussian Splatting (GS) is one of the most promising novel 3D representations that has received great interest in computer graphics and computer vision. While various systems have introduced editing capabilities for 3D GS, such as those guided by text prompts, fine-grained control over deformation remains an open challenge. In this work, we present a novel sketch-guided 3D GS deformation system that allows users to intuitively modify the geometry of a 3D GS model by drawing a silhouette sketch from a single viewpoint. Our approach introduces a new deformation method that combines cage-based deformations with a variant of Neural Jacobian Fields, enabling precise, fine-grained control. Additionally, it leverages large-scale 2D diffusion priors and ControlNet to ensure the generated deformations are semantically plausible. Through a series of experiments, we demonstrate the effectiveness of our method and showcase its ability to animate static 3D GS models as one of its key applications.

[CV-47] Self-Supervised Learning in Deep Networks: A Pathway to Robust Few-Shot Classification

链接: https://arxiv.org/abs/2411.12151
作者: Yuyang Xiao
关键词-EN: model feature extraction, combining self-supervised learning, deep network model, few-shot image classification, image classification task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study aims to optimize the few-shot image classification task and improve the model’s feature extraction and classification performance by combining self-supervised learning with the deep network model ResNet-101. During the training process, we first pre-train the model with self-supervision to enable it to learn common feature expressions on a large amount of unlabeled data; then fine-tune it on the few-shot dataset Mini-ImageNet to improve the model’s accuracy and generalization ability under limited data. The experimental results show that compared with traditional convolutional neural networks, ResNet-50, DenseNet, and other models, our method has achieved excellent performance of about 95.12% in classification accuracy (ACC) and F1 score, verifying the effectiveness of self-supervised learning in few-shot classification. This method provides an efficient and reliable solution for the field of few-shot image classification.

[CV-48] FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting

链接: https://arxiv.org/abs/2411.12089
作者: Fangyu Wu,Yuhao Chen
关键词-EN: generation tasks today, real world, sliced or cut, tasks today, generation tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In the real world, objects reveal internal textures when sliced or cut, yet this behavior is not well-studied in 3D generation tasks today. For example, slicing a virtual 3D watermelon should reveal flesh and seeds. Given that no available dataset captures an object’s full internal structure and collecting data from all slices is impractical, generative methods become the obvious approach. However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. Our approach produces objects via 3D Gaussian Splatting (3DGS) with both surface and interior textures synthesized, enabling real-time slicing and rendering without additional optimization. FruitNinja leverages a pre-trained diffusion model to progressively inpaint cross-sectional views and applies voxel-grid-based smoothing to achieve cohesive textures throughout the object. Our OpaqueAtom GS strategy overcomes 3DGS limitations by employing densely distributed opaque Gaussians, avoiding biases toward larger particles that destabilize training and sharp color transitions for fine-grained textures. Experimental results show that FruitNinja substantially outperforms existing approaches, showcasing unmatched visual quality in real-time rendered internal views across arbitrary geometry manipulations.

[CV-49] Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging

链接: https://arxiv.org/abs/2411.12070
作者: Zuzanna Buchnajzer,Kacper Dobek,Stanisław Hapke,Daniel Jankowski,Krzysztof Krawiec
关键词-EN: convolutional neural networks, neural networks tend, smooth features, learning architectures based, rely on continuous
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Deep learning architectures based on convolutional neural networks tend to rely on continuous, smooth features. While this characteristics provides significant robustness and proves useful in many real-world tasks, it is strikingly incompatible with the physical characteristic of the world, which, at the scale in which humans operate, comprises crisp objects, typically representing well-defined categories. This study proposes a class of neurosymbolic systems that learn by reconstructing the observed images in terms of visual primitives and are thus forced to form high-level, structural explanations of them. When applied to the task of diagnosing abnormalities in histological imaging, the method proved superior to a conventional deep learning architecture in terms of classification accuracy, while being more transparent.

[CV-50] ITACLIP: Boosting Training-Free Semantic Segmentation with Image Text and Architectural Enhancements

链接: https://arxiv.org/abs/2411.12044
作者: M. Arda Aydın,Efe Mert Çırpar,Elvin Abdinli,Gozde Unal,Yusuf H. Sahin
关键词-EN: computer vision tasks, Vision Language Models, foundational Vision Language, Large Language Models, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP’s open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at this https URL.

[CV-51] In-Situ Melt Pool Characterization via Thermal Imaging for Defect Detection in Directed Energy Deposition Using Vision Transformers

链接: https://arxiv.org/abs/2411.12028
作者: Israt Zarin Era,Fan Zhou,Ahmed Shoyeb Raihan,Imtiaz Ahmed,Alan Abul-Haj,James Craig,Srinjoy Das,Zhichao Liu
关键词-EN: Directed Energy Deposition, Directed Energy, Energy Deposition, offers significant potential, fine-tuned MAE Encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Directed Energy Deposition (DED) offers significant potential for manufacturing complex and multi-material parts. However, internal defects such as porosity and cracks can compromise mechanical properties and overall performance. This study focuses on in-situ monitoring and characterization of melt pools associated with porosity, aiming to improve defect detection and quality control in DED-printed parts. Traditional machine learning approaches for defect identification rely on extensive labeled datasets, often scarce and expensive to generate in real-world manufacturing. To address this, our framework employs self-supervised learning on unlabeled melt pool data using a Vision Transformer-based Masked Autoencoder (MAE) to produce highly representative embeddings. These fine-tuned embeddings are leveraged via transfer learning to train classifiers on a limited labeled dataset, enabling the effective identification of melt pool anomalies. We evaluate two classifiers: (1) a Vision Transformer (ViT) classifier utilizing the fine-tuned MAE Encoder’s parameters and (2) the fine-tuned MAE Encoder combined with an MLP classifier head. Our framework achieves overall accuracy ranging from 95.44% to 99.17% and an average F1 score exceeding 80%, with the ViT Classifier slightly outperforming the MAE Encoder Classifier. This demonstrates the scalability and cost-effectiveness of our approach for automated quality control in DED, effectively detecting defects with minimal labeled data.

[CV-52] Analyzing and Improving the Skin Tone Consistency and Bias in Implicit 3D Relightable Face Generators WACV2025

链接: https://arxiv.org/abs/2411.12002
作者: Libing Zeng,Nima Khademi Kalantari
关键词-EN: generative adversarial networks, relightable face generation, received significant attention, adversarial networks, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 10 figures, 5 tables, WACV 2025

点击查看摘要

Abstract:With the advances in generative adversarial networks (GANs) and neural rendering, 3D relightable face generation has received significant attention. Among the existing methods, a particularly successful technique uses an implicit lighting representation and generates relit images through the product of synthesized albedo and light-dependent shading images. While this approach produces high-quality results with intricate shading details, it often has difficulty producing relit images with consistent skin tones, particularly when the lighting condition is extracted from images of individuals with dark skin. Additionally, this technique is biased towards producing albedo images with lighter skin tones. Our main observation is that this problem is rooted in the biased spherical harmonics (SH) coefficients, used during training. Following this observation, we conduct an analysis and demonstrate that the bias appears not only in band 0 (DC term), but also in the other bands of the estimated SH coefficients. We then propose a simple, but effective, strategy to mitigate the problem. Specifically, we normalize the SH coefficients by their DC term to eliminate the inherent magnitude bias, while statistically align the coefficients in the other bands to alleviate the directional bias. We also propose a scaling strategy to match the distribution of illumination magnitude in the generated images with the training data. Through extensive experiments, we demonstrate the effectiveness of our solution in increasing the skin tone consistency and mitigating bias.

[CV-53] Coverage-Constrained Human-AI Cooperation with Multiple Experts

链接: https://arxiv.org/abs/2411.11976
作者: Zheng Zhang,Cuong Nguyen,Kevin Wells,Thanh-Toan Do,Gustavo Carneiro
关键词-EN: Human-AI cooperative classification, develop hybrid intelligent, hybrid intelligent systems, Human-AI cooperative, cooperative classification
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human-AI cooperative classification (HAI-CC) approaches aim to develop hybrid intelligent systems that enhance decision-making in various high-stakes real-world scenarios by leveraging both human expertise and AI capabilities. Current HAI-CC methods primarily focus on learning-to-defer (L2D), where decisions are deferred to human experts, and learning-to-complement (L2C), where AI and human experts make predictions cooperatively. However, a notable research gap remains in effectively exploring both L2D and L2C under diverse expert knowledge to improve decision-making, particularly when constrained by the cooperation cost required to achieve a target probability for AI-only selection (i.e., coverage). In this paper, we address this research gap by proposing the Coverage-constrained Learning to Defer and Complement with Specific Experts (CL2DC) method. CL2DC makes final decisions through either AI prediction alone or by deferring to or complementing a specific expert, depending on the input data. Furthermore, we propose a coverage-constrained optimisation to control the cooperation cost, ensuring it approximates a target probability for AI-only selection. This approach enables an effective assessment of system performance within a specified budget. Also, CL2DC is designed to address scenarios where training sets contain multiple noisy-label annotations without any clean-label references. Comprehensive evaluations on both synthetic and real-world datasets demonstrate that CL2DC achieves superior performance compared to state-of-the-art HAI-CC methods.

[CV-54] meFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction

链接: https://arxiv.org/abs/2411.11941
作者: DaDong Jiang,Zhihui Ke,Xiaobo Zhou,Zhi Hou,Xianghui Yang,Wenbo Hu,Tie Qiu,Chunchao Guo
关键词-EN: Dynamic scene reconstruction, long-term challenge, dynamic scenes, Gaussian Splatting, Gaussians reconstruction methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces. To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective. Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians. Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed. Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer. Project Page: this https URL

[CV-55] Fair Distillation: Teaching Fairness from Biased Teachers in Medical Imaging

链接: https://arxiv.org/abs/2411.11939
作者: Milad Masroor,Tahir Hassan,Yu Tian,Kevin Wells,David Rosewarne,Thanh-Toan Do,Gustavo Carneiro
关键词-EN: achieved remarkable success, Deep learning, learning has achieved, achieved remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has achieved remarkable success in image classification and segmentation tasks. However, fairness concerns persist, as models often exhibit biases that disproportionately affect demographic groups defined by sensitive attributes such as race, gender, or age. Existing bias-mitigation techniques, including Subgroup Re-balancing, Adversarial Training, and Domain Generalization, aim to balance accuracy across demographic groups, but often fail to simultaneously improve overall accuracy, group-specific accuracy, and fairness due to conflicts among these interdependent objectives. We propose the Fair Distillation (FairDi) method, a novel fairness approach that decomposes these objectives by leveraging biased teacher'' models, each optimized for a specific demographic group. These teacher models then guide the training of a unified student’’ model, which distills their knowledge to maximize overall and group-specific accuracies, while minimizing inter-group disparities. Experiments on medical imaging datasets show that FairDi achieves significant gains in both overall and group-specific accuracy, along with improved fairness, compared to existing methods. FairDi is adaptable to various medical tasks, such as classification and segmentation, and provides an effective solution for equitable model performance.

[CV-56] Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation

链接: https://arxiv.org/abs/2411.11935
作者: Hanieh Shojaei Miandashti,Qianqian Zou,Claus Brenner
关键词-EN: Reliable deep learning, dependable uncertainty estimation, deep learning models, learning models require, ensure dependable uncertainty
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable deep learning models require not only accurate predictions but also well-calibrated confidence estimates to ensure dependable uncertainty estimation. This is crucial in safety-critical applications like autonomous driving, which depend on rapid and precise semantic segmentation of LiDAR point clouds for real-time 3D scene understanding. In this work, we introduce a sampling-free approach for estimating well-calibrated confidence values for classification tasks, achieving alignment with true classification accuracy and significantly reducing inference time compared to sampling-based methods. Our evaluation using the Adaptive Calibration Error (ACE) metric for LiDAR semantic segmentation shows that our approach maintains well-calibrated confidence values while achieving increased processing speed compared to a sampling baseline. Additionally, reliability diagrams reveal that our method produces underconfidence rather than overconfident predictions, an advantage for safety-critical applications. Our sampling-free approach offers well-calibrated and time-efficient predictions for LiDAR scene semantic segmentation.

[CV-57] SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input

链接: https://arxiv.org/abs/2411.11934
作者: Zhen Lv,Yangqi Long,Congzhentao Huang,Cao Li,Chengfei Lv,Hao Ren,Dian Zheng
关键词-EN: Stereo video synthesis, virtual reality, Stereo video, monocular input, fields of spatial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods primarily address these issues by directly applying novel view synthesis (NVS) techniques to video, while facing limitations such as the inability to effectively represent dynamic scenes and the requirement for large amounts of training data. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training. More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.

[CV-58] FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

链接: https://arxiv.org/abs/2411.11927
作者: Anjia Cao,Xing Wei,Zhiheng Ma
关键词-EN: faces significant challenges, significant challenges due, pre-training faces significant, Language-image pre-training faces, Frozen Large lAnguage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME’s superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4% in average image-to-text recall@1 across 36 languages, and by 34.6% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at \urlthis https URL.

[CV-59] KAN-Mamba FusionNet: Redefining Medical Image Segmentation with Non-Linear Modeling

链接: https://arxiv.org/abs/2411.11926
作者: Akansh Agrawal,Akshan Agrawal,Shashwat Gupta,Priyanka Bagade
关键词-EN: Medical image segmentation, image segmentation, Medical image, robotic surgeries, treatment plans
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Medical image segmentation is crucial in robotic surgeries, disease diagnosis, and treatment plans. This research presents an innovative methodology that combines Kolmogorov-Arnold Networks (KAN) with an adapted Mamba layer for medical image segmentation. The proposed KAN-Mamba FusionNet framework improves image segmentation by integrating attention-driven mechanisms with convolutional parallel training and autoregressive deployment, while preserving interpretability, in contrast to the state-of-the-art techniques that depend exclusively on Mamba for ailment localization and accurate diagnosis. We evaluated our proposed KAN-Mamba FusionNet model on three distinct medical image segmentation datasets, BUSI, Kvasir-Seg and GlaS. The results indicated that the KAN-Mamba FusionNet consistently yields better IoU and F1 scores in comparison to the state-of-the-art methods. Further, we offer insights into the model’s behavior via ablation studies, examining the effects of various components and assessing their contributions to the overall performance of the proposed model. The findings illustrate the strength and effectiveness of this methodology for dependable medical image segmentation, providing a unique approach to address intricate visual data issues in healthcare.

[CV-60] Continuous Speculative Decoding for Autoregressive Image Generation

链接: https://arxiv.org/abs/2411.11925
作者: Zili Wang,Robert Zhang,Kun Ding,Qi Yang,Fei Li,Shiming Xiang
关键词-EN: higher generation fidelity, showcasing considerable reconstruction, demonstrated notable superiority, considerable reconstruction quality, image generation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregressive framework result in significant inference overhead. While speculative decoding has proven effective in accelerating Large Language Models (LLMs), their adaptation to continuous-valued visual autoregressive models remains unexplored. This work generalizes the speculative decoding algorithm from discrete tokens to continuous space. By analyzing the intrinsic properties of output distribution, we establish a tailored acceptance criterion for the diffusion distributions prevalent in such models. To overcome the inconsistency that occurred in speculative decoding output distributions, we introduce denoising trajectory alignment and token pre-filling methods. Additionally, we identify the hard-to-sample distribution in the rejection phase. To mitigate this issue, we propose a meticulous acceptance-rejection sampling method with a proper upper bound, thereby circumventing complex integration. Experimental results show that our continuous speculative decoding achieves a remarkable 2.33\times speed-up on off-the-shelf models while maintaining the output distribution. Codes will be available at this https URL

[CV-61] Dataset Distillers Are Good Label Denoisers In the Wild

链接: https://arxiv.org/abs/2411.11924
作者: Lechao Cheng,Kaifeng Chen,Jiyang Li,Shengeng Tang,Shufei Zhang,Meng Wang
关键词-EN: adapting deep learning, deep learning models, deep learning, essential for adapting, adapting deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning from noisy data has become essential for adapting deep learning models to real-world applications. Traditional methods often involve first evaluating the noise and then applying strategies such as discarding noisy samples, re-weighting, or re-labeling. However, these methods can fall into a vicious cycle when the initial noise evaluation is inaccurate, leading to suboptimal performance. To address this, we propose a novel approach that leverages dataset distillation for noise removal. This method avoids the feedback loop common in existing techniques and enhances training efficiency, while also providing strong privacy protection through offline processing. We rigorously evaluate three representative dataset distillation methods (DATM, DANCE, and RCIG) under various noise conditions, including symmetric noise, asymmetric noise, and real-world natural noise. Our empirical findings reveal that dataset distillation effectively serves as a denoising tool in random noise scenarios but may struggle with structured asymmetric noise patterns, which can be absorbed into the distilled samples. Additionally, clean but challenging samples, such as those from tail classes in imbalanced datasets, may undergo lossy compression during distillation. Despite these challenges, our results highlight that dataset distillation holds significant promise for robust model training, especially in high-privacy environments where noise is prevalent.

[CV-62] SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

链接: https://arxiv.org/abs/2411.11922
作者: Cheng-Yen Yang,Hsiang-Wei Huang,Wenhao Chai,Zhongyu Jiang,Jenq-Neng Hwang
关键词-EN: managing crowded scenes, object segmentation tasks, visual object tracking, Segment Anything Model, segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT _\textext and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments. Code and results are available at this https URL.

[CV-63] DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

链接: https://arxiv.org/abs/2411.11921
作者: Chensheng Peng,Chengwei Zhang,Yixiao Wang,Chenfeng Xu,Yichen Xie,Wenzhao Zheng,Kurt Keutzer,Masayoshi Tomizuka,Wei Zhan
关键词-EN: enabling effective static-dynamic, effective static-dynamic decomposition, gaussian splatting representation, complex driving scenarios, enabling effective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Our approach employs a two-stage optimization pipeline of dynamic street Gaussians. In the first stage, we extract 2D motion masks based on the observation that 3D Gaussian Splatting inherently can reconstruct only the static regions in dynamic environments. These extracted 2D motion priors are then mapped into the Gaussian space in a differentiable manner, leveraging an efficient formulation of dynamic Gaussians in the second stage. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air. Furthermore, we introduce temporal cross-view consistency to ensure coherence across time and viewpoints, resulting in high-quality surface reconstruction. Comprehensive experiments demonstrate the efficiency and effectiveness of DeSiRe-GS, surpassing prior self-supervised arts and achieving accuracy comparable to methods relying on external 3D bounding box annotations. Code is available at \urlthis https URL

[CV-64] VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation

链接: https://arxiv.org/abs/2411.11919
作者: Ruiyang Zhang,Hu Zhang,Zhedong Zheng
关键词-EN: large vision-language models, wider safety concerns, higher information load, information load processed, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given the higher information load processed by large vision-language models (LVLMs) compared to single-modal LLMs, detecting LVLM hallucinations requires more human and time expense, and thus rise a wider safety concerns. In this paper, we introduce VL-Uncertainty, the first uncertainty-based framework for detecting hallucinations in LVLMs. Different from most existing methods that require ground-truth or pseudo annotations, VL-Uncertainty utilizes uncertainty as an intrinsic metric. We measure uncertainty by analyzing the prediction variance across semantically equivalent but perturbed prompts, including visual and textual data. When LVLMs are highly confident, they provide consistent responses to semantically equivalent queries. However, when uncertain, the responses of the target LVLM become more random. Considering semantically similar answers with different wordings, we cluster LVLM responses based on their semantic content and then calculate the cluster distribution entropy as the uncertainty measure to detect hallucination. Our extensive experiments on 10 LVLMs across four benchmarks, covering both free-form and multi-choice tasks, show that VL-Uncertainty significantly outperforms strong baseline methods in hallucination detection.

[CV-65] FCC: Fully Connected Correlation for Few-Shot Segmentation

链接: https://arxiv.org/abs/2411.11917
作者: Seonghyeon Moon,Haein Kong,Muhammad Haris Khan,Yuewei Lin
关键词-EN: target object, Few-shot segmentation, target object shows, aims to segment, prior information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks. Therefore, having strong prior information for the target object using the support set is essential for guiding the initial training of FSS, which leads to the success of few-shot segmentation in challenging cases, such as when the target object shows considerable variation in appearance, texture, or scale across the support and query images. Previous methods have tried to obtain prior information by creating correlation maps from pixel-level correlation on final-layer or same-layer features. However, we found these approaches can offer limited and partial information when advanced models like Vision Transformers are used as the backbone. Vision Transformer encoders have a multi-layer structure with identical shapes in their intermediate layers. Leveraging the feature comparison from all layers in the encoder can enhance the performance of few-shot segmentation. We introduce FCC (Fully Connected Correlation) to integrate pixel-level correlations between support and query features, capturing associations that reveal target-specific patterns and correspondences in both same-layers and cross-layers. FCC captures previously inaccessible target information, effectively addressing the limitations of support mask. Our approach consistently demonstrates state-of-the-art performance on PASCAL, COCO, and domain shift tests. We conducted an ablation study and cross-layer correlation analysis to validate FCC’s core methodology. These findings reveal the effectiveness of FCC in enhancing prior information and overall model performance.

[CV-66] SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

链接: https://arxiv.org/abs/2411.11909
作者: Hongrui Jia,Chaoya Jiang,Haiyang Xu,Wei Ye,Mengfan Dong,Ming Yan,Ji Zhang,Fei Huang,Shikun Zhang
关键词-EN: solve language tasks, Large Language Models, language models continue, In-Context Learning, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.

[CV-67] textS3Mamba: Arbitrary-Scale Super-Resolution via Scaleable State Space Model

链接: https://arxiv.org/abs/2411.11906
作者: Peizhe Xia,Long Peng,Xin Di,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha
关键词-EN: super-resolve low-resolution images, Implicit Neural Representations, continuous representation space, continuous representation, high-resolution images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Arbitrary scale super-resolution (ASSR) aims to super-resolve low-resolution images to high-resolution images at any scale using a single model, addressing the limitations of traditional super-resolution methods that are restricted to fixed-scale factors (e.g., \times2 , \times4 ). The advent of Implicit Neural Representations (INR) has brought forth a plethora of novel methodologies for ASSR, which facilitate the reconstruction of original continuous signals by modeling a continuous representation space for coordinates and pixel values, thereby enabling arbitrary-scale super-resolution. Consequently, the primary objective of ASSR is to construct a continuous representation space derived from low-resolution inputs. However, existing methods, primarily based on CNNs and Transformers, face significant challenges such as high computational complexity and inadequate modeling of long-range dependencies, which hinder their effectiveness in real-world applications. To overcome these limitations, we propose a novel arbitrary-scale super-resolution method, called \textS^3 Mamba, to construct a scalable continuous representation space. Specifically, we propose a Scalable State Space Model (SSSM) to modulate the state transition matrix and the sampling matrix of step size during the discretization process, achieving scalable and continuous representation modeling with linear computational complexity. Additionally, we propose a novel scale-aware self-attention mechanism to further enhance the network’s ability to perceive global important features at different scales, thereby building the \textS^3 Mamba to achieve superior arbitrary-scale super-resolution. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method achieves state-of-the-art performance and superior generalization capabilities at arbitrary super-resolution scales.

[CV-68] GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

链接: https://arxiv.org/abs/2411.11904
作者: Yue Zhou,Mengcheng Lan,Xiang Li,Yiping Ke,Xue Jiang,Litong Feng,Wayne Zhang
关键词-EN: enhancing human interaction, natural language expression, Remote sensing, locate specific objects, visual grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages, 19 figures

点击查看摘要

Abstract:Remote sensing (RS) visual grounding aims to use natural language expression to locate specific objects (in the form of the bounding box or segmentation mask) in RS images, enhancing human interaction with intelligent RS interpretation systems. Early research in this area was primarily based on horizontal bounding boxes (HBBs), but as more diverse RS datasets have become available, tasks involving oriented bounding boxes (OBBs) and segmentation masks have emerged. In practical applications, different targets require different grounding types: HBB can localize an object’s position, OBB provides its orientation, and mask depicts its shape. However, existing specialized methods are typically tailored to a single type of RS visual grounding task and are hard to generalize across tasks. In contrast, large vision-language models (VLMs) exhibit powerful multi-task learning capabilities but struggle to handle dense prediction tasks like segmentation. This paper proposes GeoGround, a novel framework that unifies support for HBB, OBB, and mask RS visual grounding tasks, allowing flexible output selection. Rather than customizing the architecture of VLM, our work aims to elegantly support pixel-level visual grounding output through the Text-Mask technique. We define prompt-assisted and geometry-guided learning to enhance consistency across different signals. To support model training, we present refGeo, a large-scale RS visual instruction-following dataset containing 161k image-text pairs. Experimental results show that GeoGround demonstrates strong performance across four RS visual grounding tasks, matching or surpassing the performance of specialized methods on multiple benchmarks. Code available at this https URL

[CV-69] DiHuR: Diffusion-Guided Generalizable Human Reconstruction WACV2025

链接: https://arxiv.org/abs/2411.11903
作者: Jinnan Chen,Chen Li,Gim Hee Lee
关键词-EN: minimally overlapping images, Signed Distance Function, minimally overlapping, Diffusion-guided model, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to WACV 2025

点击查看摘要

Abstract:We introduce DiHuR, a novel Diffusion-guided model for generalizable Human 3D Reconstruction and view synthesis from sparse, minimally overlapping images. While existing generalizable human radiance fields excel at novel view synthesis, they often struggle with comprehensive 3D reconstruction. Similarly, directly optimizing implicit Signed Distance Function (SDF) fields from sparse-view images typically yields poor results due to limited overlap. To enhance 3D reconstruction quality, we propose using learnable tokens associated with SMPL vertices to aggregate sparse view features and then to guide SDF prediction. These tokens learn a generalizable prior across different identities in training datasets, leveraging the consistent projection of SMPL vertices onto similar semantic areas across various human identities. This consistency enables effective knowledge transfer to unseen identities during inference. Recognizing SMPL’s limitations in capturing clothing details, we incorporate a diffusion model as an additional prior to fill in missing information, particularly for complex clothing geometries. Our method integrates two key priors in a coherent manner: the prior from generalizable feed-forward models and the 2D diffusion prior, and it requires only multi-view image training, without 3D supervision. DiHuR demonstrates superior performance in both within-dataset and cross-dataset generalization settings, as validated on THuman, ZJU-MoCap, and HuMMan datasets compared to existing methods.

[CV-70] Barttender: An approachable interpretable way to compare medical imaging and non-imaging data ML4H2024 ALT

链接: https://arxiv.org/abs/2411.12707
作者: Ayush Singla,Shakson Isaac,Chirag J. Patel
关键词-EN: transformed healthcare research, Imaging-based deep learning, clinical adoption remains, adoption remains limited, remains limited due
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the Proceedings Track at Machine Learning for Health (ML4H 2024) conference, held on December 15-16, 2024 in Vancouver, Canada

点击查看摘要

Abstract:Imaging-based deep learning has transformed healthcare research, yet its clinical adoption remains limited due to challenges in comparing imaging models with traditional non-imaging and tabular data. To bridge this gap, we introduce Barttender, an interpretable framework that uses deep learning for the direct comparison of the utility of imaging versus non-imaging tabular data for tasks like disease prediction. Barttender converts non-imaging tabular features, such as scalar data from electronic health records, into grayscale bars, facilitating an interpretable and scalable deep learning based modeling of both data modalities. Our framework allows researchers to evaluate differences in utility through performance measures, as well as local (sample-level) and global (population-level) explanations. We introduce a novel measure to define global feature importances for image-based deep learning models, which we call gIoU. Experiments on the CheXpert and MIMIC datasets with chest X-rays and scalar data from electronic health records show that Barttender performs comparably to traditional methods and offers enhanced explainability using deep learning models. Comments: Accepted to the Proceedings Track at Machine Learning for Health (ML4H 2024) conference, held on December 15-16, 2024 in Vancouver, Canada Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.12707 [q-bio.QM] (or arXiv:2411.12707v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2411.12707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-71] Stochastic BIQA: Median Randomized Smoothing for Certified Blind Image Quality Assessment

链接: https://arxiv.org/abs/2411.12575
作者: Ekaterina Shumitskaya,Mikhail Pautov,Dmitriy Vatolin,Anastasia Antsiferova
关键词-EN: No-Reference Image-Quality Assessment, Image-Quality Assessment, neural networks vulnerable, modern No-Reference Image-Quality, IQA metric
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most modern No-Reference Image-Quality Assessment (NR-IQA) metrics are based on neural networks vulnerable to adversarial attacks. Attacks on such metrics lead to incorrect image/video quality predictions, which poses significant risks, especially in public benchmarks. Developers of image processing algorithms may unfairly increase the score of a target IQA metric without improving the actual quality of the adversarial image. Although some empirical defenses for IQA metrics were proposed, they do not provide theoretical guarantees and may be vulnerable to adaptive attacks. This work focuses on developing a provably robust no-reference IQA metric. Our method is based on Median Smoothing (MS) combined with an additional convolution denoiser with ranking loss to improve the SROCC and PLCC scores of the defended IQA metric. Compared with two prior methods on three datasets, our method exhibited superior SROCC and PLCC scores while maintaining comparable certified guarantees.

[CV-72] S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation

链接: https://arxiv.org/abs/2411.12547
作者: Yuke Wu,Xiang Liu,Yunyu Shi,Xinyi Chen,Zhenglei Wang,YuQing Xu,Shuo Hong Wang
关键词-EN: images complicate staging, complicate staging diagnosis, detailed lesion information, lung adenocarcinoma nodules, making accurate segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net is built on a multi-view CNN-Transformer hybrid architecture, incorporating superpixel algorithms, structured weighting, and spatial shifting techniques to achieve superior segmentation performance. The model leverages structured convolution blocks (DWF-Conv/D2BR-Conv) to extract multi-scale local features while mitigating overfitting. To enhance multi-scale feature fusion, we introduce the S2-MLP Link, integrating spatial shifting and attention mechanisms at the skip connections. Additionally, the residual-based superpixel visual transformer (RM-SViT) effectively merges global and local features by employing sparse correlation learning and multi-branch attention to capture long-range dependencies, with residual connections enhancing stability and computational efficiency. Experimental results on the LIDC-IDRI dataset demonstrate that S3TU-Net achieves a DSC, precision, and IoU of 89.04%, 90.73%, and 90.70%, respectively. Compared to recent methods, S3TU-Net improves DSC by 4.52% and sensitivity by 3.16%, with other metrics showing an approximate 2% increase. In addition to comparison and ablation studies, we validated the generalization ability of our model on the EPDB private dataset, achieving a DSC of 86.40%.

[CV-73] MAViS: Modular Autonomous Virtualization System for Two-Dimensional Semiconductor Quantum Dot Arrays

链接: https://arxiv.org/abs/2411.12516
作者: Anantha S. Rao,Donovan Buterakos,Barnaby van Straaten,Valentin John,Cécile X. Yu,Stefan D. Oosterhout,Lucas Stehouwer,Giordano Scappucci,Menno Veldhorst,Francesco Borsoi,Justyna P. Zwolak
关键词-EN: scalable quantum processors, building scalable quantum, leading candidates, candidates for building, building scalable
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 14 pages, 5 figures, 8 pages of supplemental material

点击查看摘要

Abstract:Arrays of gate-defined semiconductor quantum dots are among the leading candidates for building scalable quantum processors. High-fidelity initialization, control, and readout of spin qubit registers require exquisite and targeted control over key Hamiltonian parameters that define the electrostatic environment. However, due to the tight gate pitch, capacitive crosstalk between gates hinders independent tuning of chemical potentials and interdot couplings. While virtual gates offer a practical solution, determining all the required cross-capacitance matrices accurately and efficiently in large quantum dot registers is an open challenge. Here, we establish a Modular Automated Virtualization System (MAViS) – a general and modular framework for autonomously constructing a complete stack of multi-layer virtual gates in real time. Our method employs machine learning techniques to rapidly extract features from two-dimensional charge stability diagrams. We then utilize computer vision and regression models to self-consistently determine all relative capacitive couplings necessary for virtualizing plunger and barrier gates in both low- and high-tunnel-coupling regimes. Using MAViS, we successfully demonstrate accurate virtualization of a dense two-dimensional array comprising ten quantum dots defined in a high-quality Ge/SiGe heterostructure. Our work offers an elegant and practical solution for the efficient control of large-scale semiconductor quantum dot systems.

[CV-74] Automatic staff reconstruction within SIMSSA proect

链接: https://arxiv.org/abs/2411.12383
作者: Lorenzo J. Tardon,Isabel Barbancho,Ana M. Barbancho,Ichiro Fujinaga
关键词-EN: make musical content, include musical scores, research topic, topic of interest, automatic analysis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:The automatic analysis of scores has been a research topic of interest for the last few decades and still is since music databases that include musical scores are currently being created to make musical content available to the public, including scores of ancient music. For the correct analysis of music elements and their interpretation, the identification of staff lines is of key importance. In this paper, a scheme to post-process the output of a previous musical object identification system is described. This system allows the reconstruction by means of detection, tracking and interpolation of the staff lines of ancient scores from the digital Salzinnes Database. The scheme developed shows a remarkable performance on the specific task it was created for.

[CV-75] Versatile Cataract Fundus Image Restoration Model Utilizing Unpaired Cataract and High-quality Images

链接: https://arxiv.org/abs/2411.12278
作者: Zheng Gong,Zhuo Deng,Weihao Gao,Wenda Zhou,Yuhang Yang,Hanqing Zhao,Zhiyuan Niu,Lei Shao,Wenbin Wei,Lan Ma
关键词-EN: blinding eye diseases, common blinding eye, blinding eye, eye diseases, Cataract
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Cataract is one of the most common blinding eye diseases and can be treated by surgery. However, because cataract patients may also suffer from other blinding eye diseases, ophthalmologists must diagnose them before surgery. The cloudy lens of cataract patients forms a hazy degeneration in the fundus images, making it challenging to observe the patient’s fundus vessels, which brings difficulties to the diagnosis process. To address this issue, this paper establishes a new cataract image restoration method named Catintell. It contains a cataract image synthesizing model, Catintell-Syn, and a restoration model, Catintell-Res. Catintell-Syn uses GAN architecture with fully unsupervised data to generate paired cataract-like images with realistic style and texture rather than the conventional Gaussian degradation algorithm. Meanwhile, Catintell-Res is an image restoration network that can improve the quality of real cataract fundus images using the knowledge learned from synthetic cataract images. Extensive experiments show that Catintell-Res outperforms other cataract image restoration methods in PSNR with 39.03 and SSIM with 0.9476. Furthermore, the universal restoration ability that Catintell-Res gained from unpaired cataract images can process cataract images from various datasets. We hope the models can help ophthalmologists identify other blinding eye diseases of cataract patients and inspire more medical image restoration methods in the future.

[CV-76] Acquire Precise and Comparable Fundus Image Quality Score: FTHNet and FQS Dataset

链接: https://arxiv.org/abs/2411.12273
作者: Zheng Gong,Zhuo Deng,Run Gan,Zhiyuan Niu,Lu Chen,Canfeng Huang,Jia Liang,Weihao Gao,Fang Li,Shaochong Zhang,Lan Ma
关键词-EN: retinal fundus images, affect the diagnosis, fundus images, FIQA, FIQA Transformer-based Hypernetwork
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:The retinal fundus images are utilized extensively in the diagnosis, and their quality can directly affect the diagnosis results. However, due to the insufficient dataset and algorithm application, current fundus image quality assessment (FIQA) methods are not powerful enough to meet ophthalmologists` demands. In this paper, we address the limitations of datasets and algorithms in FIQA. First, we establish a new FIQA dataset, Fundus Quality Score(FQS), which includes 2246 fundus images with two labels: a continuous Mean Opinion Score varying from 0 to 100 and a three-level quality label. Then, we propose a FIQA Transformer-based Hypernetwork (FTHNet) to solve these tasks with regression results rather than classification results in conventional FIQA works. The FTHNet is optimized for the FIQA tasks with extensive experiments. Results on our FQS dataset show that the FTHNet can give quality scores for fundus images with PLCC of 0.9423 and SRCC of 0.9488, significantly outperforming other methods with fewer parameters and less computation this http URL successfully build a dataset and model addressing the problems of current FIQA methods. Furthermore, the model deployment experiments demonstrate its potential in automatic medical image quality control. All experiments are carried out with 10-fold cross-validation to ensure the significance of the results.

[CV-77] Self-supervised denoising of visual field data improves detection of glaucoma progression

链接: https://arxiv.org/abs/2411.12146
作者: Sean Wu,Jun Yu Chen,Vahid Mohammadzadeh,Sajad Besharati,Jaewon Lee,Kouros Nouri-Mahdavi,Joseph Caprioli,Zhe Fei,Fabien Scalzo
关键词-EN: Perimetric measurements provide, visual field, Perimetric measurements, measurements provide insight, main outcome measure
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Perimetric measurements provide insight into a patient’s peripheral vision and day-to-day functioning and are the main outcome measure for identifying progression of visual damage from glaucoma. However, visual field data can be noisy, exhibiting high variance, especially with increasing damage. In this study, we demonstrate the utility of self-supervised deep learning in denoising visual field data from over 4000 patients to enhance its signal-to-noise ratio and its ability to detect true glaucoma progression. We deployed both a variational autoencoder (VAE) and a masked autoencoder to determine which self-supervised model best smooths the visual field data while reconstructing salient features that are less noisy and more predictive of worsening disease. Our results indicate that including a categorical p-value at every visual field location improves the smoothing of visual field data. Masked autoencoders led to cleaner denoised data than previous methods, such as variational autoencoders. A 4.7% increase in detection of progressing eyes with pointwise linear regression (PLR) was observed. The masked and variational autoencoders’ smoothed data predicted glaucoma progression 2.3 months earlier when p-values were included compared to when they were not. The faster prediction of time to progression (TTP) and the higher percentage progression detected support our hypothesis that masking out visual field elements during training while including p-values at each location would improve the task of detection of visual field progression. Our study has clinically relevant implications regarding masking when training neural networks to denoise visual field data, resulting in earlier and more accurate detection of glaucoma progression. This denoising model can be integrated into future models for visual field analysis to enhance detection of glaucoma progression.

机器学习

[LG-0] LazyDINO: Fast scalable and efficiently amortized Bayesian inversion via structure-exploiting and surrogate-driven measure transport

链接: https://arxiv.org/abs/2411.12726
作者: Lianghao Cao,Joshua Chen,Michael Brennan,Thomas O’Leary-Roseberry,Youssef Marzouk,Omar Ghattas
关键词-EN: high-dimensional nonlinear Bayesian, nonlinear Bayesian inverse, Bayesian inverse problems, efficiently amortized solutions, map
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present LazyDINO, a transport map variational inference method for fast, scalable, and efficiently amortized solutions of high-dimensional nonlinear Bayesian inverse problems with expensive parameter-to-observable (PtO) maps. Our method consists of an offline phase in which we construct a derivative-informed neural surrogate of the PtO map using joint samples of the PtO map and its Jacobian. During the online phase, when given observational data, we seek rapid posterior approximation using surrogate-driven training of a lazy map [Brennan et al., NeurIPS, (2020)], i.e., a structure-exploiting transport map with low-dimensional nonlinearity. The trained lazy map then produces approximate posterior samples or density evaluations. Our surrogate construction is optimized for amortized Bayesian inversion using lazy map variational inference. We show that (i) the derivative-based reduced basis architecture [O’Leary-Roseberry et al., Comput. Methods Appl. Mech. Eng., 388 (2022)] minimizes the upper bound on the expected error in surrogate posterior approximation, and (ii) the derivative-informed training formulation [O’Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] minimizes the expected error due to surrogate-driven transport map optimization. Our numerical results demonstrate that LazyDINO is highly efficient in cost amortization for Bayesian inversion. We observe one to two orders of magnitude reduction of offline cost for accurate posterior approximation, compared to simulation-based amortized inference via conditional transport and conventional surrogate-driven transport. In particular, LazyDINO outperforms Laplace approximation consistently using fewer than 1000 offline samples, while other amortized inference methods struggle and sometimes fail at 16,000 offline samples.

[LG-1] Learning multivariate Gaussians with imperfect advice

链接: https://arxiv.org/abs/2411.12700
作者: Arnab Bhattacharyya,Davin Choo,Philips George John,Themis Gouleakis
关键词-EN: boldsymbol, Sigma, framework of learning-augmented, PAC learning setting, learning
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit the problem of distribution learning within the framework of learning-augmented algorithms. In this setting, we explore the scenario where a probability distribution is provided as potentially inaccurate advice on the true, unknown distribution. Our objective is to develop learning algorithms whose sample complexity decreases as the quality of the advice improves, thereby surpassing standard learning lower bounds when the advice is sufficiently accurate. Specifically, we demonstrate that this outcome is achievable for the problem of learning a multivariate Gaussian distribution N(\boldsymbol\mu, \boldsymbol\Sigma) in the PAC learning setting. Classically, in the advice-free setting, \tilde\Theta(d^2/\varepsilon^2) samples are sufficient and worst case necessary to learn d -dimensional Gaussians up to TV distance \varepsilon with constant probability. When we are additionally given a parameter \tilde\boldsymbol\Sigma as advice, we show that \tildeO(d^2-\beta/\varepsilon^2) samples suffices whenever | \tilde\boldsymbol\Sigma^-1/2 \boldsymbol\Sigma \tilde\boldsymbol\Sigma^-1/2 - \boldsymbolI_d |_1 \leq \varepsilon d^1-\beta (where |\cdot|_1 denotes the entrywise \ell_1 norm) for any \beta 0 , yielding a polynomial improvement over the advice-free setting. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2411.12700 [cs.LG] (or arXiv:2411.12700v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.12700 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] IMUVIE: Pickup Timeline Action Localization via Motion Movies

链接: https://arxiv.org/abs/2411.12689
作者: John Clapham,Kenneth Koltermann,Yanfu Zhang,Yuming Sun,Evie N Burnet,Gang Zhou
关键词-EN: objects pose significant, pose significant health, impacting quality, due to difficulties, picking up objects
类目: Machine Learning (cs.LG)
*备注: This is a preprint version, 12 pages, 20 figures, 3 tables

点击查看摘要

Abstract:Falls among seniors due to difficulties with tasks such as picking up objects pose significant health and safety risks, impacting quality of life and independence. Reliable, accessible assessment tools are critical for early intervention but often require costly clinic-based equipment and trained personnel, limiting their use in daily life. Existing wearable-based pickup measurement solutions address some needs but face limitations in generalizability. We present IMUVIE, a wearable system that uses motion movies and a machine-learning model to automatically detect and measure pickup events, providing a practical solution for frequent monitoring. IMUVIE’s design principles-data normalization, occlusion handling, and streamlined visuals-enhance model performance and are adaptable to tasks beyond pickup classification. In rigorous leave one subject out cross validation evaluations, IMUVIE achieves exceptional window level localization accuracy of 91-92% for pickup action classification on 256,291 motion movie frame candidates while maintaining an event level recall of 97% when evaluated on 129 pickup events. IMUVIE has strong generalization and performs well on unseen subjects. In an interview survey, IMUVIE demonstrated strong user interest and trust, with ease of use identified as the most critical factor for adoption. IMUVIE offers a practical, at-home solution for fall risk assessment, facilitating early detection of movement deterioration, and supporting safer, independent living for seniors. Comments: This is a preprint version, 12 pages, 20 figures, 3 tables Subjects: Machine Learning (cs.LG) ACMclasses: I.2; I.5.4; I.5.5; I.5.2; I.5; I.2.10; I.2.1; I.2.9; I.4.9; J.3; J.7 Cite as: arXiv:2411.12689 [cs.LG] (or arXiv:2411.12689v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.12689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Auto-Evaluation with Few Labels through Post-hoc Regression

链接: https://arxiv.org/abs/2411.12665
作者: Benjamin Eyre,David Madras
关键词-EN: Continually evaluating large, evaluating large generative, large generative models, Continually evaluating, unique challenge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Continually evaluating large generative models provides a unique challenge. Often, human annotations are necessary to evaluate high-level properties of these models (e.g. in text or images). However, collecting human annotations of samples can be resource intensive, and using other machine learning systems to provide the annotations, or automatic evaluation, can introduce systematic errors into the evaluation. The Prediction Powered Inference (PPI) framework provides a way of leveraging both the statistical power of automatic evaluation and a small pool of labelled data to produce a low-variance, unbiased estimate of the quantity being evaluated for. However, most work on PPI considers a relatively sizable set of labelled samples, which is not always practical to obtain. To this end, we present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime.

[LG-4] PyAWD: A Library for Generating Large Synthetic Datasets of Acoustic Wave Propagation with Devito

链接: https://arxiv.org/abs/2411.12636
作者: Pascal Tribel,Gianluca Bontempi
关键词-EN: deploying physical seismometers, unevenly distributed due, Machine Learning, application of Machine, physical seismometers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seismic data is often sparse and unevenly distributed due to the high costs and logistical challenges associated with deploying physical seismometers, limiting the application of Machine Learning (ML) in earthquake analysis. To address this gap, we introduce PyAWD, a Python library designed to generate high-resolution synthetic datasets simulating spatio-temporal acoustic wave propagation in both two-dimensional and three-dimensional heterogeneous media. By allowing fine control over parameters such as wave speed, external forces, spatial and temporal discretization, and media composition, PyAWD enables the creation of ML-scale datasets that capture the complexity of seismic wave behavior. We illustrate the library’s potential with an epicenter retrieval task, showcasing its suitability for designing complex, accurate seismic problems that support advanced ML approaches in the absence or lack of dense real-world data.

[LG-5] Exploring the Manifold of Neural Networks Using Diffusion Geometry

链接: https://arxiv.org/abs/2411.12626
作者: Elliott Abel,Peyton Crevasse,Yvan Grinspan,Selma Mazioud,Folu Ogundipe,Kristof Reimann,Ellie Schueler,Andrew J. Steindl,Ellen Zhang,Dhananjay Bhaskar,Siddharth Viswanath,Yanlei Zhang,Tim G. J. Rudner,Ian Adelstein,Smita Krishnaswamy
关键词-EN: high-dimensional data lies, Drawing motivation, apply manifold learning, neural networks, high-dimensional data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drawing motivation from the manifold hypothesis, which posits that most high-dimensional data lies on or near low-dimensional manifolds, we apply manifold learning to the space of neural networks. We learn manifolds where datapoints are neural networks by introducing a distance between the hidden layer representations of the neural networks. These distances are then fed to the non-linear dimensionality reduction algorithm PHATE to create a manifold of neural networks. We characterize this manifold using features of the representation, including class separation, hierarchical cluster structure, spectral entropy, and topological structure. Our analysis reveals that high-performing networks cluster together in the manifold, displaying consistent embedding patterns across all these features. Finally, we demonstrate the utility of this approach for guiding hyperparameter optimization and neural architecture search by sampling from the manifold.

[LG-6] Hypergraph p-Laplacian equations for data interpolation and semi-supervised learning

链接: https://arxiv.org/abs/2411.12601
作者: Kehan Shi,Martin Burger
关键词-EN: modeling higher-order relationships, Laplacian regularization, Laplacian equation, attracted a lot, lot of attention
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Hypergraph learning with p -Laplacian regularization has attracted a lot of attention due to its flexibility in modeling higher-order relationships in data. This paper focuses on its fast numerical implementation, which is challenging due to the non-differentiability of the objective function and the non-uniqueness of the minimizer. We derive a hypergraph p -Laplacian equation from the subdifferential of the p -Laplacian regularization. A simplified equation that is mathematically well-posed and computationally efficient is proposed as an alternative. Numerical experiments verify that the simplified p -Laplacian equation suppresses spiky solutions in data interpolation and improves classification accuracy in semi-supervised learning. The remarkably low computational cost enables further applications.

[LG-7] UMGAD: Unsupervised Multiplex Graph Anomaly Detection

链接: https://arxiv.org/abs/2411.12556
作者: Xiang Li,Jianpeng Qi,Zhongying Zhao,Guanjie Zheng,Lei Cao,Junyu Dong,Yanwei Yu
关键词-EN: Graph anomaly detection, identifying anomalous nodes, anomaly detection, multiplex heterogeneous graphs, primary objective
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph anomaly detection (GAD) is a critical task in graph machine learning, with the primary objective of identifying anomalous nodes that deviate significantly from the majority. This task is widely applied in various real-world scenarios, including fraud detection and social network analysis. However, existing GAD methods still face two major challenges: (1) They are often limited to detecting anomalies in single-type interaction graphs and struggle with multiple interaction types in multiplex heterogeneous graphs; (2) In unsupervised scenarios, selecting appropriate anomaly score thresholds remains a significant challenge for accurate anomaly detection. To address the above challenges, we propose a novel Unsupervised Multiplex Graph Anomaly Detection method, named UMGAD. We first learn multi-relational correlations among nodes in multiplex heterogeneous graphs and capture anomaly information during node attribute and structure reconstruction through graph-masked autoencoder (GMAE). Then, to further weaken the influence of noise and redundant information on abnormal information extraction, we generate attribute-level and subgraph-level augmented-view graphs respectively, and perform attribute and structure reconstruction through GMAE. Finally, We learn to optimize node attributes and structural features through contrastive learning between original-view and augmented-view graphs to improve the model’s ability to capture anomalies. Meanwhile, we also propose a new anomaly score threshold selection strategy, which allows the model to be independent of the ground truth in real unsupervised scenarios. Extensive experiments on four datasets show that our \model significantly outperforms state-of-the-art methods, achieving average improvements of 13.48% in AUC and 11.68% in Macro-F1 across all datasets.

[LG-8] Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models – A review and challenges for practice

链接: https://arxiv.org/abs/2411.12451
作者: Flavio Hafner,Chang Sun
关键词-EN: Synthetic data generators, produce synthetic data, formal privacy guarantees, Synthetic data, generated synthetic data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data generators, when trained using privacy-preserving techniques like differential privacy, promise to produce synthetic data with formal privacy guarantees, facilitating the sharing of sensitive data. However, it is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models. Then, this paper explores the practical challenges for privacy evaluations of generative models for use cases with millions of training records, such as data from statistical agencies and healthcare providers. Our findings indicate that methods designed to verify the correct operation of the training algorithm are effective for large datasets, but they often assume an adversary that is unrealistic in many scenarios. Based on the findings, we highlight a crucial trade-off between the computational feasibility of the evaluation and the level of realism of the assumed threat model. Finally, we conclude with ideas and suggestions for future research.

[LG-9] Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures

链接: https://arxiv.org/abs/2411.12438
作者: Prashanti Anderson,Mitali Bafna,Rares-Darius Buhai,Pravesh K. Kothari,David Steurer
关键词-EN: low-dimensional separation-preserving projection, Gaussian mixture models, min, Vempala and Wang, finds a low-dimensional
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 64 pages

点击查看摘要

Abstract:We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine, based on the sum-of-squares method, that finds a low-dimensional separation-preserving projection of the input data. Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition, that forms a key component of the celebrated spherical clustering algorithm of Vempala and Wang [VW04] (in addition to several other applications). As applications, we obtain an algorithm to (1) cluster an arbitrary total-variation separated mixture of k centered (i.e., zero-mean) Gaussians with n\geq \operatornamepoly(d) f(w_\min^-1) samples and \operatornamepoly(n) time, and (2) cluster an arbitrary total-variation separated mixture of k Gaussians with identical but arbitrary unknown covariance with n \geq d^O(\log w_\min^-1) f(w_\min^-1) samples and n^O(\log w_\min^-1) time. Here, w_\min is the minimum mixing weight of the input mixture, and f does not depend on the dimension d . Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers. Before this work, the techniques in the state-of-the-art non-spherical clustering algorithms needed d^O(k) f(w_\min^-1) time and samples for clustering such mixtures. Our results may come as a surprise in the context of the d^\Omega(k) statistical query lower bound [DKS17] for clustering non-spherical Gaussian mixtures. While this result is usually thought to rule out d^o(k) cost algorithms for the problem, our results show that the lower bounds can in fact be circumvented for a remarkably general class of Gaussian mixtures. Comments: 64 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.12438 [cs.DS] (or arXiv:2411.12438v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2411.12438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] STRisk: A Socio-Technical Approach to Assess Hacking Breaches Risk

链接: https://arxiv.org/abs/2411.12435
作者: Hicham Hammouchi,Narjisse Nejjari,Ghita Mezzour,Mounir Ghogho,Houda Benbrahim
关键词-EN: Data breaches, social media dimension, breaches have begun, Data, media dimension
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data breaches have begun to take on new dimensions and their prediction is becoming of great importance to organizations. Prior work has addressed this issue mainly from a technical perspective and neglected other interfering aspects such as the social media dimension. To fill this gap, we propose STRisk which is a predictive system where we expand the scope of the prediction task by bringing into play the social media dimension. We study over 3800 US organizations including both victim and non-victim organizations. For each organization, we design a profile composed of a variety of externally measured technical indicators and social factors. In addition, to account for unreported incidents, we consider the non-victim sample to be noisy and propose a noise correction approach to correct mislabeled organizations. We then build several machine learning models to predict whether an organization is exposed to experience a hacking breach. By exploiting both technical and social features, we achieve a Area Under Curve (AUC) score exceeding 98%, which is 12% higher than the AUC achieved using only technical features. Furthermore, our feature importance analysis reveals that open ports and expired certificates are the best technical predictors, while spreadability and agreeability are the best social predictors.

[LG-11] Non-IID data in Federated Learning: A Systematic Review with Taxonomy Metrics Methods Frameworks and Future Directions

链接: https://arxiv.org/abs/2411.12377
作者: Daniel M. Jimenez G.,David Solans,Mikko Heikkila,Andrea Vitaletti,Nicolas Kourtellis,Aris Anagnostopoulos,Ioannis Chatzigiannakis
关键词-EN: highlighted Federated Learning, Federated Learning, multiple distributed users, enables multiple distributed, Recent advances
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have highlighted Federated Learning (FL) as a promising approach that enables multiple distributed users (so-called clients) to collectively train ML models without sharing their private data. While this privacy-preserving method shows potential, it struggles when data across clients is not independent and identically distributed (non-IID) data. The latter remains an unsolved challenge that can result in poorer model performance and slower training times. Despite the significance of non-IID data in FL, there is a lack of consensus among researchers about its classification and quantification. This systematic review aims to fill that gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics to quantify data heterogeneity. Additionally, we describe popular solutions to address non-IID data and standardized frameworks employed in FL with heterogeneous data. Based on our state-of-the-art review, we present key lessons learned and suggest promising future research directions.

[LG-12] Ultra-Sparse Memory Network

链接: https://arxiv.org/abs/2411.12364
作者: Zihao Huang,Qiyang Min,Hongzhi Huang,Defa Zhu,Yutao Zeng,Ran Guo,Xun Zhou
关键词-EN: computational complexity, Mixture of Experts, Transformer models, widely acknowledged, exponentially related
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms traditional models. In our experiments, we train networks with up to 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.

[LG-13] Learning from Label Proportions and Covariate-shifted Instances

链接: https://arxiv.org/abs/2411.12334
作者: Sagalpreet Singh,Navodita Sharma,Shreyas Havaldar,Rishi Saket,Aravindan Raghuveer
关键词-EN: aggregate label derived, aggregate label, hybrid LLP, privacy concerns, LLP
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many applications, especially due to lack of supervision or privacy concerns, the training data is grouped into bags of instances (feature-vectors) and for each bag we have only an aggregate label derived from the instance-labels in the bag. In learning from label proportions (LLP) the aggregate label is the average of the instance-labels in a bag, and a significant body of work has focused on training models in the LLP setting to predict instance-labels. In practice however, the training data may have fully supervised albeit covariate-shifted source data, along with the usual target data with bag-labels, and we wish to train a good instance-level predictor on the target domain. We call this the covariate-shifted hybrid LLP problem. Fully supervised covariate shifted data often has useful training signals and the goal is to leverage them for better predictive performance in the hybrid LLP setting. To achieve this, we develop methods for hybrid LLP which naturally incorporate the target bag-labels along with the source instance-labels, in the domain adaptation framework. Apart from proving theoretical guarantees bounding the target generalization error, we also conduct experiments on several publicly available datasets showing that our methods outperform LLP and domain adaptation baselines as well techniques from previous related work.

[LG-14] Graph as a feature: improving node classification with non-neural graph-aware logistic regression

链接: https://arxiv.org/abs/2411.12330
作者: Simon Delarue,Thomas Bonald,Tiphaine Viard
关键词-EN: machine learning problems, solving graph-based machine, graph-based machine learning, Graph Neural Networks, Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) and their message passing framework that leverages both structural and feature information, have become a standard method for solving graph-based machine learning problems. However, these approaches still struggle to generalise well beyond datasets that exhibit strong homophily, where nodes of the same class tend to connect. This limitation has led to the development of complex neural architectures that pose challenges in terms of efficiency and scalability. In response to these limitations, we focus on simpler and more scalable approaches and introduce Graph-aware Logistic Regression (GLR), a non-neural model designed for node classification tasks. Unlike traditional graph algorithms that use only a fraction of the information accessible to GNNs, our proposed model simultaneously leverages both node features and the relationships between entities. However instead of relying on message passing, our approach encodes each node’s relationships as an additional feature vector, which is then combined with the node’s self attributes. Extensive experimental results, conducted within a rigorous evaluation framework, show that our proposed GLR approach outperforms both foundational and sophisticated state-of-the-art GNN models in node classification tasks. Going beyond the traditional limited benchmarks, our experiments indicate that GLR increases generalisation ability while reaching performance gains in computation time up to two orders of magnitude compared to it best neural competitor.

[LG-15] Attributed Graph Clustering in Collaborative Settings

链接: https://arxiv.org/abs/2411.12329
作者: Rui Zhang,Xiaoyang Hou,Zhihua Tian,Jian Liu,Qingbiao Wu,Kui Ren
关键词-EN: Graph clustering, attributed graph clustering, graph clustering methods, Graph, collaborative graph clustering
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:Graph clustering is an unsupervised machine learning method that partitions the nodes in a graph into different groups. Despite achieving significant progress in exploiting both attributed and structured data information, graph clustering methods often face practical challenges related to data isolation. Moreover, the absence of collaborative methods for graph clustering limits their effectiveness. In this paper, we propose a collaborative graph clustering framework for attributed graphs, supporting attributed graph clustering over vertically partitioned data with different participants holding distinct features of the same data. Our method leverages a novel technique that reduces the sample space, improving the efficiency of the attributed graph clustering method. Furthermore, we compare our method to its centralized counterpart under a proximity condition, demonstrating that the successful local results of each participant contribute to the overall success of the collaboration. We fully implement our approach and evaluate its utility and efficiency by conducting experiments on four public datasets. The results demonstrate that our method achieves comparable accuracy levels to centralized attributed graph clustering methods. Our collaborative graph clustering framework provides an efficient and effective solution for graph clustering challenges related to data isolation. Comments: 16 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.12329 [cs.LG] (or arXiv:2411.12329v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.12329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Emergence of Implicit World Models from Mortal Agents NEURIPS2024

链接: https://arxiv.org/abs/2411.12304
作者: Kazuya Horibe,Naoto Yoshida
关键词-EN: open-ended behavior optimization, emergent properties, behavior optimization, autonomous agents, discuss the possibility
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted as a 1-page tiny paper in the Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2024

点击查看摘要

Abstract:We discuss the possibility of world models and active exploration as emergent properties of open-ended behavior optimization in autonomous agents. In discussing the source of the open-endedness of living things, we start from the perspective of biological systems as understood by the mechanistic approach of theoretical biology and artificial life. From this perspective, we discuss the potential of homeostasis in particular as an open-ended objective for autonomous agents and as a general, integrative extrinsic motivation. We then discuss the possibility of implicitly acquiring a world model and active exploration through the internal dynamics of a network, and a hypothetical architecture for this, by combining meta-reinforcement learning, which assumes domain adaptation as a system that achieves robust homeostasis.

[LG-17] A Review on Generative AI Models for Synthetic Medical Text Time Series and Longitudinal Data

链接: https://arxiv.org/abs/2411.12274
作者: Mohammad Loni,Fatemeh Poursalim,Mehdi Asadi,Arash Gharehbaghi
关键词-EN: synthetic health records, medical time series, time series, health records, longitudinal data
类目: Machine Learning (cs.LG)
*备注: 27 pages, 3 figures

点击查看摘要

Abstract:This paper presents the results of a novel scoping review on the practical models for generating three different types of synthetic health records (SHRs): medical text, time series, and longitudinal data. The innovative aspects of the review, which incorporate study objectives, data modality, and research methodology of the reviewed studies, uncover the importance and the scope of the topic for the digital medicine context. In total, 52 publications met the eligibility criteria for generating medical time series (22), longitudinal data (17), and medical text (13). Privacy preservation was found to be the main research objective of the studied papers, along with class imbalance, data scarcity, and data imputation as the other objectives. The adversarial network-based, probabilistic, and large language models exhibited superiority for generating synthetic longitudinal data, time series, and medical texts, respectively. Finding a reliable performance measure to quantify SHR re-identification risk is the major research gap of the topic.

[LG-18] On the Accuracy and Precision of Moving Averages to Estimate Wi-Fi Link Quality

链接: https://arxiv.org/abs/2411.12265
作者: Gianluca Cena,Gabriele Formis,Matteo Rosani,Stefano Scanzio
关键词-EN: noticeable variability, wireless communication technology, impairs performance, performance and determinism, communication technology
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: preprint, 8 pages, 2024

点击查看摘要

Abstract:The radio spectrum is characterized by a noticeable variability, which impairs performance and determinism of every wireless communication technology. To counteract this aspect, mechanisms like Minstrel are customarily employed in real Wi-Fi devices, and the adoption of machine learning for optimization is envisaged in next-generation Wi-Fi 8. All these approaches require communication quality to be monitored at runtime. In this paper, the effectiveness of simple techniques based on moving averages to estimate wireless link quality is analyzed, to assess their advantages and weaknesses. Results can be used, e.g., as a baseline when studying how artificial intelligence can be employed to mitigate unpredictability of wireless networks by providing reliable estimates about current spectrum conditions. Comments: preprint, 8 pages, 2024 Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2411.12265 [cs.NI] (or arXiv:2411.12265v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.12265 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 29th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2024) Related DOI: https://doi.org/10.1109/ETFA61755.2024.10710784 Focus to learn more DOI(s) linking to related resources

[LG-19] Hyper-parameter Optimization for Federated Learning with Step-wise Adaptive Mechanism

链接: https://arxiv.org/abs/2411.12244
作者: Yasaman Saadati,M. Hadi Amini
关键词-EN: protects sensitive information, sharing clients’ raw, decentralized learning approach, Federated Learning, clients’ raw datasets
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized learning approach that protects sensitive information by utilizing local model parameters rather than sharing clients’ raw datasets. While this privacy-preserving method is widely employed across various applications, it still requires significant development and optimization. Automated Machine Learning (Auto-ML) has been adapted for reducing the need for manual adjustments. Previous studies have explored the integration of AutoML with different FL algorithms to evaluate their effectiveness in enhancing FL settings. However, Automated FL (Auto-FL) faces additional challenges due to the involvement of a large cohort of clients and global training rounds between clients and the server, rendering the tuning process time-consuming and nearly impossible on resource-constrained edge devices (e.g., IoT devices). This paper investigates the deployment and integration of two lightweight Hyper-Parameter Optimization (HPO) tools, Raytune and Optuna, within the context of FL settings. A step-wise feedback mechanism has also been designed to accelerate the hyper-parameter tuning process and coordinate AutoML toolkits with the FL server. To this end, both local and global feedback mechanisms are integrated to limit the search space and expedite the HPO process. Further, a novel client selection technique is introduced to mitigate the straggler effect in Auto-FL. The selected hyper-parameter tuning tools are evaluated using two benchmark datasets, FEMNIST, and CIFAR10. Further, the paper discusses the essential properties of successful HPO tools, the integration mechanism with the FL pipeline, and the challenges posed by the distributed and heterogeneous nature of FL environments.

[LG-20] Action-Attentive Deep Reinforcement Learning for Autonomous Alignment of Beamlines

链接: https://arxiv.org/abs/2411.12183
作者: Siyu Wang,Shengran Dai,Jianhui Jiang,Shuang Wu,Yufei Peng,Junbin Zhang
关键词-EN: Synchrotron radiation sources, radiation sources play, materials science, sources play, role in fields
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Synchrotron radiation sources play a crucial role in fields such as materials science, biology, and chemistry. The beamline, a key subsystem of the synchrotron, modulates and directs the radiation to the sample for analysis. However, the alignment of beamlines is a complex and time-consuming process, primarily carried out manually by experienced engineers. Even minor misalignments in optical components can significantly affect the beam’s properties, leading to suboptimal experimental outcomes. Current automated methods, such as bayesian optimization (BO) and reinforcement learning (RL), although these methods enhance performance, limitations remain. The relationship between the current and target beam properties, crucial for determining the adjustment, is not fully considered. Additionally, the physical characteristics of optical elements are overlooked, such as the need to adjust specific devices to control the output beam’s spot size or position. This paper addresses the alignment of beamlines by modeling it as a Markov Decision Process (MDP) and training an intelligent agent using RL. The agent calculates adjustment values based on the current and target beam states, executes actions, and iterates until optimal parameters are achieved. A policy network with action attention is designed to improve decision-making by considering both state differences and the impact of optical components. Experiments on two simulated beamlines demonstrate that our algorithm outperforms existing methods, with ablation studies highlighting the effectiveness of the action attention-based policy network.

[LG-21] Fine-Grained Uncertainty Quantification via Collisions

链接: https://arxiv.org/abs/2411.12127
作者: Jesse Friedbaum,Sudarshan Adiga,Ravi Tandon
关键词-EN: fine-grained uncertainty quantification, collision matrix, uncertainty quantification, approach for fine-grained, fine-grained uncertainty
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new approach for fine-grained uncertainty quantification (UQ) using a collision matrix. For a classification problem involving K classes, the K\times K collision matrix S measures the inherent (aleatoric) difficulty in distinguishing between each pair of classes. In contrast to existing UQ methods, the collision matrix gives a much more detailed picture of the difficulty of classification. We discuss several possible downstream applications of the collision matrix, establish its fundamental mathematical properties, as well as show its relationship with existing UQ methods, including the Bayes error rate. We also address the new problem of estimating the collision matrix using one-hot labeled data. We propose a series of innovative techniques to estimate S . First, we learn a contrastive binary classifier which takes two inputs and determines if they belong to the same class. We then show that this contrastive classifier (which is PAC learnable) can be used to reliably estimate the Gramian matrix of S , defined as G=S^TS . Finally, we show that under very mild assumptions, G can be used to uniquely recover S , a new result on stochastic matrices which could be of independent interest. Experimental results are also presented to validate our methods on several datasets.

[LG-22] MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT

链接: https://arxiv.org/abs/2411.12126
作者: Xiaomin Ouyang,Jason Wu,Tomoyoshi Kimura,Yihan Lin,Gunjan Verma,Tarek Abdelzaher,Mani Srivastava
关键词-EN: Multimodal, systems are increasingly, increasingly prevalent, Multimodal sensing systems, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal sensing systems are increasingly prevalent in various real-world applications. Most existing multimodal learning approaches heavily rely on training with a large amount of complete multimodal data. However, such a setting is impractical in real-world IoT sensing applications where data is typically collected by distributed nodes with heterogeneous data modalities, and is also rarely labeled. In this paper, we propose MMBind, a new framework for multimodal learning on distributed and heterogeneous IoT data. The key idea of MMBind is to construct a pseudo-paired multimodal dataset for model training by binding data from disparate sources and incomplete modalities through a sufficiently descriptive shared modality. We demonstrate that data of different modalities observing similar events, even captured at different times and locations, can be effectively used for multimodal training. Moreover, we propose an adaptive multimodal learning architecture capable of training models with heterogeneous modality combinations, coupled with a weighted contrastive learning approach to handle domain shifts among disparate data. Evaluations on ten real-world multimodal datasets highlight that MMBind outperforms state-of-the-art baselines under varying data incompleteness and domain shift, and holds promise for advancing multimodal foundation model training in IoT applications.

[LG-23] Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

链接: https://arxiv.org/abs/2411.12118
作者: Tiberiu Musat
关键词-EN: simple reasoning task, number of layers, simple reasoning, minimum number, retrieval problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.

[LG-24] BALI: Learning Neural Networks via Bayesian Layerwise Inference

链接: https://arxiv.org/abs/2411.12102
作者: Richard Kurle,Alexej Klushyn,Ralf Herbrich
关键词-EN: multivariate Bayesian linear, linear regression models, learning Bayesian neural, Bayesian linear regression, stack of multivariate
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a new method for learning Bayesian neural networks, treating them as a stack of multivariate Bayesian linear regression models. The main idea is to infer the layerwise posterior exactly if we know the target outputs of each layer. We define these pseudo-targets as the layer outputs from the forward pass, updated by the backpropagated gradients of the objective function. The resulting layerwise posterior is a matrix-normal distribution with a Kronecker-factorized covariance matrix, which can be efficiently inverted. Our method extends to the stochastic mini-batch setting using an exponential moving average over natural-parameter terms, thus gradually forgetting older data. The method converges in few iterations and performs as well as or better than leading Bayesian neural network methods on various regression, classification, and out-of-distribution detection benchmarks.

[LG-25] Federated Contrastive Learning of Graph-Level Representations

链接: https://arxiv.org/abs/2411.12098
作者: Xiang Li,Gagan Agrawal,Rajiv Ramnath,Ruoming Jin
关键词-EN: Graph-level representations, classification based, representations, Graph-level, Contrastive Learning
类目: Machine Learning (cs.LG)
*备注: Accepted in BigData 2024. This is a preprint

点击查看摘要

Abstract:Graph-level representations (and clustering/classification based on these representations) are required in a variety of applications. Examples include identifying malicious network traffic, prediction of protein properties, and many others. Often, data has to stay in isolated local systems (i.e., cannot be centrally shared for analysis) due to a variety of considerations like privacy concerns, lack of trust between the parties, regulations, or simply because the data is too large to be shared sufficiently quickly. This points to the need for federated learning for graph-level representations, a topic that has not been explored much, especially in an unsupervised setting. Addressing this problem, this paper presents a new framework we refer to as Federated Contrastive Learning of Graph-level Representations (FCLG). As the name suggests, our approach builds on contrastive learning. However, what is unique is that we apply contrastive learning at two levels. The first application is for local unsupervised learning of graph representations. The second level is to address the challenge associated with data distribution variation (i.e. the ``Non-IID issue") when combining local models. Through extensive experiments on the downstream task of graph-level clustering, we demonstrate FCLG outperforms baselines (which apply existing federated methods on existing graph-level clustering methods) with significant margins. Comments: Accepted in BigData 2024. This is a preprint Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.12098 [cs.LG] (or arXiv:2411.12098v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.12098 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Molecule Generation with Fragment Retrieval Augmentation NEURIPS2024

链接: https://arxiv.org/abs/2411.12078
作者: Seul Lee,Karsten Kreis,Srimukh Prasad Veccham,Meng Liu,Danny Reidenbach,Saee Paliwal,Arash Vahdat,Weili Nie
关键词-EN: desirable biochemical properties, achieved great success, Fragment-based drug discovery, fragment-based molecule generation, fragments
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Fragment-based drug discovery, in which molecular fragments are assembled into new molecules with desirable biochemical properties, has achieved great success. However, many fragment-based molecule generation methods show limited exploration beyond the existing fragments in the database as they only reassemble or slightly modify the given ones. To tackle this problem, we propose a new fragment-based molecule generation framework with retrieval augmentation, namely Fragment Retrieval-Augmented Generation (f-RAG). f-RAG is based on a pre-trained molecular generative model that proposes additional fragments from input fragments to complete and generate a new molecule. Given a fragment vocabulary, f-RAG retrieves two types of fragments: (1) hard fragments, which serve as building blocks that will be explicitly included in the newly generated molecule, and (2) soft fragments, which serve as reference to guide the generation of new fragments through a trainable fragment injection module. To extrapolate beyond the existing fragments, f-RAG updates the fragment vocabulary with generated fragments via an iterative refinement process which is further enhanced with post-hoc genetic fragment modification. f-RAG can achieve an improved exploration-exploitation trade-off by maintaining a pool of fragments and expanding it with novel and high-quality fragments through a strong generative prior.

[LG-27] heoretical Corrections and the Leveraging of Reinforcement Learning to Enhance Triangle Attack

链接: https://arxiv.org/abs/2411.12071
作者: Nicole Meng,Caleb Manicke,David Chen,Yingjie Lao,Caiwen Ding,Pengyu Hong,Kaleel Mahmood
关键词-EN: decision based black-box, based black-box attacks, based black-box, machine learning models, sensitive domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial examples represent a serious issue for the application of machine learning models in many sensitive domains. For generating adversarial examples, decision based black-box attacks are one of the most practical techniques as they only require query access to the model. One of the most recently proposed state-of-the-art decision based black-box attacks is Triangle Attack (TA). In this paper, we offer a high-level description of TA and explain potential theoretical limitations. We then propose a new decision based black-box attack, Triangle Attack with Reinforcement Learning (TARL). Our new attack addresses the limits of TA by leveraging reinforcement learning. This creates an attack that can achieve similar, if not better, attack accuracy than TA with half as many queries on state-of-the-art classifiers and defenses across ImageNet and CIFAR-10.

[LG-28] Interpretation of High-Dimensional Regression Coefficients by Comparison with Linearized Compressing Features

链接: https://arxiv.org/abs/2411.12060
作者: Joachim Schaeffer,Jinwook Rhyu,Robin Droop,Rolf Findeisen,Richard Braatz
关键词-EN: deemed inherently interpretable, Linear regression, inherently interpretable, challenges arise, deemed inherently
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This manuscript is a short communication. 9 pages, 4 figures

点击查看摘要

Abstract:Linear regression is often deemed inherently interpretable; however, challenges arise for high-dimensional data. We focus on further understanding how linear regression approximates nonlinear responses from high-dimensional functional data, motivated by predicting cycle life for lithium-ion batteries. We develop a linearization method to derive feature coefficients, which we compare with the closest regression coefficients of the path of regression solutions. We showcase the methods on battery data case studies where a single nonlinear compressing feature, g\colon \mathbbR^p \to \mathbbR , is used to construct a synthetic response, \mathbfy \in \mathbbR . This unifying view of linear regression and compressing features for high-dimensional functional data helps to understand (1) how regression coefficients are shaped in the highly regularized domain and how they relate to linearized feature coefficients and (2) how the shape of regression coefficients changes as a function of regularization to approximate nonlinear responses by exploiting local structures.

[LG-29] Higher Order Graph Attention Probabilistic Walk Networks

链接: https://arxiv.org/abs/2411.12052
作者: Thomas Bailie,Yun Sing Koh,Karthik Mukkavilli
关键词-EN: Passing Neural Networks, inherently capture dependencies, Graphs inherently capture, Message Passing Neural, capture dependencies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs inherently capture dependencies between nodes or variables through their topological structure, with paths between any two nodes indicating a sequential dependency on the nodes traversed. Message Passing Neural Networks (MPNNs) leverage these latent relationships embedded in graph structures, and have become widely adopted across diverse applications. However, many existing methods predominantly rely on local information within the 1 -hop neighborhood. This approach has notable limitations; for example, 1 -hop aggregation schemes inherently lose long-distance information, and are limited in expressive power as defined by the k -Weisfeiler-Leman ( k -WL) isomorphism test. To address these issues, we propose the Higher Order Graphical Attention (HoGA) module, which assigns weights to variable-length paths sampled based on feature-vector diversity, effectively reconstructing the k -hop neighborhood. HoGA represents higher-order relationships as a robust form of self-attention, applicable to any single-hop attention mechanism. In empirical studies, applying HoGA to existing attention-based models consistently leads to significant accuracy improvements on benchmark node classification datasets. Furthermore, we observe that the performance degradation typically associated with additional message-passing steps may be mitigated.

[LG-30] Machine Learning Evaluation Metric Discrepancies across Programming Languages and Their Components: Need for Standardization

链接: https://arxiv.org/abs/2411.12032
作者: Mohammad R. Salmanpour,Morteza Alizadeh,Ghazal Mousavi,Saba Sadeghi,Sajad Amiri,Mehrdad Oveisi,Arman Rahmim,Ilker Hacihaliloglu
关键词-EN: study evaluates metrics, Cohens Kappa, statistical tests, correlation analysis, metrics
类目: Machine Learning (cs.LG); Software Engineering (cs.SE); Computational Physics (physics.comp-ph)
*备注: This paper is 12 pages with 1 table and 10 figures

点击查看摘要

Abstract:This study evaluates metrics for tasks such as classification, regression, clustering, correlation analysis, statistical tests, segmentation, and image-to-image (I2I) translation. Metrics were compared across Python libraries, R packages, and Matlab functions to assess their consistency and highlight discrepancies. The findings underscore the need for a unified roadmap to standardize metrics, ensuring reliable and reproducible ML evaluations across platforms. This study examined a wide range of evaluation metrics across various tasks and found only some to be consistent across platforms, such as (i) Accuracy, Balanced Accuracy, Cohens Kappa, F-beta Score, MCC, Geometric Mean, AUC, and Log Loss in binary classification; (ii) Accuracy, Cohens Kappa, and F-beta Score in multi-class classification; (iii) MAE, MSE, RMSE, MAPE, Explained Variance, Median AE, MSLE, and Huber in regression; (iv) Davies-Bouldin Index and Calinski-Harabasz Index in clustering; (v) Pearson, Spearman, Kendall’s Tau, Mutual Information, Distance Correlation, Percbend, Shepherd, and Partial Correlation in correlation analysis; (vi) Paired t-test, Chi-Square Test, ANOVA, Kruskal-Wallis Test, Shapiro-Wilk Test, Welchs t-test, and Bartlett’s test in statistical tests; (vii) Accuracy, Precision, and Recall in 2D segmentation; (viii) Accuracy in 3D segmentation; (ix) MAE, MSE, RMSE, and R-Squared in 2D-I2I translation; and (x) MAE, MSE, and RMSE in 3D-I2I translation. Given observation of discrepancies in a number of metrics (e.g. precision, recall and F1 score in binary classification, WCSS in clustering, multiple statistical tests, and IoU in segmentation, amongst multiple metrics), this study concludes that ML evaluation metrics require standardization and recommends that future research use consistent metrics for different tasks to effectively compare ML techniques and solutions.

[LG-31] he Generalization Error of Machine Learning Algorithms

链接: https://arxiv.org/abs/2411.12030
作者: Samir M. Perlaza,Xinying Zou
关键词-EN: deriving closed-form expressions, generalization error, expected empirical risk, information measures, terms of information
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Submitted to the IEEE Transaction on Information Theory. November 18, 2024

点击查看摘要

Abstract:In this paper, the method of gaps, a technique for deriving closed-form expressions in terms of information measures for the generalization error of machine learning algorithms is introduced. The method relies on two central observations: (a) ~The generalization error is an average of the variation of the expected empirical risk with respect to changes on the probability measure (used for expectation); and~ (b) ~these variations, also referred to as gaps, exhibit closed-form expressions in terms of information measures. The expectation of the empirical risk can be either with respect to a measure on the models (with a fixed dataset) or with respect to a measure on the datasets (with a fixed model), which results in two variants of the method of gaps. The first variant, which focuses on the gaps of the expected empirical risk with respect to a measure on the models, appears to be the most general, as no assumptions are made on the distribution of the datasets. The second variant develops under the assumption that datasets are made of independent and identically distributed data points. All existing exact expressions for the generalization error of machine learning algorithms can be obtained with the proposed method. Also, this method allows obtaining numerous new exact expressions, which improves the understanding of the generalization error; establish connections with other areas in statistics, e.g., hypothesis testing; and potentially, might guide algorithm designs.

[LG-32] Compression of Higher Order Ambisonics with Multichannel RVQGAN

链接: https://arxiv.org/abs/2411.12008
作者: Toni Hirvonen,Mahmoud Namazi
关键词-EN: RVQGAN neural coding, third-order Ambisonics audio, RVQGAN neural, neural coding method, realized for data-driven
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:A multichannel extension to the RVQGAN neural coding method is proposed, and realized for data-driven compression of third-order Ambisonics audio. The input- and output layers of the generator and discriminator models are modified to accept multiple (16) channels without increasing the model bitrate. We also propose a loss function for accounting for spatial perception in immersive reproduction, and transfer learning from single-channel models. Listening test results with 7.1.4 immersive playback show that the proposed extension is suitable for coding scene-based, 16-channel Ambisonics content with good quality at 16 kbit/s.

[LG-33] ransmission Line Outage Probability Prediction Under Extreme Events Using Peter-Clark Bayesian Structural Learning

链接: https://arxiv.org/abs/2411.11980
作者: Xiaolin Chen,Qiuhua Huang,Yuqi Zhou
关键词-EN: extreme weather events, Recent years, notable increase, frequency and intensity, intensity of extreme
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent years have seen a notable increase in the frequency and intensity of extreme weather events. With a rising number of power outages caused by these events, accurate prediction of power line outages is essential for safe and reliable operation of power grids. The Bayesian network is a probabilistic model that is very effective for predicting line outages under weather-related uncertainties. However, most existing studies in this area offer general risk assessments, but fall short of providing specific outage probabilities. In this work, we introduce a novel approach for predicting transmission line outage probabilities using a Bayesian network combined with Peter-Clark (PC) structural learning. Our approach not only enables precise outage probability calculations, but also demonstrates better scalability and robust performance, even with limited data. Case studies using data from BPA and NOAA show the effectiveness of this approach, while comparisons with several existing methods further highlight its advantages.

[LG-34] Introducing Milabench: Benchmarking Accelerators for AI

链接: https://arxiv.org/abs/2411.11940
作者: Pierre Delaunay,Xavier Bouthillier,Olivier Breuleux,Satya Ortiz-Gagné,Olexa Bilaniuk,Fabrice Normandin,Arnaud Bergeron,Bruno Carrez,Guillaume Alain,Soline Blanc,Frédéric Osterrath,Joseph Viviano,Roger Creus-Castanyer Darshan Patil,Rabiul Awal,Le Zhang
关键词-EN: standard HPC benchmarks, standard HPC, deep learning, high-performance computing, HPC benchmarks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI workloads, particularly those driven by deep learning, are introducing novel usage patterns to high-performance computing (HPC) systems that are not comprehensively captured by standard HPC benchmarks. As one of the largest academic research centers dedicated to deep learning, Mila identified the need to develop a custom benchmarking suite to address the diverse requirements of its community, which consists of over 1,000 researchers. This report introduces Milabench, the resulting benchmarking suite. Its design was informed by an extensive literature review encompassing 867 papers, as well as surveys conducted with Mila researchers. This rigorous process led to the selection of 26 primary benchmarks tailored for procurement evaluations, alongside 16 optional benchmarks for in-depth analysis. We detail the design methodology, the structure of the benchmarking suite, and provide performance evaluations using GPUs from NVIDIA, AMD, and Intel. The Milabench suite is open source and can be accessed at this http URL.

[LG-35] Artificial Intelligence Mangrove Monitoring System Based on Deep Learning and Sentinel-2 Satellite Data in the UAE (2017-2024)

链接: https://arxiv.org/abs/2411.11918
作者: Linlin Tan,Haishan Wu
关键词-EN: maintaining coastal ecosystem, coastal ecosystem health, protecting biodiversity, maintaining coastal, coastal ecosystem
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Mangroves play a crucial role in maintaining coastal ecosystem health and protecting biodiversity. Therefore, continuous mapping of mangroves is essential for understanding their dynamics. Earth observation imagery typically provides a cost-effective way to monitor mangrove dynamics. However, there is a lack of regional studies on mangrove areas in the UAE. This study utilizes the UNet++ deep learning model combined with Sentinel-2 multispectral data and manually annotated labels to monitor the spatiotemporal dynamics of densely distributed mangroves (coverage greater than 70%) in the UAE from 2017 to 2024, achieving an mIoU of 87.8% on the validation set. Results show that the total mangrove area in the UAE in 2024 was approximately 9,142.21 hectares, an increase of 2,061.33 hectares compared to 2017, with carbon sequestration increasing by approximately 194,383.42 tons. Abu Dhabi has the largest mangrove area and plays a dominant role in the UAE’s mangrove growth, increasing by 1,855.6 hectares between 2017-2024, while other emirates have also contributed to mangrove expansion through stable and sustainable growth in mangrove areas. This comprehensive growth pattern reflects the collective efforts of all emirates in mangrove restoration.

[LG-36] LoRA Unlearns More and Retains More (Student Abstract) AAAI-25

链接: https://arxiv.org/abs/2411.11907
作者: Atharv Mittal
关键词-EN: increasing privacy regulations, Due to increasing, Machine Unlearning, regulatory compliance, increasing privacy
类目: Machine Learning (cs.LG)
*备注: AAAI-25 Student Abstract

点击查看摘要

Abstract:Due to increasing privacy regulations and regulatory compliance, Machine Unlearning (MU) has become essential. The goal of unlearning is to remove information related to a specific class from a model. Traditional approaches achieve exact unlearning by retraining the model on the remaining dataset, but incur high computational costs. This has driven the development of more efficient unlearning techniques, including model sparsification techniques, which boost computational efficiency, but degrade the model’s performance on the remaining classes. To mitigate these issues, we propose a novel method, PruneLoRA which introduces a new MU paradigm, termed prune first, then adapt, then unlearn. LoRA (Hu et al. 2022) reduces the need for large-scale parameter updates by applying low-rank updates to the model. We leverage LoRA to selectively modify a subset of the pruned model’s parameters, thereby reducing the computational cost, memory requirements and improving the model’s ability to retain performance on the remaining classes. Experimental Results across various metrics showcase that our method outperforms other approximate MU methods and bridges the gap between exact and approximate unlearning. Our code is available at this https URL.

[LG-37] MultiBalance: Multi-Objective Gradient Balancing in Industrial-Scale Multi-Task Recommendation System

链接: https://arxiv.org/abs/2411.11871
作者: Yun He,Xuxing Chen,Jiayi Xu,Renqin Cai,Yiling You,Jennifer Cao,Minhui Huang,Liu Yang,Yiqun Liu,Xiaoyi Liu,Rong Jin,Sem Park,Bo Long,Xue Feng
关键词-EN: improve recommendation performance, multiple tasks simultaneously, industrial recommendation systems, learning multiple tasks, joint learning tasks
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In industrial recommendation systems, multi-task learning (learning multiple tasks simultaneously on a single model) is a predominant approach to save training/serving resources and improve recommendation performance via knowledge transfer between the joint learning tasks. However, multi-task learning often suffers from negative transfer: one or several tasks are less optimized than training them separately. To carefully balance the optimization, we propose a gradient balancing approach called MultiBalance, which is suitable for industrial-scale multi-task recommendation systems. It balances the per-task gradients to alleviate the negative transfer, while saving the huge cost for grid search or manual explorations for appropriate task weights. Moreover, compared with prior work that normally balance the per-task gradients of shared parameters, MultiBalance is more efficient since only requiring to access per-task gradients with respect to the shared feature representations. We conduct experiments on Meta’s large-scale ads and feeds multi-task recommendation system, and observe that MultiBalance achieves significant gains (e.g., 0.738% improvement for normalized entropy (NE)) with neutral training cost in Queries Per Second (QPS), which is significantly more efficient than prior methods that balance per-task gradients of shared parameters with 70~80% QPS degradation.

[LG-38] sting classical properties from quantum data

链接: https://arxiv.org/abs/2411.12730
作者: Matthias C. Caro,Preksha Naik,Joseph Slote
关键词-EN: Boolean functions, Omega, quantum data, quantum, properties of Boolean
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 38 + 14 pages, 2 tables, 2 figures

点击查看摘要

Abstract:Many properties of Boolean functions can be tested far more efficiently than the function can be learned. However, this advantage often disappears when testers are limited to random samples–a natural setting for data science–rather than queries. In this work we investigate the quantum version of this scenario: quantum algorithms that test properties of a function f solely from quantum data in the form of copies of the function state for f . For three well-established properties, we show that the speedup lost when restricting classical testers to samples can be recovered by testers that use quantum data. For monotonicity testing, we give a quantum algorithm that uses \tilde\mathcalO(n^2) function state copies as compared to the 2^\Omega(\sqrtn) samples required classically. We also present \mathcalO(1) -copy testers for symmetry and triangle-freeness, comparing favorably to classical lower bounds of \Omega(n^1/4) and \Omega(n) samples respectively. These algorithms are time-efficient and necessarily include techniques beyond the Fourier sampling approaches applied to earlier testing problems. These results make the case for a general study of the advantages afforded by quantum data for testing. We contribute to this project by complementing our upper bounds with a lower bound of \Omega(1/\varepsilon) for monotonicity testing from quantum data in the proximity regime \varepsilon\leq\mathcalO(n^-3/2) . This implies a strict separation between testing monotonicity from quantum data and from quantum queries–where \tilde\mathcalO(n) queries suffice when \varepsilon=\Theta(n^-3/2) . We also exhibit a testing problem that can be solved from \mathcalO(1) classical queries but requires \Omega(2^n/2) function state copies, complementing a separation of the same magnitude in the opposite direction derived from the Forrelation problem. Comments: 38 + 14 pages, 2 tables, 2 figures Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2411.12730 [quant-ph] (or arXiv:2411.12730v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2411.12730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Leadsee-Precip: A Deep Learning Diagnostic Model for Precipitation

链接: https://arxiv.org/abs/2411.12640
作者: Weiwen Ji,Jin Feng,Yueqi Liu,Yulu Qiu,Hua Gao
关键词-EN: deep-learning weather forecasting, surpassed traditional numerical, precipitation, surpassed traditional, Recently
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, deep-learning weather forecasting models have surpassed traditional numerical models in terms of the accuracy of meteorological variables. However, there is considerable potential for improvements in precipitation forecasts, especially for heavy precipitation events. To address this deficiency, we propose Leadsee-Precip, a global deep learning model to generate precipitation from meteorological circulation fields. The model utilizes an information balance scheme to tackle the challenges of predicting heavy precipitation caused by the long-tail distribution of precipitation data. Additionally, more accurate satellite and radar-based precipitation retrievals are used as training targets. Compared to artificial intelligence global weather models, the heavy precipitation from Leadsee-Precip is more consistent with observations and shows competitive performance against global numerical weather prediction models. Leadsee-Precip can be integrated with any global circulation model to generate precipitation forecasts. But the deviations between the predicted and the ground-truth circulation fields may lead to a weakened precipitation forecast, which could potentially be mitigated by further fine-tuning based on the predicted circulation fields.

[LG-40] Reward driven workflows for unsupervised explainable analysis of phases and ferroic variants from atomically resolved imaging data

链接: https://arxiv.org/abs/2411.12612
作者: Kamyar Barakati,Yu Liu,Chris Nelson,Maxim A. Ziatdinov,Xiaohang Zhang,Ichiro Takeuchi,Sergei V. Kalinin
关键词-EN: aberration corrected electron, corrected electron microscopy, electron microscopy necessitates, microscopy necessitates development, Rapid progress
类目: Materials Science (cond-mat.mtrl-sci); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:Rapid progress in aberration corrected electron microscopy necessitates development of robust methods for the identification of phases, ferroic variants, and other pertinent aspects of materials structure from imaging data. While unsupervised methods for clustering and classification are widely used for these tasks, their performance can be sensitive to hyperparameter selection in the analysis workflow. In this study, we explore the effects of descriptors and hyperparameters on the capability of unsupervised ML methods to distill local structural information, exemplified by discovery of polarization and lattice distortion in Sm doped BiFeO3 (BFO) thin films. We demonstrate that a reward-driven approach can be used to optimize these key hyperparameters across the full workflow, where rewards were designed to reflect domain wall continuity and straightness, ensuring that the analysis aligns with the material’s physical behavior. This approach allows us to discover local descriptors that are best aligned with the specific physical behavior, providing insight into the fundamental physics of materials. We further extend the reward driven workflows to disentangle structural factors of variation via optimized variational autoencoder (VAE). Finally, the importance of well-defined rewards was explored as a quantifiable measure of success of the workflow.

[LG-41] GNNAS-Dock: Budget Aware Algorithm Selection with Graph Neural Networks for Molecular Docking

链接: https://arxiv.org/abs/2411.12597
作者: Yiliang Yuan,Mustafa Misir
关键词-EN: discovery and design, major element, element in drug, drug discovery, Graph Neural Network
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular docking is a major element in drug discovery and design. It enables the prediction of ligand-protein interactions by simulating the binding of small molecules to proteins. Despite the availability of numerous docking algorithms, there is no single algorithm consistently outperforms the others across a diverse set of docking scenarios. This paper introduces GNNAS-Dock, a novel Graph Neural Network (GNN)-based automated algorithm selection system for molecular docking in blind docking situations. GNNs are accommodated to process the complex structural data of both ligands and proteins. They benefit from the inherent graph-like properties to predict the performance of various docking algorithms under different conditions. The present study pursues two main objectives: 1) predict the performance of each candidate docking algorithm, in terms of Root Mean Square Deviation (RMSD), thereby identifying the most accurate method for specific scenarios; and 2) choose the best computationally efficient docking algorithm for each docking case, aiming to reduce the time required for docking while maintaining high accuracy. We validate our approach on PDBBind 2020 refined set, which contains about 5,300 pairs of protein-ligand complexes.

[LG-42] A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information

链接: https://arxiv.org/abs/2411.12570
作者: Simone Martino,Domiziano Doria,Chiara Lionello,Matteo Becchi,Giovanni M. Pavan
关键词-EN: Reconstructing the physical, Reconstructing, Orientational Tetrahedral Order, many-body dynamical systems, descriptors
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures + 3 in supporting information (at the bottom of the manuscript)

点击查看摘要

Abstract:Reconstructing the physical complexity of many-body dynamical systems can be challenging. Starting from the trajectories of their constitutive units (raw data), typical approaches require selecting appropriate descriptors to convert them into time-series, which are then analyzed to extract interpretable information. However, identifying the most effective descriptor is often non-trivial. Here, we report a data-driven approach to compare the efficiency of various descriptors in extracting information from noisy trajectories and translating it into physically relevant insights. As a prototypical system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic system where ice and water coexist in equilibrium near the solid/liquid transition temperature. We compare general and specific descriptors often used in aqueous systems: number of neighbors, molecular velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from the fifth neighbor ( d_5 ). Using Onion Clustering – an efficient unsupervised method for single-point time-series analysis – we assess the maximum extractable information for each descriptor and rank them via a high-dimensional metric. Our results show that advanced descriptors like SOAP and LENS outperform classical ones due to higher signal-to-noise ratios. Nonetheless, even simple descriptors can rival or exceed advanced ones after local signal denoising. For example, d_5 , initially among the weakest, becomes the most effective at resolving the system’s non-local dynamical complexity after denoising. This work highlights the critical role of noise in information extraction from molecular trajectories and offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.

[LG-43] Stream-Based Active Learning for Process Monitoring

链接: https://arxiv.org/abs/2411.12563
作者: Christian Capezza,Antonio Lepore,Kamran Paynabar
关键词-EN: Statistical process monitoring, normal operating conditions, Statistical process, true process state, Traditional SPM methods
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Statistical process monitoring (SPM) methods are essential tools in quality management to check the stability of industrial processes, i.e., to dynamically classify the process state as in control (IC), under normal operating conditions, or out of control (OC), otherwise. Traditional SPM methods are based on unsupervised approaches, which are popular because in most industrial applications the true OC states of the process are not explicitly known. This hampered the development of supervised methods that could instead take advantage of process data containing labels on the true process state, although they still need improvement in dealing with class imbalance, as OC states are rare in high-quality processes, and the dynamic recognition of unseen classes, e.g., the number of possible OC states. This article presents a novel stream-based active learning strategy for SPM that enhances partially hidden Markov models to deal with data streams. The ultimate goal is to optimize labeling resources constrained by a limited budget and dynamically update the possible OC states. The proposed method performance in classifying the true state of the process is assessed through a simulation and a case study on the SPM of a resistance spot welding process in the automotive industry, which motivated this research.

[LG-44] Perfecting Imperfect Physical Neural Networks with Transferable Robustness using Sharpness-Aware Training

链接: https://arxiv.org/abs/2411.12352
作者: Tengji Xu,Zeyu Luo,Shaojie Liu,Li Fan,Qiarong Xiao,Benshan Wang,Dongliang Wang,Chaoran Huang
关键词-EN: traditional digital hardware, science and engineering, digital hardware, training, SAT
类目: Optics (physics.optics); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures

点击查看摘要

Abstract:AI models are essential in science and engineering, but recent advances are pushing the limits of traditional digital hardware. To address these limitations, physical neural networks (PNNs), which use physical substrates for computation, have gained increasing attention. However, developing effective training methods for PNNs remains a significant challenge. Current approaches, regardless of offline and online training, suffer from significant accuracy loss. Offline training is hindered by imprecise modeling, while online training yields device-specific models that can’t be transferred to other devices due to manufacturing variances. Both methods face challenges from perturbations after deployment, such as thermal drift or alignment errors, which make trained models invalid and require retraining. Here, we address the challenges with both offline and online training through a novel technique called Sharpness-Aware Training (SAT), where we innovatively leverage the geometry of the loss landscape to tackle the problems in training physical systems. SAT enables accurate training using efficient backpropagation algorithms, even with imprecise models. PNNs trained by SAT offline even outperform those trained online, despite modeling and fabrication errors. SAT also overcomes online training limitations by enabling reliable transfer of models between devices. Finally, SAT is highly resilient to perturbations after deployment, allowing PNNs to continuously operate accurately under perturbations without retraining. We demonstrate SAT across three types of PNNs, showing it is universally applicable, regardless of whether the models are explicitly known. This work offers a transformative, efficient approach to training PNNs, addressing critical challenges in analog computing and enabling real-world deployment.

[LG-45] Hierarchical Spatio-Temporal Uncertainty Quantification for Distributed Energy Adoption

链接: https://arxiv.org/abs/2411.12193
作者: Wenbin Zhou,Shixiang Zhu,Feng Qiu,Xuan Wu
关键词-EN: distributed energy resources, power grid management, necessitating accurate multilevel, accurate multilevel forecasting, introduced significant spatio-temporal
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The rapid deployment of distributed energy resources (DER) has introduced significant spatio-temporal uncertainties in power grid management, necessitating accurate multilevel forecasting methods. However, existing approaches often produce overly conservative uncertainty intervals at individual spatial units and fail to properly capture uncertainties when aggregating predictions across different spatial scales. This paper presents a novel hierarchical spatio-temporal model based on the conformal prediction framework to address these challenges. Our approach generates circuit-level DER growth predictions and efficiently aggregates them to the substation level while maintaining statistical validity through a tailored non-conformity score. Applied to a decade of DER installation data from a local utility network, our method demonstrates superior performance over existing approaches, particularly in reducing prediction interval widths while maintaining coverage.

[LG-46] Sensor-fusion based Prognostics Framework for Complex Engineering Systems Exhibiting Multiple Failure Modes

链接: https://arxiv.org/abs/2411.12159
作者: Benjamin Peters,Ayush Mohanty,Xiaolei Fang,Stephen K. Robinson,Nagi Gebraeel
关键词-EN: Complex engineering systems, Complex engineering, multiple failure modes, failure modes, failure
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Complex engineering systems are often subject to multiple failure modes. Developing a remaining useful life (RUL) prediction model that does not consider the failure mode causing degradation is likely to result in inaccurate predictions. However, distinguishing between causes of failure without manually inspecting the system is nontrivial. This challenge is increased when the causes of historically observed failures are unknown. Sensors, which are useful for monitoring the state-of-health of systems, can also be used for distinguishing between multiple failure modes as the presence of multiple failure modes results in discriminatory behavior of the sensor signals. When systems are equipped with multiple sensors, some sensors may exhibit behavior correlated with degradation, while other sensors do not. Furthermore, which sensors exhibit this behavior may differ for each failure mode. In this paper, we present a simultaneous clustering and sensor selection approach for unlabeled training datasets of systems exhibiting multiple failure modes. The cluster assignments and the selected sensors are then utilized in real-time to first diagnose the active failure mode and then to predict the system RUL. We validate the complete pipeline of the methodology using a simulated dataset of systems exhibiting two failure modes and on a turbofan degradation dataset from NASA.

[LG-47] angential Randomization in Linear Bandits (TRAiL): Guaranteed Inference and Regret Bounds

链接: https://arxiv.org/abs/2411.12154
作者: Arda Güçlü,Subhonmesh Bose
关键词-EN: Tangential Randomization, computationally efficient regret-optimal, efficient regret-optimal forced, regret-optimal forced exploration, strongly convex functions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 42 pages, 6 Figures

点击查看摘要

Abstract:We propose and analyze TRAiL (Tangential Randomization in Linear Bandits), a computationally efficient regret-optimal forced exploration algorithm for linear bandits on action sets that are sublevel sets of strongly convex functions. TRAiL estimates the governing parameter of the linear bandit problem through a standard regularized least squares and perturbs the reward-maximizing action corresponding to said point estimate along the tangent plane of the convex compact action set before projecting back to it. Exploiting concentration results for matrix martingales, we prove that TRAiL ensures a \Omega(\sqrtT) growth in the inference quality, measured via the minimum eigenvalue of the design (regressor) matrix with high-probability over a T -length period. We build on this result to obtain an \mathcalO(\sqrtT \log(T)) upper bound on cumulative regret with probability at least 1 - 1/T over T periods, and compare TRAiL to other popular algorithms for linear bandits. Then, we characterize an \Omega(\sqrtT) minimax lower bound for any algorithm on the expected regret that covers a wide variety of action/parameter sets and noise processes. Our analysis not only expands the realm of lower-bounds in linear bandits significantly, but as a byproduct, yields a trade-off between regret and inference quality. Specifically, we prove that any algorithm with an \mathcalO(T^\alpha) expected regret growth must have an \Omega(T^1-\alpha) asymptotic growth in expected inference quality. Our experiments on the L^p unit ball as action sets reveal how this relation can be violated, but only in the short-run, before returning to respect the bound asymptotically. In effect, regret-minimizing algorithms must have just the right rate of inference – too fast or too slow inference will incur sub-optimal regret growth.

[LG-48] Exact Risk Curves of signSGD in High-Dimensions: Quantifying Preconditioning and Noise-Compression Effects

链接: https://arxiv.org/abs/2411.12135
作者: Ke Liang Xiao,Noah Marshall,Atish Agarwala,Elliot Paquette
关键词-EN: understand adaptive optimizers, recent years, practical optimizer, adaptive optimizers, garnered interest
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

[LG-49] he Statistical Accuracy of Neural Posterior and Likelihood Estimation

链接: https://arxiv.org/abs/2411.12068
作者: David T. Frazier,Ryan Kelly,Christopher Drovandi,David J. Warne
关键词-EN: Neural posterior estimation, neural likelihood estimation, complex modeling scenarios, machine learning approaches, conducting amortized inference
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Neural posterior estimation (NPE) and neural likelihood estimation (NLE) are machine learning approaches that provide accurate posterior, and likelihood, approximations in complex modeling scenarios, and in situations where conducting amortized inference is a necessity. While such methods have shown significant promise across a range of diverse scientific applications, the statistical accuracy of these methods is so far unexplored. In this manuscript, we give, for the first time, an in-depth exploration on the statistical behavior of NPE and NLE. We prove that these methods have similar theoretical guarantees to common statistical methods like approximate Bayesian computation (ABC) and Bayesian synthetic likelihood (BSL). While NPE and NLE methods are just as accurate as ABC and BSL, we prove that this accuracy can often be achieved at a vastly reduced computational cost, and will therefore deliver more attractive approximations than ABC and BSL in certain problems. We verify our results theoretically and in several examples from the literature.

[LG-50] Prediction-Guided Active Experiments

链接: https://arxiv.org/abs/2411.12036
作者: Ruicheng Ao,Hongyu Chen,David Simchi-Levi
关键词-EN: revised abstract, ensuring all characters, characters are ASCII-compatible, machine learning, Prediction-Guided Active Experiment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:Here is the revised abstract, ensuring all characters are ASCII-compatible: In this work, we introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE), which leverages predictions from an existing machine learning model to guide sampling and experimentation. Specifically, at each time step, an experimental unit is sampled according to a designated sampling distribution, and the actual outcome is observed based on an experimental probability. Otherwise, only a prediction for the outcome is available. We begin by analyzing the non-adaptive case, where full information on the joint distribution of the predictor and the actual outcome is assumed. For this scenario, we derive an optimal experimentation strategy by minimizing the semi-parametric efficiency bound for the class of regular estimators. We then introduce an estimator that meets this efficiency bound, achieving asymptotic optimality. Next, we move to the adaptive case, where the predictor is continuously updated with newly sampled data. We show that the adaptive version of the estimator remains efficient and attains the same semi-parametric bound under certain regularity assumptions. Finally, we validate PGAE’s performance through simulations and a semi-synthetic experiment using data from the US Census Bureau. The results underscore the PGAE framework’s effectiveness and superiority compared to other existing methods. Comments: 25 pages, 11 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM) Cite as: arXiv:2411.12036 [stat.ML] (or arXiv:2411.12036v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.12036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] On the Efficiency of ERM in Feature Learning

链接: https://arxiv.org/abs/2411.12029
作者: Ayoub El Hanchi,Chris J. Maddison,Murat A. Erdogdu
关键词-EN: feature maps indexed, linear classes induced, optimal feature map, empirical risk minimization, feature maps
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 23 pages, 0 figures

点击查看摘要

Abstract:Given a collection of feature maps indexed by a set \mathcalT , we study the performance of empirical risk minimization (ERM) on regression problems with square loss over the union of the linear classes induced by these feature maps. This setup aims at capturing the simplest instance of feature learning, where the model is expected to jointly learn from the data an appropriate feature map and a linear predictor. We start by studying the asymptotic quantiles of the excess risk of sequences of empirical risk minimizers. Remarkably, we show that when the set \mathcalT is not too large and when there is a unique optimal feature map, these quantiles coincide, up to a factor of two, with those of the excess risk of the oracle procedure, which knows a priori this optimal feature map and deterministically outputs an empirical risk minimizer from the associated optimal linear class. We complement this asymptotic result with a non-asymptotic analysis that quantifies the decaying effect of the global complexity of the set \mathcalT on the excess risk of ERM, and relates it to the size of the sublevel sets of the suboptimality of the feature maps. As an application of our results, we obtain new guarantees on the performance of the best subset selection procedure in sparse linear regression under general assumptions.

[LG-52] Pricing Weather Derivatives: A Time Series Neural Network Approach

链接: https://arxiv.org/abs/2411.12013
作者: Marco Hening-Tallarico,Pablo Olivares
关键词-EN: underlying climate variables, price weather derivative, weather derivative contracts, derivative contracts based, climate variables
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The objective of the paper is to price weather derivative contracts based on temperature and precipitation as underlying climate variables. We use a neural network approach combined with time series forecast to value Pacific Rim index in Toronto and Chicago

[LG-53] SynCoTrain: A Dual Classifier PU-learning Framework for Synthesizability Prediction

链接: https://arxiv.org/abs/2411.12011
作者: Sasan Amariamir,Janine George,Philipp Benner
关键词-EN: driving advancements, cornerstone of modern, advancements in diverse, diverse disciplines, disciplines from biomedical
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Material discovery is a cornerstone of modern science, driving advancements in diverse disciplines from biomedical technology to climate solutions. Predicting synthesizability, a critical factor in realizing novel materials, remains a complex challenge due to the limitations of traditional heuristics and thermodynamic proxies. While stability metrics such as formation energy offer partial insights, they fail to account for kinetic factors and technological constraints that influence synthesis outcomes. These challenges are further compounded by the scarcity of negative data, as failed synthesis attempts are often unpublished or context-specific. We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. SynCoTrain employs a co-training framework leveraging two complementary graph convolutional neural networks: SchNet and ALIGNN. By iteratively exchanging predictions between classifiers, SynCoTrain mitigates model bias and enhances generalizability. Our approach uses Positive and Unlabeled (PU) Learning to address the absence of explicit negative data, iteratively refining predictions through collaborative learning. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets. By focusing on oxide crystals, a well-characterized material family with extensive experimental data, we establish SynCoTrain as a reliable tool for predicting synthesizability while balancing dataset variability and computational efficiency. This work highlights the potential of co-training to advance high-throughput materials discovery and generative research, offering a scalable solution to the challenge of synthesizability prediction. Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG) Cite as: arXiv:2411.12011 [cond-mat.mtrl-sci] (or arXiv:2411.12011v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2411.12011 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Active learning for efficient discovery of optimal gene combinations in the combinatorial perturbation space

链接: https://arxiv.org/abs/2411.12010
作者: Jason Qin,Hans-Hermann Wessels,Carlos Fernandez-Granda,Yuhan Hao
关键词-EN: screening technologies enables, CRISPR screening technologies, combinatorial CRISPR screening, synergistic gene combinations, screening technologies
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancement of novel combinatorial CRISPR screening technologies enables the identification of synergistic gene combinations on a large scale. This is crucial for developing novel and effective combination therapies, but the combinatorial space makes exhaustive experimentation infeasible. We introduce NAIAD, an active learning framework that efficiently discovers optimal gene pairs capable of driving cells toward desired cellular phenotypes. NAIAD leverages single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR combinatorial perturbation datasets totaling over 350,000 genetic interactions, NAIAD, trained on small datasets, outperforms existing models by up to 40% relative to the second-best. NAIAD’s recommendation system prioritizes gene pairs with the maximum predicted effects, resulting in the highest marginal gain in each AI-experiment round and accelerating discovery with fewer CRISPR experimental iterations. Our NAIAD framework (this https URL) improves the identification of novel, effective gene combinations, enabling more efficient CRISPR library design and offering promising applications in genomics research and therapeutic development.

[LG-55] Phenome-wide causal proteomics enhance systemic lupus erythematosus flare prediction: A study in Asian populations

链接: https://arxiv.org/abs/2411.11915
作者: Liying Chen,Ou Deng,Ting Fang,Mei Chen,Xvfeng Zhang,Ruichen Cong,Dingqi Lu,Runrun Zhang,Qun Jin,Xinchang Wang
关键词-EN: Systemic lupus erythematosus, complex autoimmune disease, autoimmune disease characterized, Systemic lupus, SLE Disease Activity
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: Systemic lupus erythematosus (SLE) is a complex autoimmune disease characterized by unpredictable flares. This study aimed to develop a novel proteomics-based risk prediction model specifically for Asian SLE populations to enhance personalized disease management and early intervention. Methods: A longitudinal cohort study was conducted over 48 weeks, including 139 SLE patients monitored every 12 weeks. Patients were classified into flare (n = 53) and non-flare (n = 86) groups. Baseline plasma samples underwent data-independent acquisition (DIA) proteomics analysis, and phenome-wide Mendelian randomization (PheWAS) was performed to evaluate causal relationships between proteins and clinical predictors. Logistic regression (LR) and random forest (RF) models were used to integrate proteomic and clinical data for flare risk prediction. Results: Five proteins (SAA1, B4GALT5, GIT2, NAA15, and RPIA) were significantly associated with SLE Disease Activity Index-2K (SLEDAI-2K) scores and 1-year flare risk, implicating key pathways such as B-cell receptor signaling and platelet degranulation. SAA1 demonstrated causal effects on flare-related clinical markers, including hemoglobin and red blood cell counts. A combined model integrating clinical and proteomic data achieved the highest predictive accuracy (AUC = 0.769), surpassing individual models. SAA1 was highlighted as a priority biomarker for rapid flare discrimination. Conclusion: The integration of proteomic and clinical data significantly improves flare prediction in Asian SLE patients. The identification of key proteins and their causal relationships with flare-related clinical markers provides valuable insights for proactive SLE management and personalized therapeutic approaches.

[LG-56] HeartBERT: A Self-Supervised ECG Embedding Model for Efficient and Effective Medical Signal Analysis

链接: https://arxiv.org/abs/2411.11896
作者: Saedeh Tahery,Fatemeh Hamid Akhlaghi,Termeh Amirsoleimani,Saeed Farzi
关键词-EN: minimizing computational resources, analyze Electrocardiogram, simultaneously improving performance, Bidirectional Encoder Representations, machine learning systems
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: First version, 24 pages, 8 Figures, 7 Tables

点击查看摘要

Abstract:The HeartBert model is introduced with three primary objectives: reducing the need for labeled data, minimizing computational resources, and simultaneously improving performance in machine learning systems that analyze Electrocardiogram (ECG) signals. Inspired by Bidirectional Encoder Representations from Transformers (BERT) in natural language processing and enhanced with a self-supervised learning approach, the HeartBert model-built on the RoBERTa architecture-generates sophisticated embeddings tailored for ECG-based projects in the medical domain. To demonstrate the versatility, generalizability, and efficiency of the proposed model, two key downstream tasks have been selected: sleep stage detection and heartbeat classification. HeartBERT-based systems, utilizing bidirectional LSTM heads, are designed to address complex challenges. A series of practical experiments have been conducted to demonstrate the superiority and advancements of HeartBERT, particularly in terms of its ability to perform well with smaller training datasets, reduced learning parameters, and effective performance compared to rival models. The code and data are publicly available at this https URL.

[LG-57] How Many Data are Enough? Optimization of Data Collection for Artifact Detection in EEG Recordings

链接: https://arxiv.org/abs/2411.11886
作者: Lu Wang-Nöth,Philipp Heiler,Hai Huang,Daniel Lichtenstern,Alexandra Reichenbach,Luis Flacke,Linus Maisch,Helmut Mayer
关键词-EN: data, data collection, Objective, collection, biological data collection
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective. Electroencephalography (EEG) is a widely used neuroimaging technique known for its cost-effectiveness and user-friendliness. However, the presence of various artifacts, particularly biological artifacts like Electromyography (EMG) ones, leads to a poor signal-to-noise ratio, limiting the precision of analyses and applications. The currently reported EEG data cleaning performance largely depends on the data used for validation, and in the case of machine learning approaches, also on the data used for training. The data are typically gathered either by recruiting subjects to perform specific artifact tasks or by integrating existing datasets. Prevailing approaches, however, tend to rely on intuitive, concept-oriented data collection with minimal justification for the selection of artifacts and their quantities. Given the substantial costs associated with biological data collection and the pressing need for effective data utilization, we propose an optimization procedure for data-oriented data collection design using deep learning-based artifact detection. Approach. We apply a binary classification between artifact epochs (time intervals containing artifacts) and non-artifact epochs (time intervals containing no artifact) using three different architectures. Our aim is to minimize data collection efforts while preserving the cleaning efficiency. Main results. We were able to reduce the number of artifact tasks from twelve to three and decrease repetitions of isometric contraction tasks from ten to three or sometimes even just one. Significance. Our work addresses the need for effective data utilization in biological data collection, offering a systematic and dynamic quantitative approach. By providing clear justifications for the choices of artifacts and their quantity, we aim to guide future studies toward more effective and economical data collection in EEG and EMG research.

[LG-58] Longitudinal Wrist PPG Analysis for Reliable Hypertension Risk Screening Using Deep Learning

链接: https://arxiv.org/abs/2411.11863
作者: Hui Lin,Jiyang Li,Ramy Hussein,Xin Sui,Xiaoyu Li,Guangpu Zhu,Aggelos K. Katsaggelos,Zijing Zeng,Yelei Li
关键词-EN: blood pressure monitoring, leading risk factor, cardiovascular diseases, blood pressure, pressure monitoring
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: blood pressure, hypertension, cuffless, photoplethysmography, deep learning

点击查看摘要

Abstract:Hypertension is a leading risk factor for cardiovascular diseases. Traditional blood pressure monitoring methods are cumbersome and inadequate for continuous tracking, prompting the development of PPG-based cuffless blood pressure monitoring wearables. This study leverages deep learning models, including ResNet and Transformer, to analyze wrist PPG data collected with a smartwatch for efficient hypertension risk screening, eliminating the need for handcrafted PPG features. Using the Home Blood Pressure Monitoring (HBPM) longitudinal dataset of 448 subjects and five-fold cross-validation, our model was trained on over 68k spot-check instances from 358 subjects and tested on real-world continuous recordings of 90 subjects. The compact ResNet model with 0.124M parameters performed significantly better than traditional machine learning methods, demonstrating its effectiveness in distinguishing between healthy and abnormal cases in real-world scenarios.

[LG-59] Robust Graph Neural Networks for Stability Analysis in Dynamic Networks

链接: https://arxiv.org/abs/2411.11848
作者: Xin Zhang,Zhen Xu,Yue Liu,Mengfang Sun,Tong Zhou,Wenying Sun
关键词-EN: financial, economic risk identification, globalization and digitalization, financial system, financial institutions
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: It was accepted by the 3rd International Conference on Cloud Computing Big Data Application and Software Engineering

点击查看摘要

Abstract:In the current context of accelerated globalization and digitalization, the complexity and uncertainty of financial markets are increasing, and the identification and prevention of economic risks have become a key link in maintaining the stability of the financial system. Traditional risk identification methods often have limitations because they are difficult to cope with the multi-level and dynamically changing complex relationships in financial networks. With the rapid development of financial technology, graph neural network (GNN) technology, as an emerging deep learning method, has gradually shown great potential in the field of financial risk management. GNN can map transaction behaviors, financial institutions, individuals, and their interactive relationships in financial networks into graph structures, and effectively capture potential patterns and abnormal signals in financial data through embedded representation learning. Using this technology, financial institutions can extract valuable information from complex transaction networks, identify hidden dangers or abnormal behaviors that may cause systemic risks in a timely manner, optimize decision-making processes, and improve the accuracy of risk warnings. This paper explores the economic risk identification algorithm based on the GNN algorithm, aiming to provide financial institutions and regulators with more intelligent technical tools to help maintain the security and stability of the financial market. Improving the efficiency of economic risk identification through innovative technical means is expected to further enhance the risk resistance of the financial system and lay the foundation for building a robust global financial system.

信息检索

[IR-0] PseudoSeer: a Search Engine for Pseudocode

链接: https://arxiv.org/abs/2411.12649
作者: Levent Toksoz,Mukund Srinath,Gang Tan,C. Lee Giles
关键词-EN: facilitate efficient retrieval, pseudocode search engine, designed to facilitate, facilitate efficient, efficient retrieval
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A novel pseudocode search engine is designed to facilitate efficient retrieval and search of academic papers containing pseudocode. By leveraging Elasticsearch, the system enables users to search across various facets of a paper, such as the title, abstract, author information, and LaTeX code snippets, while supporting advanced features like combined facet searches and exact-match queries for more targeted results. A description of the data acquisition process is provided, with arXiv as the primary data source, along with methods for data extraction and text-based indexing, highlighting how different data elements are stored and optimized for search. A weighted BM25-based ranking algorithm is used by the search engine, and factors considered when prioritizing search results for both single and combined facet searches are described. We explain how each facet is weighted in a combined search. Several search engine results pages are displayed. Finally, there is a brief overview of future work and potential evaluation methodology for assessing the effectiveness and performance of the search engine is described.

[IR-1] owards Unifying Feature Interaction Models for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2411.12441
作者: Yu Kang,Junwei Pan,Jipeng Jin,Shudong Huang,Xiaofeng Gao,Lei Xiao
关键词-EN: predicting click-through rates, accurately predicting click-through, Modeling feature interactions, feature interactions plays, click-through rates
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modeling feature interactions plays a crucial role in accurately predicting click-through rates (CTR) in advertising systems. To capture the intricate patterns of interaction, many existing models employ matrix-factorization techniques to represent features as lower-dimensional embedding vectors, enabling the modeling of interactions as products between these embeddings. In this paper, we propose a general framework called IPA to systematically unify these models. Our framework comprises three key components: the Interaction Function, which facilitates feature interaction; the Layer Pooling, which constructs higher-level interaction layers; and the Layer Aggregator, which combines the outputs of all layers to serve as input for the subsequent classifier. We demonstrate that most existing models can be categorized within our framework by making specific choices for these three components. Through extensive experiments and a dimensional collapse analysis, we evaluate the performance of these choices. Furthermore, by leveraging the most powerful components within our framework, we introduce a novel model that achieves competitive results compared to state-of-the-art CTR models. PFL gets significant GMV lift during online A/B test in Tencent’s advertising platform and has been deployed as the production model in several primary scenarios.

[IR-2] Scalable and Effective Negative Sample Generation for Hyperedge Prediction

链接: https://arxiv.org/abs/2411.12354
作者: Shilin Qu,Weiqing Wang,Yuan-Fang Li,Quoc Viet Hung Nguyen,Hongzhi Yin
关键词-EN: including social networks, understanding complex multi-entity, complex multi-entity interactions, Hyperedge prediction, web-based applications
类目: Information Retrieval (cs.IR)
*备注: 11

点击查看摘要

Abstract:Hyperedge prediction is crucial in hypergraph analysis for understanding complex multi-entity interactions in various web-based applications, including social networks and e-commerce systems. Traditional methods often face difficulties in generating high-quality negative samples due to the imbalance between positive and negative instances. To address this, we present the Scalable and Effective Negative Sample Generation for Hyperedge Prediction (SEHP) framework, which utilizes diffusion models to tackle these challenges. SEHP employs a boundary-aware loss function that iteratively refines negative samples, moving them closer to decision boundaries to improve classification performance. SEHP samples positive instances to form sub-hypergraphs for scalable batch processing. By using structural information from sub-hypergraphs as conditions within the diffusion process, SEHP effectively captures global patterns. To enhance efficiency, our approach operates directly in latent space, avoiding the need for discrete ID generation and resulting in significant speed improvements while preserving accuracy. Extensive experiments show that SEHP outperforms existing methods in accuracy, efficiency, and scalability, representing a substantial advancement in hyperedge prediction techniques. Our code is available here.

[IR-3] Consistency Regularization for Complementary Clothing Recommendations

链接: https://arxiv.org/abs/2411.12295
作者: Shuiying Liao,P.Y. Mok,Li Li
关键词-EN: Bayesian Personalized Ranking, Personalized Ranking, Bayesian Personalized, Consistency Regularized model, biased learning caused
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper reports on the development of a Consistency Regularized model for Bayesian Personalized Ranking (CR-BPR), addressing to the drawbacks in existing complementary clothing recommendation methods, namely limited consistency and biased learning caused by diverse feature scale of multi-modal data. Compared to other product types, fashion preferences are inherently subjective and more personal, and fashion are often presented, not by individual clothing product, but with other complementary product(s) in a well coordinated fashion outfit. Current complementary-product recommendation studies primarily focus on user preference and product matching, this study further emphasizes the consistency observed in user-product interactions as well as product-product interactions, in the specific context of clothing matching. Most traditional approaches often underplayed the impact of existing wardrobe items on future matching choices, resulting in less effective preference prediction models. Moreover, many multi-modal information based models overlook the limitations arising from various feature scales being involved. To address these gaps, the CR-BPR model integrates collaborative filtering techniques to incorporate both user preference and product matching modeling, with a unique focus on consistency regularization for each aspect. Additionally, the incorporation of a feature scaling process further addresses the imbalances caused by different feature scales, ensuring that the model can effectively handle multi-modal data without being skewed by any particular type of feature. The effectiveness of the CR-BPR model was validated through detailed analysis involving two benchmark datasets. The results confirmed that the proposed approach significantly outperforms existing models.

[IR-4] SymphonyQG: Towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search SIGMOD2025

链接: https://arxiv.org/abs/2411.12229
作者: Yutong Gou,Jianyang Gao,Yuexuan Xu,Cheng Long
关键词-EN: high-dimensional Euclidean space, Approximate nearest neighbor, Approximate nearest, high-dimensional Euclidean, Euclidean space
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: The paper has been accepted by SIGMOD 2025

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search in high-dimensional Euclidean space has a broad range of applications. Among existing ANN algorithms, graph-based methods have shown superior performance in terms of the time-accuracy trade-off. However, they face performance bottlenecks due to the random memory accesses caused by the searching process on the graph indices and the costs of computing exact distances to guide the searching process. To relieve the bottlenecks, a recent method named NGT-QG makes an attempt by integrating quantization and graph. It (1) replicates and stores the quantization codes of a vertex’s neighbors compactly so that they can be accessed sequentially, and (2) uses a SIMD-based implementation named FastScan to efficiently estimate distances based on the quantization codes in batch for guiding the searching process. While NGT-QG achieves promising improvements over the vanilla graph-based methods, it has not fully unleashed the potential of integrating quantization and graph. For instance, it entails a re-ranking step to compute exact distances at the end, which introduces extra random memory accesses; its graph structure is not jointly designed considering the in-batch nature of FastScan, which causes wastes of computation in searching. In this work, following NGT-QG, we present a new method named SymphonyQG, which achieves more symphonious integration of quantization and graph (e.g., it avoids the explicit re-ranking step and refines the graph structure to be more aligned with FastScan). Based on extensive experiments on real-world datasets, SymphonyQG establishes the new state-of-the-art in terms of the time-accuracy trade-off.

[IR-5] INDIANA: Personalized Travel Recommendations Using Wearables and AI

链接: https://arxiv.org/abs/2411.12227
作者: Anastasios Manos,Despina Elisabeth Filipidou,Ioannis Deliyannis,Nikolaos Pavlidis,Vasilis Argyros,Ioanna Mazi
关键词-EN: tailored activity suggestions, recommendation system developed, travel recommendation system, user preferences, current location
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Accepted as position paper at 8th International Workshop on Chatbots and Human-Centred AI - CONVERSATIONS 2024

点击查看摘要

Abstract:This work presents a personalized travel recommendation system developed as part of the INDIANA platform, designed to enhance the tourist experience through tailored activity suggestions, by leveraging data from wearable devices, user preferences, current location, weather forecasts, and activity history to provide real-time, context-aware recommendations. The platform not only supports individual tourists in maximizing their travel experience but also offers insights to tourism professionals to enhance service delivery, and by integrating modern technologies such as AI, IoT, and wearable analytics, it provides a seamless, personalized, and engaging experience for travelers.

[IR-6] Sparser Training for On-Device Recommendation Systems

链接: https://arxiv.org/abs/2411.12205
作者: Yunke Qu,Liang Qu,Tong Chen,Xiangyu Zhao,Jianxin Li,Hongzhi Yin
关键词-EN: substantial memory consumption, Recommender systems, leading to substantial, consumption and inefficiencies, systems often rely
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems often rely on large embedding tables that map users and items to dense vectors of uniform size, leading to substantial memory consumption and inefficiencies. This is particularly problematic in memory-constrained environments like mobile and Web of Things (WoT) applications, where scalability and real-time performance are critical. Various research efforts have sought to address these issues. Although embedding pruning methods utilizing Dynamic Sparse Training (DST) stand out due to their low training and inference costs, consistent sparsity, and end-to-end differentiability, they face key challenges. Firstly, they typically initializes the mask matrix, which is used to prune redundant parameters, with random uniform sparse initialization. This strategy often results in suboptimal performance as it creates unstructured and inefficient connections. Secondly, they tend to favor the users/items sampled in the single batch immediately before weight exploration when they reactivate pruned parameters with large gradient magnitudes, which does not necessarily improve the overall performance. Thirdly, while they use sparse weights during forward passes, they still need to compute dense gradients during backward passes. In this paper, we propose SparseRec, an lightweight embedding method based on DST, to address these issues. Specifically, SparseRec initializes the mask matrix using Nonnegative Matrix Factorization. It accumulates gradients to identify the inactive parameters that can better improve the model performance after activation. Furthermore, it avoids dense gradients during backpropagation by sampling a subset of important vectors. Gradients are calculated only for parameters in this subset, thus maintaining sparsity during training in both forward and backward passes.

[IR-7] Multi-Grained Preference Enhanced Transformer for Multi-Behavior Sequential Recommendation

链接: https://arxiv.org/abs/2411.12179
作者: Chuan He,Yongchao Liu,Qiang Li,Weiqiang Wang,Xin Fu,Xinyi Fu,Chuntao Hong,Xinwei Yao
关键词-EN: aims to predict, purchasing item, Sequential recommendation, users’ dynamic preference, Sequential
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: 12 pages

点击查看摘要

Abstract:Sequential recommendation (SR) aims to predict the next purchasing item according to users’ dynamic preference learned from their historical user-item interactions. To improve the performance of recommendation, learning dynamic heterogeneous cross-type behavior dependencies is indispensable for recommender system. However, there still exists some challenges in Multi-Behavior Sequential Recommendation (MBSR). On the one hand, existing methods only model heterogeneous multi-behavior dependencies at behavior-level or item-level, and modelling interaction-level dependencies is still a challenge. On the other hand, the dynamic multi-grained behavior-aware preference is hard to capture in interaction sequences, which reflects interaction-aware sequential pattern. To tackle these challenges, we propose a Multi-Grained Preference enhanced Transformer framework (M-GPT). First, M-GPT constructs a interaction-level graph of historical cross-typed interactions in a sequence. Then graph convolution is performed to derive interaction-level multi-behavior dependency representation repeatedly, in which the complex correlation between historical cross-typed interactions at specific orders can be well learned. Secondly, a novel multi-scale transformer architecture equipped with multi-grained user preference extraction is proposed to encode the interaction-aware sequential pattern enhanced by capturing temporal behavior-aware multi-grained preference . Experiments on the real-world datasets indicate that our method M-GPT consistently outperforms various state-of-the-art recommendation methods.

[IR-8] Metamorphic Evaluation of ChatGPT as a Recommender System

链接: https://arxiv.org/abs/2411.12121
作者: Madhurima Khirbat,Yongli Ren,Pablo Castells,Mark Sanderson
关键词-EN: Large Language Models, Language Models, Large Language, rise of Large, traditional recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall tau and Ranking Biased Overlap(RBO)) are deployed to measure whether the relationship has been satisfied in the outputs of MRs. The experiment results on MovieLens dataset with GPT3.5 show that lower similarity are obtained in terms of Kendall \tau and RBO, which concludes that there is a need of a comprehensive evaluation of the LLM-based RS in addition to the existing evaluation metrics used for traditional recommender systems.

[IR-9] Preprocessing for lessening the influence of eye artifacts in eeg analysis

链接: https://arxiv.org/abs/2411.12092
作者: Alejandro Villena,Lorenzo J. Tardon,Isabel Barbancho,Ana M. Barbancho,Elvira Brattico,Niels T. Haumann
关键词-EN: lengthy trials, eeg signal components, extract eeg signal, signal components, eeg signals,their influence
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR)
*备注: 16 pages, journal article

点击查看摘要

Abstract:We dealt with the problem of artifacts in eeg signals in relation to the usage of lengthy trials. Specifically, we considered eye artifacts found in eeg signals,their influence in the analysis of the data and alternatives to diminish their impact on later studies of brain activity on lengthy tasks. We proposed a scheme of partial rejection on independent signal components, providesd a method to extract eeg signal components with diministhed influence of eye artifacts, and assess the importance of using artifact free signal excerpts to extract signal components in order to analyze brain activity in a musical context.

附件下载

点击下载今日全部论文列表