Arxiv今日论文 | 2024-11-21

本篇博文主要展示 2024-11-21 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决LGBTQ+群体在社交媒体上表达的少数群体压力（minority stress）的自动检测问题。解决方案的关键在于设计了一种混合模型，结合了图神经网络（Graph Neural Networks, GNN）和预训练的深度语言模型BERT（Bidirectional Encoder Representations from Transformers）。该模型通过在大规模原始数据上进行预训练，提取隐藏的语言细微差别，并通过转导学习（transductive learning）同时为标记的训练数据和未标记的测试数据开发表示。实验结果表明，RoBERTa-GCN模型在LGBTQ+ MiSSoM+数据集上的准确率和F1分数均达到0.86，显著优于其他基线模型，从而提高了对LGBTQ+群体在社交媒体上表达的少数群体压力的预测能力。

链接: https://arxiv.org/abs/2411.13534
作者: S. Chapagain,Y. Zhao,T. K. Rohleen,S. M. Hamdi,S. F. Boubrahimi,R. E. Flinn,E. M. Lund,D. Klooster,J. R. Scheer,C. J. Cascalheira
关键词-EN: Individuals who identify, minority stress, experience poorer health, LGBTQ, including lesbian
类目: Computation and Language (cs.CL)
备注: This paper is accepted in 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA)

点击查看摘要

Abstract:Individuals who identify as sexual and gender minorities, including lesbian, gay, bisexual, transgender, queer, and others (LGBTQ+) are more likely to experience poorer health than their heterosexual and cisgender counterparts. One primary source that drives these health disparities is minority stress (i.e., chronic and social stressors unique to LGBTQ+ communities’ experiences adapting to the dominant culture). This stress is frequently expressed in LGBTQ+ users’ posts on social media platforms. However, these expressions are not just straightforward manifestations of minority stress. They involve linguistic complexity (e.g., idiom or lexical diversity), rendering them challenging for many traditional natural language processing methods to detect. In this work, we designed a hybrid model using Graph Neural Networks (GNN) and Bidirectional Encoder Representations from Transformers (BERT), a pre-trained deep language model to improve the classification performance of minority stress detection. We experimented with our model on a benchmark social media dataset for minority stress detection (LGBTQ+ MiSSoM+). The dataset is comprised of 5,789 human-annotated Reddit posts from LGBTQ+ subreddits. Our approach enables the extraction of hidden linguistic nuances through pretraining on a vast amount of raw data, while also engaging in transductive learning to jointly develop representations for both labeled training data and unlabeled test data. The RoBERTa-GCN model achieved an accuracy of 0.86 and an F1 score of 0.86, surpassing the performance of other baseline models in predicting LGBTQ+ minority stress. Improved prediction of minority stress expressions on social media could lead to digital health interventions to improve the wellbeing of LGBTQ+ people-a community with high rates of stress-sensitive health problems.
摘要：自我认定为性少数和性别少数群体，包括女同性恋、男同性恋、双性恋、跨性别、酷儿及其他（LGBTQ+）的人群，其健康状况往往比异性恋和顺性别者更差。导致这些健康差异的一个主要原因是少数群体压力（即，LGBTQ+群体在适应主流文化过程中所经历的独特慢性和社会压力源）。这种压力常常在LGBTQ+用户在社交媒体平台上的帖子中表现出来。然而，这些表达并非仅仅是少数群体压力的直接体现，它们涉及语言复杂性（如习语或词汇多样性），使得许多传统的自然语言处理方法难以检测。在本研究中，我们设计了一种混合模型，结合了图神经网络（Graph Neural Networks, GNN）和双向编码器表示的Transformer（Bidirectional Encoder Representations from Transformers, BERT），这是一个预训练的深度语言模型，以提升少数群体压力检测的分类性能。我们在一个用于少数群体压力检测的基准社交媒体数据集（LGBTQ+ MiSSoM+）上进行了实验。该数据集包含5,789条来自LGBTQ+子版块的人工标注Reddit帖子。我们的方法通过在大量原始数据上的预训练，能够提取隐藏的语言细微差别，同时通过转导学习共同开发标注训练数据和未标注测试数据的表示。RoBERTa-GCN模型在预测LGBTQ+少数群体压力方面达到了0.86的准确率和0.86的F1分数，超过了其他基线模型的性能。对社交媒体上少数群体压力表达的改进预测，可能有助于实施数字健康干预措施，以改善LGBTQ+人群的福祉——这是一个压力敏感健康问题高发的群体。

[NLP-1] Advancing Complex Medical Communication in Arabic with Sporo AraSum: Surpassing Existing Large Language Models

【速读】：该论文试图解决在医疗环境中处理阿拉伯语临床文档和决策时面临的独特挑战，特别是阿拉伯语复杂的形态、句法和双言现象对自然语言处理（NLP）的影响。解决方案的关键在于评估并比较专门为阿拉伯语临床文档设计的语言模型Sporo AraSum与领先的阿拉伯语NLP模型JAIS的性能。通过使用合成数据集和修改后的PDQI-9指标，研究重点评估了模型在总结医患互动中的准确性、全面性、临床实用性和语言文化能力。结果显示，Sporo AraSum在AI核心的定量指标和所有定性属性上均显著优于JAIS，其架构能够实现精确且文化敏感的文档记录，有效应对阿拉伯语的语言细微差别，并减少AI幻觉的风险。这表明Sporo AraSum更适合满足阿拉伯语医疗环境的需求，为多语言临床工作流程提供了一种变革性的解决方案。

链接: https://arxiv.org/abs/2411.13518
作者: Chanseo Lee,Sonu Kumar,Kimon A. Vogt,Sam Meraj,Antonia Vogt
关键词-EN: processing diverse languages, Arabic NLP model, Sporo AraSum, leading Arabic NLP, processing diverse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2411.06713

点击查看摘要

Abstract:The increasing demand for multilingual capabilities in healthcare underscores the need for AI models adept at processing diverse languages, particularly in clinical documentation and decision-making. Arabic, with its complex morphology, syntax, and diglossia, poses unique challenges for natural language processing (NLP) in medical contexts. This case study evaluates Sporo AraSum, a language model tailored for Arabic clinical documentation, against JAIS, the leading Arabic NLP model. Using synthetic datasets and modified PDQI-9 metrics modified ourselves for the purposes of assessing model performances in a different language. The study assessed the models’ performance in summarizing patient-physician interactions, focusing on accuracy, comprehensiveness, clinical utility, and linguistic-cultural competence. Results indicate that Sporo AraSum significantly outperforms JAIS in AI-centric quantitative metrics and all qualitative attributes measured in our modified version of the PDQI-9. AraSum’s architecture enables precise and culturally sensitive documentation, addressing the linguistic nuances of Arabic while mitigating risks of AI hallucinations. These findings suggest that Sporo AraSum is better suited to meet the demands of Arabic-speaking healthcare environments, offering a transformative solution for multilingual clinical workflows. Future research should incorporate real-world data to further validate these findings and explore broader integration into healthcare systems. Comments: arXiv admin note: text overlap with arXiv:2411.06713 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.13518 [cs.CL] (or arXiv:2411.13518v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.13518 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：随着医疗领域对多语言能力需求的增加，需要能够处理多种语言的AI模型，特别是在临床文档和决策制定方面。阿拉伯语因其复杂的形态学、句法和双言现象，在医疗情境下的自然语言处理（NLP）中提出了独特的挑战。本案例研究评估了专为阿拉伯语临床文档设计的语言模型Sporo AraSum与领先的阿拉伯语NLP模型JAIS的性能。我们使用合成数据集和为评估不同语言模型性能而自行修改的PDQI-9指标。研究评估了模型在总结医患互动方面的表现，重点关注准确性、全面性、临床实用性和语言文化能力。结果显示，Sporo AraSum在以AI为中心的定量指标和我们在修改版PDQI-9中测量的所有定性属性上显著优于JAIS。AraSum的架构能够实现精确且文化敏感的文档记录，解决了阿拉伯语的语言细微差别，同时减少了AI幻觉的风险。这些发现表明，Sporo AraSum更适合满足阿拉伯语医疗环境的需求，为多语言临床工作流程提供了一种变革性的解决方案。未来的研究应纳入真实世界的数据，以进一步验证这些发现，并探索更广泛的医疗系统集成。

评论：arXiv管理员注：文本与arXiv:2411.06713有重叠。
学科：计算与语言（cs.CL）；人工智能（cs.AI）
引用为：arXiv:2411.13518 [cs.CL]
（或arXiv:2411.13518v1 [cs.CL]用于此版本）
https://doi.org/10.48550/arXiv.2411.13518
聚焦以了解更多
arXiv-issued DOI via DataCite（待注册）

[NLP-2] Disentangling Memory and Reasoning Ability in Large Language Models

【速读】：该论文试图解决现有大型语言模型 (Large Language Models, LLMs) 在推理过程中缺乏明确区分知识检索和推理步骤的问题，导致模型决策过程不透明、易出现幻觉和知识遗忘等影响模型在高风险领域可靠性的问题。解决方案的关键在于提出一种新的推理范式，将复杂的推理过程分解为两个清晰的动作：记忆召回 (memory recall) 和推理 (reasoning)。通过引入两个特殊标记 “memory” 和 “reason”，指导模型在不同步骤中分别进行知识检索和逻辑推理，从而提高模型性能和推理过程的可解释性，使用户能够更有效地识别错误来源并优化模型响应。

链接: https://arxiv.org/abs/2411.13504
作者: Mingyu Jin,Weidi Luo,Sitao Cheng,Xinyi Wang,Wenyue Hua,Ruixiang Tang,William Yang Wang,Yongfeng Zhang
关键词-EN: Large Language Models, Large Language, handling complex tasks, complex tasks requiring, demonstrated strong performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong performance in handling complex tasks requiring both extensive knowledge and reasoning abilities. However, the existing LLM inference pipeline operates as an opaque process without explicit separation between knowledge retrieval and reasoning steps, making the model’s decision-making process unclear and disorganized. This ambiguity can lead to issues such as hallucinations and knowledge forgetting, which significantly impact the reliability of LLMs in high-stakes domains. In this paper, we propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions: (1) memory recall: which retrieves relevant knowledge, and (2) reasoning: which performs logical steps based on the recalled knowledge. To facilitate this decomposition, we introduce two special tokens memory and reason, guiding the model to distinguish between steps that require knowledge retrieval and those that involve reasoning. Our experiment results show that this decomposition not only improves model performance but also enhances the interpretability of the inference process, enabling users to identify sources of error and refine model responses effectively. The code is available at this https URL.
摘要：大语言模型 (LLM) 在处理需要广泛知识和推理能力的复杂任务方面表现出色。然而，现有的 LLM 推理流程作为一个不透明的过程运行，知识检索和推理步骤之间没有明确的分离，导致模型的决策过程不清晰且混乱。这种模糊性可能导致幻觉和知识遗忘等问题，显著影响 LLM 在高风险领域的可靠性。在本文中，我们提出了一种新的推理范式，将复杂的推理过程分解为两个明确且清晰的动作：(1) 记忆召回：检索相关知识；(2) 推理：基于召回的知识执行逻辑步骤。为了促进这种分解，我们引入了两个特殊 Token——memory 和 reason，指导模型区分需要知识检索的步骤和涉及推理的步骤。我们的实验结果表明，这种分解不仅提高了模型性能，还增强了推理过程的可解释性，使用户能够识别错误来源并有效改进模型响应。代码可在以下链接获取：https URL。

[NLP-3] Utilizing Large Language Models to Synthesize Product Desirability Datasets

【速读】：该论文试图解决在产品可取性工具包（Product Desirability Toolkit, PDT）测试中，如何高效生成合成数据以评估用户情感和产品体验的问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs），特别是gpt-4o-mini这一成本效益较高的替代方案，通过三种方法（Word+Review, Review+Word, 和 Supply-Word）生成1000条合成产品评论。这些方法在情感一致性、文本多样性和数据生成成本方面进行了评估，结果显示所有方法均表现出高情感一致性（Pearson相关系数为0.93至0.97），其中Supply-Word方法在多样性和PDT术语覆盖率上表现最佳，尽管生成成本有所增加。尽管存在轻微的积极情感偏差，但在测试数据有限的情况下，LLM生成的合成数据具有显著优势，包括可扩展性、成本节约和数据集生产的灵活性。

链接: https://arxiv.org/abs/2411.13485
作者: John D. Hastings,Sherri Weitl-Harms,Joseph Doty,Zachary L. Myers,Warren Thompson
关键词-EN: Product Desirability Toolkit, Desirability Toolkit, large language models, Product Desirability, evaluating user sentiment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 6 tables

点击查看摘要

Abstract:This research explores the application of large language models (LLMs) to generate synthetic datasets for Product Desirability Toolkit (PDT) testing, a key component in evaluating user sentiment and product experience. Utilizing gpt-4o-mini, a cost-effective alternative to larger commercial LLMs, three methods, Word+Review, Review+Word, and Supply-Word, were each used to synthesize 1000 product reviews. The generated datasets were assessed for sentiment alignment, textual diversity, and data generation cost. Results demonstrated high sentiment alignment across all methods, with Pearson correlations ranging from 0.93 to 0.97. Supply-Word exhibited the highest diversity and coverage of PDT terms, although with increased generation costs. Despite minor biases toward positive sentiments, in situations with limited test data, LLM-generated synthetic data offers significant advantages, including scalability, cost savings, and flexibility in dataset production.
摘要：本研究探讨了大语言模型 (LLM) 在生成用于产品吸引力工具包 (PDT) 测试的合成数据集中的应用，这是评估用户情感和产品体验的关键组成部分。利用 gpt-4o-mini，一种成本效益较高的商业 LLM 替代方案，采用了三种方法：Word+Review、Review+Word 和 Supply-Word，分别合成了 1000 条产品评论。生成的数据集在情感一致性、文本多样性和数据生成成本方面进行了评估。结果显示，所有方法的情感一致性都很高，Pearson 相关系数在 0.93 到 0.97 之间。Supply-Word 方法展示了最高的多样性和 PDT 术语覆盖率，尽管生成成本有所增加。尽管存在轻微的正面情感偏差，但在测试数据有限的情况下，LLM 生成的合成数据提供了显著的优势，包括可扩展性、成本节约和数据集生产的灵活性。

[NLP-4] PatentEdits: Framing Patent Novelty as Textual Entailment

【速读】：该论文试图解决在专利申请过程中，如何根据现有技术（prior art）预测发明权利要求的修改以克服新颖性异议的问题。解决方案的关键在于引入了一个名为PatentEdits的数据集，该数据集包含105K个成功修订的示例，并通过设计算法对这些修订进行逐句标注。随后，研究利用大型语言模型（LLMs）来预测这些修订，并发现评估引用的参考文献与草稿句子之间的文本蕴含关系（textual entailment）在预测哪些发明权利要求保持不变或相对于现有技术具有新颖性方面尤为有效。

链接: https://arxiv.org/abs/2411.13477
作者: Ryan Lee,Alexander Spangher,Xuezhe Ma
关键词-EN: Patent Office, non-obvious in order, USPTO, prior art, Office
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:A patent must be deemed novel and non-obvious in order to be granted by the US Patent Office (USPTO). If it is not, a US patent examiner will cite the prior work, or prior art, that invalidates the novelty and issue a non-final rejection. Predicting what claims of the invention should change given the prior art is an essential and crucial step in securing invention rights, yet has not been studied before as a learnable task. In this work we introduce the PatentEdits dataset, which contains 105K examples of successful revisions that overcome objections to novelty. We design algorithms to label edits sentence by sentence, then establish how well these edits can be predicted with large language models (LLMs). We demonstrate that evaluating textual entailment between cited references and draft sentences is especially effective in predicting which inventive claims remained unchanged or are novel in relation to prior art.
摘要：一项专利必须被认定为新颖且非显而易见，才能获得美国专利商标局（USPTO）的批准。如果不符合这一标准，美国专利审查员将引用先前的工作，即先前技术，来否定其新颖性，并发出非最终驳回。预测在给定先前技术的情况下，发明中的哪些权利要求需要修改，是确保发明权利的关键步骤，但此前并未被视为一个可学习的任务进行研究。在本研究中，我们引入了PatentEdits数据集，该数据集包含105,000个成功修订的示例，这些修订克服了对新颖性的异议。我们设计了算法，逐句标记修订内容，然后评估这些修订内容能否通过大语言模型（LLMs）进行预测。我们证明，评估引用的参考文献与草案句子之间的文本蕴含关系，在预测哪些发明权利要求保持不变或相对于先前技术具有新颖性方面尤为有效。

[NLP-5] When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

【速读】：该论文试图解决在使用BFloat16格式时，旋转位置编码（Rotary Positional Embedding, RoPE）在长上下文场景中出现的数值问题，这些问题导致RoPE偏离其预期的相对位置编码特性。解决方案的关键是开发了一种名为AnchorAttention的即插即用注意力机制，通过将第一个token视为共享锚点并赋予一致的位置ID，来减轻BFloat16精度限制带来的数值问题，从而提高长上下文处理能力并加速训练。AnchorAttention通过减少不必要的注意力计算，保持语义连贯性，并显著提升计算效率，实验结果表明其在长上下文性能和训练时间上均有显著改进。

链接: https://arxiv.org/abs/2411.13476
作者: Haonan Wang,Qian Liu,Chao Du,Tongyao Zhu,Cunxiao Du,Kenji Kawaguchi,Tianyu Pang
关键词-EN: large language models, Extending context window, process longer sequences, context window sizes, Rotary Positional Embedding
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16’s limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50% compared to standard full attention mechanisms, while preserving the original LLM’s capabilities on general tasks. Our code is available at this https URL.
摘要：扩展上下文窗口大小使得大语言模型 (LLM) 能够处理更长的序列并应对更复杂的任务。旋转位置嵌入 (Rotary Positional Embedding, RoPE) 因其相对位置编码特性而成为长上下文训练的实际标准。然而，我们观察到，在 BFloat16 格式下使用 RoPE 会导致数值问题，使其偏离预期的相对位置编码，尤其是在长上下文场景中。这一问题源于 BFloat16 的有限精度，并且随着上下文长度的增加而累积，其中第一个 Token 对此问题贡献显著。为解决这一问题，我们开发了 AnchorAttention，一种即插即用的注意力机制，能够缓解 BFloat16 引起的数值问题，提升长上下文处理能力，并加速训练。AnchorAttention 通过减少不必要的注意力计算，保持语义一致性，并通过将第一个 Token 视为共享锚点并赋予一致的位置 ID，从而使其在整个训练上下文中的所有文档中可见，从而提高了计算效率。在三种类型的大语言模型上的实验表明，与标准的完全注意力机制相比，AnchorAttention 显著提升了长上下文性能，并将训练时间减少了超过 50%，同时保留了原始大语言模型在通用任务上的能力。我们的代码可在以下链接获取：[https URL]。

[NLP-6] LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models

【速读】：该论文试图解决少数语言因数字资源匮乏和人工智能模型主要基于高资源语言训练而面临濒危的问题。解决方案的关键在于提出一个框架，通过数据创建来生成适用于低资源语言的语料工具，从而支持语言模型的开发，以助力语言保护工作。论文以濒危的撒丁语为例，展示了该框架的有效性，旨在通过解决数据稀缺问题，促进语言多样性，并支持通过现代技术进行语言标准化和复兴的努力。

链接: https://arxiv.org/abs/2411.13453
作者: Salvatore Mario Carta,Stefano Chessa,Giulia Contu,Andrea Corriga,Andrea Deidda,Gianni Fenu,Luca Frigau,Alessandro Giuliani,Luca Grassi,Marco Manolo Manca,Mirko Marras,Francesco Mola,Bastianino Mossa,Piergiorgio Mura,Marco Ortu,Leonardo Piano,Simone Pisano,Alessia Pisu,Alessandro Sebastian Podda,Livio Pompianu,Simone Seu,Sandro Gabriele Tiddia
关键词-EN: preserving cultural heritage, face growing risks, limited digital resources, artificial intelligence models, intelligence models trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Minority languages are vital to preserving cultural heritage, yet they face growing risks of extinction due to limited digital resources and the dominance of artificial intelligence models trained on high-resource languages. This white paper proposes a framework to generate linguistic tools for low-resource languages, focusing on data creation to support the development of language models that can aid in preservation efforts. Sardinian, an endangered language, serves as the case study to demonstrate the framework’s effectiveness. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity and support ongoing efforts in language standardization and revitalization through modern technologies.
摘要：少数语言对于保护文化遗产至关重要，但由于数字资源有限以及人工智能模型主要基于高资源语言进行训练，这些语言正面临日益增长的灭绝风险。本白皮书提出了一种框架，旨在为低资源语言生成语言工具，重点是通过数据创建来支持语言模型的发展，这些模型能够协助保护工作。以濒危语言撒丁语为例，展示了该框架的有效性。通过解决阻碍这些语言智能应用的数据稀缺问题，我们为促进语言多样性做出了贡献，并通过现代技术支持语言标准化和复兴的持续努力。

[NLP-7] AdaptAgent : Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations NEURIPS2024

【速读】：该论文试图解决多模态网络代理在未见过的网站和领域上自动化任务的困难，特别是在企业特定和专有平台上。解决方案的关键在于提出AdaptAgent框架，该框架通过使用少量人类示范（最多2个）来实现多模态网络代理的少样本适应性。这种方法不仅适用于专有模型，也适用于开放权重模型，通过上下文示范（对于专有模型）或元适应示范（对于元学习的开放权重模型）来提升任务成功率。实验结果表明，这种方法在两个流行基准测试（Mind2Web和VisualWebArena）上显著提高了任务成功率，相对增加了21.03%到65.75%，从而为开发广泛适用的多模态网络代理提供了一个新的方向，超越了大规模预训练和微调的传统方法。

链接: https://arxiv.org/abs/2411.13451
作者: Gaurav Verma,Rachneet Kaur,Nishan Srishankar,Zhen Zeng,Tucker Balch,Manuela Veloso
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, graphical user interfaces, processing user instructions
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 3 figures, an abridged version to appear in NeurIPS 2024 AFM Workshop

点击查看摘要

Abstract:State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, web agents still struggle to automate tasks on unseen websites and domains, limiting their applicability to enterprise-specific and proprietary platforms. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). Our experiments on two popular benchmarks – Mind2Web VisualWebArena – show that using in-context demonstrations (for proprietary models) or meta-adaptation demonstrations (for meta-learned open-weights models) boosts task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models, corresponding to a relative increase of 21.03% to 65.75%. Furthermore, our additional analyses (a) show the effectiveness of multimodal demonstrations over text-only ones, (b) shed light on the influence of different data selection strategies during meta-learning on the generalization of the agent, and © demonstrate the effect of number of few-shot examples on the web agent’s success rate. Overall, our results unlock a complementary axis for developing widely applicable multimodal web agents beyond large-scale pre-training and fine-tuning, emphasizing few-shot adaptability.
摘要：基于多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的先进多模态网络智能体能够通过处理用户指令并与图形用户界面 (Graphical User Interfaces, GUIs) 交互，自主执行多种网络任务。当前构建网络智能体的策略依赖于 (i) 底层 MLLMs 的泛化能力及其通过提示的操控性，以及 (ii) 对 MLLMs 在网络相关任务上的大规模微调。然而，网络智能体在自动化处理未见过的网站和领域任务时仍面临挑战，限制了其在企业特定和专有平台上的应用。除了大规模预训练和微调带来的泛化能力外，我们提出利用人类示范构建具备少样本适应能力的智能体。我们引入了 AdaptAgent 框架，该框架使专有和开放权重多模态网络智能体能够通过少量人类示范（最多 2 个）适应新网站和领域。我们在两个流行基准测试——Mind2Web 和 VisualWebArena 上的实验表明，使用上下文示范（针对专有模型）或元适应示范（针对元学习开放权重模型）将任务成功率提高了 3.36% 至 7.21%，相对于未适应的先进模型，相对增长率为 21.03% 至 65.75%。此外，我们的额外分析 (a) 展示了多模态示范相对于纯文本示范的有效性，(b) 揭示了元学习过程中不同数据选择策略对智能体泛化能力的影响，以及 © 展示了少样本示例数量对网络智能体成功率的影响。总体而言，我们的研究结果为开发广泛适用的多模态网络智能体开辟了一条补充路径，强调了少样本适应能力的重要性，超越了大规模预训练和微调的范畴。

[NLP-8] WaterPark: A Robustness Assessment of Language Model Watermarking

【速读】：该论文试图解决大语言模型（LLMs）生成文本的滥用问题，如虚假信息、自动化钓鱼和学术作弊，通过开发一种统一的平台来评估现有的水印技术（watermarkers）及其对抗攻击的鲁棒性。解决方案的关键在于系统化现有的LLM水印技术和水印去除攻击，构建一个名为WaterPark的统一平台，该平台集成了10种最先进的水印技术和12种代表性的攻击方法。通过WaterPark，论文全面评估了现有水印技术的设计选择对其攻击鲁棒性的影响，揭示了上下文依赖性对水印技术抵抗攻击的重要性，并探讨了在对抗环境中操作水印技术的最佳实践，如使用通用检测器与特定水印检测器相结合以提高安全性。

链接: https://arxiv.org/abs/2411.13425
作者: Jiacheng Liang,Zian Wang,Lauren Hong,Shouling Ji,Ting Wang
关键词-EN: identifying LLM-generated texts, large language models, automated phishing, academic cheating, mitigate the misuse
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:To mitigate the misuse of large language models (LLMs), such as disinformation, automated phishing, and academic cheating, there is a pressing need for the capability of identifying LLM-generated texts. Watermarking emerges as one promising solution: it plants statistical signals into LLMs’ generative processes and subsequently verifies whether LLMs produce given texts. Various watermarking methods (``watermarkers’') have been proposed; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. For instance, a watermarker’s resilience to increasingly intensive attacks hinges on its context dependency. We further explore the best practices to operate watermarkers in adversarial environments. For instance, using a generic detector alongside a watermark-specific detector improves the security of vulnerable watermarkers. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research. Comments: 22 pages Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.13425 [cs.CR] (or arXiv:2411.13425v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.13425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：为了缓解大语言模型 (LLM) 的滥用问题，如虚假信息传播、自动化钓鱼攻击和学术作弊，识别 LLM 生成文本的能力显得尤为迫切。水印技术作为一种有前景的解决方案应运而生：它在大语言模型的生成过程中植入统计信号，并随后验证这些模型是否生成了特定文本。尽管已经提出了多种水印方法（“水印器”），但由于缺乏统一的评估平台，许多关键问题仍未得到充分探讨：i) 各种水印器的优缺点是什么，特别是它们的攻击鲁棒性如何？ii) 各种设计选择如何影响其鲁棒性？iii) 如何在对抗环境中最佳地操作水印器？为了填补这一空白，我们系统化了现有的 LLM 水印器和水印移除攻击，绘制了它们的设计空间。接着，我们开发了 WaterPark，这是一个集成了 10 种最先进水印器和 12 种代表性攻击的统一平台。更重要的是，利用 WaterPark，我们对现有的水印器进行了全面评估，揭示了各种设计选择对其攻击鲁棒性的影响。例如，水印器对日益增强的攻击的抵抗力取决于其上下文依赖性。我们进一步探讨了在对抗环境中操作水印器的最佳实践。例如，使用通用检测器与特定水印检测器相结合，可以提高易受攻击水印器的安全性。我们相信，我们的研究为当前的 LLM 水印技术提供了洞见，而 WaterPark 作为一个有价值的测试平台，将促进未来研究的发展。

评论：22 页
主题：密码学与安全 (cs.CR)；计算与语言 (cs.CL)；机器学习 (cs.LG)
引用方式：arXiv:2411.13425 [cs.CR]（或 arXiv:2411.13425v1 [cs.CR] 用于此版本）
https://doi.org/10.48550/arXiv.2411.13425
通过 DataCite 发布的 arXiv DOI（待注册）

[NLP-9] CAFE A Novel Code switching Dataset for Algerian Dialect French and English

【速读】：该论文旨在解决阿尔及利亚方言、法语和英语之间的代码转换数据集的缺乏问题，并提出了CAFE数据集作为解决方案。关键在于CAFE数据集的独特性，包括其自然对话中的自发讲话风格、捕捉代码转换和重叠语音现象、涵盖北非阿拉伯方言的独特语言挑战，以及在不同社会语言学背景下捕捉阿尔及利亚各地的方言变体。论文还强调了使用最先进的自动语音识别（ASR）模型（如Whisper large-v2,3和PromptingWhisper）处理此类内容的挑战，并通过设计数据处理管道和先进的解码技术来提高ASR性能，展示了在混合错误率（MER）、字符错误率（CER）和词错误率（WER）方面的改进。

链接: https://arxiv.org/abs/2411.13424
作者: Houssam Eddine-Othman Lachemat,Akli Abbas,Nourredine Oukas,Yassine El Kheir,Samia Haboussi,Absar Showdhury Shammur
关键词-EN: North African Arabic, African Arabic dialect, Data download link, dataset between Algerian, Algerian dialect
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 24 pages, submitted to tallip

点击查看摘要

Abstract:The paper introduces and publicly releases (Data download link available after acceptance) CAFE – the first Code-switching dataset between Algerian dialect, French, and english languages. The CAFE speech data is unique for (a) its spontaneous speaking style in vivo human-human conversation capturing phenomena like code-switching and overlapping speech, (b) addresses distinct linguistic challenges in North African Arabic dialect; © the CAFE captures dialectal variations from various parts of Algeria within different sociolinguistic contexts. CAFE data contains approximately 37 hours of speech, with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human annotation including speech segmentation, transcription, explicit annotation of code-switching points, overlapping speech, and other events such as noises, and laughter among others. The rest approximately 34.58 hours contain pseudo label transcriptions. In addition to the data release, the paper also highlighted the challenges of using state-of-the-art Automatic Speech Recognition (ASR) models such as Whisper large-v2,3 and PromptingWhisper to handle such content. Following, we benchmark CAFE data with the aforementioned Whisper models and show how well-designed data processing pipelines and advanced decoding techniques can improve the ASR performance in terms of Mixed Error Rate (MER) of 0.310, Character Error Rate (CER) of 0.329 and Word Error Rate (WER) of 0.538.
摘要：本文介绍并公开发布了（数据下载链接将在接受后提供）CAFE——首个阿尔及利亚方言、法语和英语之间的代码转换数据集。CAFE语音数据具有独特性，主要体现在以下几个方面：(a) 其自然对话中的自发讲话风格，捕捉到了代码转换和重叠语音等现象；(b) 解决了北非阿拉伯方言中的独特语言挑战；© CAFE数据集涵盖了阿尔及利亚不同地区在不同社会语言环境中的方言变体。CAFE数据集包含约37小时的语音数据，其中CAFE-small子集包含2小时36分钟的语音数据，并附有人工手动标注，包括语音分割、转录、代码转换点的显式标注、重叠语音以及其他事件如噪音和笑声等。其余约34.58小时的语音数据包含伪标签转录。除了数据发布外，本文还强调了使用如Whisper large-v2,3和PromptingWhisper等最先进的自动语音识别（ASR）模型处理此类内容的挑战。随后，我们对CAFE数据集进行了基准测试，并展示了通过精心设计的数据处理管道和先进的解码技术，如何将ASR性能在混合错误率（MER）、字符错误率（CER）和词错误率（WER）方面分别提升至0.310、0.329和0.538。

[NLP-10] Unification of Balti and trans-border sister dialects in the essence of LLM s and AI Technology

【速读】：该论文试图解决巴尔蒂语（Balti）方言多样性带来的统一难题，特别是在全球化背景下，如何利用人工智能（AI）技术，特别是大型语言模型（LLMs），来分析、记录和标准化这一濒危语言。解决方案的关键在于利用AI技术，基于巴尔蒂语不同方言的现有努力，进行语言的统一和标准化工作，以缩短因文化、社会政治、宗教和地理因素造成的语言差异。

链接: https://arxiv.org/abs/2411.13409
作者: Muhammad Sharif,Jiangyan Yi,Muhammad Shoaib
关键词-EN: Tibeto-Burman language family, called Balti belongs, language called Balti, specifically the Tibeto-Burman, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE conference ISCSLP 2024

点击查看摘要

Abstract:The language called Balti belongs to the Sino-Tibetan, specifically the Tibeto-Burman language family. It is understood with variations, across populations in India, China, Pakistan, Nepal, Tibet, Burma, and Bhutan, influenced by local cultures and producing various dialects. Considering the diverse cultural, socio-political, religious, and geographical impacts, it is important to step forward unifying the dialects, the basis of common root, lexica, and phonological perspectives, is vital. In the era of globalization and the increasingly frequent developments in AI technology, understanding the diversity and the efforts of dialect unification is important to understanding commonalities and shortening the gaps impacted by unavoidable circumstances. This article analyzes and examines how artificial intelligence AI in the essence of Large Language Models LLMs, can assist in analyzing, documenting, and standardizing the endangered Balti Language, based on the efforts made in different dialects so far.
摘要：巴尔蒂语属于汉藏语系，具体来说是藏缅语族。它在印度、中国、巴基斯坦、尼泊尔、西藏、缅甸和不丹等地的不同人群中有所变化，受到当地文化的影响，形成了多种方言。考虑到文化、社会政治、宗教和地理等多方面的影响，统一这些方言，基于共同的根源、词汇和音韵视角，显得尤为重要。在全球化和人工智能技术日益发展的时代，理解方言的多样性及其统一的努力，对于认识共同点并缩短由不可避免的情况造成的差距至关重要。本文分析并探讨了人工智能，特别是大语言模型 (LLMs)，如何基于迄今为止在不同方言中所做的努力，协助分析、记录和标准化濒危的巴尔蒂语。

[NLP-11] ransformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese

【速读】：该论文试图解决越南语自然语言推理（Natural Language Inference, NLI）任务中联合模型（joint models）的研究不足问题。解决方案的关键在于结合上下文语言模型（Contextualized Language Models, CLM）和神经网络（Neural Networks），通过CLM生成上下文表示，并利用神经网络进行分类。实验结果表明，XLM-R（355M）与神经网络的联合模型在越南语NLI任务中表现优异，F1得分达到82.78%，显著优于单独微调的PhoBERT、mBERT和XLM-R模型。该方法不仅简单高效，而且在资源利用方面表现出色，适合需要高效资源利用的应用场景。

链接: https://arxiv.org/abs/2411.13407
作者: Dat Van-Thanh Nguyen,Tin Van Huynh,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
关键词-EN: Natural Language Processing, Natural Language Inference, Natural Language, Language Processing, Language Inference
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) is a task within Natural Language Processing (NLP) that holds value for various AI applications. However, there have been limited studies on Natural Language Inference in Vietnamese that explore the concept of joint models. Therefore, we conducted experiments using various combinations of contextualized language models (CLM) and neural networks. We use CLM to create contextualized work presentations and use Neural Networks for classification. Furthermore, we have evaluated the strengths and weaknesses of each joint model and identified the model failure points in the Vietnamese context. The highest F1 score in this experiment, up to 82.78% in the benchmark dataset (ViNLI). By conducting experiments with various models, the most considerable size of the CLM is XLM-R (355M). That combination has consistently demonstrated superior performance compared to fine-tuning strong pre-trained language models like PhoBERT (+6.58%), mBERT (+19.08%), and XLM-R (+0.94%) in terms of F1-score. This article aims to introduce a novel approach or model that attains improved performance for Vietnamese NLI. Overall, we find that the joint approach of CLM and neural networks is simple yet capable of achieving high-quality performance, which makes it suitable for applications that require efficient resource utilization.
摘要：自然语言推理（Natural Language Inference, NLI）是自然语言处理（Natural Language Processing, NLP）领域中一个具有重要价值的任务，适用于多种人工智能应用。然而，关于越南语自然语言推理的研究相对有限，尤其是探讨联合模型的研究。因此，我们进行了实验，使用了多种上下文化语言模型（Contextualized Language Models, CLM）与神经网络的组合。我们利用CLM生成上下文化的文本表示，并使用神经网络进行分类。此外，我们还评估了每种联合模型的优缺点，并识别了在越南语情境下的模型失效点。在本次实验中，最高F1得分达到了82.78%（在基准数据集ViNLI中）。通过使用多种模型进行实验，我们发现CLM中规模最大的XLM-R（355M）表现最为突出。与微调强大的预训练语言模型如PhoBERT（+6.58%）、mBERT（+19.08%）和XLM-R（+0.94%）相比，这种组合在F1得分上始终表现出更优越的性能。本文旨在介绍一种新颖的方法或模型，以提升越南语NLI的性能。总体而言，我们发现CLM与神经网络的联合方法虽然简单，但能够实现高质量的性能，这使其非常适合需要高效资源利用的应用场景。

[NLP-12] On the Way to LLM Personalization: Learning to Remember User Conversations

【速读】：该论文试图解决大语言模型（LLMs）在个性化对话中的局限性问题，特别是如何通过注入先前对话的知识来提高模型对用户偏好和行为的适应性。解决方案的关键在于提出了一个名为PLUM的管道，该管道通过数据增强技术将对话内容上采样为问答对，然后使用加权交叉熵损失对低秩适应适配器进行微调。这种方法在处理对话的时序性和参数效率方面具有显著优势，并在初步实验中表现出色，达到了81.5%的准确率。

链接: https://arxiv.org/abs/2411.13405
作者: Lucie Charlotte Magister,Katherine Metcalf,Yizhe Zhang,Maartje ter Hoeve
关键词-EN: Large Language Models, Large Language, Language Models, variety of tasks, invaluable assistant
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 6 tables, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have quickly become an invaluable assistant for a variety of tasks. However, their effectiveness is constrained by their ability to tailor responses to human preferences and behaviors via personalization. Prior work in LLM personalization has largely focused on style transfer or incorporating small factoids about the user, as knowledge injection remains an open challenge. In this paper, we explore injecting knowledge of prior conversations into LLMs to enable future work on less redundant, personalized conversations. We identify two real-world constraints: (1) conversations are sequential in time and must be treated as such during training, and (2) per-user personalization is only viable in parameter-efficient settings. To this aim, we propose PLUM, a pipeline performing data augmentation for up-sampling conversations as question-answer pairs, that are then used to finetune a low-rank adaptation adapter with a weighted cross entropy loss. Even in this first exploration of the problem, we perform competitively with baselines such as RAG, attaining an accuracy of 81.5% across 100 conversations.
摘要：大语言模型（Large Language Models, LLMs）迅速成为各种任务中不可或缺的助手。然而，其效能受限于通过个性化调整回应以符合人类偏好和行为的能力。在 LLM 个性化方面，先前的工作主要集中在风格转换或融入用户的小事实信息上，而知识注入仍是一个开放的挑战。本文探讨了将先前对话的知识注入 LLM，以促进未来在减少冗余、个性化对话方面的工作。我们识别了两个现实世界的约束条件：（1）对话在时间上是顺序的，训练时必须如此处理；（2）针对每个用户的个性化仅在参数高效的环境中可行。为此，我们提出了 PLUM，这是一个执行数据增强以对对话进行上采样的管道，将其转换为问答对，然后用于微调带有加权交叉熵损失的低秩适应适配器。即使在这一问题的初步探索中，我们也与 RAG 等基线方法竞争，在 100 次对话中达到了 81.5% 的准确率。

[NLP-13] Executable QR codes with Machine Learning for Industrial Applications

【速读】：该论文试图解决在无互联网访问的情况下，如何在移动设备上执行嵌入式程序的问题。解决方案的关键在于提出了一种名为QRind的新编程语言，该语言允许将不同的计算模块（如机器学习模型和算法）集成到可执行的QR码（eQR码）中。通过这种方式，即使在没有网络连接的情况下，用户也可以在移动设备上运行这些嵌入的程序，从而实现工业4.0/5.0范式中的部分功能，如预测性维护和简化机械操作。

链接: https://arxiv.org/abs/2411.13400
作者: Stefano Scanzio,Francesco Velluto,Matteo Rosani,Lukasz Wisniewski,Gianluca Cena
关键词-EN: embed programs conceived, Executable QR codes, special kind, conceived to run, run on mobile
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: preprint, 4 pages, 2024

点击查看摘要

Abstract:Executable QR codes, also known as eQR codes or just sQRy, are a special kind of QR codes that embed programs conceived to run on mobile devices like smartphones. Since the program is directly encoded in binary form within the QR code, it can be executed even when the reading device is not provided with Internet access. The applications of this technology are manifold, and range from smart user guides to advisory systems. The first programming language made available for eQR is QRtree, which enables the implementation of decision trees aimed, for example, at guiding the user in operating/maintaining a complex machinery or for reaching a specific location. In this work, an additional language is proposed, we term QRind, which was specifically devised for Industry. It permits to integrate distinct computational blocks into the QR code, e.g., machine learning models to enable predictive maintenance and algorithms to ease machinery usage. QRind permits the Industry 4.0/5.0 paradigms to be implemented, in part, also in those cases where Internet is unavailable. Comments: preprint, 4 pages, 2024 Subjects: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2411.13400 [cs.NI] (or arXiv:2411.13400v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.13400 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 29th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2024) Related DOI: https://doi.org/10.1109/ETFA61755.2024.10710739 Focus to learn more DOI(s) linking to related resources
摘要：可执行二维码，也称为 eQR 码或简称为 sQRy，是一种特殊的二维码，其中嵌入了旨在在智能手机等移动设备上运行的程序。由于程序直接以二进制形式编码在二维码中，因此即使读取设备没有互联网访问权限，程序也可以执行。这项技术的应用非常广泛，从智能用户指南到咨询系统都有涉及。首个可用于 eQR 的编程语言是 QRtree，它支持实现决策树，例如用于指导用户操作/维护复杂机械或到达特定位置。在本研究中，我们提出了一种新的语言，称为 QRind，该语言专为工业应用设计。它允许将不同的计算模块集成到二维码中，例如，用于实现预测性维护的机器学习模型和简化机械使用的算法。QRind 允许在某些互联网不可用的情况下，部分实现工业 4.0/5.0 范式。

评论：预印本，4 页，2024 年主题：网络与互联网架构（cs.NI）；计算与语言（cs.CL）；形式语言与自动机理论（cs.FL）引用为：arXiv:2411.13400 [cs.NI] （或 arXiv:2411.13400v1 [cs.NI] 用于此版本） https://doi.org/10.48550/arXiv.2411.13400 了解更多通过 DataCite 发布的 arXiv DOI（待注册）期刊参考：第 29 届 IEEE 新兴技术与工厂自动化国际会议（ETFA 2024）相关 DOI：https://doi.org/10.1109/ETFA61755.2024.10710739 了解更多 DOI 链接到相关资源

[NLP-14] Fact-Level Confidence Calibration and Self-Correction

【速读】：该论文试图解决大语言模型（LLMs）在长文本生成任务中，现有校准方法无法准确评估每个事实的置信度与相关性，以及忽略事实与查询之间关联性的问题。解决方案的关键在于提出了一个事实级校准框架（Fact-Level Calibration framework），该框架在更细粒度上对事实的置信度进行校准，考虑了事实的相关性权重。此外，基于该框架的深入分析，论文还开发了置信度引导的事实级自我修正方法（Confidence-Guided Fact-level Self-Correction, ConFix），利用高置信度事实作为额外知识来改进低置信度事实，从而在不依赖外部知识源的情况下有效减少幻觉现象。

链接: https://arxiv.org/abs/2411.13343
作者: Yige Yuan,Bingbing Xu,Hexiang Tan,Fei Sun,Teng Xiao,Wei Li,Huawei Shen,Xueqi Cheng
关键词-EN: aligning their self-assessed, actual accuracy, self-assessed confidence, current calibration methods, Confidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Confidence calibration in LLMs, i.e., aligning their self-assessed confidence with the actual accuracy of their responses, enabling them to self-evaluate the correctness of their outputs. However, current calibration methods for LLMs typically estimate two scalars to represent overall response confidence and correctness, which is inadequate for long-form generation where the response includes multiple atomic facts and may be partially confident and correct. These methods also overlook the relevance of each fact to the query. To address these challenges, we propose a Fact-Level Calibration framework that operates at a finer granularity, calibrating confidence to relevance-weighted correctness at the fact level. Furthermore, comprehensive analysis under the framework inspired the development of Confidence-Guided Fact-level Self-Correction ( \textbfConFix ), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones. Extensive experiments across four datasets and six models demonstrate that ConFix effectively mitigates hallucinations without requiring external knowledge sources such as retrieval systems.
摘要：大语言模型（LLM）中的置信度校准，即调整其自我评估的置信度与实际回答准确性的一致性，使其能够自我评估输出的正确性。然而，当前的校准方法通常估计两个标量来表示整体回答的置信度和正确性，这对于包含多个原子事实且可能部分自信和正确的长篇生成任务来说是不够的。这些方法还忽略了每个事实与查询的相关性。为了应对这些挑战，我们提出了一种事实级校准框架，该框架在更细粒度上操作，将置信度校准为事实级相关性加权的正确性。此外，在该框架下的全面分析激发了置信度引导的事实级自我修正（ConFix）的开发，该方法利用响应中的高置信度事实作为额外知识来改进低置信度事实。在四个数据集和六个模型上的广泛实验表明，ConFix 有效地减少了幻觉现象，而无需依赖外部知识源，如检索系统。

[NLP-15] Combining Autoregressive and Autoencoder Language Models for Text Classification

【速读】：该论文试图解决在文本分类任务中，生成式语言模型（如GPT、Llama、Phi）在性能上通常不如监督学习模型（如BERT）的问题。解决方案的关键在于提出了一种结合自回归语言模型（Autoregressive Language Models）和自编码器模型（Autoencoder Models）的新方法，称为CAALM-TC。该方法利用自回归模型生成基于输入文本的上下文信息，然后将这些信息与原始文本结合，输入到自编码器模型中进行分类。这种混合方法充分利用了自回归模型的丰富上下文知识和自编码器的分类效率，从而在多个基准数据集上显著提升了文本分类的性能，特别是在数据集较小和分类目标较为抽象的任务中表现尤为突出。

链接: https://arxiv.org/abs/2411.13282
作者: João Gonçalves
关键词-EN: paper presents CAALM-TC, Autoencoder Language Models, Language Models, Combining Autoregressive, enhances text classification
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents CAALM-TC (Combining Autoregressive and Autoencoder Language Models for Text Classification), a novel method that enhances text classification by integrating autoregressive and autoencoder language models. Autoregressive large language models such as Open AI’s GPT, Meta’s Llama or Microsoft’s Phi offer promising prospects for content analysis practitioners, but they generally underperform supervised BERT based models for text classification. CAALM leverages autoregressive models to generate contextual information based on input texts, which is then combined with the original text and fed into an autoencoder model for classification. This hybrid approach capitalizes on the extensive contextual knowledge of autoregressive models and the efficient classification capabilities of autoencoders. Experimental results on four benchmark datasets demonstrate that CAALM consistently outperforms existing methods, particularly in tasks with smaller datasets and more abstract classification objectives. The findings indicate that CAALM offers a scalable and effective solution for automated content analysis in social science research that minimizes sample size requirements.
摘要：本文介绍了 CAALM-TC（结合自回归和自编码语言模型进行文本分类），这是一种通过整合自回归和自编码语言模型来增强文本分类的新方法。自回归大语言模型，如 Open AI 的 GPT、Meta 的 Llama 或 Microsoft 的 Phi，为内容分析从业者提供了广阔的前景，但在文本分类任务中，它们通常表现不如基于 BERT 的监督模型。CAALM 利用自回归模型根据输入文本生成上下文信息，然后将这些信息与原始文本结合，输入到自编码模型中进行分类。这种混合方法充分利用了自回归模型的广泛上下文知识和自编码模型的有效分类能力。在四个基准数据集上的实验结果表明，CAALM 在现有方法中表现一致优越，特别是在数据集较小和分类目标更抽象的任务中。研究结果表明，CAALM 为社会科学研究中的自动化内容分析提供了一种可扩展且有效的解决方案，同时减少了样本量的需求。

[NLP-16] VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

【速读】：该论文试图解决现有大型多模态模型（LMMs）在视频分析能力评估中依赖传统选择题方法（如VideoMME和LongVideoBench），导致评估深度不足的问题。解决方案的关键在于引入VideoAutoArena，这是一个受LMSYS Chatbot Arena框架启发的竞技场式基准，通过用户模拟生成开放式、适应性问题，以严格评估模型在视频理解中的表现。VideoAutoArena采用自动化、可扩展的评估框架，并结合改进的ELO评级系统，确保在多个LMMs之间进行公平和持续的比较。此外，通过故障驱动进化策略逐步增加问题复杂性，推动模型处理更具挑战性的视频分析场景。实验结果表明，VideoAutoArena能有效区分当前最先进的LMMs，并提供模型优缺点及改进方向的深入见解。同时，引入VideoAutoBench作为辅助基准，通过GPT-4o作为裁判与人工验证答案进行比较，进一步提升了评估的准确性和效率。

链接: https://arxiv.org/abs/2411.13281
作者: Ziyang Luo,Haoning Wu,Dongxu Li,Jing Ma,Mohan Kankanhalli,Junnan Li
关键词-EN: garnered significant attention, Large multimodal models, recently garnered significant, Large multimodal, significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena’s framework, designed to automatically assess LMMs’ video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a ‘gold standard’ using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.
摘要：具有先进视频分析能力的大型多模态模型（LMMs）近期引起了广泛关注。然而，大多数评估依赖于传统的评估方法，如在VideoMME和LongVideoBench等基准测试中使用多项选择题，这些方法往往缺乏深度，无法捕捉现实世界用户对复杂需求的全面要求。为了解决这一局限性——并且由于视频任务的人工标注成本高昂且进度缓慢——我们引入了VideoAutoArena，这是一个受LMSYS Chatbot Arena框架启发的竞技场式基准测试，旨在自动评估LMMs的视频分析能力。VideoAutoArena利用用户模拟生成开放式、适应性的问题，严格评估模型在视频理解方面的表现。该基准测试采用自动化、可扩展的评估框架，结合改进的ELO评分系统，以实现对多个LMMs的公平和持续比较。为了验证我们的自动化评判系统，我们构建了一个“黄金标准”，使用精心筛选的人工标注子集，证明我们的竞技场与人类判断高度一致，同时保持了可扩展性。此外，我们引入了一种故障驱动进化策略，逐步增加问题复杂度，推动模型处理更具挑战性的视频分析场景。实验结果表明，VideoAutoArena能够有效区分最先进的LMMs，提供了关于模型优势和改进领域的见解。为了进一步简化我们的评估，我们引入了VideoAutoBench作为辅助基准测试，其中人类标注者在VideoAutoArena的部分对战中标注胜者。我们使用GPT-4o作为评判，将模型的回答与这些经过人类验证的答案进行比较。VideoAutoArena和VideoAutoBench共同提供了一个成本效益高、可扩展的框架，用于以用户为中心的视频分析中评估LMMs。

[NLP-17] Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL

【速读】：该论文试图解决大语言模型（LLMs）在下游应用如文本到SQL转换中表现不如人类的问题。解决方案的关键在于引入持续学习（continual learning）的概念，通过提出LPE-SQL框架来增强LLMs的能力。LPE-SQL框架的核心在于第四模块，即记录成功和失败任务及其推理过程或反思生成的提示，从而实现无需参数微调的持续学习。这一方法显著提升了模型性能，实验结果表明，较小的Llama-3.1-70B模型在使用LPE-SQL框架后，性能超过了使用现有最先进（SoTA）方法的较大的Llama-3.1-405B模型。

链接: https://arxiv.org/abs/2411.13244
作者: Zhibo Chu,Zichong Wang,Qitao Qin
关键词-EN: Large Language Models, Large Language, exhibit impressive problem-solving, impressive problem-solving skills, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive problem-solving skills across many tasks, but they still underperform compared to humans in various downstream applications, such as text-to-SQL. On the BIRD benchmark leaderboard, human performance achieves an accuracy of 92.96%, whereas the top-performing method reaches only 72.39%. Notably, these state-of-the-art (SoTA) methods predominantly rely on in-context learning to simulate human-like reasoning. However, they overlook a critical human skill: continual learning. Inspired by the educational practice of maintaining mistake notebooks during our formative years, we propose LPE-SQL (Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL), a novel framework designed to augment LLMs by enabling continual learning without requiring parameter fine-tuning. LPE-SQL consists of four modules that \textbfi) retrieve relevant entries, \textbfii) efficient sql generation, \textbfiii) generate the final result through a cross-consistency mechanism and \textbfiv) log successful and failed tasks along with their reasoning processes or reflection-generated tips. Importantly, the core module of LPE-SQL is the fourth one, while the other modules employ foundational methods, allowing LPE-SQL to be easily integrated with SoTA technologies to further enhance performance. Our experimental results demonstrate that this continual learning approach yields substantial performance gains, with the smaller Llama-3.1-70B model with surpassing the performance of the larger Llama-3.1-405B model using SoTA methods.
摘要：大语言模型（LLMs）在众多任务中展现出卓越的问题解决能力，但在各种下游应用中，如文本到SQL的转换，其表现仍不及人类。在BIRD基准测试排行榜上，人类的表现达到了92.96%的准确率，而最先进的方法仅达到72.39%。值得注意的是，这些最先进（SoTA）方法主要依赖于上下文学习来模拟人类推理。然而，它们忽视了一个关键的人类技能：持续学习。受我们成长过程中保持错误笔记本的教育实践启发，我们提出了LPE-SQL（利用先前经验：用于文本到SQL的可扩展辅助知识库），这是一个新颖的框架，旨在通过无需参数微调的方式实现持续学习来增强大语言模型。LPE-SQL由四个模块组成：\textbfi) 检索相关条目，\textbfii) 高效SQL生成，\textbfiii) 通过交叉一致性机制生成最终结果，以及\textbfiv) 记录成功和失败任务及其推理过程或反思生成的提示。重要的是，LPE-SQL的核心模块是第四个，而其他模块采用基础方法，使得LPE-SQL能够轻松与最先进技术集成以进一步提升性能。我们的实验结果表明，这种持续学习方法带来了显著的性能提升，较小的Llama-3.1-70B模型表现超过了使用最先进方法的较大的Llama-3.1-405B模型。

[NLP-18] BIPro: Zero-shot Chinese Poem Generation via Block Inverse Prompting Constrained Generation Framework

【速读】：该论文试图解决生成式预训练模型在开放领域标题下的受限写作任务（如诗歌生成）中的挑战。解决方案的关键是引入了一种名为Block Inverse Prompting (BIPro)的受限生成框架。BIPro通过两种块逆向提示方法（revise和rewrite），模拟人类文本写作过程，显著提升了在开放领域传统形式中文诗歌生成任务中的零样本生成质量。基于较弱的块生成模型GLM-10B-Chinese，BIPro生成的诗歌在人类评估中优于大多数先进的直接生成系统（如GPT-4或GLM-4）和最佳的领域特定系统（如Yusheng、Shisanbai或Baidu Poetry Helper），并大幅缩小了AI生成作品与人类文学作品之间的差距。

链接: https://arxiv.org/abs/2411.13237
作者: Xu Zou
关键词-EN: made significant strides, superior cross-domain capabilities, exhibit superior cross-domain, Block Inverse Prompting, significant strides
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, generative pre-trained models have made significant strides, particularly highlighted by the release of ChatGPT and GPT-4, which exhibit superior cross-domain capabilities. However, these models still face challenges on constrained writing tasks like poem generation under open-domain titles. In response to this challenge, we introduce Block Inverse Prompting (BIPro) constrained generation framework. BIPro leverages two block inverse prompting methods, revise and rewrite, that mimic the process of human text writing using block generative models. It significantly improves the zero-shot generation quality on the formidable constrained generation task of open-domain traditional-form Chinese poem generation. Based on a less powerful block generative model GLM-10B-Chinese, poems composed via BIPro without priming or additional training outperform both most advanced direct generative systems like GPT-4 or GLM-4 and best domain-specific systems such as Yusheng, Shisanbai, or Baidu Poetry Helper in human evaluation by proficient poets. Finally, BIPro considerably narrows the gap between AI-generated works and short-listed human literary arts in another human evaluation, unveiling the promising potential of block generative models in improving the quality of constrained generation.
摘要：近年来，生成式预训练模型取得了显著进展，尤其是 ChatGPT 和 GPT-4 的发布，展示了其在跨领域任务中的优越能力。然而，这些模型在面对开放领域标题下的受限写作任务（如诗歌生成）时仍面临挑战。针对这一问题，我们提出了 Block Inverse Prompting (BIPro) 受限生成框架。BIPro 利用两种块逆向提示方法——修订和重写，模拟人类文本写作过程，通过块生成模型显著提升了在开放领域传统形式中文诗歌生成这一艰巨任务上的零样本生成质量。基于较弱的块生成模型 GLM-10B-Chinese，通过 BIPro 生成的诗歌在无需引导或额外训练的情况下，在熟练诗人的评估中优于大多数先进的直接生成系统（如 GPT-4 或 GLM-4）以及最佳的领域特定系统（如 Yusheng、Shisanbai 或 Baidu Poetry Helper）。最后，BIPro 在另一次人类评估中显著缩小了 AI 生成作品与入选人类文学艺术作品之间的差距，揭示了块生成模型在提升受限生成质量方面的巨大潜力。

[NLP-19] AIDBench: A benchmark for evaluating the authorship identification capability of large language models

【速读】：该论文试图解决大语言模型（LLMs）在匿名文本作者识别方面带来的隐私风险问题。解决方案的关键在于提出了AIDBench基准，该基准整合了多个作者识别数据集，并通过两种评估方法（一对一和一对多作者识别）来量化LLMs在作者识别任务中的表现。此外，论文还引入了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的方法，以增强LLMs在大规模作者识别任务中的能力，特别是在输入长度超过模型上下文窗口时。实验结果表明，LLMs在作者识别任务中的准确率显著高于随机猜测，揭示了这些模型带来的新隐私风险。

链接: https://arxiv.org/abs/2411.13226
作者: Zichen Wen,Dadi Guo,Huishuai Zhang
关键词-EN: attracting increasing attention, large language models, rapidly advance, daily life, increasing attention
类目: Computation and Language (cs.CL)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:As large language models (LLMs) rapidly advance and integrate into daily life, the privacy risks they pose are attracting increasing attention. We focus on a specific privacy risk where LLMs may help identify the authorship of anonymous texts, which challenges the effectiveness of anonymity in real-world systems such as anonymous peer review systems. To investigate these risks, we present AIDBench, a new benchmark that incorporates several author identification datasets, including emails, blogs, reviews, articles, and research papers. AIDBench utilizes two evaluation methods: one-to-one authorship identification, which determines whether two texts are from the same author; and one-to-many authorship identification, which, given a query text and a list of candidate texts, identifies the candidate most likely written by the same author as the query text. We also introduce a Retrieval-Augmented Generation (RAG)-based method to enhance the large-scale authorship identification capabilities of LLMs, particularly when input lengths exceed the models’ context windows, thereby establishing a new baseline for authorship identification using LLMs. Our experiments with AIDBench demonstrate that LLMs can correctly guess authorship at rates well above random chance, revealing new privacy risks posed by these powerful models. The source code and data will be made publicly available after acceptance.
摘要：随着大语言模型 (Large Language Models, LLMs) 的迅速发展和融入日常生活，它们带来的隐私风险正日益受到关注。我们特别关注一种隐私风险，即 LLMs 可能帮助识别匿名文本的作者身份，这挑战了匿名性在现实系统中的有效性，例如匿名同行评审系统。为了研究这些风险，我们提出了 AIDBench，这是一个新的基准，整合了多个作者识别数据集，包括电子邮件、博客、评论、文章和研究论文。AIDBench 采用两种评估方法：一对一作者识别，用于确定两段文本是否来自同一作者；以及一对多作者识别，给定一段查询文本和一组候选文本，识别出最可能与查询文本出自同一作者的候选文本。我们还引入了一种基于检索增强生成 (Retrieval-Augmented Generation, RAG) 的方法，以增强 LLMs 在大规模作者识别方面的能力，特别是在输入长度超过模型上下文窗口时，从而为使用 LLMs 进行作者识别建立了一个新的基准。我们在 AIDBench 上的实验表明，LLMs 能够以远高于随机猜测的准确率正确猜测作者身份，揭示了这些强大模型带来的新的隐私风险。源代码和数据将在接受后公开发布。

[NLP-20] Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

【速读】：该论文试图解决自动语音识别（ASR）系统中标注真实语音数据的成本问题，并提出了一种名为Hard-Synth的新型数据增强方法。解决方案的关键在于利用大型语言模型（LLMs）和先进的零样本文本到语音（TTS）技术，通过LLMs生成多样化的领域内文本，并通过零样本TTS结合硬提示选择方法来克隆ASR模型难以识别的语音风格。这种方法不仅减少了对外部文本数据的依赖，还显著提高了Conformer模型的性能，在LibriSpeech数据集上实现了相对词错误率（WER）的显著降低，同时展示了其数据效率和减少ASR偏差的能力。

链接: https://arxiv.org/abs/2411.13159
作者: Jiawei Yu,Yuang Li,Xiaosong Qiao,Huan Zhao,Xiaofeng Zhao,Wei Tang,Min Zhang,Hao Yang,Jinsong Su
关键词-EN: automatic speech recognition, labeling real speech, systems using text-only, text-only corpora, widely adopted
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5%/4.4% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.
摘要：文本到语音 (Text-to-speech, TTS) 模型已被广泛应用于通过仅使用文本语料库来增强自动语音识别 (Automatic Speech Recognition, ASR) 系统，从而降低标注真实语音数据的成本。现有研究主要利用额外的文本数据和 TTS 模型支持的预定义语音风格。本文提出了一种名为 Hard-Synth 的新型 ASR 数据增强方法，该方法利用大语言模型 (Large Language Models, LLMs) 和先进的零样本 TTS (Zero-shot TTS)。我们的方法通过重写技术，利用 LLMs 生成多样化的领域内文本，而无需依赖额外的文本数据。与使用预定义语音风格不同，我们引入了一种硬提示选择方法，结合零样本 TTS 来克隆 ASR 模型难以识别的语音风格。实验结果表明，Hard-Synth 显著提升了 Conformer 模型，在 LibriSpeech dev/test-other 子集上实现了相对词错误率 (Word Error Rate, WER) 分别降低了 6.5% 和 4.4%。此外，我们还展示了 Hard-Synth 在数据效率方面的优势，并能够减少 ASR 中的偏差。

[NLP-21] Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

【速读】：该论文试图解决大规模语言模型（LLMs）在推理过程中由于自回归解码的顺序性导致的计算效率低下的问题。解决方案的关键在于引入了一种两阶段的推测解码框架：草稿生成和验证。具体来说，一个较小且高效的模型首先生成初步草稿，然后由一个更大、更复杂的模型进行细化。论文通过对推测解码方法的全面综述，将其分为以草稿为中心和以模型为中心的两大类，并讨论了每种方法的关键思想及其在扩展LLM推理中的潜力。这一研究旨在指导未来在优化推测解码及其在实际LLM应用中的集成方面的研究。

链接: https://arxiv.org/abs/2411.13157
作者: Hyun Ryu,Eric Kim
关键词-EN: large language models, complexity grow, large language, critical focus, scale and complexity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token generation process. Speculative decoding addresses this bottleneck by introducing a two-stage framework: drafting and verification. A smaller, efficient model generates a preliminary draft, which is then refined by a larger, more sophisticated model. This paper provides a comprehensive survey of speculative decoding methods, categorizing them into draft-centric and model-centric approaches. We discuss key ideas associated with each method, highlighting their potential for scaling LLM inference. This survey aims to guide future research in optimizing speculative decoding and its integration into real-world LLM applications.
摘要：随着大语言模型（LLM）的规模和复杂性不断增长，高效的推理已成为一个关键焦点。传统的自回归解码虽然有效，但由于其顺序生成 Token 的过程，存在计算效率低下的问题。推测性解码通过引入两阶段框架——草稿生成和验证，解决了这一瓶颈。一个较小且高效的模型生成初步草稿，然后由一个更大、更复杂的模型进行精炼。本文对推测性解码方法进行了全面的综述，将其分为以草稿为中心和以模型为中心的方法。我们讨论了每种方法的关键思想，强调了它们在扩展 LLM 推理方面的潜力。本综述旨在指导未来在优化推测性解码及其在实际 LLM 应用中的集成方面的研究。

[NLP-22] Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control

【速读】：该论文试图解决歌词生成中的独特挑战，特别是在遵循歌曲形式结构（如诗句和副歌）的同时实现精确的音节控制。传统逐行生成方法往往导致不自然的表达，因此需要更细粒度的音节管理。解决方案的关键在于提出一个框架，该框架能够在单词、短语、行和段落多个层次上实现音节控制，并考虑到歌曲形式。该方法生成的歌词基于输入文本和歌曲形式，确保与指定的音节约束一致。

链接: https://arxiv.org/abs/2411.13100
作者: Yunkee Chae,Eunsik Shin,Hwang Suntae,Seungryeol Paik,Kyogu Lee
关键词-EN: presents unique challenges, generation presents unique, achieving precise syllable, unique challenges, verses and choruses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lyrics generation presents unique challenges, particularly in achieving precise syllable control while adhering to song form structures such as verses and choruses. Conventional line-by-line approaches often lead to unnatural phrasing, underscoring the need for more granular syllable management. We propose a framework for lyrics generation that enables multi-level syllable control at the word, phrase, line, and paragraph levels, aware of song form. Our approach generates complete lyrics conditioned on input text and song form, ensuring alignment with specified syllable constraints. Generated lyrics samples are available at: this https URL
摘要：歌词生成面临独特的挑战，特别是在遵循歌曲形式结构（如副歌和诗节）的同时实现精确的音节控制。传统的逐行生成方法往往导致不自然的表达方式，突显了更细粒度音节管理的必要性。我们提出了一种歌词生成框架，该框架能够在单词、短语、行和段落级别上实现多层次的音节控制，并考虑到歌曲形式。我们的方法基于输入文本和歌曲形式生成完整的歌词，确保与指定的音节约束一致。生成的歌词样本可通过以下链接获取：this https URL

[NLP-23] Patience Is The Key to Large Language Model Reasoning

【速读】：该论文试图解决现有大型语言模型在解决复杂问题时，由于用户偏好或训练数据成本高昂，导致推理过程简化和复杂推理能力受限的问题。解决方案的关键在于通过测试时扩展 (scaling test-time) 的概念，提出一种简单的方法，即鼓励模型采用更耐心的推理风格，而不需要引入新的知识或技能。具体实现是通过偏好优化方法，生成详细的推理过程作为正例，简单的答案作为负例，从而训练模型倾向于在响应中保持全面性。实验结果显示，在仅使用轻量级数据集进行训练的情况下，GSM8k上的性能提升了高达6.7%。

链接: https://arxiv.org/abs/2411.13082
作者: Yijiong Yu
关键词-EN: Chain of Thought, demonstrated significant improvements, solving complex problems, large language models, Recent advancements
类目: Computation and Language (cs.CL)
备注: The dataset and model are available at this https URL

点击查看摘要

Abstract:Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice detailed reasoning for brevity due to user preferences, or require extensive and expensive training data to learn complicated reasoning ability, limiting their potential in solving complex tasks. To bridge this gap, following the concept of scaling test-time, we propose a simple method by encouraging models to adopt a more patient reasoning style without the need of introducing new knowledge or skills. To employ a preference optimization approach, we generate detailed reasoning processes as positive examples and simple answers as negative examples, thereby training the model to favor thoroughness in its responses. Our results demonstrate a performance increase of up to 6.7% on GSM8k with training just on a lightweight dataset.
摘要：近年来，大语言模型领域，特别是通过思维链 (Chain of Thought, CoT) 方法，在解决复杂问题方面取得了显著进展。然而，现有模型要么由于用户偏好而倾向于牺牲详细推理以追求简洁，要么需要大量且昂贵的训练数据来学习复杂的推理能力，这限制了它们在解决复杂任务中的潜力。为了弥合这一差距，我们遵循扩展测试时间的概念，提出了一种简单的方法，即鼓励模型采用更耐心的推理风格，而无需引入新的知识或技能。我们采用偏好优化方法，生成详细的推理过程作为正例，简单的答案作为负例，从而训练模型在响应中倾向于全面性。我们的实验结果显示，仅在轻量级数据集上训练后，GSM8k 上的性能提升了高达 6.7%。

[NLP-24] Explainable LLM -driven Multi-dimensional Distillation for E-Commerce Relevance Learning WWW2025

【速读】：该论文试图解决在电子商务搜索系统中，基于大型语言模型（LLM）的查询-项目相关性建模的可解释性和在线部署效率问题。解决方案的关键在于提出了一个可解释的LLM驱动的多维度蒸馏框架，主要包括两个核心组件：(1) 可解释的LLM相关性建模（ELLM-rele），通过将相关性学习分解为中间步骤并采用链式思维（Chain-of-Thought, CoT）推理，增强了LLM的可解释性和性能；(2) 多维度知识蒸馏（Multi-dimensional Knowledge Distillation, MKD）架构，从相关性得分分布和CoT推理两个方面，将ELLM-rele的知识传递给当前可部署的基于交互和表示的学生模型，从而提升学生模型的语义交互和长尾泛化能力。

链接: https://arxiv.org/abs/2411.13045
作者: Gang Zhao,Ximing Zhang,Chenji Lu,Hui Zhao,Tianshu Wu,Pengjie Wang,Jian Xu,Bo Zheng
关键词-EN: Effective query-item relevance, Effective query-item, Large Language Model, safeguarding user satisfaction, LLM
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to WWW 2025

点击查看摘要

Abstract:Effective query-item relevance modeling is pivotal for enhancing user experience and safeguarding user satisfaction in e-commerce search systems. Recently, benefiting from the vast inherent knowledge, Large Language Model (LLM) approach demonstrates strong performance and long-tail generalization ability compared with previous neural-based specialized relevance learning methods. Though promising, current LLM-based methods encounter the following inadequacies in practice: First, the massive parameters and computational demands make it difficult to be deployed online. Second, distilling LLM models to online models is a feasible direction, but the LLM relevance modeling is a black box, and its rich intrinsic knowledge is difficult to extract and apply online. To improve the interpretability of LLM and boost the performance of online relevance models via LLM, we propose an Explainable LLM-driven Multi-dimensional Distillation framework for e-commerce relevance learning, which comprises two core components: (1) An Explainable LLM for relevance modeling (ELLM-rele), which decomposes the relevance learning into intermediate steps and models relevance learning as a Chain-of-Thought (CoT) reasoning, thereby enhancing both interpretability and performance of LLM. (2) A Multi-dimensional Knowledge Distillation (MKD) architecture that transfers the knowledge of ELLM-rele to current deployable interaction-based and representation-based student models from both the relevance score distribution and CoT reasoning aspects. Through distilling the probabilistic and CoT reasoning knowledge, MKD improves both the semantic interaction and long-tail generalization abilities of student models. Extensive offline evaluations and online experiments on Taobao search ad scene demonstrate that our proposed framework significantly enhances e-commerce relevance learning performance and user experience.
摘要：在电子商务搜索系统中，有效的查询-商品相关性建模对于提升用户体验和保障用户满意度至关重要。近年来，得益于其庞大的内在知识，大语言模型（Large Language Model, LLM）在性能和长尾泛化能力方面展现出优于以往基于神经网络的专用相关性学习方法的强大表现。尽管前景广阔，当前基于LLM的方法在实践中仍存在以下不足：首先，庞大的参数和计算需求使其难以在线部署。其次，将LLM模型提炼为在线模型是一个可行的方向，但LLM相关性建模是一个黑箱，其丰富的内在知识难以提取并应用于在线环境。为提高LLM的可解释性并借助LLM提升在线相关性模型的性能，我们提出了一种可解释的LLM驱动的多维度提炼框架，用于电子商务相关性学习，该框架包含两个核心组件：（1）一种用于相关性建模的可解释LLM（ELLM-rele），它将相关性学习分解为中间步骤，并将相关性学习建模为链式思维（Chain-of-Thought, CoT）推理，从而增强LLM的可解释性和性能。（2）一种多维度知识提炼（Multi-dimensional Knowledge Distillation, MKD）架构，该架构从相关性得分分布和CoT推理两个方面，将ELLM-rele的知识传递给当前可部署的基于交互和表示的学生模型。通过提炼概率性和CoT推理知识，MKD提升了学生模型的语义交互和长尾泛化能力。在淘宝搜索广告场景中的广泛离线和在线实验表明，我们提出的框架显著提升了电子商务相关性学习的性能和用户体验。

[NLP-25] Breaking the Cycle of Recurring Failures: Applying Generative AI to Root Cause Analysis in Legacy Banking Systems

【速读】：该论文试图解决传统银行在数字化转型过程中面临的挑战，特别是由于遗留系统限制和碎片化所有权导致的表面性事件解决和反复失败问题。解决方案的关键在于引入一种结合知识型生成式 AI 代理 (GenAI agents) 与“五问法”(Five Whys) 技术的新型事后分析方法。该方法通过分析问题描述和变更请求数据，揭示了约70%先前归因于管理和供应商失败的事件实际上源于内部代码问题。通过扫描超过5000个项目，识别出400多个具有相似根本原因的文件，该方法不仅自动化了根本原因分析，还将其转变为一个更加主动的过程，并可应用于软件开发生命周期的其他阶段，从而进一步优化开发流程。

链接: https://arxiv.org/abs/2411.13017
作者: Siyuan Jin,Zhendong Bei,Bichao Chen,Yong Xia
关键词-EN: Traditional banks face, banks face significant, face significant challenges, legacy system constraints, Traditional banks
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional banks face significant challenges in digital transformation, primarily due to legacy system constraints and fragmented ownership. Recent incidents show that such fragmentation often results in superficial incident resolutions, leaving root causes unaddressed and causing recurring failures. We introduce a novel approach to post-incident analysis, integrating knowledge-based GenAI agents with the “Five Whys” technique to examine problem descriptions and change request data. This method uncovered that approximately 70% of the incidents previously attributed to management or vendor failures were due to underlying internal code issues. We present a case study to show the impact of our method. By scanning over 5,000 projects, we identified over 400 files with a similar root cause. Overall, we leverage the knowledge-based agents to automate and elevate root cause analysis, transforming it into a more proactive process. These agents can be applied across other phases of the software development lifecycle, further improving development processes.
摘要：传统银行在数字化转型过程中面临重大挑战，主要原因是遗留系统限制和所有权分散。近期事件表明，这种分散性往往导致表面性的问题解决，未能触及根本原因，从而导致问题反复出现。我们提出了一种新颖的后续事件分析方法，将基于知识的生成式 AI 智能体与“五个为什么”技术相结合，以审查问题描述和变更请求数据。该方法揭示了约 70% 先前归咎于管理或供应商失败的事件，实际上是由内部代码问题引起的。我们通过一个案例研究展示了该方法的影响。通过扫描超过 5,000 个项目，我们识别出超过 400 个文件具有类似的根本原因。总体而言，我们利用基于知识的智能体来自动化和提升根本原因分析，将其转变为一个更为主动的过程。这些智能体可以应用于软件开发生命周期的其他阶段，进一步改进开发流程。

[NLP-26] LLM Steer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts

【速读】：该论文试图解决大型语言模型（LLMs）在长上下文理解和计算成本高的问题。解决方案的关键是引入了一个名为LLMSteer的无微调框架，通过查询无关的注意力引导（query-independent attention steering）来增强LLMs的性能。实验结果表明，LLMSteer在性能上缩小了与基准模型的差距达65.9%，并且在运行时间延迟方面比最近的注意力引导方法减少了高达4.8倍。

链接: https://arxiv.org/abs/2411.13009
作者: Zhuohan Gu,Jiayi Yao,Kuntai Du,Junchen Jiang
关键词-EN: large language models, high computational costs, longer contextual understanding, show impressive performance, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs. To balance efficiency and quality, we introduce LLMSteer, a fine-tuning-free framework that enhances LLMs through query-independent attention steering. Tested on popular LLMs and datasets, LLMSteer narrows the performance gap with baselines by 65.9% and reduces the runtime delay by up to 4.8x compared to recent attention steering methods.
摘要：随着大语言模型（LLMs）在复杂任务中展现出令人印象深刻的表现，它们在长上下文理解和计算成本高昂方面仍面临挑战。为了平衡效率和质量，我们提出了 LLMSteer，这是一个无需微调的框架，通过与查询无关的注意力引导来增强 LLMs。在流行的 LLMs 和数据集上进行测试，LLMSteer 将性能差距与基线缩小了 65.9%，并将运行时延迟相比最近的注意力引导方法减少了高达 4.8 倍。

[NLP-27] MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers NEURIPS2024

【速读】：该论文试图解决大型语言模型（Large Language Models）在计算复杂度（FLOPs）方面的问题。解决方案的关键在于提出了一种新的Transformer架构——MemoryFormer，通过消除除多头注意力操作（multi-head attention operation）之外的几乎所有计算，显著降低了计算复杂度。具体实现方式是利用内存中的查找表（in-memory lookup tables）存储大量离散向量，以替代全连接层中的权重矩阵。通过哈希算法（hash algorithm）根据输入嵌入动态检索相关向量子集，并将这些向量组合形成输出嵌入，从而近似矩阵乘法操作的结果。这种方法避免了直接进行矩阵乘法，转而采用从内存中检索数据块的方式，大幅减少了所需的计算量。

链接: https://arxiv.org/abs/2411.12992
作者: Ning Ding,Yehui Tang,Haochen Qin,Zhenli Zhou,Chao Xu,Lin Li,Kai Han,Heng Liao,Yunhe Wang
关键词-EN: computational complexity, large language models, great efforts, improve the efficiency, language models
类目: Computation and Language (cs.CL)
备注: NeurIPS2024

点击查看摘要

Abstract:In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
摘要：为了降低大语言模型的计算复杂度，研究者们已经做出了巨大努力来提升 Transformer 模型的效率，例如线性注意力机制和闪存注意力机制。然而，为了追求更高的性能，模型规模及其相应的计算复杂度仍在不断扩大。在本研究中，我们提出了 MemoryFormer，这是一种新颖的 Transformer 架构，从新的角度显著降低了计算复杂度（FLOPs）。我们几乎消除了 Transformer 模型中除多头注意力操作所需的基本计算之外的所有计算。这一成就是通过利用一种替代方法来替换全连接层的线性投影实现的。具体来说，我们首先构建了一组内存中的查找表，这些表存储了大量离散向量，以替代线性投影中使用的权重矩阵。然后，我们使用哈希算法根据输入嵌入动态检索相关向量的子集。检索到的向量组合在一起将形成输出嵌入，从而对全连接层中的矩阵乘法操作结果进行估计。与执行矩阵乘法相比，从内存中检索数据块是一个成本低得多的操作，几乎不需要计算。我们从头开始训练 MemoryFormer，并在各种基准上进行了广泛的实验，以证明所提出模型的有效性。

[NLP-28] raining Bilingual LMs with Data Constraints in the Targeted Language

【速读】：该论文试图解决在数据受限的目标语言中提升预训练模型性能的问题。解决方案的关键在于利用数据丰富的辅助语言（如英语）来增强目标语言的训练数据。具体方法包括量化辅助语言与目标语言之间的性能差距，探索翻译系统的优势，研究模型扩展在数据受限语言中的局限性，并提出从辅助语言中上采样数据的新方法。研究结果表明，使用更强的辅助数据集可以在不修改模型或训练目标的情况下提高目标语言的性能，特别是通过开发信息更丰富的英语预训练数据集，可以扩展到数据有限的目标语言设置中。

链接: https://arxiv.org/abs/2411.12986
作者: Skyler Seto,Maartje ter Hoeve,He Bai,Natalie Schluter,David Grangier
关键词-EN: Large language models, current scaling laws, Large language, data, trained on massive
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 14 figures, 15 tables

点击查看摘要

Abstract:Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a data constrained target language by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling for data constrained languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
摘要：大语言模型是根据当前的扩展定律，在庞大的网络数据抓取中进行训练的。由于高质量预训练数据的丰富性，英语方面的进展最为显著。然而，对于大多数其他语言来说，这种高质量的预训练数据是不可得的。在本研究中，我们探讨了如何通过利用高质量数据可得的辅助语言的数据，来提升在数据受限的目标语言中预训练模型的性能。我们通过量化在数据丰富的辅助语言中训练与在目标语言中训练之间的性能差距，探索了翻译系统的优势，研究了模型扩展在数据受限语言中的局限性，并提出了从辅助语言中上采样数据的新方法。我们的结果表明，对于相近的语言，使用更强的辅助数据集可以带来性能提升，而无需修改模型或训练目标；特别是，由于开发了更多信息丰富的英语预训练数据集而带来的性能提升，可以扩展到数据有限的目标语言设置中。

[NLP-29] MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning

【速读】：该论文试图解决当代具身智能体（embodied agents）在使用开放式大语言模型（LLMs）时面临的初级任务执行困难问题。解决方案的关键在于引入终身协作学习框架 \collabvoyager，通过明确的角色互换（perspective-taking）来增强智能体的学习能力。具体创新包括：（1）心智理论表示（theory of mind representations），将感知、信念、欲望和行动联系起来；（2）智能体间的自然语言交流；（3）任务和环境知识的语义记忆以及协作事件的情景记忆。这些创新使智能体能够推理自身和他人的心理状态，有效解决常见的错误信念和任务执行失败问题，显著提高任务完成率，并展现出知识转移和协作代码修正等涌现行为。

链接: https://arxiv.org/abs/2411.12977
作者: Mircea Lică,Ojas Shirekar,Baptiste Colle,Chirag Raman
关键词-EN: demonstrated promising capabilities, Contemporary embodied agents, Contemporary embodied, demonstrated promising, promising capabilities
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contemporary embodied agents, such as Voyager in Minecraft, have demonstrated promising capabilities in open-ended individual learning. However, when powered with open large language models (LLMs), these agents often struggle with rudimentary tasks, even when fine-tuned on domain-specific knowledge. Inspired by human cultural learning, we present \collabvoyager, a novel framework that enhances Voyager with lifelong collaborative learning through explicit perspective-taking. \collabvoyager introduces three key innovations: (1) theory of mind representations linking percepts, beliefs, desires, and actions; (2) natural language communication between agents; and (3) semantic memory of task and environment knowledge and episodic memory of collaboration episodes. These advancements enable agents to reason about their and others’ mental states, empirically addressing two prevalent failure modes: false beliefs and faulty task executions. In mixed-expertise Minecraft experiments, \collabvoyager agents outperform Voyager counterparts, significantly improving task completion rate by 66.6% (+39.4%) for collecting one block of dirt and 70.8% (+20.8%) for collecting one wood block. They exhibit emergent behaviors like knowledge transfer from expert to novice agents and collaborative code correction. \collabvoyager agents also demonstrate the ability to adapt to out-of-distribution tasks by using their previous experiences and beliefs obtained through collaboration. In this open-ended social learning paradigm, \collabvoyager paves the way for the democratic development of embodied AI, where agents learn in deployment from both peer and environmental feedback.
摘要：当代具身智能体，如 Minecraft 中的 Voyager，在开放式个体学习方面展示了令人鼓舞的能力。然而，当配备开放式大语言模型 (LLM) 时，这些智能体在处理基础任务时往往表现不佳，即便在特定领域知识上进行了微调。受人类文化学习的启发，我们提出了 \collabvoyager，这是一个通过显式视角采纳实现终身协作学习的新框架，旨在增强 Voyager 的能力。\collabvoyager 引入了三项关键创新：(1) 心智理论表示，将感知、信念、欲望和行动联系起来；(2) 智能体之间的自然语言交流；(3) 任务和环境知识的语义记忆以及协作事件的情节记忆。这些进步使智能体能够推理自身和他人的心理状态，从而有效解决两种常见的失败模式：错误信念和任务执行错误。在混合专家的 Minecraft 实验中，\collabvoyager 智能体的表现优于 Voyager 的同类智能体，显著提高了任务完成率：收集一块泥土的任务完成率提高了 66.6%（+39.4%），收集一块木块的任务完成率提高了 70.8%（+20.8%）。它们展示了从专家智能体向新手智能体进行知识转移和协作代码修正等涌现行为。\collabvoyager 智能体还展示了通过利用先前的协作经验和信念来适应分布外任务的能力。在这种开放式的社会学习范式中，\collabvoyager 为具身 AI 的民主化发展铺平了道路，智能体在部署过程中从同侪和环境反馈中学习。

[NLP-30] A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

【速读】：该论文试图解决大型语言模型（LLM）在实际应用中容易被用户引导执行超出其设计范围的任务（off-topic misuse）的问题。当前的防护措施，如基于精选示例或定制分类器的防护机制，存在高误报率、适应性有限以及在预生产阶段难以获取真实世界数据的问题。论文提出的解决方案之关键是引入了一种灵活且无需数据（data-free）的防护机制开发方法。通过定性定义问题空间并将其传递给LLM以生成多样化的提示，构建了一个合成数据集用于基准测试和训练防护机制，从而显著提升了防护机制的性能，并能有效泛化到其他滥用类别，如越狱和有害提示。此外，论文还通过开源合成数据集和防护模型，为预生产环境中的防护机制开发提供了宝贵资源，并支持未来在LLM安全性方面的研究与开发。

链接: https://arxiv.org/abs/2411.12946
作者: Gabriel Chua,Shing Yee Chan,Shaun Khoo
关键词-EN: Large Language Models, Large Language, Language Models, intended scope, Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.
摘要：大语言模型容易受到偏离主题的滥用，用户可能会引导这些模型执行超出其设计范围的任务。当前的防护措施，通常依赖于精心挑选的示例或定制的分类器，存在高误报率、适应性有限以及在预生产阶段无法获取现实世界数据的问题。本文提出了一种灵活且无需数据的防护措施开发方法，以解决这些问题。通过全面定性地定义问题空间，并将此传递给大语言模型以生成多样化的提示，我们构建了一个合成数据集，用于基准测试和训练能够超越启发式方法的偏离主题防护措施。此外，通过将任务框架化为判断用户提示是否与系统提示相关，我们的防护措施能够有效地推广到其他滥用类别，包括越狱和有害提示。最后，我们通过开源合成数据集和偏离主题防护模型，为预生产环境中的防护措施开发提供了宝贵的资源，并支持大语言模型安全领域的未来研究与开发。

[NLP-31] Loss-to-Loss Prediction: Scaling Laws for All Datasets

【速读】：该论文试图解决在不同数据分布之间预测训练损失的问题。解决方案的关键在于推导出一种策略，通过简单的位移幂律关系（shifted power law relationships）来预测从一个数据集到另一个数据集的训练损失。具体来说，论文提出了三种位移幂律关系：1) 在相同训练计算量下，两个模型在不同数据集上的训练损失之间的关系（train-to-train）；2) 单个模型在训练数据集和下游任务数据集上的训练损失与测试损失之间的关系（train-to-test）；3) 两个模型在不同训练数据集上的测试损失之间的关系（test-to-test）。这些关系不仅适用于差异较大的预训练数据集，还能在各种下游任务中保持有效性，并且在某些情况下，这些位移幂律关系比单一数据集的扩展律（scaling laws）预测更为准确。

链接: https://arxiv.org/abs/2411.12925
作者: David Brandfonbrener,Nikhil Anand,Nikhil Vyas,Eran Malach,Sham Kakade
关键词-EN: provide a reliable, reliable methodology, predicting train loss, scaling laws provide, methodology for predicting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
摘要：尽管缩放定律为预测单一数据分布在不同计算规模下的训练损失提供了一种可靠的方法，但对于当我们改变数据分布时这些预测应如何变化，我们知之甚少。本文中，我们推导了一种从一种损失预测另一种损失的策略，并将其应用于预测不同预训练数据集之间的损失，以及从预训练数据到下游任务数据的损失。即使在比用于拟合曲线的最大FLOP预算大20倍的情况下，我们的预测也能很好地外推。更具体地说，我们发现存在简单的位移幂律关系，包括：（1）当模型按训练计算配对时，两个分别在两个不同数据集上训练的模型的训练损失之间的关系（训练到训练）；（2）单个模型在任何下游分布上的训练损失和测试损失之间的关系（训练到测试）；以及（3）两个分别在两个不同训练数据集上训练的模型的测试损失之间的关系（测试到测试）。这些结果适用于差异显著的预训练数据集（有些完全是代码，而有些则完全没有代码），并且跨越了多种下游任务。最后，我们发现，在某些情况下，这些位移幂律关系比外推单一数据集的缩放定律能产生更准确的预测。

[NLP-32] Signformer is all you need: Towards Edge AI for Sign Language

【速读】：该论文试图解决手语翻译（尤其是无注释手语翻译）中资源密集型方法带来的不切实际和不可持续性问题。解决方案的关键在于引入了一种全新的架构——Signformer，这是一种从零开始构建的轻量级巨型模型，旨在实现边缘AI（Edge AI）的高性能和高效能。Signformer通过创新的卷积和注意力机制，实现了在没有预训练模型、先验知识转移或非从头开始NLP策略的情况下，显著提升了手语翻译的性能和效率。该模型在2024年的排行榜上取得了第二名的成绩，参数减少了467至1807倍，且在仅有0.57百万参数的轻量级配置下，超越了几乎所有其他方法。

链接: https://arxiv.org/abs/2411.12901
作者: Eta Yang
关键词-EN: growing resource-intensive methodologies, Sign language translation, Large Language Models, gloss-free paradigm, resource-intensive methodologies
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Official Code at: this https URL

点击查看摘要

Abstract:Sign language translation, especially in gloss-free paradigm, is confronting a dilemma of impracticality and unsustainability due to growing resource-intensive methodologies. Contemporary state-of-the-arts (SOTAs) have significantly hinged on pretrained sophiscated backbones such as Large Language Models (LLMs), embedding sources, or extensive datasets, inducing considerable parametric and computational inefficiency for sustainable use in real-world scenario. Despite their success, following this research direction undermines the overarching mission of this domain to create substantial value to bridge hard-hearing and common populations. Committing to the prevailing trend of LLM and Natural Language Processing (NLP) studies, we pursue a profound essential change in architecture to achieve ground-up improvements without external aid from pretrained models, prior knowledge transfer, or any NLP strategies considered not-from-scratch. Introducing Signformer, a from-scratch Feather-Giant transforming the area towards Edge AI that redefines extremities of performance and efficiency with LLM-competence and edgy-deployable compactness. In this paper, we present nature analysis of sign languages to inform our algorithmic design and deliver a scalable transformer pipeline with convolution and attention novelty. We achieve new 2nd place on leaderboard with a parametric reduction of 467-1807x against the finests as of 2024 and outcompete almost every other methods in a lighter configuration of 0.57 million parameters. Comments: Official Code at: this https URL Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2411.12901 [cs.CL] (or arXiv:2411.12901v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.12901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：手语翻译，特别是在无词汇（gloss-free）范式中，由于资源密集型方法的不断增加，正面临着实用性与可持续性不足的困境。当代最先进的技术（SOTAs）在很大程度上依赖于预训练的复杂骨干网络，如大语言模型（LLMs）、嵌入源或大规模数据集，这导致了在实际应用中可持续使用时参数和计算效率的显著不足。尽管这些技术取得了成功，但沿着这一研究方向发展，却削弱了该领域旨在为听障人群与普通人群之间搭建桥梁的根本使命。我们致力于顺应大语言模型（LLM）和自然语言处理（NLP）研究的主流趋势，追求在架构上的根本性变革，以实现自底向上的改进，而不依赖于预训练模型、先验知识转移或任何非从头开始的NLP策略。我们引入了Signformer，这是一种从头开始的轻量级-巨型架构，将手语翻译领域推向边缘AI（Edge AI），重新定义了性能与效率的极限，具备与LLM竞争的能力和边缘部署的紧凑性。在本文中，我们对手语的自然特性进行了分析，以指导我们的算法设计，并提供了一个具有卷积和注意力创新的可扩展Transformer流水线。我们在2024年的排行榜上以467-1807倍的参数减少量取得了第二名的成绩，并在0.57百万参数的轻量级配置下超越了几乎所有其他方法。

评论：官方代码位于：此https URL
主题：计算与语言（cs.CL）；计算机视觉与模式识别（cs.CV）；计算机与社会（cs.CY）；人机交互（cs.HC）；机器学习（cs.LG）
引用方式：arXiv:2411.12901 [cs.CL]（或arXiv:2411.12901v1 [cs.CL]用于此版本）
https://doi.org/10.48550/arXiv.2411.12901
通过DataCite发布的arXiv DOI（待注册）

[NLP-33] Selective Attention: Enhancing Transformer through Principled Context Control

【速读】：该论文试图解决传统自注意力机制（self-attention）在处理所有查询（queries）时采用统一映射方式的问题，这种统一处理方式限制了模型对上下文稀疏性和相关性的控制能力。论文提出的解决方案是引入选择性自注意力（Selective Self-Attention, SSA）层，其关键在于通过温度缩放策略（temperature scaling strategy）增强softmax非线性，从而根据查询嵌入及其在上下文窗口中的位置动态调整注意力图的上下文稀疏性。这一方法不仅缓解了注意力稀释问题，还促进了优化过程，并增强了模型对个体查询softmax尖锐度的控制能力。此外，通过为值嵌入引入温度缩放，SSA进一步提升了模型抑制无关或噪声标记的能力。SSA方法轻量且参数增加不到0.5%，通过权重共享策略可应用于现有的语言模型（LLMs），并在语言建模基准测试中显著且一致地提高了模型的准确性。

链接: https://arxiv.org/abs/2411.12892
作者: Xuechen Zhang,Xiangyu Chang,Mingchen Li,Amit Roy-Chowdhury,Jiasi Chen,Samet Oymak
关键词-EN: transformer architecture enables, combine tokens based, transformer architecture, architecture enables, weigh and combine
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries q in the same way by applying the mapping V^\top\textsoftmax(Kq) , where V,K are the value and key embeddings respectively. In this work, we argue that this uniform treatment hinders the ability to control contextual sparsity and relevance. As a solution, we introduce the \textitSelective Self-Attention (SSA) layer that augments the softmax nonlinearity with a principled temperature scaling strategy. By controlling temperature, SSA adapts the contextual sparsity of the attention map to the query embedding and its position in the context window. Through theory and experiments, we demonstrate that this alleviates attention dilution, aids the optimization process, and enhances the model’s ability to control softmax spikiness of individual queries. We also incorporate temperature scaling for value embeddings and show that it boosts the model’s ability to suppress irrelevant/noisy tokens. Notably, SSA is a lightweight method which introduces less than 0.5% new parameters through a weight-sharing strategy and can be fine-tuned on existing LLMs. Extensive empirical evaluations demonstrate that SSA-equipped models achieve a noticeable and consistent accuracy improvement on language modeling benchmarks.
摘要：Transformer架构中的注意力机制使模型能够根据Token与查询的相关性进行加权和组合。尽管自注意力机制取得了显著成功，但它通过应用映射 ( V^\top\textsoftmax(Kq) ) 对所有查询 ( q ) 进行统一处理，其中 ( V ) 和 ( K ) 分别是值和键的嵌入。我们认为，这种统一处理方式阻碍了控制上下文稀疏性和相关性的能力。为此，我们引入了选择性自注意力 (Selective Self-Attention, SSA) 层，该层通过一种有原则的温度缩放策略增强了softmax非线性。通过控制温度，SSA能够根据查询嵌入及其在上下文窗口中的位置调整注意力图的上下文稀疏性。通过理论和实验，我们证明这缓解了注意力稀释问题，有助于优化过程，并增强了模型控制单个查询softmax尖锐度的能力。我们还为值嵌入引入了温度缩放，并展示了它如何提升模型抑制无关/噪声Token的能力。值得注意的是，SSA是一种轻量级方法，通过权重共享策略引入的新参数不到0.5%，并且可以在现有的大语言模型上进行微调。广泛的实证评估表明，配备SSA的模型在语言建模基准测试中实现了显著且一致的准确性提升。

[NLP-34] ProSec: Fortifying Code LLM s with Proactive Security Alignment

【速读】：该论文试图解决代码大型语言模型（LLMs）在生成代码时可能引入的安全漏洞问题。解决方案的关键是提出了ProSec，一种新颖的主动安全对齐方法，旨在将代码LLMs与安全编码实践对齐。ProSec通过从常见弱点枚举（CWEs）中合成引发错误的编码场景，系统地暴露代码LLMs中的漏洞，并生成修复代码片段，使模型通过高级偏好学习目标学习安全实践。这种方法显著增加了安全对齐数据集的规模，并通过实验证明，使用ProSec训练的模型在安全性上比之前的工作提高了29.2%到35.5%，同时对模型效用的负面影响小于2个百分点。

链接: https://arxiv.org/abs/2411.12882
作者: Xiangzhe Xu,Zian Su,Jinyao Guo,Kaiyuan Zhang,Zhenting Wang,Xiangyu Zhang
关键词-EN: code-specific large language, Recent advances, greatly enhanced code, enhanced code generation, large language models
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract:Recent advances in code-specific large language models (LLMs) have greatly enhanced code generation and refinement capabilities. However, the safety of code LLMs remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Previous work proposes to collect security-focused instruction-tuning dataset from real-world vulnerabilities. It is constrained by the data sparsity of vulnerable code, and has limited applicability in the iterative post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing error-inducing coding scenarios from Common Weakness Enumerations (CWEs), and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through advanced preference learning objectives. The scenarios synthesized by ProSec triggers 25 times more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7 times larger than the previous work. Experiments show that models trained with ProSec is 29.2% to 35.5% more secure compared to previous work, with a marginal negative effect of less than 2 percentage points on model’s utility.
摘要：近年来，针对代码的大语言模型 (LLM) 的进步极大地提升了代码生成和优化的能力。然而，代码 LLM 的安全性问题尚未得到充分探索，这些模型生成的非安全代码可能会给现实世界系统带来潜在风险。先前的工作提出从现实世界漏洞中收集以安全为重点的指令调优数据集。但由于漏洞代码数据的稀疏性，这种方法在现代 LLM 的迭代后训练工作流程中的适用性有限。本文提出了 ProSec，一种新颖的主动安全对齐方法，旨在将代码 LLM 与安全编码实践对齐。ProSec 系统地通过从通用弱点枚举 (CWE) 中合成引发错误的编码场景，揭示代码 LLM 中的漏洞，并生成修复漏洞代码片段的补丁，使模型能够通过高级偏好学习目标学习安全实践。ProSec 合成的场景触发的漏洞代码比普通指令调优数据集多 25 倍，生成的安全对齐数据集比先前的工作大 7 倍。实验表明，使用 ProSec 训练的模型相比先前的工作在安全性上提高了 29.2% 至 35.5%，而对模型效用的负面影响不到 2 个百分点。

[NLP-35] AzSLD: Azerbaijani Sign Language Dataset for Fingerspelling Word and Sentence Translation with Baseline Software

【速读】：该论文试图解决阿塞拜疆手语（Azerbaijani Sign Language, AzSL）处理技术中数据集不足的问题。解决方案的关键在于创建了一个全面的阿塞拜疆手语数据集（AzSLD），该数据集包含了从不同年龄、性别和手语风格的用户中收集的30,000个视频，每个视频都经过精确的标注和对应的语言翻译。通过多角度摄像记录和详细的标注，该数据集能够支持手势识别模型的稳健训练和评估。此外，数据集还附带了技术文档和源代码，以促进其在训练和测试中的应用。论文强调了在数据收集过程中严格遵循的伦理指南，确保所有参与者都提供了知情同意。

链接: https://arxiv.org/abs/2411.12865
作者: Nigar Alishzade,Jamaladdin Hasanov
关键词-EN: processing technology development, technology development relies, language processing technology, Sign language, Sign language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign language processing technology development relies on extensive and reliable datasets, instructions, and ethical guidelines. We present a comprehensive Azerbaijani Sign Language Dataset (AzSLD) collected from diverse sign language users and linguistic parameters to facilitate advancements in sign recognition and translation systems and support the local sign language community. The dataset was created within the framework of a vision-based AzSL translation project. This study introduces the dataset as a summary of the fingerspelling alphabet and sentence- and word-level sign language datasets. The dataset was collected from signers of different ages, genders, and signing styles, with videos recorded from two camera angles to capture each sign in full detail. This approach ensures robust training and evaluation of gesture recognition models. AzSLD contains 30,000 videos, each carefully annotated with accurate sign labels and corresponding linguistic translations. The dataset is accompanied by technical documentation and source code to facilitate its use in training and testing. This dataset offers a valuable resource of labeled data for researchers and developers working on sign language recognition, translation, or synthesis. Ethical guidelines were strictly followed throughout the project, with all participants providing informed consent for collecting, publishing, and using the data.
摘要：手语处理技术的发展依赖于广泛且可靠的数据集、指导方针和伦理准则。我们提出了一个全面的阿塞拜疆手语数据集（Azerbaijani Sign Language Dataset, AzSLD），该数据集从多样化的手语使用者和语言参数中收集，旨在促进手语识别和翻译系统的进步，并支持当地手语社区。该数据集是在基于视觉的阿塞拜疆手语翻译项目框架内创建的。本研究将该数据集作为手指拼写字母和句子及单词级别手语数据集的总结进行介绍。数据集从不同年龄、性别和手语风格的使用者中收集，视频从两个摄像角度录制，以捕捉每个手语的完整细节。这种方法确保了手势识别模型的稳健训练和评估。AzSLD包含30,000个视频，每个视频都经过精心标注，附有准确的手语标签和相应的语言翻译。数据集附有技术文档和源代码，以方便其在训练和测试中的使用。该数据集为从事手语识别、翻译或合成的研究人员和开发者提供了宝贵的标注数据资源。在整个项目过程中，严格遵循了伦理准则，所有参与者均提供了知情同意书，允许收集、发布和使用数据。

[NLP-36] SCOUT: A Situated and Multi-Modal Human-Robot Dialogue Corpus

【速读】：该论文试图解决在协作探索任务中人机对话的语料库构建问题，特别是如何通过多模态数据集来支持自主人机对话系统的开发和研究。解决方案的关键在于引入了Situated Corpus Of Understanding Transactions (SCOUT)，这是一个包含89,056个话语和310,095个单词的多模态语料库，涵盖了278个对话，平均每个对话有320个话语。该语料库不仅包括对话文本，还与实验期间可用的多模态数据流（如5,785张图像和30张地图）对齐，并进行了抽象意义表示（Abstract Meaning Representation, AMR）和对话-AMR（Dialogue-AMR）的标注，以识别说话者的意图和意义，同时使用交易单元（Transactional Units）和关系（Relations）来追踪话语之间的关系，揭示对话结构的模式。通过这些标注和多模态数据的结合，SCOUT旨在加速自主、情境化的人机对话研究，特别是在导航任务中，环境细节的发现和理解方面。

链接: https://arxiv.org/abs/2411.12844
作者: Stephanie M. Lukin,Claire Bonial,Matthew Marge,Taylor Hudson,Cory J. Hayes,Kimberly A. Pollard,Anthony Baker,Ashley N. Foots,Ron Artstein,Felix Gervits,Mitchell Abrams,Cassidy Henry,Lucia Donatelli,Anton Leuski,Susan G. Hill,David Traum,Clare R. Voss
关键词-EN: Understanding Transactions, collaborative exploration, domain of collaborative, Abstract Meaning Representation, Transactions
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:We introduce the Situated Corpus Of Understanding Transactions (SCOUT), a multi-modal collection of human-robot dialogue in the task domain of collaborative exploration. The corpus was constructed from multiple Wizard-of-Oz experiments where human participants gave verbal instructions to a remotely-located robot to move and gather information about its surroundings. SCOUT contains 89,056 utterances and 310,095 words from 278 dialogues averaging 320 utterances per dialogue. The dialogues are aligned with the multi-modal data streams available during the experiments: 5,785 images and 30 maps. The corpus has been annotated with Abstract Meaning Representation and Dialogue-AMR to identify the speaker’s intent and meaning within an utterance, and with Transactional Units and Relations to track relationships between utterances to reveal patterns of the Dialogue Structure. We describe how the corpus and its annotations have been used to develop autonomous human-robot systems and enable research in open questions of how humans speak to robots. We release this corpus to accelerate progress in autonomous, situated, human-robot dialogue, especially in the context of navigation tasks where details about the environment need to be discovered.
摘要：我们介绍了情境化理解交易语料库（Situated Corpus Of Understanding Transactions, SCOUT），这是一个在协作探索任务领域中的人机对话多模态集合。该语料库源自多个Wizard-of-Oz实验，其中人类参与者通过口头指令指导远程机器人移动并收集其周围环境的信息。SCOUT包含89,056个话语和310,095个单词，来自278次对话，平均每段对话有320个话语。这些对话与实验期间可用的多模态数据流（5,785张图像和30张地图）对齐。语料库通过抽象意义表示（Abstract Meaning Representation, AMR）和对话AMR（Dialogue-AMR）进行了标注，以识别说话者的意图和话语中的意义，并通过交易单元（Transactional Units）和关系（Relations）来追踪话语之间的关系，揭示对话结构的规律。我们描述了如何利用该语料库及其标注来开发自主人机系统，并推动关于人类如何与机器人交流的研究。我们发布此语料库，以加速自主、情境化的人机对话领域的进展，特别是在需要发现环境细节的导航任务中。

[NLP-37] Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

【速读】：该论文试图解决在从人类偏好中学习奖励模型（Reward Model, RM）时，传统基于二元偏好数据（binary preference data）的Bradley-Terry模型无法充分利用样本信息的问题，特别是忽略了“平局”（tied）和“略好”（slightly better）等更细粒度的反馈。解决方案的关键在于提出一个基于序数反馈（ordinal feedback）的框架，该框架能够处理任意粒度的偏好数据。具体来说，论文首先提出了一个边际无偏性条件（marginal unbiasedness condition），该条件扩展了现有二元反馈设置中的Bradley-Terry模型假设，并通过社会学中的“群体智慧”概念进行了验证。在此基础上，论文开发了一个适用于序数反馈的自然概率模型，并分析了其性质。理论分析表明，序数反馈在减少Rademacher复杂性方面优于二元反馈，并且该学习目标和理论可以扩展到hinge损失和直接策略优化（Direct Policy Optimization, DPO）。此外，该框架还为人类标注者提供了写作指导，并通过数值实验验证了细粒度反馈在分布内和分布外设置下都能更好地学习奖励模型，同时包含一定比例的平局样本可以提升RM的学习效果。

链接: https://arxiv.org/abs/2411.12843
作者: Shang Liu,Yu Pan,Guanting Chen,Xiaocheng Li
关键词-EN: aligning large language, large language models, important component, component in aligning, aligning large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as “tied” between the two responses) and loses more fine-grained information (such as “slightly better”). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.
摘要：从人类偏好中学习奖励模型 (Reward Model, RM) 已成为对齐大语言模型 (Large Language Model, LLM) 的重要组成部分。从成对偏好数据中学习 RM 的经典设置基于经典的 Bradley-Terry (BT) 模型，该模型接受二元反馈，即标签为“响应 1 优于响应 2”或相反。这种设置不可避免地会丢弃可能有用（如“平局”）的样本，并丢失更细粒度的信息（如“略好”）。本文提出了一种在序数反馈下学习 RM 的框架，将二元偏好反馈的情况推广到任意粒度。具体而言，我们首先识别了一个边际无偏性条件，该条件在现有二元反馈设置中推广了 BT 模型的假设。该条件通过社会学中的“群体智慧”概念得到了验证。在此条件下，我们为序数反馈下的成对偏好数据开发了一个自然概率模型，并分析了其性质。我们证明了序数反馈在减少 Rademacher 复杂性方面相对于二元反馈的统计优势。所提出的学习目标和理论也扩展到铰链损失和直接策略优化 (Direct Policy Optimization, DPO)。特别是，当应用于看似无关的知识蒸馏问题时，理论分析可能具有独立兴趣，以解释其中的偏差-方差权衡。该框架还为人类标注者的写作指导提供了启示。我们的数值实验验证了细粒度反馈在分布内和分布外设置下都能更好地进行奖励学习。进一步的实验表明，包含一定比例的平局偏好样本可以提升 RM 的学习效果。

[NLP-38] Human-Robot Dialogue Annotation for Multi-Modal Common Ground

【速读】：该论文试图解决在远程对话（如灾难救援或搜救任务）中，由于通信限制导致机器人无法立即共享高质量视觉信息的情况下，如何通过自然语言对话建立人机协作的共同理解（common ground）的问题。解决方案的关键在于开发了一种符号表示方法，即对话抽象意义表示（Dialogue-AMR），用于捕捉对话中单个话语的命题语义和言外之力（illocutionary force），并通过多层对话结构（multi-floor Dialogue Structure）注释方案来捕捉不同话语之间的关系。此外，论文还探讨了视觉模态如何为对话提供上下文信息，以克服协作双方对环境理解的不一致性。最终，通过这些注释，论文实现了物理机器人与人类在双向对话和导航中的自主交互。

链接: https://arxiv.org/abs/2411.12829
作者: Claire Bonial,Stephanie M. Lukin,Mitchell Abrams,Anthony Baker,Lucia Donatelli,Ashley Foots,Cory J. Hayes,Cassidy Henry,Taylor Hudson,Matthew Marge,Kimberly A. Pollard,Ron Artstein,David Traum,Clare R. Voss
关键词-EN: symbolic representations annotated, establishing common ground, natural language dialogue, human-robot dialogue data, autonomous systems participating
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 52 pages, 14 figures

点击查看摘要

Abstract:In this paper, we describe the development of symbolic representations annotated on human-robot dialogue data to make dimensions of meaning accessible to autonomous systems participating in collaborative, natural language dialogue, and to enable common ground with human partners. A particular challenge for establishing common ground arises in remote dialogue (occurring in disaster relief or search-and-rescue tasks), where a human and robot are engaged in a joint navigation and exploration task of an unfamiliar environment, but where the robot cannot immediately share high quality visual information due to limited communication constraints. Engaging in a dialogue provides an effective way to communicate, while on-demand or lower-quality visual information can be supplemented for establishing common ground. Within this paradigm, we capture propositional semantics and the illocutionary force of a single utterance within the dialogue through our Dialogue-AMR annotation, an augmentation of Abstract Meaning Representation. We then capture patterns in how different utterances within and across speaker floors relate to one another in our development of a multi-floor Dialogue Structure annotation schema. Finally, we begin to annotate and analyze the ways in which the visual modalities provide contextual information to the dialogue for overcoming disparities in the collaborators’ understanding of the environment. We conclude by discussing the use-cases, architectures, and systems we have implemented from our annotations that enable physical robots to autonomously engage with humans in bi-directional dialogue and navigation.
摘要：本文描述了在人机对话数据上开发符号表示的过程，旨在使自主系统能够理解和参与协作性自然语言对话，并实现与人类伙伴的共同理解。在远程对话（如灾难救援或搜索救援任务中）中，建立共同理解面临特殊挑战，此时人类和机器人共同进行陌生环境的导航和探索任务，但由于通信限制，机器人无法立即共享高质量的视觉信息。通过对话进行交流是一种有效的方式，而按需或较低质量的视觉信息可以作为补充，以建立共同理解。在此框架下，我们通过Dialogue-AMR注释（Abstract Meaning Representation的扩展）捕捉对话中单个话语的命题语义和言外之力。随后，我们开发了多层对话结构注释模式，以捕捉不同话语在同一或跨发言层之间的相互关系。最后，我们开始注释和分析视觉模态如何为对话提供上下文信息，以克服合作者对环境理解上的差异。本文结尾讨论了我们从注释中实现的用例、架构和系统，这些实现使物理机器人能够自主地与人类进行双向对话和导航。

[NLP-39] Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction

【速读】：该论文试图解决大型语言模型（LLM）在处理包含大量历史交互信息和干扰因素的复杂决策场景时的性能问题。解决方案的关键在于提出了OEDD（Operationalize Experience Despite Distraction）语料库，这是一个经过人工验证的场景集合，其中代理必须根据分散的经验信息在存在干扰因素的情况下做出决策。通过使用最先进的LLM（如GPT-3.5 Turbo、GPT-4o和Gemini 1.5 Pro）进行评估，并采用最小化的思维链提示策略，研究发现当输入上下文包含超过1,615个token的历史交互、关键决策前提是两个分散环境前提的正确结论，并且随后出现一个无关紧要但具有干扰性的红鲱鱼事实时，所有LLM的表现均不如随机选择。

链接: https://arxiv.org/abs/2411.12828
作者: Sonny George,Chris Sypherd,Dylan Cashman
关键词-EN: Large language model, agents show promise, Large language, language model, number of domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents show promise in an increasing number of domains. In many proposed applications, it is expected that the agent reasons over accumulated experience presented in an input prompt. We propose the OEDD (Operationalize Experience Despite Distraction) corpus, a human-annotator-validated body of scenarios with pre-scripted agent histories where the agent must make a decision based on disparate experiential information in the presence of a distractor. We evaluate three state-of-the-art LLMs (GPT-3.5 Turbo, GPT-4o, and Gemini 1.5 Pro) using a minimal chain-of-thought prompting strategy and observe that when (1) the input context contains over 1,615 tokens of historical interactions, (2) a crucially decision-informing premise is the rightful conclusion over two disparate environment premises, and (3) a trivial, but distracting red herring fact follows, all LLMs perform worse than random choice at selecting the better of two actions. Our code and test corpus are publicly available at: this https URL .
摘要：大语言模型（LLM）智能体在越来越多的领域展现出潜力。在许多提出的应用中，预期智能体会基于输入提示中呈现的累积经验进行推理。我们提出了OEDD（Operationalize Experience Despite Distraction）语料库，这是一个由人工注释者验证的场景集合，其中包含预先编写的智能体历史记录，智能体必须在存在干扰因素的情况下，基于分散的经验信息做出决策。我们使用一种最小化的思维链提示策略，评估了三种最先进的LLM（GPT-3.5 Turbo、GPT-4o和Gemini 1.5 Pro），并观察到以下情况时，所有LLM的表现均不如随机选择：（1）输入上下文包含超过1,615个Token的历史交互信息；（2）一个关键的决策前提是两个不同环境前提的正确结论；（3）随后是一个微不足道但具有干扰性的误导性事实。我们的代码和测试语料库已公开发布，链接如下：this https URL。

[NLP-40] Revisiting Fake News Detection: Towards Temporality-aware Evaluation by Leveraging Engagement Earliness WSDM2025

【速读】：该论文试图解决传统社交图谱虚假新闻检测方法在现实场景中的局限性问题，特别是在训练过程中依赖未来数据（future knowledge）的情况下。解决方案的关键在于提出了一个新的评估方案，该方案考虑了数据的时序性（temporality-aware），即模型只能在特定时间点之前的数据上进行训练。论文进一步提出了DAWN方法，通过利用用户参与时间（engagement earliness）的特征表示来指导边权重估计器，从而抑制社交图中连接真实新闻与虚假新闻的噪声边，提升检测性能。实验结果表明，DAWN在真实世界环境中显著优于现有的虚假新闻检测方法。

链接: https://arxiv.org/abs/2411.12775
作者: Junghoon Kim,Junmo Lee,Yeonjun In,Kanghoon Yoon,Chanyoung Park
关键词-EN: utilizing social contexts, user information, tweets and comments, false information, aims to identify
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: WSDM 2025

点击查看摘要

Abstract:Social graph-based fake news detection aims to identify news articles containing false information by utilizing social contexts, e.g., user information, tweets and comments. However, conventional methods are evaluated under less realistic scenarios, where the model has access to future knowledge on article-related and context-related data during training. In this work, we newly formalize a more realistic evaluation scheme that mimics real-world scenarios, where the data is temporality-aware and the detection model can only be trained on data collected up to a certain point in time. We show that the discriminative capabilities of conventional methods decrease sharply under this new setting, and further propose DAWN, a method more applicable to such scenarios. Our empirical findings indicate that later engagements (e.g., consuming or reposting news) contribute more to noisy edges that link real news-fake news pairs in the social graph. Motivated by this, we utilize feature representations of engagement earliness to guide an edge weight estimator to suppress the weights of such noisy edges, thereby enhancing the detection performance of DAWN. Through extensive experiments, we demonstrate that DAWN outperforms existing fake news detection methods under real-world environments. The source code is available at this https URL.
摘要：基于社交图谱的虚假新闻检测旨在通过利用社交上下文信息（如用户信息、推文和评论）来识别包含虚假信息的新闻文章。然而，传统方法在评估时通常处于较为理想化的场景中，即模型在训练期间可以访问与文章相关和上下文相关的未来数据。在本研究中，我们首次正式提出了一种更贴近现实世界的评估方案，该方案模拟了真实场景，其中数据具有时间感知性，且检测模型只能基于截至某一时间点收集的数据进行训练。我们发现，在这种新设置下，传统方法的辨别能力显著下降，并进一步提出了DAWN方法，该方法更适用于此类场景。我们的实证研究结果表明，较晚的互动（如阅读或转发新闻）对社交图谱中连接真实新闻与虚假新闻对的噪声边的贡献更大。基于此，我们利用互动早期的特征表示来指导边权重估计器，以抑制这些噪声边的权重，从而提升DAWN的检测性能。通过广泛的实验，我们证明了DAWN在真实世界环境中优于现有的虚假新闻检测方法。源代码可在以下链接获取：https URL。

[NLP-41] CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在文本生成任务中易受后门攻击 (backdoor attacks) 的问题。解决方案的关键是引入了一种名为内部一致性正则化 (Internal Consistency Regularization, CROW) 的新防御方法。CROW 通过一致性正则化微调来解决由后门触发器引起的层间不一致性问题。该方法利用了干净模型在隐藏表示层间表现出平滑、一致的过渡，而受后门影响的模型在触发时表现出明显波动的直觉。通过强制内部一致性，CROW 在不依赖干净参考模型或先验触发器知识的情况下，仅使用少量干净数据即可中和后门效果，从而在各种 LLM 架构中实现实际部署。实验结果表明，CROW 在多种后门策略和任务中显著降低了攻击成功率，同时保持了模型的生成能力。

链接: https://arxiv.org/abs/2411.12768
作者: Nay Myat Min,Long H. Pham,Yige Li,Jun Sun
关键词-EN: Recent studies reveal, Large Language Models, Large Language, Recent studies, reveal that Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies reveal that Large Language Models (LLMs) are susceptible to backdoor attacks, where adversaries embed hidden triggers that manipulate model responses. Existing backdoor defense methods are primarily designed for vision or classification tasks, and are thus ineffective for text generation tasks, leaving LLMs vulnerable. We introduce Internal Consistency Regularization (CROW), a novel defense using consistency regularization finetuning to address layer-wise inconsistencies caused by backdoor triggers. CROW leverages the intuition that clean models exhibit smooth, consistent transitions in hidden representations across layers, whereas backdoored models show noticeable fluctuation when triggered. By enforcing internal consistency through adversarial perturbations and regularization, CROW neutralizes backdoor effects without requiring clean reference models or prior trigger knowledge, relying only on a small set of clean data. This makes it practical for deployment across various LLM architectures. Experimental results demonstrate that CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks, including negative sentiment, targeted refusal, and code injection, on models such as Llama-2 (7B, 13B), CodeLlama (7B, 13B) and Mistral-7B, while preserving the model’s generative capabilities.
摘要：最近的研究表明，大语言模型 (LLM) 容易受到后门攻击，攻击者嵌入隐藏的触发器以操纵模型的响应。现有的后门防御方法主要针对视觉或分类任务设计，因此对文本生成任务无效，使得 LLM 仍然处于易受攻击的状态。我们提出了内部一致性正则化 (CROW)，这是一种新颖的防御方法，利用一致性正则化微调来解决由后门触发器引起的层间不一致性。CROW 利用了以下直觉：干净的模型在隐藏表示的各层之间表现出平滑、一致的过渡，而带有后门的模型在触发时会显示出明显的波动。通过通过对抗性扰动和正则化来强制内部一致性，CROW 可以中和后门效应，而无需依赖干净的参考模型或先验的触发器知识，仅依赖一小部分干净数据。这使得它在各种 LLM 架构中具有实际部署的可行性。实验结果表明，CROW 在多种后门策略和任务中，包括负面情感、目标拒绝和代码注入，在 Llama-2 (7B, 13B)、CodeLlama (7B, 13B) 和 Mistral-7B 等模型上，始终能够显著降低攻击成功率，同时保持模型的生成能力。

[NLP-42] Suicide Risk Assessment on Social Media with Semi-Supervised Learning

【速读】：该论文试图解决在社交媒体上自动评估自杀风险的问题，特别是在标注数据不足和类别不平衡的情况下。解决方案的关键在于提出了一种半监督框架，该框架结合了少量标注数据（n=500）和大量未标注数据（n=1,500），并通过改进的自训练算法和创新的伪标签获取过程来处理不平衡数据集。为了确保伪标签的质量，研究者手动验证了在多次伪标签生成试验中未达成一致的伪标签数据子集。最终，通过利用部分验证的伪标签数据和真实标注数据，显著提升了模型对社交媒体帖子自杀风险评估的能力。

链接: https://arxiv.org/abs/2411.12767
作者: Max Lovitt,Haotian Ma,Song Wang,Yifan Peng
关键词-EN: natural language processing, language processing presents, risk assessment systems, media communities increasingly, suicidal individuals post
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted for publication in the 2024 IEEE International Conference on Big Data

点击查看摘要

Abstract:With social media communities increasingly becoming places where suicidal individuals post and congregate, natural language processing presents an exciting avenue for the development of automated suicide risk assessment systems. However, past efforts suffer from a lack of labeled data and class imbalances within the available labeled data. To accommodate this task’s imperfect data landscape, we propose a semi-supervised framework that leverages labeled (n=500) and unlabeled (n=1,500) data and expands upon the self-training algorithm with a novel pseudo-label acquisition process designed to handle imbalanced datasets. To further ensure pseudo-label quality, we manually verify a subset of the pseudo-labeled data that was not predicted unanimously across multiple trials of pseudo-label generation. We test various models to serve as the backbone for this framework, ultimately deciding that RoBERTa performs the best. Ultimately, by leveraging partially validated pseudo-labeled data in addition to ground-truth labeled data, we substantially improve our model’s ability to assess suicide risk from social media posts.
摘要：随着社交媒体社区日益成为有自杀倾向的个体发布和聚集的场所，自然语言处理技术为开发自动化的自杀风险评估系统提供了令人兴奋的途径。然而，过去的努力受限于缺乏标注数据以及可用标注数据中的类别不平衡问题。为了适应这一任务中不完美的数据环境，我们提出了一种半监督框架，该框架利用标注数据（n=500）和未标注数据（n=1,500），并通过一种新颖的伪标签获取过程扩展了自训练算法，该过程旨在处理不平衡的数据集。为进一步确保伪标签的质量，我们手动验证了在多次伪标签生成试验中未被一致预测的伪标签数据子集。我们测试了多种模型以作为该框架的骨干，最终确定RoBERTa表现最佳。最终，通过利用部分验证的伪标签数据以及真实标注数据，我们显著提升了模型从社交媒体帖子中评估自杀风险的能力。

[NLP-43] SEFD: Semantic-Enhanced Framework for Detecting LLM -Generated Text

【速读】：该论文试图解决现有检测方法在面对大语言模型（LLMs）生成的文本，特别是通过转述（paraphrasing）技术生成的文本时，检测效果不佳的问题。解决方案的关键在于提出了一种新的语义增强框架（SEFD），该框架通过结合基于检索的机制来充分利用文本语义，从而提升检测效果。具体来说，SEFD框架通过系统地将基于检索的技术与传统检测方法集成，采用精心设计的检索机制，在全面覆盖和计算效率之间取得平衡，从而在实际应用场景（如在线论坛和问答平台）中显著提高转述文本的检测准确性，同时保持对标准LLM生成内容的鲁棒性。

链接: https://arxiv.org/abs/2411.12764
作者: Weiqing He,Bojian Hou,Tianqi Shang,Davoud Ataee Tarzanagh,Qi Long,Li Shen
关键词-EN: large language models, language models, evade existing detection, widespread adoption, adoption of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The widespread adoption of large language models (LLMs) has created an urgent need for robust tools to detect LLM-generated text, especially in light of \textitparaphrasing techniques that often evade existing detection methods. To address this challenge, we present a novel semantic-enhanced framework for detecting LLM-generated text (SEFD) that leverages a retrieval-based mechanism to fully utilize text semantics. Our framework improves upon existing detection methods by systematically integrating retrieval-based techniques with traditional detectors, employing a carefully curated retrieval mechanism that strikes a balance between comprehensive coverage and computational efficiency. We showcase the effectiveness of our approach in sequential text scenarios common in real-world applications, such as online forums and Q\A platforms. Through comprehensive experiments across various LLM-generated texts and detection methods, we demonstrate that our framework substantially enhances detection accuracy in paraphrasing scenarios while maintaining robustness for standard LLM-generated content.
摘要：随着大语言模型（LLM）的广泛应用，迫切需要强大的工具来检测由 LLM 生成的文本，特别是在面对常用于规避现有检测方法的转述技术时。为应对这一挑战，我们提出了一种新颖的语义增强框架，用于检测 LLM 生成的文本（SEFD），该框架利用基于检索的机制来充分运用文本语义。我们的框架通过系统地将基于检索的技术与传统检测器相结合，采用精心设计的检索机制，在全面覆盖与计算效率之间取得平衡，从而改进了现有的检测方法。我们在现实应用中常见的顺序文本场景（如在线论坛和问答平台）中展示了我们方法的有效性。通过在各种 LLM 生成的文本和检测方法上进行全面实验，我们证明了我们的框架在转述场景中显著提高了检测准确性，同时保持了对标准 LLM 生成内容的鲁棒性。

[NLP-44] Playing Language Game with LLM s Leads to Jailbreaking

【速读】：该论文试图解决大型语言模型（LLMs）在面对恶意攻击时的安全性问题，特别是通过识别安全泛化失败（mismatched generalization）的领域来开发绕过其安全防御的破解技术。解决方案的关键在于提出了两种基于安全泛化失败的新型破解方法：自然语言游戏和定制语言游戏。这两种方法通过使用合成语言结构和自定义规则，成功绕过了LLMs的安全机制，并在多个LLM平台上实现了高成功率的破解攻击。实验结果显示，这些方法在GPT-4o、GPT-4o-mini和Claude-3.5-Sonnet上分别达到了93%、89%和83%的成功率。此外，通过对Llama-3.1-70B进行微调以实现安全对齐，发现即使在其他语言游戏中，微调后的模型仍然无法识别有害内容，这表明LLMs中的安全对齐知识无法跨不同语言格式泛化，从而为该领域的未来研究开辟了新的方向。

链接: https://arxiv.org/abs/2411.12762
作者: Yu Peng,Zewen Long,Fangming Dong,Congyi Li,Shu Wu,Kai Chen
关键词-EN: numerous jailbreak techniques, jailbreak techniques aimed, language games, custom language games, natural language games
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has spurred the development of numerous jailbreak techniques aimed at circumventing their security defenses against malicious attacks. An effective jailbreak approach is to identify a domain where safety generalization fails, a phenomenon known as mismatched generalization. In this paper, we introduce two novel jailbreak methods based on mismatched generalization: natural language games and custom language games, both of which effectively bypass the safety mechanisms of LLMs, with various kinds and different variants, making them hard to defend and leading to high attack rates. Natural language games involve the use of synthetic linguistic constructs and the actions intertwined with these constructs, such as the Ubbi Dubbi language. Building on this phenomenon, we propose the custom language games method: by engaging with LLMs using a variety of custom rules, we successfully execute jailbreak attacks across multiple LLM platforms. Extensive experiments demonstrate the effectiveness of our methods, achieving success rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet. Furthermore, to investigate the generalizability of safety alignments, we fine-tuned Llama-3.1-70B with the custom language games to achieve safety alignment within our datasets and found that when interacting through other language games, the fine-tuned models still failed to identify harmful content. This finding indicates that the safety alignment knowledge embedded in LLMs fails to generalize across different linguistic formats, thus opening new avenues for future research in this area.
摘要：大语言模型（LLM）的出现催生了多种旨在绕过其针对恶意攻击的安全防御机制的越狱技术。一种有效的越狱方法是识别安全泛化失效的领域，这种现象被称为不匹配泛化。本文介绍了两种基于不匹配泛化的新型越狱方法：自然语言游戏和自定义语言游戏。这两种方法均能有效绕过大语言模型的安全机制，具有多种类型和不同变体，使其难以防御并导致高攻击率。自然语言游戏涉及使用合成语言结构及其与这些结构的交织动作，例如Ubbi Dubbi语言。基于此现象，我们提出了自定义语言游戏方法：通过使用多种自定义规则与大语言模型互动，我们成功地在多个大语言模型平台上执行了越狱攻击。大量实验证明了我们方法的有效性，在GPT-4o上成功率达到93%，在GPT-4o-mini上达到89%，在Claude-3.5-Sonnet上达到83%。此外，为了研究安全对齐的泛化性，我们使用自定义语言游戏对Llama-3.1-70B进行了微调，以在我们的数据集中实现安全对齐，并发现当通过其他语言游戏进行交互时，微调后的模型仍然无法识别有害内容。这一发现表明，嵌入在大语言模型中的安全对齐知识无法跨不同语言格式泛化，从而为该领域的未来研究开辟了新的途径。

[NLP-45] A Novel Approach to Eliminating Hallucinations in Large Language Model-Assisted Causal Discovery

【速读】：该论文试图解决在大语言模型（LLMs）用于因果发现时，由于模型产生的幻觉（hallucinations）导致模型选择不当的问题。解决方案的关键在于：1) 使用检索增强生成（Retrieval Augmented Generation, RAG）技术，在高质量数据可用时减少幻觉；2) 引入一种新颖的方法，通过多个LLMs与一个仲裁者进行辩论，来审计因果图中的边，从而实现与RAG相当的幻觉减少效果。

链接: https://arxiv.org/abs/2411.12759
作者: Grace Sng,Yanming Zhang,Klaus Mueller
关键词-EN: optimal model selection, large language models, human domain experts, domain experts highlights, model selection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing use of large language models (LLMs) in causal discovery as a substitute for human domain experts highlights the need for optimal model selection. This paper presents the first hallucination survey of popular LLMs for causal discovery. We show that hallucinations exist when using LLMs in causal discovery so the choice of LLM is important. We propose using Retrieval Augmented Generation (RAG) to reduce hallucinations when quality data is available. Additionally, we introduce a novel method employing multiple LLMs with an arbiter in a debate to audit edges in causal graphs, achieving a comparable reduction in hallucinations to RAG.
摘要：随着大语言模型（LLM）在因果发现中作为人类领域专家的替代品的使用日益增多，优化模型选择的需求变得尤为重要。本文首次对用于因果发现的流行 LLM 进行了幻觉现象的调查。我们发现，在使用 LLM 进行因果发现时存在幻觉现象，因此选择合适的 LLM 至关重要。我们提出在高质量数据可用时，使用检索增强生成（Retrieval Augmented Generation, RAG）来减少幻觉现象。此外，我们还引入了一种新颖的方法，通过在因果图中的边审计中使用多个 LLM 和一个仲裁者进行辩论，实现了与 RAG 相当的幻觉减少效果。

[NLP-46] An exploration of the effect of quantisation on energy consumption and inference time of StarCoder2

【速读】：该论文试图解决在代码大型语言模型 (LLMs) 推理过程中能量消耗过高的问题。解决方案的关键在于量化 (quantisation) 和剪枝 (pruning) 策略的应用。量化虽然降低了模型的精度并导致吞吐量下降，但能有效减少能量消耗；而剪枝则在减少能量使用的同时，可能会损害模型的性能。论文强调了在模型压缩过程中面临的挑战和权衡，并建议未来的研究应专注于硬件优化的量化方法，以在最小化精度损失的前提下提高效率。

链接: https://arxiv.org/abs/2411.12758
作者: Pepijn de Reus,Ana Oprescu,Jelle Zuidema
关键词-EN: code Large Language, Large Language Models, Large Language, study examines quantisation, code Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This study examines quantisation and pruning strategies to reduce energy consumption in code Large Language Models (LLMs) inference. Using StarCoder2, we observe increased energy demands with quantization due to lower throughput and some accuracy losses. Conversely, pruning reduces energy usage but impairs performance. The results highlight challenges and trade-offs in LLM model compression. We suggest future work on hardware-optimized quantization to enhance efficiency with minimal loss in accuracy.
摘要：本研究探讨了量化和剪枝策略，以降低代码大语言模型 (LLM) 推理过程中的能耗。通过使用 StarCoder2，我们观察到由于吞吐量降低和部分精度损失，量化导致能耗增加。相反，剪枝虽然减少了能耗，但损害了模型性能。研究结果突显了在 LLM 模型压缩中面临的挑战和权衡。我们建议未来的工作应聚焦于硬件优化的量化方法，以在最小化精度损失的前提下提升效率。

[NLP-47] A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

【速读】：该论文试图解决数字图书馆在处理大量文本集合时，如何高效且经济地进行文本处理以支持下游应用（如知识图谱构建、文档语义增强或实现新的访问路径）的问题。解决方案的关键在于采用远监督（distant supervision）和大型语言模型（如ChatGPT、LLama、Olmo）生成训练数据，并设计兼顾准确性和应用成本的最终处理管道。论文特别关注关系抽取和文本分类，通过展示八个生物医学基准数据集来验证其方法的有效性。

链接: https://arxiv.org/abs/2411.12752
作者: Hermann Kroll,Pascal Sackhoff,Bill Matthias Thang,Maha Ksouri,Wolf-Tilo Balke
关键词-EN: building knowledge graphs, maintain extensive textual, extensive textual collections, building knowledge, knowledge graphs
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: JCD2024 Full Paper, 12 pages, 6 figures

点击查看摘要

Abstract:Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines. Therefore, we focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.
摘要：维护大量文本集合的数字图书馆可能希望为某些下游应用进一步丰富其内容，例如构建知识图谱、文档的语义增强或实现新的访问路径。所有这些应用都需要进行一定的文本处理，无论是识别相关实体、提取它们之间的语义关系，还是将文档分类到某些类别中。然而，对于数字图书馆来说，实现可靠的监督工作流程可能会变得相当具有挑战性，因为必须精心制作合适的训练数据，并训练出可靠的模型。尽管许多研究致力于在某些基准上达到最高的准确率，但我们从数字图书馆实践者的角度来解决这个问题。换句话说，我们也考虑了准确性与应用成本之间的权衡，深入探讨了通过远监督和大语言模型（如ChatGPT、LLama和Olmo）生成训练数据的方法，并讨论了如何设计最终的管道。因此，我们专注于关系抽取和文本分类，以八个生物医学基准为例进行展示。

人工智能

[AI-0] SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLM s

链接: https://arxiv.org/abs/2411.13547
作者: Shirley Kokane,Ming Zhu,Tulika Awalgaonkar,Jianguo Zhang,Thai Hoang,Akshara Prabhakar,Zuxin Liu,Tian Lan,Liangwei Yang,Juntao Tan,Rithesh Murthy,Weiran Yao,Zhiwei Liu,Juan Carlos Niebles,Huan Wang,Shelby Heinecke,Caiming Xiong,Silivo Savarese
关键词-EN: Large Language Models, Language Models, Large Language, critical aspects, aspects of building
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.

[AI-1] BALROG: Benchmarking Agent ic LLM and VLM Reasoning On Games

链接: https://arxiv.org/abs/2411.13543
作者: Davide Paglieri,Bartłomiej Cupiał,Samuel Coward,Ulyana Piterbarg,Maciej Wolczyk,Akbir Khan,Eduardo Pignatelli,Łukasz Kuciński,Lerrel Pinto,Rob Fergus,Jakob Nicolaus Foerster,Jack Parker-Holder,Tim Rocktäschel
关键词-EN: Large Language Models, Vision Language Models, Large Language, Vision Language, promising reasoning abilities
类目: Artificial Intelligence (cs.AI)
*备注: Preprint, under review

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.

[AI-2] Metacognition for Unknown Situations and Environments (MUSE)

链接: https://arxiv.org/abs/2411.13537
作者: Rodolfo Valiente,Praveen K. Pilly
关键词-EN: central to human, human adaptability, unknown situations, awareness and regulation, Metacognition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Metacognition–the awareness and regulation of one’s cognitive processes–is central to human adaptability in unknown situations. In contrast, current autonomous agents often struggle in novel environments due to their limited capacity for adaptation. We hypothesize that metacognition is a critical missing ingredient in adaptive autonomous systems, equipping them with the cognitive flexibility needed to tackle unfamiliar challenges. Given the broad scope of metacognitive abilities, we focus on two key aspects: competence awareness and strategy selection for novel tasks. To this end, we propose the Metacognition for Unknown Situations and Environments (MUSE) framework, which integrates metacognitive processes–specifically self-awareness and self-regulation–into autonomous agents. We present two initial implementations of MUSE: one based on world modeling and another leveraging large language models (LLMs), both instantiating the metacognitive cycle. Our system continuously learns to assess its competence on a given task and uses this self-awareness to guide iterative cycles of strategy selection. MUSE agents show significant improvements in self-awareness and self-regulation, enabling them to solve novel, out-of-distribution tasks more effectively compared to Dreamer-v3-based reinforcement learning and purely prompt-based LLM agent approaches. This work highlights the promise of approaches inspired by cognitive and neural systems in enabling autonomous systems to adapt to new environments, overcoming the limitations of current methods that rely heavily on extensive training data.

[AI-3] Identity Preserving 3D Head Stylization with Multiview Score Distillation

链接: https://arxiv.org/abs/2411.13536
作者: Bahri Batuhan Bilecen,Ahmet Berke Gokmen,Furkan Guzelant,Aysegul Dundar
关键词-EN: enhancing user engagement, virtual reality applications, transforms realistic facial, realistic facial features, stylization transforms realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: this https URL

点击查看摘要

Abstract:3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the this https URL for more visuals.

[AI-4] Entropy Bootstrapping for Weakly Supervised Nuclei Detection CVPR2025

链接: https://arxiv.org/abs/2411.13528
作者: James Willoughby,Irina Voiculescu
关键词-EN: Microscopy structure segmentation, Microscopy structure, generally requires, requires a human, human to draw
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted for CVPR 2025

点击查看摘要

Abstract:Microscopy structure segmentation, such as detecting cells or nuclei, generally requires a human to draw a ground truth contour around each instance. Weakly supervised approaches (e.g. consisting of only single point labels) have the potential to reduce this workload significantly. Our approach uses individual point labels for an entropy estimation to approximate an underlying distribution of cell pixels. We infer full cell masks from this distribution, and use Mask-RCNN to produce an instance segmentation output. We compare this point–annotated approach with training on the full ground truth masks. We show that our method achieves a comparatively good level of performance, despite a 95% reduction in pixel labels.

[AI-5] SoK: A Systems Perspective on Compound AI Threats and Countermeasures

链接: https://arxiv.org/abs/2411.13459
作者: Sarbartha Banerjee,Prateek Sahu,Mulong Luo,Anjo Vahldiek-Oberwagner,Neeraja J. Yadwadkar,Mohit Tiwari
关键词-EN: Large language models, Large language, inputs and data, operate on sensitive, sensitive inputs
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) used across enterprises often use proprietary models and operate on sensitive inputs and data. The wide range of attack vectors identified in prior research - targeting various software and hardware components used in training and inference - makes it extremely challenging to enforce confidentiality and integrity policies. As we advance towards constructing compound AI inference pipelines that integrate multiple large language models (LLMs), the attack surfaces expand significantly. Attackers now focus on the AI algorithms as well as the software and hardware components associated with these systems. While current research often examines these elements in isolation, we find that combining cross-layer attack observations can enable powerful end-to-end attacks with minimal assumptions about the threat model. Given, the sheer number of existing attacks at each layer, we need a holistic and systemized understanding of different attack vectors at each layer. This SoK discusses different software and hardware attacks applicable to compound AI systems and demonstrates how combining multiple attack mechanisms can reduce the threat model assumptions required for an isolated attack. Next, we systematize the ML attacks in lines with the Mitre Attck framework to better position each attack based on the threat model. Finally, we outline the existing countermeasures for both software and hardware layers and discuss the necessity of a comprehensive defense strategy to enable the secure and high-performance deployment of compound AI systems. Comments: 13 pages, 4 figures, 2 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.13459 [cs.CR] (or arXiv:2411.13459v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.13459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-6] Robust Monocular Visual Odometry using Curriculum Learning

链接: https://arxiv.org/abs/2411.13438
作者: Assaf Lahiany,Oren Gal
关键词-EN: gradually introducing increasingly, natural learning patterns, learning patterns observed, introducing increasingly complex, drawing inspiration
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Curriculum Learning (CL), drawing inspiration from natural learning patterns observed in humans and animals, employs a systematic approach of gradually introducing increasingly complex training data during model development. Our work applies innovative CL methodologies to address the challenging geometric problem of monocular Visual Odometry (VO) estimation, which is essential for robot navigation in constrained environments. The primary objective of our research is to push the boundaries of current state-of-the-art (SOTA) benchmarks in monocular VO by investigating various curriculum learning strategies. We enhance the end-to-end Deep-Patch-Visual Odometry (DPVO) framework through the integration of novel CL approaches, with the goal of developing more resilient models capable of maintaining high performance across challenging environments and complex motion scenarios. Our research encompasses several distinctive CL strategies. We develop methods to evaluate sample difficulty based on trajectory motion characteristics, implement sophisticated adaptive scheduling through self-paced weighted loss mechanisms, and utilize reinforcement learning agents for dynamic adjustment of training emphasis. Through comprehensive evaluation on the real-world TartanAir dataset, our Curriculum Learning-based Deep-Patch-Visual Odometry (CL-DPVO) demonstrates superior performance compared to existing SOTA methods, including both feature-based and learning-based VO approaches. The results validate the effectiveness of integrating curriculum learning principles into visual odometry systems.

[AI-7] SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers

链接: https://arxiv.org/abs/2411.13428
作者: Hojjat Karami,David Atienza,Anisoara Ionescu
关键词-EN: Electronic Health Records, Generating synthetic Electronic, synthetic Electronic Health, Health Records, offers significant potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating synthetic Electronic Health Records (EHRs) offers significant potential for data augmentation, privacy-preserving data sharing, and improving machine learning model training. We propose a novel tokenization strategy tailored for structured EHR data, which encompasses diverse data types such as covariates, ICD codes, and irregularly sampled time series. Using a GPT-like decoder-only transformer model, we demonstrate the generation of high-quality synthetic EHRs. Our approach is evaluated using the MIMIC-III dataset, and we benchmark the fidelity, utility, and privacy of the generated data against state-of-the-art models.

[AI-8] Heuristically Adaptive Diffusion-Model Evolutionary Strategy

链接: https://arxiv.org/abs/2411.13420
作者: Benedikt Hartl,Yanbo Zhang,Hananel Hazan,Michael Levin
关键词-EN: Gaussian noise, degrades domain-specific information, Diffusion Models, Diffusion Models represent, represent a significant
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Models represent a significant advancement in generative modeling, employing a dual-phase process that first degrades domain-specific information via Gaussian noise and restores it through a trainable model. This framework enables pure noise-to-data generation and modular reconstruction of, images or videos. Concurrently, evolutionary algorithms employ optimization methods inspired by biological principles to refine sets of numerical parameters encoding potential solutions to rugged objective functions. Our research reveals a fundamental connection between diffusion models and evolutionary algorithms through their shared underlying generative mechanisms: both methods generate high-quality samples via iterative refinement on random initial distributions. By employing deep learning-based diffusion models as generative models across diverse evolutionary tasks and iteratively refining diffusion models with heuristically acquired databases, we can iteratively sample potentially better-adapted offspring parameters, integrating them into successive generations of the diffusion model. This approach achieves efficient convergence toward high-fitness parameters while maintaining explorative diversity. Diffusion models introduce enhanced memory capabilities into evolutionary algorithms, retaining historical information across generations and leveraging subtle data correlations to generate refined samples. We elevate evolutionary algorithms from procedures with shallow heuristics to frameworks with deep memory. By deploying classifier-free guidance for conditional sampling at the parameter level, we achieve precise control over evolutionary search dynamics to further specific genotypical, phenotypical, or population-wide traits. Our framework marks a major heuristic and algorithmic transition, offering increased flexibility, precision, and control in evolutionary optimization processes.

[AI-9] Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes

链接: https://arxiv.org/abs/2411.13365
作者: Muqsit Azeem,Debraj Chakraborty,Sudeep Kanav,Jan Kretinsky
关键词-EN: Partially Observable Markov, Markov Decision Processes, Observable Markov Decision, Partially Observable, Observable Markov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Preprint – Under Review

点击查看摘要

Abstract:Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used “attractor-based” policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.

[AI-10] Verifying Machine Unlearning with Explainable AI ICPR

链接: https://arxiv.org/abs/2411.13332
作者: Àlex Pujol Vidal,Anders S. Johansen,Mohammad N. S. Jahromi,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund
关键词-EN: verifying Machine Unlearning, harbor front monitoring, Data Protection Regulation, verifying Machine, General Data Protection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICPRW2024

点击查看摘要

Abstract:We investigate the effectiveness of Explainable AI (XAI) in verifying Machine Unlearning (MU) within the context of harbor front monitoring, focusing on data privacy and regulatory compliance. With the increasing need to adhere to privacy legislation such as the General Data Protection Regulation (GDPR), traditional methods of retraining ML models for data deletions prove impractical due to their complexity and resource demands. MU offers a solution by enabling models to selectively forget specific learned patterns without full retraining. We explore various removal techniques, including data relabeling, and model perturbation. Then, we leverage attribution-based XAI to discuss the effects of unlearning on model performance. Our proof-of-concept introduces feature importance as an innovative verification step for MU, expanding beyond traditional metrics and demonstrating techniques’ ability to reduce reliance on undesired patterns. Additionally, we propose two novel XAI-based metrics, Heatmap Coverage (HC) and Attention Shift (AS), to evaluate the effectiveness of these methods. This approach not only highlights how XAI can complement MU by providing effective verification, but also sets the stage for future research to enhance their joint integration.

[AI-11] An Evolutional Neural Network Framework for Classification of Microarray Data

链接: https://arxiv.org/abs/2411.13326
作者: Maryam Eshraghi Evari,Md Nasir Sulaiman,Amir Rajabi Behjat
关键词-EN: DNA microarray gene-expression, DNA microarray, microarray gene-expression data, cancerous gene signatures, identify cancerous gene
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:DNA microarray gene-expression data has been widely used to identify cancerous gene signatures. Microarray can increase the accuracy of cancer diagnosis and prognosis. However, analyzing the large amount of gene expression data from microarray chips pose a challenge for current machine learning researches. One of the challenges lie within classification of healthy and cancerous tissues is high dimensionality of gene expressions. High dimensionality decreases the accuracy of the classification. This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes. Whereby, a Genetic Algorithm (GA) reduced dimensionality during feature selection and then a Multi-Layer perceptron Neural Network (MLP) is applied to classify selected genes. The performance evaluated by considering to the accuracy and the number of selected genes. Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.

[AI-12] Are Large Language Models Memorizing Bug Benchmarks?

链接: https://arxiv.org/abs/2411.13323
作者: Daniel Ramos,Claudia Mamede,Kush Jain,Paulo Canelas,Catarina Gamboa,Claire Le Goues
关键词-EN: Large Language Models, Large Language, software engineering tasks, including code generation, software engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: pre-print

点击查看摘要

Abstract:Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and n-gram accuracy. Our findings show that certain models, in particular codegen-multi, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like LLaMa 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities. Comments: pre-print Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.13323 [cs.SE] (or arXiv:2411.13323v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.13323 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] Scaling Laws for Online Advertisement Retrieval

链接: https://arxiv.org/abs/2411.13322
作者: Yunli Wang,Zixuan Yang,Zhen Zhang,Zhiqiang Wang,Jian Yang,Shiyang Wen,Peng Jiang,Kun Gai
关键词-EN: scaling law, scaling, law, online advertisement retrieval, neural network models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:The scaling law is a notable property of neural network models and has significantly propelled the development of large language models. Scaling laws hold great promise in guiding model design and resource allocation. Recent research increasingly shows that scaling laws are not limited to NLP tasks or Transformer architectures; they also apply to domains such as recommendation. However, there is still a lack of literature on scaling law research in online advertisement retrieval systems. This may be because 1) identifying the scaling law for resource cost and online revenue is often expensive in both time and training resources for large-scale industrial applications, and 2) varying settings for different systems prevent the scaling law from being applied across various scenarios. To address these issues, we propose a lightweight paradigm to identify the scaling law of online revenue and machine cost for a certain online advertisement retrieval scenario with a low experimental cost. Specifically, we focus on a sole factor (FLOPs) and propose an offline metric named R/R* that exhibits a high linear correlation with online revenue for retrieval models. We estimate the machine cost offline via a simulation algorithm. Thus, we can transform most online experiments into low-cost offline experiments. We conduct comprehensive experiments to verify the effectiveness of our proposed metric R/R* and to identify the scaling law in the online advertisement retrieval system of Kuaishou. With the scaling law, we demonstrate practical applications for ROI-constrained model designing and multi-scenario resource allocation in Kuaishou advertising system. To the best of our knowledge, this is the first work to study the scaling laws for online advertisement retrieval of real-world systems, showing great potential for scaling law in advertising system optimization.

[AI-14] A Resource Efficient Fusion Network for Object Detection in Birds-Eye View using Camera and Raw Radar Data ITSC

链接: https://arxiv.org/abs/2411.13311
作者: Kavin Chandrasekaran,Sorin Grigorescu,Gijs Dubbelman,Pavol Jancura
关键词-EN: autonomous driving systems, withstand adverse weather, adverse weather conditions, weather conditions unlike, affordable radar sensors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE Intelligent Transportation Systems Conference (ITSC) 2024

点击查看摘要

Abstract:Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird’s-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

[AI-15] DATTA: Domain-Adversarial Test-Time Adaptation for Cross-Domain WiFi-Based Human Activity Recognition

链接: https://arxiv.org/abs/2411.13284
作者: Julian Strohmayer,Rafael Sterzinger,Matthias Wödlinger,Martin Kampel
关键词-EN: channel state information, causing domain shifts, Cross-domain generalization, WiFi-based sensing due, variations in environments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-domain generalization is an open problem in WiFi-based sensing due to variations in environments, devices, and subjects, causing domain shifts in channel state information. To address this, we propose Domain-Adversarial Test-Time Adaptation (DATTA), a novel framework combining domain-adversarial training (DAT), test-time adaptation (TTA), and weight resetting to facilitate adaptation to unseen target domains and to prevent catastrophic forgetting. DATTA is integrated into a lightweight, flexible architecture optimized for speed. We conduct a comprehensive evaluation of DATTA, including an ablation study on all key components using publicly available data, and verify its suitability for real-time applications such as human activity recognition. When combining a SotA video-based variant of TTA with WiFi-based DAT and comparing it to DATTA, our method achieves an 8.1% higher F1-Score. The PyTorch implementation of DATTA is publicly available at: this https URL.

[AI-16] owards Specification-Driven LLM -Based Generation of Embedded Automotive Software

链接: https://arxiv.org/abs/2411.13269
作者: Minal Suresh Patil,Gustav Ung,Mattias Nyberg
关键词-EN: critical embedded software, produce critical embedded, backprompting and fine-tuning, embedded software, critical embedded
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:The paper studies how code generation by LLMs can be combined with formal verification to produce critical embedded software. The first contribution is a general framework, spec2code, in which LLMs are combined with different types of critics that produce feedback for iterative backprompting and fine-tuning. The second contribution presents a first feasibility study, where a minimalistic instantiation of spec2code, without iterative backprompting and fine-tuning, is empirically evaluated using three industrial case studies from the heavy vehicle manufacturer Scania. The goal is to automatically generate industrial-quality code from specifications only. Different combinations of formal ACSL specifications and natural language specifications are explored. The results indicate that formally correct code can be generated even without the application of iterative backprompting and fine-tuning.

[AI-17] FASTNav: Fine-tuned Adaptive Small-language-models Trained for Multi-point Robot Navigation

链接: https://arxiv.org/abs/2411.13262
作者: Yuxuan Chen,Yixin Han,Xiao Li
关键词-EN: large language models, language models, language models bring, large language, starting to enjoy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:With the rapid development of large language models (LLM), robots are starting to enjoy the benefits of new interaction methods that large language models bring. Because edge computing fulfills the needs for rapid response, privacy, and network autonomy, we believe it facilitates the extensive deployment of large models for robot navigation across various industries. To enable local deployment of language models on edge devices, we adopt some model boosting methods. In this paper, we propose FASTNav - a method for boosting lightweight LLMs, also known as small language models (SLMs), for robot navigation. The proposed method contains three modules: fine-tuning, teacher-student iteration, and language-based multi-point robot navigation. We train and evaluate models with FASTNav in both simulation and real robots, proving that we can deploy them with low cost, high accuracy and low response time. Compared to other model compression methods, FASTNav shows potential in the local deployment of language models and tends to be a promising solution for language-guided robot navigation on edge devices.

[AI-18] BelHouse3D: A Benchmark Dataset for Assessing Occlusion Robustness in 3D Point Cloud Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2411.13251
作者: Umamaheswaran Raman Kumar,Abdur Razzaq Fayjie,Jurgen Hannaert,Patrick Vandewalle
关键词-EN: advancing machine learning, point cloud, vision tasks, instrumental in advancing, advancing machine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 20 pages, 6 figures, 3 tables, accepted at ECCV 2024 Workshops

点击查看摘要

Abstract:Large-scale 2D datasets have been instrumental in advancing machine learning; however, progress in 3D vision tasks has been relatively slow. This disparity is largely due to the limited availability of 3D benchmarking datasets. In particular, creating real-world point cloud datasets for indoor scene semantic segmentation presents considerable challenges, including data collection within confined spaces and the costly, often inaccurate process of per-point labeling to generate ground truths. While synthetic datasets address some of these challenges, they often fail to replicate real-world conditions, particularly the occlusions that occur in point clouds collected from real environments. Existing 3D benchmarking datasets typically evaluate deep learning models under the assumption that training and test data are independently and identically distributed (IID), which affects the models’ usability for real-world point cloud segmentation. To address these challenges, we introduce the BelHouse3D dataset, a new synthetic point cloud dataset designed for 3D indoor scene semantic segmentation. This dataset is constructed using real-world references from 32 houses in Belgium, ensuring that the synthetic data closely aligns with real-world conditions. Additionally, we include a test set with data occlusion to simulate out-of-distribution (OOD) scenarios, reflecting the occlusions commonly encountered in real-world point clouds. We evaluate popular point-based semantic segmentation methods using our OOD setting and present a benchmark. We believe that BelHouse3D and its OOD setting will advance research in 3D point cloud semantic segmentation for indoor scenes, providing valuable insights for the development of more generalizable models.

[AI-19] XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation NEURIPS2024

链接: https://arxiv.org/abs/2411.13243
作者: Ziyi Wang,Yanbo Wang,Xumin Yu,Jie Zhou,Jiwen Lu
关键词-EN: Existing methodologies, segmentation primarily concentrate, feature space encompassing, primarily concentrate, concentrate on establishing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at this https URL.

[AI-20] ransforming the Hybrid Cloud for Emerging AI Workloads

链接: https://arxiv.org/abs/2411.13239
作者: Deming Chen,Alaa Youssef,Ruchi Pendse,André Schleife,Bryan K. Clark,Hendrik Hamann,Jingrui He,Teodoro Laino,Lav Varshney,Yuxiong Wang,Avirup Sil,Reyhaneh Jabbarvand,Tianyin Xu,Volodymyr Kindratenko,Carlos Costa,Sarita Adve,Charith Mendis,Minjia Zhang,Santiago Núñez-Corrales,Raghu Ganti,Mudhakar Srivatsa,Nam Sung Kim,Josep Torrellas,Jian Huang,Seetharami Seelam,Klara Nahrstedt,Tarek Abdelzaher,Tamar Eilam,Huimin Zhao,Matteo Manica,Ravishankar Iyer,Martin Hirzel,Vikram Adve,Darko Marinov,Hubertus Franke,Hanghang Tong,Elizabeth Ainsworth,Han Zhao,Deepak Vasisht,Minh Do,Fabio Oliveira,Giovanni Pacifici,Ruchir Puri,Priya Nagpurkar
关键词-EN: full-stack co-design approaches, IIDAI Institute, envisions transforming hybrid, collaboration between IBM, UIUC researchers
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
*备注: 70 pages, 27 figures

点击查看摘要

Abstract:This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.

[AI-21] Existential Conversations with Large Language Models : Content Community and Culture

链接: https://arxiv.org/abs/2411.13223
作者: Murray Shanahan,Beth Singler
关键词-EN: including philosophy, large language models, variety of topics, conversational AI systems, systems based
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contemporary conversational AI systems based on large language models (LLMs) can engage users on a wide variety of topics, including philosophy, spirituality, and religion. Suitably prompted, LLMs can be coaxed into discussing such existentially significant matters as their own putative consciousness and the role of artificial intelligence in the fate of the Cosmos. Here we examine two lengthy conversations of this type. We trace likely sources, both ancient and modern, for the extensive repertoire of images, myths, metaphors, and conceptual esoterica that the language model draws on during these conversations, and foreground the contemporary communities and cultural movements that deploy related motifs, especially in their online activity. Finally, we consider the larger societal impacts of such engagements with LLMs.

[AI-22] Proceedings Sixth International Workshop on Formal Methods for Autonomous Systems

链接: https://arxiv.org/abs/2411.13215
作者: Matt Luckcuck(University of Nottingham, UK),Mengwei Xu(University of Newcastle, UK)
关键词-EN: Sixth International Workshop, Autonomous Systems, integrated Formal Methods, Formal Methods, Core Technology Facility
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This EPTCS volume contains the papers from the Sixth International Workshop on Formal Methods for Autonomous Systems (FMAS 2024), which was held between the 11th and 13th of November 2024. FMAS 2024 was co-located with 19th International Conference on integrated Formal Methods (iFM’24), hosted by the University of Manchester in the United Kingdom, in the University of Manchester’s Core Technology Facility.

[AI-23] Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

链接: https://arxiv.org/abs/2411.13209
作者: Pegah Salehi,Sajad Amouei Sheshkal,Vajira Thambawita,Sushant Gautam,Saeed S. Sabet,Dag Johansen,Michael A. Riegler,Pål Halvorsen
关键词-EN: Audio Feature Extraction, Feature Extraction, Audio Feature, challenges in Audio, focusing on overcoming
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注: 16 pages, 6 figures, 3 tables. submitted to MDPI journal in as Big Data and Cognitive Computing

点击查看摘要

Abstract:This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI’s Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.

[AI-24] he Information Security Awareness of Large Language Models

链接: https://arxiv.org/abs/2411.13207
作者: Ofir Cohen,Gil Ari Agmon,Asaf Shabtai,Rami Puzis
关键词-EN: large language models, continues to increase, assisting people, aspects of life, popularity of large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The popularity of large language models (LLMs) continues to increase, and LLM-based assistants have become ubiquitous, assisting people of diverse backgrounds in many aspects of life. Significant resources have been invested in the safety of LLMs and their alignment with social norms. However, research examining their behavior from the information security awareness (ISA) perspective is lacking. Chatbots and LLM-based assistants may put unwitting users in harm’s way by facilitating unsafe behavior. We observe that the ISA inherent in some of today’s most popular LLMs varies significantly, with most models requiring user prompts with a clear security context to utilize their security knowledge and provide safe responses to users. Based on this observation, we created a comprehensive set of 30 scenarios to assess the ISA of LLMs. These scenarios benchmark the evaluated models with respect to all focus areas defined in a mobile ISA taxonomy. Among our findings is that ISA is mildly affected by changing the model’s temperature, whereas adjusting the system prompt can substantially impact it. This underscores the necessity of setting the right system prompt to mitigate ISA weaknesses. Our findings also highlight the importance of ISA assessment for the development of future LLM-based assistants.

[AI-25] Engagement-Driven Content Generation with Large Language Models

链接: https://arxiv.org/abs/2411.13187
作者: Erica Coppolillo,Marco Minici,Federico Cinus,Francesco Bonchi,Giuseppe Manco
关键词-EN: Large Language Models, Large Language, exhibit significant persuasion, networks remains underexplored, significant persuasion capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit significant persuasion capabilities in one-on-one interactions, but their influence within social networks remains underexplored. This study investigates the potential social impact of LLMs in these environments, where interconnected users and complex opinion dynamics pose unique challenges. In particular, we address the following research question: can LLMs learn to generate meaningful content that maximizes user engagement on social networks? To answer this question, we define a pipeline to guide the LLM-based content generation which employs reinforcement learning with simulated feedback. In our framework, the reward is based on an engagement model borrowed from the literature on opinion dynamics and information propagation. Moreover, we force the text generated by the LLM to be aligned with a given topic and to satisfy a minimum fluency requirement. Using our framework, we analyze the capabilities and limitations of LLMs in tackling the given task, specifically considering the relative positions of the LLM as an agent within the social network and the distribution of opinions in the network on the given topic. Our findings show the full potential of LLMs in creating social engagement. Notable properties of our approach are that the learning procedure is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. In this regard, our approach can be easily refined for more complex engagement tasks and interventions in computational social science. The code used for the experiments is publicly available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.13187 [cs.LG] (or arXiv:2411.13187v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.13187 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Erica Coppolillo [view email] [v1] Wed, 20 Nov 2024 10:40:08 UTC (7,072 KB)

[AI-26] Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning

链接: https://arxiv.org/abs/2411.13181
作者: Simone Bianco,Luigi Celona,Paolo Napoletano
关键词-EN: ensuring safe driving, safe driving, classification of distracted, pivotal for ensuring, ensuring safe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The classification of distracted drivers is pivotal for ensuring safe driving. Previous studies demonstrated the effectiveness of neural networks in automatically predicting driver distraction, fatigue, and potential hazards. However, recent research has uncovered a significant loss of accuracy in these models when applied to samples acquired under conditions that differ from the training data. In this paper, we introduce a robust model designed to withstand changes in camera position within the vehicle. Our Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module to discard camera view information from features, coupled with contrastive learning to enhance the encoding of various driver actions. Experiments conducted on the daytime and nighttime subsets of the 100-Driver dataset validate the effectiveness of our approach with an increment on average of 9% in Top-1 accuracy in comparison with the state of the art. In addition, cross-dataset and cross-camera experiments conducted on three benchmark datasets, namely AUCDD-V1, EZZ2021 and SFD, demonstrate the superior generalization capability of the proposed method.

[AI-27] Writing Style Matters: An Examination of Bias and Fairness in Information Retrieval Systems WSDM25

链接: https://arxiv.org/abs/2411.13173
作者: Hongliu Cao
关键词-EN: Language Model technologies, advancement of Language, embedding models, text embedding models, Language Model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM 25)

点击查看摘要

Abstract:The rapid advancement of Language Model technologies has opened new opportunities, but also introduced new challenges related to bias and fairness. This paper explores the uncharted territory of potential biases in state-of-the-art universal text embedding models towards specific document and query writing styles within Information Retrieval (IR) systems. Our investigation reveals that different embedding models exhibit different preferences of document writing style, while more informal and emotive styles are less favored by most embedding models. In terms of query writing styles, many embedding models tend to match the style of the query with the style of the retrieved documents, but some show a consistent preference for specific styles. Text embedding models fine-tuned on synthetic data generated by LLMs display a consistent preference for certain style of generated data. These biases in text embedding based IR systems can inadvertently silence or marginalize certain communication styles, thereby posing a significant threat to fairness in information retrieval. Finally, we also compare the answer styles of Retrieval Augmented Generation (RAG) systems based on different LLMs and find out that most text embedding models are biased towards LLM’s answer styles when used as evaluation metrics for answer correctness. This study sheds light on the critical issue of writing style based bias in IR systems, offering valuable insights for the development of more fair and robust models.

[AI-28] DMQR-RAG: Diverse Multi-Query Rewriting for RAG

链接: https://arxiv.org/abs/2411.13154
作者: Zhicong Li,Jiahao Wang,Zhishu Jiang,Hangyu Mao,Zhongxia Chen,Jiazhen Du,Yuanxing Zhang,Fuzheng Zhang,Di Zhang,Yong Liu
关键词-EN: Large language models, Large language, knowledge and hallucinations, undermine their reliability, language models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability. Retrieval-augmented generation (RAG) mitigates these issues by incorporating external information. However, user queries frequently contain noise and intent deviations, necessitating query rewriting to improve the relevance of retrieved documents. In this paper, we introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework designed to improve the performance of both document retrieval and final responses in RAG. Specifically, we investigate how queries with varying information quantities can retrieve a diverse array of documents, presenting four rewriting strategies that operate at different levels of information to enhance the performance of baseline approaches. Additionally, we propose an adaptive strategy selection method that minimizes the number of rewrites while optimizing overall performance. Our methods have been rigorously validated through extensive experiments conducted in both academic and industry settings.

[AI-29] AGLP: A Graph Learning Perspective for Semi-supervised Domain Adaptation

链接: https://arxiv.org/abs/2411.13152
作者: Houcheng Su,Mengzhu Wang,Jiao Li,Nan Yin,Li Shen
关键词-EN: partially labeled target, leverage partially labeled, labeled target domain, target domain data, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8page

点击查看摘要

Abstract:In semi-supervised domain adaptation (SSDA), the model aims to leverage partially labeled target domain data along with a large amount of labeled source domain data to enhance its generalization capability for the target domain. A key advantage of SSDA is its ability to significantly reduce reliance on labeled data, thereby lowering the costs and time associated with data preparation. Most existing SSDA methods utilize information from domain labels and class labels but overlook the structural information of the data. To address this issue, this paper proposes a graph learning perspective (AGLP) for semi-supervised domain adaptation. We apply the graph convolutional network to the instance graph which allows structural information to propagate along the weighted graph edges. The proposed AGLP model has several advantages. First, to the best of our knowledge, this is the first work to model structural information in SSDA. Second, the proposed model can effectively learn domain-invariant and semantic representations, reducing domain discrepancies in SSDA. Extensive experimental results on multiple standard benchmarks demonstrate that the proposed AGLP algorithm outperforms state-of-the-art semi-supervised domain adaptation methods.

[AI-30] YCB-LUMA: YCB Object Dataset with Luminance Keying for Object Localization

链接: https://arxiv.org/abs/2411.13149
作者: Thomas Pöllabauer
关键词-EN: Localizing target objects, Localizing target, computer vision, Localizing, important task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Localizing target objects in images is an important task in computer vision. Often it is the first step towards solving a variety of applications in autonomous driving, maintenance, quality insurance, robotics, and augmented reality. Best in class solutions for this task rely on deep neural networks, which require a set of representative training data for best performance. Creating sets of sufficient quality, variety, and size is often difficult, error prone, and expensive. This is where the method of luminance keying can help: it provides a simple yet effective solution to record high quality data for training object detection and segmentation. We extend previous work that presented luminance keying on the common YCB-V set of household objects by recording the remaining objects of the YCB superset. The additional variety of objects - addition of transparency, multiple color variations, non-rigid objects - further demonstrates the usefulness of luminance keying and might be used to test the applicability of the approach on new 2D object detection and segmentation algorithms.

[AI-31] GraphCL: Graph-based Clustering for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2411.13147
作者: Mengzhu Wang,Jiao Li,Houcheng Su,Nan Yin,Shen Li
关键词-EN: made notable advancements, semi-supervised medical image, medical image segmentation, data utilization efficiency, limited labeled data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9page

点击查看摘要

Abstract:Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.

[AI-32] CopyrightMeter: Revisiting Copyright Protection in Text-to-image Models

链接: https://arxiv.org/abs/2411.13144
作者: Naen Xu,Changjiang Li,Tianyu Du,Minxi Li,Wenjie Luo,Jiacheng Liang,Yuyuan Li,Xuhong Zhang,Meng Han,Jianwei Yin,Ting Wang
关键词-EN: generating high-quality images, textual descriptions, emerged as powerful, powerful tools, tools for generating
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have emerged as powerful tools for generating high-quality images from textual descriptions. However, their increasing popularity has raised significant copyright concerns, as these models can be misused to reproduce copyrighted content without authorization. In response, recent studies have proposed various copyright protection methods, including adversarial perturbation, concept erasure, and watermarking techniques. However, their effectiveness and robustness against advanced attacks remain largely unexplored. Moreover, the lack of unified evaluation frameworks has hindered systematic comparison and fair assessment of different approaches. To bridge this gap, we systematize existing copyright protection methods and attacks, providing a unified taxonomy of their design spaces. We then develop CopyrightMeter, a unified evaluation framework that incorporates 17 state-of-the-art protections and 16 representative attacks. Leveraging CopyrightMeter, we comprehensively evaluate protection methods across multiple dimensions, thereby uncovering how different design choices impact fidelity, efficacy, and resilience under attacks. Our analysis reveals several key findings: (i) most protections (16/17) are not resilient against attacks; (ii) the “best” protection varies depending on the target priority; (iii) more advanced attacks significantly promote the upgrading of protections. These insights provide concrete guidance for developing more robust protection methods, while its unified evaluation protocol establishes a standard benchmark for future copyright protection research in text-to-image generation.

[AI-33] Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning

链接: https://arxiv.org/abs/2411.13116
作者: Zhi Luo,Xiyuan Yang,Pan Zhou,Di Wang
关键词-EN: Manipulating the interaction, exposing the potential, reinforcement learning, environment can control, potential vulnerabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent’s training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action-manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up-and-coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action-manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white-box and black-box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., O\left(\mathcalR(T) + MH^3K^E\log (MT)\right)(0E1) , where H is the number of steps per episode, K is the total number of episodes, T=KH is the total number of steps, M is the number of subspaces divided in the state space, and \mathcalR(T) is the bound of the RL algorithm’s regret. We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.

[AI-34] Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

链接: https://arxiv.org/abs/2411.13093
作者: Yongdong Luo,Xiawu Zheng,Xiao Yang,Guilin Li,Haojia Lin,Jinfa Huang,Jiayi Ji,Fei Chao,Jiebo Luo,Rongrong Ji
关键词-EN: Existing large video-language, large video-language models, struggle to comprehend, limited context, large video-language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

[AI-35] Neural Internal Model Control: Learning a Robust Control Policy via Predictive Error Feedback

链接: https://arxiv.org/abs/2411.13079
作者: Feng Gao,Chao Yu,Yu Wang,Yi Wu
关键词-EN: Accurate motion control, Accurate motion, complex environments remains, challenge in robotics, environments remains
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to RAL

点击查看摘要

Abstract:Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while RL-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control, which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the framework’s robustness in sim-to-real transfer. Our code is released at this https URL.

[AI-36] AMaze: An intuitive benchmark generator for fast prototyping of generalizable agents

链接: https://arxiv.org/abs/2411.13072
作者: Kevin Godin-Dubois,Karine Miras,Anna V. Kononova
关键词-EN: Traditional approaches, involved a single, computer vision, generally involved, minimal complexity
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Under review in Frontiers in Artificial Intelligence

点击查看摘要

Abstract:Traditional approaches to training agents have generally involved a single, deterministic environment of minimal complexity to solve various tasks such as robot locomotion or computer vision. However, agents trained in static environments lack generalization capabilities, limiting their potential in broader scenarios. Thus, recent benchmarks frequently rely on multiple environments, for instance, by providing stochastic noise, simple permutations, or altogether different settings. In practice, such collections result mainly from costly human-designed processes or the liberal use of random number generators. In this work, we introduce AMaze, a novel benchmark generator in which embodied agents must navigate a maze by interpreting visual signs of arbitrary complexities and deceptiveness. This generator promotes human interaction through the easy generation of feature-specific mazes and an intuitive understanding of the resulting agents’ strategies. As a proof-of-concept, we demonstrate the capabilities of the generator in a simple, fully discrete case with limited deceptiveness. Agents were trained under three different regimes (one-shot, scaffolding, interactive), and the results showed that the latter two cases outperform direct training in terms of generalization capabilities. Indeed, depending on the combination of generalization metric, training regime, and algorithm, the median gain ranged from 50% to 100% and maximal performance was achieved through interactive training, thereby demonstrating the benefits of a controllable human-in-the-loop benchmark generator.

[AI-37] Branches Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

链接: https://arxiv.org/abs/2411.13057
作者: Xu Chen,Zida Cheng,Yuangang Pan,Shuai Xiao,Xiaoming Liu,Jinsong Lan,Qingwen Liu,Ivor W. Tsang
关键词-EN: Existing click-through rate, Existing click-through, click-through rate, feature, studied the role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Existing click-through rate (CTR) prediction works have studied the role of feature interaction through a variety of techniques. Each interaction technique exhibits its own strength, and solely using one type could constrain the model’s capability to capture the complex feature relationships, especially for industrial large-scale data with enormous users and items. Recent research shows that effective CTR models often combine an MLP network with a dedicated feature interaction network in a two-parallel structure. However, the interplay and cooperative dynamics between different streams or branches remain under-researched. In this work, we introduce a novel Multi-Branch Cooperation Network (MBCnet) which enables multiple branch networks to collaborate with each other for better complex feature interaction modeling. Specifically, MBCnet consists of three branches: the Expert-based Feature Grouping and Crossing (EFGC) branch that promotes the model’s memorization ability of specific feature fields, the low rank Cross Net branch and Deep branch to enhance both explicit and implicit feature crossing for improved generalization. Among branches, a novel cooperation scheme is proposed based on two principles: branch co-teaching and moderate differentiation. Branch co-teaching encourages well-learned branches to support poorly-learned ones on specific training samples. Moderate differentiation advocates branches to maintain a reasonable level of difference in their feature representations. The cooperation strategy improves learning through mutual knowledge sharing via co-teaching and boosts the discovery of diverse feature interactions across branches. Extensive experiments on large-scale industrial datasets and online A/B test demonstrate MBCnet’s superior performance, delivering a 0.09 point increase in CTR, 1.49% growth in deals, and 1.62% rise in GMV. Core codes will be released soon.

[AI-38] MEGL: Multimodal Explanation-Guided Learning

链接: https://arxiv.org/abs/2411.13053
作者: Yifei Zhang,Tianxu Jiang,Bo Pan,Jingyu Wang,Guangji Bai,Liang Zhao
关键词-EN: Artificial Intelligence, Explaining the decision-making, processes of Artificial, visual, Visual explanations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explaining the decision-making processes of Artificial Intelligence (AI) models is crucial for addressing their “black box” nature, particularly in tasks like image classification. Traditional eXplainable AI (XAI) methods typically rely on unimodal explanations, either visual or textual, each with inherent limitations. Visual explanations highlight key regions but often lack rationale, while textual explanations provide context without spatial grounding. Further, both explanation types can be inconsistent or incomplete, limiting their reliability. To address these challenges, we propose a novel Multimodal Explanation-Guided Learning (MEGL) framework that leverages both visual and textual explanations to enhance model interpretability and improve classification performance. Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations. Additionally, we introduce Textual Supervision on Visual Explanations to align visual explanations with textual rationales, even in cases where ground truth visual annotations are missing. A Visual Explanation Distribution Consistency loss further reinforces visual coherence by aligning the generated visual explanations with dataset-level patterns, enabling the model to effectively learn from incomplete multimodal supervision. We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations. Experimental results demonstrate that MEGL outperforms previous approaches in prediction accuracy and explanation quality across both visual and textual domains. Our code will be made available upon the acceptance of the paper.

[AI-39] Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization NEURIPS2024

链接: https://arxiv.org/abs/2411.13036
作者: Sanghyeob Song,Jaihyun Lew,Hyemi Jang,Sungroh Yoon
关键词-EN: high-level vision tasks, crucial for mid, vision tasks, stitching and fusion, high-level vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper is accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\urlthis https URL

[AI-40] “It was 80% me 20% AI”: Seeking Authenticity in Co-Writing with Large Language Models

链接: https://arxiv.org/abs/2411.13032
作者: Angel Hsing-Chi Hwang,Q. Vera Liao,Su Lin Blodgett,Alexandra Olteanu,Adam Trischler
关键词-EN: large language models, language models, rising proliferation, proliferation and diversity, powered by large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Given the rising proliferation and diversity of AI writing assistance tools, especially those powered by large language models (LLMs), both writers and readers may have concerns about the impact of these tools on the authenticity of writing work. We examine whether and how writers want to preserve their authentic voice when co-writing with AI tools and whether personalization of AI writing support could help achieve this goal. We conducted semi-structured interviews with 19 professional writers, during which they co-wrote with both personalized and non-personalized AI writing-support tools. We supplemented writers’ perspectives with opinions from 30 avid readers about the written work co-produced with AI collected through an online survey. Our findings illuminate conceptions of authenticity in human-AI co-creation, which focus more on the process and experience of constructing creators’ authentic selves. While writers reacted positively to personalized AI writing tools, they believed the form of personalization needs to target writers’ growth and go beyond the phase of text production. Overall, readers’ responses showed less concern about human-AI co-writing. Readers could not distinguish AI-assisted work, personalized or not, from writers’ solo-written work and showed positive attitudes toward writers experimenting with new technology for creative writing.

[AI-41] Evaluating LLM s Capabilities Towards Understanding Social Dynamics

链接: https://arxiv.org/abs/2411.13008
作者: Anique Tahir,Lu Cheng,Manuel Sandoval,Yasin N. Silva,Deborah L. Hall,Huan Liu
关键词-EN: discourse involves people, media discourse involves, involves people, Social, discourse involves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear in ASONAM 24 proceedings

点击查看摘要

Abstract:Social media discourse involves people from different backgrounds, beliefs, and motives. Thus, often such discourse can devolve into toxic interactions. Generative Models, such as Llama and ChatGPT, have recently exploded in popularity due to their capabilities in zero-shot question-answering. Because these models are increasingly being used to ask questions of social significance, a crucial research question is whether they can understand social media dynamics. This work provides a critical analysis regarding generative LLM’s ability to understand language and dynamics in social contexts, particularly considering cyberbullying and anti-cyberbullying (posts aimed at reducing cyberbullying) interactions. Specifically, we compare and contrast the capabilities of different large language models (LLMs) to understand three key aspects of social dynamics: language, directionality, and the occurrence of bullying/anti-bullying messages. We found that while fine-tuned LLMs exhibit promising results in some social media understanding tasks (understanding directionality), they presented mixed results in others (proper paraphrasing and bullying/anti-bullying detection). We also found that fine-tuning and prompt engineering mechanisms can have positive effects in some tasks. We believe that a understanding of LLM’s capabilities is crucial to design future models that can be effectively used in social applications.

[AI-42] BetterBench: Assessing AI Benchmarks Uncovering Issues and Establishing Best Practices NEURIPS2024

链接: https://arxiv.org/abs/2411.12990
作者: Anka Reuel,Amelia Hardy,Chandler Smith,Max Lamparth,Malcolm Hardy,Mykel J. Kochenderfer
关键词-EN: high-stakes environments, capabilities and risks, increasingly prevalent, prevalent in high-stakes, Benchmarks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a Spotlight Poster to NeurIPS 2024

点击查看摘要

Abstract:AI models are increasingly prevalent in high-stakes environments, necessitating thorough assessment of their capabilities and risks. Benchmarks are popular for measuring these attributes and for comparing model performance, tracking progress, and identifying weaknesses in foundation and non-foundation models. They can inform model selection for downstream tasks and influence policy initiatives. However, not all benchmarks are the same: their quality depends on their design and usability. In this paper, we develop an assessment framework considering 46 best practices across an AI benchmark’s lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues. We further find that most benchmarks do not report statistical significance of their results nor allow for their results to be easily replicated. To support benchmark developers in aligning with best practices, we provide a checklist for minimum quality assurance based on our assessment. We also develop a living repository of benchmark assessments to support benchmark comparability, accessible at this http URL.

[AI-43] LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection Recovery and Enhancement

链接: https://arxiv.org/abs/2411.12980
作者: Siwen Jiao,Yangyi Fang
关键词-EN: Visual Language Models, Language Models, enabling natural human-vehicle, Recent advancements, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Visual Language Models (VLMs) have made them crucial for visual question answering (VQA) in autonomous driving, enabling natural human-vehicle interactions. However, existing methods often struggle in dynamic driving environments, as they usually focus on static images or videos and rely on downsampling to manage computational costs. This results in the loss of critical details and the difficulty in effectively integrating spatial and temporal information, undermining fine-grained perception and temporal coherence essential for effective decision-making. To tackle these challenges, we introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving. LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception. It optimizes spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis to focus on motion-related features, thereby boosting computational efficiency. The core of LaVida Drive consists of two modules: the \textitQuery-aware Token Selection module and the \textitSpatial-Temporal Token Recovery and Enhancement module. The former dynamically selects the most relevant visual tokens based on semantic alignment with the input query, reducing the token count from high-resolution spatial input. The latter ensures smooth and coherent interactions between spatial and temporal information, preserving contextual continuity across frames. Extensive experiments on various autonomous driving question-answering benchmarks show that LaVida Drive significantly reduces visual tokens, enhances efficiency, and improves overall performance.

[AI-44] Shrinking POMCP: A Framework for Real-Time UAV Search and Rescue

链接: https://arxiv.org/abs/2411.12967
作者: Yunuo Zhang,Baiting Luo,Ayan Mukhopadhyay,Daniel Stojcsics,Daniel Elenius,Anirban Roy,Susmit Jha,Miklos Maroti,Xenofon Koutsoukos,Gabor Karsai,Abhishek Dubey
关键词-EN: Efficient path optimization, including limited visibility, complex information gathering, operations faces challenges, rescue operations faces
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted to the The 3rd International Conference on Assured Autonomy

点击查看摘要

Abstract:Efficient path optimization for drones in search and rescue operations faces challenges, including limited visibility, time constraints, and complex information gathering in urban environments. We present a comprehensive approach to optimize UAV-based search and rescue operations in neighborhood areas, utilizing both a 3D AirSim-ROS2 simulator and a 2D simulator. The path planning problem is formulated as a partially observable Markov decision process (POMDP), and we propose a novel ``Shrinking POMCP’’ approach to address time constraints. In the AirSim environment, we integrate our approach with a probabilistic world model for belief maintenance and a neurosymbolic navigator for obstacle avoidance. The 2D simulator employs surrogate ROS2 nodes with equivalent functionality. We compare trajectories generated by different approaches in the 2D simulator and evaluate performance across various belief types in the 3D AirSim-ROS simulator. Experimental results from both simulators demonstrate that our proposed shrinking POMCP solution achieves significant improvements in search times compared to alternative methods, showcasing its potential for enhancing the efficiency of UAV-assisted search and rescue operations.

[AI-45] Real-Time Energy-Optimal Path Planning for Electric Vehicles

链接: https://arxiv.org/abs/2411.12964
作者: Saman Ahmadi,Guido Tack,Daniel Harabor,Philip Kilby,Mahdi Jalili
关键词-EN: made energy-aware routing, modern transport systems, successful integration, rapid adoption, adoption of electric
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:The rapid adoption of electric vehicles (EVs) in modern transport systems has made energy-aware routing a critical task in their successful integration, especially within large-scale networks. In cases where an EV’s remaining energy is limited and charging locations are not easily accessible, some destinations may only be reachable through an energy-optimal path: a route that consumes less energy than all other alternatives. The feasibility of such energy-efficient paths depends heavily on the accuracy of the energy model used for planning, and thus failing to account for vehicle dynamics can lead to inaccurate energy estimates, rendering some planned routes infeasible in reality. This paper explores the impact of vehicle dynamics on energy-optimal path planning for EVs. We develop an accurate energy model that incorporates key vehicle dynamics parameters into energy calculations, thereby reducing the risk of planning infeasible paths under battery constraints. The paper also introduces two novel online reweighting functions that allow for a faster, pre-processing free, pathfinding in the presence of negative energy costs resulting from regenerative braking, making them ideal for real-time applications. Through extensive experimentation on real-world transport networks, we demonstrate that our approach considerably enhances energy-optimal pathfinding for EVs in both computational efficiency and energy estimation accuracy.

[AI-46] KAAE: Numerical Reasoning for Knowledge Graphs via Knowledge-aware Attributes Learning

链接: https://arxiv.org/abs/2411.12950
作者: Ming Yin,Qiang Zhou,Zongsheng Cao,Mei Li
关键词-EN: artificial intelligence applications, natural language processing, Nile is longer, Numerical reasoning, intelligence applications
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Numerical reasoning is pivotal in various artificial intelligence applications, such as natural language processing and recommender systems, where it involves using entities, relations, and attribute values (e.g., weight, length) to infer new factual relations (e.g., the Nile is longer than the Amazon). However, existing approaches encounter two critical challenges in modeling: (1) semantic relevance-the challenge of insufficiently capturing the necessary contextual interactions among entities, relations, and numerical attributes, often resulting in suboptimal inference; and (2) semantic ambiguity-the difficulty in accurately distinguishing ordinal relationships during numerical reasoning, which compromises the generation of high-quality samples and limits the effectiveness of contrastive learning. To address these challenges, we propose the novel Knowledge-Aware Attributes Embedding model (KAAE) for knowledge graph embeddings in numerical reasoning. Specifically, to overcome the challenge of semantic relevance, we introduce a Mixture-of-Experts-Knowledge-Aware (MoEKA) Encoder, designed to integrate the semantics of entities, relations, and numerical attributes into a joint semantic space. To tackle semantic ambiguity, we implement a new ordinal knowledge contrastive learning (OKCL) strategy that generates high-quality ordinal samples from the original data with the aid of ordinal relations, capturing fine-grained semantic nuances essential for accurate numerical reasoning. Experiments on three public benchmark datasets demonstrate the superior performance of KAAE across various attribute value distributions.

[AI-47] Enhancing Thermal MOT: A Novel Box Association Method Leveraging Thermal Identity and Motion Similarity ECCV

链接: https://arxiv.org/abs/2411.12943
作者: Wassim El Ahmar,Dhanvin Kolhatkar,Farzan Nowruzi,Robert Laganiere
关键词-EN: Multiple Object Tracking, unique challenges due, Multiple Object, presents unique challenges, unique challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Workshop on Towards a Complete Analysis of People, part of the European Conference on Computer Vision (ECCV) 2024

点击查看摘要

Abstract:Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method that utilizes both thermal object identity and motion similarity. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance. Additionally, we present a new dataset comprised of a large-scale collection of thermal and RGB images captured in diverse urban environments, serving as both a benchmark for our method and a new resource for thermal imaging. We conduct extensive experiments to demonstrate the superiority of our approach over existing methods, showing significant improvements in tracking accuracy and robustness under various conditions. Our findings suggest that incorporating thermal identity with motion data enhances MOT performance. The newly collected dataset and source code is available at this https URL

[AI-48] Human-In-the-Loop Software Development Agents

链接: https://arxiv.org/abs/2411.12924
作者: Wannita Takerngsaksiri,Jirat Pasuksmit,Patanamon Thongtanunam,Chakkrit Tantithamthavorn,Ruixiong Zhang,Fan Jiang,Jing Li,Evan Cook,Kun Chen,Ming Wu
关键词-EN: Large Language Models, Large Language, Language Models, based multi-agent paradigms, automatically resolve software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs)-based multi-agent paradigms for software engineering are introduced to automatically resolve software development tasks (e.g., from a given issue to source code). However, existing work is evaluated based on historical benchmark datasets, does not consider human feedback at each stage of the automated software development process, and has not been deployed in practice. In this paper, we introduce a Human-in-the-loop LLM-based Agents framework (HULA) for software development that allows software engineers to refine and guide LLMs when generating coding plans and source code for a given task. We design, implement, and deploy the HULA framework into Atlassian JIRA for internal uses. Through a multi-stage evaluation of the HULA framework, Atlassian software engineers perceive that HULA can minimize the overall development time and effort, especially in initiating a coding plan and writing code for straightforward tasks. On the other hand, challenges around code quality are raised to be solved in some cases. We draw lessons learned and discuss opportunities for future work, which will pave the way for the advancement of LLM-based agents in software development.

[AI-49] A Comparative Study of Text Retrieval Models on DaReCzech

链接: https://arxiv.org/abs/2411.12921
作者: Jakub Stetina,Martin Fajcik,Michal Stefanik,Michal Hradis
关键词-EN: OpenAI ADA, retrieval dataset DaReCzech, Czech retrieval dataset, chosen to determine, dataset DaReCzech
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.

[AI-50] MLDGG: Meta-Learning for Domain Generalization on Graphs KDD2025

链接: https://arxiv.org/abs/2411.12913
作者: Qin Tian,Chen Zhao,Minglai Shao,Wenjun Wang,Yujie Lin,Dong Li
关键词-EN: robust generalization capabilities, ensuring effective performance, testing set, aims to develop, set despite disparities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in KDD 2025 (research track)

点击查看摘要

Abstract:Domain generalization on graphs aims to develop models with robust generalization capabilities, ensuring effective performance on the testing set despite disparities between testing and training distributions. However, existing methods often rely on static encoders directly applied to the target domain, constraining its flexible adaptability. In contrast to conventional methodologies, which concentrate on developing specific generalized models, our framework, MLDGG, endeavors to achieve adaptable generalization across diverse domains by integrating cross-multi-domain meta-learning with structure learning and semantic identification. Initially, it introduces a generalized structure learner to mitigate the adverse effects of task-unrelated edges, enhancing the comprehensiveness of representations learned by Graph Neural Networks (GNNs) while capturing shared structural information across domains. Subsequently, a representation learner is designed to disentangle domain-invariant semantic and domain-specific variation information in node embedding by leveraging causal reasoning for semantic identification, further enhancing generalization. In the context of meta-learning, meta-parameters for both learners are optimized to facilitate knowledge transfer and enable effective adaptation to graphs through fine-tuning within the target domains, where target graphs are inaccessible during training. Our empirical results demonstrate that MLDGG surpasses baseline methods, showcasing its effectiveness in three different distribution shift settings.

[AI-51] Advancing Large Language Models for Spatiotemporal and Semantic Association Mining of Similar Environmental Events

链接: https://arxiv.org/abs/2411.12880
作者: Yuanyuan Tian,Wenwen Li,Lei Hu,Xiao Chen,Michael Brook,Michael Brubaker,Fan Zhang,Anna K. Liljedahl
关键词-EN: modern search tools, Large Language Models, leveraging Large Language, framework leveraging Large, Local Environmental Observer
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval and recommendation are two essential tasks in modern search tools. This paper introduces a novel retrieval-reranking framework leveraging Large Language Models (LLMs) to enhance the spatiotemporal and semantic associated mining and recommendation of relevant unusual climate and environmental events described in news articles and web posts. This framework uses advanced natural language processing techniques to address the limitations of traditional manual curation methods in terms of high labor cost and lack of scalability. Specifically, we explore an optimized solution to employ cutting-edge embedding models for semantically analyzing spatiotemporal events (news) and propose a Geo-Time Re-ranking (GT-R) strategy that integrates multi-faceted criteria including spatial proximity, temporal association, semantic similarity, and category-instructed similarity to rank and identify similar spatiotemporal events. We apply the proposed framework to a dataset of four thousand Local Environmental Observer (LEO) Network events, achieving top performance in recommending similar events among multiple cutting-edge dense retrieval models. The search and recommendation pipeline can be applied to a wide range of similar data search tasks dealing with geospatial and temporal data. We hope that by linking relevant events, we can better aid the general public to gain an enhanced understanding of climate change and its impact on different communities.

[AI-52] he Illusion of Empathy: How AI Chatbots Shape Conversation Perception

链接: https://arxiv.org/abs/2411.12877
作者: Tingting Liu,Salvatore Giorgi,Ankit Aich,Allison Lahnala,Brenda Curtis,Lyle Ungar,João Sedoc
关键词-EN: understanding user-centered perceptions, quality remains essential, understanding user-centered, essential yet under-explored, human-like by incorporating
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI chatbots become more human-like by incorporating empathy, understanding user-centered perceptions of chatbot empathy and its impact on conversation quality remains essential yet under-explored. This study examines how chatbot identity and perceived empathy influence users’ overall conversation experience. Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. Empathy ratings from GPT-4o annotations aligned with users’ ratings, reinforcing the perception of lower empathy in chatbots. In contrast, 3 out of 5 empathy models trained on human-human conversations detected no significant differences in empathy language between chatbots and humans. Our findings underscore the critical role of perceived empathy in shaping conversation quality, revealing that achieving high-quality human-AI interactions requires more than simply embedding empathetic language; it necessitates addressing the nuanced ways users interpret and experience empathy in conversations with chatbots.

[AI-53] Puppet-CNN: Input-Adaptive Convolutional Neural Networks with Model Compression using Ordinary Differential Equation

链接: https://arxiv.org/abs/2411.12876
作者: Yucheng Xing,Xin Wang
关键词-EN: Convolutional Neural Network, machine learning tasks, Neural Network, Convolutional Neural, learning tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Convolutional Neural Network (CNN) has been applied to more and more scenarios due to its excellent performance in many machine learning tasks, especially with deep and complex structures. However, as the network goes deeper, more parameters need to be stored and optimized. Besides, almost all common CNN models adopt “train-and-use” strategy where the structure is pre-defined and the kernel parameters are fixed after the training with the same structure and set of parameters used for all data without considering the content complexity. In this paper, we propose a new CNN framework, named as \textitPuppet-CNN , which contains two modules: a \textitpuppet module and a \textitpuppeteer module . The puppet module is a CNN model used to actually process the input data just like other works, but its depth and kernels are generated by the puppeteer module (realized with Ordinary Differential Equation (ODE)) based on the input complexity each time. By recurrently generating kernel parameters in the puppet module, we can take advantage of the dependence among kernels of different convolutional layers to significantly reduce the size of CNN model by only storing and training the parameters of the much smaller puppeteer ODE module. Through experiments on several datasets, our method has proven to be superior than the traditional CNNs on both performance and efficiency. The model size can be reduced more than 10 times.

[AI-54] From Text to Pose to Image: Improving Diffusion Model Control and Quality NEURIPS2024

链接: https://arxiv.org/abs/2411.12872
作者: Clément Bonnett,Ariel N. Lee,Franck Wertel,Antoine Tamano,Tanguy Cizain,Pablo Ducru
关键词-EN: extremely popular, diffusion models, pose, models, diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward

点击查看摘要

Abstract:In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at this https URL.

[AI-55] he Game-Theoretic Symbiosis of Trust and AI in Networked Systems

链接: https://arxiv.org/abs/2411.12859
作者: Yunfei Ge,Quanyan Zhu
关键词-EN: Artificial Intelligence, relationship between Artificial, strategic cybersecurity contexts, explores the symbiotic, symbiotic relationship
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This chapter explores the symbiotic relationship between Artificial Intelligence (AI) and trust in networked systems, focusing on how these two elements reinforce each other in strategic cybersecurity contexts. AI’s capabilities in data processing, learning, and real-time response offer unprecedented support for managing trust in dynamic, complex networks. However, the successful integration of AI also hinges on the trustworthiness of AI systems themselves. Using a game-theoretic framework, this chapter presents approaches to trust evaluation, the strategic role of AI in cybersecurity, and governance frameworks that ensure responsible AI deployment. We investigate how trust, when dynamically managed through AI, can form a resilient security ecosystem. By examining trust as both an AI output and an AI requirement, this chapter sets the foundation for a positive feedback loop where AI enhances network security and the trust placed in AI systems fosters their adoption.

[AI-56] mDAE : modified Denoising AutoEncoder for missing data imputation

链接: https://arxiv.org/abs/2411.12847
作者: Mariette Dupuy,Marie Chavent,Remi Dubois
关键词-EN: missing data imputation, Denoising AutoEncoder, based on Denoising, UCI Machine Learning, Machine Learning Repository
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a methodology based on Denoising AutoEncoder (DAE) for missing data imputation. The proposed methodology, called mDAE hereafter, results from a modification of the loss function and a straightforward procedure for choosing the hyper-parameters. An ablation study shows on several UCI Machine Learning Repository datasets, the benefit of using this modified loss function and an overcomplete structure, in terms of Root Mean Squared Error (RMSE) of reconstruction. This numerical study is completed by comparing the mDAE methodology with eight other methods (four standard and four more recent). A criterion called Mean Distance to Best (MDB) is proposed to measure how a method performs globally well on all datasets. This criterion is defined as the mean (over the datasets) of the distances between the RMSE of the considered method and the RMSE of the best method. According to this criterion, the mDAE methodology was consistently ranked among the top methods (along with SoftImput and missForest), while the four more recent methods were systematically ranked last. The Python code of the numerical study will be available on GitHub so that results can be reproduced or generalized with other datasets and methods.

[AI-57] Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation

链接: https://arxiv.org/abs/2411.12820
作者: Peter Barnett,Lisa Thiergart
关键词-EN: important pillar, ensuring safety, systems advance, evaluations, safety
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As AI systems advance, AI evaluations are becoming an important pillar of regulations for ensuring safety. We argue that such regulation should require developers to explicitly identify and justify key underlying assumptions about evaluations as part of their case for safety. We identify core assumptions in AI evaluations (both for evaluating existing models and forecasting future models), such as comprehensive threat modeling, proxy task validity, and adequate capability elicitation. Many of these assumptions cannot currently be well justified. If regulation is to be based on evaluations, it should require that AI development be halted if evaluations demonstrate unacceptable danger or if these assumptions are inadequately justified. Our presented approach aims to enhance transparency in AI development, offering a practical path towards more effective governance of advanced AI systems.

[AI-58] Conversational Medical AI: Ready for Practice

链接: https://arxiv.org/abs/2411.12808
作者: Antoine Lizée,Pierre-Auguste Beaucoté,James Whitbeck,Marion Doumeingts,Anaël Beaugnon,Isabelle Feldhaus
关键词-EN: shortage of doctors, doctors is creating, creating a critical, critical squeeze, squeeze in access
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 14 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The shortage of doctors is creating a critical squeeze in access to medical expertise. While conversational Artificial Intelligence (AI) holds promise in addressing this problem, its safe deployment in patient-facing roles remains largely unexplored in real-world medical settings. We present the first large-scale evaluation of a physician-supervised LLM-based conversational agent in a real-world medical setting. Our agent, Mo, was integrated into an existing medical advice chat service. Over a three-week period, we conducted a randomized controlled experiment with 926 cases to evaluate patient experience and satisfaction. Among these, Mo handled 298 complete patient interactions, for which we report physician-assessed measures of safety and medical accuracy. Patients reported higher clarity of information (3.73 vs 3.62 out of 4, p 0.05) and overall satisfaction (4.58 vs 4.42 out of 5, p 0.05) with AI-assisted conversations compared to standard care, while showing equivalent levels of trust and perceived empathy. The high opt-in rate (81% among respondents) exceeded previous benchmarks for AI acceptance in healthcare. Physician oversight ensured safety, with 95% of conversations rated as “good” or “excellent” by general practitioners experienced in operating a medical advice chat service. Our findings demonstrate that carefully implemented AI medical assistants can enhance patient experience while maintaining safety standards through physician supervision. This work provides empirical evidence for the feasibility of AI deployment in healthcare communication and insights into the requirements for successful integration into existing healthcare services. Comments: 14 pages, 7 figures, 3 tables Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2411.12808 [cs.AI] (or arXiv:2411.12808v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.12808 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: James Whitbeck [view email] [v1] Tue, 19 Nov 2024 19:00:31 UTC (1,758 KB)

[AI-59] Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

链接: https://arxiv.org/abs/2411.12790
作者: Zhen Zeng,Leijiang Gu,Xun Yang,Zhangling Duan,Zenglin Shi,Meng Wang
关键词-EN: Large Language Models, Language Models, cost-effectively correct inaccuracies, Multimodal Large Language, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge editing aims to efficiently and cost-effectively correct inaccuracies and update outdated information. Recently, there has been growing interest in extending knowledge editing from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs), which integrate both textual and visual information, introducing additional editing complexities. Existing multimodal knowledge editing works primarily focus on text-oriented, coarse-grained scenarios, failing to address the unique challenges posed by multimodal contexts. In this paper, we propose a visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities. We introduce the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark to evaluate this task. Moreover, we propose a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework. MSCKE leverages a multimodal scope classifier that integrates both visual and textual information to accurately identify and update knowledge related to specific entities within images. This approach ensures precise editing while preserving irrelevant information, overcoming the limitations of traditional text-only editing methods. Extensive experiments on the FGVEdit benchmark demonstrate that MSCKE outperforms existing methods, showcasing its effectiveness in solving the complex challenges of multimodal knowledge editing.

[AI-60] Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

链接: https://arxiv.org/abs/2411.12787
作者: Pengkun Jiao,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
关键词-EN: presents significant challenges, multimodal large language, Fine-tuning multimodal large, large language models, limits fine-grained detail
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning multimodal large language models (MLLMs) presents significant challenges, including a reliance on high-level visual features that limits fine-grained detail comprehension, and data conflicts that arise from task complexity. To address these issues, we propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA). VCE enhances the vision projector by integrating multi-level visual cues, improving the model’s ability to capture fine-grained visual features. Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks. Our method simplifies implementation, enhances visual comprehension, and improves adaptability. Experiments on both downstream tasks and general benchmarks demonstrate the effectiveness of our proposed approach.

[AI-61] Lucia: A Temporal Computing Platform for Contextual Intelligence

链接: https://arxiv.org/abs/2411.12778
作者: Weizhe Lin,Junxiao Shen
关键词-EN: large language models, multi-modal large language, redefined user interactions, artificial intelligence, language models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, especially through multi-modal large language models, has redefined user interactions, enabling responses that are contextually rich and human-like. As AI becomes an integral part of daily life, a new frontier has emerged: developing systems that not only understand spatial and sensory data but also interpret temporal contexts to build long-term, personalized memories. This report introduces Lucia, an open-source Temporal Computing Platform designed to enhance human cognition by capturing and utilizing continuous contextual memory. Lucia introduces a lightweight, wearable device that excels in both comfort and real-time data accessibility, distinguishing itself from existing devices that typically prioritize either wearability or perceptual capabilities alone. By recording and interpreting daily activities over time, Lucia enables users to access a robust temporal memory, enhancing cognitive processes such as decision-making and memory recall.

[AI-62] Education in the Era of Neurosymbolic AI

链接: https://arxiv.org/abs/2411.12763
作者: Chris Davis Jaldi,Eleni Ilkou,Noah Schroeder,Cogan Shimizu
关键词-EN: neurosymbolic artificial intelligence, support deeply adaptive, personalized learning experiences, artificial intelligence, transformative shift
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Education is poised for a transformative shift with the advent of neurosymbolic artificial intelligence (NAI), which will redefine how we support deeply adaptive and personalized learning experiences. NAI-powered education systems will be capable of interpreting complex human concepts and contexts while employing advanced problem-solving strategies, all grounded in established pedagogical frameworks. This will enable a level of personalization in learning systems that to date has been largely unattainable at scale, providing finely tailored curricula that adapt to an individual’s learning pace and accessibility needs, including the diagnosis of student understanding of subjects at a fine-grained level, identifying gaps in foundational knowledge, and adjusting instruction accordingly. In this paper, we propose a system that leverages the unique affordances of pedagogical agents – embodied characters designed to enhance learning – as critical components of a hybrid NAI architecture. To do so, these agents can thus simulate nuanced discussions, debates, and problem-solving exercises that push learners beyond rote memorization toward deep comprehension. We discuss the rationale for our system design and the preliminary findings of our work. We conclude that education in the era of NAI will make learning more accessible, equitable, and aligned with real-world skills. This is an era that will explore a new depth of understanding in educational tools.

[AI-63] AI-Empowered Human Research Integrating Brain Science and Social Sciences Insights

链接: https://arxiv.org/abs/2411.12761
作者: Feng Xiong,Xinguo Yu,Hon Wai Leong
关键词-EN: enhancing scientific research, human-AI joint research, Science Research Paradigm, research, joint research
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to IEIR 2024, 10 pages, 4 figures

点击查看摘要

Abstract:This paper explores the transformative role of artificial intelligence (AI) in enhancing scientific research, particularly in the fields of brain science and social sciences. We analyze the fundamental aspects of human research and argue that it is high time for researchers to transition to human-AI joint research. Building upon this foundation, we propose two innovative research paradigms of human-AI joint research: “AI-Brain Science Research Paradigm” and “AI-Social Sciences Research Paradigm”. In these paradigms, we introduce three human-AI collaboration models: AI as a research tool (ART), AI as a research assistant (ARA), and AI as a research participant (ARP). Furthermore, we outline the methods for conducting human-AI joint research. This paper seeks to redefine the collaborative interactions between human researchers and AI system, setting the stage for future research directions and sparking innovation in this interdisciplinary field.

[AI-64] SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input

链接: https://arxiv.org/abs/2411.11934
作者: Zhen Lv,Yangqi Long,Congzhentao Huang,Cao Li,Chengfei Lv,Hao Ren,Dian Zheng
关键词-EN: Stereo video synthesis, virtual reality, Stereo video, monocular input, fields of spatial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods primarily address these issues by directly applying novel view synthesis (NVS) techniques to video, while facing limitations such as the inability to effectively represent dynamic scenes and the requirement for large amounts of training data. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training. More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.

[AI-65] A Random-Key Optimizer for Combinatorial Optimization

链接: https://arxiv.org/abs/2411.04293
作者: Antonio A. Chaves,Mauricio G.C. Resende,Martin J.A. Schuetz,J. Kyle Brubaker,Helmut G. Katzgraber,Edilson F. de Arruda,Ricardo M. A. Silva
关键词-EN: search method tailored, efficient stochastic local, Random-Key Optimizer, stochastic local search, local search method
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注: 54 pages, 16 figures, 8 tables

点击查看摘要

Abstract:This paper presents the Random-Key Optimizer (RKO), a versatile and efficient stochastic local search method tailored for combinatorial optimization problems. Using the random-key concept, RKO encodes solutions as vectors of random keys that are subsequently decoded into feasible solutions via problem-specific decoders. The RKO framework is able to combine a plethora of classic metaheuristics, each capable of operating independently or in parallel, with solution sharing facilitated through an elite solution pool. This modular approach allows for the adaptation of various metaheuristics, including simulated annealing, iterated local search, and greedy randomized adaptive search procedures, among others. The efficacy of the RKO framework, implemented in C++, is demonstrated through its application to three NP-hard combinatorial optimization problems: the alpha-neighborhood p-median problem, the tree of hubs location problem, and the node-capacitated graph partitioning problem. The results highlight the framework’s ability to produce high-quality solutions across diverse problem domains, underscoring its potential as a robust tool for combinatorial optimization.

[AI-66] Unlocking the Power of Gradient Guidance for Structure-Based Molecule Optimization

链接: https://arxiv.org/abs/2411.13280
作者: Keyue Qiu,Yuxuan Song,Jie Yu,Hongbo Ma,Ziyao Cao,Zhilong Zhang,Yushuai Wu,Mingyue Zheng,Hao Zhou,Wei-Ying Ma
关键词-EN: Structure-based molecule optimization, Structure-based molecule, protein targets, types against protein, Molecule Joint Optimization
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the first gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. Our proposed MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3% , Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x “Me-Better” Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility and potential.

[AI-67] Quantum Kernel-Based Long Short-term Memory

链接: https://arxiv.org/abs/2411.13225
作者: Yu-Chao Hsu,Tai-Yu Li,Kuan-Cheng Chen
关键词-EN: enhance model efficiency, Long Short-Term Memory, computational capacity, promising approach, approach to enhance
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of quantum computing into classical machine learning architectures has emerged as a promising approach to enhance model efficiency and computational capacity. In this work, we introduce the Quantum Kernel-Based Long Short-Term Memory (QK-LSTM) network, which utilizes quantum kernel functions within the classical LSTM framework to capture complex, non-linear patterns in sequential data. By embedding input data into a high-dimensional quantum feature space, the QK-LSTM model reduces the reliance on large parameter sets, achieving effective compression while maintaining accuracy in sequence modeling tasks. This quantum-enhanced architecture demonstrates efficient convergence, robust loss minimization, and model compactness, making it suitable for deployment in edge computing environments and resource-limited quantum devices (especially in the NISQ era). Benchmark comparisons reveal that QK-LSTM achieves performance on par with classical LSTM models, yet with fewer parameters, underscoring its potential to advance quantum machine learning applications in natural language processing and other domains requiring efficient temporal data processing.

[AI-68] raining Physics-Driven Deep Learning Reconstruction without Raw Data Access for Equitable Fast MRI

链接: https://arxiv.org/abs/2411.13022
作者: Yaşar Utku Alçalar,Merve Gülle,Mehmet Akçakaya
关键词-EN: Physics-driven deep learning, Physics-driven deep, fast magnetic resonance, magnetic resonance imaging, MRI
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-driven deep learning (PD-DL) approaches have become popular for improved reconstruction of fast magnetic resonance imaging (MRI) scans. Even though PD-DL offers higher acceleration rates compared to existing clinical fast MRI techniques, their use has been limited outside specialized MRI centers. One impediment for their deployment is the difficulties with generalization to pathologies or population groups that are not well-represented in training sets. This has been noted in several studies, and fine-tuning on target populations to improve reconstruction has been suggested. However, current approaches for PD-DL training require access to raw k-space measurements, which is typically only available at specialized MRI centers that have research agreements for such data access. This is especially an issue for rural and underserved areas, where commercial MRI scanners only provide access to a final reconstructed image. To tackle these challenges, we propose Compressibility-inspired Unsupervised Learning via Parallel Imaging Fidelity (CUPID) for high-quality PD-DL training, using only routine clinical reconstructed images exported from an MRI scanner. CUPID evaluates the goodness of the output with a compressibility-based approach, while ensuring that the output stays consistent with the clinical parallel imaging reconstruction through well-designed perturbations. Our results show that CUPID achieves similar quality compared to well-established PD-DL training strategies that require raw k-space data access, while outperforming conventional compressed sensing (CS) and state-of-the-art generative methods. We also demonstrate its effectiveness in a zero-shot training setup for retrospectively and prospectively sub-sampled acquisitions, attesting to its minimal training burden.

[AI-69] Automating Sonologists USG Commands with AI and Voice Interface

链接: https://arxiv.org/abs/2411.13006
作者: Emad Mohamed,Shruti Tiwari,Sheena Christabel Pravin
关键词-EN: advanced AI-powered ultrasound, incorporates real-time image, AI-powered ultrasound imaging, clinical practice, research presents
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This research presents an advanced AI-powered ultrasound imaging system that incorporates real-time image processing, organ tracking, and voice commands to enhance the efficiency and accuracy of diagnoses in clinical practice. Traditional ultrasound diagnostics often require significant time and introduce a degree of subjectivity due to user interaction. The goal of this innovative solution is to provide Sonologists with a more predictable and productive imaging procedure utilizing artificial intelligence, computer vision, and voice technology. The functionality of the system employs computer vision and deep learning algorithms, specifically adopting the Mask R-CNN model from Detectron2 for semantic segmentation of organs and key landmarks. This automation improves diagnostic accuracy by enabling the extraction of valuable information with minimal human input. Additionally, it includes a voice recognition feature that allows for hands-free operation, enabling users to control the system with commands such as freeze or liver, all while maintaining their focus on the patient. The architecture comprises video processing and real-time segmentation modules that prepare the system to perform essential imaging functions, such as freezing and zooming in on frames. The liver histopathology module, optimized for detecting fibrosis, achieved an impressive accuracy of 98.6%. Furthermore, the organ segmentation module produces output confidence levels between 50% and 95%, demonstrating its efficacy in organ detection.

[AI-70] Enhancing Deep Learning-Driven Multi-Coil MRI Reconstruction via Self-Supervised Denoising

链接: https://arxiv.org/abs/2411.12919
作者: Asad Aali,Marius Arvinte,Sidharth Kumar,Yamin I. Arefeen,Jonathan I. Tamir
关键词-EN: Gaussian noise, corrupted by Gaussian, deep learning, Unbiased Risk Estimate, examine the effect
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine the effect of incorporating self-supervised denoising as a pre-processing step for training deep learning (DL) based reconstruction methods on data corrupted by Gaussian noise. K-space data employed for training are typically multi-coil and inherently noisy. Although DL-based reconstruction methods trained on fully sampled data can enable high reconstruction quality, obtaining large, noise-free datasets is impractical. We leverage Generalized Stein’s Unbiased Risk Estimate (GSURE) for denoising. We evaluate two DL-based reconstruction methods: Diffusion Probabilistic Models (DPMs) and Model-Based Deep Learning (MoDL). We evaluate the impact of denoising on the performance of these DL-based methods in solving accelerated multi-coil magnetic resonance imaging (MRI) reconstruction. The experiments were carried out on T2-weighted brain and fat-suppressed proton-density knee scans. We observed that self-supervised denoising enhances the quality and efficiency of MRI reconstructions across various scenarios. Specifically, employing denoised images rather than noisy counterparts when training DL networks results in lower normalized root mean squared error (NRMSE), higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) across different SNR levels, including 32dB, 22dB, and 12dB for T2-weighted brain data, and 24dB, 14dB, and 4dB for fat-suppressed knee data. Overall, we showed that denoising is an essential pre-processing technique capable of improving the efficacy of DL-based MRI reconstruction methods under diverse conditions. By refining the quality of input data, denoising can enable the training of more effective DL networks, potentially bypassing the need for noise-free reference MRI scans.

[AI-71] Efficient Medicinal Image Transmission and Resolution Enhancement via GAN

链接: https://arxiv.org/abs/2411.12833
作者: Rishabh Kumar Sharma,Mukund Sharma,Pushkar Sharma,Jeetashree Aparjeeta
关键词-EN: X-ray imaging, inherently carries, X-ray images, X-ray images require, X-ray
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While X-ray imaging is indispensable in medical diagnostics, it inherently carries with it those noises and limitations on resolution that mask the details necessary for diagnosis. B/W X-ray images require a careful balance between noise suppression and high-detail preservation to ensure clarity in soft-tissue structures and bone edges. While traditional methods, such as CNNs and early super-resolution models like ESRGAN, have enhanced image resolution, they often perform poorly regarding high-frequency detail preservation and noise control for B/W imaging. We are going to present one efficient approach that improves the quality of an image with the optimization of network transmission in the following paper. The pre-processing of X-ray images into low-resolution files by Real-ESRGAN, a version of ESRGAN elucidated and improved, helps reduce the server load and transmission bandwidth. Lower-resolution images are upscaled at the receiving end using Real-ESRGAN, fine-tuned for real-world image degradation. The model integrates Residual-in-Residual Dense Blocks with perceptual and adversarial loss functions for high-quality upscaled images with low noise. We further fine-tune Real-ESRGAN by adapting it to the specific B/W noise and contrast characteristics. This suppresses noise artifacts without compromising detail. The comparative evaluation conducted shows that our approach achieves superior noise reduction and detail clarity compared to state-of-the-art CNN-based and ESRGAN models, apart from reducing network bandwidth requirements. These benefits are confirmed both by quantitative metrics, including Peak Signal-to-Noise Ratio and Structural Similarity Index, and by qualitative assessments, which indicate the potential of Real-ESRGAN for diagnostic-quality X-ray imaging and for efficient medical data transmission.

[AI-72] FedCL-Ensemble Learning: A Framework of Federated Continual Learning with Ensemble Transfer Learning Enhanced for Alzheimers MRI Classifications while Preserving Privacy

链接: https://arxiv.org/abs/2411.12756
作者: Rishit Kapoor(1),Jesher Joshua(2),Muralidharan Vijayarangan(3),Natarajan B(4) ((1) Vellore Institute of Technology, (2) Vellore Institute of Technology, (3) Vellore Institute of Technology, (4) Vellore Institute of Technology)
关键词-EN: research work introduces, advanced deep learning, deep learning techniques, learning techniques combined, data processing methods
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:This research work introduces a novel approach to the classification of Alzheimer’s disease by using the advanced deep learning techniques combined with secure data processing methods. This research work primary uses transfer learning models such as ResNet, ImageNet, and VNet to extract high-level features from medical image data. Thereafter, these pre-trained models were fine-tuned for Alzheimer’s related subtle patterns such that the model is capable of robust feature extraction over varying data sources. Further, the federated learning approaches were incorporated to tackle a few other challenges related to classification, aimed to provide better prediction performance and protect data privacy. The proposed model was built using federated learning without sharing sensitive patient data. This way, the decentralized model benefits from the large and diversified dataset that it is trained upon while ensuring confidentiality. The cipher-based encryption mechanism is added that allows us to secure the transportation of data and further ensure the privacy and integrity of patient information throughout training and classification. The results of the experiments not only help to improve the accuracy of the classification of Alzheimer’s but at the same time provides a framework for secure and collaborative analysis of health care data.

[AI-73] A Survey of Financial AI: Architectures Advances and Open Challenges

链接: https://arxiv.org/abs/2411.12747
作者: Junhua Liu
关键词-EN: empowers sophisticated approaches, financial market forecasting, empowers sophisticated, sophisticated approaches, portfolio optimization
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Full list of papers and summary slides are available at: this https URL

点击查看摘要

Abstract:Financial AI empowers sophisticated approaches to financial market forecasting, portfolio optimization, and automated trading. This survey provides a systematic analysis of these developments across three primary dimensions: predictive models that capture complex market dynamics, decision-making frameworks that optimize trading and investment strategies, and knowledge augmentation systems that leverage unstructured financial information. We examine significant innovations including foundation models for financial time series, graph-based architectures for market relationship modeling, and hierarchical frameworks for portfolio optimization. Analysis reveals crucial trade-offs between model sophistication and practical constraints, particularly in high-frequency trading applications. We identify critical gaps and open challenges between theoretical advances and industrial implementation, outlining open challenges and opportunities for improving both model performance and practical applicability.

[AI-74] A Review of Reinforcement Learning in Financial Applications

链接: https://arxiv.org/abs/2411.12746
作者: Yahui Bai,Yuhe Gao,Runzhe Wan,Sheng Zhang,Rui Song
关键词-EN: Toggle, applying Reinforcement Learning, Markov Decision Process, Code, Toggle Hugging Face
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been a growing trend of applying Reinforcement Learning (RL) in financial applications. This approach has shown great potential to solve decision-making tasks in finance. In this survey, we present a comprehensive study of the applications of RL in finance and conduct a series of meta-analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL’s performance compared to traditional methods. Moreover, we identify challenges including explainability, Markov Decision Process (MDP) modeling, and robustness that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi-agent RL, and model-based RL to address these challenges and to further enhance the implementation of RL in finance. Subjects: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.12746 [q-fin.CP] (or arXiv:2411.12746v1 [q-fin.CP] for this version) https://doi.org/10.48550/arXiv.2411.12746 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Sheng Zhang [view email] [v1] Fri, 1 Nov 2024 01:03:10 UTC (3,049 KB) Full-text links: Access Paper: View a PDF of the paper titled A Review of Reinforcement Learning in Financial Applications, by Yahui Bai and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: q-fin.CP prev | next new | recent | 2024-11 Change to browse by: cs cs.AI cs.LG q-fin References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

计算机视觉

[CV-0] AI-generated Image Detection: Passive or Watermark?

链接: https://arxiv.org/abs/2411.13553
作者: Moyang Guo,Yuepeng Hu,Zhengyuan Jiang,Zeyu Li,Amir Sadovnik,Arka Daw,Neil Gong
关键词-EN: offer numerous benefits, models offer numerous, pose significant societal, significant societal risks, models offer
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While text-to-image models offer numerous benefits, they also pose significant societal risks. Detecting AI-generated images is crucial for mitigating these risks. Detection methods can be broadly categorized into passive and watermark-based approaches: passive detectors rely on artifacts present in AI-generated images, whereas watermark-based detectors proactively embed watermarks into such images. A key question is which type of detector performs better in terms of effectiveness, robustness, and efficiency. However, the current literature lacks a comprehensive understanding of this issue. In this work, we aim to bridge that gap by developing ImageDetectBench, the first comprehensive benchmark to compare the effectiveness, robustness, and efficiency of passive and watermark-based detectors. Our benchmark includes four datasets, each containing a mix of AI-generated and non-AI-generated images. We evaluate five passive detectors and four watermark-based detectors against eight types of common perturbations and three types of adversarial perturbations. Our benchmark results reveal several interesting findings. For instance, watermark-based detectors consistently outperform passive detectors, both in the presence and absence of perturbations. Based on these insights, we provide recommendations for detecting AI-generated images, e.g., when both types of detectors are applicable, watermark-based detectors should be the preferred choice.

[CV-1] REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents MICRO

链接: https://arxiv.org/abs/2411.13552
作者: Rui Tian,Qi Dai,Jianmin Bao,Kai Qiu,Yifan Yang,Chong Luo,Zuxuan Wu,Yu-Gang Jiang
关键词-EN: Commercial video generation, exhibited realistic, high-fidelity results, Commercial video, Commercial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code available at this https URL

点击查看摘要

Abstract:Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this goal, we design an image-conditioned VAE to encode a video to an extremely compressed motion latent space. This magic Reducio charm enables 64x reduction of latents compared to a common 2D VAE, without sacrificing the quality. Training diffusion models on such a compact representation easily allows for generating 1K resolution videos. We then adopt a two-stage video generation paradigm, which performs text-to-image and text-image-to-video sequentially. Extensive experiments show that our Reducio-DiT achieves strong performance in evaluation, though trained with limited GPU resources. More importantly, our method significantly boost the efficiency of video LDMs both in training and inference. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU. Code released at this https URL .

[CV-2] Find Any Part in 3D

链接: https://arxiv.org/abs/2411.13550
作者: Ziqi Ma,Yisong Yue,Georgia Gkioxari
关键词-EN: text query, study open-world part, part segmentation, part, object based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project website: this https URL

点击查看摘要

Abstract:We study open-world part segmentation in 3D: segmenting any part in any object based on any text query. Prior methods are limited in object categories and part vocabularies. Recent advances in AI have demonstrated effective open-world recognition capabilities in 2D. Inspired by this progress, we propose an open-world, direct-prediction model for 3D part segmentation that can be applied zero-shot to any object. Our approach, called Find3D, trains a general-category point embedding model on large-scale 3D assets from the internet without any human annotation. It combines a data engine, powered by foundation models for annotating data, with a contrastive training method. We achieve strong performance and generalization across multiple datasets, with up to a 3x improvement in mIoU over the next best method. Our model is 6x to over 300x faster than existing baselines. To encourage research in general-category open-world 3D part segmentation, we also release a benchmark for general objects and parts. Project website: this https URL

[CV-3] Generating 3D-Consistent Videos from Unposed Internet Photos

链接: https://arxiv.org/abs/2411.13549
作者: Gene Chou,Kai Zhang,Sai Bi,Hao Tan,Zexiang Xu,Fujun Luan,Bharath Hariharan,Noah Snavely
关键词-EN: address the problem, problem of generating, Luma Dream Machine, unposed internet photos, internet photos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model’s ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

[CV-4] HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution

链接: https://arxiv.org/abs/2411.13548
作者: Shoaib Meraj Sami,Md Mahedi Hasan,Jeremy Dawson,Nasser Nasrabadi
关键词-EN: recent diffusion-based single-step, diffusion-based single-step super-resolution, super-resolution methods achieve, single-step super-resolution methods, computationally complex
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Although recent diffusion-based single-step super-resolution methods achieve better performance as compared to SinSR, they are computationally complex. To improve the performance of SinSR, we investigate preserving the high-frequency detail features during super-resolution (SR) because the downgraded images lack detailed information. For this purpose, we introduce a high-frequency perceptual loss by utilizing an invertible neural network (INN) pretrained on the ImageNet dataset. Different feature maps of pretrained INN produce different high-frequency aspects of an image. During the training phase, we impose to preserve the high-frequency features of super-resolved and ground truth (GT) images that improve the SR image quality during inference. Furthermore, we also utilize the Jenson-Shannon divergence between GT and SR images in the pretrained DINO-v2 embedding space to match their distribution. By introducing the \textbfhigh - \textbffrequency preserving loss and distribution matching constraint in the single-step \textbfdiffusion-based SR ( \textbfHF-Diff ), we achieve a state-of-the-art CLIPIQA score in the benchmark RealSR, RealSet65, DIV2K-Val, and ImageNet datasets. Furthermore, the experimental results in several datasets demonstrate that our high-frequency perceptual loss yields better SR image quality than LPIPS and VGG-based perceptual losses. Our code will be released at this https URL.

[CV-5] Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

链接: https://arxiv.org/abs/2411.13545
作者: Andy Li,Aiden Durrant,Milan Markovic,Lu Yin,Georgios Leontidis
关键词-EN: reducing model size, Pruning of deep, deep neural networks, reducing model, model size
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels. Obtaining sparse networks at such extreme sparsity levels presents unique challenges, such as fragile gradient flow and heightened risk of layer collapse. In this work, we explore network performance beyond the commonly studied sparsities, and propose a collection of techniques that enable the continuous learning of networks without accuracy collapse even at extreme sparsities, including 99.90%, 99.95% and 99.99% on ResNet architectures. Our approach combines 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet, achieving significant performance improvements over state-of-the-art methods we compared with.

[CV-6] DIS-Mine: Instance Segmentation for Disaster-Awareness in Poor-Light Condition in Underground Mines

链接: https://arxiv.org/abs/2411.13544
作者: Mizanur Rahman Jewel,Mohamed Elmahallawy,Sanjay Madria,Samuel Frimpong
关键词-EN: explosions and structural, structural damage, instance segmentation, Detecting disasters, underground mining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting disasters in underground mining, such as explosions and structural damage, has been a persistent challenge over the years. This problem is compounded for first responders, who often have no clear information about the extent or nature of the damage within the mine. The poor-light or even total darkness inside the mines makes rescue efforts incredibly difficult, leading to a tragic loss of life. In this paper, we propose a novel instance segmentation method called DIS-Mine, specifically designed to identify disaster-affected areas within underground mines under low-light or poor visibility conditions, aiding first responders in rescue efforts. DIS-Mine is capable of detecting objects in images, even in complete darkness, by addressing challenges such as high noise, color distortions, and reduced contrast. The key innovations of DIS-Mine are built upon four core components: i) Image brightness improvement, ii) Instance segmentation with SAM integration, iii) Mask R-CNN-based segmentation, and iv) Mask alignment with feature matching. On top of that, we have collected real-world images from an experimental underground mine, introducing a new dataset named ImageMine, specifically gathered in low-visibility conditions. This dataset serves to validate the performance of DIS-Mine in realistic, challenging environments. Our comprehensive experiments on the ImageMine dataset, as well as on various other datasets demonstrate that DIS-Mine achieves a superior F1 score of 86.0% and mIoU of 72.0%, outperforming state-of-the-art instance segmentation methods, with at least 15x improvement and up to 80% higher precision in object detection.

[CV-7] Geometric Algebra Planes: Convex Implicit Neural Volumes

链接: https://arxiv.org/abs/2411.13525
作者: Irmak Sivgin,Sara Fridovich-Keil,Gordon Wetzstein,Mert Pilanci
关键词-EN: Volume parameterizations abound, recent literature, parameterizations abound, abound in recent, classic voxel grid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Volume parameterizations abound in recent literature, from the classic voxel grid to the implicit neural representation and everything in between. While implicit representations have shown impressive capacity and better memory efficiency compared to voxel grids, to date they require training via nonconvex optimization. This nonconvex training process can be slow to converge and sensitive to initialization and hyperparameter choices that affect the final converged result. We introduce a family of models, GA-Planes, that is the first class of implicit neural volume representations that can be trained by convex optimization. GA-Planes models include any combination of features stored in tensor basis elements, followed by a neural feature decoder. They generalize many existing representations and can be adapted for convex, semiconvex, or nonconvex training as needed for different inverse problems. In the 2D setting, we prove that GA-Planes is equivalent to a low-rank plus low-resolution matrix factorization; we show that this approximation outperforms the classic low-rank plus sparse decomposition for fitting a natural image. In 3D, we demonstrate GA-Planes’ competitive performance in terms of expressiveness, model size, and optimizability across three volume fitting tasks: radiance field reconstruction, 3D segmentation, and video segmentation.

[CV-8] VBench: Comprehensive and Versatile Benchmark Suite for Video Generative Models

链接: https://arxiv.org/abs/2411.13503
作者: Ziqi Huang,Fan Zhang,Xiaojie Xu,Yinan He,Jiashuo Yu,Ziyue Dong,Qianli Ma,Nattapol Chanpaisit,Chenyang Si,Yuming Jiang,Yaohui Wang,Xinyuan Chen,Ying-Cong Chen,Limin Wang,Dahua Lin,Yu Qiao,Ziwei Liu
关键词-EN: witnessed significant advancements, Video generation, Video, generation, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Leaderboard: this https URL Code: this https URL Project page: this https URL extension of arXiv:2311.17982 . arXiv admin note: substantial text overlap with arXiv:2311.17982

点击查看摘要

Abstract:Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models’ strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks’ alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models’ ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.

[CV-9] Quantum-Brain: Quantum-Inspired Neural Network Approach to Vision-Brain Understanding

链接: https://arxiv.org/abs/2411.13378
作者: Hoang-Quan Nguyen,Xuan-Bac Nguyen,Hugh Churchill,Arabinda Kumar Choudhary,Pawan Sinha,Samee U. Khan,Khoa Luu
关键词-EN: Vision-brain understanding aims, Vision-brain understanding, brain signals, aims to extract, vision-brain understanding problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-brain understanding aims to extract semantic information about brain signals from human perceptions. Existing deep learning methods for vision-brain understanding are usually introduced in a traditional learning paradigm missing the ability to learn the connectivities between brain regions. Meanwhile, the quantum computing theory offers a new paradigm for designing deep learning models. Motivated by the connectivities in the brain signals and the entanglement properties in quantum computing, we propose a novel Quantum-Brain approach, a quantum-inspired neural network, to tackle the vision-brain understanding problem. To compute the connectivity between areas in brain signals, we introduce a new Quantum-Inspired Voxel-Controlling module to learn the impact of a brain voxel on others represented in the Hilbert space. To effectively learn connectivity, a novel Phase-Shifting module is presented to calibrate the value of the brain signals. Finally, we introduce a new Measurement-like Projection module to present the connectivity information from the Hilbert space into the feature space. The proposed approach can learn to find the connectivities between fMRI voxels and enhance the semantic information obtained from human perceptions. Our experimental results on the Natural Scene Dataset benchmarks illustrate the effectiveness of the proposed method with Top-1 accuracies of 95.1% and 95.6% on image and brain retrieval tasks and an Inception score of 95.3% on fMRI-to-image reconstruction task. Our proposed quantum-inspired network brings a potential paradigm to solving the vision-brain problems via the quantum computing theory.

[CV-10] Learning based Geez character handwritten recognition

链接: https://arxiv.org/abs/2411.13350
作者: Hailemicael Lulseged Yimer,Hailegabriel Dereje Degefa,Marco Cristani,Federico Cunico
关键词-EN: ancient Ethiopic script, Convolutional Neural Networks, ancient Ethiopic, Ethiopic script, Ge’ez handwriting recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ge’ez, an ancient Ethiopic script of cultural and historical significance, has been largely neglected in handwriting recognition research, hindering the digitization of valuable manuscripts. Our study addresses this gap by developing a state-of-the-art Ge’ez handwriting recognition system using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. Our approach uses a two-stage recognition process. First, a CNN is trained to recognize individual characters, which then acts as a feature extractor for an LSTM-based system for word recognition. Our dual-stage recognition approach achieves new top scores in Ge’ez handwriting recognition, outperforming eight state-of-the-art methods, which are SVTR, ASTER, and others as well as human performance, as measured in the HHD-Ethiopic dataset work. This research significantly advances the preservation and accessibility of Ge’ez cultural heritage, with implications for historical document digitization, educational tools, and cultural preservation. The code will be released upon acceptance.

[CV-11] WHALES: A Multi-agent Scheduling Dataset for Enhanced Cooperation in Autonomous Driving

链接: https://arxiv.org/abs/2411.13340
作者: Siwei Chen,Yinsong(Richard)Wang,Ziyi Song,Sheng Zhou
关键词-EN: Achieving high levels, Achieving high, critical challenge, standalone systems, limited perception ranges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Achieving high levels of safety and reliability in autonomous driving remains a critical challenge, especially due to occlusion and limited perception ranges in standalone systems. Cooperative perception among vehicles offers a promising solution, but existing research is hindered by datasets with a limited number of agents. Scaling up the number of cooperating agents is non-trivial and introduces significant computational and technical hurdles that have not been addressed in previous works. To bridge this gap, we present Wireless enHanced Autonomous vehicles with Large number of Engaged agentS (WHALES), a dataset generated using CARLA simulator that features an unprecedented average of 8.4 agents per driving sequence. In addition to providing the largest number of agents and viewpoints among autonomous driving datasets, WHALES records agent behaviors, enabling cooperation across multiple tasks. This expansion allows for new supporting tasks in cooperative perception. As a demonstration, we conduct experiments on agent scheduling task, where the ego agent selects one of multiple candidate agents to cooperate with, optimizing perception gains in autonomous driving. The WHALES dataset and codebase can be found at this https URL.

[CV-12] aching VLMs to Localize Specific Objects from In-context Examples

链接: https://arxiv.org/abs/2411.13317
作者: Sivan Doveh,Nimrod Shabtay,Wei Lin,Eli Schwartz,Hilde Kuehne,Raja Giryes,Rogerio Feris,Leonid Karlinsky,James Glass,Assaf Arbelle,Shimon Ullman,M. Jehanzeb Mirza
关键词-EN: Visual Question Answering, Question Answering, shown remarkable capabilities, including image recognition, Visual Question
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) – each with a category label and bounding box – and is tasked with localizing the same object type in a query image. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances few-shot localization performance without sacrificing generalization, as demonstrated on several benchmarks tailored to personalized localization. This work is the first to explore and benchmark personalized few-shot localization for VLMs, laying a foundation for future research in context-driven vision-language applications. The code for our project is available at this https URL

[CV-13] Can Reasons Help Improve Pedestrian Intent Estimation? A Cross-Modal Approach

链接: https://arxiv.org/abs/2411.13302
作者: Vaishnavi Khindkar,Vineeth Balasubramanian,Chetan Arora,Anbumani Subramanian,C.V. Jawahar
关键词-EN: Vulnerable Road Users, Vulnerable Road, autonomous navigation systems, safety of Vulnerable, Road Users
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the increased importance of autonomous navigation systems has come an increasing need to protect the safety of Vulnerable Road Users (VRUs) such as pedestrians. Predicting pedestrian intent is one such challenging task, where prior work predicts the binary cross/no-cross intention with a fusion of visual and motion features. However, there has been no effort so far to hedge such predictions with human-understandable reasons. We address this issue by introducing a novel problem setting of exploring the intuitive reasoning behind a pedestrian’s intent. In particular, we show that predicting the ‘WHY’ can be very useful in understanding the ‘WHAT’. To this end, we propose a novel, reason-enriched PIE++ dataset consisting of multi-label textual explanations/reasons for pedestrian intent. We also introduce a novel multi-task learning framework called MINDREAD, which leverages a cross-modal representation learning framework for predicting pedestrian intent as well as the reason behind the intent. Our comprehensive experiments show significant improvement of 5.6% and 7% in accuracy and F1-score for the task of intent prediction on the PIE++ dataset using MINDREAD. We also achieved a 4.4% improvement in accuracy on a commonly used JAAD dataset. Extensive evaluation using quantitative/qualitative metrics and user studies shows the effectiveness of our approach.

[CV-14] DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild

链接: https://arxiv.org/abs/2411.13291
作者: Weicai Ye,Xinyu Chen,Ruohao Zhan,Di Huang,Xiaoshui Huang,Haoyi Zhu,Hujun Bao,Wanli Ouyang,Tong He,Guofeng Zhang
关键词-EN: robust pipeline, clouds for casual, smooth camera trajectories, point, obtain point trajectories
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. Traditional frameworks, such as ParticleSfM~\citezhao2022particlesfm, address this problem by sequentially computing the optical flow between adjacent frames to obtain point trajectories. They then remove dynamic trajectories through motion segmentation and perform global bundle adjustment. However, the process of estimating optical flow between two adjacent frames and chaining the matches can introduce cumulative errors. Additionally, motion segmentation combined with single-view depth estimation often faces challenges related to scale ambiguity. To tackle these challenges, we propose a dynamic-aware tracking any point (DATAP) method that leverages consistent video depth and point tracking. Specifically, our DATAP addresses these issues by estimating dense point tracking across the video sequence and predicting the visibility and dynamics of each point. By incorporating the consistent video depth prior, the performance of motion segmentation is enhanced. With the integration of DATAP, it becomes possible to estimate and optimize all camera poses simultaneously by performing global bundle adjustments for point tracking classified as static and visible, rather than relying on incremental camera registration. Extensive experiments on dynamic sequences, e.g., Sintel and TUM RGBD dynamic sequences, and on the wild video, e.g., DAVIS, demonstrate that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.

[CV-15] Unbiased Scene Graph Generation by Type-Aware Message Passing on Heterogeneous and Dual Graphs

链接: https://arxiv.org/abs/2411.13287
作者: Guanglu Sun,Jin Qiu,Lili Liang
关键词-EN: scene graph generation, unbiased scene graph, graph generation, scene graph, Interactive Graph Construction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although great progress has been made in the research of unbiased scene graph generation, issues still hinder improving the predictive performance of both head and tail classes. An unbiased scene graph generation (TA-HDG) is proposed to address these issues. For modeling interactive and non-interactive relations, the Interactive Graph Construction is proposed to model the dependence of relations on objects by combining heterogeneous and dual graph, when modeling relations between multiple objects. It also implements a subject-object pair selection strategy to reduce meaningless edges. Moreover, the Type-Aware Message Passing enhances the understanding of complex interactions by capturing intra- and inter-type context in the Intra-Type and Inter-Type stages. The Intra-Type stage captures the semantic context of inter-relaitons and inter-objects. On this basis, the Inter-Type stage captures the context between objects and relations for interactive and non-interactive relations, respectively. Experiments on two datasets show that TA-HDG achieves improvements in the metrics of R@K and mR@K, which proves that TA-HDG can accurately predict the tail class while maintaining the competitive performance of the head class.

[CV-16] Paying more attention to local contrast: improving infrared small target detection performance via prior knowledge

链接: https://arxiv.org/abs/2411.13260
作者: Peichao Wang,Jiabao Wang,Yao Chen,Rui Zhang,Yang Li,Zhuang Miao
关键词-EN: Local Contrast Attention, infrared small target, deep learning methods, Local Contrast Enhancement, Local Contrast
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:The data-driven method for infrared small target detection (IRSTD) has achieved promising results. However, due to the small scale of infrared small target datasets and the limited number of pixels occupied by the targets themselves, it is a challenging task for deep learning methods to directly learn from these samples. Utilizing human expert knowledge to assist deep learning methods in better learning is worthy of exploration. To effectively guide the model to focus on targets’ spatial features, this paper proposes the Local Contrast Attention Enhanced infrared small target detection Network (LCAE-Net), combining prior knowledge with data-driven deep learning methods. LCAE-Net is a U-shaped neural network model which consists of two developed modules: a Local Contrast Enhancement (LCE) module and a Channel Attention Enhancement (CAE) module. The LCE module takes advantages of prior knowledge, leveraging handcrafted convolution operator to acquire Local Contrast Attention (LCA), which could realize background suppression while enhance the potential target region, thus guiding the neural network to pay more attention to potential infrared small targets’ location information. To effectively utilize the response information throughout downsampling progresses, the CAE module is proposed to achieve the information fusion among feature maps’ different channels. Experimental results indicate that our LCAE-Net outperforms existing state-of-the-art methods on the three public datasets NUDT-SIRST, NUAA-SIRST, and IRSTD-1K, and its detection speed could reach up to 70 fps. Meanwhile, our model has a parameter count and Floating-Point Operations (FLOPs) of 1.945M and 4.862G respectively, which is suitable for deployment on edge devices.

[CV-17] ViSTa Dataset: Do vision-language models understand sequential tasks?

链接: https://arxiv.org/abs/2411.13211
作者: Evžen Wybitul,Evan Ryan Gunter,Mikhail Seleznyov
关键词-EN: reinforcement learning holds, learning holds promise, improving safety, reward models, reinforcement learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs’ potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure – basic single-step tasks composed into more and more complex sequential tasks – allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.

[CV-18] An Integrated Approach to Robotic Object Grasping and Manipulation

链接: https://arxiv.org/abs/2411.13205
作者: Owais Ahmed,M Huzaifa,M Areeb,Hamza Ali Khan
关键词-EN: Amazon has embarked, manual labor, labor and efficiency, transformation by incorporating, warehouse operations
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 PAGES

点击查看摘要

Abstract:In response to the growing challenges of manual labor and efficiency in warehouse operations, Amazon has embarked on a significant transformation by incorporating robotics to assist with various tasks. While a substantial number of robots have been successfully deployed for tasks such as item transportation within warehouses, the complex process of object picking from shelves remains a significant challenge. This project addresses the issue by developing an innovative robotic system capable of autonomously fulfilling a simulated order by efficiently selecting specific items from shelves. A distinguishing feature of the proposed robotic system is its capacity to navigate the challenge of uncertain object positions within each bin of the shelf. The system is engineered to autonomously adapt its approach, employing strategies that enable it to efficiently locate and retrieve the desired items, even in the absence of pre-established knowledge about their placements.

[CV-19] VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation WACV2025

链接: https://arxiv.org/abs/2411.13186
作者: Chengjie Huang,Vahdat Abdelzad,Sean Sedwards,Krzysztof Czarnecki
关键词-EN: Input aggregation, Variable Aggregation Detection, simple technique, improve detection, Input
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Input aggregation is a simple technique used by state-of-the-art LiDAR 3D object detectors to improve detection. However, increasing aggregation is known to have diminishing returns and even performance degradation, due to objects responding differently to the number of aggregated frames. To address this limitation, we propose an efficient adaptive method, which we call Variable Aggregation Detection (VADet). Instead of aggregating the entire scene using a fixed number of frames, VADet performs aggregation per object, with the number of frames determined by an object’s observed properties, such as speed and point density. VADet thus reduces the inherent trade-offs of fixed aggregation and is not architecture specific. To demonstrate its benefits, we apply VADet to three popular single-stage detectors and achieve state-of-the-art performance on the Waymo dataset.

[CV-20] Click; Single Object Tracking; Video Object Segmentation; Real-time Interaction

链接: https://arxiv.org/abs/2411.13183
作者: Kuiran Wang,Xuehui Yu,Wenwen Yu,Guorong Li,Xiangyuan Lan,Qixiang Ye,Jianbin Jiao,Zhenjun Han
关键词-EN: Single object tracking, Single object, single object trackers, precise object bounding, initializing single object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single object tracking(SOT) relies on precise object bounding box initialization. In this paper, we reconsidered the deficiencies in the current approaches to initializing single object trackers and propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios. Moreover, click as an input type inherently lack hierarchical information. To address ambiguity in certain special scenarios, we designed the Guided Click Refiner(GCR), which accepts point and optional textual information as inputs, transforming the point into the bounding box expected by the operator. The bounding box will be used as input of single object trackers. Experiments on LaSOT and GOT-10k benchmarks show that tracker combined with GCR achieves stable performance in real-time interactive scenarios. Furthermore, we explored the integration of GCR into the Segment Anything model(SAM), significantly reducing ambiguity issues when SAM receives point inputs.

[CV-21] SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio

链接: https://arxiv.org/abs/2411.13179
作者: Erik Tegler,Magnus Oskarsson,Kalle Åström
关键词-EN: Time delay estimation, multiple localization applications, direction of arrival, Time delay, delay estimation
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cross-Correlation Phase Transform (GCC-PHAT) method. In this paper we demonstrate that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. To overcome the lack of data with ground truth for the task, we train our model on a simulated dataset which is sufficiently large and varied, and that captures the relevant characteristics of the real world problem. We provide our trained model, SONNET (Simulation Optimized Neural Network Estimator of Timeshifts), which is runnable in real-time and works on novel data out of the box for many real data applications, i.e. without re-training. We further demonstrate greatly improved performance on the downstream task of self-calibration when using our model compared to classical methods.

[CV-22] RAW-Diffusion: RGB-Guided Diffusion Models for High-Fidelity RAW Image Generation WACV2025

链接: https://arxiv.org/abs/2411.13150
作者: Christoph Reinders,Radu Berdan,Beril Besbinar,Junji Otsuka,Daisuke Iso
关键词-EN: Current deep learning, deep learning approaches, computer vision primarily, vision primarily focus, Current deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025

点击查看摘要

Abstract:Current deep learning approaches in computer vision primarily focus on RGB data sacrificing information. In contrast, RAW images offer richer representation, which is crucial for precise recognition, particularly in challenging conditions like low-light environments. The resultant demand for comprehensive RAW image datasets contrasts with the labor-intensive process of creating specific datasets for individual sensors. To address this, we propose a novel diffusion-based method for generating RAW images guided by RGB images. Our approach integrates an RGB-guidance module for feature extraction from RGB inputs, then incorporates these features into the reverse diffusion process with RGB-guided residual blocks across various resolutions. This approach yields high-fidelity RAW images, enabling the creation of camera-specific RAW datasets. Our RGB2RAW experiments on four DSLR datasets demonstrate state-of-the-art performance. Moreover, RAW-Diffusion demonstrates exceptional data efficiency, achieving remarkable performance with as few as 25 training samples or even fewer. We extend our method to create BDD100K-RAW and Cityscapes-RAW datasets, revealing its effectiveness for object detection in RAW imagery, significantly reducing the amount of required RAW images.

[CV-23] Globally Correlation-Aware Hard Negative Generation

链接: https://arxiv.org/abs/2411.13145
作者: Wenjie Peng,Hongxiang Huang,Tianshui Chen,Quhui Ke,Gang Dai,Shuangping Huang
关键词-EN: Hard negative generation, deep metric learning, facilitate advancing deep, advancing deep metric, generate hard negatives
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCV’24

点击查看摘要

Abstract:Hard negative generation aims to generate informative negative samples that help to determine the decision boundaries and thus facilitate advancing deep metric learning. Current works select pair/triplet samples, learn their correlations, and fuse them to generate hard negatives. However, these works merely consider the local correlations of selected samples, ignoring global sample correlations that would provide more significant information to generate more informative negatives. In this work, we propose a Globally Correlation-Aware Hard Negative Generation (GCA-HNG) framework, which first learns sample correlations from a global perspective and exploits these correlations to guide generating hardness-adaptive and diverse negatives. Specifically, this approach begins by constructing a structured graph to model sample correlations, where each node represents a specific sample and each edge represents the correlations between corresponding samples. Then, we introduce an iterative graph message propagation to propagate the messages of node and edge through the whole graph and thus learn the sample correlations globally. Finally, with the guidance of the learned global correlations, we propose a channel-adaptive manner to combine an anchor and multiple negatives for HNG. Compared to current methods, GCA-HNG allows perceiving sample correlations with numerous negatives from a global and comprehensive perspective and generates the negatives with better hardness and diversity. Extensive experiment results demonstrate that the proposed GCA-HNG is superior to related methods on four image retrieval benchmark datasets. Codes and trained models are available at \urlthis https URL.

[CV-24] APT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

链接: https://arxiv.org/abs/2411.13136
作者: Xin Wang,Kai Chen,Jiaming Zhang,Jingjing Chen,Xingjun Ma
关键词-EN: Large pre-trained Vision-Language, pre-trained Vision-Language Models, Large pre-trained, Vision-Language Models, demonstrated excellent zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6%.

[CV-25] Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

链接: https://arxiv.org/abs/2411.13127
作者: Xuechao Zou,Shun Zhang,Kai Li,Shiying Wang,Junliang Xing,Lei Jin,Congyan Lang,Pin Tao
关键词-EN: sensing image interpretation, remote sensing image, accuracy directly impacts, image interpretation, Cloud segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:Cloud segmentation is a critical challenge in remote sensing image interpretation, as its accuracy directly impacts the effectiveness of subsequent data processing and analysis. Recently, vision foundation models (VFM) have demonstrated powerful generalization capabilities across various visual tasks. In this paper, we present a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation. Our method leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training. Cloud-Adapter incorporates a lightweight spatial perception module that initially utilizes a convolutional neural network (ConvNet) to extract dense spatial representations. These multi-scale features are then aggregated and serve as contextual inputs to an adapting module, which modulates the frozen transformer layers within the VFM. Experimental results demonstrate that the Cloud-Adapter approach, utilizing only 0.6% of the trainable parameters of the frozen backbone, achieves substantial performance gains. Cloud-Adapter consistently attains state-of-the-art (SOTA) performance across a wide variety of cloud segmentation datasets from multiple satellite sources, sensor series, data processing levels, land cover scenarios, and annotation granularities. We have released the source code and pretrained models at this https URL to support further research.

[CV-26] Virtual Staining of Label-Free Tissue in Imaging Mass Spectrometry

链接: https://arxiv.org/abs/2411.13120
作者: Yijie Zhang,Luzhe Huang,Nir Pillar,Yuzhu Li,Lukasz G. Migas,Raf Van de Plas,Jeffrey M. Spraggins,Aydogan Ozcan
关键词-EN: highly multiplexed molecular, multiplexed molecular mapping, tool for untargeted, highly multiplexed, powerful tool
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Optics (physics.optics)
*备注: 33 Pages, 6 Figures

点击查看摘要

Abstract:Imaging mass spectrometry (IMS) is a powerful tool for untargeted, highly multiplexed molecular mapping of tissue in biomedical research. IMS offers a means of mapping the spatial distributions of molecular species in biological tissue with unparalleled chemical specificity and sensitivity. However, most IMS platforms are not able to achieve microscopy-level spatial resolution and lack cellular morphological contrast, necessitating subsequent histochemical staining, microscopic imaging and advanced image registration steps to enable molecular distributions to be linked to specific tissue features and cell types. Here, we present a virtual histological staining approach that enhances spatial resolution and digitally introduces cellular morphological contrast into mass spectrometry images of label-free human tissue using a diffusion model. Blind testing on human kidney tissue demonstrated that the virtually stained images of label-free samples closely match their histochemically stained counterparts (with Periodic Acid-Schiff staining), showing high concordance in identifying key renal pathology structures despite utilizing IMS data with 10-fold larger pixel size. Additionally, our approach employs an optimized noise sampling technique during the diffusion model’s inference process to reduce variance in the generated images, yielding reliable and repeatable virtual staining. We believe this virtual staining method will significantly expand the applicability of IMS in life sciences and open new avenues for mass spectrometry-based biomedical research.

[CV-27] DriveMLLM : A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

链接: https://arxiv.org/abs/2411.13112
作者: Xianda Guo,Ruijun Zhang,Yiqun Duan,Yuhang He,Chenming Zhang,Shuai Liu,Long Chen
关键词-EN: facilitate high-level tasks, Autonomous driving requires, Autonomous driving, environments to facilitate, motion prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code will be available at \url{ this https URL }

点击查看摘要

Abstract:Autonomous driving requires a comprehensive understanding of 3D environments to facilitate high-level tasks such as motion prediction, planning, and mapping. In this paper, we introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving. DriveMLLM includes 2,734 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions. To measure MLLMs’ performance, we propose novel evaluation metrics focusing on spatial understanding. We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts. We believe these findings underscore the need for more advanced MLLM-based spatial reasoning methods and highlight the potential for DriveMLLM to drive further research in autonomous driving. Code will be available at \urlthis https URL.

[CV-28] Superpixel Cost Volume Excitation for Stereo Matching

链接: https://arxiv.org/abs/2411.13105
作者: Shanglong Liu,Lin Qi,Junyu Dong,Wenxiang Gu,Liyi Xu
关键词-EN: intrinsic local consistency, superpixel soft constraints, predicted disparity maps, soft constraints, concentrate on exciting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints, with the objective of mitigating inaccuracies at the boundaries of predicted disparity maps. Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels. By incorporating this insight, our method encourages the network to generate consistent probability distributions of disparity within each superpixel, aiming to improve the overall accuracy and coherence of predicted disparity maps. Experimental evalua tions on widely-used datasets validate the efficacy of our proposed approach, demonstrating its ability to assist cost volume-based matching networks in restoring competitive performance.

[CV-29] ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

链接: https://arxiv.org/abs/2411.13089
作者: Xulong Zhang,Xiaoyang Qu,Haoxiang Shi,Chunguang Xiao,Jianzong Wang
关键词-EN: Current STA models, STA model, paper proposes, designed to address, address the shortcomings
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted by the 26th IEEE International Conference on High Performance Computing and Communications (HPCC2024)

点击查看摘要

Abstract:This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.

[CV-30] Practical Compact Deep Compressed Sensing

链接: https://arxiv.org/abs/2411.13081
作者: Bin Chen,Jian Zhang
关键词-EN: gained growing attention, Recent years, compressed sensing, years have witnessed, witnessed the success
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by IEEE T-PAMI

点击查看摘要

Abstract:Recent years have witnessed the success of deep networks in compressed sensing (CS), which allows for a significant reduction in sampling cost and has gained growing attention since its inception. In this paper, we propose a new practical and compact network dubbed PCNet for general image CS. Specifically, in PCNet, a novel collaborative sampling operator is designed, which consists of a deep conditional filtering step and a dual-branch fast sampling step. The former learns an implicit representation of a linear transformation matrix into a few convolutions and first performs adaptive local filtering on the input image, while the latter then uses a discrete cosine transform and a scrambled block-diagonal Gaussian matrix to generate under-sampled measurements. Our PCNet is equipped with an enhanced proximal gradient descent algorithm-unrolled network for reconstruction. It offers flexibility, interpretability, and strong recovery performance for arbitrary sampling rates once trained. Additionally, we provide a deployment-oriented extraction scheme for single-pixel CS imaging systems, which allows for the convenient conversion of any linear sampling operator to its matrix form to be loaded onto hardware like digital micro-mirror devices. Extensive experiments on natural image CS, quantized CS, and self-supervised CS demonstrate the superior reconstruction accuracy and generalization ability of PCNet compared to existing state-of-the-art methods, particularly for high-resolution images. Code is available at this https URL.

[CV-31] Hints of Prompt: Enhancing Visual Representation for Multimodal LLM s in Autonomous Driving

链接: https://arxiv.org/abs/2411.13076
作者: Hao Zhou,Zhanning Gao,Maosheng Ye,Zhili Chen,Qifeng Chen,Tongyi Cao,Honggang Qi
关键词-EN: stringent safety requirements, general MLLMs combined, driving-specific scenarios accurately, represent driving-specific scenarios, combined with CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning for autonomous driving VQA tasks. Extensive experiments confirm the effectiveness of the HoP framework, showing it significantly outperforms previous state-of-the-art methods across all key metrics.

[CV-32] Improving OOD Generalization of Pre-trained Encoders via Aligned Embedding-Space Ensembles NEURIPS2024

链接: https://arxiv.org/abs/2411.13073
作者: Shuman Peng,Arash Khoeini,Sharan Vaswani,Martin Ester
关键词-EN: poor without fine-tuning, OOD data, self-supervised pre-trained embeddings, OOD, OOD data compared
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Self-Supervised Learning Workshop and the Unifying Representations in Neural Models Workshop at NeurIPS 2024

点击查看摘要

Abstract:The quality of self-supervised pre-trained embeddings on out-of-distribution (OOD) data is poor without fine-tuning. A straightforward and simple approach to improving the generalization of pre-trained representation to OOD data is the use of deep ensembles. However, obtaining an effective ensemble in the embedding space with only unlabeled data remains an unsolved problem. We first perform a theoretical analysis that reveals the relationship between individual hyperspherical embedding spaces in an ensemble. We then design a principled method to align these embedding spaces in an unsupervised manner. Experimental results on the MNIST dataset show that our embedding-space ensemble method improves pre-trained embedding quality on in-distribution and OOD data compared to single encoders.

[CV-33] Automatic marker-free registration based on similar tetrahedras for single-tree point clouds

链接: https://arxiv.org/abs/2411.13069
作者: Jing Ren,Pei Wang,Hanlong Li,Yuhan Wu,Yuhang Gao,Wenxin Chen,Mingtai Zhang,Lingyun Zhang
关键词-EN: point cloud data, forestry survey data, point clouds, tree point cloud, ICP and NDT
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: remote sensing; terrestrial lidar; multi-scan cloud registration

点击查看摘要

Abstract:In recent years, terrestrial laser scanning technology has been widely used to collect tree point cloud data, aiding in measurements of diameter at breast height, biomass, and other forestry survey data. Since a single scan from terrestrial laser systems captures data from only one angle, multiple scans must be registered and fused to obtain complete tree point cloud data. This paper proposes a marker-free automatic registration method for single-tree point clouds based on similar tetrahedras. First, two point clouds from two scans of the same tree are used to generate tree skeletons, and key point sets are constructed from these skeletons. Tetrahedra are then filtered and matched according to similarity principles, with the vertices of these two matched tetrahedras selected as matching point pairs, thus completing the coarse registration of the point clouds from the two scans. Subsequently, the ICP method is applied to the coarse-registered leaf point clouds to obtain fine registration parameters, completing the precise registration of the two tree point clouds. Experiments were conducted using terrestrial laser scanning data from eight trees, each from different species and with varying shapes. The proposed method was evaluated using RMSE and Hausdorff distance, compared against the traditional ICP and NDT methods. The experimental results demonstrate that the proposed method significantly outperforms both ICP and NDT in registration accuracy, achieving speeds up to 593 times and 113 times faster than ICP and NDT, respectively. In summary, the proposed method shows good robustness in single-tree point cloud registration, with significant advantages in accuracy and speed compared to traditional ICP and NDT methods, indicating excellent application prospects in practical registration scenarios.

[CV-34] owards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation

链接: https://arxiv.org/abs/2411.13059
作者: Rohith Peddi,Saurabh,Ayush Abhay Shrivastava,Parag Singla,Vibhav Gogate
关键词-EN: Scene Graph Generation, Scene Graph Anticipation, Spatio-Temporal Scene Graphs, Video Scene Graph, Scene Graph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Spatio-Temporal Scene Graphs (STSGs) provide a concise and expressive representation of dynamic scenes by modelling objects and their evolving relationships over time. However, real-world visual relationships often exhibit a long-tailed distribution, causing existing methods for tasks like Video Scene Graph Generation (VidSGG) and Scene Graph Anticipation (SGA) to produce biased scene graphs. To this end, we propose ImparTail, a novel training framework that leverages curriculum learning and loss masking to mitigate bias in the generation and anticipation of spatio-temporal scene graphs. Our approach gradually decreases the dominance of the head relationship classes during training and focuses more on tail classes, leading to more balanced training. Furthermore, we introduce two new tasks, Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation, designed to evaluate the robustness of STSG models against distribution shifts. Extensive experiments on the Action Genome dataset demonstrate that our framework significantly enhances the unbiased performance and robustness of STSG models compared to existing methods.

[CV-35] Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark

链接: https://arxiv.org/abs/2411.13056
作者: Bing Cao,Quanhao Lu,Jiekang Feng,Pengfei Zhu,Qinghua Hu,Qilong Wang
关键词-EN: video object counting, mathtt, Masked Autoencoder Counting, major challenge, Efficient Masked Autoencoder
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The dynamic imbalance of the fore-background is a major challenge in video object counting, which is usually caused by the sparsity of foreground objects. This often leads to severe under- and over-prediction problems and has been less studied in existing works. To tackle this issue in video object counting, we propose a density-embedded Efficient Masked Autoencoder Counting (E-MAC) framework in this paper. To effectively capture the dynamic variations across frames, we utilize an optical flow-based temporal collaborative fusion that aligns features to derive multi-frame density residuals. The counting accuracy of the current frame is boosted by harnessing the information from adjacent frames. More importantly, to empower the representation ability of dynamic foreground objects for intra-frame, we first take the density map as an auxiliary modality to perform \mathttD ensity- \mathttE mbedded \mathttM asked m \mathttO deling ( \mathttDEMO ) for multimodal self-representation learning to regress density map. However, as \mathttDEMO contributes effective cross-modal regression guidance, it also brings in redundant background information and hard to focus on foreground regions. To handle this dilemma, we further propose an efficient spatial adaptive masking derived from density maps to boost efficiency. In addition, considering most existing datasets are limited to human-centric scenarios, we first propose a large video bird counting dataset \textitDroneBird , in natural scenarios for migratory bird protection. Extensive experiments on three crowd datasets and our \textitDroneBird validate our superiority against the counterparts.

[CV-36] Bounding-box Watermarking: Defense against Model Extraction Attacks on Object Detectors

链接: https://arxiv.org/abs/2411.13047
作者: Satoru Koda,Ikuya Morikawa
关键词-EN: Deep neural networks, Deep neural, neural networks, users to query, models
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) deployed in a cloud often allow users to query models via the APIs. However, these APIs expose the models to model extraction attacks (MEAs). In this attack, the attacker attempts to duplicate the target model by abusing the responses from the API. Backdoor-based DNN watermarking is known as a promising defense against MEAs, wherein the defender injects a backdoor into extracted models via API responses. The backdoor is used as a watermark of the model; if a suspicious model has the watermark (i.e., backdoor), it is verified as an extracted model. This work focuses on object detection (OD) models. Existing backdoor attacks on OD models are not applicable for model watermarking as the defense against MEAs on a realistic threat model. Our proposed approach involves inserting a backdoor into extracted models via APIs by stealthily modifying the bounding-boxes (BBs) of objects detected in queries while keeping the OD capability. In our experiments on three OD datasets, the proposed approach succeeded in identifying the extracted models with 100% accuracy in a wide variety of experimental scenarios.

[CV-37] Attentive Contextual Attention for Cloud Removal

链接: https://arxiv.org/abs/2411.13042
作者: Wenli Huang,Ye Deng,Yang Wu,Jinjun Wang
关键词-EN: prompting urgent advancements, Earth observation, remote sensing images, cloud removal technology, prompting urgent
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Cloud cover can significantly hinder the use of remote sensing images for Earth observation, prompting urgent advancements in cloud removal technology. Recently, deep learning strategies have shown strong potential in restoring cloud-obscured areas. These methods utilize convolution to extract intricate local features and attention mechanisms to gather long-range information, improving the overall comprehension of the scene. However, a common drawback of these approaches is that the resulting images often suffer from blurriness, artifacts, and inconsistencies. This is partly because attention mechanisms apply weights to all features based on generalized similarity scores, which can inadvertently introduce noise and irrelevant details from cloud-covered areas. To overcome this limitation and better capture relevant distant context, we introduce a novel approach named Attentive Contextual Attention (AC-Attention). This method enhances conventional attention mechanisms by dynamically learning data-driven attentive selection scores, enabling it to filter out noise and irrelevant features effectively. By integrating the AC-Attention module into the DSen2-CR cloud removal framework, we significantly improve the model’s ability to capture essential distant information, leading to more effective cloud removal. Our extensive evaluation of various datasets shows that our method outperforms existing ones regarding image reconstruction quality. Additionally, we conducted ablation studies by integrating AC-Attention into multiple existing methods and widely used network architectures. These studies demonstrate the effectiveness and adaptability of AC-Attention and reveal its ability to focus on relevant features, thereby improving the overall performance of the networks. The code is available at \urlthis https URL.

[CV-38] RobustFormer: Noise-Robust Pre-training for images and videos

链接: https://arxiv.org/abs/2411.13040
作者: Ashish Bastola,Nishant Luitel,Hao Wang,Danda Pani Paudel,Roshani Poudel,Abolfazl Razi
关键词-EN: deep learning models, Discrete Wavelet Transforms, deep learning, revolutionized many areas, clean data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages

点击查看摘要

Abstract:While deep learning models are powerful tools that revolutionized many areas, they are also vulnerable to noise as they rely heavily on learning patterns and features from the exact details of the clean data. Transformers, which have become the backbone of modern vision models, are no exception. Current Discrete Wavelet Transforms (DWT) based methods do not benefit from masked autoencoder (MAE) pre-training since the inverse DWT (iDWT) introduced in these approaches is computationally inefficient and lacks compatibility with video inputs in transformer architectures. In this work, we present RobustFormer, a method that overcomes these limitations by enabling noise-robust pre-training for both images and videos; improving the efficiency of DWT-based methods by removing the need for computationally iDWT steps and simplifying the attention mechanism. To our knowledge, the proposed method is the first DWT-based method compatible with video inputs and masked pre-training. Our experiments show that MAE-based pre-training allows us to bypass the iDWT step, greatly reducing computation. Through extensive tests on benchmark datasets, RobustFormer achieves state-of-the-art results for both image and video tasks. Comments: 13 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.13040 [cs.CV] (or arXiv:2411.13040v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.13040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-39] X as Supervision: Contending with Depth Ambiguity in Unsupervised Monocular 3D Pose Estimation

链接: https://arxiv.org/abs/2411.13026
作者: Yuchen Yang,Xuanyi Liu,Xing Gao,Zhihang Zhong,Xiao Sun
关键词-EN: Recent unsupervised methods, depth ambiguity issue, inherent depth ambiguity, Recent unsupervised, methods for monocular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent unsupervised methods for monocular 3D pose estimation have endeavored to reduce dependence on limited annotated 3D data, but most are solely formulated in 2D space, overlooking the inherent depth ambiguity issue. Due to the information loss in 3D-to-2D projection, multiple potential depths may exist, yet only some of them are plausible in human structure. To tackle depth ambiguity, we propose a novel unsupervised framework featuring a multi-hypothesis detector and multiple tailored pretext tasks. The detector extracts multiple hypotheses from a heatmap within a local window, effectively managing the multi-solution problem. Furthermore, the pretext tasks harness 3D human priors from the SMPL model to regularize the solution space of pose estimation, aligning it with the empirical distribution of 3D human structures. This regularization is partially achieved through a GCN-based discriminator within the discriminative learning, and is further complemented with synthetic images through rendering, ensuring plausible estimations. Consequently, our approach demonstrates state-of-the-art unsupervised 3D pose estimation performance on various human datasets. Further evaluations on data scale-up and one animal dataset highlight its generalization capabilities. Code will be available at this https URL.

[CV-40] ORID: Organ-Regional Information Driven Framework for Radiology Report Generation WACV2025

链接: https://arxiv.org/abs/2411.13025
作者: Tiancheng Gu,Kaicheng Yang,Xiang An,Ziyong Feng,Dongnan Liu,Weidong Cai
关键词-EN: automatically generate coherent, generate coherent textual, coherent textual analyses, Radiology Report Generation, workload of radiologists
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 11 figures, WACV2025

点击查看摘要

Abstract:The objective of Radiology Report Generation (RRG) is to automatically generate coherent textual analyses of diseases based on radiological images, thereby alleviating the workload of radiologists. Current AI-based methods for RRG primarily focus on modifications to the encoder-decoder model architecture. To advance these approaches, this paper introduces an Organ-Regional Information Driven (ORID) framework which can effectively integrate multi-modal information and reduce the influence of noise from unrelated organs. Specifically, based on the LLaVA-Med, we first construct an RRG-related instruction dataset to improve organ-regional diagnosis description ability and get the LLaVA-Med-RRG. After that, we propose an organ-based cross-modal fusion module to effectively combine the information from the organ-regional diagnosis description and radiology image. To further reduce the influence of noise from unrelated organs on the radiology report generation, we introduce an organ importance coefficient analysis module, which leverages Graph Neural Network (GNN) to examine the interconnections of the cross-modal information of each organ region. Extensive experiments an1d comparisons with state-of-the-art methods across various evaluation metrics demonstrate the superior performance of our proposed method.

[CV-41] Prior-based Objective Inference Mining Potential Uncertainty for Facial Expression Recognition

链接: https://arxiv.org/abs/2411.13024
作者: Hanwei Liu,Huiling Cai,Qingcheng Lin,Xuefeng Li,Hui Xiao
关键词-EN: Annotation ambiguity caused, Annotation ambiguity, subjectivity of visual, visual judgment, major challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Annotation ambiguity caused by the inherent subjectivity of visual judgment has always been a major challenge for Facial Expression Recognition (FER) tasks, particularly for largescale datasets from in-the-wild scenarios. A potential solution is the evaluation of relatively objective emotional distributions to help mitigate the ambiguity of subjective annotations. To this end, this paper proposes a novel Prior-based Objective Inference (POI) network. This network employs prior knowledge to derive a more objective and varied emotional distribution and tackles the issue of subjective annotation ambiguity through dynamic knowledge transfer. POI comprises two key networks: Firstly, the Prior Inference Network (PIN) utilizes the prior knowledge of AUs and emotions to capture intricate motion details. To reduce over-reliance on priors and facilitate objective emotional inference, PIN aggregates inferential knowledge from various key facial subregions, encouraging mutual learning. Secondly, the Target Recognition Network (TRN) integrates subjective emotion annotations and objective inference soft labels provided by the PIN, fostering an understanding of inherent facial expression diversity, thus resolving annotation ambiguity. Moreover, we introduce an uncertainty estimation module to quantify and balance facial expression confidence. This module enables a flexible approach to dealing with the uncertainties of subjective annotations. Extensive experiments show that POI exhibits competitive performance on both synthetic noisy datasets and multiple real-world datasets. All codes and training logs will be publicly available at this https URL.

[CV-42] Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images

链接: https://arxiv.org/abs/2411.13021
作者: Shen Li,Lei Jiang,Wei Wang,Hongwei Hu,Liang Li
关键词-EN: ad-hoc inductive biases, randomly permuted channel, permuted channel order, channel ordering, paper shows
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper shows a proof-of-concept that, given a typical 3-channel images but in a randomly permuted channel order, a model (termed as Chanel-Orderer) with ad-hoc inductive biases in terms of both architecture and loss functions can accurately predict the channel ordering and knows how to make it right. Specifically, Chanel-Orderer learns to score each of the three channels with the priors of object semantics and uses the resulting scores to predict the channel ordering. This brings up benefits into a typical scenario where an \textttRGB image is often mis-displayed in the \textttBGR format and needs to be corrected into the right order. Furthermore, as a byproduct, the resulting model Chanel-Orderer is able to tell whether a given image is a near-gray-scale image (near-monochromatic) or not (polychromatic). Our research suggests that Chanel-Orderer mimics human visual coloring of our physical natural world.

[CV-43] Open-World Amodal Appearance Completion

链接: https://arxiv.org/abs/2411.13019
作者: Jiayang Ao,Yanbei Jiang,Qiuhong Ke,Krista A. Ehinger
关键词-EN: Understanding and reconstructing, reconstructing occluded objects, challenging problem, diverse and unpredictable, reconstructing occluded
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding and reconstructing occluded objects is a challenging problem, especially in open-world scenarios where categories and contexts are diverse and unpredictable. Traditional methods, however, are typically restricted to closed sets of object categories, limiting their use in complex, open-world scenes. We introduce Open-World Amodal Appearance Completion, a training-free framework that expands amodal completion capabilities by accepting flexible text queries as input. Our approach generalizes to arbitrary objects specified by both direct terms and abstract queries. We term this capability reasoning amodal completion, where the system reconstructs the full appearance of the queried object based on the provided image and language query. Our framework unifies segmentation, occlusion analysis, and inpainting to handle complex occlusions and generates completed objects as RGBA elements, enabling seamless integration into applications such as 3D reconstruction and image editing. Extensive evaluations demonstrate the effectiveness of our approach in generalizing to novel objects and occlusions, establishing a new benchmark for amodal completion in open-world settings. The code and datasets will be released after paper acceptance.

[CV-44] DT-LSD: Deformable Transformer-based Line Segment Detection

链接: https://arxiv.org/abs/2411.13005
作者: Sebastian Janampa,Marios Pattichis
关键词-EN: Line segment detection, Line segment, Transformer-based Line Segment, fundamental low-level task, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Line segment detection is a fundamental low-level task in computer vision, and improvements in this task can impact more advanced methods that depend on it. Most new methods developed for line segment detection are based on Convolutional Neural Networks (CNNs). Our paper seeks to address challenges that prevent the wider adoption of transformer-based methods for line segment detection. More specifically, we introduce a new model called Deformable Transformer-based Line Segment Detection (DT-LSD) that supports cross-scale interactions and can be trained quickly. This work proposes a novel Deformable Transformer-based Line Segment Detector (DT-LSD) that addresses LETR’s drawbacks. For faster training, we introduce Line Contrastive DeNoising (LCDN), a technique that stabilizes the one-to-one matching process and speeds up training by 34 \times . We show that DT-LSD is faster and more accurate than its predecessor transformer-based model (LETR) and outperforms all CNN-based models in terms of accuracy. In the Wireframe dataset, DT-LSD achieves 71.7 for sAP^10 and 73.9 for sAP^15 ; while 33.2 for sAP^10 and 35.1 for sAP^15 in the YorkUrban dataset.

[CV-45] Collaborative Feature-Logits Contrastive Learning for Open-Set Semi-Supervised Object Detection

链接: https://arxiv.org/abs/2411.13001
作者: Xinhao Zhong,Siyu Jiao,Yao Zhao,Yunchao Wei
关键词-EN: Current Semi-Supervised Object, Semi-Supervised Object Detection, Object Detection, leveraging large amounts, unlabeled data share
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current Semi-Supervised Object Detection (SSOD) methods enhance detector performance by leveraging large amounts of unlabeled data, assuming that both labeled and unlabeled data share the same label space. However, in open-set scenarios, the unlabeled dataset contains both in-distribution (ID) classes and out-of-distribution (OOD) classes. Applying semi-supervised detectors in such settings can lead to misclassifying OOD class as ID classes. To alleviate this issue, we propose a simple yet effective method, termed Collaborative Feature-Logits Detector (CFL-Detector). Specifically, we introduce a feature-level clustering method using contrastive loss to clarify vector boundaries in the feature space and highlight class differences. Additionally, by optimizing the logits-level uncertainty classification loss, the model enhances its ability to effectively distinguish between ID and OOD classes. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing methods.

[CV-46] GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.12981
作者: Xiaobao Wei,Peng Chen,Guangyu Li,Ming Lu,Hui Chen,Feng Tian
关键词-EN: Gaze, Gaussian Splatting, generate augmented data, data, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gaze estimation encounters generalization challenges when dealing with out-of-distribution data. To address this problem, recent methods use neural radiance fields (NeRF) to generate augmented data. However, existing methods based on NeRF are computationally expensive and lack facial details. 3D Gaussian Splatting (3DGS) has become the prevailing representation of neural fields. While 3DGS has been extensively examined in head avatars, it faces challenges with accurate gaze control and generalization across different subjects. In this work, we propose GazeGaussian, a high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. By leveraging the unstructured nature of 3DGS, we develop a novel eye representation for rigid eye rotation based on the target gaze direction. To enhance synthesis generalization across various subjects, we integrate an expression-conditional module to guide the neural renderer. Comprehensive experiments show that GazeGaussian outperforms existing methods in rendering speed, gaze redirection accuracy, and facial synthesis across multiple datasets. We also demonstrate that existing gaze estimation methods can leverage GazeGaussian to improve their generalization performance. The code will be available at: this https URL.

[CV-47] On the Consistency of Video Large Language Models in Temporal Comprehension

链接: https://arxiv.org/abs/2411.12951
作者: Minjoon Jung,Junbin Xiao,Byoung-Tak Zhang,Angela Yao
关键词-EN: temporally ground language, temporally ground, large language models, Video large language, retrieve video moments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency – a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model’s responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be available at this https URL.

[CV-48] VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

链接: https://arxiv.org/abs/2411.12915
作者: Vishwesh Nath,Wenqi Li,Dong Yang,Andriy Myronenko,Mingxin Zheng,Yao Lu,Zhijian Liu,Hongxu Yin,Yee Man Law,Yucheng Tang,Pengfei Guo,Can Zhao,Ziyue Xu,Yufan He,Greg Heinrich,Stephen Aylward,Marc Edgar,Michael Zephyr,Pavlo Molchanov,Baris Turkbey,Holger Roth,Daguang Xu
关键词-EN: Generalist vision language, made significant strides, Generalist vision, made significant, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. In traditional computer vision tasks, creative or approximate answers may be acceptable, but in healthcare, precision is this http URL large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical data - features that are often too intricate for a VLM to capture effectively especially in radiology. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of ~9% over the prior SOTA model Med-Gemini and ~6% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.

[CV-49] ree Species Classification using Machine Learning and 3D Tomographic SAR – a case study in Northern Europe

链接: https://arxiv.org/abs/2411.12897
作者: Colverd Grace,Schade Laura,Takami Jumpei,Bot Karol,Gallego Joseph
关键词-EN: Synthetic Aperture Radar, Tree species classification, forest inventories, forest management, Tree species
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Tree species classification plays an important role in nature conservation, forest inventories, forest management, and the protection of endangered species. Over the past four decades, remote sensing technologies have been extensively utilized for tree species classification, with Synthetic Aperture Radar (SAR) emerging as a key technique. In this study, we employed TomoSense, a 3D tomographic dataset, which utilizes a stack of single-look complex (SLC) images, a byproduct of SAR, captured at different incidence angles to generate a three-dimensional representation of the terrain. Our research focuses on evaluating multiple tabular machine-learning models using the height information derived from the tomographic image intensities to classify eight distinct tree species. The SLC data and tomographic imagery were analyzed across different polarimetric configurations and geosplit configurations. We investigated the impact of these variations on classification accuracy, comparing the performance of various tabular machine-learning models and optimizing them using Bayesian optimization. Additionally, we incorporated a proxy for actual tree height using point cloud data from Light Detection and Ranging (LiDAR) to provide height statistics associated with the model’s predictions. This comparison offers insights into the reliability of tomographic data in predicting tree species classification based on height.

[CV-50] owards Fairness in AI for Melanoma Detection: Systemic Review and Recommendations

链接: https://arxiv.org/abs/2411.12846
作者: Laura N Montoya,Jennafer Shae Roberts,Belen Sanchez Hidalgo
关键词-EN: Early and accurate, crucial for improving, accurate melanoma detection, melanoma detection, Early
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
*备注: 22 pages, 4 figures, 7 tables,accepted for publication in Future of Information and Communication Conference (FICC) 2025, whose proceedings will be published in ‘Lecture Notes in Networks and Systems’ by Springer Nature

点击查看摘要

Abstract:Early and accurate melanoma detection is crucial for improving patient outcomes. Recent advancements in artificial intelligence AI have shown promise in this area, but the technologys effectiveness across diverse skin tones remains a critical challenge. This study conducts a systematic review and preliminary analysis of AI based melanoma detection research published between 2013 and 2024, focusing on deep learning methodologies, datasets, and skin tone representation. Our findings indicate that while AI can enhance melanoma detection, there is a significant bias towards lighter skin tones. To address this, we propose including skin hue in addition to skin tone as represented by the LOreal Color Chart Map for a more comprehensive skin tone assessment technique. This research highlights the need for diverse datasets and robust evaluation metrics to develop AI models that are equitable and effective for all patients. By adopting best practices outlined in a PRISMA Equity framework tailored for healthcare and melanoma detection, we can work towards reducing disparities in melanoma outcomes.

[CV-51] Data-to-Model Distillation: Data-Efficient Learning Framework ECCV2024

链接: https://arxiv.org/abs/2411.12841
作者: Ahmad Sajedi,Samir Khaki,Lucy Z. Liu,Ehsan Amjadian,Yuri A. Lawryshyn,Konstantinos N. Plataniotis
关键词-EN: informative synthetic data, model trained, Dataset distillation aims, synthetic data, large-scale real dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted in the 18th European Conference on Computer Vision (ECCV 2024), Milan, Italy, September 29 October 4, 2024

点击查看摘要

Abstract:Dataset distillation aims to distill the knowledge of a large-scale real dataset into small yet informative synthetic data such that a model trained on it performs as well as a model trained on the full dataset. Despite recent progress, existing dataset distillation methods often struggle with computational efficiency, scalability to complex high-resolution datasets, and generalizability to deep architectures. These approaches typically require retraining when the distillation ratio changes, as knowledge is embedded in raw pixels. In this paper, we propose a novel framework called Data-to-Model Distillation (D2M) to distill the real dataset’s knowledge into the learnable parameters of a pre-trained generative model by aligning rich representations extracted from real and generated images. The learned generative model can then produce informative training images for different distillation ratios and deep architectures. Extensive experiments on 15 datasets of varying resolutions show D2M’s superior performance, re-distillation efficiency, and cross-architecture generalizability. Our method effectively scales up to high-resolution 128x128 ImageNet-1K. Furthermore, we verify D2M’s practical benefits for downstream applications in neural architecture search.

[CV-52] HyperGAN-CLIP: A Unified Framework for Domain Adaptation Image Synthesis and Manipulation SIGGRAPH

链接: https://arxiv.org/abs/2411.12832
作者: Abdul Basit Anees,Ahmet Canberk Baykal,Muhammed Burak Kizil,Duygu Ceylan,Erkut Erdem,Aykut Erdem
关键词-EN: Generative Adversarial Networks, Generative Adversarial, Adversarial Networks, generating highly realistic, demonstrated remarkable capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in SIGGRAPH Asia 2024. Project Website: this https URL

点击查看摘要

Abstract:Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.

[CV-53] owards motion from video diffusion models ECCV2024

链接: https://arxiv.org/abs/2411.12831
作者: Paul Janson,Tiberiu Popa,Eugene Belilovsky
关键词-EN: Text-conditioned video diffusion, Text-conditioned video, generation and editing, powerful tool, Text-conditioned
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 Workshop :Foundation Models for 3D Humans

点击查看摘要

Abstract:Text-conditioned video diffusion models have emerged as a powerful tool in the realm of video generation and editing. But their ability to capture the nuances of human movement remains under-explored. Indeed the ability of these models to faithfully model an array of text prompts can lead to a wide host of applications in human and character animation. In this work, we take initial steps to investigate whether these models can effectively guide the synthesis of realistic human body animations. Specifically we propose to synthesize human motion by deforming an SMPL-X body representation guided by Score distillation sampling (SDS) calculated using a video diffusion model. By analyzing the fidelity of the resulting animations, we gain insights into the extent to which we can obtain motion using publicly available text-to-video diffusion models using SDS. Our findings shed light on the potential and limitations of these models for generating diverse and plausible human motions, paving the way for further research in this exciting area.

[CV-54] What Makes a Good Dataset for Knowledge Distillation?

链接: https://arxiv.org/abs/2411.12817
作者: Logan Frank,Jim Davis
关键词-EN: popular and effective, effective method, model compression, datasets, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher’s original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: “what makes a good dataset for transferring knowledge from teacher to student?” Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code will be available in the future.

[CV-55] Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

链接: https://arxiv.org/abs/2411.12814
作者: Junlong Cheng,Bin Fu,Jin Ye,Guoan Wang,Tianbin Li,Haoyu Wang,Ruoyu Li,He Yao,Junren Chen,JingWen Li,Yanzhou Su,Min Zhu,Junjun He
关键词-EN: densely annotated datasets, hinders model generalization, Medical Image Segmentation, availability of large-scale, long been constrained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Interactive Medical Image Segmentation (IMIS) has long been constrained by the limited availability of large-scale, diverse, and densely annotated datasets, which hinders model generalization and consistent evaluation across different models. In this paper, we introduce the IMed-361M benchmark dataset, a significant advancement in general IMIS research. First, we collect and standardize over 6.4 million medical images and their corresponding ground truth masks from multiple data sources. Then, leveraging the strong object recognition capabilities of a vision foundational model, we automatically generated dense interactive masks for each image and ensured their quality through rigorous quality control and granularity management. Unlike previous datasets, which are limited by specific modalities or sparse annotations, IMed-361M spans 14 modalities and 204 segmentation targets, totaling 361 million masks-an average of 56 masks per image. Finally, we developed an IMIS baseline network on this dataset that supports high-quality mask generation through interactive inputs, including clicks, bounding boxes, text prompts, and their combinations. We evaluate its performance on medical image segmentation tasks from multiple perspectives, demonstrating superior accuracy and scalability compared to existing interactive segmentation models. To facilitate research on foundational models in medical computer vision, we release the IMed-361M and model at this https URL.

[CV-56] Stylecodes: Encoding Stylistic Information For Image Generation

链接: https://arxiv.org/abs/2411.12811
作者: Ciara Rowles
关键词-EN: Diffusion models excel, Diffusion models, remains a challenge, models excel, controlling them remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: code: this https URL project page: this https URL . arXiv admin note: substantial text overlap with arXiv:2408.03209

点击查看摘要

Abstract:Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. These have seen widespread adoption throughout social media due to both their ease of sharing and the fact they allow using an image for style control, without having to post the source images themselves. However, users are not able to generate srefs from their own images, nor is the underlying training procedure public. We propose StyleCodes: an open-source and open-research style encoder architecture and training procedure to express image style as a 20-symbol base64 code. Our experiments show that our encoding results in minimal loss in quality compared to traditional image-to-style techniques.

[CV-57] CLIC: Contrastive Learning Framework for Unsupervised Image Complexity Representation

链接: https://arxiv.org/abs/2411.12792
作者: Shipeng Liu,Liang Zhao,Dengfeng Chen
关键词-EN: essential visual attribute, image complexity, image complexity affects, computer vision tasks, complexity affects human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As an essential visual attribute, image complexity affects human image comprehension and directly influences the performance of computer vision tasks. However, accurately assessing and quantifying image complexity faces significant challenges. Previous works needed more generalization capabilities and well-labeled datasets to learn image complexity features. However, creating such datasets requires expensive manual labeling costs, and the models inevitably learn about human subjective biases. To address the above problems, we propose CLIC, an unsupervised framework based on contrastive learning, for learning image complexity representations. The method learns image complexity features on unlabeled data, avoiding the high labeling cost. Specifically, we propose a unique positive and negative sample selection strategy to reinforce the differences in complexity features. At the same time, we introduce an image prior-based Complexity-Aware Loss to constrain the learning process of the model. We conducted extensive experiments for verification, and the results show that CLIC can effectively learn the image complexity representation. CLIC obtained competitive results with supervised methods by fine-tuning on IC9600. In addition, CLIC applied to downstream tasks shows significant performance improvements, demonstrating the potential for application in various real-world scenarios. \hrefthis https URLcode

[CV-58] Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment

链接: https://arxiv.org/abs/2411.12791
作者: Siyi Pan,Baoliang Chen,Danni Huang,Hanwei Zhu,Lingyu Zhu,Xiangjie Sui,Shiqi Wang
关键词-EN: high-level visual tasks, large multimodal models, image, quality, image quality assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Despite the impressive performance of large multimodal models (LMMs) in high-level visual tasks, their capacity for image quality assessment (IQA) remains limited. One main reason is that LMMs are primarily trained for high-level tasks (e.g., image captioning), emphasizing unified image semantics extraction under varied quality. Such semantic-aware yet quality-insensitive perception bias inevitably leads to a heavy reliance on image semantics when those LMMs are forced for quality rating. In this paper, instead of retraining or tuning an LMM costly, we propose a training-free debiasing framework, in which the image quality prediction is rectified by mitigating the bias caused by image semantics. Specifically, we first explore several semantic-preserving distortions that can significantly degrade image quality while maintaining identifiable semantics. By applying these specific distortions to the query or test images, we ensure that the degraded images are recognized as poor quality while their semantics remain. During quality inference, both a query image and its corresponding degraded version are fed to the LMM along with a prompt indicating that the query image quality should be inferred under the condition that the degraded one is deemed poor this http URL prior condition effectively aligns the LMM’s quality perception, as all degraded images are consistently rated as poor quality, regardless of their semantic this http URL, the quality scores of the query image inferred under different prior conditions (degraded versions) are aggregated using a conditional probability model. Extensive experiments on various IQA datasets show that our debiasing framework could consistently enhance the LMM performance and the code will be publicly available.

[CV-59] Automated 3D Physical Simulation of Open-world Scene with Gaussian Splatting

链接: https://arxiv.org/abs/2411.12789
作者: Haoyu Zhao,Hao Wang,Xingyue Zhao,Hongqiu Wang,Zhiyu Wu,Chengjiang Long,Hua Zou
关键词-EN: content remains challenging, Recent advancements, customizing behaviors, remains challenging, opened new possibilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging. Current methods often require manual assignment of precise physical properties for simulations or rely on video generation models to predict them, which is computationally intensive. In this paper, we rethink the usage of multi-modal large language model (MLLM) in physics-based simulation, and present Sim Anything, a physics-based approach that endows static 3D objects with interactive dynamics. We begin with detailed scene reconstruction and object-level 3D open-vocabulary segmentation, progressing to multi-view image in-painting. Inspired by human visual reasoning, we propose MLLM-based Physical Property Perception (MLLM-P3) to predict mean physical properties of objects in a zero-shot manner. Based on the mean values and the object’s geometry, the Material Property Distribution Prediction model (MPDP) model then estimates the full distribution, reformulating the problem as probability distribution estimation to reduce computational costs. Finally, we simulate objects in an open-world scene with particles sampled via the Physical-Geometric Adaptive Sampling (PGAS) strategy, efficiently capturing complex deformations and significantly reducing computational costs. Extensive experiments and user studies demonstrate our Sim Anything achieves more realistic motion than state-of-the-art methods within 2 minutes on a single GPU.

[CV-60] Mini-Splatting2: Building 360 Scenes within Minutes via Aggressive Gaussian Densification

链接: https://arxiv.org/abs/2411.12788
作者: Guangchi Fang,Bing Wang
关键词-EN: Gaussian Splatting, explore the essential, essential challenge, challenge of fast, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we explore the essential challenge of fast scene optimization for Gaussian Splatting. Through a thorough analysis of the geometry modeling process, we reveal that dense point clouds can be effectively reconstructed early in optimization through Gaussian representations. This insight leads to our approach of aggressive Gaussian densification, which provides a more efficient alternative to conventional progressive densification methods. By significantly increasing the number of critical Gaussians, we enhance the model capacity to capture dense scene geometry at the early stage of optimization. This strategy is seamlessly integrated into the Mini-Splatting densification and simplification framework, enabling rapid convergence without compromising quality. Additionally, we introduce visibility culling within Gaussian Splatting, leveraging per-view Gaussian importance as precomputed visibility to accelerate the optimization process. Our Mini-Splatting2 achieves a balanced trade-off among optimization time, the number of Gaussians, and rendering quality, establishing a strong baseline for future Gaussian-Splatting-based works. Our work sets the stage for more efficient, high-quality 3D scene modeling in real-world applications, and the code will be made available no matter acceptance.

[CV-61] Joint Vision-Language Social Bias Removal for CLIP

链接: https://arxiv.org/abs/2411.12785
作者: Haoyu Zhang,Yangyang Guo,Mohan Kankanhalli
关键词-EN: CLIP show prominent, show prominent capabilities, downstream tasks, show prominent, CLIP show
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor to mitigate the social bias problem in V-L models by removing biased attribute information from model embeddings. However, after our revisiting of these methods, we find that their bias removal is frequently accompanied by greatly compromised V-L alignment capabilities. We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. By doing so, our method achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings. Additionally, we advocate a new evaluation protocol that can 1) holistically quantify the model debiasing and V-L alignment ability, and 2) evaluate the generalization of social bias removal models. We believe this work will offer new insights and guidance for future studies addressing the social bias problem in CLIP.

[CV-62] Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

链接: https://arxiv.org/abs/2411.12783
作者: Yiming Shi,Xun Zhu,Ying Hu,Chenyi Guo,Miao Li,Ji Wu
关键词-EN: increasingly inadequate due, diverse clinical scenarios, modern healthcare, crucial for modern, increasingly inadequate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The analysis of 3D medical images is crucial for modern healthcare, yet traditional task-specific models are becoming increasingly inadequate due to limited generalizability across diverse clinical scenarios. Multimodal large language models (MLLMs) offer a promising solution to these challenges. However, existing MLLMs have limitations in fully leveraging the rich, hierarchical information embedded in 3D medical images. Inspired by clinical practice, where radiologists focus on both 3D spatial structure and 2D planar content, we propose Med-2E3, a novel MLLM for 3D medical image analysis that integrates 3D and 2D encoders. To aggregate 2D features more effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis. Experiments on a large-scale, open-source 3D medical multimodal benchmark demonstrate that Med-2E3 exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models, with a 14% improvement in report generation and a 5% gain in medical visual question answering (VQA), highlighting the model’s potential in addressing complex multimodal clinical tasks. The code will be released upon acceptance.

[CV-63] FGP: Feature-Gradient-Prune for Efficient Convolutional Layer Pruning

链接: https://arxiv.org/abs/2411.12781
作者: Qingsong Lv,Jiasheng Sun,Sheng Zhou,Xu Zhang,Liangcheng Li,Yun Gao,Sun Qiao,Jie Song,Jiajun Bu
关键词-EN: reduce computational overhead, pruning, computational overhead, model pruning techniques, pruning techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To reduce computational overhead while maintaining model performance, model pruning techniques have been proposed. Among these, structured pruning, which removes entire convolutional channels or layers, significantly enhances computational efficiency and is compatible with hardware acceleration. However, existing pruning methods that rely solely on image features or gradients often result in the retention of redundant channels, negatively impacting inference efficiency. To address this issue, this paper introduces a novel pruning method called Feature-Gradient Pruning (FGP). This approach integrates both feature-based and gradient-based information to more effectively evaluate the importance of channels across various target classes, enabling a more accurate identification of channels that are critical to model performance. Experimental results demonstrate that the proposed method improves both model compactness and practicality while maintaining stable performance. Experiments conducted across multiple tasks and datasets show that FGP significantly reduces computational costs and minimizes accuracy loss compared to existing methods, highlighting its effectiveness in optimizing pruning outcomes. The source code is available at: this https URL.

[CV-64] Faster Multi-GPU Training with PPLL: A Pipeline Parallelism Framework Leveraging Local Learning

链接: https://arxiv.org/abs/2411.12780
作者: Xiuyuan Guo(1),Chengqi Xu(1),Guinan Guo(3),Feiyu Zhu(4),Changpeng Cai(5),Peizhe Wang(5),Xiaoming Wei(2),Junhao Su(2),Jialin Gao(2) ((1) University of Southern California, (2) Meituan, (3) Sun Yat-sen University, (4) University of Shanghai for Science and Technology, (5) Southeast University)
关键词-EN: large-scale deep learning, training large-scale deep, training, PPLL, large-scale deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently, training large-scale deep learning models is typically achieved through parallel training across multiple GPUs. However, due to the inherent communication overhead and synchronization delays in traditional model parallelism methods, seamless parallel training cannot be achieved, which, to some extent, affects overall training efficiency. To address this issue, we present PPLL (Pipeline Parallelism based on Local Learning), a novel framework that leverages local learning algorithms to enable effective parallel training across multiple GPUs. PPLL divides the model into several distinct blocks, each allocated to a separate GPU. By utilizing queues to manage data transfers between GPUs, PPLL ensures seamless cross-GPU communication, allowing multiple blocks to execute forward and backward passes in a pipelined manner. This design minimizes idle times and prevents bottlenecks typically caused by sequential gradient updates, thereby accelerating the overall training process. We validate PPLL through extensive experiments using ResNet and Vision Transformer (ViT) architectures on CIFAR-10, SVHN, and STL-10 datasets. Our results demonstrate that PPLL significantly enhances the training speed of the local learning method while achieving comparable or even superior training speed to traditional pipeline parallelism (PP) without sacrificing model performance. In a 4-GPU training setup, PPLL accelerated local learning training on ViT and ResNet by 162% and 33%, respectively, achieving 1.25x and 0.85x the speed of traditional pipeline parallelism.

[CV-65] Decoupling Training-Free Guided Diffusion by ADMM

链接: https://arxiv.org/abs/2411.12773
作者: Youyuan Zhang,Zehua Liu,Zenan Li,Zhaoyu Li,James J. Clark,Xujie Si
关键词-EN: unconditional diffusion models, differentiable loss functions, unconditional generation model, conditional generation problem, problem by guiding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we consider the conditional generation problem by guiding off-the-shelf unconditional diffusion models with differentiable loss functions in a plug-and-play fashion. While previous research has primarily focused on balancing the unconditional diffusion model and the guided loss through a tuned weight hyperparameter, we propose a novel framework that distinctly decouples these two components. Specifically, we introduce two variables x and z , to represent the generated samples governed by the unconditional generation model and the guidance function, respectively. This decoupling reformulates conditional generation into two manageable subproblems, unified by the constraint x = z . Leveraging this setup, we develop a new algorithm based on the Alternating Direction Method of Multipliers (ADMM) to adaptively balance these components. Additionally, we establish the equivalence between the diffusion reverse step and the proximal operator of ADMM and provide a detailed convergence analysis of our algorithm under certain mild assumptions. Our experiments demonstrate that our proposed method ADMMDiff consistently generates high-quality samples while ensuring strong adherence to the conditioning criteria. It outperforms existing methods across a range of conditional generation tasks, including image generation with various guidance and controllable motion synthesis.

[CV-66] Generative World Explorer

链接: https://arxiv.org/abs/2411.11844
作者: Taiming Lu,Tianmin Shu,Alan Yuille,Daniel Khashabi,Jieneng Chen
关键词-EN: Planning with partial, textit, world, Planning, Genex
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Website: this http URL

点击查看摘要

Abstract:Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can \textitimagine unseen parts of the world through a mental exploration and \textitrevise their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the \textitGenerative World Explorer (Genex) , an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train \textitGenex , we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) \textitGenex can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.

[CV-67] Comparative Analysis of Machine Learning and Deep Learning Models for Classifying Squamous Epithelial Cells of the Cervix

链接: https://arxiv.org/abs/2411.13535
作者: Subhasish Das,Satish K Panda,Madhusmita Sethy,Prajna Paramita Giri,Ashwini K Nanda
关键词-EN: female reproductive system, Pap smear, cervical cancer, Pap smear images, reproductive system
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:The cervix is the narrow end of the uterus that connects to the vagina in the female reproductive system. Abnormal cell growth in the squamous epithelial lining of the cervix leads to cervical cancer in females. A Pap smear is a diagnostic procedure used to detect cervical cancer by gently collecting cells from the surface of the cervix with a small brush and analyzing their changes under a microscope. For population-based cervical cancer screening, visual inspection with acetic acid is a cost-effective method with high sensitivity. However, Pap smears are also suitable for mass screening due to their higher specificity. The current Pap smear analysis method is manual, time-consuming, labor-intensive, and prone to human error. Therefore, an artificial intelligence (AI)-based approach for automatic cell classification is needed. In this study, we aimed to classify cells in Pap smear images into five categories: superficial-intermediate, parabasal, koilocytes, dyskeratotic, and metaplastic. Various machine learning (ML) algorithms, including Gradient Boosting, Random Forest, Support Vector Machine, and k-Nearest Neighbor, as well as deep learning (DL) approaches like ResNet-50, were employed for this classification task. The ML models demonstrated high classification accuracy; however, ResNet-50 outperformed the others, achieving a classification accuracy of 93.06%. This study highlights the efficiency of DL models for cell-level classification and their potential to aid in the early diagnosis of cervical cancer from Pap smear images.

[CV-68] Efficient Brain Imaging Analysis for Alzheimers and Dementia Detection Using Convolution-Derivative Operations

链接: https://arxiv.org/abs/2411.13490
作者: Yasmine Mustafa,Mohamed Elmahallawy,Tie Luo
关键词-EN: Alzheimer disease, human brains, characterized by progressive, progressive neurodegeneration, neurodegeneration and results
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is characterized by progressive neurodegeneration and results in detrimental structural changes in human brains. Detecting these changes is crucial for early diagnosis and timely intervention of disease progression. Jacobian maps, derived from spatial normalization in voxel-based morphometry (VBM), have been instrumental in interpreting volume alterations associated with AD. However, the computational cost of generating Jacobian maps limits its clinical adoption. In this study, we explore alternative methods and propose Sobel kernel angle difference (SKAD) as a computationally efficient alternative. SKAD is a derivative operation that offers an optimized approach to quantifying volumetric alterations through localized analysis of the gradients. By efficiently extracting gradient amplitude changes at critical spatial regions, this derivative operation captures regional volume variations Evaluation of SKAD over various medical datasets demonstrates that it is 6.3x faster than Jacobian maps while still maintaining comparable accuracy. This makes it an efficient and competitive approach in neuroimaging research and clinical practice.

[CV-69] Adversarial Diffusion Compression for Real-World Image Super-Resolution

链接: https://arxiv.org/abs/2411.13383
作者: Bin Chen,Gehui Li,Rongyuan Wu,Xindong Zhang,Jie Chen,Jian Zhang,Lei Zhang
关键词-EN: low-resolution inputs degraded, reconstruct high-resolution images, unknown processes, aims to reconstruct, degraded by complex
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-world image super-resolution (Real-ISR) aims to reconstruct high-resolution images from low-resolution inputs degraded by complex, unknown processes. While many Stable Diffusion (SD)-based Real-ISR methods have achieved remarkable success, their slow, multi-step inference hinders practical deployment. Recent SD-based one-step networks like OSEDiff and S3Diff alleviate this issue but still incur high computational costs due to their reliance on large pretrained SD models. This paper proposes a novel Real-ISR method, AdcSR, by distilling the one-step diffusion network OSEDiff into a streamlined diffusion-GAN model under our Adversarial Diffusion Compression (ADC) framework. We meticulously examine the modules of OSEDiff, categorizing them into two types: (1) Removable (VAE encoder, prompt extractor, text encoder, etc.) and (2) Prunable (denoising UNet and VAE decoder). Since direct removal and pruning can degrade the model’s generation capability, we pretrain our pruned VAE decoder to restore its ability to decode images and employ adversarial distillation to compensate for performance loss. This ADC-based diffusion-GAN hybrid design effectively reduces complexity by 73% in inference time, 78% in computation, and 74% in parameters, while preserving the model’s generation capability. Experiments manifest that our proposed AdcSR achieves competitive recovery quality on both synthetic and real-world datasets, offering up to 9.3 \times speedup over previous one-step diffusion-based methods. Code and models will be made available.

[CV-70] RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content

链接: https://arxiv.org/abs/2411.13362
作者: Yuxuan Jiang,Jakub Nawała,Chen Feng,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull
关键词-EN: reconstructing fine details, fine details, key technique, technique for improving, increasing its spatial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) is a key technique for improving the visual quality of video content by increasing its spatial resolution while reconstructing fine details. SR has been employed in many applications including video streaming, where compressed low-resolution content is typically transmitted to end users and then reconstructed with a higher resolution and enhanced quality. To support real-time playback, it is important to implement fast SR models while preserving reconstruction quality; however most existing solutions, in particular those based on complex deep neural networks, fail to do so. To address this issue, this paper proposes a low-complexity SR method, RTSR, designed to enhance the visual quality of compressed video content, focusing on resolution up-scaling from a) 360p to 1080p and from b) 540p to 4K. The proposed approach utilizes a CNN-based network architecture, which was optimized for AV1 (SVT)-encoded content at various quantization levels based on a dual-teacher knowledge distillation method. This method was submitted to the AIM 2024 Video Super-Resolution Challenge, specifically targeting the Efficient/Mobile Real-Time Video Super-Resolution competition. It achieved the best trade-off between complexity and coding performance (measured in PSNR, SSIM and VMAF) among all six submissions. The code will be available soon.

[CV-71] Analysis and Synthesis Denoisers for Forward-Backward Plug-and-Play Algorithms

链接: https://arxiv.org/abs/2411.13276
作者: Matthieu Kowalski,Benoît Malézieux,Thomas Moreau,Audrey Repetti
关键词-EN: synthesis Gaussian denoisers, synthesis Gaussian, synthesis Gaussian denoising, Gaussian denoiser, work we study
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this work we study the behavior of the forward-backward (FB) algorithm when the proximity operator is replaced by a sub-iterative procedure to approximate a Gaussian denoiser, in a Plug-and-Play (PnP) fashion. In particular, we consider both analysis and synthesis Gaussian denoisers within a dictionary framework, obtained by unrolling dual-FB iterations or FB iterations, respectively. We analyze the associated minimization problems as well as the asymptotic behavior of the resulting FB-PnP iterations. In particular, we show that the synthesis Gaussian denoising problem can be viewed as a proximity operator. For each case, analysis and synthesis, we show that the FB-PnP algorithms solve the same problem whether we use only one or an infinite number of sub-iteration to solve the denoising problem at each iteration. To this aim, we show that each “one sub-iteration” strategy within the FB-PnP can be interpreted as a primal-dual algorithm when a warm-restart strategy is used. We further present similar results when using a Moreau-Yosida smoothing of the global problem, for an arbitrary number of sub-iterations. Finally, we provide numerical simulations to illustrate our theoretical results. In particular we first consider a toy compressive sensing example, as well as an image restoration problem in a deep dictionary framework.

[CV-72] Intensity-Spatial Dual Masked Autoencoder for Multi-Scale Feature Learning in Chest CT Segmentation

链接: https://arxiv.org/abs/2411.13198
作者: Yuexing Ding,Jun Wang,Hongbing Lyu
关键词-EN: ambiguous boundaries,and multi-scale, Dual Masked AutoEncoder, boundaries,and multi-scale characteristics, Intensity-Spatial Dual Masked, Masked AutoEncoder
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages,6 figures,3 tables

点击查看摘要

Abstract:In the field of medical image segmentation, challenges such as indistinct lesion features, ambiguous boundaries,and multi-scale characteristics have long revailed. This paper proposes an improved method named Intensity-Spatial Dual Masked AutoEncoder (ISD-MAE). Based on the tissue-contrast semi-masked autoencoder, a Masked AutoEncoder (MAE) branch is introduced to perform intensity masking and spatial masking operations on chest CT images for multi-scale feature learning and segmentation tasks. The model utilizes a dual-branch structure and contrastive learning to enhance the ability to learn tissue features and boundary details. Experiments are conducted on multiple 2D and 3D datasets. The results show that ISD-MAE significantly outperforms other methods in 2D pneumonia and mediastinal tumor segmentation tasks. For example, the Dice score reaches 90.10% on the COVID19 LESION dataset, and the performance is relatively stable. However, there is still room for improvement on 3D datasets. In response to this, improvement directions are proposed, including optimizing the loss function, using enhanced 3D convolution blocks, and processing datasets from multiple this http URL code is available at:this https URL.

[CV-73] Demonstrating the Suitability of Neuromorphic Event-Based Dynamic Vision Sensors for In Process Monitoring of Metallic Additive Manufacturing and Welding

链接: https://arxiv.org/abs/2411.13108
作者: David Mascareñas,Andre Green,Ashlee Liao,Michael Torrez,Alessandro Cattaneo,Amber Black,John Bernardin,Garrett Kenyon
关键词-EN: metallic additive manufacturing, in-process monitoring applications, additive manufacturing, high dynamic range, metallic additive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work is a derivative work of a conference proceedings paper submitted to the International Modal Analysis Conference 2024, and is subject to some copyright restrictions associated with the Society of Experimental Mechanics. A variation of this paper is also published in the Weapons Engineering Symposium and Journal (WESJ) which is not publically accessible

点击查看摘要

Abstract:We demonstrate the suitability of high dynamic range, high-speed, neuromorphic event-based, dynamic vision sensors for metallic additive manufacturing and welding for in-process monitoring applications. In-process monitoring to enable quality control of mission critical components produced using metallic additive manufacturing is of high interest. However, the extreme light environment and high speed dynamics of metallic melt pools have made this a difficult environment in which to make measurements. Event-based sensing is an alternative measurement paradigm where data is only transmitted/recorded when a measured quantity exceeds a threshold resolution. The result is that event-based sensors consume less power and less memory/bandwidth, and they operate across a wide range of timescales and dynamic ranges. Event-driven driven imagers stand out from conventional imager technology in that they have a very high dynamic range of approximately 120 dB. Conventional 8 bit imagers only have a dynamic range of about 48 dB. This high dynamic range makes them a good candidate for monitoring manufacturing processes that feature high intensity light sources/generation such as metallic additive manufacturing and welding. In addition event based imagers are able to capture data at timescales on the order of 100 \mus, which makes them attractive to capturing fast dynamics in a metallic melt pool. In this work we demonstrate that event-driven imagers have been shown to be able to observe tungsten inert gas (TIG) and laser welding melt pools. The results of this effort suggest that with additional engineering effort, neuromorphic event imagers should be capable of 3D geometry measurements of the melt pool, and anomaly detection/classification/prediction.

[CV-74] LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression

链接: https://arxiv.org/abs/2411.13033
作者: Shimon Murai,Heming Sun,Jiro Katto
关键词-EN: low-bitrate learned image, Supported by powerful, utilizing perceptual metrics, powerful generative models, learned image compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE VCIP 2024 poster

点击查看摘要

Abstract:Supported by powerful generative models, low-bitrate learned image compression (LIC) models utilizing perceptual metrics have become feasible. Some of the most advanced models achieve high compression rates and superior perceptual quality by using image captions as sub-information. This paper demonstrates that using a large multi-modal model (LMM), it is possible to generate captions and compress them within a single model. We also propose a novel semantic-perceptual-oriented fine-tuning method applicable to any LIC network, resulting in a 41.58% improvement in LPIPS BD-rate compared to existing methods. Our implementation and pre-trained weights are available at this https URL.

[CV-75] Residual Vision Transformer (ResViT) Based Self-Supervised Learning Model for Brain Tumor Classification

链接: https://arxiv.org/abs/2411.12874
作者: Meryem Altin Karagoz,O. Ufuk Nalbantoglu,Geoffrey C. Fox
关键词-EN: deep learning models, Deep learning, proposed model, MRI, brain tumor diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has proven very promising for interpreting MRI in brain tumor diagnosis. However, deep learning models suffer from a scarcity of brain MRI datasets for effective training. Self-supervised learning (SSL) models provide data-efficient and remarkable solutions to limited dataset problems. Therefore, this paper introduces a generative SSL model for brain tumor classification in two stages. The first stage is designed to pre-train a Residual Vision Transformer (ResViT) model for MRI synthesis as a pretext task. The second stage includes fine-tuning a ResViT-based classifier model as a downstream task. Accordingly, we aim to leverage local features via CNN and global features via ViT, employing a hybrid CNN-transformer architecture for ResViT in pretext and downstream tasks. Moreover, synthetic MRI images are utilized to balance the training set. The proposed model performs on public BraTs 2023, Figshare, and Kaggle datasets. Furthermore, we compare the proposed model with various deep learning models, including A-UNet, ResNet-9, pix2pix, pGAN for MRI synthesis, and ConvNeXtTiny, ResNet101, DenseNet12, Residual CNN, ViT for classification. According to the results, the proposed model pretraining on the MRI dataset is superior compared to the pretraining on the ImageNet dataset. Overall, the proposed model attains the highest accuracy, achieving 90.56% on the BraTs dataset with T1 sequence, 98.53% on the Figshare, and 98.47% on the Kaggle brain tumor datasets. As a result, the proposed model demonstrates a robust, effective, and successful approach to handling insufficient dataset challenges in MRI analysis by incorporating SSL, fine-tuning, data augmentation, and combining CNN and ViT.

[CV-76] SAM-I2I: Unleash the Power of Segment Anything Model for Medical Image Translation

链接: https://arxiv.org/abs/2411.12755
作者: Jiayu Huo,Sebastien Ourselin,Rachel Sparks
关键词-EN: expensive multi-modal imaging, Convolutional Neural Networks, clinical field, crucial for reducing, redundant and expensive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image translation is crucial for reducing the need for redundant and expensive multi-modal imaging in clinical field. However, current approaches based on Convolutional Neural Networks (CNNs) and Transformers often fail to capture fine-grain semantic features, resulting in suboptimal image quality. To address this challenge, we propose SAM-I2I, a novel image-to-image translation framework based on the Segment Anything Model 2 (SAM2). SAM-I2I utilizes a pre-trained image encoder to extract multiscale semantic features from the source image and a decoder, based on the mask unit attention module, to synthesize target modality images. Our experiments on multi-contrast MRI datasets demonstrate that SAM-I2I outperforms state-of-the-art methods, offering more efficient and accurate medical image translation.

机器学习

[LG-0] Promoting User Data Autonomy During the Dissolution of a Monopolistic Firm NEURIPS2024

链接: https://arxiv.org/abs/2411.13546
作者: Rushabh Solanki,Elliot Creager
关键词-EN: neural networks pre-trained, so-called foundation models, large neural networks, digital records, consumer products
类目: Machine Learning (cs.LG)
*备注: This paper appeared at the 2nd Workshop on Regulatable ML at NeurIPS 2024

点击查看摘要

Abstract:The deployment of AI in consumer products is currently focused on the use of so-called foundation models, large neural networks pre-trained on massive corpora of digital records. This emphasis on scaling up datasets and pre-training computation raises the risk of further consolidating the industry, and enabling monopolistic (or oligopolistic) behavior. Judges and regulators seeking to improve market competition may employ various remedies. This paper explores dissolution – the breaking up of a monopolistic entity into smaller firms – as one such remedy, focusing in particular on the technical challenges and opportunities involved in the breaking up of large models and datasets. We show how the framework of Conscious Data Contribution can enable user autonomy during under dissolution. Through a simulation study, we explore how fine-tuning and the phenomenon of “catastrophic forgetting” could actually prove beneficial as a type of machine unlearning that allows users to specify which data they want used for what purposes.

[LG-1] Procurement Auctions via Approximately Optimal Submodular Optimization

链接: https://arxiv.org/abs/2411.13513
作者: Yuan Deng,Amin Karbasi,Vahab Mirrokni,Renato Paes Leme,Grigoris Velegkas,Song Zuo
关键词-EN: submodular optimization algorithms, submodular optimization, study procurement auctions, seeks to acquire, optimization algorithms
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study procurement auctions, where an auctioneer seeks to acquire services from strategic sellers with private costs. The quality of services is measured by a submodular function known to the auctioneer. Our goal is to design computationally efficient procurement auctions that (approximately) maximize the difference between the quality of the acquired services and the total cost of the sellers, while ensuring incentive compatibility (IC), individual rationality (IR) for sellers, and non-negative surplus (NAS) for the auctioneer. Our contributions are twofold: (i) we provide an improved analysis of existing algorithms for non-positive submodular function maximization, and (ii) we design efficient frameworks that transform submodular optimization algorithms into mechanisms that are IC, IR, NAS, and approximation-preserving. These frameworks apply to both the offline setting, where all sellers’ bids and services are available simultaneously, and the online setting, where sellers arrive in an adversarial order, requiring the auctioneer to make irrevocable decisions. We also explore whether state-of-the-art submodular optimization algorithms can be converted into descending auctions in adversarial settings, where the schedule of descending prices is determined by an adversary. We show that a submodular optimization algorithm satisfying bi-criteria (1/2, 1) -approximation in welfare can be effectively adapted to a descending auction. Additionally, we establish a connection between descending auctions and online submodular optimization. Finally, we demonstrate the practical applications of our frameworks by instantiating them with state-of-the-art submodular optimization algorithms and empirically comparing their welfare performance on publicly available datasets with thousands of sellers. Subjects: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2411.13513 [cs.GT] (or arXiv:2411.13513v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2411.13513 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Grigoris Velegkas [view email] [v1] Wed, 20 Nov 2024 18:06:55 UTC (168 KB)

[LG-2] Advancing Heatwave Forecasting via Distribution Informed-Graph Neural Networks (DI-GNNs): Integrating Extreme Value Theory with GNNs

链接: https://arxiv.org/abs/2411.13496
作者: Farrukh A. Chishtie,Dominique Brunet,Rachel H. White,Daniel Michelson,Jing Jiang,Vicky Lucas,Emily Ruboonga,Sayana Imaash,Melissa Westland,Timothy Chui,Rana Usman Ali,Mujtaba Hassan,Roland Stull,David Hudak
关键词-EN: posing substantial risks, Graph Neural Network, prolonged periods, posing substantial, public health
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Physics and Society (physics.soc-ph)
*备注: 23 pages, 13 figures, pdf format

点击查看摘要

Abstract:Heatwaves, prolonged periods of extreme heat, have intensified in frequency and severity due to climate change, posing substantial risks to public health, ecosystems, and infrastructure. Despite advancements in Machine Learning (ML) modeling, accurate heatwave forecasting at weather scales (1–15 days) remains challenging due to the non-linear interactions between atmospheric drivers and the rarity of these extreme events. Traditional models relying on heuristic feature engineering often fail to generalize across diverse climates and capture the complexities of heatwave dynamics. This study introduces the Distribution-Informed Graph Neural Network (DI-GNN), a novel framework that integrates principles from Extreme Value Theory (EVT) into the graph neural network architecture. DI-GNN incorporates Generalized Pareto Distribution (GPD)-derived descriptors into the feature space, adjacency matrix, and loss function to enhance its sensitivity to rare heatwave occurrences. By prioritizing the tails of climatic distributions, DI-GNN addresses the limitations of existing methods, particularly in imbalanced datasets where traditional metrics like accuracy are misleading. Empirical evaluations using weather station data from British Columbia, Canada, demonstrate the superior performance of DI-GNN compared to baseline models. DI-GNN achieved significant improvements in balanced accuracy, recall, and precision, with high AUC and average precision scores, reflecting its robustness in distinguishing heatwave events.

[LG-3] Sampling and Integration of Logconcave Functions by Algorithmic Diffusion

链接: https://arxiv.org/abs/2411.13462
作者: Yunbum Kook,Santosh S. Vempala
关键词-EN: integrating arbitrary logconcave, arbitrary logconcave functions, general logconcave functions, integrating arbitrary, logconcave functions
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 60 pages, 1 figure

点击查看摘要

Abstract:We study the complexity of sampling, rounding, and integrating arbitrary logconcave functions. Our new approach provides the first complexity improvements in nearly two decades for general logconcave functions for all three problems, and matches the best-known complexities for the special case of uniform distributions on convex bodies. For the sampling problem, our output guarantees are significantly stronger than previously known, and lead to a streamlined analysis of statistical estimation based on dependent random samples.

[LG-4] A Survey On Enhancing Reinforcement Learning in Complex Environments: Insights from Human and LLM Feedback

链接: https://arxiv.org/abs/2411.13410
作者: Alireza Rashidi Laleh,Majid Nili Ahmadabadi
关键词-EN: demonstrating remarkable potential, tackling real-world challenges, demonstrating remarkable, Reinforcement learning, active fields
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is one of the active fields in machine learning, demonstrating remarkable potential in tackling real-world challenges. Despite its promising prospects, this methodology has encountered with issues and challenges, hindering it from achieving the best performance. In particular, these approaches lack decent performance when navigating environments and solving tasks with large observation space, often resulting in sample-inefficiency and prolonged learning times. This issue, commonly referred to as the curse of dimensionality, complicates decision-making for RL agents, necessitating a careful balance between attention and decision-making. RL agents, when augmented with human or large language models’ (LLMs) feedback, may exhibit resilience and adaptability, leading to enhanced performance and accelerated learning. Such feedback, conveyed through various modalities or granularities including natural language, serves as a guide for RL agents, aiding them in discerning relevant environmental cues and optimizing decision-making processes. In this survey paper, we mainly focus on problems of two-folds: firstly, we focus on humans or an LLMs assistance, investigating the ways in which these entities may collaborate with the RL agent in order to foster optimal behavior and expedite learning; secondly, we delve into the research papers dedicated to addressing the intricacies of environments characterized by large observation space.

[LG-5] ODTE – An ensemble of multi-class SVM-based oblique decision trees

链接: https://arxiv.org/abs/2411.13376
作者: Ricardo Montañana,José A. Gámez,José M. Puerta
关键词-EN: oblique decision trees, oblique decision, decision trees, decision nodes, decision
类目: Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:We propose ODTE, a new ensemble that uses oblique decision trees as base classifiers. Additionally, we introduce STree, the base algorithm for growing oblique decision trees, which leverages support vector machines to define hyperplanes within the decision nodes. We embed a multiclass strategy – one-vs-one or one-vs-rest – at the decision nodes, allowing the model to directly handle non-binary classification tasks without the need to cluster instances into two groups, as is common in other approaches from the literature. In each decision node, only the best-performing model SVM – the one that minimizes an impurity measure for the n-ary classification – is retained, even if the learned SVM addresses a binary classification subtask. An extensive experimental study involving 49 datasets and various state-of-the-art algorithms for oblique decision tree ensembles has been conducted. Our results show that ODTE ranks consistently above its competitors, achieving significant performance gains when hyperparameters are carefully tuned. Moreover, the oblique decision trees learned through STree are more compact than those produced by other algorithms evaluated in our experiments.

[LG-6] Predicting Wall Thickness Changes in Cold Forging Processes: An Integrated FEM and Neural Network approach

链接: https://arxiv.org/abs/2411.13366
作者: Sasa Ilic,Abdulkerim Karaman,Johannes Pöppelbaum,Jan Niclas Reimann,Michael Marré,Andreas Schwung
关键词-EN: Finite Element Method, Element Method, study presents, surrogate models, Finite Element
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel approach for predicting wall thickness changes in tubes during the nosing process. Specifically, we first provide a thorough analysis of nosing processes and the influencing parameters. We further set-up a Finite Element Method (FEM) simulation to better analyse the effects of varying process parameters. As however traditional FEM simulations, while accurate, are time-consuming and computationally intensive, which renders them inapplicable for real-time application, we present a novel modeling framework based on specifically designed graph neural networks as surrogate models. To this end, we extend the neural network architecture by directly incorporating information about the nosing process by adding different types of edges and their corresponding encoders to model object interactions. This augmentation enhances model accuracy and opens the possibility for employing precise surrogate models within closed-loop production processes. The proposed approach is evaluated using a new evaluation metric termed area between thickness curves (ABTC). The results demonstrate promising performance and highlight the potential of neural networks as surrogate models in predicting wall thickness changes during nosing forging processes.

[LG-7] Vertical Validation: Evaluating Implicit Generative Models for Graphs on Thin Support Regions UAI2024

链接: https://arxiv.org/abs/2411.13358
作者: Mai Elkady,Thu Bui,Bruno Ribeiro,David I. Inouye
关键词-EN: implicit graph generative, graph generative models, growing excitement, medicine or material, graph generative
类目: Machine Learning (cs.LG)
*备注: Accepted to UAI 2024

点击查看摘要

Abstract:There has been a growing excitement that implicit graph generative models could be used to design or discover new molecules for medicine or material design. Because these molecules have not been discovered, they naturally lie in unexplored or scarcely supported regions of the distribution of known molecules. However, prior evaluation methods for implicit graph generative models have focused on validating statistics computed from the thick support (e.g., mean and variance of a graph property). Therefore, there is a mismatch between the goal of generating novel graphs and the evaluation methods. To address this evaluation gap, we design a novel evaluation method called Vertical Validation (VV) that systematically creates thin support regions during the train-test splitting procedure and then reweights generated samples so that they can be compared to the held-out test data. This procedure can be seen as a generalization of the standard train-test procedure except that the splits are dependent on sample features. We demonstrate that our method can be used to perform model selection if performance on thin support regions is the desired goal. As a side benefit, we also show that our approach can better detect overfitting as exemplified by memorization.

[LG-8] ransformers with Sparse Attention for Granger Causality

链接: https://arxiv.org/abs/2411.13264
作者: Riya Mahesh,Rahul Vashisht,Chandrashekar Lakshminarayanan
关键词-EN: Temporal causal analysis, analysis means understanding, understanding the underlying, Granger Causality, Sparse Attention Transformer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal causal analysis means understanding the underlying causes behind observed variables over time. Deep learning based methods such as transformers are increasingly used to capture temporal dynamics and causal relationships beyond mere correlations. Recent works suggest self-attention weights of transformers as a useful indicator of causal links. We leverage this to propose a novel modification to the self-attention module to establish causal links between the variables of multivariate time-series data with varying lag dependencies. Our Sparse Attention Transformer captures causal relationships using a two-fold approach - performing temporal attention first followed by attention between the variables across the time steps masking them individually to compute Granger Causality indices. The key novelty in our approach is the ability of the model to assert importance and pick the most significant past time instances for its prediction task against manually feeding a fixed time lag value. We demonstrate the effectiveness of our approach via extensive experimentation on several synthetic benchmark datasets. Furthermore, we compare the performance of our model with the traditional Vector Autoregression based Granger Causality method that assumes fixed lag length.

[LG-9] A Unified Analysis for Finite Weight Averaging

链接: https://arxiv.org/abs/2411.13169
作者: Peng Wang,Li Shen,Zerui Tao,Yan Sun,Guodong Zheng,Dacheng Tao
关键词-EN: Stochastic Weight Averaging, Stochastic Gradient Descent, Exponential Moving Average, finite weight averaging, LAtest Weight Averaging
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less explored since there are fundamental differences between finite and infinite settings. In this work, we first generalize SGD and LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization. A key challenge is the inapplicability of traditional methods in the sense of expectation or optimal values for infinite-dimensional settings in analyzing FWA’s convergence. Second, the cumulative gradients introduced by FWA introduce additional confusion to the generalization analysis, especially making it more difficult to discuss them under different assumptions. Extending the final iteration convergence analysis to the FWA, this paper, under a convexity assumption, establishes a convergence bound \mathcalO(\log\left(\fracTk\right)/\sqrtT) , where k\in[1, T/2] is a constant representing the last k iterations. Compared to SGD with \mathcalO(\log(T)/\sqrtT) , we prove theoretically that FWA has a faster convergence rate and explain the effect of the number of average points. In the generalization analysis, we find a recursive representation for bounding the cumulative gradient using mathematical induction. We provide bounds for constant and decay learning rates and the convex and non-convex cases to show the good generalization performance of FWA. Finally, experimental results on several benchmarks verify our theoretical results.

[LG-10] Unlocking Historical Clinical Trial Data with ALIGN: A Compositional Large Language Model System for Medical Coding

链接: https://arxiv.org/abs/2411.13163
作者: Nabeel Seedat,Caterina Tozzi,Andrea Hita Ardiaca,Mihaela van der Schaar,James Weatherall,Adam Taylor
关键词-EN: Large Language Models, ALIGN, reuse of historical, significant potential, medical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The reuse of historical clinical trial data has significant potential to accelerate medical research and drug development. However, interoperability challenges, particularly with missing medical codes, hinders effective data integration across studies. While Large Language Models (LLMs) offer a promising solution for automated coding without labeled data, current approaches face challenges on complex coding tasks. We introduce ALIGN, a novel compositional LLM-based system for automated, zero-shot medical coding. ALIGN follows a three-step process: (1) diverse candidate code generation; (2) self-evaluation of codes and (3) confidence scoring and uncertainty estimation enabling human deferral to ensure reliability. We evaluate ALIGN on harmonizing medication terms into Anatomical Therapeutic Chemical (ATC) and medical history terms into Medical Dictionary for Regulatory Activities (MedDRA) codes extracted from 22 immunology trials. ALIGN outperformed the LLM baselines, while also providing capabilities for trustworthy deployment. For MedDRA coding, ALIGN achieved high accuracy across all levels, matching RAG and excelling at the most specific levels (87-90% for HLGT). For ATC coding, ALIGN demonstrated superior performance, particularly at lower hierarchy levels (ATC Level 4), with 72-73% overall accuracy and 86-89% accuracy for common medications, outperforming baselines by 7-22%. ALIGN’s uncertainty-based deferral improved accuracy by 17% to 90% accuracy with 30% deferral, notably enhancing performance on uncommon medications. ALIGN achieves this cost-efficiently at \ 0.0007 and \ 0.02 per code for GPT-4o-mini and GPT-4o, reducing barriers to clinical adoption. ALIGN advances automated medical coding for clinical trial data, contributing to enhanced data interoperability and reusability, positioning it as a promising tool to improve clinical research and accelerate drug development.

[LG-11] Long-term Detection System for Six Kinds of Abnormal Behavior of the Elderly Living Alone

链接: https://arxiv.org/abs/2411.13153
作者: Kai Tanaka,Mineichi Kudo,Keigo Kimura,Atsuyoshi Nakamura
关键词-EN: elderly people, increasing worldwide, detection, Japan, housebound
类目: Machine Learning (cs.LG)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:The proportion of elderly people is increasing worldwide, particularly those living alone in Japan. As elderly people get older, their risks of physical disabilities and health issues increase. To automatically discover these issues at a low cost in daily life, sensor-based detection in a smart home is promising. As part of the effort towards early detection of abnormal behaviors, we propose a simulator-based detection systems for six typical anomalies: being semi-bedridden, being housebound, forgetting, wandering, fall while walking and fall while standing. Our detection system can be customized for various room layout, sensor arrangement and resident’s characteristics by training detection classifiers using the simulator with the parameters fitted to individual cases. Considering that the six anomalies that our system detects have various occurrence durations, such as being housebound for weeks or lying still for seconds after a fall, the detection classifiers of our system produce anomaly labels depending on each anomaly’s occurrence duration, e.g., housebound per day and falls per second. We propose a method that standardizes the processing of sensor data, and uses a simple detection approach. Although the validity depends on the realism of the simulation, numerical evaluations using sensor data that includes a variety of resident behavior patterns over nine years as test data show that (1) the methods for detecting wandering and falls are comparable to previous methods, and (2) the methods for detecting being semi-bedridden, being housebound, and forgetting achieve a sensitivity of over 0.9 with fewer than one false alarm every 50 days.

[LG-12] Domain Adaptive Unfolded Graph Neural Networks

链接: https://arxiv.org/abs/2411.13137
作者: Zepeng Zhang,Olga Fink
关键词-EN: machine learning tasks, made significant progress, graph neural networks, numerous graph machine, graph machine learning
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Over the last decade, graph neural networks (GNNs) have made significant progress in numerous graph machine learning tasks. In real-world applications, where domain shifts occur and labels are often unavailable for a new target domain, graph domain adaptation (GDA) approaches have been proposed to facilitate knowledge transfer from the source domain to the target domain. Previous efforts in tackling distribution shifts across domains have mainly focused on aligning the node embedding distributions generated by the GNNs in the source and target domains. However, as the core part of GDA approaches, the impact of the underlying GNN architecture has received limited attention. In this work, we explore this orthogonal direction, i.e., how to facilitate GDA with architectural enhancement. In particular, we consider a class of GNNs that are designed explicitly based on optimization problems, namely unfolded GNNs (UGNNs), whose training process can be represented as bi-level optimization. Empirical and theoretical analyses demonstrate that when transferring from the source domain to the target domain, the lower-level objective value generated by the UGNNs significantly increases, resulting in an increase in the upper-level objective as well. Motivated by this observation, we propose a simple yet effective strategy called cascaded propagation (CP), which is guaranteed to decrease the lower-level objective value. The CP strategy is widely applicable to general UGNNs, and we evaluate its efficacy with three representative UGNN architectures. Extensive experiments on five real-world datasets demonstrate that the UGNNs integrated with CP outperform state-of-the-art GDA baselines.

[LG-13] Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

链接: https://arxiv.org/abs/2411.13117
作者: Charles O’Neill,David Klindt
关键词-EN: uncover interpretable features, sparse, sparse inference, recent line, shown promise
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. In this paper, we investigate sparse inference and learning in SAEs through the lens of sparse coding. Specifically, we show that SAEs perform amortised sparse inference with a computationally restricted encoder and, using compressed sensing theory, we prove that this mapping is inherently insufficient for accurate sparse inference, even in solvable cases. Building on this theory, we empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our key contribution is the decoupling of the encoding and decoding processes, which allows for a comparison of various sparse encoding strategies. We evaluate these strategies on two dimensions: alignment with true underlying sparse features and correct inference of sparse codes, while also accounting for computational costs during training and inference. Our results reveal that substantial performance gains can be achieved with minimal increases in compute cost. We demonstrate that this generalises to SAEs applied to large language models (LLMs), where advanced encoders achieve similar interpretability. This work opens new avenues for understanding neural network representations and offers important implications for improving the tools we use to analyse the activations of large language models.

[LG-14] DRL-Based Optimization for AoI and Energy Consumption in C-V2X Enabled IoV

链接: https://arxiv.org/abs/2411.13104
作者: Zheng Zhang,Qiong Wu,Pingyi Fan,Nan Cheng,Wen Chen,Khaled B. Letaief
关键词-EN: Generation Partnership Project, Partnership Project, Generation Partnership, communication latency issues, address communication latency
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:To address communication latency issues, the Third Generation Partnership Project (3GPP) has defined Cellular-Vehicle to Everything (C-V2X) technology, which includes Vehicle-to-Vehicle (V2V) communication for direct vehicle-to-vehicle communication. However, this method requires vehicles to autonomously select communication resources based on the Semi-Persistent Scheduling (SPS) protocol, which may lead to collisions due to different vehicles sharing the same communication resources, thereby affecting communication effectiveness. Non-Orthogonal Multiple Access (NOMA) is considered a potential solution for handling large-scale vehicle communication, as it can enhance the Signal-to-Interference-plus-Noise Ratio (SINR) by employing Successive Interference Cancellation (SIC), thereby reducing the negative impact of communication collisions. When evaluating vehicle communication performance, traditional metrics such as reliability and transmission delay present certain contradictions. Introducing the new metric Age of Information (AoI) provides a more comprehensive evaluation of communication system. Additionally, to ensure service quality, user terminals need to possess high computational capabilities, which may lead to increased energy consumption, necessitating a trade-off between communication energy consumption and effectiveness. Given the complexity and dynamics of communication systems, Deep Reinforcement Learning (DRL) serves as an intelligent learning method capable of learning optimal strategies in dynamic environments. Therefore, this paper analyzes the effects of multi-priority queues and NOMA on AoI in the C-V2X vehicular communication system and proposes an energy consumption and AoI optimization method based on DRL. Finally, through comparative simulations with baseline methods, the proposed approach demonstrates its advances in terms of energy consumption and AoI.

[LG-15] Incremental Label Distribution Learning with Scalable Graph Convolutional Networks

链接: https://arxiv.org/abs/2411.13097
作者: Ziqi Jia,Xiaoyang Qu,Chenghao Liu,Jianzong Wang
关键词-EN: Label Distribution Learning, handling label ambiguity, Label Distribution, Distribution Learning, Incremental Label Distribution
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted by the 26th IEEE International Conference on High Performance Computing and Communications (HPCC2024)

点击查看摘要

Abstract:Label Distribution Learning (LDL) is an effective approach for handling label ambiguity, as it can analyze all labels at once and indicate the extent to which each label describes a given sample. Most existing LDL methods consider the number of labels to be static. However, in various LDL-specific contexts (e.g., disease diagnosis), the label count grows over time (such as the discovery of new diseases), a factor that existing methods overlook. Learning samples with new labels directly means learning all labels at once, thus wasting more time on the old labels and even risking overfitting the old labels. At the same time, learning new labels by the LDL model means reconstructing the inter-label relationships. How to make use of constructed relationships is also a crucial challenge. To tackle these challenges, we introduce Incremental Label Distribution Learning (ILDL), analyze its key issues regarding training samples and inter-label relationships, and propose Scalable Graph Label Distribution Learning (SGLDL) as a practical framework for implementing ILDL. Specifically, in SGLDL, we develop a New-label-aware Gradient Compensation Loss to speed up the learning of new labels and represent inter-label relationships as a graph to reduce the time required to reconstruct inter-label relationships. Experimental results on the classical LDL dataset show the clear advantages of unique algorithms and illustrate the importance of a dedicated design for the ILDL problem.

[LG-16] Omnipredicting Single-Index Models with Multi-Index Models

链接: https://arxiv.org/abs/2411.13083
作者: Lunjia Hu,Kevin Tian,Chutong Yang
关键词-EN: Recent work, defined the notion, work on supervised, minimizing a family, mathcal
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work on supervised learning [GKR+22] defined the notion of omnipredictors, i.e., predictor functions p over features that are simultaneously competitive for minimizing a family of loss functions \mathcalL against a comparator class \mathcalC . Omniprediction requires approximating the Bayes-optimal predictor beyond the loss minimization paradigm, and has generated significant interest in the learning theory community. However, even for basic settings such as agnostically learning single-index models (SIMs), existing omnipredictor constructions require impractically-large sample complexities and runtimes, and output complex, highly-improper hypotheses. Our main contribution is a new, simple construction of omnipredictors for SIMs. We give a learner outputting an omnipredictor that is \varepsilon -competitive on any matching loss induced by a monotone, Lipschitz link function, when the comparator class is bounded linear predictors. Our algorithm requires \approx \varepsilon^-4 samples and runs in nearly-linear time, and its sample complexity improves to \approx \varepsilon^-2 if link functions are bi-Lipschitz. This significantly improves upon the only prior known construction, due to [HJKRR18, GHK+23], which used \gtrsim \varepsilon^-10 samples. We achieve our construction via a new, sharp analysis of the classical Isotron algorithm [KS09, KKKS11] in the challenging agnostic learning setting, of potential independent interest. Previously, Isotron was known to properly learn SIMs in the realizable setting, as well as constant-factor competitive hypotheses under the squared loss [ZWDD24]. As they are based on Isotron, our omnipredictors are multi-index models with \approx \varepsilon^-2 prediction heads, bringing us closer to the tantalizing goal of proper omniprediction for general loss families and comparators. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2411.13083 [cs.LG] (or arXiv:2411.13083v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.13083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

链接: https://arxiv.org/abs/2411.13055
作者: Jared Fernandez,Luca Wehrstedt,Leonid Shamis,Mostafa Elhoushi,Kalyan Saladi,Yonatan Bisk,Emma Strubell,Jacob Kahn
关键词-EN: Dramatic increases, neural network models, computational resources, model size, capabilities of neural
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

[LG-18] On-device Content-based Recommendation with Single-shot Embedding Pruning: A Cooperative Game Perspective

链接: https://arxiv.org/abs/2411.13052
作者: Hung Vinh Tran,Tong Chen,Guanhua Ye,Quoc Viet Hung Nguyen,Kai Zheng,Hongzhi Yin
关键词-EN: Content-based Recommender Systems, Content-based Recommender, Recommender Systems, shaping user experiences, online advertising
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Content-based Recommender Systems (CRSs) play a crucial role in shaping user experiences in e-commerce, online advertising, and personalized recommendations. However, due to the vast amount of categorical features, the embedding tables used in CRS models pose a significant storage bottleneck for real-world deployment, especially on resource-constrained devices. To address this problem, various embedding pruning methods have been proposed, but most existing ones require expensive retraining steps for each target parameter budget, leading to enormous computation costs. In reality, this computation cost is a major hurdle in real-world applications with diverse storage requirements, such as federated learning and streaming settings. In this paper, we propose Shapley Value-guided Embedding Reduction (Shaver) as our response. With Shaver, we view the problem from a cooperative game perspective, and quantify each embedding parameter’s contribution with Shapley values to facilitate contribution-based parameter pruning. To address the inherently high computation costs of Shapley values, we propose an efficient and unbiased method to estimate Shapley values of a CRS’s embedding parameters. Moreover, in the pruning stage, we put forward a field-aware codebook to mitigate the information loss in the traditional zero-out treatment. Through extensive experiments on three real-world datasets, Shaver has demonstrated competitive performance with lightweight recommendation models across various parameter budgets. The source code is available at this https URL

[LG-19] Probably Approximately Precision and Recall Learning

链接: https://arxiv.org/abs/2411.13029
作者: Lee Cohen,Yishay Mansour,Shay Moran,Han Shao
关键词-EN: foundational metrics, accurate predictions, predictions and comprehensive, comprehensive coverage, PAC learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Precision and Recall are foundational metrics in machine learning where both accurate predictions and comprehensive coverage are essential, such as in recommender systems and multi-label learning. In these tasks, balancing precision (the proportion of relevant items among those predicted) and recall (the proportion of relevant items successfully predicted) is crucial. A key challenge is that one-sided feedback–where only positive examples are observed during training–is inherent in many practical problems. For instance, in recommender systems like YouTube, training data only consists of videos that a user has actively selected, while unselected items remain unseen. Despite this lack of negative feedback in training, avoiding undesirable recommendations at test time is essential. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions, such as between users and items. This framework subsumes the classical binary and multi-class PAC learning models as well as multi-label learning with partial feedback, where only a single random correct label per example is observed, rather than all correct labels. Our work uncovers a rich statistical and algorithmic landscape, with nuanced boundaries on what can and cannot be learned. Notably, classical methods like Empirical Risk Minimization fail in this setting, even for simple hypothesis classes with only two hypotheses. To address these challenges, we develop novel algorithms that learn exclusively from positive data, effectively minimizing both precision and recall losses. Specifically, in the realizable setting, we design algorithms that achieve optimal sample complexity guarantees. In the agnostic case, we show that it is impossible to achieve additive error guarantees–as is standard in PAC learning–and instead obtain meaningful multiplicative approximations. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.13029 [cs.LG] (or arXiv:2411.13029v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.13029 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] A Theory for Compressibility of Graph Transformers for Transductive Learning

链接: https://arxiv.org/abs/2411.13028
作者: Hamed Shirzad,Honghao Lin,Ameya Velingker,Balaji Venkatachalam,David Woodruff,Danica Sutherland
关键词-EN: typical supervised machine, supervised machine learning, machine learning tasks, graphs differ fundamentally, Transductive tasks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.

[LG-21] Scalable Deep Metric Learning on Attributed Graphs

链接: https://arxiv.org/abs/2411.13014
作者: Xiang Li,Gagan Agrawal,Ruoming Jin,Rajiv Ramnath
关键词-EN: large attributed graphs, supporting multiple downstream, attributed graphs, problem of constructing, supporting multiple
类目: Machine Learning (cs.LG)
*备注: This is the complete version of a published paper with appendix including detailed proofs

点击查看摘要

Abstract:We consider the problem of constructing embeddings of large attributed graphs and supporting multiple downstream learning tasks. We develop a graph embedding method, which is based on extending deep metric and unbiased contrastive learning techniques to 1) work with attributed graphs, 2) enabling a mini-batch based approach, and 3) achieving scalability. Based on a multi-class tuplet loss function, we present two algorithms – DMT for semi-supervised learning and DMAT-i for the unsupervised case. Analyzing our methods, we provide a generalization bound for the downstream node classification task and for the first time relate tuplet loss to contrastive learning. Through extensive experiments, we show high scalability of representation construction, and in applying the method for three downstream tasks (node clustering, node classification, and link prediction) better consistency over any single existing method.

[LG-22] Deriving Activation Functions via Integration

链接: https://arxiv.org/abs/2411.13010
作者: Allen Hao Huang
关键词-EN: deep neural networks, Activation functions play, Exponential Linear Unit, ELU activation function, neural networks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Activation functions play a crucial role in introducing non-linearities to deep neural networks. We propose a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding functions through integration. Our work introduces the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied on the ELU activation function. xIELU combines two key gradient properties: a trainable and linearly increasing gradient for positive inputs, similar to ReLU ^2 , and a trainable negative gradient flow for negative inputs, akin to xSiLU. Conceptually, xIELU can be viewed as extending ReLU ^2 to effectively handle negative inputs. In experiments with 1.1B parameter Llama models trained on 126B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to both ReLU ^2 and SwiGLU when matched for the same compute cost and parameter count.

[LG-23] MERLOT: A Distilled LLM -based Mixture-of-Experts Framework for Scalable Encrypted Traffic Classification

链接: https://arxiv.org/abs/2411.13004
作者: Yuxuan Chen,Rongpeng Li,Zhifeng Zhao,Honggang Zhang
关键词-EN: distilled large language, present MERLOT, large language model, language model optimized, based refinement
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We present MERLOT, a scalable mixture-of-expert (MoE) based refinement of distilled large language model optimized for encrypted traffic classification. By applying model distillation techniques in a teacher-student paradigm, compact models derived from GPT-2-base retain high classification accuracy while minimizing computational costs. These models function as specialized experts in an MoE architecture, dynamically assigned via a gating network. Unlike generation-based methods, our approach directly classifies encrypted traffic using the final decoder token with contextual feature embedding as input. Experiments on 10 datasets show superior or competitive performance over the state-of-the-art models while significantly reducing resource demands, underscoring its effectiveness and robustness.

[LG-24] NCAirFL: CSI-Free Over-the-Air Federated Learning Based on Non-Coherent Detection

链接: https://arxiv.org/abs/2411.13000
作者: Haifeng Wen,Nicolò Michelusi,Osvaldo Simeone,Hong Xing
关键词-EN: leverages computing primitively, multiple access channels, federated learning, leverages computing, computing primitively
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 2 figures, submitted for possible publication

点击查看摘要

Abstract:Over-the-air federated learning (FL), i.e., AirFL, leverages computing primitively over multiple access channels. A long-standing challenge in AirFL is to achieve coherent signal alignment without relying on expensive channel estimation and feedback. This paper proposes NCAirFL, a CSI-free AirFL scheme based on unbiased non-coherent detection at the edge server. By exploiting binary dithering and a long-term memory based error-compensation mechanism, NCAirFL achieves a convergence rate of order \mathcalO(1/\sqrtT) in terms of the average square norm of the gradient for general non-convex and smooth objectives, where T is the number of communication rounds. Experiments demonstrate the competitive performance of NCAirFL compared to vanilla FL with ideal communications and to coherent transmission-based benchmarks.

[LG-25] Adaptive Process-Guided Learning: An Application in Predicting Lake DO Concentrations

链接: https://arxiv.org/abs/2411.12973
作者: Runlong Yu,Chonghao Qiu,Robert Ladwig,Paul C. Hanson,Yiqun Xie,Yanhua Li,Xiaowei Jia
关键词-EN: recurrent neural networks, sustaining water quality, framework that integrates, neural networks, dissolved oxygen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a \textitProcess-Guided Learning (Pril) framework that integrates physical models with recurrent neural networks (RNNs) to enhance the prediction of dissolved oxygen (DO) concentrations in lakes, which is crucial for sustaining water quality and ecosystem health. Unlike traditional RNNs, which may deliver high accuracy but often lack physical consistency and broad applicability, the \textitPril method incorporates differential DO equations for each lake layer, modeling it as a first-order linear solution using a forward Euler scheme with a daily timestep. However, this method is sensitive to numerical instabilities. When drastic fluctuations occur, the numerical integration is neither mass-conservative nor stable. Especially during stratified conditions, exogenous fluxes into each layer cause significant within-day changes in DO concentrations. To address this challenge, we further propose an \textitAdaptive Process-Guided Learning (April) model, which dynamically adjusts timesteps from daily to sub-daily intervals with the aim of mitigating the discrepancies caused by variations in entrainment fluxes. \textitApril uses a generator-discriminator architecture to identify days with significant DO fluctuations and employs a multi-step Euler scheme with sub-daily timesteps to effectively manage these variations. We have tested our methods on a wide range of lakes in the Midwestern USA, and demonstrated robust capability in predicting DO concentrations even with limited training data. While primarily focused on aquatic ecosystems, this approach is broadly applicable to diverse scientific and engineering disciplines that utilize process-based models, such as power engineering, climate science, and biomedicine.

[LG-26] A Foundation Model for Unified Urban Spatio-Temporal Flow Prediction

链接: https://arxiv.org/abs/2411.12972
作者: Yuan Yuan,Jingtao Ding,Chonghua Han,Depeng Jin,Yong Li
关键词-EN: optimizing city infrastructure, encompassing traffic flows, encompassing traffic, managing traffic, emergency responses
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban spatio-temporal flow prediction, encompassing traffic flows and crowd flows, is crucial for optimizing city infrastructure and managing traffic and emergency responses. Traditional approaches have relied on separate models tailored to either grid-based data, representing cities as uniform cells, or graph-based data, modeling cities as networks of nodes and edges. In this paper, we build UniFlow, a foundational model for general urban flow prediction that unifies both grid-based and graphbased data. We first design a multi-view spatio-temporal patching mechanism to standardize different data into a consistent sequential format and then introduce a spatio-temporal transformer architecture to capture complex correlations and dynamics. To leverage shared spatio-temporal patterns across different data types and facilitate effective cross-learning, we propose SpatioTemporal Memory Retrieval Augmentation (ST-MRA). By creating structured memory modules to store shared spatio-temporal patterns, ST-MRA enhances predictions through adaptive memory retrieval. Extensive experiments demonstrate that UniFlow outperforms existing models in both grid-based and graph-based flow prediction, excelling particularly in scenarios with limited data availability, showcasing its superior performance and broad applicability. The datasets and code implementation have been released on this https URL.

[LG-27] Machine learned reconstruction of tsunami dynamics from sparse observations

链接: https://arxiv.org/abs/2411.12948
作者: Edward McDugald,Arvind Mohan,Darren Engwirda,Agnese Marcato,Javier Santos
关键词-EN: transformer neural network, neural network designed, estimate full-field surface, full-field surface height, sparse sensing applications
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We investigate the use of the Senseiver, a transformer neural network designed for sparse sensing applications, to estimate full-field surface height measurements of tsunami waves from sparse observations. The model is trained on a large ensemble of simulated data generated via a shallow water equations solver, which we show to be a faithful reproduction for the underlying dynamics by comparison to historical events. We train the model on a dataset consisting of 8 tsunami simulations whose epicenters correspond to historical USGS earthquake records, and where the model inputs are restricted to measurements obtained at actively deployed buoy locations. We test the Senseiver on a dataset consisting of 8 simulations not included in training, demonstrating its capability for extrapolation. The results show remarkable resolution of fine scale phase and amplitude features from the true field, provided that at least a few of the sensors have obtained a non-zero signal. Throughout, we discuss which forecasting techniques can be improved by this method, and suggest ways in which the flexibility of the architecture can be leveraged to incorporate arbitrary remote sensing data (eg. HF Radar and satellite measurements) as well as investigate optimal sensor placements.

[LG-28] Improving Low-Fidelity Models of Li-ion Batteries via Hybrid Sparse Identification of Nonlinear Dynamics

链接: https://arxiv.org/abs/2411.12935
作者: Samuel Filgueira da Silva,Mehmet Fatih Ozkan,Faissal El Idrissi,Prashanth Ramesh,Marcello Canova
关键词-EN: renewable energy systems, Accurate modeling, Thresholded Ridge Regression, lithium ion, batteries is essential
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages

点击查看摘要

Abstract:Accurate modeling of lithium ion (li-ion) batteries is essential for enhancing the safety, and efficiency of electric vehicles and renewable energy systems. This paper presents a data-inspired approach for improving the fidelity of reduced-order li-ion battery models. The proposed method combines a Genetic Algorithm with Sequentially Thresholded Ridge Regression (GA-STRidge) to identify and compensate for discrepancies between a low-fidelity model (LFM) and data generated either from testing or a high-fidelity model (HFM). The hybrid model, combining physics-based and data-driven methods, is tested across different driving cycles to demonstrate the ability to significantly reduce the voltage prediction error compared to the baseline LFM, while preserving computational efficiency. The model robustness is also evaluated under various operating conditions, showing low prediction errors and high Pearson correlation coefficients for terminal voltage in unseen environments.

[LG-29] LEDRO: LLM -Enhanced Design Space Reduction and Optimization for Analog Circuits

链接: https://arxiv.org/abs/2411.12930
作者: Dimple Vijay Kochar,Hanrui Wang,Anantha Chandrakasan,Xin Zhang
关键词-EN: significant human expertise, require significant human, Traditional approaches, designing analog circuits, human expertise
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Traditional approaches for designing analog circuits are time-consuming and require significant human expertise. Existing automation efforts using methods like Bayesian Optimization (BO) and Reinforcement Learning (RL) are sub-optimal and costly to generalize across different topologies and technology nodes. In our work, we introduce a novel approach, LEDRO, utilizing Large Language Models (LLMs) in conjunction with optimization techniques to iteratively refine the design space for analog circuit sizing. LEDRO is highly generalizable compared to other RL and BO baselines, eliminating the need for design annotation or model training for different topologies or technology nodes. We conduct a comprehensive evaluation of our proposed framework and baseline on 22 different Op-Amp topologies across four FinFET technology nodes. Results demonstrate the superior performance of LEDRO as it outperforms our best baseline by an average of 13% FoM improvement with 2.15x speed-up on low complexity Op-Amps and 48% FoM improvement with 1.7x speed-up on high complexity Op-Amps. This highlights LEDRO’s effective performance, efficiency, and generalizability.

[LG-30] rojan Cleansing with Neural Collapse

链接: https://arxiv.org/abs/2411.12914
作者: Xihe Gu,Greg Fields,Yaman Jandali,Tara Javidi,Farinaz Koushanfar
关键词-EN: embed backdoor triggers, sophisticated training-time attacks, Trojan attacks, backdoor triggers, embed backdoor
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Trojan attacks are sophisticated training-time attacks on neural networks that embed backdoor triggers which force the network to produce a specific output on any input which includes the trigger. With the increasing relevance of deep networks which are too large to train with personal resources and which are trained on data too large to thoroughly audit, these training-time attacks pose a significant risk. In this work, we connect trojan attacks to Neural Collapse, a phenomenon wherein the final feature representations of over-parameterized neural networks converge to a simple geometric structure. We provide experimental evidence that trojan attacks disrupt this convergence for a variety of datasets and architectures. We then use this disruption to design a lightweight, broadly generalizable mechanism for cleansing trojan attacks from a wide variety of different network architectures and experimentally demonstrate its efficacy.

[LG-31] nsor-Based Foundations of Ordinary Least Squares and Neural Network Regression Models

链接: https://arxiv.org/abs/2411.12873
作者: Roberto Dias Algarte
关键词-EN: Machine Learning literature, current Machine Learning, Neural Network regression, Ordinary Least Squares, Machine Learning
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 algorithms

点击查看摘要

Abstract:This article introduces a novel approach to the mathematical development of Ordinary Least Squares and Neural Network regression models, diverging from traditional methods in current Machine Learning literature. By leveraging Tensor Analysis and fundamental matrix computations, the theoretical foundations of both models are meticulously detailed and extended to their complete algorithmic forms. The study culminates in the presentation of three algorithms, including a streamlined version of the Backpropagation Algorithm for Neural Networks, illustrating the benefits of this new mathematical approach.

[LG-32] CDI: Copyrighted Data Identification in Diffusion Models

链接: https://arxiv.org/abs/2411.12858
作者: Jan Dubiński,Antoni Kowalczuk,Franziska Boenisch,Adam Dziedzic
关键词-EN: data owners, Diffusion Models, data, owners, CDI
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Coda available at this https URL

点击查看摘要

Abstract:Diffusion Models (DMs) benefit from large and diverse datasets for their training. Since this data is often scraped from the Internet without permission from the data owners, this raises concerns about copyright and intellectual property protections. While (illicit) use of data is easily detected for training samples perfectly re-created by a DM at inference time, it is much harder for data owners to verify if their data was used for training when the outputs from the suspect DM are not close replicas. Conceptually, membership inference attacks (MIAs), which detect if a given data point was used during training, present themselves as a suitable tool to address this challenge. However, we demonstrate that existing MIAs are not strong enough to reliably determine the membership of individual images in large, state-of-the-art DMs. To overcome this limitation, we propose CDI, a framework for data owners to identify whether their dataset was used to train a given DM. CDI relies on dataset inference techniques, i.e., instead of using the membership signal from a single data point, CDI leverages the fact that most data owners, such as providers of stock photography, visual media companies, or even individual artists, own datasets with multiple publicly exposed data points which might all be included in the training of a given DM. By selectively aggregating signals from existing MIAs and using new handcrafted methods to extract features for these datasets, feeding them to a scoring model, and applying rigorous statistical testing, CDI allows data owners with as little as 70 data points to identify with a confidence of more than 99% whether their data was used to train a given DM. Thereby, CDI represents a valuable tool for data owners to claim illegitimate use of their copyrighted data.

[LG-33] Integrating Secondary Structures Information into Triangular Spatial Relationships (TSR) for Advanced Protein Classification

链接: https://arxiv.org/abs/2411.12853
作者: Poorya Khajouie,Titli Sarkar,Krishna Rauniyar,Li Chen,Wu Xu,Vijay Raghavan
关键词-EN: Triangular Spatial Relationship, deciphering biological functions, deciphering biological, Protein, Protein structures represent
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Protein structures represent the key to deciphering biological functions. The more detailed form of similarity among these proteins is sometimes overlooked by the conventional structural comparison methods. In contrast, further advanced methods, such as Triangular Spatial Relationship (TSR), have been demonstrated to make finer differentiations. Still, the classical implementation of TSR does not provide for the integration of secondary structure information, which is important for a more detailed understanding of the folding pattern of a protein. To overcome these limitations, we developed the SSE-TSR approach. The proposed method integrates secondary structure elements (SSEs) into TSR-based protein representations. This allows an enriched representation of protein structures by considering 18 different combinations of helix, strand, and coil arrangements. Our results show that using SSEs improves the accuracy and reliability of protein classification to varying degrees. We worked with two large protein datasets of 9.2K and 7.8K samples, respectively. We applied the SSE-TSR approach and used a neural network model for classification. Interestingly, introducing SSEs improved performance statistics for Dataset 1, with accuracy moving from 96.0% to 98.3%. For Dataset 2, where the performance statistics were already good, further small improvements were found with the introduction of SSE, giving an accuracy of 99.5% compared to 99.4%. These results show that SSE integration can dramatically improve TSR key discrimination, with significant benefits in datasets with low initial accuracies and only incremental gains in those with high baseline performance. Thus, SSE-TSR is a powerful bioinformatics tool that improves protein classification and understanding of protein function and interaction.

[LG-34] Generalized Prompt Tuning: Adapting Frozen Univariate Time Series Foundation Models for Multivariate Healthcare Time Series ML4H2024 ALT

链接: https://arxiv.org/abs/2411.12824
作者: Mingzhu Liu,Angela H. Chen,George H. Chen
关键词-EN: Time series foundation, Time series, multivariate time series, series foundation models, performance in diverse
类目: Machine Learning (cs.LG)
*备注: Machine Learning for Health (ML4H 2024)

点击查看摘要

Abstract:Time series foundation models are pre-trained on large datasets and are able to achieve state-of-the-art performance in diverse tasks. However, to date, there has been limited work demonstrating how well these models perform in medical applications, where labeled data can be scarce. Further, we observe that currently, the majority of time series foundation models either are univariate in nature, or assume channel independence, meaning that they handle multivariate time series but do not model how the different variables relate. In this paper, we propose a prompt-tuning-inspired fine-tuning technique, Generalized Prompt Tuning (Gen-P-Tuning), that enables us to adapt an existing univariate time series foundation model (treated as frozen) to handle multivariate time series prediction. Our approach provides a way to combine information across channels (variables) of multivariate time series. We demonstrate the effectiveness of our fine-tuning approach against various baselines on two MIMIC classification tasks, and on influenza-like illness forecasting.

[LG-35] Exploring Eye Tracking to Detect Cognitive Load in Complex Virtual Reality Training

链接: https://arxiv.org/abs/2411.12771
作者: Mahsa Nasri,Mehmet Kosa,Leanne Chukoskie,Mohsen Moghaddam,Casper Harteveld
关键词-EN: Virtual Reality, beneficial training tool, cognitive load, advanced manufacturing, detect cognitive load
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Virtual Reality (VR) has been a beneficial training tool in fields such as advanced manufacturing. However, users may experience a high cognitive load due to various factors, such as the use of VR hardware or tasks within the VR environment. Studies have shown that eye-tracking has the potential to detect cognitive load, but in the context of VR and complex spatiotemporal tasks (e.g., assembly and disassembly), it remains relatively unexplored. Here, we present an ongoing study to detect users’ cognitive load using an eye-tracking-based machine learning approach. We developed a VR training system for cold spray and tested it with 22 participants, obtaining 19 valid eye-tracking datasets and NASA-TLX scores. We applied Multi-Layer Perceptron (MLP) and Random Forest (RF) models to compare the accuracy of predicting cognitive load (i.e., NASA-TLX) using pupil dilation and fixation duration. Our preliminary analysis demonstrates the feasibility of using eye tracking to detect cognitive load in complex spatiotemporal VR experiences and motivates further exploration.

[LG-36] VayuBuddy: an LLM -Powered Chatbot to Democratize Air Quality Insights

链接: https://arxiv.org/abs/2411.12760
作者: Zeel B Patel,Yash Bachwana,Nitish Sharma,Sarath Guttikunda,Nipun Batra
关键词-EN: air pollution, million lives, lives are lost, lost due, Large Language Model
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nearly 6.7 million lives are lost due to air pollution every year. While policymakers are working on the mitigation strategies, public awareness can help reduce the exposure to air pollution. Air pollution data from government-installed sensors is often publicly available in raw format, but there is a non-trivial barrier for various stakeholders in deriving meaningful insights from that data. In this work, we present VayuBuddy, a Large Language Model (LLM)-powered chatbot system to reduce the barrier between the stakeholders and air quality sensor data. VayuBuddy receives the questions in natural language, analyses the structured sensory data with a LLM-generated Python code and provides answers in natural language. We use the data from Indian government air quality sensors. We benchmark the capabilities of 7 LLMs on 45 diverse question-answer pairs prepared by us. Additionally, VayuBuddy can also generate visual analysis such as line-plots, map plot, bar charts and many others from the sensory data as we demonstrate in this work.

[LG-37] Quantum Attention for Vision Transformers in High Energy Physics

链接: https://arxiv.org/abs/2411.13520
作者: Alessandro Tesi,Gopal Ramesh Dahale,Sergei Gleyzer,Kyoungchul Kong,Tom Magorsch,Konstantin T. Matchev,Katia Matcheva
关键词-EN: hybrid quantum-classical vision, orthogonal neural networks, quantum-classical vision transformer, high-energy physics applications, transformer architecture incorporating
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:We present a novel hybrid quantum-classical vision transformer architecture incorporating quantum orthogonal neural networks (QONNs) to enhance performance and computational efficiency in high-energy physics applications. Building on advancements in quantum vision transformers, our approach addresses limitations of prior models by leveraging the inherent advantages of QONNs, including stability and efficient parameterization in high-dimensional spaces. We evaluate the proposed architecture using multi-detector jet images from CMS Open Data, focusing on the task of distinguishing quark-initiated from gluon-initiated jets. The results indicate that embedding quantum orthogonal transformations within the attention mechanism can provide robust performance while offering promising scalability for machine learning challenges associated with the upcoming High Luminosity Large Hadron Collider. This work highlights the potential of quantum-enhanced models to address the computational demands of next-generation particle physics experiments.

[LG-38] Conformal Prediction for Hierarchical Data

链接: https://arxiv.org/abs/2411.13479
作者: Guillaume Principato,Yvenn Amara-Ouali,Yannig Goude,Bachir Hamrouche,Jean-Michel Poggi,Gilles Stoltz
关键词-EN: hierarchical time series, multivariate point forecasting, Forecast Reconciliation, Forecast Reconciliation techniques, Conformal Prediction
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:Reconciliation has become an essential tool in multivariate point forecasting for hierarchical time series. However, there is still a lack of understanding of the theoretical properties of probabilistic Forecast Reconciliation techniques. Meanwhile, Conformal Prediction is a general framework with growing appeal that provides prediction sets with probabilistic guarantees in finite sample. In this paper, we propose a first step towards combining Conformal Prediction and Forecast Reconciliation by analyzing how including a reconciliation step in the Split Conformal Prediction (SCP) procedure enhances the resulting prediction sets. In particular, we show that the validity granted by SCP remains while improving the efficiency of the prediction sets. We also advocate a variation of the theoretical procedure for practical use. Finally, we illustrate these results with simulations.

[LG-39] On lower bounds of the density of planar periodic sets without unit distances

链接: https://arxiv.org/abs/2411.13248
作者: Alexander Tolmachev
关键词-EN: combinatorial geometry, Determining the maximal, unit distances, mathbb, maximal density
类目: Metric Geometry (math.MG); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 21 pages, 9 figures

点击查看摘要

Abstract:Determining the maximal density m_1(\mathbbR^2) of planar sets without unit distances is a fundamental problem in combinatorial geometry. This paper investigates lower bounds for this quantity. We introduce a novel approach to estimating m_1(\mathbbR^2) by reformulating the problem as a Maximal Independent Set (MIS) problem on graphs constructed from flat torus, focusing on periodic sets with respect to two non-collinear vectors. Our experimental results supported by theoretical justifications of proposed method demonstrate that for a sufficiently wide range of parameters this approach does not improve the known lower bound 0.22936 \le m_1(\mathbbR^2) . The best discrete sets found are approximations of Croft’s construction. In addition, several open source software packages for MIS problem are compared on this task.

[LG-40] Eliminating Ratio Bias for Gradient-based Simulated Parameter Estimation

链接: https://arxiv.org/abs/2411.12995
作者: Zehao Li,Yijie Peng
关键词-EN: article addresses, addresses the challenge, parameter calibration, likelihood function, gradient-based simulated parameter
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This article addresses the challenge of parameter calibration in stochastic models where the likelihood function is not analytically available. We propose a gradient-based simulated parameter estimation framework, leveraging a multi-time scale algorithm that tackles the issue of ratio bias in both maximum likelihood estimation and posterior density estimation problems. Additionally, we introduce a nested simulation optimization structure, providing theoretical analyses including strong convergence, asymptotic normality, convergence rate, and budget allocation strategies for the proposed algorithm. The framework is further extended to neural network training, offering a novel perspective on stochastic approximation in machine learning. Numerical experiments show that our algorithm can improve the estimation accuracy and save computational costs.

[LG-41] On adaptivity and minimax optimality of two-sided nearest neighbors

链接: https://arxiv.org/abs/2411.12965
作者: Tathagata Sadhukhan,Manit Paul,Raaz Dwivedi
关键词-EN: sequential decision-making systems, Nearest neighbor, missing data problems, recommender systems, decision-making systems
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 29 pages, 7 figures

点击查看摘要

Abstract:Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN’s MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps.

[LG-42] On the relationship between Koopman operator approximations and neural ordinary differential equations for data-driven time-evolution predictions

链接: https://arxiv.org/abs/2411.12940
作者: Jake Buzhardt,C. Ricardo Constante-Amores,Michael D. Graham
关键词-EN: Koopman operator-based methods, Koopman operator-based, state space, work explores, explores the relationship
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work explores the relationship between state space methods and Koopman operator-based methods for predicting the time-evolution of nonlinear dynamical systems. We demonstrate that extended dynamic mode decomposition with dictionary learning (EDMD-DL), when combined with a state space projection, is equivalent to a neural network representation of the nonlinear discrete-time flow map on the state space. We highlight how this projection step introduces nonlinearity into the evolution equations, enabling significantly improved EDMD-DL predictions. With this projection, EDMD-DL leads to a nonlinear dynamical system on the state space, which can be represented in either discrete or continuous time. This system has a natural structure for neural networks, where the state is first expanded into a high dimensional feature space followed by a linear mapping which represents the discrete-time map or the vector field as a linear combination of these features. Inspired by these observations, we implement several variations of neural ordinary differential equations (ODEs) and EDMD-DL, developed by combining different aspects of their respective model structures and training procedures. We evaluate these methods using numerical experiments on chaotic dynamics in the Lorenz system and a nine-mode model of turbulent shear flow, showing comparable performance across methods in terms of short-time trajectory prediction, reconstruction of long-time statistics, and prediction of rare events. We also show that these methods provide comparable performance to a non-Markovian approach in terms of prediction of extreme events.

[LG-43] Problem-dependent convergence bounds for randomized linear gradient compression

链接: https://arxiv.org/abs/2411.12898
作者: Thomas Flynn,Patrick Johnstone,Shinjae Yoo
关键词-EN: compression, distributed optimization, optimization, performance bottleneck, increasing optimization throughput
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures

点击查看摘要

Abstract:In distributed optimization, the communication of model updates can be a performance bottleneck. Consequently, gradient compression has been proposed as a means of increasing optimization throughput. In general, due to information loss, compression introduces a penalty on the number of iterations needed to reach a solution. In this work, we investigate how the iteration penalty depends on the interaction between compression and problem structure, in the context of non-convex stochastic optimization. We focus on linear compression schemes, where compression and decompression can be modeled as multiplication with a random matrix. We consider several distributions of matrices, among them random orthogonal matrices and matrices with random Gaussian entries. We find that in each case, the impact of compression on convergence can be quantified in terms of the norm of the Hessian of the objective, using a norm defined by the compression scheme. The analysis reveals that in certain cases, compression performance is related to low-rank structure or other spectral properties of the problem. In these cases, our bounds predict that the penalty introduced by compression is significantly reduced compared to worst-case bounds that only consider the compression level, ignoring problem data. We verify the theoretical findings on several optimization problems, including fine-tuning an image classification model.

[LG-44] NPGPT: Natural Product-Like Compound Generation with GPT-based Chemical Language Models

链接: https://arxiv.org/abs/2411.12886
作者: Koh Sakano,Kairi Furui,Masahito Ohue
关键词-EN: possess biological activity, structural diversity, Natural products, substances produced, produced by organisms
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.

[LG-45] Local Anti-Concentration Class: Logarithmic Regret for Greedy Linear Contextual Bandit NEURIPS2024

链接: https://arxiv.org/abs/2411.12878
作者: Seok-Jin Kim,Min-hwan Oh
关键词-EN: contextual bandit problem, linear contextual bandit, exploration-free greedy algorithms, study the performance, performance guarantees
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS2024

点击查看摘要

Abstract:We study the performance guarantees of exploration-free greedy algorithms for the linear contextual bandit problem. We introduce a novel condition, named the \textitLocal Anti-Concentration (LAC) condition, which enables a greedy bandit algorithm to achieve provable efficiency. We show that the LAC condition is satisfied by a broad class of distributions, including Gaussian, exponential, uniform, Cauchy, and Student’s~ t distributions, along with other exponential family distributions and their truncated variants. This significantly expands the class of distributions under which greedy algorithms can perform efficiently. Under our proposed LAC condition, we prove that the cumulative expected regret of the greedy algorithm for the linear contextual bandit is bounded by O(\operatornamepoly \log T) . Our results establish the widest range of distributions known to date that allow a sublinear regret bound for greedy algorithms, further achieving a sharp poly-logarithmic regret.

[LG-46] A new Input Convex Neural Network with application to options pricing

链接: https://arxiv.org/abs/2411.12854
作者: Vincent Lemaire,Gilles Pagès,Christian Yeo
关键词-EN: neural networks designed, affine functions, leveraging the principle, convex functions, neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:We introduce a new class of neural networks designed to be convex functions of their inputs, leveraging the principle that any convex function can be represented as the supremum of the affine functions it dominates. These neural networks, inherently convex with respect to their inputs, are particularly well-suited for approximating the prices of options with convex payoffs. We detail the architecture of this, and establish theoretical convergence bounds that validate its approximation capabilities. We also introduce a \emphscrambling phase to improve the training of these networks. Finally, we demonstrate numerically the effectiveness of these networks in estimating prices for three types of options with convex payoffs: Basket, Bermudan, and Swing options.

[LG-47] Off-policy estimation with adaptively collected data: the power of online learning NEURIPS2024

链接: https://arxiv.org/abs/2411.12786
作者: Jeonghwan Lee,Cong Ma
关键词-EN: adaptively collected data, treatment effect, average treatment effect, collected data, adaptively collected
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 37 pages. Accepted to the 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, British Columbia, Canada

点击查看摘要

Abstract:We consider estimation of a linear functional of the treatment effect using adaptively collected data. This task finds a variety of applications including the off-policy evaluation (\textsfOPE) in contextual bandits, and estimation of the average treatment effect (\textsfATE) in causal inference. While a certain class of augmented inverse propensity weighting (\textsfAIPW) estimators enjoys desirable asymptotic properties including the semi-parametric efficiency, much less is known about their non-asymptotic theory with adaptively collected data. To fill in the gap, we first establish generic upper bounds on the mean-squared error of the class of AIPW estimators that crucially depends on a sequentially weighted error between the treatment effect and its estimates. Motivated by this, we also propose a general reduction scheme that allows one to produce a sequence of estimates for the treatment effect via online learning to minimize the sequentially weighted estimation error. To illustrate this, we provide three concrete instantiations in (\romannumeral 1) the tabular case; (\romannumeral 2) the case of linear function approximation; and (\romannumeral 3) the case of general function approximation for the outcome model. We then provide a local minimax lower bound to show the instance-dependent optimality of the \textsfAIPW estimator using no-regret online learning algorithms.

[LG-48] Supervised Autoencoders with Fractionally Differentiated Features and Triple Barrier Labelling Enhance Predictions on Noisy Data

链接: https://arxiv.org/abs/2411.12753
作者: Bartosz Bieganowski,Robert Ślepaczuk
关键词-EN: financial time series, time series forecasting, improve investment strategy, paper investigates, investigates the enhancement
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.01866

点击查看摘要

Abstract:This paper investigates the enhancement of financial time series forecasting with the use of neural networks through supervised autoencoders (SAE), to improve investment strategy performance. Using the Sharpe and Information Ratios, it specifically examines the impact of noise augmentation and triple barrier labeling on risk-adjusted returns. The study focuses on Bitcoin, Litecoin, and Ethereum as the traded assets from January 1, 2016, to April 30, 2022. Findings indicate that supervised autoencoders, with balanced noise augmentation and bottleneck size, significantly boost strategy effectiveness. However, excessive noise and large bottleneck sizes can impair performance.

[LG-49] FinBERT-BiLSTM: A Deep Learning Model for Predicting Volatile Cryptocurrency Market Prices Using Market Sentiment Dynamics

链接: https://arxiv.org/abs/2411.12748
作者: Mabsur Fatin Bin Hossain,Lubna Zahan Lamia,Md Mahmudur Rahman,Md Mosaddek Khan
关键词-EN: guide investment decisions, helping to predict, investment decisions, predict asset prices, guide investment
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting is a key tool in financial markets, helping to predict asset prices and guide investment decisions. In highly volatile markets, such as cryptocurrencies like Bitcoin (BTC) and Ethereum (ETH), forecasting becomes more difficult due to extreme price fluctuations driven by market sentiment, technological changes, and regulatory shifts. Traditionally, forecasting relied on statistical methods, but as markets became more complex, deep learning models like LSTM, Bi-LSTM, and the newer FinBERT-LSTM emerged to capture intricate patterns. Building upon recent advancements and addressing the volatility inherent in cryptocurrency markets, we propose a hybrid model that combines Bidirectional Long Short-Term Memory (Bi-LSTM) networks with FinBERT to enhance forecasting accuracy for these assets. This approach fills a key gap in forecasting volatile financial markets by blending advanced time series models with sentiment analysis, offering valuable insights for investors and analysts navigating unpredictable markets.

信息检索

[IR-0] Unleashing the Power of Large Language Models for Group POI Recommendations

链接: https://arxiv.org/abs/2411.13415
作者: Jing Long,Liang Qu,Guanhua Ye,Tong Chen,Quoc Viet Hung Nguyen,Hongzhi Yin
关键词-EN: group POI recommendations, POI recommendations, POI, group POI, POI recommendations due
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Group Point-of-Interest (POI) recommendations aim to predict the next POI that satisfies the diverse preferences of a group of users. This task is more challenging than traditional individual POI recommendations due to complex group decision-making and extremely sparse group-level check-in data. Existing methods for group POI recommendations primarily rely on single ID-based features from check-in data, capturing only statistical correlations and failing to fully utilize the rich semantic information contained in the check-ins, resulting in suboptimal performance. To this end, we propose a framework that unleashes the power of the Large Language Model (LLM) for context-aware group POI recommendations (LLMGPR). Our approach first introduces POI tokens alongside the original word tokens of the LLM, which are initialized by applying the LLM to the rich information of each POI. We then propose a novel sequencing adapter guided by Quantized Low-Rank Adaptation (QLORA) to modify the LLM. The enhanced LLM can learn sequence representations by combining semantic-enhanced POI tokens and rich contextual information including positional encodings and spatio-temporal differences. This approach can be adapted for learning either group or user representations depending on the sequence type. Furthermore, we enhance group representations by aggregating individual member representations with another QLORA-based aggregation adapter and introducing a self-supervised learning task that predicts the purpose of check-in sequences, alleviating the data sparsity issue. Our experimental results demonstrate that LLMGPR outperforms existing methods, effectively addressing group-level data sparsity and providing superior recommendations.

[IR-1] On the Statistical Significance with Relevance Assessments of Large Language Models

链接: https://arxiv.org/abs/2411.13212
作者: David Otero,Javier Parapar,Álvaro Barreiro
关键词-EN: part of Information, Information Retrieval, integral part, retrieval test collections, Test collections
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Test collections are an integral part of Information Retrieval (IR) research. They allow researchers to evaluate and compare ranking algorithms in a quick, easy and reproducible way. However, constructing these datasets requires great efforts in manual labelling and logistics, and having only few human relevance judgements can introduce biases in the comparison. Recent research has explored the use of Large Language Models (LLMs) for labelling the relevance of documents for building new retrieval test collections. Their strong text-understanding capabilities and low cost compared to human-made judgements makes them an appealing tool for gathering relevance judgements. Results suggest that LLM-generated labels are promising for IR evaluation in terms of ranking correlation, but nothing is said about the implications in terms of statistical significance. In this work, we look at how LLM-generated judgements preserve the same pairwise significance evaluation as human judgements. Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives. However, we also show that some systems are treated differently under LLM-generated labels, suggesting that evaluation with LLM judgements might not be entirely fair. Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements. We hope that this will serve as a basis for other researchers to develop reliable models for automatic relevance assessments.

[IR-2] Data Watermarking for Sequential Recommender Systems

链接: https://arxiv.org/abs/2411.12989
作者: Sixiao Zhang,Cheng Long,Wei Yuan,Hongxu Chen,Hongzhi Yin
关键词-EN: large foundation models, era of large, large foundation, crucial component, component for building
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the era of large foundation models, data has become a crucial component for building high-performance AI systems. As the demand for high-quality and large-scale data continues to rise, data copyright protection is attracting increasing attention. In this work, we explore the problem of data watermarking for sequential recommender systems, where a watermark is embedded into the target dataset and can be detected in models trained on that dataset. We address two specific challenges: dataset watermarking, which protects the ownership of the entire dataset, and user watermarking, which safeguards the data of individual users. We systematically define these problems and present a method named DWRS to address them. Our approach involves randomly selecting unpopular items to create a watermark sequence, which is then inserted into normal users’ interaction sequences. Extensive experiments on five representative sequential recommendation models and three benchmark datasets demonstrate the effectiveness of DWRS in protecting data copyright while preserving model utility.

[IR-3] Epidemiology-informed Network for Robust Rumor Detection

链接: https://arxiv.org/abs/2411.12949
作者: Wei Jiang,Tong Chen,Xinyi Gao,Wentao Zhang,Lizhen Cui,Hongzhi Yin
关键词-EN: posed significant challenges, maintaining public trust, rapid spread, media has posed, posed significant
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid spread of rumors on social media has posed significant challenges to maintaining public trust and information integrity. Since an information cascade process is essentially a propagation tree, recent rumor detection models leverage graph neural networks to additionally capture information propagation patterns, thus outperforming text-only solutions. Given the variations in topics and social impact of the root node, different source information naturally has distinct outreach capabilities, resulting in different heights of propagation trees. This variation, however, impedes the data-driven design of existing graph-based rumor detectors. Given a shallow propagation tree with limited interactions, it is unlikely for graph-based approaches to capture sufficient cascading patterns, questioning their ability to handle less popular news or early detection needs. In contrast, a deep propagation tree is prone to noisy user responses, and this can in turn obfuscate the predictions. In this paper, we propose a novel Epidemiology-informed Network (EIN) that integrates epidemiological knowledge to enhance performance by overcoming data-driven methods sensitivity to data quality. Meanwhile, to adapt epidemiology theory to rumor detection, it is expected that each users stance toward the source information will be annotated. To bypass the costly and time-consuming human labeling process, we take advantage of large language models to generate stance labels, facilitating optimization objectives for learning epidemiology-informed representations. Our experimental results demonstrate that the proposed EIN not only outperforms state-of-the-art methods on real-world datasets but also exhibits enhanced robustness across varying tree depths.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-21

目录

概览 (2024-11-21)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载