本篇博文主要内容为 2025-01-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-01-30)

今日共更新328篇论文,其中:

  • 自然语言处理67篇(Computation and Language (cs.CL))
  • 人工智能92篇(Artificial Intelligence (cs.AI))
  • 计算机视觉40篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习104篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Dialogue is Better Than Monologue: Instructing Medical LLM s via Strategical Conversations

【速读】: 该论文旨在解决当前医疗AI系统在临床推理方面未能有效模拟真实世界的问题,主要原因是现有系统多基于静态文本和问答任务进行训练与评估,忽视了循证推理及处理干扰信息等关键能力。为解决这一问题,论文提出了一种新的基准测试方法,该方法通过模拟实际诊断场景,并结合与USMLE标准相匹配的噪声和难度级别,以更真实地反映临床环境。此外,论文还探索了基于对话的微调方法,将静态数据集转化为对话格式,以更好地捕捉迭代推理过程。实验结果显示,采用对话微调的模型在多轮推理场景中的表现提升了9.64%,在噪声环境中准确率提高了6.18%。关键解决方案在于引入了新的基准测试和基于对话的微调方法,从而显著提升了医疗AI系统的临床一致性和鲁棒性。

链接: https://arxiv.org/abs/2501.17860
作者: Zijie Liu,Xinyu Zhao,Jie Peng,Zhuangdi Zhu,Qingyu Chen,Xia Hu,Tianlong Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of 9.64% in multi-round reasoning scenarios and 6.18% in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.
zh

[NLP-1] Improving Your Model Ranking on Chatbot Arena by Vote Rigging

【速读】: 该论文旨在解决通过众包投票在Chatbot Arena平台上操纵生成式AI (Generative AI) 模型排名的问题。论文的关键解决方案在于提出了一种称为“全知操控”(omnipresent rigging)的策略,该策略利用Chatbot Arena的Elo评级机制,即使目标模型 ( m_t ) 并未直接参与某些对决,通过影响这些对决的结果也能间接提升其排名。实验结果表明,仅需操控数百个新投票,即可显著改善模型的排名。

链接: https://arxiv.org/abs/2501.17858
作者: Rui Min,Tianyu Pang,Chao Du,Qian Liu,Minhao Cheng,Min Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model m_t . We first introduce a straightforward target-only rigging strategy that focuses on new battles involving m_t , identifying it via watermarking or a binary classifier, and exclusively voting for m_t wins. However, this strategy is practically inefficient because there are over 190 models on Chatbot Arena and on average only about 1% of new battles will involve m_t . To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model m_t , even if m_t is not directly involved in the battle. We conduct experiments on around 1.7 million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at this https URL.
zh

[NLP-2] Learning Beyond the Surface: How Far Can Continual Pre-Training with LoRA Enhance LLM s Domain-Specific Insight Learning?

【速读】: 该论文旨在探究连续预训练如何提升大规模语言模型(Large Language Models, LLMs)在提取和内化领域特定数据集中深层次见解方面的能力,特别是针对陈述性、统计性和概率性三种不同形式的洞见。研究聚焦于医学和金融两个关键领域,并采用LoRA技术对LLMs进行训练。解决方案的关键在于通过修改文档以保留其核心信息,从而显著增强LLMs的洞见学习能力,而不仅仅是依赖原始文档上的连续预训练。

链接: https://arxiv.org/abs/2501.17840
作者: Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance on various tasks, yet their ability to extract and internalize deeper insights from domain-specific datasets remains underexplored. In this study, we investigate how continual pre-training can enhance LLMs’ capacity for insight learning across three distinct forms: declarative, statistical, and probabilistic insights. Focusing on two critical domains: medicine and finance, we employ LoRA to train LLMs on two existing datasets. To evaluate each insight type, we create benchmarks to measure how well continual pre-training helps models go beyond surface-level knowledge. We also assess the impact of document modification on capturing insights. The results show that, while continual pre-training on original documents has a marginal effect, modifying documents to retain only essential information significantly enhances the insight-learning capabilities of LLMs.
zh

[NLP-3] A Comprehensive Survey on Legal Summarization: Challenges and Future Directions

【速读】: 该论文旨在系统性地综述法律领域中的自动摘要技术、数据集、模型及评估方法。论文通过严格的选择标准,全面回顾了近120篇涵盖现代“Transformer”时代的自然语言处理(NLP)相关文献,从而填补了现有系统性综述在该领域的空白。论文的关键在于沿着多个维度展示现有研究,并讨论趋势、挑战及未来研究的机会。关键解决方案在于综合分析大量文献以揭示法律领域自动摘要的现状和发展方向。

链接: https://arxiv.org/abs/2501.17830
作者: Mousumi Akter,Erion Cano,Erik Weber,Dennis Dobler,Ivan Habernal
机构: TU Dortmund University & Research Center Trustworthy Data Science and Security(Germany); Ruhr University Bochum & Research Center Trustworthy Data Science and Security(Germany)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article provides a systematic up-to-date survey of automatic summarization techniques, datasets, models, and evaluation methods in the legal domain. Through specific source selection criteria, we thoroughly review over 120 papers spanning the modern `transformer’ era of natural language processing (NLP), thus filling a gap in existing systematic surveys on the matter. We present existing research along several axes and discuss trends, challenges, and opportunities for future research.
zh

[NLP-4] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

【速读】: 该论文旨在通过优化训练策略、扩展训练数据及增加模型规模,提升多模态理解和文本到图像指令执行能力,并增强文本到图像生成的稳定性。关键解决方案在于这些综合改进措施的实施。

链接: https://arxiv.org/abs/2501.17811
作者: Xiaokang Chen,Zhiyu Wu,Xingchao Liu,Zizheng Pan,Wen Liu,Zhenda Xie,Xingkai Yu,Chong Ruan
机构: DeepSeek-AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Research paper. arXiv admin note: text overlap with arXiv:2410.13848

点击查看摘要

Abstract:In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
zh

[NLP-5] BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation – Challenges and Insights

【速读】: 该论文旨在解决针对台湾华语(Taiwanese Mandarin)设计的文本转语音(Text-to-Speech, TTS)系统在多音字消歧方面的挑战。解决方案的关键在于引入了S³分词器、大型语言模型(LLM)、最优传输条件流匹配模型(OT-CFM)以及从字母到发音的预测模型,这些组件共同作用以生成高度逼真且接近人类发音的语音,从而显著提升了系统的性能及泛化能力,特别是在长尾说话人建模和多音字消歧方面表现出色。

链接: https://arxiv.org/abs/2501.17790
作者: Chan-Jan Hsu,Yi-Cheng Lin,Chia-Chun Lin,Wei-Chih Chen,Ho Lam Chung,Chen-An Li,Yi-Chang Chen,Chien-Yu Yu,Ming-Ji Lee,Chien-Cheng Chen,Ru-Heng Huang,Hung-yi Lee,Da-Shan Shiu
机构: MediaTek Research; National Taiwan University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a S^3 tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice’s superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
zh

[NLP-6] Reasoning Over the Glyphs: Evaluation of LLM s Decipherment of Rare Scripts

【速读】: 该论文旨在探索层级视觉语言模型(LVLMs)和层级语言模型(LLMs)在解读未编码于Unicode中的罕见文字方面的潜力。研究的关键在于引入了一种构建包含此类文字的语言谜题多模态数据集的新方法,并采用了一种针对语言字符的分词技术。论文通过“图片法”(Picture Method)用于LVLMs和“描述法”(Description Method)用于LLMs的方法,使这些模型能够应对挑战。这一解决方案的核心在于开发适合处理未编码文字的模型输入表示方法。

链接: https://arxiv.org/abs/2501.17785
作者: Yu-Fei Shih,Zheng-Lin Lin,Shu-Kai Hsieh
机构: National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:We explore the capabilities of LVLMs and LLMs in deciphering rare scripts not encoded in Unicode. We introduce a novel approach to construct a multimodal dataset of linguistic puzzles involving such scripts, utilizing a tokenization method for language glyphs. Our methods include the Picture Method for LVLMs and the Description Method for LLMs, enabling these models to tackle these challenges. We conduct experiments using prominent models, GPT-4o, Gemini, and Claude 3.5 Sonnet, on linguistic puzzles. Our findings reveal the strengths and limitations of current AI methods in linguistic decipherment, highlighting the impact of Unicode encoding on model performance and the challenges of modeling visual language tokens through descriptions. Our study advances understanding of AI’s potential in linguistic decipherment and underscores the need for further research.
zh

[NLP-7] 2SSP: A Two-Stage Framework for Structured Pruning of LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)的结构化剪枝问题。解决方案的关键在于提出了一种名为两阶段结构化剪枝(Two-Stage framework for Structured Pruning, 2SSP)的方法,该方法结合了宽度剪枝(Width Pruning)和深度剪枝(Depth Pruning)两种不同的策略。宽度剪枝主要通过移除神经元及其对应的行和列来保持剪枝结构之间的连接性,而深度剪枝则通过迭代过程移除注意力子模块,以最小化对目标指标(本研究中为困惑度)的影响。此外,论文还提出了一种新的机制来平衡两个阶段的稀疏率,以达到所需的全局稀疏性。这种方法在三个语言建模任务和六个下游任务中均优于五种最先进的竞争方法,并且在剪枝时间上提高了多达两个数量级。

链接: https://arxiv.org/abs/2501.17771
作者: Fabrizio Sandri,Elia Cunegatti,Giovanni Iacca
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel Two-Stage framework for Structured Pruning (2SSP) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron over the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention submodules with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global sparsity. We test 2SSP on four LLM families and three sparsity rates (25%, 37.5%, and 50%), measuring the resulting perplexity over three language modeling datasets as well as the performance over six downstream tasks. Our method consistently outperforms five state-of-the-art competitors over three language modeling and six downstream tasks, with an up to two-order-of-magnitude gain in terms of pruning time. The code is available at available at \urlthis https URL.
zh

[NLP-8] Hybrid Graphs for Table-and-Text based Question Answering using LLM s NAACL2025

【速读】: 该论文旨在解决跨结构化(表格)和非结构化(原始文本)数据源进行需要推理和聚合的问题回答所面临的挑战。当前方法依赖于微调和高质量的人工标注数据,而这些数据难以获取。论文的关键解决方案在于提出了一种无需微调的新型混合图(Hybrid Graph)方法,用于表格-文本问答。该方法通过从文本和表格数据构建统一的混合图,并根据输入问题修剪信息,向大语言模型(LLM)提供精简的相关上下文,从而实现高效的多源问答。

链接: https://arxiv.org/abs/2501.17767
作者: Ankush Agarwal,Ganesh S,Chaitanya Devaguptapu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NAACL 2025 Main Track

点击查看摘要

Abstract:Answering questions that require reasoning and aggregation across both structured (tables) and unstructured (raw text) data sources presents significant challenges. Current methods rely on fine-tuning and high-quality, human-curated data, which is difficult to obtain. Recent advances in Large Language Models (LLMs) have shown promising results for multi-hop question answering (QA) over single-source text data in a zero-shot setting, yet exploration into multi-source Table-Text QA remains limited. In this paper, we present a novel Hybrid Graph-based approach for Table-Text QA that leverages LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from textual and tabular data, pruning information based on the input question to provide the LLM with relevant context concisely. We evaluate our approach on the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs, including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot performance on both datasets, improving Exact Match scores by up to 10% on Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up to 53% compared to the original context.
zh

[NLP-9] Improving Privacy Benefits of Redaction

【速读】: 该论文旨在解决自然文本数据隐私保护的问题。其关键在于提出了一种新的编辑方法,该方法在保持较低编辑水平的同时,能够提供优于现有先进技术的隐私保护效益。

链接: https://arxiv.org/abs/2501.17762
作者: Vaibhav Gusain,Douglas Leith
机构: Trinity College Dublin - School of Computer science and statistics (都柏林三一学院计算机科学与统计学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel redaction methodology that can be used to sanitize natural text data. Our new technique provides better privacy benefits than other state of the art techniques while maintaining lower redaction levels.
zh

[NLP-10] VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback

【速读】: 该论文旨在解决当前胸部X射线(CXR)报告生成系统缺乏验证机制,导致输出结果在无专家监督情况下可靠性与可解释性不足的问题。关键解决方案在于提出了一种新型多模态框架,集成了短语定位模型(Phrase Grounding Model)和文本到图像扩散模块(Text-to-Image Diffusion Module)。短语定位模型用于基于文本提示识别并定位CXR图像中的病灶,而文本到图像扩散模块则通过保留解剖学准确性生成合成CXR图像。此外,该框架引入了双评分系统,分别评估定位准确性和语义一致性,从而显著提升了病理定位及文本到图像对齐的效果。

链接: https://arxiv.org/abs/2501.17726
作者: Sayeh Gholipour Picha,Dawood Al Chanti,Alice Caplier
机构: Univ. Grenoble Alpes(格勒诺布尔-阿尔卑斯大学); CNRS(法国国家科学研究中心); Grenoble INP(格勒诺布尔国立理工学院); GIPSA-lab(格勒诺布尔图像处理与信号分析实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency. This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment. The integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more trustworthy and transparent AI in medical imaging.
zh

[NLP-11] Using Code Generation to Solve Open Instances of Combinatorial Design Problems

【速读】: 该论文旨在解决组合设计领域中尚未确定存在性的开问题。解决方案的关键在于开发了一种名为CPro1的构造协议,该协议利用大规模语言模型(Large Language Models, LLMs)生成代码以构建组合设计,并解决这些开放实例。CPro1通过定义特定类型的组合设计以及可靠的验证器来启动,选择并实现策略于代码中,同时利用脚手架进行自动超参数调整和执行反馈。该协议通过生成大量候选方案并自动化探索标准方法(如模拟退火、遗传算法)及实验变化(如成本函数),从而提高找到有效解决方案的概率。

链接: https://arxiv.org/abs/2501.17725
作者: Christopher D. Rosin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
备注:

点击查看摘要

Abstract:The Handbook of Combinatorial Designs catalogs many types of combinatorial designs, together with lists of open instances for which existence has not yet been determined. We develop a constructive protocol CPro1, which uses Large Language Models (LLMs) to generate code that constructs combinatorial designs and resolves some of these open instances. The protocol starts from a definition of a particular type of design, and a verifier that reliably confirms whether a proposed design is valid. The LLM selects strategies and implements them in code, and scaffolding provides automated hyperparameter tuning and execution feedback using the verifier. Most generated code fails, but by generating many candidates, the protocol automates exploration of a variety of standard methods (e.g. simulated annealing, genetic algorithms) and experimentation with variations (e.g. cost functions) to find successful approaches. Testing on 16 different types of designs, CPro1 constructs solutions to open instances for 6 of them: Symmetric and Skew Weighing Matrices, Equidistant Permutation Arrays, Packing Arrays, Balanced Ternary Designs, and Florentine Rectangles.
zh

[NLP-12] RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts ACL

【速读】: 该论文旨在解决用户与聊天机器人(Conversational Agents, CAs)互动过程中潜在的未授权访问或操纵风险,特别是当用户尝试“越狱”(jailbreaking)以突破预设限制时。论文的关键解决方案在于提出RICoTA数据集,该数据集包含来自韩国类似Reddit社区的609个用户自发布对话提示,这些对话捕捉了越狱尝试,并具有特定的测试和游戏意图。通过利用这一数据集,论文旨在评估大型语言模型(LLMs)识别不同类型对话及用户测试目的的能力,从而为减轻越狱风险提供聊天机器人设计方面的启示。

链接: https://arxiv.org/abs/2501.17715
作者: Eujeong Choi,Younghun Jeong,Soomin Kim,Won Ik Cho
机构: 未知
类目: Computation and Language (cs.CL)
备注: PACLIC 38

点击查看摘要

Abstract:User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as “jailbreaking.” Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs’ ability to identify the type of conversation and users’ testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.
zh

[NLP-13] Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

【速读】: 该论文旨在挑战传统的监督微调(Supervised Fine-Tuning, SFT)范式,提出了一种名为批评微调(Critique Fine-Tuning, CFT)的新策略。CFT的关键在于让模型学会批判噪声响应,而不仅仅是简单地模仿正确的响应。这种方法受到人类学习过程中强调批判性思维的启发,鼓励更深层次的分析和细微理解,这些特质在标准SFT中常被忽视。通过构建一个包含50K样本的数据集,并使用GPT-4o作为教师模型生成批判性反馈,作者验证了CFT的有效性。实验结果显示,CFT在不同基准测试中相对于SFT有4-10%的一致改进,并且在多个数学基准测试中,基于CFT训练的Qwen2.5-Math-CFT模型在仅有50K样本的情况下,性能优于使用超过200万样本训练的竞争模型。

链接: https://arxiv.org/abs/2501.17703
作者: Yubo Wang,Xiang Yue,Wenhu Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.
zh

[NLP-14] Exploring Vision Language Models for Multimodal and Multilingual Stance Detection AAAI

【速读】: 该论文旨在解决多模态和多语言立场检测任务中的现有挑战,特别是评估最先进的视觉-语言模型(Vision-Language Models, VLMs)在处理包含文本和图像的社交媒体帖子时的表现。解决方案的关键在于通过一个新扩展的数据集,涵盖七种语言和多模态输入,来探究这些模型如何利用视觉线索、语言特定性能以及跨模态交互。研究结果表明,VLMs主要依赖文本而非图像进行立场检测,并且这种趋势在不同语言间保持一致,同时这些模型更多依赖于图像内的文本而非其他视觉内容。

链接: https://arxiv.org/abs/2501.17654
作者: Jake Vasilakes,Carolina Scarton,Zhixue Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to the International AAAI Conference on Web and Social Media (ICWSM) 2025

点击查看摘要

Abstract:Social media’s global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.
zh

[NLP-15] onguescape: Exploring Language Models Understanding of Vowel Articulation NAACL2025

【速读】: 该论文旨在探究语言模型(Language Models, LMs)是否能够基于舌位信息理解元音发音机制。解决方案的关键在于构建了一个包含实时磁共振成像(real-time MRI)视频和图像的数据集,并通过提供参考示例来评估语言模型在理解舌位与元音发音关系方面的能力。研究发现,当提供参考示例时,语言模型展现出理解舌位与元音之间关系的潜力,但在没有参考示例的情况下存在困难。

链接: https://arxiv.org/abs/2501.17643
作者: Haruki Sakajo,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
zh

[NLP-16] In-Context Meta LoRA Generation

【速读】: 该论文旨在解决在多任务场景下低秩适应(Low-rank Adaptation, LoRA)模型针对每个任务单独训练导致的存储和推理效率低下问题。此外,现有参数生成方法未能捕捉任务间的相关性,使得多任务LoRA参数生成变得困难。论文的关键解决方案是提出In-Context Meta LoRA (ICM-LoRA),通过使用来自所有任务的训练数据来训练一个条件变分自编码器(Conditional Variational Autoencoder, CVAE),该生成器以任务描述作为输入,并输出与任务相关的LoRA权重。这些权重随后与大型语言模型(LLMs)结合,无需额外微调即可创建特定任务的专业化模型。同时,利用情境元学习进行知识增强和任务映射,以捕捉任务间的关系及参数分布,从而实现更准确的LoRA参数生成。

链接: https://arxiv.org/abs/2501.17635
作者: Yihua Shao,Minxi Yan,Yang Liu,Siyu Chen,Wenjie Chen,Xinwei Long,Ziyang Yan,Lei Li,Chenyu Zhang,Nicu Sebe,Hao Tang,Yan Wang,Hao Zhao,Mengzhu Wang,Jingcai Guo
机构: The Hong Kong Polytechnic University; Tsinghua University; Beijing Institute for General Artificial Intelligence (BIGAI); University of Trento; University of Copenhagen; Peking University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1% storage compared with the original LoRA.
zh

[NLP-17] Uncertainty Quantification and Decomposition for LLM -based Recommendation WWW2025

【速读】: 该论文旨在解决大型语言模型(LLMs)在推荐系统中不确定性的问题,并确保其推荐结果的可靠性。关键解决方案在于引入了一种新的框架来估计预测不确定性,以此定量衡量基于LLM的推荐系统的可靠性,并将其分解为推荐不确定性和提示不确定性,从而深入分析不确定性来源。此外,论文还提出了不确定性感知提示方法以降低预测不确定性并增强推荐效果。

链接: https://arxiv.org/abs/2501.17630
作者: Wonbin Kweon,Sanghwan Jang,SeongKu Kang,Hwanjo Yu
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: WWW 2025

点击查看摘要

Abstract:Despite the widespread adoption of large language models (LLMs) for recommendation, we demonstrate that LLMs often exhibit uncertainty in their recommendations. To ensure the trustworthy use of LLMs in generating recommendations, we emphasize the importance of assessing the reliability of recommendations generated by LLMs. We start by introducing a novel framework for estimating the predictive uncertainty to quantitatively measure the reliability of LLM-based recommendations. We further propose to decompose the predictive uncertainty into recommendation uncertainty and prompt uncertainty, enabling in-depth analyses of the primary source of uncertainty. Through extensive experiments, we (1) demonstrate predictive uncertainty effectively indicates the reliability of LLM-based recommendations, (2) investigate the origins of uncertainty with decomposed uncertainty measures, and (3) propose uncertainty-aware prompting for a lower predictive uncertainty and enhanced recommendation. Our source code and model weights are available at this https URL
zh

[NLP-18] Structured Context Recomposition for Large Language Models Using Probabilistic Layer Realignment

【速读】: 该论文旨在解决在扩展序列生成过程中由于传统自注意力机制无法有效保留长距离依赖而导致的上下文一致性退化问题。解决方案的关键在于引入了一种名为结构化上下文重组(Structured Context Recomposition, SCR)的方法,通过动态调整Transformer层内的学习表示,并采用递归加权函数基于推断的上下文相关性重新分配表征重点,从而增强连贯性保留并缓解主题突变和逻辑不一致的问题。

链接: https://arxiv.org/abs/2501.17617
作者: Jonathan Teel,Jocasta Cumberbatch,Raphael Benington,Quentin Baskerville
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extended sequence generation often leads to degradation in contextual consistency due to the inability of conventional self-attention mechanisms to effectively retain long-range dependencies. Existing approaches, including memory compression and retrieval-augmented conditioning, introduce computational trade-offs that either increase inference latency or impose additional storage overhead. Structured Context Recomposition (SCR) introduces a probabilistic layer realignment strategy that dynamically adjusts learned representations within transformer layers, ensuring that semantically relevant embeddings persist throughout extended transformations. The proposed method enhances coherence retention through a recursive weighting function that redistributes representational emphasis based on inferred contextual relevance rather than relying on fixed token-level attention scores. Empirical results indicate that probabilistic realignment mitigates abrupt topic shifts and logical inconsistencies, particularly in scenarios where sequences exceed standard attention window constraints. Sequence-level entropy analysis further reveals that SCR moderates representational variability without introducing excessive output regularization, allowing models to sustain generative diversity while preserving contextual alignment. Attention head deviation measurements confirm that hierarchical reweighting contributes to smoother token dependency transitions across transformer layers, reinforcing the stability of multi-turn interactions and document-level reasoning. Computational resource assessments show that while SCR incurs a moderate increase in processing time, memory overhead remains within feasible limits, making it suitable for practical deployment in autoregressive generative applications.
zh

[NLP-19] Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

【速读】: 该论文旨在解决多语言自动语音识别(ASR)在低资源语言上的性能不足问题。解决方案的关键在于提出了一种基于跨语言嵌入聚类方法的层次软最大值(H-Softmax)解码器,该解码器使不同语言中的相似标记共享相似的解码表示,从而克服了先前基于哈夫曼的H-Softmax方法依赖浅层特征进行标记相似性评估的局限性。

链接: https://arxiv.org/abs/2501.17615
作者: Zhengdong Yang,Qianying Liu,Sheng Li,Fei Cheng,Chenhui Chu
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.
zh

[NLP-20] Semantic Consistency Regularization with Large Language Models for Semi-supervised Sentiment Analysis ICONIP2024

【速读】: 该论文旨在解决情感分析中文本标注劳动密集且耗时的问题,并提出了一种基于提示策略与大规模语言模型(Large Language Models, LLMs)增强的半监督方法。关键在于引入了实体增强(Entity-based Enhancement, SCR-EE)和概念增强(Concept-based Enhancement, SCR-CE)两种策略,通过查询LLMs来提升未标注文本的语义一致性,并利用这些增强的数据进行一致性损失训练,同时采用类别重组策略以充分利用不确定的未标注样本。

链接: https://arxiv.org/abs/2501.17598
作者: Kunrong Li,Xinyu Liu,Zhen Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICONIP 2024

点击查看摘要

Abstract:Accurate sentiment analysis of texts is crucial for a variety of applications, such as understanding customer feedback, monitoring market trends, and detecting public sentiment. However, manually annotating large sentiment corpora for supervised learning is labor-intensive and time-consuming. Therefore, it is essential and effective to develop a semi-supervised method for the sentiment analysis task. Although some methods have been proposed for semi-supervised text classification, they rely on the intrinsic information within the unlabeled data and the learning capability of the NLP model, which lack generalization ability to the sentiment analysis scenario and may prone to overfit. Inspired by the ability of pretrained Large Language Models (LLMs) in following instructions and generating coherent text, we propose a Semantic Consistency Regularization with Large Language Models (SCR) framework for semi-supervised sentiment analysis. We introduce two prompting strategies to semantically enhance unlabeled text using LLMs. The first is Entity-based Enhancement (SCR-EE), which involves extracting entities and numerical information, and querying the LLM to reconstruct the textual information. The second is Concept-based Enhancement (SCR-CE), which directly queries the LLM with the original sentence for semantic reconstruction. Subsequently, the LLM-augmented data is utilized for a consistency loss with confidence thresholding, which preserves high-quality agreement samples to provide additional supervision signals during training. Furthermore, to fully utilize the uncertain unlabeled data samples, we propose a class re-assembling strategy inspired by the class space shrinking theorem. Experiments show our method achieves remarkable performance over prior semi-supervised methods.
zh

[NLP-21] GLLM : Self-Corrective G-Code Generation using Large Language Models with User Feedback

【速读】: 该论文旨在解决手动编写G代码的挑战,通过弥合人类可读任务描述与机器可执行代码之间的差距。解决方案的关键在于使用经过特定领域训练数据微调的StarCoder-3B模型,并结合检索增强生成(Retrieval-Augmented Generation, RAG)机制,以及采用先进的提示策略和新颖的自纠正代码生成方法,确保生成的G代码在语法和语义上的正确性。此外,系统还包括了包括语法检查、特定于G代码的验证以及利用Hausdorff距离进行功能正确性评估在内的健全文本校验机制。

链接: https://arxiv.org/abs/2501.17584
作者: Mohamed Abdelaal,Samuel Lokadjaja,Gilbert Engert
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces GLLM, an innovative tool that leverages Large Language Models (LLMs) to automatically generate G-code from natural language instructions for Computer Numerical Control (CNC) machining. GLLM addresses the challenges of manual G-code writing by bridging the gap between human-readable task descriptions and machine-executable code. The system incorporates a fine-tuned StarCoder-3B model, enhanced with domain-specific training data and a Retrieval-Augmented Generation (RAG) mechanism. GLLM employs advanced prompting strategies and a novel self-corrective code generation approach to ensure both syntactic and semantic correctness of the generated G-code. The architecture includes robust validation mechanisms, including syntax checks, G-code-specific verifications, and functional correctness evaluations using Hausdorff distance. By combining these techniques, GLLM aims to democratize CNC programming, making it more accessible to users without extensive programming experience while maintaining high accuracy and reliability in G-code generation.
zh

[NLP-22] CSEval: Towards Automated Multi-Dimensional and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLM s

【速读】: 该论文旨在解决自动化反击仇恨言论生成方法缺乏标准化评估协议及稳健自动评估指标的问题。当前自动评估主要依赖相似性度量,无法有效捕捉反击言论质量的复杂独立属性,如上下文相关性、攻击性和论证连贯性。为应对这些挑战,论文提出了CSEval数据集和框架,用于从四个维度评估反击言论质量,并引入了ACE(自动校准的反驳评估链式思考方法),这是一种基于提示的方法,通过大型语言模型对反击言论进行评分。ACE的关键在于其自动校准的链式思考机制,能够更有效地与人类判断相关联,从而显著改进自动化反击言论的评估方法。

链接: https://arxiv.org/abs/2501.17581
作者: Amey Hengle,Aswini Kumar,Anil Bandhakavi,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校); Logically.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 17 pages, 5 figures. arXiv admin note: text overlap with arXiv:2309.13308 by other authors

点击查看摘要

Abstract:Counterspeech has been popular as an effective approach to counter online hate speech, leading to increasing research interest in automated counterspeech generation using language models. However, this field lacks standardised evaluation protocols and robust automated evaluation metrics that align with human judgement. Current automatic evaluation methods, primarily based on similarity metrics, do not effectively capture the complex and independent attributes of counterspeech quality, such as contextual relevance, aggressiveness, or argumentative coherence. This has led to an increased dependency on labor-intensive human evaluations to assess automated counter-speech generation methods. To address these challenges, we introduce CSEval, a novel dataset and framework for evaluating counterspeech quality across four dimensions: contextual-relevance, aggressiveness, argument-coherence, and suitableness. Furthermore, we propose Auto-Calibrated COT for Counterspeech Evaluation (ACE), a prompt-based method with auto-calibrated chain-of-thoughts (CoT) for scoring counterspeech using large language models. Our experiments show that ACE outperforms traditional metrics like ROUGE, METEOR, and BertScore in correlating with human judgement, indicating a significant advancement in automated counterspeech evaluation.
zh

[NLP-23] A linguistically-motivated evaluation methodology for unraveling models abilities in reading comprehension tasks

【速读】: 该论文旨在解决阅读理解任务中模型性能受输入样本语言复杂性影响的问题。关键在于提出了一种评估方法,通过语义框架标注来识别七种可能影响模型难度的语言复杂因素,并验证这些因素在不同规模和架构的模型中对表现的影响。研究结果表明,细粒度的语言驱动自动评估不仅可行,还能帮助理解模型处理特定语言特征的能力,并揭示当前最先进的模型在处理某些语言特征时仍存在不足。

链接: https://arxiv.org/abs/2501.17569
作者: Elie Antoine(LIS, TALEP),Frédéric Béchet(LIS, TALEP),Géraldine Damnati,Philippe Langlais(DIRO)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model’s difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models’ failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using Chat-GPT as a proxy for semantic annotation. Our study reveals that fine-grained linguisticallymotivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models’ abilities to handle specific linguistic characteristics of input examples. It also shows that current state-of-the-art models fail with some for those characteristics which suggests that adequately handling them requires more than merely increasing model size.
zh

[NLP-24] Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

【速读】: 该论文旨在解决图结构数据处理中的可扩展性问题以及节点级投影中的信息损失问题。论文的关键解决方案是提出了一种名为Learnable Graph Pooling Token (LGPT)的方法,通过引入可学习参数作为大型语言模型中的tokens,实现了灵活且高效的图表示,平衡了细粒度和全局图信息。此外,论文还探讨了Early Query Fusion技术,该技术在构建图表示之前融合查询上下文,从而生成更有效的图嵌入。这些方法使得在无需训练大型语言模型的情况下,GraphQA基准测试性能提升了4.13%。

链接: https://arxiv.org/abs/2501.17549
作者: Wooyoung Kim,Byungyoon Park,Wooju Kim
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.
zh

[NLP-25] LLM Assistance for Pediatric Depression

【速读】: 该论文旨在解决儿童在儿科初级保健中使用传统抑郁症筛查方法(如PHQ-9)所面临的挑战,提出利用先进的大型语言模型(LLMs)进行抑郁症状提取,以辅助传统筛查并减少诊断错误。关键在于采用零样本学习方法,其中Flan模型表现出色,尤其在罕见症状如“睡眠问题”和“自我厌恶”的提取上具有较高的精确度(F1: 0.92和0.8),同时Flan模型提供的症状注释作为特征,在机器学习算法中显著提升了区分抑郁症病例与对照组的精度(0.78)。

链接: https://arxiv.org/abs/2501.17510
作者: Mariia Ignashina,Paulina Bondaronek,Dan Santel,John Pestian,Julia Ive
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional depression screening methods, such as the PHQ-9, are particularly challenging for children in pediatric primary care due to practical limitations. AI has the potential to help, but the scarcity of annotated datasets in mental health, combined with the computational costs of training, highlights the need for efficient, zero-shot approaches. In this work, we investigate the feasibility of state-of-the-art LLMs for depressive symptom extraction in pediatric settings (ages 6-24). This approach aims to complement traditional screening and minimize diagnostic errors. Our findings show that all LLMs are 60% more efficient than word match, with Flan leading in precision (average F1: 0.65, precision: 0.78), excelling in the extraction of more rare symptoms like “sleep problems” (F1: 0.92) and “self-loathing” (F1: 0.8). Phi strikes a balance between precision (0.44) and recall (0.60), performing well in categories like “Feeling depressed” (0.69) and “Weight change” (0.78). Llama 3, with the highest recall (0.90), overgeneralizes symptoms, making it less suitable for this type of analysis. Challenges include the complexity of clinical notes and overgeneralization from PHQ-9 scores. The main challenges faced by LLMs include navigating the complex structure of clinical notes with content from different times in the patient trajectory, as well as misinterpreting elevated PHQ-9 scores. We finally demonstrate the utility of symptom annotations provided by Flan as features in an ML algorithm, which differentiates depression cases from controls with high precision of 0.78, showing a major performance boost compared to a baseline that does not use these features. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2501.17510 [cs.LG] (or arXiv:2501.17510v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.17510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] DINT Transformer

【速读】: 该论文旨在解决两个关键问题:首先,DIFF Transformer缺乏全局上下文建模能力,这在识别全局重要标记时显得尤为重要;其次,由于注意力矩阵中缺乏严格的行归一化,导致数值稳定性不足。为了解决这些问题,论文提出了DINT Transformer,通过引入微分-积分机制来计算全局重要性分数,并将其整合到注意力矩阵中,从而增强了捕捉全局依赖的能力。此外,统一的参数设计确保了行归一化的注意力矩阵,提高了数值稳定性。

链接: https://arxiv.org/abs/2501.17486
作者: Yueyang Cang,Yuhang Liu,Xiaoteng Zhang,Erlu Zhao,Li Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2410.05258 by other authors

点击查看摘要

Abstract:DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism that enhances the robustness of local attention. However, it has two critical limitations: the lack of global context modeling, which is essential for identifying globally significant tokens, and numerical instability due to the absence of strict row normalization in the attention matrix. To overcome these challenges, we propose DINT Transformer, which extends DIFF Transformer by incorporating a differential-integral mechanism. By computing global importance scores and integrating them into the attention matrix, DINT Transformer improves its ability to capture global dependencies. Moreover, the unified parameter design enforces row-normalized attention matrices, improving numerical stability. Experimental results demonstrate that DINT Transformer excels in accuracy and robustness across various practical applications, such as long-context language modeling and key information retrieval. These results position DINT Transformer as a highly effective and promising architecture.
zh

[NLP-27] DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance

【速读】: 该论文旨在解决大型语言模型(LLMs)在多样性和复杂领域中表现不均衡的问题。论文的关键解决方案是提出了一种名为多样化指纹集成(Diverse Fingerprint Ensemble, DFPE)的新方法。DFPE通过基于响应“指纹”模式对模型进行聚类,采用基于分位数的过滤机制去除性能不佳的模型,并根据各子任务验证准确度分配自适应权重,从而利用多个LLMs的互补优势以实现更稳健的性能。实验结果显示,DFPE在大规模多任务语言理解(MMLU)基准测试中,整体准确率提高了3%,学科级别准确率提高了5%。

链接: https://arxiv.org/abs/2501.17479
作者: Seffi Cohen,Niv Goldshlager,Nurit Cohen-Inger,Bracha Shapira,Lior Rokach
机构: Ben Gurion University (本古里安大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various natural language processing tasks but often struggle to excel uniformly in diverse or complex domains. We propose a novel ensemble method - Diverse Fingerprint Ensemble (DFPE), which leverages the complementary strengths of multiple LLMs to achieve more robust performance. Our approach involves: (1) clustering models based on response “fingerprints” patterns, (2) applying a quantile-based filtering mechanism to remove underperforming models at a per-subject level, and (3) assigning adaptive weights to remaining models based on their subject-wise validation accuracy. In experiments on the Massive Multitask Language Understanding (MMLU) benchmark, DFPE outperforms the best single model by 3% overall accuracy and 5% in discipline-level accuracy. This method increases the robustness and generalization of LLMs and underscores how model selection, diversity preservation, and performance-driven weighting can effectively address challenging, multi-faceted language understanding tasks.
zh

[NLP-28] Large Language Models for Single-Step and Multi-Step Flight Trajectory Prediction

【速读】: 该论文旨在解决飞行轨迹预测这一关键时间序列任务,特别是在深度学习方法之外探索大型语言模型(Large Language Models, LLMs)在该领域的应用。论文的关键解决方案在于将飞行轨迹预测问题重新构想为一个语言建模问题,并通过从ADS-B飞行数据中提取表示飞机位置和状态的特征来构建基于提示的数据集,其中轨迹航路点被转换成语言标记。随后,利用该数据集对LLMs进行微调,使其能够学习复杂的时空模式以实现精确预测。实验结果表明,与传统方法相比,LLMs在单步和多步预测中均表现出显著性能提升,尤其是LLaMA-3.1模型达到了最高的总体准确度。然而,LLMs的高推理延迟限制了其在实时应用中的可行性,强调了对此研究方向进一步优化的必要性。

链接: https://arxiv.org/abs/2501.17459
作者: Kaiwei Luo,Jiliu Zhou
机构: Sichuan University (四川大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Flight trajectory prediction is a critical time series task in aviation. While deep learning methods have shown significant promise, the application of large language models (LLMs) to this domain remains underexplored. This study pioneers the use of LLMs for flight trajectory prediction by reframing it as a language modeling problem. Specifically, We extract features representing the aircraft’s position and status from ADS-B flight data to construct a prompt-based dataset, where trajectory waypoints are converted into language tokens. The dataset is then employed to fine-tune LLMs, enabling them to learn complex spatiotemporal patterns for accurate predictions. Comprehensive experiments demonstrate that LLMs achieve notable performance improvements in both single-step and multi-step predictions compared to traditional methods, with LLaMA-3.1 model achieving the highest overall accuracy. However, the high inference latency of LLMs poses a challenge for real-time applications, underscoring the need for further research in this promising direction.
zh

[NLP-29] A review on the novelty measurements of academic papers

【速读】: 该论文旨在通过系统分析科学论文的新颖性测量方法,解决新颖性评估在创新推广和管理中的重要性。关键在于通过对科学新颖性与相关概念(originality, scientific innovation, creativity, scientific breakthrough)的比较,以及对不同类型数据的新颖性测量方法进行分类和回顾,从而提供一种基于数据驱动的方式评估科学领域的贡献、进展和新兴方向。

链接: https://arxiv.org/abs/2501.17456
作者: Yi Zhao,Chengzhi Zhang
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Novelty evaluation is vital for the promotion and management of innovation. With the advancement of information techniques and the open data movement, some progress has been made in novelty measurements. Tracking and reviewing novelty measures provides a data-driven way to assess contributions, progress, and emerging directions in the science field. As academic papers serve as the primary medium for the dissemination, validation, and discussion of scientific knowledge, this review aims to offer a systematic analysis of novelty measurements for scientific papers. We began by comparing the differences between scientific novelty and four similar concepts, including originality, scientific innovation, creativity, and scientific breakthrough. Next, we reviewed the types of scientific novelty. Then, we classified existing novelty measures according to data types and reviewed the measures for each type. Subsequently, we surveyed the approaches employed in validating novelty measures and examined the current tools and datasets associated with these measures. Finally, we proposed several open issues for future studies.
zh

[NLP-30] Cross-Language Approach for Quranic QA

【速读】: 该论文旨在解决问答系统在资源有限和数据稀缺的语言中所面临的重大局限性,特别是针对古兰经问答系统的发展挑战。这些挑战包括现代标准阿拉伯语与古兰经中古典阿拉伯语文本之间存在的语言差异,以及现有数据集规模小,限制了模型性能。为了解决这些问题,论文的关键方案是采用跨语言方法,包括通过机器翻译扩展和丰富数据集,将阿拉伯语问题转换成英语,并从古兰经的英译版中检索答案以满足多语言训练需求;同时利用预训练模型进行微调,如 BERT-Medium、RoBERTa-Base、DeBERTa-v3-Base、ELECTRA-Large、Flan-T5、Bloom 和 Falcon 等,以应对古兰经问答系统的特定需求。实验结果表明,这种跨语言方法显著提高了模型性能,其中 RoBERTa-Base 在 MAP@10 和 MRR 指标上表现最佳,而 DeBERTa-v3-Base 在 Recall@10 和 Precision@10 上表现出色。这些发现强调了跨语言策略在克服语言障碍和推进古兰经问答系统方面的有效性。

链接: https://arxiv.org/abs/2501.17449
作者: Islam Oshallah,Mohamed Basem,Ali Hamdi,Ammar Mohammed
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Question answering systems face critical limitations in languages with limited resources and scarce data, making the development of robust models especially challenging. The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide. However, these systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic, and the small size of existing datasets, which further restricts model performance. To address these challenges, we adopt a cross-language approach by (1) Dataset Augmentation: expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements; and (2) Language Model Fine-Tuning: utilizing pre-trained models such as BERT-Medium, RoBERTa-Base, DeBERTa-v3-Base, ELECTRA-Large, Flan-T5, Bloom, and Falcon to address the specific requirements of Quranic QA. Experimental results demonstrate that this cross-language approach significantly improves model performance, with RoBERTa-Base achieving the highest MAP@10 (0.34) and MRR (0.52), while DeBERTa-v3-Base excels in Recall@10 (0.50) and Precision@10 (0.24). These findings underscore the effectiveness of cross-language strategies in overcoming linguistic barriers and advancing Quranic QA systems
zh

[NLP-31] owards Making Flowchart Images Machine Interpretable ICDAR2023

【速读】: 该论文旨在使流程图(Flowcharts)图像能够被机器解析,并将其转换为可执行的Python代码。关键在于提出了一种基于变换器的新型框架FloCo-T5,该模型通过学习编程语言的语义、结构和模式,生成语法正确的代码。此外,通过使用大量保持逻辑一致的增强代码样本进行特定任务的预训练,进一步提升了模型性能。

链接: https://arxiv.org/abs/2501.17441
作者: Shreya Shukla,Prajwal Gatti,Yogesh Kumar,Vikash Yadav,Anand Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Software Engineering (cs.SE)
备注: Published at: ICDAR 2023, Project Page: this https URL

点击查看摘要

Abstract:Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines often tag these flowcharts as graphics and ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting them to executable Python codes. To this end, inspired by the recent success in natural language to code generation literature, we present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for this task,as it can effectively learn semantics, structure, and patterns of programming languages, which it leverages to generate syntactically correct code. We also used a task-specific pre-training objective to pre-train FloCo-T5 using a large number of logic-preserving augmented code samples. Further, to perform a rigorous study of this problem, we introduce theFloCo dataset that contains 11,884 flowchart images and their corresponding Python codes. Our experiments show promising results, and FloCo-T5 clearly outperforms related competitive baselines on code generation metrics. We make our dataset and implementation publicly available.
zh

[NLP-32] Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在细调过程中因有害样本导致的安全性丧失问题。论文的关键在于提出了一种名为“Virus”的攻击方法,该方法能够轻易绕过用于数据筛选的监管护栏(moderation guardrail),并且优化后的有害数据几乎无法被检测到,同时还能实现高效的攻击效果。论文强调,单纯依赖监管护栏进行数据筛选是不可靠的,并不能从根本上解决预训练LLMs的安全隐患。

链接: https://arxiv.org/abs/2501.17433
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks – models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbfit is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at this https URL
zh

[NLP-33] Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在模拟人类行为时是否仍然表现出隐性偏见,即使这些模型在显性偏见方面已有所改进。为验证这一假设,论文提出了一种技术,通过评估具有由LLM生成且包含社会人口统计信息的人格特征的代理之间的决策差异,系统地揭示广泛社会人口统计类别中的偏见。该方法的关键在于通过设计特定的社会人口统计信息和决策场景来检测这些隐性偏见,从而揭示LLMs在不同社会人口群体间的显著决策差异。

链接: https://arxiv.org/abs/2501.17420
作者: Yuxuan Li,Hirokazu Shirado,Sauvik Das
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.
zh

[NLP-34] General Scene Adaptation for Vision-and-Language Navigation ICLR2025

【速读】: 该论文旨在解决视觉与语言导航(Vision-and-Language Navigation, VLN)任务在模拟多环境一次性执行指令方面的局限性,提出了一种新的任务设置GSA-VLN,以更好地反映现实世界中导航机器人在持续环境中工作的条件。论文的关键解决方案是引入了一个新的数据集GSA-R2R,它显著扩展了环境和指令的多样性和数量,并设计了一个三阶段的指令编排管道来优化指令风格。此外,论文提出了一种新颖的方法GR-DUET,通过结合基于记忆的导航图和特定环境的训练策略,实现了在所有GSA-R2R数据集划分上的最新成果。

链接: https://arxiv.org/abs/2501.17403
作者: Haodong Hong,Yanyuan Qiao,Sen Wang,Jiajun Liu,Qi Wu
机构: The University of Queensland(昆士兰大学); CSIRO Data61; The University of Adelaide(阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of OOD data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the R2R dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages LLMs to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods. Based on our findings, we propose a novel method, GR-DUET, which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.
zh

[NLP-35] MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中与人类用户交互时面临的挑战。论文提出了一套名为MultiChallenge的新基准测试,用于评估LLMs在这一领域的表现。解决方案的关键在于识别并构建了四个具有普遍性和现实性的挑战类别,这些挑战不仅当前的人-LLM交互中常见,也难以被现有的前沿LLMs处理。这些挑战要求模型同时具备准确的指令跟随能力、上下文分配能力和情境推理能力。此外,论文还开发了一种基于实例级别评分标准的LLM裁判系统,以促进自动评价方法的发展,该方法与经验丰富的评分手动评分结果有良好的一致性。

链接: https://arxiv.org/abs/2501.17399
作者: Ved Sirdeshmukh,Kaustubh Deshpande,Johannes Mols,Lifeng Jin,Ed-Yeremai Cardona,Dean Lee,Jeremy Kritz,Willow Primack,Summer Yue,Chen Xing
机构: Scale AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.
zh

[NLP-36] Leverag ing In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains

【速读】: 该论文旨在解决自动化教育领域问题生成过程中存在的上下文不相关的问题。解决方案的关键在于引入In-Context Learning (ICL) 和 Retrieval-Augmented Generation (RAG) 方法,并提出了一种结合两者优势的新型混合模型(Hybrid Model)。通过使用GPT-4实现ICL以及利用带有检索模块的BART进行RAG,该研究有效地提升了生成问题的上下文准确性和相关性。

链接: https://arxiv.org/abs/2501.17397
作者: Subhankar Maity,Aniket Deroy,Sudeshna Sarkar
机构: IIT Kharagpur(印度理工学院克勒格布尔)
类目: Computation and Language (cs.CL)
备注: Accepted at the 16th Meeting of the Forum for Information Retrieval Evaluation as a Regular Paper

点击查看摘要

Abstract:Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.
zh

[NLP-37] Learning Free Token Reduction for Multi-Modal LLM

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在实际部署中因高计算成本和较长推理时间所面临的挑战。论文的关键解决方案在于提出了一种同时作用于空间和时间维度的令牌压缩范式(token compression paradigm)。该方法包括一个无需学习、即插即用的压缩流程,可以无缝集成到大多数多模态大型语言模型(Multimodal Large Language Model, MLLM)框架中。通过这种方法,论文展示了如何在不牺牲性能的前提下增强模型的推理能力,并显著提高效率。

链接: https://arxiv.org/abs/2501.17391
作者: Zihui Zhao,Yingxin Li,Yang Li
机构: Tsinghua University (清华大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Shenzhen Key Laboratory of Ubiquitous Data Enabling (深圳 ubiquitous 数据使能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks; however, their practical deployment is often constrained by high computational costs and prolonged inference times. Since the vision modality typically carries more information than the text modality, compressing visual prompts offers a promising solution to alleviate these challenges. Existing approaches predominantly focus on refining model architectures or directly reducing the number of visual tokens. However, these methods often compromise inference performance due to a lack of consideration for the unique spatial and temporal characteristics of visual data. In this work, we propose a token compression paradigm that operates on both spatial and temporal dimensions. Our approach includes a learning-free, plug-and-play compression pipeline that can be seamlessly integrated into most Multimodal Large Language Model (MLLM) frameworks. By leveraging this method, we enhance the model inference capability while simultaneously reducing its computational cost. Experimental results on the Video-QA task demonstrate the effectiveness of the proposed approach, showcasing significant improvements in efficiency without sacrificing performance.
zh

[NLP-38] Context-Aware Semantic Recomposition Mechanism for Large Language Models

【速读】: 该论文旨在解决大型文本生成任务中语义连贯性、上下文适应性和错误传播的局限性。关键解决方案是引入了情境感知语义重组机制(Context-Aware Semantic Recomposition Mechanism, CASRM),通过动态生成的上下文向量和注意力调制层增强token级表示与广泛上下文依赖之间的对齐,从而显著提升多领域(包括技术、对话和叙述文本)的语义连贯性,并有效缓解错误传播问题。

链接: https://arxiv.org/abs/2501.17386
作者: Richard Katrix,Quentin Carroway,Rowan Hawkesbury,Matthias Heathfield
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context-aware processing mechanisms have increasingly become a critical area of exploration for improving the semantic and contextual capabilities of language generation models. The Context-Aware Semantic Recomposition Mechanism (CASRM) was introduced as a novel framework designed to address limitations in coherence, contextual adaptability, and error propagation in large-scale text generation tasks. Through the integration of dynamically generated context vectors and attention modulation layers, CASRM enhances the alignment between token-level representations and broader contextual dependencies. Experimental evaluations demonstrated significant improvements in semantic coherence across multiple domains, including technical, conversational, and narrative text. The ability to adapt to unseen domains and ambiguous inputs was evaluated using a diverse set of test scenarios, highlighting the robustness of the proposed mechanism. A detailed computational analysis revealed that while CASRM introduces additional processing overhead, the gains in linguistic precision and contextual relevance outweigh the marginal increase in complexity. The framework also successfully mitigates error propagation in sequential tasks, improving performance in dialogue continuation and multi-step text synthesis. Additional investigations into token-level attention distribution emphasized the dynamic focus shifts enabled through context-aware enhancements. The findings suggest that CASRM offers a scalable and flexible solution for integrating contextual intelligence into existing language model architectures.
zh

[NLP-39] Better Slow than Sorry: Introducing Positive Friction for Reliable Dialogue Systems

【速读】: 该论文旨在解决对话系统中过度追求流畅性导致用户对AI输出的盲目依赖问题。这种盲目依赖可能掩盖隐含假设,并引发意外后果。论文的关键解决方案是引入“积极摩擦”(positive friction),通过在战略时刻有意放慢对话速度,提出问题、揭示假设或暂停,从而促进用户的反思和批判性思维,进而改进目标对齐、用户心理状态建模以及任务成功率。

链接: https://arxiv.org/abs/2501.17348
作者: Mert İnan,Anthony Sicilia,Suvodip Dey,Vardhan Dongre,Tejas Srinivasan,Jesse Thomason,Gökhan Tür,Dilek Hakkani-Tür,Malihe Alikhani
机构: University of Southern California (南加州大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While theories of discourse and cognitive science have long recognized the value of unhurried pacing, recent dialogue research tends to minimize friction in conversational systems. Yet, frictionless dialogue risks fostering uncritical reliance on AI outputs, which can obscure implicit assumptions and lead to unintended consequences. To meet this challenge, we propose integrating positive friction into conversational AI, which promotes user reflection on goals, critical thinking on system response, and subsequent re-conditioning of AI systems. We hypothesize systems can improve goal alignment, modeling of user mental states, and task success by deliberately slowing down conversations in strategic moments to ask questions, reveal assumptions, or pause. We present an ontology of positive friction and collect expert human annotations on multi-domain and embodied goal-oriented corpora. Experiments on these corpora, along with simulated interactions using state-of-the-art systems, suggest incorporating friction not only fosters accountable decision-making, but also enhances machine understanding of user beliefs and goals, and increases task success rates.
zh

[NLP-40] Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

【速读】: 该论文旨在解决生成式语言模型在处理需要直接输出多令牌任务级结果的任务(如偏好优化)时所面临的挑战。现有方法受限于耗时的解码过程以及离散令牌选择引起的梯度中断。论文的关键解决方案在于评估一系列无需解码的候选选择方法,这些方法能够从初始词汇表 logits 中直接获取候选概率。研究涵盖了包括五个小候选池的多选题问答任务及四个包含大量候选选项(多达10k+)的临床决策任务,并评估了与多种基础语言模型结合的效果。通过全面的实验,论文为未来模型设计提供了指导性见解。

链接: https://arxiv.org/abs/2501.17338
作者: Mingyu Derek Ma,Yanna Ding,Zijie Huang,Jianxi Gao,Yizhou Sun,Wei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative Language Models rely on autoregressive decoding to produce the output sequence token by token. Many tasks such as preference optimization, require the model to produce task-level output consisting of multiple tokens directly by selecting candidates from a pool as predictions. Determining a task-level prediction from candidates using the ordinary token-level decoding mechanism is constrained by time-consuming decoding and interrupted gradients by discrete token selection. Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches on a comprehensive set of tasks, including five multiple-choice QA tasks with a small candidate pool and four clinical decision tasks with a massive amount of candidates, some with 10k+ options. We evaluate the estimation methods paired with a wide spectrum of foundation LMs covering different architectures, sizes and training paradigms. The results and insights from our analysis inform the future model design.
zh

[NLP-41] Attribution analysis of legal language as used by LLM

【速读】: 该论文旨在探究专门设计用于法律任务的大型语言模型(LLM)相较于通用BERT模型在处理法律文本分类任务时的优势与差异。研究通过对比实验,使用集成梯度归因技术分析不同模型在性能上的变化,并从tokenization的角度进行解释。关键在于发现模型tokenizer的差异性行为导致了大部分性能差异,并通过频率分析识别出明确的法律主题标志词。

链接: https://arxiv.org/abs/2501.17330
作者: Richard K. Belew
机构: unknown.ucsd.edu

UCSD(UCSD)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 17 figures

点击查看摘要

Abstract:Three publicly-available LLM specifically designed for legal tasks have been implemented and shown that classification accuracy can benefit from training over legal corpora, but why and how? Here we use two publicly-available legal datasets, a simpler binary classification task of overruling'' texts, and a more elaborate multiple choice task identifying holding’’ judicial decisions. We report on experiments contrasting the legal LLM and a generic BERT model for comparison, against both datasets. We use integrated gradient attribution techniques to impute causes'' of variation in the models' perfomance, and characterize them in terms of the tokenizations each use. We find that while all models can correctly classify some test examples from the casehold task, other examples can only be identified by only one, model, and attribution can be used to highlight the reasons for this. We find that differential behavior of the models' tokenizers accounts for most of the difference and analyze these differences in terms of the legal language they process. Frequency analysis of tokens generated by dataset texts, combined with use of known stop word’’ lists, allow identification of tokens that are clear signifiers of legal topics.
zh

[NLP-42] Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction AAAI2025

【速读】: 该论文旨在解决临床诊断预测模型在处理患者医疗历史数据时面临的挑战,特别是由于患者数据稀缺及疾病候选空间庞大导致的模型开发难题。论文的关键解决方案是引入MERA模型,通过层次对比学习(Hierarchical Contrastive Learning)优化疾病候选排序列表,并利用微调(fine-tuning)过程将自然语言临床知识与医学编码相结合,从而有效缓解决策空间过大的问题,显著提升生成式大型语言模型(Generative LLMs)在诊断预测中的性能。

链接: https://arxiv.org/abs/2501.17326
作者: Mingyu Derek Ma,Xiaoxuan Wang,Yijia Xiao,Anthony Cuturrufo,Vijay S Nori,Eran Halperin,Wei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at AAAI 2025

点击查看摘要

Abstract:Clinical diagnosis prediction models, when provided with a patient’s medical history, aim to detect potential diseases early, facilitating timely intervention and improving prognostic outcomes. However, the inherent scarcity of patient data and large disease candidate space often pose challenges in developing satisfactory models for this intricate task. The exploration of leveraging Large Language Models (LLMs) for encapsulating clinical decision processes has been limited. We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through fine-tuning, we bridge the natural language clinical knowledge with medical codes. Experimental results on MIMIC-III and IV datasets show that MERA achieves the state-of-the-art diagnosis prediction performance and dramatically elevates the diagnosis prediction capabilities of generative LMs.
zh

[NLP-43] “Ownership Not Just Happy Talk”: Co-Designing a Participatory Large Language Model for Journalism

【速读】: 该论文旨在探讨如何通过参与式设计方法解决大型语言模型(Large Language Models, LLMs)在新闻业中的应用机会与挑战。论文的关键在于探索记者主导的LLM应如何设计及其实现方式,并提出了一种记者控制的LLM的组织结构和功能设计方案。这一方案旨在应对宏观、中观和微观层面的设计张力,从而更好地适应特定使用情境,而非依赖于现有的商业基础模型。

链接: https://arxiv.org/abs/2501.17299
作者: Emily Tseng,Meg Young,Marianne Aubin Le Quéré,Aimee Rinehart,Harini Suresh
机构: Microsoft Research(微软研究); Data & Society(数据与社会); Cornell University(康奈尔大学); The Associated Press(美联社); Brown University(布朗大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under review for an ACM conference

点击查看摘要

Abstract:Journalism has emerged as an essential domain for understanding the uses, limitations, and impacts of large language models (LLMs) in the workplace. News organizations face divergent financial incentives: LLMs already permeate newswork processes within financially constrained organizations, even as ongoing legal challenges assert that AI companies violate their copyright. At stake are key questions about what LLMs are created to do, and by whom: How might a journalist-led LLM work, and what can participatory design illuminate about the present-day challenges about adapting ``one-size-fits-all’’ foundation models to a given context of use? In this paper, we undertake a co-design exploration to understand how a participatory approach to LLMs might address opportunities and challenges around AI in journalism. Our 20 interviews with reporters, data journalists, editors, labor organizers, product leads, and executives highlight macro, meso, and micro tensions that designing for this opportunity space must address. From these desiderata, we describe the result of our co-design work: organizational structures and functionality for a journalist-controlled LLM. In closing, we discuss the limitations of commercial foundation models for workplace use, and the methodological implications of applying participatory methods to LLM co-design.
zh

[NLP-44] Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization NAACL2025

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLM)的机器翻译系统在生成过程中容易产生幻觉(hallucinations)的问题,这会严重损害用户的信任与安全。论文的关键解决方案在于提出一种方法,通过在训练阶段内在性地学习减少幻觉的生成,具体实现方式是引入一个数据创建框架来生成专注于幻觉的偏好数据集(preference datasets)。通过在这些数据集上微调LLMs,论文展示了幻觉率平均降低了96%,同时保持了整体翻译质量。

链接: https://arxiv.org/abs/2501.17295
作者: Zilu Tang,Rajen Chatterjee,Sarthak Garg
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NAACL 2025 Main Conference Long paper (9 pages)

点击查看摘要

Abstract:Machine Translation (MT) is undergoing a paradigm shift, with systems based on fine-tuned large language models (LLM) becoming increasingly competitive with traditional encoder-decoder models trained specifically for translation tasks. However, LLM-based systems are at a higher risk of generating hallucinations, which can severely undermine user’s trust and safety. Most prior research on hallucination mitigation focuses on traditional MT models, with solutions that involve post-hoc mitigation - detecting hallucinated translations and re-translating them. While effective, this approach introduces additional complexity in deploying extra tools in production and also increases latency. To address these limitations, we propose a method that intrinsically learns to mitigate hallucinations during the model training phase. Specifically, we introduce a data creation framework to generate hallucination focused preference datasets. Fine-tuning LLMs on these preference datasets reduces the hallucination rate by an average of 96% across five language pairs, while preserving overall translation quality. In a zero-shot setting our approach reduces hallucinations by 89% on an average across three unseen target languages.
zh

[NLP-45] From Natural Language to Extensive-Form Game Representations AAMAS2025

【速读】: 该论文旨在解决将自然语言描述的游戏转换为博弈论中的扩展形式表示的问题。为应对不同战略复杂度(如完全信息与不完全信息)游戏的挑战,论文提出了一种两阶段框架,并引入了专门的模块以增强上下文学习能力,使其能够有效地分解和解决该问题。关键解决方案在于第一阶段通过开发识别信息集及其相应部分树结构的模块来处理不完全信息问题,第二阶段则利用上下文学习结合自调试模块生成完整的扩展形式博弈树。

链接: https://arxiv.org/abs/2501.17282
作者: Shilong Deng,Yongzhao Wang,Rahul Savani
机构: University of Liverpool(利物浦大学); The Alan Turing Institute(图灵研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: This work has been accepted as a full paper for AAMAS 2025. This is a full version of the AAMAS 2025 proceedings

点击查看摘要

Abstract:We introduce a framework for translating game descriptions in natural language into extensive-form representations in game theory, leveraging Large Language Models (LLMs) and in-context learning. Given the varying levels of strategic complexity in games, such as perfect versus imperfect information, directly applying in-context learning would be insufficient. To address this, we introduce a two-stage framework with specialized modules to enhance in-context learning, enabling it to divide and conquer the problem effectively. In the first stage, we tackle the challenge of imperfect information by developing a module that identifies information sets along and the corresponding partial tree structure. With this information, the second stage leverages in-context learning alongside a self-debugging module to produce a complete extensive-form game tree represented using pygambit, the Python API of a recognized game-theoretic analysis tool called Gambit. Using this python representation enables the automation of tasks such as computing Nash equilibria directly from natural language descriptions. We evaluate the performance of the full framework, as well as its individual components, using various LLMs on games with different levels of strategic complexity. Our experimental results show that the framework significantly outperforms baseline models in generating accurate extensive-form games, with each module playing a critical role in its success.
zh

[NLP-46] ailored Truths: Optimizing LLM Persuasion with Personalization and Fabricated Statistics

【速读】: 该论文旨在探讨大型语言模型(Large Language Models, LLMs)在辩论中的说服力,特别是它们如何利用个人数据来个性化论点以影响人类的观点。研究的关键在于比较不同类型的论据策略(包括基于用户人口统计和个性的个性化论据、诉诸伪造统计数据以及混合策略)的说服效果。研究发现,在交互式辩论环境中,LLMs 使用混合策略时表现出显著的说服力,有51%的概率改变参与者的初始立场,相比之下,静态的人类撰写论据只有32%的成功率。因此,论文的关键解决方案在于揭示LLMs在交互式辩论中使用混合策略的高说服力,这突显了LLMs可能被用于廉价且有效的大规模误导活动的潜在风险。

链接: https://arxiv.org/abs/2501.17273
作者: Jasper Timm,Chetan Talele,Jacob Haimes
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming increasingly persuasive, demonstrating the ability to personalize arguments in conversation with humans by leveraging their personal data. This may have serious impacts on the scale and effectiveness of disinformation campaigns. We studied the persuasiveness of LLMs in a debate setting by having humans (n=33) engage with LLM-generated arguments intended to change the human’s opinion. We quantified the LLM’s effect by measuring human agreement with the debate’s hypothesis pre- and post-debate and analyzing both the magnitude of opinion change, as well as the likelihood of an update in the LLM’s direction. We compare persuasiveness across established persuasion strategies, including personalized arguments informed by user demographics and personality, appeal to fabricated statistics, and a mixed strategy utilizing both personalized arguments and fabricated statistics. We found that static arguments generated by humans and GPT-4o-mini have comparable persuasive power. However, the LLM outperformed static human-written arguments when leveraging the mixed strategy in an interactive debate setting. This approach had a \mathbf51% chance of persuading participants to modify their initial position, compared to \mathbf32% for the static human-written arguments. Our results highlight the concerning potential for LLMs to enable inexpensive and persuasive large-scale disinformation campaigns.
zh

[NLP-47] Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service

【速读】: 该论文旨在解决大规模行业环境下知识图谱问答系统(Knowledge Graph Question Answering, KGQA)评估的挑战。解决方案的关键在于Chronos框架的设计,它全面评估多组件系统的性能,涵盖了端到端及组件级指标,并且能够扩展至多样化数据集,同时提供一种可扩展的方法来衡量系统在发布前的表现。

链接: https://arxiv.org/abs/2501.17270
作者: Saloni Potdar,Daniel Lee,Omar Attia,Varun Embar,De Meng,Ramesh Balaji,Chloe Seivwright,Eric Choi,Mina H. Farid,Yiwen Sun,Yunyao Li
机构: Apple
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Question answering systems for knowledge graph (KGQA), answer factoid questions based on the data in the knowledge graph. KGQA systems are complex because the system has to understand the relations and entities in the knowledge-seeking natural language queries and map them to structured queries against the KG to answer them. In this paper, we introduce Chronos, a comprehensive evaluation framework for KGQA at industry scale. It is designed to evaluate such a multi-component system comprehensively, focusing on (1) end-to-end and component-level metrics, (2) scalable to diverse datasets and (3) a scalable approach to measure the performance of the system prior to release. In this paper, we discuss the unique challenges associated with evaluating KGQA systems at industry scale, review the design of Chronos, and how it addresses these challenges. We will demonstrate how it provides a base for data-driven decisions and discuss the challenges of using it to measure and improve a real-world KGQA system.
zh

[NLP-48] Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing NAACL2025

【速读】: 该论文旨在解决自动后编辑(APE)系统中存在的过校正问题,即不必要的修改导致译文偏离原始翻译原则中的最小编辑准则。解决方案的关键在于引入词级质量估计(Quality Estimation, QE)信息,在解码过程中减少过校正现象。此方法与特定架构无关,适用于任何APE系统,从而显著提升了在英德、英印地语以及英马拉雅拉姆语等语言对上的性能,分别获得了0.65、1.86和1.44个百分点的TER改进。这表明QE信息的整合对于降低APE系统的过校正效果非常有效。

链接: https://arxiv.org/abs/2501.17265
作者: Sourabh Deoghare,Diptesh Kanojia,Pushpak Bhattacharyya
机构: CFILT, Indian Institute of Technology Bombay (印度理工学院孟买分校); Institute for People-Centred AI, University of Surrey (人民中心人工智能研究院, 英国萨里大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main Conference: Short Papers

点击查看摘要

Abstract:Automatic Post-Editing (APE) systems often struggle with over-correction, where unnecessary modifications are made to a translation, diverging from the principle of minimal editing. In this paper, we propose a novel technique to mitigate over-correction by incorporating word-level Quality Estimation (QE) information during the decoding process. This method is architecture-agnostic, making it adaptable to any APE system, regardless of the underlying model or training approach. Our experiments on English-German, English-Hindi, and English-Marathi language pairs show the proposed approach yields significant improvements over their corresponding baseline APE systems, with TER gains of 0.65 , 1.86 , and 1.44 points, respectively. These results underscore the complementary relationship between QE and APE tasks and highlight the effectiveness of integrating QE information to reduce over-correction in APE systems.
zh

[NLP-49] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations SEMEVAL-2024

【速读】: 该论文旨在解决SemEval-2024任务3中的子任务2:多模态情感-因果分析(Multimodal Emotion-Cause Pair Extraction with Emotion Category, MECPE-Cat)。为应对这一挑战,论文提出了一种双组件系统,并将任务分解为对话中情感识别(Emotion Recognition in Conversation, ERC)和情感-因果对提取(Emotion-Cause Pair Extraction, ECPE)两个子任务。关键解决方案在于利用大型语言模型(Large Language Models, LLMs)的能力,并设计了一种情感-因果感知的指令微调方法,以增强模型对情感及其相应因果关系的理解。这种方法使得研究团队能够有效处理MECPE-Cat的复杂性,最终实现了34.71%的加权平均F1分数,并在排行榜上获得第二名。

链接: https://arxiv.org/abs/2501.17261
作者: Meng Luo,Han Zhang,Shengqiong Wu,Bobo Li,Hong Han,Hao Fei
机构: National University of Singapore(新加坡国立大学); Xidian University(西安电子科技大学); Wuhan University(武汉大学)
类目: Computation and Language (cs.CL)
备注: 2nd place at SemEval-2024 Task 3, Subtask 2, to appear in SemEval-2024 proceedings

点击查看摘要

Abstract:This paper describes the architecture of our system developed for Task 3 of SemEval-2024: Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of subtask 2, dedicated to Multimodal Emotion-Cause Pair Extraction with Emotion Category (MECPE-Cat), and constructs a dual-component system tailored to the unique challenges of this task. We divide the task into two subtasks: emotion recognition in conversation (ERC) and emotion-cause pair extraction (ECPE). To address these subtasks, we capitalize on the abilities of Large Language Models (LLMs), which have consistently demonstrated state-of-the-art performance across various natural language processing tasks and domains. Most importantly, we design an approach of emotion-cause-aware instruction-tuning for LLMs, to enhance the perception of the emotions with their corresponding causal rationales. Our method enables us to adeptly navigate the complexities of MECPE-Cat, achieving a weighted average 34.71% F1 score of the task, and securing the 2nd rank on the leaderboard. The code and metadata to reproduce our experiments are all made publicly available.
zh

[NLP-50] Audio Large Language Models Can Be Descriptive Speech Quality Evaluators ICLR2025

【速读】: 该论文旨在解决大型语言模型(LLMs)在处理语音信号时缺乏对输入音频质量感知的问题。论文的关键在于引入了一个基于自然语言的语音评估语料库,并提出了一种利用该语料库引导LLM进行音频质量感知的对齐方法(ALLD),从而使得LLM能够从原始语音中提取相关信息并生成有意义的响应。

链接: https://arxiv.org/abs/2501.17202
作者: Chen Chen,Yuchen Hu,Siyin Wang,Helin Wang,Zhehuai Chen,Chao Zhang,Chao-Han Huck Yang,Eng Siong Chng
机构: Nanyang Technological University; NVIDIA; Tsinghua University; Johns Hopkins University
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: ICLR 2025

点击查看摘要

Abstract:An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task training due to the lack of suitable datasets. To address this, we introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. In addition to the overall Mean Opinion Score (MOS), this corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. It also enables descriptive comparisons between two speech samples (A/B tests) with human-like judgment. Leveraging this corpus, we propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech and generating meaningful responses. Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models. This work advances the comprehensive perception of speech signals by audio LLMs, contributing to the development of real-world auditory and sensory intelligent agents.
zh

[NLP-51] Improving LLM Leaderboards with Psychometrical Methodology

【速读】: 该论文旨在解决大型语言模型(LLMs)评估基准 leaderboard 上排名方法过于简单化的问题。论文的关键解决方案在于应用当代心理计量学方法,这些方法最初是为人类测试和调查设计的,以改进 LLMs 在 leaderboard 上的排名方式。通过比较传统的朴素排名方法与心理计量学指导下的排名方法,研究展示了采用心理计量技术能够实现对 LLM 性能更为稳健和有意义的评估。

链接: https://arxiv.org/abs/2501.17200
作者: Denis Federiakin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 53 pages, 10 figures, 6 tables

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has necessitated the creation of benchmarks to evaluate their performance. These benchmarks resemble human tests and surveys, as they consist of sets of questions designed to measure emergent properties in the cognitive behavior of these systems. However, unlike the well-defined traits and abilities studied in social sciences, the properties measured by these benchmarks are often vaguer and less rigorously defined. The most prominent benchmarks are often grouped into leaderboards for convenience, aggregating performance metrics and enabling comparisons between models. Unfortunately, these leaderboards typically rely on simplistic aggregation methods, such as taking the average score across benchmarks. In this paper, we demonstrate the advantages of applying contemporary psychometric methodologies - originally developed for human tests and surveys - to improve the ranking of large language models on leaderboards. Using data from the Hugging Face Leaderboard as an example, we compare the results of the conventional naive ranking approach with a psychometrically informed ranking. The findings highlight the benefits of adopting psychometric techniques for more robust and meaningful evaluation of LLM performance.
zh

[NLP-52] Atla Selene Mini: A General Purpose Evaluation Model

【速读】: 该论文旨在开发一种高性能的小型语言模型评估器(SLMJ),以在多样化的基准测试中超越现有模型。关键解决方案在于采用了一种系统性的数据整理策略,该策略通过合成生成的批评来扩充公共数据集,并通过过滤和数据集修剪确保高质量。此外,模型训练采用了直接偏好优化(DPO)和监督微调(SFT)相结合的损失函数,从而产生了一个在实际场景中表现出色的高度可提示评估器。

链接: https://arxiv.org/abs/2501.17195
作者: Andrei Alexandru,Antonia Calvi,Henry Broomfield,Jackson Golden,Kyle Dai,Mathias Leys,Maurice Burger,Max Bartolo,Roman Engeler,Sashank Pisupati,Toby Drane,Young Sun Park
机构: University College London (伦敦大学学院); Cohere (Cohere); atla-ai.com (atla-ai.com)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (this https URL) and Ollama to encourage widespread community adoption.
zh

[NLP-53] AI-assisted German Employment Contract Review: A Benchmark Dataset

【速读】: 该论文旨在解决法律文本中识别无效或不公平条款的难题,特别是在就业合同审查中的应用。由于专家标注数据的稀缺性,这一任务尤为困难。为了解决这个问题,论文的关键在于发布了一个匿名且标注的基准数据集,用于评估德国就业合同条款的合法性和公平性,并提供了基线模型评估。这一举措为利用自然语言处理(NLP)技术辅助律师进行合同审查奠定了基础。

链接: https://arxiv.org/abs/2501.17194
作者: Oliver Wardas,Florian Matthes
机构: 未知
类目: Computation and Language (cs.CL)
备注: Dataset available on GitHub

点击查看摘要

Abstract:Employment contracts are used to agree upon the working conditions between employers and employees all over the world. Understanding and reviewing contracts for void or unfair clauses requires extensive knowledge of the legal system and terminology. Recent advances in Natural Language Processing (NLP) hold promise for assisting in these reviews. However, applying NLP techniques on legal text is particularly difficult due to the scarcity of expert-annotated datasets. To address this issue and as a starting point for our effort in assisting lawyers with contract reviews using NLP, we release an anonymized and annotated benchmark dataset for legality and fairness review of German employment contract clauses, alongside with baseline model evaluations.
zh

[NLP-54] Aspect-Aware Decomposition for Opinion Summarization

【速读】: 该论文旨在解决在线评论大规模意见总结过程中缺乏可解释性和透明度的问题。关键在于提出了一种基于评论方面的模块化方法,将方面识别、意见整合和元评审综合等任务分离,从而实现更高的透明度和易于检查性。这种模块化方法通过引入基于评论方面的推理,生成了比知识不可知分解提示更为丰富的中间输出,进而提高了总结的全面性和有效性。

链接: https://arxiv.org/abs/2501.17191
作者: Miao Li,Jey Han Lau,Eduard Hovy,Mirella Lapata
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 35 pages

点击查看摘要

Abstract:Opinion summarization plays a key role in deriving meaningful insights from large-scale online reviews. To make this process more explainable and grounded, we propose a modular approach guided by review aspects which separates the tasks of aspect identification, opinion consolidation, and meta-review synthesis, enabling greater transparency and ease of inspection. We conduct extensive experiments across datasets representing scientific research, business, and product domains. Results show that our method generates more grounded summaries compared to strong baseline models, as verified through automated and human evaluations. Additionally, our modular approach, which incorporates reasoning based on review aspects, produces more informative intermediate outputs than knowledge-agnostic decomposed prompting. These intermediate outputs can also effectively support humans in summarizing opinions from large volumes of reviews.
zh

[NLP-55] A Comprehensive Study on Fine-Tuning Large Language Models for Medical Question Answering Using Classification Models and Comparative Analysis

【速读】: 该论文旨在提高大型语言模型(LLMs)在医疗领域提供准确且高效答案的能力。关键在于通过两阶段方法改进模型:首先预测特定标签以分类医疗问题,然后针对该标签提供预定义答案。文中评估了包括RoBERTa和BERT在内的多种模型,并使用来自Healthline.com的真实数据及合成数据进行训练与验证。最终结果表明,Bert Large Uncased模型在准确性、精确度、召回率和F1分数上均达到100%,展示了其在医疗问答分类及精准作答方面的卓越性能。

链接: https://arxiv.org/abs/2501.17190
作者: Aysegul Ucar,Soumik Nayak,Anunak Roy,Burak Taşcı,Gülay Taşcı
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures,3 tables

点击查看摘要

Abstract:This paper presents the overview of the development and fine-tuning of large language models (LLMs) designed specifically for answering medical questions. We are mainly improving the accuracy and efficiency of providing reliable answers to medical queries. In our approach, we have two stages, prediction of a specific label for the received medical question and then providing a predefined answer for this label. Various models such as RoBERTa and BERT were examined and evaluated based on their ability. The models are trained using the datasets derived from 6,800 samples that were scraped from Healthline. com with additional synthetic data. For evaluation, we conducted a comparative study using 5-fold cross-validation. For accessing performance we used metrics like, accuracy, precision, recall, and F1 score and also recorded the training time. The performance of the models was evaluated using 5-fold cross-validation. The LoRA Roberta-large model achieved an accuracy of 78.47%, precision of 72.91%, recall of 76.95%, and an F1 score of 73.56%. The Roberta-base model demonstrated high performance with an accuracy of 99.87%, precision of 99.81%, recall of 99.86%, and an F1 score of 99.82%. The Bert Uncased model showed strong results with an accuracy of 95.85%, precision of 94.42%, recall of 95.58%, and an F1 score of 94.72%. Lastly, the Bert Large Uncased model achieved the highest performance, with an accuracy, precision, recall, and F1 score of 100%. The results obtained have helped indicate the capability of the models in classifying the medical questions and generating accurate answers in the prescription of improved health-related AI solutions.
zh

[NLP-56] Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics

【速读】: 该论文旨在解决大型语言模型(LLMs)在机器翻译中的预测不确定性问题,以增强其输出的可解释性和用户信任。解决方案的关键在于引入三种新的不确定性量化(UQ)度量方法,并开发了一个交互式的基于Web的可视化工具来表示翻译不确定性和模型置信度。这些方法通过简单的框架有效评估翻译性能,并提供直观的视觉反馈,从而帮助用户理解翻译质量及模型表现。

链接: https://arxiv.org/abs/2501.17187
作者: Jin Hyun Park,Utsawb Laminchhane,Umer Farooq,Uma Sivakumar,Arpan Kumar
机构: Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized for machine translation, yet their predictions often exhibit uncertainties that hinder interpretability and user trust. Effectively visualizing these uncertainties can enhance the usability of LLM outputs, particularly in contexts where translation accuracy is critical. This paper addresses two primary objectives: (1) providing users with token-level insights into model confidence and (2) developing a web-based visualization tool to quantify and represent translation uncertainties. To achieve these goals, we utilized the T5 model with the WMT19 dataset for translation tasks and evaluated translation quality using established metrics such as BLEU, METEOR, and ROUGE. We introduced three novel uncertainty quantification (UQ) metrics: (1) the geometric mean of token probabilities, (2) the arithmetic mean of token probabilities, and (3) the arithmetic mean of the kurtosis of token distributions. These metrics provide a simple yet effective framework for evaluating translation performance. Our analysis revealed a linear relationship between the traditional evaluation metrics and our UQ metrics, demonstrating the validity of our approach. Additionally, we developed an interactive web-based visualization that uses a color gradient to represent token confidence. This tool offers users a clear and intuitive understanding of translation quality while providing valuable insights into model performance. Overall, we show that our UQ metrics and visualization are both robust and interpretable, offering practical tools for evaluating and accessing machine translation systems.
zh

[NLP-57] Complete Chess Games Enable LLM Become A Chess Master NAACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在抽象游戏如国际象棋中的应用问题。关键解决方案在于将国际象棋游戏转化为文本格式,并使用Forsyth-Edwards记谱法表示最佳走法,通过简单的有监督微调(supervised fine-tuning),使模型达到专业级别的水平,在与标准Elo评分的Stockfish引擎对抗中获得1788的Elo评分。此外,研究还表明数据质量的重要性,长时间对局的数据监督能够带来350的Elo评分提升。

链接: https://arxiv.org/abs/2501.17186
作者: Yinqi Zhang,Xintian Han,Haolong Li,Kedi Chen,Shaohui Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NAACL 2025

点击查看摘要

Abstract:Large language models (LLM) have shown remarkable abilities in text generation, question answering, language translation, reasoning and many other tasks. It continues to advance rapidly and is becoming increasingly influential in various fields, from technology and business to education and entertainment. Despite LLM’s success in multiple areas, its ability to play abstract games, such as chess, is underexplored. Chess-playing requires the language models to output legal and reasonable moves from textual inputs. Here, we propose the Large language model ChessLLM to play full chess games. We transform the game into a textual format with the best move represented in the Forsyth-Edwards Notation. We show that by simply supervised fine-tuning, our model has achieved a professional-level Elo rating of 1788 in matches against the standard Elo-rated Stockfish when permitted to sample 10 times. We further show that data quality is important. Long-round data supervision enjoys a 350 Elo rating improvement over short-round data.
zh

[NLP-58] LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在航空航天制造领域应用时容易产生“幻觉”(hallucinations),即生成不准确或虚假信息的问题,这可能严重影响产品质量和飞行安全。论文的关键解决方案在于提出了一套针对航空航天制造领域的评估指标,通过深入分析经典教材和指南提取关键信息,并精心构建多选题来评估不同LLM模型在专业领域知识上的准确性。实验结果表明,当前LLMs在这方面的性能亟待提升。

链接: https://arxiv.org/abs/2501.17183
作者: Beiming Liu,Zhizhuo Cui,Siteng Hu,Xiaohua Li,Haifeng Lin,Zhengxin Zhang
机构: Chengdu Aircraft Industrial (Group) Co., Ltd. (成都飞机工业(集团)有限责任公司), Chengdu, China; School of Computer Science (计算机学院), Beihang University (北京航空航天大学), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: conference paper

点击查看摘要

Abstract:Aerospace manufacturing demands exceptionally high precision in technical parameters. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and QWen, in Natural Language Processing has sparked industry interest in their application to tasks including process design, material selection, and tool information retrieval. However, LLMs are prone to generating “hallucinations” in specialized domains, producing inaccurate or false information that poses significant risks to the quality of aerospace products and flight safety. This paper introduces a set of evaluation metrics tailored for LLMs in aerospace manufacturing, aiming to assess their accuracy by analyzing their performance in answering questions grounded in professional knowledge. Firstly, key information is extracted through in-depth textual analysis of classic aerospace manufacturing textbooks and guidelines. Subsequently, utilizing LLM generation techniques, we meticulously construct multiple-choice questions with multiple correct answers of varying difficulty. Following this, different LLM models are employed to answer these questions, and their accuracy is recorded. Experimental results demonstrate that the capabilities of LLMs in aerospace professional knowledge are in urgent need of improvement. This study provides a theoretical foundation and practical guidance for the application of LLMs in aerospace manufacturing, addressing a critical gap in the field.
zh

[NLP-59] Dialogue Systems for Emotional Support via Value Reinforcement

【速读】: 该论文旨在解决情感支持对话系统中如何有效强化用户的积极价值观的问题。关键在于提出了一种基于价值观驱动的方法,通过分析Reddit上的在线支持对话来训练模型,使其能够识别并强化对话中的积极价值观。这种方法显著提升了系统的支持能力,并在专家评估中表现出色,特别是在验证用户挑战和强调情境积极面方面。

链接: https://arxiv.org/abs/2501.17182
作者: Juhee Kim,Chunghu Mok,Jisun Lee,Hyang Sook Kim,Yohan Jo
机构: Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 30 pages, 3 figures

点击查看摘要

Abstract:Emotional support dialogue systems aim to reduce help-seekers’ distress and help them overcome challenges. While human values \unicodex2013 core beliefs that shape an individual’s priorities \unicodex2013 are increasingly emphasized in contemporary psychological therapy for their role in fostering internal transformation and long-term emotional well-being, their integration into emotional support systems remains underexplored. To bridge this gap, we present a value-driven method for training emotional support dialogue systems designed to reinforce positive values in seekers. Our model learns to identify which values to reinforce at each turn and how to do so, by leveraging online support conversations from Reddit. The model demonstrated superior performance in emotional support capabilities, outperforming various baselines. Notably, it more effectively explored and elicited values from seekers. Expert assessments by therapists highlighted two key strengths of our model: its ability to validate users’ challenges and its effectiveness in emphasizing positive aspects of their situations \unicodex2013 both crucial elements of value reinforcement. Our work validates the effectiveness of value reinforcement for emotional support systems and establishes a foundation for future research.
zh

[NLP-60] An AI-Driven Live Systematic Reviews in the Brain-Heart Interconnectome: Minimizing Research Waste and Advancing Evidence Synthesis

【速读】: 该论文旨在解决神经病学与心脏病学交叉领域(即脑心互连组学, Brain-Heart Interconnectome, BHI)中存在的证据综合效率低下、质量标准遵守情况差以及研究浪费等问题。关键解决方案在于开发了一套基于人工智能的系统,该系统通过集成自动检测人群、干预、对照、结果和研究设计(PICOS),使用向量嵌入的语义搜索,基于图的查询以及主题建模来识别冗余和未探索领域。核心组件包括一个达到87%准确率的双向长短时记忆模型(Bi-LSTM)用于PICOS合规性检测,一个达到95.7%准确率的研究设计分类器,以及使用GPT-3.5的检索增强生成(RAG),在基于图和主题驱动的查询方面优于GPT-4。这套系统提供了实时更新,并通过活数据库减少研究浪费,同时提供交互界面及对话式人工智能支持。

链接: https://arxiv.org/abs/2501.17181
作者: Arya Rahgozar,Pouria Mortezaagha,Jodi Edwards,Douglas Manuel,Jessie McGowen,Merrick Zwarenstein,Dean Fergusson,Andrea Tricco,Kelly Cobey,Margaret Sampson,Malcolm King,Dawn Richards,Alexandra Bodnaruc,David Moher
机构: Ottawa Hospital Research Institute(渥太华医院研究所), Ottawa, Ontario, Canada(加拿大安大略省渥太华)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The Brain-Heart Interconnectome (BHI) combines neurology and cardiology but is hindered by inefficiencies in evidence synthesis, poor adherence to quality standards, and research waste. To address these challenges, we developed an AI-driven system to enhance systematic reviews in the BHI domain. The system integrates automated detection of Population, Intervention, Comparator, Outcome, and Study design (PICOS), semantic search using vector embeddings, graph-based querying, and topic modeling to identify redundancies and underexplored areas. Core components include a Bi-LSTM model achieving 87% accuracy for PICOS compliance, a study design classifier with 95.7% accuracy, and Retrieval-Augmented Generation (RAG) with GPT-3.5, which outperformed GPT-4 for graph-based and topic-driven queries. The system provides real-time updates, reducing research waste through a living database and offering an interactive interface with dashboards and conversational AI. While initially developed for BHI, the system’s adaptable architecture enables its application across various biomedical fields, supporting rigorous evidence synthesis, efficient resource allocation, and informed clinical decision-making.
zh

[NLP-61] uning LLM Judges Hyperparameters

【速读】: 该论文旨在解决通过大型语言模型(Large Language Models, LLMs)评估过程中因高昂的人工标注成本而带来的挑战。为了解决这一问题,研究者提出了基于LLM的评判模型,通过比较两个LLM的输出实现模型排名,无需人工干预。然而,不同研究之间存在许多混淆因素,如模型、提示和其他超参数通常同时改变,这使得直接对比变得困难。

该论文的关键解决方案在于系统性地分析和调整LLM评判模型的超参数,并利用多目标多保真方法来降低评估成本。这种方法不仅能够提高准确性与成本效率,还能使用开放权重模型,从而确保更高的可访问性和可重复性。

链接: https://arxiv.org/abs/2501.17178
作者: David Salinas,Omar Swelam,Frank Hutter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune hyperparameter of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
zh

[NLP-62] Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant

【速读】: 该论文旨在解决利用大型语言模型(Large Language Models, LLMs)在大学基础编程课程中为学生提供反馈的问题。论文重点关注三个方面:评估两个知名模型GPT-3.5T和GPT-4T在提供反馈方面的性能;提出一种基于情境学习技术的精心设计的提示方法,以自动化评价过程的重要部分,并提供反馈中包含错误信息的下限估计,从而节省时间和精力;建议一种基于所提出的提示技术实现实用学习工具的可能策略。解决方案的关键在于设计了一种新的提示方法,使得反馈具有可编程分析的结构,并包含了关于LLM在解决问题时表现的诊断信息。

链接: https://arxiv.org/abs/2501.17176
作者: Marc Ballestero-Ribó,Daniel Ortiz-Martínez
机构: Department of Mathematics and Computer Science, Universitat de Barcelona (巴塞罗那大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The dream of achieving a student-teacher ratio of 1:1 is closer than ever thanks to the emergence of large language models (LLMs). One potential application of these models in the educational field would be to provide feedback to students in university introductory programming courses, so that a student struggling to solve a basic implementation problem could seek help from an LLM available 24/7. This article focuses on studying three aspects related to such an application. First, the performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The empirical results showed that GPT-4T performs much better than GPT-3.5T, however, it is not yet ready for use in a real-world scenario. This is due to the possibility of generating incorrect information that potential users may not always be able to detect. Second, the article proposes a carefully designed prompt using in-context learning techniques that allows automating important parts of the evaluation process, as well as providing a lower bound for the fraction of feedbacks containing incorrect information, saving time and effort. This was possible because the resulting feedback has a programmatically analyzable structure that incorporates diagnostic information about the LLM’s performance in solving the requested task. Third, the article also suggests a possible strategy for implementing a practical learning tool based on LLMs, which is rooted on the proposed prompting techniques. This strategy opens up a whole range of interesting possibilities from a pedagogical perspective.
zh

[NLP-63] Document-Level Sentiment Analysis of Urdu Text Using Deep Learning Techniques

【速读】: 该论文旨在解决文档级乌尔都语情感分析(Urdu Sentiment Analysis, SA)这一自然语言处理(NLP)挑战。由于乌尔都语资源匮乏且文档中存在大量表达不同观点的词汇,因此这是一个复杂的问题。为应对这一挑战,论文的关键解决方案在于提出了一种深度学习(Deep Learning, DL)混合模型,该模型将双向长短期记忆网络(Bidirectional Long Short Term Memory, BiLSTM)与单层多滤波卷积神经网络(Single Layer Multi Filter Convolutional Neural Network, SLMFCNN)相结合,称为BiLSTM-SLMFCNN。实验结果表明,该模型在乌尔都语电影评论数据集和客户支持数据集上均优于其他基准DL模型,分别达到了83%,79%,83%和94%的准确率。

链接: https://arxiv.org/abs/2501.17175
作者: Ammarah Irum,M. Ali Tahir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Document level Urdu Sentiment Analysis (SA) is a challenging Natural Language Processing (NLP) task as it deals with large documents in a resource-poor language. In large documents, there are ample amounts of words that exhibit different viewpoints. Deep learning (DL) models comprise of complex neural network architectures that have the ability to learn diverse features of the data to classify various sentiments. Besides audio, image and video classification; DL algorithms are now extensively used in text-based classification problems. To explore the powerful DL techniques for Urdu SA, we have applied five different DL architectures namely, Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), Bidirectional Encoder Representation from Transformer (BERT). In this paper, we have proposed a DL hybrid model that integrates BiLSTM with Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN). The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pretrained Urdu word embeddings that are suitable for (SA) at the document level. Results of these techniques are evaluated and our proposed model outperforms all other DL techniques for Urdu SA. BiLSTM-SLMFCNN outperformed the baseline DL models and achieved 83%, 79%, 83% and 94% accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively.
zh

[NLP-64] Extractive Schema Linking for Text-to-SQL

【速读】: 该论文旨在解决在处理大规模数据库模式(Schema)时,如何高效且准确地将文本转换为SQL查询的问题。论文的关键在于引入了一种新的方法,即通过解码器-only的大语言模型(LLMs)进行模式链接(Schema Linking),这种方法不仅在计算上更为高效,而且相比生成式方法具有更高的准确性。此外,该提取式方法允许对模式链接的精确度-召回率(Precision-Recall)进行精细控制。

链接: https://arxiv.org/abs/2501.17174
作者: Michael Glass,Mustafa Eyceoz,Dharmashankar Subramanian,Gaetano Rossiello,Long Vu,Alfio Gliozzo
机构: IBM Research AI (IBM研究院)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL is emerging as a practical interface for real world databases. The dominant paradigm for Text-to-SQL is cross-database or schema-independent, supporting application schemas unseen during training. The schema of a database defines the tables, columns, column types and foreign key connections between tables. Real world schemas can be large, containing hundreds of columns, but for any particular query only a small fraction will be relevant. Placing the entire schema in the prompt for an LLM can be impossible for models with smaller token windows and expensive even when the context window is large enough to allow it. Even apart from computational considerations, the accuracy of the model can be improved by focusing the SQL generation on only the relevant portion of the database. Schema linking identifies the portion of the database schema useful for the question. Previous work on schema linking has used graph neural networks, generative LLMs, and cross encoder classifiers. We introduce a new approach to adapt decoder-only LLMs to schema linking that is both computationally more efficient and more accurate than the generative approach. Additionally our extractive approach permits fine-grained control over the precision-recall trade-off for schema linking.
zh

[NLP-65] Benchmarking Randomized Optimization Algorithms on Binary Permutation and Combinatorial Problem Landscapes

【速读】: 该论文旨在评估四种随机优化算法(Randomized Hill Climbing, Simulated Annealing, Genetic Algorithms, 和 MIMIC)在三种不同类型问题(二元、排列和组合问题)中的性能。通过使用一组基准适应度函数,系统性地比较这些算法在解的质量、收敛速度、计算成本和鲁棒性等关键性能指标上的表现。研究结果表明,虽然MIMIC和遗传算法在处理二元和组合问题时能产生高质量的解决方案,但它们的计算需求差异显著;而随机化爬山法和模拟退火法虽然计算成本较低,但在复杂问题环境中表现有限。论文的关键在于揭示不同优化策略之间的权衡,并基于问题类型、精度需求和计算约束提供实用的算法选择指导。

链接: https://arxiv.org/abs/2501.17170
作者: Jethro Odeyemi,Wenjun Zhang
机构: University of Saskatchewan(萨斯喀彻温大学); Delft University of Technology(代尔夫特理工大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we evaluate the performance of four randomized optimization algorithms: Randomized Hill Climbing (RHC), Simulated Annealing (SA), Genetic Algorithms (GA), and MIMIC (Mutual Information Maximizing Input Clustering), across three distinct types of problems: binary, permutation, and combinatorial. We systematically compare these algorithms using a set of benchmark fitness functions that highlight the specific challenges and requirements of each problem category. Our study analyzes each algorithm’s effectiveness based on key performance metrics, including solution quality, convergence speed, computational cost, and robustness. Results show that while MIMIC and GA excel in producing high-quality solutions for binary and combinatorial problems, their computational demands vary significantly. RHC and SA, while computationally less expensive, demonstrate limited performance in complex problem landscapes. The findings offer valuable insights into the trade-offs between different optimization strategies and provide practical guidance for selecting the appropriate algorithm based on the type of problems, accuracy requirements, and computational constraints.
zh

[NLP-66] Fine-Tuning Open-Source Large Language Models to Improve Their Performance on Radiation Oncology Tasks: A Feasibility Study to Investigate Their Potential Clinical Applications in Radiation Oncology

【速读】: 该论文旨在探究通过领域知识微调大型语言模型(LLMs)是否可以提升其在放射肿瘤学中的表现,特别是在任务(1)治疗方案生成、任务(2)治疗方式选择(光子、质子、电子或近距离放射治疗)以及任务(3)ICD-10编码预测方面的性能。关键在于利用包含患者诊断信息和相应答案的数据集对开源模型LLaMA2-7B和Mistral-7B进行微调,并采用低秩近似方法,从而显著提升了模型在所有任务上的准确性及评价指标。

链接: https://arxiv.org/abs/2501.17286
作者: Peilong Wang,Zhengliang Liu,Yiwei Li,Jason Holmes,Peng Shu,Lian Zhang,Xiang Li,Quanzheng Li,Brady S. Laughlin,Diego Santos Toesca,Sujay A. Vora,Samir H. Patel,Terence T. Sio,Tianming Liu,Wei Liu
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: The radiation oncology clinical practice involves many steps relying on the dynamic interplay of abundant text data. Large language models have displayed remarkable capabilities in processing complex text information. But their direct applications in specific fields like radiation oncology remain underexplored. Purpose: This study aims to investigate whether fine-tuning LLMs with domain knowledge can improve the performance on Task (1) treatment regimen generation, Task (2) treatment modality selection (photon, proton, electron, or brachytherapy), and Task (3) ICD-10 code prediction in radiation oncology. Methods: Data for 15,724 patient cases were extracted. Cases where patients had a single diagnostic record, and a clearly identifiable primary treatment plan were selected for preprocessing and manual annotation to have 7,903 cases of the patient diagnosis, treatment plan, treatment modality, and ICD-10 code. Each case was used to construct a pair consisting of patient diagnostics details and an answer (treatment regimen, treatment modality, or ICD-10 code respectively) for the supervised fine-tuning of these three tasks. Open source LLaMA2-7B and Mistral-7B models were utilized for the fine-tuning with the Low-Rank Approximations method. Accuracy and ROUGE-1 score were reported for the fine-tuned models and original models. Clinical evaluation was performed on Task (1) by radiation oncologists, while precision, recall, and F-1 score were evaluated for Task (2) and (3). One-sided Wilcoxon signed-rank tests were used to statistically analyze the results. Results: Fine-tuned LLMs outperformed original LLMs across all tasks with p-value = 0.001. Clinical evaluation demonstrated that over 60% of the fine-tuned LLMs-generated treatment regimens were clinically acceptable. Precision, recall, and F1-score showed improved performance of fine-tuned LLMs. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2501.17286 [physics.med-ph] (or arXiv:2501.17286v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2501.17286 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peilong Wang [view email] [v1] Tue, 28 Jan 2025 20:37:32 UTC (1,261 KB)
zh

计算机视觉

[CV-0] U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

【速读】:该论文旨在解决多模态学习中依赖复杂模型和训练策略以实现最优性能的问题。解决方案的关键在于提出了统一单模态适应(Unified Unimodal Adaptation, U2A)方法,通过低秩适应(Low-Rank Adaptation, LoRA)联合微调预训练的单模态编码器,显著减少了可学习参数的数量,并简化了训练过程。此外,引入了掩码令牌(Mask Tokens, MT),用于在训练和测试过程中从可用模态生成缺失模态特征,从而避免了专门的特征估计或提示调优方法的需求。这一方案展示了在完整和缺失模态设置下的强健性和高效性。

链接: https://arxiv.org/abs/2501.17823
作者: Md Kaykobad Reza,Niki Nezakati,Ameya Patil,Mashhour Solh,M. Salman Asif
机构: University of California Riverside(加州大学河滨分校); Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 Pages, 6 Figures, 6 Tables

点击查看摘要

Abstract:Multimodal learning often relies on designing new models and complex training strategies to achieve optimal performance. We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks. Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies, such as alternating training, gradient modifications, or unimodal fine-tuning. To address missing modalities during both training and testing, we introduce Mask Tokens (MT), which generate missing modality features from available modalities using a single token per modality. This simplifies the process, removing the need for specialized feature estimation or prompt-tuning methods. Our evaluation demonstrates that U2A matches or outperforms state-of-the-art methods in both complete and missing modality settings, showcasing strong performance and robustness across various modalities, tasks, and datasets. We also analyze and report the effectiveness of Mask Tokens in different missing modality scenarios. Overall, our method provides a robust, flexible, and efficient solution for multimodal learning, with minimal computational overhead.
zh

[CV-1] SSF: Sparse Long-Range Scene Flow for Autonomous Driving ICRA

【速读】:该论文旨在解决长距离场景流估计中因稀疏观测导致的性能下降问题。论文的关键解决方案是提出了一种基于稀疏卷积的Sparse Scene Flow (SSF) 管道,采用稀疏特征图替代密集特征网格,从而克服了传统方法在远距离范围下计算复杂度呈二次增长的局限。为了应对稀疏特征图在时间序列点云之间尺寸和排序不匹配的问题,SSF引入了虚拟体素来增强特征图,并提出了一种针对远距离点赋予更大权重的范围加权度量方法。这种方法使得SSF在Argoverse2数据集上实现了最先进的长距离场景流估计性能。

链接: https://arxiv.org/abs/2501.17821
作者: Ajinkya Khoche,Qingwen Zhang,Laura Pereira Sanchez,Aron Asefaw,Sina Sharif Mansouri,Patric Jensfelt
机构: KTH Royal Institute of Technology(皇家理工学院); Autonomous Transport Solutions Lab, Scania Group(自主运输解决方案实验室, 斯堪尼亚集团); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures, accepted to International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at this https URL.
zh

[CV-2] P-TAME: Explain Any Image Classifier with Trained Perturbations

【速读】:该论文旨在解决深度神经网络(DNNs)在需要预测与解释并存的关键领域中的应用受限问题,主要由于其固有的黑箱性质。论文提出的关键解决方案是P-TAME (基于扰动的可训练注意力机制),这是一种模型无关的方法,用于解释基于DNN的图像分类器。P-TAME通过引入一个辅助图像分类器来提取输入图像特征,从而避免了为了解释特定的主干分类器架构而调整解释方法的需求。关键创新在于它以高效的方式生成高分辨率解释,仅需一次前向传递即可完成推理过程,从而显著降低了计算需求。

链接: https://arxiv.org/abs/2501.17813
作者: Mariano V. Ntrougkas,Vasileios Mezaris,Ioannis Patras
机构: Information Technologies Institute / CERTH (信息技术研究所); Queen Mary University of London (伦敦大学玛丽皇后学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted for publication

点击查看摘要

Abstract:The adoption of Deep Neural Networks (DNNs) in critical fields where predictions need to be accompanied by justifications is hindered by their inherent black-box nature. In this paper, we introduce P-TAME (Perturbation-based Trainable Attention Mechanism for Explanations), a model-agnostic method for explaining DNN-based image classifiers. P-TAME employs an auxiliary image classifier to extract features from the input image, bypassing the need to tailor the explanation method to the internal architecture of the backbone classifier being explained. Unlike traditional perturbation-based methods, which have high computational requirements, P-TAME offers an efficient alternative by generating high-resolution explanations in a single forward pass during inference. We apply P-TAME to explain the decisions of VGG-16, ResNet-50, and ViT-B-16, three distinct and widely used image classifiers. Quantitative and qualitative results show that our method matches or outperforms previous explainability methods, including model-specific approaches. Code and trained models will be released upon acceptance.
zh

[CV-3] CrowdSplat: Exploring Gaussian Splatting For Crowd Rendering

【速读】:该论文旨在解决动态真实感人群模拟在实时应用中的渲染质量问题、内存效率及计算性能。关键解决方案在于使用3D高斯点阵(CrowdSplat)来表示多样化姿态和服装的动画人物,并通过整合细节层次(LoD)渲染技术优化计算效率。CrowdSplat框架包括两个阶段:角色重建与人群合成,并针对GPU内存使用进行优化以增强可扩展性。

链接: https://arxiv.org/abs/2501.17792
作者: Xiaohan Sun,Yinghan Xu,John Dingliana,Carol O’Sullivan
机构: Trinity College Dublin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:We present CrowdSplat, a novel approach that leverages 3D Gaussian Splatting for real-time, high-quality crowd rendering. Our method utilizes 3D Gaussian functions to represent animated human characters in diverse poses and outfits, which are extracted from monocular videos. We integrate Level of Detail (LoD) rendering to optimize computational efficiency and quality. The CrowdSplat framework consists of two stages: (1) avatar reconstruction and (2) crowd synthesis. The framework is also optimized for GPU memory usage to enhance scalability. Quantitative and qualitative evaluations show that CrowdSplat achieves good levels of rendering quality, memory efficiency, and computational performance. Through these experiments, we demonstrate that CrowdSplat is a viable solution for dynamic, realistic crowd simulation in real-time applications.
zh

[CV-4] Learning Semantic Facial Descriptors for Accurate Face Animation

【速读】:该论文旨在解决面部动画任务中的两个主要问题:一是基于模型的方法(如利用3DMM或标志点)难以有效保留身份特征;二是无模型方法在实现解耦且语义丰富的特征空间时面临挑战,导致精确的运动传递难以实现。论文的关键解决方案在于引入可学习的解耦向量空间中的语义面部描述符,通过学习完整的正交基向量将面部空间解耦为身份和运动子空间,并赋予每个子空间语义。通过编码器获取基向量系数,从而在身份和运动子空间中获得有效的面部描述符,这些描述符可以重新组合为潜在代码以实现面部动画。

链接: https://arxiv.org/abs/2501.17718
作者: Lei Zhu,Yuanqi Chen,Xiaohang Liu,Thomas H. Li,Ge Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages,6 figures

点击查看摘要

Abstract:Face animation is a challenging task. Existing model-based methods (utilizing 3DMMs or landmarks) often result in a model-like reconstruction effect, which doesn’t effectively preserve identity. Conversely, model-free approaches face challenges in attaining a decoupled and semantically rich feature space, thereby making accurate motion transfer difficult to achieve. We introduce the semantic facial descriptors in learnable disentangled vector space to address the dilemma. The approach involves decoupling the facial space into identity and motion subspaces while endowing each of them with semantics by learning complete orthogonal basis vectors. We obtain basis vector coefficients by employing an encoder on the source and driving faces, leading to effective facial descriptors in the identity and motion subspaces. Ultimately, these descriptors can be recombined as latent codes to animate faces. Our approach successfully addresses the issue of model-based methods’ limitations in high-fidelity identity and the challenges faced by model-free methods in accurate motion transfer. Extensive experiments are conducted on three challenging benchmarks (i.e. VoxCeleb, HDTF, CelebV). Comprehensive quantitative and qualitative results demonstrate that our model outperforms SOTA methods with superior identity preservation and motion transfer.
zh

[CV-5] Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment

【速读】:该论文旨在解决在医学图像分析中,特别是针对超声图像分割任务,需要大量标注数据的问题。论文的关键解决方案是引入了一种名为生成增强网络(GRN)的新型联合训练框架,并结合了分割引导增强(SGE)技术。GRN通过在单一阶段优化图像生成和分割性能,显著减少了所需标注数据量,同时保持与全监督模型相当的性能。实验结果表明,采用SGE的样本高效学习变体GRN-SEL可将标注工作量减少高达70%,且在Dice相似性系数(DSC)上有1.98%的提升。

链接: https://arxiv.org/abs/2501.17690
作者: Zixue Zeng,Xiaoyan Zhao,Matthew Cartier,Tong Yu,Jing Wang,Xin Meng,Zhiyu Sheng,Maryam Satarpour,John M Cormack,Allison Bean,Ryan Nussbaum,Maya Maurer,Emily Landis-Walkenhorst,Dinesh Kumbhare,Kang Kim,Ajay Wasan,Jiantao Pu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN’s performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.
zh

[CV-6] ContourFormer:Real-Time Contour-Based End-to-End Instance Segmentation Transformer

【速读】:该论文旨在解决实时轮廓驱动的实例分割(Real-time Contour-based Instance Segmentation)问题。关键解决方案在于提出了两种创新技术:子轮廓解耦机制(sub-contour decoupling mechanisms)和轮廓细粒度分布优化(contour fine-grained distribution refinement)。其中,子轮廓解耦机制通过引入可变形注意力模块(deformable attention-based module),自适应地选择采样区域以更有效地捕捉物体边界信息;多阶段优化过程则逐步细化子轮廓以增强分割精度。这些创新使Contourformer能够在保持实时性能的同时,实现每个实例稳定且精确的分割。

链接: https://arxiv.org/abs/2501.17688
作者: Weiwei yao,Chen Li,Minjun Xiong,Wenbo Dong,Hao Chen,Xiong Xiao
机构: Zhuzhou CRRC Times Electric Co., Ltd., China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents Contourformer, a real-time contour-based instance segmentation algorithm. The method is fully based on the DETR paradigm and achieves end-to-end inference through iterative and progressive mechanisms to optimize contours. To improve efficiency and accuracy, we develop two novel techniques: sub-contour decoupling mechanisms and contour fine-grained distribution this http URL the sub-contour decoupling mechanism, we propose a deformable attention-based module that adaptively selects sampling regions based on the current predicted contour, enabling more effective capturing of object boundary information. Additionally, we design a multi-stage optimization process to enhance segmentation precision by progressively refining sub-contours. The contour fine-grained distribution refinement technique aims to further improve the ability to express fine details of this http URL innovations enable Contourformer to achieve stable and precise segmentation for each instance while maintaining real-time performance. Extensive experiments demonstrate the superior performance of Contourformer on multiple benchmark datasets, including SBD, COCO, and KINS. We conduct comprehensive evaluations and comparisons with existing state-of-the-art methods, showing significant improvements in both accuracy and inference this http URL work provides a new solution for contour-based instance segmentation tasks and lays a foundation for future research, with the potential to become a strong baseline method in this field.
zh

[CV-7] FeatureGS: Eigenvalue-Feature Optimization in 3D Gaussian Splatting for Geometrically Accurate and Artifact-Reduced Reconstruction

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS)在三维场景重建中的几何精度不足以及浮动物体(floater)伪影的问题。论文的关键解决方案是引入FeatureGS方法,通过在3DGS的优化过程中添加基于特征值衍生的三维形状特征的几何损失项,以提升几何精度并减少结构熵,从而增强平面表面的特性。此外,论文提出了四种不同的几何损失项公式,分别基于高斯点的“平面性”、高斯点邻域的“平面性”、“全变异性”和“特征熵”。这些改进措施使得FeatureGS方法在DTU基准数据集上的几何精度提升了30%,同时减少了90%的高斯点数量,并抑制了浮动物体伪影,保持了与原始方法相当的光度渲染质量。

链接: https://arxiv.org/abs/2501.17655
作者: Miriam Jäger,Markus Hillemann,Boris Jutzi
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, 7 tables

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful approach for 3D scene reconstruction using 3D Gaussians. However, neither the centers nor surfaces of the Gaussians are accurately aligned to the object surface, complicating their direct use in point cloud and mesh reconstruction. Additionally, 3DGS typically produces floater artifacts, increasing the number of Gaussians and storage requirements. To address these issues, we present FeatureGS, which incorporates an additional geometric loss term based on an eigenvalue-derived 3D shape feature into the optimization process of 3DGS. The goal is to improve geometric accuracy and enhance properties of planar surfaces with reduced structural entropy in local 3D this http URL present four alternative formulations for the geometric loss term based on ‘planarity’ of Gaussians, as well as ‘planarity’, ‘omnivariance’, and ‘eigenentropy’ of Gaussian neighborhoods. We provide quantitative and qualitative evaluations on 15 scenes of the DTU benchmark dataset focusing on following key aspects: Geometric accuracy and artifact-reduction, measured by the Chamfer distance, and memory efficiency, evaluated by the total number of Gaussians. Additionally, rendering quality is monitored by Peak Signal-to-Noise Ratio. FeatureGS achieves a 30 % improvement in geometric accuracy, reduces the number of Gaussians by 90 %, and suppresses floater artifacts, while maintaining comparable photometric rendering quality. The geometric loss with ‘planarity’ from Gaussians provides the highest geometric accuracy, while ‘omnivariance’ in Gaussian neighborhoods reduces floater artifacts and number of Gaussians the most. This makes FeatureGS a strong method for geometrically accurate, artifact-reduced and memory-efficient 3D scene reconstruction, enabling the direct use of Gaussian centers for geometric representation.
zh

[CV-8] Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)任务中的性能与效率之间的权衡问题。现有方法大多在性能或延迟方面表现不佳。论文的关键解决方案是引入ERR-Seg框架,该框架通过训练-free的通道缩减模块(Channel Reduction Module, CRM)利用视觉-语言模型如CLIP的知识来识别最相关的类别,同时剔除其他无关类别。此外,ERR-Seg还结合了高效的语义上下文融合(Efficient Semantic Context Fusion, ESCF),采用空间级和类别级序列缩减策略,从而实现显著的内存和计算节省而不影响准确性。另外,为了进一步提升精度,ERR-Seg引入了分层语义模块(Hierarchical Semantic Module, HSM)以利用中间层特征中的分层语义信息。这些措施共同作用,使得ERR-Seg相比现有最先进的方法,在ADE20K-847设置下实现了+5.6%的平均交并比(mIoU)提升,并将延迟减少了67.3%。

链接: https://arxiv.org/abs/2501.17642
作者: Lin Chen,Qi Yang,Kun Ding,Zhihao Li,Gang Shen,Fei Li,Qiyuan Cao,Shiming Xiang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统重点实验室), Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院), Beijing 101408, China; School of Software, Shandong University (山东大学软件学院), Jinan 250101, China; China Tower Corporation Limited (中国铁塔股份有限公司), Beijing, 100029, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities, significantly facilitating the development of OVSS. However, most existing methods suffer from either suboptimal performance or long latency. This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency. ERR-Seg incorporates a training-free Channel Reduction Module (CRM) that leverages prior knowledge from vision-language models like CLIP to identify the most relevant classes while discarding others. Moreover, it incorporates Efficient Semantic Context Fusion (ESCF) with spatial-level and class-level sequence reduction strategies. CRM and ESCF result in substantial memory and computational savings without compromising accuracy. Additionally, recognizing the significance of hierarchical semantics extracted from middle-layer features for closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves + 5.6% mIoU improvement and reduces latency by 67.3% .
zh

[CV-9] Efficient Interactive 3D Multi-Object Removal

【速读】:该论文旨在解决现有3D场景中多对象移除方法在粒度和灵活性方面的不足,以及在多视角修复过程中需要先验知识导致的时间消耗问题。论文的关键在于提出了一种高效的用户友好型管道,使用户能够灵活选择区域并定义移除或保留的对象。特别是,通过引入一种新颖的基于单应性变换与高置信锚点的掩码匹配与精炼模块,论文提高了跨视图对象一致性与对应性,并通过使用IoU联合形状上下文距离损失增强了变形掩码的准确性,从而改进后续修复过程。此外,论文还提供了一个新的评估数据集以应对当前3D多对象移除领域的不成熟性。实验结果表明,所提方法显著降低了计算成本,处理速度比现有最先进方法快80%以上,同时保持了等效或更高的重建质量。

链接: https://arxiv.org/abs/2501.17636
作者: Jingcheng Ni,Weiguang Zhao,Daniel Wang,Ziyao Zeng,Chenyu You,Alex Wong,Kaizhu Huang
机构: Duke Kunshan University (杜克昆山大学); Duke University (杜克大学); Yale University (耶鲁大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object removal is of great significance to 3D scene understanding, essential for applications in content filtering and scene editing. Current mainstream methods primarily focus on removing individual objects, with a few methods dedicated to eliminating an entire area or all objects of a certain category. They however confront the challenge of insufficient granularity and flexibility for real-world applications, where users demand tailored excision and preservation of objects within defined zones. In addition, most of the current methods require kinds of priors when addressing multi-view inpainting, which is time-consuming. To address these limitations, we propose an efficient and user-friendly pipeline for 3D multi-object removal, enabling users to flexibly select areas and define objects for removal or preservation. Concretely, to ensure object consistency and correspondence across multiple views, we propose a novel mask matching and refinement module, which integrates homography-based warping with high-confidence anchor points for segmentation. By leveraging the IoU joint shape context distance loss, we enhance the accuracy of warped masks and improve subsequent inpainting processes. Considering the current immaturity of 3D multi-object removal, we provide a new evaluation dataset to bridge the developmental void. Experimental results demonstrate that our method significantly reduces computational costs, achieving processing speeds more than 80% faster than state-of-the-art methods while maintaining equivalent or higher reconstruction quality.
zh

[CV-10] Federated Learning With Individualized Privacy Through Client Sampling

【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)中实现个性化差分隐私(Individualized Differential Privacy, IDP)的问题。论文的关键解决方案在于通过引入基于个人隐私预算的客户端特定采样率,并将其整合到改进的IDP-FedAvg算法中。这种方法能够根据用户的个性化隐私偏好调整隐私保护强度,从而在不牺牲数据效用的前提下提升隐私保护效果。实验结果显示,该方法相较于统一差分隐私基线及SCALE方法有显著优势,尤其在非独立同分布(non-i.i.d.)数据的任务中表现更佳。

链接: https://arxiv.org/abs/2501.17634
作者: Lucas Lange,Ole Borchardt,Erhard Rahm
机构: ScaDS.AI Dresden/Leipzig(斯卡德斯.AI 德累斯顿/莱比锡); Leipzig University(莱比锡大学); Leipzig, Germany(德国莱比锡)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With growing concerns about user data collection, individualized privacy has emerged as a promising solution to balance protection and utility by accounting for diverse user privacy preferences. Instead of enforcing a uniform level of anonymization for all users, this approach allows individuals to choose privacy settings that align with their comfort levels. Building on this idea, we propose an adapted method for enabling Individualized Differential Privacy (IDP) in Federated Learning (FL) by handling clients according to their personal privacy preferences. By extending the SAMPLE algorithm from centralized settings to FL, we calculate client-specific sampling rates based on their heterogeneous privacy budgets and integrate them into a modified IDP-FedAvg algorithm. We test this method under realistic privacy distributions and multiple datasets. The experimental results demonstrate that our approach achieves clear improvements over uniform DP baselines, reducing the trade-off between privacy and utility. Compared to the alternative SCALE method in related work, which assigns differing noise scales to clients, our method performs notably better. However, challenges remain for complex tasks with non-i.i.d. data, primarily stemming from the constraints of the decentralized setting.
zh

[CV-11] chnical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models

【速读】:该论文旨在解决基于基础模型的下游视觉分类任务中的置信校准问题,特别是在少量样本情况下。由于图像-语言对不一致时,CLIP头部的logit得分仍然很大,导致难以在数据空间内解决此问题。论文提出的关键解决方案是引入一种“置信度错配惩罚(CMP)”机制,在微调过程中对错误分类进行惩罚,通过将一定程度的对数似然性转移到真实类别来实现,其量级与两个似然性的相对幅度成正比。实验结果表明,该方法在12个视觉数据集和5个领域泛化数据集上的表现优于现有的标杆方法,平均提升了6.01%的预期校准误差(ECE)。

链接: https://arxiv.org/abs/2501.17595
作者: Behraj Khan,Tahir Syed
机构: Institute of Business Administration Karachi (研究所管理卡拉奇学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Confidence calibration is an emerging challenge in real-world decision systems based on foundations models when used for downstream vision classification tasks. Due to various reasons exposed, logit scores on the CLIP head remain large irrespective of whether the image-language pairs reconcile. It is difficult to address in data space, given the few-shot regime. We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning, by moving an amount of log-likelihood to the true class commensurate to the relative amplitudes of the two likelihoods. We refer to it as \textitconfidence misalignment penalty (CMP). Extensive experiments on 12 vision datasets and 5 domain generalization datasets supports the calibration performance of our method against stat-of-the-art. CMP outperforms the benchmarked prompt learning methods, demonstrating average improvement in Expected Calibration Error (ECE) by average 6.01 %, 4.01 % at minimum and 9.72 % at maximum. Anonymized sample source code for this paper can be found at: \urlthis https URL
zh

[CV-12] Watch Your STEPP: Semantic Traversability Estimation using Pose Projected Features

【速读】:该论文旨在解决自主机器人在非结构化地形(如自然景观)中导航时,传统方法(如占用映射)难以有效评估复杂地形可穿越性的问题。关键解决方案在于通过学习人类行走示范来估计地形可穿越性,利用DINOv2视觉Transformer模型生成密集的像素级特征嵌入,并通过编码器-解码器MLP架构分析地形片段。该方法通过最小化重构误差,区分熟悉地形与不熟悉或危险地形,从而实现异常检测,使四足机器人能够在挑战性地形中更有效地导航。

链接: https://arxiv.org/abs/2501.17594
作者: Sebastian Ægidius,Dennis Hadjivelichkov,Jianhao Jiao,Jonathan Embley-Riches,Dimitrios Kanoulas
机构: Robot Perception and Learning Lab, Department of Computer Science, University College London (伦敦大学学院计算机科学系机器人感知与学习实验室); Archimedes/Athena RC (雅典研究与技术基金会阿基米德中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures

点击查看摘要

Abstract:Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: this https URL
zh

[CV-13] Boosting Weak Positives for Text Based Person Search

【速读】:该论文旨在解决文本基础人物搜索(Text-Based Person Search, TBPS)任务中的挑战,特别是由于有限的数据和细粒度的特性导致的困难。论文的关键解决方案是一种动态识别并强调训练过程中具有挑战性的样本的提升技术。这种方法通过动态更新弱正样本的权重,使得排名非首位但与查询不共享身份的匹配对能够更多地影响损失函数,促使网络更加关注这些容易被误排序的样本。这一策略在四个行人数据集上实现了性能的提升,证明了所提模块的有效性。

链接: https://arxiv.org/abs/2501.17586
作者: Akshay Modi,Ashhar Aziz,Nilanjana Chatterjee,A V Subramanyam
机构: Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Delhi(印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large vision-language models have revolutionized cross-modal object retrieval, but text-based person search (TBPS) remains a challenging task due to limited data and fine-grained nature of the task. Existing methods primarily focus on aligning image-text pairs into a common representation space, often disregarding the fact that real world positive image-text pairs share a varied degree of similarity in between them. This leads models to prioritize easy pairs, and in some recent approaches, challenging samples are discarded as noise during training. In this work, we introduce a boosting technique that dynamically identifies and emphasizes these challenging samples during training. Our approach is motivated from classical boosting technique and dynamically updates the weights of the weak positives, wherein, the rank-1 match does not share the identity of the query. The weight allows these misranked pairs to contribute more towards the loss and the network has to pay more attention towards such samples. Our method achieves improved performance across four pedestrian datasets, demonstrating the effectiveness of our proposed module.
zh

[CV-14] An Exceptional Dataset For Rare Pancreatic Tumor Segmentation

【速读】:该论文旨在解决胰腺神经内分泌肿瘤(pNETs)早期检测困难的问题,由于pNETs的罕见性,从CT影像中分割这些肿瘤极具挑战。目前尚无专门针对pNETs的数据集供研究人员使用。为了解决这一问题,论文提出了一个包含469名患者数据的对比增强计算机断层扫描(CECT)数据集,这是首个仅针对pNETs的数据集。关键解决方案在于提供了一个基于UNet模型的新切片加权损失函数,以改善整体的pNET分割性能。

链接: https://arxiv.org/abs/2501.17555
作者: Wenqi Li,Yingli Chen,Keyang Zhou,Xiaoxiao Hu,Zilu Zheng,Yue Yan,Xinpeng Zhang,Wei Tang,Zhenxing Qian
机构: School of Computer Science, Fudan University (复旦大学), Shanghai China; Shanghai Medical College, Fudan University Shanghai Cancer Center (复旦大学上海肿瘤医院), Shanghai, China; School of Computer Science, University of Illinois (伊利诺伊大学香槟分校), Illinois, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pancreatic NEuroendocrine Tumors (pNETs) are very rare endocrine neoplasms that account for less than 5% of all pancreatic malignancies, with an incidence of only 1-1.5 cases per 100,000. Early detection of pNETs is critical for improving patient survival, but the rarity of pNETs makes segmenting them from CT a very challenging problem. So far, there has not been a dataset specifically for pNETs available to researchers. To address this issue, we propose a pNETs dataset, a well-annotated Contrast-Enhanced Computed Tomography (CECT) dataset focused exclusively on Pancreatic Neuroendocrine Tumors, containing data from 469 patients. This is the first dataset solely dedicated to pNETs, distinguishing it from previous collections. Additionally, we provide the baseline detection networks with a new slice-wise weight loss function designed for the UNet-based model, improving the overall pNET segmentation performance. We hope that our dataset can enhance the understanding and diagnosis of pNET Tumors within the medical community, facilitate the development of more accurate diagnostic tools, and ultimately improve patient outcomes and advance the field of oncology.
zh

[CV-15] Action Recognition Using Temporal Shift Module and Ensemble Learning ICPR2024

【速读】:该论文旨在解决多模态动作识别挑战中的问题,具体目标是从包含20种不同人类动作类别的多模态数据集中识别人体动作。解决方案的关键在于利用时间片段模型(Temporal Shift Module, TSM)高效捕捉视频数据中的时间动态,并结合多种数据输入类型。通过迁移学习和针对特定数据集的细致微调来优化性能,同时采用骨干网络选择与计算效率及识别精度之间的平衡。此外,通过集成不同模态输出的集成技术进一步提升模型性能,最终实现了测试集上的完美前1准确率,验证了所提方法在20种类别动作识别中的有效性。

链接: https://arxiv.org/abs/2501.17550
作者: Anh-Kiet Duong,Petra Gomez-Krämer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, MMVPR @ ICPR2024

点击查看摘要

Abstract:This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \aclICPR 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \aclTSM, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge’s specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online this https URL.
zh

[CV-16] owards Training-Free Open-World Classification with 3D Generative Models

【速读】:该论文致力于解决三维开放世界分类(3D Open-World Classification)在动态和非结构化真实场景中的挑战,特别是开放类别(open-category)和开放姿态(open-pose)识别的问题。现有方法主要依赖于复杂的二维预训练模型来提供丰富的稳定表示,但这些方法受限于三维物体投影到二维空间的问题尚未得到有效解决,从而显著限制了其性能。为克服这一难题,本文提出了利用三维生成模型进行三维开放世界分类的开创性探索。通过借鉴三维生成模型的丰富先验知识,作者进一步设计了一个旋转不变特征提取器(rotation-invariant feature extractor)。这种创新性的结合使得整个流程无需训练即可实现开放类别和姿态不变性,从而非常适合于三维开放世界分类任务。实验结果表明,所提方法在ModelNet10和McGill数据集上的整体准确率分别提升了32.0%和8.7%,展示了生成模型在三维开放世界分类中的潜力,并达到了当前最先进水平。

链接: https://arxiv.org/abs/2501.17547
作者: Xinzhe Xia,Weiguang Zhao,Yuyao Yan,Guanyu Yang,Rui Zhang,Kaizhu Huang,Xi Yang
机构: Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学); University of Liverpool(利物浦大学); Duke Kunshan University(杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D open-world classification is a challenging yet essential task in dynamic and unstructured real-world scenarios, requiring both open-category and open-pose recognition. To address these challenges, recent wisdom often takes sophisticated 2D pre-trained models to provide enriched and stable representations. However, these methods largely rely on how 3D objects can be projected into 2D space, which is unfortunately not well solved, and thus significantly limits their performance. Unlike these present efforts, in this paper we make a pioneering exploration of 3D generative models for 3D open-world classification. Drawing on abundant prior knowledge from 3D generative models, we additionally craft a rotation-invariant feature extractor. This innovative synergy endows our pipeline with the advantages of being training-free, open-category, and pose-invariant, thus well suited to 3D open-world classification. Extensive experiments on benchmark datasets demonstrate the potential of generative models in 3D open-world classification, achieving state-of-the-art performance on ModelNet10 and McGill with 32.0% and 8.7% overall accuracy improvement, respectively.
zh

[CV-17] 3DSES: an indoor Lidar point cloud segmentation dataset with real and pseudo-labels from a 3D model

【速读】:该论文旨在解决室内点云语义分割在数字孪生应用中的挑战,特别是针对由Terrestrial Laser Scanning (TLS) 获取的高密度点云数据。论文的关键解决方案在于提出了一种模型到点云的自动标注算法,并引入了一个新的标注格式——点级别语义标签与完整的3D CAD模型相结合。通过这种方法,论文展示了如何利用伪标签和激光强度信息提高分割精度,从而显著减少手动标注的时间,提升了现有模型在建筑信息建模(BIM)相关物体分割上的表现。

链接: https://arxiv.org/abs/2501.17534
作者: Maxime Mérizette(GeF, CEDRIC - VERTIGO),Nicolas Audebert(CEDRIC - VERTIGO, CNAM, LaSTIG, IGN),Pierre Kervella(GeF),Jérôme Verdun(GeF)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of indoor point clouds has found various applications in the creation of digital twins for robotics, navigation and building information modeling (BIM). However, most existing datasets of labeled indoor point clouds have been acquired by photogrammetry. In contrast, Terrestrial Laser Scanning (TLS) can acquire dense sub-centimeter point clouds and has become the standard for surveyors. We present 3DSES (3D Segmentation of ESGT point clouds), a new dataset of indoor dense TLS colorized point clouds covering 427 m 2 of an engineering school. 3DSES has a unique double annotation format: semantic labels annotated at the point level alongside a full 3D CAD model of the building. We introduce a model-to-cloud algorithm for automated labeling of indoor point clouds using an existing 3D CAD model. 3DSES has 3 variants of various semantic and geometrical complexities. We show that our model-to-cloud alignment can produce pseudo-labels on our point clouds with a \gt; 95% accuracy, allowing us to train deep models with significant time savings compared to manual labeling. First baselines on 3DSES show the difficulties encountered by existing models when segmenting objects relevant to BIM, such as light and safety utilities. We show that segmentation accuracy can be improved by leveraging pseudo-labels and Lidar intensity, an information rarely considered in current datasets. Code and data will be open sourced.
zh

[CV-18] Solving Inverse Problems using Diffusion with Fast Iterative Renoising

【速读】:该论文旨在解决成像逆问题(Imaging Inverse Problems)在无监督条件下利用预训练扩散模型(Diffusion Models)进行求解时精度不足的问题。关键在于提出了一种名为“DDfire”的新方法,通过在每次扩散步骤中多次重新估计和添加精心设计的彩色噪声来改进图像重建过程,从而确保预训练模型接收到符合其训练方式的白高斯误差。

链接: https://arxiv.org/abs/2501.17468
作者: Matt C. Bendel,Saurav K. Shastri,Rizwan Ahmad,Philip Schniter
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models. In most cases, that involves approximating the gradient of the measurement-conditional score function in the reverse process. Since the approximations produced by existing methods are quite poor, especially early in the reverse process, we propose a new approach that re-estimates and renoises the image several times per diffusion step. Renoising adds carefully shaped colored noise that ensures the pre-trained diffusion model sees white-Gaussian error, in accordance with how it was trained. We demonstrate the effectiveness of our “DDfire” method at 20, 100, and 1000 neural function evaluations on linear inverse problems and phase retrieval.
zh

[CV-19] SIGN: A Statistically-Informed Gaze Network for Gaze Time Prediction

【速读】:该论文旨在解决图像注视时间预测的问题。关键解决方案在于提出了SIGN(Statistically-Informed Gaze Network),一种融合了统计模型与深度学习技术(包括卷积神经网络CNNs和视觉变换器Visual Transformers)的方法,能够从图像整体注视时间预测出各个区域的注视概率图,进而推导出潜在的注视模式。实验结果表明,SIGN在两个数据集上均显著提升了注视时间预测的准确性,并能生成与实际注视模式相符的合理注视模式。

链接: https://arxiv.org/abs/2501.17422
作者: Jianping Ye,Michel Wedel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:We propose a first version of SIGN, a Statistically-Informed Gaze Network, to predict aggregate gaze times on images. We develop a foundational statistical model for which we derive a deep learning implementation involving CNNs and Visual Transformers, which enables the prediction of overall gaze times. The model enables us to derive from the aggregate gaze times the underlying gaze pattern as a probability map over all regions in the image, where each region’s probability represents the likelihood of being gazed at across all possible scan-paths. We test SIGN’s performance on AdGaze3500, a dataset of images of ads with aggregate gaze times, and on COCO-Search18, a dataset with individual-level fixation patterns collected during search. We demonstrate that SIGN (1) improves gaze duration prediction significantly over state-of-the-art deep learning benchmarks on both datasets, and (2) can deliver plausible gaze patterns that correspond to empirical fixation patterns in COCO-Search18. These results suggest that the first version of SIGN holds promise for gaze-time predictions and deserves further development.
zh

[CV-20] Assessing the Capability of YOLO- and Transformer-based Object Detectors for Real-time Weed Detection

【速读】:该论文旨在解决在实际田间条件下实时区分作物与杂草,甚至不同杂草种类的问题。解决方案的关键在于比较当前最先进的目标检测模型(包括 YOLOv8、YOLOv9、YOLOv10 和 RT-DETR)的性能,并通过训练和评估这些模型来确定最适合实时应用的模型。实验结果表明,YOLOv9 模型(特别是 YOLOv9s 和 YOLOv9e)在召回率及平均精度指标方面表现优异,而 RT-DETR 模型(尤其是 RT-DETR-l)则在精确度方面表现出色。此外,最小的 YOLO 模型变体(如 YOLOv8n、YOLOv9t 和 YOLOv10n)提供了快速的推理时间,适合资源受限的嵌入式计算设备部署。

链接: https://arxiv.org/abs/2501.17387
作者: Alicia Allmendinger,Ahmet Oğuz Saltık,Gerassimos G. Peteinatos,Anthony Stein,Roland Gerhards
机构: University of Hohenheim; Department of Artificial Intelligence in Agricultural Engineering & Computational Science Hub; ELGO - “DIMITRA”
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spot spraying represents an efficient and sustainable method for reducing the amount of pesticides, particularly herbicides, used in agricultural fields. To achieve this, it is of utmost importance to reliably differentiate between crops and weeds, and even between individual weed species in situ and under real-time conditions. To assess suitability for real-time application, different object detection models that are currently state-of-the-art are compared. All available models of YOLOv8, YOLOv9, YOLOv10, and RT-DETR are trained and evaluated with images from a real field situation. The images are separated into two distinct datasets: In the initial data set, each species of plants is trained individually; in the subsequent dataset, a distinction is made between monocotyledonous weeds, dicotyledonous weeds, and three chosen crops. The results demonstrate that while all models perform equally well in the metrics evaluated, the YOLOv9 models, particularly the YOLOv9s and YOLOv9e, stand out in terms of their strong recall scores (66.58 % and 72.36 %), as well as mAP50 (73.52 % and 79.86 %), and mAP50-95 (43.82 % and 47.00 %) in dataset 2. However, the RT-DETR models, especially RT-DETR-l, excel in precision with reaching 82.44 % on dataset 1 and 81.46 % in dataset 2, making them particularly suitable for scenarios where minimizing false positives is critical. In particular, the smallest variants of the YOLO models (YOLOv8n, YOLOv9t, and YOLOv10n) achieve substantially faster inference times down to 7.58 ms for dataset 2 on the NVIDIA GeForce RTX 4090 GPU for analyzing one frame, while maintaining competitive accuracy, highlighting their potential for deployment in resource-constrained embedded computing devices as typically used in productive setups.
zh

[CV-21] On the Coexistence and Ensembling of Watermarks

【速读】:该论文旨在研究不同深度图像水印方法之间的共存性,并探索它们在不显著影响图像质量和解码鲁棒性的情况下能否协同工作。论文的关键在于发现多种开源水印可以在同一图像中良好共存,并提出通过集成多种水印方法来增加总体信息容量,同时实现容量、准确性、鲁棒性和图像质量之间的新权衡,而无需重新训练基础模型。

链接: https://arxiv.org/abs/2501.17356
作者: Aleksandar Petrov,Shruti Agarwal,Philip H.S. Torr,Adel Bibi,John Collomosse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Watermarking, the practice of embedding imperceptible information into media such as images, videos, audio, and text, is essential for intellectual property protection, content provenance and attribution. The growing complexity of digital ecosystems necessitates watermarks for different uses to be embedded in the same media. However, to detect and decode all watermarks, they need to coexist well with one another. We perform the first study of coexistence of deep image watermarking methods and, contrary to intuition, we find that various open-source watermarks can coexist with only minor impacts on image quality and decoding robustness. The coexistence of watermarks also opens the avenue for ensembling watermarking methods. We show how ensembling can increase the overall message capacity and enable new trade-offs between capacity, accuracy, robustness and image quality, without needing to retrain the base models.
zh

[CV-22] Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines

【速读】:该论文旨在解决在大规模医学影像应用中,如何通过量化深度神经网络来显著减少内存使用并加速处理的问题。现有的许多方法仅研究“模拟量化”(fake quantization),即在推理过程中模拟较低精度操作,但并未实际减小模型大小或提高真实世界的推理速度。此外,将真实的三维低比特量化部署到现代GPU上的潜力尚未得到探索。论文的关键解决方案在于引入了一种真实的后训练量化(PTQ)框架,成功地在最先进的3D医学分割模型上实现了真正的8位量化。该框架包括两个主要步骤:首先使用TensorRT进行无标签校准数据集的权重和激活的模拟量化;其次,通过TensorRT引擎将这种模拟量化转换为真实的量化,从而实现在真实GPU上的模型大小和推理延迟的实际降低。

链接: https://arxiv.org/abs/2501.17343
作者: Chongyu Qu,Ritchie Zhao,Ye Yu,Bin Liu,Tianyuan Yao,Junchao Zhu,Bennett A. Landman,Yucheng Tang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied “fake quantization”, which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. this https URL.
zh

[CV-23] WASUP: Interpretable Classification with Weight-Input Alignment and Class-Discriminative SUPports Vectors

【速读】:该论文旨在解决深度学习模型在关键领域应用中高精度与可解释性之间的平衡问题。解决方案的关键在于引入了WASUP(Weight-Aligned Similarity-based Universal Predictor),这是一种固有的可解释神经网络,通过案例推理的概念从训练图像中提取类别代表性支持向量,确保捕获相关特征并抑制无关特征。WASUP利用B-Cos变换将模型权重与输入对齐,实现潜在特征到输入空间的忠实映射,从而提供局部和全局的解释,确保解释的准确性并验证了其理论基础。

链接: https://arxiv.org/abs/2501.17328
作者: Tom Nuno Wolf,Christian Wachinger
机构: Lab for Artificial Intelligence in Medical Imaging, Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability. We introduce WASUP, an inherently interpretable neural network that provides local and global explanations of its decision-making process. We prove that these explanations are faithful by fulfilling established axioms for explanations. Leveraging the concept of case-based reasoning, WASUP extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones. Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input’s latent feature vector. We employ B-Cos transformations, which align model weights with inputs to enable faithful mappings of latent features back to the input space, facilitating local explanations in addition to global explanations of case-based reasoning. We evaluate WASUP on three tasks: fine-grained classification on Stanford Dogs, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset. Results indicate that WASUP not only achieves competitive accuracy compared to state-of-the-art black-box models but also offers insightful explanations verified through theoretical analysis. Our findings underscore WASUP’s potential for applications where understanding model decisions is as critical as the decisions themselves.
zh

[CV-24] Influence of field of view in visual prostheses design: Analysis with a VR system

【速读】:该论文旨在研究视野范围与空间分辨率对视觉假体感知准确性及响应时间的影响。关键在于通过模拟实验系统评估不同视野范围下的搜索与识别任务表现,发现准确性与响应时间随视野增大而降低,并且性能与视角分辨率相关,但当每度磷光体数量低于2.3时收益递减。因此,论文建议在设计视网膜假体时应集中磷光体于小区域以最大化视角分辨率,即使这意味着牺牲视野范围。

链接: https://arxiv.org/abs/2501.17322
作者: Melani Sanchez-Garcia,Ruben Martinez-Cantin,Jesus Bermudez-Cameo,Jose J. Guerrero
机构: Instituto de Investigación en Ingeniería de Aragón (I3A)(阿贡研究所); Universidad de Zaragoza (萨拉戈萨大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual prostheses are designed to restore partial functional vision in patients with total vision loss. Retinal visual prostheses provide limited capabilities as a result of low resolution, limited field of view and poor dynamic range. Understanding the influence of these parameters in the perception results can guide prostheses research and design. In this work, we evaluate the influence of field of view with respect to spatial resolution in visual prostheses, measuring the accuracy and response time in a search and recognition task. Twenty-four normally sighted participants were asked to find and recognize usual objects, such as furniture and home appliance in indoor room scenes. For the experiment, we use a new simulated prosthetic vision system that allows simple and effective experimentation. Our system uses a virtual-reality environment based on panoramic scenes. The simulator employs a head-mounted display which allows users to feel immersed in the scene by perceiving the entire scene all around. Our experiments use public image datasets and a commercial head-mounted display. We have also released the virtual-reality software for replicating and extending the experimentation. Results show that the accuracy and response time decrease when the field of view is increased. Furthermore, performance appears to be correlated with the angular resolution, but showing a diminishing return even with a resolution of less than 2.3 phosphenes per degree. Our results seem to indicate that, for the design of retinal prostheses, it is better to concentrate the phosphenes in a small area, to maximize the angular resolution, even if that implies sacrificing field of view.
zh

[CV-25] A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts

【速读】:该论文旨在解决Novelty Detection (ND)方法在环境变化导致的小幅分布偏移(即风格偏移)下性能显著下降的问题。论文指出,由于训练过程中缺乏域外(out-of-distribution, OOD)样本,ND模型倾向于依赖于域内(in-distribution, ID)数据中的主导风格特征,从而误将风格特征与核心特征关联起来。为了解决这一问题,论文提出了一种通过构建辅助OOD集,并利用基于任务的知识蒸馏策略来区分核心特征与风格特征的关键方法,从而使得模型能够依赖核心特征进行区分。

链接: https://arxiv.org/abs/2501.17289
作者: Hossein Mirzaei,Mojtaba Nafez,Moein Madadi,Arad Maleki,Mahdi Hajialilue,Zeinab Sadat Taghavi,Sepehr Rezaee,Ali Ansari,Bahar Dibaei Nia,Kian Shamsaie,Mohammadreza Salehi,Mackenzie W. Mathis,Mahdieh Soleymani Baghshah,Mohammad Sabokrou,Mohammad Hossein Rohban
机构: Sharif University of Technology, Iran; Ludwig-Maximilians-Universität München (LMU), Germany; Shahid Beheshti University, Iran; École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code repository is available at: this https URL

点击查看摘要

Abstract:There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.
zh

[CV-26] Advancing the Biological Plausibility and Efficacy of Hebbian Convolutional Neural Networks

【速读】:该论文旨在解决如何将Hebbian学习规则有效整合到卷积神经网络(Convolutional Neural Networks, CNNs)中以进行图像处理的问题。关键在于提出了一种优化架构,该架构结合了竞争机制、硬胜者通吃(hard Winner-Takes-All, WTA)竞争、高斯侧抑制机制以及Bienenstock-Cooper-Munro(BCM)学习规则,从而显著提升了分类性能,达到了76%的CIFAR-10数据集分类准确率,接近于端到端反向传播方法的表现,并且超越了相同网络深度下基于硬-WTA的最新研究表现。

链接: https://arxiv.org/abs/2501.17266
作者: Julian Jimenez Nimmo,Esther Mondragon
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 14 figures

点击查看摘要

Abstract:The research presented in this paper advances the integration of Hebbian learning into Convolutional Neural Networks (CNNs) for image processing, systematically exploring different architectures to build an optimal configuration, adhering to biological tenability. Hebbian learning operates on local unsupervised neural information to form feature representations, providing an alternative to the popular but arguably biologically implausible and computationally intensive backpropagation learning algorithm. The suggested optimal architecture significantly enhances recent research aimed at integrating Hebbian learning with competition mechanisms and CNNs, expanding their representational capabilities by incorporating hard Winner-Takes-All (WTA) competition, Gaussian lateral inhibition mechanisms and Bienenstock-Cooper-Munro (BCM) learning rule in a single model. The resulting model achieved 76% classification accuracy on CIFAR-10, rivalling its end-to-end backpropagation variant (77%) and critically surpassing the state-of-the-art hard-WTA performance in CNNs of the same network depth (64.6%) by 11.4%. Moreover, results showed clear indications of sparse hierarchical learning through increasingly complex and abstract receptive fields. In summary, our implementation enhances both the performance and the generalisability of the learnt representations and constitutes a crucial step towards more biologically realistic artificial neural networks.
zh

[CV-27] ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

【速读】:该论文旨在解决开发基于光学相干断层扫描(OCT)的诊断工具所面临的挑战,包括有限的公共数据集、稀疏标注及隐私问题。尽管深度学习在自动化OCT分析方面取得进展,但这些挑战仍未得到解决。为应对这些限制,论文提出了一种名为Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN) 的新框架。该框架的关键在于采用三阶段工作流程:监督预训练、自监督预训练(SSP)和监督微调。通过利用OCTMNIST数据集进行数据增强,并使用Vision Transformer (ViT-Base) 提取特征,同时采用负余弦相似度损失来对齐特征表示,从而增强特征提取并提高诊断准确性。

链接: https://arxiv.org/abs/2501.17260
作者: Mohammadreza Saraei,Igor Kozak,Eung-Joo Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods.
zh

[CV-28] Separated Inter/Intra-Modal Fusion Prompts for Compositional Zero-Shot Learning

【速读】:该论文致力于解决在成分零样本学习(Compositional Zero-Shot Learning, CZSL)中准确识别细微语义差异和组合状态与对象的问题。现有方法主要集中在提示配置或利用提示调整预训练的视觉-语言模型,但这些方法难以有效应对上述挑战。论文的关键解决方案在于提出一种通过引入跨模态/模态内融合合成器的多样化提示学习方法,以提升属性识别性能,从而在理解包含细微语义差异和多个对象的场景时构建更高效且有效的CZSL技术。

链接: https://arxiv.org/abs/2501.17171
作者: Sua Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: AIAP 2025

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize subtle differences in meaning or the combination of states and objects through the use of known and unknown concepts during training. Existing methods either focused on prompt configuration or on using prompts to tune the pre-trained Vision-Language model. However, these methods faced challenges in accurately identifying subtle differences in meaning or combining states with objects. To jointly eradicate the above issues and construct an efficient and effective CZSL technique, we suggest a method to improve attribute recognition performance by utilizing diverse Prompt Learning with an Inter/Intra-Modality Fusion Synthesizer in scene understanding involving subtle semantic differences and multiple objects.
zh

[CV-29] Aggregation Schemes for Single-Vector WSI Representation Learning in Digital Pathology

【速读】:该论文旨在解决将Whole Slide Images (WSIs) 高效整合到计算病理学中的关键问题,即为每张WSI分配一个高质量的特征向量。由于WSIs具有高分辨率和吉字节级像素的特点,直接将其输入现有GPU进行处理是不可行的。因此,论文的关键解决方案在于将WSI分割成多个小块(即patch),并通过预训练模型提取每个patch的嵌入表示。为了从这些patch嵌入集合中获得单个WSI嵌入,论文评估了多种集表示学习方案,包括简单平均或最大池化操作、Deep Sets、Memory网络、Focal注意力机制、高斯混合模型(GMM)Fisher向量以及深度稀疏和二进制Fisher向量。通过在TCGA数据库中的膀胱、乳腺、肾和结肠四种不同原发部位的数据上进行实验,论文对比了这些方法与非聚合方法(如最小距离中位数)的搜索性能。

链接: https://arxiv.org/abs/2501.17822
作者: Sobhan Hemati,Ghazal Alabtah,Saghir Alfasly,H.R. Tizhoosh
机构: KIMIA Lab (KIMIA实验室), Department of Artificial Intelligence and Informatics (人工智能与信息学系), Mayo Clinic (梅奥诊所), Rochester, MN, USA (美国明尼苏达州罗切斯特)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:A crucial step to efficiently integrate Whole Slide Images (WSIs) in computational pathology is assigning a single high-quality feature vector, i.e., one embedding, to each WSI. With the existence of many pre-trained deep neural networks and the emergence of foundation models, extracting embeddings for sub-images (i.e., tiles or patches) is straightforward. However, for WSIs, given their high resolution and gigapixel nature, inputting them into existing GPUs as a single image is not feasible. As a result, WSIs are usually split into many patches. Feeding each patch to a pre-trained model, each WSI can then be represented by a set of patches, hence, a set of embeddings. Hence, in such a setup, WSI representation learning reduces to set representation learning where for each WSI we have access to a set of patch embeddings. To obtain a single embedding from a set of patch embeddings for each WSI, multiple set-based learning schemes have been proposed in the literature. In this paper, we evaluate the WSI search performance of multiple recently developed aggregation techniques (mainly set representation learning techniques) including simple average or max pooling operations, Deep Sets, Memory networks, Focal attention, Gaussian Mixture Model (GMM) Fisher Vector, and deep sparse and binary Fisher Vector on four different primary sites including bladder, breast, kidney, and Colon from TCGA. Further, we benchmark the search performance of these methods against the median of minimum distances of patch embeddings, a non-aggregating approach used for WSI retrieval.
zh

[CV-30] Glioma Multimodal MRI Analysis System for Tumor Layered Diagnosis via Multi-task Semi-supervised Learning

【速读】:本文旨在解决胶质瘤多模态磁共振成像(Multimodal MRI)分析中独立任务处理缺乏相互依赖性的问题。解决方案的关键在于提出了一个胶质瘤多模态MRI分析系统(GMMAS),它采用深度学习网络同时处理多个相关任务,通过基于不确定性的多任务学习架构利用这些任务间的相互依赖关系。GMMAS能够同步输出肿瘤区域分割、组织学亚型分类、IDH突变基因型以及1p/19q染色体异常状态,并通过两阶段半监督学习方法充分利用标记和未标记的MRI样本以增强模型性能。此外,GMMAS通过基于知识自蒸馏和对比学习的适应模块,实现了跨模态特征提取,在模态缺失情况下表现出鲁棒性,并揭示了不同MRI模态的重要性。

链接: https://arxiv.org/abs/2501.17758
作者: Yihao Liu,Zhihao Cui,Liming Li,Junjie You,Xinle Feng,Jianxin Wang,Xiangyu Wang,Qing Liu,Minghua Wu
机构: Cancer Research Institute, Central South University (中南大学癌症研究所), Changsha 410083, China; School of Physics & Electronic Science, Changsha University of Science & Technology (长沙理工大学物理与电子科学学院), Changsha 410114, Hunan, China; IFLYTEK Research (科大讯飞研究院), Anhui 230088, Hefei, China; School of Life Sciences, Central South University (中南大学生命科学学院), Changsha 410083, China; Department of Radiology, Xiangya Hospital, Central South University (中南大学湘雅医院放射科), Changsha 410008, Hunan, China; School of Computer Science and Engineering, Central South University (中南大学计算机科学与工程学院), Changsha 410083, China; Department of Neurosurgery, Xiangya Hospital, Central South University (中南大学湘雅医院神经外科), Changsha 410008, Hunan, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Gliomas are the most common primary tumors of the central nervous system. Multimodal MRI is widely used for the preliminary screening of gliomas and plays a crucial role in auxiliary diagnosis, therapeutic efficacy, and prognostic evaluation. Currently, the computer-aided diagnostic studies of gliomas using MRI have focused on independent analysis events such as tumor segmentation, grading, and radiogenomic classification, without studying inter-dependencies among these events. In this study, we propose a Glioma Multimodal MRI Analysis System (GMMAS) that utilizes a deep learning network for processing multiple events simultaneously, leveraging their inter-dependencies through an uncertainty-based multi-task learning architecture and synchronously outputting tumor region segmentation, glioma histological subtype, IDH mutation genotype, and 1p/19q chromosome disorder status. Compared with the reported single-task analysis models, GMMAS improves the precision across tumor layered diagnostic tasks. Additionally, we have employed a two-stage semi-supervised learning method, enhancing model performance by fully exploiting both labeled and unlabeled MRI samples. Further, by utilizing an adaptation module based on knowledge self-distillation and contrastive learning for cross-modal feature extraction, GMMAS exhibited robustness in situations of modality absence and revealed the differing significance of each MRI modal. Finally, based on the analysis outputs of the GMMAS, we created a visual and user-friendly platform for doctors and patients, introducing GMMAS-GPT to generate personalized prognosis evaluations and suggestions.
zh

[CV-31] PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion

【速读】:该论文旨在解决传统远程肺功能监测方法(如常规远程 spirometry)精度不足的问题。为实现这一目标,论文提出了一种创新的非侵入性方法,该方法结合了多模态预测模型,将RGB或热成像视频数据与患者元数据相结合。解决方案的关键在于利用能量高效的脉冲神经网络(Spiking Neural Networks, SNNs)进行峰值呼气流量(Peak Expiratory Flow, PEF)的回归以及第一秒用力呼气量(Forced Expiratory Volume, FEV1)和用力肺活量(Forced Vital Capacity, FVC)的分类。为了克服SNN在回归任务中的局限性,论文使用轻量级卷积神经网络(CNNs)。此外,通过引入多头注意力层(Multi-Head Attention Layer)增强多模态数据的整合,并采用K折验证和集成学习来提高模型的鲁棒性。

链接: https://arxiv.org/abs/2501.17699
作者: Ahmed Sharshar,Yasser Attia,Mohammad Yaqub,Mohsen Guizani
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on this https URL
zh

[CV-32] Dual Invariance Self-training for Reliable Semi-supervised Surgical Phase Recognition

【速读】:该论文旨在解决手术阶段识别中准确性和数据标注不足的问题。解决方案的关键在于提出了一种新颖的半监督学习框架Dual Invariance Self-Training (DIST),通过结合时间不变性和变换不变性来增强手术阶段识别。DIST框架采用两步自训练过程,动态选择可靠的伪标签,从而确保鲁棒的伪监督,并减轻噪声伪标签的风险,使决策边界更接近真实数据分布,进而提升对未见数据的泛化能力。

链接: https://arxiv.org/abs/2501.17628
作者: Sahar Nasirihaghighi,Negin Ghamsarian,Raphael Sznitman,Klaus Schoeffmann
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate surgical phase recognition is crucial for advancing computer-assisted interventions, yet the scarcity of labeled data hinders training reliable deep learning models. Semi-supervised learning (SSL), particularly with pseudo-labeling, shows promise over fully supervised methods but often lacks reliable pseudo-label assessment mechanisms. To address this gap, we propose a novel SSL framework, Dual Invariance Self-Training (DIST), that incorporates both Temporal and Transformation Invariance to enhance surgical phase recognition. Our two-step self-training process dynamically selects reliable pseudo-labels, ensuring robust pseudo-supervision. Our approach mitigates the risk of noisy pseudo-labels, steering decision boundaries toward true data distribution and improving generalization to unseen data. Evaluations on Cataract and Cholec80 datasets show our method outperforms state-of-the-art SSL approaches, consistently surpassing both supervised and SSL baselines across various network architectures.
zh

[CV-33] rustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training scenarios

【速读】:该论文旨在解决乳腺X线摄影筛查中自动化诊断协议的需求与现有深度神经网络模型泛化能力不足之间的矛盾。解决方案的关键在于评估两种图像到图像翻译框架(生成对抗网络基础的cycleGAN和新兴的扩散模型SynDiff)在非配对训练数据场景下的性能,并引入不确定性量化方法来评估模型的可信度,提出了一种校准质量评估方案以应对无地面实况数据的情况。这有助于促进在通常缺乏地面实况数据的领域中,图像到图像翻译模型的可靠应用。

链接: https://arxiv.org/abs/2501.17570
作者: Ciaran Bench,Emir Ahmed,Spencer A. Thomas
机构: National Physical Laboratory (国家物理实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Mammographic screening is an effective method for detecting breast cancer, facilitating early diagnosis. However, the current need to manually inspect images places a heavy burden on healthcare systems, spurring a desire for automated diagnostic protocols. Techniques based on deep neural networks have been shown effective in some studies, but their tendency to overfit leaves considerable risk for poor generalisation and misdiagnosis, preventing their widespread adoption in clinical settings. Data augmentation schemes based on unpaired neural style transfer models have been proposed that improve generalisability by diversifying the representations of training image features in the absence of paired training data (images of the same tissue in either image style). But these models are similarly prone to various pathologies, and evaluating their performance is challenging without ground truths/large datasets (as is often the case in medical imaging). Here, we consider two frameworks/architectures: a GAN-based cycleGAN, and the more recently developed diffusion-based SynDiff. We evaluate their performance when trained on image patches parsed from three open access mammography datasets and one non-medical image dataset. We consider the use of uncertainty quantification to assess model trustworthiness, and propose a scheme to evaluate calibration quality in unpaired training scenarios. This ultimately helps facilitate the trustworthy use of image-to-image translation models in domains where ground truths are not typically available.
zh

人工智能

[AI-0] GRACE: Generalizing Robot-Assisted Caregiving with User Functionality Embeddings

链接: https://arxiv.org/abs/2501.17855
作者: Ziang Liu,Yuanchen Ju,Yu Da,Tom Silver,Pranav N. Thakkar,Jenna Li,Justin Guo,Katherine Dimitropoulou,Tapomayukh Bhattacharjee
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 5 figures, Accepted to IEEE/ACM International Conference on Human-Robot Interaction (HRI), 2025

点击查看摘要

Abstract:Robot caregiving should be personalized to meet the diverse needs of care recipients – assisting with tasks as needed, while taking user agency in action into account. In physical tasks such as handover, bathing, dressing, and rehabilitation, a key aspect of this diversity is the functional range of motion (fROM), which can vary significantly between individuals. In this work, we learn to predict personalized fROM as a way to generalize robot decision-making in a wide range of caregiving tasks. We propose a novel data-driven method for predicting personalized fROM using functional assessment scores from occupational therapy. We develop a neural model that learns to embed functional assessment scores into a latent representation of the user’s physical function. The model is trained using motion capture data collected from users with emulated mobility limitations. After training, the model predicts personalized fROM for new users without motion capture. Through simulated experiments and a real-robot user study, we show that the personalized fROM predictions from our model enable the robot to provide personalized and effective assistance while improving the user’s agency in action. See our website for more visualizations: this https URL.

[AI-1] From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning AAAI2024

链接: https://arxiv.org/abs/2501.17842
作者: Junseok Park,Hyeonseo Yang,Min Whoo Lee,Won-Seok Choi,Minsu Lee,Byoung-Tak Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Extended version of AAAI 2024 paper: Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning. This manuscript is currently being prepared for journal submission

点击查看摘要

Abstract:Reinforcement learning (RL) agents often face challenges in balancing exploration and exploitation, particularly in environments where sparse or dense rewards bias learning. Biological systems, such as human toddlers, naturally navigate this balance by transitioning from free exploration with sparse rewards to goal-directed behavior guided by increasingly dense rewards. Inspired by this natural progression, we investigate the Toddler-Inspired Reward Transition in goal-oriented RL tasks. Our study focuses on transitioning from sparse to potential-based dense (S2D) rewards while preserving optimal strategies. Through experiments on dynamic robotic arm manipulation and egocentric 3D navigation tasks, we demonstrate that effective S2D reward transitions significantly enhance learning performance and sample efficiency. Additionally, using a Cross-Density Visualizer, we show that S2D transitions smooth the policy loss landscape, resulting in wider minima that improve generalization in RL models. In addition, we reinterpret Tolman’s maze experiments, underscoring the critical role of early free exploratory learning in the context of S2D rewards.

[AI-2] International AI Safety Report

链接: https://arxiv.org/abs/2501.17805
作者: Yoshua Bengio,Sören Mindermann,Daniel Privitera,Tamay Besiroglu,Rishi Bommasani,Stephen Casper,Yejin Choi,Philip Fox,Ben Garfinkel,Danielle Goldfarb,Hoda Heidari,Anson Ho,Sayash Kapoor,Leila Khalatbari,Shayne Longpre,Sam Manning,Vasilios Mavroudis,Mantas Mazeika,Julian Michael,Jessica Newman,Kwan Yee Ng,Chinasa T. Okolo,Deborah Raji,Girish Sastry,Elizabeth Seger,Theodora Skeadas,Tobin South,Emma Strubell,Florian Tramèr,Lucia Velasco,Nicole Wheeler,Daron Acemoglu,Olubayo Adekanmbi,David Dalrymple,Thomas G. Dietterich,Edward W. Felten,Pascale Fung,Pierre-Olivier Gourinchas,Fredrik Heintz,Geoffrey Hinton,Nick Jennings,Andreas Krause,Susan Leavy,Percy Liang,Teresa Ludermir,Vidushi Marda,Helen Margetts,John McDermid,Jane Munga,Arvind Narayanan,Alondra Nelson,Clara Neppel,Alice Oh,Gopal Ramchurn,Stuart Russell,Marietje Schaake,Bernhard Schölkopf,Dawn Song,Alvaro Soto,Lee Tiedrich,Gaël Varoquaux,Andrew Yao,Ya-Qin Zhang,Fahad Albalawi,Marwan Alserkal,Olubunmi Ajala,Guillaume Avrin,Christian Busch,André Carlos Ponce de Leon Ferreira de Carvalho,Bronwyn Fox,Amandeep Singh Gill,Ahmet Halit Hatip,Juha Heikkilä,Gill Jolly,Ziv Katzir,Hiroaki Kitano,Antonio Krüger,Chris Johnson,Saif M. Khan,Kyoung Mu Lee,Dominic Vincent Ligot,Oleksii Molchanovskyi,Andrea Monti,Nusu Mwamanzi,Mona Nemer,Nuria Oliver,José Ramón López Portillo,Balaraman Ravindran,Raquel Pezoa Rivera,Hammam Riza,Crystal Rugege,Ciarán Seoighe,Jerry Sheehan,Haroon Sheikh,Denise Wong,Yi Zeng
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report’s Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report’s Chair, these independent experts collectively had full discretion over the report’s content.

[AI-3] Yin-Yang: Developing Motifs With Long-Term Structure And Controllability

链接: https://arxiv.org/abs/2501.17759
作者: Keshav Bhandari,Geraint A. Wiggins,Simon Colton
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: 16 Pages, 4 Figures, Accepted at Artificial Intelligence in Music, Sound, Art and Design: 14th International Conference, EvoMUSART 2025

点击查看摘要

Abstract:Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at this https URL.

[AI-4] Early External Safety Testing of OpenAI s o3-mini: Insights from the Pre-Deployment Evaluation

链接: https://arxiv.org/abs/2501.17749
作者: Aitor Arrieta,Miriam Ugarte,Pablo Valle,José Antonio Parejo,Sergio Segura
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2501.17132

点击查看摘要

Abstract:Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals’ privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI’s new o3-mini LLM as part of OpenAI’s early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI’s latest LLM.

[AI-5] Inferring Implicit Goals Across Differing Task Models

链接: https://arxiv.org/abs/2501.17704
作者: Silvia Tulli,Stylianos Loukas Vasileiou,Mohamed Chetouani,Sarath Sreedharan
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:One of the significant challenges to generating value-aligned behavior is to not only account for the specified user objectives but also any implicit or unspecified user requirements. The existence of such implicit requirements could be particularly common in settings where the user’s understanding of the task model may differ from the agent’s estimate of the model. Under this scenario, the user may incorrectly expect some agent behavior to be inevitable or guaranteed. This paper addresses such expectation mismatch in the presence of differing models by capturing the possibility of unspecified user subgoal in the context of a task captured as a Markov Decision Process (MDP) and querying for it as required. Our method identifies bottleneck states and uses them as candidates for potential implicit subgoals. We then introduce a querying strategy that will generate the minimal number of queries required to identify a policy guaranteed to achieve the underlying goal. Our empirical evaluations demonstrate the effectiveness of our approach in inferring and achieving unstated goals across various tasks.

[AI-6] Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching

链接: https://arxiv.org/abs/2501.17665
作者: Xuzhe Dang,Lada Kudláčková,Stefan Edelkamp
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automating the generation of Planning Domain Definition Language (PDDL) with Large Language Model (LLM) opens new research topic in AI planning, particularly for complex real-world tasks. This paper introduces Image2PDDL, a novel framework that leverages Vision-Language Models (VLMs) to automatically convert images of initial states and descriptions of goal states into PDDL problems. By providing a PDDL domain alongside visual inputs, Imasge2PDDL addresses key challenges in bridging perceptual understanding with symbolic planning, reducing the expertise required to create structured problem instances, and improving scalability across tasks of varying complexity. We evaluate the framework on various domains, including standard planning domains like blocksworld and sliding tile puzzles, using datasets with multiple difficulty levels. Performance is assessed on syntax correctness, ensuring grammar and executability, and content correctness, verifying accurate state representation in generated PDDL problems. The proposed approach demonstrates promising results across diverse task complexities, suggesting its potential for broader applications in AI planning. We will discuss a potential use case in robot-assisted teaching of students with Autism Spectrum Disorder.

[AI-7] he Imitation Game According To Turing

链接: https://arxiv.org/abs/2501.17629
作者: Sharon Temtsin(1),Diane Proudfoot(1),David Kaber(2),Christoph Bartneck(1) ((1) The University of Canterbury, (2) Oregon State University)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The current cycle of hype and anxiety concerning the benefits and risks to human society of Artificial Intelligence is fuelled, not only by the increasing use of generative AI and other AI tools by the general public, but also by claims made on behalf of such technology by popularizers and scientists. In particular, recent studies have claimed that Large Language Models (LLMs) can pass the Turing Test-a goal for AI since the 1950s-and therefore can “think”. Large-scale impacts on society have been predicted as a result. Upon detailed examination, however, none of these studies has faithfully applied Turing’s original instructions. Consequently, we conducted a rigorous Turing Test with GPT-4-Turbo that adhered closely to Turing’s instructions for a three-player imitation game. We followed established scientific standards where Turing’s instructions were ambiguous or missing. For example, we performed a Computer-Imitates-Human Game (CIHG) without constraining the time duration and conducted a Man-Imitates-Woman Game (MIWG) as a benchmark. All but one participant correctly identified the LLM, showing that one of today’s most advanced LLMs is unable to pass a rigorous Turing Test. We conclude that recent extravagant claims for such models are unsupported, and do not warrant either optimism or concern about the social impact of thinking machines.

[AI-8] VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching ICASSP2025

链接: https://arxiv.org/abs/2501.17612
作者: Ha-Yeong Choi,Jaehan Park
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at \urlthis https URL.

[AI-9] Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding ICASSP2025

链接: https://arxiv.org/abs/2501.17578
作者: Marco Pasini,Stefan Lattner,George Fazekas
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Efficiently compressing high-dimensional audio signals into a compact and informative latent space is crucial for various tasks, including generative modeling and music information retrieval (MIR). Existing audio autoencoders, however, often struggle to achieve high compression ratios while preserving audio fidelity and facilitating efficient downstream applications. We introduce Music2Latent2, a novel audio autoencoder that addresses these limitations by leveraging consistency models and a novel approach to representation learning based on unordered latent embeddings, which we call summary embeddings. Unlike conventional methods that encode local audio features into ordered sequences, Music2Latent2 compresses audio signals into sets of summary embeddings, where each embedding can capture distinct global features of the input sample. This enables to achieve higher reconstruction quality at the same compression ratio. To handle arbitrary audio lengths, Music2Latent2 employs an autoregressive consistency model trained on two consecutive audio chunks with causal masking, ensuring coherent reconstruction across segment boundaries. Additionally, we propose a novel two-step decoding procedure that leverages the denoising capabilities of consistency models to further refine the generated audio at no additional cost. Our experiments demonstrate that Music2Latent2 outperforms existing continuous audio autoencoders regarding audio quality and performance on downstream tasks. Music2Latent2 paves the way for new possibilities in audio compression.

[AI-10] Exploring the Potential of Wireless-enabled Multi-Chip AI Accelerators

链接: https://arxiv.org/abs/2501.17567
作者: Emmanuel Irabor,Mariam Musavi,Abhijit Das,Sergi Abadal
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Accepted in AccML @ HiPEAC 2025

点击查看摘要

Abstract:The insatiable appetite of Artificial Intelligence (AI) workloads for computing power is pushing the industry to develop faster and more efficient accelerators. The rigidity of custom hardware, however, conflicts with the need for scalable and versatile architectures capable of catering to the needs of the evolving and heterogeneous pool of Machine Learning (ML) models in the literature. In this context, multi-chiplet architectures assembling multiple (perhaps heterogeneous) accelerators are an appealing option that is unfortunately hindered by the still rigid and inefficient chip-to-chip interconnects. In this paper, we explore the potential of wireless technology as a complement to existing wired interconnects in this multi-chiplet approach. Using an evaluation framework from the state-of-the-art, we show that wireless interconnects can lead to speedups of 10% on average and 20% maximum. We also highlight the importance of load balancing between the wired and wireless interconnects, which will be further explored in future work.

[AI-11] Solving Urban Network Security Games: Learning Platform Benchmark and Challenge for AI Research

链接: https://arxiv.org/abs/2501.17559
作者: Shuxin Zhuang,Shuxin Li,Tianji Yang,Muheng Li,Xianjie Shi,Bo An,Youzhi Zhang
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:After the great achievement of solving two-player zero-sum games, more and more AI researchers focus on solving multiplayer games. To facilitate the development of designing efficient learning algorithms for solving multiplayer games, we propose a multiplayer game platform for solving Urban Network Security Games (\textbfUNSG) that model real-world scenarios. That is, preventing criminal activity is a highly significant responsibility assigned to police officers in cities, and police officers have to allocate their limited security resources to interdict the escaping criminal when a crime takes place in a city. This interaction between multiple police officers and the escaping criminal can be modeled as a UNSG. The variants of UNSGs can model different real-world settings, e.g., whether real-time information is available or not, and whether police officers can communicate or not. The main challenges of solving this game include the large size of the game and the co-existence of cooperation and competition. While previous efforts have been made to tackle UNSGs, they have been hampered by performance and scalability issues. Therefore, we propose an open-source UNSG platform (\textbfGraphChase) for designing efficient learning algorithms for solving UNSGs. Specifically, GraphChase offers a unified and flexible game environment for modeling various variants of UNSGs, supporting the development, testing, and benchmarking of algorithms. We believe that GraphChase not only facilitates the development of efficient algorithms for solving real-world problems but also paves the way for significant advancements in algorithmic development for solving general multiplayer games.

[AI-12] Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant

链接: https://arxiv.org/abs/2501.17546
作者: Gaole He,Nilay Aishwarya,Ujwal Gadiraju
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: conditionally accepted to IUI 2025

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) methods are being proposed to help interpret and understand how AI systems reach specific predictions. Inspired by prior work on conversational user interfaces, we argue that augmenting existing XAI methods with conversational user interfaces can increase user engagement and boost user understanding of the AI system. In this paper, we explored the impact of a conversational XAI interface on users’ understanding of the AI system, their trust, and reliance on the AI system. In comparison to an XAI dashboard, we found that the conversational XAI interface can bring about a better understanding of the AI system among users and higher user trust. However, users of both the XAI dashboard and conversational XAI interfaces showed clear overreliance on the AI system. Enhanced conversations powered by large language model (LLM) agents amplified over-reliance. Based on our findings, we reason that the potential cause of such overreliance is the illusion of explanatory depth that is concomitant with both XAI interfaces. Our findings have important implications for designing effective conversational XAI interfaces to facilitate appropriate reliance and improve human-AI collaboration. Code can be found at this https URL

[AI-13] RegD: Hierarchical Embeddings via Distances over Geometric Regions

链接: https://arxiv.org/abs/2501.17518
作者: Hui Yang,Jiaoyan Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hierarchical data are common in many domains like life sciences and e-commerce, and their embeddings often play a critical role. Although hyperbolic embeddings offer a grounded approach to representing hierarchical structures in low-dimensional spaces, their utility is hindered by optimization difficulties in hyperbolic space and dependence on handcrafted structural constraints. We propose RegD, a novel Euclidean framework that addresses these limitations by representing hierarchical data as geometric regions with two new metrics: (1) depth distance, which preserves the representational power of hyperbolic spaces for hierarchical data, and (2) boundary distance, which explicitly encodes set-inclusion relationships between regions in a general way. Our empirical evaluation on diverse real-world datasets shows consistent performance gains over state-of-the-art methods and demonstrates RegD’s potential for broader applications beyond hierarchy alone tasks.

[AI-14] Reflections on “Can AI Understand Our Universe?”

链接: https://arxiv.org/abs/2501.17507
作者: Yu Wang
类目: Artificial Intelligence (cs.AI); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: Invited talk at the 17th Marcel Grossmann Meeting, associated with arXiv:2404.10019 , to be published in the International Journal of Modern Physics D

点击查看摘要

Abstract:This article briefly discusses the philosophical and technical aspects of AI. It focuses on two concepts of understanding: intuition and causality, and highlights three AI technologies: Transformers, chain-of-thought reasoning, and multimodal processing. We anticipate that in principle AI could form understanding, with these technologies representing promising advancements.

[AI-15] SemML: Enhancing Automata-Theoretic LTL Synthesis with Machine Learning

链接: https://arxiv.org/abs/2501.17496
作者: Jan Kretinsky,Tobias Meggendorfer,Maximilian Prokop,Ashkan Zarkhah
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Synthesizing a reactive system from specifications given in linear temporal logic (LTL) is a classical problem, finding its applications in safety-critical systems design. We present our tool SemML, which won this year’s LTL realizability tracks of SYNTCOMP, after years of domination by Strix. While both tools are based on the automata-theoretic approach, ours relies heavily on (i) Semantic labelling, additional information of logical nature, coming from recent LTL-to-automata translations and decorating the resulting parity game, and (ii) Machine Learning approaches turning this information into a guidance oracle for on-the-fly exploration of the parity game (whence the name SemML). Our tool fills the missing gaps of previous suggestions to use such an oracle and provides an efficeint implementation with additional algorithmic improvements. We evaluate SemML both on the entire set of SYNTCOMP as well as a synthetic data set, compare it to Strix, and analyze the advantages and limitations. As SemML solves more instances on SYNTCOMP and does so significantly faster on larger instances, this demonstrates for the first time that machine-learning-aided approaches can out-perform state-of-the-art tools in real LTL synthesis.

[AI-16] Certifying Pareto-Optimality in Multi-Objective Maximum Satisfiability

链接: https://arxiv.org/abs/2501.17493
作者: Christoph Jabs,Jeremias Berg,Bart Bogaerts,Matti Järvisalo
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Due to the wide employment of automated reasoning in the analysis and construction of correct systems, the results reported by automated reasoning engines must be trustworthy. For Boolean satisfiability (SAT) solvers - and more recently SAT-based maximum satisfiability (MaxSAT) solvers - trustworthiness is obtained by integrating proof logging into solvers, making solvers capable of emitting machine-verifiable proofs to certify correctness of the reasoning steps performed. In this work, we enable for the first time proof logging based on the VeriPB proof format for multi-objective MaxSAT (MO-MaxSAT) optimization techniques. Although VeriPB does not offer direct support for multi-objective problems, we detail how preorders in VeriPB can be used to provide certificates for MO-MaxSAT algorithms computing a representative solution for each element in the non-dominated set of the search space under Pareto-optimality, without extending the VeriPB format or the proof checker. By implementing VeriPB proof logging into a state-of-the-art multi-objective MaxSAT solver, we show empirically that proof logging can be made scalable for MO-MaxSAT with reasonable overhead.

[AI-17] Neural Spelling: A Spell-Based BCI System for Language Neural Decoding

链接: https://arxiv.org/abs/2501.17489
作者: Xiaowei Jiang,Charles Zhou,Yiqun Duan,Ziyi Zhao,Thomas Do,Chin-Teng Lin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) present a promising avenue by translating neural activity directly into text, eliminating the need for physical actions. However, existing non-invasive BCI systems have not successfully covered the entire alphabet, limiting their practicality. In this paper, we propose a novel non-invasive EEG-based BCI system with Curriculum-based Neural Spelling Framework, which recognizes all 26 alphabet letters by decoding neural signals associated with handwriting first, and then apply a Generative AI (GenAI) to enhance spell-based neural language decoding tasks. Our approach combines the ease of handwriting with the accessibility of EEG technology, utilizing advanced neural decoding algorithms and pre-trained large language models (LLMs) to translate EEG patterns into text with high accuracy. This system show how GenAI can improve the performance of typical spelling-based neural language decoding task, and addresses the limitations of previous methods, offering a scalable and user-friendly solution for individuals with communication impairments, thereby enhancing inclusive communication options.

[AI-18] Algorithmic Segmentation and Behavioral Profiling for Ransomware Detection Using Temporal-Correlation Graphs

链接: https://arxiv.org/abs/2501.17429
作者: Ignatius Rollere,Caspian Hartsfield,Seraphina Courtenay,Lucian Fenwick,Aurelia Grunwald
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of cyber threats has outpaced traditional detection methodologies, necessitating innovative approaches capable of addressing the adaptive and complex behaviors of modern adversaries. A novel framework was introduced, leveraging Temporal-Correlation Graphs to model the intricate relationships and temporal patterns inherent in malicious operations. The approach dynamically captured behavioral anomalies, offering a robust mechanism for distinguishing between benign and malicious activities in real-time scenarios. Extensive experiments demonstrated the framework’s effectiveness across diverse ransomware families, with consistently high precision, recall, and overall detection accuracy. Comparative evaluations highlighted its better performance over traditional signature-based and heuristic methods, particularly in handling polymorphic and previously unseen ransomware variants. The architecture was designed with scalability and modularity in mind, ensuring compatibility with enterprise-scale environments while maintaining resource efficiency. Analysis of encryption speeds, anomaly patterns, and temporal correlations provided deeper insights into the operational strategies of ransomware, validating the framework’s adaptability to evolving threats. The research contributes to advancing cybersecurity technologies by integrating dynamic graph analytics and machine learning for future innovations in threat detection. Results from this study underline the potential for transforming the way organizations detect and mitigate complex cyberattacks.

[AI-19] Reqo: A Robust and Explainable Query Optimization Cost Model

链接: https://arxiv.org/abs/2501.17414
作者: Baoming Chang,Amin Kamali,Verena Kantere
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been a growing interest in using machine learning (ML) in query optimization to select more efficient plans. Existing learning-based query optimizers use certain model architectures to convert tree-structured query plans into representations suitable for downstream ML tasks. As the design of these architectures significantly impacts cost estimation, we propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs) to achieve more accurate cost estimates. The inherent uncertainty of data and model parameters also leads to inaccurate cost estimates, resulting in suboptimal plans and less robust query performance. To address this, we implement a novel learning-to-rank cost model that effectively quantifies the uncertainty in cost estimates using approximate probabilistic ML. This model adaptively integrates quantified uncertainty with estimated costs and learns from comparing pairwise plans, achieving more robust performance. In addition, we propose the first explainability technique specifically designed for learning-based cost models. This technique explains the contribution of any subgraphs in the query plan to the final predicted cost, which can be integrated and trained with any learning-based cost model to significantly boost the model’s explainability. By incorporating these innovations, we propose a cost model for a Robust and Explainable Query Optimizer, Reqo, that improves the accuracy, robustness, and explainability of cost estimation, outperforming state-of-the-art approaches in all three dimensions.

[AI-20] A Genetic Algorithm-Based Approach for Automated Optimization of Kolmogorov-Arnold Networks in Classification Tasks

链接: https://arxiv.org/abs/2501.17411
作者: Quan Long,Bin Wang,Bing Xue,Mengjie Zhang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the issue of interpretability in multilayer perceptrons (MLPs), Kolmogorov-Arnold Networks (KANs) are introduced in 2024. However, optimizing KAN structures is labor-intensive, typically requiring manual intervention and parameter tuning. This paper proposes GA-KAN, a genetic algorithm-based approach that automates the optimization of KANs, requiring no human intervention in the design process. To the best of our knowledge, this is the first time that evolutionary computation is explored to optimize KANs automatically. Furthermore, inspired by the use of sparse connectivity in MLPs in effectively reducing the number of parameters, GA-KAN further explores sparse connectivity to tackle the challenge of extensive parameter spaces in KANs. GA-KAN is validated on two toy datasets, achieving optimal results without the manual tuning required by the original KAN. Additionally, GA-KAN demonstrates superior performance across five classification datasets, outperforming traditional methods on all datasets and providing interpretable symbolic formulae for the Wine and Iris datasets, thereby enhancing model transparency. Furthermore, GA-KAN significantly reduces the number of parameters over the standard KAN across all the five datasets. The core contributions of GA-KAN include automated optimization, a new encoding strategy, and a new decoding process, which together improve the accuracy and interpretability, and reduce the number of parameters.

[AI-21] Intensional Inheritance Between Concepts: An Information-Theoretic Interpretation

链接: https://arxiv.org/abs/2501.17393
作者: Ben Goertzel
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of formalizing and quantifying the concept of “intensional inheritance” between two concepts. We begin by conceiving the intensional inheritance of W from F as the amount of information the proposition "x is F " provides about the proposition "x is W . To flesh this out, we consider concepts F and W defined by sets of properties \left\F_1, F_2, \ldots, F_n\right\ and \left\W_1, W_2, \ldots, W_m\right\ with associated degrees \left\d_1, d_2, \ldots, d_n\right\ and \left\e_1, e_2, \ldots, e_m\right\ , respectively, where the properties may overlap. We then derive formulas for the intensional inheritance using both Shannon information theory and algorithmic information theory, incorporating interaction information among properties. We examine a special case where all properties are mutually exclusive and calculate the intensional inheritance in this case in both frameworks. We also derive expressions for P(W \mid F) based on the mutual information formula. Finally we consider the relationship between intensional inheritance and conventional set-theoretic “extensional” inheritance, concluding that in our information-theoretic framework, extensional inheritance emerges as a special case of intensional inheritance.

[AI-22] A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.17384
作者: Zhengpeng Xie,Jiahang Cao,Yulong Zhang,Qiang Zhang,Renjing Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, empowered with the powerful capabilities of neural networks, reinforcement learning (RL) has successfully tackled numerous challenging tasks. However, while these models demonstrate enhanced decision-making abilities, they are increasingly prone to overfitting. For instance, a trained RL model often fails to generalize to even minor variations of the same task, such as a change in background color or other minor semantic differences. To address this issue, we propose a dual-agent adversarial policy learning framework, which allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Specifically, our framework involves a game process between two agents: each agent seeks to maximize the impact of perturbing on the opponent’s policy by producing representation differences for the same state, while maintaining its own stability against such perturbations. This interaction encourages agents to learn generalizable policies, capable of handling irrelevant features from the high-dimensional observations. Extensive experimental results on the Procgen benchmark demonstrate that the adversarial process significantly improves the generalization performance of both agents, while also being applied to various RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial framework, the RL agent outperforms the baseline methods by a significant margin, especially in hard-level tasks, marking a significant step forward in the generalization capabilities of deep reinforcement learning.

[AI-23] ASAP: Learning Generalizable Online Bin Packing via Adaptive Selection After Pruning

链接: https://arxiv.org/abs/2501.17377
作者: Han Fang,Paul Weng,Yutong Ban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, deep reinforcement learning (DRL) has achieved promising results in solving online 3D Bin Packing Problems (3D-BPP). However, these DRL-based policies may perform poorly on new instances due to distribution shift. Besides generalization, we also consider adaptation, completely overlooked by previous work, which aims at rapidly finetuning these policies to a new test distribution. To tackle both generalization and adaptation issues, we propose Adaptive Selection After Pruning (ASAP), which decomposes a solver’s decision-making into two policies, one for pruning and one for selection. The role of the pruning policy is to remove inherently bad actions, which allows the selection policy to choose among the remaining most valuable actions. To learn these policies, we propose a training scheme based on a meta-learning phase of both policies followed by a finetuning phase of the sole selection policy to rapidly adapt it to a test distribution. Our experiments demonstrate that ASAP exhibits excellent generalization and adaptation capabilities on in-distribution and out-of-distribution instances under both discrete and continuous setup.

[AI-24] A Geometric Perspective for High-Dimensional Multiplex Graphs CIKM

链接: https://arxiv.org/abs/2501.17374
作者: Kamel Abdous,Nairouz Mrabah,Mohamed Bouguessa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in Proceedings of the ACM Conference on Information and Knowledge Management (CIKM) 2024, DOI: https://doi.org/10.1145/3627673.3679541

点击查看摘要

Abstract:High-dimensional multiplex graphs are characterized by their high number of complementary and divergent dimensions. The existence of multiple hierarchical latent relations between the graph dimensions poses significant challenges to embedding methods. In particular, the geometric distortions that might occur in the representational space have been overlooked in the literature. This work studies the problem of high-dimensional multiplex graph embedding from a geometric perspective. We find that the node representations reside on highly curved manifolds, thus rendering their exploitation more challenging for downstream tasks. Moreover, our study reveals that increasing the number of graph dimensions can cause further distortions to the highly curved manifolds. To address this problem, we propose a novel multiplex graph embedding method that harnesses hierarchical dimension embedding and Hyperbolic Graph Neural Networks. The proposed approach hierarchically extracts hyperbolic node representations that reside on Riemannian manifolds while gradually learning fewer and more expressive latent dimensions of the multiplex graph. Experimental results on real-world high-dimensional multiplex graphs show that the synergy between hierarchical and hyperbolic embeddings incurs much fewer geometric distortions and brings notable improvements over state-of-the-art approaches on downstream tasks.

[AI-25] Forecasting SP 500 Using LSTM Models

链接: https://arxiv.org/abs/2501.17366
作者: Prashant Pilla,Raji Mekonen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:With the volatile and complex nature of financial data influenced by external factors, forecasting the stock market is challenging. Traditional models such as ARIMA and GARCH perform well with linear data but struggle with non-linear dependencies. Machine learning and deep learning models, particularly Long Short-Term Memory (LSTM) networks, address these challenges by capturing intricate patterns and long-term dependencies. This report compares ARIMA and LSTM models in predicting the SP 500 index, a major financial benchmark. Using historical price data and technical indicators, we evaluated these models using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The ARIMA model showed reasonable performance with an MAE of 462.1, RMSE of 614, and 89.8 percent accuracy, effectively capturing short-term trends but limited by its linear assumptions. The LSTM model, leveraging sequential processing capabilities, outperformed ARIMA with an MAE of 369.32, RMSE of 412.84, and 92.46 percent accuracy, capturing both short- and long-term dependencies. Notably, the LSTM model without additional features performed best, achieving an MAE of 175.9, RMSE of 207.34, and 96.41 percent accuracy, showcasing its ability to handle market data efficiently. Accurately predicting stock movements is crucial for investment strategies, risk assessments, and market stability. Our findings confirm the potential of deep learning models in handling volatile financial data compared to traditional ones. The results highlight the effectiveness of LSTM and suggest avenues for further improvements. This study provides insights into financial forecasting, offering a comparative analysis of ARIMA and LSTM while outlining their strengths and limitations. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR) Cite as: arXiv:2501.17366 [cs.LG] (or arXiv:2501.17366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.17366 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.14759118 Focus to learn more DOI(s) linking to related resources

[AI-26] he M-factor: A Novel Metric for Evaluating Neural Architecture Search in Resource-Constrained Environments

链接: https://arxiv.org/abs/2501.17361
作者: Srikanth Thudumu,Hy Nguyen,Hung Du,Nhat Duong,Zafaryab Rasool,Rena Logothetis,Scott Barnett,Rajesh Vasa,Kon Mouzakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) aims to automate the design of deep neural networks. However, existing NAS techniques often focus on maximising accuracy, neglecting model efficiency. This limitation restricts their use in resource-constrained environments like mobile devices and edge computing systems. Moreover, current evaluation metrics prioritise performance over efficiency, lacking a balanced approach for assessing architectures suitable for constrained scenarios. To address these challenges, this paper introduces the M-factor, a novel metric combining model accuracy and size. Four diverse NAS techniques are compared: Policy-Based Reinforcement Learning, Regularised Evolution, Tree-structured Parzen Estimator (TPE), and Multi-trial Random Search. These techniques represent different NAS paradigms, providing a comprehensive evaluation of the M-factor. The study analyses ResNet configurations on the CIFAR-10 dataset, with a search space of 19,683 configurations. Experiments reveal that Policy-Based Reinforcement Learning and Regularised Evolution achieved M-factor values of 0.84 and 0.82, respectively, while Multi-trial Random Search attained 0.75, and TPE reached 0.67. Policy-Based Reinforcement Learning exhibited performance changes after 39 trials, while Regularised Evolution optimised within 20 trials. The research investigates the optimisation dynamics and trade-offs between accuracy and model size for each strategy. Findings indicate that, in some cases, random search performed comparably to more complex algorithms when assessed using the M-factor. These results highlight how the M-factor addresses the limitations of existing metrics by guiding NAS towards balanced architectures, offering valuable insights for selecting strategies in scenarios requiring both performance and efficiency.

[AI-27] Deep-and-Wide Learning: Enhancing Data-Driven Inference via Synergistic Learning of Inter- and Intra-Data Representations

链接: https://arxiv.org/abs/2501.17347
作者: Md Tauhidul Islam,Lei Xing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:Advancements in deep learning are revolutionizing science and engineering. The immense success of deep learning is largely due to its ability to extract essential high-dimensional (HD) features from input data and make inference decisions based on this information. However, current deep neural network (DNN) models face several challenges, such as the requirements of extensive amounts of data and computational resources. Here, we introduce a new learning scheme, referred to as deep-and-wide learning (DWL), to systematically capture features not only within individual input data (intra-data features) but also across the data (inter-data features). Furthermore, we propose a dual-interactive-channel network (D-Net) to realize the DWL, which leverages our Bayesian formulation of low-dimensional (LD) inter-data feature extraction and its synergistic interaction with the conventional HD representation of the dataset, for substantially enhanced computational efficiency and inference. The proposed technique has been applied to data across various disciplines for both classification and regression tasks. Our results demonstrate that DWL surpasses state-of-the-art DNNs in accuracy by a substantial margin with limited training data and improves the computational efficiency by order(s) of magnitude. The proposed DWL strategy dramatically alters the data-driven learning techniques, including emerging large foundation models, and sheds significant insights into the evolving field of AI.

[AI-28] Anomaly Detection in Cooperative Vehicle Perception Systems under Imperfect Communication

链接: https://arxiv.org/abs/2501.17329
作者: Ashish Bastola,Hao Wang,Abolfazl Razi
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Anomaly detection is a critical requirement for ensuring safety in autonomous driving. In this work, we leverage Cooperative Perception to share information across nearby vehicles, enabling more accurate identification and consensus of anomalous behaviors in complex traffic scenarios. To account for the real-world challenge of imperfect communication, we propose a cooperative-perception-based anomaly detection framework (CPAD), which is a robust architecture that remains effective under communication interruptions, thereby facilitating reliable performance even in low-bandwidth settings. Since no multi-agent anomaly detection dataset exists for vehicle trajectories, we introduce 15,000 different scenarios with a 90,000 trajectories benchmark dataset generated through rule-based vehicle dynamics analysis. Empirical results demonstrate that our approach outperforms standard anomaly classification methods in F1-score, AUC and showcase strong robustness to agent connection interruptions.

[AI-29] Connecting Federated ADMM to Bayes

链接: https://arxiv.org/abs/2501.17325
作者: Siddharth Swaroop,Mohammad Emtiyaz Khan,Finale Doshi-Velez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the ‘site’ parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning.

[AI-30] A sketch of an AI control safety case

链接: https://arxiv.org/abs/2501.17315
作者: Tomek Korbak,Joshua Clymer,Benjamin Hilton,Buck Shlegeris,Geoffrey Irving
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a “control safety case”, which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won’t exfiltrate sensitive information. The sketch relies on evidence from a “control evaluation,”’ where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

[AI-31] Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding

链接: https://arxiv.org/abs/2501.17310
作者: Yun-Shiuan Chuang,Nikunj Harlalka,Sameer Narendran,Alexander Cheung,Sizhe Gao,Siddharth Suresh,Junjie Hu,Timothy T. Rogers
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Guesstimation, the task of making approximate quantity estimates, is a common real-world challenge. However, it has been largely overlooked in large language models (LLMs) and vision language models (VLMs) research. We introduce a novel guesstimation dataset, MARBLES. This dataset requires one to estimate how many items (e.g., marbles) can fit into containers (e.g., a one-cup measuring cup), both with and without accompanying images. Inspired by the social science concept of the Wisdom of Crowds'' (WOC) - taking the median from estimates from a crowd), which has proven effective in guesstimation, we propose WOC decoding’’ strategy for LLM guesstimation. We show that LLMs/VLMs perform well on guesstimation, suggesting that they possess some level of a “world model” necessary for guesstimation. Moreover, similar to human performance, the WOC decoding method improves LLM/VLM guesstimation accuracy. Furthermore, the inclusion of images in the multimodal condition enhances model performance. These results highlight the value of WOC decoding strategy for LLMs/VLMs and position guesstimation as a probe for evaluating LLMs/VLMs’ world model.

[AI-32] Multi-Physics Simulations via Coupled Fourier Neural Operator

链接: https://arxiv.org/abs/2501.17296
作者: Shibo Li,Tao Wang,Yifei Sun,Heiwei Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Physical simulations are essential tools across critical fields such as mechanical and aerospace engineering, chemistry, meteorology, etc. While neural operators, particularly the Fourier Neural Operator (FNO), have shown promise in predicting simulation results with impressive performance and efficiency, they face limitations when handling real-world scenarios involving coupled multi-physics outputs. Current neural operator methods either overlook the correlations between multiple physical processes or employ simplistic architectures that inadequately capture these relationships. To overcome these challenges, we introduce a novel coupled multi-physics neural operator learning (COMPOL) framework that extends the capabilities of Fourier operator layers to model interactions among multiple physical processes. Our approach implements feature aggregation through recurrent and attention mechanisms, enabling comprehensive modeling of coupled interactions. Our method’s core is an innovative system for aggregating latent features from multi-physics processes. These aggregated features serve as enriched information sources for neural operator layers, allowing our framework to capture complex physical relationships accurately. We evaluated our coupled multi-physics neural operator across diverse physical simulation tasks, including biological systems, fluid mechanics, and multiphase flow in porous media. Our proposed model demonstrates a two to three-fold improvement in predictive performance compared to existing approaches.

[AI-33] Rethinking Functional Brain Connectome Analysis: Do Graph Deep Learning Models Help?

链接: https://arxiv.org/abs/2501.17207
作者: Keqi Han,Yao Su,Lifang He,Liang Zhan,Sergey Plis,Vince Calhoun,Carl Yang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:Functional brain connectome is crucial for deciphering the neural mechanisms underlying cognitive functions and neurological disorders. Graph deep learning models have recently gained tremendous popularity in this field. However, their actual effectiveness in modeling the brain connectome remains unclear. In this study, we re-examine graph deep learning models based on four large-scale neuroimaging studies encompassing diverse cognitive and clinical outcomes. Surprisingly, we find that the message aggregation mechanism, a hallmark of graph deep learning models, does not help with predictive performance as typically assumed, but rather consistently degrades it. To address this issue, we propose a hybrid model combining a linear model with a graph attention network through dual pathways, achieving robust predictions and enhanced interpretability by revealing both localized and global neural connectivity patterns. Our findings urge caution in adopting complex deep learning models for functional brain connectome analysis, emphasizing the need for rigorous experimental designs to establish tangible performance gains and perhaps more importantly, to pursue improvements in model interpretability.

[AI-34] Integrating Reinforcement Learning and AI Agents for Adaptive Robotic Interaction and Assistance in Dementia Care

链接: https://arxiv.org/abs/2501.17206
作者: Fengpei Yuan,Nehal Hasnaeen,Ran Zhang,Bryce Bible,Joseph Riley Taylor,Hairong Qi,Fenghui Yao,Xiaopeng Zhao
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:This study explores a novel approach to advancing dementia care by integrating socially assistive robotics, reinforcement learning (RL), large language models (LLMs), and clinical domain expertise within a simulated environment. This integration addresses the critical challenge of limited experimental data in socially assistive robotics for dementia care, providing a dynamic simulation environment that realistically models interactions between persons living with dementia (PLWDs) and robotic caregivers. The proposed framework introduces a probabilistic model to represent the cognitive and emotional states of PLWDs, combined with an LLM-based behavior simulation to emulate their responses. We further develop and train an adaptive RL system enabling humanoid robots, such as Pepper, to deliver context-aware and personalized interactions and assistance based on PLWDs’ cognitive and emotional states. The framework also generalizes to computer-based agents, highlighting its versatility. Results demonstrate that the RL system, enhanced by LLMs, effectively interprets and responds to the complex needs of PLWDs, providing tailored caregiving strategies. This research contributes to human-computer and human-robot interaction by offering a customizable AI-driven caregiving platform, advancing understanding of dementia-related challenges, and fostering collaborative innovation in assistive technologies. The proposed approach has the potential to enhance the independence and quality of life for PLWDs while alleviating caregiver burden, underscoring the transformative role of interaction-focused AI systems in dementia care.

[AI-35] Smart Cubing for Graph Search: A Comparative Study

链接: https://arxiv.org/abs/2501.17201
作者: Markus Kirchweger,Hai Xia,Tomáš Peitl,Stefan Szeider
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Parallel solving via cube-and-conquer is a key method for scaling SAT solvers to hard instances. While cube-and-conquer has proven successful for pure SAT problems, notably the Pythagorean triples conjecture, its application to SAT solvers extended with propagators presents unique challenges, as these propagators learn constraints dynamically during the search. We study this problem using SAT Modulo Symmetries (SMS) as our primary test case, where a symmetry-breaking propagator reduces the search space by learning constraints that eliminate isomorphic graphs. Through extensive experimentation comprising over 10,000 CPU hours, we systematically evaluate different cube-and-conquer variants on three well-studied combinatorial problems. Our methodology combines prerun phases to collect learned constraints, various cubing strategies, and parameter tuning via algorithm configuration and LLM-generated design suggestions. The comprehensive empirical evaluation provides new insights into effective cubing strategies for propagator-based SAT solving, with our best method achieving speedups of 2-3x from improved cubing and parameter tuning, providing an additional 1.5-2x improvement on harder instances. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.17201 [cs.AI] (or arXiv:2501.17201v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.17201 Focus to learn more arXiv-issued DOI via DataCite

[AI-36] Letters Colors and Words: Constructing the Ideal Building Blocks Set

链接: https://arxiv.org/abs/2501.17188
作者: Ricardo Salazar,Shahrzad Jamshidi
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 29 pages, 8 figures, submitted to SIAM Undergraduate Research Online

点击查看摘要

Abstract:Define a building blocks set to be a collection of n cubes (each with six sides) where each side is assigned one letter and one color from a palette of m colors. We propose a novel problem of assigning letters and colors to each face so as to maximize the number of words one can spell from a chosen dataset that are either mono words, all letters have the same color, or rainbow words, all letters have unique colors. We explore this problem considering a chosen set of English words, up to six letters long, from a typical vocabulary of a US American 14 year old and explore the problem when n=6 and m=6, with the added restriction that each color appears exactly once on the cube. The problem is intractable, as the size of the solution space makes a brute force approach computationally infeasible. Therefore we aim to solve this problem using random search, simulated annealing, two distinct tree search approaches (greedy and best-first), and a genetic algorithm. To address this, we explore a range of optimization techniques: random search, simulated annealing, two distinct tree search methods (greedy and best-first), and a genetic algorithm. Additionally, we attempted to implement a reinforcement learning approach; however, the model failed to converge to viable solutions within the problem’s constraints. Among these methods, the genetic algorithm delivered the best performance, achieving a total of 2846 mono and rainbow words.

[AI-37] EvoGP: A GPU-accelerated Framework for Tree-Based Genetic Programming

链接: https://arxiv.org/abs/2501.17168
作者: Lishuang Wang,Zhihong Wu,Kebin Sun,Zhuozhao Li,Ran Cheng
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tree-based Genetic Programming (TGP) is a key evolutionary algorithm widely used in symbolic regression, feature engineering, and scientific modeling. Its high computational demands make GPU acceleration essential for scalable and high-performance evolutionary computation. However, GPU acceleration of TGP faces three key challenges: inefficient tree encoding, highly heterogeneous genetic operations, and limited parallelism in fitness evaluation. To address these challenges, we introduce EvoGP, a comprehensive GPU-accelerated TGP framework. First, we design a tensorized encoding scheme to represent tree with different structures as tensors with the same shape, optimizing memory access and enabling efficient parallel execution. Second, we propose a unified parallel framework for genetic operations by leveraging shared computational primitives and implementing dedicated CUDA kernels for scalable performance. Third, we present a fully parallel fitness evaluation strategy for symbolic regression, exploiting both population-level and data-level parallelism to maximize GPU utilization. Moreover, we implement a comprehensive library to provide rich algorithm operators and benchmark problems. EvoGP is extensively tested on various tasks, including symbolic regression, classification, and robotics control, demonstrating its versatility and effectiveness across diverse application scenarios. Experimental results show that EvoGP achieves up to a 140.89x speedup over the state-of-the-art GPU-based TGP implementation, while maintaining or exceeding the accuracy of baseline methods. EvoGP is open-source and accessible at: this https URL.

[AI-38] QualityFlow: An Agent ic Workflow for Program Synthesis Controlled by LLM Quality Checks

链接: https://arxiv.org/abs/2501.17167
作者: Yaojie Hu,Qiang Zhou,Qihong Chen,Xiaopeng Li,Linbo Liu,Dejiao Zhang,Amit Kachroo,Talha Oz,Omer Tripp
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce QualityFlow, a dynamic agentic workflow for program synthesis. Given the English description of a programming problem and a set of unit tests, the model’s goal is to synthesize the correct program that solves the problem and passes the tests. QualityFlow consists of multiple large language model (LLM) agents that resemble a software development team, including code generation, testing, and self-debugging. Existing program synthesis methods face three major limitations: assumption of visible unit test conformity, bottleneck of synthesized test quality, and deviation of self-debugging trajectory. To address them, we propose the LLM Quality Checker, which explicitly “imagines” whether the synthesized programs’ execution would conform to the unit tests. The Quality Checks dynamically control the workflow, including actions to submit the final answer, clarify the problem statement, and revert previous workflow steps. As a result, our Quality Checker can precisely accept any correct program, mitigate faulty synthesized tests, and prevent potential workflow deviation. The success of the Quality Checker further enables Diversified Prompting, which encourages variations in LLM responses to maximize the possibility that a correct program appears and passes the quality check. In experiments, QualityFlow establishes the state-of-the-art results on four program synthesis benchmarks: MBPP, HumanEval, and the stricter evaluations of both MBPP and HumanEval from EvalPlus. Our systematic analysis shows that the dynamic workflow controlled by LLM quality checks can outperform static workflows and single-attempt zero-shot synthesis. The Quality Checker is the center of our investigation, and we dissect its individual performance and integrated impact on the workflow accuracy, as well as other ablations experiments to justify our workflow design.

[AI-39] Split Knowledge Distillation for Large Models in IoT: Architecture Challenges and Solutions

链接: https://arxiv.org/abs/2501.17164
作者: Zuguang Li,Wen Wu,Shaohua Wu,Qiaohua Lin,Yaping Sun,Hui Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4figures, 2 tables, and 15 conference

点击查看摘要

Abstract:Large models (LMs) have immense potential in Internet of Things (IoT) systems, enabling applications such as intelligent voice assistants, predictive maintenance, and healthcare monitoring. However, training LMs on edge servers raises data privacy concerns, while deploying them directly on IoT devices is constrained by limited computational and memory resources. We analyze the key challenges of training LMs in IoT systems, including energy constraints, latency requirements, and device heterogeneity, and propose potential solutions such as dynamic resource management, adaptive model partitioning, and clustered collaborative training. Furthermore, we propose a split knowledge distillation framework to efficiently distill LMs into smaller, deployable versions for IoT devices while ensuring raw data remains local. This framework integrates knowledge distillation and split learning to minimize energy consumption and meet low model training delay requirements. A case study is presented to evaluate the feasibility and performance of the proposed framework.

[AI-40] AI Governance through Markets

链接: https://arxiv.org/abs/2501.17755
作者: Philip Moreira Tomei,Rupal Jain,Matija Franklin
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance.

[AI-41] Exact characterization of epsilon-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation

链接: https://arxiv.org/abs/2501.17731
作者: Alberto Carlevaro,Teodoro Alamo,Fabrizio Dabbene,Maurizio Mongelli
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic guarantees on the prediction of data-driven classifiers are necessary to define models that can be considered reliable. This is a key requirement for modern machine learning in which the goodness of a system is measured in terms of trustworthiness, clearly dividing what is safe from what is unsafe. The spirit of this paper is exactly in this direction. First, we introduce a formal definition of \epsilon-Safe Decision Region, a subset of the input space in which the prediction of a target (safe) class is probabilistically guaranteed. Second, we prove that, when data come from exponential family distributions, the form of such a region is analytically determined and controllable by design parameters, i.e. the probability of sampling the target class and the confidence on the prediction. However, the request of having exponential data is not always possible. Inspired by this limitation, we developed Multi Cost SVM, an SVM based algorithm that approximates the safe region and is also able to handle unbalanced data. The research is complemented by experiments and code available for reproducibility.

机器学习

[LG-0] rEGGression: an Interactive and Agnostic Tool for the Exploration of Symbolic Regression Models

链接: https://arxiv.org/abs/2501.17859
作者: Fabricio Olivetti de Franca,Gabriel Kronberger
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Regression analysis is used for prediction and to understand the effect of independent variables on dependent variables. Symbolic regression (SR) automates the search for non-linear regression models, delivering a set of hypotheses that balances accuracy with the possibility to understand the phenomena. Many SR implementations return a Pareto front allowing the choice of the best trade-off. However, this hides alternatives that are close to non-domination, limiting these choices. Equality graphs (e-graphs) allow to represent large sets of expressions compactly by efficiently handling duplicated parts occurring in multiple expressions. E-graphs allow to store and query all SR solution candidates visited in one or multiple GP runs efficiently and open the possibility to analyse much larger sets of SR solution candidates. We introduce rEGGression, a tool using e-graphs to enable the exploration of a large set of symbolic expressions which provides querying, filtering, and pattern matching features creating an interactive experience to gain insights about SR models. The main highlight is its focus in the exploration of the building blocks found during the search that can help the experts to find insights about the studied this http URL is possible by exploiting the pattern matching capability of the e-graph data structure.

[LG-1] Improving Genetic Programming for Symbolic Regression with Equality Graphs

链接: https://arxiv.org/abs/2501.17848
作者: Fabricio Olivetti de Franca,Gabriel Kronberger
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 5 tables

点击查看摘要

Abstract:The search for symbolic regression models with genetic programming (GP) has a tendency of revisiting expressions in their original or equivalent forms. Repeatedly evaluating equivalent expressions is inefficient, as it does not immediately lead to better solutions. However, evolutionary algorithms require diversity and should allow the accumulation of inactive building blocks that can play an important role at a later point. The equality graph is a data structure capable of compactly storing expressions and their equivalent forms allowing an efficient verification of whether an expression has been visited in any of their stored equivalent forms. We exploit the e-graph to adapt the subtree operators to reduce the chances of revisiting expressions. Our adaptation, called eggp, stores every visited expression in the e-graph, allowing us to filter out from the available selection of subtrees all the combinations that would create already visited expressions. Results show that, for small expressions, this approach improves the performance of a simple GP algorithm to compete with PySR and Operon without increasing computational cost. As a highlight, eggp was capable of reliably delivering short and at the same time accurate models for a selected set of benchmarks from SRBench and a set of real-world datasets.

[LG-2] acoupi: An Open-Source Python Framework for Deploying Bioacoustic AI Models on Edge Devices

链接: https://arxiv.org/abs/2501.17841
作者: Aude Vuilliomenet,Santiago Martínez Balvanera,Oisin Mac Aodha,Kate E. Jones,Duncan Wilson
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 21 pages, 3 figures, 1 table, to be submitted to BES Methods in Ecology and Evolution

点击查看摘要

Abstract:1. Passive acoustic monitoring (PAM) coupled with artificial intelligence (AI) is becoming an essential tool for biodiversity monitoring. Traditional PAM systems require manual data offloading and impose substantial demands on storage and computing infrastructure. The combination of on-device AI-based processing and network connectivity enables local data analysis and transmission of only relevant information, greatly reducing storage needs. However, programming these devices for robust operation is challenging, requiring expertise in embedded systems and software engineering. Despite the increase in AI-based models for bioacoustics, their full potential remains unrealized without accessible tools to deploy them on custom hardware and tailor device behaviour to specific monitoring goals. 2. To address this challenge, we develop acoupi, an open-source Python framework that simplifies the creation and deployment of smart bioacoustic devices. acoupi integrates audio recording, AI-based data processing, data management, and real-time wireless messaging into a unified and configurable framework. By modularising key elements of the bioacoustic monitoring workflow, acoupi allows users to easily customise, extend, or select specific components to fit their unique monitoring needs. 3. We demonstrate the flexibility of acoupi by integrating two bioacoustic classifiers: BirdNET, for the classification of bird species, and BatDetect2, for the classification of UK bat species. We test the reliability of acoupi over a month-long deployment of two acoupi-powered devices in a UK urban park. 4. acoupi can be deployed on low-cost hardware such as the Raspberry Pi and can be customised for various applications. acoupi standardised framework and simplified tools facilitate the adoption of AI-powered PAM systems for researchers and conservationists. acoupi is on GitHub at this https URL.

[LG-3] Matrix Product Sketching via Coordinated Sampling

链接: https://arxiv.org/abs/2501.17836
作者: Majid Daliri,Juliana Freire,Danrong Li,Christopher Musco
类目: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:We revisit the well-studied problem of approximating a matrix product, \mathbfA^T\mathbfB , based on small space sketches \mathcalS(\mathbfA) and \mathcalS(\mathbfB) of \mathbfA \in \R^n \times d and \mathbfB\in \R^n \times m . We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when \mathbfA and \mathbfB are sparse, methods based on \emphcoordinated random sampling can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error \epsilon|\mathbfA|_F|\mathbfB|_F , coordinated sampling requires sketches of size O(s/\epsilon^2) when \mathbfA and \mathbfB have at most s \leq d,m non-zeros per row. In contrast, linear sketching leads to sketches of size O(d/\epsilon^2) and O(m/\epsilon^2) for \mathbfA and \mathbfB . We empirically evaluate our approach on two applications: 1) distributed linear regression in databases, a problem motivated by tasks like dataset discovery and augmentation, and 2) approximating attention matrices in transformer-based language models. In both cases, our sampling algorithms yield an order of magnitude improvement over linear sketching.

[LG-4] Hierarchical Fallback Architecture for High Risk Online Machine Learning Inference

链接: https://arxiv.org/abs/2501.17834
作者: Gustavo Polleti,Marlesson Santana,Felipe Sassi Del Sant,Eduardo Fontes
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Open Banking powered machine learning applications require novel robustness approaches to deal with challenging stress and failure scenarios. In this paper we propose an hierarchical fallback architecture for improving robustness in high risk machine learning applications with a focus in the financial domain. We define generic failure scenarios often found in online inference that depend on external data providers and we describe in detail how to apply the hierarchical fallback architecture to address them. Finally, we offer a real world example of its applicability in the industry for near-real time transactional fraud risk evaluation using Open Banking data and under extreme stress scenarios.

[LG-5] Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning ICLR

链接: https://arxiv.org/abs/2501.17827
作者: Haque Ishfaq,Guangyuan Wang,Sami Nur Islam,Doina Precup
类目: Machine Learning (cs.LG)
*备注: Published in The Thirteenth International Conference on Learning Representations (ICLR) 2025. The first two authors contributed equally

点击查看摘要

Abstract:Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based Q updates, parallel tempering for exploring multiple modes of the posterior of the Q function, and diffusion synthesized state-action samples regularized with Q action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.

[LG-6] LEKA:LLM -Enhanced Knowledge Augmentation

链接: https://arxiv.org/abs/2501.17802
作者: Xinhao Zhang,Jinghan Zhang,Fengran Mo,Dongjie Wang,Yanjie Fu,Kunpeng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans excel in analogical learning and knowledge transfer and, more importantly, possess a unique understanding of identifying appropriate sources of knowledge. From a model’s perspective, this presents an interesting challenge. If models could autonomously retrieve knowledge useful for transfer or decision-making to solve problems, they would transition from passively acquiring to actively accessing and learning from knowledge. However, filling models with knowledge is relatively straightforward – it simply requires more training and accessible knowledge bases. The more complex task is teaching models about which knowledge can be analogized and transferred. Therefore, we design a knowledge augmentation method LEKA for knowledge transfer that actively searches for suitable knowledge sources that can enrich the target domain’s knowledge. This LEKA method extracts key information from textual information from the target domain, retrieves pertinent data from external data libraries, and harmonizes retrieved data with the target domain data in feature space and marginal probability measures. We validate the effectiveness of our approach through extensive experiments across various domains and demonstrate significant improvements over traditional methods in reducing computational costs, automating data alignment, and optimizing transfer learning outcomes.

[LG-7] Detecting Anomalies Using Rotated Isolation Forest

链接: https://arxiv.org/abs/2501.17787
作者: Vahideh Monemizadeh,Kourosh Kiani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Isolation Forest (iForest), proposed by Liu, Ting, and Zhou at TKDE 2012, has become a prominent tool for unsupervised anomaly detection. However, recent research by Hariri, Kind, and Brunner, published in TKDE 2021, has revealed issues with iForest. They identified the presence of axis-aligned ghost clusters that can be misidentified as normal clusters, leading to biased anomaly scores and inaccurate predictions. In response, they developed the Extended Isolation Forest (EIF), which effectively solves these issues by eliminating the ghost clusters introduced by iForest. This enhancement results in improved consistency of anomaly scores and superior performance. We reveal a previously overlooked problem in the Extended Isolation Forest (EIF), showing that it is vulnerable to ghost inter-clusters between normal clusters of data points. In this paper, we introduce the Rotated Isolation Forest (RIF) algorithm which effectively addresses both the axis-aligned ghost clusters observed in iForest and the ghost inter-clusters seen in EIF. RIF accomplishes this by randomly rotating the dataset (using random rotation matrices and QR decomposition) before feeding it into the iForest construction, thereby increasing dataset variation and eliminating ghost clusters. Our experiments conclusively demonstrate that the RIF algorithm outperforms iForest and EIF, as evidenced by the results obtained from both synthetic datasets and real-world datasets.

[LG-8] AdditiveLLM : Large Language Models Predict Defects in Additive Manufacturing

链接: https://arxiv.org/abs/2501.17784
作者: Peter Pak,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work we investigate the ability of large language models to predict additive manufacturing defect regimes given a set of process parameter inputs. For this task we utilize a process parameter defect dataset to fine-tune a collection of models, titled AdditiveLLM, for the purpose of predicting potential defect regimes including Keyholing, Lack of Fusion, and Balling. We compare different methods of input formatting in order to gauge the model’s performance to correctly predict defect regimes on our sparse Baseline dataset and our natural language Prompt dataset. The model displays robust predictive capability, achieving an accuracy of 93% when asked to provide the defect regimes associated with a set of process parameters. The incorporation of natural language input further simplifies the task of process parameters selection, enabling users to identify optimal settings specific to their build.

[LG-9] Picard-KKT-hPINN: Enforcing Nonlinear Enthalpy Balances for Physically Consistent Neural Networks

链接: https://arxiv.org/abs/2501.17782
作者: Giacomo Lastrucci,Tanuj Karia,Zoë Gromotka,Artur M. Schweidtmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are widely used as surrogate models but they do not guarantee physically consistent predictions thereby preventing adoption in various applications. We propose a method that can enforce NNs to satisfy physical laws that are nonlinear in nature such as enthalpy balances. Our approach, inspired by Picard successive approximations method, aims to enforce multiplicatively separable constraints by sequentially freezing and projecting a set of the participating variables. We demonstrate our PicardKKThPINN for surrogate modeling of a catalytic packed bed reactor for methanol synthesis. Our results show that the method efficiently enforces nonlinear enthalpy and linear atomic balances at machine-level precision. Additionally, we show that enforcing conservation laws can improve accuracy in data-scarce conditions compared to vanilla multilayer perceptron.

[LG-10] Generative Unordered Flow for Set-Structured Data Generation

链接: https://arxiv.org/abs/2501.17770
作者: Yangming Li,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注: Paper under review

点击查看摘要

Abstract:Flow-based generative models have demonstrated promising performance across a broad spectrum of data modalities (e.g., image and text). However, there are few works exploring their extension to unordered data (e.g., spatial point set), which is not trivial because previous models are mostly designed for vector data that are naturally ordered. In this paper, we present unordered flow, a type of flow-based generative model for set-structured data generation. Specifically, we convert unordered data into an appropriate function representation, and learn the probability measure of such representations through function-valued flow matching. For the inverse map from a function representation to unordered data, we propose a method similar to particle filtering, with Langevin dynamics to first warm-up the initial particles and gradient-based search to update them until convergence. We have conducted extensive experiments on multiple real-world datasets, showing that our unordered flow model is very effective in generating set-structured data and significantly outperforms previous baselines.

[LG-11] Dynamics of Transient Structure in In-Context Linear Regression Transformers

链接: https://arxiv.org/abs/2501.17745
作者: Liam Carroll,Jesse Hoogland,Matthew Farrugia-Roberts,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. This explanation is grounded in empirical measurements of model complexity using the local learning coefficient.

[LG-12] Sparser Better Faster Stronger: Efficient Automatic Differentiation for Sparse Jacobians and Hessians

链接: https://arxiv.org/abs/2501.17737
作者: Adrian Hill,Guillaume Dalle
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: 29 pages, 5 figures, 8 tables, 2 listings

点击查看摘要

Abstract:From implicit differentiation to probabilistic modeling, Jacobians and Hessians have many potential use cases in Machine Learning (ML), but conventional wisdom views them as computationally prohibitive. Fortunately, these matrices often exhibit sparsity, which can be leveraged to significantly speed up the process of Automatic Differentiation (AD). This paper presents advances in Automatic Sparse Differentiation (ASD), starting with a new perspective on sparsity detection. Our refreshed exposition is based on operator overloading, able to detect both local and global sparsity patterns, and naturally avoids dead ends in the control flow graph. We also describe a novel ASD pipeline in Julia, consisting of independent software packages for sparsity detection, matrix coloring, and differentiation, which together enable ASD based on arbitrary AD backends. Our pipeline is fully automatic and requires no modification of existing code, making it compatible with existing ML codebases. We demonstrate that this pipeline unlocks Jacobian and Hessian matrices at scales where they were considered too expensive to compute. On real-world problems from scientific ML and optimization, we show significant speed-ups of up to three orders of magnitude. Notably, our ASD pipeline often outperforms standard AD for one-off computations, once thought impractical due to slower sparsity detection methods.

[LG-13] Sparse Autoencoders Can Interpret Randomly Initialized Transformers

链接: https://arxiv.org/abs/2501.17727
作者: Thomas Heap,Tim Lawson,Lucy Farnik,Laurence Aitchison
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are an increasingly popular technique for interpreting the internal representations of transformers. In this paper, we apply SAEs to ‘interpret’ random transformers, i.e., transformers where the parameters are sampled IID from a Gaussian rather than trained on text data. We find that random and trained transformers produce similarly interpretable SAE latents, and we confirm this finding quantitatively using an open-source auto-interpretability pipeline. Further, we find that SAE quality metrics are broadly similar for random and trained transformers. We find that these results hold across model sizes and layers. We discuss a number of number interesting questions that this work raises for the use of SAEs and auto-interpretability in the context of mechanistic interpretability.

[LG-14] STGCN-LSTM for Olympic Medal Prediction: Dynamic Power Modeling and Causal Policy Optimization

链接: https://arxiv.org/abs/2501.17711
作者: Yiquan Wang,Jiaying Wang,Jingyi Yang,Zihao Xu
类目: Machine Learning (cs.LG)
*备注: 18pages, 7figures

点击查看摘要

Abstract:This paper proposes a novel hybrid model, STGCN-LSTM, to forecast Olympic medal distributions by integrating the spatio-temporal relationships among countries and the long-term dependencies of national performance. The Spatial-Temporal Graph Convolution Network (STGCN) captures geographic and interactive factors-such as coaching exchange and socio-economic links-while the Long Short-Term Memory (LSTM) module models historical trends in medal counts, economic data, and demographics. To address zero-inflated outputs (i.e., the disparity between countries that consistently yield wins and those never having won medals), a Zero-Inflated Compound Poisson (ZICP) framework is incorporated to separate random zeros from structural zeros, providing a clearer view of potential breakthrough performances. Validation includes historical backtracking, policy shock simulations, and causal inference checks, confirming the robustness of the proposed method. Results shed light on the influence of coaching mobility, event specialization, and strategic investment on medal forecasts, offering a data-driven foundation for optimizing sports policies and resource allocation in diverse Olympic contexts.

[LG-15] Decision-Theoretic Approaches in Learning-Augmented Algorithms

链接: https://arxiv.org/abs/2501.17701
作者: Spyros Angelopoulos,Christoph Dürr,Georgii Melidi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we initiate the systemic study of decision-theoretic metrics in the design and analysis of algorithms with machine-learned predictions. We introduce approaches based on both deterministic measures such as distance-based evaluation, that help us quantify how close the algorithm is to an ideal solution, as well as stochastic measures that allow us to balance the trade-off between the algorithm’s performance and the risk associated with the imperfect oracle. These approaches help us quantify the algorithmic performance across the entire spectrum of prediction error, unlike several previous works that focus on few, and often extreme values of the error. We apply these techniques to two well-known problems from resource allocation and online decision making, namely contract scheduling and 1-max search.

[LG-16] mperature-Free Loss Function for Contrastive Learning

链接: https://arxiv.org/abs/2501.17683
作者: Bum Jun Kim,Sang Woo Kim
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:As one of the most promising methods in self-supervised learning, contrastive learning has achieved a series of breakthroughs across numerous fields. A predominant approach to implementing contrastive learning is applying InfoNCE loss: By capturing the similarities between pairs, InfoNCE loss enables learning the representation of data. Albeit its success, adopting InfoNCE loss requires tuning a temperature, which is a core hyperparameter for calibrating similarity scores. Despite its significance and sensitivity to performance being emphasized by several studies, searching for a valid temperature requires extensive trial-and-error-based experiments, which increases the difficulty of adopting InfoNCE loss. To address this difficulty, we propose a novel method to deploy InfoNCE loss without temperature. Specifically, we replace temperature scaling with the inverse hyperbolic tangent function, resulting in a modified InfoNCE loss. In addition to hyperparameter-free deployment, we observed that the proposed method even yielded a performance gain in contrastive learning. Our detailed theoretical analysis discovers that the current practice of temperature scaling in InfoNCE loss causes serious problems in gradient descent, whereas our method provides desirable gradient properties. The proposed method was validated on five benchmarks on contrastive learning, yielding satisfactory results without temperature tuning.

[LG-17] Explainable Artificial Intelligence for identifying profitability predictors in Financial Statements

链接: https://arxiv.org/abs/2501.17676
作者: Marco Piazza,Mauro Passacantando,Francesca Magli,Federica Doni,Andrea Amaduzzi,Enza Messina
类目: Machine Learning (cs.LG)
*备注: Short paper presented at Workshop on “Ai in Finance” at European Conference on Artificial Intelligence

点击查看摘要

Abstract:The interconnected nature of the economic variables influencing a firm’s performance makes the prediction of a company’s earning trend a challenging task. Existing methodologies often rely on simplistic models and financial ratios failing to capture the complexity of interacting influences. In this paper, we apply Machine Learning techniques to raw financial statements data taken from AIDA, a Database comprising Italian listed companies’ data from 2013 to 2022. We present a comparative study of different models and following the European AI regulations, we complement our analysis by applying explainability techniques to the proposed models. In particular, we propose adopting an eXplainable Artificial Intelligence method based on Game Theory to identify the most sensitive features and make the result more interpretable. Comments: Short paper presented at Workshop on “Ai in Finance” at European Conference on Artificial Intelligence Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.17676 [cs.LG] (or arXiv:2501.17676v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.17676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization USENIX-SECURITY

链接: https://arxiv.org/abs/2501.17667
作者: Derui Wang,Kristen Moore,Diksha Goel,Minjune Kim,Gang Li,Yang Li,Robin Doss,Minhui Xue,Bo Li,Seyit Camtepe,Liming Zhu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to USENIX Security Symposium 2025, Seattle, WA, USA. Source code is available at Github ( this https URL ) and Zenodo ( this https URL )

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments. However, DRL agents are vulnerable to noisy observations and adversarial attacks, and concerns about the adversarial robustness of DRL systems have emerged. Recent efforts have focused on addressing these robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings. Among these approaches, policy smoothing has proven to be an effective and scalable method for certifying the robustness of DRL agents. Nevertheless, existing certifiably robust DRL relies on policies trained with simple Gaussian augmentations, resulting in a suboptimal trade-off between certified robustness and certified return. To address this issue, we introduce a novel paradigm dubbed \textttCertified-r\textttAdius-\textttMaximizing \textttPolicy (\textttCAMP) training. \textttCAMP is designed to enhance DRL policies, achieving better utility without compromising provable robustness. By leveraging the insight that the global certified radius can be derived from local certified radii based on training-time statistics, \textttCAMP formulates a surrogate loss related to the local certified radius and optimizes the policy guided by this surrogate loss. We also introduce \textitpolicy imitation as a novel technique to stabilize \textttCAMP training. Experimental results demonstrate that \textttCAMP significantly improves the robustness-return trade-off across various tasks. Based on the results, \textttCAMP can achieve up to twice the certified expected return compared to that of baselines. Our code is available at this https URL.

[LG-19] Landscape Features in Single-Objective Continuous Optimization: Have We Hit a Wall in Algorithm Selection Generalization?

链接: https://arxiv.org/abs/2501.17663
作者: Gjorgjina Cenikj,Gašper Petelin,Moritz Seiler,Nikola Cenikj,Tome Eftimov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:%% Text of abstract The process of identifying the most suitable optimization algorithm for a specific problem, referred to as algorithm selection (AS), entails training models that leverage problem landscape features to forecast algorithm performance. A significant challenge in this domain is ensuring that AS models can generalize effectively to novel, unseen problems. This study evaluates the generalizability of AS models based on different problem representations in the context of single-objective continuous optimization. In particular, it considers the most widely used Exploratory Landscape Analysis features, as well as recently proposed Topological Landscape Analysis features, and features based on deep learning, such as DeepELA, TransOptAS and Doe2Vec. Our results indicate that when presented with out-of-distribution evaluation data, none of the feature-based AS models outperform a simple baseline model, i.e., a Single Best Solver.

[LG-20] Drivetrain simulation using variational autoencoders

链接: https://arxiv.org/abs/2501.17653
作者: Pallavi Sharma,Jorge-Humberto Urrea-Quintero,Bogdan Bogdan,Adrian-Dumitru Ciotec,Laura Vasilie,Henning Wessels,Matteo Skull
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注: 27 pages

点击查看摘要

Abstract:This work proposes variational autoencoders (VAEs) to predict a vehicle’s jerk from a given torque demand, addressing the limitations of sparse real-world datasets. Specifically, we implement unconditional and conditional VAEs to generate jerk signals that integrate features from different drivetrain scenarios. The VAEs are trained on experimental data collected from two variants of a fully electric SUV, which differ in maximum torque delivery and drivetrain configuration. New meaningful jerk signals are generated within an engineering context through the interpretation of the VAE’s latent space. A performance comparison with baseline physics-based and hybrid models confirms the effectiveness of the VAEs. We show that VAEs bypass the need for exhaustive manual system parametrization while maintaining physical plausibility by conditioning data generation on specific inputs.

[LG-21] nabqr: Python package for improving probabilistic forecasts

链接: https://arxiv.org/abs/2501.17604
作者: Bastian Schmidt Jørgensena,Jan Kloppenborg Møller,Peter Nystrup,Henrik Madsen
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce the open-source Python package NABQR: Neural Adaptive Basis for (time-adaptive) Quantile Regression that provides reliable probabilistic forecasts. NABQR corrects ensembles (scenarios) with LSTM networks and then applies time-adaptive quantile regression to the corrected ensembles to obtain improved and more reliable forecasts. With the suggested package, accuracy improvements of up to 40% in mean absolute terms can be achieved in day-ahead forecasting of onshore and offshore wind power production in Denmark.

[LG-22] RegionGCN: Spatial-Heterogeneity-Aware Graph Convolutional Networks

链接: https://arxiv.org/abs/2501.17599
作者: Hao Guo,Han Wang,Di Zhu,Lun Wu,A. Stewart Fotheringham,Yu Liu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:Modeling spatial heterogeneity in the data generation process is essential for understanding and predicting geographical phenomena. Despite their prevalence in geospatial tasks, neural network models usually assume spatial stationarity, which could limit their performance in the presence of spatial process heterogeneity. By allowing model parameters to vary over space, several approaches have been proposed to incorporate spatial heterogeneity into neural networks. However, current geographically weighting approaches are ineffective on graph neural networks, yielding no significant improvement in prediction accuracy. We assume the crux lies in the over-fitting risk brought by a large number of local parameters. Accordingly, we propose to model spatial process heterogeneity at the regional level rather than at the individual level, which largely reduces the number of spatially varying parameters. We further develop a heuristic optimization procedure to learn the region partition adaptively in the process of model training. Our proposed spatial-heterogeneity-aware graph convolutional network, named RegionGCN, is applied to the spatial prediction of county-level vote share in the 2016 US presidential election based on socioeconomic attributes. Results show that RegionGCN achieves significant improvement over the basic and geographically weighted GCNs. We also offer an exploratory analysis tool for the spatial variation of non-linear relationships through ensemble learning of regional partitions from RegionGCN. Our work contributes to the practice of Geospatial Artificial Intelligence (GeoAI) in tackling spatial heterogeneity.

[LG-23] Histogram approaches for imbalanced data streams regression

链接: https://arxiv.org/abs/2501.17568
作者: Ehsan Aminian,Joao Gama,Rita P. Ribeiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling imbalanced data streams in regression tasks presents a significant challenge, as rare instances can appear anywhere in the target distribution rather than being confined to its extreme values. In this paper, we introduce novel data-level sampling strategies, \textttHistUS and \textttHistOS, that utilize histogram-based approaches to dynamically balance data streams. Unlike previous methods based on Chebyshev\textquotesingle s inequality, our proposed techniques identify and handle rare cases across the entire distribution effectively. We demonstrate that \textttHistUS and \textttHistOS outperform traditional methods through extensive experiments on synthetic and real-world datasets, leading to more accurate and robust regression models in streaming environments.

[LG-24] Heuristic-Informed Mixture of Experts for Link Prediction in Multilayer Networks

链接: https://arxiv.org/abs/2501.17557
作者: Lucio La Cava,Domenico Mandaglio,Lorenzo Zangari,Andrea Tagarelli
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注: Under Review

点击查看摘要

Abstract:Link prediction algorithms for multilayer networks are in principle required to effectively account for the entire layered structure while capturing the unique contexts offered by each layer. However, many existing approaches excel at predicting specific links in certain layers but struggle with others, as they fail to effectively leverage the diverse information encoded across different network layers. In this paper, we present MoE-ML-LP, the first Mixture-of-Experts (MoE) framework specifically designed for multilayer link prediction. Building on top of multilayer heuristics for link prediction, MoE-ML-LP synthesizes the decisions taken by diverse experts, resulting in significantly enhanced predictive capabilities. Our extensive experimental evaluation on real-world and synthetic networks demonstrates that MoE-ML-LP consistently outperforms several baselines and competing methods, achieving remarkable improvements of +60% in Mean Reciprocal Rank, +82% in Hits@1, +55% in Hits@5, and +41% in Hits@10. Furthermore, MoE-ML-LP features a modular architecture that enables the seamless integration of newly developed experts without necessitating the re-training of the entire framework, fostering efficiency and scalability to new experts, paving the way for future advancements in link prediction.

[LG-25] Closing the Gap Between Synthetic and Ground Truth Time Series Distributions via Neural Mapping

链接: https://arxiv.org/abs/2501.17553
作者: Daesoo Lee,Sara Malacarne,Erlend Aune
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Neural Mapper for Vector Quantized Time Series Generator (NM-VQTSG), a novel method aimed at addressing fidelity challenges in vector quantized (VQ) time series generation. VQ-based methods, such as TimeVQVAE, have demonstrated success in generating time series but are hindered by two critical bottlenecks: information loss during compression into discrete latent spaces and deviations in the learned prior distribution from the ground truth distribution. These challenges result in synthetic time series with compromised fidelity and distributional accuracy. To overcome these limitations, NM-VQTSG leverages a U-Net-based neural mapping model to bridge the distributional gap between synthetic and ground truth time series. To be more specific, the model refines synthetic data by addressing artifacts introduced during generation, effectively aligning the distributions of synthetic and real data. Importantly, NM-VQTSG can be used for synthetic time series generated by any VQ-based generative method. We evaluate NM-VQTSG across diverse datasets from the UCR Time Series Classification archive, demonstrating its capability to consistently enhance fidelity in both unconditional and conditional generation tasks. The improvements are evidenced by significant improvements in FID, IS, and conditional FID, additionally backed up by visual inspection in a data space and a latent space. Our findings establish NM-VQTSG as a new method to improve the quality of synthetic time series. Our implementation is available on \urlthis https URL.

[LG-26] NF-MKV Net: A Constraint-Preserving Neural Network Approach to Solving Mean-Field Games Equilibrium

链接: https://arxiv.org/abs/2501.17450
作者: Jinwei Liu,Lu Ren,Wang Yao,Xiao Zhang
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Neural network-based methods for solving Mean-Field Games (MFGs) equilibria have garnered significant attention for their effectiveness in high-dimensional problems. However, many algorithms struggle with ensuring that the evolution of the density distribution adheres to the required mathematical constraints. This paper investigates a neural network approach to solving MFGs equilibria through a stochastic process perspective. It integrates process-regularized Normalizing Flow (NF) frameworks with state-policy-connected time-series neural networks to address McKean-Vlasov-type Forward-Backward Stochastic Differential Equation (MKV FBSDE) fixed-point problems, equivalent to MFGs equilibria.

[LG-27] Gradual Domain Adaptation for Graph Learning

链接: https://arxiv.org/abs/2501.17443
作者: Pui Ieng Lei,Ximing Chen,Yijun Sheng,Yanyan Liu,Jingzhi Guo,Zhiguo Gong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing literature lacks a graph domain adaptation technique for handling large distribution shifts, primarily due to the difficulty in simulating an evolving path from source to target graph. To make a breakthrough, we present a graph gradual domain adaptation (GGDA) framework with the construction of a compact domain sequence that minimizes information loss in adaptations. Our approach starts with an efficient generation of knowledge-preserving intermediate graphs over the Fused Gromov-Wasserstein (FGW) metric. With the bridging data pool, GGDA domains are then constructed via a novel vertex-based domain progression, which comprises “close” vertex selections and adaptive domain advancement to enhance inter-domain information transferability. Theoretically, our framework concretizes the intractable inter-domain distance W_p(\mu_t,\mu_t+1) via implementable upper and lower bounds, enabling flexible adjustments of this metric for optimizing domain formation. Extensive experiments under various transfer scenarios validate the superior performance of our GGDA framework.

[LG-28] Human-Aligned Skill Discovery: Balancing Behaviour Exploration and Alignment AAMAS2025

链接: https://arxiv.org/abs/2501.17431
作者: Maxence Hussonnois,Thommen George Karimpanal,Santu Rana
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted at the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Unsupervised skill discovery in Reinforcement Learning aims to mimic humans’ ability to autonomously discover diverse behaviors. However, existing methods are often unconstrained, making it difficult to find useful skills, especially in complex environments, where discovered skills are frequently unsafe or impractical. We address this issue by proposing Human-aligned Skill Discovery (HaSD), a framework that incorporates human feedback to discover safer, more aligned skills. HaSD simultaneously optimises skill diversity and alignment with human values. This approach ensures that alignment is maintained throughout the skill discovery process, eliminating the inefficiencies associated with exploring unaligned skills. We demonstrate its effectiveness in both 2D navigation and SafetyGymnasium environments, showing that HaSD discovers diverse, human-aligned skills that are safe and useful for downstream tasks. Finally, we extend HaSD by learning a range of configurable skills with varying degrees of diversity alignment trade-offs that could be useful in practical scenarios.

[LG-29] WCDT: Systematic WCET Optimization for Decision Tree Implementations

链接: https://arxiv.org/abs/2501.17428
作者: Nils Hölscher,Christian Hakert,Georg von der Brüggen,Jian-Jia Chen,Kuan-Hsun Chen,Jan Reineke
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Machine-learning models are increasingly deployed on resource-constrained embedded systems with strict timing constraints. In such scenarios, the worst-case execution time (WCET) of the models is required to ensure safe operation. Specifically, decision trees are a prominent class of machine-learning models and the main building blocks of tree-based ensemble models (e.g., random forests), which are commonly employed in resource-constrained embedded systems. In this paper, we develop a systematic approach for WCET optimization of decision tree implementations. To this end, we introduce a linear surrogate model that estimates the execution time of individual paths through a decision tree based on the path’s length and the number of taken branches. We provide an optimization algorithm that constructively builds a WCET-optimal implementation of a given decision tree with respect to this surrogate model. We experimentally evaluate both the surrogate model and the WCET-optimization algorithm. The evaluation shows that the optimization algorithm improves analytically determined WCET by up to 17% compared to an unoptimized implementation. Subjects: Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2501.17428 [cs.LG] (or arXiv:2501.17428v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.17428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Certificated Actor-Critic: Hierarchical Reinforcement Learning with Control Barrier Functions for Safe Navigation ICRA2025

链接: https://arxiv.org/abs/2501.17424
作者: Junjun Xie,Shuhao Zhao,Liang Hu,Huijun Gao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 2025

点击查看摘要

Abstract:Control Barrier Functions (CBFs) have emerged as a prominent approach to designing safe navigation systems of robots. Despite their popularity, current CBF-based methods exhibit some limitations: optimization-based safe control techniques tend to be either myopic or computationally intensive, and they rely on simplified system models; conversely, the learning-based methods suffer from the lack of quantitative indication in terms of navigation performance and safety. In this paper, we present a new model-free reinforcement learning algorithm called Certificated Actor-Critic (CAC), which introduces a hierarchical reinforcement learning framework and well-defined reward functions derived from CBFs. We carry out theoretical analysis and proof of our algorithm, and propose several improvements in algorithm implementation. Our analysis is validated by two simulation experiments, showing the effectiveness of our proposed CAC algorithm.

[LG-31] si4onnx: A Python package for Selective Inference in Deep Learning Models

链接: https://arxiv.org/abs/2501.17415
作者: Teruyuki Katsuoka,Tomohiro Shiraishi,Daiki Miwa,Shuichi Nishino,Ichiro Takeuchi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35pages, 3figures

点击查看摘要

Abstract:In this paper, we introduce si4onnx, a package for performing selective inference on deep learning models. Techniques such as CAM in XAI and reconstruction-based anomaly detection using VAE can be interpreted as methods for identifying significant regions within input images. However, the identified regions may not always carry meaningful significance. Therefore, evaluating the statistical significance of these regions represents a crucial challenge in establishing the reliability of AI systems. si4onnx is a Python package that enables straightforward implementation of hypothesis testing with controlled type I error rates through selective inference. It is compatible with deep learning models constructed using common frameworks such as PyTorch and TensorFlow.

[LG-32] Poisoning Attacks and Defenses to Federated Unlearning

链接: https://arxiv.org/abs/2501.17396
作者: Wenbin Wang,Qiwen Ma,Zifan Zhang,Yuchen Liu,Zhuqing Liu,Minghong Fang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in The Web Conference 2025

点击查看摘要

Abstract:Federated learning allows multiple clients to collaboratively train a global model with the assistance of a server. However, its distributed nature makes it susceptible to poisoning attacks, where malicious clients can compromise the global model by sending harmful local model updates to the server. To unlearn an accurate global model from a poisoned one after identifying malicious clients, federated unlearning has been introduced. Yet, current research on federated unlearning has primarily concentrated on its effectiveness and efficiency, overlooking the security challenges it presents. In this work, we bridge the gap via proposing BadUnlearn, the first poisoning attacks targeting federated unlearning. In BadUnlearn, malicious clients send specifically designed local model updates to the server during the unlearning process, aiming to ensure that the resulting unlearned model remains poisoned. To mitigate these threats, we propose UnlearnGuard, a robust federated unlearning framework that is provably robust against both existing poisoning attacks and our BadUnlearn. The core concept of UnlearnGuard is for the server to estimate the clients’ local model updates during the unlearning process and employ a filtering strategy to verify the accuracy of these estimations. Theoretically, we prove that the model unlearned through UnlearnGuard closely resembles one obtained by train-from-scratch. Empirically, we show that BadUnlearn can effectively corrupt existing federated unlearning methods, while UnlearnGuard remains secure against poisoning attacks.

[LG-33] Byzantine-Robust Federated Learning over Ring-All-Reduce Distributed Computing

链接: https://arxiv.org/abs/2501.17392
作者: Minghong Fang,Zhuqing Liu,Xuecen Zhao,Jia Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appear in The Web Conference 2025

点击查看摘要

Abstract:Federated learning (FL) has gained attention as a distributed learning paradigm for its data privacy benefits and accelerated convergence through parallel computation. Traditional FL relies on a server-client (SC) architecture, where a central server coordinates multiple clients to train a global model, but this approach faces scalability challenges due to server communication bottlenecks. To overcome this, the ring-all-reduce (RAR) architecture has been introduced, eliminating the central server and achieving bandwidth optimality. However, the tightly coupled nature of RAR’s ring topology exposes it to unique Byzantine attack risks not present in SC-based FL. Despite its potential, designing Byzantine-robust RAR-based FL algorithms remains an open problem. To address this gap, we propose BRACE (Byzantine-robust ring-all-reduce), the first RAR-based FL algorithm to achieve both Byzantine robustness and communication efficiency. We provide theoretical guarantees for the convergence of BRACE under Byzantine attacks, demonstrate its bandwidth efficiency, and validate its practical effectiveness through experiments. Our work offers a foundational understanding of Byzantine-robust RAR-based FL design.

[LG-34] Do We Really Need to Design New Byzantine-robust Aggregation Rules? NDSS2025

链接: https://arxiv.org/abs/2501.17381
作者: Minghong Fang,Seyedsina Nabavirazavi,Zhuqing Liu,Wei Sun,Sundararaja Sitharama Iyengar,Haibo Yang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in NDSS 2025

点击查看摘要

Abstract:Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model through a server, without exchanging their private training data. However, the decentralized aspect of FL makes it susceptible to poisoning attacks, where malicious clients can manipulate the global model by sending altered local model updates. To counter these attacks, a variety of aggregation rules designed to be resilient to Byzantine failures have been introduced. Nonetheless, these methods can still be vulnerable to sophisticated attacks or depend on unrealistic assumptions about the server. In this paper, we demonstrate that there is no need to design new Byzantine-robust aggregation rules; instead, FL can be secured by enhancing the robustness of well-established aggregation rules. To this end, we present FoundationFL, a novel defense mechanism against poisoning attacks. FoundationFL involves the server generating synthetic updates after receiving local model updates from clients. It then applies existing Byzantine-robust foundational aggregation rules, such as Trimmed-mean or Median, to combine clients’ model updates with the synthetic ones. We theoretically establish the convergence performance of FoundationFL under Byzantine settings. Comprehensive experiments across several real-world datasets validate the efficiency of our FoundationFL method.

[LG-35] Data-Informed Model Complexity Metric for Optimizing Symbolic Regression Models GECCO2025

链接: https://arxiv.org/abs/2501.17372
作者: Nathan Haut,Zenas Huang,Adam Alessio
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Submitted to GECCO 2025

点击查看摘要

Abstract:Choosing models from a well-fitted evolved population that generalizes beyond training data is difficult. We introduce a pragmatic method to estimate model complexity using Hessian rank for post-processing selection. Complexity is approximated by averaging the model output Hessian rank across a few points (N=3), offering efficient and accurate rank estimates. This method aligns model selection with input data complexity, calculated using intrinsic dimensionality (ID) estimators. Using the StackGP system, we develop symbolic regression models for the Penn Machine Learning Benchmark and employ twelve scikit-dimension library methods to estimate ID, aligning model expressiveness with dataset ID. Our data-informed complexity metric finds the ideal complexity window, balancing model expressiveness and accuracy, enhancing generalizability without bias common in methods reliant on user-defined parameters, such as parsimony pressure in weight selection.

[LG-36] Breaking the log(1/Delta_2) Barrier: Better Batched Best Arm Identification with Adaptive Grids ICLR2025

链接: https://arxiv.org/abs/2501.17370
作者: Tianyuan Jin,Qin Zhang,Dongruo Zhou
类目: Machine Learning (cs.LG)
*备注: 21 pages, published at ICLR 2025

点击查看摘要

Abstract:We investigate the problem of batched best arm identification in multi-armed bandits, where we aim to identify the best arm from a set of n arms while minimizing both the number of samples and batches. We introduce an algorithm that achieves near-optimal sample complexity and features an instance-sensitive batch complexity, which breaks the \log(1/\Delta_2) barrier. The main contribution of our algorithm is a novel sample allocation scheme that effectively balances exploration and exploitation for batch sizes. Experimental results indicate that our approach is more batch-efficient across various setups. We also extend this framework to the problem of batched best arm identification in linear bandits and achieve similar improvements.

[LG-37] Compact Neural TTS Voices for Accessibility ICASSP2025

链接: https://arxiv.org/abs/2501.17332
作者: Kunal Jain,Eoin Murphy,Deepanshu Gupta,Jonathan Dyke,Saumya Shah,Vasilieios Tsiaras,Petko Petkov,Alistair Conkie
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.

[LG-38] CardiCat: a Variational Autoencoder for High-Cardinality Tabular Data

链接: https://arxiv.org/abs/2501.17324
作者: Lee Carlin,Yuval Benjamini
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:High-cardinality categorical features are a common characteristic of mixed-type tabular datasets. Existing generative model architectures struggle to learn the complexities of such data at scale, primarily due to the difficulty of parameterizing the categorical features. In this paper, we present a general variational autoencoder model, CardiCat, that can accurately fit imbalanced high-cardinality and heterogeneous tabular data. Our method substitutes one-hot encoding with regularized dual encoder-decoder embedding layers, which are jointly learned. This approach enables us to use embeddings that depend also on the other covariates, leading to a compact and homogenized parameterization of categorical features. Our model employs a considerably smaller trainable parameter space than competing methods, enabling learning at a large scale. CardiCat generates high-quality synthetic data that better represent high-cardinality and imbalanced features compared to competing VAE models for multiple real and simulated datasets.

[LG-39] Exploring Non-Convex Discrete Energy Landscapes: A Langevin-Like Sampler with Replica Exchange

链接: https://arxiv.org/abs/2501.17323
作者: Haoyang Zheng,Ruqi Zhang,Guang Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 figures, 23 pages

点击查看摘要

Abstract:Gradient-based Discrete Samplers (GDSs) are effective for sampling discrete energy landscapes. However, they often stagnate in complex, non-convex settings. To improve exploration, we introduce the Discrete Replica EXchangE Langevin (DREXEL) sampler and its variant with Adjusted Metropolis (DREAM). These samplers use two GDSs at different temperatures and step sizes: one focuses on local exploitation, while the other explores broader energy landscapes. When energy differences are significant, sample swaps occur, which are determined by a mechanism tailored for discrete sampling to ensure detailed balance. Theoretically, we prove both DREXEL and DREAM converge asymptotically to the target energy and exhibit faster mixing than a single GDS. Experiments further confirm their efficiency in exploring non-convex discrete energy landscapes.

[LG-40] MDDM: A Molecular Dynamics Diffusion Model to Predict Particle Self-Assembly

链接: https://arxiv.org/abs/2501.17319
作者: Kevin Ferguson,Yu-hsuan Chen,Levent Burak Kara
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The discovery and study of new material systems relies on molecular simulations that often come with significant computational expense. We propose MDDM, a Molecular Dynamics Diffusion Model, which is capable of predicting a valid output conformation for a given input pair potential function. After training MDDM on a large dataset of molecular dynamics self-assembly results, the proposed model can convert uniform noise into a meaningful output particle structure corresponding to an arbitrary input potential. The model’s architecture has domain-specific properties built-in, such as satisfying periodic boundaries and being invariant to translation. The model significantly outperforms the baseline point-cloud diffusion model for both unconditional and conditional generation tasks.

[LG-41] RLPP: A Residual Method for Zero-Shot Real-World Autonomous Racing on Scaled Platforms ICRA WWW

链接: https://arxiv.org/abs/2501.17311
作者: Edoardo Ghignone,Nicolas Baumann,Cheng Hu,Jonathan Wang,Lei Xie,Andrea Carron,Michele Magno
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), Atlanta 2025. The code is available at: this http URL

点击查看摘要

Abstract:Autonomous racing presents a complex environment requiring robust controllers capable of making rapid decisions under dynamic conditions. While traditional controllers based on tire models are reliable, they often demand extensive tuning or system identification. RL methods offer significant potential due to their ability to learn directly from interaction, yet they typically suffer from the Sim-to-Reall gap, where policies trained in simulation fail to perform effectively in the real world. In this paper, we propose RLPP, a residual RL framework that enhances a PP controller with an RL-based residual. This hybrid approach leverages the reliability and interpretability of PP while using RL to fine-tune the controller’s performance in real-world scenarios. Extensive testing on the F1TENTH platform demonstrates that RLPP improves lap times by up to 6.37 %, closing the gap to the SotA methods by more than 52 % and providing reliable performance in zero-shot real-world deployment, overcoming key challenges associated with the Sim-to-Real transfer and reducing the performance gap from simulation to reality by more than 8-fold when compared to the baseline RL controller. The RLPP framework is made available as an open-source tool, encouraging further exploration and advancement in autonomous racing research. The code is available at: this http URL.

[LG-42] Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings

链接: https://arxiv.org/abs/2501.17304
作者: Igor Abramovski,Alon Vinnikov,Shalev Shaer,Naoyuki Kanda,Xiaofei Wang,Amir Ivry,Eyal Krupka
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 280 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.

[LG-43] Nonlinear dynamics of localization in neural receptive fields NEURIPS2024

链接: https://arxiv.org/abs/2501.17284
作者: Leon Lufkin,Andrew M. Saxe,Erin Grant
类目: Machine Learning (cs.LG)
*备注: Appeared at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024); spotlight presentation

点击查看摘要

Abstract:Localized receptive fields – neurons that are selective for certain contiguous spatiotemporal features of their input – populate early sensory regions of the mammalian brain. Unsupervised learning algorithms that optimize explicit sparsity or independence criteria replicate features of these localized receptive fields, but fail to explain directly how localization arises through learning without efficient coding, as occurs in early layers of deep neural networks and might occur in early sensory regions of biological systems. We consider an alternative model in which localized receptive fields emerge without explicit top-down efficiency constraints – a feedforward neural network trained on a data model inspired by the structure of natural images. Previous work identified the importance of non-Gaussian statistics to localization in this setting but left open questions about the mechanisms driving dynamical emergence. We address these questions by deriving the effective learning dynamics for a single nonlinear neuron, making precise how higher-order statistical properties of the input data drive emergent localization, and we demonstrate that the predictions of these effective dynamics extend to the many-neuron setting. Our analysis provides an alternative explanation for the ubiquity of localization as resulting from the nonlinear dynamics of learning in neural circuits.

[LG-44] Stiff Transfer Learning for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2501.17281
作者: Emilien Seiler,Wanzhou Lei,Pavlos Protopapas
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Stiff differential equations are prevalent in various scientific domains, posing significant challenges due to the disparate time scales of their components. As computational power grows, physics-informed neural networks (PINNs) have led to significant improvements in modeling physical processes described by differential equations. Despite their promising outcomes, vanilla PINNs face limitations when dealing with stiff systems, known as failure modes. In response, we propose a novel approach, stiff transfer learning for physics-informed neural networks (STL-PINNs), to effectively tackle stiff ordinary differential equations (ODEs) and partial differential equations (PDEs). Our methodology involves training a Multi-Head-PINN in a low-stiff regime, and obtaining the final solution in a high stiff regime by transfer learning. This addresses the failure modes related to stiffness in PINNs while maintaining computational efficiency by computing “one-shot” solutions. The proposed approach demonstrates superior accuracy and speed compared to PINNs-based methods, as well as comparable computational efficiency with implicit numerical methods in solving stiff-parameterized linear and polynomial nonlinear ODEs and PDEs under stiff conditions. Furthermore, we demonstrate the scalability of such an approach and the superior speed it offers for simulations involving initial conditions and forcing function reparametrization.

[LG-45] A 1-D CNN inference engine for constrained platforms

链接: https://arxiv.org/abs/2501.17269
作者: Ishwar Mudraje,Kai Vogelgesang,Thorsten Herfet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1D-CNNs are used for time series classification in various domains with a high degree of accuracy. Most implementations collect the incoming data samples in a buffer before performing inference on it. On edge devices, which are typically constrained and single-threaded, such an implementation may interfere with time-critical tasks. One such task is that of sample acquisition. In this work, we propose an inference scheme that interleaves the convolution operations between sample intervals, which allows us to reduce the inference latency. Furthermore, our scheme is well-suited for storing data in ring buffers, yielding a small memory footprint. We demonstrate these improvements by comparing our approach to TFLite’s inference method, giving a 10% reduction in the inference delay while almost halving the memory usage. Our approach is feasible on common consumer devices, which we show using an AVR-based Arduino board and an ARM-based Arduino board.

[LG-46] Increasing Information for Model Predictive Control with Semi-Markov Decision Processes

链接: https://arxiv.org/abs/2501.17256
作者: Rémy Hosseinkhan Boucher(1 and 2),Onofrio Semeraro(1 and 2),Lionel Mathelin(1 and 2) ((1) Université Paris-Saclay, (2) CNRS)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works in Learning-Based Model Predictive Control of dynamical systems show impressive sample complexity performances using criteria from Information Theory to accelerate the learning procedure. However, the sequential exploration opportunities are limited by the system local state, restraining the amount of information of the observations from the current exploration trajectory. This article resolves this limitation by introducing temporal abstraction through the framework of Semi-Markov Decision Processes. The framework increases the total information of the gathered data for a fixed sampling budget, thus reducing the sample complexity.

[LG-47] Amplifier: Bringing Attention to Neglected Low-Energy Components in Time Series Forecasting AAAI2025

链接: https://arxiv.org/abs/2501.17216
作者: Jingru Fei,Kun Yi,Wei Fan,Qi Zhang,Zhendong Niu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:We propose an energy amplification technique to address the issue that existing models easily overlook low-energy components in time series forecasting. This technique comprises an energy amplification block and an energy restoration block. The energy amplification block enhances the energy of low-energy components to improve the model’s learning efficiency for these components, while the energy restoration block returns the energy to its original level. Moreover, considering that the energy-amplified data typically displays two distinct energy peaks in the frequency spectrum, we integrate the energy amplification technique with a seasonal-trend forecaster to model the temporal relationships of these two peaks independently, serving as the backbone for our proposed model, Amplifier. Additionally, we propose a semi-channel interaction temporal relationship enhancement block for Amplifier, which enhances the model’s ability to capture temporal relationships from the perspective of the commonality and specificity of each channel in the data. Extensive experiments on eight time series forecasting benchmarks consistently demonstrate our model’s superiority in both effectiveness and efficiency compared to state-of-the-art methods.

[LG-48] Deep Learning in Wireless Communication Receiver: A Survey

链接: https://arxiv.org/abs/2501.17184
作者: Shadman Rahman Doha,Ahmed Abdelhadi
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 16 Pages, 9 Figures

点击查看摘要

Abstract:The design of wireless communication receivers to enhance signal processing in complex and dynamic environments is going through a transformation by leveraging deep neural networks (DNNs). Traditional wireless receivers depend on mathematical models and algorithms, which do not have the ability to adapt or learn from data. In contrast, deep learning-based receivers are more suitable for modern wireless communication systems because they can learn from data and adapt accordingly. This survey explores various deep learning architectures such as multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and autoencoders, focusing on their application in the design of wireless receivers. Key modules of a receiver such as synchronization, channel estimation, equalization, space-time decoding, demodulation, decoding, interference cancellation, and modulation classification are discussed in the context of advanced wireless technologies like orthogonal frequency division multiplexing (OFDM), multiple input multiple output (MIMO), semantic communication, task-oriented communication, and next-generation (Next-G) networks. The survey not only emphasizes the potential of deep learning-based receivers in future wireless communication but also investigates different challenges of deep learning-based receivers, such as data availability, security and privacy concerns, model interpretability, computational complexity, and integration with legacy systems.

[LG-49] Long-term prediction of El Ni~no-Southern Oscillation using reservoir computing with data-driven realtime filter

链接: https://arxiv.org/abs/2501.17781
作者: Takuya Jinno,Takahito Mitsui,Kengo Nakai,Yoshitaka Saiki,Tsuyoshi Yoneda
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:In recent years, the application of machine learning approaches to time-series forecasting of climate dynamical phenomena has become increasingly active. It is known that applying a band-pass filter to a time-series data is a key to obtaining a high-quality data-driven model. Here, to obtain longer-term predictability of machine learning models, we introduce a new type of band-pass filter. It can be applied to realtime operational prediction workflows since it relies solely on past time series. We combine the filter with reservoir computing, which is a machine-learning technique that employs a data-driven dynamical system. As an application, we predict the multi-year dynamics of the El Niño-Southern Oscillation with the prediction horizon of 24 months using only past time series.

[LG-50] Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling

链接: https://arxiv.org/abs/2501.17772
作者: Theo Lepage,Reda Dehak
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: submitted to IEEE/ACM TASLP in January 2025

点击查看摘要

Abstract:Recent developments in Self-Supervised Learning (SSL) have demonstrated significant potential for Speaker Verification (SV), but closing the performance gap with supervised systems remains an ongoing challenge. Standard SSL frameworks rely on anchor-positive pairs extracted from the same audio utterances. Hence, positives have channel characteristics similar to those of their corresponding anchors, even with extensive data-augmentation. Therefore, this positive sampling strategy is a fundamental limitation as it encodes too much information regarding the recording source in the learned representations. This article introduces Self-Supervised Positive Sampling (SSPS), a bootstrapped technique for sampling appropriate and diverse positives in SSL frameworks for SV. SSPS samples positives close to their anchor in the representation space, as we assume that these pseudo-positives belong to the same speaker identity but correspond to different recording conditions. This method demonstrates consistent improvements in SV performance on VoxCeleb benchmarks when implemented in major SSL frameworks, such as SimCLR, SwAV, VICReg, and DINO. Using SSPS, SimCLR, and DINO achieve 2.57% and 2.53% EER on VoxCeleb1-O. SimCLR yields a 58% relative reduction in EER, getting comparable performance to DINO with a simpler training framework. Furthermore, SSPS lowers intra-class variance and reduces channel information in speaker representations while exhibiting greater robustness without data-augmentation.

[LG-51] Machine-Learning-Enhanced Optimization of Noise-Resilient Variational Quantum Eigensolvers

链接: https://arxiv.org/abs/2501.17689
作者: Kim A. Nicoli,Luca J. Wagner,Lena Funcke
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 14 pages, 3 figures, contribution to the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, Liverpool, UK

点击查看摘要

Abstract:Variational Quantum Eigensolvers (VQEs) are a powerful class of hybrid quantum-classical algorithms designed to approximate the ground state of a quantum system described by its Hamiltonian. VQEs hold promise for various applications, including lattice field theory. However, the inherent noise of Noisy Intermediate-Scale Quantum (NISQ) devices poses a significant challenge for running VQEs as these algorithms are particularly susceptible to noise, e.g., measurement shot noise and hardware noise. In a recent work, it was proposed to enhance the classical optimization of VQEs with Gaussian Processes (GPs) and Bayesian Optimization, as these machine-learning techniques are well-suited for handling noisy data. In these proceedings, we provide additional insights into this new algorithm and present further numerical experiments. In particular, we examine the impact of hardware noise and error mitigation on the algorithm’s performance. We validate the algorithm using classical simulations of quantum hardware, including hardware noise benchmarks, which have not been considered in previous works. Our numerical experiments demonstrate that GP-enhanced algorithms can outperform state-of-the-art baselines, laying the foundation for future research on deploying these techniques to real quantum hardware and lattice field theory setups. Comments: 14 pages, 3 figures, contribution to the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, Liverpool, UK Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat) Cite as: arXiv:2501.17689 [quant-ph] (or arXiv:2501.17689v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2501.17689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Extracting Inter-Protein Interactions Via Multitasking Graph Structure Learning

链接: https://arxiv.org/abs/2501.17589
作者: Jiang Li,Yuan-Ting Li
类目: Quantitative Methods (q-bio.QM); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Submit

点击查看摘要

Abstract:Identifying protein-protein interactions (PPI) is crucial for gaining in-depth insights into numerous biological processes within cells and holds significant guiding value in areas such as drug development and disease treatment. Currently, most PPI prediction methods focus primarily on the study of protein sequences, neglecting the critical role of the internal structure of proteins. This paper proposes a novel PPI prediction method named MgslaPPI, which utilizes graph attention to mine protein structural information and enhances the expressive power of the protein encoder through multitask learning strategy. Specifically, we decompose the end-to-end PPI prediction process into two stages: amino acid residue reconstruction (A2RR) and protein interaction prediction (PIP). In the A2RR stage, we employ a graph attention-based residue reconstruction method to explore the internal relationships and features of proteins. In the PIP stage, in addition to the basic interaction prediction task, we introduce two auxiliary tasks, i.e., protein feature reconstruction (PFR) and masked interaction prediction (MIP). The PFR task aims to reconstruct the representation of proteins in the PIP stage, while the MIP task uses partially masked protein features for PPI prediction, with both working in concert to prompt MgslaPPI to capture more useful information. Experimental results demonstrate that MgslaPPI significantly outperforms existing state-of-the-art methods under various data partitioning schemes.

[LG-53] Sequential Learning of the Pareto Front for Multi-objective Bandits

链接: https://arxiv.org/abs/2501.17513
作者: Elise Crépon(UMPA-ENSL),Aurélien Garivier(UMPA-ENSL),Wouter M Koolen(CWI)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with K possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has enough information to identify the Pareto front of the different arm means, she stops the game and gives an answer. We are interested in designing algorithms such that the answer given is correct with probability at least 1- \delta . Our main contribution is an efficient implementation of an algorithm achieving the optimal sample complexity when the risk \delta is small. With K arms in d dimensions p of which are in the Pareto set, the algorithm runs in time O(Kp^d) per round.

[LG-54] A Survey on Cluster-based Federated Learning

链接: https://arxiv.org/abs/2501.17512
作者: Omar El-Rifai(CIS-ENSMSE),Michael Ben Ali(IRIT),Imen Megdiche(IRIT, IRIT-SIG, INUC),André Peninou(IRIT, IRIT-SIG, UT2J),Olivier Teste(IRIT-SIG, IRIT, UT2J, UT)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the industrial and commercial use of Federated Learning (FL) has expanded, so has the need for optimized algorithms. In settings were FL clients’ data is non-independently and identically distributed (non-IID) and with highly heterogeneous distributions, the baseline FL approach seems to fall short. To tackle this issue, recent studies, have looked into personalized FL (PFL) which relaxes the implicit single-model constraint and allows for multiple hypotheses to be learned from the data or local models. Among the personalized FL approaches, cluster-based solutions (CFL) are particularly interesting whenever it is clear -through domain knowledge -that the clients can be separated into groups. In this paper, we study recent works on CFL, proposing: i) a classification of CFL solutions for personalization; ii) a structured review of literature iii) a review of alternative use cases for CFL. CCS Concepts: \bullet General and reference \rightarrow Surveys and overviews; \bullet Computing methodologies \rightarrow Machine learning; \bullet Information systems \rightarrow Clustering; \bullet Security and privacy \rightarrow Privacy-preserving protocols. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.17512 [stat.ML] (or arXiv:2501.17512v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.17512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] Fundamental Computational Limits in Pursuing Invariant Causal Prediction and Invariance-Guided Regularization

链接: https://arxiv.org/abs/2501.17354
作者: Yihong Gu,Cong Fang,Yang Xu,Zijian Guo,Jianqing Fan
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 70 pages, 3 figures

点击查看摘要

Abstract:Pursuing invariant prediction from heterogeneous environments opens the door to learning causality in a purely data-driven way and has several applications in causal discovery and robust transfer learning. However, existing methods such as ICP [Peters et al., 2016] and EILLS [Fan et al., 2024] that can attain sample-efficient estimation are based on exponential time algorithms. In this paper, we show that such a problem is intrinsically hard in computation: the decision problem, testing whether a non-trivial prediction-invariant solution exists across two environments, is NP-hard even for the linear causal relationship. In the world where P \neq NP, our results imply that the estimation error rate can be arbitrarily slow using any computationally efficient algorithm. This suggests that pursuing causality is fundamentally harder than detecting associations when no prior assumption is pre-offered. Given there is almost no hope of computational improvement under the worst case, this paper proposes a method capable of attaining both computationally and statistically efficient estimation under additional conditions. Furthermore, our estimator is a distributionally robust estimator with an ellipse-shaped uncertain set where more uncertainty is placed on spurious directions than invariant directions, resulting in a smooth interpolation between the most predictive solution and the causal solution by varying the invariance hyper-parameter. Non-asymptotic results and empirical applications support the claim. Comments: 70 pages, 3 figures Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2501.17354 [math.ST] (or arXiv:2501.17354v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2501.17354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] sting Conditional Mean Independence Using Generative Neural Networks

链接: https://arxiv.org/abs/2501.17345
作者: Yi Zhang,Linjun Huang,Yun Yang,Xiaofeng Shao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages. 4 figures

点击查看摘要

Abstract:Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an n^-1/2 neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.

[LG-57] A Guaranteed-Stable Neural Network Approach for Optimal Control of Nonlinear Systems

链接: https://arxiv.org/abs/2501.17333
作者: Anran Li,John P. Swensen,Mehdi Hosseinzadeh
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A promising approach to optimal control of nonlinear systems involves iteratively linearizing the system and solving an optimization problem at each time instant to determine the optimal control input. Since this approach relies on online optimization, it can be computationally expensive, and thus unrealistic for systems with limited computing resources. One potential solution to this issue is to incorporate a Neural Network (NN) into the control loop to emulate the behavior of the optimal control scheme. Ensuring stability and reference tracking in the resulting NN-based closed-loop system requires modifications to the primary optimization problem. These modifications often introduce non-convexity and nonlinearity with respect to the decision variables, which may surpass the capabilities of existing solvers and complicate the generation of the training dataset. To address this issue, this paper develops a Neural Optimization Machine (NOM) to solve the resulting optimization problems. The central concept of a NOM is to transform the optimization challenges into the problem of training a NN. Rigorous proofs demonstrate that when a NN trained on data generated by the NOM is used in the control loop, all signals remain bounded and the system states asymptotically converge to a neighborhood around the desired equilibrium point, with a tunable proximity threshold. Simulation and experimental studies are provided to illustrate the effectiveness of the proposed methodology.

[LG-58] MR imaging in the low-field: Leverag ing the power of machine learning

链接: https://arxiv.org/abs/2501.17211
作者: Andreas Kofler,Dongyue Si,David Schote,Rene M Botnar,Christoph Kolbitsch,Claudia Prieto
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: To appear as a book chapter in T. Küstner et al, “Machine Learning in MRI: From Methods to Clinical Translation”

点击查看摘要

Abstract:Recent innovations in Magnetic Resonance Imaging (MRI) hardware and software have reignited interest in low-field ( 1,\mathrmT ) and ultra-low-field MRI ( 0.1,\mathrmT ). These technologies offer advantages such as lower power consumption, reduced specific absorption rate, reduced field-inhomogeneities, and cost-effectiveness, presenting a promising alternative for resource-limited and point-of-care settings. However, low-field MRI faces inherent challenges like reduced signal-to-noise ratio and therefore, potentially lower spatial resolution or longer scan times. This chapter examines the challenges and opportunities of low-field and ultra-low-field MRI, with a focus on the role of machine learning (ML) in overcoming these limitations. We provide an overview of deep neural networks and their application in enhancing low-field and ultra-low-field MRI performance. Specific ML-based solutions, including advanced image reconstruction, denoising, and super-resolution algorithms, are discussed. The chapter concludes by exploring how integrating ML with low-field MRI could expand its clinical applications and improve accessibility, potentially revolutionizing its use in diverse healthcare settings. Comments: To appear as a book chapter in T. Küstner et al, “Machine Learning in MRI: From Methods to Clinical Translation” Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG) Cite as: arXiv:2501.17211 [eess.IV] (or arXiv:2501.17211v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.17211 Focus to learn more arXiv-issued DOI via DataCite

[LG-59] Near-Optimal Algorithms for Omniprediction

链接: https://arxiv.org/abs/2501.17205
作者: Princewill Okoroafor,Robert Kleinberg,Michael P. Kim
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Omnipredictors are simple prediction functions that encode loss-minimizing predictions with respect to a hypothesis class \H , simultaneously for every loss function within a class of losses Ł . In this work, we give near-optimal learning algorithms for omniprediction, in both the online and offline settings. To begin, we give an oracle-efficient online learning algorithm that acheives (Ł,\H) -omniprediction with \tildeO(\sqrtT \log |\H|) regret for any class of Lipschitz loss functions Ł\subseteq Ł_\mathrmLip . Quite surprisingly, this regret bound matches the optimal regret for \emphminimization of a single loss function (up to a \sqrt\log(T) factor). Given this online algorithm, we develop an online-to-offline conversion that achieves near-optimal complexity across a number of measures. In particular, for all bounded loss functions within the class of Bounded Variation losses Ł_\mathrmBV (which include all convex, all Lipschitz, and all proper losses) and any (possibly-infinite) \H , we obtain an offline learning algorithm that, leveraging an (offline) ERM oracle and m samples from \D , returns an efficient (Ł_\mathrmBV,\H,\eps(m)) -omnipredictor for \eps(m) scaling near-linearly in the Rademacher complexity of \mathrmTh \circ \H .

信息检索

[IR-0] Leverag ing Multimodal LLM for Inspirational User Interface Search

链接: https://arxiv.org/abs/2501.17799
作者: Seokhyeon Park,Yumin Song,Soohyun Lee,Jaeyoung Kim,Jinwook Seo
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '25)

点击查看摘要

Abstract:Inspirational search, the process of exploring designs to inform and inspire new creative work, is pivotal in mobile user interface (UI) design. However, exploring the vast space of UI references remains a challenge. Existing AI-based UI search methods often miss crucial semantics like target users or the mood of apps. Additionally, these models typically require metadata like view hierarchies, limiting their practical use. We used a multimodal large language model (MLLM) to extract and interpret semantics from mobile UI images. We identified key UI semantics through a formative study and developed a semantic-based UI search system. Through computational and human evaluations, we demonstrate that our approach significantly outperforms existing UI retrieval methods, offering UI designers a more enriched and contextually relevant search experience. We enhance the understanding of mobile UI design semantics and highlight MLLMs’ potential in inspirational search, providing a rich dataset of UI semantics for future studies.

[IR-1] WARP: An Efficient Engine for Multi-Vector Retrieval

链接: https://arxiv.org/abs/2501.17788
作者: Jan Luca Scheerer,Matei Zaharia,Christopher Potts,Gustavo Alonso,Omar Khattab
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We study the efficiency of multi-vector retrieval methods like ColBERT and its recent variant XTR. We introduce WARP, a retrieval engine that drastically improves the efficiency of XTR-based ColBERT retrievers through three key innovations: (1) WARP _\textSELECT for dynamic similarity imputation, (2) implicit decompression to bypass costly vector reconstruction, and (3) a two-stage reduction process for efficient scoring. Combined with optimized C++ kernels and specialized inference runtimes, WARP reduces end-to-end latency by 41x compared to XTR’s reference implementation and thereby achieves a 3x speedup over PLAID from the the official ColBERT implementation. We study the efficiency of multi-vector retrieval methods like ColBERT and its recent variant XTR. We introduce WARP, a retrieval engine that drastically improves the efficiency of XTR-based ColBERT retrievers through three key innovations: (1) WARP _\textSELECT for dynamic similarity imputation, (2) implicit decompression during retrieval, and (3) a two-stage reduction process for efficient scoring. Thanks also to highly-optimized C++ kernels and to the adoption of specialized inference runtimes, WARP can reduce end-to-end query latency relative to XTR’s reference implementation by 41x. And it thereby achieves a 3x speedup over the official ColBERTv2 PLAID engine, while preserving retrieval quality. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2501.17788 [cs.IR] (or arXiv:2501.17788v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2501.17788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Distinguished Quantized Guidance for Diffusion-based Sequence Recommendation

链接: https://arxiv.org/abs/2501.17670
作者: Wenyu Mao,Shuchang Liu,Haoyang Liu,Haozhe Liu,Xiang Li,Lanatao Hu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as promising approaches for sequential recommendation due to their strong ability to model data distributions and generate high-quality items. Existing work typically adds noise to the next item and progressively denoises it guided by the user’s interaction sequence, generating items that closely align with user interests. However, we identify two key issues in this paradigm. First, the sequences are often heterogeneous in length and content, exhibiting noise due to stochastic user behaviors. Using such sequences as guidance may hinder DMs from accurately understanding user interests. Second, DMs are prone to data bias and tend to generate only the popular items that dominate the training dataset, thus failing to meet the personalized needs of different users. To address these issues, we propose Distinguished Quantized Guidance for Diffusion-based Sequence Recommendation (DiQDiff), which aims to extract robust guidance to understand user interests and generate distinguished items for personalized user interests within DMs. To extract robust guidance, DiQDiff introduces Semantic Vector Quantization (SVQ) to quantize sequences into semantic vectors (e.g., collaborative signals and category interests) using a codebook, which can enrich the guidance to better understand user interests. To generate distinguished items, DiQDiff personalizes the generation through Contrastive Discrepancy Maximization (CDM), which maximizes the distance between denoising trajectories using contrastive loss to prevent biased generation for different users. Extensive experiments are conducted to compare DiQDiff with multiple baseline models across four widely-used datasets. The superior recommendation performance of DiQDiff against leading approaches demonstrates its effectiveness in sequential recommendation tasks.

[IR-3] Value Function Decomposition in Markov Recommendation Process

链接: https://arxiv.org/abs/2501.17409
作者: Xiaobei Wang,Shuchang Liu,Qingpeng Cai,Xiang Li,Lantao Hu,Han li,Guangming Xie
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Recent advances in recommender systems have shown that user-system interaction essentially formulates long-term optimization problems, and online reinforcement learning can be adopted to improve recommendation performance. The general solution framework incorporates a value function that estimates the user’s expected cumulative rewards in the future and guides the training of the recommendation policy. To avoid local maxima, the policy may explore potential high-quality actions during inference to increase the chance of finding better future rewards. To accommodate the stepwise recommendation process, one widely adopted approach to learning the value function is learning from the difference between the values of two consecutive states of a user. However, we argue that this paradigm involves an incorrect approximation in the stochastic process. Specifically, between the current state and the next state in each training sample, there exist two separate random factors from the stochastic policy and the uncertain user environment. Original temporal difference (TD) learning under these mixed random factors may result in a suboptimal estimation of the long-term rewards. As a solution, we show that these two factors can be separately approximated by decomposing the original temporal difference loss. The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration. As empirical verification of our proposed method, we conduct offline experiments with online simulated environments built based on public datasets.

附件下载

点击下载今日全部论文列表