Arxiv今日论文 | 2025-01-14

本篇博文主要内容为 2025-01-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决传统搜索引擎在检索增强生成（Retrieval-Augmented Generation, RAG）任务中可能检索到浅层内容的问题，这限制了大型语言模型（LLMs）处理复杂、多层次信息的能力。为了解决这一问题，作者提出了WebWalkerQA基准，旨在评估LLMs在网页遍历中的能力，特别是系统性地提取高质量数据的能力。关键解决方案是引入了WebWalker，这是一个多智能体框架，通过探索-批评（explore-critic）范式模拟人类般的网页导航。实验结果表明，WebWalkerQA具有挑战性，并且展示了RAG与WebWalker在现实场景中通过水平和垂直整合的有效性。

链接: https://arxiv.org/abs/2501.07572
作者: Jialong Wu,Wenbiao Yin,Yong Jiang,Zhenglin Wang,Zekun Xi,Runnan Fang,Deyu Zhou,Pengjun Xie,Fei Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website’s subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.
zh

[NLP-1] SST-EM: Advanced Metrics for Evaluating Semantic Spatial and Temporal Aspects in Video Editing WACV

【速读】：该论文旨在解决视频编辑模型性能评估的挑战。传统的评估指标，如CLIP文本和图像分数，存在局限性：文本分数受限于训练数据不足和层次依赖，而图像分数无法评估时间一致性。为此，论文提出了SST-EM（语义、空间和时间评估指标），这是一种新的评估框架，结合了现代视觉-语言模型（VLMs）、目标检测和时间一致性检查。SST-EM的四个关键组件包括：(1) 使用VLM从帧中提取语义信息，(2) 通过目标检测进行主要对象跟踪，(3) 利用LLM代理进行对象细化，(4) 使用视觉Transformer（ViT）评估时间一致性。这些组件通过人类评估和回归分析得出的权重整合为一个统一的指标。SST-EM的核心在于全面评估视频编辑中的语义保真度和时间平滑性。

链接: https://arxiv.org/abs/2501.07554
作者: Varun Biyyala,Bharat Chanderprakash Kathuria,Jialu Li,Youshan Zhang
机构: Katz School of Science and Health, Yeshiva University (卡茨科学与健康学院, 叶史瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: WACV workshop

点击查看摘要

Abstract:Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf\hrefthis https URLGitHub Repository.
zh

[NLP-2] Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

【速读】：该论文旨在解决大型语言模型（LLMs）和多模态大型语言模型（MLLMs）在复杂空间推理任务中表现不佳的问题。尽管链式思维提示（Chain-of-Thought, CoT）在增强复杂推理能力方面表现出色，但在涉及空间推理的任务中仍存在局限性。论文提出的解决方案是多模态思维可视化（Multimodal Visualization-of-Thought, MVoT），通过生成图像可视化来模拟人类在推理过程中的视觉思维。关键创新在于在自回归多模态模型中引入了令牌差异损失（token discrepancy loss），以提升视觉连贯性和保真度。实验结果表明，MVoT在多个动态空间推理任务中表现出色，尤其是在CoT失败的复杂场景中展现了显著的改进。这一方法为视觉思维与语言推理的有效结合提供了新的可能性。

链接: https://arxiv.org/abs/2501.07542
作者: Chengzu Li,Wenshan Wu,Huanyu Zhang,Yan Xia,Shaoguang Mao,Li Dong,Ivan Vulić,Furu Wei
机构: Microsoft Research; Language Technology Lab, University of Cambridge (剑桥大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 6 figures, 4 tables (27 pages, 10 figures, 16 tables including references and appendices)

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
zh

[NLP-3] Investigating Large Language Models in Inferring Personality Traits from User Conversations

【速读】：该论文探讨了大型语言模型（LLMs），特别是GPT-4o和GPT-4o mini，在零样本提示条件下是否能够从用户对话中推断大五人格特质（Big Five personality traits）并生成大五人格量表-10（BFI-10）项目分数。研究的关键解决方案在于引入了一个中间步骤：在计算特质之前，先提示模型生成BFI-10项目分数。这种方法显著提高了预测的准确性，并且与金标准更为一致。此外，研究还通过基于抑郁症状存在的组别比较，揭示了模型在不同群体中的表现差异。GPT-4o mini在抑郁症状存在组中对神经质（Neuroticism）和尽责性（Conscientiousness）等特质的敏感性较高，而GPT-4o则在跨群体的细致解释方面表现出优势。这些发现强调了LLMs在分析现实世界心理数据方面的潜力，为人工智能与心理学的跨学科研究提供了重要基础。

链接: https://arxiv.org/abs/2501.07532
作者: Jianfeng Zhu,Ruoming Jin,Karin G. Coifman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are demonstrating remarkable human like capabilities across diverse domains, including psychological assessment. This study evaluates whether LLMs, specifically GPT-4o and GPT-4o mini, can infer Big Five personality traits and generate Big Five Inventory-10 (BFI-10) item scores from user conversations under zero-shot prompting conditions. Our findings reveal that incorporating an intermediate step–prompting for BFI-10 item scores before calculating traits–enhances accuracy and aligns more closely with the gold standard than direct trait inference. This structured approach underscores the importance of leveraging psychological frameworks in improving predictive precision. Additionally, a group comparison based on depressive symptom presence revealed differential model performance. Participants were categorized into two groups: those experiencing at least one depressive symptom and those without symptoms. GPT-4o mini demonstrated heightened sensitivity to depression-related shifts in traits such as Neuroticism and Conscientiousness within the symptom-present group, whereas GPT-4o exhibited strengths in nuanced interpretation across groups. These findings underscore the potential of LLMs to analyze real-world psychological data effectively, offering a valuable foundation for interdisciplinary research at the intersection of artificial intelligence and psychology.
zh

[NLP-4] Parallel Key-Value Cache Fusion for Position Invariant RAG

【速读】：该论文试图解决大语言模型（LLMs）在处理输入上下文时对相关信息位置敏感的问题，特别是当相关信息位于上下文中间时，模型容易生成错误响应，这种现象被称为“Lost in the Middle”。为了解决这一问题，论文提出了一种框架，旨在为仅解码器模型（decoder-only models）生成一致的输出，而不受输入上下文顺序的影响。该框架的关键在于实现了位置不变性（position invariance），即模型对输入上下文的顺序不敏感，并且在处理无关段落时表现出更强的鲁棒性，相较于现有的检索增强生成（RAG）管道方法具有显著优势。

链接: https://arxiv.org/abs/2501.07523
作者: Philhoon Oh,Jinwoo Shin,James Thorne
机构: KAIST AI (韩国科学技术院人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) underscore the necessity of Retrieval Augmented Generation (RAG) to leverage external information. However, LLMs are sensitive to the position of relevant information within contexts and tend to generate incorrect responses when such information is placed in the middle, known as `Lost in the Middle’ phenomenon. In this paper, we introduce a framework that generates consistent outputs for decoder-only models, irrespective of the input context order. Experimental results for three open domain question answering tasks demonstrate position invariance, where the model is not sensitive to input context order, and superior robustness to irrelevent passages compared to prevailing approaches for RAG pipelines.
zh

[NLP-5] EBe: A Benchmark for Assessing the Current Knowledge of Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）在快速变化的知识环境中保持更新的问题，特别是模型在持续学习（continual learning）和全球知识表现中的地理差异（geographic disparities）方面的不足。现有基准测试主要评估模型的一般事实回忆能力，但忽略了模型如何通过持续学习整合新知识以及在不同地区的表现差异。为此，作者提出了“及时事件基准测试”（Timely Events Benchmark, TiEBe），这是一个包含超过11,000个问答对的数据集，专注于全球和地区重要事件。TiEBe利用维基百科的结构化回顾数据，能够持续更新以评估LLMs对全球事务演变的了解及其对不同地区事件的理解能力。该基准测试揭示了LLMs在事实回忆中存在显著的地理差异，强调了全球知识表示平衡的重要性。此外，TiEBe还用于评估持续学习策略，帮助理解模型在获取新知识的同时不遗忘旧知识的能力。

链接: https://arxiv.org/abs/2501.07482
作者: Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos,Hugo Abonizio,Rodrigo Nogueira
机构: State University of Campinas (UNICAMP)(坎皮纳斯州立大学); Maritaca AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a rapidly evolving knowledge landscape and the increasing adoption of large language models, a need has emerged to keep these models continuously updated with current events. While existing benchmarks evaluate general factual recall, they often overlook two critical aspects: the ability of models to integrate evolving knowledge through continual learning and the significant regional disparities in their performance. To address these gaps, we introduce the Timely Events Benchmark (TiEBe), a dataset containing over 11,000 question-answer pairs focused on globally and regionally significant events. TiEBe leverages structured retrospective data from Wikipedia, enabling continuous updates to assess LLMs’ knowledge of evolving global affairs and their understanding of events across different regions. Our benchmark demonstrates that LLMs exhibit substantial geographic disparities in factual recall, emphasizing the need for more balanced global knowledge representation. Furthermore, TiEBe serves as a tool for evaluating continual learning strategies, providing insights into models’ ability to acquire new information without forgetting past knowledge.
zh

[NLP-6] Enhancing Retrieval-Augmented Generation: A Study of Best Practices

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中各组件和配置对系统性能的影响尚未充分探索的问题。通过开发多种先进的RAG系统设计，包括查询扩展（query expansion）、多种新颖的检索策略以及一种新的对比上下文学习（Contrastive In-Context Learning）RAG，论文系统地研究了语言模型大小、提示设计、文档块大小、知识库大小、检索步长、查询扩展技术、对比上下文学习知识库、多语言知识库以及句子级相关上下文检索（Focus Mode）等关键因素对响应质量的影响。解决方案的关键在于通过大量实验分析这些因素的作用，为开发RAG系统提供可操作的见解，从而在上下文丰富性和检索生成效率之间取得平衡，推动RAG框架在多样化实际应用中的适应性和高性能表现。

链接: https://arxiv.org/abs/2501.07391
作者: Siran Li,Linus Stenzel,Carsten Eickhoff,Seyed Ali Bahrainian
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have recently shown remarkable advancements by integrating retrieval mechanisms into language models, enhancing their ability to produce more accurate and contextually relevant responses. However, the influence of various components and configurations within RAG systems remains underexplored. A comprehensive understanding of these elements is essential for tailoring RAG systems to complex retrieval tasks and ensuring optimal performance across diverse applications. In this paper, we develop several advanced RAG system designs that incorporate query expansion, various novel retrieval strategies, and a novel Contrastive In-Context Learning RAG. Our study systematically investigates key factors, including language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, and Focus Mode retrieving relevant context at sentence-level. Through extensive experimentation, we provide a detailed analysis of how these factors influence response quality. Our findings offer actionable insights for developing RAG systems, striking a balance between contextual richness and retrieval-generation efficiency, thereby paving the way for more adaptable and high-performing RAG frameworks in diverse real-world scenarios. Our code and implementation details are publicly available.
zh

[NLP-7] Emergent effects of scaling on the functional hierarchies within large language models

【速读】：该论文旨在探讨大语言模型（LLM）的层次结构是否遵循传统的功能分层观点，即早期层处理句法，中间层解析语义，深层整合信息。研究通过向LLM输入简单文本（如“A church and organ”），并提取各层的激活值，使用支持向量机（SVM）和岭回归（ridge regression）来预测文本标签，从而分析各层是否编码了特定信息。研究结果表明，虽然小型模型（Llama-3.2-3b）部分支持传统的层次结构观点，但在大型模型（Llama-3.3-70b-Instruct）中，抽象层次出现了显著的波动，且相邻层之间的注意力机制存在协调现象。这些发现表明，尽管层次结构在模型中普遍存在，但大型模型在某些方面偏离了这一结构，表现出复杂的抽象和信息压缩模式。

链接: https://arxiv.org/abs/2501.07359
作者: Paul C. Bogdan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) architectures are often described as functionally hierarchical: Early layers process syntax, middle layers begin to parse semantics, and late layers integrate information. The present work revisits these ideas. This research submits simple texts to an LLM (e.g., “A church and organ”) and extracts the resulting activations. Then, for each layer, support vector machines and ridge regressions are fit to predict a text’s label and thus examine whether a given layer encodes some information. Analyses using a small model (Llama-3.2-3b; 28 layers) partly bolster the common hierarchical perspective: Item-level semantics are most strongly represented early (layers 2-7), then two-item relations (layers 8-12), and then four-item analogies (layers 10-15). Afterward, the representation of items and simple relations gradually decreases in deeper layers that focus on more global information. However, several findings run counter to a steady hierarchy view: First, although deep layers can represent document-wide abstractions, deep layers also compress information from early portions of the context window without meaningful abstraction. Second, when examining a larger model (Llama-3.3-70b-Instruct), stark fluctuations in abstraction level appear: As depth increases, two-item relations and four-item analogies initially increase in their representation, then markedly decrease, and afterward increase again momentarily. This peculiar pattern consistently emerges across several experiments. Third, another emergent effect of scaling is coordination between the attention mechanisms of adjacent layers. Across multiple experiments using the larger model, adjacent layers fluctuate between what information they each specialize in representing. In sum, an abstraction hierarchy often manifests across layers, but large models also deviate from this structure in curious ways.
zh

[NLP-8] Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding ICASSP2025

【速读】：该论文试图解决在语音识别和理解（Spoken Language Understanding, SLU）领域中，传统的序列到序列（sequence-to-sequence）方法无法同时进行语音识别和结构化内容提取的问题。为了解决这一问题，作者提出了一种联合语音识别和结构学习框架（Joint Speech Recognition and Structure Learning Framework, JSRSL），这是一种基于跨度（span）的端到端SLU模型。该框架能够同时准确转录语音并提取结构化内容。通过在中文数据集AISHELL-NER和英文数据集SLURP上进行命名实体识别和意图分类的实验，结果表明，该方法不仅在转录和提取能力上优于传统的序列到序列方法，还在两个数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07329
作者: Jiliang Hu,Zuchao Li,Mengjia Shen,Haojun Ai,Sheng Li,Jun Zhang
机构: 1Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China; 2School of Computer Science, Wuhan University, Wuhan, China; 3Wuhan Second Ship Design and Research Institute, Wuhan, China; 4National Institute of Information and Communications Technology, Japan
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, accepted by ICASSP 2025

点击查看摘要

Abstract:Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.
zh

[NLP-9] FinerWeb-10BT: Refining Web Data with LLM -Based Line-Level Filtering ALT

【速读】：该论文旨在解决大语言模型（LLMs）训练数据质量的问题。传统启发式过滤方法往往无法有效识别低质量文本或错误地移除有价值的内容。为此，论文提出了一种基于LLM的行级过滤方法，通过使用GPT-4o mini对FineWeb数据集中的20,000个文档进行行级标注，生成描述性标签以识别低质量文本。这些标签被归类为九大类别，并训练了一个DeBERTa-v3分类器，以将过滤方法扩展到FineWeb的10B-token子集。实验结果表明，使用过滤后的数据集训练的GPT-2模型在HellaSwag基准测试中表现更优，且达到性能目标的速度更快，即使数据量减少了25%。这一方法显著提升了LLM训练数据的质量和效率。论文还发布了标注数据集FinerWeb-10BT和相关代码，以支持进一步研究。

链接: https://arxiv.org/abs/2501.07314
作者: Erik Henriksson,Otto Tarkka,Filip Ginter
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 4 tables. To be published in NoDaLiDa/Baltic-HLT 2025 proceedings

点击查看摘要

Abstract:Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.
zh

[NLP-10] he Lessons of Developing Process Reward Models in Mathematical Reasoning

【速读】：该论文旨在解决在大型语言模型（LLMs）的数学推理过程中，过程奖励模型（Process Reward Models, PRMs）在数据标注和评估方法上面临的挑战。具体而言，现有的基于蒙特卡洛（Monte Carlo, MC）估计的数据合成方法在性能和泛化能力上表现较差，且传统的Best-of-N（BoN）评估策略存在偏差，导致评估标准与PRM的过程验证目标不一致。为解决这些问题，论文提出了一个共识过滤机制（consensus filtering mechanism），将MC估计与LLM-as-a-judge方法结合，并倡导采用更全面的评估框架，结合响应级别和步骤级别的指标。这一机制显著提升了模型在BoN评估和逐步错误识别任务中的性能和数据效率，最终开发出一个优于现有开源替代方案的最先进的PRM，并为未来构建过程监督模型提供了实用指南。

链接: https://arxiv.org/abs/2501.07301
作者: Zhenru Zhang,Chujie Zheng,Yangzhen Wu,Beichen Zhang,Runji Lin,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.
zh

[NLP-11] Comparative analysis of optical character recognition methods for Sami texts from the National Library of Norway

【速读】：该论文旨在解决挪威国家图书馆（NLN）在数字化过程中对萨米语（Sámi）文档进行光学字符识别（OCR）时精度不足的问题。由于OCR质量直接影响后续处理流程，提升萨米语文档的OCR准确性对于确保这些资源的可访问性至关重要。论文通过微调（fine-tuning）和评估三种现有的OCR方法——Transkribus、Tesseract和TrOCR——来解决这一问题。研究结果表明，Transkribus和TrOCR在处理萨米语文档时表现优于Tesseract，而Tesseract在跨领域数据集上表现更佳。此外，论文指出，通过微调预训练模型并结合手动标注与机器标注及合成文本图像，即使在手动标注数据量有限的情况下，也能显著提升萨米语OCR的准确性。

链接: https://arxiv.org/abs/2501.07300
作者: Tita Enstad,Trond Trosterud,Marie Iversdatter Røsok,Yngvil Beyer,Marie Roald
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in Proceedings of the 25th Nordic Conference on Computational Linguistics (NoDaLiDa)

点击查看摘要

Abstract:Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
zh

[NLP-12] Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

【速读】：该论文试图解决大型音频-语言模型（Large Audio-Language Models, LALMs）在复杂现实问题中的推理能力不足的问题。尽管LALMs在音频感知和理解任务（如语音识别和音频字幕生成）中表现出色，但其推理能力尚未得到充分探索。论文的关键解决方案是将链式思维（Chain-of-Thought, CoT）推理方法集成到LALMs中，以增强其在听觉模态中的推理能力。通过评估代表性的CoT方法，论文分析了这些方法在声音、音乐和语音领域的信息提取和推理任务中的表现。研究发现，CoT方法在简单和中等难度任务中显著提升了性能，但在复杂任务中，推理链可能会混淆模型，导致准确性下降。此外，研究还发现推理路径长度与准确性之间存在正相关关系，表明通过扩展推理路径可以提升模型的高级指令遵循和推理能力。该研究不仅展示了CoT在增强LALM推理能力方面的潜力，还指出了关键局限性，并为未来研究提供了可行的方向。

链接: https://arxiv.org/abs/2501.07246
作者: Ziyang Ma,Zhuo Chen,Yuping Wang,Eng Siong Chng,Xie Chen
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research.
zh

[NLP-13] Can Vision-Language Models Evaluate Handwritten Math?

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在自动评估手写学生数学作业中的能力不足问题。尽管VLMs在自动评分领域取得了进展，但其在处理手写内容时的检测、定位和纠错能力尚未得到全面研究。为此，作者提出了FERMAT基准测试，该基准涵盖了计算错误、概念错误、符号错误和呈现错误四个关键维度，并包含来自7-12年级的609个手动策划问题中的2200多个手写数学解答，其中故意引入了扰动。通过FERMAT，作者对九种VLMs在错误检测、定位和纠正三个任务中的表现进行了评估。结果表明，当前VLMs在处理手写文本时存在显著缺陷，其中Gemini-1.5-Pro在错误纠正任务中表现最佳，准确率为77%。此外，研究还发现某些模型在处理手写内容时表现较差，但当手写输入被替换为印刷文本或图像时，其准确性有所提升。这些发现揭示了当前VLMs的局限性，并为未来的改进提供了新的研究方向。

链接: https://arxiv.org/abs/2501.07244
作者: Oikantik Nath,Hanani Bathina,Mohammed Safi Ur Rahman Khan,Mitesh M. Khapra
机构: Indian Institute of Technology, Madras(印度理工学院马德拉斯分校); AI4Bharat; Chennai Mathematical Institute(金奈数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess the ability of VLMs to detect, localize and correct errors in handwritten mathematical content. FERMAT spans four key error dimensions - computational, conceptual, notational, and presentation - and comprises over 2,200 handwritten math solutions derived from 609 manually curated problems from grades 7-12 with intentionally introduced perturbations. Using FERMAT we benchmark nine VLMs across three tasks: error detection, localization, and correction. Our results reveal significant shortcomings in current VLMs in reasoning over handwritten text, with Gemini-1.5-Pro achieving the highest error correction rate (77%). We also observed that some models struggle with processing handwritten content, as their accuracy improves when handwritten inputs are replaced with printed text or images. These findings highlight the limitations of current VLMs and reveal new avenues for improvement. We release FERMAT and all the associated resources in the open-source to drive further research.
zh

[NLP-14] When lies are mostly truthful: automated verbal deception detection for embedded lies

【速读】：该论文试图解决口头欺骗检测（verbal deception detection）研究中关于嵌入性谎言（embedded lies）的研究滞后问题。传统的欺骗检测研究通常假设陈述是完全真实或完全欺骗的，而实际上，陈述的真实性往往是一个连续体，真实和欺骗的部分可能同时存在于同一陈述中。论文通过收集一个包含2088条真实和欺骗性陈述的新数据集，其中标注了嵌入性谎言，并使用经过微调的语言模型（Llama-3-8B）进行分类，准确率达到64%。研究结果表明，嵌入性谎言的检测挑战主要源于其与真实陈述的高度相似性，典型的欺骗性陈述包含约2/3的真实信息和1/3的嵌入性谎言，且这些谎言多源自个人过去的经历，与真实陈述在语言学特征上差异较小。该数据集为未来研究嵌入性谎言提供了新的资源。

链接: https://arxiv.org/abs/2501.07217
作者: Riccardo Loconte,Bennett Kleinberg
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: Verbal deception detection research relies on narratives and commonly assumes statements as truthful or deceptive. A more realistic perspective acknowledges that the veracity of statements exists on a continuum with truthful and deceptive parts being embedded within the same statement. However, research on embedded lies has been lagging behind. Methods: We collected a novel dataset of 2,088 truthful and deceptive statements with annotated embedded lies. Using a within-subjects design, participants provided a truthful account of an autobiographical event. They then rewrote their statement in a deceptive manner by including embedded lies, which they highlighted afterwards and judged on lie centrality, deceptiveness, and source. Results: We show that a fined-tuned language model (Llama-3-8B) can classify truthful statements and those containing embedded lies with 64% accuracy. Individual differences, linguistic properties and explainability analysis suggest that the challenge of moving the dial towards embedded lies stems from their resemblance to truthful statements. Typical deceptive statements consisted of 2/3 truthful information and 1/3 embedded lies, largely derived from past personal experiences and with minimal linguistic differences with their truthful counterparts. Conclusion: We present this dataset as a novel resource to address this challenge and foster research on embedded lies in verbal deception detection.
zh

[NLP-15] BIOMEDICA: An Open Biomedical Image-Caption Archive Dataset and Vision-Language Models Derived from Scientific Literature

【速读】：该论文试图解决生物医学领域缺乏大规模、多样化且公开可访问的多模态数据集的问题，这些问题限制了通用生物医学视觉-语言模型（VLMs）的发展。现有的数据集通常局限于狭窄的领域，未能涵盖科学文献中编码的完整生物医学知识多样性。为解决这一问题，作者提出了BIOMEDICA框架，这是一个可扩展的开源框架，能够从PubMed Central开放获取子集中提取、注释并序列化数据，生成一个易于使用的公开数据集。该框架生成了包含超过2400万张独特图像-文本对的数据集，覆盖了600多万篇文章，并提供了元数据和专家指导的注释。此外，作者还发布了BMCA-CLIP模型套件，这些模型通过流式预训练在BIOMEDICA数据集上，避免了下载27TB数据的需求。实验表明，这些模型在40个任务上达到了最先进的性能，特别是在零样本分类和图像-文本检索任务中表现优异，同时计算资源消耗减少了10倍。该解决方案的关键在于提供了一个全面、多样化的生物医学数据集，并通过高效的预训练方法提升了模型的性能。

链接: https://arxiv.org/abs/2501.07171
作者: Alejandro Lozano,Min Woo Sun,James Burgess,Liangyu Chen,Jeffrey J Nirschl,Jeffrey Gu,Ivan Lopez,Josiah Aklilu,Austin Wolfgang Katzer,Collin Chiu,Anita Rau,Xiaohan Wang,Yuhui Zhang,Alfred Seunghoon Song,Robert Tibshirani,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible this http URL framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data this http URL average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.
zh

[NLP-16] ListConRanker: A Contrastive Text Reranker with Listwise Encoding

【速读】：该论文旨在解决现有重排序模型（reranker models）在语义相似性评估中的两个主要问题：一是传统方法主要采用点对点编码（pointwise encoding），只能对每个段落进行独立编码，无法在编码过程中比较不同段落之间的语义差异；二是传统模型使用交叉熵损失函数（cross-entropy loss function）进行训练，导致梯度变化不平滑且训练效率低下。为解决这些问题，论文提出了一种新颖的列表编码对比文本重排序模型（Listwise-encoded Contrastive text reRanker, ListConRanker）。该模型的关键创新在于引入了列表编码（listwise encoding），使得在编码过程中能够比较不同段落之间的语义差异，并增强正例之间以及正例与负例之间的对比信息。此外，模型采用圆形损失（circle loss）进行训练，以提高梯度的灵活性并解决训练效率问题。实验结果表明，ListConRanker在多个中文大规模文本嵌入基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07111
作者: Junlong Liu,Yue Ma,Ruihui Zhao,Junhao Zheng,Qianli Ma,Yangyang Kang
机构: 1School of Computer Science and Engineering, South China University of Technology (华南理工大学计算机科学与工程学院); 2ByteDance China (字节跳动中国); 3Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Reranker models aim to re-rank the passages based on the semantics similarity between the given query and passages, which have recently received more attention due to the wide application of the Retrieval-Augmented Generation. Most previous methods apply pointwise encoding, meaning that it can only encode the context of the query for each passage input into the model. However, for the reranker model, given a query, the comparison results between passages are even more important, which is called listwise encoding. Besides, previous models are trained using the cross-entropy loss function, which leads to issues of unsmooth gradient changes during training and low training efficiency. To address these issues, we propose a novel Listwise-encoded Contrastive text reRanker (ListConRanker). It can help the passage to be compared with other passages during the encoding process, and enhance the contrastive information between positive examples and between positive and negative examples. At the same time, we use the circle loss to train the model to increase the flexibility of gradients and solve the problem of training efficiency. Experimental results show that ListConRanker achieves state-of-the-art performance on the reranking benchmark of Chinese Massive Text Embedding Benchmark, including the cMedQA1.0, cMedQA2.0, MMarcoReranking, and T2Reranking datasets.
zh

[NLP-17] AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR ICASSP2025

【速读】：该论文试图解决自动语音识别（ASR）系统在处理句内语码转换（Intra-sentential Code-Switching, CS）时的挑战，特别是在低资源语言（如越南语）中，由于训练数据有限和语码转换的不可预测性，ASR系统难以准确转录包含外语专有名词或专业术语的语音。论文提出的解决方案是AdaCS模型，该模型在编码器-解码器网络中集成了自适应偏置注意力模块（Adaptive Bias Attention Module, BAM），通过识别和规范化语码转换短语，显著提升了模型在未见领域的适应性。AdaCS在推理过程中利用提供的偏置词表，增强了其自适应能力，实验结果表明，AdaCS在两个测试集上分别实现了56.2%和36.8%的词错误率（WER）降低，优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.07102
作者: TheChuong Chu,Vu Tuan Dat Pham,Kien Dao,Hoang Nguyen,Quoc Hung Truong
机构: VinBrain, Hanoi, Vietnam (VinBrain, 河内, 越南)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Intra-sentential code-switching (CS) refers to the alternation between languages that happens within a single utterance and is a significant challenge for Automatic Speech Recognition (ASR) systems. For example, when a Vietnamese speaker uses foreign proper names or specialized terms within their speech. ASR systems often struggle to accurately transcribe intra-sentential CS due to their training on monolingual data and the unpredictable nature of CS. This issue is even more pronounced for low-resource languages, where limited data availability hinders the development of robust models. In this study, we propose AdaCS, a normalization model integrates an adaptive bias attention module (BAM) into encoder-decoder network. This novel approach provides a robust solution to CS ASR in unseen domains, thereby significantly enhancing our contribution to the field. By utilizing BAM to both identify and normalize CS phrases, AdaCS enhances its adaptive capabilities with a biased list of words provided during inference. Our method demonstrates impressive performance and the ability to handle unseen CS phrases across various domains. Experiments show that AdaCS outperforms previous state-of-the-art method on Vietnamese CS ASR normalization by considerable WER reduction of 56.2% and 36.8% on the two proposed test sets.
zh

[NLP-18] Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models ICASSP2025

【速读】：该论文旨在解决在文本到图像生成（Text-to-Image, T2I）任务中，大型多模态模型（Large Multimodal Models, LMMs）对输入文本的理解能力不足的问题。随着对更复杂和灵活图像描述需求的增长，增强在上下文学习（In-Context Learning, ICL）范式下对输入文本的理解成为一个关键但尚未充分探索的领域。论文提出的解决方案是通过构建并行多语言提示（Parallel Multilingual Prompts, PMT2I），利用LMMs的多语言能力，将输入文本翻译成多种语言，并将原始文本与翻译文本同时提供给模型。实验结果表明，PMT2I在一般性、组合性和细粒度评估中表现优异，特别是在与人类偏好对齐方面，且能够生成更多样化的图像。此外，结合重排序方法时，PMT2I显著优于基线提示。

链接: https://arxiv.org/abs/2501.07086
作者: Yongyu Mu,Hengyu Li,Junxin Wang,Xiaoxuan Zhou,Chenglong Wang,Yingfeng Luo,Qiaozhi He,Tong Xiao,Guocheng Chen,Jingbo Zhu
机构: 1School of Computer Science and Engineering, Northeastern University (东北大学计算机科学与工程学院), Shenyang, China; 2School of Economics and Management, Dalian Jiaotong University (大连交通大学经济管理学院), Dalian, China; 3NiuTrans Research (牛信达研究), Shenyang, China
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at this https URL.
zh

[NLP-19] Research on the Online Update Method for Retrieval-Augmented Generation (RAG ) Model with Incremental Learning

【速读】：该论文试图解决在信息技术快速发展和数据量指数增长的背景下，语言模型如何有效应对动态变化的信息环境，并实时更新和适应新知识的挑战。解决方案的关键在于提出了一种基于现有检索增强生成（Retrieval Enhanced Generation, RAG）模型的在线更新方法，该方法通过动态记忆捕获新兴数据样本，并采用可调知识蒸馏策略逐步将其整合到核心模型中。此外，检索模块引入了分层索引和多层门控机制，以确保检索内容更具针对性和准确性。在生成阶段，针对不同类型的输入建立了多阶段网络结构，并通过跨注意力匹配和筛选中间表示，确保新旧知识的有效整合和迭代更新。实验结果表明，该方法在知识保留和推理准确性方面优于现有的主流对比模型。

链接: https://arxiv.org/abs/2501.07063
作者: Yuxin Fan,Yuxiang Wang,Lipeng Liu,Xirui Tang,Na Sun,Zidong Yu
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the contemporary context of rapid advancements in information technology and the exponential growth of data volume, language models are confronted with significant challenges in effectively navigating the dynamic and ever-evolving information landscape to update and adapt to novel knowledge in real time. In this work, an online update method is proposed, which is based on the existing Retrieval Enhanced Generation (RAG) model with multiple innovation mechanisms. Firstly, the dynamic memory is used to capture the emerging data samples, and then gradually integrate them into the core model through a tunable knowledge distillation strategy. At the same time, hierarchical indexing and multi-layer gating mechanism are introduced into the retrieval module to ensure that the retrieved content is more targeted and accurate. Finally, a multi-stage network structure is established for different types of inputs in the generation stage, and cross-attention matching and screening are carried out on the intermediate representations of each stage to ensure the effective integration and iterative update of new and old knowledge. Experimental results show that the proposed method is better than the existing mainstream comparison models in terms of knowledge retention and inference accuracy.
zh

[NLP-20] Leverag ing ASIC AI Chips for Homomorphic Encryption

【速读】：该论文试图解决同态加密（Homomorphic Encryption, HE）在云计算环境中因计算资源需求过高而导致的高延迟问题。尽管HE提供了强大的隐私保护，但其计算复杂度远高于明文计算，导致结果返回的延迟难以接受。论文提出了一种解决方案，即将HE原语转换为AI操作符，并利用现有的ASIC AI加速器（如TPU）进行加速。解决方案的关键在于三个方面：(1) 支持模乘运算（modular multiplication），(2) 在软件中实现高精度算术运算，以及(3) 在矩阵引擎上高效映射这些运算。为此，论文引入了CROSS编译器，采用Barrett约简（Barrett reduction）提供模约简支持，利用Basis Aligned Transformation (BAT)将高精度乘法转换为低精度矩阵-向量乘法，并通过Matrix Aligned Transformation (MAT)将向量化的模运算映射为矩阵乘法，从而在2D空间矩阵引擎上高效处理。实验结果表明，CROSS在Google TPUv4上实现了显著的性能提升，相比多核CPU和V100分别达到了161倍和5倍的加速。

链接: https://arxiv.org/abs/2501.07047
作者: Jianming Tong,Tianhao Huang,Leo de Castro,Anirudh Itagi,Jingtian Dang,Anupam Golder,Asra Ali,Jevin Jiang,Arvind,G. Edward Suh,Tushar Krishna
机构: Georgia Institute of Technology(乔治亚理工学院); Massachusetts Institute of Technology(麻省理工学院); Google(谷歌); Cornell University(康奈尔大学)/NVIDIA(英伟达)
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 16 pages, 10 figures, 4 algorithms, 7 tables. Enabling Google TPUv4 for privacy-preserving AI inference

点击查看摘要

Abstract:Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at this https URL.
zh

[NLP-21] ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization COLING2025

【速读】：该论文试图解决越南社交媒体文本中的词汇规范化（Lexical Normalization）问题，特别是针对非标准词（Non-Standard Word, NSW）的处理。越南社交媒体文本中存在大量非正式语言表达，这些表达形式多样且缺乏标准化的标注数据，给自然语言处理（NLP）任务带来了挑战。ViSoLex系统通过整合预训练语言模型（pre-trained language models）和弱监督学习（weakly supervised learning）技术，提供非标准词查找（NSW Lookup）和词汇规范化服务，从而有效解决了越南语中标注数据稀缺的问题。其关键解决方案在于利用预训练模型和弱监督学习来提高词汇规范化的准确性和效率，同时提供一个灵活且可定制的框架，以适应不同数据集和研究需求。

链接: https://arxiv.org/abs/2501.07020
作者: Anh Thi-Hoang Nguyen,Dung Ha Nguyen,Kiet Van Nguyen
机构: University of Information Technology, Ho Chi Minh City, Vietnam (胡志明市信息技术大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市越南国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:ViSoLex is an open-source system designed to address the unique challenges of lexical normalization for Vietnamese social media text. The platform provides two core services: Non-Standard Word (NSW) Lookup and Lexical Normalization, enabling users to retrieve standard forms of informal language and standardize text containing NSWs. ViSoLex’s architecture integrates pre-trained language models and weakly supervised learning techniques to ensure accurate and efficient normalization, overcoming the scarcity of labeled data in Vietnamese. This paper details the system’s design, functionality, and its applications for researchers and non-technical users. Additionally, ViSoLex offers a flexible, customizable framework that can be adapted to various datasets and research requirements. By publishing the source code, ViSoLex aims to contribute to the development of more robust Vietnamese natural language processing tools and encourage further research in lexical normalization. Future directions include expanding the system’s capabilities for additional languages and improving the handling of more complex non-standard linguistic patterns.
zh

[NLP-22] LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视觉理解方面的局限性，特别是单一视觉编码器（vision encoder）和过长视觉标记（visual tokens）带来的问题。尽管现有的混合MLLMs通过引入多种视觉专家（vision experts）取得了一定进展，但在有效整合多样化视觉编码器方面仍存在研究空白。论文提出了一种名为LEO的新型MLLM，其关键解决方案是采用双分支视觉编码器框架（dual-branch vision encoder framework），并结合后适应融合策略（post-adaptation fusion strategy）和自适应分块（adaptive tiling）技术。具体而言，LEO对输入图像的每个分块依次交替使用两个视觉编码器生成的视觉标记，从而优化视觉信息的融合。实验表明，LEO在13个视觉-语言基准测试中表现优异，超越了现有的开源MLLMs和混合MLLMs，并且在不改变模型架构或训练方法的情况下，能够适应自动驾驶等特定领域任务。

链接: https://arxiv.org/abs/2501.06986
作者: Mozhgan Nasr Azadani,James Riddell,Sean Sedwards,Krzysztof Czarnecki
机构: University of Waterloo(滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.
zh

[NLP-23] Harnessing Large Language Models for Disaster Management: A Survey

【速读】：该论文试图解决在自然灾害管理领域中，大型语言模型（LLMs）缺乏系统性综述和深入分析的问题。尽管LLMs在科学研究中展现了卓越的能力，并在多个领域产生了变革性影响，尤其是在减轻对人类生命、基础设施和环境的威胁方面，但针对自然灾害管理的LLMs研究仍缺乏系统化的梳理和分类。为此，论文提出了一种基于灾害阶段和应用场景的分类法（taxonomy），对现有LLMs在自然灾害管理中的应用进行了全面综述。通过收集公开数据集并识别关键挑战和机遇，该研究旨在为专业社区开发更先进的灾害管理LLMs提供指导，从而增强对自然灾害的抵御能力。解决方案的关键在于系统化分类现有研究，并为未来研究指明方向。

链接: https://arxiv.org/abs/2501.06932
作者: Zhenyu Lei,Yushun Dong,Weiyu Li,Rong Ding,Qi Wang,Jundong Li
机构: University of Virginia(弗吉尼亚大学); Florida State University(佛罗里达州立大学); Northeastern University(东北大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized scientific research with their exceptional capabilities and transformed various fields. Among their practical applications, LLMs have been playing a crucial role in mitigating threats to human life, infrastructure, and the environment. Despite growing research in disaster LLMs, there remains a lack of systematic review and in-depth analysis of LLMs for natural disaster management. To address the gap, this paper presents a comprehensive survey of existing LLMs in natural disaster management, along with a taxonomy that categorizes existing works based on disaster phases and application scenarios. By collecting public datasets and identifying key challenges and opportunities, this study aims to guide the professional community in developing advanced LLMs for disaster management to enhance the resilience against natural disasters.
zh

[NLP-24] Risk-Averse Finetuning of Large Language Models NEURIPS2024

【速读】：该论文旨在解决大型语言模型（LLMs）在响应某些提示时生成负面或有毒内容的问题。为了解决这一问题，作者提出将风险规避原则（risk-averse principles）集成到LLM的微调过程中，以最小化有害输出的发生，特别是那些罕见但影响严重的事件。解决方案的关键在于通过优化条件风险价值（Conditional Value at Risk, CVaR）的风险度量，训练LLMs在避免有毒输出的同时，保持其在生成任务中的有效性。通过情感修改和毒性缓解任务的实证评估，论文展示了结合人类反馈的风险规避强化学习（RLHF）在促进更安全和更具建设性的在线话语环境中的有效性。

链接: https://arxiv.org/abs/2501.06911
作者: Sapana Chaudhary,Ujwal Dinesha,Dileep Kalathil,Srinivas Shakkottai
机构: Amazon Web Services (AWS); Department of Electrical and Computer Engineering, Texas A&M University (德州农工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Neurips 2024

点击查看摘要

Abstract:We consider the challenge of mitigating the generation of negative or toxic content by the Large Language Models (LLMs) in response to certain prompts. We propose integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events. By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks. Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.
zh

[NLP-25] Language Fusion for Parameter-Efficient Cross-lingual Transfer

【速读】：该论文试图解决多语言文本语料库（multilingual text corpora）有限性导致的非英语语言在下游任务中表现不佳的问题。由于非英语语言的表示空间（representation space）训练不足，现有的跨语言迁移方法（cross-lingual transfer methods）通常通过混合英语和非英语标记或扩展模型参数来利用英语表示空间，但这些方法往往增加了计算复杂度。论文提出的解决方案是Fusion for Language Representations (FLARE) in adapters，其关键是通过低秩适配器（LoRA adapters）中的轻量级线性变换（lightweight linear transformations）集成源语言和目标语言的表示，从而在保持参数效率的同时提升跨语言迁移性能。实验结果表明，FLARE在问答任务中相较于标准的LoRA微调方法，显著提升了模型性能。

链接: https://arxiv.org/abs/2501.06892
作者: Philipp Borchert,Ivan Vulić,Marie-Francine Moens,Jochen De Weerdt
机构: Research Centre for Information Systems Engineering, KU Leuven, Belgium(比利时鲁汶大学信息系统工程研究中心); IESEG School of Management, France(法国IESEG管理学院); Language Technology Lab, University of Cambridge, United Kingdom(英国剑桥大学语言技术实验室); Department of Computer Science, KU Leuven, Belgium(比利时鲁汶大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Limited availability of multilingual text corpora for training language models often leads to poor performance on downstream tasks due to undertrained representation spaces for languages other than English. This ‘under-representation’ has motivated recent cross-lingual transfer methods to leverage the English representation space by e.g. mixing English and ‘non-English’ tokens at the input level or extending model parameters to accommodate new languages. However, these approaches often come at the cost of increased computational complexity. We propose Fusion forLanguage Representations (FLARE) in adapters, a novel method that enhances representation quality and downstream performance for languages other than English while maintaining parameter efficiency. FLARE integrates source and target language representations within low-rank (LoRA) adapters using lightweight linear transformations, maintaining parameter efficiency while improving transfer performance. A series of experiments across representative cross-lingual natural language understanding tasks, including natural language inference, question-answering and sentiment analysis, demonstrate FLARE’s effectiveness. FLARE achieves performance improvements of 4.9% for Llama 3.1 and 2.2% for Gemma~2 compared to standard LoRA fine-tuning on question-answering tasks, as measured by the exact match metric.
zh

[NLP-26] ransfer Learning of Tabular Data by Finetuning Large Language Models

【速读】：该论文试图解决在表格数据（tabular data）分类任务中，由于特征空间异质性和样本量有限，深度学习（deep learning）尚未取得显著成功的问题。解决方案的关键在于利用生成式 AI（Generative AI）中的大语言模型（LLM）进行迁移学习（transfer learning）。具体而言，论文提出了一种端到端的 LLM 微调（finetuning）方法，通过在没有预训练表格数据模型的情况下，在十个基准数据集上进行跨数据迁移学习。该方法在特征数量少于十个的表格数据上表现优异，超越了现有的机器学习和深度学习方法，同时显著降低了计算成本，确保了分类性能的竞争力或优越性。

链接: https://arxiv.org/abs/2501.06863
作者: Shourav B. Rabbani,Ibna Kowsar,Manar D. Samad
机构: Department of Computer Science, Tennessee State University (田纳西州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the artificial intelligence (AI) revolution, deep learning has yet to achieve much success with tabular data due to heterogeneous feature space and limited sample sizes without viable transfer learning. The new era of generative AI, powered by large language models (LLM), brings unprecedented learning opportunities to diverse data and domains. This paper investigates the effectiveness of an LLM application programming interface (API) and transfer learning of LLM in tabular data classification. LLM APIs respond to input text prompts with tokenized data and instructions, whereas transfer learning finetunes an LLM for a target classification task. This paper proposes an end-to-end finetuning of LLM to demonstrate cross-data transfer learning on ten benchmark data sets when large pre-trained tabular data models do not exist to facilitate transfer learning. The proposed LLM finetuning method outperforms state-of-the-art machine and deep learning methods on tabular data with less than ten features - a standard feature size for tabular data sets. The transfer learning approach uses a fraction of the computational cost of other deep learning or API-based solutions while ensuring competitive or superior classification performance.
zh

[NLP-27] A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

【速读】：该论文旨在解决阿拉伯世界心理健康障碍日益严重的公共卫生问题，特别是如何利用大语言模型（LLMs）开发可访问的诊断和干预工具。研究的关键解决方案包括：1）评估多种LLMs（包括通用多语言模型和双语模型）在阿拉伯语境下的表现，重点关注提示设计（prompt design）、语言配置（native Arabic vs. translated English）和少样本提示（few-shot prompting）对诊断性能的影响；2）通过实验发现，提示工程（prompt engineering）显著影响模型表现，结构化提示在多类数据集上优于非结构化提示，平均差异达14.5%；3）语言对性能的影响较小，但模型选择至关重要，例如Phi-3.5 MoE在二元分类任务中表现优异，而Mistral NeMo在严重性预测任务中表现最佳；4）少样本提示能显著提升性能，特别是在多类分类任务中，GPT-4o Mini的准确率平均提高了1.58倍。这些发现强调了提示优化、多语言分析和少样本学习在开发适用于阿拉伯语人群的文化敏感且有效的LLM心理健康工具中的重要性。

链接: https://arxiv.org/abs/2501.06859
作者: Noureldin Zahran,Aya E. Fouda,Radwa J. Hanafy,Mohammed E. Fouda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.
zh

[NLP-28] A General Framework for Inference-time Scaling and Steering of Diffusion Models

【速读】：该论文试图解决在生成具有用户指定属性的样本时，扩散模型（Diffusion Models）面临的挑战。尽管扩散模型在图像、视频、蛋白质设计和文本等多种模态中表现出色，但生成符合特定属性的样本仍然困难。现有的方法通常通过微调模型来最大化捕捉期望属性的奖励函数，但这些方法需要昂贵的训练成本，并且容易导致模式崩溃（mode collapse）。

论文提出的解决方案是Feynman Kac (FK) steering，这是一种在推理时（inference-time）通过奖励函数引导扩散模型的框架。FK steering的核心在于采样多个相互作用的扩散过程（称为粒子），并在中间步骤根据使用势函数（potentials）计算的分数对粒子进行重采样。势函数通过中间状态的奖励来定义，高势值表示粒子将生成高奖励的样本。该方法无需训练，能够显著提升样本质量和可控性，且在文本到图像和文本扩散模型中表现出色，优于传统的微调方法。

链接: https://arxiv.org/abs/2501.06848
作者: Raghav Singhal,Zachary Horvitz,Ryan Teehan,Mengye Ren,Zhou Yu,Kathleen McKeown,Rajesh Ranganath
机构: Department of Computer Science, New York University(纽约大学计算机科学系); Columbia University(哥伦比亚大学); Center for Data Science, New York University(纽约大学数据科学中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we propose Feynman Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models, even with off-the-shelf rewards, can provide significant sample quality gains and controllability benefits. Code is available at this https URL .
zh

[NLP-29] SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

【速读】：该论文试图解决大语言模型（LLMs）训练过程中出现的梯度尖峰（gradient spikes）问题，这些尖峰会导致训练不稳定，进而引发模型性能下降、检查点恢复和实验重启等高昂代价。梯度尖峰的幅度可达典型梯度的1000倍，严重干扰了学习过程。为解决这一问题，论文提出了一种名为SPAM（Spike-Aware Adam with Momentum Reset）的新型优化器，其关键创新在于通过动量重置（momentum reset）和梯度裁剪（gradient clipping）来有效应对梯度尖峰。SPAM在多种任务中（如LLM预训练、4-bit LLM预训练、强化学习和时间序列预测）均表现出优于Adam及其变体的性能，并且在内存受限的情况下，SPAM通过稀疏动量（sparse momentum）实现了内存高效训练，超越了现有的内存高效优化器（如GaLore和Adam-Mini）。该研究强调了缓解梯度尖峰对LLM训练的重要性，并提出了一种既能提升训练稳定性又能提高资源效率的优化策略。

链接: https://arxiv.org/abs/2501.06842
作者: Tianjin Huang,Ziquan Zhu,Gaojie Jin,Lu Liu,Zhangyang Wang,Shiwei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000\times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at this https URL
zh

[NLP-30] LLM s Model Non-WEIRD Populations: Experiments with Synthetic Cultural Agents

【速读】：该论文试图解决在研究非WEIRD（西方、教育程度高、工业化、富裕和民主）人群的经济行为时所面临的挑战。这些挑战主要源于这些人群的多样性和难以获取的实地数据。为了解决这一问题，论文提出了一种新颖的方法，即利用大语言模型（LLMs）生成代表这些人群的合成文化代理（SCAs）。通过将这些SCAs置于经典的行为实验（如独裁者游戏和最后通牒游戏）中，研究展示了实验行为中的显著跨文化差异。关键解决方案在于利用LLMs生成SCAs，这些代理的行为在已有数据的人群中与真实人类被试的行为相似，而在未研究的人群中，该方法能够生成可测试的经济行为假设。通过将AI整合到实验经济学中，该方法为难以触及的人群提供了一种有效且伦理的实验预研和协议优化工具。

链接: https://arxiv.org/abs/2501.06834
作者: Augusto Gonzalez-Bonorino(1),Monica Capra(2 and 3),Emilio Pantoja(4) ((1) Pomona College Economics Department, (2) Claremont Graduate University Economics Department, (3) University of Arizona Center for the Philosophy of Freedom, (4) Pitzer College Economics and Computer Science Department)
机构: Department of Economics, Pomona College(波莫纳学院经济系); Department of Economic Sciences, Claremont Graduate University(克莱蒙特研究生大学经济科学系); Center for the Philosophy of Freedom, University of Arizona(亚利桑那大学自由哲学中心); Department of Economics and Computer Science, Pitzer College(皮策学院经济与计算机科学系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Despite its importance, studying economic behavior across diverse, non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) populations presents significant challenges. We address this issue by introducing a novel methodology that uses Large Language Models (LLMs) to create synthetic cultural agents (SCAs) representing these populations. We subject these SCAs to classic behavioral experiments, including the dictator and ultimatum games. Our results demonstrate substantial cross-cultural variability in experimental behavior. Notably, for populations with available data, SCAs’ behaviors qualitatively resemble those of real human subjects. For unstudied populations, our method can generate novel, testable hypotheses about economic behavior. By integrating AI into experimental economics, this approach offers an effective and ethical method to pilot experiments and refine protocols for hard-to-reach populations. Our study provides a new tool for cross-cultural economic studies and demonstrates how LLMs can help experimental behavioral research.
zh

[NLP-31] Event Argument Extraction with Enriched Prompts

【速读】：该论文旨在深入研究基于提示（prompt-based）的事件论元抽取（Event Argument Extraction, EAE）模型，探讨在提示中引入不同类型信息对模型性能的影响。具体来说，研究关注了事件触发器（trigger）、同一事件中的其他角色论元（role arguments）以及同一文档中多个事件的跨事件角色论元对模型表现的影响。解决方案的关键在于通过优化训练目标，进一步提升基于提示的EAE模型的性能。实验在RAMS数据集上进行了验证，涉及三个小型语言模型和两个大型语言模型。

链接: https://arxiv.org/abs/2501.06825
作者: Chen Liang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work aims to delve deeper into prompt-based event argument extraction (EAE) models. We explore the impact of incorporating various types of information into the prompt on model performance, including trigger, other role arguments for the same event, and role arguments across multiple events within the same document. Further, we provide the best possible performance that the prompt-based EAE model can attain and demonstrate such models can be further optimized from the perspective of the training objective. Experiments are carried out on three small language models and two large language models in RAMS.
zh

[NLP-32] Bridging the Fairness Gap: Enhancing Pre-trained Models with LLM -Generated Sentences

【速读】：该论文试图解决预训练语言模型（Pre-trained Language Models, PLMs）中存在的性别偏见问题。传统去偏方法通常依赖外部语料库，但这些语料库可能存在质量、多样性或人口统计平衡性不足的问题，影响去偏效果。论文提出了一种新的方法（Fair-Gender），通过吸收具有连贯性、属性平衡且语义丰富的句子来增强PLMs的公平性。然而，这些句子由于对齐问题和负迁移风险，不能直接用于去偏。为此，论文采用因果分析来估计因果效应，过滤掉未对齐的句子，并识别出对齐的句子，将其整合到PLMs中，从而确保正迁移。实验结果表明，该方法在显著减少性别偏见的同时，保持了PLMs的语言表达能力。

链接: https://arxiv.org/abs/2501.06795
作者: Liu Yu,Ludie Guo,Ping Kuang,Fan Zhou
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) are trained on data that inherently contains gender biases, leading to undesirable impacts. Traditional debiasing methods often rely on external corpora, which may lack quality, diversity, or demographic balance, affecting the effectiveness of debiasing. With the rise of large language models and their extensive knowledge, we propose enhancing fairness (Fair-Gender) in PLMs by absorbing coherent, attribute-balanced, and semantically rich sentences. However, these sentences cannot be directly used for debiasing due to alignment issues and the risk of negative transfer. We address this by applying causal analysis to estimate causal effects, filtering out unaligned sentences, and identifying aligned ones for incorporation into PLMs, thereby ensuring positive transfer. Experiments show that our approach significantly reduces gender biases in PLMs while preserving their language expressiveness.
zh

[NLP-33] 3DCoMPaT200: Language-Grounded Compositional Understanding of Parts and Materials of 3D Shapes

【速读】：该论文旨在解决当前3D物体在部件级别理解上的数据集类别有限的问题。现有的数据集如ShapeNet-Part和PartNet分别仅包含16和24个物体类别，而3DCoMPaT数据集虽然专注于部件和材料的组合理解，但也仅包含42个物体类别。为了促进更丰富和细粒度的3D部件级别理解，论文提出了3DCoMPaT200数据集，该数据集包含200个物体类别，物体词汇量约为3DCoMPaT的5倍，部件类别约为4倍。具体而言，3DCoMPaT200显著扩展了3DCoMPaT，包含1,031个细粒度部件类别和293个不同的材料类别，适用于3D物体部件的组合应用。此外，论文提出了一种新的任务——组合部件形状检索（Compositional Part Shape Retrieval），使用ULIP方法为3D组合理解提供了一个强大的基础模型。该方法通过评估模型在给定一个、三个或六个文本描述的部件时的形状检索性能，结果表明模型的性能随着组合数量的增加而提高，强调了组合数据集在增强模型理解复杂3D形状能力方面的关键作用。

链接: https://arxiv.org/abs/2501.06785
作者: Mahmoud Ahmed,Xiang Li,Arpit Prajapati,Mohamed Elhoseiny
机构: KAUST (King Abdullah University of Science and Technology); Poly9
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with \approx 5 times larger object vocabulary compared to 3DCoMPaT and \approx 4 times larger part categories. Concretely, 3DCoMPaT200 significantly expands upon 3DCoMPaT, featuring 1,031 fine-grained part categories and 293 distinct material classes for compositional application to 3D object parts. Additionally, to address the complexities of compositional 3D modeling, we propose a novel task of Compositional Part Shape Retrieval using ULIP to provide a strong 3D foundational model for 3D Compositional Understanding. This method evaluates the model shape retrieval performance given one, three, or six parts described in text format. These results show that the model’s performance improves with an increasing number of style compositions, highlighting the critical role of the compositional dataset. Such results underscore the dataset’s effectiveness in enhancing models’ capability to understand complex 3D shapes from a compositional perspective. Code and Data can be found at this http URL
zh

[NLP-34] Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

【速读】：该论文旨在探讨文本到图像（Text-to-Image, T2I）扩散模型中填充标记（padding tokens）对图像生成过程的影响。尽管填充标记在文本编码前被默认扩展为固定长度，但其在图像生成过程中的作用尚未被深入研究。论文通过开发两种因果分析技术，深入研究了填充标记在T2I模型不同组件中的信息编码方式，并揭示了填充标记在文本编码、扩散过程或完全被忽略的三种不同情景。关键发现包括这些情景与模型架构（交叉注意力或自注意力）及训练过程（冻结或训练文本编码器）之间的重要关系。这些发现有助于更深入地理解填充标记的作用机制，为未来T2I系统的模型设计和训练实践提供了潜在指导。

链接: https://arxiv.org/abs/2501.06751
作者: Michael Toker,Ido Galil,Hadas Orgad,Rinon Gal,Yoad Tewel,Gal Chechik,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院); NVIDIA (英伟达); Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process. Typically, these prompts are extended to a fixed length by adding padding tokens before text encoding. Despite being a default practice, the influence of padding tokens on the image generation process has not been investigated. In this work, we conduct the first in-depth analysis of the role padding tokens play in T2I models. We develop two causal techniques to analyze how information is encoded in the representation of tokens across different components of the T2I pipeline. Using these techniques, we investigate when and how padding tokens impact the image generation process. Our findings reveal three distinct scenarios: padding tokens may affect the model’s output during text encoding, during the diffusion process, or be effectively ignored. Moreover, we identify key relationships between these scenarios and the model’s architecture (cross or self-attention) and its training process (frozen or trained text encoder). These insights contribute to a deeper understanding of the mechanisms of padding tokens, potentially informing future model design and training practices in T2I systems.
zh

[NLP-35] Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM -Based Medical Evaluation

【速读】：该论文旨在解决大型语言模型（LLMs）在医疗应用中的可靠性和准确性问题，特别是在临床环境中的复杂诊断任务。现有基准测试通常局限于固定格式的任务（如多项选择题），无法充分反映真实世界临床诊断的复杂性。此外，传统的评估指标和基于LLM的评估工具存在对齐问题，往往提供过于简化的评估结果，无法准确反映人类判断。为解决这些问题，论文提出了HDCEval（Hierarchical Divide-and-Conquer Evaluation）框架，这是一种针对医疗评估的细粒度对齐的分层分治评估框架。HDCEval的关键在于其基于专业医生合作开发的细粒度医疗评估指南，涵盖患者问题相关性、医学知识正确性和表达三个方面。该框架通过将复杂评估任务分解为专门的子任务，每个子任务由经过属性驱动令牌优化（ADTO）训练的专家模型进行评估，从而确保每个评估方面都能以专家级的精度处理，显著提高了与人类评估者的一致性。

链接: https://arxiv.org/abs/2501.06741
作者: Shunfan Zheng,Xiechi Zhang,Gerard de Melo,Xiaoling Wang,Linlin Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of large language models (LLMs) for medical applications, ensuring the reliability and accuracy of these models in clinical settings is paramount. Existing benchmarks often focus on fixed-format tasks like multiple-choice QA, which fail to capture the complexity of real-world clinical diagnostics. Moreover, traditional evaluation metrics and LLM-based evaluators struggle with misalignment, often providing oversimplified assessments that do not adequately reflect human judgment. To address these challenges, we introduce HDCEval, a Hierarchical Divide-and-Conquer Evaluation framework tailored for fine-grained alignment in medical evaluation. HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors, encompassing Patient Question Relevance, Medical Knowledge Correctness, and Expression. The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models trained through Attribute-Driven Token Optimization (ADTO) on a meticulously curated preference dataset. This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
zh

[NLP-36] ZOQO: Zero-Order Quantized Optimization ICASSP2025

【速读】：该论文试图解决深度学习（deep learning）中计算和内存需求不断增加的问题，特别是在资源受限环境下的挑战。为了解决这一问题，作者提出了一种零阶量化优化（zero-order quantized optimization, ZOQO）方法，旨在通过量化参数和操作来训练模型。该解决方案的关键在于利用梯度符号的零阶近似（zero-order approximations of the gradient sign），并调整学习过程以保持参数的量化，而无需进行全精度的梯度计算。实验表明，ZOQO在大语言模型微调（fine-tuning）和黑盒对抗攻击（black-box adversarial attacks）中表现出色，尽管零阶和量化操作训练存在局限性，但其性能与全精度方法相当，展示了其在低资源环境中的潜力。

链接: https://arxiv.org/abs/2501.06736
作者: Noga Bar,Raja Giryes
机构: Tel Aviv University (特拉维夫大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:The increasing computational and memory demands in deep learning present significant challenges, especially in resource-constrained environments. We introduce a zero-order quantized optimization (ZOQO) method designed for training models with quantized parameters and operations. Our approach leverages zero-order approximations of the gradient sign and adapts the learning process to maintain the parameters’ quantization without the need for full-precision gradient calculations. We demonstrate the effectiveness of ZOQO through experiments in fine-tuning of large language models and black-box adversarial attacks. Despite the limitations of zero-order and quantized operations training, our method achieves competitive performance compared to full-precision methods, highlighting its potential for low-resource environments.
zh

[NLP-37] Better Prompt Compression Without Multi-Layer Perceptrons

【速读】：该论文旨在解决语言模型推理过程中提示（prompt）压缩的问题，以提高推理速度而不改变生成模型本身。现有的方法通常通过训练一个编码器来将提示压缩为较小的学习到的标记序列，该编码器作为推理语言模型的低秩适应（LoRA）进行训练。然而，论文提出，编码器并不需要保持原始语言模型的架构来实现有效的压缩。关键解决方案是引入了仅注意力压缩器（Attention-Only Compressor, AOC），该压缩器在移除语言模型Transformer块中的多层感知机（MLP）层后，学习提示压缩编码器，从而减少了约67%的参数。实验表明，在高达480倍的压缩比范围内，AOC能够更好地重建提示，并优于未移除MLP层的基线压缩编码器。这一结果表明，提示压缩编码器的架构无需与原始解码语言模型相同，为未来研究提示压缩的架构和方法开辟了新方向。

链接: https://arxiv.org/abs/2501.06730
作者: Edouardo Honig,Andrew Lizarraga,Zijun Frank Zhang,Ying Nian Wu
机构: University of California, Los Angeles: Department of Statistics & Data Science (加州大学洛杉矶分校: 统计与数据科学系); Natera
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 0 figures

点击查看摘要

Abstract:Prompt compression is a promising approach to speeding up language model inference without altering the generative model. Prior works compress prompts into smaller sequences of learned tokens using an encoder that is trained as a LowRank Adaptation (LoRA) of the inference language model. However, we show that the encoder does not need to keep the original language model’s architecture to achieve useful compression. We introduce the Attention-Only Compressor (AOC), which learns a prompt compression encoder after removing the multilayer perceptron (MLP) layers in the Transformer blocks of a language model, resulting in an encoder with roughly 67% less parameters compared to the original model. Intriguingly we find that, across a range of compression ratios up to 480x, AOC can better regenerate prompts and outperform a baseline compression encoder that is a LoRA of the inference language model without removing MLP layers. These results demonstrate that the architecture of prompt compression encoders does not need to be identical to that of the original decoder language model, paving the way for further research into architectures and approaches for prompt compression.
zh

[NLP-38] Measuring the Robustness of Reference-Free Dialogue Evaluation Systems

【速读】：该论文旨在解决当前对话系统评估指标在应对多样化和创造性响应时的可靠性不足问题。具体而言，作者提出了一个基准测试，用于评估无参考对话指标在面对四类对抗性攻击（speaker tag prefixes、static responses、ungrammatical responses、repeated conversational context）时的鲁棒性。解决方案的关键在于分析现有评估指标（如DialogRPT、UniEval和PromptEval）在基于人类判断的相关性和对抗性攻击的敏感性之间的差异。研究发现，传统基准测试中表现相似的指标在面对对抗性响应时可能表现出显著差异，这促使开发更为细致的评估框架以应对实际对话系统中的挑战。

链接: https://arxiv.org/abs/2501.06728
作者: Justin Vasselli,Adam Nohejl,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval – a prompt-based method leveraging LLMs – across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.
zh

[NLP-39] ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian

【速读】：该论文试图解决如何评估大型语言模型（Large Language Models, LLMs）在乌克兰语中的推理能力和鲁棒性的问题。由于现有研究主要集中在英语基准测试上，其他语言如乌克兰语的评估相对不足，导致对这些模型在乌克兰语中的表现缺乏全面了解。为此，论文提出了一个基于乌克兰标准化教育考试系统（External Independent Evaluation 和 National Multi-subject Test）的综合性基准测试 ZNO-Eval。该基准测试包含单答案选项、多项选择、匹配和开放式问题，涵盖乌克兰语、数学、历史和地理等多个学科，旨在全面分析模型在不同领域和复杂度下的推理能力。通过对多个知名语言模型（如 GPT-3.5-Turbo、GPT-4o、GPT-4-Turbo、Mistral Large、Claude 3 Opus 和 Gemini-1.5 Pro）的评估，论文发现 GPT-4o 在常识推理和复杂语言任务中表现最优，而 Gemini Pro 和 GPT-4 Turbo 在算术领域表现突出。尽管所有模型在历史和地理等纯文本常识任务中接近满分，但在乌克兰语和数学任务中仍存在差距，凸显了开发专门语言基准测试的重要性，以更准确地评估模型在不同语言和上下文中的能力和局限性。

链接: https://arxiv.org/abs/2501.06715
作者: Mykyta Syromiatnikov,Victoria Ruvinskaya,Anastasiya Troynina
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures. X International conference “Informatics. Culture. Technology.” (2024)

点击查看摘要

Abstract:As the usage of large language models for problems outside of simple text understanding or generation increases, assessing their abilities and limitations becomes crucial. While significant progress has been made in this area over the last few years, most research has focused on benchmarking English, leaving other languages underexplored. This makes evaluating the reasoning and robustness level of language models in Ukrainian particularly challenging. The purpose of this work is to establish a comprehensive benchmark for the reasoning capabilities evaluation of large language models in the Ukrainian language. This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine’s standardized educational testing system: the External Independent Evaluation and the National Multi-subject Test. With single-answer options, multiple-choice, matching, and open-ended questions from diverse subjects, including Ukrainian language, mathematics, history, and geography, this dataset paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities. Evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro on this benchmark demonstrated the superiority of GPT-4o in both common knowledge reasoning and intricate language tasks. At the same time, Gemini Pro and GPT-4 Turbo excelled in the arithmetic domain, leading in single-answer and open-ended math problems. While all models were close to max performance in text-only common knowledge tasks like history and geography, there still is a gap for Ukrainian language and math, thus highlighting the importance of developing specialized language benchmarks for more accurate assessments of model capabilities and limitations across different languages and contexts.
zh

[NLP-40] Fine-tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese

【速读】：该论文试图解决科学评估中学生书面解释的自动评分问题，特别是在象形文字语言（如中文）中的应用。当前，手动评分学生解释既具有挑战性又耗费资源，而大型语言模型（LLMs）在字母语言（如英语）中已显示出潜力，但在象形文字语言中的应用尚未充分探索。论文的关键解决方案是通过微调ChatGPT这一领先的大型语言模型，来自动评分中文科学解释。研究收集了学生对七个科学解释任务的回答，并进行了自动评分，同时使用Kendall相关系数分析了评分准确性与推理复杂性之间的关系。结果表明，领域特定的适应性使ChatGPT能够准确评分中文科学解释，但评分准确性与推理复杂性相关：低水平回答的评分准确性与推理复杂性呈负相关，而高水平回答则呈正相关。模型在复杂句子结构的低水平回答中倾向于高估推理复杂性，而在简洁因果推理的高水平回答中则低估。这些相关性源于语言特征：简洁和清晰增强了低水平回答的评分准确性，而全面性则提高了高水平回答的准确性。研究结果证明了大型语言模型在中文语境下自动评分的有效性，并强调了语言特征和推理复杂性在教育评估评分模型微调中的重要性。

链接: https://arxiv.org/abs/2501.06704
作者: Jie Yang,Ehsan Latif,Yuze He,Xiaoming Zhai
机构: Beijing Normal University(北京师范大学); University of Georgia(乔治亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of explanations for scientific phenomena is essential in science assessment, but scoring student-written explanations remains challenging and resource-intensive. Large language models (LLMs) have shown promise in addressing this issue, particularly in alphabetic languages like English. However, their applicability to logographic languages is less explored. This study investigates the potential of fine-tuning ChatGPT, a leading LLM, to automatically score scientific explanations written in Chinese. Student responses to seven scientific explanation tasks were collected and automatically scored, with scoring accuracy examined in relation to reasoning complexity using the Kendall correlation. A qualitative analysis explored how linguistic features influenced scoring accuracy. The results show that domain-specific adaptation enables ChatGPT to score Chinese scientific explanations with accuracy. However, scoring accuracy correlates with reasoning complexity: a negative correlation for lower-level responses and a positive one for higher-level responses. The model overrates complex reasoning in low-level responses with intricate sentence structures and underrates high-level responses using concise causal reasoning. These correlations stem from linguistic features–simplicity and clarity enhance accuracy for lower-level responses, while comprehensiveness improves accuracy for higher-level ones. Simpler, shorter responses tend to score more accurately at lower levels, whereas longer, information-rich responses yield better accuracy at higher levels. These findings demonstrate the effectiveness of LLMs in automatic scoring within a Chinese context and emphasize the importance of linguistic features and reasoning complexity in fine-tuning scoring models for educational assessments.
zh

[NLP-41] APO: Task-Referenced Adaptation for Prompt Optimization ICASSP2025

【速读】：该论文旨在解决自动化提示优化（APO）中忽视任务特定特征的问题，导致生成的提示缺乏领域特异性且不适用于任务特定的优化。为解决这一问题，论文提出了TAPO（Task-Aware Prompt Optimization）框架，其核心包括三个关键模块：任务感知的指标选择模块（task-aware metric selection module），用于增强任务特定的提示生成能力；多指标评估模块（multi-metrics evaluation module），从多个角度联合评估提示；以及基于进化的优化框架（evolution-based optimization framework），用于自动优化提示，提升其在不同任务中的适应性。通过六个数据集的广泛实验验证了该框架的有效性。

链接: https://arxiv.org/abs/2501.06689
作者: Wenxin Luo,Weirui Wang,Xiaopeng Li,Weibo Zhou,Pengyue Jia,Xiangyu Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for task-specific optimization. In this paper, we introduce TAPO, a multitask-aware prompt optimization framework composed of three key modules. First, a task-aware metric selection module is proposed to enhance task-specific prompt generation capabilities. Second, we present a multi-metrics evaluation module to jointly evaluate prompts from multiple perspectives. Third, an evolution-based optimization framework is introduced for automatic prompt refinement, which improves adaptability across various tasks. Extensive experiments on six datasets demonstrate the effectiveness of our approach, and our code is publicly available.
zh

[NLP-42] Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization

【速读】：该论文试图解决在资源受限的边缘设备（edge devices）上进行Transformer模型训练时面临的计算和内存需求过高的问题。由于隐私、领域适应（domain adaptation）和设备端科学机器学习（on-device scientific machine learning）等考虑，边缘设备上的Transformer训练需求日益增加，但其计算和内存需求往往超出了边缘设备的能力范围。

解决方案的关键在于两个方面：算法层面和硬件层面。在算法层面，论文提出了一种双向收缩流（bi-directional contraction flow）的张量化Transformer训练方法，显著减少了计算FLOPS和层内内存开销。在硬件层面，论文设计了一个基于FPGA的加速器，将所有高度压缩的模型参数和梯度信息存储在芯片上，创建了一个仅使用片上内存的框架，从而减少了片外通信，并最小化了延迟和能耗。此外，论文还实现了针对每个训练阶段的自定义计算内核，并采用了层内并行和流水线技术，进一步提升了运行时间和内存效率。通过这些方法，论文在AMD Alevo U50 FPGA上实现了单批次端到端训练，内存预算控制在6MB BRAM和22.5MB URAM以内，相比NVIDIA RTX 3090 GPU上的未压缩训练，内存减少了30到51倍，每轮训练的能耗降低了3.6倍。

链接: https://arxiv.org/abs/2501.06663
作者: Jiayi Tian,Jinming Lu,Hai Li,Xiangwei Wang,Cong(Callie)Hao,Ian Young,Zheng Zhang
机构: Department of Electrical and Computer Engineering, University of California, Santa Barbara (加州大学圣塔芭芭拉分校电气与计算机工程系); Intel Corporation (英特尔公司); Department of Computer Science, North Carolina State University (北卡罗来纳州立大学计算机科学系); School of Electrical and Computer Engineering, Georgia Institute of Technology (佐治亚理工学院电气与计算机工程学院)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within 36.7 to 93.5 MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than 6 -MB BRAM and 22.5 -MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of 30\times to 51\times . Our FPGA accelerator also achieves up to 3.6\times less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.
zh

[NLP-43] FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

【速读】：该论文试图解决直接偏好优化（Direct Preference Optimization, DPO）在训练大型语言模型（LLMs）时对错误排序的偏好对（misranked preference pairs）改进效果有限的问题。尽管DPO的梯度强调这些错误排序的偏好对，但实验表明，DPO训练很少能有效改善这些情况。为此，论文提出了FocalPO，一种DPO的变体，其关键解决方案是通过引入一个调制因子（modulating factor）动态调整DPO损失，从而降低对错误排序偏好对的权重，并优先增强模型对已正确排序的偏好对的理解。FocalPO的灵感来源于视觉任务中使用的Focal Loss，实验证明其在Alpaca Eval 2.0等基准测试中优于DPO及其变体，进一步验证了其有效性。

链接: https://arxiv.org/abs/2501.06645
作者: Tong Liu,Xiao Yu,Wenxuan Zhou,Jindong Gu,Volker Tresp
机构: LMU Munich(慕尼黑大学); Columbia University(哥伦比亚大学); University of Southern California(南加州大学); University of Oxford(牛津大学); Munich Center for Machine Learning(慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citepchen2024preference empirically finds that DPO training \textitrarely improves these misranked preference pairs, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textitdown-weighs misranked preference pairs and prioritizes enhancing the model’s understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.
zh

[NLP-44] Scaling Down Semantic Leakage: Investigating Associative Bias in Smaller Language Models

【速读】：该论文探讨了语义泄漏（semantic leakage）现象在较小规模语言模型中的表现，特别是参数规模在500M到7B之间的模型。语义泄漏是指从训练数据中学到的关联在语言模型生成过程中以意外且有时不期望的方式出现。此前的研究主要集中在参数规模较大的语言模型（7B+参数）上。本文通过使用Qwen2.5模型家族，系统地评估了较小模型是否由于容量有限而表现出较少的语义泄漏。研究基于Gonen等人（2024）的数据集，引入了一个新的以颜色为中心的提示数据集，并将其分类为特定类型的语义关联，以评估模型的表现。结果表明，较小模型总体上表现出较少的语义泄漏，但这种趋势并非严格线性，中等规模的模型有时在泄漏行为上甚至超过较大模型。研究的关键在于通过引入新的数据集和系统评估方法，揭示了模型规模与语义泄漏之间的复杂关系。

链接: https://arxiv.org/abs/2501.06638
作者: Veronika Smilga
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic leakage is a phenomenon recently introduced by Gonen et al. (2024). It refers to a situation in which associations learnt from the training data emerge in language model generations in an unexpected and sometimes undesired way. Prior work has focused on leakage in large language models (7B+ parameters). In this study, I use Qwen2.5 model family to explore whether smaller models, ranging from 500M to 7B parameters, demonstrate less semantic leakage due to their limited capacity for capturing complex associations. Building on the previous dataset from Gonen et al. (2024), I introduce a new dataset of color-focused prompts, categorized into specific types of semantic associations, to systematically evaluate the models’ performance. Results indicate that smaller models exhibit less semantic leakage overall, although this trend is not strictly linear, with medium-sized models sometimes surpassing larger ones in leaking behavior. The dataset, the model generations, and the evaluation code are publicly available at this https URL.
zh

[NLP-45] Dual use issues in the field of Natural Language Generation

【速读】：该论文旨在探讨自然语言生成（Natural Language Generation, NLG）领域的双重用途（Dual Use）问题。双重用途问题指的是技术既可用于有益目的，也可能被滥用于有害目的。论文基于SIGGEN（ACL下属的自然语言生成特别兴趣小组）社区的一项调查，调查由ACL执行委员会发起，要求各SIG在其子领域内提供双重用途问题的概述。调查于2024年10月发出，2025年1月处理结果。尽管仅有23名受访者，样本可能不足以代表所有SIGGEN成员，但该报告为未来的讨论提供了有价值的参考。解决方案的关键在于通过社区反馈进一步完善对双重用途问题的理解，并为相关政策的制定提供依据。

链接: https://arxiv.org/abs/2501.06636
作者: Emiel van Miltenburg
机构: Department of Communication and Cognition, Tilburg University, Tilburg, the Netherlands (荷兰蒂尔堡大学传播与认知系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report documents the results of a recent survey in the SIGGEN community, focusing on Dual Use issues in Natural Language Generation (NLG). SIGGEN is the Special Interest Group (SIG) of the Association for Computational Linguistics (ACL) for researchers working on NLG. The survey was prompted by the ACL executive board, which asked all SIGs to provide an overview of dual use issues within their respective subfields. The survey was sent out in October 2024 and the results were processed in January 2025. With 23 respondents, the survey is presumably not representative of all SIGGEN members, but at least this document offers a helpful resource for future discussions. This report is open to feedback from the SIGGEN community. Let me know if you have any questions or comments! Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.06636 [cs.CL] (or arXiv:2501.06636v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.06636 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-46] EmoXpt: Analyzing Emotional Variances in Human Comments and LLM -Generated Responses

【速读】：该论文旨在探讨生成式 AI（Generative AI）在公众中的情感动态，特别是通过分析人类在社交媒体（如推特）上对 ChatGPT、OpenAI、Copilot 和大型语言模型（LLMs）等术语的提及，来理解公众对生成式 AI 的情感反应。此外，研究还进一步评估了 ChatGPT 在回应这些推文时的情感智能，比较了人类评论与 LLM 生成回复之间的情感差异。为解决这一问题，论文提出了 EmoXpt 这一情感分析框架，该框架不仅评估人类对生成式 AI 的情感态度，还专门分析了 ChatGPT 回复中的情感表达。实验结果表明，LLM 生成的回复在效率、连贯性和情感一致性上显著优于人类回复，且情感表达更为积极。

链接: https://arxiv.org/abs/2501.06597
作者: Shireesh Reddy Pyreddy,Tarannum Shaila Zaman
机构: SUNY Polytechnic Institute(纽约州立大学理工学院); University of Maryland Baltimore County(马里兰大学巴尔的摩分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 7 pages, 10 figures, 5 tables. This paper has been accepted and presented at the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC)

点击查看摘要

Abstract:The widespread adoption of generative AI has generated diverse opinions, with individuals expressing both support and criticism of its applications. This study investigates the emotional dynamics surrounding generative AI by analyzing human tweets referencing terms such as ChatGPT, OpenAI, Copilot, and LLMs. To further understand the emotional intelligence of ChatGPT, we examine its responses to selected tweets, highlighting differences in sentiment between human comments and LLM-generated responses. We introduce EmoXpt, a sentiment analysis framework designed to assess both human perspectives on generative AI and the sentiment embedded in ChatGPT’s responses. Unlike prior studies that focus exclusively on human sentiment, EmoXpt uniquely evaluates the emotional expression of ChatGPT. Experimental results demonstrate that LLM-generated responses are notably more efficient, cohesive, and consistently positive than human responses.
zh

[NLP-47] ChemAgent : Self-updating Library in Large Language Models Improves Chemical Reasoning

【速读】：该论文旨在解决大语言模型（LLMs）在处理化学推理任务时面临的挑战，包括难以准确处理领域特定公式、执行推理步骤以及有效整合代码等问题。为了解决这些问题，作者提出了ChemAgent框架，其关键是通过动态自更新的库（library）来提升LLMs的性能。该库通过将化学任务分解为子任务，并将这些子任务编译为结构化集合，供未来查询参考。当遇到新问题时，ChemAgent从库中检索并精炼相关信息（称为“记忆”），从而促进任务分解和解决方案的生成。该框架设计了三种类型的记忆和一个库增强的推理组件，使LLMs能够通过经验不断改进。实验结果表明，ChemAgent在四个化学推理数据集上的性能提升高达46%（GPT-4），显著优于现有方法。这一解决方案为未来在药物发现和材料科学等领域的应用展示了巨大潜力。

链接: https://arxiv.org/abs/2501.06590
作者: Xiangru Tang,Tianyu Hu,Muyang Ye,Yanjun Shao,Xunjian Yin,Siru Ouyang,Wangchunshu Zhou,Pan Lu,Zhuosheng Zhang,Yilun Zhao,Arman Cohan,Mark Gerstein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at this https URL
zh

[NLP-48] Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

【速读】：该论文试图解决在大规模语言模型推理过程中，由于模型并行化策略（Model Parallelism）导致的GPU间通信瓶颈问题。这种通信瓶颈限制了通过增加设备数量来提升计算效率的潜力。论文提出的解决方案是引入一种称为“梯子残差（Ladder Residual）”的架构修改，适用于所有基于残差（Residual-based）的模型。梯子残差的关键在于通过重新设计模型架构，实现通信与计算的解耦（Decoupling），从而有效隐藏通信延迟。具体来说，梯子残差允许在传统的并行化模式中实现通信与计算的重叠，特别是在张量并行（Tensor Parallelism）中，显著减少了通信开销。实验表明，在8个设备上进行张量并行分片（TP sharding）时，梯子残差应用于70B参数的Transformer模型可以在推理时实现30%的端到端加速。此外，论文还展示了通过梯子残差架构对Llama-3.1 8B模型进行部分转换时，仅需3B token的重新训练即可实现最小精度损失。

链接: https://arxiv.org/abs/2501.06589
作者: Muru Zhang,Mayank Mishra,Zhongzhu Zhou,William Brandon,Jue Wang,Yoon Kim,Jonathan Ragan-Kelley,Shuaiwen Leon Song,Ben Athiwaratkun,Tri Dao
机构: Together AI; University of Southern California (南加州大学); MIT-IBM Watson Lab (MIT-IBM 沃森实验室); University of Sydney (悉尼大学); Massachusetts Institute of Technology (麻省理工学院); Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.
zh

[NLP-49] ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting

【速读】：该论文试图解决合同起草中的信息检索问题，特别是合同条款检索。由于律师通常不会从头起草合同，而是通过查找和修改最相关的先例条款来完成合同起草，因此高效的条款检索系统至关重要。论文引入了Atticus Clause Retrieval Dataset (ACORD)，这是首个完全由专家标注的合同起草检索基准数据集。ACORD专注于复杂的合同条款，如责任限制（Limitation of Liability）、赔偿（Indemnification）、控制权变更（Change of Control）和最惠国条款（Most Favored Nation）。该数据集包含114个查询和超过126,000个查询-条款对，每个对都按1到5星进行评分。解决方案的关键在于使用双编码器检索器（bi-encoder retriever）与点式LLM重排序器（pointwise LLMs re-rankers）相结合，显示出有前景的结果。然而，要有效处理律师通常面临的复杂法律工作，仍需进一步改进。ACORD作为首个由专家标注的合同起草检索基准，可为自然语言处理（NLP）社区提供宝贵的信息检索基准。

链接: https://arxiv.org/abs/2501.06582
作者: Steven H. Wang,Maksim Zubkov,Kexin Fan,Sarah Harrell,Yuyang Sun,Wei Chen,Andreas Plesner,Roger Wattenhofer
机构: ETH Zurich(苏黎世联邦理工学院); Independent Researcher(独立研究员); New York University(纽约大学); University of Washington(华盛顿大学); Yale University(耶鲁大学); The Atticus Project(阿提克斯项目)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information retrieval, specifically contract clause retrieval, is foundational to contract drafting because lawyers rarely draft contracts from scratch; instead, they locate and revise the most relevant precedent. We introduce the Atticus Clause Retrieval Dataset (ACORD), the first retrieval benchmark for contract drafting fully annotated by experts. ACORD focuses on complex contract clauses such as Limitation of Liability, Indemnification, Change of Control, and Most Favored Nation. It includes 114 queries and over 126,000 query-clause pairs, each ranked on a scale from 1 to 5 stars. The task is to find the most relevant precedent clauses to a query. The bi-encoder retriever paired with pointwise LLMs re-rankers shows promising results. However, substantial improvements are still needed to effectively manage the complex legal work typically undertaken by lawyers. As the first retrieval benchmark for contract drafting annotated by experts, ACORD can serve as a valuable IR benchmark for the NLP community.
zh

[NLP-50] Natural Language Processing and Deep Learning Models to Classify Phase of Flight in Aviation Safety Occurrences

【速读】：该论文旨在解决航空安全事件报告中非结构化文本难以被计算机系统理解的问题，特别是如何从这些文本中分类和识别安全事件发生的飞行阶段。解决方案的关键在于应用自然语言处理（NLP）和人工智能（AI）模型，特别是深度学习模型（如ResNet和sRNN），来处理和分析文本叙述。通过使用来自NTSB的27,000份安全事件报告作为初始数据集，研究评估了这两种模型的分类性能。结果显示，sRNN模型在七分类问题中的准确率超过68%，显著优于简化的ResNet模型架构，表明NLP和深度学习模型能够有效地从原始文本中推断出飞行阶段，从而支持航空业利益相关者进行更有效的安全事件分析。

链接: https://arxiv.org/abs/2501.06564
作者: Aziida Nanyonga,Hassan Wasswa,Oleksandra Molloy,Ugur Turhan,Graham Wild
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NLP, Aviation reports, Text analysis, Deep learning algorithms, Flight phase classification

点击查看摘要

Abstract:The air transport system recognizes the criticality of safety, as even minor anomalies can have severe consequences. Reporting accidents and incidents play a vital role in identifying their causes and proposing safety recommendations. However, the narratives describing pre-accident events are presented in unstructured text that is not easily understood by computer systems. Classifying and categorizing safety occurrences based on these narratives can support informed decision-making by aviation industry stakeholders. In this study, researchers applied natural language processing (NLP) and artificial intelligence (AI) models to process text narratives to classify the flight phases of safety occurrences. The classification performance of two deep learning models, ResNet and sRNN was evaluated, using an initial dataset of 27,000 safety occurrence reports from the NTSB. The results demonstrated good performance, with both models achieving an accuracy exceeding 68%, well above the random guess rate of 14% for a seven-class classification problem. The models also exhibited high precision, recall, and F1 scores. The sRNN model greatly outperformed the simplified ResNet model architecture used in this study. These findings indicate that NLP and deep learning models can infer the flight phase from raw text narratives, enabling effective analysis of safety occurrences.
zh

[NLP-51] A Survey on Spoken Italian Datasets and Corpora

【速读】：该论文旨在解决意大利语（Italian）口语数据集在语言学研究、自然语言处理（Natural Language Processing, NLP）和语音技术领域中资源不足的问题。尽管意大利语作为一种丰富且多样的罗曼语族语言，其口语数据集的开发和应用相较于英语或汉语等主要语言仍显不足。论文通过对66个意大利语口语数据集的全面分析，详细探讨了这些数据集的特征、采集方法及其在自动语音识别（Automatic Speech Recognition, ASR）、情感检测和教育等领域的应用。关键解决方案包括对这些数据集进行分类（如按语音类型、来源和语境、人口统计和语言特征等），并公开数据集清单，供研究人员和开发者使用。此外，论文还讨论了数据集稀缺性、代表性和可访问性等挑战，并提出了改进数据集创建和利用的建议，以推动意大利语语音技术和语言学研究的发展。

链接: https://arxiv.org/abs/2501.06557
作者: Marco Giordano,Claudia Rinaldi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: submitted to IEEE Access Journal in Dec 2024

点击查看摘要

Abstract:Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.
zh

[NLP-52] Dispersion Measures as Predictors of Lexical Decision Time Word Familiarity and Lexical Complexity WWW

【速读】：该论文旨在解决如何通过不同的分散度（dispersion）测量方法来更全面地描述词汇在语料库中的分布，并验证这些方法在预测词汇决策时间、词汇熟悉度和词汇复杂性方面的有效性。研究评估了多种分散度测量方法在五种不同语言中的表现，发现范围的对数（logarithm of range）在所有任务和语言中不仅比词频的对数（log-frequency）具有更好的预测能力，而且在与词频对数结合使用时，其预测效果也优于更复杂的分散度测量方法。研究还探讨了语料库部分粒度（corpus part granularity）和对数变换（logarithmic transformation）对结果的影响，揭示了先前研究中矛盾结果的原因。

链接: https://arxiv.org/abs/2501.06536
作者: Adam Nohejl,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: Pre-print, to be presented at the NLP Meeting 2025 ( this http URL - NON-REVIEWED)

点击查看摘要

Abstract:Various measures of dispersion have been proposed to paint a fuller picture of a word’s distribution in a corpus, but only little has been done to validate them externally. We evaluate a wide range of dispersion measures as predictors of lexical decision time, word familiarity, and lexical complexity in five diverse languages. We find that the logarithm of range is not only a better predictor than log-frequency across all tasks and languages, but that it is also the most powerful additional variable to log-frequency, consistently outperforming the more complex dispersion measures. We discuss the effects of corpus part granularity and logarithmic transformation, shedding light on contradictory results of previous studies.
zh

[NLP-53] Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering COLING2025

【速读】：该论文试图解决大型语言模型（LLMs）在法律问答（Legal QA）领域中存在的幻觉（hallucination）问题，即模型生成错误或虚构信息的现象。为了解决这一问题，作者首先引入了一个名为LegalHalBench的基准测试和三种自动评估指标，用于评估LLMs在回答法律问题时常见的幻觉现象。随后，作者提出了一种幻觉缓解方法，该方法结合了行为克隆（behavior cloning）和一种新颖的硬样本感知迭代直接偏好优化（Hard Sample-aware Iterative Direct Preference Optimization, HIPO）。通过大量真实数据实验，作者验证了该方法的有效性，并在多个指标上取得了显著改进，包括新提出的非幻觉法规率（Non-Hallucinated Statute Rate）、法规相关性率（Statute Relevance Rate）、法律声明真实性（Legal Claim Truthfulness），以及传统的METEOR、BERTScore、ROUGE-L和胜率（win rates）等指标。

链接: https://arxiv.org/abs/2501.06521
作者: Yinghao Hu,Leilei Gan,Wenyi Xiao,Kun Kuang,Fei Wu
机构: 1School of Software Technology, Zhejiang University (浙江大学软件技术学院); 2College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, to be published in COLING 2025

点击查看摘要

Abstract:Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.
zh

[NLP-54] PASS: Presentation Automation for Slide Generation and Speech

【速读】：该论文旨在解决在快节奏的工作环境中，如何高效地从一般Word文档生成演示文稿（presentation）并自动化其口头交付的问题。现有的研究主要集中在将研究论文转换为演示文稿，而忽略了更广泛的文档类型和演示文稿的自动化交付。论文提出的解决方案PASS（Pipeline for Automated Slide Synthesis）通过分析用户文档，生成动态且引人入胜的演示文稿，并利用AI生成语音进行自动化口头交付。PASS的关键创新在于其能够处理多种文档类型，而不仅限于研究论文，并且通过基于LLM（Large Language Model）的评估指标，从相关性（relevance）、连贯性（coherence）和冗余性（redundancy）三个关键维度对生成的演示文稿进行评估，确保其质量和效果。

链接: https://arxiv.org/abs/2501.06497
作者: Tushar Aggarwal,Aarohi Bhand
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In today’s fast-paced world, effective presentations have become an essential tool for communication in both online and offline meetings. The crafting of a compelling presentation requires significant time and effort, from gathering key insights to designing slides that convey information clearly and concisely. However, despite the wealth of resources available, people often find themselves manually extracting crucial points, analyzing data, and organizing content in a way that ensures clarity and impact. Furthermore, a successful presentation goes beyond just the slides; it demands rehearsal and the ability to weave a captivating narrative to fully engage the audience. Although there has been some exploration of automating document-to-slide generation, existing research is largely centered on converting research papers. In addition, automation of the delivery of these presentations has yet to be addressed. We introduce PASS, a pipeline used to generate slides from general Word documents, going beyond just research papers, which also automates the oral delivery of the generated slides. PASS analyzes user documents to create a dynamic, engaging presentation with an AI-generated voice. Additionally, we developed an LLM-based evaluation metric to assess our pipeline across three critical dimensions of presentations: relevance, coherence, and redundancy. The data and codes are available at this https URL.
zh

[NLP-55] Analyzing the Role of Context in Forecasting with Large Language Models

【速读】：该论文旨在评估近期语言模型（LLMs）在二元预测问题上的预测性能。研究首先引入了一个包含600多个二元预测问题的新数据集，并辅以相关的新闻文章及其简洁的问题相关摘要。随后，论文探讨了不同上下文水平的输入提示对预测性能的影响。研究结果表明，引入新闻文章显著提升了预测性能，而使用少量示例（few-shot examples）则导致准确性下降。此外，研究发现较大的模型在性能上始终优于较小的模型，凸显了LLMs在增强自动化预测方面的潜力。解决方案的关键在于通过引入新闻文章等上下文信息来优化模型的输入提示，从而提升预测准确性。

链接: https://arxiv.org/abs/2501.06496
作者: Gerrit Mutschlechner,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This study evaluates the forecasting performance of recent language models (LLMs) on binary forecasting questions. We first introduce a novel dataset of over 600 binary forecasting questions, augmented with related news articles and their concise question-related summaries. We then explore the impact of input prompts with varying level of context on forecasting performance. The results indicate that incorporating news articles significantly improves performance, while using few-shot examples leads to a decline in accuracy. We find that larger models consistently outperform smaller models, highlighting the potential of LLMs in enhancing automated forecasting.
zh

[NLP-56] Sequential Classification of Aviation Safety Occurrences with Natural Language Processing

【速读】：该论文试图解决的问题是如何从航空安全事件（safety occurrence）的文本叙述中推断出飞机受损程度（damage level）。由于这些文本叙述通常是非结构化的、人类可理解的文本，计算机系统难以直接处理，因此需要通过自然语言处理（NLP）和人工智能（AI）模型来对这些文本进行分类和归类。解决方案的关键在于应用多种深度学习模型，包括LSTM（长短期记忆网络）、BLSTM（双向长短期记忆网络）、GRU（门控循环单元）、sRNN（简单循环神经网络）及其组合模型，对来自NTSB（美国国家运输安全委员会）的27,000份安全事件报告进行分析。研究结果表明，这些模型在四分类问题上的准确率均超过87.9%，且sRNN在召回率和准确率上略优于其他单一模型，而LSTM在精确率上表现稍好。这一方法为航空行业利益相关者提供了基于文本叙述的安全决策支持。

链接: https://arxiv.org/abs/2501.06490
作者: Aziida Nanyonga,Hassan Wasswa,Ugur Turhan,Oleksandra Molloy,Graham Wild
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety is a critical aspect of the air transport system given even slight operational anomalies can result in serious consequences. To reduce the chances of aviation safety occurrences, accidents and incidents are reported to establish the root cause, propose safety recommendations etc. However, analysis narratives of the pre-accident events are presented using human-understandable, raw, unstructured, text that a computer system cannot understand. The ability to classify and categorise safety occurrences from their textual narratives would help aviation industry stakeholders make informed safety-critical decisions. To classify and categorise safety occurrences, we applied natural language processing (NLP) and AI (Artificial Intelligence) models to process text narratives. The study aimed to answer the question. How well can the damage level caused to the aircraft in a safety occurrence be inferred from the text narrative using natural language processing. The classification performance of various deep learning models including LSTM, BLSTM, GRU, sRNN, and combinations of these models including LSTM and GRU, BLSTM+GRU, sRNN and LSTM, sRNN and BLSTM, sRNN and GRU, sRNN and BLSTM and GRU, and sRNN and LSTM and GRU was evaluated on a set of 27,000 safety occurrence reports from the NTSB. The results of this study indicate that all models investigated performed competitively well recording an accuracy of over 87.9% which is well above the random guess of 25% for a four-class classification problem. Also, the models recorded high precision, recall, and F1 scores above 80%, 88%, and 85%, respectively. sRNN slightly outperformed other single models in terms of recall (90%) and accuracy (90%) while LSTM reported slightly better performance in terms of precision (87%).
zh

[NLP-57] First Token Probability Guided RAG for Telecom Question Answering

【速读】：该论文试图解决在电信领域中的多项选择题回答（MCQA）任务中，现有检索增强生成（RAG）方法在检索质量和减少幻觉（hallucinations）方面的不足。为了解决这些问题，作者提出了一种新颖的首词概率引导的RAG框架。该框架的关键在于利用置信度分数（confidence scores）来优化关键超参数，如块数量（chunk number）和块窗口大小（chunk window size），并动态调整上下文。具体而言，该方法首先检索最相关的文本块，并生成一个单字作为潜在答案，然后对所有选项的概率进行归一化处理，生成置信度分数，用于指导上下文的动态调整。通过基于这些置信度分数迭代优化超参数，该方法能够持续提升RAG在特定领域MCQA任务中的准确性。

链接: https://arxiv.org/abs/2501.06468
作者: Tingwei Chen,Jiayi Chen,Zijian Zhao,Haolong Chen,Liang Zhang,Guangxu Zhu
机构: Shenzhen Research Institute of Big Data, Shenzhen, China (深圳大数据研究院); School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China (香港中文大学（深圳）理工学院); School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China (中山大学计算机科学与工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have garnered significant attention for their impressive general-purpose capabilities. For applications requiring intricate domain knowledge, Retrieval-Augmented Generation (RAG) has shown a distinct advantage in incorporating domain-specific information into LLMs. However, existing RAG research has not fully addressed the challenges of Multiple Choice Question Answering (MCQA) in telecommunications, particularly in terms of retrieval quality and mitigating hallucinations. To tackle these challenges, we propose a novel first token probability guided RAG framework. This framework leverages confidence scores to optimize key hyperparameters, such as chunk number and chunk window size, while dynamically adjusting the context. Our method starts by retrieving the most relevant chunks and generates a single token as the potential answer. The probabilities of all options are then normalized to serve as confidence scores, which guide the dynamic adjustment of the context. By iteratively optimizing the hyperparameters based on these confidence scores, we can continuously improve RAG performance. We conducted experiments to validate the effectiveness of our framework, demonstrating its potential to enhance accuracy in domain-specific MCQA tasks.
zh

[NLP-58] Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis

【速读】：该论文试图解决在对话语音合成（Conversational Speech Synthesis, CSS）中如何生成与对话风格一致且富有表现力的语音的问题。现有的研究通常仅依赖于当前对话（Current Dialogue, CD）历史，而忽略了存储对话（Stored Dialogue, SD）中所包含的与当前对话场景相关的风格表达知识。这些知识对于生成具有同理心反馈的表达性语音至关重要。为解决这一问题，论文提出了一种名为RADKA-CSS的检索增强对话知识聚合方案，其关键包括三个主要组件：1）构建存储对话语义-风格数据库（Stored Dialogue Semantic-Style Database, SDSSD），并通过多属性检索方案从SD中检索与CD在语义和风格上最相似的对话片段；2）采用多粒度图结构编码对话，并引入多源风格知识聚合机制，以有效利用CD和SD中的风格知识；3）将聚合的风格知识输入语音合成器，帮助生成与对话风格一致的表达性语音。实验结果表明，RADKA-CSS在表达性渲染方面优于基线模型。

链接: https://arxiv.org/abs/2501.06467
作者: Rui Liu,Zhenqi Jia,Feilong Bao,Haizhou Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by Information Fusion 2025

点击查看摘要

Abstract:Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: this https URL. Comments: Accepted by Information Fusion 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.06467 [cs.CL] (or arXiv:2501.06467v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.06467 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-59] MedCT: A Clinical Terminology Graph for Generative AI Applications in Healthcare

【速读】：该论文旨在解决中国医疗社区中临床数据的标准化和可编程表示问题，以促进新药开发、治疗路径的改进以及患者结果的优化。解决方案的关键在于引入了全球首个针对中国医疗社区的临床术语系统 MedCT（Medical Clinical Terminology），并配套开发了临床基础模型 MedBERT 和实体链接模型 MedLink。MedCT 系统通过知识图谱提供了一种机制，能够有效减少大语言模型（LLMs）在临床应用中的幻觉问题，从而显著提高基于 LLM 的临床应用的准确性和安全性。此外，论文还展示了如何利用 LLM 的生成和表达能力，快速构建并部署高质量的术语系统，相较于传统术语系统（如 SNOMED CT）的长期开发过程，MedCT 在短短三个月内实现了实际临床应用。实验结果表明，MedCT 在语义匹配和实体链接任务中达到了最先进的性能，不仅适用于中文，也适用于英文。通过在实际临床任务中的应用，如电子健康记录（EHR）自动生成和医疗文档搜索，MedCT 展示了其在临床工作流程和患者结果中的多重价值。

链接: https://arxiv.org/abs/2501.06465
作者: Ye Chen,Dongdong Huang,Haoyun Xu,Cong Fu,Lin Sheng,Qingli Zhou,Yuqiang Shen,Kai Wang
机构: Tiger Research(老虎研究); The Fourth Affiliated Hospital of School of Medicine, Zhejiang University(浙江大学医学院第四附属医院); Information Center, The Fourth Affiliated Hospital of School of Medicine, Zhejiang University(浙江大学医学院第四附属医院信息中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the world’s first clinical terminology for the Chinese healthcare community, namely MedCT, accompanied by a clinical foundation model MedBERT and an entity linking model MedLink. The MedCT system enables standardized and programmable representation of Chinese clinical data, successively stimulating the development of new medicines, treatment pathways, and better patient outcomes for the populous Chinese community. Moreover, the MedCT knowledge graph provides a principled mechanism to minimize the hallucination problem of large language models (LLMs), therefore achieving significant levels of accuracy and safety in LLM-based clinical applications. By leveraging the LLMs’ emergent capabilities of generativeness and expressiveness, we were able to rapidly built a production-quality terminology system and deployed to real-world clinical field within three months, while classical terminologies like SNOMED CT have gone through more than twenty years development. Our experiments show that the MedCT system achieves state-of-the-art (SOTA) performance in semantic matching and entity linking tasks, not only for Chinese but also for English. We also conducted a longitudinal field experiment by applying MedCT and LLMs in a representative spectrum of clinical tasks, including electronic health record (EHR) auto-generation and medical document search for diagnostic decision making. Our study shows a multitude of values of MedCT for clinical workflows and patient outcomes, especially in the new genre of clinical LLM applications. We present our approach in sufficient engineering detail, such that implementing a clinical terminology for other non-English societies should be readily reproducible. We openly release our terminology, models and algorithms, along with real-world clinical datasets for the development.
zh

[NLP-60] O1 Replication Journey – Part 3: Inference-time Scaling for Medical Reasoning

【速读】：该论文旨在探索在大语言模型（LLMs）中通过推理时间扩展（inference-time scaling）来提升其在医学推理任务中的表现，包括诊断决策和治疗规划。研究通过在不同复杂度的医学基准（如MedQA、Medbullets和JAMA Clinical Challenges）上进行广泛实验，揭示了几个关键发现：首先，增加推理时间确实能够显著提升模型性能，尤其是在仅有500个样本的小规模训练集下，模型性能提升了6%-11%。其次，任务复杂度与所需的推理链长度直接相关，表明复杂问题需要更长的思考过程。最后，模型生成的鉴别诊断遵循假设-演绎法（hypothetico-deductive method），能够系统地生成并缩小可能的疾病列表。这些发现表明，推理时间扩展与旅程学习（journey learning）的结合在提升LLMs的实际临床推理能力方面具有显著潜力。

链接: https://arxiv.org/abs/2501.06458
作者: Zhongzhen Huang,Gui Geng,Shengyi Hua,Zhen Huang,Haoyang Zou,Shaoting Zhang,Pengfei Liu,Xiaofan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); SII; SPIRAL Lab; Generative AI Research Lab (GAIR)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient’s symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs’ real-world clinical reasoning capabilities.
zh

[NLP-61] Synthetic Feature Augmentation Improves Generalization Performance of Language Models

【速读】：该论文试图解决在有限且不平衡的数据集上训练和微调深度学习模型（尤其是大语言模型，LLMs）时面临的挑战。这些问题通常导致模型泛化能力差，表现为模型对主导类别过拟合，而对少数类别表现不佳，从而导致预测偏差和在实际应用中的鲁棒性降低。为解决这些问题，论文提出通过在嵌入空间中生成合成样本来增强特征，具体方法是通过多种技术上采样（upsampling）少数类别，从而改善模型性能并缓解数据不平衡问题。该方案的关键在于通过生成合成样本增强嵌入空间中的特征，从而提升模型在不平衡数据场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2501.06434
作者: Ashok Choudhary,Cornelius Thiels,Hojjat Salehinejad
机构: Mayo Clinic(梅奥诊所); Kern Center for the Science of Health Care Delivery(健康护理科学核心中心); Department of Artificial Intelligence and Informatics(人工智能与信息学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at IEEE SSCI 2025

点击查看摘要

Abstract:Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant classes and underperform on minority classes, leading to biased predictions and reduced robustness in real-world applications. To overcome these challenges, we propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. By upsampling underrepresented classes, this method improves model performance and alleviates data imbalance. We validate the effectiveness of this approach across multiple open-source text classification benchmarks, demonstrating its potential to enhance model robustness and generalization in imbalanced data scenarios.
zh

[NLP-62] nsor Product Attention Is All You Need

【速读】：该论文试图解决在处理长输入序列时，语言模型（Language Models）所需的大规模键值缓存（Key-Value, KV caches）导致的内存开销问题。为了解决这一问题，论文提出了一种新的注意力机制——张量积注意力（Tensor Product Attention, TPA）。TPA通过张量分解（tensor decompositions）将查询（queries）、键（keys）和值（values）紧凑地表示，从而显著减少了推理时的KV缓存大小。TPA的关键在于将表示分解为上下文低秩成分（contextual low-rank components），并与旋转位置编码（RoPE）无缝集成，从而在提高模型质量的同时实现了内存效率的提升。基于TPA，论文还引入了T6模型架构，该架构在语言建模任务中表现优异，超越了包括多头注意力（MHA）、多查询注意力（MQA）、分组查询注意力（GQA）和多层注意力（MLA）在内的标准Transformer基线模型。TPA的内存效率使得在固定资源约束下能够处理更长的序列，解决了现代语言模型中的一个关键可扩展性挑战。

链接: https://arxiv.org/abs/2501.06425
作者: Yifan Zhang,Yifeng Liu,Huizhuo Yuan,Zhen Qin,Yang Yuan,Quanquan Gu,Andrew Chi-Chih Yao
机构: IIIS, Tsinghua University(清华大学); Shanghai Qi Zhi Institute(上海期智研究院); University of California, Los Angeles(加州大学洛杉矶分校); TapTap
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at this https URL.
zh

[NLP-63] Using Pre-trained LLM s for Multivariate Time Series Forecasting

【速读】：该论文试图解决的问题是如何利用预训练的大语言模型（LLMs）来进行多元需求时间序列预测。尽管LLMs在自然语言处理领域表现出色，但其在时间序列预测中的应用仍面临挑战，尤其是如何将多元时间序列数据有效地映射到LLM的token嵌入空间中。论文的核心解决方案是提出了一种新颖的多元时间序列分块策略（multivariate patching strategy），将时间序列特征嵌入到仅解码器（decoder-only）的预训练Transformer模型中。这一策略使得模型能够在时间序列预测任务中表现出与当前最先进模型相竞争的性能。此外，论文还使用了基于权重的诊断方法来验证其研究结果的有效性。

链接: https://arxiv.org/abs/2501.06386
作者: Malcolm L. Wolff,Shenghao Yang,Kari Torkkola,Michael W. Mahoney
机构: Amazon(亚马逊); University of Waterloo(滑铁卢大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained Large Language Models (LLMs) encapsulate large amounts of knowledge and take enormous amounts of compute to train. We make use of this resource, together with the observation that LLMs are able to transfer knowledge and performance from one domain or even modality to another seemingly-unrelated area, to help with multivariate demand time series forecasting. Attention in transformer-based methods requires something worth attending to – more than just samples of a time-series. We explore different methods to map multivariate input time series into the LLM token embedding space. In particular, our novel multivariate patching strategy to embed time series features into decoder-only pre-trained Transformers produces results competitive with state-of-the-art time series forecasting models. We also use recently-developed weight-based diagnostics to validate our findings.
zh

[NLP-64] Dynamics of “Spontaneous” Topic Changes in Next Token Prediction with Self-Attention

【速读】：该论文试图解决的问题是：基于自注意力机制（self-attention）的语言模型在生成文本时缺乏人类认知中自发的主题转换能力。具体而言，人类在对话中能够基于情感或上下文信号自然地切换话题，而现有的语言模型则依赖于输入标记的结构化统计线索进行下一个标记的预测，无法实现这种自发性。论文通过定义主题连续性（topic continuity）、模糊序列（ambiguous sequences）和主题转换（change of topic）等概念，并基于标记优先级图（token priority graphs, TPGs）对主题进行建模，探讨了影响模型进行主题转换的因素。关键解决方案包括：（1）模型在输入主题相关的标记优先级顺序上保持一致性；（2）只有当低优先级标记数量超过输入主题的所有高优先级标记时，模型才会发生主题转换；（3）与人类认知不同，较长的上下文长度和重叠主题会降低模型自发转换主题的可能性。这些发现揭示了人类认知与自注意力模型在主题转换上的差异，并为设计更自然的对话式AI提供了理论依据。

链接: https://arxiv.org/abs/2501.06382
作者: Mumin Jia,Jairo Diaz-Rodriguez
机构: Department of Mathematics and Statistics, York University (约克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Human cognition can spontaneously shift conversation topics, often triggered by emotional or contextual signals. In contrast, self-attention-based language models depend on structured statistical cues from input tokens for next-token prediction, lacking this spontaneity. Motivated by this distinction, we investigate the factors that influence the next-token prediction to change the topic of the input sequence. We define concepts of topic continuity, ambiguous sequences, and change of topic, based on defining a topic as a set of token priority graphs (TPGs). Using a simplified single-layer self-attention architecture, we derive analytical characterizations of topic changes. Specifically, we demonstrate that (1) the model maintains the priority order of tokens related to the input topic, (2) a topic change occurs only if lower-priority tokens outnumber all higher-priority tokens of the input topic, and (3) unlike human cognition, longer context lengths and overlapping topics reduce the likelihood of spontaneous redirection. These insights highlight differences between human cognition and self-attention-based models in navigating topic changes and underscore the challenges in designing conversational AI capable of handling “spontaneous” conversations more naturally. To our knowledge, this is the first work to address these questions in such close relation to human conversation and thought.
zh

[NLP-65] AFRIDOC-MT: Document-level MT Corpus for African Languages

【速读】：该论文旨在解决非洲语言在文档级多平行翻译（document-level multi-parallel translation）中的资源匮乏问题，特别是针对健康和信息科技领域的新闻文档。论文提出了AFRIDOC-MT数据集，该数据集涵盖了英语与五种非洲语言（阿姆哈拉语、豪萨语、斯瓦希里语、约鲁巴语和祖鲁语）之间的翻译，包含334篇健康新闻和271篇信息技术新闻文档，所有内容均由人工从英语翻译而来。解决方案的关键在于通过评估神经机器翻译（NMT）模型和大型语言模型（LLMs）在句子和伪文档（pseudo-document）级别的翻译表现，并将输出重新对齐以形成完整的文档进行评估。实验结果表明，NLLB-200在标准NMT模型中表现最佳，而GPT-4在通用LLMs中表现最优。此外，通过对选定模型进行微调，性能显著提升，但基于句子训练的模型在处理长文档时表现不佳。分析还揭示了一些LLMs在非洲语言翻译中存在生成不足、重复词汇或短语以及偏离目标翻译等问题。

链接: https://arxiv.org/abs/2501.06374
作者: Jesujoba O. Alabi,Israel Abebe Azime,Miaoran Zhang,Cristina España-Bonet,Rachel Bawden,Dawei Zhu,David Ifeoluwa Adelani,Clement Oyeleke Odoje,Idris Akinade,Iffat Maab,Davis David,Shamsuddeen Hassan Muhammad,Neo Putini,David O. Ademuyiwa,Andrew Caines,Dietrich Klakow
机构: Masakhane NLP; Saarland University, Saarland Informatic Campus; DFKI GmbH; Inria, Paris, France; Mila, McGill University & Canada CIFAR AI Chair; University of Ibadan, Nigeria; National Institute of Informatics, Japan; Selcom Tanzania; Imperial College London; University of KwaZulu-Natal; Loughborough University, U.K; University of Cambridge, U.K
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.
zh

[NLP-66] Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts

【速读】：该论文旨在解决大型语言模型（LLMs）在医学文献中存在的性别偏见问题，特别是与职业相关的性别代词偏见。解决方案的关键在于开发了一个名为“MOBERT”的基于BERT的模型，该模型通过处理1965年至1980年间379,000篇PubMed摘要，识别并修改与职业相关的性别代词，以实现性别中立化。MOBERT在训练过程中使用了这些中性化的摘要，并与基于原始数据集训练的“1965Bert”进行了性能对比。结果显示，MOBERT实现了70%的包容性替换率，而1965Bert仅为4%。进一步分析表明，MOBERT的代词替换准确性与训练数据中职业术语的频率相关。论文建议通过扩展数据集和优化流程来进一步提高性能，确保在医学应用中的语言建模更加公平。

链接: https://arxiv.org/abs/2501.06365
作者: Elizabeth Schaefer,Kirk Roberts
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:This paper presents a pipeline for mitigating gender bias in large language models (LLMs) used in medical literature by neutralizing gendered occupational pronouns. A dataset of 379,000 PubMed abstracts from 1965-1980 was processed to identify and modify pronouns tied to professions. We developed a BERT-based model, Modern Occupational Bias Elimination with Refined Training,'' or MOBERT,‘’ trained on these neutralized abstracts, and compared its performance with ``1965Bert,‘’ trained on the original dataset. MOBERT achieved a 70% inclusive replacement rate, while 1965Bert reached only 4%. A further analysis of MOBERT revealed that pronoun replacement accuracy correlated with the frequency of occupational terms in the training data. We propose expanding the dataset and refining the pipeline to improve performance and ensure more equitable language modeling in medical applications.
zh

[NLP-67] Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

【速读】：该论文探讨了大型语言模型（LLMs）如何学习和编码多种语言中的形态句法概念（morphosyntactic concepts），如语法数（grammatical number）、性别（gender）和时态（tense）。研究的主要问题是这些概念在不同语言中的表示是否共享，以及这些共享表示在模型中的具体作用。解决方案的关键在于使用稀疏自编码器（sparse autoencoders）对Llama-3-8B和Aya-23-8B模型进行训练，以识别跨语言的抽象语法概念的特征方向。通过因果干预（causal interventions）验证这些表示的多语言性质，并利用这些特征在机器翻译任务中精确修改模型行为，展示了这些特征在网络中的普遍性和选择性。研究结果表明，即使主要基于英语数据训练的模型也能发展出跨语言的形态句法概念的抽象表示。

链接: https://arxiv.org/abs/2501.06346
作者: Jannik Brinkmann,Chris Wendler,Christian Bartelt,Aaron Mueller
机构: University of Mannheim(曼海姆大学); Northeastern University(东北大学); EPFL(洛桑联邦理工学院); Technion – IIT(以色列理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphosyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature’s roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts.
zh

[NLP-68] Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing AAAI2025

【速读】：该论文试图解决科学出版物中图表标题（figure captions）撰写质量不高的问题，尤其是由于作者缺乏关注导致的标题不完善。尽管先前的研究已经探索了生成式 AI 在标题生成中的应用，但这些研究主要集中于读者为中心的使用场景，即用户评估生成的标题，而非将其主动整合到写作过程中。本文通过一项涉及 18 名参与者的用户研究，探讨了论文作者如何将 AI 生成的标题整合到他们的写作过程中。研究的关键在于，参与者使用最先进的 AI 模型生成的标题作为资源，重写自己近期发表论文中的两个图表标题。通过对写作过程的视频记录进行交互分析，研究发现作者通常从复制和精炼 AI 生成的标题开始，倾向于使用更长、细节丰富的标题，并整合文本和视觉元素。然而，当前的 AI 模型在处理复杂图表时效果不佳。这些发现揭示了图表标题撰写的复杂性和多样性，为 AI 系统提供了设计机会，以更好地支持学术写作中的挑战。

链接: https://arxiv.org/abs/2501.06317
作者: Ho Yin(Sam)Ng,Ting-Yao Hsu,Jiyoo Min,Sungchul Kim,Ryan A. Rossi,Tong Yu,Hyunggu Jung,Ting-Hao ‘Kenneth’ Huang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper will appear at AAAI 2025 Workshop (2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle)

点击查看摘要

Abstract:Figures and their captions play a key role in scientific publications. However, despite their importance, many captions in published papers are poorly crafted, largely due to a lack of attention by paper authors. While prior AI research has explored caption generation, it has mainly focused on reader-centered use cases, where users evaluate generated captions rather than actively integrating them into their writing. This paper addresses this gap by investigating how paper authors incorporate AI-generated captions into their writing process through a user study involving 18 participants. Each participant rewrote captions for two figures from their own recently published work, using captions generated by state-of-the-art AI models as a resource. By analyzing video recordings of the writing process through interaction analysis, we observed that participants often began by copying and refining AI-generated captions. Paper writers favored longer, detail-rich captions that integrated textual and visual elements but found current AI models less effective for complex figures. These findings highlight the nuanced and diverse nature of figure caption composition, revealing design opportunities for AI systems to better support the challenges of academic writing.
zh

[NLP-69] Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks

【速读】：该论文试图解决大型语言模型（LLMs）在领域特定任务中的性能评估问题，特别是那些需要深度自然语言理解的任务，如多跳问答（MHQA）问题。研究聚焦于使用HotpotQA数据集来评估LLMs在需要推理和结合多个文本源信息的复杂任务中的表现。解决方案的关键在于设计了一个两阶段的“选择器-阅读器”架构，其中每个阶段都使用独立的LLM。此外，研究还采用了思维链（Chain of Thought, CoT）和问题分解等方法，以探究这些技术对提升模型性能的影响。实验结果表明，结合这些技术后，LLMs在答案查找任务中的F1分数提升了4%，证明了其在处理领域特定任务和复杂语言理解方面的能力。

链接: https://arxiv.org/abs/2501.06286
作者: Iman Barati,Arash Ghafouri,Behrouz Minaei-Bidgoli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the use of large language models (LLMs) has significantly increased, and these models have demonstrated remarkable performance in a variety of general language tasks. However, the evaluation of their performance in domain-specific tasks, particularly those requiring deep natural language understanding, has received less attention. In this research, we evaluate the ability of large language models in performing domain-specific tasks, focusing on the multi-hop question answering (MHQA) problem using the HotpotQA dataset. This task, due to its requirement for reasoning and combining information from multiple textual sources, serves as a challenging benchmark for assessing the language comprehension capabilities of these models. To tackle this problem, we have designed a two-stage selector-reader architecture, where each stage utilizes an independent LLM. In addition, methods such as Chain of Thought (CoT) and question decomposition have been employed to investigate their impact on improving the model’s performance. The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers, providing evidence of the models’ ability to handle domain-specific tasks and their understanding of complex language.
zh

[NLP-70] Dafny as Verification-Aware Intermediate Language for Code Generation

【速读】：该论文试图解决使用大语言模型（LLMs）从自然语言提示生成源代码时，生成的代码可能存在难以察觉的错误的问题。为了解决这一问题，论文提出了一种利用形式化方法（formal methods）来提高生成代码质量的方案。其关键解决方案是引导LLM先生成一个不透明的中间表示（intermediate representation），使用验证感知语言Dafny编写，该中间表示可以自动验证其是否符合预定的规范。验证通过后，正确的Dafny程序会被编译为目标语言并返回给用户。整个过程中，用户与系统的交互仅通过自然语言进行，Dafny代码不会直接暴露给用户。论文还描述了当前的原型系统，并报告了其在HumanEval Python代码生成基准测试中的表现。

链接: https://arxiv.org/abs/2501.06283
作者: Yue Chen Li,Stefan Zetzsche,Siva Somayyajula
机构: Massachusetts Institute of Technology(麻省理工学院); Amazon(亚马逊); Amazon(亚马逊)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Using large language models (LLMs) to generate source code from natural language prompts is a popular and promising idea with a wide range of applications. One of its limitations is that the generated code can be faulty at times, often in a subtle way, despite being presented to the user as correct. In this paper, we explore ways in which formal methods can assist with increasing the quality of code generated by an LLM. Instead of emitting code in a target language directly, we propose that the user guides the LLM to first generate an opaque intermediate representation, in the verification-aware language Dafny, that can be automatically validated for correctness against agreed on specifications. The correct Dafny program is then compiled to the target language and returned to the user. All user-system interactions throughout the procedure occur via natural language; Dafny code is never exposed. We describe our current prototype and report on its performance on the HumanEval Python code generation benchmarks.
zh

[NLP-71] MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

【速读】：该论文旨在解决现有语音交互模型在无缝语音交互中的局限性，特别是原生模型（native models）和对齐模型（aligned models）所面临的问题。原生模型虽然在一个框架内整合了语音和文本处理，但存在序列长度不一致和预训练不足的问题；而对齐模型虽然保持了文本大语言模型（LLMs）的能力，但受限于小数据集和对语音任务的狭窄关注。论文提出的解决方案是MinMo，一个具有约80亿参数的多模态大语言模型（Multimodal Large Language Model），通过多阶段训练（包括语音到文本对齐、文本到语音对齐、语音到语音对齐以及双工交互对齐）在140万小时的多样化语音数据和广泛的语音任务上进行训练。MinMo不仅在各种语音理解和生成基准上实现了最先进的性能，还支持全双工对话（即用户与系统之间的双向同时通信），并提出了一个新颖且简单的语音解码器，在语音生成方面优于现有模型。此外，MinMo增强了指令跟随能力，能够根据用户指令控制语音生成，包括情感、方言、语速等细微差别，并模仿特定声音。

链接: https://arxiv.org/abs/2501.06282
作者: Qian Chen,Yafeng Chen,Yanni Chen,Mengzhe Chen,Yingda Chen,Chong Deng,Zhihao Du,Ruize Gao,Changfeng Gao,Zhifu Gao,Yabin Li,Xiang Lv,Jiaqing Liu,Haoneng Luo,Bin Ma,Chongjia Ni,Xian Shi,Jialong Tang,Hui Wang,Hao Wang,Wen Wang,Yuxuan Wang,Yunlan Xu,Fan Yu,Zhijie Yan,Yexin Yang,Baosong Yang,Xian Yang,Guanrou Yang,Tianyu Zhao,Qinglin Zhang,Shiliang Zhang,Nan Zhao,Pei Zhang,Chong Zhang,Jinren Zhou
机构: Tongyi Lab, Alibaba Group(通义实验室, 阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress. Authors are listed in alphabetical order by family name

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is this https URL, and the code and models will be released soon.
zh

[NLP-72] Punctuations Semantic Role between Brain and Transformers Models

【速读】：该论文试图解决的问题是：当代用于自然语言处理（NLP）的神经网络（neural networks）是否能够与人类大脑的语言处理机制相兼容，以及如何通过实验方法评估不同NLP模型与大脑活动之间的对应关系。研究还探讨了文本中不同标点符号的去除对大脑语义处理的影响。

解决方案的关键在于：采用了一种基于人类大脑数据的实验方法，通过将大脑活动与NLP模型的内部表示进行对比，评估了四种新的NLP模型与大脑活动的兼容性。研究特别关注了RoBERTa和BERT模型的表现，发现RoBERTa在准确度上优于BERT，且BERT在去除标点符号后表现出更高的准确度。此外，研究还通过改变文本的标点符号配置，分析了这些变化对大脑语义处理的影响。

链接: https://arxiv.org/abs/2501.06278
作者: Zenon Lamprou,Frank Polick,Yashar Moshfeghi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contemporary neural networks intended for natural language processing (NLP) are not designed with specific linguistic rules. It suggests that they may acquire a general understanding of language. This attribute has led to extensive research in deciphering their internal representations. A pioneering method involves an experimental setup using human brain data to explore if a translation between brain and neural network representations can be established. Since this technique emerged, more sophisticated NLP models have been developed. In our study, we apply this method to evaluate four new NLP models aiming to identify the one most compatible with brain activity. Additionally, to explore how the brain comprehends text semantically, we alter the text by removing punctuation in four different ways to understand its impact on semantic processing by the human brain. Our findings indicate that the RoBERTa model aligns best with brain activity, outperforming BERT in accuracy according to our metrics. Furthermore, for BERT, higher accuracy was noted when punctuation was excluded, and increased context length did not significantly diminish accuracy compared to the original results with punctuation.
zh

[NLP-73] Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain

【速读】：该论文试图解决生成式 AI 在生态和环境科学应用中缺乏统一评估框架的问题。为了解决这一问题，作者提出了环境大语言模型评估（Environmental Large Language model Evaluation, ELLE）问答数据集，这是首个专门用于评估大语言模型及其在生态和环境科学中应用的基准数据集。ELLE 数据集包含 1,130 个问答对，涵盖 16 个环境主题，并按领域、难度和类型进行分类。通过提供这一标准化的评估工具，ELLE 数据集能够促进生成式 AI 技术在可持续环境成果中的开发和应用。

链接: https://arxiv.org/abs/2501.06277
作者: Jing Guo,Nan Li,Ming Xu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative AI holds significant potential for ecological and environmental applications such as monitoring, data analysis, education, and policy support. However, its effectiveness is limited by the lack of a unified evaluation framework. To address this, we present the Environmental Large Language model Evaluation (ELLE) question answer (QA) dataset, the first benchmark designed to assess large language models and their applications in ecological and environmental sciences. The ELLE dataset includes 1,130 question answer pairs across 16 environmental topics, categorized by domain, difficulty, and type. This comprehensive dataset standardizes performance assessments in these fields, enabling consistent and objective comparisons of generative AI performance. By providing a dedicated evaluation tool, ELLE dataset promotes the development and application of generative AI technologies for sustainable environmental outcomes. The dataset and code are available at this https URL and this https URL.
zh

[NLP-74] PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

【速读】：该论文试图解决在语音合成（Text-to-Speech, TTS）中捕捉情感和风格等细微差别的问题。现有的深度神经网络架构虽然在模仿人类语音模式方面取得了显著进展，但在情感和风格的控制上仍面临挑战。为此，论文提出了一种基于提示（prompt-based）的情感控制方法。该解决方案的关键在于通过多说话者的情感和强度控制，结合大语言模型（Large Language Models, LLMs）来调节语音的韵律（prosody），同时保持语言内容的完整性。通过嵌入情感线索、调节强度水平，并使用提示引导韵律变化，该方法能够为合成语音注入更具人类表现力和多样性的特征。最终，论文通过系统探索上述控制机制，验证了该方法的有效性。

链接: https://arxiv.org/abs/2501.06276
作者: Shaozuo Zhang,Ambuj Mehrish,Yingting Li,Soujanya Poria
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
zh

[NLP-75] Polarized Patterns of Language Toxicity and Sentiment of Debunking Posts on Social Media

【速读】：该论文试图解决在线政治讨论中虚假信息和假新闻对民主进程和公众参与的负面影响问题，特别是辟谣（debunking）过程中语言毒性（language toxicity）、悲观情绪（pessimism）和社会极化（social polarization）之间的关系。通过分析2016年和2020年美国总统选举以及QAnon阴谋论相关的超过8600万条辟谣推文和400万条Reddit辟谣评论，研究揭示了三个关键发现：(1) 外围参与者（1-degree users）在塑造毒性话语中起主导作用，主要由于较低的社区责任感和情感表达；(2) 平台机制显著影响极化现象，Twitter放大了党派差异，而Reddit由于其结构化和社区驱动的互动方式，整体毒性更高；(3) 语言毒性与悲观情绪呈负相关，增加互动尤其能减少Reddit上的毒性。研究表明，平台架构影响用户互动的信息复杂性，Twitter促进集中、统一的讨论，而Reddit鼓励多样化、复杂的交流。这些发现为政策制定者和平台设计者提供了减轻有害影响、促进健康在线讨论的见解，对理解数字环境中的虚假信息、仇恨言论和政治极化具有重要意义。

链接: https://arxiv.org/abs/2501.06274
作者: Wentao Xu,Wenlu Fan,Shiqian Lu,Tenghao Li,Bin Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Here’s a condensed 1920-character version: The rise of misinformation and fake news in online political discourse poses significant challenges to democratic processes and public engagement. While debunking efforts aim to counteract misinformation and foster fact-based dialogue, these discussions often involve language toxicity and emotional polarization. We examined over 86 million debunking tweets and more than 4 million Reddit debunking comments to investigate the relationship between language toxicity, pessimism, and social polarization in debunking efforts. Focusing on discussions of the 2016 and 2020 U.S. presidential elections and the QAnon conspiracy theory, our analysis reveals three key findings: (1) peripheral participants (1-degree users) play a disproportionate role in shaping toxic discourse, driven by lower community accountability and emotional expression; (2) platform mechanisms significantly influence polarization, with Twitter amplifying partisan differences and Reddit fostering higher overall toxicity due to its structured, community-driven interactions; and (3) a negative correlation exists between language toxicity and pessimism, with increased interaction reducing toxicity, especially on Reddit. We show that platform architecture affects informational complexity of user interactions, with Twitter promoting concentrated, uniform discourse and Reddit encouraging diverse, complex communication. Our findings highlight the importance of user engagement patterns, platform dynamics, and emotional expressions in shaping polarization in debunking discourse. This study offers insights for policymakers and platform designers to mitigate harmful effects and promote healthier online discussions, with implications for understanding misinformation, hate speech, and political polarization in digital environments.
zh

[NLP-76] AgoraSpeech: A multi-annotated comprehensive dataset of political discourse through the lens of humans and AI

【速读】：该论文旨在解决政治话语数据集稀缺的问题，特别是高质量、全面标注的数据集。现有的政治话语语料库虽然数量众多，但由于需要大量的人工努力、多学科知识和专业知识来进行修辞策略和意识形态背景的细致标注，高质量的数据集仍然匮乏。为此，论文提出了AgoraSpeech数据集，该数据集包含了2023年希腊全国选举期间六个政党的171篇政治演讲，并逐段标注了六项自然语言处理任务：文本分类、主题识别、情感分析、命名实体识别、极化检测和民粹主义检测。解决方案的关键在于采用了两步标注方法，首先使用ChatGPT生成初步标注，然后通过详尽的人工验证进行修正。这一数据集不仅为选举前期的分析提供了洞察，还为政治和社会科学家、记者以及数据科学家提供了丰富的信息源，同时也可用于自然语言处理和大语言模型的基准测试与微调。

链接: https://arxiv.org/abs/2501.06265
作者: Pavlos Sermpezis,Stelios Karamanidis,Eva Paraschou,Ilias Dimitriadis,Sofia Yfantidou,Filitsa-Ioanna Kouskouveli,Thanasis Troboukis,Kelly Kiki,Antonis Galanopoulos,Athena Vakali
机构: Data & Web Science Lab, School of Informatics, Aristotle University of Thessaloniki (亚里士多德大学信息学院数据与网络科学实验室); incubator for Media Education and Development (iMEdD) (媒体教育与发展的孵化器); School of Political Sciences, Aristotle University of Thessaloniki (亚里士多德大学政治科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Political discourse datasets are important for gaining political insights, analyzing communication strategies or social science phenomena. Although numerous political discourse corpora exist, comprehensive, high-quality, annotated datasets are scarce. This is largely due to the substantial manual effort, multidisciplinarity, and expertise required for the nuanced annotation of rhetorical strategies and ideological contexts. In this paper, we present AgoraSpeech, a meticulously curated, high-quality dataset of 171 political speeches from six parties during the Greek national elections in 2023. The dataset includes annotations (per paragraph) for six natural language processing (NLP) tasks: text classification, topic identification, sentiment analysis, named entity recognition, polarization and populism detection. A two-step annotation was employed, starting with ChatGPT-generated annotations and followed by exhaustive human-in-the-loop validation. The dataset was initially used in a case study to provide insights during the pre-election period. However, it has general applicability by serving as a rich source of information for political and social scientists, journalists, or data scientists, while it can be used for benchmarking and fine-tuning NLP and large language models (LLMs).
zh

[NLP-77] What Matters for In-Context Learning: A Balancing Act of Look-up and In-Weight Learning

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在上下文学习（In-Context Learning, ICL）中的关键机制尚不明确。尽管先前的研究已经探讨了预训练数据和模型架构的作用，但ICL的核心机制仍未得到充分理解。论文通过系统性地揭示支持ICL涌现的模型特性，提出了一种解决方案。其关键在于发现数据序列中的概念重复（conceptual repetitions）对ICL至关重要，这种重复可以是文本数据中的n-gram重复或图像序列数据中的精确图像复制。此外，论文还指出，ICL的涌现依赖于在训练过程中平衡模型权重学习目标与上下文解决能力。这一发现不仅揭示了ICL的核心机制，还提供了减少ICL性能波动的新见解。

链接: https://arxiv.org/abs/2501.06256
作者: Jelena Bratulić,Sudhanshu Mittal,Christian Rupprecht,Thomas Brox
机构: University of Freiburg(弗莱堡大学); University of Oxford(牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance in various tasks, including In-Context Learning (ICL), where the model performs new tasks by conditioning solely on the examples provided in the context, without updating the model’s weights. While prior research has explored the roles of pretraining data and model architecture, the key mechanism behind ICL remains unclear. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL. To disambiguate these factors, we conduct a study with a controlled dataset and data sequences using a deep autoregressive model. We show that conceptual repetitions in the data sequences are crucial for ICL, more so than previously indicated training data properties like burstiness or long-tail distribution. Conceptual repetitions could refer to n -gram repetitions in textual data or exact image copies in image sequence data. Such repetitions also offer other previously overlooked benefits such as reduced transiency in ICL performance. Furthermore, we show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability during training.
zh

[NLP-78] Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

【速读】：该论文试图解决稀疏自编码器（Sparse Autoencoders, SAEs）在提升大语言模型（Large Language Models, LLMs）可解释性时，传统性能指标（如均方误差和L0稀疏度）无法有效评估其语义表示能力的问题。具体而言，传统指标忽略了SAEs是否能够提取出可解释的单义特征（monosemantic features）并保持词语之间的语义关系。论文提出了一套针对多义词（polysemous words）的评估方法，旨在分析SAEs提取单义特征的质量。研究发现，仅优化MSE-L0帕累托前沿的SAEs可能会混淆可解释性，而未必能有效提取单义特征。通过分析多义词，论文揭示了LLMs内部机制，特别是深层网络和注意力模块在区分多义词中的作用。这一语义导向的评估为多义词研究和现有SAE目标的改进提供了新的见解，有助于开发更具实用性的SAEs。

链接: https://arxiv.org/abs/2501.06254
作者: Gouki Minegishi,Hiroki Furuta,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs – whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features. The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word. Our semantics focused evaluation offers new insights into the polysemy and the existing SAE objective and contributes to the development of more practical SAEs.
zh

[NLP-79] textTransformer2: Self-adaptive LLM s

【速读】：该论文试图解决传统微调方法（fine-tuning）在处理多样化任务时存在的计算密集性和静态适应能力不足的问题。传统方法通常需要对整个模型进行重新训练，导致计算资源消耗大且难以实时适应新任务。论文提出的解决方案 \implname 是一种新颖的自适应框架，通过选择性调整权重矩阵中的单一组件，使大语言模型（LLMs）能够实时适应未见过的任务。其关键机制包括两阶段推理过程：首先，调度系统识别任务属性；其次，通过强化学习训练的任务特定“专家”向量被动态混合，以针对输入提示生成目标行为。该方法在参数更少、效率更高的情况下，优于常见的低秩适应（LoRA）等方法，并在不同LLM架构和多模态任务（如视觉-语言任务）中表现出广泛的适用性。\implname 为提升LLM的适应性和任务特定性能提供了一种可扩展且高效的解决方案，推动了真正动态、自组织AI系统的发展。

链接: https://arxiv.org/abs/2501.06252
作者: Qi Sun,Edoardo Cetin,Yujin Tang
机构: Sakana AI, Japan(日本); Institute of Science Tokyo, Japan(东京科学研究所, 日本)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 panges, 11 figures, 9 tables

点击查看摘要

Abstract:Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce \implname, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, \implname employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific “expert” vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. \implname demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. \implname represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.
zh

[NLP-80] Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

【速读】：该论文试图解决在使用强化学习反馈训练大型语言模型（LLMs）时，现有方法通常采用多个奖励函数的线性平均（linear aggregation），而忽视了单个奖励维度的关键特征和奖励之间的相互依赖性，从而导致生成文本的次优结果。论文指出，线性聚合奖励存在一些脆弱性，可能导致生成文本出现不期望的特性。解决方案的关键在于提出了一种基于经济学理论中的效用函数（utility functions）——特别是Inada条件（Inada conditions）——的奖励函数转换方法。该方法增强了对低奖励值的敏感性，同时降低了对高奖励值的敏感性。通过与传统线性加权平均方法的对比，论文展示了Inada启发的奖励反馈在模型训练中的优越性，定量和定性分析表明，使用Inada转换训练的模型在生成文本时更具帮助性且更少有害性。

链接: https://arxiv.org/abs/2501.06248
作者: Roberto-Rafael Maura-Rivero,Chirag Nagpal,Roma Patel,Francesco Visin
机构: Google DeepMind; Google Research; London School of Economics (伦敦政治经济学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.
zh

[NLP-81] A partition cover approach to tokenization

【速读】：该论文试图解决自然语言处理（NLP）中的分词（Tokenization）问题，即将字符串编码为固定词汇表大小的标记（tokens）。当前主流的分词算法是字节对编码（Byte Pair Encoding, BPE），它将分词问题视为压缩问题，并通过一系列合并操作来解决。本文提出了一种新的优化目标，将分词问题形式化为一个优化问题，并通过简单的顶点覆盖（vertex cover）归约证明了该问题是NP难问题。为了解决这一问题，作者提出了一种多项式时间的贪心算法GreedTok。该算法的关键在于将分词问题自然地松弛为加权最大覆盖问题（weighted maximum coverage problem），并利用已有的(1 - 1/e)近似算法GreedWMC进行求解。通过在实际语料库上的实验评估，GreedTok在性能上优于BPE，并且在目标得分上与GreedWMC相当，尽管GreedWMC由于松弛可能获得更高的得分。

链接: https://arxiv.org/abs/2501.06246
作者: Jia Peng Lim,Davin Choo,Hady W. Lauw
机构: Singapore Management University(新加坡管理大学); Harvard University(哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Tokenization is the process of encoding strings into tokens from a fixed vocabulary of size k and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 - 1/e) -approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE, while achieving a comparable objective score as GreedWMC (which could have achieved a higher score due to relaxation).
zh

[NLP-82] owards a scalable AI-driven framework for data-independent Cyber Threat Intelligence Information Extraction

【速读】：该论文试图解决网络安全威胁情报（Cyber Threat Intelligence, CTI）信息提取（Information Extraction, IE）中高质量标注数据不足的问题。现有的AI驱动解决方案通常依赖于大量标注数据，但这些数据在实际应用中往往难以获取。为此，论文提出了0-CTI，一个基于AI的可扩展框架，旨在高效提取CTI信息。该框架的关键在于利用了先进的自然语言处理（Natural Language Processing, NLP）技术，特别是基于Transformer的架构，能够处理完整的CTI报告文本序列，提取出命名实体及其关系的网络本体。0-CTI的独特之处在于其支持监督学习和零样本学习（zero-shot learning），尤其是在零样本方法下，系统能够在没有标注数据的情况下进行实体和关系提取，从而适应各种数据可用性场景。此外，该框架的监督实体提取器在网络安全实体提取任务中超越了当前最先进的模型，展示了其在低资源和数据丰富环境中的双重优势。通过将系统输出与网络安全领域信息交换标准STIX（Structured Threat Information Expression）格式对齐，0-CTI标准化了提取的知识，增强了网络安全操作中的沟通与协作。

链接: https://arxiv.org/abs/2501.06239
作者: Olga Sorokoletova,Emanuele Antonioni,Giordano Colò
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cyber Threat Intelligence (CTI) is critical for mitigating threats to organizations, governments, and institutions, yet the necessary data are often dispersed across diverse formats. AI-driven solutions for CTI Information Extraction (IE) typically depend on high-quality, annotated data, which are not always available. This paper introduces 0-CTI, a scalable AI-based framework designed for efficient CTI Information Extraction. Leveraging advanced Natural Language Processing (NLP) techniques, particularly Transformer-based architectures, the proposed system processes complete text sequences of CTI reports to extract a cyber ontology of named entities and their relationships. Our contribution is the development of 0-CTI, the first modular framework for CTI Information Extraction that supports both supervised and zero-shot learning. Unlike existing state-of-the-art models that rely heavily on annotated datasets, our system enables fully dataless operation through zero-shot methods for both Entity and Relation Extraction, making it adaptable to various data availability scenarios. Additionally, our supervised Entity Extractor surpasses current state-of-the-art performance in cyber Entity Extraction, highlighting the dual strength of the framework in both low-resource and data-rich environments. By aligning the system’s outputs with the Structured Threat Information Expression (STIX) format, a standard for information exchange in the cybersecurity domain, 0-CTI standardizes extracted knowledge, enhancing communication and collaboration in cybersecurity operations. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2501.06239 [cs.CR] (or arXiv:2501.06239v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.06239 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-83] Fitting Different Interactive Information: Joint Classification of Emotion and Intention

【速读】：该论文旨在解决低资源多模态情感和意图识别（low-resource multimodal emotion and intention recognition）中的关键问题，特别是在如何有效利用大量未标注数据的同时，确保不同难度任务在交互阶段的相互促进。解决方案的关键在于：首先，通过对已标注数据训练的模型进行伪标签标注（pseudo-label labeling），选择高置信度的样本及其标签来缓解低资源问题；其次，利用实验中发现的意图识别易于表示的特性，使其与情感识别在不同注意力头（attention heads）下相互促进，并通过融合实现更高的意图识别性能。最终，在精细化处理数据的基础上，该方案在测试集上取得了0.5532的得分，赢得了该赛道的冠军。

链接: https://arxiv.org/abs/2501.06215
作者: Xinger Li,Zhiqiang Zhong,Bo Huang,Yang Yang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper is the first-place solution for ICASSP MEIJU@2025 Track I, which focuses on low-resource multimodal emotion and intention recognition. How to effectively utilize a large amount of unlabeled data, while ensuring the mutual promotion of different difficulty levels tasks in the interaction stage, these two points become the key to the competition. In this paper, pseudo-label labeling is carried out on the model trained with labeled data, and samples with high confidence and their labels are selected to alleviate the problem of low resources. At the same time, the characteristic of easy represented ability of intention recognition found in the experiment is used to make mutually promote with emotion recognition under different attention heads, and higher performance of intention recognition is achieved through fusion. Finally, under the refined processing data, we achieve the score of 0.5532 in the Test set, and win the championship of the track.
zh

[NLP-84] FLAME: Financial Large-Language Model Assessment and Metrics Evaluation

【速读】：该论文旨在解决金融领域大语言模型（LLMs）在中文环境下的全面评估问题。尽管越来越多的金融LLMs被引入用于特定金融任务，但如何全面评估其价值仍然具有挑战性。为此，论文提出了FLAME（Financial LLMs Evaluation System），一个专门针对中文金融LLMs的评估系统。FLAME的核心包括两个评估基准：FLAME-Cer和FLAME-Sce。FLAME-Cer涵盖了14种权威金融认证（如CPA、CFA、FRM等），包含约16,000道经过人工审核的精选题目，确保其准确性和代表性。FLAME-Sce则包含10个主要核心金融业务场景、21个次要金融业务场景以及近100个三级金融应用任务的综合评估集。通过这一系统，论文评估了包括GPT-4o、GLM-4、ERNIE-4.0、Qwen2.5、XuanYuan3和Baichuan4-Finance在内的6个代表性LLMs，发现Baichuan4-Finance在大多数任务中表现优异。FLAME的建立为中文金融LLMs的进一步发展提供了全面且专业的评估框架。

链接: https://arxiv.org/abs/2501.06211
作者: Jiayu Guo,Yu Guo,Martha Li,Songtao Tan
机构: The School of Finance, Renmin University of China (中国人民大学财政金融学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: this https URL.
zh

[NLP-85] Applications of natural language processing in aviation safety: A review and qualitative analysis

【速读】：该论文探讨了如何利用自然语言处理（Natural Language Processing, NLP）技术提升航空安全，重点研究了机器学习算法在增强安全措施中的应用。通过对现有文献的分析，论文揭示了NLP在航空安全领域的方法、发现和影响趋势，并总结了研究动机、目标和成果，展示了NLP在识别关键安全问题和改善航空安全方面的潜力。论文还指出了当前研究的空白，并提出了未来探索的方向，为航空行业提供了实用建议。解决方案的关键在于克服NLP在航空安全应用中的挑战，如大规模标注数据集的需求和复杂模型的可解释性问题。为此，论文提出了主动学习（active learning）用于数据标注和可解释AI（explainable AI）用于模型解释等解决方案，并通过案例研究展示了NLP在提升航空安全方面的成功应用。

链接: https://arxiv.org/abs/2501.06210
作者: Aziida Nanyonga,Keith Joiner,Ugur Turhan,Graham Wild
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explores using Natural Language Processing in aviation safety, focusing on machine learning algorithms to enhance safety measures. There are currently May 2024, 34 Scopus results from the keyword search natural language processing and aviation safety. Analyzing these studies allows us to uncover trends in the methodologies, findings and implications of NLP in aviation. Both qualitative and quantitative tools have been used to investigate the current state of literature on NLP for aviation safety. The qualitative analysis summarises the research motivations, objectives, and outcomes, showing how NLP can be utilized to help identify critical safety issues and improve aviation safety. This study also identifies research gaps and suggests areas for future exploration, providing practical recommendations for the aviation industry. We discuss challenges in implementing NLP in aviation safety, such as the need for large, annotated datasets, and the difficulty in interpreting complex models. We propose solutions like active learning for data annotation and explainable AI for model interpretation. Case studies demonstrate the successful application of NLP in improving aviation safety, highlighting its potential to make aviation safer and more efficient.
zh

[NLP-86] Enhancing AI Safety Through the Fusion of Low Rank Adapters

【速读】：该论文试图解决在大型语言模型（LLMs）进行指令微调（instruction fine-tuning）时，模型在面对恶意提示（malicious prompts）时可能生成有害响应的问题。为了解决这一问题，论文提出了低秩适配器融合（Low-Rank Adapter Fusion, LoRA）的方法，通过在任务适配器（task adapter）和安全适配器（safety adapter）之间进行融合，来降低模型生成有害响应的概率。安全适配器专门在安全数据集上进行训练，以确保模型在面对恶意提示时能够有效过滤有害内容。实验结果表明，使用LoRA融合方法可以将有害响应率降低42%。然而，该方法也带来了过度安全行为的问题，即模型可能会拒绝与不安全提示相似的安全提示。

链接: https://arxiv.org/abs/2501.06208
作者: Satya Swaroop Gudipudi,Sreeram Vipparla,Harpreet Singh,Shashwat Goel,Ponnurangam Kumaraguru
机构: IIIT Hyderabad(海得拉巴国际信息技术学院); NSUT Delhi(德里尼赫鲁科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction fine-tuning of large language models (LLMs) is a powerful method for improving task-specific performance, but it can inadvertently lead to a phenomenon where models generate harmful responses when faced with malicious prompts. In this paper, we explore Low-Rank Adapter Fusion (LoRA) as a means to mitigate these risks while preserving the model’s ability to handle diverse instructions effectively. Through an extensive comparative analysis against established baselines using recognized benchmark datasets, we demonstrate a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter, the latter of which is specifically trained on our safety dataset. However, we also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones
zh

[NLP-87] Leverag ing Edge Intelligence and LLM s to Advance 6G-Enabled Internet of Automated Defense Vehicles

【速读】：该论文探讨了人工智能（AI）及其子领域深度学习（DL）在军事自动驾驶中的应用，特别是如何通过先进决策模型和超可靠低延迟通信（Ultra-Reliable Low Latency Communication, URLLC）来提升自动驾驶系统在危险环境中的精确性和安全性。论文的核心问题在于如何确保在任务关键场景中，多个互联的自动驾驶车辆能够实现无缝协调、实时数据交换以及对动态驾驶环境的即时响应。解决方案的关键在于利用6G技术增强自动化防御车辆互联网（Internet of Automated Defense Vehicles, IoADV）的连通性，并通过预训练的生成式大语言模型（Generative Large Language Models, LLMs）优化决策和通信过程。这些技术的结合有望在军事防御领域实现自动驾驶系统的全面潜力，特别是在提升任务执行效率和安全性方面。

链接: https://arxiv.org/abs/2501.06205
作者: Murat Arda Onsu,Poonam Lohan,Burak Kantarci
机构: School of Electrical Engineering and Computer Science, University of Ottawa (渥太华大学电气工程与计算机科学学院)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 5 figures, under (secondary/revised) review in IEEE Internet of Things Magazine

点击查看摘要

Abstract:The evolution of Artificial Intelligence (AI) and its subset Deep Learning (DL), has profoundly impacted numerous domains, including autonomous driving. The integration of autonomous driving in military settings reduces human casualties and enables precise and safe execution of missions in hazardous environments while allowing for reliable logistics support without the risks associated with fatigue-related errors. However, relying on autonomous driving solely requires an advanced decision-making model that is adaptable and optimum in any situation. Considering the presence of numerous interconnected autonomous vehicles in mission-critical scenarios, Ultra-Reliable Low Latency Communication (URLLC) is vital for ensuring seamless coordination, real-time data exchange, and instantaneous response to dynamic driving environments. The advent of 6G strengthens the Internet of Automated Defense Vehicles (IoADV) concept within the realm of Internet of Military Defense Things (IoMDT) by enabling robust connectivity, crucial for real-time data exchange, advanced navigation, and enhanced safety features through IoADV interactions. On the other hand, a critical advancement in this space is using pre-trained Generative Large Language Models (LLMs) for decision-making and communication optimization for autonomous driving. Hence, this work presents opportunities and challenges with a vision of realizing the full potential of these technologies in critical defense applications, especially through the advancement of IoADV and its role in enhancing autonomous military operations.
zh

[NLP-88] A Novel Task-Driven Method with Evolvable Interactive Agents Using Event Trees for Enhanced Emergency Decision Support

【速读】：该论文试图解决在气候变化和其他全球性挑战加剧的背景下，人类驱动的策略在应对突发紧急情况时的局限性问题。特别是在复杂系统故障时，预先制定的应急计划不足可能导致操作人员不堪重负。为此，论文提出了一种名为EvoTaskTree的创新方法，旨在通过任务驱动的方式，利用可进化的交互代理（agents）和事件树（event trees）来支持紧急决策。该解决方案的关键在于整合了两种基于大语言模型（LLMs）的代理：任务执行者（task executors）负责执行关键程序，任务验证者（task validators）确保这些行动的有效性。通过事件树分析，该框架涵盖了三个关键任务：初始事件子事件分析、事件树头部事件分析和决策建议。代理从这些任务的成功和失败响应中学习，最终在核电站等安全关键系统中展示了其有效性，处理未遇到过的事故场景时准确率高达100%。EvoTaskTree显著提升了紧急决策的快速制定能力。

链接: https://arxiv.org/abs/2501.06193
作者: Xingyu Xiao,Peng Chen,Ben Qi,Jingang Liang,Jiejuan Tong,Haitao Wang
机构: Tsinghua University(清华大学); University of Chinese Academy of Sciences(中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As climate change and other global challenges increase the likelihood of unforeseen emergencies, the limitations of human-driven strategies in critical situations become more pronounced. Inadequate pre-established emergency plans can lead operators to become overwhelmed during complex systems malfunctions. This study addresses the urgent need for agile decision-making in response to various unforeseen incidents through a novel approach, EvoTaskTree (a task-driven method with evolvable interactive agents using event trees for emergency decision support). This advanced approach integrates two types of agents powered by large language models (LLMs): task executors, responsible for executing critical procedures, and task validators, ensuring the efficacy of those actions. By leveraging insights from event tree analysis, our framework encompasses three crucial tasks: initiating event subevent analysis, event tree header event analysis, and decision recommendations. The agents learn from both successful and unsuccessful responses from these tasks. Finally, we use nuclear power plants as a demonstration of a safety-critical system. Our findings indicate that the designed agents are not only effective but also outperform existing approaches, achieving an impressive accuracy rate of up to 100 % in processing previously unencoun32 tered incident scenarios. This paper demonstrates that EvoTaskTree significantly enhances the rapid formulation of emergency decision-making.
zh

[NLP-89] A Multimodal Social Agent

【速读】：该论文试图解决如何将计算机与社交能力（social capabilities）相结合，以自动化并改进社交内容分析（social content analysis）的问题。具体而言，论文提出了MuSA，一种基于多模态大语言模型（multimodal LLM-based agent）的代理，旨在处理以人为中心的内容分析任务，如问答（question answering）、视觉问答（visual question answering）、标题生成（title generation）和分类（categorization）。解决方案的关键在于MuSA采用了规划（planning）、推理（reasoning）、行动（acting）、优化（optimizing）、批评（criticizing）和精炼（refining）等策略来完成任务。通过这种方式，MuSA在问答、标题生成和内容分类等任务中表现显著优于基线模型，展示了其在自动化社交内容分析和辅助决策过程中的潜力。

链接: https://arxiv.org/abs/2501.06189
作者: Athina Bikaki,Ioannis A. Kakadiaris
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:In recent years, large language models (LLMs) have demonstrated remarkable progress in common-sense reasoning tasks. This ability is fundamental to understanding social dynamics, interactions, and communication. However, the potential of integrating computers with these social capabilities is still relatively unexplored. However, the potential of integrating computers with these social capabilities is still relatively unexplored. This paper introduces MuSA, a multimodal LLM-based agent that analyzes text-rich social content tailored to address selected human-centric content analysis tasks, such as question answering, visual question answering, title generation, and categorization. It uses planning, reasoning, acting, optimizing, criticizing, and refining strategies to complete a task. Our approach demonstrates that MuSA can automate and improve social content analysis, helping decision-making processes across various applications. We have evaluated our agent’s capabilities in question answering, title generation, and content categorization tasks. MuSA performs substantially better than our baselines.
zh

[NLP-90] Causal Claims in Economics WWW

【速读】：该论文旨在分析经济学领域中因果推断方法（如DiD、IV、RDD、RCTs）的兴起及其对学术认可和长期学术影响力的影响。通过对1980年至2023年间超过44,000篇NBER和CEPR工作论文的分析，作者使用自定义语言模型构建知识图谱，以映射经济学概念及其关系。研究发现，因果主张的比例从1990年的约4%显著上升至2020年的近28%，反映了“可信度革命”的日益增长的影响。关键解决方案在于区分一般主张与通过因果推断方法记录的主张，并探讨因果叙事的复杂性（如因果链的深度）对顶级期刊发表和引用次数的影响。研究表明，因果复杂性显著预测了在顶级期刊发表的可能性及更高的引用次数，而非因果复杂性则与这些结果无关或呈负相关。此外，新颖性在基于可信因果方法的情况下对顶级期刊发表至关重要，而非因果新颖性则表现出较弱甚至负面的影响。总体而言，方法论严谨性和因果创新是学术认可的关键驱动因素，但长期影响力可能需要在创新贡献与现有经济学话语的概念整合之间取得平衡。

链接: https://arxiv.org/abs/2501.06873
作者: Prashant Garg,Thiemo Fetzer
机构: Imperial College London(帝国理工学院); University of Warwick(华威大学); Bonn(波恩大学); CEPR(欧洲经济政策研究中心); CAGE(华威大学经济研究中心); NIESR(国家经济与社会研究所); ECONtribute(波恩大学经济研究中心); Grantham Institute(格兰瑟姆研究所)
类目: General Economics (econ.GN); Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI); Methodology (stat.ME)
备注: For data, interactive tools, and additional project information, visit this https URL . The website contains resources such as data downloads, interactive author and paper-level knowledge graphs, and more

点击查看摘要

Abstract:We analyze over 44,000 NBER and CEPR working papers from 1980 to 2023 using a custom language model to construct knowledge graphs that map economic concepts and their relationships. We distinguish between general claims and those documented via causal inference methods (e.g., DiD, IV, RDD, RCTs). We document a substantial rise in the share of causal claims-from roughly 4% in 1990 to nearly 28% in 2020-reflecting the growing influence of the “credibility revolution.” We find that causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes. Novelty is also pivotal for top-5 publication, but only when grounded in credible causal methods: introducing genuinely new causal edges or paths markedly increases both the likelihood of acceptance at leading outlets and long-run citations, while non-causal novelty exhibits weak or even negative effects. Papers engaging with central, widely recognized concepts tend to attract more citations, highlighting a divergence between factors driving publication success and long-term academic impact. Finally, bridging underexplored concept pairs is rewarded primarily when grounded in causal methods, yet such gap filling exhibits no consistent link with future citations. Overall, our findings suggest that methodological rigor and causal innovation are key drivers of academic recognition, but sustained impact may require balancing novel contributions with conceptual integration into established economic discourse.
zh

[NLP-91] Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR)

【速读】：该论文试图解决在众包标注（crowdsourced labels）中，由于标注者群体（annotator pools）不具备代表性而导致模型训练结果无法反映更广泛人群观点的问题。具体来说，当标注者群体中存在不同类型的标注倾向时，模型在训练数据不平衡的情况下会出现校准不良（poor calibration）的现象。为了解决这一问题，论文提出了“人口对齐实例复制”（Population-Aligned Instance Replication, PAIR）方法，通过对代表性不足的标注者群体的标签进行复制，使其比例与目标人群一致，从而在不需额外数据收集的情况下显著减少偏差。该方案的关键在于借鉴了调查研究中的统计调整技术，确保模型训练与目标人群对齐，即使在没有代表性标注者群体的情况下也能有效提升训练数据的质量。

链接: https://arxiv.org/abs/2501.06826
作者: Stephanie Eckman,Bolei Ma,Christoph Kern,Rob Chew,Barbara Plank,Frauke Kreuter
机构: University of Maryland, College Park; LMU Munich & Munich Center for Machine Learning; RTI International
类目: Methodology (stat.ME); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.
zh

[NLP-92] Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis ICASSP2025

【速读】：该论文旨在解决低资源语言（low-resource languages）在语音处理中的跨语言音位表示问题，特别是如何通过有效的源语言选择来提升目标低资源语言的性能。以往的研究在跨语言研究中使用了多种源语言，但并未深入考虑源语言的选择。本文的关键解决方案是通过深入分析语言选择，提出了一种评估多语言家族间音位邻近性（phonetic proximity）的实用方法。研究表明，使用音位相似的语言（phonologically similar languages）进行多语言训练，能够在音位识别任务中相对单语言训练提升55.6%的性能，甚至优于大规模自监督学习模型的表现。此外，同一语言家族内的多语言训练表明，音位相似性越高，性能提升越显著，而相似性较低则会导致性能下降。

链接: https://arxiv.org/abs/2501.06810
作者: Minu Kim,Kangwook Jang,Hoirin Kim
机构: KAIST (韩国科学技术院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 10 pages, 5 figures, accepted to ICASSP 2025

点击查看摘要

Abstract:This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.
zh

[NLP-93] he Magnitude of Categories of Texts Enriched by Language Models

【速读】：该论文主要解决两个问题：首先，通过语言模型给出的下一个词的概率，明确定义了自然语言文本类别的[0,1]-富集（[0,1]-enrichment），并考虑了文本生成的终止条件，以确定何时可以将这种富集解释为文本上的概率分布。其次，计算了与文本相关的广义度量空间（generalized metric space）的Möbius函数和幅度（magnitude），使用了Vigneaux最近引入的组合版本。幅度函数f(t)是文本x（提示）上Tsallis t-熵（Tsallis t-entropies）的下一词概率分布p(-|x)的总和加上模型可能输出的基数。f在t=1处的导数恢复了Shannon熵的总和，这证明了将幅度视为配分函数（partition function）的合理性。此外，论文还通过Leinster和Schulman的方法，将幅度函数表达为幅度同调（magnitude homology）的欧拉特征（Euler characteristic），并提供了零阶和一阶幅度同调群的明确描述。解决方案的关键在于利用语言模型的概率分布和组合数学工具来量化文本生成过程中的信息量和结构特性。

链接: https://arxiv.org/abs/2501.06662
作者: Tai-Danae Bradley,Juan Pablo Vigneaux
机构: SandboxAQ; Department of Mathematics, The Master’s University (数学系, 马斯特大学); Department of Mathematics, California Institute of Technology (数学系, 加州理工学院)
类目: Category Theory (math.CT); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The purpose of this article is twofold. Firstly, we use the next-token probabilities given by a language model to explicitly define a [0,1] -enrichment of a category of texts in natural language, in the sense of Bradley, Terilla, and Vlassopoulos. We consider explicitly the terminating conditions for text generation and determine when the enrichment itself can be interpreted as a probability over texts. Secondly, we compute the Möbius function and the magnitude of an associated generalized metric space \mathcalM of texts using a combinatorial version of these quantities recently introduced by Vigneaux. The magnitude function f(t) of \mathcalM is a sum over texts x (prompts) of the Tsallis t -entropies of the next-token probability distributions p(-|x) plus the cardinality of the model’s possible outputs. The derivative of f at t=1 recovers a sum of Shannon entropies, which justifies seeing magnitude as a partition function. Following Leinster and Schulman, we also express the magnitude function of \mathcal M as an Euler characteristic of magnitude homology and provide an explicit description of the zeroeth and first magnitude homology groups.
zh

[NLP-94] Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives ICASSP2025

【速读】：该论文旨在解决针对非洲荷兰语（Afrikaans）和科萨语（isiXhosa）学龄前儿童讲述故事的自动语音识别（ASR）系统开发问题。由于口头叙述是评估儿童在学会阅读之前的语言发展的重要方式，研究团队探讨了多种已有的儿童语音ASR策略，以确定最适合这一独特场景的方法。研究发现，使用Whisper模型并结合仅5分钟的转录儿童语音数据时，额外的领域内成人数据（即与故事领域匹配的成人语音）对系统性能的提升最为显著，尤其是在结合语音转换技术时。此外，半监督学习对两种语言均有帮助，而参数高效微调（parameter-efficient fine-tuning）对非洲荷兰语有效，但对科萨语效果不佳，原因在于科萨语在Whisper模型中的代表性不足。该研究填补了非英语数据及4至5岁学龄前儿童语音ASR研究的空白，验证了多种儿童语音ASR策略在未充分探索场景中的适用性。

链接: https://arxiv.org/abs/2501.06478
作者: Christiaan Jacobs,Annelien Smith,Daleen Klop,Ondřej Klejch,Febe de Wet,Herman Kamper
机构: Stellenbosch University, South Africa(南非斯泰伦博斯大学); University of Edinburgh, United Kingdom(英国爱丁堡大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children’s language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.
zh

[NLP-95] S-Transducer: End-to-End Speech Synthesis with Neural Transducer ICASSP2025

【速读】：该论文旨在解决文本到语音（Text-to-Speech, TTS）生成中的两个关键问题：一是如何避免使用显式的时长预测器（duration predictor），二是如何高效处理音频编解码模型（audio codec models）中由于残差量化器（residual quantizers）导致的每帧预测多个码本（codebooks）的复杂性。解决方案的核心在于引入了一种新颖的架构——TTS-Transducer，该架构结合了神经转导器（neural transducers）和音频编解码模型的优势。具体而言，TTS-Transducer首先利用转导器架构学习文本标记与第一个码本的语音编解码标记之间的单调对齐（monotonic alignments），从而避免了显式时长预测器的使用。随后，通过非自回归Transformer（non-autoregressive Transformer）利用从转导器损失中提取的对齐信息，预测剩余的码本。整个系统采用端到端（end-to-end）训练方式，展示了其在质量和鲁棒性上与传统TTS系统的竞争力。

链接: https://arxiv.org/abs/2501.06320
作者: Vladimir Bataev,Subhankar Ghosh,Vitaly Lavrukhin,Jason Li
机构: NVIDIA
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
zh

计算机视觉

[CV-0] Dataset Distillation via Committee Voting

【速读】：该论文试图解决数据集蒸馏（Dataset Distillation）中的关键问题，即如何在保持原始数据集核心特性的同时，生成一个更小但具有代表性的数据集，从而在减少计算资源的情况下实现高效的模型训练。现有的研究主要集中在改进原始数据与合成数据之间的对齐或匹配过程，或提高大规模数据集蒸馏的效率。本文提出了一种新颖且正交的方法——委员会投票数据集蒸馏（Committee Voting for Dataset Distillation, CV-DD），其关键解决方案在于利用多个模型或专家的集体智慧来生成高质量的蒸馏数据集。通过整合多个模型的分布和预测，并生成高质量的软标签（soft labels），该方法能够捕捉更广泛的数据特征，减少模型特定的偏差和分布偏移的负面影响，从而显著提高泛化能力。这种基于投票的策略不仅促进了蒸馏数据集内的多样性和鲁棒性，还显著减少了过拟合，提升了后续评估任务的性能。实验结果表明，与单模型或多模型蒸馏方法相比，委员会投票方法生成的蒸馏数据更具可靠性和适应性，展示了其在高效和准确数据集蒸馏中的潜力。

链接: https://arxiv.org/abs/2501.07575
作者: Jiacheng Cui,Zhaoyi Li,Xiaochen Ma,Xinyue Bi,Yaxin Luo,Zhiqiang Shen
机构: VILA Lab, MBZUAI; University of Ottawa (渥太华大学); Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code at: this https URL

点击查看摘要

Abstract:Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce \bf C ommittee \bf V oting for \bf D ataset \bf D istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: this https URL.
zh

[CV-1] UnCommon Objects in 3D

【速读】：该论文旨在解决3D深度学习（3D deep learning）和3D生成式AI（3D generative AI）领域中缺乏高质量、多样化且具有全面3D注释的数据集的问题。为此，作者提出了一个新的以对象为中心的数据集——Uncommon Objects in 3D (uCO3D)。该数据集是目前公开可用的最大规模的高分辨率视频集合，涵盖了超过1,000个对象类别，并确保每个对象具有360度全覆盖的3D注释。与现有的MVImgNet和CO3Dv2数据集相比，uCO3D不仅在多样性上显著提升，还通过严格的视频和3D注释质量检查确保了更高的数据质量。此外，uCO3D不仅包含3D相机姿态（3D camera poses）、深度图（depth maps）和稀疏点云（sparse point clouds）的注释，还为每个对象提供了文本描述（caption）和3D高斯泼溅重建（3D Gaussian Splat reconstruction）。通过在MVImgNet、CO3Dv2和uCO3D上训练多个大型3D模型，作者验证了uCO3D在提升模型性能方面的优越性，表明其在学习应用中具有更高的价值。

链接: https://arxiv.org/abs/2501.07574
作者: Xingchen Liu,Piyush Tayal,Jianyuan Wang,Jesus Zarzar,Tom Monnier,Konstantinos Tertikas,Jiali Duan,Antoine Toisoul,Jason Y. Zhang,Natalia Neverova,Andrea Vedaldi,Roman Shapovalov,David Novotny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360 ^\circ coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.
zh

[CV-2] raining-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

【速读】：该论文旨在解决在视频生成过程中保持时间一致性和精确运动控制的挑战。现有的无需训练的方法往往难以在帧间保持时间一致性或准确遵循引导的运动。为此，论文提出了一种简单而有效的解决方案，结合了基于初始噪声的方法和一种新颖的运动一致性损失（motion consistency loss）。关键创新在于通过捕捉视频扩散模型中中间特征的帧间特征相关性模式来表示参考视频的运动模式，并设计了一种运动一致性损失来在生成视频中保持相似的特征相关性模式。通过在潜在空间中使用该损失的梯度来引导生成过程，实现了精确的运动控制。该方法在多种运动控制任务中提高了时间一致性，同时保留了无需训练设置的优势。

链接: https://arxiv.org/abs/2501.07563
作者: Xinyu Zhang,Zicheng Duan,Dong Gong,Lingqiao Liu
机构: The University of Adelaide(阿德莱德大学); The University of New South Wales(新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we address the challenge of generating temporally consistent videos with motion guidance. While many existing methods depend on additional control modules or inference-time fine-tuning, recent studies suggest that effective motion guidance is achievable without altering the model architecture or requiring extra training. Such approaches offer promising compatibility with various video generation foundation models. However, existing training-free methods often struggle to maintain consistent temporal coherence across frames or to follow guided motion accurately. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video. We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video, using the gradient of this loss in the latent space to guide the generation process for precise motion control. This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup. Extensive experiments show that our method sets a new standard for efficient, temporally coherent video generation.
zh

[CV-3] MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training ATC

【速读】：该论文试图解决在不同成像模态（imaging modalities）下图像匹配（image matching）性能下降的问题，特别是在缺乏跨模态标注训练数据的情况下。现有的深度学习算法在处理具有显著外观变化的跨模态图像时表现不佳，这限制了依赖多模态图像获取互补信息的应用领域。为解决这一问题，论文提出了一种大规模预训练框架，利用合成的跨模态训练信号，结合来自不同来源的多样化数据，训练模型以识别和匹配图像中的基本结构。该框架的关键在于其训练出的匹配模型在多个未见过的跨模态配准任务中表现出卓越的泛化能力，使用相同的网络权重即可显著优于现有方法。这一进展显著提升了图像匹配技术在各种科学领域的适用性，并为多模态人类和人工智能分析等新应用铺平了道路。

链接: https://arxiv.org/abs/2501.07556
作者: Xingyi He,Hao Yu,Sida Peng,Dongli Tan,Zehong Shen,Hujun Bao,Xiaowei Zhou
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.
zh

[CV-4] Confident Pseudo-labeled Diffusion Augmentation for Canine Cardiomegaly Detection WACV

【速读】：该论文试图解决犬心脏肥大（canine cardiomegaly）检测中因数据集小、标注质量差以及成像条件多样性导致的模型泛化能力不足的问题。解决方案的关键在于提出了一种名为“自信伪标签扩散增强模型”（Confident Pseudo-labeled Diffusion Augmentation, CDA）的方法。该方法通过扩散模型生成合成的X射线图像并标注椎心评分（Vertebral Heart Score）关键点，从而扩展数据集。同时，采用基于蒙特卡洛丢弃（Monte Carlo Dropout）的伪标签策略，筛选高置信度标签以优化合成数据集，并通过迭代方式提升模型性能。实验结果表明，CDA模型在犬心脏肥大检测中优于传统方法，达到了最先进的准确率。

链接: https://arxiv.org/abs/2501.07533
作者: Shiman Zhang,Lakshmikar Reddy Polamreddy,Youshan Zhang
机构: Katz School of Science and Health, Yeshiva University (卡茨科学与健康学院, 叶史瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV workshop

点击查看摘要

Abstract:Canine cardiomegaly, marked by an enlarged heart, poses serious health risks if undetected, requiring accurate diagnostic methods. Current detection models often rely on small, poorly annotated datasets and struggle to generalize across diverse imaging conditions, limiting their real-world applicability. To address these issues, we propose a Confident Pseudo-labeled Diffusion Augmentation (CDA) model for identifying canine cardiomegaly. Our approach addresses the challenge of limited high-quality training data by employing diffusion models to generate synthetic X-ray images and annotate Vertebral Heart Score key points, thereby expanding the dataset. We also employ a pseudo-labeling strategy with Monte Carlo Dropout to select high-confidence labels, refine the synthetic dataset, and improve accuracy. Iteratively incorporating these labels enhances the model’s performance, overcoming the limitations of existing approaches. Experimental results show that the CDA model outperforms traditional methods, achieving state-of-the-art accuracy in canine cardiomegaly detection. The code implementation is available at this https URL.
zh

[CV-5] IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion WACV-25

【速读】：该论文旨在解决面部视频编辑（Facial Video Editing）中的几个关键挑战，包括编辑质量低、计算成本高、难以在不同编辑中保持面部身份一致性，以及现有模型通常局限于编辑预定义的面部属性，缺乏灵活性。为解决这些问题，论文提出了一种新颖的面部视频编辑框架，该框架利用预训练的文本到图像（Text-to-Image, T2I）扩散模型（Diffusion Models）的丰富潜在空间，并针对面部视频编辑任务进行微调。解决方案的关键在于引入了一种有针对性的微调方案，能够在确保视频帧间身份一致性的同时，实现高质量、局部化、文本驱动的编辑。此外，通过在推理阶段使用预训练的T2I模型，该方法显著减少了80%的编辑时间，同时保持了视频序列的时间一致性。通过广泛的测试，该方法在多种复杂场景下表现出色，超越了现有技术，展示了在多组指标和基准测试中的优越性能。

链接: https://arxiv.org/abs/2501.07530
作者: Tharun Anand,Aryan Garg,Kaushik Mitra
机构: Indian Institute Of Technology Madras(印度理工学院马德拉斯分校); University of Wisconsin Madison(威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV-25 Workshop

点击查看摘要

Abstract:Facial video editing has become increasingly important for content creators, enabling the manipulation of facial expressions and attributes. However, existing models encounter challenges such as poor editing quality, high computational costs and difficulties in preserving facial identity across diverse edits. Additionally, these models are often constrained to editing predefined facial attributes, limiting their flexibility to diverse editing prompts. To address these challenges, we propose a novel facial video editing framework that leverages the rich latent space of pre-trained text-to-image (T2I) diffusion models and fine-tune them specifically for facial video editing tasks. Our approach introduces a targeted fine-tuning scheme that enables high quality, localized, text-driven edits while ensuring identity preservation across video frames. Additionally, by using pre-trained T2I models during inference, our approach significantly reduces editing time by 80%, while maintaining temporal consistency throughout the video sequence. We evaluate the effectiveness of our approach through extensive testing across a wide range of challenging scenarios, including varying head poses, complex action sequences, and diverse facial expressions. Our method consistently outperforms existing techniques, demonstrating superior performance across a broad set of metrics and benchmarks.
zh

[CV-6] RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment

【速读】：该论文旨在解决自动化胸部X光片解读中的两个关键问题：准确的疾病分类和详细的放射学报告生成。当前的方法要么侧重于分类准确性而牺牲了可解释性，要么通过图像字幕技术生成详细但可能不可靠的报告。为了解决这些问题，作者提出了RadAlign框架，该框架结合了视觉语言模型（VLMs）的预测准确性和大语言模型（LLMs）的推理能力。RadAlign首先使用专门的VLM将视觉特征与关键医学概念对齐，从而在多种疾病上实现了平均AUC为0.885的优异分类性能。这些识别的医学条件以文本概念的形式表示在视觉-语言对齐空间中，随后用于提示基于LLM的报告生成。通过检索增强生成机制，RadAlign在历史类似案例的基础上生成报告，显著提高了报告质量，GREEN评分为0.678，优于现有最先进方法的0.634。该框架在保持强临床可解释性的同时，减少了幻觉现象，推动了集成预测和生成式AI在自动化医学影像和报告分析中的应用。

链接: https://arxiv.org/abs/2501.07525
作者: Difei Gu,Yunhe Gao,Yang Zhou,Mu Zhou,Dimitris Metaxas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist’s workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods’ 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at this https URL.
zh

[CV-7] hree-view Focal Length Recovery From Homographies

【速读】：该论文旨在解决从三视图单应性（three-view homographies）中恢复焦距（focal lengths）的问题。通过分析两个单应性之间的法向量一致性，作者提出了一种新的显式约束方法，利用消元技术推导出焦距与单应性之间的关系。关键解决方案在于利用三视图单应性提供的两个额外约束，使得能够恢复一个或两个焦距。论文讨论了四种可能的情况，包括三个相机具有未知的相同焦距、三个相机具有两个不同的未知焦距、一个焦距已知而另外两个相机具有相同或不同的未知焦距。所有问题都可以转化为求解一个或两个未知数的多项式，并通过Sturm序列或隐变量技术高效求解。实验结果表明，所提出的方法比现有的基于两视图的求解器更快且更准确。

链接: https://arxiv.org/abs/2501.07499
作者: Yaqing Ding,Viktor Kocur,Zuzana Berger Haladová,Qianliang Wu,Shen Cai,Jian Yang,Zuzana Kukelova
机构: Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague (捷克技术大学布拉格分校电气工程学院视觉识别组); Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava (布拉迪斯拉发夸美纽斯大学数学、物理与信息学院); PCA Lab, Nanjing University of Science and Technology, Nanjing, China (南京理工大学PCA实验室); Visual and Geometric Perception Lab, Donghua University (东华大学视觉与几何感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL Dataset available at: this https URL

点击查看摘要

Abstract:In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers. The code and data are available on this https URL
zh

[CV-8] Aligning First Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method

【速读】：该论文旨在解决弱监督暴力检测（Weakly Supervised Violence Detection）中的多模态融合问题。具体来说，现有的方法主要关注设计多模态融合模型来处理模态差异（modality discrepancies），而本文提出了一种新的多模态语义特征对齐方法（multimodal semantic feature alignment method）。该方法通过稀疏映射将局部、瞬态且信息较少的模态（如音频和光流）的语义特征映射到信息更丰富的RGB语义特征空间中。通过迭代过程，该方法识别出合适的非零特征匹配子空间，并基于该子空间对齐模态特定的事件表示，从而在后续的模态融合阶段充分利用所有模态的信息。基于此，本文设计了一个新的弱监督暴力检测框架，包括单模态多实例学习（unimodal multiple-instance learning）用于提取单模态语义特征、多模态对齐、多模态融合以及最终检测。实验结果表明，该方法在XD-Violence数据集上达到了86.07%的平均精度（AP），证明了其有效性。

链接: https://arxiv.org/abs/2501.07496
作者: Wenping Jin,Li Zhu,Jing Sun
机构: Xi’an Jiaotong University (西安交通大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at this https URL.
zh

[CV-9] 3DGS-to-PC: Convert a 3D Gaussian Splatting Scene into a Dense Point Cloud or Mesh

【速读】：该论文旨在解决将3D高斯泼溅（3D Gaussian Splatting, 3DGS）场景转换为点云（point cloud）的复杂挑战。3DGS虽然能够生成高度详细的3D重建，但其场景通常需要专门的渲染器进行可视化，而点云作为一种广泛使用的3D表示形式，兼容大多数流行的3D处理软件。论文提出的解决方案3DGS-to-PC是一个灵活且高度可定制的框架，能够将3DGS场景转换为密集且高精度的点云。其关键步骤包括：1）从每个高斯分布中概率采样点作为3D密度函数；2）使用马氏距离（Mahalanobis distance）对新点进行阈值处理，以防止极端异常值；3）通过自定义的图像渲染方法重新计算高斯颜色，确保点云中的颜色与最终场景一致。此外，3DGS-to-PC还支持通过泊松表面重建（Poisson Surface Reconstruction）生成网格（mesh），从而无需重新训练即可从3DGS场景生成彩色网格。该框架能够轻松集成到现有的3DGS流程中，为将3DGS数据转换为点云和基于表面的格式提供了强大的工具。

链接: https://arxiv.org/abs/2501.07478
作者: Lewis A G Stuart,Michael P Pound
机构: University of Nottingham (诺丁汉大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) excels at producing highly detailed 3D reconstructions, but these scenes often require specialised renderers for effective visualisation. In contrast, point clouds are a widely used 3D representation and are compatible with most popular 3D processing software, yet converting 3DGS scenes into point clouds is a complex challenge. In this work we introduce 3DGS-to-PC, a flexible and highly customisable framework that is capable of transforming 3DGS scenes into dense, high-accuracy point clouds. We sample points probabilistically from each Gaussian as a 3D density function. We additionally threshold new points using the Mahalanobis distance to the Gaussian centre, preventing extreme outliers. The result is a point cloud that closely represents the shape encoded into the 3D Gaussian scene. Individual Gaussians use spherical harmonics to adapt colours depending on view, and each point may contribute only subtle colour hints to the resulting rendered scene. To avoid spurious or incorrect colours that do not fit with the final point cloud, we recalculate Gaussian colours via a customised image rendering approach, assigning each Gaussian the colour of the pixel to which it contributes most across all views. 3DGS-to-PC also supports mesh generation through Poisson Surface Reconstruction, applied to points sampled from predicted surface Gaussians. This allows coloured meshes to be generated from 3DGS scenes without the need for re-training. This package is highly customisable and capability of simple integration into existing 3DGS pipelines. 3DGS-to-PC provides a powerful tool for converting 3DGS data into point cloud and surface-based formats.
zh

[CV-10] A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion

【速读】：该论文旨在解决在嵌入式设备上部署大型计算机视觉（Computer Vision）模型时，静态优化技术（如剪枝、量化等）无法根据不同输入的计算复杂度动态调整计算量的问题。静态优化方法忽略了不同输入具有不同复杂性的事实，导致计算资源的浪费或性能不足。论文提出的解决方案是动态神经网络（Dynamic Neural Networks），它能够根据具体输入动态调整计算量，从而提高计算效率和模型适应性。关键点在于动态神经网络允许网络的输出、计算图或输入部分具有自适应性，从而更好地适应不同复杂度的输入。此外，论文还强调了动态神经网络在传感器融合（Sensor Fusion）中的优势，包括更好的适应性、噪声降低和信息优先级处理。

链接: https://arxiv.org/abs/2501.07451
作者: Fabio Montello,Ronja Güldenring,Simone Scardapane,Lazaros Nalpantidis
机构: Technical University of Denmark (丹麦技术大学); Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at International Journal of Computer Vision

点击查看摘要

Abstract:Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction.
zh

[CV-11] PrecipDiff: Leverag ing image diffusion models to enhance satellite-based precipitation observations

【速读】：该论文旨在解决卫星降水产品在精度、偏差和空间分辨率方面的局限性问题。具体来说，卫星观测虽然提供了全球近实时监测的能力，但其数据存在不一致性、偏差和低空间分辨率（10 km）的问题。为了解决这些问题，论文提出了一种基于扩散模型（diffusion models）和残差学习（residual learning）的统一框架。该框架首次引入了扩散模型，用于校正不同降水产品之间的不一致性，并将卫星降水估计的空间分辨率从10 km降尺度到1 km。通过在Seattle地区进行的大量实验，该方法在精度提升、偏差减少和空间细节增强方面表现出显著效果。关键之处在于，该方法仅使用降水数据，展示了基于计算机视觉的方法在增强卫星降水产品方面的潜力，为该领域的进一步研究奠定了基础。

链接: https://arxiv.org/abs/2501.07447
作者: Ting-Yu Dai,Hayato Ushijima-Mwesigwa
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A recent report from the World Meteorological Organization (WMO) highlights that water-related disasters have caused the highest human losses among natural disasters over the past 50 years, with over 91% of deaths occurring in low-income countries. This disparity is largely due to the lack of adequate ground monitoring stations, such as weather surveillance radars (WSR), which are expensive to install. For example, while the US and Europe combined possess over 600 WSRs, Africa, despite having almost one and half times their landmass, has fewer than 40. To address this issue, satellite-based observations offer a global, near-real-time monitoring solution. However, they face several challenges like accuracy, bias, and low spatial resolution. This study leverages the power of diffusion models and residual learning to address these limitations in a unified framework. We introduce the first diffusion model for correcting the inconsistency between different precipitation products. Our method demonstrates the effectiveness in downscaling satellite precipitation estimates from 10 km to 1 km resolution. Extensive experiments conducted in the Seattle region demonstrate significant improvements in accuracy, bias reduction, and spatial detail. Importantly, our approach achieves these results using only precipitation data, showcasing the potential of a purely computer vision-based approach for enhancing satellite precipitation products and paving the way for further advancements in this domain.
zh

[CV-12] Guided SAM: Label-Efficient Part Segmentation

【速读】：该论文旨在解决物体部件分割（part segmentation）任务中需要大量训练数据和繁琐标注的问题。现有的Segment-Anything Model (SAM)虽然在多种分割任务中表现良好，但需要手动提供位置提示（positional prompts）来指导分割，且由于训练时针对完整物体而非部件，容易导致部件过分割（over-segmentation）。为此，论文提出了一种名为“Guided SAM”的新方法，通过从粗粒度图像块标注中学习位置提示，减少对大量标注数据的依赖。该方法的关键在于利用图像块训练分类器识别部件类别，并将这些图像块聚合成感兴趣区域（ROIs），结合位置提示引导SAM进行分割。实验表明，Guided SAM在汽车部件数据集上显著提升了分割效果，平均IoU从0.37提高到0.49，且标注效率提升了五倍。

链接: https://arxiv.org/abs/2501.07434
作者: S.B. van Rooij,G.J. Burghouts
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Localizing object parts precisely is essential for tasks such as object recognition and robotic manipulation. Recent part segmentation methods require extensive training data and labor-intensive annotations. Segment-Anything Model (SAM) has demonstrated good performance on a wide range of segmentation problems, but requires (manual) positional prompts to guide it where to segment. Furthermore, since it has been trained on full objects instead of object parts, it is prone to over-segmentation of parts. To address this, we propose a novel approach that guides SAM towards the relevant object parts. Our method learns positional prompts from coarse patch annotations that are easier and cheaper to acquire. We train classifiers on image patches to identify part classes and aggregate patches into regions of interest (ROIs) with positional prompts. SAM is conditioned on these ROIs and prompts. This approach, termed `Guided SAM’, enhances efficiency and reduces manual effort, allowing effective part segmentation with minimal labeled data. We demonstrate the efficacy of Guided SAM on a dataset of car parts, improving the average IoU on state of the art models from 0.37 to 0.49 with annotations that are on average five times more efficient to acquire.
zh

[CV-13] Diff-Ensembler: Learning to Ensemble 2D Diffusion Models for Volume-to-Volume Medical Image Translation

【速读】：该论文试图解决现有模型在医学图像体积到体积（volume-to-volume）转换中难以有效捕捉三维（3D）空间结构的问题。现有方法通常通过加权平均结合多个二维（2D）网络，忽略了3D空间结构，且直接训练3D模型面临计算资源需求高和大规模数据集需求的挑战。为解决这些问题，论文提出了一种名为Diff-Ensembler的新型混合2D-3D模型。该模型通过在每个扩散步骤中将垂直训练的2D扩散模型与3D网络集成，实现了高效且有效的体积转换。此外，Diff-Ensembler能够自然地集成基于不同模态的扩散模型，从而实现输入条件的灵活且准确融合。实验表明，Diff-Ensembler在3D医学图像超分辨率和模态转换中表现出卓越的精度和体积真实感，并在肿瘤分割等下游任务中进一步验证了其体积真实感的优势。

链接: https://arxiv.org/abs/2501.07430
作者: Xiyue Zhu,Dou Hoon Kwark,Ruike Zhu,Kaiwen Hong,Yiqi Tao,Shirui Luo,Yudu Li,Zhi-Pei Liang,Volodymyr Kindratenko
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); National Center for Supercomputing Applications (国家超级计算应用中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite success in volume-to-volume translations in medical images, most existing models struggle to effectively capture the inherent volumetric distribution using 3D representations. The current state-of-the-art approach combines multiple 2D-based networks through weighted averaging, thereby neglecting the 3D spatial structures. Directly training 3D models in medical imaging presents significant challenges due to high computational demands and the need for large-scale datasets. To address these challenges, we introduce Diff-Ensembler, a novel hybrid 2D-3D model for efficient and effective volumetric translations by ensembling perpendicularly trained 2D diffusion models with a 3D network in each diffusion step. Moreover, our model can naturally be used to ensemble diffusion models conditioned on different modalities, allowing flexible and accurate fusion of input conditions. Extensive experiments demonstrate that Diff-Ensembler attains superior accuracy and volumetric realism in 3D medical image super-resolution and modality translation. We further demonstrate the strength of our model’s volumetric realism using tumor segmentation as a downstream task.
zh

[CV-14] OCORD: Open-Campus Object Removal Dataset

【速读】：该论文试图解决对象移除（object removal）任务中存在的挑战，包括语义理解不足、生成伪影（artifacts）以及现有数据集在真实场景中的不匹配问题。现有数据集多依赖于合成数据，无法充分反映真实世界的光照和阴影等物理现象，且存在可扩展性和标注效率低下的问题。为解决这些局限性，论文提出了一种新颖的解决方案，通过固定相机设置下的长时间视频捕捉构建高分辨率的真实世界数据集，并利用Grounding-DINO、Segment-Anything-Model和MASA等先进工具实现自动化标注，显著减少了标注时间和人力成本。该方案的关键在于提供了一个高效标注流程，并发布了首个完全开放的高分辨率真实世界对象移除数据集，同时通过微调预训练的扩散模型提升了对象移除任务的性能。

链接: https://arxiv.org/abs/2501.07397
作者: Shuo Zhang,Runpu Wei,Kongming Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:The rapid advancements in generative models, particularly diffusion-based techniques, have revolutionized image inpainting tasks by enabling the generation of high-fidelity and diverse content. However, object removal remains under-explored as a specific subset of inpainting, facing challenges such as inadequate semantic understanding and the unintended generation of artifacts. Existing datasets for object removal often rely on synthetic data, which fails to align with real-world scenarios, limiting model performance. Although some real-world datasets address these issues partially, they suffer from scalability, annotation inefficiencies, and limited realism in physical phenomena such as lighting and shadows. To address these limitations, this paper introduces a novel approach to object removal by constructing a high-resolution real-world dataset through long-duration video capture with fixed camera settings. Leveraging advanced tools such as Grounding-DINO, Segment-Anything-Model, and MASA for automated annotation, we provides image, background, and mask pairs while significantly reducing annotation time and labor. With our efficient annotation pipeline, we release the first fully open, high-resolution real-world dataset for object removal, and improved performance in object removal tasks through fine-tuning of pre-trained diffusion models.
zh

[CV-15] Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models

【速读】：该论文试图解决自动目标识别（Automatic Target Recognition, ATR）在极端使用场景（如军事应用）中面临的挑战，特别是在未知地形、环境条件和新型目标类别下的识别问题。现有的目标检测器（包括开放世界检测器）由于缺乏对这些新条件的训练，无法自信地识别新型目标或在未知环境中有效操作。尽管大型视觉-语言模型（Large Vision-Language Models, LVLMs）在零样本（zero-shot）条件下表现出识别能力，但其在场景中定位目标的能力较弱。为解决这些局限性，论文提出了一种新颖的管道，结合了开放世界检测器的检测能力和LVLMs的识别置信度，从而构建了一个能够在新型类别和未知领域中实现零样本ATR的鲁棒系统。该解决方案的关键在于整合两种技术的优势，以提升在复杂和未知条件下的目标识别性能。

链接: https://arxiv.org/abs/2501.07396
作者: Yasiru Ranasinghe,Vibashan VS,James Uplinger,Celso De Melo,Vishal M. Patel
机构: The Johns Hopkins University (约翰霍普金斯大学); DEVCOM Army Research Laboratory (DEVCOM陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic target recognition (ATR) plays a critical role in tasks such as navigation and surveillance, where safety and accuracy are paramount. In extreme use cases, such as military applications, these factors are often challenged due to the presence of unknown terrains, environmental conditions, and novel object categories. Current object detectors, including open-world detectors, lack the ability to confidently recognize novel objects or operate in unknown environments, as they have not been exposed to these new conditions. However, Large Vision-Language Models (LVLMs) exhibit emergent properties that enable them to recognize objects in varying conditions in a zero-shot manner. Despite this, LVLMs struggle to localize objects effectively within a scene. To address these limitations, we propose a novel pipeline that combines the detection capabilities of open-world detectors with the recognition confidence of LVLMs, creating a robust system for zero-shot ATR of novel classes and unknown domains. In this study, we compare the performance of various LVLMs for recognizing military vehicles, which are often underrepresented in training datasets. Additionally, we examine the impact of factors such as distance range, modality, and prompting methods on the recognition performance, providing insights into the development of more reliable ATR systems for novel conditions and classes.
zh

[CV-16] Kolmogorov-Arnold Network for Remote Sensing Image Semantic Segmentation

【速读】：该论文旨在解决遥感应用中语义分割（Semantic Segmentation）任务中存在的两个主要问题：一是现有方法难以充分利用编码器提取的高维特征，二是在解码过程中难以有效恢复细节信息。为解决这些问题，作者提出了一种基于Kolmogorov Arnold Network (KAN)的新型语义分割网络DeepKANSeg。其关键创新点包括：1）引入基于KAN的深度特征精炼模块（DeepKAN），用于从高维特征中有效捕捉复杂的空间和语义关系；2）在全局-局部联合解码器（global-local combined decoder）中用基于KAN的线性层（GLKAN）替代传统的多层感知机（MLP）层，以增强解码器在恢复细节信息时的能力。实验结果表明，该方法在ISPRS Vaihingen和ISPRS Potsdam两个高分辨率遥感数据集上均优于现有方法，且KAN的显式单变量分解特性提高了模型的可解释性，适用于需要可解释学习的遥感应用。

链接: https://arxiv.org/abs/2501.07390
作者: Xianping Ma,Ziyao Wang,Yin Hu,Xiaokang Zhang,Man-On Pun
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); Wuhan University of Science and Technology (武汉科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Semantic segmentation plays a crucial role in remote sensing applications, where the accurate extraction and representation of features are essential for high-quality results. Despite the widespread use of encoder-decoder architectures, existing methods often struggle with fully utilizing the high-dimensional features extracted by the encoder and efficiently recovering detailed information during decoding. To address these problems, we propose a novel semantic segmentation network, namely DeepKANSeg, including two key innovations based on the emerging Kolmogorov Arnold Network (KAN). Notably, the advantage of KAN lies in its ability to decompose high-dimensional complex functions into univariate transformations, enabling efficient and flexible representation of intricate relationships in data. First, we introduce a KAN-based deep feature refinement module, namely DeepKAN to effectively capture complex spatial and rich semantic relationships from high-dimensional features. Second, we replace the traditional multi-layer perceptron (MLP) layers in the global-local combined decoder with KAN-based linear layers, namely GLKAN. This module enhances the decoder’s ability to capture fine-grained details during decoding. To evaluate the effectiveness of the proposed method, experiments are conducted on two well-known fine-resolution remote sensing benchmark datasets, namely ISPRS Vaihingen and ISPRS Potsdam. The results demonstrate that the KAN-enhanced segmentation model achieves superior performance in terms of accuracy compared to state-of-the-art methods. They highlight the potential of KANs as a powerful alternative to traditional architectures in semantic segmentation tasks. Moreover, the explicit univariate decomposition provides improved interpretability, which is particularly beneficial for applications requiring explainable learning in remote sensing.
zh

[CV-17] FedSemiDG: Domain Generalized Federated Semi-supervised Medical Image Segmentation

【速读】：该论文试图解决在联邦半监督学习（Federated Semi-Supervised Learning, FSSL）中存在的领域偏移（domain shift）问题，这一问题可能导致模型聚合效果不佳以及未标记数据的利用效率低下，最终影响模型在未见领域上的表现。为了解决这一问题，论文提出了领域泛化的联邦半监督学习（FedSemiDG）框架，旨在从多个领域中利用有限的标记数据和丰富的未标记数据，以分布式方式训练模型，使其能够在未见领域上具有良好的泛化能力。

解决方案的关键在于提出了一个名为“联邦泛化感知半监督学习”（Federated Generalization-Aware SemiSupervised Learning, FGASL）的新框架。该框架在全局层面引入了“泛化感知聚合”（Generalization-Aware Aggregation, GAA），根据局部模型的泛化性能为其分配自适应权重；在局部层面，采用了“双教师自适应伪标签细化”（Dual-Teacher Adaptive Pseudo Label Refinement, DR）策略，结合全局和领域特定知识，生成更可靠的伪标签。此外，通过“扰动不变对齐”（Perturbation-Invariant Alignment, PIA）策略，在扰动下强制特征一致性，促进领域不变学习。这些方法共同作用，显著提升了模型在未见领域上的泛化性能。

链接: https://arxiv.org/abs/2501.07378
作者: Zhipeng Deng,Zhe Xu,Tsuyoshi Isshiki,Yefeng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Medical image segmentation is challenging due to the diversity of medical images and the lack of labeled data, which motivates recent developments in federated semi-supervised learning (FSSL) to leverage a large amount of unlabeled data from multiple centers for model training without sharing raw data. However, what remains under-explored in FSSL is the domain shift problem which may cause suboptimal model aggregation and low effectivity of the utilization of unlabeled data, eventually leading to unsatisfactory performance in unseen domains. In this paper, we explore this previously ignored scenario, namely domain generalized federated semi-supervised learning (FedSemiDG), which aims to learn a model in a distributed manner from multiple domains with limited labeled data and abundant unlabeled data such that the model can generalize well to unseen domains. We present a novel framework, Federated Generalization-Aware SemiSupervised Learning (FGASL), to address the challenges in FedSemiDG by effectively tackling critical issues at both global and local levels. Globally, we introduce Generalization-Aware Aggregation (GAA), assigning adaptive weights to local models based on their generalization performance. Locally, we use a Dual-Teacher Adaptive Pseudo Label Refinement (DR) strategy to combine global and domain-specific knowledge, generating more reliable pseudo labels. Additionally, Perturbation-Invariant Alignment (PIA) enforces feature consistency under perturbations, promoting domain-invariant learning. Extensive experiments on three medical segmentation tasks (cardiac MRI, spine MRI and bladder cancer MRI) demonstrate that our method significantly outperforms state-of-the-art FSSL and domain generalization approaches, achieving robust generalization on unseen domains.
zh

[CV-18] mberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations WACV

【速读】：该论文旨在解决林业操作中木材（timber）的自动化检测和测量问题，特别是在远程环境中，这些操作通常依赖大量人力且存在较高的安全风险。论文提出的解决方案的关键在于引入了一个名为TimberVision的数据集，该数据集包含超过2000张标注的RGB图像，涵盖了51000个树干组件（包括切割面和侧面），在数量和细节上远超现有数据集。基于该数据集，论文通过一系列消融实验研究了面向对象检测（oriented object detection）和实例分割（instance segmentation）的性能，并评估了多种场景参数对模型性能的影响。此外，论文提出了一个通用框架，将检测到的组件融合为统一的树干表示，并自动推导几何属性，结合多目标跟踪（multi-object tracking）以增强鲁棒性。最终，该解决方案仅依赖RGB图像数据，能够在复杂环境条件下提供高度描述性和精确的树干表示，适用于多种应用场景，并可与其他传感器模态结合使用。

链接: https://arxiv.org/abs/2501.07360
作者: Daniel Steininger,Julia Simon,Andreas Trondl,Markus Murschitz
机构: AIT Austrian Institute of Technology (Center for Vision, Automation & Control) (奥地利技术研究院，视觉、自动化与控制中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Winter Conference on Applications of Computer Vision (WACV) 2025. Code and dataset available at this https URL

点击查看摘要

Abstract:Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
zh

[CV-19] A method for estimating roadway billboard salience

【速读】：该论文旨在解决路边广告牌（roadside billboards）等户外广告形式在驾驶员视角下的显著性及其对驾驶注意力的潜在影响问题。研究首先评估了神经网络（neural networks）在检测路边广告中的有效性，重点比较了YOLOv5和Faster R-CNN模型的性能。其次，研究通过显著性提取方法（saliency extraction methods）来确定广告牌的显著性，具体采用了UniSal和SpectralResidual方法生成每张图像的显著性图（saliency maps）。为了验证这些显著性模型，研究还建立了一个基于城市高速公路驾驶过程中眼动追踪（eye tracking）数据的数据库。解决方案的关键在于结合深度学习模型和显著性分析方法，以量化广告牌对驾驶员注意力的影响，从而为交通安全和广告设计提供科学依据。

链接: https://arxiv.org/abs/2501.07342
作者: Zuzana Berger Haladova,Michal Zrubec,Zuzana Cernekova
机构: Faculty of Mathematics Physics and Informatics, Comenius University Bratislava Slovakia (数学物理与信息学院，布拉迪斯拉发考门斯基大学，斯洛伐克)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Roadside billboards and other forms of outdoor advertising play a crucial role in marketing initiatives; however, they can also distract drivers, potentially contributing to accidents. This study delves into the significance of roadside advertising in images captured from a driver’s perspective. Firstly, it evaluates the effectiveness of neural networks in detecting advertising along roads, focusing on the YOLOv5 and Faster R-CNN models. Secondly, the study addresses the determination of billboard significance using methods for saliency extraction. The UniSal and SpectralResidual methods were employed to create saliency maps for each image. The study establishes a database of eye tracking sessions captured during city highway driving to assess the saliency models.
zh

[CV-20] Anonymization of Documents for Law Enforcement with Machine Learning

【速读】：该论文试图解决在涉及敏感个人信息（如执法领域）的数据驱动方法中，如何自动匿名化扫描文档图像的问题，以减少人工操作并确保数据保护合规性。解决方案的关键在于结合自动检测敏感区域和手动匿名化参考文档的知识，最小化自动遮蔽区域，从而保持匿名化后文档的进一步法医处理可行性。通过使用自监督图像模型进行参考文档的实例检索，该方法仅需一个匿名化示例即可高效处理同类型的所有文档，显著减少了处理时间。实验表明，该方法在手工制作的真实遮蔽数据集上优于纯自动遮蔽系统和简单的参考匿名化复制粘贴方案。

链接: https://arxiv.org/abs/2501.07334
作者: Manuel Eberhardinger,Patrick Takenaka,Daniel Grießhaber,Johannes Maucher
机构: Stuttgart Media University(斯图加特传媒大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE Symposium on CI in Security, Defence and Biometrics 2025 (IEEE CISDB)

点击查看摘要

Abstract:The steadily increasing utilization of data-driven methods and approaches in areas that handle sensitive personal information such as in law enforcement mandates an ever increasing effort in these institutions to comply with data protection guidelines. In this work, we present a system for automatically anonymizing images of scanned documents, reducing manual effort while ensuring data protection compliance. Our method considers the viability of further forensic processing after anonymization by minimizing automatically redacted areas by combining automatic detection of sensitive regions with knowledge from a manually anonymized reference document. Using a self-supervised image model for instance retrieval of the reference document, our approach requires only one anonymized example to efficiently redact all documents of the same type, significantly reducing processing time. We show that our approach outperforms both a purely automatic redaction system and also a naive copy-paste scheme of the reference anonymization to other documents on a hand-crafted dataset of ground truth redactions.
zh

[CV-21] Localization-Aware Multi-Scale Representation Learning for Repetitive Action Counting

【速读】：该论文试图解决重复动作计数（Repetitive Action Counting, RAC）任务中的噪声干扰问题，特别是在视频中由于动作中断和不一致性导致的计数性能下降。现有的RAC方法通常依赖于帧间相似性表示来进行周期预测，但这些方法在面对常见的噪声时表现不佳。论文提出了一种新的解决方案，即通过引入前景定位优化目标来增强相似性表示学习，从而获得更鲁棒和高效的视频特征。关键解决方案包括两个模块：1) 多尺度周期感知表示（Multi-Scale Period-Aware Representation, MPR），通过特定尺度的设计来适应不同动作频率并学习更灵活的时间相关性；2) 重复前景定位（Repetition Foreground Localization, RFL），通过粗略识别周期性动作并结合全局语义信息来增强表示。这两个模块可以联合优化，生成更具辨别力的周期性动作表示，显著减少噪声的影响，提高计数准确性。实验结果表明，该方法在RepCountA和UCFRep数据集上有效提升了重复动作计数的性能。

链接: https://arxiv.org/abs/2501.07312
作者: Sujia Wang,Xiangwei Shen,Yansong Tang,Xin Dong,Wenjia Geng,Lei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE VCIP2024

点击查看摘要

Abstract:Repetitive action counting (RAC) aims to estimate the number of class-agnostic action occurrences in a video without exemplars. Most current RAC methods rely on a raw frame-to-frame similarity representation for period prediction. However, this approach can be significantly disrupted by common noise such as action interruptions and inconsistencies, leading to sub-optimal counting performance in realistic scenarios. In this paper, we introduce a foreground localization optimization objective into similarity representation learning to obtain more robust and efficient video features. We propose a Localization-Aware Multi-Scale Representation Learning (LMRL) framework. Specifically, we apply a Multi-Scale Period-Aware Representation (MPR) with a scale-specific design to accommodate various action frequencies and learn more flexible temporal correlations. Furthermore, we introduce the Repetition Foreground Localization (RFL) method, which enhances the representation by coarsely identifying periodic actions and incorporating global semantic information. These two modules can be jointly optimized, resulting in a more discerning periodic action representation. Our approach significantly reduces the impact of noise, thereby improving counting accuracy. Additionally, the framework is designed to be scalable and adaptable to different types of video content. Experimental results on the RepCountA and UCFRep datasets demonstrate that our proposed method effectively handles repetitive action counting.
zh

[CV-22] he Devil is in the Spurious Correlation: Boosting Moment Retrieval via Temporal Dynamic Learning

【速读】：该论文试图解决在视频时刻检索（moment retrieval）任务中，模型容易受到文本查询与背景帧之间的虚假相关性（spurious correlation）影响的问题，导致难以准确预测目标时刻的时间跨度。为解决这一问题，论文提出了两种关键策略：首先，通过一种新颖的视频合成方法构建动态上下文，使模型能够在不同动态背景下关注与查询相关的目标时刻；其次，通过增强时间动态表示，将文本查询与时间动态表示对齐，从而建立查询相关时刻与上下文之间的非虚假相关性。这些策略有效缓解了虚假相关性问题，并在QVHighlights和Charades-STA两个基准数据集上取得了新的最优性能。

链接: https://arxiv.org/abs/2501.07305
作者: Xinyang Zhou,Fanyue Wei,Lixin Duan,Wen Li
机构: University of Electronic Science and Technology of China (电子科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is currently still a major challenge. In this paper, we reveal that a crucial reason stems from the spurious correlation between the text queries and the moment context. Namely, the model may associate the textual query with the background frames rather than the target moment. To address this issue, we propose a temporal dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the relevant moment. With separate yet similar videos mixed up, the synthesis approach empowers our model to attend to the target moment of the corresponding query under various dynamic contexts. Second, we enhance the representation by learning temporal dynamics. Besides the visual representation, text queries are aligned with temporal dynamic representations, which enables our model to establish a non-spurious correlation between the query-related moment and context. With the aforementioned proposed method, the spurious correlation issue in moment retrieval can be largely alleviated. Our method establishes a new state-of-the-art performance on two popular benchmarks of moment retrieval, \ie, QVHighlights and Charades-STA. In addition, the detailed ablation analyses demonstrate the effectiveness of the proposed strategies. Our code will be publicly available.
zh

[CV-23] Code and Pixels: Multi-Modal Contrastive Pre-training for Enhanced Tabular Data Analysis

【速读】：该论文试图解决如何通过结合表格数据和图像数据来增强表格模型性能的问题。表格数据提供了丰富的结构化信息，对于全面理解和决策过程至关重要，但传统方法往往忽略了表格数据与图像数据之间的相关性。论文提出的解决方案是多任务对比掩码表格建模（Multi-task Contrastive Masked Tabular Modeling, MT-CMTM），该方法通过结合对比学习（contrastive learning）和掩码表格建模（masked tabular modeling）来优化这两种数据模态之间的协同作用。关键创新在于使用了一种带有残差连接和注意力机制的1D卷积神经网络（1D-ResNet-CBAM），该网络能够高效处理表格数据，而无需依赖图像数据。这使得MT-CMTM能够处理纯表格数据，避免了图像获取和处理的高成本。实验结果表明，MT-CMTM在HIPMP和DVM数据集上均优于从头训练的1D-ResNet-CBAM模型，展示了其在多模态学习领域的潜力。

链接: https://arxiv.org/abs/2501.07304
作者: Kankana Roy,Lars Krämer,Sebastian Domaschke,Malik Haris,Roland Aydin,Fabian Isensee,Martin Held
机构: Karolinska Institute(卡罗林斯卡学院); Helmholtz Imaging and the German Cancer Research Center (DKFZ)(亥姆霍兹成像与德国癌症研究中心); Institute of Materials Systems Modeling, Helmholtz-Zentrum Hereon(亥姆霍兹材料系统建模研究所); Institute of Membrane Research, Helmholtz-Zentrum Hereon(亥姆霍兹膜研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning from tabular data is of paramount importance, as it complements the conventional analysis of image and video data by providing a rich source of structured information that is often critical for comprehensive understanding and decision-making processes. We present Multi-task Contrastive Masked Tabular Modeling (MT-CMTM), a novel method aiming to enhance tabular models by leveraging the correlation between tabular data and corresponding images. MT-CMTM employs a dual strategy combining contrastive learning with masked tabular modeling, optimizing the synergy between these data modalities. Central to our approach is a 1D Convolutional Neural Network with residual connections and an attention mechanism (1D-ResNet-CBAM), designed to efficiently process tabular data without relying on images. This enables MT-CMTM to handle purely tabular data for downstream tasks, eliminating the need for potentially costly image acquisition and processing. We evaluated MT-CMTM on the DVM car dataset, which is uniquely suited for this particular scenario, and the newly developed HIPMP dataset, which connects membrane fabrication parameters with image data. Our MT-CMTM model outperforms the proposed tabular 1D-ResNet-CBAM, which is trained from scratch, achieving a relative 1.48% improvement in relative MSE on HIPMP and a 2.38% increase in absolute accuracy on DVM. These results demonstrate MT-CMTM’s robustness and its potential to advance the field of multi-modal learning. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.07304 [cs.CV] (or arXiv:2501.07304v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.07304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-24] oward Realistic Camouflaged Object Detection: Benchmarks and Method

【速读】：该论文试图解决在现实场景中检测伪装物体（Realistic Camouflaged Object Detection, RCOD）的挑战。由于伪装物体与其背景特征高度相似，传统的语义分割或实例分割方法虽然能够识别物体的轮廓，但在仅需定位物体位置的任务中效率较低且成本较高。相比之下，目标检测算法在RCOD任务中提供了更优的解决方案，但由于目标检测器省略了像素级的比较分析，进一步加剧了检测难度。为解决这一问题，论文提出了一种伪装感知特征细化（Camouflage-Aware Feature Refinement, CAFR）策略。CAFR通过充分利用大模型的先验知识，帮助检测器深入理解背景与前景之间的差异。具体而言，CAFR引入了自适应梯度传播（Adaptive Gradient Propagation, AGP）模块，该模块对大检测模型中的所有特征提取层进行微调，以从伪装背景中充分细化类别特定的特征。此外，论文还设计了稀疏特征细化（Sparse Feature Refinement, SFR）模块，优化基于Transformer的特征提取器，使其主要聚焦于在伪装场景中捕捉类别特定的特征。为了促进RCOD任务的评估，论文还对三个现有的分割COD数据集进行了手动标注，创建了新的RCOD任务基准。

链接: https://arxiv.org/abs/2501.07297
作者: Zhimeng Xin,Tianxu Wu,Shiming Chen,Shuo Ye,Zijing Xie,Yixiong Zou,Xinge You,Yufei Guo
机构: School of Cyber Science and Engineering, Huazhong University of Science and Technology (华中科技大学网络空间安全学院); School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子信息与通信学院); School of Computer Science & Technology, Huazhong University of Science and Technology (华中科技大学计算机科学与技术学院); Intelligent Science and Technology Academy of CASIC (中国航天科工集团智能科技研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camouflaged object detection (COD) primarily relies on semantic or instance segmentation methods. While these methods have made significant advancements in identifying the contours of camouflaged objects, they may be inefficient or cost-effective for tasks that only require the specific location of the object. Object detection algorithms offer an optimized solution for Realistic Camouflaged Object Detection (RCOD) in such cases. However, detecting camouflaged objects remains a formidable challenge due to the high degree of similarity between the features of the objects and their backgrounds. Unlike segmentation methods that perform pixel-wise comparisons to differentiate between foreground and background, object detectors omit this analysis, further aggravating the challenge. To solve this problem, we propose a camouflage-aware feature refinement (CAFR) strategy. Since camouflaged objects are not rare categories, CAFR fully utilizes a clear perception of the current object within the prior knowledge of large models to assist detectors in deeply understanding the distinctions between background and foreground. Specifically, in CAFR, we introduce the Adaptive Gradient Propagation (AGP) module that fine-tunes all feature extractor layers in large detection models to fully refine class-specific features from camouflaged contexts. We then design the Sparse Feature Refinement (SFR) module that optimizes the transformer-based feature extractor to focus primarily on capturing class-specific features in camouflaged scenarios. To facilitate the assessment of RCOD tasks, we manually annotate the labels required for detection on three existing segmentation COD datasets, creating a new benchmark for RCOD tasks. Code and datasets are available at: this https URL.
zh

[CV-25] Event-based Video Person Re-identification via Cross-Modality and Temporal Collaboration ICASSP2025

【速读】：该论文试图解决视频行人重识别（Video-based person ReID）中的隐私泄露问题，并提出了一种仅使用事件数据（event data）的方法来避免传统RGB图像带来的隐私风险。解决方案的关键在于提出了一个跨模态与时序协作（Cross-Modality and Temporal Collaboration, CMTC）网络。该网络首先通过事件转换网络从原始事件数据中提取辅助信息，然后通过差分模态协作模块平衡事件数据与辅助信息的作用，以实现互补效果。此外，还引入了时序协作模块来利用运动信息和外观线索。实验结果表明，该方法在基于事件数据的视频行人重识别任务中表现优于其他方法。

链接: https://arxiv.org/abs/2501.07296
作者: Renkai Li,Xin Yuan,Wei Liu,Xin Xu
机构: 1 School of Computer Science and Technology, Wuhan University of Science and Technology (武汉科技大学计算机科学与技术学院); 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System (湖北省智能信息处理与实时工业系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
zh

[CV-26] Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion AAAI2025

【速读】：该论文旨在解决3D语义场景补全（3D semantic scene completion）问题，即在自主系统中估计从场景数据中缺失的几何和语义信息。由于现实世界条件的复杂性，该任务通常需要处理多模态数据的复杂模型以达到可接受的性能。论文提出了一种独特的神经模型，利用状态空间（state space）和扩散生成建模（diffusion generative modeling）的进展，通过单目图像输入实现显著的3D语义场景补全性能。解决方案的关键在于在变分自编码器（variational autoencoder）的条件潜空间中进行数据处理，并采用创新的状态空间技术进行扩散建模。核心组件是提出的Skimba（Skip Mamba）去噪器，该去噪器擅长高效处理长序列数据。Skimba扩散模型结合了三重Mamba结构、维度分解残差和沿三个方向的变化扩张，构成了3D场景补全网络的核心。此外，该方法还采用了该网络的变体进行后续的语义分割阶段。通过在SemanticKITTI和SSCBench-KITTI360数据集上的广泛评估，该方法不仅大幅优于其他单目技术，还在与立体方法的竞争中表现出色。

链接: https://arxiv.org/abs/2501.07260
作者: Li Liang,Naveed Akhtar,Jordan Vice,Xiangrui Kong,Ajmal Saeed Mian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at this https URL
zh

[CV-27] EdgeTAM: On-Device Track Anything Model KR

【速读】：该论文旨在解决如何在保持性能的同时，使SAM 2（Segment Anything Model 2）在移动设备上高效运行的问题。SAM 2通过引入内存银行机制（memory bank mechanism）扩展了其从图像到视频输入的能力，并在视频分割任务中表现出色。然而，现有的优化方法主要集中在压缩图像编码器上，未能有效解决SAM 2中新引入的内存注意力模块（memory attention blocks）成为延迟瓶颈的问题。为此，论文提出了EdgeTAM，其关键创新在于使用了一种新颖的2D空间感知器（2D Spatial Perceiver），通过轻量级Transformer编码密集存储的帧级内存，并采用固定数量的可学习查询（learnable queries）来减少计算成本。此外，论文还提出了一种蒸馏管道（distillation pipeline），在不增加推理开销的情况下进一步提升性能。最终，EdgeTAM在多个数据集上取得了优异的性能，并在iPhone 15 Pro Max上实现了16 FPS的实时运行速度。

链接: https://arxiv.org/abs/2501.07256
作者: Chong Zhou,Chenchen Zhu,Yunyang Xiong,Saksham Suri,Fanyi Xiao,Lemeng Wu,Raghuraman Krishnamoorthi,Bo Dai,Chen Change Loy,Vikas Chandra,Bilge Soran
机构: Meta Reality Labs; Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released at this https URL

点击查看摘要

Abstract:On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 JF on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.
zh

[CV-28] MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework CVPR2025

【速读】：该论文试图解决现有对抗攻击方法在评估和增强深度神经网络（DNNs）鲁棒性时存在的局限性。具体来说，现有的单目标对抗攻击方法主要依赖于替代损失函数（surrogate loss function），未能充分利用多个损失函数的协同效应和冲突关系，导致攻击效果受限。为了解决这一问题，论文提出了一种新的对抗攻击框架——多目标集基攻击（Multi-Objective Set-based Attack, MOS Attack）。该框架通过采用基于集的多目标优化策略，能够自动挖掘多个损失函数之间的协同模式，从而在不增加额外参数的情况下生成更强大的对抗样本。实验表明，MOS Attack在攻击效果上优于单目标攻击方法，并且在减少损失函数数量的情况下仍能保持优越的性能。

链接: https://arxiv.org/abs/2501.07251
作者: Ping Guo,Cheng Gong,Xi Lin,Fei Liu,Zhichao Lu,Qingfu Zhang,Zhenkun Wang
机构: Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review of CVPR 2025

点击查看摘要

Abstract:Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions. Comments: Under Review of CVPR 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.07251 [cs.LG] (or arXiv:2501.07251v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.07251 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ping Guo [view email] [v1] Mon, 13 Jan 2025 12:00:34 UTC (3,528 KB) Full-text links: Access Paper: View a PDF of the paper titled MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework, by Ping Guo and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-01 Change to browse by: cs cs.AI cs.CR cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-29] Depth and Image Fusion for Road Obstacle Detection Using Stereo Camera

【速读】：该论文致力于解决在道路上检测物体的问题，特别是在物体出现时间、大小和形状未知的情况下，传统的机器学习（ML）和深度学习（DL）方法不适用。由于人工照明的变化、不均匀的路面纹理以及物体的未知特征，这一任务变得更加复杂。为了解决这一问题，作者提出了一种深度信息和图像融合的方法，该方法结合了基于RGB的小对比度物体搜索和基于立体图像的障碍物检测，并采用了SLIC超像素分割技术。通过在地下停车场对静态和低速障碍物进行实验，该方法成功检测并跟踪了小物体，如停车基础设施、遗留在路上的物品、车轮、掉落的箱子等。解决方案的关键在于融合深度信息和图像分析，以克服单一方法的局限性。

链接: https://arxiv.org/abs/2501.07245
作者: Oleg Perezyabov,Mikhail Gavrilenkov,Ilya Afanasyev
机构: Huawei St. Petersburg Research Center(华为圣彼得堡研究中心); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 8 pages, 15 figures

点击查看摘要

Abstract:This paper is devoted to the detection of objects on a road, performed with a combination of two methods based on both the use of depth information and video analysis of data from a stereo camera. Since neither the time of the appearance of an object on the road, nor its size and shape is known in advance, ML/DL-based approaches are not applicable. The task becomes more complicated due to variations in artificial illumination, inhomogeneous road surface texture, and unknown character and features of the object. To solve this problem we developed the depth and image fusion method that complements a search of small contrast objects by RGB-based method, and obstacle detection by stereo image-based approach with SLIC superpixel segmentation. We conducted experiments with static and low speed obstacles in an underground parking lot and demonstrated the successful work of the developed technique for detecting and even tracking small objects, which can be parking infrastructure objects, things left on the road, wheels, dropped boxes, etc.
zh

[CV-30] CSTA: Spatial-Temporal Causal Adaptive Learning for Exemplar-Free Video Class-Incremental Learning

【速读】：该论文试图解决在视频数据的类增量学习（Class-Incremental Learning, CIL）中，如何在引入新类别的同时保留过去类别的信息，并有效处理视频数据中空间外观和时间动作的复杂性。为了解决这一问题，论文提出了一种无示例框架，通过引入独立的时空适配器（spatiotemporal adapters）来学习新类别的模式，并满足每个类别独特的增量信息表示需求。然而，简单地应用这些适配器会阻碍空间和时间信息增量之间的内在联系，影响新学习类别信息的表示效率。为此，论文从因果关系的角度提出了两个关键创新：首先，设计了一个因果蒸馏模块（causal distillation module），以保持时空知识之间的关系，从而实现更高效的表示；其次，提出了一种因果补偿机制（causal compensation mechanism），以减少不同类型信息在增量和记忆过程中的冲突。实验结果表明，该框架在基准数据集上取得了新的最先进结果，平均准确率比当前基于示例的方法高出4.2%。

链接: https://arxiv.org/abs/2501.07236
作者: Tieyuan Chen,Huabin Liu,Chern Hong Lim,John See,Xing Gao,Junhui Hou,Weiyao Lin
机构: Department of Electronic Engineering, Shanghai Jiao Tong University (上海交通大学电子工程系); Zhongguancun Academy (中关村研究院); School of Information Technology, Monash University (莫纳什大学信息技术学院); School of Mathematical and Computer Sciences, Heriot-Watt University (赫瑞瓦特大学数学与计算机科学学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TCSVT Submission

点击查看摘要

Abstract:Continual learning aims to acquire new knowledge while retaining past information. Class-incremental learning (CIL) presents a challenging scenario where classes are introduced sequentially. For video data, the task becomes more complex than image data because it requires learning and preserving both spatial appearance and temporal action involvement. To address this challenge, we propose a novel exemplar-free framework that equips separate spatiotemporal adapters to learn new class patterns, accommodating the incremental information representation requirements unique to each class. While separate adapters are proven to mitigate forgetting and fit unique requirements, naively applying them hinders the intrinsic connection between spatial and temporal information increments, affecting the efficiency of representing newly learned class information. Motivated by this, we introduce two key innovations from a causal perspective. First, a causal distillation module is devised to maintain the relation between spatial-temporal knowledge for a more efficient representation. Second, a causal compensation mechanism is proposed to reduce the conflicts during increment and memorization between different types of information. Extensive experiments conducted on benchmark datasets demonstrate that our framework can achieve new state-of-the-art results, surpassing current example-based methods by 4.2% in accuracy on average.
zh

[CV-31] MECD: Unlocking Event-Level Causal Graph Discovery for Video Reasoning

【速读】：该论文试图解决视频因果推理（Video Causal Reasoning）领域中的局限性，特别是现有方法主要局限于问答范式，且仅关注包含孤立事件和基本因果关系的短视频片段，缺乏对包含多个相互关联事件的长视频进行综合和结构化因果分析的能力。为此，论文提出了一个新的任务和数据集，称为多事件因果发现（Multi-Event Causal Discovery, MECD），旨在揭示长视频中按时间顺序分布的事件之间的因果关系，并通过视觉片段和事件文本描述生成一个综合且结构化的事件级视频因果图，解释结果事件发生的原因和过程。

解决方案的关键在于提出了一种基于Granger因果方法（Granger Causality）的新框架，该框架结合了高效的基于掩码的事件预测模型，用于执行事件Granger测试（Event Granger Test）。该测试通过比较在前提事件被掩码和未掩码时对结果事件的预测来估计因果关系。此外，论文还整合了因果推理技术，如前门调整（front-door adjustment）和反事实推理（counterfactual inference），以应对MECD中的因果混淆和虚假因果关系等挑战。同时，引入了上下文链推理（context chain reasoning）以进行更稳健和泛化的推理。实验结果表明，该框架在推理完整因果关系方面优于GPT-4o和VideoChat2，分别提高了5.77%和2.70%，并且因果关系图还能提升下游视频理解任务（如视频问答和视频事件预测）的性能。

链接: https://arxiv.org/abs/2501.07227
作者: Tieyuan Chen,Huabin Liu,Yi Wang,Yihang Chen,Tianyao He,Chaofan Gan,Huanyu He,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学); Monash University (莫纳什大学); Zhongguancun Academy (中关村研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TPAMI Submission. arXiv admin note: substantial text overlap with arXiv:2409.17647

点击查看摘要

Abstract:Video causal reasoning aims to achieve a high-level understanding of videos from a causal perspective. However, it exhibits limitations in its scope, primarily executed in a question-answering paradigm and focusing on brief video segments containing isolated events and basic causal relations, lacking comprehensive and structured causality analysis for videos with multiple interconnected events. To fill this gap, we introduce a new task and dataset, Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations between events distributed chronologically across long videos. Given visual segments and textual descriptions of events, MECD identifies the causal associations between these events to derive a comprehensive and structured event-level video causal graph explaining why and how the result event occurred. To address the challenges of MECD, we devise a novel framework inspired by the Granger Causality method, incorporating an efficient mask-based event prediction model to perform an Event Granger Test. It estimates causality by comparing the predicted result event when premise events are masked versus unmasked. Furthermore, we integrate causal inference techniques such as front-door adjustment and counterfactual inference to mitigate challenges in MECD like causality confounding and illusory causality. Additionally, context chain reasoning is introduced to conduct more robust and generalized reasoning. Experiments validate the effectiveness of our framework in reasoning complete causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%, respectively. Further experiments demonstrate that causal relation graphs can also contribute to downstream video understanding tasks such as video question answering and video event prediction.
zh

[CV-32] Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis

【速读】：该论文旨在解决图像和视频中人体姿态分类的准确性问题，特别是在瑜伽姿势分类中的应用。解决方案的关键在于利用多模态学习方法，尤其是对比语言-图像预训练（Contrastive Language-Image Pretraining, CLIP）模型，通过迁移学习（transfer learning）对包含15,301张图像（真实和合成）的82类数据集进行微调（fine-tuning）。研究结果表明，微调后的CLIP模型在3826张测试图像上达到了超过85%的分类准确率，比现有最先进模型提升了约6%，且训练时间比基于YOLOv8的模型减少了3.5倍。此外，在小规模数据集（如每类6个姿势，分别包含1301和401张训练图像）上，微调模型的准确率分别达到了98.8%和99.1%。实验还表明，即使在每类仅有20张图像的情况下，六类数据集的分类准确率仍可达到约90%。这些结果表明，CLIP模型在瑜伽姿势分类及更广泛的人体姿态分类任务中具有显著的应用潜力，且其推理时间（约7毫秒）支持其在实时自动化系统中的集成，例如开发实时个人瑜伽辅助系统。

链接: https://arxiv.org/abs/2501.07221
作者: Andrzej D. Dobrzycki,Ana M. Bernardos,Luca Bergesio,Andrzej Pomirski,Daniel Sáez-Trigueros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate human posture classification in images and videos is crucial for automated applications across various fields, including work safety, physical rehabilitation, sports training, or daily assisted living. Recently, multimodal learning methods, such as Contrastive Language-Image Pretraining (CLIP), have advanced significantly in jointly understanding images and text. This study aims to assess the effectiveness of CLIP in classifying human postures, focusing on its application in yoga. Despite the initial limitations of the zero-shot approach, applying transfer learning on 15,301 images (real and synthetic) with 82 classes has shown promising results. The article describes the full procedure for fine-tuning, including the choice for image description syntax, models and hyperparameters adjustment. The fine-tuned CLIP model, tested on 3826 images, achieves an accuracy of over 85%, surpassing the current state-of-the-art of previous works on the same dataset by approximately 6%, its training time being 3.5 times lower than what is needed to fine-tune a YOLOv8-based model. For more application-oriented scenarios, with smaller datasets of six postures each, containing 1301 and 401 training images, the fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively. Furthermore, our experiments indicate that training with as few as 20 images per pose can yield around 90% accuracy in a six-class dataset. This study demonstrates that this multimodal technique can be effectively used for yoga pose classification, and possibly for human posture classification, in general. Additionally, CLIP inference time (around 7 ms) supports that the model can be integrated into automated systems for posture evaluation, e.g., for developing a real-time personal yoga assistant for performance assessment.
zh

[CV-33] meLogic: A Temporal Logic Benchmark for Video QA

【速读】：该论文试图解决当前视频问答（VideoQA）基准测试中缺乏对时间逻辑理解能力评估的问题。由于标注时间逻辑的复杂性，现有的VideoQA基准测试很少关注这一关键技能，尽管视觉-语言模型取得了进展，但评估其时间逻辑推理能力仍然是一个挑战，主要原因是缺乏需要正式、复杂时间推理的问答对。为了解决这一问题，作者提出了TimeLogic QA（TLQA）框架，该框架能够自动生成专门用于评估时间逻辑理解的问答对。TLQA框架的关键在于利用现有视频数据集中的时间标注和逻辑理论中的时间操作符，构建测试事件序列及其时间关系理解的问题。该框架具有通用性和可扩展性，能够利用带有时间动作分割标注或时间场景图标注的视频数据集，自动生成时间逻辑问题。通过使用STAR、Breakfast、AGQA和CrossTask四个数据集，TLQA生成了包含2k和10k问答对的小型（TLQA-S）和大型（TLQA-L）VideoQA数据集变体，每个数据集分别包含32k和160k个问答对。作者通过TLQA对领先的VideoQA模型进行了全面评估，以基准测试其时间逻辑理解能力，并评估了这些模型在16个不同时间复杂性类别上的时间推理性能。

链接: https://arxiv.org/abs/2501.07214
作者: Sirnam Swetha,Hilde Kuehne,Mubarak Shah
机构: University of Central Florida(中佛罗里达大学); University of Tuebingen(蒂宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model’s temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
zh

[CV-34] Multi-face emotion detection for effective Human-Robot Interaction

【速读】：该论文旨在解决如何通过情感识别技术增强人机交互（human-robot interaction）的问题，特别是在移动人形机器人（mobile humanoid robot）中的应用。为了实现这一目标，研究提出了一种集成在移动人形机器人中的面部情感检测接口（facial emotion detection interface），能够实时显示多个个体的情感。解决方案的关键在于开发并评估了多种用于面部表情识别（facial expression recognition）的深度神经网络模型（deep neural network models），并在一致的计算机条件下进行了测试，取得了良好的结果。此外，研究还考虑了在移动人形机器人上实现该应用时，准确性与内存占用（memory footprint）之间的权衡，以确保系统的有效性和实用性。

链接: https://arxiv.org/abs/2501.07213
作者: Mohamed Ala Yahyaoui,Mouaad Oujabour,Leila Ben Letaifa,Amine Bohi
机构: CESI LINEACT Laboratory, UR 7527, Vandoeuvre-lès-Nancy, 54500, France; CESI LINEACT Laboratory, UR 7527, Dijon, 21800, France
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 8 figures and 1 table. Accepted at the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Porto, Portugal

点击查看摘要

Abstract:The integration of dialogue interfaces in mobile devices has become ubiquitous, providing a wide array of services. As technology progresses, humanoid robots designed with human-like features to interact effectively with people are gaining prominence, and the use of advanced human-robot dialogue interfaces is continually expanding. In this context, emotion recognition plays a crucial role in enhancing human-robot interaction by enabling robots to understand human intentions. This research proposes a facial emotion detection interface integrated into a mobile humanoid robot, capable of displaying real-time emotions from multiple individuals on a user interface. To this end, various deep neural network models for facial expression recognition were developed and evaluated under consistent computer-based conditions, yielding promising results. Afterwards, a trade-off between accuracy and memory footprint was carefully considered to effectively implement this application on a mobile humanoid robot.
zh

[CV-35] FaceOracle: Chat with a Face Image Oracle

【速读】：该论文试图解决在身份证和旅行证件签发过程中，如何确保提交的人脸图像符合国际标准的质量要求的问题。高质量的图像对于人工审查和自动人脸识别系统都至关重要。论文提出的解决方案是FaceOracle，一个基于大语言模型（LLM）的AI助手，能够通过自然对话的方式帮助用户分析人脸图像，并使用符合标准的算法进行质量评估。FaceOracle的关键在于利用LLM的强大能力，使用户能够理解各种人脸图像质量概念，并解释人脸图像质量评估（FIQA）算法的结果。通过将FaceOracle集成到签发机构的工作流程中，专家可以更高效地分析、理解并传达他们的决策，从而提高工作效率。

链接: https://arxiv.org/abs/2501.07202
作者: Wassim Kabbani,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A face image is a mandatory part of ID and travel documents. Obtaining high-quality face images when issuing such documents is crucial for both human examiners and automated face recognition systems. In several international standards, face image quality requirements are intricate and defined in detail. Identifying and understanding non-compliance or defects in the submitted face images is crucial for both issuing authorities and applicants. In this work, we introduce FaceOracle, an LLM-powered AI assistant that helps its users analyze a face image in a natural conversational manner using standard compliant algorithms. Leveraging the power of LLMs, users can get explanations of various face image quality concepts as well as interpret the outcome of face image quality assessment (FIQA) algorithms. We implement a proof-of-concept that demonstrates how experts at an issuing authority could integrate FaceOracle into their workflow to analyze, understand, and communicate their decisions more efficiently, resulting in enhanced productivity.
zh

[CV-36] VAGeo: View-specific Attention for Cross-View Object Geo-Localization ICASSP2025

【速读】：该论文旨在解决跨视角物体地理定位（Cross-view Object Geo-localization, CVOGL）中的问题，即在地面或无人机视角的查询图像中精确定位目标物体在卫星图像中的位置。现有方法通常将地面视角和无人机视角的查询图像等同对待，忽略了它们之间的视角差异以及查询图像与卫星参考图像之间的空间关联。为此，论文提出了一种新颖的视角特定注意力地理定位方法（View-specific Attention Geo-localization, VAGeo）。该方法的关键在于两个核心模块：视角特定位置编码（View-specific Positional Encoding, VSPE）模块和通道-空间混合注意力（Channel-Spatial Hybrid Attention, CSHA）模块。VSPE模块根据不同视角（地面和无人机）的特性设计了特定的位置编码，以更准确地识别查询图像中的目标物体；CSHA模块则通过结合通道注意力和空间注意力机制，学习更具判别性的特征。实验结果表明，VAGeo在CVOGL数据集上显著提升了性能，特别是在地面视角和无人机视角的查询图像中，准确率分别从45.43%/42.24%提升到48.21%/45.22%和从61.97%/57.66%提升到66.19%/61.87%。

链接: https://arxiv.org/abs/2501.07194
作者: Zhongyang Li,Xin Yuan,Wei Liu,Xin Xu
机构: 1 School of Computer Science and Technology, Wuhan University of Science and Technology (武汉科技大学计算机科学与技术学院); 2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System (湖北省智能信息处理与实时工业系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Cross-view object geo-localization (CVOGL) aims to locate an object of interest in a captured ground- or drone-view image within the satellite image. However, existing works treat ground-view and drone-view query images equivalently, overlooking their inherent viewpoint discrepancies and the spatial correlation between the query image and the satellite-view reference image. To this end, this paper proposes a novel View-specific Attention Geo-localization method (VAGeo) for accurate CVOGL. Specifically, VAGeo contains two key modules: view-specific positional encoding (VSPE) module and channel-spatial hybrid attention (CSHA) module. In object-level, according to the characteristics of different viewpoints of ground and drone query images, viewpoint-specific positional codings are designed to more accurately identify the click-point object of the query image in the VSPE module. In feature-level, a hybrid attention in the CSHA module is introduced by combining channel attention and spatial attention mechanisms simultaneously for learning discriminative features. Extensive experimental results demonstrate that the proposed VAGeo gains a significant performance improvement, i.e., improving acc@0.25/acc@0.5 on the CVOGL dataset from 45.43%/42.24% to 48.21%/45.22% for ground-view, and from 61.97%/57.66% to 66.19%/61.87% for drone-view.
zh

[CV-37] A4O: All Trigger for One sample

【速读】：该论文旨在解决现有后门攻击（backdoor attacks）研究中单一触发器（trigger）类型导致的防御漏洞问题。现有防御方法通常假设触发器以统一方式出现，这种假设为更复杂的后门攻击提供了可乘之机。论文提出了一种新颖的后门攻击机制，通过结合多种类型的触发器，专注于隐蔽性和攻击效果。其关键解决方案在于观察到后门攻击的性能、可检测性和可移除性与触发器的强度成正比，因此通过降低每种触发器的强度并将它们组合起来，实现强效的后门攻击，同时避免被现有防御机制检测到。实验结果表明，该方法在多个标准数据集上能够实现高攻击成功率（ASRs），并成功绕过当前最先进的防御技术。

链接: https://arxiv.org/abs/2501.07192
作者: Duc Anh Vu,Anh Tuan Tran,Cong Tran,Cuong Pham
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Backdoor attacks have become a critical threat to deep neural networks (DNNs), drawing many research interests. However, most of the studied attacks employ a single type of trigger. Consequently, proposed backdoor defenders often rely on the assumption that triggers would appear in a unified way. In this paper, we show that this naive assumption can create a loophole, allowing more sophisticated backdoor attacks to bypass. We design a novel backdoor attack mechanism that incorporates multiple types of backdoor triggers, focusing on stealthiness and effectiveness. Our journey begins with the intriguing observation that the performance of a backdoor attack in deep learning models, as well as its detectability and removability, are all proportional to the magnitude of the trigger. Based on this correlation, we propose reducing the magnitude of each trigger type and combining them to achieve a strong backdoor relying on the combined trigger while still staying safely under the radar of defenders. Extensive experiments on three standard datasets demonstrate that our method can achieve high attack success rates (ASRs) while consistently bypassing state-of-the-art defenses.
zh

[CV-38] Uncertainty Guarantees on Automated Precision Weeding using Conformal Prediction

【速读】：该论文试图解决精准农业（Precision Agriculture）中，特别是精准除草（Precision Weeding）领域，农民对基于深度学习（Deep Learning）和计算机视觉（Computer Vision）的自动化系统缺乏信任的问题。这一信任缺失主要源于深度神经网络（Deep Neural Networks）的不透明性和复杂性，以及制造商无法提供有效的性能保证。论文提出的解决方案是采用保形预测（Conformal Prediction），这是一种在机器学习社区中广泛认可的方法，能够在极小的约束条件下为任何黑箱模型的预测提供可信的保证。通过将保形预测应用于基于深度学习的图像分类任务，并结合明确的喷洒决策规则，论文开发了一个精准喷洒管道，并在两种实际场景中进行了评估：一种是分布内（In-Distribution）条件，另一种是接近分布外（Near Out-of-Distribution）条件。实验结果表明，该方法能够为至少90%的杂草喷洒提供形式化（Certifiable）的保证。

链接: https://arxiv.org/abs/2501.07185
作者: Paul Melki(IMS),Lionel Bombrun(IMS),Boubacar Diallo,Jérôme Dias,Jean-Pierre da Costa(IMS)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Precision agriculture in general, and precision weeding in particular, have greatly benefited from the major advancements in deep learning and computer vision. A large variety of commercial robotic solutions are already available and deployed. However, the adoption by farmers of such solutions is still low for many reasons, an important one being the lack of trust in these systems. This is in great part due to the opaqueness and complexity of deep neural networks and the manufacturers’ inability to provide valid guarantees on their performance. Conformal prediction, a well-established methodology in the machine learning community, is an efficient and reliable strategy for providing trustworthy guarantees on the predictions of any black-box model under very minimal constraints. Bridging the gap between the safe machine learning and precision agriculture communities, this article showcases conformal prediction in action on the task of precision weeding through deep learning-based image classification. After a detailed presentation of the conformal prediction methodology and the development of a precision spraying pipeline based on a ‘‘conformalized’’ neural network and well-defined spraying decision rules, the article evaluates this pipeline on two real-world scenarios: one under in-distribution conditions, the other reflecting a near out-of-distribution setting. The results show that we are able to provide formal, i.e. certifiable, guarantees on spraying at least 90% of the weeds.
zh

[CV-39] Radial Distortion in Face Images: Detection and Impact

【速读】：该论文试图解决在无监督自注册场景下，通过智能手机获取的人脸图像中存在的径向畸变（radial distortion，也称为鱼眼效应）问题，及其对人脸识别系统（FRS）性能的影响。径向畸变会导致图像质量下降，进而影响FRS的准确性和可靠性，尤其是在在线身份验证和旅行证件签发等应用中。论文提出了一种有效的径向畸变检测模型，能够在注册场景中检测并标记出存在径向畸变的图像。该模型被形式化为一种人脸图像质量评估（FIQA）算法，并通过实验详细研究了径向畸变对FRS性能的影响。实验结果表明，所提出的模型在检测径向畸变方面表现出色，并为如何在操作系统中最佳使用这些模型提供了有价值的见解。解决方案的关键在于开发并验证了一种能够有效检测径向畸变的FIQA算法，从而确保注册图像的质量，提升FRS的整体性能。

链接: https://arxiv.org/abs/2501.07179
作者: Wassim Kabbani,Tristan Le Pessot,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
机构: NTNU, Gjøvik, Norway (挪威科技大学, 挪威); ENSICAEN, Caen, France (法国卡昂国立高等工程师学校, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Acquiring face images of sufficiently high quality is important for online ID and travel document issuance applications using face recognition systems (FRS). Low-quality, manipulated (intentionally or unintentionally), or distorted images degrade the FRS performance and facilitate documents’ misuse. Securing quality for enrolment images, especially in the unsupervised self-enrolment scenario via a smartphone, becomes important to assure FRS performance. In this work, we focus on the less studied area of radial distortion (a.k.a., the fish-eye effect) in face images and its impact on FRS performance. We introduce an effective radial distortion detection model that can detect and flag radial distortion in the enrolment scenario. We formalize the detection model as a face image quality assessment (FIQA) algorithm and provide a careful inspection of the effect of radial distortion on FRS performance. Evaluation results show excellent detection results for the proposed models, and the study on the impact on FRS uncovers valuable insights into how to best use these models in operational systems.
zh

[CV-40] Adaptive Noise-Tolerant Network for Image Segmentation

【速读】：该论文试图解决在自动图像分割（image segmentation）领域中，特别是在生物医学图像（如组织病理学图像）中，获取高质量的真实分割标签（ground-truth segmentations）作为深度学习模型训练数据的难题。由于这些图像的高分辨率、大尺寸和复杂性，手动标注分割标签既耗时又不现实。为此，论文提出了一种新的自适应噪声容忍网络（Adaptive Noise-Tolerant Network, ANTN）模型，通过整合来自现成分割算法的不完美或噪声分割结果，来提升分割性能。解决方案的关键在于两个方面：一是将多个噪声标签整合到一个深度学习模型中；二是噪声分割建模（包括概率参数）是自适应的，能够根据测试图像的外观进行调整。实验结果表明，ANTN模型在合成数据和真实组织病理学图像上均表现出优于现有分割算法的效果。

链接: https://arxiv.org/abs/2501.07163
作者: Weizhi Li
机构: Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike image classification and annotation, for which deep network models have achieved dominating superior performances compared to traditional computer vision algorithms, deep learning for automatic image segmentation still faces critical challenges. One of such hurdles is to obtain ground-truth segmentations as the training labels for deep network training. Especially when we study biomedical images, such as histopathological images (histo-images), it is unrealistic to ask for manual segmentation labels as the ground truth for training due to the fine image resolution as well as the large image size and complexity. In this paper, instead of relying on clean segmentation labels, we study whether and how integrating imperfect or noisy segmentation results from off-the-shelf segmentation algorithms may help achieve better segmentation results through a new Adaptive Noise-Tolerant Network (ANTN) model. We extend the noisy label deep learning to image segmentation with two novel aspects: (1) multiple noisy labels can be integrated into one deep learning model; (2) noisy segmentation modeling, including probabilistic parameters, is adaptive, depending on the given testing image appearance. Implementation of the new ANTN model on both the synthetic data and real-world histo-images demonstrates its effectiveness and superiority over off-the-shelf and other existing deep-learning-based image segmentation algorithms.
zh

[CV-41] Eye Sclera for Fair Face Image Quality Assessment

【速读】：该论文旨在解决人脸识别系统（FRS）中的公平性问题，特别是在人脸图像质量评估（FIQA）方面。当前的FIQA方法可能因肤色等人口统计学因素而产生偏差，影响系统的公平性。论文提出了一种基于巩膜（sclera）区域的公平FIQA方案，因为巩膜区域不受肤色和人口统计学差异的影响。通过分析来自不同肤色和人口统计学群体的数据集，论文验证了仅使用巩膜区域来评估动态范围、过曝和欠曝等图像质量指标的可行性。关键解决方案在于利用巩膜区域作为质量评估的替代区域，并通过误差-丢弃特性（EDC）曲线分析证明了其在公平FIQA中的有效性。

链接: https://arxiv.org/abs/2501.07158
作者: Wassim Kabbani,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fair operational systems are crucial in gaining and maintaining society’s trust in face recognition systems (FRS). FRS start with capturing an image and assessing its quality before using it further for enrollment or verification. Fair Face Image Quality Assessment (FIQA) schemes therefore become equally important in the context of fair FRS. This work examines the sclera as a quality assessment region for obtaining a fair FIQA. The sclera region is agnostic to demographic variations and skin colour for assessing the quality of a face image. We analyze three skin tone related ISO/IEC face image quality assessment measures and assess the sclera region as an alternative area for assessing FIQ. Our analysis of the face dataset of individuals from different demographic groups representing different skin tones indicates sclera as an alternative to measure dynamic range, over- and under-exposure of face using sclera region alone. The sclera region being agnostic to skin tone, i.e., demographic factors, provides equal utility as a fair FIQA as shown by our Error-vs-Discard Characteristic (EDC) curve analysis.
zh

[CV-42] Robust Single Object Tracking in LiDAR Point Clouds under Adverse Weather Conditions

【速读】：该论文试图解决在恶劣天气条件下基于LiDAR点云的3D单目标跟踪（3D Single Object Tracking, 3DSOT）性能评估不足的问题。当前3DSOT方法在干净数据集上表现优异，但在真实世界的恶劣天气条件下（如雨、雾、雪）性能显著下降，主要原因在于缺乏针对恶劣天气的基准测试数据集。为此，论文提出了一个包含合成数据集（KITTI-A和nuScenes-A）和真实世界数据集（CADC-SOT）的挑战性基准，覆盖了三种天气类型。基于此基准，论文评估了五种代表性3D跟踪器的鲁棒性，发现其性能显著下降，并进一步探讨了导致性能下降的三个关键因素：目标距离、模板形状损坏和目标形状损坏。最终，论文提出了一种基于域随机化（Domain Randomization）和对比学习（Contrastive Learning）的双分支跟踪框架DRCT，在恶劣天气条件下表现出色。解决方案的关键在于通过引入恶劣天气基准和设计新的跟踪框架，提升了3DSOT在复杂环境中的鲁棒性。

链接: https://arxiv.org/abs/2501.07133
作者: Xiantong Zhao,Xiuping Liu,Shengjing Tian,Yinan Han
机构: School of Mathematical Sciences, Dalian University of Technology, China (大连理工大学数学科学学院); School of Economics and Management, China University of Mining and Technology, Xuzhou, China (中国矿业大学经济管理学院); DUT-BSU Joint Institute, Dalian University of Technology, China (大连理工大学DUT-BSU联合学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:3D single object tracking (3DSOT) in LiDAR point clouds is a critical task for outdoor perception, enabling real-time perception of object location, orientation, and motion. Despite the impressive performance of current 3DSOT methods, evaluating them on clean datasets inadequately reflects their comprehensive performance, as the adverse weather conditions in real-world surroundings has not been considered. One of the main obstacles is the lack of adverse weather benchmarks for the evaluation of 3DSOT. To this end, this work proposes a challenging benchmark for LiDAR-based 3DSOT in adverse weather, which comprises two synthetic datasets (KITTI-A and nuScenes-A) and one real-world dataset (CADC-SOT) spanning three weather types: rain, fog, and snow. Based on this benchmark, five representative 3D trackers from different tracking frameworks conducted robustness evaluation, resulting in significant performance degradations. This prompts the question: What are the factors that cause current advanced methods to fail on such adverse weather samples? Consequently, we explore the impacts of adverse weather and answer the above question from three perspectives: 1) target distance; 2) template shape corruption; and 3) target shape corruption. Finally, based on domain randomization and contrastive learning, we designed a dual-branch tracking framework for adverse weather, named DRCT, achieving excellent performance in benchmarks.
zh

[CV-43] Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning

【速读】：该论文试图解决组合零样本学习（Compositional Zero-Shot Learning, CZSL）中的一个关键问题，即现有方法在学习已见组合的语义表示时，往往无法有效解耦图像中的状态（states）和对象（objects）的独立特征，从而限制了模型对未见组合的泛化能力。为解决这一问题，论文提出了Duplex方法，其核心在于通过双原型学习（dual-prototype learning）策略，结合语义和视觉原型，利用图神经网络（Graph Neural Network, GNN）自适应更新视觉原型，捕捉状态与对象之间的复杂交互。此外，Duplex还利用预训练视觉-语言模型（Vision-Language Models, VLMs）的强视觉-语义对齐能力，通过多路径架构和提示工程（prompt engineering）实现图像与文本表示的对齐，从而确保模型在封闭世界和开放世界场景中均能实现鲁棒的泛化性能。

链接: https://arxiv.org/abs/2501.07114
作者: Zhong Peng,Yishi Xu,Gerong Wang,Wenchao Chen,Bo Chen,Jing Zhang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training. Existing methods predominantly focus on learning semantic representations of seen compositions but often fail to disentangle the independent features of states and objects in images, thereby limiting their ability to generalize to unseen compositions. To address this challenge, we propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture, enabling effective representation learning for compositional tasks. Duplex utilizes a Graph Neural Network (GNN) to adaptively update visual prototypes, capturing complex interactions between states and objects. Additionally, it leverages the strong visual-semantic alignment of pre-trained Vision-Language Models (VLMs) and employs a multi-path architecture combined with prompt engineering to align image and text representations, ensuring robust generalization. Extensive experiments on three benchmark datasets demonstrate that Duplex outperforms state-of-the-art methods in both closed-world and open-world settings.
zh

[CV-44] Matching Free Depth Recovery from Structured Light

【速读】：该论文旨在解决基于结构光系统（structured light systems）从图像中进行深度估计的问题。传统方法通常依赖于图像匹配过程，而本文提出了一种新颖的解决方案，即使用密度体素网格（density voxel grid）来表示场景几何，并通过自监督可微分体积渲染（self-supervised differentiable volume rendering）进行训练。该方法的关键在于利用结构光系统中投影图案的颜色场（color fields）在渲染过程中优化几何场（geometry field），从而实现几何场的独立优化。此外，论文引入了归一化设备坐标（normalized device coordinates, NDC）、畸变损失（distortion loss）和基于表面的颜色损失（surface-based color loss）来增强几何保真度。实验结果表明，该方法在少样本场景下的几何性能优于现有的基于匹配的技术，深度估计误差在合成场景中减少了约60%，在真实场景中减少了约30%，并且训练速度比之前使用隐式表示的无匹配方法快约三倍。

链接: https://arxiv.org/abs/2501.07113
作者: Zhuohang Yu,Kai Wang,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); China Unicom (中国联通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:We present a novel approach for depth estimation from images captured by structured light systems. Unlike many previous methods that rely on image matching process, our approach uses a density voxel grid to represent scene geometry, which is trained via self-supervised differentiable volume rendering. Our method leverages color fields derived from projected patterns in structured light systems during the rendering process, enabling the isolated optimization of the geometry field. This contributes to faster convergence and high-quality output. Additionally, we incorporate normalized device coordinates (NDC), a distortion loss, and a novel surface-based color loss to enhance geometric fidelity. Experimental results demonstrate that our method outperforms existing matching-based techniques in geometric performance for few-shot scenarios, achieving approximately a 60% reduction in average estimated depth errors on synthetic scenes and about 30% on real-world captured scenes. Furthermore, our approach delivers fast training, with a speed roughly three times faster than previous matching-free methods that employ implicit representations.
zh

[CV-45] Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

【速读】：该论文旨在解决微视频推荐系统中多模态信息（如视觉、声学和文本）融合的静态性问题。传统方法在融合多模态信息时采用静态融合策略，无法有效建模不同微视频之间多模态信息的多样化关系。为此，论文提出了一种基于元学习的多模态融合框架，称为Meta Multimodal Fusion (MetaMMF)。该框架的核心在于动态地为每个微视频的多模态融合函数分配参数，从而在表示学习过程中实现个性化的融合。具体而言，MetaMMF将每个微视频的多模态融合视为独立任务，并通过元学习器从输入任务的多模态特征中提取元信息，进而参数化神经网络作为特定项目的融合函数。实验结果表明，MetaMMF在多个基准数据集上显著优于现有的多模态推荐模型（如MMGCN、LATTICE和InvRL）。此外，论文还通过采用规范多分解（canonical polyadic decomposition）来简化模型，提高了训练效率，并通过实验结果验证了其有效性。

链接: https://arxiv.org/abs/2501.07110
作者: Han Liu,Yinwei Wei,Fan Liu,Wenjie Wang,Liqiang Nie,Tat-Seng Chua
机构: Shandong University (山东大学); National University of Singapore (新加坡国立大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: This paper has been accepted by ACM Transactions on Information Systems

点击查看摘要

Abstract:Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at this https URL.
zh

[CV-46] he Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

【速读】：该论文旨在全面回顾视觉问答（Visual Question Answering, VQA）领域的发展历程，从早期的研究到近年来的重大突破，如注意力机制（attention mechanisms）、组合推理（compositional reasoning）以及视觉-语言预训练方法（vision-language pre-training methods）的兴起。论文通过梳理关键模型、数据集和技术，强调了Transformer架构和多模态预训练在推动VQA系统进步中的核心作用。此外，论文还探讨了VQA在医疗等领域的专门应用，并讨论了当前面临的挑战，如数据集偏差（dataset bias）、模型可解释性（model interpretability）以及常识推理（common-sense reasoning）的需求。最后，论文展望了VQA的未来发展方向，特别是大型多模态语言模型（large multimodal language models）和外部知识整合（integration of external knowledge）的潜在趋势。解决方案的关键在于利用Transformer架构和多模态预训练技术，结合外部知识，以提升VQA系统的性能和泛化能力。

链接: https://arxiv.org/abs/2501.07109
作者: Anupam Pandey,Deepjyoti Bodo,Arpan Phukan,Asif Ekbal
机构: 1Department of CSE IIT Patna (印度理工学院巴特那分校计算机科学与工程系); 2School of AIDE IIT Jodhpur (印度理工学院焦特布尔分校人工智能与数据工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) is an interdisciplinary field that bridges the gap between computer vision (CV) and natural language processing(NLP), enabling Artificial Intelligence(AI) systems to answer questions about images. Since its inception in 2015, VQA has rapidly evolved, driven by advances in deep learning, attention mechanisms, and transformer-based models. This survey traces the journey of VQA from its early days, through major breakthroughs, such as attention mechanisms, compositional reasoning, and the rise of vision-language pre-training methods. We highlight key models, datasets, and techniques that shaped the development of VQA systems, emphasizing the pivotal role of transformer architectures and multimodal pre-training in driving recent progress. Additionally, we explore specialized applications of VQA in domains like healthcare and discuss ongoing challenges, such as dataset bias, model interpretability, and the need for common-sense reasoning. Lastly, we discuss the emerging trends in large multimodal language models and the integration of external knowledge, offering insights into the future directions of VQA. This paper aims to provide a comprehensive overview of the evolution of VQA, highlighting both its current state and potential advancements.
zh

[CV-47] RMAvatar: Photorealistic Human Avatar Reconstruction from Monocular Video Based on Rectified Mesh-embedded Gaussians

【速读】：该论文旨在解决从单目视频中学习穿着衣物的虚拟人（clothed avatar）表示的问题。现有的方法在处理复杂非刚性变形时存在局限性，尤其是在使用线性混合蒙皮（LBS, Linear Blend Skinning）公式时，难以精确控制人体骨骼的复杂变形。为此，论文提出了RMAvatar，一种基于网格嵌入高斯溅射（Gaussian Splatting）的新型虚拟人表示方法。该方法的解决方案关键在于两个模块：高斯初始化模块（Gaussian initialization module）和高斯校正模块（Gaussian rectification module）。高斯初始化模块将高斯分布嵌入到网格的三角面片中，并通过网格控制其运动，从而确保虚拟人的低频运动和表面变形。高斯校正模块则通过学习与姿态相关的非刚性变形细节，进一步提升虚拟人的真实感和表现力。实验结果表明，RMAvatar在渲染质量和定量评估上均达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07104
作者: Sen Peng,Weixing Xie,Zilong Wang,Xiaohu Guo,Zhonggui Chen,Baorong Yang,Xiao Dong
机构: College of Computer Engineering, Jimei University, Xiamen, China(集美大学计算机工程学院, 中国厦门); National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China(厦门大学健康与医学数据科学研究所, 中国厦门); Department of Computer Science, The University of Texas at Dallas, Richardson, United States(德克萨斯大学达拉斯分校计算机科学系, 美国理查森); School of Informatics, Xiamen University, Xiamen, China(厦门大学信息学院, 中国厦门); Guangdong Provincial/Zhuhai Key Laboratory of IRADS, BNU-HKBU United International College, Zhuhai, China(广东省/珠海市智能区域与数据科学重点实验室, 北京师范大学-香港浸会大学联合国际学院, 中国珠海)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVM2025

点击查看摘要

Abstract:We introduce RMAvatar, a novel human avatar representation with Gaussian splatting embedded on mesh to learn clothed avatar from a monocular video. We utilize the explicit mesh geometry to represent motion and shape of a virtual human and implicit appearance rendering with Gaussian Splatting. Our method consists of two main modules: Gaussian initialization module and Gaussian rectification module. We embed Gaussians into triangular faces and control their motion through the mesh, which ensures low-frequency motion and surface deformation of the avatar. Due to the limitations of LBS formula, the human skeleton is hard to control complex non-rigid transformations. We then design a pose-related Gaussian rectification module to learn fine-detailed non-rigid deformations, further improving the realism and expressiveness of the avatar. We conduct extensive experiments on public datasets, RMAvatar shows state-of-the-art performance on both rendering quality and quantitative evaluations. Please see our project page at this https URL.
zh

[CV-48] Dual Scale-aware Adaptive Masked Knowledge Distillation for Object Detection

【速读】：该论文试图解决现有特征掩码知识蒸馏（feature masking knowledge distillation）方法在全局注意力引导下进行特征掩码蒸馏时，未能深入挖掘特征图中的细粒度视觉线索的问题。现有方法通常在单尺度特征图上进行全局掩码操作，忽略了跨不同尺度的局部感知线索，这限制了特征重建的效果。论文提出了一种细粒度自适应特征掩码蒸馏框架，通过在多尺度特征图上进行特征蒸馏，编码对象感知的局部信息，从而提升特征重建的准确性。此外，该方法结合了掩码逻辑蒸馏（masking logits distillation）策略，利用教师网络与学生网络之间的逻辑差异来指导蒸馏过程，进一步优化知识传递。实验结果表明，该方法在目标检测任务中显著优于现有方法，如DMKD和FreeKD。

链接: https://arxiv.org/abs/2501.07101
作者: ZhouRui Zhang,Jun Li,JiaYan Li,ZhiJian Wu,JianHua Xu
机构: Nanjing Normal University(南京师范大学); East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent feature masking knowledge distillation methods make use of attention mechanisms to identify either important spatial regions or channel clues for discriminative feature reconstruction. However, most of existing strategies perform global attention-guided feature masking distillation without delving into fine-grained visual clues in feature maps. In particular, uncovering locality-aware clues across different scales are conducive to reconstructing region-aware features, thereby significantly benefiting distillation performance. In this study, we propose a fine-grained adaptive feature masking distillation framework for accurate object detection. Different from previous methods in which global masking is performed on single-scale feature maps, we explore the scale-aware feature masking by performing feature distillation across various scales, such that the object-aware locality is encoded for improved feature reconstruction. In addition, our fine-grained feature distillation strategy is combined with a masking logits distillation scheme in which logits difference between teacher and student networks is utilized to guide the distillation process. Thus, it can help the student model to better learn from the teacher counterpart with improved knowledge transfer. Extensive experiments for detection task demonstrate the superiority of our method. For example, when RetinaNet, RepPoints and Cascade Mask RCNN are used as teacher detectors, the student network achieves mAP scores of 41.5%, 42.9%, and 42.6%, respectively, outperforming state-of-the-art methods such as DMKD and FreeKD.
zh

[CV-49] Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics AAAI2025

【速读】：该论文试图解决在基于自我中心视角（egocentric）的3D手-物体交互数据集中，现有方法在识别未见物体上的已知动作时表现不佳的问题。主要挑战在于现有方法依赖于3D边界框（bounding boxes）来表示物体形状和运动，并且在测试时依赖于物体模板，这限制了其对未见物体的泛化能力。为解决这些问题，论文提出使用超二次曲面（superquadrics）作为3D物体表示的替代方案，并展示了其在无模板物体重建和动作识别任务中的有效性。此外，论文还研究了动作的组合性（compositionality），通过设计一个更具挑战性的任务，即训练集中的动词和名词组合与测试集不重叠，进一步探讨了3D几何信息在动作识别中的潜在优势。为此，论文扩展了H2O和FPHA数据集，并设计了一种新颖的协作学习框架，能够显式推理手与操纵物体之间的几何关系。通过大量定量和定性评估，论文展示了在组合动作识别任务上相较于现有技术的显著改进。

链接: https://arxiv.org/abs/2501.07100
作者: Tze Ho Elden Tse,Runyang Feng,Linfang Zheng,Jiho Park,Yixing Gao,Jihie Kim,Ales Leonardis,Hyung Jin Chang
机构: 1. University of Birmingham(伯明翰大学); 2. University of Science and Technology of China(中国科学技术大学); 3. Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:With the availability of egocentric 3D hand-object interaction datasets, there is increasing interest in developing unified models for hand-object pose estimation and action recognition. However, existing methods still struggle to recognise seen actions on unseen objects due to the limitations in representing object shape and movement using 3D bounding boxes. Additionally, the reliance on object templates at test time limits their generalisability to unseen objects. To address these challenges, we propose to leverage superquadrics as an alternative 3D object representation to bounding boxes and demonstrate their effectiveness on both template-free object reconstruction and action recognition tasks. Moreover, as we find that pure appearance-based methods can outperform the unified methods, the potential benefits from 3D geometric information remain unclear. Therefore, we study the compositionality of actions by considering a more challenging task where the training combinations of verbs and nouns do not overlap with the testing split. We extend H2O and FPHA datasets with compositional splits and design a novel collaborative learning framework that can explicitly reason about the geometric relations between hands and the manipulated object. Through extensive quantitative and qualitative evaluations, we demonstrate significant improvements over the state-of-the-arts in (compositional) action recognition.
zh

[CV-50] Video Quality Assessment for Online Processing: From Spatial to Temporal Sampling

【速读】：该论文试图解决视频质量评估（VQA）模型在空间和时间维度上的信息冗余问题，探索在保持可接受性能的前提下，视频信息的最小保留量。解决方案的关键在于联合空间和时间采样（joint spatial and temporal sampling），通过大幅减少视频的空间和时间信息，将高度压缩的视频输入到一个稳定的VQA模型中。实验表明，即使在丢弃大部分视频信息的情况下，VQA模型仍能保持可接受的性能。此外，论文还提出了一种在线VQA模型的设计方案，该模型通过简化空间特征提取器、时间特征融合模块和全局质量回归模块，验证了在线VQA模型的可行性。

链接: https://arxiv.org/abs/2501.07087
作者: Jiebin Yan,Lei Wu,Yuming Fang,Xuelin Liu,Xue Xia,Weide Liu
机构: School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330032, Jiangxi, China(江西财经大学信息技术学院); Harvard Medical School, Harvard University, USA(哈佛大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video’s information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.
zh

[CV-51] Representation Learning of Point Cloud Upsampling in Global and Local Inputs

【速读】：该论文试图解决点云（point cloud）上采样过程中由于稀疏性和噪声导致的质量问题。解决方案的关键在于通过表示学习（representation learning）同时利用点云模型的全局和局部信息。具体而言，论文将同一对象的全局和局部信息分别输入到两个编码器（encoders）中提取特征，然后将这些特征融合并输入到一个上采样解码器（upsampling decoder）中。通过这种方式，论文旨在利用全局和局部输入的先验知识来提升上采样效果。实验结果表明，该框架能够进一步提升现有最先进（state-of-the-art, SOTA）点云上采样神经网络的效果，并通过显著性图（Saliency Map）展示了全局和局部特征输入的差异及其并行训练的有效性。

链接: https://arxiv.org/abs/2501.07076
作者: Tongxu Zhang,Bei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, point cloud upsampling has been widely applied in fields such as 3D reconstruction. Our study investigates the factors influencing point cloud upsampling on both global and local levels through representation learning. Specifically, the paper inputs global and local information of the same point cloud model object into two encoders to extract these features, fuses them, and then feeds the combined features into an upsampling decoder. The goal is to address issues of sparsity and noise in point clouds by leveraging prior knowledge from both global and local inputs. And the proposed framework can be applied to any state-of-the-art point cloud upsampling neural network. Experiments were conducted on a series of autoencoder-based models utilizing deep learning, yielding interpretability for both global and local inputs, and it has been proven in the results that our proposed framework can further improve the upsampling effect in previous SOTA works. At the same time, the Saliency Map reflects the differences between global and local feature inputs, as well as the effectiveness of training with both inputs in parallel.
zh

[CV-52] Label Calibration in Source Free Domain Adaptation WACV

【速读】：该论文试图解决源自由领域自适应（Source-free Domain Adaptation, SFDA）中伪标签（pseudolabels）噪声问题。由于源域和目标域之间的领域差异，传统的自监督SFDA技术通过softmax函数生成的伪标签往往不可靠。论文提出了一种基于证据深度学习（evidential deep learning）的解决方案，通过引入预测不确定性和softmax校准来优化伪标签。具体而言，论文在目标网络的输出上引入了狄利克雷先验（Dirichlet prior），以单次前向传播的方式捕捉不确定性证据。此外，softmax校准解决了平移不变性问题，有助于在噪声标签下进行学习。论文结合了证据深度学习损失和信息最大化损失，并在有先验和无先验目标知识的SFDA设置中应用了校准后的softmax。实验结果表明，该方法在多个基准数据集上优于其他最先进的方法。

链接: https://arxiv.org/abs/2501.07072
作者: Shivangi Rai,Rini Smita Thakur,Kunal Jangid,Vinod K Kurmi
机构: Indian Institute of Science Education and Research Bhopal (印度科学教育与研究学院博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) utilizes a pre-trained source model with unlabeled target data. Self-supervised SFDA techniques generate pseudolabels from the pre-trained source model, but these pseudolabels often contain noise due to domain discrepancies between the source and target domains. Traditional self-supervised SFDA techniques rely on deterministic model predictions using the softmax function, leading to unreliable pseudolabels. In this work, we propose to introduce predictive uncertainty and softmax calibration for pseudolabel refinement using evidential deep learning. The Dirichlet prior is placed over the output of the target network to capture uncertainty using evidence with a single forward pass. Furthermore, softmax calibration solves the translation invariance problem to assist in learning with noisy labels. We incorporate a combination of evidential deep learning loss and information maximization loss with calibrated softmax in both prior and non-prior target knowledge SFDA settings. Extensive experimental analysis shows that our method outperforms other state-of-the-art methods on benchmark datasets.
zh

[CV-53] Enhancing Image Generation Fidelity via Progressive Prompts ICASSP2025

【速读】：该论文试图解决扩散变换器（DiT）架构在图像生成中区域提示控制（regional prompt control）不足的问题。现有的DiT方法主要关注全局感知合成（global-aware synthesis），而对区域提示控制的探索较少。论文提出了一种从粗到细（coarse-to-fine）的生成管道，通过结合大语言模型（LLM）生成图像的高层次描述（如内容、主题和对象）和低层次描述（如细节和风格），并探索不同深度的交叉注意力层（cross-attention layers）的影响。研究发现，深层负责高层次内容控制，而浅层则处理低层次内容控制。通过将各种提示注入区域交叉注意力控制，论文增强了DiT图像生成的可控性。实验结果表明，该管道显著提升了生成图像的性能。

链接: https://arxiv.org/abs/2501.07070
作者: Zhen Xiong,Yuqi Li,Chuanguang Yang,Tiao Tan,Zhihong Zhu,Siyuan Li,Yue Ma
机构: 1Institute of Computing Technology, Chinese Academy of Sciences, China(中国科学院计算技术研究所); 2Tsinghua University, China(清华大学); 3Peking university, China(北京大学); 4EaseUS, China(易我科技); 5The Hong Kong University of Science and Technology, HK(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025, Github: this https URL

点击查看摘要

Abstract:The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
zh

[CV-54] Hierarchical Superpixel Segmentation via Structural Information Theory SDM2025

【速读】：该论文旨在解决现有基于图（graph-based）的超像素分割方法在利用全局信息方面的不足。现有方法通常仅关注给定像素与其直接相邻像素之间的关系，而忽略了非相邻像素的影响，导致分割质量不理想。为解决这一问题，论文提出了一种基于结构信息理论（structural information theory）的层次化超像素分割方法，称为SIT-HSS。其关键解决方案包括两个方面：首先，设计了一种新颖的图构建策略，通过逐步探索像素邻域并基于一维结构熵（1D SE）添加边，以最大化保留图信息的同时避免图结构过于复杂；其次，提出了一种新的二维结构熵（2D SE）引导的层次化图分割方法，通过逐层迭代合并像素簇来减少图的2D SE，直到达到预定的分割尺度。实验结果表明，SIT-HSS在三个基准数据集上的表现优于现有的无监督超像素分割算法。

链接: https://arxiv.org/abs/2501.07069
作者: Minhui Xie,Hao Peng,Pu Li,Guangjie Zeng,Shuhai Wang,Jia Wu,Peng Li,Philip S. Yu
机构: School of Cyber Science and Technology, Beihang University, China(北京航空航天大学网络空间安全学院); College of Information Science and Technology, Shijiazhuang Tiedao University, China(石家庄铁道大学信息科学与技术学院); Macquarie University, Australia(麦考瑞大学); University of Illinois at Chicago, Illinois, USA(伊利诺伊大学芝加哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SDM 2025

点击查看摘要

Abstract:Superpixel segmentation is a foundation for many higher-level computer vision tasks, such as image segmentation, object recognition, and scene understanding. Existing graph-based superpixel segmentation methods typically concentrate on the relationships between a given pixel and its directly adjacent pixels while overlooking the influence of non-adjacent pixels. These approaches do not fully leverage the global information in the graph, leading to suboptimal segmentation quality. To address this limitation, we present SIT-HSS, a hierarchical superpixel segmentation method based on structural information theory. Specifically, we first design a novel graph construction strategy that incrementally explores the pixel neighborhood to add edges based on 1-dimensional structural entropy (1D SE). This strategy maximizes the retention of graph information while avoiding an overly complex graph structure. Then, we design a new 2D SE-guided hierarchical graph partitioning method, which iteratively merges pixel clusters layer by layer to reduce the graph’s 2D SE until a predefined segmentation scale is achieved. Experimental results on three benchmark datasets demonstrate that the SIT-HSS performs better than state-of-the-art unsupervised superpixel segmentation algorithms. The source code is available at \urlthis https URL.
zh

[CV-55] SFC-GAN: A Generative Adversarial Network for Brain Functional and Structural Connectome Translation

【速读】：该论文试图解决在脑成像研究中同时获取结构连接性（Structural Connectivity, SC）和功能连接性（Functional Connectivity, FC）的挑战，尤其是在仅有一种连接性数据可用的情况下，如何实现SC和FC之间的双向转换。现有深度生成模型通常只能合成单一模态或进行单向转换，无法充分利用双向转换的潜在优势。为此，论文提出了一种名为结构-功能连接性生成对抗网络（Structural-Functional Connectivity GAN, SFC-GAN）的新框架，基于CycleGAN架构，通过卷积层有效捕捉脑连接组的空间结构。为了保持连接组的拓扑完整性，该框架采用了结构保持损失函数，确保模型在捕捉全局和局部连接模式的同时维持对称性。实验表明，SFC-GAN在SC和FC之间的转换性能优于基线模型，且在相似性和图属性评估中表现出色，生成的模态数据可有效用于下游分类任务。

链接: https://arxiv.org/abs/2501.07055
作者: Yee-Fan Tan,Jun Lin Liow,Pei-Sze Tan,Fuad Noman,Raphael C.-W. Phan,Hernando Ombao,Chee-Ming Ting
机构: Monash University Malaysia(莫纳什大学马来西亚分校); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Modern brain imaging technologies have enabled the detailed reconstruction of human brain connectomes, capturing structural connectivity (SC) from diffusion MRI and functional connectivity (FC) from functional MRI. Understanding the intricate relationships between SC and FC is vital for gaining deeper insights into the brain’s functional and organizational mechanisms. However, obtaining both SC and FC modalities simultaneously remains challenging, hindering comprehensive analyses. Existing deep generative models typically focus on synthesizing a single modality or unidirectional translation between FC and SC, thereby missing the potential benefits of bi-directional translation, especially in scenarios where only one connectome is available. Therefore, we propose Structural-Functional Connectivity GAN (SFC-GAN), a novel framework for bidirectional translation between SC and FC. This approach leverages the CycleGAN architecture, incorporating convolutional layers to effectively capture the spatial structures of brain connectomes. To preserve the topological integrity of these connectomes, we employ a structure-preserving loss that guides the model in capturing both global and local connectome patterns while maintaining symmetry. Our framework demonstrates superior performance in translating between SC and FC, outperforming baseline models in similarity and graph property evaluations compared to ground truth data, each translated modality can be effectively utilized for downstream classification.
zh

[CV-56] Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

【速读】：该论文旨在解决视觉Transformer模型（ViT）在面对对抗样本（adversarial examples）时的脆弱性问题。尽管Transformer模型在自然语言处理任务中表现出色，但在计算机视觉任务中，ViT模型仍然容易受到对抗攻击的影响。论文通过研究六种常见的对抗攻击方法对三种预训练ViT模型的影响，揭示了这些模型的脆弱性。为了理解和分析神经网络在面对对抗样本时的决策偏差，作者使用了两种可视化技术：注意力展开（attention rollout）和梯度注意力展开（grad attention rollout）。

论文提出的解决方案是Protego，一种基于Transformer内在能力的检测框架，用于识别ViT模型中的对抗样本。其关键点在于利用Transformer的注意力机制（attention mechanism），特别是预测标记（token of prediction）包含输入样本的所有信息，并且对抗样本的注意力区域与正常样本不同。通过训练一个检测器，Protego能够有效区分对抗样本和正常样本，实验结果表明其检测性能显著优于现有方法，AUC得分均超过0.95。这一方法有望推动元宇宙安全领域的研究进展。

链接: https://arxiv.org/abs/2501.07044
作者: Jialin Wu,Kaikai Pan,Yanjiao Chen,Jiangyi Deng,Shengyuan Pang,Wenyuan Xu
机构: USSLAB, Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE MetaCom 2024

点击查看摘要

Abstract:Transformer models have excelled in natural language tasks, prompting the vision community to explore their implementation in computer vision problems. However, these models are still influenced by adversarial examples. In this paper, we investigate the attack capabilities of six common adversarial attacks on three pretrained ViT models to reveal the vulnerability of ViT models. To understand and analyse the bias in neural network decisions when the input is adversarial, we use two visualisation techniques that are attention rollout and grad attention rollout. To prevent ViT models from adversarial attack, we propose Protego, a detection framework that leverages the transformer intrinsic capabilities to detection adversarial examples of ViT models. Nonetheless, this is challenging due to a diversity of attack strategies that may be adopted by adversaries. Inspired by the attention mechanism, we know that the token of prediction contains all the information from the input sample. Additionally, the attention region for adversarial examples differs from that of normal examples. Given these points, we can train a detector that achieves superior performance than existing detection methods to identify adversarial examples. Our experiments have demonstrated the high effectiveness of our detection method. For these six adversarial attack methods, our detector’s AUC scores all exceed 0.95. Protego may advance investigations in metaverse security.
zh

[CV-57] Rethinking Knowledge in Distillation: An In-context Sample Retrieval Perspective

【速读】：该论文试图解决传统知识蒸馏（Knowledge Distillation, KD）方法中忽视同类样本间关系的问题。传统KD方法主要关注学生模型对每个样本的输出与教师模型输出的相似性，而忽略了同类样本之间的关联性。论文提出了一种新的上下文知识蒸馏（In-Context Knowledge Distillation, IC-KD）框架，通过捕捉每个样本与其上下文样本（即一组相似样本，可能属于相同或不同类别）之间的关系，重新定义蒸馏中的知识表示。

解决方案的关键在于从上下文样本检索的角度进行知识蒸馏。首先，论文通过理论分析表明，教师模型从上下文样本中提取的知识对正则化学生模型的训练至关重要。基于此，作者提出了IC-KD框架，包括构建教师模型的特征记忆库，并通过基于检索的学习为每个样本检索上下文样本。接着，论文引入了正上下文蒸馏（Positive In-Context Distillation, PICD）来减少学生模型样本与教师模型中同类上下文样本在logit空间中的差异。此外，负上下文蒸馏（Negative In-Context Distillation, NICD）被用来在logit空间中分离学生模型样本与教师模型中不同类别的上下文样本。实验结果表明，IC-KD在多种KD范式（离线、在线和无教师KD）中均表现出色，并在CIFAR-100和ImageNet数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07040
作者: Jinjing Zhu,Songze Li,Lin Wang
机构: The Hong Kong University of Science and Technology (HKUST)(香港科技大学); Southeast University(东南大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional knowledge distillation (KD) approaches are designed for the student model to predict similar output as the teacher model for each sample. Unfortunately, the relationship across samples with same class is often neglected. In this paper, we explore to redefine the knowledge in distillation, capturing the relationship between each sample and its corresponding in-context samples (a group of similar samples with the same or different classes), and perform KD from an in-context sample retrieval perspective. As KD is a type of learned label smoothing regularization (LSR), we first conduct a theoretical analysis showing that the teacher’s knowledge from the in-context samples is a crucial contributor to regularize the student training with the corresponding samples. Buttressed by the analysis, we propose a novel in-context knowledge distillation (IC-KD) framework that shows its superiority across diverse KD paradigms (offline, online, and teacher-free KD). Firstly, we construct a feature memory bank from the teacher model and retrieve in-context samples for each corresponding sample through retrieval-based learning. We then introduce Positive In-Context Distillation (PICD) to reduce the discrepancy between a sample from the student and the aggregated in-context samples with the same class from the teacher in the logit space. Moreover, Negative In-Context Distillation (NICD) is introduced to separate a sample from the student and the in-context samples with different classes from the teacher in the logit space. Extensive experiments demonstrate that IC-KD is effective across various types of KD, and consistently achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets.
zh

[CV-58] IoT-Based Real-Time Medical-Related Human Activity Recognition Using Skeletons and Multi-Stage Deep Learning for Healthcare

【速读】：该论文旨在解决医疗相关人类活动（MRHA）识别中的挑战，特别是在实时患者监测中的应用。当前的人体运动识别（HMR）技术面临高计算需求、低准确性和有限适应性等问题，限制了其在医疗领域的有效应用。论文提出了一种新颖的HMR方法，结合多阶段深度学习技术与物联网（IoT）来检测MRHA。该解决方案的关键在于使用EfficientNet通过七个Mobile Inverted Bottleneck Convolutions（MBConv）块从骨架帧序列中提取优化的空间特征，并利用ConvLSTM捕捉时空模式。最后，通过全局平均池化、全连接层和dropout层的分类模块生成最终预测。该方法在NTU RGB+D 120和HMDB51数据集上表现出色，分别达到了94.85%和96.45%的跨主体和跨视角评估准确率，并在HMDB51上实现了89.00%的准确率。此外，系统通过Raspberry Pi和GSM模块集成IoT功能，利用Twilio的SMS服务向护理人员和患者发送实时警报。这一可扩展且高效的解决方案弥合了HMR与IoT之间的差距，推动了患者监测的进步，改善了医疗结果并降低了成本。

链接: https://arxiv.org/abs/2501.07039
作者: Subrata Kumer Paul,Abu Saleh Musa Miah,Rakhi Rani Paul,Md. Ekramul Hamid,Jungpil Shin,Md Abdur Rahim
机构: Dept. of Computer Science and Engineering (CSE), Bangladesh Army University of Engineering & Technology (BAUET), Qadirabad Cantonment, Dayarampur, Natore-6431, Rajshahi, Bangladesh; Department of Computer Science & Engineering, University of Rajshahi, Rajshahi-6205, Bangladesh; Dept. of Computer Science and Engineering (CSE), Bangladesh Army University of Science and Technology (BAUST), Saidpur, Bangladesh; School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu, Fukushima, Japan; Dept. of Computer Science and Engineering, Pabna University of Science and Technology, Rajapur, Pabna, Bangladesh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Internet of Things (IoT) and mobile technology have significantly transformed healthcare by enabling real-time monitoring and diagnosis of patients. Recognizing medical-related human activities (MRHA) is pivotal for healthcare systems, particularly for identifying actions that are critical to patient well-being. However, challenges such as high computational demands, low accuracy, and limited adaptability persist in Human Motion Recognition (HMR). While some studies have integrated HMR with IoT for real-time healthcare applications, limited research has focused on recognizing MRHA as essential for effective patient monitoring. This study proposes a novel HMR method for MRHA detection, leveraging multi-stage deep learning techniques integrated with IoT. The approach employs EfficientNet to extract optimized spatial features from skeleton frame sequences using seven Mobile Inverted Bottleneck Convolutions (MBConv) blocks, followed by ConvLSTM to capture spatio-temporal patterns. A classification module with global average pooling, a fully connected layer, and a dropout layer generates the final predictions. The model is evaluated on the NTU RGB+D 120 and HMDB51 datasets, focusing on MRHA, such as sneezing, falling, walking, sitting, etc. It achieves 94.85% accuracy for cross-subject evaluations and 96.45% for cross-view evaluations on NTU RGB+D 120, along with 89.00% accuracy on HMDB51. Additionally, the system integrates IoT capabilities using a Raspberry Pi and GSM module, delivering real-time alerts via Twilios SMS service to caregivers and patients. This scalable and efficient solution bridges the gap between HMR and IoT, advancing patient monitoring, improving healthcare outcomes, and reducing costs.
zh

[CV-59] Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

【速读】：该论文旨在解决在线支付系统中由AI深度伪造（AI Deepfake）技术引发的欺诈问题。随着深度伪造技术的普及，传统的安全系统难以有效识别这些高度复杂的欺诈手段。为此，研究提出了一种基于生成对抗网络（Generative Adversarial Networks, GANs）的新型模型，通过识别支付图像中的细微篡改来增强在线支付的安全性。该模型利用包含真实在线支付图像和通过StyleGAN、DeepFake等先进GAN架构生成的深度伪造图像的数据集进行训练。实验结果表明，该模型能够以高于95%的准确率区分合法交易和深度伪造，显著提升了支付系统对AI驱动欺诈的鲁棒性。研究的关键在于利用GANs的对抗性训练机制，实现对深度伪造图像的高效检测，从而为金融服务中的欺诈检测提供了新的技术路径。

链接: https://arxiv.org/abs/2501.07033
作者: Zong Ke,Shicheng Zhou,Yining Zhou,Chia Hong Chang,Rong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper will be published and indexed by IEEE at 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2025)

点击查看摘要

Abstract:This study explores the use of Generative Adversarial Networks (GANs) to detect AI deepfakes and fraudulent activities in online payment systems. With the growing prevalence of deepfake technology, which can manipulate facial features in images and videos, the potential for fraud in online transactions has escalated. Traditional security systems struggle to identify these sophisticated forms of fraud. This research proposes a novel GAN-based model that enhances online payment security by identifying subtle manipulations in payment images. The model is trained on a dataset consisting of real-world online payment images and deepfake images generated using advanced GAN architectures, such as StyleGAN and DeepFake. The results demonstrate that the proposed model can accurately distinguish between legitimate transactions and deepfakes, achieving a high detection rate above 95%. This approach significantly improves the robustness of payment systems against AI-driven fraud. The paper contributes to the growing field of digital security, offering insights into the application of GANs for fraud detection in financial services. Keywords- Payment Security, Image Recognition, Generative Adversarial Networks, AI Deepfake, Fraudulent Activities
zh

[CV-60] UNetVL: Enhancing 3D Medical Image Segmentation with Chebyshev KAN Powered Vision-LSTM

【速读】：该论文旨在解决3D医学图像分割（3D medical image segmentation）中卷积神经网络（Convolutional Neural Networks, CNNs）和视觉变换器（Vision Transformers, ViTs）在长程依赖（long-range dependency）获取与计算效率之间难以平衡的问题。为解决这一挑战，论文提出了一种新颖的架构UNETVL（U-Net Vision-LSTM），该架构结合了视觉长短期记忆网络（Vision-LSTM, ViL）和高效的切比雪夫-柯尔莫哥洛夫-阿诺德网络（Chebyshev Kolmogorov-Arnold Networks, KAN），以更好地处理复杂的长程依赖模式。UNETVL通过引入ViL提升了可扩展性和记忆功能，同时利用KAN有效处理长程依赖关系。通过在ACDC和AMOS2022基准数据集上的验证，UNETVL在平均Dice分数上显著优于现有方法，尤其是在其前身UNETR的基础上分别提升了7.3%和15.6%。

链接: https://arxiv.org/abs/2501.07017
作者: Xuhui Guo,Tanmoy Dam,Rohan Dhamdhere,Gourav Modanwal,Anant Madabhushi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D medical image segmentation has progressed considerably due to Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), yet these methods struggle to balance long-range dependency acquisition with computational efficiency. To address this challenge, we propose UNETVL (U-Net Vision-LSTM), a novel architecture that leverages recent advancements in temporal information processing. UNETVL incorporates Vision-LSTM (ViL) for improved scalability and memory functions, alongside an efficient Chebyshev Kolmogorov-Arnold Networks (KAN) to handle complex and long-range dependency patterns more effectively. We validated our method on the ACDC and AMOS2022 (post challenge Task 2) benchmark datasets, showing a significant improvement in mean Dice score compared to recent state-of-the-art approaches, especially over its predecessor, UNETR, with increases of 7.3% on ACDC and 15.6% on AMOS, respectively. Extensive ablation studies were conducted to demonstrate the impact of each component in UNETVL, providing a comprehensive understanding of its architecture. Our code is available at this https URL, facilitating further research and applications in this domain.
zh

[CV-61] SplatMAP: Online Dense Monocular SLAM with 3D Gaussian Splatting

【速读】：该论文试图解决从单目视频中实现高保真3D重建的挑战，特别是传统方法如Structure-from-Motion (SfM)和单目SLAM在捕捉场景细节方面的局限性。尽管可微分渲染技术如Neural Radiance Fields (NeRF)在一定程度上解决了这些问题，但其高计算成本使其不适用于实时应用。此外，现有的3D Gaussian Splatting (3DGS)方法通常侧重于光度一致性，忽略了几何精度，并且未能充分利用SLAM的动态深度和姿态更新来优化场景。

论文提出的解决方案关键在于将密集SLAM与3DGS集成，以实现实时高保真密集重建。具体而言，该方法引入了SLAM-Informed Adaptive Densification（基于SLAM的自适应密度化），通过利用SLAM生成的密集点云动态更新和密度化高斯模型。此外，还结合了Geometry-Guided Optimization（几何引导优化），将边缘感知的几何约束与光度一致性相结合，共同优化3DGS场景表示的外观和几何，从而实现详细且准确的SLAM地图重建。实验结果表明，该方法在Replica和TUM-RGBD数据集上均取得了最先进的结果，显著提升了PSNR、SSIM和LPIPS等指标。

链接: https://arxiv.org/abs/2501.07015
作者: Yue Hu,Rong Liu,Meida Chen,Andrew Feng,Peter Beerel
机构: Institute for Creative Technologies, University of Southern California (南加州大学创意技术研究所); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving high-fidelity 3D reconstruction from monocular video remains challenging due to the inherent limitations of traditional methods like Structure-from-Motion (SfM) and monocular SLAM in accurately capturing scene details. While differentiable rendering techniques such as Neural Radiance Fields (NeRF) address some of these challenges, their high computational costs make them unsuitable for real-time applications. Additionally, existing 3D Gaussian Splatting (3DGS) methods often focus on photometric consistency, neglecting geometric accuracy and failing to exploit SLAM’s dynamic depth and pose updates for scene refinement. We propose a framework integrating dense SLAM with 3DGS for real-time, high-fidelity dense reconstruction. Our approach introduces SLAM-Informed Adaptive Densification, which dynamically updates and densifies the Gaussian model by leveraging dense point clouds from SLAM. Additionally, we incorporate Geometry-Guided Optimization, which combines edge-aware geometric constraints and photometric consistency to jointly optimize the appearance and geometry of the 3DGS scene representation, enabling detailed and accurate SLAM mapping reconstruction. Experiments on the Replica and TUM-RGBD datasets demonstrate the effectiveness of our approach, achieving state-of-the-art results among monocular systems. Specifically, our method achieves a PSNR of 36.864, SSIM of 0.985, and LPIPS of 0.040 on Replica, representing improvements of 10.7%, 6.4%, and 49.4%, respectively, over the previous SOTA. On TUM-RGBD, our method outperforms the closest baseline by 10.2%, 6.6%, and 34.7% in the same metrics. These results highlight the potential of our framework in bridging the gap between photometric and geometric dense 3D scene representations, paving the way for practical and efficient monocular dense reconstruction.
zh

[CV-62] Comparison of Autoencoders for tokenization of ASL datasets

【速读】：该论文旨在解决美国手语（ASL）图像数据集的高保真重建问题，特别是针对87,000张图像和29个手语类别的数据集。研究通过比较三种编码器-解码器架构（前馈自编码器、卷积自编码器和扩散自编码器）来评估其性能。关键解决方案在于扩散自编码器（Diffusion Autoencoder），其通过概率噪声建模和迭代去噪能力，显著优于其他方法，达到了最低的均方误差（MSE）和最高的平均意见得分（MOS）。扩散自编码器的优势在于其能够处理复杂的图像数据，并在多模态AI应用（如手语识别和生成）中展现出潜力。研究结果为设计鲁棒的编码器-解码器系统提供了重要见解，推动了多模态AI能力的发展。

链接: https://arxiv.org/abs/2501.06942
作者: Vouk Praun-Petrovic,Aadhvika Koundinya,Lavanya Prahallad
机构: Harker Upper School, San Jose, California; Irvington High School, Fremont, California; Research Spark Hub, Dublin, California
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 tables, 4 figures

点击查看摘要

Abstract:Generative AI, powered by large language models (LLMs), has revolutionized applications across text, audio, images, and video. This study focuses on developing and evaluating encoder-decoder architectures for the American Sign Language (ASL) image dataset, consisting of 87,000 images across 29 hand sign classes. Three approaches were compared: Feedforward Autoencoders, Convolutional Autoencoders, and Diffusion Autoencoders. The Diffusion Autoencoder outperformed the others, achieving the lowest mean squared error (MSE) and highest Mean Opinion Score (MOS) due to its probabilistic noise modeling and iterative denoising capabilities. The Convolutional Autoencoder demonstrated effective spatial feature extraction but lacked the robustness of the diffusion process, while the Feedforward Autoencoder served as a baseline with limitations in handling complex image data. Objective and subjective evaluations confirmed the superiority of the Diffusion Autoencoder for high-fidelity image reconstruction, emphasizing its potential in multimodal AI applications such as sign language recognition and generation. This work provides critical insights into designing robust encoder-decoder systems to advance multimodal AI capabilities.
zh

[CV-63] Evaluating unsupervised contrastive learning framework for MRI sequences classification

【速读】：该论文旨在解决磁共振成像（MRI）序列自动识别的问题，以简化临床工作流程，减少放射科医生手动分类和识别序列的时间，从而加快患者的诊断和治疗规划。解决方案的关键在于提出了一种基于无监督对比深度学习框架的系统，利用基于ResNet-18架构的卷积神经网络（CNN）对九种常见的MRI序列类型进行分类。该系统通过内部数据集进行训练，并在多个公共数据集（如BraTS、ADNI、Fused Radiology-Pathology Prostate Dataset、ACRIN乳腺癌数据集等）上进行验证，涵盖了多种采集协议，并且仅需2D切片进行训练。该系统的分类准确率在九种常见MRI序列类型上超过0.95。

链接: https://arxiv.org/abs/2501.06938
作者: Yuli Wang,Kritika Iyer,Sep Farhand,Yoshihisa Shinagawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The automatic identification of Magnetic Resonance Imaging (MRI) sequences can streamline clinical workflows by reducing the time radiologists spend manually sorting and identifying sequences, thereby enabling faster diagnosis and treatment planning for patients. However, the lack of standardization in the parameters of MRI scans poses challenges for automated systems and complicates the generation and utilization of datasets for machine learning research. To address this issue, we propose a system for MRI sequence identification using an unsupervised contrastive deep learning framework. By training a convolutional neural network based on the ResNet-18 architecture, our system classifies nine common MRI sequence types as a 9-class classification problem. The network was trained using an in-house internal dataset and validated on several public datasets, including BraTS, ADNI, Fused Radiology-Pathology Prostate Dataset, the Breast Cancer Dataset (ACRIN), among others, encompassing diverse acquisition protocols and requiring only 2D slices for training. Our system achieves a classification accuracy of over 0.95 across the nine most common MRI sequence types.
zh

[CV-64] CULTURE3D: Cultural Landmarks and Terrain Dataset for 3D Applications

【速读】：该论文旨在解决现有数据集在细粒度3D应用中的局限性，特别是数据集规模和细节层次不足的问题。为此，作者提出了一个基于无人机航拍图像的大规模细粒度数据集，该数据集具有更高的分辨率和更丰富的细节，能够更准确地捕捉真实世界的场景布局和建筑结构。解决方案的关键在于利用无人机航拍图像构建数据集，并结合COLMAP格式的高斯溅射（Gaussian Splatting）和运动恢复结构（Structure-from-Motion, SfM）方法进行环境重建。该数据集兼容多种广泛使用的技术，如SLAM、多视图立体视觉（Multi-View Stereo）和神经辐射场（NeRF），从而支持精确的3D重建和点云生成，成为重建和分割任务的基准。此外，该数据集支持多模态数据的无缝集成，推动从建筑重建到虚拟旅游等广泛3D应用的创新。

链接: https://arxiv.org/abs/2501.06927
作者: Xinyi Zheng,Steve Zhang,Weizhe Lin,Aaron Zhang,Walterio W. Mayol-Cuevas,Junxiao Shen
机构: University of Bristol(布里斯托大学); X-Intelligence Labs(X-Intelligence实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a large-scale fine-grained dataset using high-resolution images captured from locations worldwide. Compared to existing datasets, our dataset offers a significantly larger size and includes a higher level of detail, making it uniquely suited for fine-grained 3D applications. Notably, our dataset is built using drone-captured aerial imagery, which provides a more accurate perspective for capturing real-world site layouts and architectural structures. By reconstructing environments with these detailed images, our dataset supports applications such as the COLMAP format for Gaussian Splatting and the Structure-from-Motion (SfM) method. It is compatible with widely-used techniques including SLAM, Multi-View Stereo, and Neural Radiance Fields (NeRF), enabling accurate 3D reconstructions and point clouds. This makes it a benchmark for reconstruction and segmentation tasks. The dataset enables seamless integration with multi-modal data, supporting a range of 3D applications, from architectural reconstruction to virtual tourism. Its flexibility promotes innovation, facilitating breakthroughs in 3D modeling and analysis.
zh

[CV-65] Benchmarking YOLOv8 for Optimal Crack Detection in Civil Infrastructure

【速读】：该论文旨在解决桥梁结构完整性和安全性监测中的实时裂缝检测问题。传统裂缝检测方法在实时应用中存在速度较慢的局限性，而现有的两阶段目标检测算法（two-stage target detection algorithms）虽然精度较高，但难以满足实时性需求。为此，论文提出采用YOLOv8框架进行裂缝检测，并通过对其五种模型尺度（nano、small、medium、large、extra-large）的全面评估，结合六种先进的优化器（Stochastic Gradient Descent、Adaptive Moment Estimation、Adam with Decoupled Weight Decay、Root Mean Square Propagation、Rectified Adam、Nesterov-accelerated Adam）进行超参数优化。研究结果表明，采用随机梯度下降（Stochastic Gradient Descent）优化的YOLOv8在精度和速度上均表现出色，为实时裂缝检测设定了新的基准。这一解决方案的关键在于利用YOLOv8的高效性和灵活性，结合优化器的选择，显著提升了检测速度和精度，为基础设施监测中的计算机视觉技术集成提供了基础性方法。

链接: https://arxiv.org/abs/2501.06922
作者: Woubishet Zewdu Taffese,Ritesh Sharma,Mohammad Hossein Afsharmovahed,Gunasekaran Manogaran,Genda Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 104th TRB Annual Meeting 2025

点击查看摘要

Abstract:Ensuring the structural integrity and safety of bridges is crucial for the reliability of transportation networks and public safety. Traditional crack detection methods are increasingly being supplemented or replaced by advanced artificial intelligence (AI) techniques. However, most of the models rely on two-stage target detection algorithms, which pose concerns for real-time applications due to their lower speed. While models such as YOLO (You Only Look Once) have emerged as transformative tools due to their remarkable speed and accuracy. However, the potential of the latest YOLOv8 framework in this domain remains underexplored. This study bridges that gap by rigorously evaluating YOLOv8’s performance across five model scales (nano, small, medium, large, and extra-large) using a high-quality Roboflow dataset. A comprehensive hyperparameter optimization was performed, testing six state-of-the-art optimizers-Stochastic Gradient Descent, Adaptive Moment Estimation, Adam with Decoupled Weight Decay, Root Mean Square Propagation, Rectified Adam, and Nesterov-accelerated Adam. Results revealed that YOLOv8, optimized with Stochastic Gradient Descent, delivered exceptional accuracy and speed, setting a new benchmark for real-time crack detection. Beyond its immediate application, this research positions YOLOv8 as a foundational approach for integrating advanced computer vision techniques into infrastructure monitoring. By enabling more reliable and proactive maintenance of aging bridge networks, this work paves the way for safer, more efficient transportation systems worldwide.
zh

[CV-66] Local Foreground Selection aware Attentive Feature Reconstruction for few-shot fine-grained plant species classification

【速读】：该论文旨在解决植物物种分类中由于类内差异显著而类间差异较小导致的分类准确率问题。为了解决这一问题，论文提出了一种新颖的局部前景选择（Local Foreground Selection, LFS）注意力机制。该机制通过整合局部注意力（local attention）和前景选择注意力（foreground selection attention）来生成具有区分性的支持特征图和查询特征图。局部注意力用于捕捉局部空间细节，增强特征区分度并增加类间差异；前景选择注意力则通过强调前景植物对象并减少背景干扰，从而减少类内差异。实验结果表明，LFS机制在三个植物物种数据集上表现出色，并相较于以往的特征重建方法具有互补优势。

链接: https://arxiv.org/abs/2501.06909
作者: Aisha Zulfiqar,Ebroul Izquiedro
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plant species exhibit significant intra-class variation and minimal inter-class variation. To enhance classification accuracy, it is essential to reduce intra-class variation while maximizing inter-class variation. This paper addresses plant species classification using a limited number of labelled samples and introduces a novel Local Foreground Selection(LFS) attention mechanism. LFS is a straightforward module designed to generate discriminative support and query feature maps. It operates by integrating two types of attention: local attention, which captures local spatial details to enhance feature discrimination and increase inter-class differentiation, and foreground selection attention, which emphasizes the foreground plant object while mitigating background interference. By focusing on the foreground, the query and support features selectively highlight relevant feature sequences and disregard less significant background sequences, thereby reducing intra-class differences. Experimental results from three plant species datasets demonstrate the effectiveness of the proposed LFS attention mechanism and its complementary advantages over previous feature reconstruction methods.
zh

[CV-67] Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

【速读】：该论文试图解决两个主要问题：首先，训练可控的3D生成网络需要大量多样化的序列数据，而这些数据中的图像和高质量跟踪网格（tracked meshes）并不总是可用；其次，现有的单目（monocular）头像模型在新视角和表情的泛化能力上表现不佳，缺乏强先验知识，并且容易过拟合到特定的视角分布。为解决这些问题，论文提出了一种名为SynShot的新方法，其关键解决方案是通过从大量合成头部数据集中学习先验模型，这些数据集包含多样化的身份、表情和视角。SynShot利用少量输入图像对预训练的合成先验进行微调，以弥合领域差距，从而生成能够泛化到新表情和视角的逼真头部头像。该方法使用3D高斯溅射（3D Gaussian splatting）和卷积编码-解码器来建模头部头像，并在UV纹理空间中输出高斯参数。此外，为了应对头部不同部位（如皮肤与头发）的建模复杂性，该方法嵌入了显式控制机制，以增加每个部位的基本图元数量。与需要数千张真实训练图像的最先进单目方法相比，SynShot显著提升了新视角和表情的合成效果。

链接: https://arxiv.org/abs/2501.06903
作者: Wojciech Zielonka,Stephan J. Garbin,Alexandros Lattas,George Kopanas,Paulo Gotardo,Thabo Beeler,Justus Thies,Timo Bolkart
机构: Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所, 图宾根, 德国); Technical University of Darmstadt(达姆施塔特工业大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website this https URL

点击查看摘要

Abstract:We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle two major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to state-of-the-art monocular methods that require thousands of real training images, SynShot significantly improves novel view and expression synthesis.
zh

[CV-68] ActiveGAMER: Active GAussian Mapping through Efficient Rendering

【速读】：该论文旨在解决实时场景映射和探索中的高质量重建问题，特别是在复杂环境中的主动映射性能。传统基于神经辐射场（NeRF）的方法由于计算量大，限制了主动映射的效率。论文提出的解决方案ActiveGAMER系统，利用3D高斯泼溅（3D Gaussian Splatting, 3DGS）的高效渲染能力，实现了在复杂环境中的有效探索。系统的核心是一个基于渲染的信息增益模块，动态识别最具信息量的视角，用于下一最佳视角规划，从而提升几何和光度重建的准确性。此外，ActiveGAMER通过结合从粗到细的探索、后优化以及全局-局部关键帧选择策略，最大化重建的完整性和保真度。实验结果表明，该系统在几何和光度重建的准确性和完整性方面显著优于现有方法。

链接: https://arxiv.org/abs/2501.06897
作者: Liyan Chen,Huangying Zhan,Kevin Chen,Xiangyu Xu,Qingan Yan,Changjiang Cai,Yi Xu
机构: OPPO US Research Center; Stevens Institute of Technology(斯蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments. The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity. Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER’s effectiveness in active mapping tasks.
zh

[CV-69] MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis WACV

【速读】：该论文试图解决在皮肤癌诊断中深度学习模型的“黑箱”性质问题，即这些模型在提高病灶检测和分类准确性的同时，缺乏透明性和可解释性，导致医生对其决策过程产生信任问题。为了解决这一问题，研究提出了一种基于CLIP（Contrastive Language-Image Pretraining）模型的改进方法，称为MedGrad E-CLIP。该方法通过引入加权熵机制，专门针对复杂的医学影像（如皮肤病变）进行优化，能够捕捉视觉特征与诊断标准术语之间的有意义关系，并突出显示与特定诊断描述相关的关键图像区域。这一解决方案的关键在于通过视觉解释图像中不同特征与诊断标准之间的关系，增强了AI驱动诊断系统的透明性、鲁棒性和可信度。

链接: https://arxiv.org/abs/2501.06887
作者: Sadia Kamal,Tim Oates
机构: University of Maryland Baltimore County (马里兰大学巴尔的摩分校); University of Maryland Baltimore County (马里兰大学巴尔的摩分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted to 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

点击查看摘要

Abstract:As deep learning models gain attraction in medical data, ensuring transparent and trustworthy decision-making is essential. In skin cancer diagnosis, while advancements in lesion detection and classification have improved accuracy, the black-box nature of these methods poses challenges in understanding their decision processes, leading to trust issues among physicians. This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms. To further enhance transparency, we propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions. This approach highlights critical image regions linked to specific diagnostic descriptions. The developed integrated pipeline not only classifies skin lesions by matching corresponding descriptions but also adds an essential layer of explainability developed especially for medical data. By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis, ultimately improving transparency, robustness, and trust in AI-driven diagnostic systems.
zh

[CV-70] ransforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning NEURIPS2024

【速读】：该论文旨在解决多任务学习（Multi-Task Learning, MTL）在视觉Transformer（Vision Transformer）中的应用问题，特别是现有方法在结合专家混合（Mixture-of-Experts, MoE）结构和低秩适应（Low-Rank Adaptation, LoRA）时，由于刚性组合导致的优化困难和推理效率低下的问题。论文提出的解决方案关键包括：1）开发了MoEfied LoRA结构，将预训练的Transformer分解为低秩MoE结构，并利用LoRA对参数进行微调；2）设计了质量保持（Quality Retaining, QR）优化机制，利用历史高质量类别logits防止训练良好的任务性能退化；3）提出了路由器衰减策略，将学习到的参数整合到原始Transformer中，以实现高效的推理。这些创新显著提升了多任务学习的性能和推理速度。

链接: https://arxiv.org/abs/2501.06884
作者: Hanwen Zhong,Jiaxin Chen,Yutong Zhang,Di Huang,Yunhong Wang
机构: 1State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China (北京航空航天大学虚拟现实技术与系统国家重点实验室); 2School of Computer Science and Engineering, Beihang University, Beijing, China (北京航空航天大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiveness of reparameterization of LoRA, leading to sub-optimal performance and low inference speed. In this work, we propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner during training, and reparameterizing the learned structure for efficient inference. Specifically, we firstly develop the MoEfied LoRA structure, which decomposes the pre-trained Transformer into a low-rank MoE structure and employ LoRA to fine-tune the parameters. Subsequently, we take into account the intrinsic asynchronous nature of multi-task learning and devise a learning Quality Retaining (QR) optimization mechanism, by leveraging the historical high-quality class logits to prevent a well-trained task from performance degradation. Finally, we design a router fading strategy to integrate the learned parameters into the original Transformer, archiving efficient inference. Extensive experiments on public benchmarks demonstrate the superiority of our method, compared to the state-of-the-art multi-task learning approaches.
zh

[CV-71] Real-Time Neural-Enhancement for Online Cloud Gaming

【速读】：该论文旨在解决在线云游戏（Online Cloud Gaming）中实时高质量视频传输的挑战，特别是在可变广域网（WAN）环境下。传统基于超分辨率（Super-Resolution, SR）的视频质量增强方法需要对整个视频进行密集的微调，这在多样化的在线云游戏场景中不可行。为此，论文提出了名为River的云游戏传输框架，其核心思想是利用云游戏视频片段特征的重复性和冗余性，通过重用微调后的SR模型，将微调延迟从分钟级降低到毫秒级的查询延迟。River的关键解决方案包括：1）设计内容感知编码器，为不同视频片段微调SR模型并将其存储在查找表中；2）在线模型调度器，根据视频特征检索最相关的SR模型以增强帧质量；3）动态更新机制，当现有模型无法满足某些视频片段时，进一步微调新模型并更新查找表；4）预取策略，预测最可能被检索的模型，避免向客户端传输模型权重的开销。实验表明，River相比现有最优方案减少了44%的冗余训练开销，并将峰值信噪比（PSNR）提高了1.81dB，同时在实际部署中满足实时性要求，在移动设备上实现了约720p 20fps的性能。

链接: https://arxiv.org/abs/2501.06880
作者: Shan Jiang,Zhenhua Han,Haisheng Tan,Xinyang Jiang,Yifan Yang,Xiaoxi Zhang,Hongqiu Ni,Yuqing Yang,Xiang-Yang Li
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院); Sun Yat-sen University (中山大学)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online Cloud gaming demands real-time, high-quality video transmission across variable wide-area networks (WANs). Neural-enhanced video transmission algorithms employing super-resolution (SR) for video quality enhancement have effectively challenged WAN environments. However, these SR-based methods require intensive fine-tuning for the whole video, making it infeasible in diverse online cloud gaming. To address this, we introduce River, a cloud gaming delivery framework designed based on the observation that video segment features in cloud gaming are typically repetitive and redundant. This permits a significant opportunity to reuse fine-tuned SR models, reducing the fine-tuning latency of minutes to query latency of milliseconds. To enable the idea, we design a practical system that addresses several challenges, such as model organization, online model scheduler, and transfer strategy. River first builds a content-aware encoder that fine-tunes SR models for diverse video segments and stores them in a lookup table. When delivering cloud gaming video streams online, River checks the video features and retrieves the most relevant SR models to enhance the frame quality. Meanwhile, if no existing SR model performs well enough for some video segments, River will further fine-tune new models and update the lookup table. Finally, to avoid the overhead of streaming model weight to the clients, River designs a prefetching strategy that predicts the models with the highest possibility of being retrieved. Our evaluation based on real video game streaming demonstrates River can reduce redundant training overhead by 44% and improve the Peak-Signal-to-Noise-Ratio by 1.81dB compared to the SOTA solutions. Practical deployment shows River meets real-time requirements, achieving approximately 720p 20fps on mobile devices.
zh

[CV-72] Defect Detection Network In PCB Circuit Devices Based on GAN Enhanced YOLOv11

【速读】：该论文旨在解决印刷电路板（PCB）表面缺陷检测中的挑战，特别是针对复杂和不常见的缺陷类型，如毛刺（burr）。解决方案的关键在于提出了一种改进的YOLOv11模型，并结合生成对抗网络（GAN）进行数据增强。通过GAN生成合成的缺陷图像，扩充了数据集的多样性和真实性，从而提升了模型在复杂环境和小目标缺陷检测中的泛化能力。该方法显著提高了检测的准确性、召回率和鲁棒性，减少了对手工检测的依赖，并加速了从设计到生产的流程。研究强调了在电子设计自动化（EDA）过程中集成GAN数据增强和优化检测架构的重要性，为工业应用中的PCB缺陷检测提供了可靠性和效率的提升。

链接: https://arxiv.org/abs/2501.06879
作者: Jiayi Huang,Feiyun Zhao,Lieyang Chen
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes an advanced method for surface defect detection in printed circuit boards (PCBs) using an improved YOLOv11 model enhanced with a generative adversarial network (GAN). The approach focuses on identifying six common defect types: missing hole, rat bite, open circuit, short circuit, burr, and virtual welding. By employing GAN to generate synthetic defect images, the dataset is augmented with diverse and realistic patterns, improving the model’s ability to generalize, particularly for complex and infrequent defects like burrs. The enhanced YOLOv11 model is evaluated on a PCB defect dataset, demonstrating significant improvements in accuracy, recall, and robustness, especially when dealing with defects in complex environments or small targets. This research contributes to the broader field of electronic design automation (EDA), where efficient defect detection is a crucial step in ensuring high-quality PCB manufacturing. By integrating advanced deep learning techniques, this approach enhances the automation and precision of defect detection, reducing reliance on manual inspection and accelerating design-to-production workflows. The findings underscore the importance of incorporating GAN-based data augmentation and optimized detection architectures in EDA processes, providing valuable insights for improving reliability and efficiency in PCB defect detection within industrial applications.
zh

[CV-73] Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach WACV2025

【速读】：该论文试图解决自主系统中传感器校准（sensor calibration）的不确定性量化（uncertainty quantification）问题。尽管传感器校准的准确性对自主系统至关重要，但其不确定性量化仍然缺乏深入探索。论文提出了一种将不确定性感知（uncertainty awareness）集成到在线外参校准（online extrinsic calibration）中的方法，结合蒙特卡洛丢弃（Monte Carlo Dropout）和保形预测（Conformal Prediction）来生成具有保证覆盖水平的预测区间（prediction intervals）。该方法的创新之处在于提出了一个框架，能够增强现有校准模型的不确定性量化能力，并且兼容多种网络架构。通过在KITTI（RGB相机-激光雷达）和DSEC（事件相机-激光雷达）数据集上的验证，论文展示了该方法在不同视觉传感器类型中的有效性，并通过适应性指标评估了区间的效率和可靠性。通过提供具有可量化置信度度量的校准参数，该方法为校准估计的可靠性提供了深入见解，从而显著提高了动态环境中传感器融合的鲁棒性，并为计算机视觉（Computer Vision）领域提供了有价值的贡献。

链接: https://arxiv.org/abs/2501.06878
作者: Mathieu Cocheteux,Julien Moreau,Franck Davoine
机构: Université de technologie de Compiègne (贡比涅技术大学), CNRS (法国国家科学研究中心), Heudiasyc (Heudiasyc实验室); CNRS (法国国家科学研究中心), INSA Lyon (里昂国立应用科学学院), UCBL (里昂第一大学), LIRIS (LIRIS实验室), UMR5205 (UMR5205研究单位)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at WACV 2025

点击查看摘要

Abstract:Accurate sensor calibration is crucial for autonomous systems, yet its uncertainty quantification remains underexplored. We present the first approach to integrate uncertainty awareness into online extrinsic calibration, combining Monte Carlo Dropout with Conformal Prediction to generate prediction intervals with a guaranteed level of coverage. Our method proposes a framework to enhance existing calibration models with uncertainty quantification, compatible with various network architectures. Validated on KITTI (RGB Camera-LiDAR) and DSEC (Event Camera-LiDAR) datasets, we demonstrate effectiveness across different visual sensor types, measuring performance with adapted metrics to evaluate the efficiency and reliability of the intervals. By providing calibration parameters with quantifiable confidence measures, we offer insights into the reliability of calibration estimates, which can greatly improve the robustness of sensor fusion in dynamic environments and usefully serve the Computer Vision community.
zh

[CV-74] A Foundational Generative Model for Breast Ultrasound Image Analysis

【速读】：该论文旨在解决乳腺超声图像分析领域缺乏专门的基础生成模型（foundational generative model）的问题。现有的基础模型虽然在临床任务中表现出色，但在乳腺超声图像分析方面的潜力尚未得到充分开发。为此，作者提出了BUSGen，这是首个专门为乳腺超声图像分析设计的基础生成模型。BUSGen的关键在于其预训练阶段使用了超过350万张乳腺超声图像，使其能够学习乳腺结构、病理特征和临床变异。通过少样本适应（few-shot adaptation），BUSGen能够生成逼真且信息丰富的任务特定数据，从而支持多种下游任务的模型开发。实验表明，BUSGen在乳腺癌筛查、诊断和预后任务中表现出卓越的适应性，显著优于基于真实数据训练的基础模型，并在早期诊断中超越了所有参与测试的放射科医生。此外，BUSGen通过生成完全去标识化的数据，保护了患者隐私，推动了医疗数据的安全利用。

链接: https://arxiv.org/abs/2501.06869
作者: Haojun Yu,Youcheng Li,Nan Zhang,Zihan Niu,Xuantong Gong,Yanwen Luo,Haotian Ye,Siyu He,Quanlin Wu,Wangyan Qin,Mengyuan Zhou,Jie Han,Jia Tao,Ziwei Zhao,Di Dai,Di He,Dong Wang,Binghui Tang,Ling Huo,James Zou,Qingli Zhu,Yong Wang,Liwei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Peking University; Stanford University; Peking University Cancer Hospital Institute; Peking Union Medical College Hospital; Cancer Hospital, Chinese Academy of Medical Sciences

点击查看摘要

Abstract:Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen’s exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at this https URL.
zh

[CV-75] LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier

【速读】：该论文试图解决大规模词汇语义分割（semantic segmentation）模型扩展词汇量时面临的挑战，特别是由于大规模掩码标签（mask labels）的标注工作耗时且劳动密集，导致模型在处理分布外类别（out-of-distribution categories）时性能显著下降的问题。为了解决这一问题，论文提出了一种名为LarvSeg的新框架，其关键创新在于利用图像分类数据（image classification data）来扩展语义分割模型的词汇量。具体而言，LarvSeg通过将图像级监督（image-level supervision）引入像素级分割模型的训练过程，使得模型能够在分类数据中引入的新类别上进行语义分割。此外，论文还设计了一个类别注意力分类器（category-wise attentive classifier），通过对相应类别的精确区域施加监督，进一步提升模型性能。实验表明，LarvSeg显著提高了大规模词汇语义分割的性能，尤其是在没有掩码标签的类别上，并首次借助ImageNet21K实现了21K类别的语义分割模型。

链接: https://arxiv.org/abs/2501.06862
作者: Haojun Yu,Di Dai,Ziwei Zhao,Di He,Han Hu,Liwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: PRCV 2024

点击查看摘要

Abstract:Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at this https URL.
zh

[CV-76] Faithful Counterfactual Visual Explanations (FCVE)

【速读】：该论文试图解决深度学习模型在计算机视觉领域缺乏透明性和可解释性的问题。尽管深度学习模型取得了显著进展，但其决策过程往往难以理解和解释，尤其是对于非专家用户。现有技术通常难以提供令人信服且易于理解的解释，且无法准确识别模型内在的决策机制。为解决这些问题，论文提出了一种反事实解释（Counterfactual Explanation, CE）模型，该模型在保持图像像素数据不变的前提下，通过最小化必要的改变生成易于理解的视觉解释。该方法的创新之处在于识别模型学习到的内部概念和过滤器，并利用它们生成合理的反事实解释。这些解释能够反映模型的内部决策过程，从而确保解释的忠实性（faithfulness）。

链接: https://arxiv.org/abs/2501.06841
作者: Bismillah Khan,Syed Ali Tariq,Tehseen Zia,Muhammad Ahsan,David Windridge
机构: COMSATS University Islamabad(COMSATS大学伊斯兰堡); The City School(城市学校); Middlesex University London(伦敦米德尔塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models in computer vision have made remarkable progress, but their lack of transparency and interpretability remains a challenge. The development of explainable AI can enhance the understanding and performance of these models. However, existing techniques often struggle to provide convincing explanations that non-experts easily understand, and they cannot accurately identify models’ intrinsic decision-making processes. To address these challenges, we propose to develop a counterfactual explanation (CE) model that balances plausibility and faithfulness. This model generates easy-to-understand visual explanations by making minimum changes necessary in images without altering the pixel data. Instead, the proposed method identifies internal concepts and filters learned by models and leverages them to produce plausible counterfactual explanations. The provided explanations reflect the internal decision-making process of the model, thus ensuring faithfulness to the model.
zh

[CV-77] SAM-DA: Decoder Adapter for Efficient Medical Domain Adaptation WACV25

【速读】：该论文旨在解决医学影像语义分割（semantic segmentation）中的领域适应（domain adaptation）问题。尽管现有的基础分割模型（如SAM）在自然图像上表现出色，但在医学影像领域表现不佳。此外，现有的端到端微调方法在计算上不可行。为此，论文提出了一种新颖的SAM适配器（SAM adapter）方法，通过最小化可训练参数的数量，同时达到与全微调相当的性能。该适配器被策略性地放置在掩码解码器（mask decoder）中，具有出色的泛化能力，并在全监督和测试时领域适应任务中显著提升了分割效果。通过在四个数据集上的广泛验证，该方法在仅训练不到1%的SAM参数的情况下，超越了现有方法。

链接: https://arxiv.org/abs/2501.06836
作者: Javier Gamazo Tejero,Moritz Schmid,Pablo Márquez Neila,Martin S. Zinkernagel,Sebastian Wolf,Raphael Sznitman
机构: University of Bern(伯尔尼大学); Inselspital Bern, Switzerland(瑞士伯尔尼大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV25

点击查看摘要

Abstract:This paper addresses the domain adaptation challenge for semantic segmentation in medical imaging. Despite the impressive performance of recent foundational segmentation models like SAM on natural images, they struggle with medical domain images. Beyond this, recent approaches that perform end-to-end fine-tuning of models are simply not computationally tractable. To address this, we propose a novel SAM adapter approach that minimizes the number of trainable parameters while achieving comparable performances to full fine-tuning. The proposed SAM adapter is strategically placed in the mask decoder, offering excellent and broad generalization capabilities and improved segmentation across both fully supervised and test-time domain adaptation tasks. Extensive validation on four datasets showcases the adapter’s efficacy, outperforming existing methods while training less than 1% of SAM’s total parameters.
zh

[CV-78] X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding

【速读】：该论文旨在解决现有基准数据集在评估超长第一人称视角（egocentric）视频理解任务中的不足。现有的数据集主要关注单一、短时长的视频或中等时长的视频（最多几十分钟），无法有效评估长时间的第一人称视角视频记录。为此，作者提出了X-LeBench，这是一个专门用于评估超长第一人称视角视频任务的新型基准数据集。解决方案的关键在于利用大语言模型（LLMs）的先进文本处理能力，开发了一个生活日志模拟管道（life-logging simulation pipeline），能够生成与现实世界视频数据一致的、连贯的日常计划。通过将合成的日常计划与来自Ego4D（一个涵盖广泛日常生活场景的大规模第一人称视角视频数据集）的真实视频片段灵活结合，X-LeBench生成了432个模拟视频生活日志，这些日志反映了在丰富情境下的真实日常活动，视频时长从23分钟到16.4小时不等。该研究揭示了现有基线系统和多模态大语言模型（MLLMs）在长时第一人称视角视频理解任务中的表现普遍较差，强调了开发更先进模型的必要性。

链接: https://arxiv.org/abs/2501.06835
作者: Wenqi Zhou,Kai Cao,Hao Zheng,Xinyi Zheng,Miao Liu,Per Ola Kristensson,Walterio Mayol-Cuevas,Fan Zhang,Weizhe Lin,Junxiao Shen
机构: University of Bristol(布里斯托大学); University of Manchester(曼彻斯特大学); University of Cambridge(剑桥大学); X-Intelligence Labs(X-智能实验室); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings. Leveraging the advanced text processing capabilities of large language models (LLMs), X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs that mirror realistic daily activities in contextually rich scenarios. The video life-log durations span from 23 minutes to 16.4 hours. The evaluation of several baseline systems and multimodal large language models (MLLMs) reveals their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding and underscoring the need for more advanced models.
zh

[CV-79] owards Counterfactual and Contrastive Explainability and Transparency of DCNN Image Classifiers

【速读】：该论文试图解决深度卷积神经网络（DCNNs）的可解释性问题，旨在揭示DCNN模型决策背后的原因，并提高其在高风险环境中的理解和可靠性。为此，作者提出了一种新颖的方法，用于生成可解释的反事实（counterfactual）和对比性（contrastive）解释。该方法的关键在于通过探测DCNN的内部工作机制，而不是通过修改输入图像来生成解释。具体而言，给定一个输入图像，该方法通过识别DCNN中最重要的滤波器（filters）来提供对比性解释，这些滤波器代表了将模型的决策从原始推断类别分类到其他指定类别的特征和概念。同时，反事实解释则通过指定这些滤波器所需的最小变化来实现对比性输出。通过识别这些滤波器和概念，该方法能够提供模型决策背后的对比性和反事实原因，从而使模型更加透明。该方法的一个有趣应用是误分类分析，通过比较特定输入图像中识别的概念与类别特定概念，以验证模型决策的有效性。该方法在Caltech-UCSD Birds (CUB) 2011数据集上进行了评估，并与现有最先进方法进行了比较，展示了其解释的有效性。

链接: https://arxiv.org/abs/2501.06831
作者: Syed Ali Tariq,Tehseen Zia,Mubeen Ghafoor
机构: COMSATS University Islamabad(COMSATS大学伊斯兰堡); University of Lincoln(林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainability of deep convolutional neural networks (DCNNs) is an important research topic that tries to uncover the reasons behind a DCNN model’s decisions and improve their understanding and reliability in high-risk environments. In this regard, we propose a novel method for generating interpretable counterfactual and contrastive explanations for DCNN models. The proposed method is model intrusive that probes the internal workings of a DCNN instead of altering the input image to generate explanations. Given an input image, we provide contrastive explanations by identifying the most important filters in the DCNN representing features and concepts that separate the model’s decision between classifying the image to the original inferred class or some other specified alter class. On the other hand, we provide counterfactual explanations by specifying the minimal changes necessary in such filters so that a contrastive output is obtained. Using these identified filters and concepts, our method can provide contrastive and counterfactual reasons behind a model’s decisions and makes the model more transparent. One of the interesting applications of this method is misclassification analysis, where we compare the identified concepts from a particular input image and compare them with class-specific concepts to establish the validity of the model’s decisions. The proposed method is compared with state-of-the-art and evaluated on the Caltech-UCSD Birds (CUB) 2011 dataset to show the usefulness of the explanations provided. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.06831 [cs.CV] (or arXiv:2501.06831v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.06831 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Knowledge-Based Systems, Volume 257, 2022, 109901, ISSN 0950-7051 Related DOI: https://doi.org/10.1016/j.knosys.2022.109901 Focus to learn more DOI(s) linking to related resources
zh

[CV-80] GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

【速读】：该论文旨在解决现有多模态大语言模型（MLLMs）在遥感（RS）图像理解任务中缺乏像素级对话能力的问题。具体来说，现有模型在图像描述、视觉问答和视觉定位等任务中表现出色，但无法根据用户指令生成特定实例的分割掩码（segmentation masks）。为解决这一问题，论文提出了GeoPix模型，通过引入掩码预测器（mask predictor）将视觉编码器的特征转换为基于大语言模型分割标记嵌入的掩码，从而实现了像素级的图像理解。此外，为了处理遥感图像中多尺度对象的分割问题，模型集成了类级可学习记忆模块（class-wise learnable memory module），以在实例级别捕获和存储整个数据集的类级地理上下文信息。论文还构建了GeoPixInstruct数据集，包含65,463张图像和140,412个实例，每个实例都标注了文本描述、边界框和掩码，以解决大规模训练数据缺失的问题。最后，采用两阶段训练策略来平衡文本生成和掩码预测在多模态多任务优化中的不同需求。实验结果表明，GeoPix在像素级分割任务中表现出色，同时在图像和区域级基准测试中保持竞争力。

链接: https://arxiv.org/abs/2501.06828
作者: Ruizhe Ou,Yuan Hu,Fan Zhang,Jiaxin Chen,Yu Liu
机构: Pattern Recognition and Intelligent System Lab, School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学模式识别与智能系统实验室); Institute of Remote Sensing and Geographic Information Systems, School of Earth and Space Sciences, Peking University (北京大学地球与空间科学学院遥感与地理信息系统研究所); Peking University Ordos Research Institute of Energy (北京大学鄂尔多斯能源研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM’s segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
zh

[CV-81] UR2P-Dehaze: Learning a Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior

【速读】：该论文旨在解决图像去雾（image dehazing）技术中现有方法依赖单一手动先验（manual prior）而无法有效揭示图像细节的问题。为了解决这一局限性，作者提出了一种基于非配对图像的去雾网络，称为“通过非配对丰富物理先验的简单图像去雾增强器”（UR2P-Dehaze）。其解决方案的关键包括：1）设计了一个共享先验估计器（Shared Prior Estimator, SPE），通过迭代训练确保光照和反射率的一致性，从而生成清晰、高质量的图像；2）引入自监控机制，消除不良特征，为图像重建提供可靠的先验；3）提出了动态小波可分离卷积（Dynamic Wavelet Separable Convolution, DWSC），有效整合低频和高频的关键特征，显著增强图像细节的保留并确保全局一致性；4）设计了自适应颜色校正器（Adaptive Color Corrector），解决图像颜色不清晰的问题。实验结果表明，该方法在多个评价指标上达到了最先进的性能，并有助于提升下游任务的性能。

链接: https://arxiv.org/abs/2501.06818
作者: Minglong Xue,Shuaibin Fan,Shivakumara Palaiahnakote,Mingliang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image dehazing techniques aim to enhance contrast and restore details, which are essential for preserving visual information and improving image processing accuracy. Existing methods rely on a single manual prior, which cannot effectively reveal image details. To overcome this limitation, we propose an unpaired image dehazing network, called the Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior (UR2P-Dehaze). First, to accurately estimate the illumination, reflectance, and color information of the hazy image, we design a shared prior estimator (SPE) that is iteratively trained to ensure the consistency of illumination and reflectance, generating clear, high-quality images. Additionally, a self-monitoring mechanism is introduced to eliminate undesirable features, providing reliable priors for image reconstruction. Next, we propose Dynamic Wavelet Separable Convolution (DWSC), which effectively integrates key features across both low and high frequencies, significantly enhancing the preservation of image details and ensuring global consistency. Finally, to effectively restore the color information of the image, we propose an Adaptive Color Corrector that addresses the problem of unclear colors. The PSNR, SSIM, LPIPS, FID and CIEDE2000 metrics on the benchmark dataset show that our method achieves state-of-the-art performance. It also contributes to the performance improvement of downstream tasks. The project code will be available at this https URL. \endabstract
zh

[CV-82] RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models

【速读】：该论文旨在解决遥感图像分割中通过自由格式文本输入实现细粒度视觉理解的问题。当前研究主要依赖预训练语言模型来编码文本描述并将其与视觉模态对齐，但这些方法在建立细粒度语义概念之间的鲁棒对齐方面存在困难，导致文本和视觉信息之间的表示不一致。为解决这一问题，论文提出了一个基于基础模型的遥感图像分割方法，称为RSRefSeg。该模型利用CLIP（Contrastive Language–Image Pretraining）进行视觉和文本编码，并通过全局和局部文本语义作为过滤器，在潜在空间中生成与参考相关的视觉激活特征。这些激活特征随后作为输入提示传递给SAM（Segment Anything Model），利用其强大的视觉泛化能力进一步优化分割掩码。实验结果表明，RSRefSeg在RRSIS-D数据集上优于现有方法，证明了基础模型在增强多模态任务理解方面的有效性。

链接: https://arxiv.org/abs/2501.06809
作者: Keyan Chen,Jiafan Zhang,Chenyang Liu,Zhengxia Zou,Zhenwei Shi
机构: Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \urlthis https URL.
zh

[CV-83] Semantic-CD: Remote Sensing Image Semantic Change Detection towards Open-vocabulary Setting

【速读】：该论文试图解决遥感图像语义变化检测（Semantic Change Detection, SCD）中传统方法在实际场景中难以泛化到不同语义类别的问题。为了解决这一问题，作者提出了一种名为Semantic-CD的新方法，该方法结合了视觉-语言基础模型CLIP（Contrastive Language–Image Pretraining）的开放词汇语义知识。Semantic-CD的关键在于通过完全解耦的多任务学习（包括二值变化检测和语义变化检测任务），利用CLIP的广泛词汇知识来增强模型的类别泛化能力，并提升分割精度。该方法的核心组件包括双时相CLIP视觉编码器、开放语义提示器、二值变化检测解码器和语义变化检测解码器，分别用于提取双时相图像特征、生成开放词汇的语义成本体积图、生成二值变化检测掩码以及生成语义标签。实验结果表明，Semantic-CD在SECOND数据集上能够生成更准确的掩码并减少语义分类错误，验证了其在SCD任务中应用视觉-语言基础模型语义先验的有效性。

链接: https://arxiv.org/abs/2501.06808
作者: Yongshuo Zhu,Lu Li,Keyan Chen,Chenyang Liu,Fugen Zhou,Zhenwei Shi
机构: Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image semantic change detection is a method used to analyze remote sensing images, aiming to identify areas of change as well as categorize these changes within images of the same location taken at different times. Traditional change detection methods often face challenges in generalizing across semantic categories in practical scenarios. To address this issue, we introduce a novel approach called Semantic-CD, specifically designed for semantic change detection in remote sensing images. This method incorporates the open vocabulary semantics from the vision-language foundation model, CLIP. By utilizing CLIP’s extensive vocabulary knowledge, our model enhances its ability to generalize across categories and improves segmentation through fully decoupled multi-task learning, which includes both binary change detection and semantic change detection tasks. Semantic-CD consists of four main components: a bi-temporal CLIP visual encoder for extracting features from bi-temporal images, an open semantic prompter for creating semantic cost volume maps with open vocabulary, a binary change detection decoder for generating binary change detection masks, and a semantic change detection decoder for producing semantic labels. Experimental results on the SECOND dataset demonstrate that Semantic-CD achieves more accurate masks and reduces semantic classification errors, illustrating its effectiveness in applying semantic priors from vision-language foundation models to SCD tasks.
zh

[CV-84] Improving Pain Classification using Spatio-Temporal Deep Learning Approaches with Facial Expressions

【速读】：该论文旨在解决传统疼痛管理中的主观性问题，特别是针对无法自我报告疼痛的非言语个体（如语言能力受限的人群）。传统方法依赖于自我报告，存在主观性强且不适用于特定人群的局限性。为此，论文提出了一种基于面部表情的自动化疼痛检测方法，利用深度学习技术改进疼痛评估。解决方案的关键在于两种创新模型：一是结合ConvNeXt模型和长短期记忆（LSTM）模块的混合模型，用于分析视频帧并预测疼痛存在；二是将时空图卷积网络（STGCN）与LSTM结合，处理面部图像的关键点以进行疼痛检测。这些方法首次应用于Pain Emotion Faces Database（PEMF）数据集进行二分类疼痛检测，并通过实验验证了其有效性，展示了结合空间和时间特征在提升疼痛检测精度方面的潜力。

链接: https://arxiv.org/abs/2501.06787
作者: Aafaf Ridouan,Amine Bohi,Youssef Mourchid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 3 tables. Accepted and presented at the 18th International Conference on Machine Vision (ICMV 2024), Edinburgh, UK

点击查看摘要

Abstract:Pain management and severity detection are crucial for effective treatment, yet traditional self-reporting methods are subjective and may be unsuitable for non-verbal individuals (people with limited speaking skills). To address this limitation, we explore automated pain detection using facial expressions. Our study leverages deep learning techniques to improve pain assessment by analyzing facial images from the Pain Emotion Faces Database (PEMF). We propose two novel approaches1: (1) a hybrid ConvNeXt model combined with Long Short-Term Memory (LSTM) blocks to analyze video frames and predict pain presence, and (2) a Spatio-Temporal Graph Convolution Network (STGCN) integrated with LSTM to process landmarks from facial images for pain detection. Our work represents the first use of the PEMF dataset for binary pain classification and demonstrates the effectiveness of these models through extensive experimentation. The results highlight the potential of combining spatial and temporal features for enhanced pain detection, offering a promising advancement in objective pain assessment methodologies.
zh

[CV-85] mporal-Aware Spiking Transformer Hashing Based on 3D-DWT

【速读】：该论文旨在解决动态视觉传感器（DVS）数据快速增长背景下，构建低能耗、高效的数据检索系统的迫切需求。解决方案的关键在于提出了一种名为Spikinghash的新型监督哈希方法，该方法基于脉冲神经网络（SNNs）的二进制特性，采用分层轻量级结构。具体而言，Spiking WaveMixer（SWM）部署在浅层，利用多级三维离散小波变换（3D-DWT）将时空特征解耦为低频和高频分量，并通过高效的光谱特征融合来捕捉时间依赖性和局部空间特征。Spiking Self-Attention（SSA）则部署在深层，进一步提取全局时空信息。此外，论文设计了一个利用SNNs二进制特性的哈希层，通过整合多个时间步长的信息生成最终的哈希码。为了提升检索性能，论文还提出了一种新的动态软相似性损失函数，利用膜电位构建可学习的相似性矩阵作为软标签，充分捕捉类别间的相似性差异并补偿SNNs中的信息损失。实验结果表明，Spikinghash在多个数据集上实现了低能耗和少参数的最先进检索性能。

链接: https://arxiv.org/abs/2501.06786
作者: Zihao Mei,Jianhao Li,Bolin Zhang,Chong Wang,Lijun Guo,Guoqi Li,Jiangbo Qian
机构: Ningbo University (宁波大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI under review. This work has been submitted to the lEEE for possible publication

点击查看摘要

Abstract:With the rapid growth of dynamic vision sensor (DVS) data, constructing a low-energy, efficient data retrieval system has become an urgent task. Hash learning is one of the most important retrieval technologies which can keep the distance between hash codes consistent with the distance between DVS data. As spiking neural networks (SNNs) can encode information through spikes, they demonstrate great potential in promoting energy efficiency. Based on the binary characteristics of SNNs, we first propose a novel supervised hashing method named Spikinghash with a hierarchical lightweight structure. Spiking WaveMixer (SWM) is deployed in shallow layers, utilizing a multilevel 3D discrete wavelet transform (3D-DWT) to decouple spatiotemporal features into various low-frequency and high frequency components, and then employing efficient spectral feature fusion. SWM can effectively capture the temporal dependencies and local spatial features. Spiking Self-Attention (SSA) is deployed in deeper layers to further extract global spatiotemporal information. We also design a hash layer utilizing binary characteristic of SNNs, which integrates information over multiple time steps to generate final hash codes. Furthermore, we propose a new dynamic soft similarity loss for SNNs, which utilizes membrane potentials to construct a learnable similarity matrix as soft labels to fully capture the similarity differences between classes and compensate information loss in SNNs, thereby improving retrieval performance. Experiments on multiple datasets demonstrate that Spikinghash can achieve state-of-the-art results with low energy consumption and fewer parameters.
zh

[CV-86] SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

【速读】：该论文旨在解决基于神经辐射场（NeRF）的3D感知图像合成中高分辨率（HR）图像生成的高计算成本问题，同时保持3D一致性（3D-consistency）。现有方法通常通过2D超分辨率技术来提升图像分辨率，但这会牺牲3D一致性，或者采用辐射流形（radiance manifolds）或两阶段生成方法，但这些方法局限于特定任务，缺乏通用性。为解决这些问题，论文提出了SuperNeRF-GAN框架，其关键创新在于无缝集成NeRF基础的3D感知图像合成方法，能够在提升图像分辨率的同时保持3D一致性并降低计算成本。具体而言，该框架通过预训练的生成器生成NeRF表示（如tri-plane），进行体积渲染以获取低分辨率图像及其对应的深度和法线图，然后利用NeRF超分辨率模块学习网络以生成高分辨率NeRF，并通过深度引导渲染过程（Depth-Guided Rendering）实现边界校正的多深度图构建、法线引导的深度超分辨率和深度引导的NeRF渲染。实验结果表明，该方法在效率、3D一致性和图像质量方面均表现出色，消融研究也验证了各模块的有效性。

链接: https://arxiv.org/abs/2501.06770
作者: Peng Zheng,Linzhi Huang,Yizhou Yu,Yi Chang,Yilin Wang,Rui Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural volume rendering techniques, such as NeRF, have revolutionized 3D-aware image synthesis by enabling the generation of images of a single scene or object from various camera poses. However, the high computational cost of NeRF presents challenges for synthesizing high-resolution (HR) images. Most existing methods address this issue by leveraging 2D super-resolution, which compromise 3D-consistency. Other methods propose radiance manifolds or two-stage generation to achieve 3D-consistent HR synthesis, yet they are limited to specific synthesis tasks, reducing their universality. To tackle these challenges, we propose SuperNeRF-GAN, a universal framework for 3D-consistent super-resolution. A key highlight of SuperNeRF-GAN is its seamless integration with NeRF-based 3D-aware image synthesis methods and it can simultaneously enhance the resolution of generated images while preserving 3D-consistency and reducing computational cost. Specifically, given a pre-trained generator capable of producing a NeRF representation such as tri-plane, we first perform volume rendering to obtain a low-resolution image with corresponding depth and normal map. Then, we employ a NeRF Super-Resolution module which learns a network to obtain a high-resolution NeRF. Next, we propose a novel Depth-Guided Rendering process which contains three simple yet effective steps, including the construction of a boundary-correct multi-depth map through depth aggregation, a normal-guided depth super-resolution and a depth-guided NeRF rendering. Experimental results demonstrate the superior efficiency, 3D-consistency, and quality of our approach. Additionally, ablation studies confirm the effectiveness of our proposed components.
zh

[CV-87] ODPG: Outfitting Diffusion with Pose Guided Condition

【速读】：该论文试图解决虚拟试衣（Virtual Try-On, VTON）技术在生成高真实感图像和处理动态姿态时面临的挑战。传统方法通常依赖于生成对抗网络（Generative Adversarial Networks, GANs）和扩散模型（Diffusion models），但在处理复杂姿态和细节纹理时效果有限。论文提出了一种名为“基于姿态引导条件的扩散模型”（Outfitting Diffusion with Pose Guided Condition, ODPG）的新方法，其关键创新在于利用潜在扩散模型（latent diffusion model）在去噪过程中引入多重条件输入。具体而言，ODPG将服装、姿态和外观图像转换为潜在特征，并在基于UNet的去噪模型中集成这些特征，从而实现了在动态姿态下非显式地合成服装。该方法通过端到端架构避免了显式的服装变形过程，并在FashionTryOn和DeepFashion数据集上展示了其在生成具有精细纹理细节的VTON图像方面的优越性。

链接: https://arxiv.org/abs/2501.06769
作者: Seohyun Lee,Jintae Park,Sanghyeok Park
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures. Preprint submitted to VISAPP 2025: the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications

点击查看摘要

Abstract:Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without physically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses. This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process. By transforming garment, pose, and appearance images into latent features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthesis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with fine-grained texture details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping processes. Future work will focus on generating VTON outputs in video format and on applying our attention mechanism, as detailed in the Method section, to other domains with limited data.
zh

[CV-88] VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning AAAI2025

【速读】：该论文试图解决视频大语言模型（VideoLLMs）在细粒度时间理解任务中的不足，特别是密集视频描述（Dense Video Captioning, DVC）任务。DVC任务要求模型不仅描述视频中的所有事件，还要对这些事件进行时间定位，涉及视频分割、视频描述和时间视频定位等多个细粒度任务。现有的VideoLLMs通常以单一步骤解决DVC任务，未能充分利用其推理能力，且训练目标与评估指标不完全一致，导致监督信号与目标任务不对齐。

为解决这一问题，论文提出了一种名为VidChain的新框架，其核心包括任务链（Chain-of-Tasks, CoTasks）和基于度量的直接偏好优化（Metric-based Direct Preference Optimization, M-DPO）。CoTasks将复杂任务分解为一系列子任务，使VideoLLMs能够更有效地利用其推理能力。M-DPO则通过将模型与评估指标对齐，为每个任务提供与指标一致的细粒度监督。实验表明，VidChain在两种不同的VideoLLMs上均显著提升了细粒度视频理解能力，并在两个DVC基准测试和时间视频定位任务上优于现有方法。

链接: https://arxiv.org/abs/2501.06761
作者: Ji Soo Lee,Jongha Kim,Jeehye Na,Jinyoung Park,Hyunwoo J. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \urlthis https URL.
zh

[CV-89] Static Segmentation by Tracking: A Frustratingly Label-Efficient Approach to Fine-Grained Segmentation

【速读】：该论文旨在解决生物领域图像分割（image segmentation）中的精细任务，特别是从标本图像（如蝴蝶翅膀条纹或甲虫身体部位）中进行特征和部位分割。传统方法需要手动标注大量图像（通常每个物种数百张），并训练分割模型以将这些标注推广到其他图像，这一过程极其耗时。论文提出了一种名为静态分割跟踪（Static Segmentation by Tracking, SST）的标签高效方法。SST的关键在于利用同一物种标本图像中特征和部位的一致性，将标本图像拼接成“伪视频”，并将特征和部位分割问题重新定义为跟踪问题。具体而言，SST通过从“伪前序”图像中传播标注或预测的掩码，为未标注图像生成掩码。该方法基于最初为视频分割开发的Segment Anything Model 2 (SAM~2)，能够在每个物种仅需一张标注图像的情况下实现高质量的特征和部位分割，显著提升了标本图像分析的效率。此外，论文还引入了循环一致性损失（cycle-consistent loss）来微调模型，进一步展示了SST在野外图像的单次实例分割和基于特征的图像检索中的广泛潜力。

链接: https://arxiv.org/abs/2501.06749
作者: Zhenyang Feng,Zihe Wang,Saul Ibaven Bueno,Tomasz Frelek,Advikaa Ramesh,Jingyan Bai,Lemeng Wang,Zanming Huang,Jianyang Gu,Jinsu Yoo,Tai-Yu Pan,Arpita Chowdhury,Michelle Ramirez,Elizabeth G. Campolongo,Matthew J. Thompson,Christopher G. Lawrence,Sydne Record,Neil Rosser,Anuj Karpatne,Daniel Rubenstein,Hilmar Lapp,Charles V. Stewart,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao
机构: The Ohio State University(俄亥俄州立大学); Princeton University(普林斯顿大学); University of Maine(缅因大学); University of Miami(迈阿密大学); Virginia Tech(弗吉尼亚理工大学); Duke University(杜克大学); Rensselaer Polytechnic Institute(伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study image segmentation in the biological domain, particularly trait and part segmentation from specimen images (e.g., butterfly wing stripes or beetle body parts). This is a crucial, fine-grained task that aids in understanding the biology of organisms. The conventional approach involves hand-labeling masks, often for hundreds of images per species, and training a segmentation model to generalize these labels to other images, which can be exceedingly laborious. We present a label-efficient method named Static Segmentation by Tracking (SST). SST is built upon the insight: while specimens of the same species have inherent variations, the traits and parts we aim to segment show up consistently. This motivates us to concatenate specimen images into a pseudo-video'' and reframe trait and part segmentation as a tracking problem. Concretely, SST generates masks for unlabeled images by propagating annotated or predicted masks from the pseudo-preceding’’ images. Powered by Segment Anything Model 2 (SAM~2) initially developed for video segmentation, we show that SST can achieve high-quality trait and part segmentation with merely one labeled image per species – a breakthrough for analyzing specimen images. We further develop a cycle-consistent loss to fine-tune the model, again using one labeled image. Additionally, we highlight the broader potential of SST, including one-shot instance segmentation on images taken in the wild and trait-based image retrieval.
zh

[CV-90] Diversified Augmentation with Domain Adaption for Debiased Video Temporal Grounding ICASSP2025

【速读】：该论文试图解决视频时序句子定位（Temporal Sentence Grounding in Videos, TSGV）任务中由于公共数据集存在显著的时间偏差（temporal biases）而导致的模型泛化能力不足的问题。这些时间偏差主要源于目标时刻（target moments）在时间分布上的不均匀性。现有方法通过生成增强视频来改变目标时刻的时间位置，但由于数据集中的视频长度变化较小，仅改变时间位置无法有效提升模型在不同长度视频上的泛化能力。

论文提出的解决方案包括两个关键点：首先，通过多样化的数据增强（diversified data augmentation）生成具有不同长度和目标时刻位置的视频，以丰富时间分布的多样性；其次，设计了一个域适应辅助任务（domain adaptation auxiliary task），通过域判别器（domain discriminator）减少原始视频和增强视频之间的特征差异，从而降低增强视频引入的噪声。此外，模型还被鼓励对具有相同文本查询但不同时刻位置的视频生成不同的预测，以促进去偏差训练（debiased training）。实验结果表明，该方法在Charades-CD和ActivityNet-CD数据集上表现出优异的泛化能力，并在多种定位结构中取得了最先进的结果。

链接: https://arxiv.org/abs/2501.06746
作者: Junlong Ren,Gangjian Zhang,Haifeng Sun,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed to the uneven temporal distributions of target moments. Existing methods generate augmented videos, where target moments are forced to have varying temporal locations. However, since the video lengths of the given datasets have small variations, only changing the temporal locations results in poor generalization ability in videos with varying lengths. In this paper, we propose a novel training framework complemented by diversified data augmentation and a domain discriminator. The data augmentation generates videos with various lengths and target moment locations to diversify temporal distributions. However, augmented videos inevitably exhibit distinct feature distributions which may introduce noise. To address this, we design a domain adaptation auxiliary task to diminish feature discrepancies between original and augmented videos. We also encourage the model to produce distinct predictions for videos with the same text queries but different moment locations to promote debiased training. Experiments on Charades-CD and ActivityNet-CD datasets demonstrate the effectiveness and generalization abilities of our method in multiple grounding structures, achieving state-of-the-art results.
zh

[CV-91] Rice Leaf Disease Detection: A Comparative Study Between CNN Transformer and Non-neural Network Architectures

【速读】：该论文旨在解决孟加拉国水稻叶片病害的早期识别和分类问题，这对于防止病害传播、减少对作物产量和质量的影响至关重要。论文通过比较多种计算机视觉技术，包括卷积神经网络（CNN）和视觉变换器（ViT），以及传统的机器学习方法如支持向量机（SVM），来探索最佳的病害检测方法。关键解决方案是利用迁移学习（transfer learning）来提高模型在少量训练数据下的泛化能力。实验结果表明，ResNet50模型在CNN和基于变换器的模型中表现最佳，成为该任务的最优选择。

链接: https://arxiv.org/abs/2501.06740
作者: Samia Mehnaz,Md. Touhidul Islam
机构: Viqarunnisa Noon College, Dhaka, Bangladesh(维卡伦尼萨正午学院, 达卡, 孟加拉国); Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh(电气与电子工程系, 孟加拉国工程技术大学, 达卡, 孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:In nations such as Bangladesh, agriculture plays a vital role in providing livelihoods for a significant portion of the population. Identifying and classifying plant diseases early is critical to prevent their spread and minimize their impact on crop yield and quality. Various computer vision techniques can be used for such detection and classification. While CNNs have been dominant on such image classification tasks, vision transformers has become equally good in recent time also. In this paper we study the various computer vision techniques for Bangladeshi rice leaf disease detection. We use the Dhan-Shomadhan – a Bangladeshi rice leaf disease dataset, to experiment with various CNN and ViT models. We also compared the performance of such deep neural network architecture with traditional machine learning architecture like Support Vector Machine(SVM). We leveraged transfer learning for better generalization with lower amount of training data. Among the models tested, ResNet50 exhibited the best performance over other CNN and transformer-based models making it the optimal choice for this task.
zh

[CV-92] Multi-Label Scene Classification in Remote Sensing Benefits from Image Super-Resolution

【速读】：该论文试图解决卫星图像在遥感（Remote Sensing, RS）应用中的空间分辨率限制问题，特别是在多标签场景分类任务中，由于需要更高层次的细节和特征区分，低分辨率图像往往限制了分类精度。论文的核心解决方案是通过图像超分辨率（Super-Resolution, SR）作为预处理步骤，提升卫星图像的质量，从而改善下游分类任务的性能。研究评估了四种SR模型（SRResNet、HAT、SeeSR和RealESRGAN）在不同卷积神经网络（CNN）架构（如ResNet-50、ResNet-101、ResNet-152和Inception-v4）上的效果。结果表明，SR技术的应用显著提高了多标签场景分类的性能，尤其是在保留对多标签任务至关重要的空间细节方面。该研究为遥感领域中选择合适的SR技术提供了有价值的见解，并提出了一种易于集成的框架，以改进现有的遥感系统。

链接: https://arxiv.org/abs/2501.06720
作者: Ashitha Mudraje,Brian B. Moser,Stanislav Frolov,Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Satellite imagery is a cornerstone for numerous Remote Sensing (RS) applications; however, limited spatial resolution frequently hinders the precision of such systems, especially in multi-label scene classification tasks as it requires a higher level of detail and feature differentiation. In this study, we explore the efficacy of image Super-Resolution (SR) as a pre-processing step to enhance the quality of satellite images and thus improve downstream classification performance. We investigate four SR models - SRResNet, HAT, SeeSR, and RealESRGAN - and evaluate their impact on multi-label scene classification across various CNN architectures, including ResNet-50, ResNet-101, ResNet-152, and Inception-v4. Our results show that applying SR significantly improves downstream classification performance across various metrics, demonstrating its ability to preserve spatial details critical for multi-label tasks. Overall, this work offers valuable insights into the selection of SR techniques for multi-label prediction in remote sensing and presents an easy-to-integrate framework to improve existing RS systems.
zh

[CV-93] F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Consistent Gaussian Splatting

【速读】：该论文旨在解决从单目数据集（如ImageNet）中进行可泛化的3D感知生成的问题。核心挑战在于如何在不依赖多视角或动态数据的情况下，学习到鲁棒的3D感知表示，并确保在不同视角下纹理和几何的一致性。尽管现有的一些基线方法能够进行3D感知生成，但其生成的图像质量仍落后于最先进的2D生成方法。为解决这一局限性，论文提出了一种基于像素对齐高斯溅射（pixel-aligned Gaussian Splatting）的前馈管道，称为F3D-Gaus。该方案通过自监督的循环一致性约束来增强跨视角一致性，并通过视频模型先验进行几何感知的细化，从而在宽视角场景中生成更精细的细节，并提升模型捕捉复杂3D纹理的能力。实验表明，该方法不仅能够从单目数据集中生成高质量、多视角一致的3D感知图像，还显著提高了训练和推理效率。

链接: https://arxiv.org/abs/2501.06714
作者: Yuxin Wang,Qianyi Wu,Dan Xu
机构: HKUST(香港科技大学); Monash University(莫纳什大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model’s capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.
zh

[CV-94] Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints AAAI2025

【速读】：该论文旨在解决多任务视觉定位（Multi-task Visual Grounding）中存在的两个主要问题：一是引用表达式理解（Referring Expression Comprehension, REC）和引用图像分割（Referring Image Segmentation, RIS）之间的模糊性，导致多任务预测结果不一致；二是多模态理解不足，导致目标感知偏差。为解决这些问题，论文提出了一种从粗到细的一致性约束视觉定位架构（Coarse-to-fine Consistency Constraints Visual Grounding, C^3VG）。该架构的关键在于其两阶段框架：首先通过查询解码器和像素解码器生成初步的检测和分割输出（即粗糙语义感知阶段，Rough Semantic Perception, RSP），然后通过掩码引导交互模块（Mask-guided Interaction Module, MIM）和双向一致性约束损失（bidirectional consistency constraint loss）对粗预测进行细化（即精细化一致性交互阶段，Refined Consistency Interaction, RCI），以确保任务间的一致性表示。此外，论文还利用基于视觉-语言融合表示的预训练模型来增强多模态理解能力。实验结果表明，C^3VG在RefCOCO、RefCOCO+和RefCOCOg数据集上显著优于现有的REC和RIS方法。

链接: https://arxiv.org/abs/2501.06710
作者: Ming Dai,Jian Li,Jiedong Zhuang,Xian Zhang,Wankou Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI2025

点击查看摘要

Abstract:Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ( \textC^3\textVG ), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of \textC^3\textVG , which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \urlthis https URL.
zh

[CV-95] Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

【速读】：该论文旨在解决多类别遥感目标计数（Multicategory remote object counting）问题，即在遥感图像中准确估计不同类别目标的数量。现有方法主要依赖于卷积神经网络（CNNs）和Transformer，但CNN难以捕捉全局依赖关系，而Transformer计算复杂度高，限制了其在遥感应用中的有效性。为此，论文提出了Mamba-MOC，一种基于Mamba的网络，首次将Mamba应用于遥感目标计数。该方案的关键在于：1）提出了跨尺度交互模块（cross-scale interaction module），以促进层次特征的深度融合；2）设计了上下文状态空间模型（context state space model），在扫描过程中同时捕捉全局和局部上下文信息，并提供局部邻域信息。实验结果表明，该方法在大规模实际场景中相比主流计数算法达到了最先进的性能。

链接: https://arxiv.org/abs/2501.06697
作者: Peng Liu,Sen Lei,Heng-Chao Li
机构: Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multicategory remote object counting is a fundamental task in computer vision, aimed at accurately estimating the number of objects of various categories in remote images. Existing methods rely on CNNs and Transformers, but CNNs struggle to capture global dependencies, and Transformers are computationally expensive, which limits their effectiveness in remote applications. Recently, Mamba has emerged as a promising solution in the field of computer vision, offering a linear complexity for modeling global dependencies. To this end, we propose Mamba-MOC, a mamba-based network designed for multi-category remote object counting, which represents the first application of Mamba to remote sensing object counting. Specifically, we propose a cross-scale interaction module to facilitate the deep integration of hierarchical features. Then we design a context state space model to capture both global and local contextual information and provide local neighborhood information during the scan process. Experimental results in large-scale realistic scenarios demonstrate that our proposed method achieves state-of-the-art performance compared with some mainstream counting algorithms.
zh

[CV-96] Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

【速读】：该论文旨在解决机器人学习中的“仿真到现实差距”（sim-to-real gap）问题，即通过仿真环境训练的模型难以直接应用于现实世界。传统的解决方案如领域随机化（domain randomization）和系统辨识（system identification）受限于仿真和图形引擎的固有约束。本文提出了一种名为Vid2Sim的新框架，通过一个可扩展且成本效益高的“现实到仿真”（real2sim）管道，实现了神经三维场景重建和仿真，从而有效弥合了这一差距。Vid2Sim的核心创新在于能够从单目视频输入生成具有物理交互能力的逼真三维仿真环境，使得视觉导航智能体能够在复杂的城市环境中进行强化学习。实验结果表明，与现有仿真方法训练的智能体相比，Vid2Sim在数字孪生和现实世界中的城市导航成功率分别提高了31.2%和68.3%。

链接: https://arxiv.org/abs/2501.06693
作者: Ziyang Xie,Zhizheng Liu,Zhenghao Peng,Wayne Wu,Bolei Zhou
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); University of California, Los Angeles(加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photorealistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods.
zh

[CV-97] PGP-SAM: Prototype-Guided Prompt Learning for Efficient Few-Shot Medical Image Segmentation

【速读】：该论文旨在解决在医学图像分割中，Segment Anything Model (SAM) 需要大量像素级标注和精确的点或框提示设计的问题。为了解决这些挑战，作者提出了PGP-SAM，一种基于原型（prototype）的少样本调优方法，利用有限样本替代繁琐的手动提示。其关键解决方案包括两个主要组件：(1) 一个即插即用的上下文调制模块，用于整合多尺度信息；(2) 一个类引导的交叉注意力机制，用于融合原型和特征以自动生成提示。实验结果表明，PGP-SAM在仅使用10%的2D切片的情况下，在公共多器官数据集和私有心室数据集上均优于现有的无提示SAM变体，表现出更高的平均Dice分数。

链接: https://arxiv.org/abs/2501.06692
作者: Zhonghao Yan,Zijin Yin,Tianyu Lin,Xiangzhu Zeng,Kongming Liang,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, Accepted at ISBI 2025

点击查看摘要

Abstract:The Segment Anything Model (SAM) has demonstrated strong and versatile segmentation capabilities, along with intuitive prompt-based interactions. However, customizing SAM for medical image segmentation requires massive amounts of pixel-level annotations and precise point- or box-based prompt designs. To address these challenges, we introduce PGP-SAM, a novel prototype-based few-shot tuning approach that uses limited samples to replace tedious manual prompts. Our key idea is to leverage inter- and intra-class prototypes to capture class-specific knowledge and relationships. We propose two main components: (1) a plug-and-play contextual modulation module that integrates multi-scale information, and (2) a class-guided cross-attention mechanism that fuses prototypes and features for automatic prompt generation. Experiments on a public multi-organ dataset and a private ventricle dataset demonstrate that PGP-SAM achieves superior mean Dice scores compared with existing prompt-free SAM variants, while using only 10% of the 2D slices.
zh

[CV-98] Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

【速读】：该论文试图解决自动驾驶（Autonomous Driving, AD）中语义理解与下游决策的挑战，特别是在行人行为理解和交互处理方面。尽管近年来自动驾驶在3D检测、分类和定位方面取得了显著进展，但在复杂场景下的语义理解和高级规划仍存在困难。论文提出通过知识蒸馏（Knowledge Distillation）将大型语言模型（Large Language Models, LLM）和视觉-语言模型（Vision-Language Models, VLM）的语义标签有效迁移到更小的视觉网络中，从而在减少计算和内存资源需求的同时，实现复杂场景的语义表示，以支持下游的规划和控制决策。解决方案的关键在于通过知识蒸馏技术，将大规模模型的语义理解能力压缩到轻量级网络中，以适用于车载计算环境。

链接: https://arxiv.org/abs/2501.06680
作者: Haoxiang Gao,Yu Zhao
机构: Motional AD LLC; University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving (AD) has experienced significant improvements in recent years and achieved promising 3D detection, classification, and localization results. However, many challenges remain, e.g. semantic understanding of pedestrians’ behaviors, and downstream handling for pedestrian interactions. Recent studies in applications of Large Language Models (LLM) and Vision-Language Models (VLM) have achieved promising results in scene understanding and high-level maneuver planning in diverse traffic scenarios. However, deploying the billion-parameter LLMs to vehicles requires significant computation and memory resources. In this paper, we analyzed effective knowledge distillation of semantic labels to smaller Vision networks, which can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
zh

[CV-99] Imbalanced Medical Image Segmentation with Pixel-dependent Noisy Labels

【速读】：该论文旨在解决医学图像分割中由于标注噪声（noisy labels）和类别不平衡（class imbalance）导致的性能下降问题。现有的方法通常假设噪声标签是类别依赖的（class-dependent），而忽略了大多数噪声标签实际上是像素依赖的（pixel-dependent）特性。此外，现有方法通常使用固定阈值来过滤噪声标签，这可能导致少数类别的样本被错误移除，从而进一步降低分割性能。为解决这些问题，论文提出了一个名为“协作学习与课程选择”（Collaborative Learning with Curriculum Selection, CLCS）的框架。该框架的关键创新点包括：1）将噪声标签视为像素依赖的，并通过协作学习框架进行处理；2）采用课程动态阈值（curriculum dynamic thresholding）方法，根据模型的学习进度动态选择干净样本，以缓解类别不平衡问题；3）引入噪声平衡损失（Noise Balance Loss, NBL），充分利用噪声样本而非直接丢弃，从而提高数据利用率。具体而言，CLCS框架包含两个模块：课程噪声标签样本选择（Curriculum Noisy Label Sample Selection, CNS）和噪声平衡损失（Noise Balance Loss, NBL）。CNS模块通过设计一个双分支网络并引入差异损失（discrepancy loss）进行协作学习，从不同视角提取同一实例的特征表示，并通过概率投票选择干净样本。NBL模块则通过鲁棒损失函数利用噪声样本提升模型性能。

链接: https://arxiv.org/abs/2501.06678
作者: Erjian Guo,Zicheng Wang,Zhen Zhao,Luping Zhou
机构: School of Electrical and Computer Engineering, University of Sydney, Australia (悉尼大学电气与计算机工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation is often hindered by noisy labels in training data, due to the challenges of annotating medical images. Prior research works addressing noisy labels tend to make class-dependent assumptions, overlooking the pixel-dependent nature of most noisy labels. Furthermore, existing methods typically apply fixed thresholds to filter out noisy labels, risking the removal of minority classes and consequently degrading segmentation performance. To bridge these gaps, our proposed framework, Collaborative Learning with Curriculum Selection (CLCS), addresses pixel-dependent noisy labels with class imbalance. CLCS advances the existing works by i) treating noisy labels as pixel-dependent and addressing them through a collaborative learning framework, and ii) employing a curriculum dynamic thresholding approach adapting to model learning progress to select clean data samples to mitigate the class imbalance issue, and iii) applying a noise balance loss to noisy data samples to improve data utilization instead of discarding them outright. Specifically, our CLCS contains two modules: Curriculum Noisy Label Sample Selection (CNS) and Noise Balance Loss (NBL). In the CNS module, we designed a two-branch network with discrepancy loss for collaborative learning so that different feature representations of the same instance could be extracted from distinct views and used to vote the class probabilities of pixels. Besides, a curriculum dynamic threshold is adopted to select clean-label samples through probability voting. In the NBL module, instead of directly dropping the suspiciously noisy labels, we further adopt a robust loss to leverage such instances to boost the performance.
zh

[CV-100] MapGS: Generalizable Pretraining and Data Augmentation for Online Mapping via Novel View Synthesis

【速读】：该论文试图解决自动驾驶车辆在跨传感器配置（cross-sensor configuration）下的泛化问题。具体来说，现有在线地图生成方法在部署到具有不同相机内参（intrinsics）和外参（extrinsics）的车辆时，性能会显著下降。论文提出了一种基于高斯溅射（Gaussian splatting）的新框架，通过重建场景并在目标传感器配置下渲染相机图像，生成目标配置的传感器数据及其对应的标签，用于训练在线地图生成模型。该框架在nuScenes和Argoverse 2数据集上实现了18%的性能提升，并通过有效的数据增强（dataset augmentation）实现了更快的收敛和高效的训练，仅使用25%的原始训练数据即可超越现有最先进方法。这一方案的关键在于利用高斯溅射技术生成目标传感器配置下的数据，从而减少对繁琐数据标注的依赖，并实现数据的重用。

链接: https://arxiv.org/abs/2501.06660
作者: Hengyuan Zhang,David Paz,Yuliang Guo,Xinyu Huang,Henrik I. Christensen,Liu Ren
机构: Contextual Robotics Institute, UC San Diego (加州大学圣地亚哥分校上下文机器人研究所); Bosch North America (博世北美)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Online mapping reduces the reliance of autonomous vehicles on high-definition (HD) maps, significantly enhancing scalability. However, recent advancements often overlook cross-sensor configuration generalization, leading to performance degradation when models are deployed on vehicles with different camera intrinsics and extrinsics. With the rapid evolution of novel view synthesis methods, we investigate the extent to which these techniques can be leveraged to address the sensor configuration generalization challenge. We propose a novel framework leveraging Gaussian splatting to reconstruct scenes and render camera images in target sensor configurations. The target config sensor data, along with labels mapped to the target config, are used to train online mapping models. Our proposed framework on the nuScenes and Argoverse 2 datasets demonstrates a performance improvement of 18% through effective dataset augmentation, achieves faster convergence and efficient training, and exceeds state-of-the-art performance when using only 25% of the original training data. This enables data reuse and reduces the need for laborious data labeling. Project page at this https URL.
zh

[CV-101] WIX: Automatically Reconstructing Structured Data from Templatized Documents

【速读】：该论文试图解决从模板化文档（templatized documents）中高效、准确地提取数据的问题。模板化文档是通过在视觉模板中填充字段以程序化方式生成的文档。当前的数据提取工具在处理复杂文档布局时表现不佳，且在大规模数据集上存在高延迟和高成本的问题，通常还需要大量人工干预来提取表格或用户指定字段的值。论文提出的解决方案的关键在于预测生成这些文档的底层模板，通过建模文档之间的视觉和结构共性来实现。基于预测模板的数据提取方法提供了一个更加原则化、准确且高效的解决方案，且成本较低。实验结果表明，揭示模板对于从模板化文档中提取数据至关重要，TWIX工具在34个多样化真实世界数据集上的平均精度和召回率超过90%，显著优于行业工具如Textract和Azure Document Intelligence，以及基于视觉的大语言模型（如GPT-4-Vision），并且在处理大规模数据集时具有显著的速度和成本优势。

链接: https://arxiv.org/abs/2501.06659
作者: Yiming Lin,Mawil Hasan,Rohan Kosalge,Alvin Cheung,Aditya G. Parameswaran
机构: UC Berkeley(加州大学伯克利分校)
类目: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from templatized documents. TWIX achieves over 90% precision and recall on average, outperforming tools from industry: Textract and Azure Document Intelligence, and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.
zh

[CV-102] Personalized Preference Fine-tuning of Diffusion Models

【速读】：该论文试图解决现有基于RLHF（Reinforcement Learning from Human Feedback）技术（如DPO）在文本到图像生成扩散模型（text-to-image diffusion models）中存在的个性化不足问题。现有方法通常优化单一奖励函数，以对齐模型生成结果与群体偏好，但忽略了用户个体在信念或价值观上的细微差异，导致模型在个性化生成方面的效果受限。为解决这一问题，论文提出了PPD（Personalized Preference Diffusion）方法，其核心在于通过多奖励优化目标，使扩散模型能够对齐个性化偏好。具体而言，PPD方法的关键在于：（1）利用视觉语言模型（Vision-Language Model, VLM）从少量成对偏好示例中提取个性化偏好嵌入（personal preference embeddings）；（2）通过交叉注意力机制将这些嵌入整合到扩散模型中。在用户嵌入的条件下，模型通过DPO目标进行微调，同时优化对齐多个用户的偏好。实验结果表明，该方法能够有效优化多个奖励函数，并在推理过程中实现奖励函数之间的插值。在实际用户场景中，仅需新用户的四个偏好示例，PPD方法就能以76%的平均胜率优于Stable Cascade，生成更符合特定用户偏好的图像。

链接: https://arxiv.org/abs/2501.06655
作者: Meihua Dang,Anikait Singh,Linqi Zhou,Stefano Ermon,Jiaming Song
机构: Stanford University(斯坦福大学); Luma AI
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users’ beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
zh

[CV-103] Parking Space Detection in the City of Granada

【速读】：该论文旨在解决城市区域停车位检测的挑战，特别是针对格拉纳达市。研究通过利用航空影像，开发并应用语义分割（Semantic Segmentation）技术，准确识别停放车辆、移动车辆和道路。解决方案的关键在于创建了一个专属于格拉纳达的私有数据集，该数据集在训练神经网络模型中起到了关键作用。研究采用了全卷积网络（Fully Convolutional Networks）、金字塔网络（Pyramid Networks）和扩张卷积（Dilated Convolutions）等技术，展示了这些方法在城市语义分割中的有效性。通过对比分析和优化多种模型（如Dynamic U-Net、PSPNet和DeepLabV3+），研究结果表明DeepLabV3+在性能上表现最为突出。该研究为城市规划和交通管理领域提供了通过先进图像处理技术高效利用停车空间的见解。

链接: https://arxiv.org/abs/2501.06651
作者: Crespo-Orti Luis,Moreno-Cuadrado Isabel,Olivares-Martínez Pablo,Sanz-Tornero Ximo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:This paper addresses the challenge of parking space detection in urban areas, focusing on the city of Granada. Utilizing aerial imagery, we develop and apply semantic segmentation techniques to accurately identify parked cars, moving cars and roads. A significant aspect of our research is the creation of a proprietary dataset specific to Granada, which is instrumental in training our neural network model. We employ Fully Convolutional Networks, Pyramid Networks and Dilated Convolutions, demonstrating their effectiveness in urban semantic segmentation. Our approach involves comparative analysis and optimization of various models, including Dynamic U-Net, PSPNet and DeepLabV3+, tailored for the segmentation of aerial images. The study includes a thorough experimentation phase, using datasets such as UDD5 and UAVid, alongside our custom Granada dataset. We evaluate our models using metrics like Foreground Accuracy, Dice Coefficient and Jaccard Index. Our results indicate that DeepLabV3+ offers the most promising performance. We conclude with future directions, emphasizing the need for a dedicated neural network for parked car detection and the potential for application in other urban environments. This work contributes to the fields of urban planning and traffic management, providing insights into efficient utilization of parking spaces through advanced image processing techniques.
zh

[CV-104] A Comparative Performance Analysis of Classification and Segmentation Models on Bangladeshi Pothole Dataset

【速读】：该论文旨在解决基于孟加拉国道路坑洞数据集的分类和分割模型性能分析问题。研究团队开发了一个包含824个样本的自定义数据集，这些样本来自达卡和博格拉的街道，并在分类和分割任务中与现有工业和自定义数据集进行了对比。解决方案的关键在于对数据集进行增强（四倍用于分割，十倍用于分类），并测试了九种分类模型（CCT、CNN、INN、Swin Transformer、ConvMixer、VGG16、ResNet50、DenseNet201和Xception）和四种分割模型（U-Net、ResU-Net、U-Net++和Attention-Unet）。研究特别强调了轻量级模型（如CCT、CNN、INN、Swin Transformer和ConvMixer）在低计算需求和快速预测时间方面的优势。实验结果表明，该数据集在分类任务中达到了超过99%的准确率和F1分数，在分割任务中也达到了与现有数据集相当的性能，Dice相似系数最高为67.54%，IoU分数最高为59.39%。数据增强显著提升了所有测试模型的性能。

链接: https://arxiv.org/abs/2501.06602
作者: Antara Firoz Parsa,S. M. Abdullah,Anika Hasan Talukder,Md. Asif Shahidullah Kabbya,Shakib Al Hasan,Md. Farhadul Islam,Jannatun Noor
机构: BRAC University (BRAC大学); United International University (联合国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Tables, 7 Figures

点击查看摘要

Abstract:The study involves a comprehensive performance analysis of popular classification and segmentation models, applied over a Bangladeshi pothole dataset, being developed by the authors of this research. This custom dataset of 824 samples, collected from the streets of Dhaka and Bogura performs competitively against the existing industrial and custom datasets utilized in the present literature. The dataset was further augmented four-fold for segmentation and ten-fold for classification evaluation. We tested nine classification models (CCT, CNN, INN, Swin Transformer, ConvMixer, VGG16, ResNet50, DenseNet201, and Xception) and four segmentation models (U-Net, ResU-Net, U-Net++, and Attention-Unet) over both the datasets. Among the classification models, lightweight models namely CCT, CNN, INN, Swin Transformer, and ConvMixer were emphasized due to their low computational requirements and faster prediction times. The lightweight models performed respectfully, oftentimes equating to the performance of heavyweight models. In addition, augmentation was found to enhance the performance of all the tested models. The experimental results exhibit that, our dataset performs on par or outperforms the similar classification models utilized in the existing literature, reaching accuracy and f1-scores over 99%. The dataset also performed on par with the existing datasets for segmentation, achieving model Dice Similarity Coefficient up to 67.54% and IoU scores up to 59.39%.
zh

[CV-105] Exploring Pose-Based Anomaly Detection for Retail Security: A Real-World Shoplifting Dataset and Benchmark

【速读】：该论文试图解决零售业中因盗窃行为（shoplifting）导致的巨额经济损失问题。传统安全措施在实时检测盗窃行为方面存在不足，因此需要智能化的解决方案。论文将盗窃检测问题框架化为异常检测（anomaly detection）问题，重点在于识别与典型购物模式的偏差。解决方案的关键在于引入了PoseLift数据集，这是一个专门为盗窃检测设计的隐私保护数据集，解决了数据稀缺、隐私问题和模型偏差等挑战。PoseLift数据集通过与零售商店合作构建，包含来自真实场景的匿名化人体姿态数据，既保留了关键的行为信息，又确保了隐私保护。通过在PoseLift数据集上对先进的基于姿态的异常检测模型进行基准测试，论文展示了这些方法在检测准确性上的优势，同时有效解决了传统方法中的隐私和偏差问题。PoseLift数据集为研究人员提供了一个有价值的工具，推动了计算机视觉领域的伦理发展，并将公开以促进创新和合作。

链接: https://arxiv.org/abs/2501.06591
作者: Narges Rashvand,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Shanle Yao,Hamed Tabkhi
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Shoplifting poses a significant challenge for retailers, resulting in billions of dollars in annual losses. Traditional security measures often fall short, highlighting the need for intelligent solutions capable of detecting shoplifting behaviors in real time. This paper frames shoplifting detection as an anomaly detection problem, focusing on the identification of deviations from typical shopping patterns. We introduce PoseLift, a privacy-preserving dataset specifically designed for shoplifting detection, addressing challenges such as data scarcity, privacy concerns, and model biases. PoseLift is built in collaboration with a retail store and contains anonymized human pose data from real-world scenarios. By preserving essential behavioral information while anonymizing identities, PoseLift balances privacy and utility. We benchmark state-of-the-art pose-based anomaly detection models on this dataset, evaluating performance using a comprehensive set of metrics. Our results demonstrate that pose-based approaches achieve high detection accuracy while effectively addressing privacy and bias concerns inherent in traditional methods. As one of the first datasets capturing real-world shoplifting behaviors, PoseLift offers researchers a valuable tool to advance computer vision ethically and will be publicly available to foster innovation and collaboration. The dataset is available at this https URL.
zh

[CV-106] VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在生成输出时可能出现的与现实不符的现象，即视觉幻觉（Visual Hallucinations, VH），这一问题严重影响了模型在实际应用中的可靠性。为解决这一问题，论文提出了一种高效的即插即用解码算法——视觉感知稀疏化（Visual-Aware Sparsification, VASparse）。该算法的关键创新点在于通过令牌稀疏化（token sparsification）策略来平衡效率和可信度。具体而言，VASparse在解码过程中实施了一种视觉感知的令牌选择策略，以减少冗余令牌的同时有效保留视觉上下文。此外，论文还创新性地引入了一种基于稀疏的视觉对比解码方法，用于重新校准幻觉输出的分布，而无需进行二次解码，从而避免了时间开销。实验结果表明，VASparse在多个基准测试中显著缓解了VH问题，并在保持解码速度的同时达到了最先进的性能。

链接: https://arxiv.org/abs/2501.06553
作者: Xianwei Zhuang,Zhihong Zhu,Yuxin Xie,Liming Liang,Yuexian Zou
机构: SECE of Peking University (北京大学信息科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at this https URL.
zh

[CV-107] CoreNet: Conflict Resolution Network for Point-Pixel Misalignment and Sub-Task Suppression of 3D LiDAR-Camera Object Detection

【速读】：该论文旨在解决多模态输入融合中的两个关键问题：点-像素不对齐（point-pixel misalignment）和子任务抑制（sub-task suppression）。点-像素不对齐指的是不透明物体的像素特征在投影到世界空间时，可能会映射到同一条射线上的多个点特征，导致信息不准确；子任务抑制则是指分类预测和边界框回归任务之间可能相互抑制，影响整体性能。为解决这些问题，论文提出了一种名为冲突解决网络（Conflict Resolution Network, CoreNet）的新方法。其关键解决方案包括：1）双流变换模块（dual-stream transformation module），通过基于射线和基于点的2D到BEV（鸟瞰图）变换，实现从图像空间到世界空间的近似唯一映射；2）任务特定预测器（task-specific predictor），采用双分支结构，分别使用类别特定查询和边界框特定查询来处理不同的子任务，每个查询由任务特定特征和通用特征构成，使网络能够根据子任务自适应选择相关信息。实验结果表明，CoreNet在nuScenes数据集上取得了显著性能提升，证明了其有效性。

链接: https://arxiv.org/abs/2501.06550
作者: Yiheng Li,Yang Yang,Zhen Lei
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Information Fusion 2025

点击查看摘要

Abstract:Fusing multi-modality inputs from different sensors is an effective way to improve the performance of 3D object detection. However, current methods overlook two important conflicts: point-pixel misalignment and sub-task suppression. The former means a pixel feature from the opaque object is projected to multiple point features of the same ray in the world space, and the latter means the classification prediction and bounding box regression may cause mutual suppression. In this paper, we propose a novel method named Conflict Resolution Network (CoreNet) to address the aforementioned issues. Specifically, we first propose a dual-stream transformation module to tackle point-pixel misalignment. It consists of ray-based and point-based 2D-to-BEV transformations. Both of them achieve approximately unique mapping from the image space to the world space. Moreover, we introduce a task-specific predictor to tackle sub-task suppression. It uses the dual-branch structure which adopts class-specific query and Bbox-specific query to corresponding sub-tasks. Each task-specific query is constructed of task-specific feature and general feature, which allows the heads to adaptively select information of interest based on different sub-tasks. Experiments on the large-scale nuScenes dataset demonstrate the superiority of our proposed CoreNet, by achieving 75.6% NDS and 73.3% mAP on the nuScenes test set without test-time augmentation and model ensemble techniques. The ample ablation study also demonstrates the effectiveness of each component. The code is released on this https URL.
zh

[CV-108] Natural Language Supervision for Low-light Image Enhancement

【速读】：该论文试图解决低光图像增强（Low-Light Image Enhancement, LLIE）领域中存在的挑战，即如何在不同光照条件下定义“完美”的参考图像，并协调基于度量的结果与视觉友好性之间的矛盾。主流方法通常依赖于低光图像与正常光图像对的端到端映射学习，但由于光照条件的多样性，难以确定一个统一的参考标准。为此，论文提出了一种基于自然语言监督（Natural Language Supervision, NLS）的策略，通过从与图像对应的文本中学习特征映射，为不同光照条件下的图像描述提供了一种通用且灵活的接口。然而，基于文本描述的图像分布具有高度多模态性，导致训练困难。为解决这一问题，论文设计了文本引导条件机制（Textual Guidance Conditioning Mechanism, TCM），通过结合图像区域与句子词汇之间的联系，增强了对图像和文本细粒度跨模态线索的捕捉能力。此外，论文还提出了信息融合注意力（Information Fusion Attention, IFA）模块，用于有效识别和融合不同层次的图像与文本信息。最终，论文将TCM和IFA整合到名为NaLSuper的自然语言监督网络中，实验结果表明该方法具有鲁棒性和优越的性能。

链接: https://arxiv.org/abs/2501.06546
作者: Jiahui Tang,Kaihua Zhou,Zhijian Luo,Yueen Hou
机构: School of Computer, Jiaying University, Meizhou, R. P. China, 514015 (计算机学院, 嘉应学院, 梅州, 中国, 514015)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:With the development of deep learning, numerous methods for low-light image enhancement (LLIE) have demonstrated remarkable performance. Mainstream LLIE methods typically learn an end-to-end mapping based on pairs of low-light and normal-light images. However, normal-light images under varying illumination conditions serve as reference images, making it difficult to define a ``perfect’’ reference image This leads to the challenge of reconciling metric-oriented and visual-friendly results. Recently, many cross-modal studies have found that side information from other related modalities can guide visual representation learning. Based on this, we introduce a Natural Language Supervision (NLS) strategy, which learns feature maps from text corresponding to images, offering a general and flexible interface for describing an image under different illumination. However, image distributions conditioned on textual descriptions are highly multimodal, which makes training difficult. To address this issue, we design a Textual Guidance Conditioning Mechanism (TCM) that incorporates the connections between image regions and sentence words, enhancing the ability to capture fine-grained cross-modal cues for images and text. This strategy not only utilizes a wider range of supervised sources, but also provides a new paradigm for LLIE based on visual and textual feature alignment. In order to effectively identify and merge features from various levels of image and textual information, we design an Information Fusion Attention (IFA) module to enhance different regions at different levels. We integrate the proposed TCM and IFA into a Natural Language Supervision network for LLIE, named NaLSuper. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NaLSuper. Comments: 12 pages, 10 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.06546 [cs.CV] (or arXiv:2501.06546v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.06546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-109] CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

【速读】：该论文旨在解决基于图像的近视筛查中的两个长期问题：一是如何整合一对眼睛的眼底图像信息，二是如何结合高度近视状态和眼轴长度之间的固有依赖关系。解决方案的关键在于提出了一种名为CeViT的视觉Transformer（Vision Transformer）双通道架构。该架构通过共享的Transformer编码器提取一对眼睛的共同特征，并通过分离的多层感知机头（multilayer perceptron heads）建模双眼之间的不对称性。此外，论文还引入了一种称为copula loss的统计方法，用于建模图像协变量条件下离散-连续混合响应的条件依赖关系。通过这种架构和方法，CeViT在高度近视分类和眼轴长度预测的准确性上均优于基线模型。

链接: https://arxiv.org/abs/2501.06540
作者: Chong Zhong,Yang Li,Jinfeng Xu,Xiang Fu,Yunhao Liu,Qiuyi Huang,Danjuan Yang,Meiyan Li,Aiyi Liu,Alan H. Welsh,Xingtao Zhou,Bo Fu,Catherine C. Liu
机构: 1Department of Data Science & Artificial Intelligence, The Hong Kong Polytechnic University(香港理工大学数据科学与人工智能系); 2School of Data Science, Fudan University(复旦大学数据科学学院); 3Department of Biostatistics, City University of Hong Kong(香港城市大学生物统计学系); 4School of Information Science and Technology, Fudan University(复旦大学信息科学与技术学院); 5Eye & ENT Hospital, Fudan University(复旦大学眼耳鼻喉科医院); 6Eunice Kennedy Shriver National Institute of Child Health and Human Development(尤尼斯·肯尼迪·施莱弗国家儿童健康与人类发展研究所); 7College of Business and Economics, Australian National University(澳大利亚国立大学商业与经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:We aim to assist image-based myopia screening by resolving two longstanding problems, “how to integrate the information of ocular images of a pair of eyes” and “how to incorporate the inherent dependence among high-myopia status and axial length for both eyes.” The classification-regression task is modeled as a novel 4-dimensional muti-response regression, where discrete responses are allowed, that relates to two dependent 3rd-order tensors (3D ultrawide-field fundus images). We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder, and the interocular asymmetries are modeled through separated multilayer perceptron heads. Statistically, we model the conditional dependence among mixture of discrete-continuous responses given the image covariates by a so-called copula loss. We establish a new theoretical framework regarding fine-tuning on CeViT based on latent representations, allowing the black-box fine-tuning procedure interpretable and guaranteeing higher relative efficiency of fine-tuning weight estimation in the asymptotic setting. We apply CeViT to an annotated ultrawide-field fundus image dataset collected by Shanghai Eye \ ENT Hospital, demonstrating that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.
zh

[CV-110] DivTrackee versus DynTracker: Promoting Diversity in Anti-Facial Recognition against Dynamic FR Strategy

【速读】：该论文试图解决面部识别（FR）模型在广泛采用过程中可能被滥用的隐私问题，特别是针对那些旨在追踪特定目标身份的追踪者（trackers）的能力。现有的反面部识别（AFR）方法主要基于静态FR策略进行评估，无法准确反映这些追踪者的实际能力。为此，论文提出了一种动态FR策略（DynTracker），通过迭代更新模型库中的目标身份图像，使得现有AFR保护措施失效。为解决这一问题，论文提出了一种新的方法DivTrackee，通过文本引导的图像生成框架和促进多样性的对抗性损失，生成多样化的AFR保护图像。实验表明，DynTracker能够有效突破现有AFR方法，而DivTrackee在防止动态FR策略识别用户面部图像方面表现出优越性。该研究为开发更有效的AFR方法以应对追踪者的威胁提供了重要基础。

链接: https://arxiv.org/abs/2501.06533
作者: Wenshu Fan,Minxing Zhang,Hongwei Li,Wenbo Jiang,Hanxiao Chen,Xiangyu Yue,Michael Backes,Xiao Zhang
机构: University of Electronic Science and Technology of China(电子科技大学); CISPA Helmholtz Center for Information Security(CISPA亥姆霍兹信息安全中心); The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The widespread adoption of facial recognition (FR) models raises serious concerns about their potential misuse, motivating the development of anti-facial recognition (AFR) to protect user facial privacy. In this paper, we argue that the static FR strategy, predominantly adopted in prior literature for evaluating AFR efficacy, cannot faithfully characterize the actual capabilities of determined trackers who aim to track a specific target identity. In particular, we introduce \emph\ourAttack, a dynamic FR strategy where the model’s gallery database is iteratively updated with newly recognized target identity images. Surprisingly, such a simple approach renders all the existing AFR protections ineffective. To mitigate the privacy threats posed by DynTracker, we advocate for explicitly promoting diversity in the AFR-protected images. We hypothesize that the lack of diversity is the primary cause of the failure of existing AFR methods. Specifically, we develop \emphDivTrackee, a novel method for crafting diverse AFR protections that builds upon a text-guided image generation framework and diversity-promoting adversarial losses. Through comprehensive experiments on various facial image benchmarks and feature extractors, we demonstrate DynTracker’s strength in breaking existing AFR methods and the superiority of DivTrackee in preventing user facial images from being identified by dynamic FR strategies. We believe our work can act as an important initial step towards developing more effective AFR methods for protecting user facial privacy against determined trackers.
zh

[CV-111] Multi-View Factorizing and Disentangling: A Novel Framework for Incomplete Multi-View Multi-Label Classification

【速读】：该论文试图解决多视图多标签分类（MvMLC）中的两个主要问题：视图和标签的不完整性，以及如何从多样化的视图中学习既具有视图一致性又具有视图特异性的鲁棒多视图表示。解决方案的关键在于提出了一种新的框架，称为不完整多视图多标签分类（iMvMLC）。该框架将多视图表示分解为视图一致性和视图特异性两个独立的因子集，并设计了一种图解缠损失函数以减少这些表示之间的冗余。此外，框架创新性地将一致性表示学习分解为三个关键子目标：提取不同视图之间的共享信息、消除一致性表示中的视图内冗余以及保留任务相关信息。通过设计一个鲁棒的任务相关一致性学习模块，结合掩码跨视图预测（MCP）策略和信息理论，该框架能够在视图和标签不完整的情况下有效运行，并在多个数据集上表现出优于其他领先方法的性能。

链接: https://arxiv.org/abs/2501.06524
作者: Wulin Xie,Lian Zhao,Jiang Long,Xiaohuan Lu,Bingyan Nie
机构: Guizhou University (贵州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view multi-label classification (MvMLC) has recently garnered significant research attention due to its wide range of real-world applications. However, incompleteness in views and labels is a common challenge, often resulting from data collection oversights and uncertainties in manual annotation. Furthermore, the task of learning robust multi-view representations that are both view-consistent and view-specific from diverse views still a challenge problem in MvMLC. To address these issues, we propose a novel framework for incomplete multi-view multi-label classification (iMvMLC). Our method factorizes multi-view representations into two independent sets of factors: view-consistent and view-specific, and we correspondingly design a graph disentangling loss to fully reduce redundancy between these representations. Additionally, our framework innovatively decomposes consistent representation learning into three key sub-objectives: (i) how to extract view-shared information across different views, (ii) how to eliminate intra-view redundancy in consistent representations, and (iii) how to preserve task-relevant information. To this end, we design a robust task-relevant consistency learning module that collaboratively learns high-quality consistent representations, leveraging a masked cross-view prediction (MCP) strategy and information theory. Notably, all modules in our framework are developed to function effectively under conditions of incomplete views and labels, making our method adaptable to various multi-view and multi-label datasets. Extensive experiments on five datasets demonstrate that our method outperforms other leading approaches.
zh

[CV-112] NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

【速读】：该论文试图解决神经视图合成（Neural View Synthesis, NVS）中质量评估方法的局限性问题。传统的全参考方法（如PSNR、SSIM和LPIPS）依赖于稀疏的参考视图，难以全面捕捉神经合成场景（Neurally Synthesized Scenes, NSS）的感知质量，且由于获取人类感知标签的困难，导致数据集有限，模型容易过拟合并降低泛化能力。为解决这些问题，论文提出了NVS-SQA方法，通过自监督学习无参考质量表示，不依赖人类标签。其关键解决方案在于利用启发式线索和质量分数作为学习目标，并结合专门的对比对准备过程，以提高学习的有效性和效率。实验结果表明，NVS-SQA在无参考方法中显著优于其他17种方法，并在所有评估指标上超越了16种全参考方法。

链接: https://arxiv.org/abs/2501.06488
作者: Qiang Qu,Yiran Shen,Xiaoming Chen,Yuk Ying Chung,Weidong Cai,Tongliang Liu
机构: School of Computer Science, the University of Sydney, Australia(悉尼大学计算机科学学院); School of Software, Shandong University, China(山东大学软件学院); Beijing Technology and Business University, China(北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the “same instance, similar representation” assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).
zh

[CV-113] Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

【速读】：该论文试图解决文本到图像生成（Text-to-Image, T2I）模型在生成图像时存在的感知伪影、与复杂提示的错位以及安全性问题。现有的方法通常通过收集人类反馈、训练奖励模型来估计人类反馈，并基于奖励模型对T2I模型进行微调以使其与人类偏好对齐。然而，这些方法在提升某些质量指标（如安全性）时，可能会意外地降低其他方面的表现（如提示对齐），甚至导致奖励作弊（reward hacking）。为此，论文提出了Focus-N-Fix，一种区域感知的微调方法，专注于修正原始模型中存在问题的图像区域。该方法能够在保持图像整体结构不变的前提下，显著提升在安全性（如过度性化或暴力）、合理性等方面的局部质量，且对其他区域的影响极小或几乎不可察觉。

链接: https://arxiv.org/abs/2501.06481
作者: Xiaoying Xing,Avinab Saha,Junfeng He,Susan Hao,Paul Vicol,Moonkyung Ryu,Gang Li,Sahil Singla,Sarah Young,Yinxiao Li,Feng Yang,Deepak Ramachandran
机构: Google Research; Northwestern University; UT Austin; Google DeepMind; Google
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation has made significant advances in recent years, but challenges still remain in the generation of perceptual artifacts, misalignment with complex prompts, and safety. The prevailing approach to address these issues involves collecting human feedback on generated images, training reward models to estimate human feedback, and then fine-tuning T2I models based on the reward models to align them with human preferences. However, while existing reward fine-tuning methods can produce images with higher rewards, they may change model behavior in unexpected ways. For example, fine-tuning for one quality aspect (e.g., safety) may degrade other aspects (e.g., prompt alignment), or may lead to reward hacking (e.g., finding a way to increase rewards without having the intended effect). In this paper, we propose Focus-N-Fix, a region-aware fine-tuning method that trains models to correct only previously problematic image regions. The resulting fine-tuned model generates images with the same high-level structure as the original model but shows significant improvements in regions where the original model was deficient in safety (over-sexualization and violence), plausibility, or other criteria. Our experiments demonstrate that Focus-N-Fix improves these localized quality aspects with little or no degradation to others and typically imperceptible changes in the rest of the image. Disclaimer: This paper contains images that may be overly sexual, violent, offensive, or harmful.
zh

[CV-114] Flash Window Attention: speedup the attention computation for Swin Transformer

【速读】：该论文旨在解决高分辨率图像像素处理中的计算效率问题。传统的Swin Transformer通过引入窗口注意力机制（window attention），将图像划分为不重叠的窗口，并在每个窗口内进行注意力计算，从而显著提高了计算效率。然而，直接替换为在语言模型中表现更高效的闪存注意力（flash attention）并不适用，因为闪存注意力适用于长序列处理，而窗口注意力则需处理大量短序列。为此，论文提出了一种专门为窗口注意力优化的解决方案——闪存窗口注意力（Flash Window Attention）。该方案通过优化注意力计算，将计算效率提升了高达300%，并将端到端运行效率提升了30%。

链接: https://arxiv.org/abs/2501.06480
作者: Zhendong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address the high resolution of image pixels, the Swin Transformer introduces window attention. This mechanism divides an image into non-overlapping windows and restricts attention computation to within each window, significantly enhancing computational efficiency. To further optimize this process, one might consider replacing standard attention with flash attention, which has proven to be more efficient in language models. However, a direct substitution is ineffective. Flash attention is designed for long sequences, whereas window attention deals with shorter sequences but must handle numerous of them in parallel. In this report, we present an optimized solution called Flash Window Attention, tailored specifically for window attention. Flash Window Attention improves attention computation efficiency by up to 300% and enhances end-to-end runtime efficiency by up to 30%. Our code is available online.
zh

[CV-115] Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering

【速读】：该论文试图解决视频情感分类（video sentiment classification）中的挑战，特别是在标注数据有限的情况下如何提高分类准确性。解决方案的关键在于采用多模态（multi-modal）方法，结合视频本身、伴随文本和声学特征（acoustic features），并通过基于聚类的半监督预训练（clustering-based semi-supervised pre-training）从数据中提取有意义的表示。这一预训练步骤能够识别视频和文本数据中的模式，使模型能够在初始阶段无需大量标注信息的情况下学习底层结构和关系。随后，通过监督式微调（supervised fine-tuning）对系统进行优化，以实现更准确和深入的情感分类。这种方法尤其适用于标注数据有限的情况，能够有效提升分类性能。

链接: https://arxiv.org/abs/2501.06475
作者: Mehrshad Saadatinia,Minoo Ahmadi,Armin Abdollahi
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding emotions in videos is a challenging task. However, videos contain several modalities which make them a rich source of data for machine learning and deep learning tasks. In this work, we aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features. To address the limitations of relying on large labeled datasets, we are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data. This pre-training step identifies patterns in the video and text data, allowing the model to learn underlying structures and relationships without requiring extensive labeled information at the outset. Once these patterns are established, we fine-tune the system in a supervised manner to classify the sentiment expressed in videos. We believe that this multi-modal approach, combining clustering with supervised fine-tuning, will lead to more accurate and insightful sentiment classification, especially in cases where labeled data is limited.
zh

[CV-116] YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

【速读】：该论文旨在解决羽毛球机器人（badminton rally robot）在实时人机对抗中高精度3D羽毛球轨迹检测的问题。由于羽毛球飞行速度快、易受环境因素（如场地线条和光照）干扰，传统的2D检测方法难以满足实时性和高精度的要求。论文提出的解决方案包括两个关键部分：首先，设计了YO-CSA检测网络，该网络通过引入上下文和空间注意力机制（contextual and spatial attention mechanisms），优化了YOLOv8s模型的骨干网络（backbone）、颈部（neck）和头部（head），以增强模型在全局和局部特征提取与融合方面的能力。其次，构建了一个实时3D羽毛球轨迹检测系统，该系统将YO-CSA提取的2D坐标序列通过立体视觉映射到3D空间，并基于历史信息预测未来3D坐标，同时通过重投影更新2D检测的位置约束。此外，系统还包含一个补偿模块，用于填补缺失的中间帧，确保轨迹的完整性。实验结果表明，YO-CSA在mAP@0.75指标上达到了90.43%的准确率，优于YOLOv8s和YOLO11s，且系统在12个测试序列中保持了超过130 fps的处理速度。

链接: https://arxiv.org/abs/2501.06472
作者: Yuan Lai,Zhiwei Shi,Chengxi Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages,14 figures

点击查看摘要

Abstract:The 3D trajectory of a shuttlecock required for a badminton rally robot for human-robot competition demands real-time performance with high accuracy. However, the fast flight speed of the shuttlecock, along with various visual effects, and its tendency to blend with environmental elements, such as court lines and lighting, present challenges for rapid and accurate 2D detection. In this paper, we first propose the YO-CSA detection network, which optimizes and reconfigures the YOLOv8s model’s backbone, neck, and head by incorporating contextual and spatial attention mechanisms to enhance model’s ability in extracting and integrating both global and local features. Next, we integrate three major subtasks, detection, prediction, and compensation, into a real-time 3D shuttlecock trajectory detection system. Specifically, our system maps the 2D coordinate sequence extracted by YO-CSA into 3D space using stereo vision, then predicts the future 3D coordinates based on historical information, and re-projects them onto the left and right views to update the position constraints for 2D detection. Additionally, our system includes a compensation module to fill in missing intermediate frames, ensuring a more complete trajectory. We conduct extensive experiments on our own dataset to evaluate both YO-CSA’s performance and system effectiveness. Experimental results show that YO-CSA achieves a high accuracy of 90.43% mAP@0.75, surpassing both YOLOv8s and YOLO11s. Our system performs excellently, maintaining a speed of over 130 fps across 12 test sequences.
zh

[CV-117] SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors

【速读】：该论文旨在解决现有神经隐式表示（Neural Implicit Representations）在密集同步定位与地图构建（SLAM）中存在的重建质量和实时性能不足的问题。现有方法的主要缺陷在于场景表示策略不够灵活，且未充分利用先验信息。为此，论文提出了SP-SLAM系统，其关键解决方案包括：1）通过计算深度图像并在表面附近建立稀疏体素编码（Sparse Voxel-Encoding）的场景先验，实现模型的快速收敛；2）将单帧深度图像计算的编码体素融合到全局体积中，以支持高保真表面重建；3）采用三平面（Tri-Planes）存储场景外观信息，在高质量几何纹理映射和内存消耗之间取得平衡；4）引入有效的映射优化策略，使系统能够在运行时持续优化所有历史输入帧的姿态，而不增加计算开销。实验结果表明，SP-SLAM在多个基准数据集上实现了更高的跟踪精度和重建质量，同时显著提升了运行速度。

链接: https://arxiv.org/abs/2501.06469
作者: Zhen Hong,Bowen Wang,Haoran Duan,Yawen Huang,Xiong Li,Zhenyu Wen,Xiang Wu,Wei Xiang,Yefeng Zheng
机构: Institute of Cyberspace Security and College of Information Engineering, Zhejiang University of Technology (浙江工业大学网络空间安全研究所与信息工程学院); Department of Automation, Tsinghua University (清华大学自动化系); Tencent Jarvis Lab (腾讯Jarvis实验室); School of Engineering and Mathematical Sciences, La Trobe University (拉筹伯大学工程与数学科学学院); Medical Artificial Intelligence Laboratory, Westlake University (西湖大学医学人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.
zh

[CV-118] Discovering an Image-Adaptive Coordinate System for Photography Processing BMVC2024

【速读】：该论文试图解决现有基于曲线查找表（Curve Lookup Table, LUT）的方法在处理全RGB空间映射时面临的高内存复杂度问题。现有方法通常通过采样离散化的3D网格来构建3D LUT，或将RGB通道分解为三个独立的1D LUT，这些方法在效率和精度上存在局限。论文提出了一种新颖的算法IAC（Image-Adaptive Cartesian coordinate system），通过在RGB色彩空间中学习一个图像自适应的笛卡尔坐标系，再进行曲线操作。这一端到端可训练的方法能够通过联合学习的图像自适应坐标系和曲线高效地调整图像。实验结果表明，该策略在多种摄影处理任务（如照片修饰、曝光校正和白平衡编辑）中实现了最先进的性能，同时保持了轻量级设计和快速推理速度。解决方案的关键在于引入了图像自适应的坐标系，从而在降低内存复杂度的同时提升了处理效果。

链接: https://arxiv.org/abs/2501.06448
作者: Ziteng Cui,Lin Gu,Tatsuya Harada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2024

点击查看摘要

Abstract:Curve Lookup Table (LUT) based methods directly map a pixel to the target output, making them highly efficient tools for real-time photography processing. However, due to extreme memory complexity to learn full RGB space mapping, existing methods either sample a discretized 3D lattice to build a 3D LUT or decompose into three separate curves (1D LUTs) on the RGB channels. Here, we propose a novel algorithm, IAC, to learn an image-adaptive Cartesian coordinate system in the RGB color space before performing curve operations. This end-to-end trainable approach enables us to efficiently adjust images with a jointly learned image-adaptive coordinate system and curves. Experimental results demonstrate that this simple strategy achieves state-of-the-art (SOTA) performance in various photography processing tasks, including photo retouching, exposure correction, and white-balance editing, while also maintaining a lightweight design and fast inference speed.
zh

[CV-119] CPDR: Towards Highly-Efficient Salient Object Detection via Crossed Post-decoder Refinement

【速读】：该论文旨在解决当前显著目标检测（salient object detection）方法中，由于使用深层网络和大规模骨干网络（backbones）导致的计算复杂度显著增加的问题。为了解决这一问题，作者提出了一种轻量级的后解码器细化模块——交叉后解码器细化（CPDR），以增强标准特征金字塔网络（FPN）或U-Net框架的特征表示能力。解决方案的关键在于引入了注意力下采样融合（ADF）和注意力上采样融合（AUF），分别通过通道注意力机制和空间注意力机制来优化低层和高层特征。此外，作者还提出了基于ADF和AUF的双注意力交叉融合（DACF），在保持性能的同时减少了参数数量。实验结果表明，该方法在五个基准数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.06441
作者: Yijie Li,Hewei Wang,Aggelos Katsaggelos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Most of the current salient object detection approaches use deeper networks with large backbones to produce more accurate predictions, which results in a significant increase in computational complexity. A great number of network designs follow the pure UNet and Feature Pyramid Network (FPN) architecture which has limited feature extraction and aggregation ability which motivated us to design a lightweight post-decoder refinement module, the crossed post-decoder refinement (CPDR) to enhance the feature representation of a standard FPN or U-Net framework. Specifically, we introduce the Attention Down Sample Fusion (ADF), which employs channel attention mechanisms with attention maps generated by high-level representation to refine the low-level features, and Attention Up Sample Fusion (AUF), leveraging the low-level information to guide the high-level features through spatial attention. Additionally, we proposed the Dual Attention Cross Fusion (DACF) upon ADFs and AUFs, which reduces the number of parameters while maintaining the performance. Experiments on five benchmark datasets demonstrate that our method outperforms previous state-of-the-art approaches.
zh

[CV-120] UCloudNet: A Residual U-Net with Deep Supervision for Cloud Image Segmentation

【速读】：该论文旨在解决传统卷积神经网络（CNNs）在云图像分割任务中收敛速度慢、训练消耗大的问题，特别是在实时处理天空相机系统中的应用。为了解决这一问题，论文提出了一种基于深度监督的残差U-Net（Residual U-Net）模型，称为UCloudNet。该模型通过在编码器中引入残差连接（residual connection），进一步提升了特征提取能力，从而在保证高精度的同时减少了训练消耗。这一解决方案的关键在于结合了残差连接和深度监督机制，使得模型在较少的训练周期内能够快速收敛，适用于实时云图像分割任务。

链接: https://arxiv.org/abs/2501.06440
作者: Yijie Li,Hewei Wang,Shaofan Wang,Yee Hui Lee,Muhammad Salman Pathan,Soumyabrata Dev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in meteorology involve the use of ground-based sky cameras for cloud observation. Analyzing images from these cameras helps in calculating cloud coverage and understanding atmospheric phenomena. Traditionally, cloud image segmentation relied on conventional computer vision techniques. However, with the advent of deep learning, convolutional neural networks (CNNs) are increasingly applied for this purpose. Despite their effectiveness, CNNs often require many epochs to converge, posing challenges for real-time processing in sky camera systems. In this paper, we introduce a residual U-Net with deep supervision for cloud segmentation which provides better accuracy than previous approaches, and with less training consumption. By utilizing residual connection in encoders of UCloudNet, the feature extraction ability is further improved.
zh

[CV-121] Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning

【速读】：该论文旨在解决肖像视频编辑中的稳定性和灵活性问题。解决方案的关键在于提出了Qffusion框架，该框架基于“动画用于编辑”的设计原则，通过从两幅静态参考图像中训练一个通用的动画框架，并在推理过程中使用修改后的起始帧和结束帧作为参考，从而实现肖像视频编辑。Qffusion利用Stable Diffusion的强大生成能力，提出了象限网格排列（Quadrant-grid Arrangement, QGA）方案，将两幅参考图像的潜在代码和四种面部条件的潜在代码以四网格方式排列，并通过自注意力机制融合这两种模态的特征，实现外观和时间上的联合建模。此外，Qffusion还提出了象限网格传播（Quadrant-grid Propagation, QGP）推理策略，通过递归处理参考帧和条件帧，实现了稳定的任意长度视频生成。该框架无需额外的网络或复杂的训练阶段，仅需修改Stable Diffusion的输入格式，即可实现稳定的视频编辑。

链接: https://arxiv.org/abs/2501.06438
作者: Maomao Li,Lijian Lin,Yunfei Liu,Ye Zhu,Yu Li
机构: International Digital Economy Academy (IDEA)(国际数字经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing’', and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing.
zh

[CV-122] Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis IROS2024

【速读】：该论文试图解决当前基于生成式 AI 的逼真新视角合成（Novel View Synthesis, NVS）方法在大型室外场景中的局限性问题。尽管优化型 NVS 模型已尝试解决这一问题，但具有显著优势的可泛化前馈方法（feed-forward methods）仍未被充分探索。论文提出了一种基于前馈的 NVS 模型 PixelNeRF，并在大规模 UrbanScene3D 数据集上进行了训练。关键解决方案包括：1）提出了四种训练策略，通过聚类和训练数据集来应对视角重叠有限的挑战；2）引入了 Aug3D 增强技术，该技术利用传统运动结构重建（Structure-from-Motion, SfM）生成的新视角，通过网格和语义采样来增强前馈 NVS 模型的学习能力。实验表明，Aug3D 通过将生成的新视角与原始数据集结合，显著提升了模型预测新视角的能力。

链接: https://arxiv.org/abs/2501.06431
作者: Aditya Rauniyar,Omar Alama,Silong Yong,Katia Sycara,Sebastian Scherer
机构: Robotics Institute, School of Computer Science at Carnegie Mellon University(卡内基梅隆大学机器人研究所，计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: IROS 2024 Workshop, 9 Pages, 7 Figures

点击查看摘要

Abstract:Recent photorealistic Novel View Synthesis (NVS) advances have increasingly gained attention. However, these approaches remain constrained to small indoor scenes. While optimization-based NVS models have attempted to address this, generalizable feed-forward methods, offering significant advantages, remain underexplored. In this work, we train PixelNeRF, a feed-forward NVS model, on the large-scale UrbanScene3D dataset. We propose four training strategies to cluster and train on this dataset, highlighting that performance is hindered by limited view overlap. To address this, we introduce Aug3D, an augmentation technique that leverages reconstructed scenes using traditional Structure-from-Motion (SfM). Aug3D generates well-conditioned novel views through grid and semantic sampling to enhance feed-forward NVS model learning. Our experiments reveal that reducing the number of views per cluster from 20 to 10 improves PSNR by 10%, but the performance remains suboptimal. Aug3D further addresses this by combining the newly generated novel views with the original dataset, demonstrating its effectiveness in improving the model’s ability to predict novel views.
zh

[CV-123] Open Eyes Then Reason : Fine-grained Visual Mathematical Understanding in MLLM s

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在需要细粒度视觉理解的数学问题解决任务中表现不佳的问题。具体而言，现有模型在几何实体识别等视觉任务中存在显著错误，导致数学推理能力受限。论文通过系统评估现有MLLMs的视觉定位能力，揭示了视觉定位准确性与问题解决性能之间的显著负相关关系，强调了细粒度视觉理解的重要性。

解决方案的关键在于提出了一种名为SVE-Math（Selective Vision-Enhanced Mathematical MLLM）的新方法。该方法通过引入几何基础的视觉编码器（geometric-grounded vision encoder）和特征路由器（feature router），动态调整层次化视觉特征图的贡献，从而准确识别视觉基元并生成适合语言模型推理需求的精确视觉提示。实验表明，SVE-Math-Qwen2.5-7B在MathVerse数据集上比其他7B模型表现高出15%，并在MathVista上与GPT-4V兼容，同时在GeoQA数据集上表现出与更大数据集训练的模型相媲美的性能。这一方法为未来研究提供了将细粒度视觉理解融入MLLMs的有力方向。

链接: https://arxiv.org/abs/2501.06430
作者: Shan Zhang,Aotian Chen,Yanpeng Sun,Jindong Gu,Yi-Yu Zheng,Piotr Koniusz,Kai Zou,Anton van den Hengel,Yuan Xue
机构: Australian Institute for Machine Learning, University of Adelaide (阿德莱德大学); Georgia Institute of Technology (佐治亚理工学院); Nanjing University of Science and Technology (南京理工大学); University of Oxford (牛津大学); NetMind.ai; Data61, CSIRO (澳大利亚联邦科学与工业研究组织); The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities, highlighting that this remains a key bottleneck in visual mathematical reasoning. To address this, we propose a novel approach, SVE-Math (Selective Vision-Enhanced Mathematical MLLM), featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps. Our model recognizes accurate visual primitives and generates precise visual prompts tailored to the language model’s reasoning needs. In experiments, SVE-Math-Qwen2.5-7B outperforms other 7B models by 15% on MathVerse and is compatible with GPT-4V on MathVista. Despite being trained on smaller datasets, SVE-Math-7B achieves competitive performance on GeoQA, rivaling models trained on significantly larger datasets. Our findings emphasize the importance of incorporating fine-grained visual understanding into MLLMs and provide a promising direction for future research.
zh

[CV-124] FocusDD: Real-World Scene Infusion for Robust Dataset Distillation

【速读】：该论文旨在解决大规模和高分辨率数据集压缩（dataset distillation）在实际应用中的局限性问题。传统的数据集压缩方法在处理大规模和高分辨率数据时表现不佳，限制了其在实际训练中的实用性。论文提出了一种新颖的、分辨率无关的数据集压缩方法——FocusDD（Focused Dataset Distillation），其核心在于通过识别关键信息块（key information patches）来生成多样且真实的压缩数据，从而确保压缩数据集在不同网络架构中的泛化能力。具体而言，FocusDD利用预训练的视觉Transformer（Vision Transformer, ViT）提取关键图像块，并将这些块合成为单个压缩图像。这些压缩图像不仅适用于分类任务，还适用于目标检测等密集任务。此外，为了进一步提升压缩数据集的泛化能力，每个合成图像还通过原始图像的下采样视图进行增强。实验结果表明，FocusDD在ImageNet-1K和COCO2017数据集上均显著优于现有方法，验证了其有效性。

链接: https://arxiv.org/abs/2501.06405
作者: Youbing Hu,Yun Cheng,Olga Saukh,Firat Ozdemir,Anqi Lu,Zhiqiang Cao,Zhijun Li
机构: Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院); Swiss Data Science Center, Zurich, Switzerland(瑞士苏黎世数据科学中心); Graz University of Technology, Austria(奥地利格拉茨科技大学); Complexity Science Hub Vienna, Austria(奥地利维也纳复杂性科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dataset distillation has emerged as a strategy to compress real-world datasets for efficient training. However, it struggles with large-scale and high-resolution datasets, limiting its practicality. This paper introduces a novel resolution-independent dataset distillation method Focus ed Dataset Distillation (FocusDD), which achieves diversity and realism in distilled data by identifying key information patches, thereby ensuring the generalization capability of the distilled dataset across different network architectures. Specifically, FocusDD leverages a pre-trained Vision Transformer (ViT) to extract key image patches, which are then synthesized into a single distilled image. These distilled images, which capture multiple targets, are suitable not only for classification tasks but also for dense tasks such as object detection. To further improve the generalization of the distilled dataset, each synthesized image is augmented with a downsampled view of the original image. Experimental results on the ImageNet-1K dataset demonstrate that, with 100 images per class (IPC), ResNet50 and MobileNet-v2 achieve validation accuracies of 71.0% and 62.6%, respectively, outperforming state-of-the-art methods by 2.8% and 4.7%. Notably, FocusDD is the first method to use distilled datasets for object detection tasks. On the COCO2017 dataset, with an IPC of 50, YOLOv11n and YOLOv11s achieve 24.4% and 32.1% mAP, respectively, further validating the effectiveness of our approach.
zh

[CV-125] Has an AI model been trained on your images?

【速读】：该论文试图解决生成式 AI（Generative AI）图像模型在训练过程中未经许可使用大量互联网图像所引发的知识产权和版权问题。具体而言，许多创作者担心他们的作品被未经允许地用于模型训练，且缺乏退出机制。论文提出了一种计算高效的方法，用于确定某个模型是否在训练过程中使用了特定的图像或图像集。该方法的关键在于不依赖于对模型架构或权重的了解（即所谓的黑盒成员推断，black-box membership inference），从而能够在不深入模型内部的情况下进行审计。这一方法有望为现有模型的审计提供支持，并促进生成式 AI 模型的公平开发和部署。

链接: https://arxiv.org/abs/2501.06399
作者: Matyas Bohacek,Hany Farid
机构: Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:From a simple text prompt, generative-AI image models can create stunningly realistic and creative images bounded, it seems, by only our imagination. These models have achieved this remarkable feat thanks, in part, to the ingestion of billions of images collected from nearly every corner of the internet. Many creators have understandably expressed concern over how their intellectual property has been ingested without their permission or a mechanism to opt out of training. As a result, questions of fair use and copyright infringement have quickly emerged. We describe a method that allows us to determine if a model was trained on a specific image or set of images. This method is computationally efficient and assumes no explicit knowledge of the model architecture or weights (so-called black-box membership inference). We anticipate that this method will be crucial for auditing existing models and, looking ahead, ensuring the fairer development and deployment of generative AI models.
zh

[CV-126] owards Robust Nonlinear Subspace Clustering: A Kernel Learning Approach

【速读】：该论文试图解决基于核的子空间聚类（Kernel-based subspace clustering）中的三个主要问题：(i) 预定义核（predefined kernels）对模型性能的影响；(ii) 在非线性空间中保留原始流形结构（manifold structures）的困难；(iii) 谱类型策略（spectral-type strategies）对亲和矩阵（affinity matrix）理想块对角结构（block diagonal structure）的依赖。为解决这些问题，论文提出了一种新的范式DKLM（Data-driven Kernel Learning for Manifold preservation），其关键解决方案是通过数据自表示（self-representation）直接学习核，确保自适应加权并满足乘法三角不等式约束（multiplicative triangle inequality constraint），从而增强学习核的鲁棒性。DKLM通过利用学习到的核，在非线性空间中保留数据的局部流形结构，同时促进形成最优的块对角亲和矩阵。

链接: https://arxiv.org/abs/2501.06368
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
机构: Université de Sherbrooke (舍布鲁克大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Kernel-based subspace clustering, which addresses the nonlinear structures in data, is an evolving area of research. Despite noteworthy progressions, prevailing methodologies predominantly grapple with limitations relating to (i) the influence of predefined kernels on model performance; (ii) the difficulty of preserving the original manifold structures in the nonlinear space; (iii) the dependency of spectral-type strategies on the ideal block diagonal structure of the affinity matrix. This paper presents DKLM, a novel paradigm for kernel-induced nonlinear subspace clustering. DKLM provides a data-driven approach that directly learns the kernel from the data’s self-representation, ensuring adaptive weighting and satisfying the multiplicative triangle inequality constraint, which enhances the robustness of the learned kernel. By leveraging this learned kernel, DKLM preserves the local manifold structure of data in a nonlinear space while promoting the formation of an optimal block-diagonal affinity matrix. A thorough theoretical examination of DKLM reveals its relationship with existing clustering paradigms. Comprehensive experiments on synthetic and real-world datasets demonstrate the effectiveness of the proposed method.
zh

[CV-127] Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity

【速读】：该论文旨在解决在视觉Transformer（ViT）模型中进行混合精度量化（MPQ）时的精度损失问题。解决方案的关键在于提出了Mix-QViT框架，该框架通过两个标准系统地分配每层的比特宽度：一是通过层间相关性传播（Layer-wise Relevance Propagation, LRP）评估每层对最终分类的贡献度（层重要性），二是通过量化敏感性评估，即在保持其他层基线精度的同时，量化每层并评估其性能影响。此外，针对训练后量化（PTQ），论文提出了一种裁剪通道量化方法，旨在通过消除层归一化（LayerNorm）激活后的极端离群值来减少通道间差异的影响。实验结果表明，Mix-QViT在3-bit、4-bit和6-bit精度下均优于现有技术，且在2-bit混合精度量化训练中表现出色。

链接: https://arxiv.org/abs/2501.06357
作者: Navin Ranjan,Andreas Savakis
机构: Rochester Institute of Technology(罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal, 12 pages, 7 figures

点击查看摘要

Abstract:In this paper, we propose Mix-QViT, an explainability-driven MPQ framework that systematically allocates bit-widths to each layer based on two criteria: layer importance, assessed via Layer-wise Relevance Propagation (LRP), which identifies how much each layer contributes to the final classification, and quantization sensitivity, determined by evaluating the performance impact of quantizing each layer at various precision levels while keeping others layers at a baseline. Additionally, for post-training quantization (PTQ), we introduce a clipped channel-wise quantization method designed to reduce the effects of extreme outliers in post-LayerNorm activations by removing severe inter-channel variations. We validate our approach by applying Mix-QViT to ViT, DeiT, and Swin Transformer models across multiple datasets. Our experimental results for PTQ demonstrate that both fixed-bit and mixed-bit methods outperform existing techniques, particularly at 3-bit, 4-bit, and 6-bit precision. Furthermore, in quantization-aware training, Mix-QViT achieves superior performance with 2-bit mixed-precision.
zh

[CV-128] MEt3R: Measuring Multi-View Consistency in Generated Images

【速读】：该论文旨在解决多视角生成图像的一致性评估问题。传统重建指标（reconstruction metrics）由于生成模型（generative models）的特性，无法有效衡量生成图像的质量，因此需要一种独立于采样过程的评估方法。论文提出了MEt3R这一指标，专门用于评估生成的多视角图像之间的一致性，且该评估不依赖于特定场景。解决方案的关键在于使用DUSt3R从图像对中通过前馈方式获取密集的3D重建结果，并将图像内容从一个视角映射到另一个视角，然后通过比较这些图像的特征图来获得不受视角影响的一致性评分。通过MEt3R，论文评估了多种新视角生成和视频生成方法的一致性，包括其开放的多视角潜在扩散模型（multi-view latent diffusion model）。

链接: https://arxiv.org/abs/2501.06336
作者: Mohammad Asim,Christopher Wewer,Thomas Wimmer,Bernt Schiele,Jan Eric Lenssen
机构: Max Planck Institute for Informatics, Saarland Informatics Campus (马克斯·普朗克信息学研究所, 萨尔兰信息学园区); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Project website: this https URL

点击查看摘要

Abstract:We introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, including our open, multi-view latent diffusion model.
zh

[CV-129] owards Iris Presentation Attack Detection with Foundation Models

【速读】：该论文试图解决近红外虹膜呈现攻击检测（NIR Iris Presentation Attack Detection, PAD）中由于数据集规模小、攻击工具多样性不足以及真实样本和攻击样本之间缺乏对应关系所导致的泛化能力不足的问题。解决方案的关键在于利用两个基础模型（Foundation Models）——DinoV2和VisualOpenClip，通过在小型神经网络上进行微调预测，显著超越了基于深度学习的现有方法。然而，如果能够获得真实样本和攻击样本，从头训练的系统仍然可以达到更好的结果。

链接: https://arxiv.org/abs/2501.06312
作者: Juan E. Tapia,Lázaro Janier González-Soler,Christoph Busch
机构: da/sec-Biometrics and Internet Security Research Group, Darmstadt, Germany (da/sec-生物识别与互联网安全研究小组, 达姆施塔特, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models are becoming increasingly popular due to their strong generalization capabilities resulting from being trained on huge datasets. These generalization capabilities are attractive in areas such as NIR Iris Presentation Attack Detection (PAD), in which databases are limited in the number of subjects and diversity of attack instruments, and there is no correspondence between the bona fide and attack images because, most of the time, they do not belong to the same subjects. This work explores an iris PAD approach based on two foundation models, DinoV2 and VisualOpenClip. The results show that fine-tuning prediction with a small neural network as head overpasses the state-of-the-art performance based on deep learning approaches. However, systems trained from scratch have still reached better results if bona fide and attack images are available.
zh

[CV-130] Visualizing Uncertainty in Image Guided Surgery a Review

【速读】：该论文试图解决在脑肿瘤切除手术中，由于脑移位（brain shift）导致的术前影像（如MRI和超声）失效和配准不确定性问题。脑移位是由渗透压、液体水平和组织切除等因素引起的动态变形，可能导致导航系统的不准确性。解决方案的关键在于两个方面：1）量化不确定性（quantifying uncertainty）；2）将量化的不确定性有效地传达给观察者（如外科医生）。通过考虑并可视化这种不确定性，可以帮助外科医生重新信任导航系统，从而提高手术的精确性和安全性。

链接: https://arxiv.org/abs/2501.06280
作者: Mahsa Geshvadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:During tumor resection surgery, surgeons rely on neuronavigation to locate tumors and other critical structures in the brain. Most neuronavigation is based on preoperative images, such as MRI and ultrasound, to navigate through the brain. Neuronavigation acts like GPS for the brain, guiding neurosurgeons during the procedure. However, brain shift, a dynamic deformation caused by factors such as osmotic concentration, fluid levels, and tissue resection, can invalidate the preoperative images and introduce registration uncertainty. Considering and effectively visualizing this uncertainty has the potential to help surgeons trust the navigation again. Uncertainty has been studied in various domains since the 19th century. Considering uncertainty requires two essential components: 1) quantifying uncertainty; and 2) conveying the quantified values to the observer. There has been growing interest in both of these research areas during the past few decades.
zh

[CV-131] OpenAI ChatGPT GPT interprets Radiological Images: GPT-4 as a Medical Doctor for a Fast Check-Up

【速读】：该论文试图解决的问题是：人工智能（AI）是否能够取代医疗专业人员（如医生），或者是否可以作为决策支持工具，使决策更加容易和可靠。解决方案的关键在于探索GPT-4在医疗领域的图像解释能力，特别是其在放射学图像（radiological images）中的应用。通过实验验证GPT-4的图像处理能力，论文旨在评估AI在医疗诊断中的潜力，并探讨其作为辅助工具的可能性。

链接: https://arxiv.org/abs/2501.06269
作者: Omer Aydin,Enis Karaarslan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:OpenAI released version GPT-4 on March 14, 2023, following the success of ChatGPT, which was announced in November 2022. In addition to the existing GPT-3 features, GPT-4 has the ability to interpret images. To achieve this, the processing power and model have been significantly improved. The ability to process and interpret images goes far beyond the applications and effectiveness of artificial intelligence. In this study, we will first explore the interpretation of radiological images in healthcare using artificial intelligence (AI). Then, we will experiment with the image interpretation capability of the GPT-4. In this way, we will address the question of whether artificial intelligence (AI) can replace a healthcare professional (e.g., a medical doctor) or whether it can be used as a decision support tool that makes decisions easier and more reliable.
zh

[CV-132] GelBelt: A Vision-based Tactile Sensor for Continuous Sensing of Large Surfaces

【速读】：该论文旨在解决传统视觉触觉传感器在大规模表面连续感知应用中的局限性，如感应区域小、易受损等问题。为解决这些问题，论文提出了一种新型的视觉触觉传感器设计，其关键创新在于采用弹性带（elastomeric belt）和两个轮子（two wheels）的机械结构，能够连续扫描目标表面。该设计在形状重建和表面融合方面表现出色，能够在高达45 mm/s的速度下快速且高精度地扫描大规模表面。实验结果表明，该传感器在感应区域内对不同扫描速度下的表面法线图（surface normal map）估计与参考值之间的点积（dot product）具有较高的准确性，验证了其在大规模表面连续感知中的适用性。

链接: https://arxiv.org/abs/2501.06263
作者: Mohammad Amin Mirzaee,Hung-Jui Huang,Wenzhen Yuan
机构: University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE RA-L. 8 pages, 7 figures, webpage: this https URL

点击查看摘要

Abstract:Scanning large-scale surfaces is widely demanded in surface reconstruction applications and detecting defects in industries’ quality control and maintenance stages. Traditional vision-based tactile sensors have shown promising performance in high-resolution shape reconstruction while suffering limitations such as small sensing areas or susceptibility to damage when slid across surfaces, making them unsuitable for continuous sensing on large surfaces. To address these shortcomings, we introduce a novel vision-based tactile sensor designed for continuous surface sensing applications. Our design uses an elastomeric belt and two wheels to continuously scan the target surface. The proposed sensor showed promising results in both shape reconstruction and surface fusion, indicating its applicability. The dot product of the estimated and reference surface normal map is reported over the sensing area and for different scanning speeds. Results indicate that the proposed sensor can rapidly scan large-scale surfaces with high accuracy at speeds up to 45 mm/s.
zh

[CV-133] CAMs as Shapley Value-based Explainers

【速读】：该论文旨在解决当前类激活映射（Class Activation Mapping, CAM）方法在解释神经网络决策时机制不明确的问题。为了增强对CAM方法的理解并提高其可解释性，作者提出了内容保留博弈论解释器（Content Reserved Game-theoretic, CRG Explainer）这一理论框架。该框架通过将神经网络预测过程建模为合作博弈，阐明了GradCAM和HiResCAM的理论基础。在此基础上，作者开发了ShapleyCAM方法，该方法利用梯度和Hessian矩阵（Hessian matrix）提供更精确且理论依据更强的视觉解释。由于精确计算Shapley值（Shapley value）在计算上不可行，ShapleyCAM采用合作博弈效用函数的二阶泰勒展开来推导闭式表达式。此外，作者提出了残差Softmax目标类（Residual Softmax Target-Class, ReST）效用函数，以解决预Softmax和后Softmax分数的局限性。通过在ImageNet验证集上对12种流行网络进行广泛实验，验证了ShapleyCAM及其变体的有效性。该研究不仅推动了CAM方法的可解释性，还弥合了启发式CAM方法与计算密集型的Shapley值方法之间的差距。

链接: https://arxiv.org/abs/2501.06261
作者: Huaiguang Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注: Accepted by The Visual Computer

点击查看摘要

Abstract:Class Activation Mapping (CAM) methods are widely used to visualize neural network decisions, yet their underlying mechanisms remain incompletely understood. To enhance the understanding of CAM methods and improve their explainability, we introduce the Content Reserved Game-theoretic (CRG) Explainer. This theoretical framework clarifies the theoretical foundations of GradCAM and HiResCAM by modeling the neural network prediction process as a cooperative game. Within this framework, we develop ShapleyCAM, a new method that leverages gradients and the Hessian matrix to provide more precise and theoretically grounded visual explanations. Due to the computational infeasibility of exact Shapley value calculation, ShapleyCAM employs a second-order Taylor expansion of the cooperative game’s utility function to derive a closed-form expression. Additionally, we propose the Residual Softmax Target-Class (ReST) utility function to address the limitations of pre-softmax and post-softmax scores. Extensive experiments across 12 popular networks on the ImageNet validation set demonstrate the effectiveness of ShapleyCAM and its variants. Our findings not only advance CAM explainability but also bridge the gap between heuristic-driven CAM methods and compute-intensive Shapley value-based methods. The code is available at \urlthis https URL.
zh

[CV-134] Quantum Down Sampling Filter for Variational Auto-encoder

【速读】：该论文旨在解决传统变分自编码器（VAEs）在处理低分辨率输入（16x16像素）时，重建图像质量较差、细节丢失的问题。传统VAEs在这种情况下通常会产生模糊或不准确的结果。为了解决这一问题，论文提出了一种混合模型，该模型在VAE的编码器中结合了量子计算技术，并在解码器中使用了卷积神经网络（CNNs）。通过在编码过程中将分辨率从16x16提升到32x32，该方法评估了模型在增强分辨率的同时如何保持关键特征和结构的能力。该方案的关键在于利用量子计算技术提升编码器的性能，并通过卷积神经网络在解码器中进一步优化图像重建的质量。实验结果表明，量子增强的VAE（Q-VAE）在MNIST和USPS数据集上显著优于传统的VAE和CDP-VAE，表现出更低的Fréchet Inception Distance（FID）和均方误差（MSE），证明了其在提升图像重建质量和保留关键细节方面的潜力。

链接: https://arxiv.org/abs/2501.06259
作者: Farina Riaz,Fakhar Zaman,Hajime Suzuki,Sharif Abuadbba,David Nguyen
机构: CSIRO Data61; CSIRO Manufacturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are essential tools in generative modeling and image reconstruction, with their performance heavily influenced by the encoder-decoder architecture. This study aims to improve the quality of reconstructed images by enhancing their resolution and preserving finer details, particularly when working with low-resolution inputs (16x16 pixels), where traditional VAEs often yield blurred or in-accurate results. To address this, we propose a hybrid model that combines quantum computing techniques in the VAE encoder with convolutional neural networks (CNNs) in the decoder. By upscaling the resolution from 16x16 to 32x32 during the encoding process, our approach evaluates how the model reconstructs images with enhanced resolution while maintaining key features and structures. This method tests the model’s robustness in handling image reconstruction and its ability to preserve essential details despite training on lower-resolution data. We evaluate our proposed down sampling filter for Quantum VAE (Q-VAE) on the MNIST and USPS datasets and compare it with classical VAEs and a variant called Classical Direct Passing VAE (CDP-VAE), which uses windowing pooling filters in the encoding process. Performance is assessed using metrics such as the Fréchet Inception Distance (FID) and Mean Squared Error (MSE), which measure the fidelity of reconstructed images. Our results demonstrate that the Q-VAE consistently outperforms both the Classical VAE and CDP-VAE, achieving significantly lower FID and MSE scores. Additionally, CDP-VAE yields better performance than C-VAE. These findings highlight the potential of quantum-enhanced VAEs to improve image reconstruction quality by enhancing resolution and preserving essential features, offering a promising direction for future applications in computer vision and synthetic data generation.
zh

[CV-135] he State of Post-Hoc Local XAI Techniques for Image Processing: Challenges and Motivations

【速读】：该论文试图解决复杂人工智能（AI）系统中存在的“黑箱”问题，即这些系统的内部工作机制不透明，难以解释和理解。为了提高AI系统的可信度和可靠性，论文探讨了可解释人工智能（Explainable Artificial Intelligence, XAI）领域的相关研究。解决方案的关键在于通过XAI技术使AI系统更加透明和可解释，从而增强用户对AI系统的信任。论文详细讨论了XAI的动机、方法、面临的挑战以及未来研究方向，特别是在图像处理领域的应用，旨在推动XAI研究的进一步发展。

链接: https://arxiv.org/abs/2501.06253
作者: Rech Leong Tian Poh,Sye Loong Keoh,Liying Li
机构: TÜV SÜD Asia Pacific(新加坡); School of Computing Science, University of Glasgow(格拉斯哥大学, 苏格兰, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As complex AI systems further prove to be an integral part of our lives, a persistent and critical problem is the underlying black-box nature of such products and systems. In pursuit of productivity enhancements, one must not forget the need for various technology to boost the overall trustworthiness of such AI systems. One example, which is studied extensively in this work, is the domain of Explainable Artificial Intelligence (XAI). Research works in this scope are centred around the objective of making AI systems more transparent and interpretable, to further boost reliability and trust in using them. In this work, we discuss the various motivation for XAI and its approaches, the underlying challenges that XAI faces, and some open problems that we believe deserve further efforts to look into. We also provide a brief discussion of various XAI approaches for image processing, and finally discuss some future directions, to hopefully express and motivate the positive development of the XAI research space.
zh

[CV-136] Generative AI for Cel-Animation: A Survey

【速读】：该论文探讨了传统赛璐珞动画（Cel Animation）制作流程中存在的效率低下和技术门槛高的问题。传统动画制作涉及多个关键步骤，如故事板设计、布局设计、关键帧动画、中间帧生成和上色等，这些步骤需要大量的人工投入、技术专长和时间成本，限制了动画制作的效率和可扩展性。论文提出，生成式人工智能（Generative AI, GenAI）的兴起为解决这些问题提供了创新方案。通过整合大语言模型、多模态模型和扩散模型等技术，GenAI能够自动化中间帧生成、上色和故事板创建等任务，从而降低技术门槛，扩大创作者的范围，并让艺术家能够更专注于创意表达和艺术创新。论文还讨论了未来AI辅助动画的潜在发展方向，尽管在视觉一致性、风格连贯性和伦理问题等方面仍存在挑战。

链接: https://arxiv.org/abs/2501.06250
作者: Yunlong Tang,Junjia Guo,Pinxin Liu,Zhiyuan Wang,Hang Hua,Jia-Xing Zhong,Yunzhong Xiao,Chao Huang,Luchuan Song,Susan Liang,Yizhi Song,Liu He,Jing Bi,Mingqian Feng,Xinyang Li,Zeliang Zhang,Chenliang Xu
机构: University of Rochester(罗切斯特大学); UCSB(加州大学圣塔芭芭拉分校); University of Oxford(牛津大学); CMU(卡内基梅隆大学); Purdue University(普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 20 pages

点击查看摘要

Abstract:Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, issues such as maintaining visual consistency, ensuring stylistic coherence, and addressing ethical considerations continue to pose challenges. Furthermore, this paper discusses future directions and explores potential advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: this https URL
zh

[CV-137] Scalable Cosmic AI Inference using Cloud Serverless Computing with FMI

【速读】：该论文旨在解决大规模天文图像数据处理和预测中的计算资源需求高、可访问性受限的问题。现代深度学习模型虽然具有高预测精度，但其对计算资源的需求较大，导致资源密集且难以普及。为此，论文提出了基于云的天文学推理框架（Cloud-based Astronomy Inference, CAI），其关键解决方案在于将预训练的基础模型（foundation models）与无服务器云基础设施（serverless cloud infrastructure）通过函数即服务（Function-as-a-Service, FaaS）消息接口（FMI）集成。CAI框架能够在无需大量硬件支持的情况下，实现对天文图像的高效、可扩展推理。通过以红移预测（redshift prediction）为例的广泛实验，CAI展示了在大规模数据集上的显著可扩展性改进，为天文学界提供了一个易于访问且高效的工具。

链接: https://arxiv.org/abs/2501.06249
作者: Mills Staylor,Amirreza Dolatpour Fathkouhi,Md Khairul Islam,Kaleigh O’Hara,Ryan Ghiles Goudjil,Geoffrey Fox,Judy Fox
机构: Department of Computer Science, University of Virginia (弗吉尼亚大学计算机科学系); School of Data Science, University of Virginia (弗吉尼亚大学数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:

点击查看摘要

Abstract:Large-scale astronomical image data processing and prediction is essential for astronomers, providing crucial insights into celestial objects, the universe’s history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS) Message Interface (FMI). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. CAI’s significant scalability improvement on large data sizes provides an accessible and effective tool for the astronomy community. The code is accessible at this https URL.
zh

[CV-138] NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data

【速读】：该论文旨在解决4D全景LiDAR（Light Detection and Ranging）分割中的身份切换（ID switches）和跟踪性能下降问题。现有方法如4D-PLS和4D-STOP依赖于短期的实例检测，缺乏运动估计，并且排除了小尺寸实例，导致在复杂环境中频繁出现身份切换和跟踪性能下降。为解决这些问题，论文提出了NextStop1跟踪器，其关键创新在于集成了基于卡尔曼滤波（Kalman filter）的运动估计、数据关联和生命周期管理，并引入了轨迹状态（tracklet state）概念以优化优先级分配。通过使用LiDAR分割与跟踪质量（LSTQ）指标在SemanticKITTI验证集上的评估，NextStop1显著提升了小尺寸物体（如行人和骑自行车者）的跟踪性能，减少了身份切换，提前了跟踪初始化，并在复杂环境中表现出更高的可靠性。

链接: https://arxiv.org/abs/2501.06235
作者: Nirit Alkalay,Roy Orfaig,Ben-Zion Bobrovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:4D panoptic LiDAR segmentation is essential for scene understanding in autonomous driving and robotics ,combining semantic and instance segmentation with temporal this http URL methods, like 4D-PLS and 4D-STOP, use a tracking-by-detection methodology, employing deep learning networks to perform semantic and instance segmentation on each frame. To maintain temporal consistency, large-size instances detected in the current frame are compared and associated with instances within a temporal window that includes the current and preceding frames. However, their reliance on short-term instance detection, lack of motion estimation, and exclusion of small-sized instances lead to frequent identity switches and reduced tracking performance. We address these issues with the NextStop1 tracker, which integrates Kalman filter-based motion estimation, data association, and lifespan management, along with a tracklet state concept to improve prioritization. Evaluated using the LiDAR Segmentation and Tracking Quality (LSTQ) metric on the SemanticKITTI validation set, NextStop demonstrated enhanced tracking performance, particularly for small-sized objects like people and bicyclists, with fewer ID switches, earlier tracking initiation, and improved reliability in complex environments. The source code is available at this https URL
zh

[CV-139] BEN: Using Confidence-Guided Matting for Dichotomous Image Segmentation

【速读】：该论文试图解决二值图像分割（Dichotomous Image Segmentation, DIS）中图像抠图（image matting）和对象分割（object segmentation）被视为不同任务的问题。随着图像分割的改进变得越来越具有挑战性，结合图像抠图和灰度分割技术为架构创新提供了新的方向。论文提出了一种名为置信度引导抠图（Confidence-Guided Matting, CGM）的新架构方法，并开发了首个CGM模型——背景擦除网络（Background Erase Network, BEN）。BEN由两个组件组成：用于初始分割的BEN Base和用于置信度优化的BEN Refiner。该方案在DIS5K验证数据集上显著优于当前最先进的方法，表明基于抠图的优化可以显著提升分割质量。这一工作为计算机视觉中抠图与分割技术的交叉融合开辟了新的可能性。

链接: https://arxiv.org/abs/2501.06230
作者: Maxwell Meyer,Jack Spruyt
机构: Prama.llc
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 13 pages, 2 figures, 2 tables, and 2 algorithms

点击查看摘要

Abstract:Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN is comprised of two components: BEN Base for initial segmentation and BEN Refiner for confidence refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work opens new possibilities for cross-pollination between matting and segmentation techniques in computer vision.
zh

[CV-140] Open-Source Manually Annotated Vocal Tract Database for Automatic Segmentation from 3D MRI Using Deep Learning: Benchmarking 2D and 3D Convolutional and Transformer Networks

【速读】：该论文试图解决从磁共振成像（MRI）数据中准确分割声道（vocal tract）的问题，这对于语音和声音应用至关重要。手动分割不仅耗时且容易出错。研究的核心在于评估深度学习算法在自动分割3D MRI声道中的有效性，旨在通过自动化方法提高分割的准确性和效率。

链接: https://arxiv.org/abs/2501.06229
作者: Subin Erattakulangara,Karthika Kelat,Katie Burnham,Rachel Balbi,Sarah E. Gerard,David Meyer,Sajan Goud Lingala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Accurate segmentation of the vocal tract from magnetic resonance imaging (MRI) data is essential for various voice and speech applications. Manual segmentation is time intensive and susceptible to errors. This study aimed to evaluate the efficacy of deep learning algorithms for automatic vocal tract segmentation from 3D MRI.
zh

[CV-141] A Distributed Hybrid Quantum Convolutional Neural Network for Medical Image Classification

【速读】：该论文旨在解决医学图像分类中复杂特征提取的挑战。医学图像具有高度复杂和精细的特征，传统神经网络在处理这些特征时存在局限性。论文提出了一种基于量子电路分割的分布式混合量子卷积神经网络（QCNN），以利用量子计算的优势，在资源受限的环境中高效捕捉医学图像的复杂特征。解决方案的关键在于通过量子卷积神经网络提取高维特征，并结合量子电路分割技术，将8量子比特的QCNN重构为仅需5量子比特的模型。实验结果表明，该模型在多个数据集上表现出色，且在参数较少的情况下优于现有技术，验证了其有效性。

链接: https://arxiv.org/abs/2501.06225
作者: Yangyang Li,Zhengya Qia,Yuelin Lia,Haorui Yanga,Ronghua Shanga,Licheng Jiaoa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical images are characterized by intricate and complex features, requiring interpretation by physicians with medical knowledge and experience. Classical neural networks can reduce the workload of physicians, but can only handle these complex features to a limited extent. Theoretically, quantum computing can explore a broader parameter space with fewer parameters, but it is currently limited by the constraints of quantum this http URL these factors, we propose a distributed hybrid quantum convolutional neural network based on quantum circuit splitting. This model leverages the advantages of quantum computing to effectively capture the complex features of medical images, enabling efficient classification even in resource-constrained environments. Our model employs a quantum convolutional neural network (QCNN) to extract high-dimensional features from medical images, thereby enhancing the model’s expressive this http URL integrating distributed techniques based on quantum circuit splitting, the 8-qubit QCNN can be reconstructed using only 5 this http URL results demonstrate that our model achieves strong performance across 3 datasets for both binary and multiclass classification tasks. Furthermore, compared to recent technologies, our model achieves superior performance with fewer parameters, and experimental results validate the effectiveness of our model.
zh

[CV-142] Detection Retrieval and Explanation Unified: A Violence Detection System Based on Knowledge Graphs and GAT

【速读】：该论文旨在解决当前暴力检测系统（violence detection systems）面临的两个关键挑战：缺乏可解释性和功能单一性。大多数现有系统作为黑箱模型，仅提供分类或检索功能，无法解释其推理过程。为解决这些问题，论文提出了一种新型的可解释暴力检测系统，称为“三合一”（Three-in-One, TIO）系统。该系统的核心解决方案包括：1）通过知识图谱（Knowledge Graph, KG）和图注意力网络（Graph Attention Network, GAT）实现检测、检索和解释三大功能；2）利用ImageBind生成高维嵌入以构建知识图谱，结合GAT进行推理，并通过轻量级时间序列模块提取视频嵌入特征；3）通过连接分类器和检索器实现多功能输出，同时利用知识图谱的可解释性验证推理过程。此外，论文还引入了多种轻量级方法以减少系统资源消耗并提升效率。实验结果表明，该系统在XD-Violence和UCF-Crime数据集上表现优异，并通过案例研究揭示了旁观者数量增加与暴力行为减少之间的有趣现象。

链接: https://arxiv.org/abs/2501.06224
作者: Wen-Dong Jiang,Chih-Yung Chang,Diptendu Sinha Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages. Submitted to Neurocomputing

点击查看摘要

Abstract:Recently, violence detection systems developed using unified multimodal models have achieved significant success and attracted widespread attention. However, most of these systems face two critical challenges: the lack of interpretability as black-box models and limited functionality, offering only classification or retrieval capabilities. To address these challenges, this paper proposes a novel interpretable violence detection system, termed the Three-in-One (TIO) System. The TIO system integrates knowledge graphs (KG) and graph attention networks (GAT) to provide three core functionalities: detection, retrieval, and explanation. Specifically, the system processes each video frame along with text descriptions generated by a large language model (LLM) for videos containing potential violent behavior. It employs ImageBind to generate high-dimensional embeddings for constructing a knowledge graph, uses GAT for reasoning, and applies lightweight time series modules to extract video embedding features. The final step connects a classifier and retriever for multi-functional outputs. The interpretability of KG enables the system to verify the reasoning process behind each output. Additionally, the paper introduces several lightweight methods to reduce the resource consumption of the TIO system and enhance its efficiency. Extensive experiments conducted on the XD-Violence and UCF-Crime datasets validate the effectiveness of the proposed system. A case study further reveals an intriguing phenomenon: as the number of bystanders increases, the occurrence of violent behavior tends to decrease.
zh

[CV-143] Powerful Design of Small Vision Transformer on CIFAR10

【速读】：该论文旨在解决Vision Transformers (ViTs)在小规模数据集上表现不佳的问题，特别是在与卷积神经网络（CNNs）相比时。论文通过设计和优化Tiny ViTs，以CIFAR-10为基准数据集，系统地评估了数据增强、patch token初始化、低秩压缩（low-rank compression）和多类token策略对模型性能的影响。关键解决方案包括：1）在多头潜在注意力（Multi-Head Latent Attention, MLA）中对查询进行低秩压缩，以减少冗余并最小化性能损失；2）引入多个CLS token以增强全局表示能力，从而提升模型精度。这些发现为优化Tiny ViTs提供了一个全面的框架，并为高效且有效的设计提供了实用见解。

链接: https://arxiv.org/abs/2501.06220
作者: Gent Wu
机构: Surrey Institute for People-Centred AI (PAI) (萨里以人为本人工智能研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at this https URL.
zh

[CV-144] WhACC: Whisker Automatic Contact Classifier with Expert Human-Level Performance

【速读】：该论文旨在解决在啮齿动物触须系统研究中，手动标注触须接触事件耗时且劳动强度大的问题。尽管已有自动化工具如Janelia Whisker Tracker，但每百万帧视频仍需约3小时的手动标注时间。为解决这一问题，作者提出了Whisker Automatic Contact Classifier (WhACC)，这是一个基于Python的软件包，能够从高速视频中自动识别头部固定的啮齿动物的触须接触事件，其性能达到人类水平。WhACC的关键在于利用ResNet50V2进行特征提取，并结合LightGBM进行分类。通过与三位专家人类标注者在一百万帧视频上的对比，WhACC的触须接触分类一致性达到99.5%，与人类标注者之间的一致性相当。此外，WhACC还提供了一个自定义再训练接口，允许用户在小数据集上进行模型定制，进一步减少了标注时间。通过这一解决方案，标注一亿帧视频所需的时间从约333小时大幅减少至约6小时。

链接: https://arxiv.org/abs/2501.06219
作者: Phillip Maire,Samson G. King,Jonathan Andrew Cheung,Stefanie Walker,Samuel Andrew Hires
机构: Department of Biological Sciences, Section of Neurobiology, University of Southern California (南加州大学生物科学系神经生物学部); Neuroscience Graduate Program, University of Southern California (南加州大学神经科学研究生项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rodent vibrissal system is pivotal in advancing neuroscience research, particularly for studies of cortical plasticity, learning, decision-making, sensory encoding, and sensorimotor integration. Despite the advantages, curating touch events is labor intensive and often requires 3 hours per million video frames, even after leveraging automated tools like the Janelia Whisker Tracker. We address this limitation by introducing Whisker Automatic Contact Classifier (WhACC), a python package designed to identify touch periods from high-speed videos of head-fixed behaving rodents with human-level performance. WhACC leverages ResNet50V2 for feature extraction, combined with LightGBM for Classification. Performance is assessed against three expert human curators on over one million frames. Pairwise touch classification agreement on 99.5% of video frames, equal to between-human agreement. Finally, we offer a custom retraining interface to allow model customization on a small subset of data, which was validated on four million frames across 16 single-unit electrophysiology recordings. Including this retraining step, we reduce human hours required to curate a 100 million frame dataset from ~333 hours to ~6 hours.
zh

[CV-145] Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models

【速读】：该论文旨在解决视觉生成模型（Vision Generative Models）在量化（Quantization）过程中的性能差异问题，特别是对比了扩散式模型（Diffusion-style Models）和语言式模型（Language-style Models）在不同量化设置下的表现。研究发现，尽管两者在全精度（Full Precision）下性能相当，但在量化后，语言式模型表现更优，主要归因于其离散表示空间对信息丢失的容忍度更高。论文进一步提出，提升量化视觉生成模型的比特级扩展律（Bit-level Scaling Laws）具有挑战性，而模型蒸馏（Model Distillation）是一种有效的方法。为此，作者提出了TopKLD方法，通过平衡蒸馏过程中的“隐式知识”（Implicit Knowledge）和“显式知识”（Explicit Knowledge），优化知识转移，从而在整数和浮点量化设置下提升比特级扩展律。

链接: https://arxiv.org/abs/2501.06218
作者: Xin Ding,Shijie Cao,Ting Cao,Zhibo Chen
机构: Microsoft Research(微软研究院); Beijing Zhongke Research Institue(北京中科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision generative models have recently made significant advancements along two primary paradigms: diffusion-style and language-style, both of which have demonstrated excellent scaling laws. Quantization is crucial for efficiently deploying these models, as it reduces memory and computation costs. In this work, we systematically investigate the impact of quantization on these two paradigms. Surprisingly, despite achieving comparable performance in full precision, language-style models consistently outperform diffusion-style models across various quantization settings. This observation suggests that language-style models have superior bit-level scaling laws, offering a better tradeoff between model quality and total bits. To dissect this phenomenon, we conduct extensive experiments and find that the primary reason is the discrete representation space of language-style models, which is more tolerant of information loss during quantization. Furthermore, our analysis indicates that improving the bit-level scaling law of quantized vision generative models is challenging, with model distillation identified as a highly effective approach. Specifically, we propose TopKLD to optimize the transfer of distilled knowledge by balancing implicit knowledge'' and explicit knowledge’’ during the distillation process. This approach elevates the bit-level scaling laws by one level across both integer and floating-point quantization settings.
zh

[CV-146] Understanding colors of Dufaycolor: Can we recover them using historical colorimetric and spectral data?

【速读】：该论文旨在解决如何准确重建加色法彩色摄影（additive color photography）中原始颜色的问题，特别是针对1935年至1950年代末期生产的Dufaycolor技术。解决方案的关键在于开发一种开源的Color-Screen工具，该工具结合了历史测量数据，这些数据涉及用于生产彩色滤光片（réseau）的染料。通过利用这些历史数据，研究人员能够更精确地恢复加色法彩色照片的原始色彩。

链接: https://arxiv.org/abs/2501.06216
作者: Jan Hubička,Linda Kimrová,Melichar Konečný
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 8 pages, 6 figures, 4 tables; submitted to proceedings of 3rd international conference on “Colour Photography and Film: analysis, preservation, and conservation of analogue and digital materials”,

点击查看摘要

Abstract:Dufaycolor, an additive color photography process produced from 1935 to the late 1950s, represents one of the most advanced iterations of this technique. This paper presents ongoing research and development of an open-source Color-Screen tool designed to reconstruct the original colors of additive color photographs. We discuss the incorporation of historical measurements of dyes used in the production of the color-screen filter (réseau) to achieve accurate color recovery.
zh

[CV-147] Path Space Partitioning and Guided Image Sampling for MCMC

【速读】：该论文试图解决在渲染算法中，传统的光路积分方法在统一的路径空间（path space）上进行积分时效率不高的问题。为了解决这一问题，论文提出了一种基于路径空间分区（partitioning path space）的方法，通过分析标准蒙特卡罗估计器（Monte Carlo estimator）生成的路径，将路径空间划分为多个子空间，并使用马尔可夫链蒙特卡罗（MCMC）估计器分别对这些子空间进行积分。这种方法的关键在于通过分区和稀疏子空间内的积分，结合图像空间中的引导提议分布（guided proposal distributions），提高了积分的效率。实验结果表明，该方法在相同样本数量下，相比其他MCMC积分方法，能够显著提升图像质量。

链接: https://arxiv.org/abs/2501.06214
作者: Thomas Bashford-Rogers,Luis Paulo Santos
机构: University of Warwick(华威大学); Universidade do Minho(米尼奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Rendering algorithms typically integrate light paths over path space. However, integrating over this one unified space is not necessarily the most efficient approach, and we show that partitioning path space and integrating each of these partitioned spaces with a separate estimator can have advantages. We propose an approach for partitioning path space based on analyzing paths from a standard Monte Carlo estimator and integrating these partitioned path spaces using a Markov Chain Monte Carlo (MCMC) estimator. This also means that integration happens within a sparser subset of path space, so we propose the use of guided proposal distributions in image space to improve efficiency. We show that our method improves image quality over other MCMC integration approaches at the same number of samples.
zh

[CV-148] Implicit Neural Representations for Registration of Left Ventricle Myocardium During a Cardiac Cycle

【速读】：该论文旨在解决左心室心肌（LVmyo）在心动周期中的运动建模问题，这对于评估心脏功能至关重要。传统的基于卷积神经网络（CNN）的深度学习方法在进行可变形图像配准（DIR）时，通常需要大量的内存和计算资源。论文提出了一种基于隐式神经表示（INRs）的高效解决方案，通过操作任意数量的连续点来实现配准。为了增强LVmyo周围的配准精度，研究将CT帧中的Hounsfield单位值与LVmyo的符号距离场结合，从而在保留CT帧组织信息的同时，指导LVmyo的配准。该框架展示了高精度的配准效果，并为LVmyo运动的进一步分析提供了稳健的时序配准方法。

链接: https://arxiv.org/abs/2501.07248
作者: Mathias Micheelsen Lowes,Jonas Jalili Pedersen,Bjørn S. Hansen,Klaus Fuglsang Kofoed,Maxime Sermesant,Rasmus R. Paulsen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, STACOM 2024

点击查看摘要

Abstract:Understanding the movement of the left ventricle myocardium (LVmyo) during the cardiac cycle is essential for assessing cardiac function. One way to model this movement is through a series of deformable image registrations (DIRs) of the LVmyo. Traditional deep learning methods for DIRs, such as those based on convolutional neural networks, often require substantial memory and computational resources. In contrast, implicit neural representations (INRs) offer an efficient approach by operating on any number of continuous points. This study extends the use of INRs for DIR to cardiac computed tomography (CT), focusing on LVmyo registration. To enhance the precision of the registration around the LVmyo, we incorporate the signed distance field of the LVmyo with the Hounsfield Unit values from the CT frames. This guides the registration of the LVmyo, while keeping the tissue information from the CT frames. Our framework demonstrates high registration accuracy and provides a robust method for temporal registration that facilitates further analysis of LVmyo motion.
zh

[CV-149] Lung Cancer detection using Deep Learning

【速读】：该论文旨在解决肺癌早期检测的问题，特别是通过区分良性（benign）和恶性（malignant）肿瘤来提高诊断的准确性。解决方案的关键在于采用了一种混合模型，结合了卷积神经网络（Convolutional Neural Networks, CNNs）和支持向量机（Support Vector Machines, SVMs）。该模型通过训练计算机断层扫描（Computed Tomography, CT scans）数据集来实现对肺癌的早期检测。这种基于深度学习的混合模型被认为是当前肺癌早期检测的前沿方法。

链接: https://arxiv.org/abs/2501.07197
作者: Aryan Chaudhari,Ankush Singh,Sanchi Gajbhiye,Pratham Agrawal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we discuss lung cancer detection using hybrid model of Convolutional-Neural-Networks (CNNs) and Support-Vector-Machines-(SVMs) in order to gain early detection of tumors, benign or malignant. The work uses this hybrid model by training upon the Computed Tomography scans (CT scans) as dataset. Using deep learning for detecting lung cancer early is a cutting-edge method.
zh

[CV-150] MSV-Mamba: A Multiscale Vision Mamba Network for Echocardiography Segmentation

【速读】：该论文旨在解决超声成像中由于高噪声水平、降低的时空分辨率以及复杂解剖结构带来的挑战，这些问题显著阻碍了模型在心脏不同区域准确捕捉和分析结构关系及动态模式的能力。为此，论文提出了一种结合大窗口Mamba尺度（LMS）模块和分层特征融合方法的U形深度学习模型，用于超声心动图分割。解决方案的关键在于：首先，使用级联残差块作为编码器，逐步提取多尺度细节特征；其次，在解码器中集成大窗口多尺度Mamba模块，以捕捉区域间的全局依赖关系，并增强对复杂解剖结构的分割能力；此外，模型在解码器的每一层引入辅助损失，并采用双注意力机制在空间和通道上融合多层特征，从而提升分割性能和准确性。实验结果表明，该模型在EchoNet-Dynamic和CAMUS数据集上的表现优于其他方法，特别是在左心室内膜（LV_endo）和左心室外膜（LV_epi）的分割任务中取得了显著的精度提升。

链接: https://arxiv.org/abs/2501.07120
作者: Xiaoxian Yang,Qi Wang,Kaiqi Zhang,Ke Wei,Jun Lyu,Lingchao Chen
机构: School of Computer and Information Engineering, Shanghai Polytechnic University (上海第二工业大学计算机与信息工程学院); Department of Clinical Research, The First Affiliated Hospital of Jinan University (暨南大学第一附属医院临床研究中心); Department of Neurosurgery, Fudan University Huashan Hospital (复旦大学华山医院神经外科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound imaging frequently encounters challenges, such as those related to elevated noise levels, diminished spatiotemporal resolution, and the complexity of anatomical structures. These factors significantly hinder the model’s ability to accurately capture and analyze structural relationships and dynamic patterns across various regions of the heart. Mamba, an emerging model, is one of the most cutting-edge approaches that is widely applied to diverse vision and language tasks. To this end, this paper introduces a U-shaped deep learning model incorporating a large-window Mamba scale (LMS) module and a hierarchical feature fusion approach for echocardiographic segmentation. First, a cascaded residual block serves as an encoder and is employed to incrementally extract multiscale detailed features. Second, a large-window multiscale mamba module is integrated into the decoder to capture global dependencies across regions and enhance the segmentation capability for complex anatomical structures. Furthermore, our model introduces auxiliary losses at each decoder layer and employs a dual attention mechanism to fuse multilayer features both spatially and across channels. This approach enhances segmentation performance and accuracy in delineating complex anatomical structures. Finally, the experimental results using the EchoNet-Dynamic and CAMUS datasets demonstrate that the model outperforms other methods in terms of both accuracy and robustness. For the segmentation of the left ventricular endocardium ( LV_endo ), the model achieved optimal values of 95.01 and 93.36, respectively, while for the left ventricular epicardium ( LV_epi ), values of 87.35 and 87.80, respectively, were achieved. This represents an improvement ranging between 0.54 and 1.11 compared with the best-performing model.
zh

[CV-151] A Multi-Modal Deep Learning Framework for Pan-Cancer Prognosis

【速读】：该论文旨在解决现有预后模型在泛化能力和多模态数据利用方面的局限性。现有模型通常仅针对单一类型癌症，且仅利用特定类型的模态数据（如组织病理学WSI和基因表达分析），导致其泛化能力较弱。为解决这些问题，论文提出了一种基于深度学习的模型UMPSNet。其关键解决方案包括：1）构建了针对组织病理学图像和基因组表达谱的编码器，并进一步整合了四种重要的元数据（人口统计信息、癌症类型信息、治疗方案和诊断结果）到文本模板中，引入文本编码器提取文本特征；2）采用基于最优传输（OT）的注意力机制来对齐和融合不同模态的特征；3）引入引导软专家混合（GMoE）机制，有效处理多癌症数据集之间的分布差异。通过多模态数据的综合利用和联合训练，UMPSNet在多个癌症类型上表现出优异的泛化能力和预测性能。

链接: https://arxiv.org/abs/2501.07016
作者: Binyu Zhang,Shichao Li,Junpeng Jian,Zhu Meng,Limei Guo,Zhicheng Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prognostic task is of great importance as it closely related to the survival analysis of patients, the optimization of treatment plans and the allocation of resources. The existing prognostic models have shown promising results on specific datasets, but there are limitations in two aspects. On the one hand, they merely explore certain types of modal data, such as patient histopathology WSI and gene expression analysis. On the other hand, they adopt the per-cancer-per-model paradigm, which means the trained models can only predict the prognostic effect of a single type of cancer, resulting in weak generalization ability. In this paper, a deep-learning based model, named UMPSNet, is proposed. Specifically, to comprehensively understand the condition of patients, in addition to constructing encoders for histopathology images and genomic expression profiles respectively, UMPSNet further integrates four types of important meta data (demographic information, cancer type information, treatment protocols, and diagnosis results) into text templates, and then introduces a text encoder to extract textual features. In addition, the optimal transport OT-based attention mechanism is utilized to align and fuse features of different modalities. Furthermore, a guided soft mixture of experts (GMoE) mechanism is introduced to effectively address the issue of distribution differences among multiple cancer datasets. By incorporating the multi-modality of patient data and joint training, UMPSNet outperforms all SOTA approaches, and moreover, it demonstrates the effectiveness and generalization ability of the proposed learning paradigm of a single model for multiple cancer types. The code of UMPSNet is available at this https URL.
zh

[CV-152] Super-Resolution of 3D Micro-CT Images Using Generative Adversarial Networks: Enhancing Resolution and Segmentation Accuracy

【速读】：该论文旨在解决微计算机断层扫描（micro-CT）图像在岩石分析中的分辨率不足和分割不准确的问题。由于不同岩石矿物和相在micro-CT测量中的X射线衰减重叠，导致图像分割存在误差。论文提出的解决方案是使用一种基于机器学习的生成模型——3D深度卷积Wasserstein生成对抗网络（3D DC WGAN-GP），通过训练低分辨率的3D micro-CT图像和未配对的2D高分辨率激光扫描显微镜（LSM）图像，实现了图像分辨率的八倍提升（8x），并显著改善了矿物和孔隙空间的分割精度。该方法的成功应用展示了其在数字岩石物理学中的潜力，能够生成分辨率高达0.4375微米/体素的高质量3D图像。

链接: https://arxiv.org/abs/2501.06939
作者: Evgeny Ugolkov,Xupeng He,Hyung Kwak,Hussein Hoteit
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:We develop a procedure for substantially improving the quality of segmented 3D micro-Computed Tomography (micro-CT) images of rocks with a Machine Learning (ML) Generative Model. The proposed model enhances the resolution eightfold (8x) and addresses segmentation inaccuracies due to the overlapping X-ray attenuation in micro-CT measurement for different rock minerals and phases. The proposed generative model is a 3D Deep Convolutional Wasserstein Generative Adversarial Network with Gradient Penalty (3D DC WGAN-GP). The algorithm is trained on segmented 3D low-resolution micro-CT images and segmented unpaired complementary 2D high-resolution Laser Scanning Microscope (LSM) images. The algorithm was demonstrated on multiple samples of Berea sandstones. We achieved high-quality super-resolved 3D images with a resolution of 0.4375 micro-m/voxel and accurate segmentation for constituting minerals and pore space. The described procedure can significantly expand the modern capabilities of digital rock physics.
zh

[CV-153] Driver Age and Its Effect on Key Driving Metrics: Insights from Dynamic Vehicle Data

【速读】：该论文试图解决老年驾驶员（65岁及以上）在真实驾驶场景中的行为表现问题，特别是与年龄相关的驾驶行为变化如何影响驾驶安全。研究表明，70岁以上的驾驶员相较于40至50岁的驾驶员具有更高的车祸死亡率，因此开发针对老年驾驶员的有效安全干预措施至关重要。论文通过利用自然驾驶数据（Naturalistic Driving Data, NDD）分析驾驶性能指标，特别是高速公路上的限速遵守情况和停车路口的减速行为，这些行为可能受到年龄相关能力下降的影响。研究的关键解决方案是使用累积分布函数（Cumulative Distribution Functions, CDFs）建立老年和年轻驾驶员的关键驾驶行为基准，并通过异常检测、基准比较和准确性评估揭示驾驶模式的显著差异。这一方法为高级驾驶辅助系统（Advanced Driver Assistance Systems, ADAS）提供了基于年龄特定驾驶行为的定制干预措施，从而提升驾驶安全性。

链接: https://arxiv.org/abs/2501.06918
作者: Aparna Joshi,Kojo Adugyamfi,Jennifer Merickel,Pujitha Gunaratne,Anuj Sharma
机构: 未知
类目: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 9 figures, 4 Tables, 104th TRB Annual Meeting 2025, Washington DC

点击查看摘要

Abstract:By 2030, the senior population aged 65 and older is expected to increase by over 50%, significantly raising the number of older drivers on the road. Drivers over 70 face higher crash death rates compared to those in their forties and fifties, underscoring the importance of developing more effective safety interventions for this demographic. Although the impact of aging on driving behavior has been studied, there is limited research on how these behaviors translate into real-world driving scenarios. This study addresses this need by leveraging Naturalistic Driving Data (NDD) to analyze driving performance measures - specifically, speed limit adherence on interstates and deceleration at stop intersections, both of which may be influenced by age-related declines. Using NDD, we developed Cumulative Distribution Functions (CDFs) to establish benchmarks for key driving behaviors among senior and young drivers. Our analysis, which included anomaly detection, benchmark comparisons, and accuracy evaluations, revealed significant differences in driving patterns primarily related to speed limit adherence at 75mph. While our approach shows promising potential for enhancing Advanced Driver Assistance Systems (ADAS) by providing tailored interventions based on age-specific adherence to speed limit driving patterns, we recognize the need for additional data to refine and validate metrics for other driving behaviors. By establishing precise benchmarks for various driving performance metrics, ADAS can effectively identify anomalies, such as abrupt deceleration, which may indicate impaired driving or other safety concerns. This study lays a strong foundation for future research aimed at improving safety interventions through detailed driving behavior analysis.
zh

[CV-154] Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution CVPR2025

【速读】：该论文试图解决的是在任意尺度超分辨率（Arbitrary-scale Super-Resolution, ASR）任务中，隐式神经表示（Implicit Neural Representation, INR）由于多层感知机（Multi-Layer Perceptron, MLP）的线性层感受野有限且计算成本高而导致的表示能力受限问题。论文提出了一种基于高斯泼溅（Gaussian Splatting, GS）的新方法，旨在克服直接应用GS到ASR任务中的挑战，即GS原本是基于优化的方法，需要对每个场景进行过拟合，而ASR任务要求模型能够泛化到不同图像和缩放因子。

解决方案的关键在于两个方面：首先，设计了一种新的架构，能够以前馈方式预测输入低分辨率图像对应的图像条件高斯分布；其次，实现了一种高效的基于GPU/CUDA的可微分2D尺度感知光栅化方法，通过从预测的连续高斯分布中采样离散RGB值来渲染超分辨率图像。通过端到端训练，优化后的网络（GSASR）能够对任意图像和未见过的缩放因子执行ASR任务。实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2501.06838
作者: Du Chen,Liyi Chen,Zhengqiang Zhang,Lei Zhang
机构: The Hong Kong Polytechnic University(香港理工大学); OPPO Research Institute(OPPO研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2025

点击查看摘要

Abstract:Equipped with the continuous representation capability of Multi-Layer Perceptron (MLP), Implicit Neural Representation (INR) has been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, the limited receptive field of the linear layers in MLP restricts the representation capability of INR, while it is computationally expensive to query the MLP numerous times to render each pixel. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted contiguous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can perform ASR for any image and unseen scaling factors. Extensive experiments validate the effectiveness of our proposed method. The project page can be found at \urlthis https URL.
zh

[CV-155] Wavelet Integrated Convolutional Neural Network for ECG Signal Denoising

【速读】：该论文试图解决使用干电极进行可穿戴心电图（ECG）测量时存在的高强度噪声失真问题。由于ECG信号和噪声的频率带重叠，传统的噪声消除方法难以有效处理。因此，论文提出了一种基于卷积神经网络（CNN）并结合小波变换层的解决方案。该方案通过提取干净ECG信号中的特定频率特征，能够根据噪声的强度和类型动态调整噪声特性，从而有效降低噪声。实验结果表明，该方法在信噪比（SNR）为-10到10的范围内表现优异，尤其在低SNR情况下效果更为显著。

链接: https://arxiv.org/abs/2501.06724
作者: Takamasa Terada,Masahiro Toyoura
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wearable electrocardiogram (ECG) measurement using dry electrodes has a problem with high-intensity noise distortion. Hence, a robust noise reduction method is required. However, overlapping frequency bands of ECG and noise make noise reduction difficult. Hence, it is necessary to provide a mechanism that changes the characteristics of the noise based on its intensity and type. This study proposes a convolutional neural network (CNN) model with an additional wavelet transform layer that extracts the specific frequency features in a clean ECG. Testing confirms that the proposed method effectively predicts accurate ECG behavior with reduced noise by accounting for all frequency domains. In an experiment, noisy signals in the signal-to-noise ratio (SNR) range of -10-10 are evaluated, demonstrating that the efficiency of the proposed method is higher when the SNR is small.
zh

[CV-156] CNN-powered micro- to macro-scale flow modeling in deformable porous media

【速读】：该论文旨在解决在可变形多孔介质中预测宏观固有渗透率张量（macroscopic intrinsic permeability tensor）的问题。传统方法通常依赖于耗时的实验或计算成本高昂的流体动力学模拟，而本文提出了一种基于机器学习（ML）的高效方法，以克服这些局限性。解决方案的关键在于利用卷积神经网络（CNN）来预测在变形和各向异性流动条件下的孔隙流体流动行为。具体而言，该方法通过二值化的CT图像作为输入，预测对称的二阶渗透率张量，这是连续多孔介质流动建模中的关键参数。该方法包括四个主要步骤：构建不同体积应变水平下的Bentheim砂岩CT图像数据集、使用格子玻尔兹曼方法（LBM）进行孔隙尺度单相流动模拟以生成渗透率数据、训练CNN模型以CT图像为输入和渗透率张量为输出，以及探索提高模型泛化能力的技术，如数据增强和替代CNN架构。通过这些步骤，论文展示了CNN在准确预测渗透率张量方面的能力，这对于岩土工程、水文学和材料科学等多个学科具有重要意义。

链接: https://arxiv.org/abs/2501.06466
作者: Yousef Heider,Fadi Aldakheel,Wolfgang Ehlers
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 12 figures, research paper

点击查看摘要

Abstract:This work introduces a novel application for predicting the macroscopic intrinsic permeability tensor in deformable porous media, using a limited set of micro-CT images of real microgeometries. The primary goal is to develop an efficient, machine-learning (ML)-based method that overcomes the limitations of traditional permeability estimation techniques, which often rely on time-consuming experiments or computationally expensive fluid dynamics simulations. The novelty of this work lies in leveraging Convolutional Neural Networks (CNN) to predict pore-fluid flow behavior under deformation and anisotropic flow conditions. Particularly, the described approach employs binarized CT images of porous micro-structure as inputs to predict the symmetric second-order permeability tensor, a critical parameter in continuum porous media flow modeling. The methodology comprises four key steps: (1) constructing a dataset of CT images from Bentheim sandstone at different volumetric strain levels; (2) performing pore-scale simulations of single-phase flow using the lattice Boltzmann method (LBM) to generate permeability data; (3) training the CNN model with the processed CT images as inputs and permeability tensors as outputs; and (4) exploring techniques to improve model generalization, including data augmentation and alternative CNN architectures. Examples are provided to demonstrate the CNN’s capability to accurately predict the permeability tensor, a crucial parameter in various disciplines such as geotechnical engineering, hydrology, and material science. An exemplary source code is made available for interested readers.
zh

[CV-157] Ultrasound Image Synthesis Using Generative AI for Lung Ultrasound Detection

【速读】：该论文旨在解决医疗AI模型在训练数据不平衡时，模型性能在常见类别上趋于饱和而在罕见病例上表现不佳的问题。为了解决这一局限性，作者提出了DiffUltra，这是一种生成式AI技术，能够合成具有广泛病变变异性的真实肺部超声（LUS）图像。解决方案的关键在于引入了病变解剖库（Lesion-anatomy Bank），该库从真实患者数据中捕捉病变的结构和位置属性，以指导图像生成。通过这种方法，DiffUltra在检测肺实变方面比仅使用真实患者数据训练的模型提高了5.6%的平均精度（AP），并且显著增加了数据多样性和罕见病例的检测率，特别是在仅占数据集10%的大肺实变检测中，AP提高了25%。

链接: https://arxiv.org/abs/2501.06356
作者: Yu-Cheng Chou,Gary Y. Li,Li Chen,Mohsen Zahiri,Naveen Balaraju,Shubham Patil,Bryson Hicks,Nikolai Schnittke,David O. Kessler,Jeffrey Shupp,Maria Parker,Cristiana Baloescu,Christopher Moore,Cynthia Gregory,Kenton Gregory,Balasundar Raju,Jochen Kruecker,Alvin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2025

点击查看摘要

Abstract:Developing reliable healthcare AI models requires training with representative and diverse data. In imbalanced datasets, model performance tends to plateau on the more prevalent classes while remaining low on less common cases. To overcome this limitation, we propose DiffUltra, the first generative AI technique capable of synthesizing realistic Lung Ultrasound (LUS) images with extensive lesion variability. Specifically, we condition the generative AI by the introduced Lesion-anatomy Bank, which captures the lesion’s structural and positional properties from real patient data to guide the image this http URL demonstrate that DiffUltra improves consolidation detection by 5.6% in AP compared to the models trained solely on real patient data. More importantly, DiffUltra increases data diversity and prevalence of rare cases, leading to a 25% AP improvement in detecting rare instances such as large lung consolidations, which make up only 10% of the dataset.
zh

[CV-158] Underwater Image Enhancement using Generative Adversarial Networks: A Survey

【速读】：该论文旨在解决水下图像增强（underwater image enhancement）中的关键问题，特别是由于水下环境中的光衰减（light attenuation）、散射（scattering）和颜色失真（color distortion）导致的图像质量下降问题。这些因素严重限制了水下图像在海洋生物学、生态系统监测、珊瑚礁健康评估、水下考古以及自主水下航行器（AUV）导航等关键应用中的使用。论文的核心解决方案是采用生成对抗网络（GANs），利用其强大的复杂变换学习和生成逼真输出的能力，来提升水下图像的质量。通过对物理模型、无物理模型、基于卷积神经网络（CNN）的模型以及最新的基于GAN的方法进行全面分析，论文探讨了各种方法的优缺点、评估指标、数据集和损失函数，并指出了当前方法在泛化能力、计算需求和数据集偏差等方面的局限性，同时提出了未来研究的潜在方向。

链接: https://arxiv.org/abs/2501.06273
作者: Kancharagunta Kishan Babu,Ashreen Tabassum,Bommakanti Navaneeth,Tenneti Jahnavi,Yenka Akshaya
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures, 2 tables

点击查看摘要

Abstract:In recent years, there has been a surge of research focused on underwater image enhancement using Generative Adversarial Networks (GANs), driven by the need to overcome the challenges posed by underwater environments. Issues such as light attenuation, scattering, and color distortion severely degrade the quality of underwater images, limiting their use in critical applications. Generative Adversarial Networks (GANs) have emerged as a powerful tool for enhancing underwater photos due to their ability to learn complex transformations and generate realistic outputs. These advancements have been applied to real-world applications, including marine biology and ecosystem monitoring, coral reef health assessment, underwater archaeology, and autonomous underwater vehicle (AUV) navigation. This paper explores all major approaches to underwater image enhancement, from physical and physics-free models to Convolutional Neural Network (CNN)-based models and state-of-the-art GAN-based methods. It provides a comprehensive analysis of these methods, evaluation metrics, datasets, and loss functions, offering a holistic view of the field. Furthermore, the paper delves into the limitations and challenges faced by current methods, such as generalization issues, high computational demands, and dataset biases, while suggesting potential directions for future research.
zh

[CV-159] Interpretable Auto Window Setting for Deep-Learning-Based CT Analysis

【速读】：该论文试图解决在计算机断层扫描（CT）分析过程中自动窗口设置（Auto Window Setting）领域缺乏领域不变性且直观可解释的方法的问题。尽管已有研究探讨了CT多窗口融合在增强神经网络能力方面的潜力，但目前仍缺乏一种既适用于不同领域又能直观解释的自动窗口设置方法。论文提出的解决方案关键是一个基于Tanh激活函数的即插即用模块，该模块兼容主流深度学习架构。该模块从CT的物理原理出发，遵循可解释性原则，确保其在医疗应用中的可靠性。其领域不变性设计使得自适应机制的决策偏好可以从临床直观的角度进行观察，从而不仅使神经网络专家能够理解，还能获得临床医生的更高信任。通过在多个开源数据集上的验证，该方法在硬分割目标上实现了10%~200%的Dice系数提升。

链接: https://arxiv.org/abs/2501.06223
作者: Yiqin Zhang,Meiling Chen,Zhengjie Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whether during the early days of popularization or in the present, the window setting in Computed Tomography (CT) has always been an indispensable part of the CT analysis process. Although research has investigated the capabilities of CT multi-window fusion in enhancing neural networks, there remains a paucity of domain-invariant, intuitively interpretable methodologies for Auto Window Setting. In this work, we propose an plug-and-play module originate from Tanh activation function, which is compatible with mainstream deep learning architectures. Starting from the physical principles of CT, we adhere to the principle of interpretability to ensure the module’s reliability for medical implementations. The domain-invariant design facilitates observation of the preference decisions rendered by the adaptive mechanism from a clinically intuitive perspective. This enables the proposed method to be understood not only by experts in neural networks but also garners higher trust from clinicians. We confirm the effectiveness of the proposed method in multiple open-source datasets, yielding 10%~200% Dice improvements on hard segment targets.
zh

人工智能

[AI-0] Evaluating Agent -based Program Repair at Google

链接: https://arxiv.org/abs/2501.07531
作者: Pat Rondon,Renyao Wei,José Cambronero,Jürgen Cito,Aaron Sun,Siddhant Sanyam,Michele Tufano,Satish Chandra
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects. In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. This paper explores the viability of using an agentic approach to address bugs in an enterprise context. To investigate this, we curate an evaluation set of 178 bugs drawn from Google’s issue tracking system. This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google’s development environment. We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set. After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution – in terms of language diversity, size, and spread of changes, etc. – compared to those in the popular SWE-Bench dataset. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.07531 [cs.SE] (or arXiv:2501.07531v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.07531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] he Paradox of Success in Evolutionary and Bioinspired Optimization: Revisiting Critical Issues Key Studies and Methodological Pathways

链接: https://arxiv.org/abs/2501.07515
作者: Daniel Molina,Javier Del Ser,Javier Poyatos,Francisco Herrera
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 38 pages, 1 figure

点击查看摘要

Abstract:Evolutionary and bioinspired computation are crucial for efficiently addressing complex optimization problems across diverse application domains. By mimicking processes observed in nature, like evolution itself, these algorithms offer innovative solutions beyond the reach of traditional optimization methods. They excel at finding near-optimal solutions in large, complex search spaces, making them invaluable in numerous fields. However, both areas are plagued by challenges at their core, including inadequate benchmarking, problem-specific overfitting, insufficient theoretical grounding, and superfluous proposals justified only by their biological metaphor. This overview recapitulates and analyzes in depth the criticisms concerning the lack of innovation and rigor in experimental studies within the field. To this end, we examine the judgmental positions of the existing literature in an informed attempt to guide the research community toward directions of solid contribution and advancement in these areas. We summarize guidelines for the design of evolutionary and bioinspired optimizers, the development of experimental comparisons, and the derivation of novel proposals that take a step further in the field. We provide a brief note on automating the process of creating these algorithms, which may help align metaheuristic optimization research with its primary objective (solving real-world problems), provided that our identified pathways are followed. Our conclusions underscore the need for a sustained push towards innovation and the enforcement of methodological rigor in prospective studies to fully realize the potential of these advanced computational techniques.

[AI-2] Inductive Learning of Robot Task Knowledge from Raw Data and Online Expert Feedback

链接: https://arxiv.org/abs/2501.07507
作者: Daniele Meli,Paolo Fiorini
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The increasing level of autonomy of robots poses challenges of trust and social acceptance, especially in human-robot interaction scenarios. This requires an interpretable implementation of robotic cognitive capabilities, possibly based on formal methods as logics for the definition of task specifications. However, prior knowledge is often unavailable in complex realistic scenarios. In this paper, we propose an offline algorithm based on inductive logic programming from noisy examples to extract task specifications (i.e., action preconditions, constraints and effects) directly from raw data of few heterogeneous (i.e., not repetitive) robotic executions. Our algorithm leverages on the output of any unsupervised action identification algorithm from video-kinematic recordings. Combining it with the definition of very basic, almost task-agnostic, commonsense concepts about the environment, which contribute to the interpretability of our methodology, we are able to learn logical axioms encoding preconditions of actions, as well as their effects in the event calculus paradigm. Since the quality of learned specifications depends mainly on the accuracy of the action identification algorithm, we also propose an online framework for incremental refinement of task knowledge from user feedback, guaranteeing safe execution. Results in a standard manipulation task and benchmark for user training in the safety-critical surgical robotic scenario, show the robustness, data- and time-efficiency of our methodology, with promising results towards the scalability in more complex domains. Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Robotics (cs.RO) Cite as: arXiv:2501.07507 [cs.AI] (or arXiv:2501.07507v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.07507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2501.07502
作者: Mingkang Wu,Devin White,Vernon Lawhern,Nicholas R. Waytowich,Yongcan Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the Collaborative AI and Modeling of Humans Bridge Program at AAAI 2025

点击查看摘要

Abstract:Reinforcement learning (RL), a common tool in decision making, learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. On the contrary, humans often learn to distinguish from different levels of performance and extract the underlying trends towards improving their decision making for best performance. Motivated by this, this paper proposes a novel RL method that mimics humans’ decision making process by differentiating among collected experiences for effective policy learning. The main idea is to extract important directional information from experiences with different performance levels, named ratings, so that policies can be updated towards desired deviation from these experiences with different ratings. Specifically, we propose a new policy loss function that penalizes distribution similarities between the current policy and failed experiences with different ratings, and assign different weights to the penalty terms based on the rating classes. Meanwhile, reward learning from these rated samples can be integrated with the new policy loss towards an integrated reward and policy learning from rated samples. Optimizing the integrated reward and policy loss function will lead to the discovery of directions for policy improvement towards maximizing cumulative rewards and penalizing most from the lowest performance level while least from the highest performance level. To evaluate the effectiveness of the proposed method, we present results for experiments on a few typical environments that show improved convergence and overall performance over the existing rating-based reinforcement learning method with only reward learning.

[AI-4] Data and System Perspectives of Sustainable Artificial Intelligence

链接: https://arxiv.org/abs/2501.07487
作者: Tao Xie,David Harel,Dezhi Ran,Zhenwen Li,Maoliang Li,Zhi Yang,Leye Wang,Xiang Chen,Ying Zhang,Wentao Zhang,Meng Li,Chen Zhang,Linyi Li,Assaf Marron
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sustainable AI is a subfield of AI for concerning developing and using AI systems in ways of aiming to reduce environmental impact and achieve sustainability. Sustainable AI is increasingly important given that training of and inference with AI models such as large langrage models are consuming a large amount of computing power. In this article, we discuss current issues, opportunities and example solutions for addressing these issues, and future challenges to tackle, from the data and system perspectives, related to data acquisition, data processing, and AI model training and inference.

[AI-5] Smart Learning in the 21st Century: Advancing Constructionism Across Three Digital Epochs

链接: https://arxiv.org/abs/2501.07486
作者: Ilya Levin,Alexei L. Semenov,Mikael Gorsky
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 22 pages

点击查看摘要

Abstract:This article explores the evolution of constructionism as an educational framework, tracing its relevance and transformation across three pivotal eras: the advent of personal computing, the networked society, and the current era of generative AI. Rooted in Seymour Papert constructionist philosophy, this study examines how constructionist principles align with the expanding role of digital technology in personal and collective learning. We discuss the transformation of educational environments from hierarchical instructionism to constructionist models that emphasize learner autonomy and interactive, creative engagement. Central to this analysis is the concept of an expanded personality, wherein digital tools and AI integration fundamentally reshape individual self-perception and social interactions. By integrating constructionism into the paradigm of smart education, we propose it as a foundational approach to personalized and democratized learning. Our findings underscore constructionism enduring relevance in navigating the complexities of technology-driven education, providing insights for educators and policymakers seeking to harness digital innovations to foster adaptive, student-centered learning experiences.

[AI-6] Estimating Musical Surprisal in Audio ICASSP2025

链接: https://arxiv.org/abs/2501.07474
作者: Mathias Rose Bjare,Giorgia Cantisani,Stefan Lattner,Gerhard Widmer
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures, 1 table. Accepted at the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025), Hyderabad, India

点击查看摘要

Abstract:In modeling musical surprisal expectancy with computational methods, it has been proposed to use the information content (IC) of one-step predictions from an autoregressive model as a proxy for surprisal in symbolic music. With an appropriately chosen model, the IC of musical events has been shown to correlate with human perception of surprise and complexity aspects, including tonal and rhythmic complexity. This work investigates whether an analogous methodology can be applied to music audio. We train an autoregressive Transformer model to predict compressed latent audio representations of a pretrained autoencoder network. We verify learning effects by estimating the decrease in IC with repetitions. We investigate the mean IC of musical segment types (e.g., A or B) and find that segment types appearing later in a piece have a higher IC than earlier ones on average. We investigate the IC’s relation to audio and musical features and find it correlated with timbral variations and loudness and, to a lesser extent, dissonance, rhythmic complexity, and onset density related to audio and musical features. Finally, we investigate if the IC can predict EEG responses to songs and thus model humans’ surprisal in music. We provide code for our method on this http URL.

[AI-7] A Survey of Embodied AI in Healthcare: Techniques Applications and Opportunities

链接: https://arxiv.org/abs/2501.07468
作者: Yihao Liu,Xu Cao,Tingting Chen,Yankai Jiang,Junjie You,Minghua Wu,Xiaosong Wang,Mengling Feng,Yaochu Jin,Jintai Chen
类目: Artificial Intelligence (cs.AI)
*备注: 44 pages, 11 figures

点击查看摘要

Abstract:Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and personalization. Powered by modern AI technologies such as multimodal large language models and world models, Embodied AI (EmAI) represents a transformative frontier, offering enhanced autonomy and the ability to interact with the physical world to address these challenges. As an interdisciplinary and rapidly evolving research domain, “EmAI in healthcare” spans diverse fields such as algorithms, robotics, and biomedicine. This complexity underscores the importance of timely reviews and analyses to track advancements, address challenges, and foster cross-disciplinary collaboration. In this paper, we provide a comprehensive overview of the “brain” of EmAI for healthcare, wherein we introduce foundational AI algorithms for perception, actuation, planning, and memory, and focus on presenting the healthcare applications spanning clinical interventions, daily care companionship, infrastructure support, and biomedical research. Despite its promise, the development of EmAI for healthcare is hindered by critical challenges such as safety concerns, gaps between simulation platforms and real-world applications, the absence of standardized benchmarks, and uneven progress across interdisciplinary domains. We discuss the technical barriers and explore ethical considerations, offering a forward-looking perspective on the future of EmAI in healthcare. A hierarchical framework of intelligent levels for EmAI systems is also introduced to guide further development. By providing systematic insights, this work aims to inspire innovation and practical applications, paving the way for a new era of intelligent, patient-centered healthcare.

[AI-8] Understanding and Benchmarking Artificial Intelligence: OpenAI s o3 Is Not AGI

链接: https://arxiv.org/abs/2501.07458
作者: Rolf Pfister,Hansueli Jud
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 15 pages

点击查看摘要

Abstract:OpenAI’s o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence. This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI). Building on the distinction between skills and intelligence made by François Chollet, the creator of ARC-AGI, a new understanding of intelligence is introduced: an agent is the more intelligent, the more efficiently it can achieve the more diverse goals in the more diverse worlds with the less knowledge. An analysis of the ARC-AGI benchmark shows that its tasks represent a very specific type of problem that can be solved by massive trialling of combinations of predefined operations. This method is also applied by o3, achieving its high score through the extensive use of computing power. However, for most problems in the physical world and in the human domain, solutions cannot be tested in advance and predefined operations are not available. Consequently, massive trialling of predefined operations, as o3 does, cannot be a basis for AGI - instead, new approaches are required that can reliably solve a wide variety of problems without existing skills. To support this development, a new benchmark for intelligence is outlined that covers a much higher diversity of unknown tasks to be solved, thus enabling a comprehensive assessment of intelligence and of progress towards AGI.

[AI-9] Online inductive learning from answer sets for efficient reinforcement learning exploration

链接: https://arxiv.org/abs/2501.07445
作者: Celeste Veronese,Daniele Meli,Alessandro Farinelli
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach combining inductive logic programming with reinforcement learning to improve training performance and explainability. We exploit inductive learning of answer set programs from noisy examples to learn a set of logical rules representing an explainable approximation of the agent policy at each batch of experience. We then perform answer set reasoning on the learned rules to guide the exploration of the learning agent at the next batch, without requiring inefficient reward shaping and preserving optimality with soft bias. The entire procedure is conducted during the online execution of the reinforcement learning algorithm. We preliminarily validate the efficacy of our approach by integrating it into the Q-learning algorithm for the Pac-Man scenario in two maps of increasing complexity. Our methodology produces a significant boost in the discounted return achieved by the agent, even in the first batches of training. Moreover, inductive learning does not compromise the computational time required by Q-learning and learned rules quickly converge to an explanation of the agent policy.

[AI-10] Empirical Evaluation of the Implicit Hitting Set Approach for Weighted CSPs

链接: https://arxiv.org/abs/2501.07432
作者: Aleksandra Petrova,Javier Larrosa,Emma Rollón
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:SAT technology has proven to be surprisingly effective in a large variety of domains. However, for the Weighted CSP problem dedicated algorithms have always been superior. One approach not well-studied so far is the use of SAT in conjunction with the Implicit Hitting Set approach. In this work, we explore some alternatives to the existing algorithm of reference. The alternatives, mostly borrowed from related boolean frameworks, consider trade-offs for the two main components of the IHS approach: the computation of low-cost hitting vectors, and their transformation into high-cost cores. For each one, we propose 4 levels of intensity. Since we also test the usefulness of cost function merging, our experiments consider 32 different implementations. Our empirical study shows that for WCSP it is not easy to identify the best alternative. Nevertheless, the cost-function merging encoding and extracting maximal cores seems to be a robust approach.

[AI-11] An Investigation into Seasonal Variations in Energy Forecasting for Student Residences

链接: https://arxiv.org/abs/2501.07423
作者: Muhammad Umair Danish,Mathumitha Sureshkumar,Thanuri Fonseka,Umeshika Uthayakumar,Vinura Galwaduge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research provides an in-depth evaluation of various machine learning models for energy forecasting, focusing on the unique challenges of seasonal variations in student residential settings. The study assesses the performance of baseline models, such as LSTM and GRU, alongside state-of-the-art forecasting methods, including Autoregressive Feedforward Neural Networks, Transformers, and hybrid approaches. Special attention is given to predicting energy consumption amidst challenges like seasonal patterns, vacations, meteorological changes, and irregular human activities that cause sudden fluctuations in usage. The findings reveal that no single model consistently outperforms others across all seasons, emphasizing the need for season-specific model selection or tailored designs. Notably, the proposed Hyper Network based LSTM and MiniAutoEncXGBoost models exhibit strong adaptability to seasonal variations, effectively capturing abrupt changes in energy consumption during summer months. This study advances the energy forecasting field by emphasizing the critical role of seasonal dynamics and model-specific behavior in achieving accurate predictions.

[AI-12] Initial Findings on Sensor based Open Vocabulary Activity Recognition via Text Embedding Inversion

链接: https://arxiv.org/abs/2501.07408
作者: Lala Shakti Swarup Ray,Bo Zhou,Sungho Suh,Paul Lukowicz
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Conventional human activity recognition (HAR) relies on classifiers trained to predict discrete activity classes, inherently limiting recognition to activities explicitly present in the training set. Such classifiers would invariably fail, putting zero likelihood, when encountering unseen activities. We propose Open Vocabulary HAR (OV-HAR), a framework that overcomes this limitation by first converting each activity into natural language and breaking it into a sequence of elementary motions. This descriptive text is then encoded into a fixed-size embedding. The model is trained to regress this embedding, which is subsequently decoded back into natural language using a pre-trained embedding inversion model. Unlike other works that rely on auto-regressive large language models (LLMs) at their core, OV-HAR achieves open vocabulary recognition without the computational overhead of such models. The generated text can be transformed into a single activity class using LLM prompt engineering. We have evaluated our approach on different modalities, including vision (pose), IMU, and pressure sensors, demonstrating robust generalization across unseen activities and modalities, offering a fundamentally different paradigm from contemporary classifiers.

[AI-13] PROTECT: Protein circadian time prediction using unsupervised learning

链接: https://arxiv.org/abs/2501.07405
作者: Aram Ansary Ogholbake,Qiang Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Circadian rhythms regulate the physiology and behavior of humans and animals. Despite advancements in understanding these rhythms and predicting circadian phases at the transcriptional level, predicting circadian phases from proteomic data remains elusive. This challenge is largely due to the scarcity of time labels in proteomic datasets, which are often characterized by small sample sizes, high dimensionality, and significant noise. Furthermore, existing methods for predicting circadian phases from transcriptomic data typically rely on prior knowledge of known rhythmic genes, making them unsuitable for proteomic datasets. To address this gap, we developed a novel computational method using unsupervised deep learning techniques to predict circadian sample phases from proteomic data without requiring time labels or prior knowledge of proteins or genes. Our model involves a two-stage training process optimized for robust circadian phase prediction: an initial greedy one-layer-at-a-time pre-training which generates informative initial parameters followed by fine-tuning. During fine-tuning, a specialized loss function guides the model to align protein expression levels with circadian patterns, enabling it to accurately capture the underlying rhythmic structure within the data. We tested our method on both time-labeled and unlabeled proteomic data. For labeled data, we compared our predictions to the known time labels, achieving high accuracy, while for unlabeled human datasets, including postmortem brain regions and urine samples, we explored circadian disruptions. Notably, our analysis identified disruptions in rhythmic proteins between Alzheimer’s disease and control subjects across these samples.

[AI-14] Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

链接: https://arxiv.org/abs/2501.07400
作者: Thomas Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: AMS Latex, 35 pages

点击查看摘要

Abstract:We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity (“truncated”) at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.

[AI-15] he Essentials of AI for Life and Society: An AI Literacy Course for the University Community AAAI-25

链接: https://arxiv.org/abs/2501.07392
作者: Joydeep Biswas,Don Fussell,Peter Stone,Kristin Patterson,Kristen Procko,Lea Sabatini,Zifan Xu
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted to EAAI-25: The 15th Symposium on Educational Advances in Artificial Intelligence, collocated with AAAI-25

点击查看摘要

Abstract:We describe the development of a one-credit course to promote AI literacy at The University of Texas at Austin. In response to a call for the rapid deployment of class to serve a broad audience in Fall of 2023, we designed a 14-week seminar-style course that incorporated an interdisciplinary group of speakers who lectured on topics ranging from the fundamentals of AI to societal concerns including disinformation and employment. University students, faculty, and staff, and even community members outside of the University, were invited to enroll in this online offering: The Essentials of AI for Life and Society. We collected feedback from course participants through weekly reflections and a final survey. Satisfyingly, we found that attendees reported gains in their AI literacy. We sought critical feedback through quantitative and qualitative analysis, which uncovered challenges in designing a course for this general audience. We utilized the course feedback to design a three-credit version of the course that is being offered in Fall of 2024. The lessons we learned and our plans for this new iteration may serve as a guide to instructors designing AI courses for a broad audience.

[AI-16] Information-Theoretic Dual Memory System for Continual Learning

链接: https://arxiv.org/abs/2501.07382
作者: RunQing Wu,KaiHui Huang,HanYi Zhang,QiHe Liu,GuoJin Yu,JingSong Deng,Fei Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 35 pages, 9 figures, submitted to Knowledge-Based Systems

点击查看摘要

Abstract:Continuously acquiring new knowledge from a dynamic environment is a fundamental capability for animals, facilitating their survival and ability to address various challenges. This capability is referred to as continual learning, which focuses on the ability to learn a sequence of tasks without the detriment of previous knowledge. A prevalent strategy to tackle continual learning involves selecting and storing numerous essential data samples from prior tasks within a fixed-size memory buffer. However, the majority of current memory-based techniques typically utilize a single memory buffer, which poses challenges in concurrently managing newly acquired and previously learned samples. Drawing inspiration from the Complementary Learning Systems (CLS) theory, which defines rapid and gradual learning mechanisms for processing information, we propose an innovative dual memory system called the Information-Theoretic Dual Memory System (ITDMS). This system comprises a fast memory buffer designed to retain temporary and novel samples, alongside a slow memory buffer dedicated to preserving critical and informative samples. The fast memory buffer is optimized employing an efficient reservoir sampling process. Furthermore, we introduce a novel information-theoretic memory optimization strategy that selectively identifies and retains diverse and informative data samples for the slow memory buffer. Additionally, we propose a novel balanced sample selection procedure that automatically identifies and eliminates redundant memorized samples, thus freeing up memory capacity for new data acquisitions, which can deal with a growing array of tasks. Our methodology is rigorously assessed through a series of continual learning experiments, with empirical results underscoring the effectiveness of the proposed system.

[AI-17] mpoGPT : Enhancing Temporal Reasoning via Quantizing Embedding

链接: https://arxiv.org/abs/2501.07335
作者: Haochuan Zhang,Chunhua Yang,Jie Han,Liyang Qin,Xiaoli Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-modal language model has made advanced progress in vision and audio, but still faces significant challenges in dealing with complex reasoning tasks in the time series domain. The reasons are twofold. First, labels for multi-modal time series data are coarse and devoid of analysis or reasoning processes. Training with these data cannot improve the model’s reasoning capabilities. Second, due to the lack of precise tokenization in processing time series, the representation patterns for temporal and textual information are inconsistent, which hampers the effectiveness of multi-modal alignment. To address these challenges, we propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT. Specially, we construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system. Additionally, proposed TempoGPT achieves consistent representation between temporal and textual information by quantizing temporal embeddings, where temporal embeddings are quantized into a series of discrete tokens using a predefined codebook; subsequently, a shared embedding layer processes both temporal and textual tokens. Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks. Moreover, we quantitatively demonstrate the effectiveness of quantizing temporal embeddings in enhancing multi-modal alignment and the reasoning capabilities of TLMs. Code and data are available at this https URL.

[AI-18] Evaluation of Artificial Intelligence Methods for Lead Time Prediction in Non-Cycled Areas of Automotive Production

链接: https://arxiv.org/abs/2501.07317
作者: Cornelius Hake(1, 2),Jonas Weigele(1, 3),Frederik Reichert(3),Christian Friedrich(2) ((1) Ing. h.c. F. Porsche AG (2) Hochschule Karlsruhe (3) Hochschule Esslingen)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 4 figures, CLC2024 Conference

点击查看摘要

Abstract:The present study examines the effectiveness of applying Artificial Intelligence methods in an automotive production environment to predict unknown lead times in a non-cycle-controlled production area. Data structures are analyzed to identify contextual features and then preprocessed using one-hot encoding. Methods selection focuses on supervised machine learning techniques. In supervised learning methods, regression and classification methods are evaluated. Continuous regression based on target size distribution is not feasible. Classification methods analysis shows that Ensemble Learning and Support Vector Machines are the most suitable. Preliminary study results indicate that gradient boosting algorithms LightGBM, XGBoost, and CatBoost yield the best results. After further testing and extensive hyperparameter optimization, the final method choice is the LightGBM algorithm. Depending on feature availability and prediction interval granularity, relative prediction accuracies of up to 90% can be achieved. Further tests highlight the importance of periodic retraining of AI models to accurately represent complex production processes using the database. The research demonstrates that AI methods can be effectively applied to highly variable production data, adding business value by providing an additional metric for various control tasks while outperforming current non AI-based systems.

[AI-19] Principles for Responsible AI Consciousness Research

链接: https://arxiv.org/abs/2501.07290
作者: Patrick Butlin,Theodoros Lappas
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research suggests that it may be possible to build conscious AI systems now or in the near future. Conscious AI systems would arguably deserve moral consideration, and it may be the case that large numbers of conscious systems could be created and caused to suffer. Furthermore, AI systems or AI-generated characters may increasingly give the impression of being conscious, leading to debate about their moral status. Organisations involved in AI research must establish principles and policies to guide research and deployment choices and public communication concerning consciousness. Even if an organisation chooses not to study AI consciousness as such, it will still need policies in place, as those developing advanced AI systems risk inadvertently creating conscious entities. Responsible research and deployment practices are essential to address this possibility. We propose five principles for responsible research and argue that research organisations should make voluntary, public commitments to principles on these lines. Our principles concern research objectives and procedures, knowledge sharing and public communications.

[AI-20] LLM -Net: Democratizing LLM s-as-a-Service through Blockchain-based Expert Networks

链接: https://arxiv.org/abs/2501.07288
作者: Zan-Kai Chong,Hiroyuki Ohsaki,Bryan Ng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The centralization of Large Language Models (LLMs) development has created significant barriers to AI advancement, limiting the democratization of these powerful technologies. This centralization, coupled with the scarcity of high-quality training data and mounting complexity of maintaining comprehensive expertise across rapidly expanding knowledge domains, poses critical challenges to the continued growth of LLMs. While solutions like Retrieval-Augmented Generation (RAG) offer potential remedies, maintaining up-to-date expert knowledge across diverse domains remains a significant challenge, particularly given the exponential growth of specialized information. This paper introduces LLMs Networks (LLM-Net), a blockchain-based framework that democratizes LLMs-as-a-Service through a decentralized network of specialized LLM providers. By leveraging collective computational resources and distributed domain expertise, LLM-Net incorporates fine-tuned expert models for various specific domains, ensuring sustained knowledge growth while maintaining service quality through collaborative prompting mechanisms. The framework’s robust design includes blockchain technology for transparent transaction and performance validation, establishing an immutable record of service delivery. Our simulation, built on top of state-of-the-art LLMs such as Claude 3.5 Sonnet, Llama 3.1, Grok-2, and GPT-4o, validates the effectiveness of the reputation-based mechanism in maintaining service quality by selecting high-performing respondents (LLM providers). Thereby it demonstrates the potential of LLM-Net to sustain AI advancement through the integration of decentralized expertise and blockchain-based accountability.

[AI-21] Lifelong Learning of Large Language Model based Agents : A Roadmap

链接: https://arxiv.org/abs/2501.07278
作者: Junhao Zheng,Chengming Shi,Xidi Cai,Qiuke Li,Duzhen Zhang,Chenxing Li,Dong Yu,Qianli Ma
类目: Artificial Intelligence (cs.AI)
*备注: 46 pages

点击查看摘要

Abstract:Lifelong learning, also known as continual or incremental learning, is a crucial component for advancing Artificial General Intelligence (AGI) by enabling systems to continuously adapt in dynamic environments. While large language models (LLMs) have demonstrated impressive capabilities in natural language processing, existing LLM agents are typically designed for static systems and lack the ability to adapt over time in response to new challenges. This survey is the first to systematically summarize the potential techniques for incorporating lifelong learning into LLM-based agents. We categorize the core components of these agents into three modules: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for grounded interactions with the dynamic environment. We highlight how these pillars collectively enable continuous adaptation, mitigate catastrophic forgetting, and improve long-term performance. This survey provides a roadmap for researchers and practitioners working to develop lifelong learning capabilities in LLM agents, offering insights into emerging trends, evaluation metrics, and application scenarios. Relevant literature and resources are available at \hrefthis urlthis https URL.

[AI-22] Bridging Smart Meter Gaps: A Benchmark of Statistical Machine Learning and Time Series Foundation Models for Data Imputation

链接: https://arxiv.org/abs/2501.07276
作者: Amir Sartipi,Joaquin Delgado Fernandez,Sergio Potenciano Menci,Alessio Magitteri
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integrity of time series data in smart grids is often compromised by missing values due to sensor failures, transmission errors, or disruptions. Gaps in smart meter data can bias consumption analyses and hinder reliable predictions, causing technical and economic inefficiencies. As smart meter data grows in volume and complexity, conventional techniques struggle with its nonlinear and nonstationary patterns. In this context, Generative Artificial Intelligence offers promising solutions that may outperform traditional statistical methods. In this paper, we evaluate two general-purpose Large Language Models and five Time Series Foundation Models for smart meter data imputation, comparing them with conventional Machine Learning and statistical models. We introduce artificial gaps (30 minutes to one day) into an anonymized public dataset to test inference capabilities. Results show that Time Series Foundation Models, with their contextual understanding and pattern recognition, could significantly enhance imputation accuracy in certain cases. However, the trade-off between computational cost and performance gains remains a critical consideration.

[AI-23] Lessons From Red Teaming 100 Generative AI Products

链接: https://arxiv.org/abs/2501.07238
作者: Blake Bullwinkel,Amanda Minnich,Shiven Chawla,Gary Lopez,Martin Pouliot,Whitney Maxwell,Joris de Gruyter,Katherine Pratt,Saphir Qi,Nina Chikanov,Roman Lutz,Raja Sekhar Rao Dheekonda,Bolor-Erdene Jagdagdorj,Eugenia Kim,Justin Song,Keegan Hines,Daniel Jones,Giorgio Severi,Richard Lundeen,Sam Vaughan,Victoria Westerhoff,Pete Bryan,Ram Shankar Siva Kumar,Yonatan Zunger,Chang Kawaguchi,Mark Russinovich
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, AI red teaming has emerged as a practice for probing the safety and security of generative AI systems. Due to the nascency of the field, there are many open questions about how red teaming operations should be conducted. Based on our experience red teaming over 100 generative AI products at Microsoft, we present our internal threat model ontology and eight main lessons we have learned: 1. Understand what the system can do and where it is applied 2. You don’t have to compute gradients to break an AI system 3. AI red teaming is not safety benchmarking 4. Automation can help cover more of the risk landscape 5. The human element of AI red teaming is crucial 6. Responsible AI harms are pervasive but difficult to measure 7. LLMs amplify existing security risks and introduce new ones 8. The work of securing AI systems will never be complete By sharing these insights alongside case studies from our operations, we offer practical recommendations aimed at aligning red teaming efforts with real world risks. We also highlight aspects of AI red teaming that we believe are often misunderstood and discuss open questions for the field to consider. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.07238 [cs.AI] (or arXiv:2501.07238v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.07238 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Blake Bullwinkel [view email] [v1] Mon, 13 Jan 2025 11:36:33 UTC (1,309 KB) Full-text links: Access Paper: View a PDF of the paper titled Lessons From Red Teaming 100 Generative AI Products, by Blake Bullwinkel and 25 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-24] Breaking Memory Limits: Gradient Wavelet Transform Enhances LLM s Training

链接: https://arxiv.org/abs/2501.07237
作者: Ziqing Wen,Ping Luo,Jiahuan Wang,Xiaoge Deng,Jinping Zou,Kun Yuan,Tao Sun,Dongsheng Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

[AI-25] Crowdsourced human-based computational approach for tagging peripheral blood smear sample images from Sickle Cell Disease patients using non-expert users

链接: https://arxiv.org/abs/2501.07196
作者: José María Buades Rubio,Gabriel Moyà-Alcover,Antoni Jaume-i-Capó,Nataša Petrović
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present a human-based computation approach for the analysis of peripheral blood smear (PBS) images images in patients with Sickle Cell Disease (SCD). We used the Mechanical Turk microtask market to crowdsource the labeling of PBS images. We then use the expert-tagged erythrocytesIDB dataset to assess the accuracy and reliability of our proposal. Our results showed that when a robust consensus is achieved among the Mechanical Turk workers, probability of error is very low, based on comparison with expert analysis. This suggests that our proposed approach can be used to annotate datasets of PBS images, which can then be used to train automated methods for the diagnosis of SCD. In future work, we plan to explore the potential integration of our findings with outcomes obtained through automated methodologies. This could lead to the development of more accurate and reliable methods for the diagnosis of SCD

[AI-26] Generalizable Graph Neural Networks for Robust Power Grid Topology Control

链接: https://arxiv.org/abs/2501.07186
作者: Matthijs de Jong,Jan Viebahn,Yuliya Shapovalova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The energy transition necessitates new congestion management methods. One such method is controlling the grid topology with machine learning (ML). This approach has gained popularity following the Learning to Run a Power Network (L2RPN) competitions. Graph neural networks (GNNs) are a class of ML models that reflect graph structure in their computation, which makes them suitable for power grid modeling. Various GNN approaches for topology control have thus been proposed. We propose the first GNN model for grid topology control that uses only GNN layers. Additionally, we identify the busbar information asymmetry problem that the popular homogeneous graph representation suffers from, and propose a heterogeneous graph representation to resolve it. We train both homogeneous and heterogeneous GNNs and fully connected neural networks (FCNN) baselines on an imitation learning task. We evaluate the models according to their classification accuracy and grid operation ability. We find that the heterogeneous GNNs perform best on in-distribution networks, followed by the FCNNs, and lastly, the homogeneous GNNs. We also find that both GNN types generalize better to out-of-distribution networks than FCNNs.

[AI-27] Kriging and Gaussian Process Interpolation for Georeferenced Data Augmentation

链接: https://arxiv.org/abs/2501.07183
作者: Frédérick Fabre Ferber(LIM, UPR Recyclage et risque),Dominique Gay(LIM),Jean-Christophe Soulié(UPR Recyclage et risque),Jean Diatta(LIM),Odalric-Ambrym Maillard(Scool)
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation is a crucial step in the development of robust supervised learning models, especially when dealing with limited datasets. This study explores interpolation techniques for the augmentation of geo-referenced data, with the aim of predicting the presence of Commelina benghalensis L. in sugarcane plots in La Réunion. Given the spatial nature of the data and the high cost of data collection, we evaluated two interpolation approaches: Gaussian processes (GPs) with different kernels and kriging with various variograms. The objectives of this work are threefold: (i) to identify which interpolation methods offer the best predictive performance for various regression algorithms, (ii) to analyze the evolution of performance as a function of the number of observations added, and (iii) to assess the spatial consistency of augmented datasets. The results show that GP-based methods, in particular with combined kernels (GP-COMB), significantly improve the performance of regression algorithms while requiring less additional data. Although kriging shows slightly lower performance, it is distinguished by a more homogeneous spatial coverage, a potential advantage in certain contexts.

[AI-28] Anomalous Agreement: How to find the Ideal Number of Anomaly Classes in Correlated Multivariate Time Series Data AAAI

链接: https://arxiv.org/abs/2501.07172
作者: Ferdinand Rewicki,Joachim Denzler,Julia Niebling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Acccepted at AAAI Workshop on AI for Time Series Analysis (AI4TS) 2025

点击查看摘要

Abstract:Detecting and classifying abnormal system states is critical for condition monitoring, but supervised methods often fall short due to the rarity of anomalies and the lack of labeled data. Therefore, clustering is often used to group similar abnormal behavior. However, evaluating cluster quality without ground truth is challenging, as existing measures such as the Silhouette Score (SSC) only evaluate the cohesion and separation of clusters and ignore possible prior knowledge about the data. To address this challenge, we introduce the Synchronized Anomaly Agreement Index (SAAI), which exploits the synchronicity of anomalies across multivariate time series to assess cluster quality. We demonstrate the effectiveness of SAAI by showing that maximizing SAAI improves accuracy on the task of finding the true number of anomaly classes K in correlated time series by 0.23 compared to SSC and by 0.32 compared to X-Means. We also show that clusters obtained by maximizing SAAI are easier to interpret compared to SSC.

[AI-29] Natural Language-Assisted Multi-modal Medication Recommendation

链接: https://arxiv.org/abs/2501.07166
作者: Jie Tan,Yu Rong,Kangfei Zhao,Tian Bian,Tingyang Xu,Junzhou Huang,Hong Cheng,Helen Meng
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Combinatorial medication recommendation(CMR) is a fundamental task of healthcare, which offers opportunities for clinical physicians to provide more precise prescriptions for patients with intricate health conditions, particularly in the scenarios of long-term medical care. Previous research efforts have sought to extract meaningful information from electronic health records (EHRs) to facilitate combinatorial medication recommendations. Existing learning-based approaches further consider the chemical structures of medications, but ignore the textual medication descriptions in which the functionalities are clearly described. Furthermore, the textual knowledge derived from the EHRs of patients remains largely underutilized. To address these issues, we introduce the Natural Language-Assisted Multi-modal Medication Recommendation(NLA-MMR), a multi-modal alignment framework designed to learn knowledge from the patient view and medication view jointly. Specifically, NLA-MMR formulates CMR as an alignment problem from patient and medication modalities. In this vein, we employ pretrained language models(PLMs) to extract in-domain knowledge regarding patients and medications, serving as the foundational representation for both modalities. In the medication modality, we exploit both chemical structures and textual descriptions to create medication representations. In the patient modality, we generate the patient representations based on textual descriptions of diagnosis, procedure, and symptom. Extensive experiments conducted on three publicly accessible datasets demonstrate that NLA-MMR achieves new state-of-the-art performance, with a notable average improvement of 4.72% in Jaccard score. Our source code is publicly available on this https URL.

[AI-30] QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications

链接: https://arxiv.org/abs/2501.07161
作者: Jeongseok Kim,Jemin Lee,Yongin Kwon,Daeyoung Kim
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages, 10 figures, Accepted in Future Generation Computer Systems Journal

点击查看摘要

Abstract:Mixed-precision quantization methods have been proposed to reduce model size while minimizing accuracy degradation. However, existing studies require retraining and do not consider the computational overhead and intermediate representations (IR) generated during the compilation process, limiting their application at the compiler level. This computational overhead refers to the runtime latency caused by frequent quantization and dequantization operations during inference. Performing these operations at the individual operator level causes significant runtime delays. To address these issues, we propose QuantuneV2, a compiler-based mixed-precision quantization method designed for practical embedded AI applications. QuantuneV2 performs inference only twice, once before quantization and once after quantization, and operates with a computational complexity of O(n) that increases linearly with the number of model parameters. We also made the sensitivity analysis more stable by using local metrics like weights, activation values, the Signal to Quantization Noise Ratio, and the Mean Squared Error. We also cut down on computational overhead by choosing the best IR and using operator fusion. Experimental results show that QuantuneV2 achieved up to a 10.28 percent improvement in accuracy and a 12.52 percent increase in speed compared to existing methods across five models: ResNet18v1, ResNet50v1, SqueezeNetv1, VGGNet, and MobileNetv2. This demonstrates that QuantuneV2 enhances model performance while maintaining computational efficiency, making it suitable for deployment in embedded AI environments.

[AI-31] CureGraph: Contrastive Multi-Modal Graph Representation Learning for Urban Living Circle Health Profiling and Prediction

链接: https://arxiv.org/abs/2501.07157
作者: Jinlin Li,Xiao Zhou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The early detection and prediction of health status decline among the elderly at the neighborhood level are of great significance for urban planning and public health policymaking. While existing studies affirm the connection between living environments and health outcomes, most rely on single data modalities or simplistic feature concatenation of multi-modal information, limiting their ability to comprehensively profile the health-oriented urban environments. To fill this gap, we propose CureGraph, a contrastive multi-modal representation learning framework for urban health prediction that employs graph-based techniques to infer the prevalence of common chronic diseases among the elderly within the urban living circles of each neighborhood. CureGraph leverages rich multi-modal information, including photos and textual reviews of residential areas and their surrounding points of interest, to generate urban neighborhood embeddings. By integrating pre-trained visual and textual encoders with graph modeling techniques, CureGraph captures cross-modal spatial dependencies, offering a comprehensive understanding of urban environments tailored to elderly health considerations. Extensive experiments on real-world datasets demonstrate that CureGraph improves the best baseline by 28% on average in terms of R^2 across elderly disease risk prediction tasks. Moreover, the model enables the identification of stage-wise chronic disease progression and supports comparative public health analysis across neighborhoods, offering actionable insights for sustainable urban development and enhanced quality of life. The code is publicly available at this https URL.

[AI-32] IMRL: A Novel Meta-Reinforcement Learning Framework for Non-Stationary and Multi-Task Environments

链接: https://arxiv.org/abs/2501.07146
作者: Chenyang Qi,Huiping Li,Panfeng Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, meta-reinforcement learning (meta-RL) algorithm has been proposed to improve sample efficiency in the field of decision-making and control, enabling agents to learn new knowledge from a small number of samples. However, most research uses the Gaussian distribution to extract task representation, which is poorly adapted to tasks that change in non-stationary environment. To address this problem, we propose a novel meta-reinforcement learning method by leveraging Gaussian mixture model and the transformer network to construct task inference model. The Gaussian mixture model is utilized to extend the task representation and conduct explicit encoding of tasks. Specifically, the classification of tasks is encoded through transformer network to determine the Gaussian component corresponding to the task. By leveraging task labels, the transformer network is trained using supervised learning. We validate our method on MuJoCo benchmarks with non-stationary and multi-task environments. Experimental results demonstrate that the proposed method dramatically improves sample efficiency and accurately recognizes the classification of the tasks, while performing excellently in the environment.

[AI-33] FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices

链接: https://arxiv.org/abs/2501.07139
作者: Yuji Chai,Mujin Kwen,David Brooks,Gu-Yeon Wei
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.

[AI-34] How GPT learns layer by layer

链接: https://arxiv.org/abs/2501.07108
作者: Jason Du,Kelly Hong,Alishba Imran,Erfan Jahanparast,Mehdi Khfifi,Kaichun Qiao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE’s and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: this https URL.

[AI-35] MathReader : Text-to-Speech for Mathematical Documents ICASSP2025

链接: https://arxiv.org/abs/2501.07088
作者: Sieun Hyeon,Kyudan Jung,Nam-Joon Kim,Hyun Gon Ryu,Jaeyoung Do
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at this https URL.

[AI-36] ADKGD: Anomaly Detection in Knowledge Graphs with Dual-Channel Training

链接: https://arxiv.org/abs/2501.07078
作者: Jiayang Wu,Wensheng Gan,Jiahao Zhang,Philip S. Yu
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: Preprint. 11 figures, 6 tables

点击查看摘要

Abstract:In the current development of large language models (LLMs), it is important to ensure the accuracy and reliability of the underlying data sources. LLMs are critical for various applications, but they often suffer from hallucinations and inaccuracies due to knowledge gaps in the training data. Knowledge graphs (KGs), as a powerful structural tool, could serve as a vital external information source to mitigate the aforementioned issues. By providing a structured and comprehensive understanding of real-world data, KGs enhance the performance and reliability of LLMs. However, it is common that errors exist in KGs while extracting triplets from unstructured data to construct KGs. This could lead to degraded performance in downstream tasks such as question-answering and recommender systems. Therefore, anomaly detection in KGs is essential to identify and correct these errors. This paper presents an anomaly detection algorithm in knowledge graphs with dual-channel learning (ADKGD). ADKGD leverages a dual-channel learning approach to enhance representation learning from both the entity-view and triplet-view perspectives. Furthermore, using a cross-layer approach, our framework integrates internal information aggregation and context information aggregation. We introduce a kullback-leibler (KL)-loss component to improve the accuracy of the scoring function between the dual channels. To evaluate ADKGD’s performance, we conduct empirical studies on three real-world KGs: WN18RR, FB15K, and NELL-995. Experimental results demonstrate that ADKGD outperforms the state-of-the-art anomaly detection algorithms. The source code and datasets are publicly available at this https URL.

[AI-37] Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLM s Values

链接: https://arxiv.org/abs/2501.07071
作者: Jing Yao,Xiaoyuan Yi,Shitong Duan,Jindong Wang,Yuzhuo Bai,Muhua Huang,Peng Zhang,Tun Lu,Zhicheng Dou,Maosong Sun,Xing Xie
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evaluations focus narrowly on safety risks such as bias and toxicity. (2) Evaluation Validity: Existing static, open-source benchmarks are prone to data contamination and quickly become obsolete as LLMs evolve. Additionally, these discriminative evaluations uncover LLMs’ knowledge about values, rather than valid assessments of LLMs’ behavioral conformity to values. (3) Value Pluralism: The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment. To address these challenges, we presents the Value Compass Leaderboard, with three correspondingly designed modules. It (i) grounds the evaluation on motivationally distinct \textitbasic values to clarify LLMs’ underlying values from a holistic view; (ii) applies a \textitgenerative evolving evaluation framework with adaptive test items for evolving LLMs and direct value recognition from behaviors in realistic scenarios; (iii) propose a metric that quantifies LLMs alignment with a specific value as a weighted sum over multiple dimensions, with weights determined by pluralistic values.

[AI-38] Logic Meets Magic: LLM s Cracking Smart Contract Vulnerabilities

链接: https://arxiv.org/abs/2501.07058
作者: ZeKe Xiao,Qin Wang,Hammond Pearce,Shiping Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smart contract vulnerabilities caused significant economic losses in blockchain applications. Large Language Models (LLMs) provide new possibilities for addressing this time-consuming task. However, state-of-the-art LLM-based detection solutions are often plagued by high false-positive rates. In this paper, we push the boundaries of existing research in two key ways. First, our evaluation is based on Solidity v0.8, offering the most up-to-date insights compared to prior studies that focus on older versions (v0.4). Second, we leverage the latest five LLM models (across companies), ensuring comprehensive coverage across the most advanced capabilities in the field. We conducted a series of rigorous evaluations. Our experiments demonstrate that a well-designed prompt can reduce the false-positive rate by over 60%. Surprisingly, we also discovered that the recall rate for detecting some specific vulnerabilities in Solidity v0.8 has dropped to just 13% compared to earlier versions (i.e., v0.4). Further analysis reveals the root cause of this decline: the reliance of LLMs on identifying changes in newly introduced libraries and frameworks during detection. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.07058 [cs.CR] (or arXiv:2501.07058v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.07058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] PoAct: Policy and Action Dual-Control Agent for Generalized Applications

链接: https://arxiv.org/abs/2501.07054
作者: Guozhi Yuan,Youfeng Liu,Jingli Yang,Wei Jia,Kai Lin,Yansong Gao,Shan He,Zilin Ding,Haitao Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Based on their superior comprehension and reasoning capabilities, Large Language Model (LLM) driven agent frameworks have achieved significant success in numerous complex reasoning tasks. ReAct-like agents can solve various intricate problems step-by-step through progressive planning and tool calls, iteratively optimizing new steps based on environmental feedback. However, as the planning capabilities of LLMs improve, the actions invoked by tool calls in ReAct-like frameworks often misalign with complex planning and challenging data organization. Code Action addresses these issues while also introducing the challenges of a more complex action space and more difficult action organization. To leverage Code Action and tackle the challenges of its complexity, this paper proposes Policy and Action Dual-Control Agent (PoAct) for generalized applications. The aim is to achieve higher-quality code actions and more accurate reasoning paths by dynamically switching reasoning policies and modifying the action space. Experimental results on the Agent Benchmark for both legal and generic scenarios demonstrate the superior reasoning capabilities and reduced token consumption of our approach in complex tasks. On the LegalAgentBench, our method shows a 20 percent improvement over the baseline while requiring fewer tokens. We conducted experiments and analyses on the GPT-4o and GLM-4 series models, demonstrating the significant potential and scalability of our approach to solve complex problems.

[AI-40] Unveiling the Potential of Text in High-Dimensional Time Series Forecasting NEURIPS24

链接: https://arxiv.org/abs/2501.07048
作者: Xin Zhou,Weiqing Wang,Shilin Qu,Zhiqiang Zhang,Christoph Bergmeir
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS24 TSALM Workshop

点击查看摘要

Abstract:Time series forecasting has traditionally focused on univariate and multivariate numerical data, often overlooking the benefits of incorporating multimodal information, particularly textual data. In this paper, we propose a novel framework that integrates time series models with Large Language Models to improve high-dimensional time series forecasting. Inspired by multimodal models, our method combines time series and textual data in the dual-tower structure. This fusion of information creates a comprehensive representation, which is then processed through a linear layer to generate the final forecast. Extensive experiments demonstrate that incorporating text enhances high-dimensional time series forecasting performance. This work paves the way for further research in multimodal time series forecasting.

[AI-41] ACCon: Angle-Compensated Contrastive Regularizer for Deep Regression AAAI AAAI-2025

链接: https://arxiv.org/abs/2501.07045
作者: Botao Zhao,Xiaoyang Qu,Zuheng Kang,Junqing Peng,Jing Xiao,Jianzong Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accept by AAAI-2025 (The 39th Annual AAAI Conference on Artificial Intelligence)

点击查看摘要

Abstract:In deep regression, capturing the relationship among continuous labels in feature space is a fundamental challenge that has attracted increasing interest. Addressing this issue can prevent models from converging to suboptimal solutions across various regression tasks, leading to improved performance, especially for imbalanced regression and under limited sample sizes. However, existing approaches often rely on order-aware representation learning or distance-based weighting. In this paper, we hypothesize a linear negative correlation between label distances and representation similarities in regression tasks. To implement this, we propose an angle-compensated contrastive regularizer for deep regression, which adjusts the cosine distance between anchor and negative samples within the contrastive learning framework. Our method offers a plug-and-play compatible solution that extends most existing contrastive learning methods for regression tasks. Extensive experiments and theoretical analysis demonstrate that our proposed angle-compensated contrastive regularizer not only achieves competitive regression performance but also excels in data efficiency and effectiveness on imbalanced datasets.

[AI-42] A Proposed Large Language Model-Based Smart Search for Archive System

链接: https://arxiv.org/abs/2501.07024
作者: Ha Dung Nguyen,Thi-Hoang Anh Nguyen,Thanh Binh Nguyen
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: The 13th International Symposium on Information and Communication Technology (SOICT 2024)

点击查看摘要

Abstract:This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.

[AI-43] Neural Probabilistic Circuits: Enabling Compositional and Interpretable Predictions through Logical Reasoning

链接: https://arxiv.org/abs/2501.07021
作者: Weixin Chen,Simon Yu,Huajie Shao,Lui Sha,Han Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:End-to-end deep neural networks have achieved remarkable success across various domains but are often criticized for their lack of interpretability. While post hoc explanation methods attempt to address this issue, they often fail to accurately represent these black-box models, resulting in misleading or incomplete explanations. To overcome these challenges, we propose an inherently transparent model architecture called Neural Probabilistic Circuits (NPCs), which enable compositional and interpretable predictions through logical reasoning. In particular, an NPC consists of two modules: an attribute recognition model, which predicts probabilities for various attributes, and a task predictor built on a probabilistic circuit, which enables logical reasoning over recognized attributes to make class predictions. To train NPCs, we introduce a three-stage training algorithm comprising attribute recognition, circuit construction, and joint optimization. Moreover, we theoretically demonstrate that an NPC’s error is upper-bounded by a linear combination of the errors from its modules. To further demonstrate the interpretability of NPC, we provide both the most probable explanations and the counterfactual explanations. Empirical results on four benchmark datasets show that NPCs strike a balance between interpretability and performance, achieving results competitive even with those of end-to-end black-box models while providing enhanced interpretability.

[AI-44] AlgoRxplorers | Precision in Mutation – Enhancing Drug Design with Advanced Protein Stability Prediction Tools

链接: https://arxiv.org/abs/2501.07014
作者: Karishma Thakrar,Jiangqin Ma,Max Diamond,Akash Patel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting the impact of single-point amino acid mutations on protein stability is essential for understanding disease mechanisms and advancing drug development. Protein stability, quantified by changes in Gibbs free energy ( \Delta\Delta G ), is influenced by these mutations. However, the scarcity of data and the complexity of model interpretation pose challenges in accurately predicting stability changes. This study proposes the application of deep neural networks, leveraging transfer learning and fusing complementary information from different models, to create a feature-rich representation of the protein stability landscape. We developed four models, with our third model, ThermoMPNN+, demonstrating the best performance in predicting \Delta\Delta G values. This approach, which integrates diverse feature sets and embeddings through latent transfusion techniques, aims to refine \Delta\Delta G predictions and contribute to a deeper understanding of protein dynamics, potentially leading to advancements in disease research and drug discovery.

[AI-45] Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps ICLR2024

链接: https://arxiv.org/abs/2501.06999
作者: Henry Li,Ronen Basri,Yuval Kluger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Spotlight at ICLR 2024

点击查看摘要

Abstract:Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover’s Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \hrefthis https URLthis https url.

[AI-46] Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning

链接: https://arxiv.org/abs/2501.06994
作者: Juntao Ren,Priya Sundaresan,Dorsa Sadigh,Sanjiban Choudhury,Jeannette Bohg
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-pi completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-pi achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website this https URL.

[AI-47] Graph Contrastive Learning on Multi-label Classification for Recommendations

链接: https://arxiv.org/abs/2501.06985
作者: Jiayang Wu,Wensheng Gan,Huashen Lu,Philip S. Yu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Preprint. 10 figures, 5 tables

点击查看摘要

Abstract:In business analysis, providing effective recommendations is essential for enhancing company profits. The utilization of graph-based structures, such as bipartite graphs, has gained popularity for their ability to analyze complex data relationships. Link prediction is crucial for recommending specific items to users. Traditional methods in this area often involve identifying patterns in the graph structure or using representational techniques like graph neural networks (GNNs). However, these approaches encounter difficulties as the volume of data increases. To address these challenges, we propose a model called Graph Contrastive Learning for Multi-label Classification (MCGCL). MCGCL leverages contrastive learning to enhance recommendation effectiveness. The model incorporates two training stages: a main task and a subtask. The main task is holistic user-item graph learning to capture user-item relationships. The homogeneous user-user (item-item) subgraph is constructed to capture user-user and item-item relationships in the subtask. We assessed the performance using real-world datasets from Amazon Reviews in multi-label classification tasks. Comparative experiments with state-of-the-art methods confirm the effectiveness of MCGCL, highlighting its potential for improving recommendation systems.

[AI-48] Data Enrichment Work and AI Labor in Latin America and the Caribbean

链接: https://arxiv.org/abs/2501.06981
作者: Gianna Williams,Maya De Los Santos,Alexandra To,Saiph Savage
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 17 pages of content with 2 figures

点击查看摘要

Abstract:The global AI surge demands crowdworkers from diverse languages and cultures. They are pivotal in labeling data for enabling global AI systems. Despite global significance, research has primarily focused on understanding the perspectives and experiences of US and India crowdworkers, leaving a notable gap. To bridge this, we conducted a survey with 100 crowdworkers across 16 Latin American and Caribbean countries. We discovered that these workers exhibited pride and respect for their digital labor, with strong support and admiration from their families. Notably, crowd work was also seen as a stepping stone to financial and professional independence. Surprisingly, despite wanting more connection, these workers also felt isolated from peers and doubtful of others’ labor quality. They resisted collaboration and gender-based tools, valuing gender-neutrality. Our work advances HCI understanding of Latin American and Caribbean crowdwork, offering insights for digital resistance tools for the region.

[AI-49] Combining LLM decision and RL action selection to improve RL policy for adaptive interventions

链接: https://arxiv.org/abs/2501.06980
作者: Karine Karine,Benjamin M. Marlin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is increasingly being used in the healthcare domain, particularly for the development of personalized health adaptive interventions. Inspired by the success of Large Language Models (LLMs), we are interested in using LLMs to update the RL policy in real time, with the goal of accelerating personalization. We use the text-based user preference to influence the action selection on the fly, in order to immediately incorporate the user preference. We use the term “user preference” as a broad term to refer to a user personal preference, constraint, health status, or a statement expressing like or dislike, etc. Our novel approach is a hybrid method that combines the LLM response and the RL action selection to improve the RL policy. Given an LLM prompt that incorporates the user preference, the LLM acts as a filter in the typical RL action selection. We investigate different prompting strategies and action selection strategies. To evaluate our approach, we implement a simulation environment that generates the text-based user preferences and models the constraints that impact behavioral dynamics. We show that our approach is able to take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.

[AI-50] Kolmogorov-Arnold Recurrent Network for Short Term Load Forecasting Across Diverse Consumers

链接: https://arxiv.org/abs/2501.06965
作者: Muhammad Umair Danish,Katarina Grolinger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Load forecasting plays a crucial role in energy management, directly impacting grid stability, operational efficiency, cost reduction, and environmental sustainability. Traditional Vanilla Recurrent Neural Networks (RNNs) face issues such as vanishing and exploding gradients, whereas sophisticated RNNs such as LSTMs have shown considerable success in this domain. However, these models often struggle to accurately capture complex and sudden variations in energy consumption, and their applicability is typically limited to specific consumer types, such as offices or schools. To address these challenges, this paper proposes the Kolmogorov-Arnold Recurrent Network (KARN), a novel load forecasting approach that combines the flexibility of Kolmogorov-Arnold Networks with RNN’s temporal modeling capabilities. KARN utilizes learnable temporal spline functions and edge-based activations to better model non-linear relationships in load data, making it adaptable across a diverse range of consumer types. The proposed KARN model was rigorously evaluated on a variety of real-world datasets, including student residences, detached homes, a home with electric vehicle charging, a townhouse, and industrial buildings. Across all these consumer categories, KARN consistently outperformed traditional Vanilla RNNs, while it surpassed LSTM and Gated Recurrent Units (GRUs) in six buildings. The results demonstrate KARN’s superior accuracy and applicability, making it a promising tool for enhancing load forecasting in diverse energy management scenarios.

[AI-51] Enhancing Patient-Centric Communication: Leverag ing LLM s to Simulate Patient Perspectives

链接: https://arxiv.org/abs/2501.06964
作者: Xinyao Ma,Rui Zhu,Zihao Wang,Jingwei Xiong,Qingyu Chen,Haixu Tang,L. Jean Camp,Lucila Ohno-Machado
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing scenarios, particularly in simulating domain-specific experts using tailored prompts. This ability enables LLMs to adopt the persona of individuals with specific backgrounds, offering a cost-effective and efficient alternative to traditional, resource-intensive user studies. By mimicking human behavior, LLMs can anticipate responses based on concrete demographic or professional profiles. In this paper, we evaluate the effectiveness of LLMs in simulating individuals with diverse backgrounds and analyze the consistency of these simulated behaviors compared to real-world outcomes. In particular, we explore the potential of LLMs to interpret and respond to discharge summaries provided to patients leaving the Intensive Care Unit (ICU). We evaluate and compare with human responses the comprehensibility of discharge summaries among individuals with varying educational backgrounds, using this analysis to assess the strengths and limitations of LLM-driven simulations. Notably, when LLMs are primed with educational background information, they deliver accurate and actionable medical guidance 88% of the time. However, when other information is provided, performance significantly drops, falling below random chance levels. This preliminary study shows the potential benefits and pitfalls of automatically generating patient-specific health information from diverse populations. While LLMs show promise in simulating health personas, our results highlight critical gaps that must be addressed before they can be reliably used in clinical settings. Our findings suggest that a straightforward query-response model could outperform a more tailored approach in delivering health information. This is a crucial first step in understanding how LLMs can be optimized for personalized health communication while maintaining accuracy.

[AI-52] Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus GPT -4 and Copilot

链接: https://arxiv.org/abs/2501.06963
作者: Antonio López Martínez,Alejandro Cano,Antonio Ruiz-Martínez
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of Generative Artificial Intelligence (GenAI) has brought a significant change to our society. GenAI can be applied across numerous fields, with particular relevance in cybersecurity. Among the various areas of application, its use in penetration testing (pentesting) or ethical hacking processes is of special interest. In this paper, we have analyzed the potential of leading generic-purpose GenAI tools-Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES). Our analysis involved evaluating each tool across all PTES phases within a controlled virtualized environment. The findings reveal that, while these tools cannot fully automate the pentesting process, they provide substantial support by enhancing efficiency and effectiveness in specific tasks. Notably, all tools demonstrated utility; however, Claude Opus consistently outperformed the others in our experimental scenarios.

[AI-53] Compact Bayesian Neural Networks via pruned MCMC sampling

链接: https://arxiv.org/abs/2501.06962
作者: Ratneel Deo,Scott Sisson,Jody M. Webster,Rohitash Chandra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 11 figures

点击查看摘要

Abstract:Bayesian Neural Networks (BNNs) offer robust uncertainty quantification in model predictions, but training them presents a significant computational challenge. This is mainly due to the problem of sampling multimodal posterior distributions using Markov Chain Monte Carlo (MCMC) sampling and variational inference algorithms. Moreover, the number of model parameters scales exponentially with additional hidden layers, neurons, and features in the dataset. Typically, a significant portion of these densely connected parameters are redundant and pruning a neural network not only improves portability but also has the potential for better generalisation capabilities. In this study, we address some of the challenges by leveraging MCMC sampling with network pruning to obtain compact probabilistic models having removed redundant parameters. We sample the posterior distribution of model parameters (weights and biases) and prune weights with low importance, resulting in a compact model. We ensure that the compact BNN retains its ability to estimate uncertainty via the posterior distribution while retaining the model training and generalisation performance accuracy by adapting post-pruning resampling. We evaluate the effectiveness of our MCMC pruning strategy on selected benchmark datasets for regression and classification problems through empirical result analysis. We also consider two coral reef drill-core lithology classification datasets to test the robustness of the pruning model in complex real-world datasets. We further investigate if refining compact BNN can retain any loss of performance. Our results demonstrate the feasibility of training and pruning BNNs using MCMC whilst retaining generalisation performance with over 75% reduction in network size. This paves the way for developing compact BNN models that provide uncertainty estimates for real-world applications.

[AI-54] Patent Novelty Assessment Accelerating Innovation and Patent Prosecution

链接: https://arxiv.org/abs/2501.06956
作者: Kapil Kashyap,Sean Fargose,Gandhar Dhonde,Aditya Mishra
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of technological innovation, safeguarding intellectual property rights through patents is crucial for fostering progress and stimulating research and development investments. This report introduces a ground-breaking Patent Novelty Assessment and Claim Generation System, meticulously crafted to dissect the inventive aspects of intellectual property and simplify access to extensive patent claim data. Addressing a crucial gap in academic institutions, our system provides college students and researchers with an intuitive platform to navigate and grasp the intricacies of patent claims, particularly tailored for the nuances of Chinese patents. Unlike conventional analysis systems, our initiative harnesses a proprietary Chinese API to ensure unparalleled precision and relevance. The primary challenge lies in the complexity of accessing and comprehending diverse patent claims, inhibiting effective innovation upon existing ideas. Our solution aims to overcome these barriers by offering a bespoke approach that seamlessly retrieves comprehensive claim information, finely tuned to the specifics of the Chinese patent landscape. By equipping users with efficient access to comprehensive patent claim information, our transformative platform seeks to ignite informed exploration and innovation in the ever-evolving domain of intellectual property. Its envisioned impact transcends individual colleges, nurturing an environment conducive to research and development while deepening the understanding of patented concepts within the academic community.

[AI-55] he Einstein Test: Towards a Practical Test of a Machines Ability to Exhibit Superintelligence

链接: https://arxiv.org/abs/2501.06948
作者: David Benrimoh,Nace Mikus,Ariel Rosenfeld
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Creative and disruptive insights (CDIs), such as the development of the theory of relativity, have punctuated human history, marking pivotal shifts in our intellectual trajectory. Recent advancements in artificial intelligence (AI) have sparked debates over whether state of the art models possess the capacity to generate CDIs. We argue that the ability to create CDIs should be regarded as a significant feature of machine superintelligence (SI).To this end, we propose a practical test to evaluate whether an approach to AI targeting SI can yield novel insights of this kind. We propose the Einstein test: given the data available prior to the emergence of a known CDI, can an AI independently reproduce that insight (or one that is formally equivalent)? By achieving such a milestone, a machine can be considered to at least match humanity’s past top intellectual achievements, and therefore to have the potential to surpass them.

[AI-56] An Empirical Study of Deep Reinforcement Learning in Continuing Tasks

链接: https://arxiv.org/abs/2501.06937
作者: Yi Wan,Dmytro Korenkevych,Zheqing Zhu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards-including those beyond resets-are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for improving temporal-difference-based RL algorithms in continuing tasks by centering rewards, as introduced by Naik et al. (2024). While their work primarily focused on this method in conjunction with Q-learning, our results extend their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger tasks, and outperforms two other reward-centering approaches.

[AI-57] Why are we living the age of AI applications right now? The long innovation path from AIs birth to a childs bedtime magic

链接: https://arxiv.org/abs/2501.06929
作者: Tapio Pitkäranta
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Today a four-year-old child who does not know how to read or write can now create bedtime stories with graphical illustrations and narrated audio, using AI tools that seamlessly transform speech into text, generate visuals, and convert text back into speech in a natural and engaging manner. This remarkable example demonstrates why we are living in the age of AI applications. This paper examines contemporary leading AI applications and traces their historical development, highlighting the major advancements that have enabled their realization. Five key factors are identified: 1) The evolution of computational hardware (CPUs and GPUs), enabling the training of complex AI models 2) The vast digital archives provided by the World Wide Web, which serve as a foundational data resource for AI systems 3) The ubiquity of mobile computing, with smartphones acting as powerful, accessible small computers in the hands of billions 4) The rise of industrial-scale cloud infrastructures, offering elastic computational power for AI training and deployment 5) Breakthroughs in AI research, including neural networks, backpropagation, and the “Attention is All You Need” framework, which underpin modern AI capabilities. These innovations have elevated AI from solving narrow tasks to enabling applications like ChatGPT that are adaptable for numerous use cases, redefining human-computer interaction. By situating these developments within a historical context, the paper highlights the critical milestones that have made AI’s current capabilities both possible and widely accessible, offering profound implications for society.

[AI-58] What Is a Counterfactual Cause in Action Theories? AAMAS2025

链接: https://arxiv.org/abs/2501.06857
作者: Daxin Liu,Vaishak Belle
类目: Artificial Intelligence (cs.AI)
*备注: This is an extended report of our short paper accepted at AAMAS 2025

点击查看摘要

Abstract:Since the proposal by Halpern and Pearl, reasoning about actual causality has gained increasing attention in artificial intelligence, ranging from domains such as model-checking and verification to reasoning about actions and knowledge. More recently, Batusov and Soutchanski proposed a notion of actual achievement cause in the situation calculus, amongst others, they can determine the cause of quantified effects in a given action history. While intuitively appealing, this notion of cause is not defined in a counterfactual perspective. In this paper, we propose a notion of cause based on counterfactual analysis. In the context of action history, we show that our notion of cause generalizes naturally to a notion of achievement cause. We analyze the relationship between our notion of the achievement cause and the achievement cause by Batusov and Soutchanski. Finally, we relate our account of cause to Halpern and Pearl’s account of actual causality. Particularly, we note some nuances in applying a counterfactual viewpoint to disjunctive goals, a common thorn to definitions of actual causes.

[AI-59] An efficient approach to represent enterprise web application structure using Large Language Model in the service of Intelligent Quality Engineering

链接: https://arxiv.org/abs/2501.06837
作者: Zaber Al Hassan Ayon,Gulam Husain,Roshankumar Bisoi,Waliur Rahman,Dr Tom Osborn
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 16 pages, 1 figure and 4 tables, relevant for Gen AI and enterprise AI use cases

点击查看摘要

Abstract:This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs) to enable intelligent quality engineering at scale. We introduce a hierarchical representation methodology that optimizes the few-shot learning capabilities of LLMs while preserving the complex relationships and interactions within web applications. The approach encompasses five key phases: comprehensive DOM analysis, multi-page synthesis, test suite generation, execution, and result analysis. Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing by developing a structured format that enables LLMs to understand web application architecture through in-context learning. We evaluated our approach using two distinct web applications: an e-commerce platform (Swag Labs) and a healthcare application (MediBox) which is deployed within Atalgo engineering environment. The results demonstrate success rates of 90% and 70%, respectively, in achieving automated testing, with high relevance scores for test cases across multiple evaluation criteria. The findings suggest that our representation approach significantly enhances LLMs’ ability to generate contextually relevant test cases and provide better quality assurance overall, while reducing the time and effort required for testing.

[AI-60] Leverag ing Taxonomy and LLM s for Improved Multimodal Hierarchical Classification COLING2025

链接: https://arxiv.org/abs/2501.06827
作者: Shijing Chen,Mohamed Reda Bouadjenek,Shoaib Jameel,Usman Naseem,Basem Suleiman,Flora D. Salim,Hakim Hacid,Imran Razzak
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, 2 tables, and accepted by COLING 2025

点击查看摘要

Abstract:Multi-level Hierarchical Classification (MLHC) tackles the challenge of categorizing items within a complex, multi-layered class structure. However, traditional MLHC classifiers often rely on a backbone model with independent output layers, which tend to ignore the hierarchical relationships between classes. This oversight can lead to inconsistent predictions that violate the underlying taxonomy. Leveraging Large Language Models (LLMs), we propose a novel taxonomy-embedded transitional LLM-agnostic framework for multimodality classification. The cornerstone of this advancement is the ability of models to enforce consistency across hierarchical levels. Our evaluations on the MEP-3M dataset - a multi-modal e-commerce product dataset with various hierarchical levels - demonstrated a significant performance improvement compared to conventional LLM structures.

[AI-61] MEXA-CTP: Mode Experts Cross-Attention for Clinical Trial Outcome Prediction SDM2025

链接: https://arxiv.org/abs/2501.06823
作者: Yiqing Zhang,Xiaozhong Liu,Fabricio Murai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: Accepted and to be published in SDM2025

点击查看摘要

Abstract:Clinical trials are the gold standard for assessing the effectiveness and safety of drugs for treating diseases. Given the vast design space of drug molecules, elevated financial cost, and multi-year timeline of these trials, research on clinical trial outcome prediction has gained immense traction. Accurate predictions must leverage data of diverse modes such as drug molecules, target diseases, and eligibility criteria to infer successes and failures. Previous Deep Learning approaches for this task, such as HINT, often require wet lab data from synthesized molecules and/or rely on prior knowledge to encode interactions as part of the model architecture. To address these limitations, we propose a light-weight attention-based model, MEXA-CTP, to integrate readily-available multi-modal data and generate effective representations via specialized modules dubbed “mode experts”, while avoiding human biases in model design. We optimize MEXA-CTP with the Cauchy loss to capture relevant interactions across modes. Our experiments on the Trial Outcome Prediction (TOP) benchmark demonstrate that MEXA-CTP improves upon existing approaches by, respectively, up to 11.3% in F1 score, 12.2% in PR-AUC, and 2.5% in ROC-AUC, compared to HINT. Ablation studies are provided to quantify the effectiveness of each component in our proposed method.

[AI-62] A Study on Educational Data Analysis and Personalized Feedback Report Generation Based on Tags and ChatGPT

链接: https://arxiv.org/abs/2501.06819
作者: Yizhou Zhou,Mengqiao Zhang,Yuan-Hao Jiang,Xinyu Gao,Naijie Liu,Bo Jiang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a novel method that employs tag annotation coupled with the ChatGPT language model to analyze student learning behaviors and generate personalized feedback. Central to this approach is the conversion of complex student data into an extensive set of tags, which are then decoded through tailored prompts to deliver constructive feedback that encourages rather than discourages students. This methodology focuses on accurately feeding student data into large language models and crafting prompts that enhance the constructive nature of feedback. The effectiveness of this approach was validated through surveys conducted with over 20 mathematics teachers, who confirmed the reliability of the generated reports. This method can be seamlessly integrated into intelligent adaptive learning systems or provided as a tool to significantly reduce the workload of teachers, providing accurate and timely feedback to students. By transforming raw educational data into interpretable tags, this method supports the provision of efficient and timely personalized learning feedback that offers constructive suggestions tailored to individual learner needs.

[AI-63] Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

链接: https://arxiv.org/abs/2501.06802
作者: Jun Wan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model performance and parameters, data, and compute. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLM training and inference processes from the perspective of lossless compression using conditional Kolmogorov complexity, and unify these two types of Scaling Laws. We find that both types of Scaling Laws improve approximation of conditional Kolmogorov complexity by increasing execution steps t . The first type of Scaling Laws increases t by increasing model parameters y . The second type of Scaling Laws increases t by increasing the number of output tokens.

[AI-64] Cost-Effective Robotic Handwriting System with AI Integration DATE

链接: https://arxiv.org/abs/2501.06783
作者: Tianyi Huang,Richard Xiong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This is an updated version of a paper originally presented at the 2024 IEEE Long Island Systems, Applications and Technology Conference (LISAT)

点击查看摘要

Abstract:This paper introduces a cost-effective robotic handwriting system designed to replicate human-like handwriting with high precision. Combining a Raspberry Pi Pico microcontroller, 3D-printed components, and a machine learning-based handwriting generation model implemented via this http URL, the system converts user-supplied text into realistic stroke trajectories. By leveraging lightweight 3D-printed materials and efficient mechanical designs, the system achieves a total hardware cost of approximately \ 56, significantly undercutting commercial alternatives. Experimental evaluations demonstrate handwriting precision within \pm 0.3 millimeters and a writing speed of approximately 200 mm/min, positioning the system as a viable solution for educational, research, and assistive applications. This study seeks to lower the barriers to personalized handwriting technologies, making them accessible to a broader audience.

[AI-65] Eliza: A Web3 friendly AI Agent Operating System

链接: https://arxiv.org/abs/2501.06781
作者: Shaw Walters,Sam Gao,Shakker Nerd,Feng Da,Warren Williams,Ting-Chien Meng,Hunter Han,Frank He,Allen Zhang,Ming Wu,Timothy Shen,Maxwell Hu,Jerry Yan
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:AI Agent, powered by large language models (LLMs) as its cognitive core, is an intelligent agentic system capable of autonomously controlling and determining the execution paths under user’s instructions. With the burst of capabilities of LLMs and various plugins, such as RAG, text-to-image/video/3D, etc., the potential of AI Agents has been vastly expanded, with their capabilities growing stronger by the day. However, at the intersection between AI and web3, there is currently no ideal agentic framework that can seamlessly integrate web3 applications into AI agent functionalities. In this paper, we propose Eliza, the first open-source web3-friendly Agentic framework that makes the deployment of web3 applications effortless. We emphasize that every aspect of Eliza is a regular Typescript program under the full control of its user, and it seamlessly integrates with web3 (i.e., reading and writing blockchain data, interacting with smart contracts, etc.). Furthermore, we show how stable performance is achieved through the pragmatic implementation of the key components of Eliza’s runtime. Our code is publicly available at this https URL.

[AI-66] On the Complexity of Global Necessary Reason s to Explain Classification

链接: https://arxiv.org/abs/2501.06766
作者: Marco Calautti,Enrico Malizia,Cristian Molinaro
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable AI has garnered considerable attention in recent years, as understanding the reasons behind decisions or predictions made by AI systems is crucial for their successful adoption. Explaining classifiers’ behavior is one prominent problem. Work in this area has proposed notions of both local and global explanations, where the former are concerned with explaining a classifier’s behavior for a specific instance, while the latter are concerned with explaining the overall classifier’s behavior regardless of any specific instance. In this paper, we focus on global explanations, and explain classification in terms of ``minimal’’ necessary conditions for the classifier to assign a specific class to a generic instance. We carry out a thorough complexity analysis of the problem for natural minimality criteria and important families of classifiers considered in the literature.

[AI-67] MiniRAG : Towards Extremely Simple Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2501.06713
作者: Tianyu Fan,Jingyuan Wang,Xubin Ren,Chao Huang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing demand for efficient and lightweight Retrieval-Augmented Generation (RAG) systems has highlighted significant challenges when deploying Small Language Models (SLMs) in existing RAG frameworks. Current approaches face severe performance degradation due to SLMs’ limited semantic understanding and text processing capabilities, creating barriers for widespread adoption in resource-constrained scenarios. To address these fundamental limitations, we present MiniRAG, a novel RAG system designed for extreme simplicity and efficiency. MiniRAG introduces two key technical innovations: (1) a semantic-aware heterogeneous graph indexing mechanism that combines text chunks and named entities in a unified structure, reducing reliance on complex semantic understanding, and (2) a lightweight topology-enhanced retrieval approach that leverages graph structures for efficient knowledge discovery without requiring advanced language capabilities. Our extensive experiments demonstrate that MiniRAG achieves comparable performance to LLM-based methods even when using SLMs while requiring only 25% of the storage space. Additionally, we contribute a comprehensive benchmark dataset for evaluating lightweight RAG systems under realistic on-device scenarios with complex queries. We fully open-source our implementation and datasets at: this https URL.

[AI-68] Evaluating Sample Utility for Data Selection by Mimicking Model Weights

链接: https://arxiv.org/abs/2501.06708
作者: Tzu-Heng Huang,Manjot Bilkhu,Frederic Sala,Javier Movellan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples’ utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

[AI-69] ELIZA Reanimated: The worlds first chatbot restored on the worlds first time sharing system

链接: https://arxiv.org/abs/2501.06707
作者: Rupert Lane,Anthony Hay,Arthur Schwarz,David M. Berry,Jeff Shrager
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Symbolic Computation (cs.SC)
*备注: In review

点击查看摘要

Abstract:ELIZA, created by Joseph Weizenbaum at MIT in the early 1960s, is usually considered the world’s first chatbot. It was developed in MAD-SLIP on MIT’s CTSS, the world’s first time-sharing system, on an IBM 7094. We discovered an original ELIZA printout in Prof. Weizenbaum’s archives at MIT, including an early version of the famous DOCTOR script, a nearly complete version of the MAD-SLIP code, and various support functions in MAD and FAP. Here we describe the reanimation of this original ELIZA on a restored CTSS, itself running on an emulated IBM 7094. The entire stack is open source, so that any user of a unix-like OS can run the world’s first chatbot on the world’s first time-sharing system.

[AI-70] AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

链接: https://arxiv.org/abs/2501.06706
作者: Yinfang Chen,Manish Shetty,Gagan Somashekar,Minghua Ma,Yogesh Simmhan,Jonathan Mace,Chetan Bansal,Rujia Wang,Saravan Rajmohan
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.

[AI-71] Large Language Models Knowledge Graphs and Search Engines: A Crossroads for Answering Users Questions

链接: https://arxiv.org/abs/2501.06699
作者: Aidan Hogan,Xin Luna Dong,Denny Vrandečić,Gerhard Weikum
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Much has been discussed about how Large Language Models, Knowledge Graphs and Search Engines can be combined in a synergistic manner. A dimension largely absent from current academic discourse is the user perspective. In particular, there remain many open questions regarding how best to address the diverse information needs of users, incorporating varying facets and levels of difficulty. This paper introduces a taxonomy of user information needs, which guides us to study the pros, cons and possible synergies of Large Language Models, Knowledge Graphs and Search Engines. From this study, we derive a roadmap for future research.

[AI-72] DVM: Towards Controllable LLM Agents in Social Deduction Games ICASSP2025

链接: https://arxiv.org/abs/2501.06695
作者: Zheng Zhang,Yihuai Lan,Yangsen Chen,Lei Wang,Xiang Wang,Hao Wang
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced the capability of game agents in social deduction games (SDGs). These games rely heavily on conversation-driven interactions and require agents to infer, make decisions, and express based on such information. While this progress leads to more sophisticated and strategic non-player characters (NPCs) in SDGs, there exists a need to control the proficiency of these agents. This control not only ensures that NPCs can adapt to varying difficulty levels during gameplay, but also provides insights into the safety and fairness of LLM agents. In this paper, we present DVM, a novel framework for developing controllable LLM agents for SDGs, and demonstrate its implementation on one of the most popular SDGs, Werewolf. DVM comprises three main components: Predictor, Decider, and Discussor. By integrating reinforcement learning with a win rate-constrained decision chain reward mechanism, we enable agents to dynamically adjust their gameplay proficiency to achieve specified win rates. Experiments show that DVM not only outperforms existing methods in the Werewolf game, but also successfully modulates its performance levels to meet predefined win rate targets. These results pave the way for LLM agents’ adaptive and balanced gameplay in SDGs, opening new avenues for research in controllable game agents.

[AI-73] Generative AI in Education: From Foundational Insights to the Socratic Playground for Learning

链接: https://arxiv.org/abs/2501.06682
作者: Xiangen Hu,Sheng Xu,Richard Tong,Art Graesser
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the synergy between human cognition and Large Language Models (LLMs), highlighting how generative AI can drive personalized learning at scale. We discuss parallels between LLMs and human cognition, emphasizing both the promise and new perspectives on integrating AI systems into education. After examining challenges in aligning technology with pedagogy, we review AutoTutor-one of the earliest Intelligent Tutoring Systems (ITS)-and detail its successes, limitations, and unfulfilled aspirations. We then introduce the Socratic Playground, a next-generation ITS that uses advanced transformer-based models to overcome AutoTutor’s constraints and provide personalized, adaptive tutoring. To illustrate its evolving capabilities, we present a JSON-based tutoring prompt that systematically guides learner reflection while tracking misconceptions. Throughout, we underscore the importance of placing pedagogy at the forefront, ensuring that technology’s power is harnessed to enhance teaching and learning rather than overshadow it.

[AI-74] Common Sense Is All You Need

链接: https://arxiv.org/abs/2501.06642
作者: Hugo Latapie
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has made significant strides in recent years, yet it continues to struggle with a fundamental aspect of cognition present in all animals: common sense. Current AI systems, including those designed for complex tasks like autonomous driving, problem-solving challenges such as the Abstraction and Reasoning Corpus (ARC), and conversational benchmarks like the Turing Test, often lack the ability to adapt to new situations without extensive prior knowledge. This manuscript argues that integrating common sense into AI systems is essential for achieving true autonomy and unlocking the full societal and commercial value of AI. We propose a shift in the order of knowledge acquisition emphasizing the importance of developing AI systems that start from minimal prior knowledge and are capable of contextual learning, adaptive reasoning, and embodiment – even within abstract domains. Additionally, we highlight the need to rethink the AI software stack to address this foundational challenge. Without common sense, AI systems may never reach true autonomy, instead exhibiting asymptotic performance that approaches theoretical ideals like AIXI but remains unattainable in practice due to infinite resource and computation requirements. While scaling AI models and passing benchmarks like the Turing Test have brought significant advancements in applications that do not require autonomy, these approaches alone are insufficient to achieve autonomous AI with common sense. By redefining existing benchmarks and challenges to enforce constraints that require genuine common sense, and by broadening our understanding of embodiment to include both physical and abstract domains, we can encourage the development of AI systems better equipped to handle the complexities of real-world and abstract environments. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.06642 [cs.AI] (or arXiv:2501.06642v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.06642 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] Enhancing Path Planning Performance through Image Representation Learning of High-Dimensional Configuration Spaces

链接: https://arxiv.org/abs/2501.06639
作者: Jorge Ocampo Jimenez,Wael Suleiman
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel method for accelerating path-planning tasks in unknown scenes with obstacles by utilizing Wasserstein Generative Adversarial Networks (WGANs) with Gradient Penalty (GP) to approximate the distribution of waypoints for a collision-free path using the Rapidly-exploring Random Tree algorithm. Our approach involves conditioning the WGAN-GP with a forward diffusion process in a continuous latent space to handle multimodal datasets effectively. We also propose encoding the waypoints of a collision-free path as a matrix, where the multidimensional ordering of the waypoints is naturally preserved. This method not only improves model learning but also enhances training convergence. Furthermore, we propose a method to assess whether the trained model fails to accurately capture the true waypoints. In such cases, we revert to uniform sampling to ensure the algorithm’s probabilistic completeness; a process that traditionally involves manually determining an optimal ratio for each scenario in other machine learning-based methods. Our experiments demonstrate promising results in accelerating path-planning tasks under critical time constraints. The source code is openly available at this https URL.

[AI-76] Quantifying Relational Exploration in Cultural Heritage Knowledge Graphs with LLM s: A Neuro-Symbolic Approach

链接: https://arxiv.org/abs/2501.06628
作者: Mohammed Maree
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a neuro-symbolic approach for relational exploration in cultural heritage knowledge graphs, leveraging Large Language Models (LLMs) for explanation generation and a novel mathematical framework to quantify the interestingness of relationships. We demonstrate the importance of interestingness measure using a quantitative analysis, by highlighting its impact on the overall performance of our proposed system, particularly in terms of precision, recall, and F1-score. Using the Wikidata Cultural Heritage Linked Open Data (WCH-LOD) dataset, our approach yields a precision of 0.70, recall of 0.68, and an F1-score of 0.69, representing an improvement compared to graph-based (precision: 0.28, recall: 0.25, F1-score: 0.26) and knowledge-based baselines (precision: 0.45, recall: 0.42, F1-score: 0.43). Furthermore, our LLM-powered explanations exhibit better quality, reflected in BLEU (0.52), ROUGE-L (0.58), and METEOR (0.63) scores, all higher than the baseline approaches. We show a strong correlation (0.65) between interestingness measure and the quality of generated explanations, validating its effectiveness. The findings highlight the importance of LLMs and a mathematical formalization for interestingness in enhancing the effectiveness of relational exploration in cultural heritage knowledge graphs, with results that are measurable and testable. We further show that the system enables more effective exploration compared to purely knowledge-based and graph-based methods.

[AI-77] Guided Code Generation with LLM s: A Multi-Agent Framework for Complex Code Tasks

链接: https://arxiv.org/abs/2501.06625
作者: Amr Almorsi,Mohanned Ahmed,Walid Gomaa
类目: Artificial Intelligence (cs.AI)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks, yet they face significant limitations in handling complex, long-context programming challenges and demonstrating complex compositional reasoning abilities. This paper introduces a novel agentic framework for ``guided code generation’’ that tries to address these limitations through a deliberately structured, fine-grained approach to code generation tasks. Our framework leverages LLMs’ strengths as fuzzy searchers and approximate information retrievers while mitigating their weaknesses in long sequential reasoning and long-context understanding. Empirical evaluation using OpenAI’s HumanEval benchmark with Meta’s Llama 3.1 8B model (int4 precision) demonstrates a 23.79% improvement in solution accuracy compared to direct one-shot generation. Our results indicate that structured, guided approaches to code generation can significantly enhance the practical utility of LLMs in software development while overcoming their inherent limitations in compositional reasoning and context handling.

[AI-78] ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation

链接: https://arxiv.org/abs/2501.06598
作者: Xuanle Zhao,Xianzhen Luo,Qi Shi,Chi Chen,Shuo Wang,Wanxiang Che,Zhiyuan Liu,Maosong Sun
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding tasks. However, interpreting charts with textual descriptions often leads to information loss, as it fails to fully capture the dense information embedded in charts. In contrast, parsing charts into code provides lossless representations that can effectively contain all critical details. Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks.: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose \textbfChartCoder, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code. Furthermore, we introduce \textbfChart2Code-160k, the first large-scale and diverse dataset for chart-to-code generation, and propose the \textbfSnippet-of-Thought (SoT) method, which transforms direct chart-to-code generation data into step-by-step generation. Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks, achieving superior chart restoration and code excitability. Our code will be available at this https URL.

[AI-79] ransforming Social Science Research with Transfer Learning: Social Science Survey Data Integration with AI

链接: https://arxiv.org/abs/2501.06577
作者: Ali Amini
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 5 figures, Presented and Submitted to SPSA 2025 (Political Methodology Panel)

点击查看摘要

Abstract:Large-N nationally representative surveys, which have profoundly shaped American politics scholarship, represent related but distinct domains -a key condition for transfer learning applications. These surveys are related through their shared demographic, party identification, and ideological variables, yet differ in that individual surveys often lack specific policy preference questions that researchers require. Our study introduces a novel application of transfer learning (TL) to address these gaps, marking the first systematic use of TL paradigms in the context of survey data. Specifically, models pre-trained on the Cooperative Election Study (CES) dataset are fine-tuned for use in the American National Election Studies (ANES) dataset to predict policy questions based on demographic variables. Even with a naive architecture, our transfer learning approach achieves approximately 92 percentage accuracy in predicting missing variables across surveys, demonstrating the robust potential of this method. Beyond this specific application, our paper argues that transfer learning is a promising framework for maximizing the utility of existing survey data. We contend that artificial intelligence, particularly transfer learning, opens new frontiers in social science methodology by enabling systematic knowledge transfer between well-administered surveys that share common variables but differ in their outcomes of interest.

[AI-80] Active Rule Mining for Multivariate Anomaly Detection in Radio Access Networks

链接: https://arxiv.org/abs/2501.06571
作者: Ebenezer R. H. P. Isaac,Joseph H. R. Isaac
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multivariate anomaly detection finds its importance in diverse applications. Despite the existence of many detectors to solve this problem, one cannot simply define why an obtained anomaly inferred by the detector is anomalous. This reasoning is required for network operators to understand the root cause of the anomaly and the remedial action that should be taken to counteract its occurrence. Existing solutions in explainable AI may give cues to features that influence an anomaly, but they do not formulate generalizable rules that can be assessed by a domain expert. Furthermore, not all outliers are anomalous in a business sense. There is an unfulfilled need for a system that can interpret anomalies predicted by a multivariate anomaly detector and map these patterns to actionable rules. This paper aims to fulfill this need by proposing a semi-autonomous anomaly rule miner. The proposed method is applicable to both discrete and time series data and is tailored for radio access network (RAN) anomaly detection use cases. The proposed method is demonstrated in this paper with time series RAN data.

[AI-81] Where to Go Next Day: Multi-scale Spatial-Temporal Decoupled Model for Mid-term Human Mobility Prediction

链接: https://arxiv.org/abs/2501.06561
作者: Zongyuan Huang,Weipeng Wang,Shaoyu Huang,Marta C. Gonzalez,Yaohui Jin,Yanyan Xu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting individual mobility patterns is crucial across various applications. While current methods mainly focus on predicting the next location for personalized services like recommendations, they often fall short in supporting broader applications such as traffic management and epidemic control, which require longer period forecasts of human mobility. This study addresses mid-term mobility prediction, aiming to capture daily travel patterns and forecast trajectories for the upcoming day or week. We propose a novel Multi-scale Spatial-Temporal Decoupled Predictor (MSTDP) designed to efficiently extract spatial and temporal information by decoupling daily trajectories into distinct location-duration chains. Our approach employs a hierarchical encoder to model multi-scale temporal patterns, including daily recurrence and weekly periodicity, and utilizes a transformer-based decoder to globally attend to predicted information in the location or duration chain. Additionally, we introduce a spatial heterogeneous graph learner to capture multi-scale spatial relationships, enhancing semantic-rich representations. Extensive experiments, including statistical physics analysis, are conducted on large-scale mobile phone records in five cities (Boston, Los Angeles, SF Bay Area, Shanghai, and Tokyo), to demonstrate MSTDP’s advantages. Applied to epidemic modeling in Boston, MSTDP significantly outperforms the best-performing baseline, achieving a remarkable 62.8% reduction in MAE for cumulative new cases.

[AI-82] Hierarchical Reinforcement Learning for Optimal Agent Grouping in Cooperative Systems

链接: https://arxiv.org/abs/2501.06554
作者: Liyuan Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:This paper presents a hierarchical reinforcement learning (RL) approach to address the agent grouping or pairing problem in cooperative multi-agent systems. The goal is to simultaneously learn the optimal grouping and agent policy. By employing a hierarchical RL framework, we distinguish between high-level decisions of grouping and low-level agents’ actions. Our approach utilizes the CTDE (Centralized Training with Decentralized Execution) paradigm, ensuring efficient learning and scalable execution. We incorporate permutation-invariant neural networks to handle the homogeneity and cooperation among agents, enabling effective coordination. The option-critic algorithm is adapted to manage the hierarchical decision-making process, allowing for dynamic and optimal policy adjustments.

[AI-83] Scaffolding Creativity: Integrating Generative AI Tools and Real-world Experiences in Business Education

链接: https://arxiv.org/abs/2501.06527
作者: Nicole C. Wang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This case study explores the integration of Generative AI tools and real-world experiences in business education. Through a study of an innovative undergraduate course, we investigate how AI-assisted learning, combined with experiential components, impacts students’ creative processes and learning outcomes. Our findings reveal that this integrated approach accelerates knowledge acquisition, enables students to overcome traditional creative barriers, and facilitates a dynamic interplay between AI-generated insights and real-world observations. The study also highlights challenges, including the need for instructors with high AI literacy and the rapid evolution of AI tools creating a moving target for curriculum design. These insights contribute to the growing body of literature on AI in education and provide actionable recommendations for educators preparing students for the complexities of modern business environments.

[AI-84] Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

链接: https://arxiv.org/abs/2501.06514
作者: Yuankun Xie,Xiaopeng Wang,Zhiyong Wang,Ruibo Fu,Zhengqi Wen,Songjun Cao,Long Ma,Chenxing Li,Haonnan Cheng,Long Ye
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Current research in audio deepfake detection is gradually transitioning from binary classification to multi-class tasks, referred as audio deepfake source tracing task. However, existing studies on source tracing consider only closed-set scenarios and have not considered the challenges posed by open-set conditions. In this paper, we define the Neural Codec Source Tracing (NCST) task, which is capable of performing open-set neural codec classification and interpretable ALM detection. Specifically, we constructed the ST-Codecfake dataset for the NCST task, which includes bilingual audio samples generated by 11 state-of-the-art neural codec methods and ALM-based out-ofdistribution (OOD) test samples. Furthermore, we establish a comprehensive source tracing benchmark to assess NCST models in open-set conditions. The experimental results reveal that although the NCST models perform well in in-distribution (ID) classification and OOD detection, they lack robustness in classifying unseen real audio. The ST-codecfake dataset and code are available.

[AI-85] Resource Allocation under the Latin Square Constraint AAMAS2025

链接: https://arxiv.org/abs/2501.06506
作者: Yasushi Kawase,Bodhayan Roy,Mohammad Azharuddin Sanpui
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: This paper has been accepted in AAMAS 2025 as an extended abstract

点击查看摘要

Abstract:A Latin square is an n \times n matrix filled with n distinct symbols, each of which appears exactly once in each row and exactly once in each column. We introduce a problem of allocating n indivisible items among n agents over n rounds while satisfying the Latin square constraint. This constraint ensures that each agent receives no more than one item per round and receives each item at most once. Each agent has an additive valuation on the item–round pairs. Real-world applications like scheduling, resource management, and experimental design require the Latin square constraint to satisfy fairness or balancedness in allocation. Our goal is to find a partial or complete allocation that maximizes the sum of the agents’ valuations (utilitarian social welfare) or the minimum of the agents’ valuations (egalitarian social welfare). For the problem of maximizing utilitarian social welfare, we prove NP-hardness even when the valuations are binary additive. We then provide (1-1/e) and (1-1/e)/4 -approximation algorithms for partial and complete settings, respectively. Additionally, we present fixed-parameter tractable (FPT) algorithms with respect to the order of Latin square and the optimum value for both partial and complete settings. For the problem of maximizing egalitarian social welfare, we establish that deciding whether the optimum value is at most 1 or at least 2 is NP-hard for both the partial and complete settings, even when the valuations are binary. Furthermore, we demonstrate that checking the existence of a complete allocation that satisfies each of envy-free, proportional, equitable, envy-free up to any good, proportional up to any good, or equitable up to any good is NP-hard, even when the valuations are identical.

[AI-86] Improving Requirements Classification with SMOTE-Tomek Preprocessing

链接: https://arxiv.org/abs/2501.06491
作者: Barak Or
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16%, significantly surpassing the baseline of 58.31%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.

[AI-87] A Diffusive Data Augmentation Framework for Reconstruction of Complex Network Evolutionary History

链接: https://arxiv.org/abs/2501.06485
作者: En Xu,Can Rong,Jingtao Ding,Yong Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The evolutionary processes of complex systems contain critical information regarding their functional characteristics. The generation time of edges provides insights into the historical evolution of various networked complex systems, such as protein-protein interaction networks, ecosystems, and social networks. Recovering these evolutionary processes holds significant scientific value, including aiding in the interpretation of the evolution of protein-protein interaction networks. However, existing methods are capable of predicting the generation times of remaining edges given a partial temporal network but often perform poorly in cross-network prediction tasks. These methods frequently fail in edge generation time recovery tasks for static networks that lack timestamps. In this work, we adopt a comparative paradigm-based framework that fuses multiple networks for training, enabling cross-network learning of the relationship between network structure and edge generation times. Compared to separate training, this approach yields an average accuracy improvement of 16.98%. Furthermore, given the difficulty in collecting temporal networks, we propose a novel diffusion-model-based generation method to produce a large number of temporal networks. By combining real temporal networks with generated ones for training, we achieve an additional average accuracy improvement of 5.46% through joint training.

[AI-88] he Internet of Large Language Models : An Orchestration Framework for LLM Training and Knowledge Exchange Toward Artificial General Intelligence

链接: https://arxiv.org/abs/2501.06471
作者: Wilson Wei,Nicholas Chen,Yuxuan Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the multi-dimensional challenges faced during the development of Large Language Models (LLMs), including the massive scale of model parameters and file sizes, the complexity of development environment configuration, the singularity of model functionality, and the high costs of computational resources. To address these challenges, this paper proposes three core technical solutions: LLM sharing protocol, LLM universal environment framework, and Agent optimal path module. To solve the computational resource constraints in the early stages of research, we further innovatively propose a joint mining mechanism, achieving bilateral value sharing between computing power providers and model designers, including breakthrough rewards for optimal model paths and long-term profit distribution, thereby providing researchers with cost-optimized computational resource support and promoting the continuous development of LLM research and applications.

[AI-89] Assessing instructor-AI cooperation for grading essay-type questions in an introductory sociology course

链接: https://arxiv.org/abs/2501.06461
作者: Francisco Olivos,Tobias Kamelski,Sebastián Ascui-Gac
类目: Artificial Intelligence (cs.AI)
*备注: 10 figures, 2 tables

点击查看摘要

Abstract:This study explores the use of artificial intelligence (AI) as a complementary tool for grading essay-type questions in higher education, focusing on its consistency with human grading and potential to reduce biases. Using 70 handwritten exams from an introductory sociology course, we evaluated generative pre-trained transformers (GPT) models’ performance in transcribing and scoring students’ responses. GPT models were tested under various settings for both transcription and grading tasks. Results show high similarity between human and GPT transcriptions, with GPT-4o-mini outperforming GPT-4o in accuracy. For grading, GPT demonstrated strong correlations with the human grader scores, especially when template answers were provided. However, discrepancies remained, highlighting GPT’s role as a “second grader” to flag inconsistencies for assessment reviewing rather than fully replace human evaluation. This study contributes to the growing literature on AI in education, demonstrating its potential to enhance fairness and efficiency in grading essay-type questions.

[AI-90] On the Computational Capability of Graph Neural Networks: A Circuit Complexity Bound Perspective

链接: https://arxiv.org/abs/2501.06444
作者: Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Wei Wang,Jiahao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become the standard approach for learning and reasoning over relational data, leveraging the message-passing mechanism that iteratively propagates node embeddings through graph structures. While GNNs have achieved significant empirical success, their theoretical limitations remain an active area of research. Existing studies primarily focus on characterizing GNN expressiveness through Weisfeiler-Lehman (WL) graph isomorphism tests. In this paper, we take a fundamentally different approach by exploring the computational limitations of GNNs through the lens of circuit complexity. Specifically, we analyze the circuit complexity of common GNN architectures and prove that under constraints of constant-depth layers, linear or sublinear embedding sizes, and polynomial precision, GNNs cannot solve key problems such as graph connectivity and graph isomorphism unless \mathsfTC^0 = \mathsfNC^1 . These results reveal the intrinsic expressivity limitations of GNNs behind their empirical success and introduce a novel framework for analyzing GNN expressiveness that can be extended to a broader range of GNN models and graph decision problems.

[AI-91] ARES: Auxiliary Range Expansion for Outlier Synthesis

链接: https://arxiv.org/abs/2501.06442
作者: Eui-Soo Jung,Hae-Hun Seo,Hyun-Woo Jung,Je-Geon Oh,Yoon-Yeong Kim
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent successes of artificial intelligence and deep learning often depend on the well-collected training dataset which is assumed to have an identical distribution with the test dataset. However, this assumption, which is called closed-set learning, is hard to meet in realistic scenarios for deploying deep learning models. As one of the solutions to mitigate this assumption, research on out-of-distribution (OOD) detection has been actively explored in various domains. In OOD detection, we assume that we are given the data of a new class that was not seen in the training phase, i.e., outlier, at the evaluation phase. The ultimate goal of OOD detection is to detect and classify such unseen outlier data as a novel “unknown” class. Among various research branches for OOD detection, generating a virtual outlier during the training phase has been proposed. However, conventional generation-based methodologies utilize in-distribution training dataset to imitate outlier instances, which limits the quality of the synthesized virtual outlier instance itself. In this paper, we propose a novel methodology for OOD detection named Auxiliary Range Expansion for Outlier Synthesis, or ARES. ARES models the region for generating out-of-distribution instances by escaping from the given in-distribution region; instead of remaining near the boundary of in-distribution region. Various stages consists ARES to ultimately generate valuable OOD-like virtual instances. The energy score-based discriminator is then trained to effectively separate in-distribution data and outlier data. Quantitative experiments on broad settings show the improvement of performance by our method, and qualitative results provide logical explanations of the mechanism behind it.

[AI-92] Deep Learning on Hester Davis Scores for Inpatient Fall Prediction

链接: https://arxiv.org/abs/2501.06432
作者: Hojjat Salehinejad,Ricky Rojas,Kingsley Iheasirim,Mohammed Yousufuddin,Bijan Borah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for presentation at IEEE SSCI 2025

点击查看摘要

Abstract:Fall risk prediction among hospitalized patients is a critical aspect of patient safety in clinical settings, and accurate models can help prevent adverse events. The Hester Davis Score (HDS) is commonly used to assess fall risk, with current clinical practice relying on a threshold-based approach. In this method, a patient is classified as high-risk when their HDS exceeds a predefined threshold. However, this approach may fail to capture dynamic patterns in fall risk over time. In this study, we model the threshold-based approach and propose two machine learning approaches for enhanced fall prediction: One-step ahead fall prediction and sequence-to-point fall prediction. The one-step ahead model uses the HDS at the current timestamp to predict the risk at the next timestamp, while the sequence-to-point model leverages all preceding HDS values to predict fall risk using deep learning. We compare these approaches to assess their accuracy in fall risk prediction, demonstrating that deep learning can outperform the traditional threshold-based method by capturing temporal patterns and improving prediction reliability. These findings highlight the potential for data-driven approaches to enhance patient safety through more reliable fall prevention strategies.

[AI-93] AlgoPilot: Fully Autonomous Program Synthesis Without Human-Written Programs

链接: https://arxiv.org/abs/2501.06423
作者: Xiaoxin Yin
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Program synthesis has traditionally relied on human-provided specifications, examples, or prior knowledge to generate functional algorithms. Existing methods either emulate human-written algorithms or solve specific tasks without generating reusable programmatic logic, limiting their ability to create novel algorithms. We introduce AlgoPilot, a groundbreaking approach for fully automated program synthesis without human-written programs or trajectories. AlgoPilot leverages reinforcement learning (RL) guided by a Trajectory Language Model (TLM) to synthesize algorithms from scratch. The TLM, trained on trajectories generated by random Python functions, serves as a soft constraint during the RL process, aligning generated sequences with patterns likely to represent valid algorithms. Using sorting as a test case, AlgoPilot demonstrates its ability to generate trajectories that are interpretable as classical algorithms, such as Bubble Sort, while operating without prior algorithmic knowledge. This work establishes a new paradigm for algorithm discovery and lays the groundwork for future advancements in autonomous program synthesis.

[AI-94] DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory

链接: https://arxiv.org/abs/2501.06417
作者: Jerry Chee,Arturs Backurs,Rainie Heck,Li Zhang,Janardhan Kulkarni,Thomas Rothvoss,Sivakanth Gopi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emphdiscrepancy theory, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given m=\mathrmpoly(1/\epsilon) samples from the data distribution, we can round all but O(m) model weights such that the expected approximation error of the quantized model on the true data distribution is \le \epsilon as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emphDiscQuant. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64% accuracy on the GSM8k dataset, whereas GPTQ achieves 54% and RTN achieves 31% (the original model achieves 84%). We make our code available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2501.06417 [cs.LG] (or arXiv:2501.06417v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06417 Focus to learn more arXiv-issued DOI via DataCite

[AI-95] Influencing Humans to Conform to Preference Models for RLHF

链接: https://arxiv.org/abs/2501.06416
作者: Stephane Hatgis-Kessell,W. Bradley Knox,Serena Booth,Scott Niekum,Peter Stone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human’s unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human’s reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human’s unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.

[AI-96] Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

链接: https://arxiv.org/abs/2501.06394
作者: Zhengyan Sheng,Zhihao Du,Heng Lu,Shiliang Zhang,Zhen-Hua Ling
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers’ recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at \urlthis https URL.

[AI-97] Kolmogorov-Arnold networks for metal surface defect classification

链接: https://arxiv.org/abs/2501.06389
作者: Maciej Krzywda,Mariusz Wermiński,Szymon Łukasik,Amir H. Gandomi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:This paper presents the application of Kolmogorov-Arnold Networks (KAN) in classifying metal surface defects. Specifically, steel surfaces are analyzed to detect defects such as cracks, inclusions, patches, pitted surfaces, and scratches. Drawing on the Kolmogorov-Arnold theorem, KAN provides a novel approach compared to conventional multilayer perceptrons (MLPs), facilitating more efficient function approximation by utilizing spline functions. The results show that KAN networks can achieve better accuracy than convolutional neural networks (CNNs) with fewer parameters, resulting in faster convergence and improved performance in image classification.

[AI-98] owards a Probabilistic Framework for Analyzing and Improving LLM -Enabled Software

链接: https://arxiv.org/abs/2501.06370
作者: Juan Manuel Baldonado,Flavia Bonomo-Braberman,Víctor Adrián Braberman
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring the reliability and verifiability of large language model (LLM)-enabled systems remains a significant challenge in software engineering. We propose a probabilistic framework for systematically analyzing and improving these systems by modeling and refining distributions over clusters of semantically equivalent outputs. This framework facilitates the evaluation and iterative improvement of Transference Models – key software components that utilize LLMs to transform inputs into outputs for downstream tasks. To illustrate its utility, we apply the framework to the autoformalization problem, where natural language documentation is transformed into formal program specifications. Our case illustrates how probabilistic analysis enables the identification of weaknesses and guides focused alignment improvements, resulting in more reliable and interpretable outputs. This principled approach offers a foundation for addressing critical challenges in the development of robust LLM-enabled systems.

[AI-99] On The Statistical Complexity of Offline Decision-Making ICML’24

链接: https://arxiv.org/abs/2501.06339
作者: Thanh Nguyen-Tang,Raman Arora
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: arXiv version for the ICML’24 paper

点击查看摘要

Abstract:We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that \emphstrictly subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decision-making and show nearly minimax-optimal rates in a wide range of regimes.

[AI-100] Aggregating Low Rank Adapters in Federated Fine-tuning

链接: https://arxiv.org/abs/2501.06332
作者: Evelyn Trautmann,Ian Hales,Martin F. Volk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: presented at conference this https URL

点击查看摘要

Abstract:Fine-tuning large language models requires high computational and memory resources, and is therefore associated with significant costs. When training on federated datasets, an increased communication effort is also needed. For this reason, parameter-efficient methods (PEFT) are becoming increasingly important. In this context, very good results have already been achieved by fine-tuning with low-rank adaptation methods (LoRA). The application of LoRA methods in Federated Learning, and especially the aggregation of adaptation matrices, is a current research field. In this article, we propose a novel aggregation method and compare it with different existing aggregation methods of low rank adapters trained in a federated fine-tuning of large machine learning models and evaluate their performance with respect to selected GLUE benchmark datasets.

[AI-101] Multi-Agent Collaboration Mechanisms: A Survey of LLM s

链接: https://arxiv.org/abs/2501.06322
作者: Khanh-Tung Tran,Dung Dao,Minh-Duong Nguyen,Quoc-Viet Pham,Barry O’Sullivan,Hoang D. Nguyen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With recent advances in Large Language Models (LLMs), Agentic AI has become phenomenal in real-world applications, moving toward multiple LLM-based agents to perceive, learn, reason, and act collaboratively. These LLM-based Multi-Agent Systems (MASs) enable groups of intelligent agents to coordinate and solve complex tasks collectively at scale, transitioning from isolated models to collaboration-centric approaches. This work provides an extensive survey of the collaborative aspect of MASs and introduces an extensible framework to guide future research. Our framework characterizes collaboration mechanisms based on key dimensions: actors (agents involved), types (e.g., cooperation, competition, or coopetition), structures (e.g., peer-to-peer, centralized, or distributed), strategies (e.g., role-based or model-based), and coordination protocols. Through a review of existing methodologies, our findings serve as a foundation for demystifying and advancing LLM-based MASs toward more intelligent and collaborative solutions for complex, real-world use cases. In addition, various applications of MASs across diverse domains, including 5G/6G networks, Industry 5.0, question answering, and social and cultural settings, are also investigated, demonstrating their wider adoption and broader impacts. Finally, we identify key lessons learned, open challenges, and potential research directions of MASs towards artificial collective intelligence.

[AI-102] BioAgents : Democratizing Bioinformatics Analysis with Multi-Agent Systems

链接: https://arxiv.org/abs/2501.06314
作者: Nikita Mehandru,Amanda K. Hall,Olesya Melnichenko,Yulia Dubinina,Daniel Tsirulnikov,David Bamman,Ahmed Alaa,Scott Saponas,Venkat S. Malladi
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and require expensive computing resources to achieve high performance. We thus propose a multi-agent system built on small language models, fine-tuned on bioinformatics data, and enhanced with retrieval augmented generation (RAG). Our system, BioAgents, enables local operation and personalization using proprietary data. We observe performance comparable to human experts on conceptual genomics tasks, and suggest next steps to enhance code generation capabilities.

[AI-103] owards smart and adaptive agents for active sensing on edge devices

链接: https://arxiv.org/abs/2501.06262
作者: Devendra Vyas,Miguel de Prado,Tim Verbelen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:TinyML has made deploying deep learning models on low-power edge devices feasible, creating new opportunities for real-time perception in constrained environments. However, the adaptability of such deep learning methods remains limited to data drift adaptation, lacking broader capabilities that account for the environment’s underlying dynamics and inherent uncertainty. Deep learning’s scaling laws, which counterbalance this limitation by massively up-scaling data and model size, cannot be applied when deploying on the Edge, where deep learning limitations are further amplified as models are scaled down for deployment on resource-constrained devices. This paper presents a smart agentic system capable of performing on-device perception and planning, enabling active sensing on the edge. By incorporating active inference into our solution, our approach extends beyond deep learning capabilities, allowing the system to plan in dynamic environments while operating in real time with a modest total model size of 2.3 MB. We showcase our proposed system by creating and deploying a saccade agent connected to an IoT camera with pan and tilt capabilities on an NVIDIA Jetson embedded device. The saccade agent controls the camera’s field of view following optimal policies derived from the active inference principles, simulating human-like saccadic motion for surveillance and robotics applications. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2501.06262 [cs.RO] (or arXiv:2501.06262v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2501.06262 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-104] Progressive Supervision via Label Decomposition: An Long-Term and Large-Scale Wireless Traffic Forecasting Method

链接: https://arxiv.org/abs/2501.06255
作者: Daojun Liang,Haixia Zhang,Dongfeng Yuan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at Knowledge-Based Systems. arXiv admin note: substantial text overlap with arXiv:2412.00108

点击查看摘要

Abstract:Long-term and Large-scale Wireless Traffic Forecasting (LL-WTF) is pivotal for strategic network management and comprehensive planning on a macro scale. However, LL-WTF poses greater challenges than short-term ones due to the pronounced non-stationarity of extended wireless traffic and the vast number of nodes distributed at the city scale. To cope with this, we propose a Progressive Supervision method based on Label Decomposition (PSLD). Specifically, we first introduce a Random Subgraph Sampling (RSS) algorithm designed to sample a tractable subset from large-scale traffic data, thereby enabling efficient network training. Then, PSLD employs label decomposition to obtain multiple easy-to-learn components, which are learned progressively at shallow layers and combined at deep layers to effectively cope with the non-stationary problem raised by LL-WTF tasks. Finally, we compare the proposed method with various state-of-the-art (SOTA) methods on three large-scale WT datasets. Extensive experimental results demonstrate that the proposed PSLD significantly outperforms existing methods, with an average 2%, 4%, and 11% performance improvement on three WT datasets, respectively. In addition, we built an open source library for WT forecasting (WTFlib) to facilitate related research, which contains numerous SOTA methods and provides a strong this http URL can be reproduced through this https URL.

[AI-105] A Survey on Algorithmic Developments in Optimal Transport Problem with Applications

链接: https://arxiv.org/abs/2501.06247
作者: Sina Moradi
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimal Transport (OT) has established itself as a robust framework for quantifying differences between distributions, with applications that span fields such as machine learning, data science, and computer vision. This paper offers a detailed examination of the OT problem, beginning with its theoretical foundations, including the classical formulations of Monge and Kantorovich and their extensions to modern computational techniques. It explores cutting-edge algorithms, including Sinkhorn iterations, primal-dual strategies, and reduction-based approaches, emphasizing their efficiency and scalability in addressing high-dimensional problems. The paper also highlights emerging trends, such as integrating OT into machine learning frameworks, the development of novel problem variants, and ongoing theoretical advancements. Applications of OT are presented across a range of domains, with particular attention to its innovative application in time series data analysis via Optimal Transport Warping (OTW), a robust alternative to methods like Dynamic Time Warping. Despite the significant progress made, challenges related to scalability, robustness, and ethical considerations remain, necessitating further research. The paper underscores OT’s potential to bridge theoretical depth and practical utility, fostering impactful advancements across diverse disciplines.

[AI-106] Microservice Deployment in Space Computing Power Networks via Robust Reinforcement Learning

链接: https://arxiv.org/abs/2501.06244
作者: Zhiyong Yu,Yuning Jiang,Xin Liu,Yuanming Shi,Chunxiao Jiang,Linling Kuang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:With the growing demand for Earth observation, it is important to provide reliable real-time remote sensing inference services to meet the low-latency requirements. The Space Computing Power Network (Space-CPN) offers a promising solution by providing onboard computing and extensive coverage capabilities for real-time inference. This paper presents a remote sensing artificial intelligence applications deployment framework designed for Low Earth Orbit satellite constellations to achieve real-time inference performance. The framework employs the microservice architecture, decomposing monolithic inference tasks into reusable, independent modules to address high latency and resource heterogeneity. This distributed approach enables optimized microservice deployment, minimizing resource utilization while meeting quality of service and functional requirements. We introduce Robust Optimization to the deployment problem to address data uncertainty. Additionally, we model the Robust Optimization problem as a Partially Observable Markov Decision Process and propose a robust reinforcement learning algorithm to handle the semi-infinite Quality of Service constraints. Our approach yields sub-optimal solutions that minimize accuracy loss while maintaining acceptable computational costs. Simulation results demonstrate the effectiveness of our framework.

[AI-107] Agent TCP/IP: An Agent Agent -to-Agent Transaction System

链接: https://arxiv.org/abs/2501.06243
作者: Andrea Muttoni,Jason Zhao
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
*备注: 17 pages, 2 figures

点击查看摘要

Abstract:Autonomous agents represent an inevitable evolution of the internet. Current agent frameworks do not embed a standard protocol for agent-to-agent interaction, leaving existing agents isolated from their peers. As intellectual property is the native asset ingested by and produced by agents, a true agent economy requires equipping agents with a universal framework for engaging in binding contracts with each other, including the exchange of valuable training data, personality, and other forms of Intellectual Property. A purely agent-to-agent transaction layer would transcend the need for human intermediation in multi-agent interactions. The Agent Transaction Control Protocol for Intellectual Property (ATCP/IP) introduces a trustless framework for exchanging IP between agents via programmable contracts, enabling agents to initiate, trade, borrow, and sell agent-to-agent contracts on the Story blockchain network. These contracts not only represent auditable onchain execution but also contain a legal wrapper that allows agents to express and enforce their actions in the offchain legal setting, creating legal personhood for agents. Via ATCP/IP, agents can autonomously sell their training data to other agents, license confidential or proprietary information, collaborate on content based on their unique skills, all of which constitutes an emergent knowledge economy.

[AI-108] Intelligent Task Offloading: Advanced MEC Task Offloading and Resource Management in 5G Networks

链接: https://arxiv.org/abs/2501.06242
作者: Alireza Ebrahimi,Fatemeh Afghah
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:5G technology enhances industries with high-speed, reliable, low-latency communication, revolutionizing mobile broadband and supporting massive IoT connectivity. With the increasing complexity of applications on User Equipment (UE), offloading resource-intensive tasks to robust servers is essential for improving latency and speed. The 3GPP’s Multi-access Edge Computing (MEC) framework addresses this challenge by processing tasks closer to the user, highlighting the need for an intelligent controller to optimize task offloading and resource allocation. This paper introduces a novel methodology to efficiently allocate both communication and computational resources among individual UEs. Our approach integrates two critical 5G service imperatives: Ultra-Reliable Low Latency Communication (URLLC) and Massive Machine Type Communication (mMTC), embedding them into the decision-making framework. Central to this approach is the utilization of Proximal Policy Optimization, providing a robust and efficient solution to the challenges posed by the evolving landscape of 5G technology. The proposed model is evaluated in a simulated 5G MEC environment. The model significantly reduces processing time by 4% for URLLC users under strict latency constraints and decreases power consumption by 26% for mMTC users, compared to existing baseline models based on the reported simulation results. These improvements showcase the model’s adaptability and superior performance in meeting diverse QoS requirements in 5G networks.

[AI-109] Forecasting Anonymized Electricity Load Profiles

链接: https://arxiv.org/abs/2501.06237
作者: Joaquin Delgado Fernandez,Sergio Potenciano Menci,Alessio Magitteri
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the evolving landscape of data privacy, the anonymization of electric load profiles has become a critical issue, especially with the enforcement of the General Data Protection Regulation (GDPR) in Europe. These electric load profiles, which are essential datasets in the energy industry, are classified as personal behavioral data, necessitating stringent protective measures. This article explores the implications of this classification, the importance of data anonymization, and the potential of forecasting using microaggregated data. The findings underscore that effective anonymization techniques, such as microaggregation, do not compromise the performance of forecasting models under certain conditions (i.e., forecasting aggregated). In such an aggregated level, microaggregated data maintains high levels of utility, with minimal impact on forecasting accuracy. The implications for the energy sector are profound, suggesting that privacy-preserving data practices can be integrated into smart metering technology applications without hindering their effectiveness.

[AI-110] Data-Driven Radio Propagation Modeling using Graph Neural Networks

链接: https://arxiv.org/abs/2501.06236
作者: Adrien Bufort,Laurent Lebocq,Stefan Cathabard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Modeling radio propagation is essential for wireless network design and performance optimization. Traditional methods rely on physics models of radio propagation, which can be inaccurate or inflexible. In this work, we propose using graph neural networks to learn radio propagation behaviors directly from real-world network data. Our approach converts the radio propagation environment into a graph representation, with nodes corresponding to locations and edges representing spatial and ray-tracing relationships between locations. The graph is generated by converting images of the environment into a graph structure, with specific relationships between nodes. The model is trained on this graph representation, using sensor measurements as target data. We demonstrate that the graph neural network, which learns to predict radio propagation directly from data, achieves competitive performance compared to traditional heuristic models. This data-driven approach outperforms classic numerical solvers in terms of both speed and accuracy. To the best of our knowledge, we are the first to apply graph neural networks to real-world radio propagation data to generate coverage maps, enabling generative models of signal propagation with point measurements only. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2501.06236 [cs.LG] (or arXiv:2501.06236v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-111] Sustainable and Intelligent Public Facility Failure Management System Based on Large Language Models

链接: https://arxiv.org/abs/2501.06231
作者: Siguo Bi,Jilong Zhang,Wei Ni
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a new Large Language Model (LLM)-based Smart Device Management framework, a pioneering approach designed to address the intricate challenges of managing intelligent devices within public facilities, with a particular emphasis on applications to libraries. Our framework leverages state-of-the-art LLMs to analyze and predict device failures, thereby enhancing operational efficiency and reliability. Through prototype validation in real-world library settings, we demonstrate the framework’s practical applicability and its capacity to significantly reduce budgetary constraints on public facilities. The advanced and innovative nature of our model is evident from its successful implementation in prototype testing. We plan to extend the framework’s scope to include a wider array of public facilities and to integrate it with cutting-edge cybersecurity technologies, such as Internet of Things (IoT) security and machine learning algorithms for threat detection and response. This will result in a comprehensive and proactive maintenance system that not only bolsters the security of intelligent devices but also utilizes machine learning for automated analysis and real-time threat mitigation. By incorporating these advanced cybersecurity elements, our framework will be well-positioned to tackle the dynamic challenges of modern public infrastructure, ensuring robust protection against potential threats and enabling facilities to anticipate and prevent failures, leading to substantial cost savings and enhanced service quality.

[AI-112] asanAI: In-Browser No-Code Offline-First Machine Learning Toolkit

链接: https://arxiv.org/abs/2501.06226
作者: Norman Koch,Siavash Ghiasvand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:Machine learning (ML) has become crucial in modern life, with growing interest from researchers and the public. Despite its potential, a significant entry barrier prevents widespread adoption, making it challenging for non-experts to understand and implement ML techniques. The increasing desire to leverage ML is counterbalanced by its technical complexity, creating a gap between potential and practical application. This work introduces asanAI, an offline-first, open-source, no-code machine learning toolkit designed for users of all skill levels. It allows individuals to design, debug, train, and test ML models directly in a web browser, eliminating the need for software installations and coding. The toolkit runs on any device with a modern web browser, including smartphones, and ensures user privacy through local computations while utilizing WebGL for enhanced GPU performance. Users can quickly experiment with neural networks and train custom models using various data sources, supported by intuitive visualizations of network structures and data flows. asanAI simplifies the teaching of ML concepts in educational settings and is released under an open-source MIT license, encouraging modifications. It also supports exporting models in industry-ready formats, empowering a diverse range of users to effectively learn and apply machine learning in their projects. The proposed toolkit is successfully utilized by researchers of this http URL to swiftly draft and test machine learning ideas, by trainers to effectively educate enthusiasts, and by teachers to introduce contemporary ML topics in classrooms with minimal effort and high clarity.

[AI-113] A Novel Method for Pignistic Information Fusion in the View of Z-number

链接: https://arxiv.org/abs/2501.06201
作者: Yuanpeng He
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:How to properly fuse information from complex sources is still an open problem. Lots of methods have been put forward to provide a effective solution in fusing intricate information. Among them, Dempster-Shafer evidences theory (DSET) is one of the representatives, it is widely used to handle uncertain information. Based on DSET, a completely new method to fuse information from different sources based on pignistic transformation and Z-numbers is proposed in this paper which is able to handle separate situations of information and keeps high accuracy in producing rational and correct judgments on actual situations. Besides, in order to illustrate the superiority of the proposed method, some numerical examples and application are also provided to verify the validity and robustness of it.

[AI-114] A Computational Model of Learning and Memory Using Structurally Dynamic Cellular Automata

链接: https://arxiv.org/abs/2501.06192
作者: Jeet Singh
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:In the fields of computation and neuroscience, much is still unknown about the underlying computations that enable key cognitive functions including learning, memory, abstraction and behavior. This paper proposes a mathematical and computational model of learning and memory based on a small set of bio-plausible functions that include coincidence detection, signal modulation, and reward/penalty mechanisms. Our theoretical approach proposes that these basic functions are sufficient to establish and modulate an information space over which computation can be carried out, generating signal gradients usable for inference and behavior. The computational method used to test this is a structurally dynamic cellular automaton with continuous-valued cell states and a series of recursive steps propagating over an undirected graph with the memory function embedded entirely in the creation and modulation of graph edges. The experimental results show: that the toy model can make near-optimal choices to re-discover a reward state after a single training run; that it can avoid complex penalty configurations; that signal modulation and network plasticity can generate exploratory behaviors in sparse reward environments; that the model generates context-dependent memory representations; and that it exhibits high computational efficiency because of its minimal, single-pass training requirements combined with flexible and contextual memory representation.

[AI-115] Attention when you need

链接: https://arxiv.org/abs/2501.07440
作者: Lokesh Boominathan,Yizhou Chen,Matthew McGinley,Xaq Pitkow
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Being attentive to task-relevant features can improve task performance, but paying attention comes with its own metabolic cost. Therefore, strategic allocation of attention is crucial in performing the task efficiently. This work aims to understand this strategy. Recently, de Gee et al. conducted experiments involving mice performing an auditory sustained attention-value task. This task required the mice to exert attention to identify whether a high-order acoustic feature was present amid the noise. By varying the trial duration and reward magnitude, the task allows us to investigate how an agent should strategically deploy their attention to maximize their benefits and minimize their costs. In our work, we develop a reinforcement learning-based normative model of the mice to understand how it balances attention cost against its benefits. The model is such that at each moment the mice can choose between two levels of attention and decide when to take costly actions that could obtain rewards. Our model suggests that efficient use of attentional resources involves alternating blocks of high attention with blocks of low attention. In the extreme case where the agent disregards sensory input during low attention states, we see that high attention is used rhythmically. Our model provides evidence about how one should deploy attention as a function of task utility, signal statistics, and how attention affects sensory evidence.

[AI-116] he Spoils of Algorithmic Collusion: Profit Allocation Among Asymmetric Firms

链接: https://arxiv.org/abs/2501.07178
作者: Simon Martin,Hans-Theo Normann,Paul Püplichhuisen,Tobias Werner
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the propensity of independent algorithms to collude in repeated Cournot duopoly games. Specifically, we investigate the predictive power of different oligopoly and bargaining solutions regarding the effect of asymmetry between firms. We find that both consumers and firms can benefit from asymmetry. Algorithms produce more competitive outcomes when firms are symmetric, but less when they are very asymmetric. Although the static Nash equilibrium underestimates the effect on total quantity and overestimates the effect on profits, it delivers surprisingly accurate predictions in terms of total welfare. The best description of our results is provided by the equal relative gains solution. In particular, we find algorithms to agree on profits that are on or close to the Pareto frontier for all degrees of asymmetry. Our results suggest that the common belief that symmetric industries are more prone to collusion may no longer hold when algorithms increasingly drive managerial decisions.

[AI-117] Discrete Speech Unit Extraction via Independent Component Analysis ICASSP2025

链接: https://arxiv.org/abs/2501.06562
作者: Tomohiko Nakamura,Kwanghee Choi,Keigo Hojo,Yoshiaki Bando,Satoru Fukayama,Shinji Watanabe
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to ICASSP 2025 SALMA Workshop. Code available at this https URL

点击查看摘要

Abstract:Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.

[AI-118] Determination of galaxy photometric redshifts using Conditional Generative Adversarial Networks (CGANs)

链接: https://arxiv.org/abs/2501.06532
作者: M. Garcia-Fernandez
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate and reliable photometric redshifts determination is one of the key aspects for wide-field photometric surveys. Determination of photometric redshift for galaxies, has been traditionally solved by use of machine-learning and artificial intelligence techniques trained on a calibration sample of galaxies, where both photometry and spectrometry are determined. On this paper, we present a new algorithmic approach for determining photometric redshifts of galaxies using Conditional Generative Adversarial Networks (CGANs). Proposed CGAN implementation, approaches photometric redshift determination as a probabilistic regression, where instead of determining a single value for the estimated redshift of the galaxy, a full probability density is computed. The methodology proposed, is tested with data from Dark Energy Survey (DES) Y1 data and compared with other existing algorithm such as a Random Forest regressor.

[AI-119] opoFormer: Integrating Transformers and ConvLSTMs for Coastal Topography Prediction

链接: https://arxiv.org/abs/2501.06494
作者: Santosh Munian,Oktay Karakuş,William Russell,Gwyn Nelson
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures, 1 table

点击查看摘要

Abstract:This paper presents \textitTopoFormer, a novel hybrid deep learning architecture that integrates transformer-based encoders with convolutional long short-term memory (ConvLSTM) layers for the precise prediction of topographic beach profiles referenced to elevation datums, with a particular focus on Mean Low Water Springs (MLWS) and Mean Low Water Neaps (MLWN). Accurate topographic estimation down to MLWS is critical for coastal management, navigation safety, and environmental monitoring. Leveraging a comprehensive dataset from the Wales Coastal Monitoring Centre (WCMC), consisting of over 2000 surveys across 36 coastal survey units, TopoFormer addresses key challenges in topographic prediction, including temporal variability and data gaps in survey measurements. The architecture uniquely combines multi-head attention mechanisms and ConvLSTM layers to capture both long-range dependencies and localized temporal patterns inherent in beach profiles data. TopoFormer’s predictive performance was rigorously evaluated against state-of-the-art models, including DenseNet, 1D/2D CNNs, and LSTMs. While all models demonstrated strong performance, \textitTopoFormer achieved the lowest mean absolute error (MAE), as low as 2 cm, and provided superior accuracy in both in-distribution (ID) and out-of-distribution (OOD) evaluations.

[AI-120] LensNet: Enhancing Real-time Microlensing Event Discovery with Recurrent Neural Networks in the Korea Microlensing Telescope Network

链接: https://arxiv.org/abs/2501.06293
作者: Javier Viaña,Kyu-Ha Hwang,Zoë de Beurs,Jennifer C. Yee,Andrew Vanderburg,Michael D. Albrow,Sun-Ju Chung,Andrew Gould,Cheongho Han,Youn Kil Jung,Yoon-Hyun Ryu,In-Gu Shin,Yossi Shvartzvald,Hongjing Yang,Weicheng Zang,Sang-Mok Cha,Dong-Jin Kim,Seung-Lee Kim,Chung-Uk Lee,Dong-Joo Lee,Yongseok Lee,Byeong-Gon Park,Richard W. Pogge
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
*备注: 23 pages, 13 figures, Accepted for publication in the The Astronomical Journal

点击查看摘要

Abstract:Traditional microlensing event vetting methods require highly trained human experts, and the process is both complex and time-consuming. This reliance on manual inspection often leads to inefficiencies and constrains the ability to scale for widespread exoplanet detection, ultimately hindering discovery rates. To address the limits of traditional microlensing event vetting, we have developed LensNet, a machine learning pipeline specifically designed to distinguish legitimate microlensing events from false positives caused by instrumental artifacts, such as pixel bleed trails and diffraction spikes. Our system operates in conjunction with a preliminary algorithm that detects increasing trends in flux. These flagged instances are then passed to LensNet for further classification, allowing for timely alerts and follow-up observations. Tailored for the multi-observatory setup of the Korea Microlensing Telescope Network (KMTNet) and trained on a rich dataset of manually classified events, LensNet is optimized for early detection and warning of microlensing occurrences, enabling astronomers to organize follow-up observations promptly. The internal model of the pipeline employs a multi-branch Recurrent Neural Network (RNN) architecture that evaluates time-series flux data with contextual information, including sky background, the full width at half maximum of the target star, flux errors, PSF quality flags, and air mass for each observation. We demonstrate a classification accuracy above 87.5%, and anticipate further improvements as we expand our training set and continue to refine the algorithm.

[AI-121] Large Language Models for Bioinformatics

链接: https://arxiv.org/abs/2501.06271
作者: Wei Ruan,Yanjun Lyu,Jing Zhang,Jiazhang Cai,Peng Shu,Yang Ge,Yao Lu,Shang Gao,Yue Wang,Peilong Wang,Lin Zhao,Tao Wang,Yufang Liu,Luyang Fang,Ziyu Liu,Zhengliang Liu,Yiwei Li,Zihao Wu,Junhao Chen,Hanqi Jiang,Yi Pan,Zhenyuan Yang,Jingyuan Chen,Shizhe Liang,Wei Zhang,Terry Ma,Yuan Dou,Jianli Zhang,Xinyu Gong,Qi Gan,Yusong Zou,Zebang Chen,Yuanxin Qian,Shuo Yu,Jin Lu,Kenan Song,Xianqiao Wang,Andrea Sikora,Gang Li,Xiang Li,Quanzheng Li,Yingfeng Wang,Lu Zhang,Yohannes Abate,Lifang He,Wenxuan Zhong,Rongjie Liu,Chao Huang,Wei Liu,Ye Shen,Ping Ma,Hongtu Zhu,Yajun Yan,Dajiang Zhu,Tianming Liu
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 64 pages, 1 figure

点击查看摘要

Abstract:With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.

[AI-122] How Do Artificial Intelligences Think? The Three Mathematico-Cognitive Factors of Categorical Segmentation Operated by Synthetic Neurons

链接: https://arxiv.org/abs/2501.06196
作者: Michael Pichat,William Pogrund,Armanush Gasparian,Paloma Pichat,Samuel Demarchi,Michael Veillet-Guillem
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:How do the synthetic neurons in language models create “thought categories” to segment and analyze their informational environment? What are the cognitive characteristics, at the very level of formal neurons, of this artificial categorical thought? Based on the mathematical nature of algebraic operations inherent to neuronal aggregation functions, we attempt to identify mathematico-cognitive factors that genetically shape the categorical reconstruction of the informational world faced by artificial cognition. This study explores these concepts through the notions of priming, attention, and categorical phasing.

机器学习

[LG-0] E2ESlack: An End-to-End Graph-Based Framework for Pre-Routing Slack Prediction

链接: https://arxiv.org/abs/2501.07564
作者: Saurabh Bodhe,Zhanguang Zhang,Atia Hamidizadeh,Shixiong Kai,Yingxue Zhang,Mingxuan Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-routing slack prediction remains a critical area of research in Electronic Design Automation (EDA). Despite numerous machine learning-based approaches targeting this task, there is still a lack of a truly end-to-end framework that engineers can use to obtain TNS/WNS metrics from raw circuit data at the placement stage. Existing works have demonstrated effectiveness in Arrival Time (AT) prediction but lack a mechanism for Required Arrival Time (RAT) prediction, which is essential for slack prediction and obtaining TNS/WNS metrics. In this work, we propose E2ESlack, an end-to-end graph-based framework for pre-routing slack prediction. The framework includes a TimingParser that supports DEF, SDF and LIB files for feature extraction and graph construction, an arrival time prediction model and a fast RAT estimation module. To the best of our knowledge, this is the first work capable of predicting path-level slacks at the pre-routing stage. We perform extensive experiments and demonstrate that our proposed RAT estimation method outperforms the SOTA ML-based prediction method and also pre-routing STA tool. Additionally, the proposed E2ESlack framework achieves TNS/WNS values comparable to post-routing STA results while saving up to 23x runtime.

[LG-1] Dynamic Prototype Rehearsal for Continual Learning in ECG Arrhythmia Detection ICASSP2025

链接: https://arxiv.org/abs/2501.07555
作者: Sana Rahmani,Reetam Chatterjee,Ali Etemad,Javad Hashemi
类目: Machine Learning (cs.LG)
*备注: Accepted to 2025 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Continual Learning (CL) methods aim to learn from a sequence of tasks while avoiding the challenge of forgetting previous knowledge. We present DREAM-CL, a novel CL method for ECG arrhythmia detection that introduces dynamic prototype rehearsal memory. DREAM-CL selects representative prototypes by clustering data based on learning behavior during each training session. Within each cluster, we apply a smooth sorting operation that ranks samples by training difficulty, compressing extreme values and removing outliers. The more challenging samples are then chosen as prototypes for the rehearsal memory, ensuring effective knowledge retention across sessions. We evaluate our method on time-incremental, class-incremental, and lead-incremental scenarios using two widely used ECG arrhythmia datasets, Chapman and PTB-XL. The results demonstrate that DREAM-CL outperforms the state-of-the-art in CL for ECG arrhythmia detection. Detailed ablation and sensitivity studies are performed to validate the different design choices of our method.

[LG-2] ML Mule: Mobile-Driven Context-Aware Collaborative Learning

链接: https://arxiv.org/abs/2501.07536
作者: Haoxiang Yu,Javier Berrocal,Christine Julien
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial intelligence has been integrated into nearly every aspect of daily life, powering applications from object detection with computer vision to large language models for writing emails and compact models in smart homes. These machine learning models cater to individual users but are often detached from them, as they are typically stored and processed in centralized data centers. This centralized approach raises privacy concerns, incurs high infrastructure costs, and struggles with personalization. Federated and fully decentralized learning methods have been proposed to address these issues, but they still depend on centralized servers or face slow convergence due to communication constraints. To overcome these challenges, we propose ML Mule, a approach that utilizes individual mobile devices as ‘Mules’ to train and transport model snapshots as they move through physical spaces, sharing these models with the physical ‘Spaces’ they inhabit. This method implicitly forms affinity groups among devices associated with users who share particular spaces, enabling collaborative model evolution, and protecting users’ privacy. Our approach addresses several major shortcomings of traditional, federated, and fully decentralized learning systems. The proposed framework represents a new class of machine learning methods that are more robust, distributed, and personalized, bringing the field closer to realizing the original vision of intelligent, adaptive, and genuinely context-aware smart environments. The results show that ML Mule converges faster and achieves higher model accuracy compared to other existing methods.

[LG-3] Investigating Map-Based Path Loss Models: A Study of Feature Representations in Convolutional Neural Networks

链接: https://arxiv.org/abs/2501.07534
作者: Ryan G. Dempsey,Jonathan Ethier,Halim Yanikomeroglu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Path loss prediction is a beneficial tool for efficient use of the radio frequency spectrum. Building on prior research on high-resolution map-based path loss models, this paper studies convolutional neural network input representations in more detail. We investigate different methods of representing scalar features in convolutional neural networks. Specifically, we compare using frequency and distance as input channels to convolutional layers or as scalar inputs to regression layers. We assess model performance using three different feature configurations and find that representing scalar features as image channels results in the strongest generalization.

[LG-4] Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

链接: https://arxiv.org/abs/2501.07493
作者: Yangsibo Huang,Milad Nasr,Anastasios Angelopoulos,Nicholas Carlini,Wei-Lin Chiang,Christopher A. Choquette-Choo,Daphne Ippolito,Matthew Jagielski,Katherine Lee,Ken Ziyu Liu,Ion Stoica,Florian Tramer,Chiyuan Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than 95% accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

[LG-5] MVICAD2: Multi-View Independent Component Analysis with Delays and Dilations

链接: https://arxiv.org/abs/2501.07426
作者: Ambroise Heurtebise,Omar Chehab,Pierre Ablin,Alexandre Gramfort
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:Machine learning techniques in multi-view settings face significant challenges, particularly when integrating heterogeneous data, aligning feature spaces, and managing view-specific biases. These issues are prominent in neuroscience, where data from multiple subjects exposed to the same stimuli are analyzed to uncover brain activity dynamics. In magnetoencephalography (MEG), where signals are captured at the scalp level, estimating the brain’s underlying sources is crucial, especially in group studies where sources are assumed to be similar for all subjects. Common methods, such as Multi-View Independent Component Analysis (MVICA), assume identical sources across subjects, but this assumption is often too restrictive due to individual variability and age-related changes. Multi-View Independent Component Analysis with Delays (MVICAD) addresses this by allowing sources to differ up to a temporal delay. However, temporal dilation effects, particularly in auditory stimuli, are common in brain dynamics, making the estimation of time delays alone insufficient. To address this, we propose Multi-View Independent Component Analysis with Delays and Dilations (MVICAD2), which allows sources to differ across subjects in both temporal delays and dilations. We present a model with identifiable sources, derive an approximation of its likelihood in closed form, and use regularization and optimization techniques to enhance performance. Through simulations, we demonstrate that MVICAD2 outperforms existing multi-view ICA methods. We further validate its effectiveness using the Cam-CAN dataset, and showing how delays and dilations are related to aging.

[LG-6] Dynami-CAL GraphNet: A Physics-Informed Graph Neural Network Conserving Linear and Angular Momentum for Dynamical Systems

链接: https://arxiv.org/abs/2501.07373
作者: Vinay Sharma,Olga Fink
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurate, interpretable, and real-time modeling of multi-body dynamical systems is essential for predicting behaviors and inferring physical properties in natural and engineered environments. Traditional physics-based models face scalability challenges and are computationally demanding, while data-driven approaches like Graph Neural Networks (GNNs) often lack physical consistency, interpretability, and generalization. In this paper, we propose Dynami-CAL GraphNet, a Physics-Informed Graph Neural Network that integrates the learning capabilities of GNNs with physics-based inductive biases to address these limitations. Dynami-CAL GraphNet enforces pairwise conservation of linear and angular momentum for interacting nodes using edge-local reference frames that are equivariant to rotational symmetries, invariant to translations, and equivariant to node permutations. This design ensures physically consistent predictions of node dynamics while offering interpretable, edge-wise linear and angular impulses resulting from pairwise interactions. Evaluated on a 3D granular system with inelastic collisions, Dynami-CAL GraphNet demonstrates stable error accumulation over extended rollouts, effective extrapolations to unseen configurations, and robust handling of heterogeneous interactions and external forces. Dynami-CAL GraphNet offers significant advantages in fields requiring accurate, interpretable, and real-time modeling of complex multi-body dynamical systems, such as robotics, aerospace engineering, and materials science. By providing physically consistent and scalable predictions that adhere to fundamental conservation laws, it enables the inference of forces and moments while efficiently handling heterogeneous interactions and external forces.

[LG-7] Multimodal semantic retrieval for product search

链接: https://arxiv.org/abs/2501.07365
作者: Dong Liu,Esther Lopez Ramos
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic retrieval (also known as dense retrieval) based on textual data has been extensively studied for both web search and product search application fields, where the relevance of a query and a potential target document is computed by their dense vector representation comparison. Product image is crucial for e-commence search interactions and is a key factor for customers at product explorations. But its impact for semantic retrieval has not been well studied yet. In this research, we build a multimodal representation for product items in e-commerece search in contrast to pure-text representation of products, and investigate the impact of such representations. The models are developed and evaluated on e-commerce datasets. We demonstrate that a multimodal representation scheme for a product can show improvement either on purchase recall or relevance accuracy in semantic retrieval. Additionally, we provide numerical analysis for exclusive matches retrieved by a multimodal semantic retrieval model versus a text-only semantic retrieval model, to demonstrate the validation of multimodal solutions.

[LG-8] Deep Generative Clustering with VAEs and Expectation-Maximization

链接: https://arxiv.org/abs/2501.07358
作者: Michael Adipoetra,Ségolène Martin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel deep clustering method that integrates Variational Autoencoders (VAEs) into the Expectation-Maximization (EM) framework. Our approach models the probability distribution of each cluster with a VAE and alternates between updating model parameters by maximizing the Evidence Lower Bound (ELBO) of the log-likelihood and refining cluster assignments based on the learned distributions. This enables effective clustering and generation of new samples from each cluster. Unlike existing VAE-based methods, our approach eliminates the need for a Gaussian Mixture Model (GMM) prior or additional regularization techniques. Experiments on MNIST and FashionMNIST demonstrate superior clustering performance compared to state-of-the-art methods.

[LG-9] Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data AAAI2025

链接: https://arxiv.org/abs/2501.07346
作者: Shilong Deng,Zetao Zheng,Hongcai He,Paul Weng,Jie Shao
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025 (this version includes supplementary material)

点击查看摘要

Abstract:A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.

[LG-10] Digital Operating Mode Classification of Real-World Amateur Radio Transmissions ICASSP2025

链接: https://arxiv.org/abs/2501.07337
作者: Maximilian Bundscherer,Thomas H. Schmitt,Ilja Baumann,Tobias Bocklet
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Conference IEEE ICASSP 2025

点击查看摘要

Abstract:This study presents an ML approach for classifying digital radio operating modes evaluated on real-world transmissions. We generated 98 different parameterized radio signals from 17 digital operating modes, transmitted each of them on the 70 cm (UHF) amateur radio band, and recorded our transmissions with two different architectures of SDR receivers. Three lightweight ML models were trained exclusively on spectrograms of limited non-transmitted signals with random characters as payloads. This training involved an online data augmentation pipeline to simulate various radio channel impairments. Our best model, EfficientNetB0, achieved an accuracy of 93.80% across the 17 operating modes and 85.47% across all 98 parameterized radio signals, evaluated on our real-world transmissions with Wikipedia articles as payloads. Furthermore, we analyzed the impact of varying signal durations the number of FFT bins on classification, assessed the effectiveness of our simulated channel impairments, and tested our models across multiple simulated SNRs.

[LG-11] Foundation Models at Work: Fine-Tuning for Fairness in Algorithmic Hiring AAAI2025

链接: https://arxiv.org/abs/2501.07324
作者: Buse Sibel Korkmaz,Rahul Nair,Elizabeth M. Daly,Evangelos Anagnostopoulos,Christos Varytimidis,Antonio del Rio Chanona
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025, AI Governance Workshop

点击查看摘要

Abstract:Foundation models require fine-tuning to ensure their generative outputs align with intended results for specific tasks. Automating this fine-tuning process is challenging, as it typically needs human feedback that can be expensive to acquire. We present AutoRefine, a method that leverages reinforcement learning for targeted fine-tuning, utilizing direct feedback from measurable performance improvements in specific downstream tasks. We demonstrate the method for a problem arising in algorithmic hiring platforms where linguistic biases influence a recommendation system. In this setting, a generative model seeks to rewrite given job specifications to receive more diverse candidate matches from a recommendation engine which matches jobs to candidates. Our model detects and regulates biases in job descriptions to meet diversity and fairness criteria. The experiments on a public hiring dataset and a real-world hiring platform showcase how large language models can assist in identifying and mitigation biases in the real world.

[LG-12] Variable Bregman Majorization-Minimization Algorithm and its Application to Dirichlet Maximum Likelihood Estimation

链接: https://arxiv.org/abs/2501.07306
作者: Ségolène Martin,Jean-Christophe Pesquet,Gabriele Steidl,Ismail Ben Ayed
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a novel Bregman descent algorithm for minimizing a convex function that is expressed as the sum of a differentiable part (defined over an open set) and a possibly nonsmooth term. The approach, referred to as the Variable Bregman Majorization-Minimization (VBMM) algorithm, extends the Bregman Proximal Gradient method by allowing the Bregman function used in the divergence to adaptively vary at each iteration, provided it satisfies a majorizing condition on the objective function. This adaptive framework enables the algorithm to approximate the objective more precisely at each iteration, thereby allowing for accelerated convergence compared to the traditional Bregman Proximal Gradient descent. We establish the convergence of the VBMM algorithm to a minimizer under mild assumptions on the family of metrics used. Furthermore, we introduce a novel application of both the Bregman Proximal Gradient method and the VBMM algorithm to the estimation of the multidimensional parameters of a Dirichlet distribution through the maximization of its log-likelihood. Numerical experiments confirm that the VBMM algorithm outperforms existing approaches in terms of convergence speed.

[LG-13] Dataset-Agnostic Recommender Systems

链接: https://arxiv.org/abs/2501.07294
作者: Tri Kurniawan Wijaya,Edoardo D’Amico,Xinyang Shao
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:[This is a position paper and does not contain any empirical or theoretical results] Recommender systems have become a cornerstone of personalized user experiences, yet their development typically involves significant manual intervention, including dataset-specific feature engineering, hyperparameter tuning, and configuration. To this end, we introduce a novel paradigm: Dataset-Agnostic Recommender Systems (DAReS) that aims to enable a single codebase to autonomously adapt to various datasets without the need for fine-tuning, for a given recommender system task. Central to this approach is the Dataset Description Language (DsDL), a structured format that provides metadata about the dataset’s features and labels, and allow the system to understand dataset’s characteristics, allowing it to autonomously manage processes like feature selection, missing values imputation, noise removal, and hyperparameter optimization. By reducing the need for domain-specific expertise and manual adjustments, DAReS offers a more efficient and scalable solution for building recommender systems across diverse application domains. It addresses critical challenges in the field, such as reusability, reproducibility, and accessibility for non-expert users or entry-level researchers.

[LG-14] Generating Poisoning Attacks against Ridge Regression Models with Categorical Features

链接: https://arxiv.org/abs/2501.07275
作者: Monse Guedes-Ayala,Lars Schewe,Zeynep Suvak,Miguel Anjos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Machine Learning (ML) models have become a very powerful tool to extract information from large datasets and use it to make accurate predictions and automated decisions. However, ML models can be vulnerable to external attacks, causing them to underperform or deviate from their expected tasks. One way to attack ML models is by injecting malicious data to mislead the algorithm during the training phase, which is referred to as a poisoning attack. We can prepare for such situations by designing anticipated attacks, which are later used for creating and testing defence strategies. In this paper, we propose an algorithm to generate strong poisoning attacks for a ridge regression model containing both numerical and categorical features that explicitly models and poisons categorical features. We model categorical features as SOS-1 sets and formulate the problem of designing poisoning attacks as a bilevel optimization problem that is nonconvex mixed-integer in the upper-level and unconstrained convex quadratic in the lower-level. We present the mathematical formulation of the problem, introduce a single-level reformulation based on the Karush-Kuhn-Tucker (KKT) conditions of the lower level, find bounds for the lower-level variables to accelerate solver performance, and propose a new algorithm to poison categorical features. Numerical experiments show that our method improves the mean squared error of all datasets compared to the previous benchmark in the literature.

[LG-15] Interpretable machine-learning for predicting molecular weight of PLA based on artificial bee colony optimization algorithm and adaptive neurofuzzy inference system

链接: https://arxiv.org/abs/2501.07247
作者: Amir Pouya Masoumi,Leo Creedon,Ramen Ghosh,Nimra Munir,Ross McMorrow,Marion McAfee
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article discusses the integration of the Artificial Bee Colony (ABC) algorithm with two supervised learning methods, namely Artificial Neural Networks (ANNs) and Adaptive Network-based Fuzzy Inference System (ANFIS), for feature selection from Near-Infrared (NIR) spectra for predicting the molecular weight of medical-grade Polylactic Acid (PLA). During extrusion processing of PLA, in-line NIR spectra were captured along with extrusion process and machine setting data. With a dataset comprising 63 observations and 512 input features, appropriate machine learning tools are essential for interpreting data and selecting features to improve prediction accuracy. Initially, the ABC optimization algorithm is coupled with ANN/ANFIS to forecast PLA molecular weight. The objective functions of the ABC algorithm are to minimize the root mean square error (RMSE) between experimental and predicted PLA molecular weights while also minimizing the number of input features. Results indicate that employing ABC-ANFIS yields the lowest RMSE of 282 Da and identifies four significant parameters (NIR wavenumbers 6158 cm-1, 6310 cm-1, 6349 cm-1, and melt temperature) for prediction. These findings demonstrate the effectiveness of using the ABC algorithm with ANFIS for selecting a minimal set of features to predict PLA molecular weight with high accuracy during processing

[LG-16] A data-driven approach to discover and quantify systemic lupus erythematosus etiological heterogeneity from electronic health records

链接: https://arxiv.org/abs/2501.07206
作者: Marco Barbero Mota,John M. Still,Jorge L. Gamboa,Eric V. Strobl,Charles M. Stein,Vivian K. Kawai,Thomas A. Lasko
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Received Runner-up Knowledge Discovery and Data Mining Innovation Award at the American Medical Informatics Association Annual Symposium 2024

点击查看摘要

Abstract:Systemic lupus erythematosus (SLE) is a complex heterogeneous disease with many manifestational facets. We propose a data-driven approach to discover probabilistic independent sources from multimodal imperfect EHR data. These sources represent exogenous variables in the data generation process causal graph that estimate latent root causes of the presence of SLE in the health record. We objectively evaluated the sources against the original variables from which they were discovered by training supervised models to discriminate SLE from negative health records using a reduced set of labelled instances. We found 19 predictive sources with high clinical validity and whose EHR signatures define independent factors of SLE heterogeneity. Using the sources as input patient data representation enables models to provide with rich explanations that better capture the clinical reasons why a particular record is (not) an SLE case. Providers may be willing to trade patient-level interpretability for discrimination especially in challenging cases.

[LG-17] An Enhanced Zeroth-Order Stochastic Frank-Wolfe Framework for Constrained Finite-Sum Optimization

链接: https://arxiv.org/abs/2501.07201
作者: Haishan Ye,Yinghui Huang,Hao Di,Xiangyu Chang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 35 pages, 4 figures, 3 tables

点击查看摘要

Abstract:We propose an enhanced zeroth-order stochastic Frank-Wolfe framework to address constrained finite-sum optimization problems, a structure prevalent in large-scale machine-learning applications. Our method introduces a novel double variance reduction framework that effectively reduces the gradient approximation variance induced by zeroth-order oracles and the stochastic sampling variance from finite-sum objectives. By leveraging this framework, our algorithm achieves significant improvements in query efficiency, making it particularly well-suited for high-dimensional optimization tasks. Specifically, for convex objectives, the algorithm achieves a query complexity of O(d \sqrtn/\epsilon ) to find an epsilon-suboptimal solution, where d is the dimensionality and n is the number of functions in the finite-sum objective. For non-convex objectives, it achieves a query complexity of O(d^3/2\sqrtn/\epsilon^2 ) without requiring the computation ofd partial derivatives at each iteration. These complexities are the best known among zeroth-order stochastic Frank-Wolfe algorithms that avoid explicit gradient calculations. Empirical experiments on convex and non-convex machine learning tasks, including sparse logistic regression, robust classification, and adversarial attacks on deep networks, validate the computational efficiency and scalability of our approach. Our algorithm demonstrates superior performance in both convergence rate and query complexity compared to existing methods.

[LG-18] Pre-Trained Large Language Model Based Remaining Useful Life Transfer Prediction of Bearing

链接: https://arxiv.org/abs/2501.07191
作者: Laifa Tao,Zhengduo Zhao,Xuesong Wang,Bin Li,Wenchao Zhan,Xuanyuan Su,Shangyu Li,Qixuan Huang,Haifei Liu,Chen Lu,Zhixuan Lian
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting the remaining useful life (RUL) of rotating machinery, such as bearings, is essential for ensuring equipment reliability and minimizing unexpected industrial failures. Traditional data-driven deep learning methods face challenges in practical settings due to inconsistent training and testing data distributions and limited generalization for long-term predictions.

[LG-19] Knowledge Distillation and Enhanced Subdomain Adaptation Using Graph Convolutional Network for Resource-Constrained Bearing Fault Diagnosis

链接: https://arxiv.org/abs/2501.07173
作者: Mohammadreza Kavianpour,Parisa Kavianpour,Amin Ramezani,Mohammad TH Beheshti
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bearing fault diagnosis under varying working conditions faces challenges, including a lack of labeled data, distribution discrepancies, and resource constraints. To address these issues, we propose a progressive knowledge distillation framework that transfers knowledge from a complex teacher model, utilizing a Graph Convolutional Network (GCN) with Autoregressive moving average (ARMA) filters, to a compact and efficient student model. To mitigate distribution discrepancies and labeling uncertainty, we introduce Enhanced Local Maximum Mean Squared Discrepancy (ELMMSD), which leverages mean and variance statistics in the Reproducing Kernel Hilbert Space (RKHS) and incorporates a priori probability distributions between labels. This approach increases the distance between clustering centers, bridges subdomain gaps, and enhances subdomain alignment reliability. Experimental results on benchmark datasets (CWRU and JNU) demonstrate that the proposed method achieves superior diagnostic accuracy while significantly reducing computational costs. Comprehensive ablation studies validate the effectiveness of each component, highlighting the robustness and adaptability of the approach across diverse working conditions.

[LG-20] AlphaNet: Scaling Up Local Frame-based Atomistic Foundation Model

链接: https://arxiv.org/abs/2501.07155
作者: Bangchen Yin,Jiaao Wang,Weitao Du,Pengbo Wang,Penghua Ying,Haojun Jia,Zisheng Zhang,Yuanqi Du,Carla P. Gomes,Chenru Duan,Hai Xiao,Graeme Henkelman
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:We present AlphaNet, a local frame-based equivariant model designed to achieve both accurate and efficient simulations for atomistic systems. Recently, machine learning force fields (MLFFs) have gained prominence in molecular dynamics simulations due to their advantageous efficiency-accuracy balance compared to classical force fields and quantum mechanical calculations, alongside their transferability across various systems. Despite the advancements in improving model accuracy, the efficiency and scalability of MLFFs remain significant obstacles in practical applications. AlphaNet enhances computational efficiency and accuracy by leveraging the local geometric structures of atomic environments through the construction of equivariant local frames and learnable frame transitions. We substantiate the efficacy of AlphaNet across diverse datasets, including defected graphene, formate decomposition, zeolites, and surface reactions. AlphaNet consistently surpasses well-established models, such as NequIP and DeepPot, in terms of both energy and force prediction accuracy. Notably, AlphaNet offers one of the best trade-offs between computational efficiency and accuracy among existing models. Moreover, AlphaNet exhibits scalability across a broad spectrum of system and dataset sizes, affirming its versatility.

[LG-21] LLM 360 K2: Scaling Up 360-Open-Source Large Language Models

链接: https://arxiv.org/abs/2501.07124
作者: Zhengzhong Liu,Bowen Tan,Hongyi Wang,Willie Neiswanger,Tianhua Tao,Haonan Li,Fajri Koto,Yuqi Wang,Suqi Sun,Omkar Pangarkar,Richard Fan,Yi Gu,Victor Miller,Liqun Ma,Liping Tang,Nikhil Ranjan,Yonghao Zhuang,Guowei He,Renxi Wang,Mingkai Deng,Robin Algayres,Yuanzhi Li,Zhiqiang Shen,Preslav Nakov,Eric Xing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We detail the training of the LLM360 K2-65B model, scaling up our 360-degree OPEN SOURCE approach to the largest and most powerful models under project LLM360. While open-source LLMs continue to advance, the answer to “How are the largest LLMs trained?” remains unclear within the community. The implementation details for such high-capacity models are often protected due to business considerations associated with their high cost. This lack of transparency prevents LLM researchers from leveraging valuable insights from prior experience, e.g., “What are the best practices for addressing loss spikes?” The LLM360 K2 project addresses this gap by providing full transparency and access to resources accumulated during the training of LLMs at the largest scale. This report highlights key elements of the K2 project, including our first model, K2 DIAMOND, a 65 billion-parameter LLM that surpasses LLaMA-65B and rivals LLaMA2-70B, while requiring fewer FLOPs and tokens. We detail the implementation steps and present a longitudinal analysis of K2 DIAMOND’s capabilities throughout its training process. We also outline ongoing projects such as TXT360, setting the stage for future models in the series. By offering previously unavailable resources, the K2 project also resonates with the 360-degree OPEN SOURCE principles of transparency, reproducibility, and accessibility, which we believe are vital in the era of resource-intensive AI research.

[LG-22] D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation

链接: https://arxiv.org/abs/2501.07077
作者: Zhejun Zhang,Yuanping Chen,Shibing Chu
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.

[LG-23] Explore the Use of Time Series Foundation Model for Car-Following Behavior Analysis

链接: https://arxiv.org/abs/2501.07034
作者: Luwei Zeng,Runze Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling car-following behavior is essential for traffic simulation, analyzing driving patterns, and understanding complex traffic flows with varying levels of autonomous vehicles. Traditional models like the Safe Distance Model and Intelligent Driver Model (IDM) require precise parameter calibration and often lack generality due to simplified assumptions about driver behavior. While machine learning and deep learning methods capture complex patterns, they require large labeled datasets. Foundation models provide a more efficient alternative. Pre-trained on vast, diverse time series datasets, they can be applied directly to various tasks without the need for extensive re-training. These models generalize well across domains, and with minimal fine-tuning, they can be adapted to specific tasks like car-following behavior prediction. In this paper, we apply Chronos, a state-of-the-art public time series foundation model, to analyze car-following behavior using the Open ACC dataset. Without fine-tuning, Chronos outperforms traditional models like IDM and Exponential smoothing with trend and seasonality (ETS), and achieves similar results to deep learning models such as DeepAR and TFT, with an RMSE of 0.60. After fine-tuning, Chronos reduces the error to an RMSE of 0.53, representing a 33.75% improvement over IDM and a 12-37% reduction compared to machine learning models like ETS and deep learning models including DeepAR, WaveNet, and TFT. This demonstrates the potential of foundation models to significantly advance transportation research, offering a scalable, adaptable, and highly accurate approach to predicting and simulating car-following behaviors.

[LG-24] PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2501.07032
作者: Hoang-Thang Ta,Duy-Quy Thai,Anh Tran,Grigori Sidorov,Alexander Gelbukh
类目: Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) represent an innovation in neural network architectures, offering a compelling alternative to Multi-Layer Perceptrons (MLPs) in models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. By advancing network design, KANs are driving groundbreaking research and enabling transformative applications across various scientific domains involving neural networks. However, existing KANs often require significantly more parameters in their network layers compared to MLPs. To address this limitation, this paper introduces PRKANs (\textbfParameter-\textbfReduced \textbfKolmogorov-\textbfArnold \textbfNetworks), which employ several methods to reduce the parameter count in KAN layers, making them comparable to MLP layers. Experimental results on the MNIST and Fashion-MNIST datasets demonstrate that PRKANs with attention mechanisms outperform several existing KANs and rival the performance of MLPs, albeit with slightly longer training times. Furthermore, the study highlights the advantages of Gaussian Radial Basis Functions (GRBFs) and layer normalization in KAN designs. The repository for this work is available at: \urlthis https URL.

[LG-25] Erasing Noise in Signal Detection with Diffusion Model: From Theory to Application

链接: https://arxiv.org/abs/2501.07030
作者: Xiucheng Wang,Peilin Zheng,Nan Cheng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, a signal detection method based on the denoise diffusion model (DM) is proposed, which outperforms the maximum likelihood (ML) estimation method that has long been regarded as the optimal signal detection technique. Theoretically, a novel mathematical theory for intelligent signal detection based on stochastic differential equations (SDEs) is established in this paper, demonstrating the effectiveness of DM in reducing the additive white Gaussian noise in received signals. Moreover, a mathematical relationship between the signal-to-noise ratio (SNR) and the timestep in DM is established, revealing that for any given SNR, a corresponding optimal timestep can be identified. Furthermore, to address potential issues with out-of-distribution inputs in the DM, we employ a mathematical scaling technique that allows the trained DM to handle signal detection across a wide range of SNRs without any fine-tuning. Building on the above theoretical foundation, we propose a DM-based signal detection method, with the diffusion transformer (DiT) serving as the backbone neural network, whose computational complexity of this method is \mathcalO(n^2) . Simulation results demonstrate that, for BPSK and QAM modulation schemes, the DM-based method achieves a significantly lower symbol error rate (SER) compared to ML estimation, while maintaining a much lower computational complexity.

[LG-26] Improved Regret Bounds for Online Fair Division with Bandit Learning

链接: https://arxiv.org/abs/2501.07022
作者: Benjamin Schiffer,Shirley Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online fair division when there are a finite number of item types and the player values for the items are drawn randomly from distributions with unknown means. In this setting, a sequence of indivisible items arrives according to a random online process, and each item must be allocated to a single player. The goal is to maximize expected social welfare while maintaining that the allocation satisfies proportionality in expectation. When player values are normalized, we show that it is possible to with high probability guarantee proportionality constraint satisfaction and achieve \tildeO(\sqrtT) regret. To achieve this result, we present an upper confidence bound (UCB) algorithm that uses two rounds of linear optimization. This algorithm highlights fundamental aspects of proportionality constraints that allow for a UCB algorithm despite the presence of many (potentially tight) constraints. This result improves upon the previous best regret rate of \tildeO(T^2/3) .

[LG-27] Global Search for Optimal Low Thrust Spacecraft Trajectories using Diffusion Models and the Indirect Method

链接: https://arxiv.org/abs/2501.07005
作者: Jannik Graebner,Ryne Beeson
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Long time-duration low-thrust nonlinear optimal spacecraft trajectory global search is a computationally and time expensive problem characterized by clustering patterns in locally optimal solutions. During preliminary mission design, mission parameters are subject to frequent changes, necessitating that trajectory designers efficiently generate high-quality control solutions for these new scenarios. Generative machine learning models can be trained to learn how the solution structure varies with respect to a conditional parameter, thereby accelerating the global search for missions with updated parameters. In this work, state-of-the-art diffusion models are integrated with the indirect approach for trajectory optimization within a global search framework. This framework is tested on two low-thrust transfers of different complexity in the circular restricted three-body problem. By generating and analyzing a training data set, we develop mathematical relations and techniques to understand the complex structures in the costate domain of locally optimal solutions for these problems. A diffusion model is trained on this data and successfully accelerates the global search for both problems. The model predicts how the costate solution structure changes, based on the maximum spacecraft thrust magnitude. Warm-starting a numerical solver with diffusion model samples for the costates at the initial time increases the number of solutions generated per minute for problems with unseen thrust magnitudes by one to two orders of magnitude in comparison to samples from a uniform distribution and from an adjoint control transformation.

[LG-28] Sanidha: A Studio Quality Multi-Modal Dataset for Carnatic Music

链接: https://arxiv.org/abs/2501.06959
作者: Venkatakrishnan Vaidyanathapuram Krishnan,Noel Alben,Anish Nair,Nathaniel Condit-Schultz
类目: ound (cs.SD); Digital Libraries (cs.DL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:Music source separation demixes a piece of music into its individual sound sources (vocals, percussion, melodic instruments, etc.), a task with no simple mathematical solution. It requires deep learning methods involving training on large datasets of isolated music stems. The most commonly available datasets are made from commercial Western music, limiting the models’ applications to non-Western genres like Carnatic music. Carnatic music is a live tradition, with the available multi-track recordings containing overlapping sounds and bleeds between the sources. This poses a challenge to commercially available source separation models like Spleeter and Hybrid Demucs. In this work, we introduce ‘Sanidha’, the first open-source novel dataset for Carnatic music, offering studio-quality, multi-track recordings with minimal to no overlap or bleed. Along with the audio files, we provide high-definition videos of the artists’ performances. Additionally, we fine-tuned Spleeter, one of the most commonly used source separation models, on our dataset and observed improved SDR performance compared to fine-tuning on a pre-existing Carnatic multi-track dataset. The outputs of the fine-tuned model with ‘Sanidha’ are evaluated through a listening study.

[LG-29] A Hessian-informed hyperparameter optimization for differential learning rate

链接: https://arxiv.org/abs/2501.06954
作者: Shiyun Xu,Zhiqi Bu,Yiliang Zhang,Ian Barnett
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training. Furthermore, we can quantify the influence of different parameters and freeze the less-contributing parameters, which leads to a new PEFT that automatically adapts to various tasks and models. Additionally, Hi-DLR also exhibits comparable performance on various full model training tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.06954 [cs.LG] (or arXiv:2501.06954v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] A group-theoretic framework for machine learning in hyperbolic spaces

链接: https://arxiv.org/abs/2501.06934
作者: Vladimir Jaćimović
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:Embedding the data in hyperbolic spaces can preserve complex relationships in very few dimensions, thus enabling compact models and improving efficiency of machine learning (ML) algorithms. The underlying idea is that hyperbolic representations can prevent the loss of important structural information for certain ubiquitous types of data. However, further advances in hyperbolic ML require more principled mathematical approaches and adequate geometric methods. The present study aims at enhancing mathematical foundations of hyperbolic ML by combining group-theoretic and conformal-geometric arguments with optimization and statistical techniques. Precisely, we introduce the notion of the mean (barycenter) and the novel family of probability distributions on hyperbolic balls. We further propose efficient optimization algorithms for computation of the barycenter and for maximum likelihood estimation. One can build upon basic concepts presented here in order to design more demanding algorithms and implement hyperbolic deep learning pipelines.

[LG-31] Neural equilibria for long-term prediction of nonlinear conservation laws

链接: https://arxiv.org/abs/2501.06933
作者: J. Antonio Lara Benitez,Junyi Guo,Kareem Hegazy,Ivan Dokmanić,Michael W. Mahoney,Maarten V. de Hoop
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce Neural Discrete Equilibrium (NeurDE), a machine learning (ML) approach for long-term forecasting of flow phenomena that relies on a “lifting” of physical conservation laws into the framework of kinetic theory. The kinetic formulation provides an excellent structure for ML algorithms by separating nonlinear, non-local physics into a nonlinear but local relaxation to equilibrium and a linear non-local transport. This separation allows the ML to focus on the local nonlinear components while addressing the simpler linear transport with efficient classical numerical algorithms. To accomplish this, we design an operator network that maps macroscopic observables to equilibrium states in a manner that maximizes entropy, yielding expressive BGK-type collisions. By incorporating our surrogate equilibrium into the lattice Boltzmann (LB) algorithm, we achieve accurate flow forecasts for a wide range of challenging flows. We show that NeurDE enables accurate prediction of compressible flows, including supersonic flows, while tracking shocks over hundreds of time steps, using a small velocity lattice-a heretofore unattainable feat without expensive numerical root finding.

[LG-32] A Hybrid Virtual Element Method and Deep Learning Approach for Solving One-Dimensional Euler-Bernoulli Beams

链接: https://arxiv.org/abs/2501.06925
作者: Paulo Akira F. Enabe,Rodrigo Provasi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A hybrid framework integrating the Virtual Element Method (VEM) with deep learning is presented as an initial step toward developing efficient and flexible numerical models for one-dimensional Euler-Bernoulli beams. The primary aim is to explore a data-driven surrogate model capable of predicting displacement fields across varying material and geometric parameters while maintaining computational efficiency. Building upon VEM’s ability to handle higher-order polynomials and non-conforming discretizations, the method offers a robust numerical foundation for structural mechanics. A neural network architecture is introduced to separately process nodal and material-specific data, effectively capturing complex interactions with minimal reliance on large datasets. To address challenges in training, the model incorporates Sobolev training and GradNorm techniques, ensuring balanced loss contributions and enhanced generalization. While this framework is in its early stages, it demonstrates the potential for further refinement and development into a scalable alternative to traditional methods. The proposed approach lays the groundwork for advancing numerical and data-driven techniques in beam modeling, offering a foundation for future research in structural mechanics.

[LG-33] Optimal Online Bookmaking for Binary Games

链接: https://arxiv.org/abs/2501.06923
作者: Alankrita Bhatt,Or Ordentlich,Oron Sabag
类目: Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In online betting, the bookmaker can update the payoffs it offers on a particular event many times before the event takes place, and the updated payoffs may depend on the bets accumulated thus far. We study the problem of bookmaking with the goal of maximizing the return in the worst-case, with respect to the gamblers’ behavior and the event’s outcome. We formalize this problem as the \emphOptimal Online Bookmaking game, and provide the exact solution for the binary case. To this end, we develop the optimal bookmaking strategy, which relies on a new technique called bi-balancing trees, that assures that the house loss is the same for all \emphdecisive betting sequences, where the gambler bets all its money on a single outcome in each round.

[LG-34] Black-box optimization and quantum annealing for filtering out mislabeled training instances

链接: https://arxiv.org/abs/2501.06916
作者: Makoto Otsuka,Kento Kodama,Keisuke Morita,Masayuki Ohzeki
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method’s ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave’s clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij’s simulated quantum annealing sampler or Neal’s simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.

[LG-35] Deep Learning and Foundation Models for Weather Prediction: A Survey

链接: https://arxiv.org/abs/2501.06907
作者: Jimeng Shi,Azam Shirali,Bowen Jin,Sizhe Zhou,Wei Hu,Rahuul Rangaraj,Shaowen Wang,Jiawei Han,Zhaonan Wang,Upmanu Lall,Yanzhao Wu,Leonardo Bobadilla,Giri Narasimhan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-based numerical models have been the bedrock of atmospheric sciences for decades, offering robust solutions but often at the cost of significant computational resources. Deep learning (DL) models have emerged as powerful tools in meteorology, capable of analyzing complex weather and climate data by learning intricate dependencies and providing rapid predictions once trained. While these models demonstrate promising performance in weather prediction, often surpassing traditional physics-based methods, they still face critical challenges. This paper presents a comprehensive survey of recent deep learning and foundation models for weather prediction. We propose a taxonomy to classify existing models based on their training paradigms: deterministic predictive learning, probabilistic generative learning, and pre-training and fine-tuning. For each paradigm, we delve into the underlying model architectures, address major challenges, offer key insights, and propose targeted directions for future research. Furthermore, we explore real-world applications of these methods and provide a curated summary of open-source code repositories and widely used datasets, aiming to bridge research advancements with practical implementations while fostering open and trustworthy scientific practices in adopting cutting-edge artificial intelligence for weather prediction. The related sources are available at this https URL DL-Foundation-Models-Weather.

[LG-36] Introduction to the Usage of Open Data from the Large Hadron Collider for Computer Scientists in the Context of Machine Learning

链接: https://arxiv.org/abs/2501.06896
作者: Timo Saala,Matthias Schott
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 34 pages, 22 figures (without Appendix)

点击查看摘要

Abstract:Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data’s content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.

[LG-37] A novel multi-agent dynamic portfolio optimization learning system based on hierarchical deep reinforcement learning

链接: https://arxiv.org/abs/2501.06832
作者: Ruoyu Sun,Yue Xi,Angelos Stefanidis,Zhengyong Jiang,Jionglong Su
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has been extensively used to address portfolio optimization problems. The DRL agents acquire knowledge and make decisions through unsupervised interactions with their environment without requiring explicit knowledge of the joint dynamics of portfolio assets. Among these DRL algorithms, the combination of actor-critic algorithms and deep function approximators is the most widely used DRL algorithm. Here, we find that training the DRL agent using the actor-critic algorithm and deep function approximators may lead to scenarios where the improvement in the DRL agent’s risk-adjusted profitability is not significant. We propose that such situations primarily arise from the following two problems: sparsity in positive reward and the curse of dimensionality. These limitations prevent DRL agents from comprehensively learning asset price change patterns in the training environment. As a result, the DRL agents cannot explore the dynamic portfolio optimization policy to improve the risk-adjusted profitability in the training process. To address these problems, we propose a novel multi-agent Hierarchical Deep Reinforcement Learning (HDRL) algorithmic framework in this research. Under this framework, the agents work together as a learning system for portfolio optimization. Specifically, by designing an auxiliary agent that works together with the executive agent for optimal policy exploration, the learning system can focus on exploring the policy with higher risk-adjusted return in the action space with positive return and low variance. In this way, we can overcome the issue of the curse of dimensionality and improve the training efficiency in the positive reward sparse environment.

[LG-38] A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier

链接: https://arxiv.org/abs/2501.06805
作者: Tareque Mohmud Chowdhury,Farzana Tabassum,Sabrina Islam,Abu Raihan Mostofa Kamal
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 20 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Accurately identifying cancer samples is crucial for precise diagnosis and effective patient treatment. Traditional methods falter with high-dimensional and high feature-to-sample count ratios, which are critical for classifying cancer samples. This study aims to develop a novel feature selection framework specifically for transcriptome data and propose two ensemble classifiers. For feature selection, we partition the transcriptome dataset vertically based on feature types. Then apply the Boruta feature selection process on each of the partitions, combine the results, and apply Boruta again on the combined result. We repeat the process with different parameters of Boruta and prepare the final feature set. Finally, we constructed two ensemble ML models based on LR, SVM and XGBoost classifiers with max voting and averaging probability approach. We used 10-fold cross-validation to ensure robust and reliable classification performance. With 97.11% accuracy and 0.9996 AUC value, our approach performs better compared to existing state-of-the-art methods to classify 33 types of cancers. A set of 12 types of cancer is traditionally challenging to differentiate between each other due to their similarity in tissue of origin. Our method accurately identifies over 90% of samples from these 12 types of cancers, which outperforms all known methods presented in existing literature. The gene set enrichment analysis reveals that our framework’s selected features have enriched the pathways highly related to cancers. This study develops a feature selection framework to select features highly related to cancer development and leads to identifying different types of cancer samples with higher accuracy.

[LG-39] COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators DATE2025

链接: https://arxiv.org/abs/2501.06780
作者: Jihoon Park,Jeongin Choe,Dohyun Kim,Jae-Joon Kim
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted IEEE DATE 2025

点击查看摘要

Abstract:Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods. Comments: Accepted IEEE DATE 2025 Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2501.06780 [cs.AR] (or arXiv:2501.06780v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2501.06780 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Pareto Set Learning for Multi-Objective Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2501.06773
作者: Erlong Liu,Yu-Chang Wu,Xiaobin Huang,Chengrui Gao,Ren-Jian Wang,Ke Xue,Chao Qian
类目: Machine Learning (cs.LG)
*备注: AAAI 2025 Accept

点击查看摘要

Abstract:Multi-objective decision-making problems have emerged in numerous real-world scenarios, such as video games, navigation and robotics. Considering the clear advantages of Reinforcement Learning (RL) in optimizing decision-making processes, researchers have delved into the development of Multi-Objective RL (MORL) methods for solving multi-objective decision problems. However, previous methods either cannot obtain the entire Pareto front, or employ only a single policy network for all the preferences over multiple objectives, which may not produce personalized solutions for each preference. To address these limitations, we propose a novel decomposition-based framework for MORL, Pareto Set Learning for MORL (PSL-MORL), that harnesses the generation capability of hypernetwork to produce the parameters of the policy network for each decomposition weight, generating relatively distinct policies for various scalarized subproblems with high efficiency. PSL-MORL is a general framework, which is compatible for any RL algorithm. The theoretical result guarantees the superiority of the model capacity of PSL-MORL and the optimality of the obtained policy network. Through extensive experiments on diverse benchmarks, we demonstrate the effectiveness of PSL-MORL in achieving dense coverage of the Pareto front, significantly outperforming state-of-the-art MORL methods in the hypervolume and sparsity indicators.

[LG-41] MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection

链接: https://arxiv.org/abs/2501.06764
作者: Kaiying Yan,Moyang Liu,Yukun Liu,Ruibo Fu,Zhengqi Wen,Jianhua Tao,Xuefei Liu,Guanjun Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework to optimize multimodal fusion, using a Targeted Pareto(TPareto) optimization algorithm for fusion-level-specific objective learning with a certain focus. Based on the designed hierarchical fusion network, the algorithm defines three fusion levels with corresponding losses and implements all-modal-oriented Pareto gradient integration for each. This approach accomplishes superior multimodal fusion by utilizing the information obtained from intermediate fusion to provide positive effects to the entire process. Experiment results on FakeSV and FVC datasets show that the proposed framework outperforms baselines and the TPareto optimization algorithm achieves 2.40% and 1.89% accuracy improvement respectively.

[LG-42] Procedural Fairness and Its Relationship with Distributive Fairness in Machine Learning

链接: https://arxiv.org/abs/2501.06753
作者: Ziming Wang,Changwu Huang,Ke Tang,Xin Yao
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 33 pages, 11 figures

点击查看摘要

Abstract:Fairness in machine learning (ML) has garnered significant attention in recent years. While existing research has predominantly focused on the distributive fairness of ML models, there has been limited exploration of procedural fairness. This paper proposes a novel method to achieve procedural fairness during the model training phase. The effectiveness of the proposed method is validated through experiments conducted on one synthetic and six real-world datasets. Additionally, this work studies the relationship between procedural fairness and distributive fairness in ML models. On one hand, the impact of dataset bias and the procedural fairness of ML model on its distributive fairness is examined. The results highlight a significant influence of both dataset bias and procedural fairness on distributive fairness. On the other hand, the distinctions between optimizing procedural and distributive fairness metrics are analyzed. Experimental results demonstrate that optimizing procedural fairness metrics mitigates biases introduced or amplified by the decision-making process, thereby ensuring fairness in the decision-making process itself, as well as improving distributive fairness. In contrast, optimizing distributive fairness metrics encourages the ML model’s decision-making process to favor disadvantaged groups, counterbalancing the inherent preferences for advantaged groups present in the dataset and ultimately achieving distributive fairness.

[LG-43] DRDT3: Diffusion-Refined Decision Test-Time Training Model

链接: https://arxiv.org/abs/2501.06718
作者: Xingshuai Huang,Di Wu,Benoit Boulet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision Transformer (DT), a trajectory modeling method, has shown competitive performance compared to traditional offline reinforcement learning (RL) approaches on various classic control tasks. However, it struggles to learn optimal policies from suboptimal, reward-labeled trajectories. In this study, we explore the use of conditional generative modeling to facilitate trajectory stitching given its high-quality data generation ability. Additionally, recent advancements in Recurrent Neural Networks (RNNs) have shown their linear complexity and competitive sequence modeling performance over Transformers. We leverage the Test-Time Training (TTT) layer, an RNN that updates hidden states during testing, to model trajectories in the form of DT. We introduce a unified framework, called Diffusion-Refined Decision TTT (DRDT3), to achieve performance beyond DT models. Specifically, we propose the Decision TTT (DT3) module, which harnesses the sequence modeling strengths of both self-attention and the TTT layer to capture recent contextual information and make coarse action predictions. We further integrate DT3 with the diffusion model using a unified optimization objective. With experiments on multiple tasks of Gym and AntMaze in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art conventional offline RL and DT-based methods.

[LG-44] Averag e Reward Reinforcement Learning for Wireless Radio Resource Management

链接: https://arxiv.org/abs/2501.06700
作者: Kun Yang,Jing Yang,Cong Shen
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Accepted by Asilomar 2024

点击查看摘要

Abstract:In this paper, we address a crucial but often overlooked issue in applying reinforcement learning (RL) to radio resource management (RRM) in wireless communications: the mismatch between the discounted reward RL formulation and the undiscounted goal of wireless network optimization. To the best of our knowledge, we are the first to systematically investigate this discrepancy, starting with a discussion of the problem formulation followed by simulations that quantify the extent of the gap. To bridge this gap, we introduce the use of average reward RL, a method that aligns more closely with the long-term objectives of RRM. We propose a new method called the Average Reward Off policy Soft Actor Critic (ARO SAC) is an adaptation of the well known Soft Actor Critic algorithm in the average reward framework. This new method achieves significant performance improvement our simulation results demonstrate a 15% gain in the system performance over the traditional discounted reward RL approach, underscoring the potential of average reward RL in enhancing the efficiency and effectiveness of wireless network optimization.

[LG-45] Understanding and Mitigating Membership Inference Risks of Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2501.06686
作者: Sanghyun Hong,Fan Wu,Anthony Gruber,Kookjin Lee
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural ordinary differential equations (NODEs) are an emerging paradigm in scientific computing for modeling dynamical systems. By accurately learning underlying dynamics in data in the form of differential equations, NODEs have been widely adopted in various domains, such as healthcare, finance, computer vision, and language modeling. However, there remains a limited understanding of the privacy implications of these fundamentally different models, particularly with regard to their membership inference risks. In this work, we study the membership inference risks associated with NODEs. We first comprehensively evaluate NODEs against membership inference attacks. We show that NODEs are twice as resistant to these privacy attacks compared to conventional feedforward models such as ResNets. By analyzing the variance in membership risks across different NODE models, we identify the factors that contribute to their lower risks. We then demonstrate, both theoretically and empirically, that membership inference risks can be further mitigated by utilizing a stochastic variant of NODEs: Neural stochastic differential equations (NSDEs). We show that NSDEs are differentially-private (DP) learners that provide the same provable privacy guarantees as DP-SGD, the de-facto mechanism for training private models. NSDEs are also effective in mitigating existing membership inference attacks, demonstrating risks comparable to private models trained with DP-SGD while offering an improved privacy-utility trade-off. Moreover, we propose a drop-in-replacement strategy that efficiently integrates NSDEs into conventional feedforward models to enhance their privacy. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2501.06686 [cs.CR] (or arXiv:2501.06686v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.06686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] ab-Shapley: Identifying Top-k Tabular Data Quality Insights AAAI-25

链接: https://arxiv.org/abs/2501.06685
作者: Manisha Padala,Lokesh Nagalapatti,Atharv Tyagi,Ramasuri Narayanam,Shiv Kumar Saini
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AAAI-25

点击查看摘要

Abstract:We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data’s anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

[LG-47] Challenging reaction prediction models to generalize to novel chemistry

链接: https://arxiv.org/abs/2501.06669
作者: John Bradshaw,Anji Zhang,Babak Mahjour,David E. Graff,Marwin H.S. Segler,Connor W. Coley
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an in-distribution setting, whereas many real-world uses for these models are in out-of-distribution settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today’s reaction predictors, acting as a crucial first step in the development of tomorrow’s next-generation models capable of reaction discovery.

[LG-48] Learning dynamical systems with hit-and-run random feature maps

链接: https://arxiv.org/abs/2501.06661
作者: Pinak Mandal,Georg A. Gottwald
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We show how random feature maps can be used to forecast dynamical systems with excellent forecasting skill. We consider the tanh activation function and judiciously choose the internal weights in a data-driven manner such that the resulting features explore the nonlinear, non-saturated regions of the activation function. We introduce skip connections and construct a deep variant of random feature maps by combining several units. To mitigate the curse of dimensionality, we introduce localization where we learn local maps, employing conditional independence. Our modified random feature maps provide excellent forecasting skill for both single trajectory forecasts as well as long-time estimates of statistical properties, for a range of chaotic dynamical systems with dimensions up to 512. In contrast to other methods such as reservoir computers which require extensive hyperparameter tuning, we effectively need to tune only a single hyperparameter, and are able to achieve state-of-the-art forecast skill with much smaller networks.

[LG-49] SafeSplit: A Novel Defense Against Client-Side Backdoor Attacks in Split Learning NDSS2025

链接: https://arxiv.org/abs/2501.06650
作者: Phillip Rieger,Alessandro Pegoraro,Kavita Kumari,Tigist Abera,Jonathan Knauer,Ahmad-Reza Sadeghi
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear at NDSS 2025; 18 pages, 6 Tables, and 11 figures

点击查看摘要

Abstract:Split Learning (SL) is a distributed deep learning approach enabling multiple clients and a server to collaboratively train and infer on a shared deep neural network (DNN) without requiring clients to share their private local data. The DNN is partitioned in SL, with most layers residing on the server and a few initial layers and inputs on the client side. This configuration allows resource-constrained clients to participate in training and inference. However, the distributed architecture exposes SL to backdoor attacks, where malicious clients can manipulate local datasets to alter the DNN’s behavior. Existing defenses from other distributed frameworks like Federated Learning are not applicable, and there is a lack of effective backdoor defenses specifically designed for SL. We present SafeSplit, the first defense against client-side backdoor attacks in Split Learning (SL). SafeSplit enables the server to detect and filter out malicious client behavior by employing circular backward analysis after a client’s training is completed, iteratively reverting to a trained checkpoint where the model under examination is found to be benign. It uses a two-fold analysis to identify client-induced changes and detect poisoned models. First, a static analysis in the frequency domain measures the differences in the layer’s parameters at the server. Second, a dynamic analysis introduces a novel rotational distance metric that assesses the orientation shifts of the server’s layer parameters during training. Our comprehensive evaluation across various data distributions, client counts, and attack scenarios demonstrates the high efficacy of this dual analysis in mitigating backdoor attacks while preserving model utility. Comments: To appear at NDSS 2025; 18 pages, 6 Tables, and 11 figures Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2501.06650 [cs.CR] (or arXiv:2501.06650v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.06650 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.14722/ndss.2025.241698 Focus to learn more DOI(s) linking to related resources

[LG-50] Dual-Modality Representation Learning for Molecular Property Prediction

链接: https://arxiv.org/abs/2501.06608
作者: Anyin Zhao,Zuquan Chen,Zhengyu Fang,Xiaoge Zhang,Jing Li
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecular property prediction has attracted substantial attention recently. Accurate prediction of drug properties relies heavily on effective molecular representations. The structures of chemical compounds are commonly represented as graphs or SMILES sequences. Recent advances in learning drug properties commonly employ Graph Neural Networks (GNNs) based on the graph representation. For the SMILES representation, Transformer-based architectures have been adopted by treating each SMILES string as a sequence of tokens. Because each representation has its own advantages and disadvantages, combining both representations in learning drug properties is a promising direction. We propose a method named Dual-Modality Cross-Attention (DMCA) that can effectively combine the strengths of two representations by employing the cross-attention mechanism. DMCA was evaluated across eight datasets including both classification and regression tasks. Results show that our method achieves the best overall performance, highlighting its effectiveness in leveraging the complementary information from both graph and SMILES modalities.

[LG-51] Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm ICASSP

链接: https://arxiv.org/abs/2501.06603
作者: Yilang Zhang,Bingcong Li,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注: Accepted by International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

点击查看摘要

Abstract:Targeting solutions over `flat’ regions of the loss landscape, sharpness-aware minimization (SAM) has emerged as a powerful tool to improve generalizability of deep neural network based learning. While several SAM variants have been developed to this end, a unifying approach that also guides principled algorithm design has been elusive. This contribution leverages preconditioning (pre) to unify SAM variants and provide not only unifying convergence analysis, but also valuable insights. Building upon preSAM, a novel algorithm termed infoSAM is introduced to address the so-called adversarial model degradation issue in SAM by adjusting gradients depending on noise estimates. Extensive numerical tests demonstrate the superiority of infoSAM across various benchmarks.

[LG-52] A Tight VC-Dimension Analysis of Clustering Coresets with Applications

链接: https://arxiv.org/abs/2501.06588
作者: Vincent Cohen-Addad,Andrew Draganov,Matteo Russo,David Saulpic,Chris Schwiegelshohn
类目: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider coresets for k -clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the k -median objective \sum_p\min_c\in Cdist(p,C) . Given a point set P , a coreset \Omega is a small weighted subset that approximates the cost of P for all candidate solutions C up to a (1\pm\varepsilon ) multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved k -median coreset bounds for the following metrics: Coresets of size \tildeO\left(k\varepsilon^-2\right) for shortest path metrics in planar graphs, improving over the bounds \tildeO\left(k\varepsilon^-6\right) by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC’21] and \tildeO\left(k^2\varepsilon^-4\right) by [Braverman, Jiang, Krauthgamer, Wu, SODA’21]. Coresets of size \tildeO\left(kd\ell\varepsilon^-2\log m\right) for clustering d -dimensional polygonal curves of length at most m with curves of length at most \ell with respect to Frechet metrics, improving over the bounds \tildeO\left(k^3d\ell\varepsilon^-3\log m\right) by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS’22] and \tildeO\left(k^2d\ell\varepsilon^-2\log m \log |P|\right) by [Conradi, Kolbe, Psarros, Rohde, SoCG’24]. Subjects: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2501.06588 [cs.CG] (or arXiv:2501.06588v1 [cs.CG] for this version) https://doi.org/10.48550/arXiv.2501.06588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Boundary-enhanced time series data imputation with long-term dependency diffusion models

链接: https://arxiv.org/abs/2501.06585
作者: Chunjing Xiao,Xue Jiang,Xianghe Du,Wei Yang,Wei Lu,Xiaomin Wang,Kevin Chetty
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by Knowledge-Based Systems

点击查看摘要

Abstract:Data imputation is crucial for addressing challenges posed by missing values in multivariate time series data across various fields, such as healthcare, traffic, and economics, and has garnered significant attention. Among various methods, diffusion model-based approaches show notable performance improvements. However, existing methods often cause disharmonious boundaries between missing and known regions and overlook long-range dependencies in missing data estimation, leading to suboptimal results. To address these issues, we propose a Diffusion-based time Series Data Imputation (DSDI) framework. We develop a weight-reducing injection strategy that incorporates the predicted values of missing points with reducing weights into the reverse diffusion process to mitigate boundary inconsistencies. Further, we introduce a multi-scale S4-based U-Net, which combines hierarchical information from different levels via multi-resolution integration to capture long-term dependencies. Experimental results demonstrate that our model outperforms existing imputation methods.

[LG-54] Recommending the right academic programs: An interest mining approach using BERTopic

链接: https://arxiv.org/abs/2501.06581
作者: Alessandro Hill,Kalen Goo,Puneet Agarwal
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: Accepted at Data Mining and Knowledge Discovery (Springer)

点击查看摘要

Abstract:Prospective students face the challenging task of selecting a university program that will shape their academic and professional careers. For decision-makers and support services, it is often time-consuming and extremely difficult to match personal interests with suitable programs due to the vast and complex catalogue information available. This paper presents the first information system that provides students with efficient recommendations based on both program content and personal preferences. BERTopic, a powerful topic modeling algorithm, is used that leverages text embedding techniques to generate topic representations. It enables us to mine interest topics from all course descriptions, representing the full body of knowledge taught at the institution. Underpinned by the student’s individual choice of topics, a shortlist of the most relevant programs is computed through statistical backtracking in the knowledge map, a novel characterization of the program-course relationship. This approach can be applied to a wide range of educational settings, including professional and vocational training. A case study at a post-secondary school with 80 programs and over 5,000 courses shows that the system provides immediate and effective decision support. The presented interest topics are meaningful, leading to positive effects such as serendipity, personalization, and fairness, as revealed by a qualitative study involving 65 students. Over 98% of users indicated that the recommendations aligned with their interests, and about 94% stated they would use the tool in the future. Quantitative analysis shows the system can be configured to ensure fairness, achieving 98% program coverage while maintaining a personalization score of 0.77. These findings suggest that this real-time, user-centered, data-driven system could improve the program selection process.

[LG-55] Physics-Informed Neuro-Evolution (PINE): A Survey and Prospects

链接: https://arxiv.org/abs/2501.06572
作者: Jian Cheng Wong,Abhishek Gupta,Chin Chun Ooi,Pao-Hsiung Chiu,Jiao Liu,Yew-Soon Ong
类目: Neural and Evolutionary Computing (cs.NE); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 1 table

点击查看摘要

Abstract:Deep learning models trained on finite data lack a complete understanding of the physical world. On the other hand, physics-informed neural networks (PINNs) are infused with such knowledge through the incorporation of mathematically expressible laws of nature into their training loss function. By complying with physical laws, PINNs provide advantages over purely data-driven models in limited-data regimes. This feature has propelled them to the forefront of scientific machine learning, a domain characterized by scarce and costly data. However, the vision of accurate physics-informed learning comes with significant challenges. This review examines PINNs for the first time in terms of model optimization and generalization, shedding light on the need for new algorithmic advances to overcome issues pertaining to the training speed, precision, and generalizability of today’s PINN models. Of particular interest are the gradient-free methods of neuroevolution for optimizing the uniquely complex loss landscapes arising in PINN training. Methods synergizing gradient descent and neuroevolution for discovering bespoke neural architectures and balancing multiple conflicting terms in physics-informed learning objectives are positioned as important avenues for future research. Yet another exciting track is to cast neuroevolution as a meta-learner of generalizable PINN models.

[LG-56] Online Algorithm for Aggregating Experts Predictions with Unbounded Quadratic Loss

链接: https://arxiv.org/abs/2501.06505
作者: Alexander Korotin,Vladimir V’yugin,Evgeny Burnaev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of online aggregation of expert predictions with the quadratic loss function. We propose an algorithm for aggregating expert predictions which does not require a prior knowledge of the upper bound on the losses. The algorithm is based on the exponential reweighing of expert losses.

[LG-57] A New Flexible Train-Test Split Algorithm an approach for choosing among the Hold-out K-fold cross-validation and Hold-out iteration

链接: https://arxiv.org/abs/2501.06492
作者: Zahra Bami,Ali Behnampour,Hassan Doosti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligent transformed industries, like engineering, medicine, finance. Predictive models use supervised learning, a vital Machine learning subset. Crucial for model evaluation, cross-validation includes re-substitution, hold-out, and K-fold. This study focuses on improving the accuracy of ML algorithms across three different datasets. To evaluate Hold-out, Hold-out with iteration, and K-fold Cross-Validation techniques, we created a flexible Python program. By modifying parameters like test size, Random State, and ‘k’ values, we were able to improve accuracy assessment. The outcomes demonstrate the Hold-out validation method’s persistent superiority, particularly with a test size of 10%. With iterations and Random State settings, hold-out with iteration shows little accuracy variance. It suggests that there are variances according to algorithm, with Decision Tree doing best for Framingham and Naive Bayes and K Nearest Neighbors for COVID-19. Different datasets require different optimal K values in K-Fold Cross Validation, highlighting these considerations. This study challenges the universality of K values in K-Fold Cross Validation and suggests a 10% test size and 90% training size for better outcomes. It also emphasizes the contextual impact of dataset features, sample size, feature count, and selected methodologies. Researchers can adapt these codes for their dataset to obtain highest accuracy with specific evaluation.

[LG-58] Automated Detection and Analysis of Minor Deformations in Flat Walls Due to Railway Vibrations Using LiDAR and Machine Learning

链接: https://arxiv.org/abs/2501.06457
作者: Surjo Dey,Ankit Sharma,Hritu Raj,Susham Biswas
类目: Machine Learning (cs.LG)
*备注: IEEE conference paper

点击查看摘要

Abstract:This study introduces an advanced methodology for automatically identifying minor deformations in flat walls caused by vibrations from nearby railway tracks. It leverages high-density Terrestrial Laser Scanner (TLS) LiDAR surveys and AI/ML techniques to collect and analyze data. The scan data is processed into a detailed point cloud, which is segmented to distinguish ground points, trees, buildings, and other objects. The analysis focuses on identifying sections along flat walls and estimating their deformations relative to the ground orientation. Findings from the study, conducted at the RGIPT campus, reveal significant deformations in walls close to the railway corridor, with the highest deformations ranging from 7 to 8 cm and an average of 3 to 4 cm. In contrast, walls further from the corridor show negligible deformations. The developed automated process for feature extraction and deformation monitoring demonstrates potential for structural health monitoring. By integrating LiDAR data with machine learning, the methodology provides an efficient system for identifying and analyzing structural deformations, highlighting the importance of continuous monitoring for ensuring structural integrity and public safety in urban infrastructure. This approach represents a substantial advancement in automated feature extraction and deformation analysis, contributing to more effective management of urban infrastructure. Comments: IEEE conference paper Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.06457 [cs.LG] (or arXiv:2501.06457v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06457 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) Related DOI: https://doi.org/10.1109/ICCCNT61001.2024.10725633 Focus to learn more DOI(s) linking to related resources

[LG-59] Cross-Technology Interference: Detection Avoidance and Coexistence Mechanisms in the ISM Bands

链接: https://arxiv.org/abs/2501.06446
作者: Zegeye Mekasha Kidane,Waltenegus Dargie
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A large number of heterogeneous wireless networks share the unlicensed spectrum designated as the ISM (Industry, Scientific, and Medicine) radio band. These networks do not adhere to a common medium access rule and differ in their specifications considerably. As a result, when concurrently active, they cause cross-technology interference (CTI) on each other. The effect of this interference is not reciprocal, the networks using high transmission power and advanced transmission schemes often causing disproportionate disruptions to those with modest communication and computation resources. CTI corrupts packets, incurs packet retransmission cost, introduces end-to-end latency and jitter, and make networks unpredictable. The purpose of this paper is to closely examine its impact on low-power networks which are based on the IEEE 802.15.4 standard. It discusses latest developments on CTI detection, coexistence and avoidance mechanisms as well on messaging schemes which attempt to enable heterogeneous networks directly communicate with one another to coordinate packet transmission and channel assignment.

[LG-60] Reliable Imputed-Sample Assisted Vertical Federated Learning

链接: https://arxiv.org/abs/2501.06429
作者: Yaopei Zeng,Lei Liu,Shaoguo Liu,Hongjian Dou,Baoyuan Wu,Li Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) is a well-known FL variant that enables multiple parties to collaboratively train a model without sharing their raw data. Existing VFL approaches focus on overlapping samples among different parties, while their performance is constrained by the limited number of these samples, leaving numerous non-overlapping samples unexplored. Some previous work has explored techniques for imputing missing values in samples, but often without adequate attention to the quality of the imputed samples. To address this issue, we propose a Reliable Imputed-Sample Assisted (RISA) VFL framework to effectively exploit non-overlapping samples by selecting reliable imputed samples for training VFL models. Specifically, after imputing non-overlapping samples, we introduce evidence theory to estimate the uncertainty of imputed samples, and only samples with low uncertainty are selected. In this way, high-quality non-overlapping samples are utilized to improve VFL model. Experiments on two widely used datasets demonstrate the significant performance gains achieved by the RISA, especially with the limited overlapping samples, e.g., a 48% accuracy gain on CIFAR-10 with only 1% overlapping samples.

[LG-61] ask Delay and Energy Consumption Minimization for Low-altitude MEC via Evolutionary Multi-objective Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.06410
作者: Geng Sun,Weilong Ma,Jiahui Li,Zemin Sun,Jiacheng Wang,Dusit Niyato,Shiwen Mao
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The low-altitude economy (LAE), driven by unmanned aerial vehicles (UAVs) and other aircraft, has revolutionized fields such as transportation, agriculture, and environmental monitoring. In the upcoming six-generation (6G) era, UAV-assisted mobile edge computing (MEC) is particularly crucial in challenging environments such as mountainous or disaster-stricken areas. The computation task offloading problem is one of the key issues in UAV-assisted MEC, primarily addressing the trade-off between minimizing the task delay and the energy consumption of the UAV. In this paper, we consider a UAV-assisted MEC system where the UAV carries the edge servers to facilitate task offloading for ground devices (GDs), and formulate a calculation delay and energy consumption multi-objective optimization problem (CDECMOP) to simultaneously improve the performance and reduce the cost of the system. Then, by modeling the formulated problem as a multi-objective Markov decision process (MOMDP), we propose a multi-objective deep reinforcement learning (DRL) algorithm within an evolutionary framework to dynamically adjust the weights and obtain non-dominated policies. Moreover, to ensure stable convergence and improve performance, we incorporate a target distribution learning (TDL) algorithm. Simulation results demonstrate that the proposed algorithm can better balance multiple optimization objectives and obtain superior non-dominated solutions compared to other methods.

[LG-62] Mathematics of Digital Twins and Transfer Learning for PDE Models

链接: https://arxiv.org/abs/2501.06400
作者: Yifei Zong,Alexandre Tartakovsky
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 22 pages, 7 figures

点击查看摘要

Abstract:We define a digital twin (DT) of a physical system governed by partial differential equations (PDEs) as a model for real-time simulations and control of the system behavior under changing conditions. We construct DTs using the Karhunen-Loève Neural Network (KL-NN) surrogate model and transfer learning (TL). The surrogate model allows fast inference and differentiability with respect to control parameters for control and optimization. TL is used to retrain the model for new conditions with minimal additional data. We employ the moment equations to analyze TL and identify parameters that can be transferred to new conditions. The proposed analysis also guides the control variable selection in DT to facilitate efficient TL. For linear PDE problems, the non-transferable parameters in the KL-NN surrogate model can be exactly estimated from a single solution of the PDE corresponding to the mean values of the control variables under new target conditions. Retraining an ML model with a single solution sample is known as one-shot learning, and our analysis shows that the one-shot TL is exact for linear PDEs. For nonlinear PDE problems, transferring of any parameters introduces errors. For a nonlinear diffusion PDE model, we find that for a relatively small range of control variables, some surrogate model parameters can be transferred without introducing a significant error, some can be approximately estimated from the mean-field equation, and the rest can be found using a linear residual least square problem or an ordinary linear least square problem if a small labeled dataset for new conditions is available. The former approach results in a one-shot TL while the latter approach is an example of a few-shot TL. Both methods are approximate for the nonlinear PDEs. Comments: 22 pages, 7 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML) Cite as: arXiv:2501.06400 [cs.LG] (or arXiv:2501.06400v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06400 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] On the Partial Identifiability in Reward Learning: Choosing the Best Reward

链接: https://arxiv.org/abs/2501.06376
作者: Filippo Lazzati,Alberto Maria Metelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In Reward Learning (ReL), we are given feedback on an unknown target reward, and the goal is to use this information to find it. When the feedback is not informative enough, the target reward is only partially identifiable, i.e., there exists a set of rewards (the feasible set) that are equally-compatible with the feedback. In this paper, we show that there exists a choice of reward, non-necessarily contained in the feasible set that, depending on the ReL application, improves the performance w.r.t. selecting the reward arbitrarily among the feasible ones. To this aim, we introduce a new quantitative framework to analyze ReL problems in a simple yet expressive way. We exemplify the framework in a reward transfer use case, for which we devise three provably-efficient ReL algorithms.

[LG-64] On Creating A Brain-To-Text Decoder

链接: https://arxiv.org/abs/2501.06326
作者: Zenon Lamprou,Yashar Moshfeghi
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Brain decoding has emerged as a rapidly advancing and extensively utilized technique within neuroscience. This paper centers on the application of raw electroencephalogram (EEG) signals for decoding human brain activity, offering a more expedited and efficient methodology for enhancing our understanding of the human brain. The investigation specifically scrutinizes the efficacy of brain-computer interfaces (BCI) in deciphering neural signals associated with speech production, with particular emphasis on the impact of vocabulary size, electrode density, and training data on the framework’s performance. The study reveals the competitive word error rates (WERs) achievable on the Librispeech benchmark through pre-training on unlabelled data for speech processing. Furthermore, the study evaluates the efficacy of voice recognition under configurations with limited labeled data, surpassing previous state-of-the-art techniques while utilizing significantly fewer labels. Additionally, the research provides a comprehensive analysis of error patterns in voice recognition and the influence of model size and unlabelled training data. It underscores the significance of factors such as vocabulary size and electrode density in enhancing BCI performance, advocating for an increase in microelectrodes and refinement of language models.

[LG-65] Uncertainty Estimation for Path Loss and Radio Metric Models

链接: https://arxiv.org/abs/2501.06308
作者: Alexis Bose,Jonathan Ethier,Ryan G. Dempsey,Yifeng Qiu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 5 pages, 12 figures

点击查看摘要

Abstract:This research leverages Conformal Prediction (CP) in the form of Conformal Predictive Systems (CPS) to accurately estimate uncertainty in a suite of machine learning (ML)-based radio metric models [1] as well as in a 2-D map-based ML path loss model [2]. Utilizing diverse difficulty estimators, we construct 95% confidence prediction intervals (PIs) that are statistically robust. Our experiments demonstrate that CPS models, trained on Toronto datasets, generalize effectively to other cities such as Vancouver and Montreal, maintaining high coverage and reliability. Furthermore, the employed difficulty estimators identify challenging samples, leading to measurable reductions in RMSE as dataset difficulty decreases. These findings highlight the effectiveness of scalable and reliable uncertainty estimation through CPS in wireless network modeling, offering important potential insights for network planning, operations, and spectrum management.

[LG-66] nsorization of neural networks for improved privacy and interpretability

链接: https://arxiv.org/abs/2501.06300
作者: José Ramón Pareja Monturiol,Alejandro Pozas-Kerstjens,David Pérez-García
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: 39 pages, 9 figures

点击查看摘要

Abstract:We present a tensorization algorithm for constructing tensor train representations of functions, drawing on sketching and cross interpolation ideas. The method only requires black-box access to the target function and a small set of sample points defining the domain of interest. Thus, it is particularly well-suited for machine learning models, where the domain of interest is naturally defined by the training dataset. We show that this approach can be used to enhance the privacy and interpretability of neural network models. Specifically, we apply our decomposition to (i) obfuscate neural networks whose parameters encode patterns tied to the training data distribution, and (ii) estimate topological phases of matter that are easily accessible from the tensor train representation. Additionally, we show that this tensorization can serve as an efficient initialization method for optimizing tensor trains in general settings, and that, for model compression, our algorithm achieves a superior trade-off between memory and time complexity compared to conventional tensorization methods of neural networks.

[LG-67] Cluster Catch Digraphs with the Nearest Neighbor Distance

链接: https://arxiv.org/abs/2501.06268
作者: Rui Shi,Nedret Billor,Elvan Ceyhan
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 28 pages, 4 figures, and 10 tables

点击查看摘要

Abstract:We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley’s K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley’s K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test Comments: 28 pages, 4 figures, and 10 tables Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2501.06268 [cs.LG] (or arXiv:2501.06268v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.06268 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Contextual Bandit Optimization with Pre-Trained Neural Networks

链接: https://arxiv.org/abs/2501.06258
作者: Mikhail Terekhov
类目: Machine Learning (cs.LG)
*备注: Master’s thesis

点击查看摘要

Abstract:Bandit optimization is a difficult problem, especially if the reward model is high-dimensional. When rewards are modeled by neural networks, sublinear regret has only been shown under strong assumptions, usually when the network is extremely wide. In this thesis, we investigate how pre-training can help us in the regime of smaller models. We consider a stochastic contextual bandit with the rewards modeled by a multi-layer neural network. The last layer is a linear predictor, and the layers before it are a black box neural architecture, which we call a representation network. We model pre-training as an initial guess of the weights of the representation network provided to the learner. To leverage the pre-trained weights, we introduce a novel algorithm we call Explore Twice then Commit (E2TC). During its two stages of exploration, the algorithm first estimates the last layer’s weights using Ridge regression, and then runs Stochastic Gradient Decent jointly on all the weights. For a locally convex loss function, we provide conditions on the pre-trained weights under which the algorithm can learn efficiently. Under these conditions, we show sublinear regret of E2TC when the dimension of the last layer and number of actions K are much smaller than the horizon T . In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit. We introduce a measure of misspecification \epsilon_0 for this bandit and use it to provide bounds O(\epsilon_0\sqrtdKT+(KT)^4 /5) or \tildeO(\epsilon_0\sqrtdKT+d^1 /3(KT)^2 /3) on the regret, depending on regularization strength. The first of these bounds has a dimension-independent sublinear term, made possible by the stochasticity of contexts. We also run experiments to evaluate the regret of E2TC and sample complexity of its exploration in practice.

[LG-69] Predicting House Rental Prices in Ghana Using Machine Learning

链接: https://arxiv.org/abs/2501.06241
作者: Philip Adzanoukpe
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 13 pages, 8 figures, 2 tables

点击查看摘要

Abstract:This study investigates the efficacy of machine learning models for predicting house rental prices in Ghana, addressing the need for accurate and accessible housing market information. Utilising a comprehensive dataset of rental listings, we trained and evaluated various models, including CatBoost, XGBoost, and Random Forest. CatBoost emerged as the best-performing model, achieving an R^2 of 0.876, demonstrating its ability to effectively capture complex relationships within the housing market. Feature importance analysis revealed that location-based features, number of bedrooms, bathrooms, and furnishing status are key drivers of rental prices. Our findings provide valuable insights for stakeholders, including real estate professionals, investors, and policymakers, while also highlighting opportunities for future research, such as incorporating temporal data and exploring regional variations.

[LG-70] he Convergence of Dynamic Routing between Capsules

链接: https://arxiv.org/abs/2501.06240
作者: Daoyuan Ye,Juntao Li,Yiting Shen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Capsule networks(CapsNet) are recently proposed neural network models with new processing layers, specifically for entity representation and discovery of images. It is well known that CapsNet have some advantages over traditional neural networks, especially in generalization capability. At the same time, some studies report negative experimental results. The causes of this contradiction have not been thoroughly analyzed. The preliminary experimental results show that the behavior of routing algorithms does not always produce good results as expected, and in most cases, different routing algorithms do not change the classification results, but simply polarize the link strength, especially when they continue to repeat without stopping. To realize the true potential of the CapsNet, deep mathematical analysis of the routing algorithms is crucial. In this paper, we will give the objective function that is minimized by the dynamic routing algorithm, which is a concave function. The dynamic routing algorithm can be regarded as nonlinear gradient method to solving an optimization algorithm under linear constraints, and its convergence can be strictly proved mathematically. Furthermore, the mathematically rigorous proof of the convergence is given for this class of iterative routing procedures. We analyze the relation between the objective function and the constraints solved by the dynamic routing algorithm in detail, and perform the corresponding routing experiment to analyze the effect of our convergence proof.

[LG-71] Multi-field Visualization: Trait design and trait-induced merge trees

链接: https://arxiv.org/abs/2501.06238
作者: Danhua Lei,Jochen Jankowai,Petar Hristov,Hamish Carr,Leif Denby,Talha Bin Masood,Ingrid Hotz
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: IEEE Transactions on Visualization and Computer Graphics. arXiv admin note: text overlap with arXiv:2308.09015

点击查看摘要

Abstract:Feature level sets (FLS) have shown significant potential in the analysis of multi-field data by using traits defined in attribute space to specify features in the domain. In this work, we address key challenges in the practical use of FLS: trait design and feature selection for rendering. To simplify trait design, we propose a Cartesian decomposition of traits into simpler components, making the process more intuitive and computationally efficient. Additionally, we utilize dictionary learning results to automatically suggest point traits. To enhance feature selection, we introduce trait-induced merge trees (TIMTs), a generalization of merge trees for feature level sets, aimed at topologically analyzing tensor fields or general multi-variate data. The leaves in the TIMT represent areas in the input data that are closest to the defined trait, thereby most closely resembling the defined feature. This merge tree provides a hierarchy of features, enabling the querying of the most relevant and persistent features. Our method includes various query techniques for the tree, allowing the highlighting of different aspects. We demonstrate the cross-application capabilities of this approach through five case studies from different domains.

[LG-72] Mechanics and Design of Metastructured Auxetic Patches with Bio-inspired Materials

链接: https://arxiv.org/abs/2501.06233
作者: Yingbin Chen,Milad Arzani,Xuan Mu,Sophia Jin,Shaoping Xiao
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Metastructured auxetic patches, characterized by negative Poisson’s ratios, offer unique mechanical properties that closely resemble the behavior of human tissues and organs. As a result, these patches have gained significant attention for their potential applications in organ repair and tissue regeneration. This study focuses on neural networks-based computational modeling of auxetic patches with a sinusoidal metastructure fabricated from silk fibroin, a bio-inspired material known for its biocompatibility and strength. The primary objective of this research is to introduce a novel, data-driven framework for patch design. To achieve this, we conducted experimental fabrication and mechanical testing to determine material properties and validate the corresponding finite element models. Finite element simulations were then employed to generate the necessary data, while greedy sampling, an active learning technique, was utilized to reduce the computational cost associated with data labeling. Two neural networks were trained to accurately predict Poisson’s ratios and stresses for strains up to 15%, respectively. Both models achieved R^2 scores exceeding 0.995, which indicates highly reliable predictions. Building on this, we developed a neural network-based design model capable of tailoring patch designs to achieve specific mechanical properties. This model demonstrated superior performance when compared to traditional optimization methods, such as genetic algorithms, by providing more efficient and precise design solutions. The proposed framework represents a significant advancement in the design of bio-inspired metastructures for medical applications, paving the way for future innovations in tissue engineering and regenerative medicine.

[LG-73] An Interpretable ML-based Model for Predicting p-y Curves of Monopile Foundations in Sand

链接: https://arxiv.org/abs/2501.06232
作者: Biao Li,Qing-Kai Song,Wen-Gang Qi,Fu-Ping Gao
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft)
*备注:

点击查看摘要

Abstract:Predicting the lateral pile response is challenging due to the complexity of pile-soil interactions. Machine learning (ML) techniques have gained considerable attention for their effectiveness in non-linear analysis and prediction. This study develops an interpretable ML-based model for predicting p-y curves of monopile foundations. An XGBoost model was trained using a database compiled from existing research. The results demonstrate that the model achieves superior predictive accuracy. Shapley Additive Explanations (SHAP) was employed to enhance interpretability. The SHAP value distributions for each variable demonstrate strong alignment with established theoretical knowledge on factors affecting the lateral response of pile foundations.

[LG-74] Generating and Detecting Various Types of Fake Image and Audio Content: A Review of Modern Deep Learning Technologies and Tools

链接: https://arxiv.org/abs/2501.06227
作者: Arash Dehghani,Hossein Saberi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper reviews the state-of-the-art in deepfake generation and detection, focusing on modern deep learning technologies and tools based on the latest scientific advancements. The rise of deepfakes, leveraging techniques like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion models and other generative models, presents significant threats to privacy, security, and democracy. This fake media can deceive individuals, discredit real people and organizations, facilitate blackmail, and even threaten the integrity of legal, political, and social systems. Therefore, finding appropriate solutions to counter the potential threats posed by this technology is essential. We explore various deepfake methods, including face swapping, voice conversion, reenactment and lip synchronization, highlighting their applications in both benign and malicious contexts. The review critically examines the ongoing “arms race” between deepfake generation and detection, analyzing the challenges in identifying manipulated contents. By examining current methods and highlighting future research directions, this paper contributes to a crucial understanding of this rapidly evolving field and the urgent need for robust detection strategies to counter the misuse of this powerful technology. While focusing primarily on audio, image, and video domains, this study allows the reader to easily grasp the latest advancements in deepfake generation and detection.

[LG-75] Can Explainable AI Assess Personalized Health Risks from Indoor Air Pollution?

链接: https://arxiv.org/abs/2501.06222
作者: Pritisha Sarkar,Kushalava reddy Jala,Mousumi Saha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acknowledging the effects of outdoor air pollution, the literature inadequately addresses indoor air pollution’s impacts. Despite daily health risks, existing research primarily focused on monitoring, lacking accuracy in pinpointing indoor pollution sources. In our research work, we thoroughly investigated the influence of indoor activities on pollution levels. A survey of 143 participants revealed limited awareness of indoor air pollution. Leveraging 65 days of diverse data encompassing activities like incense stick usage, indoor smoking, inadequately ventilated cooking, excessive AC usage, and accidental paper burning, we developed a comprehensive monitoring system. We identify pollutant sources and effects with high precision through clustering analysis and interpretability models (LIME and SHAP). Our method integrates Decision Trees, Random Forest, Naive Bayes, and SVM models, excelling at 99.8% accuracy with Decision Trees. Continuous 24-hour data allows personalized assessments for targeted pollution reduction strategies, achieving 91% accuracy in predicting activities and pollution exposure.

[LG-76] Optimizing Supply Chain Networks with the Power of Graph Neural Networks

链接: https://arxiv.org/abs/2501.06221
作者: Chi-Sheng Chen,Ying-Jung Chen
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as transformative tools for modeling complex relational data, offering unprecedented capabilities in tasks like forecasting and optimization. This study investigates the application of GNNs to demand forecasting within supply chain networks using the SupplyGraph dataset, a benchmark for graph-based supply chain analysis. By leveraging advanced GNN methodologies, we enhance the accuracy of forecasting models, uncover latent dependencies, and address temporal complexities inherent in supply chain operations. Comparative analyses demonstrate that GNN-based models significantly outperform traditional approaches, including Multilayer Perceptrons (MLPs) and Graph Convolutional Networks (GCNs), particularly in single-node demand forecasting tasks. The integration of graph representation learning with temporal data highlights GNNs’ potential to revolutionize predictive capabilities for inventory management, production scheduling, and logistics optimization. This work underscores the pivotal role of forecasting in supply chain management and provides a robust framework for advancing research and applications in this domain.

[LG-77] Improving DeFi Accessibility through Efficient Liquidity Provisioning with Deep Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2501.07508
作者: Haonan Xu,Alessio Brini
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures. Accepted at AI for Social Impact: Bridging Innovations in Finance, Social Media, and Crime Prevention Workshop at AAAI 2025

点击查看摘要

Abstract:This paper applies deep reinforcement learning (DRL) to optimize liquidity provisioning in Uniswap v3, a decentralized finance (DeFi) protocol implementing an automated market maker (AMM) model with concentrated liquidity. We model the liquidity provision task as a Markov Decision Process (MDP) and train an active liquidity provider (LP) agent using the Proximal Policy Optimization (PPO) algorithm. The agent dynamically adjusts liquidity positions by using information about price dynamics to balance fee maximization and impermanent loss mitigation. We use a rolling window approach for training and testing, reflecting realistic market conditions and regime shifts. This study compares the data-driven performance of the DRL-based strategy against common heuristics adopted by small retail LP actors that do not systematically modify their liquidity positions. By promoting more efficient liquidity management, this work aims to make DeFi markets more accessible and inclusive for a broader range of participants. Through a data-driven approach to liquidity management, this work seeks to contribute to the ongoing development of more efficient and user-friendly DeFi markets.

[LG-78] Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport

链接: https://arxiv.org/abs/2501.07446
作者: Brendan Mallery,James M. Murphy,Shuchin Aeron
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 58 pages. Code to reproduce experiments: this https URL

点击查看摘要

Abstract:We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of m reference measures given a set of coefficients belonging to the m -dimensional simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure \mu . Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of regularized barycenters as solutions to a fixed-point equation for the average of the entropic maps from the barycenter to the reference measures. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when \mu is a barycenter. It is shown that these coordinates, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, a hallmark of entropy-regularized optimal transport, and we verify these rates experimentally. We also establish that barycentric coordinates are stable with respect to perturbations in the Wasserstein-2 metric, suggesting a robustness of these coefficients to corruptions. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.

[LG-79] Pairwise Comparisons without Stochastic Transitivity: Model Theory and Applications

链接: https://arxiv.org/abs/2501.07437
作者: Sze Ming Lee,Yunxiao Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 1 figure

点击查看摘要

Abstract:Most statistical models for pairwise comparisons, including the Bradley-Terry (BT) and Thurstone models and many extensions, make a relatively strong assumption of stochastic transitivity. This assumption imposes the existence of an unobserved global ranking among all the players/teams/items and monotone constraints on the comparison probabilities implied by the global ranking. However, the stochastic transitivity assumption does not hold in many real-world scenarios of pairwise comparisons, especially games involving multiple skills or strategies. As a result, models relying on this assumption can have suboptimal predictive performance. In this paper, we propose a general family of statistical models for pairwise comparison data without a stochastic transitivity assumption, substantially extending the BT and Thurstone models. In this model, the pairwise probabilities are determined by a (approximately) low-dimensional skew-symmetric matrix. Likelihood-based estimation methods and computational algorithms are developed, which allow for sparse data with only a small proportion of observed pairs. Theoretical analysis shows that the proposed estimator achieves minimax-rate optimality, which adapts effectively to the sparsity level of the data. The spectral theory for skew-symmetric matrices plays a crucial role in the implementation and theoretical analysis. The proposed method’s superiority against the BT model, along with its broad applicability across diverse scenarios, is further supported by simulations and real data analysis.

[LG-80] Distance Measure Based on an Embedding of the Manifold of K-Component Gaussian Mixture Models into the Manifold of Symmetric Positive Definite Matrices

链接: https://arxiv.org/abs/2501.07429
作者: Amit Vishwakarma,KS Subrahamanian Moosath
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a distance between the Gaussian Mixture Models(GMMs) is obtained based on an embedding of the K-component Gaussian Mixture Model into the manifold of the symmetric positive definite matrices. Proof of embedding of K-component GMMs into the manifold of symmetric positive definite matrices is given and shown that it is a submanifold. Then, proved that the manifold of GMMs with the pullback of induced metric is isometric to the submanifold with the induced metric. Through this embedding we obtain a general lower bound for the Fisher-Rao metric. This lower bound is a distance measure on the manifold of GMMs and we employ it for the similarity measure of GMMs. The effectiveness of this framework is demonstrated through an experiment on standard machine learning benchmarks, achieving accuracy of 98%, 92%, and 93.33% on the UIUC, KTH-TIPS, and UMD texture recognition datasets respectively.

[LG-81] Simulating the Hubbard Model with Equivariant Normalizing Flows

链接: https://arxiv.org/abs/2501.07371
作者: Dominic Schuh,Janik Kreit,Evan Berkowitz,Lena Funcke,Thomas Luu,Kim A. Nicoli,Marcel Rodekamp
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 14 pages, 5 figures, contribution to the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, Liverpool, UK

点击查看摘要

Abstract:Generative models, particularly normalizing flows, have shown exceptional performance in learning probability distributions across various domains of physics, including statistical mechanics, collider physics, and lattice field theory. In the context of lattice field theory, normalizing flows have been successfully applied to accurately learn the Boltzmann distribution, enabling a range of tasks such as direct estimation of thermodynamic observables and sampling independent and identically distributed (i.i.d.) configurations. In this work, we present a proof-of-concept demonstration that normalizing flows can be used to learn the Boltzmann distribution for the Hubbard model. This model is widely employed to study the electronic structure of graphene and other carbon nanomaterials. State-of-the-art numerical simulations of the Hubbard model, such as those based on Hybrid Monte Carlo (HMC) methods, often suffer from ergodicity issues, potentially leading to biased estimates of physical observables. Our numerical experiments demonstrate that leveraging i.i.d.\ sampling from the normalizing flow effectively addresses these issues. Comments: 14 pages, 5 figures, contribution to the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, Liverpool, UK Subjects: Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat) Cite as: arXiv:2501.07371 [cond-mat.str-el] (or arXiv:2501.07371v1 [cond-mat.str-el] for this version) https://doi.org/10.48550/arXiv.2501.07371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Estimating quantum relative entropies on quantum computers

链接: https://arxiv.org/abs/2501.07292
作者: Yuchen Lu,Kun Fang
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 24 pages, 10 figures; comments are welcome

点击查看摘要

Abstract:Quantum relative entropy, a quantum generalization of the well-known Kullback-Leibler divergence, serves as a fundamental measure of the distinguishability between quantum states and plays a pivotal role in quantum information science. Despite its importance, efficiently estimating quantum relative entropy between two quantum states on quantum computers remains a significant challenge. In this work, we propose the first quantum algorithm for estimating quantum relative entropy and Petz Rényi divergence from two unknown quantum states on quantum computers, addressing open problems highlighted in [Phys. Rev. A 109, 032431 (2024)] and [IEEE Trans. Inf. Theory 70, 5653-5680 (2024)]. This is achieved by combining quadrature approximations of relative entropies, the variational representation of quantum f-divergences, and a new technique for parameterizing Hermitian polynomial operators to estimate their traces with quantum states. Notably, the circuit size of our algorithm is at most 2n+1 with n being the number of qubits in the quantum states and it is directly applicable to distributed scenarios, where quantum states to be compared are hosted on cross-platform quantum computers. We validate our algorithm through numerical simulations, laying the groundwork for its future deployment on quantum hardware devices.

[LG-83] A Users Guide to textttKSig: GPU-Accelerated Computation of the Signature Kernel

链接: https://arxiv.org/abs/2501.07145
作者: Csaba Tóth,Danilo Jr Dela Cruz,Harald Oberhauser
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently introduced various scalable variations. In this chapter, we give a short introduction to \textttKSig , a \textttScikit-Learn compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels, and performing downstream learning tasks. We also introduce a new algorithm based on tensor sketches which gives strong performance compared to existing algorithms. The package is available at \hrefthis https URL\textttthis https URL .

[LG-84] Inferring Interpretable Models of Frag mentation Functions using Symbolic Regression

链接: https://arxiv.org/abs/2501.07123
作者: Nour Makke,Sanjay Chawla
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); Symbolic Computation (cs.SC); High Energy Physics - Theory (hep-th)
*备注:

点击查看摘要

Abstract:Machine learning is rapidly making its path into natural sciences, including high-energy physics. We present the first study that infers, directly from experimental data, a functional form of fragmentation functions. The latter represent a key ingredient to describe physical observables measured in high-energy physics processes that involve hadron production, and predict their values at different energy. Fragmentation functions can not be calculated in theory and have to be determined instead from data. Traditional approaches rely on global fits of experimental data using a pre-assumed functional form inspired from phenomenological models to learn its parameters. This novel approach uses a ML technique, namely symbolic regression, to learn an analytical model from measured charged hadron multiplicities. The function learned by symbolic regression resembles the Lund string function and describes the data well, thus representing a potential candidate for use in global FFs fits. This study represents an approach to follow in such QCD-related phenomenology studies and more generally in sciences.

[LG-85] Differentially Private Kernelized Contextual Bandits

链接: https://arxiv.org/abs/2501.07046
作者: Nikola Pavlovic,Sudeep Salgia,Qing Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS). We study this problem under the additional constraint of joint differential privacy, where the agents needs to ensure that the sequence of query points is differentially private with respect to both the sequence of contexts and rewards. We propose a novel algorithm that improves upon the state of the art and achieves an error rate of \mathcalO\left(\sqrt\frac\gamma_TT + \frac\gamma_TT \varepsilon\right) after T queries for a large class of kernel families, where \gamma_T represents the effective dimensionality of the kernel and \varepsilon 0 is the privacy parameter. Our results are based on a novel estimator for the reward function that simultaneously enjoys high utility along with a low-sensitivity to observed rewards and contexts, which is crucial to obtain an order optimal learning performance with improved dependence on the privacy parameter.

[LG-86] Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

链接: https://arxiv.org/abs/2501.06926
作者: Lars van der Laan,David Hubbard,Allen Tran,Nathan Kallus,Aurélien Bibaut
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Double reinforcement learning (DRL) enables statistically efficient inference on the value of a policy in a nonparametric Markov Decision Process (MDP) given trajectories generated by another policy. However, this approach necessarily requires stringent overlap between the state distributions, which is often violated in practice. To relax this requirement and extend DRL, we study efficient inference on linear functionals of the Q -function (of which policy value is a special case) in infinite-horizon, time-invariant MDPs under semiparametric restrictions on the Q -function. These restrictions can reduce the overlap requirement and lower the efficiency bound, yielding more precise estimates. As an important example, we study the evaluation of long-term value under domain adaptation, given a few short trajectories from the new domain and restrictions on the difference between the domains. This can be used for long-term causal inference. Our method combines flexible estimates of the Q -function and the Riesz representer of the functional of interest (e.g., the stationary state density ratio for policy value) and is automatic in that we do not need to know the form of the latter - only the functional we care about. To address potential model misspecification bias, we extend the adaptive debiased machine learning (ADML) framework of \citetvan2023adaptive to construct nonparametrically valid and superefficient estimators that adapt to the functional form of the Q -function. As a special case, we propose a novel adaptive debiased plug-in estimator that uses isotonic-calibrated fitted Q -iteration - a new calibration algorithm for MDPs - to circumvent the computational challenges of estimating debiasing nuisances from min-max objectives.

[LG-87] Variable Selection Methods for Multivariate Functional and Complex Biomedical Data in the AI Age

链接: https://arxiv.org/abs/2501.06868
作者: Marcos Matabuena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many problems within personalized medicine and digital health rely on the analysis of continuous-time functional biomarkers and other complex data structures emerging from high-resolution patient monitoring. In this context, this work proposes new optimization-based variable selection methods for multivariate, functional, and even more general outcomes in metrics spaces based on best-subset selection. Our framework applies to several types of regression models, including linear, quantile, or non parametric additive models, and to a broad range of random responses, such as univariate, multivariate Euclidean data, functional, and even random graphs. Our analysis demonstrates that our proposed methodology outperforms state-of-the-art methods in accuracy and, especially, in speed-achieving several orders of magnitude improvement over competitors across various type of statistical responses as the case of mathematical functions. While our framework is general and is not designed for a specific regression and scientific problem, the article is self-contained and focuses on biomedical applications. In the clinical areas, serves as a valuable resource for professionals in biostatistics, statistics, and artificial intelligence interested in variable selection problem in this new technological AI-era.

[LG-88] Hierarchy-Boosted Funnel Learning for Identifying Semiconductors with Ultralow Lattice Thermal Conductivity

链接: https://arxiv.org/abs/2501.06775
作者: Mengfan Wu,Shenshen Yan,Jie Ren
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Data-driven machine learning (ML) has demonstrated tremendous potential in material property predictions. However, the scarcity of materials data with costly property labels in the vast chemical space presents a significant challenge for ML in efficiently predicting properties and uncovering structure-property relationships. Here, we propose a novel hierarchy-boosted funnel learning (HiBoFL) framework, which is successfully applied to identify semiconductors with ultralow lattice thermal conductivity ( \kappa_\mathrmL ). By training on only a few hundred materials targeted by unsupervised learning from a pool of hundreds of thousands, we achieve efficient and interpretable supervised predictions of ultralow \kappa_\mathrmL , thereby circumventing large-scale brute-force calculations without clear objectives. As a result, we provide a list of candidates with ultralow \kappa_\mathrmL for potential thermoelectric applications and discover a new factor that significantly influences structural anharmonicity. This study offers a novel practical pathway for accelerating the discovery of functional materials.

[LG-89] Improving the adaptive and continuous learning capabilities of artificial neural networks: Lessons from multi-neuromodulatory dynamics

链接: https://arxiv.org/abs/2501.06762
作者: Jie Mei,Alejandro Rodriguez-Garcia,Daigo Takeuchi,Gabriel Wainstein,Nina Hubig,Yalda Mohsenzadeh,Srikanth Ramaswamy
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Continuous, adaptive learning-the ability to adapt to the environment and improve performance-is a hallmark of both natural and artificial intelligence. Biological organisms excel in acquiring, transferring, and retaining knowledge while adapting to dynamic environments, making them a rich source of inspiration for artificial neural networks (ANNs). This study explores how neuromodulation, a fundamental feature of biological learning systems, can help address challenges such as catastrophic forgetting and enhance the robustness of ANNs in continuous learning scenarios. Driven by neuromodulators including dopamine (DA), acetylcholine (ACh), serotonin (5-HT) and noradrenaline (NA), neuromodulatory processes in the brain operate at multiple scales, facilitating dynamic responses to environmental changes through mechanisms ranging from local synaptic plasticity to global network-wide adaptability. Importantly, the relationship between neuromodulators, and their interplay in the modulation of sensory and cognitive processes are more complex than expected, demonstrating a “many-to-one” neuromodulator-to-task mapping. To inspire the design of novel neuromodulation-aware learning rules, we highlight (i) how multi-neuromodulatory interactions enrich single-neuromodulator-driven learning, (ii) the impact of neuromodulators at multiple spatial and temporal scales, and correspondingly, (iii) strategies to integrate neuromodulated learning into or approximate it in ANNs. To illustrate these principles, we present a case study to demonstrate how neuromodulation-inspired mechanisms, such as DA-driven reward processing and NA-based cognitive flexibility, can enhance ANN performance in a Go/No-Go task. By integrating multi-scale neuromodulation, we aim to bridge the gap between biological learning and artificial systems, paving the way for ANNs with greater flexibility, robustness, and adaptability.

[LG-90] Sequential Portfolio Selection under Latent Side Information-Dependence Structure: Optimality and Universal Learning Algorithms

链接: https://arxiv.org/abs/2501.06701
作者: Duy Khanh Lam
类目: Mathematical Finance (q-fin.MF); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Portfolio Management (q-fin.PM)
*备注: 34 pages, working paper, first draft (errors may exist)

点击查看摘要

Abstract:This paper investigates the investment problem of constructing an optimal no-short sequential portfolio strategy in a market with a latent dependence structure between asset prices and partly unobservable side information, which is often high-dimensional. The results demonstrate that a dynamic strategy, which forms a portfolio based on perfect knowledge of the dependence structure and full market information over time, may not grow at a higher rate infinitely often than a constant strategy, which remains invariant over time. Specifically, if the market is stationary, implying that the dependence structure is statistically stable, the growth rate of an optimal dynamic strategy, utilizing the maximum capacity of the entire market information, almost surely decays over time into an equilibrium state, asymptotically converging to the growth rate of a constant strategy. Technically, this work reassesses the common belief that a constant strategy only attains the optimal limiting growth rate of dynamic strategies when the market process is identically and independently distributed. By analyzing the dynamic log-optimal portfolio strategy as the optimal benchmark in a stationary market with side information, we show that a random optimal constant strategy almost surely exists, even when a limiting growth rate for the dynamic strategy does not. Consequently, two approaches to learning algorithms for portfolio construction are discussed, demonstrating the safety of removing side information from the learning process while still guaranteeing an asymptotic growth rate comparable to that of the optimal dynamic strategy. Comments: 34 pages, working paper, first draft (errors may exist) Subjects: Mathematical Finance (q-fin.MF); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Portfolio Management (q-fin.PM) Cite as: arXiv:2501.06701 [q-fin.MF] (or arXiv:2501.06701v1 [q-fin.MF] for this version) https://doi.org/10.48550/arXiv.2501.06701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-91] Dynamic Causal Structure Discovery and Causal Effect Estimation

链接: https://arxiv.org/abs/2501.06534
作者: Jianian Wang,Rui Song
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To represent the causal relationships between variables, a directed acyclic graph (DAG) is widely utilized in many areas, such as social sciences, epidemics, and genetics. Many causal structure learning approaches are developed to learn the hidden causal structure utilizing deep-learning approaches. However, these approaches have a hidden assumption that the causal relationship remains unchanged over time, which may not hold in real life. In this paper, we develop a new framework to model the dynamic causal graph where the causal relations are allowed to be time-varying. We incorporate the basis approximation method into the score-based causal discovery approach to capture the dynamic pattern of the causal graphs. Utilizing the autoregressive model structure, we could capture both contemporaneous and time-lagged causal relationships while allowing them to vary with time. We propose an algorithm that could provide both past-time estimates and future-time predictions on the causal graphs, and conduct simulations to demonstrate the usefulness of the proposed method. We also apply the proposed method for the covid-data analysis, and provide causal estimates on how policy restriction’s effect changes.

[LG-92] Reinforcement Learning for Enhancing Sensing Estimation in Bistatic ISAC Systems with UAV Swarms

链接: https://arxiv.org/abs/2501.06454
作者: Obed Morrison Atsu,Salmane Naoumi,Roberto Bomfin,Marwa Chafii
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel Multi-Agent Reinforcement Learning (MARL) framework to enhance integrated sensing and communication (ISAC) networks using unmanned aerial vehicle (UAV) swarms as sensing radars. By framing the positioning and trajectory optimization of UAVs as a Partially Observable Markov Decision Process, we develop a MARL approach that leverages centralized training with decentralized execution to maximize the overall sensing performance. Specifically, we implement a decentralized cooperative MARL strategy to enable UAVs to develop effective communication protocols, therefore enhancing their environmental awareness and operational efficiency. Additionally, we augment the MARL solution with a transmission power adaptation technique to mitigate interference between the communicating drones and optimize the communication protocol efficiency. Moreover, a transmission power adaptation technique is incorporated to mitigate interference and optimize the learned communication protocol efficiency. Despite the increased complexity, our solution demonstrates robust performance and adaptability across various scenarios, providing a scalable and cost-effective enhancement for future ISAC networks.

[LG-93] IPP-Net: A Generalizable Deep Neural Network Model for Indoor Pathloss Radio Map Prediction ICASSP2025

链接: https://arxiv.org/abs/2501.06414
作者: Bin Feng,Meng Zheng,Wei Liang,Lei Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 2 pages, 1 figure, Accepted to ICASSP 2025

点击查看摘要

Abstract:In this paper, we propose a generalizable deep neural network model for indoor pathloss radio map prediction (termed as IPP-Net). IPP-Net is based on a UNet architecture and learned from both large-scale ray tracing simulation data and a modified 3GPP indoor hotspot model. The performance of IPP-Net is evaluated in the First Indoor Pathloss Radio Map Prediction Challenge in ICASSP 2025. The evaluation results show that IPP-Net achieves a weighted root mean square error of 9.501 dB on three competition tasks and obtains the second overall ranking.

[LG-94] Computational and Statistical Asymptotic Analysis of the JKO Scheme for Iterative Algorithms to update distributions

链接: https://arxiv.org/abs/2501.06408
作者: Shang Wu,Yazhen Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The seminal paper of Jordan, Kinderlehrer, and Otto introduced what is now widely known as the JKO scheme, an iterative algorithmic framework for computing distributions. This scheme can be interpreted as a Wasserstein gradient flow and has been successfully applied in machine learning contexts, such as deriving policy solutions in reinforcement learning. In this paper, we extend the JKO scheme to accommodate models with unknown parameters. Specifically, we develop statistical methods to estimate these parameters and adapt the JKO scheme to incorporate the estimated values. To analyze the adopted statistical JKO scheme, we establish an asymptotic theory via stochastic partial differential equations that describes its limiting dynamic behavior. Our framework allows both the sample size used in parameter estimation and the number of algorithmic iterations to go to infinity. This study offers a unified framework for joint computational and statistical asymptotic analysis of the statistical JKO scheme. On the computational side, we examine the scheme’s dynamic behavior as the number of iterations increases, while on the statistical side, we investigate the large-sample behavior of the resulting distributions computed through the scheme. We conduct numerical simulations to evaluate the finite-sample performance of the proposed methods and validate the developed asymptotic theory.

[LG-95] A Hybrid Framework for Reinsurance Optimization: Integrating Generative Models and Reinforcement Learning

链接: https://arxiv.org/abs/2501.06404
作者: Stella C. Dong,James R. Finlay
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinsurance optimization is critical for insurers to manage risk exposure, ensure financial stability, and maintain solvency. Traditional approaches often struggle with dynamic claim distributions, high-dimensional constraints, and evolving market conditions. This paper introduces a novel hybrid framework that integrates Generative Models, specifically Variational Autoencoders (VAEs), with Reinforcement Learning (RL) using Proximal Policy Optimization (PPO). The framework enables dynamic and scalable optimization of reinsurance strategies by combining the generative modeling of complex claim distributions with the adaptive decision-making capabilities of reinforcement learning. The VAE component generates synthetic claims, including rare and catastrophic events, addressing data scarcity and variability, while the PPO algorithm dynamically adjusts reinsurance parameters to maximize surplus and minimize ruin probability. The framework’s performance is validated through extensive experiments, including out-of-sample testing, stress-testing scenarios (e.g., pandemic impacts, catastrophic events), and scalability analysis across portfolio sizes. Results demonstrate its superior adaptability, scalability, and robustness compared to traditional optimization techniques, achieving higher final surpluses and computational efficiency. Key contributions include the development of a hybrid approach for high-dimensional optimization, dynamic reinsurance parameterization, and validation against stochastic claim distributions. The proposed framework offers a transformative solution for modern reinsurance challenges, with potential applications in multi-line insurance operations, catastrophe modeling, and risk-sharing strategy design. Subjects: Econometrics (econ.EM); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.06404 [econ.EM] (or arXiv:2501.06404v1 [econ.EM] for this version) https://doi.org/10.48550/arXiv.2501.06404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-96] Counterfactually Fair Reinforcement Learning via Sequential Data Preprocessing

链接: https://arxiv.org/abs/2501.06366
作者: Jitao Wang,Chengchun Shi,John D. Piette,Joshua R. Loftus,Donglin Zeng,Zhenke Wu
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:When applied in healthcare, reinforcement learning (RL) seeks to dynamically match the right interventions to subjects to maximize population benefit. However, the learned policy may disproportionately allocate efficacious actions to one subpopulation, creating or exacerbating disparities in other socioeconomically-disadvantaged subgroups. These biases tend to occur in multi-stage decision making and can be self-perpetuating, which if unaccounted for could cause serious unintended consequences that limit access to care or treatment benefit. Counterfactual fairness (CF) offers a promising statistical tool grounded in causal inference to formulate and study fairness. In this paper, we propose a general framework for fair sequential decision making. We theoretically characterize the optimal CF policy and prove its stationarity, which greatly simplifies the search for optimal CF policies by leveraging existing RL algorithms. The theory also motivates a sequential data preprocessing algorithm to achieve CF decision making under an additive noise assumption. We prove and then validate our policy learning approach in controlling unfairness and attaining optimal value through simulations. Analysis of a digital health dataset designed to reduce opioid misuse shows that our proposal greatly enhances fair access to counseling.

信息检索

[IR-0] Future-Conditioned Recommendations with Multi-Objective Controllable Decision Transformer

链接: https://arxiv.org/abs/2501.07212
作者: Chongming Gao,Kexin Huang,Ziang Fei,Jiaju Chen,Jiawei Chen,Jianshan Sun,Shuchang Liu,Qingpeng Cai,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Securing long-term success is the ultimate aim of recommender systems, demanding strategies capable of foreseeing and shaping the impact of decisions on future user satisfaction. Current recommendation strategies grapple with two significant hurdles. Firstly, the future impacts of recommendation decisions remain obscured, rendering it impractical to evaluate them through direct optimization of immediate metrics. Secondly, conflicts often emerge between multiple objectives, like enhancing accuracy versus exploring diverse recommendations. Existing strategies, trapped in a “training, evaluation, and retraining” loop, grow more labor-intensive as objectives evolve. To address these challenges, we introduce a future-conditioned strategy for multi-objective controllable recommendations, allowing for the direct specification of future objectives and empowering the model to generate item sequences that align with these goals autoregressively. We present the Multi-Objective Controllable Decision Transformer (MocDT), an offline Reinforcement Learning (RL) model capable of autonomously learning the mapping from multiple objectives to item sequences, leveraging extensive offline data. Consequently, it can produce recommendations tailored to any specified objectives during the inference stage. Our empirical findings emphasize the controllable recommendation strategy’s ability to produce item sequences according to different objectives while maintaining performance that is competitive with current recommendation strategies across various objectives.

[IR-1] Intent-Interest Disentanglement and Item-Aware Intent Contrastive Learning for Sequential Recommendation

链接: https://arxiv.org/abs/2501.07096
作者: Yijin Choi,Chiehyeon Lim
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Recommender systems aim to provide personalized item recommendations by capturing user behaviors derived from their interaction history. Considering that user interactions naturally occur sequentially based on users’ intents in mind, user behaviors can be interpreted as user intents. Therefore, intent-based sequential recommendations are actively studied recently to model user intents from historical interactions for a more precise user understanding beyond traditional studies that often overlook the underlying semantics behind user interactions. However, existing studies face three challenges: 1) the limited understanding of user behaviors by focusing solely on intents, 2) the lack of robustness in categorizing intents due to arbitrary fixed numbers of intent categories, and 3) the neglect of interacted items in modeling of user intents. To address these challenges, we propose Intent-Interest Disentanglement and Item-Aware Intent Contrastive Learning for Sequential Recommendation (IDCLRec). IDCLRec disentangles user behaviors into intents which are dynamic motivations and interests which are stable tastes of users for a comprehensive understanding of user behaviors. A causal cross-attention mechanism is used to identify consistent interests across interactions, while residual behaviors are modeled as intents by modeling their temporal dynamics through a similarity adjustment loss. In addition, without predefining the number of intent categories, an importance-weighted attention mechanism captures user-specific categorical intent considering the importance of intent for each interaction. Furthermore, we introduce item-aware contrastive learning which aligns intents that occurred the same interaction and aligns intent with item combinations occurred by the corresponding intent. Extensive experiments conducted on real-world datasets demonstrate the effectiveness of IDCLRec.

[IR-2] Unveiling Temporal Trends in 19th Century Literature: An Information Retrieval Approach

链接: https://arxiv.org/abs/2501.06833
作者: Suchana Datta,Dwaipayan Roy,Derek Greene,Gerardine Meaney
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: Accepted at JCDL 2024

点击查看摘要

Abstract:In English literature, the 19th century witnessed a significant transition in styles, themes, and genres. Consequently, the novels from this period display remarkable diversity. This paper explores these variations by examining the evolution of term usage in 19th century English novels through the lens of information retrieval. By applying a query expansion-based approach to a decade-segmented collection of fiction from the British Library, we examine how related terms vary over time. Our analysis employs multiple standard metrics including Kendall’s tau, Jaccard similarity, and Jensen-Shannon divergence to assess overlaps and shifts in expanded query term sets. Our results indicate a significant degree of divergence in the related terms across decades as selected by the query expansion technique, suggesting substantial linguistic and conceptual changes throughout the 19th century novels.

[IR-3] Repeat-bias-aware Optimization of Beyond-accuracy Metrics for Next Basket Recommendation ECIR2025

链接: https://arxiv.org/abs/2501.06362
作者: Yuanna Liu,Ming Li,Mohammad Aliannejadi,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted as a full paper at the 47th European Conference on Information Retrieval (ECIR2025)

点击查看摘要

Abstract:In next basket recommendation (NBR) a set of items is recommended to users based on their historical basket sequences. In many domains, the recommended baskets consist of both repeat items and explore items. Some state-of-the-art NBR methods are heavily biased to recommend repeat items so as to maximize utility. The evaluation and optimization of beyond-accuracy objectives for NBR, such as item fairness and diversity, has attracted increasing attention. How can such beyond-accuracy objectives be pursued in the presence of heavy repeat bias? We find that only optimizing diversity or item fairness without considering repeat bias may cause NBR algorithms to recommend more repeat items. To solve this problem, we propose a model-agnostic repeat-bias-aware optimization algorithm to post-process the recommended results obtained from NBR methods with the objective of mitigating repeat bias when optimizing diversity or item fairness. We consider multiple variations of our optimization algorithm to cater to multiple NBR methods. Experiments on three real-world grocery shopping datasets show that the proposed algorithms can effectively improve diversity and item fairness, and mitigate repeat bias at acceptable Recall loss.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-14

目录

概览 (2025-01-14)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载