Arxiv今日论文 | 2024-12-06

本篇博文主要展示 2024-12-06 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决视觉-语言模型中视觉令牌（visual tokens）冗余和计算成本高的问题。解决方案的关键是提出了VisionZip方法，该方法通过选择一组信息丰富的视觉令牌输入到语言模型中，从而减少视觉令牌的冗余并提高效率，同时保持模型性能。VisionZip不仅适用于图像和视频理解任务，还特别适合多轮对话等实际场景。实验结果表明，VisionZip在几乎所有设置下都比之前的先进方法提高了至少5%的性能，并显著提升了模型推理速度，使LLaVA-Next 13B模型在推理速度上超过LLaVA-Next 7B模型，同时取得更好的结果。

链接: https://arxiv.org/abs/2412.04467
作者: Senqiao Yang,Yukang Chen,Zhuotao Tian,Chengyao Wang,Jingyao Li,Bei Yu,Jiaya Jia
关键词-EN: raising computational costs, Recent advancements, significantly raising computational, computational costs, advancements in vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 2 columns, 28 pages, 15 figures, 18 tables

点击查看摘要

Abstract:Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at this https URL .
zh

[NLP-1] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

【速读】：该论文试图解决图形用户界面 (Graphical User Interfaces, GUIs) 自动化任务中的复杂性和变异性问题，特别是现有方法依赖于文本表示所带来的泛化性、效率和可扩展性限制。解决方案的关键在于引入 Aguvis，这是一个基于纯视觉的统一框架，适用于跨平台的自主 GUI 代理。Aguvis 利用基于图像的观察、自然语言指令与视觉元素的关联，并通过一致的动作空间确保跨平台泛化。此外，Aguvis 整合了显式的规划和推理机制，增强了代理在复杂数字环境中的自主导航和交互能力。通过构建大规模的 GUI 代理轨迹数据集，并采用两阶段训练流程（先进行通用 GUI 关联，再进行规划和推理），Aguvis 在离线和在线场景中均超越了现有最先进的方法，实现了首个完全自主的纯视觉 GUI 代理，无需依赖外部闭源模型。

链接: https://arxiv.org/abs/2412.04454
作者: Yiheng Xu,Zekun Wang,Junli Wang,Dunjie Lu,Tianbao Xie,Amrita Saha,Doyen Sahoo,Tao Yu,Caiming Xiong
关键词-EN: Graphical User Interfaces, Graphical User, User Interfaces, remains challenging due, tasks remains challenging
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at this https URL.
zh

[NLP-2] p-MoD: Building Mixture-of-Depths MLLM s via Progressive Ratio Decay

【速读】：该论文试图解决多模态大语言模型（MLLMs）在训练和推理过程中计算成本高昂的问题。解决方案的关键在于引入Mixture-of-Depths（MoD）机制，通过在每个transformer解码器层中选择性地处理关键视觉标记（vision tokens），跳过冗余标记，从而提高模型效率。为确保MoD机制在MLLMs中的有效集成，论文提出了两个创新设计：tanh-gated权重归一化（TanhNorm）和对称标记重加权（STRing），以及渐进比率衰减（PRD）策略，后者通过逐层降低标记保留比率，采用偏移余弦调度，进一步提升了MoD的潜力，显著提高了模型效率和性能。实验结果表明，p-MoD模型在多个基准测试中表现优异，且在推理和训练阶段的计算成本显著降低。

链接: https://arxiv.org/abs/2412.04449
作者: Jun Zhang,Desen Meng,Ji Qi,Zhenpeng Huang,Tao Wu,Limin Wang
关键词-EN: multimodal large language, inference costs impede, large language models, diverse tasks, impede their advancement
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report; Code released at this https URL

点击查看摘要

Abstract:Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
zh

[NLP-3] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

【速读】：该论文试图解决的问题是如何利用大规模预训练语言模型（Large Language Models）的成功经验，将其应用于机器人学习领域，特别是机器人操作任务中。解决方案的关键在于提出了一种名为Moto的方法，通过将视频内容转换为潜在的运动标记序列（latent Motion Token sequences），并利用Latent Motion Tokenizer进行无监督学习，从而捕捉视频中的多样化视觉运动知识。Moto-GPT通过运动标记的自回归预训练，能够生成语义可解释的运动标记、预测合理的运动轨迹，并通过输出似然性评估轨迹的合理性。为了将学习到的运动先验知识转移到实际机器人操作中，论文提出了一种联合微调策略，有效地将潜在运动标记预测与实际机器人控制相结合。实验结果表明，经过微调的Moto-GPT在机器人操作基准测试中表现出优越的鲁棒性和效率，验证了其从视频数据到下游视觉操作任务的知识转移效果。

链接: https://arxiv.org/abs/2412.04445
作者: Yi Chen,Yuying Ge,Yizhuo Li,Yixiao Ge,Mingyu Ding,Ying Shan,Xihui Liu
关键词-EN: Large Language Models, Language Models pre-trained, Recent developments, developments in Large, Models pre-trained
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project released at: this https URL

点击查看摘要

Abstract:Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich “corpus”, can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging “language” of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.
zh

[NLP-4] Establishing Task Scaling Laws via Compute-Efficient Model Ladders

【速读】：该论文试图解决在过度训练设置下预训练语言模型（LMs）在单个任务上的性能预测问题。解决方案的关键在于采用两步预测方法：首先利用模型和数据规模预测任务特定的损失，然后利用该任务损失预测任务性能。通过训练一组小规模的“梯子”模型，收集数据点以拟合两步预测步骤的参数化函数，并针对两个目标模型（7B模型训练到4T tokens和13B模型训练到5T tokens）进行预测。实验结果表明，在多个选择任务上，该方法能够以2个绝对误差点的精度预测目标模型的准确性，但在任务指标方差较高的任务上预测误差较大。此外，论文还展示了设计选择和两步方法在建立缩放定律方面的优越性能。

链接: https://arxiv.org/abs/2412.04403
作者: Akshita Bhagia,Jiacheng Liu,Alexander Wettig,David Heineman,Oyvind Tafjord,Ananya Harsh Jha,Luca Soldaini,Noah A. Smith,Dirk Groeneveld,Pang Wei Koh,Jesse Dodge,Hannaneh Hajishirzi
关键词-EN: pretrained language models, overtrained setting, individual task performance, task performance, models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale “ladder” models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.
zh

[NLP-5] BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages

【速读】：该论文试图解决印度36种语言之间的机器翻译问题，特别是针对这些语言在脚本变体、语音差异和句法多样性方面的挑战。解决方案的关键在于开发多样的语料库创建策略，包括利用现有资源、开发平行数据集、生成领域特定的语料库以及使用合成数据技术。此外，论文还提出了对机器翻译进行多维度评估的方法，涵盖标准翻译、话语级翻译、领域特定翻译、基于参考和无参考的评估、错误分析和自动后编辑。通过这些综合措施，论文旨在建立一个全面的框架，以提高机器翻译质量，促进印度语言多样性环境中的跨语言交流。

链接: https://arxiv.org/abs/2412.04351
作者: Vandan Mujadia,Dipti Misra Sharma
关键词-EN: Arabic and Devanagari, Bengali and Meitei, Indian languages, including Assamese, developing translation models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper focuses on developing translation models and related applications for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, Telugu, and Urdu. Achieving this requires parallel and other types of corpora for all 36 * 36 language pairs, addressing challenges like script variations, phonetic differences, and syntactic diversity. For instance, languages like Kashmiri and Sindhi, which use multiple scripts, demand script normalization for alignment, while low-resource languages such as Khasi and Santali require synthetic data augmentation to ensure sufficient coverage and quality. To address these challenges, this work proposes strategies for corpus creation by leveraging existing resources, developing parallel datasets, generating domain-specific corpora, and utilizing synthetic data techniques. Additionally, it evaluates machine translation across various dimensions, including standard and discourse-level translation, domain-specific translation, reference-based and reference-free evaluation, error analysis, and automatic post-editing. By integrating these elements, the study establishes a comprehensive framework to improve machine translation quality and enable better cross-lingual communication in India’s linguistically diverse ecosystem. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04351 [cs.CL] (or arXiv:2412.04351v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.04351 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] Retrieval-Augmented Machine Translation with Unstructured Knowledge

【速读】：该论文试图解决在机器翻译 (MT) 中如何有效利用非结构化文档中的世界知识来增强大型语言模型 (LLMs) 的翻译能力。解决方案的关键在于构建了一个名为 RAGtrans 的基准数据集，该数据集包含 79K 个通过 GPT-4o 和人类翻译收集的翻译样本，并提供了多语言文档以补充这些样本的知识。论文进一步提出了一种多任务训练方法，利用现有的多语言语料库创建辅助训练目标，无需额外的标注需求，从而教会 LLMs 在翻译过程中如何利用多语言文档中的信息。实验结果表明，该方法显著提升了 LLMs 的翻译性能，BLEU 分数提高了 1.58-3.09，COMET 分数提高了 1.00-2.03。

链接: https://arxiv.org/abs/2412.04342
作者: Jiaan Wang,Fandong Meng,Yingxue Zhang,Jie Zhou
关键词-EN: large language models, RAG, enhance large language, Retrieval-augmented generation, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance models’ MT ability. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs’ retrieval-augmented MT ability. RAGtrans contains 79K MT samples collected via GPT-4o and human translators. Besides, documents from different languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.
zh

[NLP-7] Understanding Student Sentiment on Mental Health Support in Colleges Using Large Language Models

【速读】：该论文试图解决高校心理健康支持效果评估中的数据收集困难和缺乏标准化指标的问题。解决方案的关键在于利用公开的学生反馈数据，通过大型语言模型（LLMs）进行情感分析，创建了一个名为SMILE-College的情感分析数据集，并结合传统机器学习方法和最先进的LLMs（如GPT-3.5和BERT）进行分析。研究发现，GPT-3.5和BERT在预测学生反馈情感方面表现最佳，这为心理健康相关研究和高校心理健康服务的改进提供了实际见解。通过这种数据驱动的方法，论文旨在促进心理健康支持评估、管理和决策的效率和科学性。

链接: https://arxiv.org/abs/2412.04326
作者: Palak Sood,Chengyang He,Divyanshu Gupta,Yue Ning,Ping Wang
关键词-EN: organizing supportive events, offering counseling services, Mental health support, supportive events, vital in educating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted by ‘2024 IEEE International Conference on Big Data (IEEE BigData 2024)’. The paper has 8 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Mental health support in colleges is vital in educating students by offering counseling services and organizing supportive events. However, evaluating its effectiveness faces challenges like data collection difficulties and lack of standardized metrics, limiting research scope. Student feedback is crucial for evaluation but often relies on qualitative analysis without systematic investigation using advanced machine learning methods. This paper uses public Student Voice Survey data to analyze student sentiments on mental health support with large language models (LLMs). We created a sentiment analysis dataset, SMILE-College, with human-machine collaboration. The investigation of both traditional machine learning methods and state-of-the-art LLMs showed the best performance of GPT-3.5 and BERT on this new dataset. The analysis highlights challenges in accurately predicting response sentiments and offers practical insights on how LLMs can enhance mental health-related research and improve college mental health services. This data-driven approach will facilitate efficient and informed mental health support evaluation, management, and decision-making.
zh

[NLP-8] he Hyperfitting Phenomenon: Sharpening and Stabilizing LLM s for Open-Ended Text Generation ICLR

【速读】：该论文试图解决预训练大型语言模型（LLMs）在生成开放式文本时出现的重复和乏味序列的问题。解决方案的关键在于通过在极小数据集上进行超拟合（hyperfitting），即进一步微调模型以达到接近零的训练损失，从而显著提升长序列生成的多样性和质量。实验结果表明，经过超拟合的模型在贪婪解码下生成的长序列在多样性和人类偏好方面优于Top-P采样，且这种现象在不同大小、不同领域甚至自回归图像生成中均适用。此外，超拟合现象与Grokking和双下降现象有显著区别，且超拟合模型很少陷入训练数据的重复序列中，即使明确阻止这些序列，也能生成高质量的输出。

链接: https://arxiv.org/abs/2412.04318
作者: Fredrik Carlsson,Fangyu Liu,Daniel Ward,Murathan Kurfali,Joakim Nivre
关键词-EN: overfitting pre-trained large, pre-trained large language, counter-intuitive generalization results, paper introduces, introduces the counter-intuitive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ICLR

点击查看摘要

Abstract:This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples – a process we refer to as hyperfitting – the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.
zh

[NLP-9] Densing Law of LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在规模扩大时面临的训练和推理效率问题，特别是在资源受限环境中部署的挑战。解决方案的关键在于引入“容量密度”（capacity density）这一新指标，用于评估不同规模LLMs的质量，并描述其在有效性和效率方面的趋势。通过定义目标LLM的有效参数大小，并将其与实际参数大小进行比较，论文提出了一个统一的框架来评估模型的有效性和效率。此外，论文还揭示了容量密度随时间呈指数增长的“密集化定律”（densing law），为未来LLM的发展提供了新的视角，强调了提高容量密度以在最小计算开销下实现最佳结果的重要性。

链接: https://arxiv.org/abs/2412.04315
作者: Chaojun Xiao,Jie Cai,Weilin Zhao,Guoyang Zeng,Xu Han,Zhiyuan Liu,Maosong Sun
关键词-EN: Large Language Models, Large Language, capacity density, Language Models, model size increases
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. However, this scaling brings great challenges to training and inference efficiency, particularly for deploying LLMs in resource-constrained environments, and the scaling trend is becoming increasingly unsustainable. This paper introduces the concept of ``\textitcapacity density’’ as a new metric to evaluate the quality of the LLMs across different scales and describes the trend of LLMs in terms of both effectiveness and efficiency. To calculate the capacity density of a given target LLM, we first introduce a set of reference models and develop a scaling law to predict the downstream performance of these reference models based on their parameter sizes. We then define the \textiteffective parameter size of the target LLM as the parameter size required by a reference model to achieve equivalent performance, and formalize the capacity density as the ratio of the effective parameter size to the actual parameter size of the target LLM. Capacity density provides a unified framework for assessing both model effectiveness and efficiency. Our further analysis of recent open-source base LLMs reveals an empirical law (the densing law)that the capacity density of LLMs grows exponentially over time. More specifically, using some widely used benchmarks for evaluation, the capacity density of LLMs doubles approximately every three months. The law provides new perspectives to guide future LLM development, emphasizing the importance of improving capacity density to achieve optimal results with minimal computational overhead.
zh

[NLP-10] ALMA: Alignment with Minimal Annotation

【速读】：该论文试图解决大型语言模型（LLM）对齐过程中需要大量人类标注数据的问题。解决方案的关键在于引入了一种名为ALMA（Alignment with Minimal Annotation）的方法，通过仅使用9,000个标注示例（远少于传统方法所需的数百万标注）来实现有效的模型对齐。ALMA的核心技术包括：通过少样本学习生成多样化的提示，利用多个模型检查点生成多样化的响应，以及通过分数聚合和自蒸馏增强评判模型（奖励模型）。通过这些技术，ALMA能够在仅使用预训练的Llama3基础模型、5,000个监督微调（SFT）示例和4,000个评判标注的情况下，在多个对齐基准测试中接近Llama3-Instruct的性能。这一成果表明，基础模型本身已具备足够的知识来进行有效对齐，而合成数据生成方法能够揭示这些知识。

链接: https://arxiv.org/abs/2412.04305
作者: Michihiro Yasunaga,Leonid Shamis,Chunting Zhou,Andrew Cohen,Jason Weston,Luke Zettlemoyer,Marjan Ghazvininejad
关键词-EN: typically require millions, Recent approaches, alignment typically require, external aligned models, typically require
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent approaches to large language model (LLM) alignment typically require millions of human annotations or rely on external aligned models for synthetic data generation. This paper introduces ALMA: Alignment with Minimal Annotation, demonstrating that effective alignment can be achieved using only 9,000 labeled examples – less than 1% of conventional approaches. ALMA generates large amounts of high-quality synthetic alignment data through new techniques: diverse prompt synthesis via few-shot learning, diverse response generation with multiple model checkpoints, and judge (reward model) enhancement through score aggregation and self-distillation. Using only a pretrained Llama3 base model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves performance close to Llama3-Instruct across diverse alignment benchmarks (e.g., 0.1% difference on AlpacaEval 2.0 score). These results are achieved with a multi-round, self-bootstrapped data synthesis and training recipe that continues to improve for 10 rounds, surpassing the typical 3-round ceiling of previous methods. These results suggest that base models already possess sufficient knowledge for effective alignment, and that synthetic data generation methods can expose it.
zh

[NLP-11] Evolutionary Pre-Prompt Optimization for Mathematical Reasoning

【速读】：该论文试图解决在大型语言模型（LLMs）中如何通过少样本学习（few-shot learning）和思维链（chain-of-thought, CoT）方法优化示例选择，以提高复杂推理任务的逻辑一致性。解决方案的关键在于采用进化预提示优化（Evolutionary Pre-Prompt Optimization, EPPO）算法，通过比较基于进化的计算方法，显著提升在基准数据集（如GSM8k和MathQA）上的精确匹配得分，相较于传统的少样本方法，提升超过10个绝对点。此外，当与自一致性（self-consistency, SC）结合时，这种优化效果进一步增强。

链接: https://arxiv.org/abs/2412.04291
作者: Mathurin Videau,Alessandro Leite,Marc Schoenauer,Olivier Teytaud
关键词-EN: demonstrate remarkable proficiency, complex reasoning tasks, large language models, Recent advancements, demonstrate remarkable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC)
zh

[NLP-12] Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

【速读】：该论文试图解决阿拉伯语自然语言处理（NLP）领域中大型语言模型（LLMs）参数过多导致硬件需求高和推理延迟长的问题。解决方案的关键在于引入了一个小而强大的阿拉伯语中心模型——Arabic Stable LM 1.6B，该模型在基础版和聊天版中均表现出色。通过在微调数据中加入大量合成对话数据，论文展示了混合合成指令调优数据的好处，使得Arabic Stable LM 1.6B聊天模型在多个基准测试中超越了参数多至8倍的模型。

链接: https://arxiv.org/abs/2412.04277
作者: Zaid Alyafeai,Michael Pieler,Hannah Teufel,Jonathan Tow,Marco Bellagente,Duy Phung,Nikhil Pinnaparaju,Reshinth Adithyan,Paulo Rocha,Maksym Zhuravinskyi,Carlos Riquelme
关键词-EN: natural language processing, English language, domains of natural, language processing, Arabic Stable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive results in multiple domains of natural language processing (NLP) but are mainly focused on the English language. Recently, more LLMs have incorporated a larger proportion of multilingual text to represent low-resource languages. In Arabic NLP, several Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the past two years. However, most Arabic LLMs have more than 7 billion parameters, which increases their hardware requirements and inference latency, when compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable LM 1.6B chat model achieves impressive results on several benchmarks beating multiple models with up to 8x the parameters. In addition, we show the benefit of mixing in synthetic instruction tuning data by augmenting our fine-tuning data with a large synthetic dialogue dataset.
zh

[NLP-13] Representation Purification for End-to-End Speech Translation COLING2025

【速读】：该论文试图解决语音到文本翻译（Speech-to-text translation, ST）中由于语音信号中的非翻译相关因素（如音色和节奏）导致的知识迁移效率低下的问题。解决方案的关键在于提出了一个名为语音表示净化与监督增强框架（Speech Representation Purification with Supervision Enhancement, SRPSE），该框架通过排除语音表示中的非翻译相关成分，从而减轻这些成分对ST性能的负面影响。实验结果表明，SRPSE在MuST-C和CoVoST-2数据集上显著提升了翻译性能，尤其是在无转录文本（transcript-free）设置下表现尤为突出。

链接: https://arxiv.org/abs/2412.04266
作者: Chengwei Zhang,Yue Zhou,Rui Zhao,Yidong Chen,Xiaodong Shi
关键词-EN: converting spoken language, involves converting spoken, spoken language, textbf, cross-modal task
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by COLING 2025

点击查看摘要

Abstract:Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a \textbfSpeech \textbfRepresentation \textbfPurification with \textbfSupervision \textbfEnhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a \textittranscript-free setting.
zh

[NLP-14] Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier

【速读】：该论文试图解决开发高性能多语言模型（multilingual models）的问题，目标是使这些模型在性能上达到或超越单语言模型（monolingual models）。解决方案的关键在于利用Cohere For AI和Cohere多年的研究成果，包括数据套利（data arbitrage）、多语言偏好训练（multilingual preference training）和模型合并（model merging）等技术，从而推出了Aya Expanse模型家族。这些模型在多语言性能上达到了新的技术水平，并在Arena-Hard-Auto数据集的23种语言翻译版本上进行了评估，显示出优于同类领先开源模型的性能，特别是在32B参数模型上，其性能甚至超过了参数数量是其两倍的Llama 3.1 70B模型。

链接: https://arxiv.org/abs/2412.04261
作者: John Dang,Shivalika Singh,Daniel D’souza,Arash Ahmadian,Alejandro Salamanca,Madeline Smith,Aidan Peppin,Sungjin Hong,Manoj Govindassamy,Terrence Zhao,Sandra Kublik,Meor Amer,Viraat Aryabumi,Jon Ander Campos,Yi-Chern Tan,Tom Kocmi,Florian Strub,Nathan Grinsztajn,Yannis Flet-Berliac,Acyr Locatelli,Hangyu Lin,Dwarak Talupuru,Bharat Venkitesh,David Cairuz,Bowen Yang,Tim Chung,Wei-Yin Ko,Sylvie Shang Shi,Amir Shukayev,Sammie Bae,Aleksandra Piktus,Roman Castagné,Felipe Cruz-Salinas,Eddie Kim,Lucas Crawhall-Stein,Adrien Morisot,Sudip Roy,Phil Blunsom,Ivan Zhang,Aidan Gomez,Nick Frosst,Marzieh Fadaee,Beyza Ermis,Ahmet Üstün,Sara Hooker
关键词-EN: Aya Expanse, Aya Expanse model, developing highly performant, Aya Expanse sets, highly performant multilingual
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Aya Expanse model family, a new generation of 8B and 32B parameter multilingual language models, aiming to address the critical challenge of developing highly performant multilingual models that match or surpass the capabilities of monolingual models. By leveraging several years of research at Cohere For AI and Cohere, including advancements in data arbitrage, multilingual preference training, and model merging, Aya Expanse sets a new state-of-the-art in multilingual performance. Our evaluations on the Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya Expanse 8B and 32B outperform leading open-weight models in their respective parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model with twice as many parameters, achieving a 54.0% win-rate. In this short technical report, we present extended evaluation results for the Aya Expanse model family and release their open-weights, together with a new multilingual evaluation dataset m-ArenaHard.
zh

[NLP-15] CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations ALT

【速读】：该论文试图解决从患者与医生对话中自动生成临床总结的问题。解决方案的关键在于采用了一个双模块架构：一个基于检索的过滤模块，用于从对话记录中提取主观信息（Subjective）、客观信息（Objective）、评估（Assessment）和计划（Plan）的SOAP信息；以及一个由微调的预训练语言模型（Pre-trained Language Models, PLMs）驱动的推理模块，利用提取的SOAP数据生成抽象的临床总结。通过整合公开数据集FigShare和MTS-Dialog，并由领域专家（Subject Matter Experts, SMEs）验证，构建了一个包含1,473对对话-总结对的训练数据集，用于微调PLM。该方法在自动评估（如ROUGE、BERTScore）和专家人工评估中均表现出色，显示出在自动化临床总结生成方面的优越性。

链接: https://arxiv.org/abs/2412.04254
作者: Subash Neupane,Himanshu Tripathi,Shaswata Mitra,Sean Bozorgzad,Sudip Mittal,Shahram Rahimi,Amin Amirlatifi
关键词-EN: Pre-trained Language Models, paper presents ClinicSum, automatically generate clinical, fine-tuned Pre-trained Language, paper presents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted at the the 2024 IEEE International Conference on Big Data workshop Workshop on Big Data and AI for Healthcare

点击查看摘要

Abstract:This paper presents ClinicSum, a novel framework designed to automatically generate clinical summaries from patient-doctor conversations. It utilizes a two-module architecture: a retrieval-based filtering module that extracts Subjective, Objective, Assessment, and Plan (SOAP) information from conversation transcripts, and an inference module powered by fine-tuned Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to generate abstracted clinical summaries. To fine-tune the PLM, we created a training dataset of consisting 1,473 conversations-summaries pair by consolidating two publicly available datasets, FigShare and MTS-Dialog, with ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum’s effectiveness is evaluated through both automatic metrics (e.g., ROUGE, BERTScore) and expert human assessments. Results show that ClinicSum outperforms state-of-the-art PLMs, demonstrating superior precision, recall, and F-1 scores in automatic evaluations and receiving high preference from SMEs in human assessment, making it a robust solution for automated clinical summarization.
zh

[NLP-16] A History of Philosophy in Colombia through Topic Modelling

【速读】：该论文试图解决的问题是如何通过数据驱动的方法全面研究哥伦比亚和拉丁美洲哲学的发展历史。解决方案的关键在于应用动态主题建模技术（dynamic topic modelling techniques）对哥伦比亚哲学期刊《Ideas y Valores》进行分析，以识别和追踪该期刊在不同历史时期的主要哲学话题和趋势。通过这种方法，研究者能够揭示出价值理论（包括伦理学、政治哲学和美学）、认识论以及科学哲学等主题在该地区哲学话语中的显著地位，并探讨了历史和解释性哲学文本研究的变化趋势。此外，研究还考察了编辑压力是否导致历史焦点文章的减少，并提出了将此研究扩展到其他拉丁美洲期刊的建议，以及改进非英语语言自然语言处理工作流程的思路。

链接: https://arxiv.org/abs/2412.04236
作者: Juan R. Loaiza,Miguel González-Duque
关键词-EN: Data-driven approaches, valuable tool, tool for studying, Latin American journals, Latin American
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Data-driven approaches to philosophy have emerged as a valuable tool for studying the history of the discipline. However, most studies in this area have focused on a limited number of journals from specific regions and subfields. We expand the scope of this research by applying dynamic topic modelling techniques to explore the history of philosophy in Colombia and Latin America. Our study examines the Colombian philosophy journal Ideas y Valores, founded in 1951 and currently one of the most influential academic philosophy journals in the region. By analyzing the evolution of topics across the journal’s history, we identify various trends and specific dynamics in philosophical discourse within the Colombian and Latin American context. Our findings reveal that the most prominent topics are value theory (including ethics, political philosophy, and aesthetics), epistemology, and the philosophy of science. We also trace the evolution of articles focusing on the historical and interpretive aspects of philosophical texts, and we note a notable emphasis on German philosophers such as Kant, Husserl, and Hegel on various topics throughout the journal’s lifetime. Additionally, we investigate whether articles with a historical focus have decreased over time due to editorial pressures. Our analysis suggests no significant decline in such articles. Finally, we propose ideas for extending this research to other Latin American journals and suggest improvements for natural language processing workflows in non-English languages.
zh

[NLP-17] Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots

【速读】：该论文旨在解决大型语言模型 (LLMs) 中的幻觉问题，通过结合检测和缓解技术来实现。解决方案的关键在于：1) 在问答系统中采用检索增强生成 (RAG) 框架，通过将答案基于外部数据来缓解幻觉；2) 引入负缺失信息评分系统 (NMISS)，通过识别传统评估指标错误地将上下文相关答案标记为幻觉的情况，来改进评估。通过使用意大利健康新闻文章作为上下文来评估 LLM 性能，结果表明 Gemma2 和 GPT-4 表现优异，而中端模型如 Llama2、Llama3 和 Mistral 在 NMISS 的帮助下显著提升了上下文信息的丰富性。这种综合方法为减少和更准确评估 LLM 中的幻觉提供了新的见解，并具有在实际医疗任务和其他领域中的应用潜力。

链接: https://arxiv.org/abs/2412.04235
作者: Maria Paola Priola
关键词-EN: Large Language Models, Large Language, Information Scoring System, Negative Missing Information, Missing Information Scoring
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
zh

[NLP-18] A Context-aware Framework for Translation-mediated Conversations

【速读】：该论文试图解决在双语对话场景中，现有自动翻译系统由于未能充分考虑上下文信息而导致的翻译错误和误解问题。解决方案的关键在于通过引入上下文增强的平行数据进行训练，并在推理阶段采用质量感知的解码策略，结合上下文感知的评估指标，从候选翻译中选择最优结果。这一框架旨在使翻译系统能够更敏感地处理对话历史，从而生成更准确、更符合上下文的翻译，提升翻译质量和对话的连贯性。

链接: https://arxiv.org/abs/2412.04205
作者: José Pombal,Sweta Agrawal,Patrick Fernandes,Emmanouil Zaranis,André F. T. Martins
关键词-EN: Effective communication, communication is fundamental, challenges arise, arise when participants, share a common
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective communication is fundamental to any interaction, yet challenges arise when participants do not share a common language. Automatic translation systems offer a powerful solution to bridge language barriers in such scenarios, but they introduce errors that can lead to misunderstandings and conversation breakdown. A key issue is that current systems fail to incorporate the rich contextual information necessary to resolve ambiguities and omitted details, resulting in literal, inappropriate, or misaligned translations. In this work, we present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings. During training, we leverage context-augmented parallel data, which allows the model to generate translations sensitive to conversational history. During inference, we perform quality-aware decoding with context-aware metrics to select the optimal translation from a pool of candidates. We validate both components of our framework on two task-oriented domains: customer chat and user-assistant interaction. Across both settings, our framework consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct, as measured by multiple automatic translation quality metrics on several language pairs. We also show that the resulting model leverages context in an intended and interpretable way, improving consistency between the conveyed message and the generated translations.
zh

[NLP-19] AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic

【速读】：该论文试图解决阿拉伯方言（Dialectal Arabic, DA）在语言技术，特别是大型语言模型（Large Language Models, LLMs）中的应用不足问题。解决方案的关键在于提出了一种综合评估LLM在DA建模中表现的方法，涵盖了模型保真度（fidelity）、理解能力（understanding）、质量（quality）和双言现象（diglossia）四个维度。通过评估九种LLM在八种DA方言中的表现，论文提供了最佳实践建议，并指出尽管LLM在生成DA方面不如理解DA，但在生成时并未出现质量下降。此外，研究发现当前的后训练过程可能会降低DA能力，而少量示例（few-shot examples）可以克服这一问题及其他LLM缺陷。

链接: https://arxiv.org/abs/2412.04193
作者: Nathaniel R. Robinson,Shahd Abdelmoneim,Kelly Marchisio,Sebastian Ruder
关键词-EN: Dialectal Arabic, large language models, language technologies, language models, Dialectal
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits language modeling applications, yet the research community lacks operationalized LLM performance measurements in DA. We present a method that comprehensively evaluates LLM fidelity, understanding, quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA varieties across these four dimensions and provide best practice recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, but does not suggest deterioration in quality when they do. Further analysis suggests that current post-training can degrade DA capabilities, that few-shot examples can overcome this and other LLM deficiencies, and that otherwise no measurable features of input text correlate well with LLM DA performance.
zh

[NLP-20] If You Cant Use Them Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

【速读】：该论文试图解决在训练大型模型（约100亿参数）时，如何有效利用在不同训练阶段、目标、超参数和数据混合下产生的次优模型检查点，以实现模型合并（model merging）并提升模型性能的问题。解决方案的关键在于提出一种优化算法，通过调整每个检查点在合并中的权重，形成一个线性组合，从而生成一个帕累托最优（Pareto-optimal）模型，该模型在多个任务上表现优于单个模型和基于合并的基线模型。进一步分析表明，良好的合并通常包含几乎所有具有非零权重的检查点，表明即使是看似不佳的初始检查点也能对最终的合并模型做出贡献。

链接: https://arxiv.org/abs/2412.04144
作者: Muhammad Khalifa,Yi-Chern Tan,Arash Ahmadian,Tom Hosking,Honglak Lee,Lu Wang,Ahmet Üstün,Tom Sherborne,Matthias Gallé
关键词-EN: shown great promise, generalist models trained, combining expert models, shown great, great promise
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging ``generalist’’ models trained on many tasks. We explore merging in the context of large ( \sim100 B) models, by \textitrecycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and many suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in a Pareto-optimal models that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.
zh

[NLP-21] Reducing Tool Hallucination via Reliability Alignment

【速读】：该论文试图解决大型语言模型（LLMs）在工具调用过程中出现的工具幻觉（tool hallucinations）问题，这一问题可能导致任务执行错误和运营成本增加。解决方案的关键在于提出一个以可靠性为中心的对齐框架，该框架通过增强模型对工具相关性和使用准确性的评估能力，来减少工具选择幻觉（tool selection hallucination）和工具使用幻觉（tool usage hallucination）。论文通过一系列评估指标和在StableToolBench上的实验，验证了该框架在减少工具幻觉和提高LLM工具调用系统整体可靠性方面的有效性。

链接: https://arxiv.org/abs/2412.04141
作者: Hongshen Xu,Su Zhu,Zihan Wang,Hang Zheng,Da Ma,Ruisheng Cao,Shuai Fan,Lu Chen,Kai Yu
关键词-EN: Large Language Models, Large Language, offering powerful potential, language generation, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have extended their capabilities beyond language generation to interact with external systems through tool calling, offering powerful potential for real-world applications. However, the phenomenon of tool hallucinations, which occur when models improperly select or misuse tools, presents critical challenges that can lead to flawed task execution and increased operational costs. This paper investigates the concept of reliable tool calling and highlights the necessity of addressing tool hallucinations. We systematically categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To mitigate these issues, we propose a reliability-focused alignment framework that enhances the model’s ability to accurately assess tool relevance and usage. By proposing a suite of evaluation metrics and evaluating on StableToolBench, we further demonstrate the effectiveness of our framework in mitigating tool hallucination and improving the overall system reliability of LLM tool calling.
zh

[NLP-22] xt Change Detection in Multilingual Documents Using Image Comparison WACV2025

【速读】：该论文试图解决文档比较中光学字符识别（OCR）技术在多语言或多混合语言模型选择和性能上的局限性问题。解决方案的关键在于提出了基于图像比较模型的文本变化检测（TCD）方法，该方法采用词级别的文本图像对比，生成源文档和目标文档之间的双向变化分割图。通过利用多尺度注意力特征之间的相关性，该方法无需显式的文本对齐或缩放预处理，从而提高了性能。此外，论文构建了一个包含多种语言实际印刷和扫描词对的基准数据集，并通过实验验证了该方法在多个公开基准数据集上的有效性，与现有最先进的语义分割和变化检测模型以及传统的OCR模型进行了比较。

链接: https://arxiv.org/abs/2412.04137
作者: Doyoung Park,Naresh Reddy Yarram,Sunjin Kim,Minkyu Kim,Seongho Cho,Taehee Lee
关键词-EN: optical character recognition, comparison typically relies, character recognition, core technology, typically relies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15pages, 11figures 6tables, wacv2025 accepted

点击查看摘要

Abstract:Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.
zh

[NLP-23] GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering

【速读】：该论文试图解决在低资源语言（如罗马尼亚语）中法律领域多选题问答（MCQA）任务的问题。解决方案的关键在于：1) 引入了JuRO，首个公开可用的罗马尼亚语法律MCQA数据集，包含10,836个问题；2) 创建了CROL，一个包含93个法律文档及其修改的组织化法律语料库，用于信息检索（IR）技术；3) 首次提出了Law-RoG，一个罗马尼亚语知识图谱（KG），该图谱基于上述语料库构建；4) 提出了Graph Retrieval Augmented by Facts (GRAF)方法，该方法在MCQA任务中表现出色，与现有的最先进（SOTA）方法相比具有竞争力，甚至在大多数情况下超越了它们。

链接: https://arxiv.org/abs/2412.04119
作者: Cristian-George Crăciun,Răzvan-Alexandru Smădu,Dumitru-Clementin Cercel,Mihaela-Claudia Cercel
关键词-EN: Pre-trained Language Models, shown remarkable performances, Pre-trained Language, Language Models, NLP research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained Language Models (PLMs) have shown remarkable performances in recent years, setting a new paradigm for NLP research and industry. The legal domain has received some attention from the NLP community partly due to its textual nature. Some tasks from this domain are represented by question-answering (QA) tasks. This work explores the legal domain Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this work is multi-fold. We first introduce JuRO, the first openly available Romanian legal MCQA dataset, comprising three different examinations and a number of 10,836 total questions. Along with this dataset, we introduce CROL, an organized corpus of laws that has a total of 93 distinct documents with their modifications from 763 time spans, that we leveraged in this work for Information Retrieval (IR) techniques. Moreover, we are the first to propose Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is derived from the aforementioned corpus. Lastly, we propose a novel approach for MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive results with generally accepted SOTA methods and even exceeds them in most settings.
zh

[NLP-24] Missing Melodies: AI Music Generation and its “Nearly” Complete Omission of the Global South

【速读】：该论文试图解决生成式 AI (Generative AI) 在音乐生成领域中对全球南方音乐（Global South）的严重代表性不足问题。解决方案的关键在于识别并纠正现有数据集和研究中对全球北方音乐（Global North）的过度偏重，具体措施包括增加全球南方音乐在训练数据中的比例，以及改进符号音乐生成方法以更好地捕捉和表达这些音乐的文化细微差别。通过这些措施，论文旨在促进生成式 AI 在音乐生成中的多样性和包容性。

链接: https://arxiv.org/abs/2412.04100
作者: Atharva Mehta,Shivam Chauhan,Monojit Choudhury
关键词-EN: sparked renewed interest, Recent advances, Global South, music generation, music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to CACM, 12 pages, 2 figures

点击查看摘要

Abstract:Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than 200 papers from eleven prominent AI and music conferences and organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and inclusion of the musical genres of the Global South in AI research. Our findings reveal a stark imbalance: approximately 86% of the total dataset hours and over 93% of researchers focus primarily on music from the Global North. However, around 40% of these datasets include some form of non-Western music, genres from the Global South account for only 14.6% of the data. Furthermore, approximately 51% of the papers surveyed concentrate on symbolic music generation, a method that often fails to capture the cultural nuances inherent in music from regions such as South Asia, the Middle East, and Africa. As AI increasingly shapes the creation and dissemination of music, the significant underrepresentation of music genres in datasets and research presents a serious threat to global musical diversity. We also propose some important steps to mitigate these risks and foster a more inclusive future for AI-driven music generation.
zh

[NLP-25] GEITje 7B Ultra: A Conversational Model for Dutch

【速读】：该论文试图解决的主要问题是现有语言模型在非英语语言（如荷兰语）上的预训练不足，导致这些语言的模型性能受限。解决方案的关键在于通过监督微调（supervised finetuning）和偏好对齐（preference alignment）两个步骤来提升荷兰语模型GEITje的性能。具体来说，研究者首先利用高质量的合成对话数据集进行监督微调，然后通过合成反馈数据集进行偏好对齐，以进一步优化模型的表现。这一解决方案的核心在于利用合成数据集进行精细调整，从而使GEITje模型在荷兰语环境中达到更好的效果。

链接: https://arxiv.org/abs/2412.04092
作者: Bram Vanroy
关键词-EN: neglecting extensive pretraining, focusing on English, rapidly evolved, predominantly focusing, neglecting extensive
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have rapidly evolved, predominantly focusing on English while often neglecting extensive pretraining in other languages. This approach has required initiatives to adapt powerful, English-centric models to other linguistic contexts through finetuning. For Dutch, such a recent endeavour is ``GEITje’’ a model originally derived from the English-based Mistral 7B. Building on this fundamental work, the current research extends the capabilities of GEITje by supervised finetuning on newly created high-quality synthetic conversational datasets, along with an additional preference alignment procedure on a synthetic feedback dataset. Both the developed models and the created datasets are openly available.
zh

[NLP-26] Automated Medical Report Generation for ECG Data: Bridging Medical Text and Signal Processing with Deep Learning

【速读】：该论文试图解决心电图（ECG）数据的自动化解读问题，特别是通过生成类似于临床医生撰写的自由文本报告来实现这一目标。解决方案的关键在于采用基于编码器-解码器（encoder-decoder）的方法，利用现有的带有自由文本报告的ECG数据集进行训练，从而生成详细的心电图描述。这种方法显著提升了ECG分析的自动化水平，并在多个数据集上表现出色，尤其是在1-lead和12-lead ECG数据上，其METEOR得分显著优于现有的最先进模型。该研究还讨论了几个关键的设计选择，为当前领域的挑战和创新提供了全面的概述。

链接: https://arxiv.org/abs/2412.04067
作者: Amnon Bleich,Antje Linnemann,Bjoern H. Diem,Tim OF Conrad
关键词-EN: Recent advances, improved image captioning, natural language generation, significantly improved image, visual content
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in deep learning and natural language generation have significantly improved image captioning, enabling automated, human-like descriptions for visual content. In this work, we apply these captioning techniques to generate clinician-like interpretations of ECG data. This study leverages existing ECG datasets accompanied by free-text reports authored by healthcare professionals (HCPs) as training data. These reports, while often inconsistent, provide a valuable foundation for automated learning. We introduce an encoder-decoder-based method that uses these reports to train models to generate detailed descriptions of ECG episodes. This represents a significant advancement in ECG analysis automation, with potential applications in zero-shot classification and automated clinical decision support. The model is tested on various datasets, including both 1- and 12-lead ECGs. It significantly outperforms the state-of-the-art reference model by Qiu et al., achieving a METEOR score of 55.53% compared to 24.51% achieved by the reference model. Furthermore, several key design choices are discussed, providing a comprehensive overview of current challenges and innovations in this domain. The source codes for this research are publicly available in our Git repository this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04067 [cs.CL] (or arXiv:2412.04067v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.04067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-27] Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting MPs

【速读】：该论文试图解决的问题是如何在政治背景下准确检测针对英国议员（UK MPs）的敌意言论。解决方案的关键在于构建了一个包含3,320条英语推文的数据集，这些推文跨越两年时间，并经过人工标注以识别敌意及其针对的身份特征（种族、性别、宗教、无特定身份）。通过这一数据集，研究者进行了语言学和主题分析，以深入探讨英国政治数据中敌意言论的独特内容。此外，论文还评估了预训练语言模型和大型语言模型在二元敌意检测和多类目标身份类型分类任务中的表现。这一解决方案的关键在于提供了针对英国政治背景的特定数据和见解，为未来研究政治相关敌意言论的普遍性和性质奠定了基础。

链接: https://arxiv.org/abs/2412.04046
作者: Mugdha Pandya,Mali Jin,Kalina Bontcheva,Diana Maynard
关键词-EN: social media platforms, social media, media platforms, politicians, hostility
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Numerous politicians use social media platforms, particularly X, to engage with their constituents. This interaction allows constituents to pose questions and offer feedback but also exposes politicians to a barrage of hostile responses, especially given the anonymity afforded by social media. They are typically targeted in relation to their governmental role, but the comments also tend to attack their personal identity. This can discredit politicians and reduce public trust in the government. It can also incite anger and disrespect, leading to offline harm and violence. While numerous models exist for detecting hostility in general, they lack the specificity required for political contexts. Furthermore, addressing hostility towards politicians demands tailored approaches due to the distinct language and issues inherent to each country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset of 3,320 English tweets spanning a two-year period manually annotated for hostility towards UK MPs. Our dataset also captures the targeted identity characteristics (race, gender, religion, none) in hostile tweets. We perform linguistic and topical analyses to delve into the unique content of the UK political data. Finally, we evaluate the performance of pre-trained language models and large language models on binary hostility detection and multi-class targeted identity type classification tasks. Our study offers valuable data and insights for future research on the prevalence and nature of politics-related hostility specific to the UK.
zh

[NLP-28] M3D: A Multimodal Multilingual and Multitask Dataset for Grounded Document-level Information Extraction

【速读】：该论文试图解决多模态信息抽取（Multimodal Information Extraction, MIE）领域中现有数据集主要集中在英文文本的句子级别图像辅助信息抽取，而忽视了基于视频的多模态信息抽取和细粒度视觉定位的问题。解决方案的关键在于构建了一个名为M³D的多模态多语言多任务数据集，该数据集具有以下特点：(1) 包含文档级别的文本和视频对，丰富了多模态信息；(2) 支持英语和中文两种广泛使用的语言；(3) 涵盖了实体识别、实体链提取、关系抽取和视觉定位等多项多模态信息抽取任务。此外，论文提出了一种创新的层次化多模态信息抽取模型，通过去噪特征融合模块（Denoised Feature Fusion Module, DFFM）有效整合多模态信息，并在非理想场景下设计了缺失模态构建模块（Missing Modality Construction Module, MMCM）来缓解模态信息缺失带来的问题。

链接: https://arxiv.org/abs/2412.04026
作者: Jiang Liu,Bobo Li,Xinran Yang,Na Yang,Hao Fei,Mingyao Zhang,Fei Li,Donghong Ji
关键词-EN: attracted increasing attention, Multimodal, Multimodal information, multimodal information benefits, information benefits text
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M ^3 D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.
zh

[NLP-29] Exploring the Influence of Label Aggregation on Minority Voices: Implications for Dataset Bias and Model Training

【速读】：该论文试图解决手动标注中少数意见（minority opinions）在标签聚合策略（label aggregation strategies）中的代表性问题。解决方案的关键在于研究标准标签聚合策略（如多数投票或专家意见）对少数意见的影响，评估这些少数意见的质量和价值，并探讨它们如何影响最终金标准标签（gold labels）的类别分布，以及这些变化如何影响基于这些数据集训练的模型的行为。论文还讨论了每种方法可能引入的偏见及其在模型中的放大效应。

链接: https://arxiv.org/abs/2412.04025
作者: Mugdha Pandya,Nafise Sadat Moosavi,Diana Maynard
关键词-EN: removing unreliable annotators, Resolving disagreement, manual annotation typically, annotation typically consists, label aggregation strategy
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Resolving disagreement in manual annotation typically consists of removing unreliable annotators and using a label aggregation strategy such as majority vote or expert opinion to resolve disagreement. These may have the side-effect of silencing or under-representing minority but equally valid opinions. In this paper, we study the impact of standard label aggregation strategies on minority opinion representation in sexism detection. We investigate the quality and value of minority annotations, and then examine their effect on the class distributions in gold labels, as well as how this affects the behaviour of models trained on the resulting datasets. Finally, we discuss the potential biases introduced by each method and how they can be amplified by the models.
zh

[NLP-30] Marco-LLM : Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

【速读】：该论文试图解决大型语言模型（LLMs）在处理低资源语言时的性能不足问题。解决方案的关键在于引入Marco-LLM：一种通过大规模多语言训练来增强跨语言能力的LLM。具体来说，研究团队收集了大量低资源语言的数据，并使用Qwen2模型进行了广泛的持续预训练，从而创建了Marco-LLM。该模型在多种多语言基准测试中表现出色，特别是在任意到任意机器翻译任务中显示出显著的改进，有效缩小了高资源语言与低资源语言之间的性能差距。

链接: https://arxiv.org/abs/2412.04003
作者: Lingfeng Ming,Bo Zeng,Chenyang Lyu,Tianqi Shi,Yu Zhao,Xue Yang,Yefeng Liu,Yiyu Wang,Linlong Xu,Yangyang Liu,Xiaohu Zhao,Hao Wang,Heng Liu,Hao Zhou,Huifeng Yin,Zifu Shang,Haijun Li,Longyue Wang,Weihua Luo,Kaifu Zhang
关键词-EN: Large Language Models, Large Language, achieved remarkable progress, recent years, remarkable progress
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.
zh

[NLP-31] MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for Strengthening LLM

【速读】：该论文试图解决大型语言模型（LLMs）在需要复杂逻辑推理和多步骤问题解决的任务中表现不足的问题。解决方案的关键在于引入了一种名为MTMT（Multi-thinking Modes Tree）的新方法，通过与LLMs交互构建思维树，模拟多种高级认知过程，如联想、反事实思维、任务分解和比较。MTMT通过将原始复杂任务分解为更简单的子问题，帮助LLMs更容易地解决问题，从而更有效地利用LLMs中的潜在知识。研究结果表明，整合多种思维模式显著增强了LLMs处理复杂任务的能力。

链接: https://arxiv.org/abs/2412.03987
作者: Changcheng Li,Xiangyu Wang,Qiuju Chen,Xiren Zhou,Huanhuan Chen
关键词-EN: Large language models, Large language, requiring complex logical, complex logical reasoning, shown limitations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown limitations in tasks requiring complex logical reasoning and multi-step problem-solving. To address these challenges, researchers have employed carefully designed prompts and flowcharts, simulating human cognitive processes to enhance LLM performance, such as the Chain of Thought approach. In this paper, we introduce MTMT (Multi-thinking Modes Tree), a novel method that interacts with LLMs to construct a thought tree, simulating various advanced cognitive processes, including but not limited to association, counterfactual thinking, task decomposition, and comparison. By breaking down the original complex task into simpler sub-questions, MTMT facilitates easier problem-solving for LLMs, enabling more effective utilization of the latent knowledge within LLMs. We evaluate the performance of MTMT under different parameter configurations, using GPT-4o mini as the base model. Our results demonstrate that integrating multiple modes of thinking significantly enhances the ability of LLMs to handle complex tasks.
zh

[NLP-32] Demonstration Selection for In-Context Learning via Reinforcement Learning

【速读】：该论文试图解决在少样本提示场景下，如何通过选择多样化的示例（demonstrations）来提高大型语言模型（LLMs）在文本分类任务中的泛化能力和分类准确性的问题。解决方案的关键在于提出了一个名为“相关性-多样性增强选择”（Relevance-Diversity Enhanced Selection, RDES）的方法，该方法利用强化学习中的Q-学习框架，动态地选择既具有多样性又与分类目标相关的示例。RDES通过计算所选示例的标签分布来评估多样性得分，从而确保参考数据的平衡表示，进而提升分类准确性。此外，论文还探讨了在推理过程中引入“思维链”（Chain-of-Thought, CoT）推理，进一步增强了模型的预测性能。

链接: https://arxiv.org/abs/2412.03966
作者: Xubin Wang,Jianfei Wu,Yichen Yuan,Mingzhe Li,Deyu Cai,Weijia Jia
关键词-EN: enhancing model generalization, structures and concepts, Large Language Models, crucial for enhancing, enables a broader
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diversity in demonstration selection is crucial for enhancing model generalization, as it enables a broader coverage of structures and concepts. However, constructing an appropriate set of demonstrations has remained a focal point of research. This paper presents the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning to optimize the selection of diverse reference demonstrations for text classification tasks using Large Language Models (LLMs), especially in few-shot prompting scenarios. RDES employs a Q-learning framework to dynamically identify demonstrations that maximize both diversity and relevance to the classification objective by calculating a diversity score based on label distribution among selected demonstrations. This method ensures a balanced representation of reference data, leading to improved classification accuracy. Through extensive experiments on four benchmark datasets and involving 12 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances classification accuracy compared to ten established baselines. Furthermore, we investigate the incorporation of Chain-of-Thought (CoT) reasoning in the reasoning process, which further enhances the model’s predictive performance. The results underscore the potential of reinforcement learning to facilitate adaptive demonstration selection and deepen the understanding of classification challenges.
zh

[NLP-33] MIND: Effective Incorrect Assignment Detection through a Multi-Modal Structure-Enhanced Language Model

【速读】：该论文试图解决学术出版物快速增长导致的作者姓名歧义问题，特别是在在线数字图书馆中，这一问题加剧了作者姓名与论文分配错误的情况。解决方案的关键在于引入了一种结构增强的语言模型，该模型结合了图方法中的关键结构特征与丰富的论文属性中的细粒度语义特征，以检测错误的分配。该模型通过多模态多轮指令调优框架进行训练，包括任务导向的指令调优、文本属性模态和结构模态，从而在KDD Cup 2024的排行榜上取得了顶尖的性能。

链接: https://arxiv.org/abs/2412.03930
作者: Yunhe Pang,Bo Chen,Fanjin Zhang,Yanghui Rao,Jie Tang
关键词-EN: online digital libraries, digital libraries, rapid growth, publications has exacerbated, exacerbated the issue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of academic publications has exacerbated the issue of author name ambiguity in online digital libraries. Despite advances in name disambiguation algorithms, cumulative errors continue to undermine the reliability of academic systems. It is estimated that over 10% paper-author assignments are rectified when constructing the million-scale WhoIsWho benchmark. Existing endeavors to detect incorrect assignments are either semantic-based or graph-based approaches, which fall short of making full use of the rich text attributes of papers and implicit structural features defined via the co-occurrence of paper attributes. To this end, this paper introduces a structure-enhanced language model that combines key structural features from graph-based methods with fine-grained semantic features from rich paper attributes to detect incorrect assignments. The proposed model is trained with a highly effective multi-modal multi-turn instruction tuning framework, which incorporates task-guided instruction tuning, text-attribute modality, and structural modality. Experimental results demonstrate that our model outperforms previous approaches, achieving top performance on the leaderboard of KDD Cup 2024. Our code has been publicly available.
zh

[NLP-34] A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

【速读】：该论文试图解决当前在游戏理论场景中评估基于大型语言模型（LLM）的社会智能代理缺乏系统性综述的问题。解决方案的关键在于系统性地回顾现有研究，并将其组织成三个核心组成部分：游戏框架（Game Framework）、社会代理（Social Agent）和评估协议（Evaluation Protocol）。游戏框架涵盖了从选择聚焦到沟通聚焦的各种游戏场景；社会代理部分探讨了代理的偏好、信念和推理能力；评估协议则包括了游戏无关和游戏特定的评估指标。通过这种方式，论文不仅总结了当前的研究进展，还指出了未来研究的方向，从而为在游戏理论场景中开发和评估社会代理提供了有价值的见解。

链接: https://arxiv.org/abs/2412.03920
作者: Xiachong Feng,Longxu Dou,Ella Li,Qinghao Wang,Haochuan Wang,Yu Guo,Chang Ma,Lingpeng Kong
关键词-EN: Large Language Model, Language Model, Large Language, intelligence of Large, based social agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Game-theoretic scenarios have become pivotal in evaluating the social intelligence of Large Language Model (LLM)-based social agents. While numerous studies have explored these agents in such settings, there is a lack of a comprehensive survey summarizing the current progress. To address this gap, we systematically review existing research on LLM-based social agents within game-theoretic scenarios. Our survey organizes the findings into three core components: Game Framework, Social Agent, and Evaluation Protocol. The game framework encompasses diverse game scenarios, ranging from choice-focusing to communication-focusing games. The social agent part explores agents’ preferences, beliefs, and reasoning abilities. The evaluation protocol covers both game-agnostic and game-specific metrics for assessing agent performance. By reflecting on the current research and identifying future research directions, this survey provides insights to advance the development and evaluation of social agents in game-theoretic scenarios.
zh

[NLP-35] MISR: Measuring Instrumental Self-Reasoning in Frontier Models

【速读】：该论文试图解决大型语言模型（LLM）在代理任务中工具性自我推理能力的评估问题。解决方案的关键在于提出了一套全面的评估任务，涵盖了自我修改、知识寻求和模糊自我推理等多个场景，并使用最先进的LLM构建的代理进行评估。研究发现，工具性自我推理能力仅在最具能力的模型中出现，并且高度依赖于上下文。论文还开源了这些评估任务，以便未来模型可以测量其工具性自我推理能力的提升。

链接: https://arxiv.org/abs/2412.03904
作者: Kai Fronsdal,David Lindner
关键词-EN: instrumental self-reasoning ability, self-reasoning ability, instrumental self-reasoning, large language model, self-reasoning
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 65 page appendix, 5 figures

点击查看摘要

Abstract:We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at this https URL.
zh

[NLP-36] Uniform Discretized Integrated Gradients: An effective attribution based method for explaining large language models

【速读】：该论文试图解决在解释大型语言模型（LLMs）时，传统集成梯度法（Integrated Gradients）在处理离散特征空间（如词嵌入）时的不足。解决方案的关键在于提出了一种名为均匀离散集成梯度法（Uniform Discretized Integrated Gradients, UDIG）的新方法，该方法采用了一种新的插值策略，选择了一条更有利于计算归因分数的非线性路径。通过在情感分类和问答任务中对UDIG进行评估，结果表明其在多个评价指标上优于现有方法。

链接: https://arxiv.org/abs/2412.03886
作者: Swarnava Sinha Roy,Ayan Kundu
关键词-EN: explaining deep learning, deep learning models, Discretized Integrated Gradients, well-known technique, technique for explaining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrated Gradients is a well-known technique for explaining deep learning models. It calculates feature importance scores by employing a gradient based approach computing gradients of the model output with respect to input features and accumulating them along a linear path. While this works well for continuous features spaces, it may not be the most optimal way to deal with discrete spaces like word embeddings. For interpreting LLMs (Large Language Models), there exists a need for a non-linear path where intermediate points, whose gradients are to be computed, lie close to actual words in the embedding space. In this paper, we propose a method called Uniform Discretized Integrated Gradients (UDIG) based on a new interpolation strategy where we choose a favorable nonlinear path for computing attribution scores suitable for predictive language models. We evaluate our method on two types of NLP tasks- Sentiment Classification and Question Answering against three metrics viz Log odds, Comprehensiveness and Sufficiency. For sentiment classification, we have used the SST2, IMDb and Rotten Tomatoes datasets for benchmarking and for Question Answering, we have used the fine-tuned BERT model on SQuAD dataset. Our approach outperforms the existing methods in almost all the metrics.
zh

[NLP-37] AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

【速读】：该论文试图解决泰语专有名词（proper names）到拉丁字母的音译问题，特别是针对泰语语音学中的声调特征和元音长度区分所带来的挑战。解决方案的关键在于采用了一种新颖的双模型方法：AyutthayaAlpha-Small基于ByT5架构，而AyutthayaAlpha-VerySmall则是一个计算效率更高的变体，后者在性能上出乎意料地优于其较大的对应模型。通过结合语言学规则与深度学习技术，并利用精心策划的120万对泰语-拉丁名字数据集（经过策略性上采样至270万对）进行训练，AyutthayaAlpha系统不仅在音译准确性上达到了最先进的水平，还能有效捕捉个人和文化偏好，适用于跨语言信息检索、国际数据标准化及身份验证系统等多个实际应用领域。

链接: https://arxiv.org/abs/2412.03877
作者: Davor Lauc,Attapol Rutherford,Weerin Wongwarawipatr
关键词-EN: advanced transformer-based machine, transformer-based machine learning, machine learning model, learning model designed, study introduces AyutthayaAlpha
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system’s practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.
zh

[NLP-38] Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer

【速读】：该论文试图解决将手写或数字化的数学表达式图像转换为LaTeX代码的问题。解决方案的关键在于采用基于Transformer的架构，相较于传统的卷积神经网络（CNN）编码器和循环神经网络（RNN）解码器组合，Transformer架构在实验中表现出更高的整体准确率、BLEU评分以及更低的Levenshtein距离评分。此外，论文还探讨了使用ResNet50模型替代CNN编码器对CNN-RNN架构的改进，展示了Transformer架构在数学表达式转换任务中的优越性，并指出通过适当的模型参数微调，可以进一步提高性能。

链接: https://arxiv.org/abs/2412.03853
作者: Jayaprakash Sundararaj,Akhil Vyas,Benjamin Gonzalez-Maldonado
关键词-EN: Converting mathematical expressions, digital mathematical expression, mathematical expression images, Converting mathematical, CNN encoder
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 7 pages; 3 figures

点击查看摘要

Abstract:Converting mathematical expressions into LaTeX is challenging. In this paper, we explore using newer transformer based architectures for addressing the problem of converting handwritten/digital mathematical expression images into equivalent LaTeX code. We use the current state of the art CNN encoder and RNN decoder as a baseline for our experiments. We also investigate improvements to CNN-RNN architecture by replacing the CNN encoder with the ResNet50 model. Our experiments show that transformer architectures achieve a higher overall accuracy and BLEU scores along with lower Levenschtein scores compared to the baseline CNN/RNN architecture with room to achieve even better results with appropriate fine-tuning of model parameters.
zh

[NLP-39] Educational-Psychological Dialogue Robot Based on Multi-Agent Collaboration

【速读】：该论文试图解决现有智能对话系统在现代教育和心理辅导领域中存在的单一领域限制、复杂问题处理不准确和缺乏专业性的问题。解决方案的关键在于构建一个集成了教育和心理辅导功能的智能对话系统，该系统通过多个AI代理（包括安全检测代理、意图识别代理、教育LLM代理和心理LLM代理）协同工作，确保提供准确的教育知识问答和心理支持服务。具体来说，系统通过意图分类模型识别用户输入的意图，并调用增强检索的教育大模型和经过心理数据微调的心理大模型，以提供专业的教育建议和心理支持。

链接: https://arxiv.org/abs/2412.03847
作者: Shiwen Ni,Min Yang
关键词-EN: psychological counseling fields, Intelligent dialogue systems, complex issues, psychological LLM agent, single domain
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intelligent dialogue systems are increasingly used in modern education and psychological counseling fields, but most existing systems are limited to a single domain, cannot deal with both educational and psychological issues, and often lack accuracy and professionalism when dealing with complex issues. To address these problems, this paper proposes an intelligent dialog system that combines educational and psychological counseling functions. The system consists of multiple AI agent, including security detection agent, intent identification agent, educational LLM agent, and psychological LLM agent, which work in concert to ensure the provision of accurate educational knowledge Q\A and psychological support services. Specifically, the system recognizes user-input intentions through an intention classification model and invokes a retrieval-enhanced educational grand model and a psychological grand model fine-tuned with psychological data in order to provide professional educational advice and psychological support.
zh

[NLP-40] Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

【速读】：该论文试图解决的问题是：现有的基于二元判断训练的奖励模型（reward models）在捕捉目标用户在实际任务中的广泛、综合偏好方面存在不足。论文指出，依赖于二元选择的偏好调优方法无法充分反映用户在多答案提示（Plurality of Responses to Prompts）和难以区分响应（Indistinguishability of Responses）等主观性维度上的不同偏好。解决方案的关键在于引入了一种简单而有效的方法，通过在现有二元偏好数据集中增加合成偏好判断（synthetic preference judgments）来估计潜在的用户分歧，并将其作为正则化项（margin term）在模型训练过程中加以利用，从而使模型的预测结果更好地与综合用户偏好相一致。

链接: https://arxiv.org/abs/2412.03822
作者: Vishakh Padmakumar,Chuanyang Jin,Hannah Rose Kirk,He He
关键词-EN: Large language models, Large language, increasingly deployed, deployed via public-facing, public-facing interfaces
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained using binary judgments where annotators select the preferred choice out of pairs of model outputs. In this work, we argue that this reliance on binary choices does not capture the broader, aggregate preferences of the target user in real-world tasks. We propose a taxonomy that identifies two dimensions of subjectivity where different users disagree on the preferred output-namely, the Plurality of Responses to Prompts, where prompts allow for multiple correct answers, and the Indistinguishability of Responses, where candidate outputs are paraphrases of each other. We show that reward models correlate weakly with user preferences in these cases. As a first step to address this issue, we introduce a simple yet effective method that augments existing binary preference datasets with synthetic preference judgments to estimate potential user disagreement. Incorporating these via a margin term as a form of regularization during model training yields predictions that better align with the aggregate user preferences.
zh

[NLP-41] Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)

【速读】：该论文旨在解决公开健康调查问卷中问题之间的语义相似性计算问题，以促进基于调查的个人生成健康数据 (PGHD) 的标准化。解决方案的关键在于构建并评估三种分类器：词袋模型 (Bag-of-Words)、基于BERT的SBERT模型以及基于LaBSE的SBERT模型。其中，SBERT-LaBSE模型在跨语言的语义相似性评估中表现最为出色，其接收者操作特征曲线 (ROC) 和精确率-召回率曲线 (Precision-Recall Curves) 下的面积均超过0.99，显著优于其他两种模型。该研究展示了SBERT-LaBSE在跨语言语义对齐中的潜力，尽管在捕捉细微差异和计算效率方面仍存在挑战，但其性能表明了其在提升调查问卷数据跨语言语义互操作性方面的巨大潜力。

链接: https://arxiv.org/abs/2412.03817
作者: Sunghoon Kang,Hyeoneui Kim,Hyewon Park,Ricky Taira
关键词-EN: Person-Generated Health Data, NIH CDE Repository, health survey questions, Health Data, Korean public health
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based Person-Generated Health Data (PGHD). We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build three classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, and SBRET with LaBSE embeddings. The algorithms were evaluated using traditional contingency statistics. Among the three algorithms, SBERT-LaBSE demonstrated the highest performance in assessing question similarity across both languages, achieving an Area Under the Receiver Operating Characteristic (ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved effective in identifying cross-lingual semantic this http URL SBERT-LaBSE algorithm excelled at aligning semantically equivalent sentences across both languages but encountered challenges in capturing subtle nuances and maintaining computational efficiency. Future research should focus on testing with larger multilingual datasets and on calibrating and normalizing scores across the health lifelog domains to improve consistency. This study introduces the SBERT-LaBSE algorithm for calculating semantic similarity across two languages, showing it outperforms BERT-based models and the Bag of Words approach, highlighting its potential to improve semantic interoperability of survey-based PGHD across language barriers.
zh

[NLP-42] Synergizing LLM s and Knowledge Graphs: A Novel Approach to Software Repository-Related Question Answering

【速读】：该论文试图解决从软件仓库数据中提取洞察的效率和准确性问题。解决方案的关键在于通过构建知识图谱（Knowledge Graph）并将其与大型语言模型（LLM）结合，以提高自然语言交互式问答系统的准确性。具体步骤包括：(1) 从仓库数据中构建知识图谱；(2) 将知识图谱与LLM协同工作，以实现自然语言问题的理解和相关数据的准确检索。通过这种方法，研究在五个流行的开源项目上进行了评估，初始准确率为65%，并通过few-shot chain-of-thought prompting技术将准确率提升至84%，证明了LLM与知识图谱结合的有效性。

链接: https://arxiv.org/abs/2412.03815
作者: Samuel Abedu,SayedHassan Khatoonabadi,Emad Shihab
关键词-EN: development process, valuable information, information for gaining, gaining insights, repository data
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ACM Transactions on Software Engineering and Methodology for review

点击查看摘要

Abstract:Software repositories contain valuable information for gaining insights into their development process. However, extracting insights from these repository data is time-consuming and requires technical expertise. While software engineering chatbots have been developed to facilitate natural language interactions with repositories, they struggle with understanding natural language and accurately retrieving relevant data. This study aims to improve the accuracy of LLM-based chatbots in answering repository-related questions by augmenting them with knowledge graphs. We achieve this in a two-step approach; (1) constructing a knowledge graph from the repository data and (2) synergizing the knowledge graph with LLM to allow for the natural language questions and answers. We curated a set of 20 questions with different complexities and evaluated our approach on five popular open-source projects. Our approach achieved an accuracy of 65%. We further investigated the limitations and identified six key issues, with the majority relating to the reasoning capability of the LLM. We experimented with a few-shot chain-of-thought prompting to determine if it could enhance our approach. This technique improved the overall accuracy to 84%. Our findings demonstrate the synergy between LLMs and knowledge graphs as a viable solution for making repository data accessible to both technical and non-technical stakeholders.
zh

[NLP-43] Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language Models

【速读】：该论文试图解决机器翻译（MT）中的自动化和效率问题，特别是通过引入Agent AI和LangGraph框架来提升多语言翻译的准确性和可扩展性。解决方案的关键在于：1) 设计模块化的Agent AI，如TranslateEnAgent、TranslateFrenchAgent和TranslateJpAgent，这些Agent专门用于特定语言对的翻译，并利用大型语言模型（LLMs）如GPT-4o来确保翻译的准确性和上下文相关性；2) 使用LangGraph，一个基于LangChain的图框架，来简化和自动化Agent的管理和协作，支持动态状态管理和复杂工作流的自动化，从而实现高效的翻译流程。通过这种模块化和自动化的设计，论文展示了在多语言翻译领域进一步创新的可能性。

链接: https://arxiv.org/abs/2412.03801
作者: Jialin Wang,Zhihua Duan
关键词-EN: explores the transformative, transformative role, advancing the automation, automation and effectiveness, Agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the transformative role of Agent AI and LangGraph in advancing the automation and effectiveness of machine translation (MT). Agents are modular components designed to perform specific tasks, such as translating between particular languages, with specializations like TranslateEnAgent, TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese translations, respectively. These agents leverage the powerful semantic capabilities of large language models (LLMs), such as GPT-4o, to ensure accurate, contextually relevant translations while maintaining modularity, scalability, and context retention. LangGraph, a graph-based framework built on LangChain, simplifies the creation and management of these agents and their workflows. It supports dynamic state management, enabling agents to maintain dialogue context and automates complex workflows by linking agents and facilitating their collaboration. With flexibility, open-source community support, and seamless integration with LLMs, LangGraph empowers agents to deliver high-quality translations. Together, Agent AI and LangGraph create a cohesive system where LangGraph orchestrates agent interactions, ensuring that user inputs are analyzed, routed, and processed efficiently. Experimental results demonstrate the potential of this system to enhance multilingual translation accuracy and scalability. By highlighting modular design and automated workflows, this paper sets the stage for further innovations in intelligent machine translation services. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.03801 [cs.CL] (or arXiv:2412.03801v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.03801 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-44] he broader spectrum of in-context learning

【速读】：该论文试图解决的问题是如何在更广泛的元学习框架下理解语言模型中的上下文少样本学习（supervised few-shot learning）。解决方案的关键在于提出一个视角，即将任何序列分布中上下文非平凡地降低损失的情况解释为一种上下文学习（in-context learning）。这一视角有助于统一语言模型展示的各种上下文能力，如通过指令或角色扮演适应任务，或外推时间序列。此外，该视角还揭示了上下文学习在低层次语言依赖处理（如共指或平行结构）中的潜在根源，并强调了泛化的重要性，包括学习新事物的能力、从不同呈现方式中学习的灵活性以及应用所学知识的能力。论文还讨论了与元学习和目标条件代理等领域的广泛联系，并建议研究上下文学习时应考虑这一更广泛的上下文能力和泛化类型。

链接: https://arxiv.org/abs/2412.03782
作者: Andrew Kyle Lampinen,Stephanie C. Y. Chan,Aaditya K. Singh,Murray Shanahan
关键词-EN: generated substantial interest, substantial interest, in-context learning, generated substantial, in-context
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability of language models to learn a task from a few examples in context has generated substantial interest. Here, we provide a perspective that situates this type of supervised few-shot learning within a much broader spectrum of meta-learned in-context learning. Indeed, we suggest that any distribution of sequences in which context non-trivially decreases loss on subsequent predictions can be interpreted as eliciting a kind of in-context learning. We suggest that this perspective helps to unify the broad set of in-context abilities that language models exhibit \unicodex2014 such as adapting to tasks from instructions or role play, or extrapolating time series. This perspective also sheds light on potential roots of in-context learning in lower-level processing of linguistic dependencies (e.g. coreference or parallel structures). Finally, taking this perspective highlights the importance of generalization, which we suggest can be studied along several dimensions: not only the ability to learn something novel, but also flexibility in learning from different presentations, and in applying what is learned. We discuss broader connections to past literature in meta-learning and goal-conditioned agents, and other perspectives on learning and adaptation. We close by suggesting that research on in-context learning should consider this broader spectrum of in-context capabilities and types of generalization.
zh

[NLP-45] WithdrarXiv: A Large-Scale Dataset for Retraction Study

【速读】：该论文试图解决计算机科学及其他STEM领域中撤稿现象的系统性研究不足的问题。解决方案的关键在于构建了WithdrarXiv，这是首个大规模的arXiv撤稿论文数据集，包含超过14,000篇撤稿论文及其相关的撤稿评论，覆盖了arXiv的整个历史直至2024年9月。通过细致分析作者评论，论文开发了一套全面的撤稿原因分类体系，识别出10种不同的撤稿类别，从关键错误到政策违规。此外，论文展示了零样本自动分类撤稿原因的方法，实现了0.96的加权平均F1分数。论文还发布了WithdrarXiv-SciFy，一个包含解析后的全文PDF脚本的增强版本，旨在支持科学可行性研究、声明验证和自动定理证明等领域的研究。这些成果为提升科学质量控制和自动化验证系统提供了宝贵的见解，并讨论了数据发布中的伦理问题，采取了一系列措施以促进该领域的开放科学。

链接: https://arxiv.org/abs/2412.03775
作者: Delip Rao,Jonathan Young,Thomas Dietterich,Chris Callison-Burch
关键词-EN: STEM fields remain, fields remain scarce, STEM fields, maintaining scientific integrity, remain scarce
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository’s entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations. We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy, an enriched version including scripts for parsed full-text PDFs, specifically designed to enable research in scientific feasibility studies, claim verification, and automated theorem proving. These findings provide valuable insights for improving scientific quality control and automated verification systems. Finally, and most importantly, we discuss ethical issues and take a number of steps to implement responsible data release while fostering open science in this area.
zh

[NLP-46] Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks AAAI25

【速读】：该论文试图解决预训练变压器语言模型（LMs）在自然语言处理（NLP）任务中表现优异但缺乏可解释性的问题。解决方案的关键在于开发内在可解释的模型，同时保持这些模型的高性能。具体来说，论文提出了一种基于原型网络的方法，通过捕捉情感不一致性来提高讽刺检测的可解释性和准确性，并设计了一种新颖的白盒多头图注意力原型网络，以在不牺牲原始黑盒LMs准确性的前提下解释文本分类模型的决策。此外，论文还探索了通过对比学习扩展注意力原型网络，以重新设计一种可解释的图神经网络，旨在同时提升文档分类模型的可解释性和性能。

链接: https://arxiv.org/abs/2412.03761
作者: Ximing Wen
关键词-EN: Pretrained transformer-based Language, transformer-based Language Models, achieve significant improvement, NLP tasks, Pretrained transformer-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 pages, 1 figure, accepted by AAAI25 DC

点击查看摘要

Abstract:Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on NLP tasks, but their black-box nature, which leads to a lack of interpretability, has been a major concern. My dissertation focuses on developing intrinsically interpretable models when using LMs as encoders while maintaining their superior performance via prototypical networks. I initiated my research by investigating enhancements in performance for interpretable models of sarcasm detection. My proposed approach focuses on capturing sentiment incongruity to enhance accuracy while offering instance-based explanations for the classification decisions. Later, I developed a novel white-box multi-head graph attention-based prototype network designed to explain the decisions of text classification models without sacrificing the accuracy of the original black-box LMs. In addition, I am working on extending the attention-based prototype network with contrastive learning to redesign an interpretable graph neural network, aiming to enhance both the interpretability and performance of the model in document classification.
zh

[NLP-47] Domain-specific Question Answering with Hybrid Search

【速读】：该论文试图解决领域特定问答系统中的独特挑战，提出了一种结合微调密集检索器与基于关键词的稀疏搜索方法的混合解决方案。关键在于利用线性组合的相关性信号，包括密集检索的余弦相似度、BM25评分以及URL主机匹配，并通过可调的权重参数进行优化。实验结果表明，这种混合方法在提高准确性的同时，保持了强大的上下文基础，有效应对了企业环境中领域特定问答的复杂性。

链接: https://arxiv.org/abs/2412.03736
作者: Dewang Sultania,Zhaoyu Lu,Twisha Naik,Franck Dernoncourt,David Seunghyun Yoon,Sanat Sharma,Trung Bui,Ashok Gupta,Tushar Vatsa,Suhas Suresha,Ishita Verma,Vibha Belavadi,Cheng Chen,Michael Friedrich
关键词-EN: address unique challenges, requires specialized solutions, unique challenges, Domain specific question, evolving field
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain specific question answering is an evolving field that requires specialized solutions to address unique challenges. In this paper, we show that a hybrid approach combining a fine-tuned dense retriever with keyword based sparse search methods significantly enhances performance. Our system leverages a linear combination of relevance signals, including cosine similarity from dense retrieval, BM25 scores, and URL host matching, each with tunable boost parameters. Experimental results indicate that this hybrid method outperforms our single-retriever system, achieving improved accuracy while maintaining robust contextual grounding. These findings suggest that integrating multiple retrieval methodologies with weighted scoring effectively addresses the complexities of domain specific question answering in enterprise settings.
zh

[NLP-48] From Language Models over Tokens to Language Models over Characters

【速读】：该论文试图解决现代语言模型内部基于token分布而非字符分布所带来的编程挑战，特别是prompt在传递给token-level语言模型之前需要进行tokenization的问题。解决方案的关键在于提出了将token-level语言模型转换为character-level模型的算法，包括精确算法和近似算法。通过这些算法，论文实现了在Llama 3.1 8B语言模型上以较小的计算预算（46.3字符/秒）高效且准确地（误差小于0.00021 excess bits / character）近似character-level分布。

链接: https://arxiv.org/abs/2412.03719
作者: Tim Vieira,Ben LeBrun,Mario Giulianelli,Juan Luis Gastaldi,Brian DuSell,John Terilla,Timothy J. O’Donnell,Ryan Cotterell
关键词-EN: posing numerous challenges, Modern language models, programmers building user, building user applications, Modern language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern language models are internally – and mathematically – distributions over token strings rather than \emphcharacter strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that – even with a small computation budget – our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.
zh

[NLP-49] Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在推理时通过扩展计算来提高响应质量的问题。解决方案的关键在于提出了一种名为视觉价值模型 (Vision Value Model, VisVM) 的新方法，该模型能够在推理时引导 VLM 的搜索过程，以生成具有更好视觉理解能力的响应。VisVM 不仅评估当前搜索步骤中生成的句子质量，还预测当前步骤可能导致的后续句子质量，从而提供长期价值。通过这种方式，VisVM 能够引导 VLMs 避免生成容易产生幻觉或细节不足的句子，从而生成更高质量的响应。实验结果表明，VisVM 引导的搜索方法显著提升了 VLMs 生成描述性字幕的能力，使其包含更丰富的视觉细节并减少幻觉现象。此外，通过使用 VisVM 引导生成的字幕进行自训练，VLMs 在多模态基准测试中的表现也得到了显著提升，这表明 VisVM 具有开发自改进 VLMs 的潜力。

链接: https://arxiv.org/abs/2412.03704
作者: Wang Xiyao,Yang Zhengyuan,Li Linjie,Lu Hongjin,Xu Yuancheng,Lin Chung-Ching Lin,Lin Kevin,Huang Furong,Wang Lijuan
关键词-EN: lacks effective approaches, scaling inference-time computation, significant advancements, advancements in vision-language, lacks effective
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs’ ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM’s performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at this https URL.
zh

[NLP-50] Acquired TASTE: Multimodal Stance Detection with Textual and Structural Embeddings COLING

【速读】：该论文试图解决立场检测（stance detection）中对话上下文的重要性问题，特别是在文本表示之外，如何有效融合对话结构信息以提升检测性能。解决方案的关键在于引入了一种多模态架构——TASTE，该架构通过将基于Transformer的内容嵌入与无监督的结构嵌入相结合，利用Gated Residual Network (GRN)层来融合社交嵌入，从而捕捉内容与对话结构之间的复杂互动。这一方法不仅在常见基准测试中达到了最先进的性能，还强调了同时利用内容和结构信息对于增强立场检测的重要性。

链接: https://arxiv.org/abs/2412.03681
作者: Guy Barel,Oren Tsur,Dan Volenchik
关键词-EN: Stance detection plays, Stance detection, downstream applications, scientific facts, plays a pivotal
类目: Computation and Language (cs.CL)
备注: The modified camera ready version will be published in January 2025 at COLING

点击查看摘要

Abstract:Stance detection plays a pivotal role in enabling an extensive range of downstream applications, from discourse parsing to tracing the spread of fake news and the denial of scientific facts. While most stance classification models rely on textual representation of the utterance in question, prior work has demonstrated the importance of the conversational context in stance detection. In this work we introduce TASTE – a multimodal architecture for stance detection that harmoniously fuses Transformer-based content embedding with unsupervised structural embedding. Through the fine-tuning of a pretrained transformer and the amalgamation with social embedding via a Gated Residual Network (GRN) layer, our model adeptly captures the complex interplay between content and conversational structure in determining stance. TASTE achieves state-of-the-art results on common benchmarks, significantly outperforming an array of strong baselines. Comparative evaluations underscore the benefits of social grounding – emphasizing the criticality of concurrently harnessing both content and structure for enhanced stance detection.
zh

[NLP-51] Evaluating Language Models as Synthetic Data Generators

【速读】：该论文试图解决在语言模型（LM）后训练中，不同LM作为数据生成器的性能缺乏系统性比较的问题。解决方案的关键在于提出了AgoraBench，这是一个基准测试工具，通过提供标准化的设置和评估指标来系统地比较不同LM的数据生成能力。通过合成126万个训练实例并训练99个学生模型，研究揭示了LM在数据生成方面的关键特性，包括生成新问题、增强现有问题、响应质量、困惑度（perplexity）和指令难度等，并强调了输出格式选择和成本效益模型选择对数据生成效果的重要性。

链接: https://arxiv.org/abs/2412.03679
作者: Seungone Kim,Juyoung Suk,Xiang Yue,Vijay Viswanathan,Seongyun Lee,Yizhong Wang,Kiril Gashteovski,Carolin Lawrence,Sean Welleck,Graham Neubig
关键词-EN: generate high-quality data, solve problems directly, data generation, generate high-quality, data
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Given the increasing use of synthetic data in language model (LM) post-training, an LM’s ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs’ data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs’ data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM’s data generation ability doesn’t necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.
zh

[NLP-52] Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis ECCV2024

【速读】：该论文试图解决的问题是评估多模态大语言模型（Multimodal LLMs）在图像描述任务中的表现，并探讨其是否能替代传统的图像描述网络。解决方案的关键在于评估这些模型在零样本学习（zero-shot）和通过微调方法（如提示学习、前缀调优和低秩适应）适应不同语义领域的能力。研究结果表明，尽管多模态LLMs在零样本学习中表现出色，但在保持泛化能力的同时进行特定领域的微调仍然具有挑战性。

链接: https://arxiv.org/abs/2412.03665
作者: Davide Bucciarelli,Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
关键词-EN: generate natural language, Large Language Models, natural language descriptions, Multimodal LLMs, image captioning demands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: ECCV 2024 Workshop on Green Foundation Models

点击查看摘要

Abstract:The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs – like GPT-4V and Gemini – which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs.
zh

[NLP-53] Multimodal Sentiment Analysis Based on BERT and ResNet

【速读】：该论文试图解决多模态数据（文本和图像）在情感分析任务中特征融合不足的问题。解决方案的关键在于提出了一种结合BERT和ResNet的多模态情感分析框架。具体来说，首先利用BERT提取文本特征向量，ResNet提取图像特征表示，然后通过探索多种特征融合策略，最终选择基于注意力机制的融合模型，以充分利用文本和图像之间的互补信息。实验结果表明，该多模态模型在公共数据集MAVA-single上的准确率和F1分数均优于单一模态模型，达到了最佳的74.5%的准确率。

链接: https://arxiv.org/abs/2412.03625
作者: JiaLe Ren
关键词-EN: Internet and social, sentiment analysis tasks, multimodal sentiment analysis, sentiment analysis, social media
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of the Internet and social media, multi-modal data (text and image) is increasingly important in sentiment analysis tasks. However, the existing methods are difficult to effectively fuse text and image features, which limits the accuracy of analysis. To solve this problem, a multimodal sentiment analysis framework combining BERT and ResNet was proposed. BERT has shown strong text representation ability in natural language processing, and ResNet has excellent image feature extraction performance in the field of computer vision. Firstly, BERT is used to extract the text feature vector, and ResNet is used to extract the image feature representation. Then, a variety of feature fusion strategies are explored, and finally the fusion model based on attention mechanism is selected to make full use of the complementary information between text and image. Experimental results on the public dataset MAVA-single show that compared with the single-modal models that only use BERT or ResNet, the proposed multi-modal model improves the accuracy and F1 score, reaching the best accuracy of 74.5%. This study not only provides new ideas and methods for multimodal sentiment analysis, but also demonstrates the application potential of BERT and ResNet in cross-domain fusion. In the future, more advanced feature fusion techniques and optimization strategies will be explored to further improve the accuracy and generalization ability of multimodal sentiment analysis.
zh

[NLP-54] How to Correctly do Semantic Backpropagation on Language-based Agent ic Systems

【速读】：该论文试图解决基于图的代理系统优化 (Graph-based Agentic System Optimization, GASO) 中反馈分配不当的问题，即如何根据系统输出的反馈有效地调整系统各组件。解决方案的关键在于提出了语义反向传播 (semantic backpropagation) 和语义梯度 (semantic gradients) 的概念，通过利用节点间的关系来计算各组件对系统输出的影响方向信息。具体实现方法为语义梯度下降 (semantic gradient descent)，该方法在BIG-Bench Hard和GSM8K数据集上的实验结果表明，其性能优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.03624
作者: Wenyi Wang,Hisham A. Alyahya,Dylan R. Ashley,Oleg Serikov,Dmitrii Khizbullin,Francesco Faccio,Jürgen Schmidhuber
关键词-EN: challenging real-world tasks, Language-based agentic systems, shown great promise, Language-based agentic, solving small-scale research
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
备注: 11 pages in main text + 2 pages of references + 15 pages of appendices, 2 figures in main text + 17 figures in appendices, 2 tables in main text + 1 table in appendices, 2 algorithms in main text; source code available at this https URL

点击查看摘要

Abstract:Language-based agentic systems have shown great promise in recent years, transitioning from solving small-scale research problems to being deployed in challenging real-world tasks. However, optimizing these systems often requires substantial manual labor. Recent studies have demonstrated that these systems can be represented as computational graphs, enabling automatic optimization. Despite these advancements, most current efforts in Graph-based Agentic System Optimization (GASO) fail to properly assign feedback to the system’s components given feedback on the system’s output. To address this challenge, we formalize the concept of semantic backpropagation with semantic gradients – a generalization that aligns several key optimization techniques, including reverse-mode automatic differentiation and the more recent TextGrad by exploiting the relationship among nodes with a common successor. This serves as a method for computing directional information about how changes to each component of an agentic system might improve the system’s output. To use these gradients, we propose a method called semantic gradient descent which enables us to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show that our approach outperforms existing state-of-the-art methods for solving GASO problems. A detailed ablation study on the LIAR dataset demonstrates the parsimonious nature of our method. A full copy of our implementation is publicly available at this https URL
zh

[NLP-55] CBEval: A framework for evaluating and interpreting cognitive biases in LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在认知过程中存在的偏见和推理能力不足的问题。解决方案的关键在于提出一个框架，通过构建影响图（influence graphs）来解释、理解和揭示LLMs中的认知偏见。该框架能够识别出导致偏见的特定短语和词汇，从而深入分析这些偏见的成因，并进一步探讨如“整数偏见”（round number bias）和“认知偏见屏障”（cognitive bias barrier）等偏见在语言模型中的表现及其影响。

链接: https://arxiv.org/abs/2412.03605
作者: Ammar Shaikh,Raj Abhijit Dandekar,Sreedath Panat,Rajat Dandekar
关键词-EN: Rapid advancements, advancements in Large, Large Language models, Large Language, significantly enhanced
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Rapid advancements in Large Language models (LLMs) has significantly enhanced their reasoning capabilities. Despite improved performance on benchmarks, LLMs exhibit notable gaps in their cognitive processes. Additionally, as reflections of human-generated data, these models have the potential to inherit cognitive biases, raising concerns about their reasoning and decision making capabilities. In this paper we present a framework to interpret, understand and provide insights into a host of cognitive biases in LLMs. Conducting our research on frontier language models we’re able to elucidate reasoning limitations and biases, and provide reasoning behind these biases by constructing influence graphs that identify phrases and words most responsible for biases manifested in LLMs. We further investigate biases such as round number bias and cognitive bias barrier revealed when noting framing effect in language models.
zh

[NLP-56] CPTQuant – A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

【速读】：该论文试图解决大型语言模型在自然语言理解和生成任务中面临的内存和计算需求过高的问题。解决方案的关键在于提出了一种名为CPTQuant的综合策略，该策略结合了基于相关性（CMPQ）、基于剪枝（PMPQ）和基于泰勒分解（TDMPQ）的混合精度技术。CMPQ通过层间的典型相关分析调整精度级别，PMPQ根据层的稀疏敏感性优化精度，而TDMPQ则利用泰勒分解评估各层对输入扰动的敏感性以调整精度。这些技术在保持精度的同时，显著提高了模型的压缩率和效率，实验结果表明在BERT和多个OPT模型上实现了高达4倍的压缩和2倍的效率提升，且精度损失最小。

链接: https://arxiv.org/abs/2412.03599
作者: Amitash Nanda,Sree Bhargavi Balija,Debashis Sahoo
关键词-EN: Large language, computational requirements, Large language models, transformed the comprehension, comprehension and generation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Large language models have transformed the comprehension and generation of natural language tasks, but they come with substantial memory and computational requirements. Quantization techniques have emerged as a promising avenue for addressing these challenges while preserving accuracy and making energy efficient. We propose CPTQuant, a comprehensive strategy that introduces correlation-based (CMPQ), pruning-based (PMPQ), and Taylor decomposition-based (TDMPQ) mixed precision techniques. CMPQ adapts the precision level based on canonical correlation analysis of different layers. PMPQ optimizes precision layer-wise based on their sensitivity to sparsity. TDMPQ modifies precision using Taylor decomposition to assess each layer’s sensitivity to input perturbation. These strategies allocate higher precision to more sensitive layers while diminishing precision to robust layers. CPTQuant assesses the performance across BERT, OPT-125M, OPT-350M, OPT-1.3B, and OPT-2.7B. We demonstrate up to 4x compression and a 2x-fold increase in efficiency with minimal accuracy drop compared to Hugging Face FP16. PMPQ stands out for achieving a considerably higher model compression. Sensitivity analyses across various LLMs show that the initial and final 30% of layers exhibit higher sensitivities than the remaining layers. PMPQ demonstrates an 11% higher compression ratio than other methods for classification tasks, while TDMPQ achieves a 30% greater compression ratio for language modeling tasks.
zh

[NLP-57] he Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

【速读】：该论文试图解决大型语言模型（LLMs）在标准化测试中表现优异，但在实际语言理解和适应性方面表现不足的问题。论文通过系统分析自然语言处理（NLP）评估框架，揭示了现有评估方法中的普遍漏洞，包括基准测试的利用、数据集污染和评估偏差，这些漏洞导致了对语言理解能力进展的错误认知。解决方案的关键在于提出新的评估方法，这些方法应具备抵抗操纵、减少数据污染的能力，并能评估特定领域的任务。为此，论文建议采用动态适应的评估框架，以更准确地反映LLM的实际性能，并解决现有评估方法的局限性。

链接: https://arxiv.org/abs/2412.03597
作者: Sourav Banerjee,Ayushi Agarwal,Eishkaran Singh
关键词-EN: Large Language Models, demonstrate genuine language, rankings in Large, models excel, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 11 pages

点击查看摘要

Abstract:The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum, from basic metrics to complex benchmarks like GLUE and MMLU. These vulnerabilities manifest through benchmark exploitation, dataset contamination, and evaluation bias, creating a false perception of progress in language understanding capabilities. Through extensive review of contemporary evaluation approaches, we identify significant limitations in static benchmark designs, human evaluation protocols, and LLM-as-judge frameworks, all of which compromise the reliability of current performance assessments. As LLM capabilities evolve and existing benchmarks become redundant, we lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks. This requires frameworks that are adapted dynamically, addressing current limitations and providing a more accurate reflection of LLM performance.
zh

[NLP-58] BatchLLM : Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

【速读】：该论文试图解决现有大型语言模型（LLM）推理引擎在处理具有前缀共享特性的批量任务时，存在的性能瓶颈问题。具体来说，现有系统在优化流式请求时，难以有效支持具有前缀共享特性的批量任务，导致硬件利用率低、内存使用效率不高。解决方案的关键在于提出BatchLLM，该系统通过全局识别公共前缀，将共享相同前缀的请求集中调度，以最大化KV上下文的复用，从而缩短公共KV内存的生命周期。此外，BatchLLM重新排序请求，优先调度解码步骤占比较大的请求，以更好地混合解码令牌与后续的前缀填充块，并采用以内存为中心的令牌批处理策略，以扩大令牌批量大小，从而提高GPU利用率。实验结果表明，BatchLLM在微基准测试和典型行业工作负载中，性能优于vLLM 1.1倍至2倍。

链接: https://arxiv.org/abs/2412.03594
作者: Zhen Zheng,Xin Ji,Taosong Fang,Fanghao Zhou,Chuanjie Liu,Gang Peng
关键词-EN: requests, performance indictor, prefix sharing characteristic, LLM tasks, prefix sharing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many LLM tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix. The KV context that is about to be reused may prematurely be evicted with the implicit cache management. Even if not evicted, the lifetime of the shared KV context is extended since requests sharing the same context are not scheduled together, resulting in larger memory usage. These streaming oriented systems schedule the requests in the first-come-first-serve or similar order. As a result, the requests with larger ratio of decoding steps may be scheduled too late to be able to mix with the prefill chunks to increase the hardware utilization. Besides, the token and request number based batching can limit the size of token-batch, which keeps the GPU from saturating for the iterations dominated by decoding tokens. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best, which also shrinks the lifetime of common KV memory. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM by 1.1x to 2x on a set of microbenchmarks and two typical industry workloads.
zh

[NLP-59] CovidLLM : A Robust Large Language Model with Missing Value Adaptation and Multi-Objective Learning Strategy for Predicting Disease Severity and Clinical Outcomes in COVID-19 Patients

【速读】：该论文试图解决COVID-19患者早期识别疾病严重程度和临床结局的问题，特别是在高风险人群中。解决方案的关键在于利用大型语言模型（LLMs）的强大语义理解能力，通过构建专门的提示（prompts）和采用多目标学习策略来实现。具体来说，研究首先选择与临床结局和疾病严重程度显著相关的血清学指标作为模型输入数据，并利用LLMs的优势，在提示中明确告知模型特征值缺失的情况，避免传统模型依赖的插补（imputation）方法。多目标学习策略中，模型首先预测疾病严重程度，然后将预测结果作为生成临床结局的基础，通过微调过程中两个目标的相互影响和改进，提高预测准确性。实验基于ChatGLM模型进行，结果显示LLMs在此任务中的有效性，表明其在该领域的进一步开发潜力。

链接: https://arxiv.org/abs/2412.03593
作者: Shengjun Zhu(1),Siyu Liu(2),Yang Li(3),Qing Lei,Hongyan Hou,Hewei Jiang,Shujuan Guo,Feng Wang,Rongshang Chen,Xionglin Fan,Shengce Tao,Jiaxin Cai((1) School of Mathematics and Statistics, Xiamen University of Technology, Xiamen, China, (2) School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, China, (3) Shanghai Center for Systems Biomedicine, Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Jiao Tong University, Shanghai, China)
关键词-EN: Coronavirus Disease, deaths worldwide, caused millions, millions of deaths, clinical outcomes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coronavirus Disease 2019 (COVID-19), which emerged in 2019, has caused millions of deaths worldwide. Although effective vaccines have been developed to mitigate severe symptoms, certain populations, particularly the elderly and those with comorbidities, remain at high risk for severe outcomes and increased mortality. Consequently, early identification of the severity and clinical outcomes of the disease in these patients is vital to prevent adverse prognoses. Although traditional machine learning and deep learning models have been widely employed in this area, the potential of large language models (LLMs) remains largely unexplored. Our research focuses primarily on constructing specialized prompts and adopting multi-objective learning strategies. We started by selecting serological indicators that significantly correlate with clinical outcomes and disease severity to serve as input data for the model. Blood test samples often contain numerous missing values, and traditional models generally rely on imputation to handle these gaps in the data. In contrast, LLMs offer the advantage of robust semantic understanding. By setting prompts, we can explicitly inform the model when a feature’s value is missing, without the need for imputation. For the multi-objective learning strategy, the model is designed to first predict disease severity and then predict clinical outcomes. Given that LLMs utilize both the input text and the generated tokens as input for generating the next token, the predicted severity is used as a basis for generating the clinical outcome. During the fine-tuning of the LLM, the two objectives influence and improve each other. Our experiments were implemented based on the ChatGLM model. The results demonstrate the effectiveness of LLMs in this task, suggesting promising potential for further development.
zh

[NLP-60] Using Images to Find Context-Independent Word Representations in Vector Space

【速读】：该论文试图解决现有词向量表示方法依赖于文本上下文来捕捉语义关系的问题。解决方案的关键在于提出了一种新颖的方法，即利用词典释义和图像描述来独立于上下文生成词向量。具体实现是通过对词图像使用自编码器（auto-encoder）来找到有意义的表示，并利用这些表示来计算词向量。该方法在词相似性、概念分类和异常检测任务中表现与基于上下文的方法相当，同时显著减少了训练时间。

链接: https://arxiv.org/abs/2412.03592
作者: Harsh Kumar
关键词-EN: find semantic relationships, rely on capturing, semantic relationships, proposed to find, text to find
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many methods have been proposed to find vector representation for words, but most rely on capturing context from the text to find semantic relationships between these vectors. We propose a novel method of using dictionary meanings and image depictions to find word vectors independent of any context. We use auto-encoder on the word images to find meaningful representations and use them to calculate the word vectors. We finally evaluate our method on word similarity, concept categorization and outlier detection tasks. Our method performs comparably to context-based methods while taking much less training time.
zh

[NLP-61] Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts

【速读】：该论文试图解决文档AI模型在训练过程中面临的高质量标注数据稀缺和数据隐私问题。解决方案的关键在于提出了一种基于图神经网络 (Graph Neural Networks, GNNs) 的合成文档布局生成方法。通过将文档元素（如文本块、图像、表格）表示为图中的节点，并将它们的空间关系表示为边，GNNs能够生成具有高度真实性和多样性的文档布局。这种方法利用图结构学习确保了布局的结构一致性和语义连贯性，从而克服了传统数据增强技术在捕捉复杂文档布局方面的不足。实验结果表明，该方法在文档分类、命名实体识别 (NER) 和信息提取等任务中显著提升了模型性能，并有效解决了合成数据与真实数据之间的领域适应性问题。

链接: https://arxiv.org/abs/2412.03590
作者: Amit Agarwal,Hitesh Patel,Priyaranjan Pattnayak,Srikant Panda,Bhargava Kumar,Tejaswini Kumar
关键词-EN: data privacy concerns, Graph Neural Networks, access to high-quality, primarily due, privacy concerns
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in IJERT, Volume 13, Issue 10 (October 2024)

点击查看摘要

Abstract:The development of robust Document AI models has been constrained by limited access to high-quality, labeled datasets, primarily due to data privacy concerns, scarcity, and the high cost of manual annotation. Traditional methods of synthetic data generation, such as text and image augmentation, have proven effective for increasing data diversity but often fail to capture the complex layout structures present in real world documents. This paper proposes a novel approach to synthetic document layout generation using Graph Neural Networks (GNNs). By representing document elements (e.g., text blocks, images, tables) as nodes in a graph and their spatial relationships as edges, GNNs are trained to generate realistic and diverse document layouts. This method leverages graph-based learning to ensure structural coherence and semantic consistency, addressing the limitations of traditional augmentation techniques. The proposed framework is evaluated on tasks such as document classification, named entity recognition (NER), and information extraction, demonstrating significant performance improvements. Furthermore, we address the computational challenges of GNN based synthetic data generation and propose solutions to mitigate domain adaptation issues between synthetic and real-world datasets. Our experimental results show that graph-augmented document layouts outperform existing augmentation techniques, offering a scalable and flexible solution for training Document AI models.
zh

[NLP-62] Human Evaluation of Procedural Knowledge Graph Extraction from Text with Large Language Models

【速读】：该论文试图解决从自然语言文本中提取程序性知识（Procedural Knowledge）并将其表示为知识图谱（Knowledge Graph, KG）的问题。解决方案的关键在于利用大型语言模型（Large Language Model, LLM）的能力，通过提示工程（prompt engineering）方法从文本程序中提取步骤、动作、对象、设备和时间信息，并根据预定义的本体（ontology）填充程序性知识图谱。论文通过用户研究评估了知识图谱提取结果的质量和有用性，展示了LLM生成的输出具有可接受的质量，并评估了人类评估者对AI的主观感知。

链接: https://arxiv.org/abs/2412.03589
作者: Valentina Anita Carriero,Antonia Azzini,Ilaria Baroni,Mario Scrocca,Irene Celino
关键词-EN: perform some tasks, know-how expressed, form of sequences, needed to perform, Large Language Model
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Procedural Knowledge is the know-how expressed in the form of sequences of steps needed to perform some tasks. Procedures are usually described by means of natural language texts, such as recipes or maintenance manuals, possibly spread across different documents and systems, and their interpretation and subsequent execution is often left to the reader. Representing such procedures in a Knowledge Graph (KG) can be the basis to build digital tools to support those users who need to apply or execute them. In this paper, we leverage Large Language Model (LLM) capabilities and propose a prompt engineering approach to extract steps, actions, objects, equipment and temporal information from a textual procedure, in order to populate a Procedural KG according to a pre-defined ontology. We evaluate the KG extraction results by means of a user study, in order to qualitatively and quantitatively assess the perceived quality and usefulness of the LLM-extracted procedural knowledge. We show that LLMs can produce outputs of acceptable quality and we assess the subjective perception of AI by human evaluators.
zh

[NLP-63] Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models

【速读】：该论文试图解决基于Transformer的大规模预训练模型在下游任务中微调时资源消耗过高的问题。解决方案的关键在于提出了SAFE（逐步冻结不重要适配器）方法，通过在早期训练阶段逐步冻结对任务性能影响较小的适配器，从而减少内存使用、计算量和训练时间，同时保持或提升模型性能。SAFE方法不仅显著降低了资源消耗，还通过引入正则化效应，平滑了损失曲面。

链接: https://arxiv.org/abs/2412.03587
作者: Hyegang Son,Yonglak Son,Changhoon Kim,Young Geun Kim
关键词-EN: Transformer-based large-scale pre-trained, large-scale pre-trained models, pre-trained models achieve, achieve great success, models achieve great
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based large-scale pre-trained models achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. Recent work has developed adapter-tuning, but these approaches either still require a relatively high resource usage. Through our investigation, we show that each adapter in adapter-tuning does not have the same impact on task performance and resource usage. Based on our findings, we propose SAFE, which gradually freezes less-important adapters that do not contribute to adaptation during the early training steps. In our experiments, SAFE reduces memory usage, computation amount, and training time by 42.85%, 34.59%, and 11.82%, respectively, while achieving comparable or better performance compared to the baseline. We also demonstrate that SAFE induces regularization effect, thereby smoothing the loss landscape.
zh

[NLP-64] PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback

【速读】：该论文试图解决大型语言模型（LLMs）在生成代码时仅关注功能正确性而忽视运行效率的问题。解决方案的关键是提出了一种无需训练的框架PerfCodeGen，该框架通过在测试用例执行期间基于运行时反馈进行自我改进迭代，从而提升LLM生成代码的性能。PerfCodeGen显著提高了生成代码的运行效率，使其在多个基准测试（如HumanEval、MBPP和APPS）中达到了最先进的运行效率，甚至超越了参考解决方案。此外，该方法在多种开源LLM（如Phi-3-mini、Llama 3 8B、Mixtral 8x7B、Command R和Llama 3 70B）中展示了其提升代码质量的有效性。

链接: https://arxiv.org/abs/2412.03578
作者: Yun Peng,Akhilesh Deepak Gotmare,Michael Lyu,Caiming Xiong,Silvio Savarese,Doyen Sahoo
关键词-EN: software development tasks, Large Language Models, Large Language, development tasks, widely adopted
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require LLM-generated code to be not only correct but also optimally efficient. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open language models like Phi-3-mini, PerfCodeGen achieves runtime efficiency comparable to prompting powerful closed models like GPT-4. We achieve state-of-the-art runtime efficiency on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open LLMs of varying sizes including Phi-3-mini, Llama 3 8B, Mixtral 8x7B, Command R, and Llama 3 70B.
zh

[NLP-65] Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data

【速读】：该论文试图解决矿产站点记录链接中的数据异质性和大规模数据处理问题，特别是在缺乏黄金标准数据（gold-standard data）的情况下。解决方案的关键在于利用大型生成式语言模型（LLM）生成训练数据，并使用这些生成的数据来微调预训练判别语言模型（PLM），从而在不牺牲效率的前提下提高记录链接的准确性。具体来说，该方法通过LLM生成训练数据，然后使用这些数据来微调PLM，以实现比传统基于黄金标准数据的PLM方法更高的F1分数，同时将推理时间减少了近18倍。此外，该方法还提供了一个自动化流程，无需人工干预，进一步增强了其在实际应用中的可行性。

链接: https://arxiv.org/abs/2412.03575
作者: Jiyoon Pyo,Yao-Yi Chiang
关键词-EN: Record linkage, linkage integrates diverse, integrates diverse data, diverse data sources, Record linkage integrates
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach’s potential to overcome record linkage challenges.
zh

[NLP-66] Improving Tool Retrieval by Leveraging Large Language Models for Query Generation

【速读】：该论文试图解决在大语言模型（LLMs）使用工具时，如何有效地从大量工具中检索出与复杂用户请求最相关的工具的问题。解决方案的关键在于利用LLM的理解能力生成检索查询，然后将生成的查询进行嵌入并用于最近邻搜索，以找到最相关的工具。具体方法包括零样本提示（zero-shot prompting）、基于工具描述的监督微调（supervised fine-tuning on tool descriptions）以及通过迭代优化奖励指标（reward metric）进行对齐学习（alignment learning）。实验结果表明，这种方法在已知工具（in-domain）和未知工具（out-of-domain）的场景中均能提升检索性能。

链接: https://arxiv.org/abs/2412.03573
作者: Mohammad Kachuee,Sarthak Ahuja,Vaibhav Kumar,Puyang Xu,Xiaohu Liu
关键词-EN: Large Language Models, Language Models, Large Language, promising avenue, avenue to extend
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using tools by Large Language Models (LLMs) is a promising avenue to extend their reach beyond language or conversational settings. The number of tools can scale to thousands as they enable accessing sensory information, fetching updated factual knowledge, or taking actions in the real world. In such settings, in-context learning by providing a short list of relevant tools in the prompt is a viable approach. To retrieve relevant tools, various approaches have been suggested, ranging from simple frequency-based matching to dense embedding-based semantic retrieval. However, such approaches lack the contextual and common-sense understanding required to retrieve the right tools for complex user requests. Rather than increasing the complexity of the retrieval component itself, we propose leveraging LLM understanding to generate a retrieval query. Then, the generated query is embedded and used to find the most relevant tools via a nearest-neighbor search. We investigate three approaches for query generation: zero-shot prompting, supervised fine-tuning on tool descriptions, and alignment learning by iteratively optimizing a reward metric measuring retrieval performance. By conducting extensive experiments on a dataset covering complex and multi-tool scenarios, we show that leveraging LLMs for query generation improves the retrieval for in-domain (seen tools) and out-of-domain (unseen tools) settings.
zh

[NLP-67] CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing NEURIPS2024

【速读】：该论文试图解决在语音处理任务中，传统微调方法过度依赖输入音频特征的问题，并提出了一种通用的条件感知自监督学习表示模型 (Condition-Aware Self-Supervised Learning Representation, CA-SSLR)。解决方案的关键在于将语言和说话者嵌入从早期层集成到自监督学习模型中，使模型能够感知当前的语言和说话者上下文，从而减少对输入音频特征的依赖，同时保持基础自监督学习表示模型的完整性。通过线性调制动态调整内部表示，CA-SSLR在不显著改变原始模型行为的情况下实现了细粒度的适应性，显著减少了可训练参数数量，缓解了过拟合问题，并在资源匮乏和未见任务中表现出色。

链接: https://arxiv.org/abs/2412.04425
作者: Yen-Ju Lu,Jing Liu,Thomas Thebaud,Laureano Moro-Velazquez,Ariya Rastrow,Najim Dehak,Jesus Villalba
关键词-EN: Condition-Aware Self-Supervised Learning, Self-Supervised Learning Representation, introduce Condition-Aware Self-Supervised, Self-Supervised Learning, generalist conditioning model
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR), a generalist conditioning model broadly applicable to various speech-processing tasks. Compared to standard fine-tuning methods that optimize for downstream models, CA-SSLR integrates language and speaker embeddings from earlier layers, making the SSL model aware of the current language and speaker context. This approach reduces the reliance on input audio features while preserving the integrity of the base SSLR. CA-SSLR improves the model’s capabilities and demonstrates its generality on unseen tasks with minimal task-specific tuning. Our method employs linear modulation to dynamically adjust internal representations, enabling fine-grained adaptability without significantly altering the original model behavior. Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating its effectiveness.
zh

计算机视觉

[CV-0] Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

【速读】：该论文试图解决立体匹配（stereo matching）中的关键挑战，如无纹理区域、遮挡和非朗伯表面等问题。解决方案的关键在于引入了一种新颖的立体匹配框架——Stereo Anywhere，该框架结合了几何约束与单目深度视觉基础模型（Vision Foundation Models, VFMs）的鲁棒先验。通过双分支架构，Stereo Anywhere能够无缝集成立体匹配与学习到的上下文线索，并引入创新的代价体融合机制（cost volume fusion mechanisms），从而有效处理上述挑战。实验结果表明，该模型在零样本泛化（zero-shot generalization）方面达到了最先进的性能，显著优于现有解决方案，并展现出对镜面和透明物体等复杂情况的卓越鲁棒性。

链接: https://arxiv.org/abs/2412.04472
作者: Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia
关键词-EN: depth Vision Foundation, Vision Foundation Models, monocular depth Vision, Vision Foundation, combines geometric constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL - Project page: this https URL

点击查看摘要

Abstract:We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
zh

[CV-1] PaintScene4D: Consistent 4D Scene Generation from Text Prompts

【速读】：该论文试图解决生成逼真的动态4D场景的挑战，特别是现有方法在生成对象中心化场景和缺乏照片真实感方面的局限性。解决方案的关键在于提出了PaintScene4D，这是一个新颖的文本到4D场景生成框架，它采用了一种简化的架构，利用在多样化真实世界数据集上训练的视频生成模型。该方法首先使用视频生成模型生成参考视频，然后通过策略性选择相机阵列进行渲染，并应用渐进式扭曲和修复技术以确保多视角下的时空一致性。最后，通过动态渲染器优化多视角图像，使用户能够根据偏好灵活控制相机视角。该框架采用无训练架构，能够高效生成逼真的4D场景，并支持任意轨迹的视图。

链接: https://arxiv.org/abs/2412.04471
作者: Vinayak Gupta,Yunze Man,Yu-Xiong Wang
关键词-EN: Recent advances, generating photorealistic dynamic, content creation, significant challenge, advances in diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL
zh

[CV-2] urbo3D: Ultra-fast Text-to-3D Generation

【速读】：该论文试图解决文本到3D生成速度慢的问题，提出了一个名为Turbo3D的超快速系统，能够在不到一秒的时间内生成高质量的高斯散射资产。解决方案的关键在于采用了快速的四步四视图扩散生成器和高效的潜空间前馈高斯重构器。具体来说，四步四视图生成器通过新颖的双教师方法进行蒸馏，使学生模型从多视图教师学习视图一致性，从单视图教师学习照片真实感。通过将高斯重构器的输入从像素空间转移到潜空间，消除了额外的图像解码时间，并将变换器序列长度减半，从而实现了最大效率。这种方法不仅在3D生成结果上优于先前的基线，而且在运行时间上显著缩短。

链接: https://arxiv.org/abs/2412.04470
作者: Hanzhe Hu,Tianwei Yin,Fujun Luan,Yiwei Hu,Hao Tan,Zexiang Xu,Sai Bi,Shubham Tulsiani,Kai Zhang
关键词-EN: generating high-quality Gaussian, high-quality Gaussian splatting, Gaussian splatting assets, system capable, capable of generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor’s inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.
zh

[CV-3] QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos NEURIPS2024

【速读】：该论文试图解决在线自由视角视频（Free-Viewpoint Video, FVV）流媒体传输中的挑战，特别是如何在实时约束下实现高质量的体积表示更新、快速训练和渲染，以及高效的内存占用和传输。解决方案的关键在于提出了一种名为QUantized and Efficient ENcoding (QUEEN)的新框架，该框架利用3D高斯喷射（3D Gaussian Splatting, 3D-GS）技术。QUEEN通过直接学习相邻帧之间的高斯属性残差，而不对这些残差施加任何结构约束，从而实现高质量的重建和泛化能力。为了高效存储这些残差，论文进一步提出了量化-稀疏框架，包括一个学习到的潜在解码器用于量化高斯位置以外的属性残差，以及一个学习到的门控模块用于稀疏化位置残差。此外，论文还提出了使用高斯视空间梯度差向量来区分场景中的静态和动态内容，这不仅加速了训练过程，还提高了稀疏学习的效率。在多个FVV基准测试中，QUEEN在所有指标上均优于现有的在线FVV方法。

链接: https://arxiv.org/abs/2412.04469
作者: Sharath Girish,Tianye Li,Amrita Mazumdar,Abhinav Shrivastava,David Luebke,Shalini De Mello
关键词-EN: Online free-viewpoint video, challenging problem, FVV, Online free-viewpoint, free-viewpoint video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2024, Project website: this https URL

点击查看摘要

Abstract:Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. It requires incremental on-the-fly updates to a volumetric representation, fast training and rendering to satisfy real-time constraints and a small memory footprint for efficient transmission. If achieved, it can enhance user experience by enabling novel applications, e.g., 3D video conferencing and live volumetric video broadcast, among others. In this work, we propose a novel framework for QUantized and Efficient ENcoding (QUEEN) for streaming FVV using 3D Gaussian Splatting (3D-GS). QUEEN directly learns Gaussian attribute residuals between consecutive frames at each time-step without imposing any structural constraints on them, allowing for high quality reconstruction and generalizability. To efficiently store the residuals, we further propose a quantization-sparsity framework, which contains a learned latent-decoder for effectively quantizing attribute residuals other than Gaussian positions and a learned gating module to sparsify position residuals. We propose to use the Gaussian viewspace gradient difference vector as a signal to separate the static and dynamic content of the scene. It acts as a guide for effective sparsity learning and speeds up training. On diverse FVV benchmarks, QUEEN outperforms the state-of-the-art online FVV methods on all metrics. Notably, for several highly dynamic scenes, it reduces the model size to just 0.7 MB per frame while training in under 5 sec and rendering at 350 FPS. Project website is at this https URL
zh

[CV-4] NVILA: Efficient Frontier Visual Language Models

【速读】：该论文试图解决视觉语言模型（Visual Language Models, VLMs）在提高准确性的同时，效率提升不足的问题。解决方案的关键在于引入NVILA，这是一种开放的VLMs家族，通过“先扩展后压缩”（scale-then-compress）的方法优化模型架构，即首先提高空间和时间分辨率，然后压缩视觉标记，从而高效处理高分辨率图像和长视频。此外，论文还系统性地研究了从训练、微调到部署的全生命周期效率提升，包括降低训练成本、减少微调内存使用、缩短预填充和解码延迟。NVILA在多个图像和视频基准测试中达到了或超越了现有领先的开源和专有VLMs的准确性，同时显著提升了效率。

链接: https://arxiv.org/abs/2412.04468
作者: Zhijian Liu,Ligeng Zhu,Baifeng Shi,Zhuoyang Zhang,Yuming Lou,Shang Yang,Haocheng Xi,Shiyi Cao,Yuxian Gu,Dacheng Li,Xiuyu Li,Yunhao Fang,Yukang Chen,Cheng-Yu Hsieh,De-An Huang,An-Chieh Cheng,Vishwesh Nath,Jinyi Hu,Sifei Liu,Ranjay Krishna,Daguang Xu,Xiaolong Wang,Pavlo Molchanov,Jan Kautz,Hongxu Yin,Song Han,Yao Lu
关键词-EN: made significant advances, recent years, Visual language models, made significant, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This “scale-then-compress” approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.
zh

[CV-5] UnZipLoRA: Separating Content and Style from a Single Image

【速读】：该论文试图解决图像中主体和风格元素的解耦问题，即如何从单一图像中分离出主体和风格，并分别表示为两个独立的低秩适应模型 (LoRAs)。解决方案的关键在于提出了一种名为 UnZipLoRA 的方法，该方法通过同时训练两个 LoRAs 来实现主体和风格的解耦，并确保它们之间的兼容性，即可以通过直接相加的方式无缝组合。UnZipLoRA 采用了一种新颖的提示分离技术以及列和块分离策略，以准确地保留主体和风格的特征，并确保所学 LoRAs 之间的兼容性。该方法不仅允许对主体和风格进行独立操作和重新上下文化，还能生成各种变体、将提取的风格应用于新主体，并重新组合它们以重建原始图像或创建新颖的变体。

链接: https://arxiv.org/abs/2412.04465
作者: Chang Liu,Viraj Shah,Aiyu Cui,Svetlana Lazebnik
关键词-EN: Low-Rank Adaptations, paper introduces UnZipLoRA, paper introduces, subject, constituent subject
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper introduces UnZipLoRA, a method for decomposing an image into its constituent subject and style, represented as two distinct LoRAs (Low-Rank Adaptations). Unlike existing personalization techniques that focus on either subject or style in isolation, or require separate training sets for each, UnZipLoRA disentangles these elements from a single image by training both the LoRAs simultaneously. UnZipLoRA ensures that the resulting LoRAs are compatible, i.e., they can be seamlessly combined using direct addition. UnZipLoRA enables independent manipulation and recontextualization of subject and style, including generating variations of each, applying the extracted style to new subjects, and recombining them to reconstruct the original image or create novel variations. To address the challenge of subject and style entanglement, UnZipLoRA employs a novel prompt separation technique, as well as column and block separation strategies to accurately preserve the characteristics of subject and style, and ensure compatibility between the learned LoRAs. Evaluation with human studies and quantitative metrics demonstrates UnZipLoRA’s effectiveness compared to other state-of-the-art methods, including DreamBooth-LoRA, Inspiration Tree, and B-LoRA.
zh

[CV-6] DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction

【速读】：该论文试图解决的是可变形物体的三维形状和姿态重建问题。解决方案的关键在于引入了一种新的数据表示方法，即双点图（Dual Point Maps, DualPM）。DualPM由一对点图组成，其中一个点图将像素与其在物体上的三维位置关联，另一个点图则关联到物体的标准静止姿态版本。通过这种方式，论文将三维重建和三维姿态估计问题简化为预测DualPMs的问题。实验结果表明，这种表示方法非常适合深度网络进行预测，并且在处理马匹等可变形物体的三维分析和重建任务中，显著优于先前的技术。

链接: https://arxiv.org/abs/2412.04464
作者: Ben Kaye,Tomas Jakab,Shangzhe Wu,Christian Rupprecht,Andrea Vedaldi
关键词-EN: point maps, Dual Point Maps, geometric tasks, learning in geometric, viewpoint-invariant point maps
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: First two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:The choice of data representation is a key factor in the success of deep learning in geometric tasks. For instance, DUSt3R has recently introduced the concept of viewpoint-invariant point maps, generalizing depth prediction, and showing that one can reduce all the key problems in the 3D reconstruction of static scenes to predicting such point maps. In this paper, we develop an analogous concept for a very different problem, namely, the reconstruction of the 3D shape and pose of deformable objects. To this end, we introduce the Dual Point Maps (DualPM), where a pair of point maps is extracted from the same image, one associating pixels to their 3D locations on the object, and the other to a canonical version of the object at rest pose. We also extend point maps to amodal reconstruction, seeing through self-occlusions to obtain the complete shape of the object. We show that 3D reconstruction and 3D pose estimation reduce to the prediction of the DualPMs. We demonstrate empirically that this representation is a good target for a deep network to predict; specifically, we consider modeling horses, showing that DualPMs can be trained purely on 3D synthetic data, consisting of a single model of a horse, while generalizing very well to real images. With this, we improve by a large margin previous methods for the 3D analysis and reconstruction of this type of objects.
zh

[CV-7] MegaSaM: Accurate Fast and Robust Structure and Motion from Casual Dynamic Videos

【速读】：该论文试图解决在动态场景中单目视频的相机参数和深度图的准确、快速和鲁棒估计问题。传统方法如运动结构恢复 (Structure from Motion) 和单目SLAM (Monocular SLAM) 技术通常假设输入视频主要为静态场景且具有大量视差，这些方法在缺乏这些条件时容易产生错误估计。尽管基于神经网络的方法试图克服这些挑战，但它们在处理动态视频和不受控相机运动时往往计算成本高或鲁棒性差。论文提出的解决方案关键在于对深度视觉SLAM框架进行精心修改，包括训练和推理方案的优化，使其能够适应复杂动态场景中的无约束相机路径，甚至包括视差较小的视频。实验结果表明，该系统在相机姿态和深度估计方面显著优于现有方法，同时具有更快的运行速度或相当的运行时间。

链接: https://arxiv.org/abs/2412.04463
作者: Zhengqi Li,Richard Tucker,Forrester Cole,Qianqian Wang,Linyi Jin,Vickie Ye,Angjoo Kanazawa,Aleksander Holynski,Noah Snavely
关键词-EN: casual monocular videos, maps from casual, monocular SLAM techniques, casual monocular, dynamic scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: this https URL
zh

[CV-8] 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

【速读】：该论文提出了一种名为4Real-Video的新框架，旨在生成具有时间和视角轴的4D视频。其关键解决方案在于采用了一种新颖的双流架构：一个流用于更新视角（viewpoint updates），另一个流用于更新时间（temporal updates）。在每个扩散变换器层之后，通过同步层（synchronization layer）在两个流之间交换信息，从而实现更高的推理速度、增强的视觉质量（通过FVD、CLIP和VideoScore评估）以及改进的时间和视角一致性（通过VideoScore和Dust3R-Confidence评估）。

链接: https://arxiv.org/abs/2412.04462
作者: Chaoyang Wang,Peiye Zhuang,Tuan Duc Ngo,Willi Menapace,Aliaksandr Siarohin,Michael Vasilkovsky,Ivan Skorokhodov,Sergey Tulyakov,Peter Wonka,Hsin-Ying Lee
关键词-EN: framework for generating, video frames, viewpoint axes, frames, viewpoint
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).
zh

[CV-9] LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

【速读】：该论文试图解决生成具有前景和背景层的透明图像的问题，这在图形设计、动画和数字艺术等领域中具有重要应用。解决方案的关键在于提出了一种基于潜在扩散模型 (Latent Diffusion Models, LDMs) 的新型图像生成管道，该管道能够同时生成带有透明信息的前景层 (RGBA) 和背景层 (RGB)。与现有方法中逐层生成的方式不同，该方法引入了协调生成机制，使得前景层和背景层之间能够动态交互，从而生成更加连贯和一致的图像。这一创新显著提升了图像的视觉连贯性、质量和层间一致性。

链接: https://arxiv.org/abs/2412.04460
作者: Yusuf Dalva,Yijun Li,Qing Liu,Nanxuan Zhao,Jianming Zhang,Zhe Lin,Pinar Yanardag
关键词-EN: Large-scale diffusion models, achieved remarkable success, Large-scale diffusion, generating high-quality images, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.
zh

[CV-10] Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering

【速读】：该论文试图解决高效辐射场渲染的问题，特别是在没有使用神经网络或3D高斯分布的情况下，通过稀疏体素（sparse voxels）实现高质量的渲染。解决方案的关键在于两个方面：首先，通过动态莫顿排序（dynamic Morton ordering）沿像素光线正确地按深度顺序渲染稀疏体素，从而避免了高斯喷射（Gaussian splatting）中常见的弹出伪影（popping artifact）；其次，自适应地将稀疏体素拟合到场景的不同细节层次，忠实地再现场景细节的同时实现高帧率的渲染。该方法不仅将无神经网络的体素网格表示的PSNR提高了4dB以上，渲染速度提升了10倍，还实现了与最先进技术相媲美的新视角合成效果，并且与基于网格的3D处理算法无缝兼容，通过集成TSDF-Fusion和Marching Cubes实现了有前景的网格重建精度。

链接: https://arxiv.org/abs/2412.04459
作者: Cheng Sun,Jaesung Choe,Charles Loop,Wei-Chiu Ma,Yu-Chiang Frank Wang
关键词-EN: efficient radiance field, radiance field rendering, propose an efficient, efficient radiance, radiance field
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code release in progress

点击查看摘要

Abstract:We propose an efficient radiance field rendering algorithm that incorporates a rasterization process on sparse voxels without neural networks or 3D Gaussians. There are two key contributions coupled with the proposed system. The first is to render sparse voxels in the correct depth order along pixel rays by using dynamic Morton ordering. This avoids the well-known popping artifact found in Gaussian splatting. Second, we adaptively fit sparse voxels to different levels of detail within scenes, faithfully reproducing scene details while achieving high rendering frame rates. Our method improves the previous neural-free voxel grid representation by over 4db PSNR and more than 10x rendering FPS speedup, achieving state-of-the-art comparable novel-view synthesis results. Additionally, our neural-free sparse voxels are seamlessly compatible with grid-based 3D processing algorithms. We achieve promising mesh reconstruction accuracy by integrating TSDF-Fusion and Marching Cubes into our sparse grid system.
zh

[CV-11] Cubify Anything: Scaling Indoor 3D Object Detection

【速读】：该论文试图解决室内3D物体检测的问题，特别是在使用消费级手持设备获取的单帧RGB(-D)图像上的应用。解决方案的关键在于引入了一个新的数据集Cubify-Anything 1M (CA-1M)，该数据集通过激光扫描技术提供了高度精确的3D物体标注，并与手持设备的RGB(-D)图像进行了近乎完美的配准。此外，论文提出了Cubify Transformer (CuTR)，这是一种基于Transformer的3D物体检测模型，它直接从RGB(-D)输入的2D特征中预测3D边界框，而不依赖于点云或体素表示。尽管CuTR缺乏3D归纳偏置，但结合CA-1M数据集，它在3D物体召回率上达到了62%以上，并且在处理噪声和不确定性方面表现出色，甚至在仅使用RGB数据时也能保持良好的性能。通过在CA-1M上的预训练，CuTR在更广泛的数据集上也能超越基于点云的方法，表明在数据丰富的环境下，3D归纳偏置可能不再是必要的。

链接: https://arxiv.org/abs/2412.04458
作者: Justin Lazarow,David Griffiths,Gefen Kohavi,Francisco Crespo,Afshin Dehghan
关键词-EN: frame acquired, commodity handheld device, single RGB, establish Cubify Transformer, handheld device
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.
zh

[CV-12] Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

【速读】：该论文试图解决动态场景的视图合成问题，特别是利用单目输入数据进行视图合成这一具有挑战性的问题。解决方案的关键在于对多种基于高斯溅射（Gaussian splatting）的方法进行系统性的组织、基准测试和分析，以提供直接的比较。论文通过使用多个现有数据集和一个新的合成数据集，系统地分类高斯溅射方法的特定运动表示类型，并量化这些差异对性能的影响。研究发现，在合成数据中，这些方法的性能排序是明确的，但在复杂现实世界数据中，这些差异目前被数据复杂性所掩盖。此外，所有基于高斯的方法虽然渲染速度快，但在优化过程中表现出脆弱性。

链接: https://arxiv.org/abs/2412.04457
作者: Yiqing Liang,Mikhail Okunev,Mikaela Angelina Uy,Runfeng Li,Leonidas Guibas,James Tompkin,Adam W. Harley
关键词-EN: converting multi-view image, multi-view image data, view synthesis, popular approach, approach for converting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 39 figures, 9 tables

点击查看摘要

Abstract:Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis. In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data – an ill-posed and challenging problem. The fast pace of work in this area has produced multiple simultaneous papers that claim to work best, which cannot all be true. In this work, we organize, benchmark, and analyze many Gaussian-splatting-based methods, providing apples-to-apples comparisons that prior works have lacked. We use multiple existing datasets and a new instructive synthetic dataset designed to isolate factors that affect reconstruction quality. We systematically categorize Gaussian splatting methods into specific motion representation types and quantify how their differences impact performance. Empirically, we find that their rank order is well-defined in synthetic data, but the complexity of real-world data currently overwhelms the differences. Furthermore, the fast rendering speed of all Gaussian-based methods comes at the cost of brittleness in optimization. We summarize our experiments into a list of findings that can help to further progress in this lively problem setting. Project Webpage: this https URL
zh

[CV-13] HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

【速读】：该论文试图解决在多视角静态监控场景下的人体形状和姿态恢复问题，特别是在固定多视角监控系统中，如老年人护理和安全监控。解决方案的关键在于提出了一种名为HeatFormer的神经优化方法，该方法通过迭代优化SMPL参数来实现多视角图像的精确人体形状和姿态估计。HeatFormer的核心创新在于将SMPL参数估计问题转化为热图生成与对齐问题，并采用了一种新颖的Transformer编码器和解码器结构，使其对视角配置具有根本上的无关性。这种方法不仅提高了估计的准确性和鲁棒性，还增强了在遮挡情况下的通用性。

链接: https://arxiv.org/abs/2412.04456
作者: Yuto Matsubara,Ko Nishino
关键词-EN: fully leverage multiple, leverage multiple static, multiple static views, shape and pose, pose recovery
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.
zh

[CV-14] Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

【速读】：该论文试图解决闭环机器人系统中开放集故障的自动检测与预防问题，特别是如何在故障发生后进行反应性识别以及在故障发生前进行预防性识别的双重挑战。解决方案的关键在于提出了“代码即监控 (Code-as-Monitor, CaM)”这一新范式，利用视觉语言模型 (Vision-Language Model, VLM) 将这两种任务统一为一组时空约束满足问题，并通过 VLM 生成的代码实时评估这些约束。此外，通过引入抽象约束相关实体或其部分的紧凑几何元素，增强了监控的准确性和效率，简化了跟踪过程，并促进了约束感知的视觉编程。实验结果表明，CaM 在严重干扰下比基线方法提高了 28.7% 的成功率，并减少了 31.8% 的执行时间。

链接: https://arxiv.org/abs/2412.04455
作者: Enshen Zhou,Qi Su,Cheng Chi,Zhizheng Zhang,Zhongyuan Wang,Tiejun Huang,Lu Sheng,He Wang
关键词-EN: Automatic detection, closed-loop robotic systems, Automatic, robotic systems, identify unexpected failures
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.
zh

[CV-15] NaVILA: Legged Robot Vision-Language-Action Model for Navigation

【速读】：该论文试图解决视觉-语言导航问题，特别是针对足式机器人（legged robots）在复杂和杂乱场景中的导航任务。解决方案的关键在于提出了NaVILA框架，这是一个两级框架，它将视觉-语言-动作模型（Vision-Language-Action model, VLA）与运动技能统一起来。NaVILA的核心创新在于它不直接从VLA预测低级动作，而是首先生成包含空间信息的中级动作指令（如“向前移动75厘米”），这些指令随后作为视觉运动强化学习策略的输入，用于执行实际的低级动作。这种方法显著提升了现有基准上的表现，并在新开发的IsaacLab基准上展示了其在更真实场景、低级控制和实际机器人实验中的优势。

链接: https://arxiv.org/abs/2412.04453
作者: An-Chieh Cheng,Yandong Ji,Zhaojing Yang,Xueyan Zou,Jan Kautz,Erdem Bıyık,Hongxu Yin,Sifei Liu,Xiaolong Wang
关键词-EN: Navigation with legged, solve the problem, challenging and cluttered, Navigation, cluttered scenes
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., “moving forward 75cm”), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level controls, and real-world robot experiments. We show more results at this https URL
zh

[CV-16] Four-Plane Factorized Video Autoencoders

【速读】：该论文试图解决高维数据（如视频）在潜在变量生成模型中的高效训练和推理问题。解决方案的关键在于提出了一种将体积数据投影到四平面分解潜在空间的自动编码器，该潜在空间随输入大小呈亚线性增长，从而适用于高维数据。这种分解模型设计支持在多种条件生成任务中直接采用潜在扩散模型 (LDMs)，如类条件生成、帧预测和视频插值，同时显著提升了速度和内存效率，并保持了高保真重建所需的丰富表示。

链接: https://arxiv.org/abs/2412.04452
作者: Mohammed Suhail,Carlos Esteves,Leonid Sigal,Ameesh Makadia
关键词-EN: tasks including image, generative tasks including, emerged as powerful, powerful tools, including image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.
zh

[CV-17] MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

【速读】：该论文试图解决音频驱动说话视频生成中的三个主要问题：音频与口型同步、长期身份一致性以及自然表情生成。解决方案的关键在于提出了一种名为Memory-guided EMOtion-aware diffusion (MEMO)的端到端音频驱动肖像动画方法。该方法包含两个核心模块：(1) 记忆引导的时间模块，通过开发记忆状态来存储更长时间的历史信息，以线性注意力机制引导时间建模，从而增强长期身份一致性和运动平滑度；(2) 情感感知的音频模块，通过多模态注意力替代传统的交叉注意力，增强音频与视频的交互，同时通过情感自适应层归一化检测音频中的情感，以精细化面部表情。这些关键模块共同作用，使得MEMO在生成高质量、音频口型同步、身份一致且表情自然的说话视频方面表现优异。

链接: https://arxiv.org/abs/2412.04448
作者: Longtao Zheng,Yifan Zhang,Hanzhong Guo,Jiachun Pan,Zhenxiong Tan,Jiahao Lu,Chuanxin Tang,Bo An,Shuicheng Yan
关键词-EN: Recent advances, talking video generation, video diffusion models, models have unlocked, unlocked new potential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
zh

[CV-18] EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

【速读】：该论文试图解决多模态大语言模型（MLLMs）在多样化场景中的规划能力不足的问题。解决方案的关键在于引入了一个名为EgoPlan-Bench2的综合基准测试，该基准通过半自动化的方式利用第一人称视角的视频构建，涵盖了日常生活中的四大领域和24个详细场景，以评估MLLMs在真实世界中的规划能力。论文还提出了一种无需额外训练的改进方法，即通过多模态思维链（Chain-of-Thought, CoT）提示技术，显著提升了GPT-4V在EgoPlan-Bench2上的表现，提高了10.24分。这一方法不仅揭示了当前MLLMs在规划方面的局限性，还为未来在这一关键领域的改进提供了方向。

链接: https://arxiv.org/abs/2412.04447
作者: Lu Qiu,Yuying Ge,Yi Chen,Yixiao Ge,Ying Shan,Xihui Liu
关键词-EN: Large Language Models, Language Models, Multimodal Large Language, Large Language, artificial general intelligence
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code data are available at: this https URL

点击查看摘要

Abstract:The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at this https URL.
zh

[CV-19] DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

【速读】：该论文试图解决视频生成中的可扩展性和高效性问题，特别是如何在保持高质量的同时，利用自回归语言模型（AR language models）进行视频生成。解决方案的关键在于引入了一种名为DiCoDe的新方法，该方法通过利用扩散压缩深度令牌（Diffusion-Compressed Deep Tokens）来显著减少令牌数量（达到1000倍的压缩率），从而使得传统的AR语言模型能够高效地生成视频。DiCoDe的核心创新在于将视频视为时间序列，并通过训练一个基于视频扩散模型的令牌器，将视频数据转换为深度令牌，从而实现高效的AR生成。这种方法不仅在质量上与现有方法相当，而且在训练效率上具有显著优势，能够使用较少的计算资源（如4个A100 GPU）生成从几秒到一分钟的视频。

链接: https://arxiv.org/abs/2412.04446
作者: Yizhuo Li,Yuying Ge,Yixiao Ge,Ping Luo,Ying Shan
关键词-EN: Deep Tokens, language, language models, DiCoDe, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual “language” into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe’s exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.
zh

[CV-20] Learning Artistic Signatures: Symmetry Discovery and Style Transfer

【速读】：该论文试图解决艺术风格定义不明确的问题，特别是在风格迁移任务中缺乏理论基础和可解释性。解决方案的关键在于提出了一种新的艺术风格定义，即将风格视为一组决定局部纹理排列的全局对称性（global symmetries）。通过学习大量绘画数据集的对称性，并验证这些对称性对艺术流派的预测能力，论文展示了这种定义的有效性。此外，结合局部和全局特征（包括李代数生成元和传统纹理测量），该方法能够更准确地捕捉艺术家之间的风格相似性，从而提供了一个更可解释且理论基础更强的风格迁移框架。

链接: https://arxiv.org/abs/2412.04441
作者: Emma Finn,T. Anderson Keller,Emmanouil Theodosis,Demba E. Ba
关键词-EN: style transfer, style, decade of literature, undisputed definition, artistic style
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite nearly a decade of literature on style transfer, there is no undisputed definition of artistic style. State-of-the-art models produce impressive results but are difficult to interpret since, without a coherent definition of style, the problem of style transfer is inherently ill-posed. Early work framed style-transfer as an optimization problem but treated style as a measure only of texture. This led to artifacts in the outputs of early models where content features from the style image sometimes bled into the output image. Conversely, more recent work with diffusion models offers compelling empirical results but provides little theoretical grounding. To address these issues, we propose an alternative definition of artistic style. We suggest that style should be thought of as a set of global symmetries that dictate the arrangement of local textures. We validate this perspective empirically by learning the symmetries of a large dataset of paintings and showing that symmetries are predictive of the artistic movement to which each painting belongs. Finally, we show that by considering both local and global features, using both Lie generators and traditional measures of texture, we can quantitatively capture the stylistic similarity between artists better than with either set of features alone. This approach not only aligns well with art historians’ consensus but also offers a robust framework for distinguishing nuanced stylistic differences, allowing for a more interpretable, theoretically grounded approach to style transfer.
zh

[CV-21] GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

【速读】：该论文试图解决文本到视频生成模型在处理复杂动态场景时的挑战，特别是基于组合文本提示生成复杂场景时遇到的属性绑定、时间动态和对象间交互等问题。解决方案的关键在于提出了一种迭代的多智能体框架GenMAC，该框架通过角色专业化的多语言学习模型（MLLM）智能体协作来实现组合文本到视频的生成。GenMAC的协作工作流程包括设计、生成和重新设计三个阶段，并通过生成和重新设计阶段的迭代循环逐步验证和优化生成的视频。重新设计阶段是核心，通过分解为四个顺序执行的MLLM智能体（验证智能体、建议智能体、修正智能体和输出结构化智能体）来避免单一智能体的幻觉问题，并设计了自路由机制以适应性地选择合适的修正智能体来处理不同的生成场景。实验结果表明，GenMAC在组合文本到视频生成任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.04440
作者: Kaiyi Huang,Yukun Huang,Xuefei Ning,Zinan Lin,Yu Wang,Xihui Liu
关键词-EN: shown significant progress, recent years, models have shown, shown significant, significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.
zh

[CV-22] owards Real-Time Open-Vocabulary Video Instance Segmentation

【速读】：该论文试图解决开放词汇视频实例分割 (Open-Vocabulary Video Instance Segmentation, OV-VIS) 在实时应用中的计算瓶颈问题。解决方案的关键在于提出了名为 TROY-VIS 的新方法，通过以下三种关键技术显著提升处理速度并保持高精度：(1) 解耦注意力特征增强器 (Decoupled Attention Feature Enhancer)，加速不同模态和尺度间的信息交互；(2) 闪存嵌入记忆 (Flash Embedding Memory)，快速获取对象类别的文本嵌入；(3) 核插值 (Kernel Interpolation)，利用视频中的时间连续性。实验结果表明，TROY-VIS 在 BURST 和 LV-VIS 两个大规模 OV-VIS 基准上实现了速度与精度的最佳平衡，运行速度比 GLEE-Lite 快 20 倍（25 FPS 对比 1.25 FPS），且精度相当或更高，展示了其在移动机器人和增强现实等动态环境中的实时应用潜力。

链接: https://arxiv.org/abs/2412.04434
作者: Bin Yan,Martin Sundermeyer,David Joseph Tan,Huchuan Lu,Federico Tombari
关键词-EN: video instance segmentation, performing open-vocabulary video, open-vocabulary video instance, Decoupled Attention Feature, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS’s potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at this https URL.
zh

[CV-23] PBDyG: Position Based Dynamic Gaussians for Motion-Aware Clothed Human Avatars

【速读】：该论文试图解决从多视角RGB视频中学习穿着衣物的人体模型的问题，特别是如何恢复物理上准确的肢体和衣物运动。解决方案的关键在于引入了一种名为“基于位置的动力学高斯分布 (Position Based Dynamic Gaussians, PBDyG)”的方法，该方法通过物理模拟实现“运动依赖”的衣物变形，而非仅依赖“姿态依赖”的刚性变换。具体来说，PBDyG将穿着衣物的人体模型整体建模，但区分了两个接触的物理实体：衣物被建模为3D高斯分布，附着在一个跟随输入视频中人体运动的蒙皮SMPL身体上。SMPL身体的关节运动驱动衣物高斯分布的物理模拟，从而将虚拟形象变换到新的姿态。为了运行基于位置的动力学模拟，通过动态3D高斯分布技术从RGB视频中估计包括质量和材料刚度在内的物理属性。实验表明，该方法不仅能准确再现外观，还能重建穿着高度可变形衣物（如裙子或外套）的虚拟形象，这是现有方法难以实现的。

链接: https://arxiv.org/abs/2412.04433
作者: Shota Sasaki,Jane Wu,Ko Nishino
关键词-EN: recovering physically accurate, physically accurate body, multiview RGB videos, Based Dynamic Gaussians, Position Based Dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel clothed human model that can be learned from multiview RGB videos, with a particular emphasis on recovering physically accurate body and cloth movements. Our method, Position Based Dynamic Gaussians (PBDyG), realizes movement-dependent'' cloth deformation via physical simulation, rather than merely relying on pose-dependent’’ rigid transformations. We model the clothed human holistically but with two distinct physical entities in contact: clothing modeled as 3D Gaussians, which are attached to a skinned SMPL body that follows the movement of the person in the input videos. The articulation of the SMPL body also drives physically-based simulation of the clothes’ Gaussians to transform the avatar to novel poses. In order to run position based dynamics simulation, physical properties including mass and material stiffness are estimated from the RGB videos through Dynamic 3D Gaussian Splatting. Experiments demonstrate that our method not only accurately reproduces appearance but also enables the reconstruction of avatars wearing highly deformable garments, such as skirts or coats, which have been challenging to reconstruct using existing methods.
zh

[CV-24] Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

【速读】：该论文试图解决在大语言模型（LLMs）中统一视频理解和生成的问题。解决方案的关键在于开发一种多功能的视频分词器（video tokenizer），称为Divot，它利用扩散过程（diffusion process）进行自监督的视频表示学习。Divot能够捕捉视频的空间特征和时间动态，生成适用于LLMs的表示，并且这些表示可以通过视频扩散模型（video diffusion model）解码为真实的视频片段，从而实现视频生成。论文通过将Divot与预训练的LLM结合，展示了在视频理解和生成任务中的竞争性能，并通过Divot-Vicuna模型在视频叙事生成方面取得了优异效果。

链接: https://arxiv.org/abs/2412.04432
作者: Yuying Ge,Yizhuo Li,Yixiao Ge,Ying Shan
关键词-EN: Large Language Models, Large Language, Language Models, unifying image comprehension, video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project released at: this https URL

点击查看摘要

Abstract:In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.
zh

[CV-25] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

【速读】：该论文试图解决生成式 AI (Generative AI) 在文本到图像生成任务中，如何提升生成图像的分辨率、细节质量和生成速度的问题。解决方案的关键在于提出了 Infinity，一种基于位级视觉自回归建模 (Bitwise Visual AutoRegressive Modeling) 的方法。Infinity 通过引入无限词汇量标记器 (infinite-vocabulary tokenizer) 和位级自校正机制 (bitwise self-correction mechanism)，显著提升了模型的生成能力和细节表现。此外，通过理论上的无限词汇量扩展和并行扩展 Transformer 模型，Infinity 在生成速度和质量上均超越了现有的顶级扩散模型，如 SD3-Medium 和 SDXL，成为当前最快的文本到图像生成模型。

链接: https://arxiv.org/abs/2412.04431
作者: Jian Han,Jinlai Liu,Yi Jiang,Bin Yan,Yuqi Zhang,Zehuan Yuan,Bingyue Peng,Xiaobing Liu
关键词-EN: Bitwise Visual AutoRegressive, AutoRegressive Modeling capable, generating high-resolution, language instruction, capable of generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.
zh

[CV-26] Grounding Descriptions in Images informs Zero-Shot Visual Recognition

【速读】：该论文试图解决视觉-语言模型（Vision-language models, VLMs）如CLIP在零样本视觉识别中面临的两个主要问题：1) 难以识别细粒度实体；2) 难以泛化到训练分布之外的未见概念。解决方案的关键在于提出了一种新的预训练策略GRAIN，该策略旨在同时对齐图像和描述的细粒度和粗粒度表示。具体来说，GRAIN通过联合学习将文本描述锚定在图像区域，并同时对齐整体标题与全局图像表示，从而改善了模型在零样本识别任务中的表现。为了驱动这种预训练，研究者利用冻结的多模态大语言模型（Multimodal Large Language Models, MLLMs）生成大规模合成标注数据。实验结果表明，GRAIN在11个多样化的图像分类数据集上显著提升了零样本性能，并在新引入的Products-2023数据集上展示了其对新颖概念的识别能力。

链接: https://arxiv.org/abs/2412.04429
作者: Shaunak Halbe,Junjiao Tian,K J Joseph,James Seale Smith,Katherine Stevo,Vineeth N Balasubramanian,Zsolt Kira
关键词-EN: perform zero-shot visual, zero-shot visual recognition, Vision-language models, visual recognition, recognition on open-vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model’s ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at this https URL .
zh

[CV-27] Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

【速读】：该论文试图解决多模态大语言模型（MLLMs）中视觉特征表示的局限性问题，特别是现有基于对比学习（contrastive learning）的CLIP-style视觉Transformer在捕捉多层次和多方面视觉特征方面的不足。解决方案的关键在于提出了Florence-VL，这是一种新的多模态大语言模型家族，其视觉表示由生成式视觉基础模型Florence-2生成，能够捕捉更丰富和多样的视觉特征。论文中提出的关键创新包括：1) 一种新颖的特征融合架构，结合了从不同深度和多重提示下提取的视觉特征的“深度-广度融合（DBFusion）”方法；2) 一种创新的训练策略，通过端到端的预训练和后续的微调步骤，将Florence-2的视觉特征有效集成到预训练的语言模型（如Phi 3.5和LLama 3）中；3) 使用精心设计的多样化开源数据集进行训练，涵盖高质量的图像描述和指令调优对。这些创新使得Florence-VL在视觉-语言对齐、多模态和视觉中心基准测试中显著优于现有的最先进模型。

链接: https://arxiv.org/abs/2412.04424
作者: Jiuhai Chen,Jianwei Yang,Haiping Wu,Dianqi Li,Jianfeng Gao,Tianyi Zhou,Bin Xiao
关键词-EN: multimodal large language, visual representations produced, generative vision foundation, large language models, visual features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2’s visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose “depth-breath fusion (DBFusion)” to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL’s visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. this https URL
zh

[CV-28] FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning

【速读】：该论文试图解决联邦学习 (Federated Learning, FL) 中由于客户端数据分布异质性（特别是标签偏斜）导致的性能下降、收敛速度减慢和全局模型鲁棒性降低的问题。解决方案的关键在于引入了一种双策略方法：首先，设计了一种自适应损失函数，用于客户端训练，旨在保持先前获得的知识并优化本地优化与全局模型一致性之间的平衡；其次，开发了一种动态聚合策略，用于在服务器端聚合客户端模型，该策略能够适应每个客户端独特的学习模式，从而有效应对网络中数据多样性的挑战。通过在三个真实世界数据集上的全面评估和理论收敛保证，证明了该方法相对于现有最先进方法的优越性。

链接: https://arxiv.org/abs/2412.04416
作者: Pranab Sahoo,Ashutosh Tripathi,Sriparna Saha,Samrat Mondal
关键词-EN: combining locally optimized, locally optimized models, Federated Learning, unified global model, marks a transformative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapts to each client’s unique learning patterns, effectively addressing the challenges of diverse data across the network. Our comprehensive evaluation, conducted across three diverse real-world datasets, coupled with theoretical convergence guarantees, demonstrates the superior efficacy of our method compared to several established state-of-the-art approaches.
zh

[CV-29] Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

【速读】：该论文试图解决现有3D语义占用预测方法在自动驾驶场景中存在的空间稀疏性问题，即现有方法通常采用密集网格表示场景，忽略了驾驶场景的空间稀疏性，导致计算效率低下。解决方案的关键在于提出了一种概率性高斯叠加模型（probabilistic Gaussian superposition model），将每个高斯分布解释为其邻域被占用的概率分布，并通过概率乘法推导整体几何结构。此外，论文采用精确的高斯混合模型（Gaussian mixture model）进行语义计算，避免了高斯之间的不必要重叠。为了有效初始化非空区域的高斯分布，论文设计了一个基于分布的初始化模块（distribution-based initialization module），学习像素对齐的占用分布而非表面深度。实验结果表明，该方法在nuScenes和KITTI-360数据集上达到了最先进的性能，同时具有高效率。

链接: https://arxiv.org/abs/2412.04384
作者: Yuanhui Huang,Amonnut Thammatadatrakoon,Wenzhao Zheng,Yunpeng Zhang,Dalong Du,Jiwen Lu
关键词-EN: robust vision-centric autonomous, vision-centric autonomous driving, predicts fine-grained geometry, important task, task for robust
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:3D semantic occupancy prediction is an important task for robust vision-centric autonomous driving, which predicts fine-grained geometry and semantics of the surrounding scene. Most existing methods leverage dense grid-based scene representations, overlooking the spatial sparsity of the driving scenes. Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. Furthermore, we adopt the exact Gaussian mixture model for semantics calculation to avoid unnecessary overlapping of Gaussians. To effectively initialize Gaussians in non-empty region, we design a distribution-based initialization module which learns the pixel-aligned occupancy distribution instead of the depth of surfaces. We conduct extensive experiments on nuScenes and KITTI-360 datasets and our GaussianFormer-2 achieves state-of-the-art performance with high efficiency. Code: this https URL.
zh

[CV-30] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

【速读】：该论文试图解决3D视觉定位 (3D Visual Grounding, 3DVG) 中依赖标注的3D数据集和预定义对象类别所带来的可扩展性和适应性问题。解决方案的关键在于引入SeeGround框架，该框架利用在大规模2D数据上训练的视觉语言模型 (Vision-Language Models, VLMs) 实现零样本3DVG。具体来说，SeeGround通过将3D场景表示为查询对齐的渲染图像和空间丰富的文本描述的混合体，解决了3D数据与2D-VLM输入格式之间的差距。解决方案的核心模块包括视角适应模块 (Perspective Adaptation Module)，用于动态选择与查询相关的图像渲染视角，以及融合对齐模块 (Fusion Alignment Module)，用于整合2D图像与3D空间描述以增强对象定位。实验结果表明，该方法在ScanRefer和Nr3D数据集上显著优于现有的零样本方法，甚至超越了一些弱监督和全监督方法。

链接: https://arxiv.org/abs/2412.04383
作者: Rong Li,Shijie Li,Lingdong Kong,Xulei Yang,Junwei Liang
关键词-EN: Visual Grounding, aims to locate, reality and robotics, based on textual, essential for applications
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 19 pages, 10 figures, 9 tables; Project Page at this https URL

点击查看摘要

Abstract:3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, which is essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose to represent 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness.
zh

[CV-31] EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

【速读】：该论文试图解决的是在实体代理（embodied agents）通过逐步探索场景时进行3D占据预测（3D occupancy prediction）的问题。现有的方法主要集中在从单一或少数视角进行离线感知，无法应用于需要逐步感知场景的实体代理。论文提出的解决方案之关键是EmbodiedOcc框架，该框架基于高斯分布（Gaussian-based），通过初始化全局场景为均匀分布的3D语义高斯分布，并逐步更新实体代理观察到的局部区域。每次更新时，从观察到的图像中提取语义和结构特征，并通过可变形交叉注意力机制（deformable cross-attention）高效地整合这些特征，以细化区域高斯分布。最终，通过高斯到体素的映射（Gaussian-to-voxel splatting）从更新后的3D高斯分布中获取全局3D占据信息。EmbodiedOcc框架假设环境是未知的（即均匀分布），并通过局部高斯分布的逐步细化来模拟人类通过实体探索理解新场景的过程。

链接: https://arxiv.org/abs/2412.04380
作者: Yuqi Wu,Wenzhao Zheng,Sicheng Zuo,Yuanhui Huang,Jie Zhou,Jiwen Lu
关键词-EN: comprehensive description, Gaussians, occupancy prediction, embodied, occupancy prediction task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents which demands to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Experiments demonstrate that our EmbodiedOcc outperforms existing local prediction methods and accomplishes the embodied occupancy prediction with high accuracy and strong expandability. Our code is available at: this https URL.
zh

[CV-32] Discriminative Fine-tuning of LVLMs

【速读】：该论文试图解决现有对比训练的视觉-语言模型（Vision-Language Models, VLMs）如CLIP在语言理解方面的局限性，即“词袋行为”（bag of words behavior），以及大型视觉-语言模型（Large Vision-Language Models, LVLMs）在判别任务上的不适用性问题。解决方案的关键在于提出一种新的训练方法，将生成式LVLM转换为判别式模型，从而在保持强大图像-文本判别能力的同时，增强语言理解能力。具体来说，该方法结合了对比损失和下一个词预测损失，利用变长和粒度的图像-文本对进行训练，并通过软提示（soft prompting）和LoRA适配器（LoRA adapters）实现参数高效的自适应。这一方法不仅在标准图像-文本检索基准上显著超越了现有最先进的CLIP类模型，还在组合性（compositionality）方面取得了显著提升。

链接: https://arxiv.org/abs/2412.04378
作者: Yassine Ouali,Adrian Bulat,Alexandros Xenos,Anestis Zaganidis,Ioannis Maniadis Metaxas,Georgios Tzimiropoulos,Brais Martinez
关键词-EN: Contrastively-trained Vision-Language Models, vision-language representation learning, Contrastively-trained Vision-Language, representation learning, Large Vision-Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. The first two authors contributed equally

点击查看摘要

Abstract:Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a “bag of words” behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine “the best of both worlds”: a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework’s components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality. Comments: Preprint. The first two authors contributed equally Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04378 [cs.CV] (or arXiv:2412.04378v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.04378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-33] A Hitchhikers Guide to Understanding Performances of Two-Class Classifiers

【速读】：该论文试图解决的问题是如何在不同应用场景下全面理解分类器的性能，而不仅仅依赖于一两个标准评分。解决方案的关键在于引入了一种名为“Tile”的工具，该工具将无限数量的排名评分组织成一个二维地图，从而能够高效地评估和比较分类器，展示所有可能的应用特定偏好。通过Tile，论文展示了如何根据不同用户（如理论分析师、方法设计师、基准测试者和应用开发者）的需求，提供不同的解释风格，并在单一可视化中捕捉分类器的行为。

链接: https://arxiv.org/abs/2412.04377
作者: Anaïs Halin,Sébastien Piérard,Anthony Cioppa,Marc Van Droogenbroeck
关键词-EN: Properly understanding, Tile, Properly, classifiers, compare classifiers
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Properly understanding the performances of classifiers is essential in various scenarios. However, the literature often relies only on one or two standard scores to compare classifiers, which fails to capture the nuances of application-specific requirements, potentially leading to suboptimal classifier selection. Recently, a paper on the foundations of the theory of performance-based ranking introduced a tool, called the Tile, that organizes an infinity of ranking scores into a 2D map. Thanks to the Tile, it is now possible to evaluate and compare classifiers efficiently, displaying all possible application-specific preferences instead of having to rely on a pair of scores. In this paper, we provide a first hitchhiker’s guide for understanding the performances of two-class classifiers by presenting four scenarios, each showcasing a different user profile: a theoretical analyst, a method designer, a benchmarker, and an application developer. Particularly, we show that we can provide different interpretative flavors that are adapted to the user’s needs by mapping different values on the Tile. As an illustration, we leverage the newly introduced Tile tool and the different flavors to rank and analyze the performances of 74 state-of-the-art semantic segmentation models in two-class classification through the eyes of the four user profiles. Through these user profiles, we demonstrate that the Tile effectively captures the behavior of classifiers in a single visualization, while accommodating an infinite number of ranking scores.
zh

[CV-34] ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation NEURIPS2024

【速读】：该论文试图解决视频中动作的时间分析问题，具体包括动作分割 (temporal action segmentation) 和长期动作预测 (long-term action anticipation)。解决方案的关键在于提出了一种名为 ActFusion 的统一扩散模型 (unified diffusion model)，该模型能够同时处理视频序列中的可见部分和不可见部分。可见部分用于动作分割，不可见部分用于未来动作预测。为了实现这一目标，论文引入了一种新的预测性掩码策略 (anticipative masking strategy)，在训练过程中将视频帧的后期部分掩码为不可见，并用可学习的标记 (learnable tokens) 替代这些帧，以学习预测未来的不可见部分。实验结果表明，这种联合学习方法在动作分割和预测任务中均取得了最先进的性能，超越了专门针对单一任务的模型。

链接: https://arxiv.org/abs/2412.04353
作者: Dayoung Gong,Suha Kwak,Minsu Cho
关键词-EN: popular vision tasks, long-term action anticipation, action segmentation, Temporal action segmentation, popular vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner; the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future. Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation. ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.
zh

[CV-35] RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

【速读】：该论文试图解决运动生成在实际应用中受限于数据集多样性和规模的问题，特别是在处理分布外场景时的局限性。解决方案的关键是提出了一种简单且有效的基线方法，称为RMD（Retrieval-Augmented Motion Generation），通过检索增强技术提升运动生成的泛化能力。RMD的关键创新点包括：1) 外部检索数据库可以灵活替换；2) 运动数据库中的身体部分可以被重复使用，并通过大型语言模型（LLM）进行分割和重组；3) 预训练的运动扩散模型作为先验，以提高通过检索和直接组合获得的动作质量。RMD无需额外训练即可实现最先进的性能，特别是在处理分布外数据时表现出显著优势。

链接: https://arxiv.org/abs/2412.04343
作者: Zhouyingcheng Liao,Mingyuan Zhang,Wenjia Wang,Lei Yang,Taku Komura
关键词-EN: made substantial progress, practical application remains, application remains constrained, substantial progress, diversity and scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.
zh

[CV-36] Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Birds-Eye-View via Uncertainty Measure

【速读】：该论文试图解决在半监督3D物体检测（Semi-Supervised 3D Object Detection, SSOD）中，使用伪标签技术时教师网络（Teacher Network）容易出现灾难性遗忘（Catastrophic Forgetting）的问题。解决方案的关键在于引入了一个新的概念——反射式教师（Reflective Teacher），通过学生网络（Student Network）同时使用标注数据和伪标注数据进行训练，并通过一个正则化器（Regularizer）逐步将学生的知识传递给教师，以确保教师保留先前的知识。此外，论文还提出了几何感知BEV融合（Geometry Aware BEV Fusion, GA-BEVFusion），用于高效对齐多模态BEV特征（Multi-Modal BEV Features），从而减少相机和LiDAR模态之间的差异，确保几何信息的准确映射和语义信息的提取。实验结果表明，该方法在nuScenes和Waymo数据集上均优于现有最先进的方法，并且在仅使用部分标注数据的情况下，性能与使用全部标注数据的全监督方法相当。

链接: https://arxiv.org/abs/2412.04337
作者: Saheli Hazra,Sudip Das,Rohit Choudhary,Arindam Das,Ganesh Sistu,Ciaran Eising,Ujjwal Bhattacharya
关键词-EN: Applying pseudo labeling, Exponential Moving Average, pseudo labeling techniques, Applying pseudo, object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Applying pseudo labeling techniques has been found to be advantageous in semi-supervised 3D object detection (SSOD) in Bird’s-Eye-View (BEV) for autonomous driving, particularly where labeled data is limited. In the literature, Exponential Moving Average (EMA) has been used for adjustments of the weights of teacher network by the student network. However, the same induces catastrophic forgetting in the teacher network. In this work, we address this issue by introducing a novel concept of Reflective Teacher where the student is trained by both labeled and pseudo labeled data while its knowledge is progressively passed to the teacher through a regularizer to ensure retention of previous knowledge. Additionally, we propose Geometry Aware BEV Fusion (GA-BEVFusion) for efficient alignment of multi-modal BEV features, thus reducing the disparity between the modalities - camera and LiDAR. This helps to map the precise geometric information embedded among LiDAR points reliably with the spatial priors for extraction of semantic information from camera images. Our experiments on the nuScenes and Waymo datasets demonstrate: 1) improved performance over state-of-the-art methods in both fully supervised and semi-supervised settings; 2) Reflective Teacher achieves equivalent performance with only 25% and 22% of labeled data for nuScenes and Waymo datasets respectively, in contrast to other fully supervised methods that utilize the full labeled dataset.
zh

[CV-37] Liquid: Language Models are Scalable Multi-modal Generators DATE

【速读】：该论文试图解决多模态大语言模型 (MLLM) 中视觉与语言任务的整合问题，特别是如何在不依赖外部预训练视觉嵌入（如 CLIP）的情况下，实现视觉理解和生成的无缝集成。解决方案的关键在于提出了一种名为 Liquid 的自回归生成范式，通过将图像标记化为离散代码，并与文本标记在共享的视觉和语言特征空间中学习这些代码嵌入。这种方法不仅消除了对独立视觉嵌入的需求，还揭示了随着模型规模增加，视觉和语言任务统一训练带来的性能下降问题会逐渐减小的缩放规律。此外，统一的标记空间使得视觉生成和理解任务能够相互增强，有效消除了早期模型中常见的干扰。通过利用现有的 LLM 作为基础，Liquid 在多模态能力和语言性能方面均表现出色，同时显著降低了训练成本。

链接: https://arxiv.org/abs/2412.04332
作者: Junfeng Wu,Yi Jiang,Chuofan Ma,Yuliang Liu,Hengshuang Zhao,Zehuan Yuan,Song Bai,Xiang Bai
关键词-EN: embeddings alongside text, seamlessly integrates visual, alongside text tokens, shared feature space, auto-regressive generation paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. Will be updated soon

点击查看摘要

Abstract:We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We show that existing LLMs can serve as strong foundations for Liquid, saving 100x in training costs while outperforming Chameleon in multimodal capabilities and maintaining language performance comparable to mainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2 are powerful multimodal generators, offering a scalable solution for enhancing both vision-language understanding and generation. The code and models will be released.
zh

[CV-38] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

【速读】：该论文试图解决多模态大语言模型（MLLMs）在实际应用中响应速度慢和延迟大的问题。解决方案的关键在于提出了一种名为FlashSloth的高效小型MLLM，通过改进视觉标记的描述能力，压缩其冗余语义，从而减少视觉标记的数量，降低训练内存和计算复杂度，同时保持高水平的视觉语言任务性能。具体来说，FlashSloth引入了嵌入式视觉压缩设计，以捕捉视觉显著和指令相关的图像信息，从而在减少视觉标记的同时，实现卓越的多模态性能。

链接: https://arxiv.org/abs/2412.04317
作者: Bo Tong,Bokai Lai,Yiyi Zhou,Gen Luo,Yunhang Shen,Ke Li,Xiaoshuai Sun,Rongrong Ji
关键词-EN: large language models, big leap forward, multimodal large language, large latency, forward in capability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.
zh

[CV-39] LocalSR: Image Super-Resolution in Local Region

【速读】：该论文试图解决在特定应用场景下，仅需对图像中的局部区域进行超分辨率处理的问题。传统方法对整个图像进行超分辨率处理，不仅计算和内存成本高，而且在仅需局部高分辨率的情况下显得不必要。论文提出的解决方案之关键是引入了一种名为“基于上下文的局部超分辨率 (Context-based Local Super-Resolution, CLSR)”的新方法。该方法通过三个并行处理模块实现：1) 基础模块用于对感兴趣区域 (Region of Interest, ROI) 进行超分辨率处理；2) 全局上下文模块用于从整个图像中提取有用的特征；3) 邻近集成模块则专注于ROI周围的区域，逐步将远处像素的特征传播到目标区域。实验结果表明，该方法在降低复杂度的同时，优于仅关注ROI的变体方法。

链接: https://arxiv.org/abs/2412.04314
作者: Bo Ji,Angela Yao
关键词-EN: Standard single-image super-resolution, Standard single-image, restores entire images, entire image, ROI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard single-image super-resolution (SR) upsamples and restores entire images. Yet several real-world applications require higher resolutions only in specific regions, such as license plates or faces, making the super-resolution of the entire image, along with the associated memory and computational cost, unnecessary. We propose a novel task, called LocalSR, to restore only local regions of the low-resolution image. For this problem setting, we propose a context-based local super-resolution (CLSR) to super-resolve only specified regions of interest (ROI) while leveraging the entire image as context. Our method uses three parallel processing modules: a base module for super-resolving the ROI, a global context module for gathering helpful features from across the image, and a proximity integration module for concentrating on areas surrounding the ROI, progressively propagating features from distant pixels to the target region. Experimental results indicate that our approach, with its reduced low complexity, outperforms variants that focus exclusively on the ROI.
zh

[CV-40] he Tile: A 2D Map of Ranking Scores for Two-Class Classification

【速读】：该论文试图解决在计算机视觉和机器学习领域中，如何对分类器进行严格评估并进行准确比较和排序的问题。现有的评估工具如接收者操作特征曲线 (ROC) 和精确率/召回率 (PR) 空间，虽然在显示性能方面有效，但在比较分类器时存在局限性，尤其是在考虑特定应用偏好时。论文提出的解决方案之关键是引入了一种名为“Tile”的新型多功能工具，该工具通过在一个二维地图上组织无限多的排序分数，有效地解决了这一问题。Tile 能够涵盖包括准确率、真阳性率、阳性预测值、Jaccard 系数以及所有 F-beta 分数在内的常见评估指标，并研究了这些排序分数的内在属性，如先验的影响及其与 ROC 空间的对应关系。通过这种方式，Tile 不仅能够在一个可视化中捕捉所有排序信息，还能帮助解释这些排序。

链接: https://arxiv.org/abs/2412.04309
作者: Sébastien Piérard,Anaïs Halin,Anthony Cioppa,Adrien Deliège,Marc Van Droogenbroeck
关键词-EN: machine learning communities, learning communities, research domains, computer vision, vision and machine
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard’s coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.
zh

[CV-41] owards Zero-shot 3D Anomaly Localization WACV2025

【速读】：该论文试图解决在工业检测中，由于数据隐私或出口管制法规等原因导致目标3D对象的正常训练数据不可用的情况下，如何进行3D异常检测和定位的问题。解决方案的关键在于提出了一种新的任务——零样本3D异常检测和定位（zero-shot 3D anomaly detection and localization），并设计了3DzAL框架。该框架通过基于任务无关的3D xyz数据生成伪异常，并利用补丁级别的对比学习（patch-level contrastive learning）来学习更具代表性的特征表示。此外，论文还训练了一个正常性分类器网络，结合特征距离和分类结果来设计异常评分，并通过引入对抗扰动（adversarial perturbations）来增强基于分类的异常评分。

链接: https://arxiv.org/abs/2412.04304
作者: Yizhou Wang,Kuan-Chuan Peng,Yun Fu
关键词-EN: detection and localization, anomaly detection, industrial inspection, great significance, significance for industrial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted to WACV 2025

点击查看摘要

Abstract:3D anomaly detection and localization is of great significance for industrial inspection. Prior 3D anomaly detection and localization methods focus on the setting that the testing data share the same category as the training data which is normal. However, in real-world applications, the normal training data for the target 3D objects can be unavailable due to issues like data privacy or export control regulation. To tackle these challenges, we identify a new task – zero-shot 3D anomaly detection and localization, where the training and testing classes do not overlap. To this end, we design 3DzAL, a novel patch-level contrastive learning framework based on pseudo anomalies generated using the inductive bias from task-irrelevant 3D xyz data to learn more representative feature representations. Furthermore, we train a normalcy classifier network to classify the normal patches and pseudo anomalies and utilize the classification result jointly with feature distance to design anomaly scores. Instead of directly using the patch point clouds, we introduce adversarial perturbations to the input patch xyz data before feeding into the 3D normalcy classifier for the classification-based anomaly score. We show that 3DzAL outperforms the state-of-the-art anomaly detection and localization performance.
zh

[CV-42] SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

【速读】：该论文试图解决文本引导图像编辑中多步扩散模型带来的速度瓶颈问题，特别是在实际应用和设备端应用中，由于多步反演和采样过程的高成本，导致编辑速度无法满足需求。解决方案的关键在于SwiftEdit的引入，它通过两个创新点实现了即时文本引导图像编辑（仅需0.23秒）：一是单步反演框架，通过反演实现单步图像重建；二是掩码引导编辑技术，结合提出的注意力重缩放机制，实现局部图像编辑。这些创新使得SwiftEdit在速度上比传统多步方法快至少50倍，同时保持了竞争力的编辑效果。

链接: https://arxiv.org/abs/2412.04301
作者: Trong-Tung Nguyen,Quang Nguyen,Khoi Nguyen,Anh Tran,Cuong Pham
关键词-EN: Recent advances, simple text inputs, text inputs, text-guided image editing, perform image edits
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 15 figures

点击查看摘要

Abstract:Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: this https URL
zh

[CV-43] 2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

【速读】：该论文试图解决文本到图像生成（T2I）模型在知识密集型概念生成中的事实性评估问题。解决方案的关键在于提出了T2I-FactualBench，这是一个迄今为止最大的基准，专门用于评估知识密集型概念生成的事实性。T2I-FactualBench包含一个三层级的知识密集型文本到图像生成框架，从单个知识概念的记忆到多个知识概念的复杂组合。此外，论文还引入了一个基于多轮视觉问答（VQA）的评估框架，用于评估三层级知识密集型文本到图像生成任务的事实性。实验结果表明，当前最先进的T2I模型在事实性方面仍有显著改进空间。

链接: https://arxiv.org/abs/2412.04300
作者: Ziwei Huang,Wanggui He,Quanyu Long,Yandi Wang,Haoyuan Li,Zhelun Yu,Fangxun Shu,Long Chen,Hao Jiang,Leilei Gan
关键词-EN: synthesized images remains, evaluating text-image alignment, synthesized images, images remains, image quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.
zh

[CV-44] SIDA: Social Media Image Deepfake Detection Localization and Explanation with Large Multimodal Model

【速读】：该论文试图解决生成式模型（Generative Models）在社交媒体上传播虚假信息的问题，特别是通过创建高度逼真的合成图像来误导大众，从而破坏数字内容的可信度。解决方案的关键在于引入了一个名为Social media Image Detection dataSet (SID-Set) 的大规模多样化深度伪造检测数据集，该数据集具有广泛的数量（300K AI生成/篡改和真实图像）、广泛的多样性（涵盖全合成和篡改图像）和高度的逼真度。此外，论文提出了一种新的图像深度伪造检测、定位和解释框架，称为Social media Image Detection, localization, and explanation Assistant (SIDA)，该框架不仅能够识别图像的真实性，还能通过掩码预测定位篡改区域，并提供模型判断标准的文本解释。实验结果表明，SIDA在多样化设置下相比现有最先进的深度伪造检测模型表现更优。

链接: https://arxiv.org/abs/2412.04292
作者: Zhenglin Huang,Jinwei Hu,Xiangtai Li,Yiwei He,Xingyu Zhao,Bei Peng,Baoyuan Wu,Xiaowei Huang,Guangliang Cheng
关键词-EN: creating highly realistic, poses substantial risks, highly realistic images, realistic images poses, images poses substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model’s judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.
zh

[CV-45] Learnable Infinite Taylor Gaussian for Dynamic View Rendering

【速读】：该论文试图解决高斯属性（如位置、旋转和尺度）随时间演变的建模问题，这一问题由于大量时变参数和有限的光度数据导致收敛困难，难以找到最优解。论文提出的解决方案关键在于引入了一种基于可学习无穷泰勒公式（learnable infinite Taylor Formula）的新方法，该方法结合了隐式神经网络方法的灵活性和显式多项式函数的可解释性，从而在各种动态场景中实现更鲁棒和可泛化的高斯动力学建模。这一方法在动态新视角渲染任务中进行了广泛的实验验证，展示了其在该领域的最先进性能。

链接: https://arxiv.org/abs/2412.04282
作者: Bingbing Hu,Yanyan Li,Rui Xie,Bo Xu,Haoye Dong,Junfeng Yao,Gim Hee Lee
关键词-EN: limited photometric data, challenging task due, convergence issues, making it difficult, Capturing the temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing the temporal evolution of Gaussian properties such as position, rotation, and scale is a challenging task due to the vast number of time-varying parameters and the limited photometric data available, which generally results in convergence issues, making it difficult to find an optimal solution. While feeding all inputs into an end-to-end neural network can effectively model complex temporal dynamics, this approach lacks explicit supervision and struggles to generate high-quality transformation fields. On the other hand, using time-conditioned polynomial functions to model Gaussian trajectories and orientations provides a more explicit and interpretable solution, but requires significant handcrafted effort and lacks generalizability across diverse scenes. To overcome these limitations, this paper introduces a novel approach based on a learnable infinite Taylor Formula to model the temporal evolution of Gaussians. This method offers both the flexibility of an implicit network-based approach and the interpretability of explicit polynomial functions, allowing for more robust and generalizable modeling of Gaussian dynamics across various dynamic scenes. Extensive experiments on dynamic novel view rendering tasks are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in this domain. More information is available on our project page(this https URL).
zh

[CV-46] HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

【速读】：该论文试图解决现有大规模图像编辑数据集在人类偏好对齐方面的不足，即这些数据集通常缺乏足够的人类反馈，导致难以准确反映人类对图像编辑的期望。解决方案的关键在于创建了一个高质量、经过人类奖励的数据集——HumanEdit。该数据集通过人类标注者构建数据对和数据管理员提供反馈，经过四个阶段的精心策划，确保了数据集的准确性和可靠性。HumanEdit包含5,751张图像，涵盖六种不同的编辑指令类型，并提供了高分辨率的内容和多样化的领域来源，为指令引导的图像编辑任务设定了新的多功能基准。

链接: https://arxiv.org/abs/2412.04280
作者: Jinbin Bai,Wei Chow,Ling Yang,Xiangtai Li,Juncheng Li,Hanwang Zhang,Shuicheng Yan
关键词-EN: human-rewarded dataset specifically, diverse image manipulations, dataset specifically designed, open-form language instructions, enabling precise
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Codes and Supplementary Material: this https URL

点击查看摘要

Abstract:We present HumanEdit, a high-quality, human-rewarded dataset specifically designed for instruction-guided image editing, enabling precise and diverse image manipulations through open-form language instructions. Previous large-scale editing datasets often incorporate minimal human feedback, leading to challenges in aligning datasets with human preferences. HumanEdit bridges this gap by employing human annotators to construct data pairs and administrators to provide feedback. With meticulously curation, HumanEdit comprises 5,751 images and requires more than 2,500 hours of human effort across four stages, ensuring both accuracy and reliability for a wide range of image editing tasks. The dataset includes six distinct types of editing instructions: Action, Add, Counting, Relation, Remove, and Replace, encompassing a broad spectrum of real-world scenarios. All images in the dataset are accompanied by masks, and for a subset of the data, we ensure that the instructions are sufficiently detailed to support mask-free editing. Furthermore, HumanEdit offers comprehensive diversity and high-resolution 1024 \times 1024 content sourced from various domains, setting a new versatile benchmark for instructional image editing datasets. With the aim of advancing future research and establishing evaluation benchmarks in the field of image editing, we release HumanEdit at \urlthis https URL.
zh

[CV-47] argeted Hard Sample Synthesis Based on Estimated Pose and Occlusion Error for Improved Object Pose Estimation

【速读】：该论文试图解决在机器人抓取应用中，特别是无纹理物体和复杂姿态下的6D物体姿态估计问题。解决方案的关键在于提出了一种模型无关的困难样本合成方法，利用现有的模拟器和姿态误差在相机-物体视球和遮挡空间中的建模。通过评估模型在物体姿态和遮挡分布上的性能，识别出高误差区域，并生成逼真的训练样本以针对这些区域进行优化。实验结果表明，该方法在使用最先进的姿态估计模型时，能够将正确检测率提高多达20%。

链接: https://arxiv.org/abs/2412.04279
作者: Alan Li,Angela P. Schoellig
关键词-EN: robotics enabling efficient, enabling efficient interaction, fundamental component, component in robotics, robotics enabling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:6D Object pose estimation is a fundamental component in robotics enabling efficient interaction with the environment. It is particularly challenging in bin-picking applications, where objects may be textureless and in difficult poses, and occlusion between objects of the same type may cause confusion even in well-trained models. We propose a novel method of hard example synthesis that is model-agnostic, using existing simulators and the modeling of pose error in both the camera-to-object viewsphere and occlusion space. Through evaluation of the model performance with respect to the distribution of object poses and occlusions, we discover regions of high error and generate realistic training samples to specifically target these regions. With our training approach, we demonstrate an improvement in correct detection rate of up to 20% across several ROBI-dataset objects using state-of-the-art pose estimation models.
zh

[CV-48] Reinforcement Learning from Wild Animal Videos

【速读】：该论文试图解决如何通过观察野生动物视频来学习四足机器人运动技能的问题。解决方案的关键在于提出了一种名为“从野生动物视频中进行强化学习 (Reinforcement Learning from Wild Animal Videos, RLWAV)”的方法。该方法首先通过训练一个视频分类器来识别野生动物视频中的动作，然后将这些动作转化为机器人在物理模拟器中的运动策略，并通过强化学习优化策略。最终，该策略可以直接应用于真实的四足机器人，使其能够学习并执行多种技能，如行走、跳跃和保持静止，而无需依赖参考轨迹或特定技能的奖励。

链接: https://arxiv.org/abs/2412.04273
作者: Elliot Chane-Sane,Constant Roux,Olivier Stasse,Nicolas Mansard
关键词-EN: legged robot locomotion, wild animal videos, nature documentaries, watching thousands, featured in nature
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:We propose to learn legged robot locomotion skills by watching thousands of wild animal videos from the internet, such as those featured in nature documentaries. Indeed, such videos offer a rich and diverse collection of plausible motion examples, which could inform how robots should move. To achieve this, we introduce Reinforcement Learning from Wild Animal Videos (RLWAV), a method to ground these motions into physical robots. We first train a video classifier on a large-scale animal video dataset to recognize actions from RGB clips of animals in their natural habitats. We then train a multi-skill policy to control a robot in a physics simulator, using the classification score of a third-person camera capturing videos of the robot’s movements as a reward for reinforcement learning. Finally, we directly transfer the learned policy to a real quadruped Solo. Remarkably, despite the extreme gap in both domain and embodiment between animals in the wild and robots, our approach enables the policy to learn diverse skills such as walking, jumping, and keeping still, without relying on reference trajectories nor skill-specific rewards.
zh

[CV-49] Enhancing Whole Slide Image Classification through Supervised Contrastive Domain Adaptation

【速读】：该论文试图解决组织病理学影像领域中的域偏移问题，这一问题源于不同医院间染色和数字化协议的差异。解决方案的关键在于引入一种新的域适应方法，通过在监督对比学习方法中加入训练约束，以实现域适应并提高类间可分性。实验结果表明，该方法在处理来自两个中心的六种皮肤癌亚型的全切片图像时，相较于未使用域适应或仅进行染色归一化的方法，表现出更优越的性能。

链接: https://arxiv.org/abs/2412.04260
作者: Ilán Carretero,Pablo Meseguer,Rocío del Amor,Valery Naranjo
关键词-EN: common phenomenon due, digitization protocols, domain adaptation, common phenomenon, phenomenon due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in CASEIB 2024

点击查看摘要

Abstract:Domain shift in the field of histopathological imaging is a common phenomenon due to the intra- and inter-hospital variability of staining and digitization protocols. The implementation of robust models, capable of creating generalized domains, represents a need to be solved. In this work, a new domain adaptation method to deal with the variability between histopathological images from multiple centers is presented. In particular, our method adds a training constraint to the supervised contrastive learning approach to achieve domain adaptation and improve inter-class separability. Experiments performed on domain adaptation and classification of whole-slide images of six skin cancer subtypes from two centers demonstrate the method’s usefulness. The results reflect superior performance compared to not using domain adaptation after feature extraction or staining normalization.
zh

[CV-50] 3D Part Segmentation via Geometric Aggregation of 2D Visual Features

【速读】：该论文试图解决现有监督式3D部件分割模型在开放场景中的可迁移性问题，特别是在处理真实世界中的多样对象和部件时。解决方案的关键在于提出了一种名为COPS的全面模型，该模型结合了视觉概念提取的语义信息和3D几何结构，以有效识别对象部件。COPS通过多视角渲染点云，提取2D特征并将其投影回3D空间，利用一种新颖的几何感知特征聚合过程来确保空间和语义的一致性，最终将点聚类并标记为部件。这种方法不仅提高了模型的效率和可扩展性，还在多个数据集上实现了零样本状态下的最先进性能。

链接: https://arxiv.org/abs/2412.04247
作者: Marco Garosi,Riccardo Tedoldi,Davide Boscaini,Massimiliano Mancini,Nicu Sebe,Fabio Poiesi
关键词-EN: identify object parts, part segmentation models, limiting their transferability, transferability to open-set, fixed set
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. However, naively applying VLMs in this context introduces several drawbacks, such as the need for meticulous prompt engineering, and fails to leverage the 3D geometric structure of objects. To address these limitations, we propose COPS, a COmprehensive model for Parts Segmentation that blends the semantics extracted from visual concepts and 3D geometry to effectively identify object parts. COPS renders a point cloud from multiple viewpoints, extracts 2D features, projects them back to 3D, and uses a novel geometric-aware feature aggregation procedure to ensure spatial and semantic consistency. Finally, it clusters points into parts and labels them. We demonstrate that COPS is efficient, scalable, and achieves zero-shot state-of-the-art performance across five datasets, covering synthetic and real-world data, texture-less and coloured objects, as well as rigid and non-rigid shapes. The code is available at this https URL.
zh

[CV-51] Intriguing Properties of Robust Classification

【速读】：该论文试图解决的问题是如何训练出既高精度又对输入的小扰动具有鲁棒性的分类器。解决方案的关键在于揭示了在某些情况下，鲁棒性泛化仅在数据量极大时才可能实现，即存在一个场景，其中鲁棒分类器存在，但学习一个鲁棒分类器需要指数级的数据量。基于这一理论结果，论文探讨了在CIFAR-10等数据集上鲁棒分类器的泛化性能，并得出结论：当前鲁棒模型的局限性在于泛化能力，它们需要大量数据才能在测试集上表现良好。此外，论文还指出，问题不在于当前架构的表达能力或泛化能力，而是数据中存在低幅度的特征，这些特征对非鲁棒泛化有用，但对鲁棒分类器不可用。

链接: https://arxiv.org/abs/2412.04245
作者: Bernd Prach,Christoph H. Lampert
关键词-EN: train high-accuracy classifiers, years ago, extensive research, community learned, learned about adversarial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite extensive research since the community learned about adversarial examples 10 years ago, we still do not know how to train high-accuracy classifiers that are guaranteed to be robust to small perturbations of their inputs. Previous works often argued that this might be because no classifier exists that is robust and accurate at the same time. However, in computer vision this assumption does not match reality where humans are usually accurate and robust on most tasks of interest. We offer an alternative explanation and show that in certain settings robust generalization is only possible with unrealistically large amounts of data. More precisely we find a setting where a robust classifier exists, it is easy to learn an accurate classifier, yet it requires an exponential amount of data to learn a robust classifier. Based on this theoretical result, we explore how well robust classifiers generalize on datasets such as CIFAR-10. We come to the conclusion that on this datasets, the limitation of current robust models also lies in the generalization, and that they require a lot of data to do well on the test set. We also show that the problem is not in the expressiveness or generalization capabilities of current architectures, and that there are low magnitude features in the data which are useful for non-robust generalization but are not available for robust classifiers.
zh

[CV-52] GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

【速读】：该论文试图解决双人手部活动理解的关键问题，特别是在人工智能和机器人领域中，由于现有数据集规模不足、涵盖的手部活动多样性有限以及详细标注缺失，导致难以构建大规模的双人手部活动模型。解决方案的关键在于引入了GigaHands数据集，这是一个大规模的标注数据集，捕捉了56名受试者和417个对象的34小时双人手部活动，总计14k个运动片段，基于183百万帧图像和84k个文本标注。通过无标记的捕捉设置和数据采集协议，实现了全自动的3D手部和对象估计，同时最小化了文本标注的工作量。GigaHands数据集的规模和多样性使其适用于广泛的实际应用，包括文本驱动的动作合成、手部运动描述以及动态辐射场重建。

链接: https://arxiv.org/abs/2412.04244
作者: Rao Fu,Dingxi Zhang,Alex Jiang,Wanjia Fu,Austin Funk,Daniel Ritchie,Srinath Sridhar
关键词-EN: Understanding bimanual human, Understanding bimanual, human hand activities, bimanual human hand, critical problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.
zh

[CV-53] Quantifying the Limits of Segment Anything Model: Analyzing Challenges in Segmenting Tree-Like and Low-Contrast Structures

【速读】：该论文试图解决Segment Anything Model (SAM)在处理具有密集树状结构和低纹理对比度对象时的性能问题。解决方案的关键在于提出了两个量化对象特征的指标：树状度 (tree-likeness) 和纹理可分性 (textural separability)。通过广泛的控制合成实验和真实数据集测试，论文展示了SAM的性能与这两个特征的显著相关性，并将其行为归因于“纹理混淆” (textural confusion) 现象，即SAM错误地将局部结构解释为全局纹理，导致过度分割或难以区分对象与相似纹理的背景。这些发现为理解SAM的局限性提供了首个定量框架，并为未来视觉基础模型的改进提供了指导。

链接: https://arxiv.org/abs/2412.04243
作者: Yixin Zhang,Nicholas Konz,Kevin Kramer,Maciej A. Mazurowski
关键词-EN: shown impressive performance, diverse domains, large-scale training, shown impressive, interactive and zero-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Code: this https URL

点击查看摘要

Abstract:Segment Anything Model (SAM) has shown impressive performance in interactive and zero-shot segmentation across diverse domains, suggesting that they have learned a general concept of “objects” from their large-scale training. However, we observed that SAM struggles with certain types of objects, particularly those featuring dense, tree-like structures and low textural contrast from their surroundings. These failure modes are critical for understanding its limitations in real-world use. In order to systematically examine this issue, we propose metrics to quantify two key object characteristics: tree-likeness and textural separability. Through extensive controlled synthetic experiments and testing on real datasets, we demonstrate that SAM’s performance is noticeably correlated with these factors. We link these behaviors under the concept of “textural confusion”, where SAM misinterprets local structure as global texture, leading to over-segmentation, or struggles to differentiate objects from similarly textured backgrounds. These findings offer the first quantitative framework to model SAM’s challenges, providing valuable insights into its limitations and guiding future improvements for vision foundation models.
zh

[CV-54] VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

【速读】：该论文试图解决如何利用大型视觉-语言模型 (Large Vision-Language Models, LVLM) 进行内容感知的布局生成问题。解决方案的关键在于提出了视觉感知自校正布局生成方法 (Visual-Aware Self-Correction LAyout GeneRation, VASCAR)，该方法通过迭代地参考渲染后的布局图像（以彩色边界框形式呈现在海报背景上）来不断优化LVLM的输出，从而在不进行额外训练的情况下实现了最先进的布局生成质量，超越了现有的专门布局生成模型和其他基于LLM的方法。

链接: https://arxiv.org/abs/2412.04237
作者: Jiahao Zhang,Ryota Yoshihashi,Shunsuke Kitada,Atsuki Osanai,Yuta Nakashima
关键词-EN: HTML or JSON, produce structure-description languages, Large language models, visual information, structure-description languages
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.
zh

[CV-55] DEIM: DETR with Improved Matching for Fast Convergence

【速读】：该论文试图解决基于Transformer架构的实时目标检测模型（如DETR）在训练过程中收敛速度慢的问题。解决方案的关键在于引入了一种创新的训练框架DEIM，其核心包括两个主要策略：一是采用密集的一对一匹配策略（Dense O2O matching），通过增加每张图像的正样本数量来加速收敛；二是提出了一种匹配感知损失函数（Matchability-Aware Loss, MAL），用于优化不同质量级别的匹配，从而提升密集一对一匹配策略的效果。通过这些改进，DEIM显著减少了训练时间并提升了模型性能，特别是在与RT-DETR和D-FINE结合时，表现尤为突出。

链接: https://arxiv.org/abs/2412.04234
作者: Shihua Huang,Zhichao Lu,Xiaodong Cun,Yongjun Yu,Xiao Zhou,Xi Shen
关键词-EN: Transformer-based architectures, efficient training framework, training framework designed, innovative and efficient, framework designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Exceeding all existing real-time object detectors, including YOLOv11 and D-FINE

点击查看摘要

Abstract:We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at this https URL.
zh

[CV-56] Foundations of the Theory of Performance-Based Ranking

【速读】：该论文试图解决在考虑应用特定偏好时，如何对算法、设备、方法或模型等实体进行基于性能的排序问题。解决方案的关键在于建立一个基于概率论和序理论的严格框架，该框架包括以下要素：1) 将性能作为数学对象进行操作；2) 表达哪些性能较差或等同于其他性能；3) 通过满意度变量对任务进行建模；4) 考虑评估的属性；5) 定义评分；6) 通过重要性变量指定应用特定偏好。在此框架基础上，论文提出了性能排序和基于性能排序的公理化定义，并引入了一个称为排序评分的通用参数化评分族，用于在考虑应用特定偏好的情况下建立满足公理的排序。此外，论文还展示了在二分类情况下，排序评分族涵盖了包括准确率、真阳性率（召回率，敏感性）、真阴性率（特异性）、阳性预测值（精确度）和F1在内的知名性能评分，但同时也指出一些常用的评分不适合用于满足公理的性能排序。

链接: https://arxiv.org/abs/2412.04227
作者: Sébastien Piérard,Anaïs Halin,Anthony Cioppa,Adrien Deliège,Marc Van Droogenbroeck
关键词-EN: application-specific preferences, Ranking, scores, models based, ranking scores
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Ranking entities such as algorithms, devices, methods, or models based on their performances, while accounting for application-specific preferences, is a challenge. To address this challenge, we establish the foundations of a universal theory for performance-based ranking. First, we introduce a rigorous framework built on top of both the probability and order theories. Our new framework encompasses the elements necessary to (1) manipulate performances as mathematical objects, (2) express which performances are worse than or equivalent to others, (3) model tasks through a variable called satisfaction, (4) consider properties of the evaluation, (5) define scores, and (6) specify application-specific preferences through a variable called importance. On top of this framework, we propose the first axiomatic definition of performance orderings and performance-based rankings. Then, we introduce a universal parametric family of scores, called ranking scores, that can be used to establish rankings satisfying our axioms, while considering application-specific preferences. Finally, we show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores, including the accuracy, the true positive rate (recall, sensitivity), the true negative rate (specificity), the positive predictive value (precision), and F1. However, we also show that some other scores commonly used to compare classifiers are unsuitable to derive performance orderings satisfying the axioms. Therefore, this paper provides the computer vision and machine learning communities with a rigorous framework for evaluating and ranking entities.
zh

[CV-57] Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

【速读】：该论文试图解决将现有的分割模型（如Segment Anything Model, SAM）直接应用于新兴视觉模态（如深度和事件数据）时，在多模态分割任务中表现不佳的问题。解决方案的关键在于提出了一种针对不同输入视觉模态的混合低秩适应专家模型（Mixture of Low-Rank Adaptation Experts, MoE-LoRA），并通过训练MoE-LoRA层来适应多模态数据，同时保持SAM的权重不变，以保留其强大的泛化和分割能力。具体来说，论文提出了一种新的MoE路由策略，用于跨模态自适应生成加权特征，增强多模态特征的整合，并引入了多尺度特征提取与融合机制，通过调整SAM的分割头和引入辅助分割头来有效结合多尺度特征，从而显著提升多模态分割性能。

链接: https://arxiv.org/abs/2412.04220
作者: Chenyang Zhu,Bin Xiao,Lin Shi,Shoukun Xu,Xu Zheng
关键词-EN: scaling segmentation models, RGB modality, Segment Anything Model, recent Segment, represents a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi-modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi-modal semantic segmentation by proposing a Mixture of Low-Rank Adaptation Experts (MoE-LoRA) tailored for different input visual modalities. By training only the MoE-LoRA layers while keeping SAM’s weights frozen, SAM’s strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross-modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi-modal feature integration. Additionally, we incorporate multi-scale feature extraction and fusion by adapting SAM’s segmentation head and introducing an auxiliary segmentation head to combine multi-scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi-modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state-of-the-art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.
zh

[CV-58] Aligned Music Notation and Lyrics Transcription

【速读】：该论文试图解决数字化声乐乐谱中的独特挑战，即在保留音乐符号与歌词之间关键对齐关系的同时进行转录。解决方案的关键在于引入并正式定义了“对齐音乐符号与歌词转录 (Aligned Music Notation and Lyrics Transcription, AMNLT)”挑战，通过联合考虑音乐符号、歌词及其同步关系来实现声乐乐谱的完整转录。论文分析了从传统的分而治之方法到新颖的端到端解决方案（包括直接转录、展开机制和语言模型）的不同方法。实验结果表明，端到端方法在处理对齐挑战时通常优于启发式方法，特别是在有足够训练数据的情况下，语言模型显示出显著的优势。

链接: https://arxiv.org/abs/2412.04217
作者: Eliseo Fuentes-Martínez,Antonio Ríos-Vila,Juan C. Martinez-Sevilla,David Rizo,Jorge Calvo-Zaragoza
关键词-EN: Optical Character Recognition, Optical Music Recognition, Character Recognition, Optical Character, traditional Optical Music
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The digitization of vocal music scores presents unique challenges that go beyond traditional Optical Music Recognition (OMR) and Optical Character Recognition (OCR), as it necessitates preserving the critical alignment between music notation and lyrics. This alignment is essential for proper interpretation and processing in practical applications. This paper introduces and formalizes, for the first time, the Aligned Music Notation and Lyrics Transcription (AMNLT) challenge, which addresses the complete transcription of vocal scores by jointly considering music symbols, lyrics, and their synchronization. We analyze different approaches to address this challenge, ranging from traditional divide-and-conquer methods that handle music and lyrics separately, to novel end-to-end solutions including direct transcription, unfolding mechanisms, and language modeling. To evaluate these methods, we introduce four datasets of Gregorian chants, comprising both real and synthetic sources, along with custom metrics specifically designed to assess both transcription and alignment accuracy. Our experimental results demonstrate that end-to-end approaches generally outperform heuristic methods in the alignment challenge, with language models showing particular promise in scenarios where sufficient training data is available. This work establishes the first comprehensive framework for AMNLT, providing both theoretical foundations and practical solutions for preserving and digitizing vocal music heritage.
zh

[CV-59] PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models

【速读】：该论文试图解决地理空间基础模型 (Geospatial Foundation Models, GFMs) 在评估过程中存在的局限性问题，特别是现有评估方法的不一致性、狭窄性以及地理偏见。解决方案的关键在于引入PANGAEA，这是一个标准化的评估协议，涵盖了多样化的数据集、任务、分辨率、传感器模态和时间维度。PANGAEA旨在建立一个稳健且广泛适用的基准，用于评估GFMs的性能，并通过与监督基线模型（如UNet和vanilla ViT）的比较，分析GFMs在有限标注数据情况下的有效性。论文通过公开评估代码和基准，促进研究者复现实验并在此基础上进行进一步研究，从而推动对大型预训练地理空间模型更为系统的评估。

链接: https://arxiv.org/abs/2412.04204
作者: Valerio Marsocci,Yuru Jia,Georges Le Bellier,David Kerekes,Liang Zeng,Sebastian Hafner,Sebastian Gerard,Eric Brune,Ritu Yadav,Ali Shibli,Heng Fang,Yifang Ban,Maarten Vergauwen,Nicolas Audebert,Andrea Nascetti
关键词-EN: Earth observation data, representations from Earth, Earth observation, evaluation remains inconsistent, Geospatial Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at this https URL.
zh

[CV-60] Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image

【速读】：该论文试图解决高光谱图像（Hyperspectral Images, HSIs）在成像设备限制下经常出现的噪声和低分辨率问题。解决方案的关键在于提出了一种新的学习范式，称为Hyperspectral Image Joint Pandenoising and Pansharpening (Hipandas)，通过融合高分辨率的全色图像（PAN）来同时进行去噪和超分辨率处理。Hipandas框架包括引导去噪网络、引导超分辨率网络和PAN重建网络，利用HSI低秩先验和一种新的细节导向低秩先验，并通过两阶段训练策略确保网络的有效训练。实验结果表明，该方法在模拟和真实数据集上均优于现有最先进算法，生成更准确和视觉上更令人满意的高分辨率HSI图像。

链接: https://arxiv.org/abs/2412.04201
作者: Shuang Xu,Zixiang Zhao,Haowen Bai,Chang Yu,Jiangjun Peng,Xiangyong Cao,Deyu Meng
关键词-EN: low resolution due, Hyperspectral images, imaging devices, low resolution, resolution due
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces \textbfHyperspectral \textbfImage Joint \textbfPandenoising \textbfand Pan\textbfsharpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed zero-shot Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.
zh

[CV-61] Instructional Video Generation

【速读】：该论文试图解决在生成式视频中视觉细节不足的问题，特别是在第一人称教学视频中，手部复杂动作与相对稳定的环境之间的协调问题。解决方案的关键在于两个创新点：首先，提出了一种自动生成预期运动区域的方法，该方法结合了视觉上下文和动作文本的指导；其次，引入了一种关键的手部结构损失，以引导扩散模型关注手部姿势的平滑性和一致性。这些创新显著提升了在EpicKitchens和Ego4D增强数据集上的教学视频生成质量，特别是在手部动作的清晰度方面，超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.04189
作者: Yayuan Li,Zhi Cao,Jason J. Corso
关键词-EN: instructional video generation, video generation, recent strides, struggle with elements, visual detail
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures and 4 tables

点击查看摘要

Abstract:Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and this http URL results can be found on the project webpage: this https URL
zh

[CV-62] Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

【速读】：该论文试图解决在利用事件相机（Event cameras）和RGB相机进行目标检测时，由于事件数据与RGB帧之间存在低延迟与高延迟的不匹配，以及训练时的稀疏标签与推理时的连续数据流之间的不匹配，导致的高频融合目标检测的困难。解决方案的关键在于提出了频率自适应低延迟目标检测器（Frequency-Adaptive Low-Latency Object Detector, FAOD）。FAOD通过Align Module将低频RGB帧与高频事件数据对齐，解决了事件与RGB数据之间的不匹配问题。此外，提出的时间偏移（Time Shift）训练策略强制模块对时间偏移的事件-RGB对及其原始表示进行对齐预测，使得网络能够以高频事件数据为主要参考，同时将低频RGB图像作为补充信息，从而保持事件流的低延迟特性，实现高频检测。实验结果表明，FAOD在PKU-DAVIS-SOD和DSEC-Detection数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.04149
作者: Haitian Zhang,Xiangyuan Wang,Chang Xu,Xinya Wang,Fang Xu,Huai Yu,Lei Yu,Wen Yang
关键词-EN: Fusing Events, object detection leverages, rich semantic information, semantic information provided, RGB cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textitvs.~high-latency RGB frames; temporally sparse labels in training \textitvs.~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbfFrequency-\textbfAdaptive Low-Latency \textbfObject \textbfDetector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80 \times Event-RGB frequency mismatch.
zh

[CV-63] AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

【速读】：该论文试图解决在基于扩散模型的服装图像生成中，现有方法无法支持多种服装组合、难以在保持文本提示忠实度的同时保留服装细节的问题。解决方案的关键在于提出了一种名为AnyDressing的新方法，该方法包括两个主要网络：GarmentsNet和DressingNet。GarmentsNet通过高效的Garment-Specific Feature Extractor模块并行提取服装的详细特征，防止服装混淆并确保网络效率。DressingNet则通过自适应的Dressing-Attention机制和实例级别的服装定位学习策略，精确地将多服装特征注入到相应区域，从而有效整合多服装纹理线索并增强文本与图像的一致性。此外，论文还引入了Garment-Enhanced Texture Learning策略，以提升服装的细粒度纹理细节。AnyDressing设计为可插拔模块，便于与扩散模型的社区控制扩展集成，从而提高合成图像的多样性和可控性。

链接: https://arxiv.org/abs/2412.04146
作者: Xinghui Li,Qichao Sun,Pengze Zhang,Fulong Ye,Zhichao Liao,Wanquan Feng,Songtao Zhao,Qian He
关键词-EN: Recent advances, garment-centric image generation, advances in garment-centric, image prompts based, text prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.
zh

[CV-64] Deep priors for satellite image restoration with accurate uncertainties

【速读】：该论文试图解决卫星光学图像在地面接收后存在的失真问题，包括去噪、去模糊和超分辨率等恢复任务，并进一步量化这些恢复过程中的不确定性，以降低幻觉风险并避免在下游应用中传播偏差。解决方案的关键在于提出了一种通用方法，通过单一网络恢复来自多个传感器的数据，并可扩展地推导不确定性。具体来说，论文引入了两种方法：VBLE-xz 和 SatDPIR。VBLE-xz 在变分压缩自编码器的潜在空间中解决逆问题，联合估计潜在空间和图像空间中的不确定性，从而实现可扩展的后验采样和校准的不确定性。SatDPIR 则是一种基于去噪器的方法，从 DPIR 改编而来，能够高效计算准确的点估计。这两种方法在模拟和真实的高分辨率 Pleiades 图像上进行了全面实验，证明了其性能和鲁棒性，并在与直接反演方法的比较中达到了最先进的结果。

链接: https://arxiv.org/abs/2412.04130
作者: Biquard Maud,Marie Chabert,Florence Genin,Christophe Latry,Thomas Oberlin
关键词-EN: Satellite optical images, on-ground receipt, offer a distorted, observed scene, distorted view
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, classically including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainty related to this restoration could be valuable by lowering the risk of hallucination and avoiding propagating these biases in downstream applications. Deep learning methods are now state-of-the-art for satellite image restoration. However, they require to train a specific network for each sensor and they do not provide the associated uncertainties. This paper proposes a generic method involving a single network to restore images from several sensors and a scalable way to derive the uncertainties. We focus on deep regularization (DR) methods, which learn a deep prior on target images before plugging it into a model-based optimization scheme. First, we introduce VBLE-xz, which solves the inverse problem in the latent space of a variational compressive autoencoder, estimating the uncertainty jointly in the latent and in the image spaces. It enables scalable posterior sampling with relevant and calibrated uncertainties. Second, we propose the denoiser-based method SatDPIR, adapted from DPIR, which efficiently computes accurate point estimates. We conduct a comprehensive set of experiments on very high resolution simulated and real Pleiades images, asserting both the performance and robustness of the proposed methods. VBLE-xz and SatDPIR achieve state-of-the-art results compared to direct inversion methods. In particular, VBLE-xz is a scalable method to get realistic posterior samples and accurate uncertainties, while SatDPIR represents a compelling alternative to direct inversion methods when uncertainty quantification is not required.
zh

[CV-65] CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections

【速读】：该论文试图解决从平面截面重建复杂结构的问题，特别是在医学成像、制造和地形学等领域中，现有方法在处理稀疏数据和重建薄几何结构时面临的挑战。解决方案的关键在于引入了一种名为 \method 的新方法，该方法通过从平面轮廓生成的二维有符号距离（2D signed distances）中提取三维有符号距离场（3D signed distance field）。这种方法通过使用针对二维切片内几何已知情况设计的损失函数，使神经有符号距离场（neural SDFs）的训练具有轮廓感知能力。实验结果表明，该方法显著优于现有技术，能够有效重建薄结构并生成无插值伪影或过度平滑的精确三维模型。

链接: https://arxiv.org/abs/2412.04120
作者: Thomas Walker,Salvatore Esposito,Daniel Rebain,Amir Vaxman,Arno Onken,Changjian Li,Oisin Mac Aodha
关键词-EN: challenging problem, Reconstructing complex structures, medical imaging, Reconstructing complex, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing complex structures from planar cross-sections is a challenging problem, with wide-reaching applications in medical imaging, manufacturing, and topography. Out-of-the-box point cloud reconstruction methods can often fail due to the data sparsity between slicing planes, while current bespoke methods struggle to reconstruct thin geometric structures and preserve topological continuity. This is important for medical applications where thin vessel structures are present in CT and MRI scans. This paper introduces \method, a novel approach for extracting a 3D signed distance field from 2D signed distances generated from planar contours. Our approach makes the training of neural SDFs contour-aware by using losses designed for the case where geometry is known within 2D slices. Our results demonstrate a significant improvement over existing methods, effectively reconstructing thin structures and producing accurate 3D models without the interpolation artifacts or over-smoothing of prior approaches.
zh

[CV-66] MVUDA: Unsupervised Domain Adaptation for Multi-view Pedestrian Detection

【速读】：该论文试图解决在多视角行人检测中，当训练数据和测试数据采集自不同摄像头设置时，模型性能下降的问题。解决方案的关键在于提出了一种无监督域适应 (Unsupervised Domain Adaptation, UDA) 方法，该方法利用均值教师自训练框架和一种针对多视角行人检测的新型伪标签技术，使得模型能够在不需要额外标注数据的情况下适应新的摄像头设置。这种方法不仅在多个基准测试中达到了最先进的性能，还消除了对额外标注单目数据集的依赖，从而显著提高了多视角行人检测器的实际应用性和鲁棒性。

链接: https://arxiv.org/abs/2412.04117
作者: Erik Brorsson,Lennart Svensson,Kristofer Bengtsson,Knut Åkesson
关键词-EN: address multi-view pedestrian, multi-view pedestrian, multi-view pedestrian detection, labeled data, pedestrian detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address multi-view pedestrian detection in a setting where labeled data is collected using a multi-camera setup different from the one used for testing. While recent multi-view pedestrian detectors perform well on the camera rig used for training, their performance declines when applied to a different setup. To facilitate seamless deployment across varied camera rigs, we propose an unsupervised domain adaptation (UDA) method that adapts the model to new rigs without requiring additional labeled data. Specifically, we leverage the mean teacher self-training framework with a novel pseudo-labeling technique tailored to multi-view pedestrian detection. This method achieves state-of-the-art performance on multiple benchmarks, including MultiviewX \rightarrow Wildtrack. Unlike previous methods, our approach eliminates the need for external labeled monocular datasets, thereby reducing reliance on labeled data. Extensive evaluations demonstrate the effectiveness of our method and validate key design choices. By enabling robust adaptation across camera setups, our work enhances the practicality of multi-view pedestrian detectors and establishes a strong UDA baseline for future research.
zh

[CV-67] hermal and RGB Images Work Better Together in Wind Turbine Damage Detection

【速读】：该论文试图解决风力涡轮机叶片（WTBs）缺陷检测效率低下的问题。解决方案的关键在于通过多光谱图像合成方法，将无人机获取的热成像（thermal）和RGB图像进行空间坐标变换、关键点检测、二值描述符创建以及加权图像叠加，从而生成复合图像。这种方法显著提升了缺陷检测的效率，具体表现为YOLOv8模型在准确率、精确率、召回率和F1分数上的提升，同时减少了误报和漏检的数量。

链接: https://arxiv.org/abs/2412.04114
作者: Serhii Svystun,Oleksandr Melnychenko,Pavlo Radiuk,Oleg Savenko,Anatoliy Sachenko,Andrii Lysyi
关键词-EN: wind turbine blades, thermal and RGB, turbine blades, wind turbine, crucial for ensuring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Unmanned aerial vehicle, image composition, multispectral images, green energy, data quality management, weighted overlay

点击查看摘要

Abstract:The inspection of wind turbine blades (WTBs) is crucial for ensuring their structural integrity and operational efficiency. Traditional inspection methods can be dangerous and inefficient, prompting the use of unmanned aerial vehicles (UAVs) that access hard-to-reach areas and capture high-resolution imagery. In this study, we address the challenge of enhancing defect detection on WTBs by integrating thermal and RGB images obtained from UAVs. We propose a multispectral image composition method that combines thermal and RGB imagery through spatial coordinate transformation, key point detection, binary descriptor creation, and weighted image overlay. Using a benchmark dataset of WTB images annotated for defects, we evaluated several state-of-the-art object detection models. Our results show that composite images significantly improve defect detection efficiency. Specifically, the YOLOv8 model’s accuracy increased from 91% to 95%, precision from 89% to 94%, recall from 85% to 92%, and F1-score from 87% to 93%. The number of false positives decreased from 6 to 3, and missed defects reduced from 5 to 2. These findings demonstrate that integrating thermal and RGB imagery enhances defect detection on WTBs, contributing to improved maintenance and reliability.
zh

[CV-68] MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

【速读】：该论文试图解决医学图像分割中由于异质性模态和标注掩码稀缺性导致的分割模型在未标注模态上发展受限的问题。解决方案的关键在于提出了一种新的范式，即利用生成模型（Generative Models）在不需要配对数据的情况下，可控地合成未标注模态的数据。具体来说，论文贡献包括：(i) 构建并整理了一个大规模的放射学图像-文本数据集MedGen-1M，包含模态标签、属性、区域和器官信息，以及部分器官掩码标注，以支持可控医学图像生成的研究；(ii) 提出了一种基于扩散模型的数据引擎MRGen，该引擎能够根据文本提示和掩码生成条件化的MR图像，从而为缺乏掩码标注的多种模态合成训练样本，用于在未标注模态上训练分割模型；(iii) 通过在多种模态上的广泛实验，证明了该数据引擎能够有效合成训练样本，并将MRI分割扩展到未标注的模态。

链接: https://arxiv.org/abs/2412.04106
作者: Haoning Wu,Ziheng Zhao,Ya Zhang,Weidi Xie,Yanfeng Wang
关键词-EN: deep neural networks, recently demonstrated impressive, demonstrated impressive progress, mask annotations limit, unannotated modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: (i) we collect and curate a large-scale radiology image-text dataset, MedGen-1M, comprising modality labels, attributes, region, and organ information, along with a subset of organ mask annotations, to support research in controllable medical image generation; (ii) we propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks, synthesizing MR images for diverse modalities lacking mask annotations, to train segmentation models on unannotated modalities; (iii) we conduct extensive experiments across various modalities, illustrating that our data engine can effectively synthesize training samples and extend MRI segmentation towards unannotated modalities.
zh

[CV-69] D-LORD for Motion Stylization

【速读】：该论文试图解决运动风格化（motion stylization）问题，具体包括运动风格迁移（motion style transfer）和运动重定向（motion retargeting）。解决方案的关键在于提出了一种名为D-LORD（Double Latent Optimization for Representation Disentanglement）的新框架，该框架通过数据驱动的潜在优化方法，将给定运动序列中的类别信息（class，如特定情感或个体身份）与内容信息（content，如普遍理解的动作，如行走或跳跃）进行分离。D-LORD的关键优势在于其能够在不需要配对运动数据的情况下进行风格迁移，而是通过在潜在优化过程中利用类别和内容标签来实现。通过解耦表示，该框架能够使用自适应实例归一化（Adaptive Instance Normalization）将一个运动序列的风格转换为另一个运动序列的风格。此外，D-LORD具有良好的泛化能力，能够处理不同的类别和内容标签，适用于多种应用场景，并能根据特定的类别和内容标签生成多样化的运动序列。

链接: https://arxiv.org/abs/2412.04097
作者: Meenakshi Gupta,Mingyuan Lei,Tat-Jen Cham,Hwee Kuan Lee
关键词-EN: Double Latent Optimization, Double Latent, Latent Optimization, Representation Disentanglement, motion style transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel framework named D-LORD (Double Latent Optimization for Representation Disentanglement), which is designed for motion stylization (motion style transfer and motion retargeting). The primary objective of this framework is to separate the class and content information from a given motion sequence using a data-driven latent optimization approach. Here, class refers to person-specific style, such as a particular emotion or an individual’s identity, while content relates to the style-agnostic aspect of an action, such as walking or jumping, as universally understood concepts. The key advantage of D-LORD is its ability to perform style transfer without needing paired motion data. Instead, it utilizes class and content labels during the latent optimization process. By disentangling the representation, the framework enables the transformation of one motion sequences style to another’s style using Adaptive Instance Normalization. The proposed D-LORD framework is designed with a focus on generalization, allowing it to handle different class and content labels for various applications. Additionally, it can generate diverse motion sequences when specific class and content labels are provided. The framework’s efficacy is demonstrated through experimentation on three datasets: the CMU XIA dataset for motion style transfer, the MHAD dataset, and the RRIS Ability dataset for motion retargeting. Notably, this paper presents the first generalized framework for motion style transfer and motion retargeting, showcasing its potential contributions in this area.
zh

[CV-70] HyperFLINT: Hypernetwork-based Flow Estimation and Temporal Interpolation for Scientific Ensemble Visualization

【速读】：该论文试图解决在时空科学集合数据中，传统方法未能充分考虑集合参数的问题，从而限制了其在不同模拟设置下的适应性和对数据动态的深入理解。解决方案的关键在于引入基于超网络 (Hypernetwork) 的深度学习方法 HyperFLINT，通过超网络动态生成主网络的权重，使得模型能够根据不同的模拟参数自适应地估计流场和进行时间插值，从而显著提升流场估计和时间插值的准确性，并促进参数空间探索，为复杂科学集合数据提供有价值的洞察。

链接: https://arxiv.org/abs/2412.04095
作者: Hamid Gadirov,Qi Wu,David Bauer,Kwan-Liu Ma,Jos Roerdink,Steffen Frey
关键词-EN: temporally interpolating scalar, deep learning-based approach, Hypernetwork-based FLow estimation, interpolating scalar fields, Hypernetwork-based FLow
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present HyperFLINT (Hypernetwork-based FLow estimation and temporal INTerpolation), a novel deep learning-based approach for estimating flow fields, temporally interpolating scalar fields, and facilitating parameter space exploration in spatio-temporal scientific ensemble data. This work addresses the critical need to explicitly incorporate ensemble parameters into the learning process, as traditional methods often neglect these, limiting their ability to adapt to diverse simulation settings and provide meaningful insights into the data dynamics. HyperFLINT introduces a hypernetwork to account for simulation parameters, enabling it to generate accurate interpolations and flow fields for each timestep by dynamically adapting to varying conditions, thereby outperforming existing parameter-agnostic approaches. The architecture features modular neural blocks with convolutional and deconvolutional layers, supported by a hypernetwork that generates weights for the main network, allowing the model to better capture intricate simulation dynamics. A series of experiments demonstrates HyperFLINT’s significantly improved performance in flow field estimation and temporal interpolation, as well as its potential in enabling parameter space exploration, offering valuable insights into complex scientific ensembles.
zh

[CV-71] LossAgent : Towards Any Optimization Objectives for Image Processing with LLM Agents

【速读】：该论文试图解决低级图像处理任务（如图像超分辨率和恢复）中，现有损失函数（如MSE损失）无法有效实例化复杂优化目标（如手工感知度量、文本描述和复杂人类反馈）的问题。解决方案的关键在于引入大型语言模型（LLM）作为损失代理（LossAgent），通过其丰富的文本理解和先验知识，使损失代理能够理解和处理复杂的优化目标、轨迹和来自外部环境的状态反馈。具体来说，论文通过建立损失库并设计面向优化的提示工程，使损失代理能够在每次优化交互中智能地决定每个损失在库中的组合权重，从而实现任何定制化优化目标的优化轨迹。

链接: https://arxiv.org/abs/2412.04090
作者: Bingchen Li,Xin Li,Yiting Lu,Zhibo Chen
关键词-EN: low-level image processing, image processing, low-level image, image processing networks, image processing tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the first loss agent, dubbed LossAgent, for low-level image processing tasks, e.g., image super-resolution and restoration, intending to achieve any customized optimization objectives of low-level image processing in different practical applications. Notably, not all optimization objectives, such as complex hand-crafted perceptual metrics, text description, and intricate human feedback, can be instantiated with existing low-level losses, e.g., MSE loss. which presents a crucial challenge in optimizing image processing networks in an end-to-end manner. To eliminate this, our LossAgent introduces the powerful large language model (LLM) as the loss agent, where the rich textual understanding of prior knowledge empowers the loss agent with the potential to understand complex optimization objectives, trajectory, and state feedback from external environments in the optimization process of the low-level image processing networks. In particular, we establish the loss repository by incorporating existing loss functions that support the end-to-end optimization for low-level image processing. Then, we design the optimization-oriented prompt engineering for the loss agent to actively and intelligently decide the compositional weights for each loss in the repository at each optimization interaction, thereby achieving the required optimization trajectory for any customized optimization objectives. Extensive experiments on three typical low-level image processing tasks and multiple optimization objectives have shown the effectiveness and applicability of our proposed LossAgent. Code and pre-trained models will be available at this https URL.
zh

[CV-72] BodyMetric: Evaluating the Realism of HumanBodies in Text-to-Image Generation

【速读】：该论文试图解决文本到图像生成模型在生成人体图像时常见的身体相关缺陷问题，如多余或缺失的肢体、不现实的姿势、模糊的身体部位等。解决方案的关键是提出了BodyMetric，一种可学习的指标，用于预测图像中人体的真实性。BodyMetric通过结合真实性标签和多模态信号（包括从输入图像推断出的3D人体表示和文本描述）进行训练。为了支持这一方法，论文设计了一个标注流程，收集专家对人体真实性的评级，从而构建了一个新的数据集BodyRealism。通过消融研究，验证了BodyMetric架构选择的合理性，并强调了利用3D人体先验信息在捕捉2D图像中身体相关缺陷的重要性。与评估图像整体用户偏好的现有指标相比，BodyMetric专门反映身体相关的缺陷，并通过应用实例展示了其在大规模基准测试和图像排序中的实用性。

链接: https://arxiv.org/abs/2412.04086
作者: Nefeli Andreou,Varsha Vivek,Ying Wang,Alex Vorobiov,Tiffany Deng,Raja Bala,Larry Davis,Betty Mohler Tesch
关键词-EN: Accurately generating images, Accurately generating, text remains, remains a challenging, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately generating images of human bodies from text remains a challenging problem for state of the art text-to-image models. Commonly observed body-related artifacts include extra or missing limbs, unrealistic poses, blurred body parts, etc. Currently, evaluation of such artifacts relies heavily on time-consuming human judgments, limiting the ability to benchmark models at scale. We address this by proposing BodyMetric, a learnable metric that predicts body realism in images. BodyMetric is trained on realism labels and multi-modal signals including 3D body representations inferred from the input image, and textual descriptions. In order to facilitate this approach, we design an annotation pipeline to collect expert ratings on human body realism leading to a new dataset for this task, namely, BodyRealism. Ablation studies support our architectural choices for BodyMetric and the importance of leveraging a 3D human body prior in capturing body-related artifacts in 2D images. In comparison to concurrent metrics which evaluate general user preference in images, BodyMetric specifically reflects body-related artifacts. We demonstrate the utility of BodyMetric through applications that were previously infeasible at scale. In particular, we use BodyMetric to benchmark the generation ability of text-to-image models to produce realistic human bodies. We also demonstrate the effectiveness of BodyMetric in ranking generated images based on the predicted realism scores.
zh

[CV-73] Unified Framework for Open-World Compositional Zero-shot Learning

【速读】：该论文试图解决开放世界组合零样本学习 (Open-World Compositional Zero-Shot Learning, OW-CZSL) 中识别已知基本元素和实体的新组合的挑战。解决方案的关键在于增强图像和文本数据之间的跨模态交互 (inter-modality interactions)，并通过引入一个新模块来减轻推理阶段对所有可能组合进行穷举探索的计算负担。此外，论文提出了一种先进的混合学习过程 (hybrid procedure)，结合了联合学习和独立学习的机制，以生成最终的预测结果。该模型在三个数据集上达到了OW-CZSL的最新技术水平，并在两个数据集上超越了大型视觉语言模型 (Large Vision Language Models, LLVM)。

链接: https://arxiv.org/abs/2412.04083
作者: Hirunima Jayasekara,Khoi Pham,Nirat Saini,Abhinav Shrivastava
关键词-EN: Open-World Compositional Zero-Shot, Compositional Zero-Shot Learning, Open-World Compositional, Compositional Zero-Shot, addresses the challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-World Compositional Zero-Shot Learning (OW-CZSL) addresses the challenge of recognizing novel compositions of known primitives and entities. Even though prior works utilize language knowledge for recognition, such approaches exhibit limited interactions between language-image modalities. Our approach primarily focuses on enhancing the inter-modality interactions through fostering richer interactions between image and textual data. Additionally, we introduce a novel module aimed at alleviating the computational burden associated with exhaustive exploration of all possible compositions during the inference stage. While previous methods exclusively learn compositions jointly or independently, we introduce an advanced hybrid procedure that leverages both learning mechanisms to generate final predictions. Our proposed model, achieves state-of-the-art in OW-CZSL in three datasets, while surpassing Large Vision Language Models (LLVM) in two datasets.
zh

[CV-74] SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning

【速读】：该论文试图解决领域泛化（Domain Generalization, DG）问题，即如何利用一个或多个源域的数据训练模型，以确保其在未见过的目标域上具有鲁棒性能。论文提出的解决方案之关键是引入了一种名为奇异值分解低秩适应（Singular Value Decomposed Low-Rank Adaptation, SoRA）的方法。SoRA通过奇异值分解（Singular Value Decomposition, SVD）分析预训练模型的权重分布，选择性地调整较小的奇异值成分，同时冻结其余部分，从而在保留预训练模型泛化能力的同时，高效地学习任务特定的特征。此外，SoRA还冻结了具有领域泛化能力的模块，并采用退火权重衰减策略，以在泛化能力和区分能力之间实现最佳平衡。该方法在多个领域泛化基准测试中达到了最先进的结果，且不增加额外的推理开销或正则化损失，兼容任何骨干网络或头部结构，并易于集成到各种任务中。

链接: https://arxiv.org/abs/2412.04077
作者: Seokju Yun,Seunghye Chae,Dongheon Lee,Youngmin Ro
关键词-EN: ensure robust performance, unseen target domains, aims to adapt, ensure robust, robust performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter-Efficient Fine-Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre-trained model and learning task-specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre-trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Low-Rank Adaptation (SoRA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoRA effectively retains the generalization ability of the pre-trained model while efficiently acquiring task-specific skills. Furthermore, we freeze domain-generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade-off between generalizability and discriminability. SoRA attains state-of-the-art results on multiple benchmarks that span both domain generalized semantic segmentation to domain generalized object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of tasks.
zh

[CV-75] ransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation

【速读】：该论文试图解决无监督域适应 (Unsupervised Domain Adaptation, UDA) 中由于显著的域差异导致的任务困难问题。解决方案的关键在于利用Swin Transformer并结合三个关键模块：图域鉴别器 (Graph Domain Discriminator) 通过图卷积和基于熵的注意力区分来增强域对齐；自适应双重注意力模块 (Adaptive Double Attention) 结合Windows和Shifted Windows注意力并动态重加权以有效对齐长程和局部特征；跨特征变换 (Cross-Feature Transform) 修改Swin Transformer块以提升跨域泛化能力。这些模块共同作用，无需特定任务的对齐模块，即可在多种应用中实现最先进的性能。

链接: https://arxiv.org/abs/2412.04073
作者: A. Enes Doruk,Erhan Oztop,Hasan F. Ates
关键词-EN: Unsupervised Domain Adaptation, utilize labeled data, significant domain gaps, Swin Transformer, unlabeled target domain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain, often hindered by significant domain gaps. Traditional CNN-based methods struggle to fully capture complex domain relationships, motivating the shift to vision transformers like the Swin Transformer, which excel in modeling both local and global dependencies. In this work, we propose a novel UDA approach leveraging the Swin Transformer with three key modules. A Graph Domain Discriminator enhances domain alignment by capturing inter-pixel correlations through graph convolutions and entropy-based attention differentiation. An Adaptive Double Attention module combines Windows and Shifted Windows attention with dynamic reweighting to align long-range and local features effectively. Finally, a Cross-Feature Transform modifies Swin Transformer blocks to improve generalization across domains. Extensive benchmarks confirm the state-of-the-art performance of our versatile method, which requires no task-specific alignment modules, establishing its adaptability to diverse applications.
zh

[CV-76] ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality

【速读】：该论文试图解决自回归视觉生成模型中生成效率低下的问题。解决方案的关键在于提出了ZipAR框架，这是一个无需训练、即插即用的并行解码框架。ZipAR利用图像的局部结构特性，通过在列维度上并行解码空间相邻区域的视觉标记，实现了“下一组预测”范式，从而在一次前向传播中同时解码多个标记，显著减少了生成图像所需的前向传播次数，提升了生成效率。实验结果表明，ZipAR在Emu3-Gen模型上最多可减少91%的模型前向传播次数，且无需任何额外的再训练。

链接: https://arxiv.org/abs/2412.04062
作者: Yefei He,Feng Chen,Yuanyu He,Shaoxuan He,Hong Zhou,Kaipeng Zhang,Bohan Zhuang
关键词-EN: parallel decoding framework, accelerating auto-regressive, framework for accelerating, decoding framework, propose ZipAR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction’’ paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
zh

[CV-77] Benchmarking and Enhancing Surgical Phase Recognition Models for Robotic-Assisted Esophagectomy

【速读】：该论文试图解决机器人辅助微创食管切除术 (Robotic-assisted minimally invasive esophagectomy, RAMIE) 中手术阶段识别的问题。解决方案的关键在于开发了一种新的深度学习模型，该模型采用编码器-解码器结构，并结合因果层次注意力机制，以更有效地捕捉手术过程中复杂的时序动态变化。通过对比现有最先进的手术阶段识别模型，该新模型在性能上表现出显著优势，从而为术中外科医生提供支持。

链接: https://arxiv.org/abs/2412.04039
作者: Yiping Li,Romy van Jaarsveld,Ronald de Jong,Jasper Bongers,Gino Kuiper,Richard van Hillegersberg,Jelle Ruurda,Marcel Breeuwer,Yasmina Al Khalil
关键词-EN: Robotic-assisted minimally invasive, minimally invasive esophagectomy, minimally invasive surgery, traditional minimally invasive, Robotic-assisted minimally
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the SPIE Medical Imaging Conference, 2025

点击查看摘要

Abstract:Robotic-assisted minimally invasive esophagectomy (RAMIE) is a recognized treatment for esophageal cancer, offering better patient outcomes compared to open surgery and traditional minimally invasive surgery. RAMIE is highly complex, spanning multiple anatomical areas and involving repetitive phases and non-sequential phase transitions. Our goal is to leverage deep learning for surgical phase recognition in RAMIE to provide intraoperative support to surgeons. To achieve this, we have developed a new surgical phase recognition dataset comprising 27 videos. Using this dataset, we conducted a comparative analysis of state-of-the-art surgical phase recognition models. To more effectively capture the temporal dynamics of this complex procedure, we developed a novel deep learning model featuring an encoder-decoder structure with causal hierarchical attention, which demonstrates superior performance compared to existing models.
zh

[CV-78] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

【速读】：该论文试图解决在双人互动场景中，如何实现一个能够动态切换听与说状态的音频驱动头部生成模型的问题。解决方案的关键在于提出了INFP框架，该框架包括两个主要阶段：基于运动的头部模仿阶段（Motion-Based Head Imitation）和音频引导的运动生成阶段（Audio-Guided Motion Generation）。在第一阶段，模型通过学习将真实对话视频中的面部交流行为投影到低维运动潜在空间，并利用这些运动潜在代码来动画化静态图像。第二阶段则通过去噪学习输入双人音频到运动潜在代码的映射，从而实现音频驱动的头部生成。此外，论文还引入了DyConv数据集，这是一个从互联网收集的大规模双人对话数据集，以促进这一研究方向的发展。

链接: https://arxiv.org/abs/2412.04037
作者: Yongming Zhu,Longhao Zhang,Zhengkun Rong,Tianshu Hu,Shuang Liang,Zhipeng Ge
关键词-EN: socially intelligent agent, socially intelligent, head generation, audio-driven head generation, Imagine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: this https URL.
zh

[CV-79] Mask of truth: model sensitivity to unexpected regions of medical images

【速读】：该论文试图解决的问题是：在医学图像分析中，大型模型虽然提高了性能，但往往依赖于图像中的非相关部分（即伪相关或捷径），导致在实际应用中表现不佳。解决方案的关键在于通过遮蔽图像中临床相关部分（ROI）来挑战卷积神经网络（CNN）的分类能力，并结合SHAP解释方法和嵌入分析，揭示模型是否存在伪相关。此外，通过放射科住院医师对不同遮蔽策略下的胸部X光片进行解读，进一步验证了模型的表现与临床知识的契合度。

链接: https://arxiv.org/abs/2412.04030
作者: Théo Sourget,Michelle Hestbek-Møller,Amelia Jiménez-Sánchez,Jack Junchi Xu,Veronika Cheplygina
关键词-EN: development of larger, led to increased, medical image analysis, images, medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of larger models for medical image analysis has led to increased performance. However, it also affected our ability to explain and validate model decisions. Models can use non-relevant parts of images, also called spurious correlations or shortcuts, to obtain high performance on benchmark datasets but fail in real-world scenarios. In this work, we challenge the capacity of convolutional neural networks (CNN) to classify chest X-rays and eye fundus images while masking out clinically relevant parts of the image. We show that all models trained on the PadChest dataset, irrespective of the masking strategy, are able to obtain an Area Under the Curve (AUC) above random. Moreover, the models trained on full images obtain good performance on images without the region of interest (ROI), even superior to the one obtained on images only containing the ROI. We also reveal a possible spurious correlation in the Chaksu dataset while the performances are more aligned with the expectation of an unbiased model. We go beyond the performance analysis with the usage of the explainability method SHAP and the analysis of embeddings. We asked a radiology resident to interpret chest X-rays under different masking to complement our findings with clinical knowledge. Our code is available at this https URL and this https URL
zh

[CV-80] PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

【速读】：该论文试图解决自动驾驶中空间和运动信息感知的不准确问题，特别是在远距离区域使用稀疏激光雷达（LiDAR）数据时，传统方法（如以物体为中心和类别无关的方法）存在的缺陷。解决方案的关键在于提出了一个名为 PriorMotion 的生成式框架，该框架通过提取栅格化和矢量化的场景表示来建模时空先验信息。具体来说，PriorMotion 模型包括一个BEV编码器、一个栅格-矢量先验编码器和一个时空先验生成器，从而提高了运动预测中的空间和时间一致性。此外，论文还引入了针对类别无关运动预测的标准化评估协议，并在nuScenes数据集上验证了其优越性能，进一步通过先进的FMCW LiDAR验证了其鲁棒性。

链接: https://arxiv.org/abs/2412.04020
作者: Kangan Qian,Xinyu Jiao,Yining Shi,Yunlong Wang,Ziang Luo,Zheng Fu,Kun Jiang,Diange Yang
关键词-EN: safe autonomous navigation, Reliable perception, autonomous navigation, information is crucial, crucial for safe
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Reliable perception of spatial and motion information is crucial for safe autonomous navigation. Traditional approaches typically fall into two categories: object-centric and class-agnostic methods. While object-centric methods often struggle with missed detections, leading to inaccuracies in motion prediction, many class-agnostic methods focus heavily on encoder design, often overlooking important priors like rigidity and temporal consistency, leading to suboptimal performance, particularly with sparse LiDAR data at distant region. To address these issues, we propose \textbfPriorMotion , a generative framework that extracts rasterized and vectorized scene representations to model spatio-temporal priors. Our model comprises a BEV encoder, an Raster-Vector prior Encoder, and a Spatio-Temporal prior Generator, improving both spatial and temporal consistency in motion prediction. Additionally, we introduce a standardized evaluation protocol for class-agnostic motion prediction. Experiments on the nuScenes dataset show that PriorMotion achieves state-of-the-art performance, with further validation on advanced FMCW LiDAR confirming its robustness.
zh

[CV-81] IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation CVPR2025

【速读】：该论文试图解决从单张图像和音频输入生成高分辨率说话头视频的问题。现有方法如使用显式面部模型（如3D morphable models (3DMM)和面部标志点）在生成高保真视频时存在外观感知运动表示不足的问题，而生成式方法（如视频扩散模型）虽然视频质量高，但处理速度慢，限制了实际应用。论文提出的解决方案之关键是引入隐式面部运动扩散模型（Implicit Face Motion Diffusion Model, IF-MDM），通过隐式运动将人脸编码为外观感知的压缩面部潜在表示，从而增强视频生成。尽管隐式运动缺乏显式模型的空间解耦，导致与细微唇部运动的对齐复杂，但通过引入运动统计信息来捕捉细粒度运动信息，并提供运动可控性以在推理过程中优化运动强度与视觉质量之间的权衡。IF-MDM支持实时生成512x512分辨率视频，最高可达45帧每秒（fps），并在广泛评估中展示了其优于现有扩散模型和显式面部模型的性能。

链接: https://arxiv.org/abs/2412.04000
作者: Sejong Yang,Seoung Wug Oh,Yang Zhou,Seon Joo Kim
关键词-EN: high-resolution talking head, talking head generation, audio input, approach for high-resolution, high-resolution talking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: underreview in CVPR 2025

点击查看摘要

Abstract:We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.
zh

[CV-82] Blind Underwater Image Restoration using Co-Operational Regressor Networks

【速读】：该论文试图解决水下图像恢复中的视觉退化问题，特别是在生物研究、考古学和基础设施维护等应用中的挑战。解决方案的关键在于提出了一种名为协同回归网络 (Co-Operational Regressor Networks, CoRe-Nets) 的新型机器学习模型。该模型由两个协同工作的网络组成：学徒回归器 (Apprentice Regressor, AR) 负责图像变换，而大师回归器 (Master Regressor, MR) 则评估由 AR 生成的图像的峰值信噪比 (PSNR) 并反馈给 AR。CoRe-Nets 基于自组织操作神经网络 (Self-Organized Operational Neural Networks, Self-ONNs)，通过调节核变换中的非线性来提供优越的学习能力。该模型在大型水下图像 (LSUI) 数据集上的实验表明，其不仅在恢复性能上达到了最先进的水平，而且在计算复杂度上显著降低，甚至在两遍应用后，其恢复的图像质量有时能超越真实图像的视觉质量。

链接: https://arxiv.org/abs/2412.03995
作者: Ozer Can Devecioglu,Serkan Kiranyaz,Turker Ince,Moncef Gabbouj
关键词-EN: waters unique properties, including scattering, color distortion, biological research, infrastructure maintenanceHowever
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 11 pages

点击查看摘要

Abstract:The exploration of underwater environments is essential for applications such as biological research, archaeology, and infrastructure maintenanceHowever, underwater imaging is challenging due to the waters unique properties, including scattering, absorption, color distortion, and reduced visibility. To address such visual degradations, a variety of approaches have been proposed covering from basic signal processing methods to deep learning models; however, none of them has proven to be consistently successful. In this paper, we propose a novel machine learning model, Co-Operational Regressor Networks (CoRe-Nets), designed to achieve the best possible underwater image restoration. A CoRe-Net consists of two co-operating networks: the Apprentice Regressor (AR), responsible for image transformation, and the Master Regressor (MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images generated by the AR and feeds it back to AR. CoRe-Nets are built on Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior learning capability by modulating nonlinearity in kernel transformations. The effectiveness of the proposed model is demonstrated on the benchmark Large Scale Underwater Image (LSUI) dataset. Leveraging the joint learning capabilities of the two cooperating networks, the proposed model achieves the state-of-art restoration performance with significantly reduced computational complexity and often presents such results that can even surpass the visual quality of the ground truth with a 2-pass application. Our results and the optimized PyTorch implementation of the proposed approach are now publicly shared on GitHub.
zh

[CV-83] LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks

【速读】：该论文试图解决现有物理后门攻击（Physical Backdoor Attacks）在远程控制、时间隐蔽性、灵活性和移动性方面的不足。解决方案的关键在于提出了一种基于激光的新型后门触发器（Laser-based Backdoor Triggers），并设计了一种名为LaserGuider的物理后门攻击方法。LaserGuider利用激光的长距离传输和即时成像特性，实现了远程控制能力，并显著提高了攻击的时间隐蔽性、灵活性和移动性。此外，论文还提出了一种系统化的方法来优化激光参数，以增强攻击效果。实验结果表明，LaserGuider在交通标志识别深度神经网络（DNNs）上的攻击成功率超过90%，且对正常输入的影响极小。

链接: https://arxiv.org/abs/2412.03993
作者: Yongjie Xu,Guangke Chen,Fu Song,Yuqi Chen
关键词-EN: deep neural networks, embed hidden associations, attacks embed hidden, maintaining normal behavior, neural networks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: In Proceedings of the 23rd International Conference on Applied Cryptography and Network Security (ACNS), Munich, Germany, 23-26 June, 2025

点击查看摘要

Abstract:Backdoor attacks embed hidden associations between triggers and targets in deep neural networks (DNNs), causing them to predict the target when a trigger is present while maintaining normal behavior otherwise. Physical backdoor attacks, which use physical objects as triggers, are feasible but lack remote control, temporal stealthiness, flexibility, and mobility. To overcome these limitations, in this work, we propose a new type of backdoor triggers utilizing lasers that feature long-distance transmission and instant-imaging properties. Based on the laser-based backdoor triggers, we present a physical backdoor attack, called LaserGuider, which possesses remote control ability and achieves high temporal stealthiness, flexibility, and mobility. We also introduce a systematic approach to optimize laser parameters for improving attack effectiveness. Our evaluation on traffic sign recognition DNNs, critical in autonomous vehicles, demonstrates that LaserGuider with three different laser-based triggers achieves over 90% attack success rate with negligible impact on normal inputs. Additionally, we release LaserMark, the first dataset of real world traffic signs stamped with physical laser spots, to support further research in backdoor attacks and defenses.
zh

[CV-84] UNCOVER: Unknown Class Object Detection for Autonomous Vehicles in Real-time

【速读】：该论文试图解决自动驾驶系统在开放世界场景中遇到未知物体时，标准目标检测器因仅训练于有限的基础类别而忽略这些未知物体，从而带来潜在道路风险的问题。解决方案的关键在于引入占用预测（occupancy prediction）与边界框回归（bounding box regression），通过计算预测区域中实际物体占用的比例来评分物体性（objectness）。为增强其泛化能力，论文通过Mosaic和Mixup数据增强技术利用其他领域的数据增加物体多样性，并将不属于训练类别的物体分类为一个新的分布外（out-of-distribution, OOD）类别。此外，为降低误报率，特别是对于近距离物体，论文还引入了一个后处理过滤步骤，利用深度图中的几何线索进行过滤。这一解决方案名为UNCOVER，旨在实现实时检测并高召回率地识别未知物体。

链接: https://arxiv.org/abs/2412.03986
作者: Lars Schmarje,Kaspar Sakman,Reinhard Koch,Dan Zhang
关键词-EN: encountering unknown objects, operates in open-world, open-world scenarios, unknown objects, encountering unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving (AD) operates in open-world scenarios, where encountering unknown objects is inevitable. However, standard object detectors trained on a limited number of base classes tend to ignore any unknown objects, posing potential risks on the road. To address this, it is important to learn a generic rather than a class specific objectness from objects seen during training. We therefore introduce an occupancy prediction together with bounding box regression. It learns to score the objectness by calculating the ratio of the predicted area occupied by actual objects. To enhance its generalizability, we increase the object diversity by exploiting data from other domains via Mosaic and Mixup augmentation. The objects outside the AD training classes are classified as a newly added out-of-distribution (OOD) class. Our solution UNCOVER, for UNknown Class Object detection for autonomous VEhicles in Real-time, excels at achieving both real-time detection and high recall of unknown objects on challenging AD benchmarks. To further attain very low false positive rates, particularly for close objects, we introduce a post-hoc filtering step that utilizes geometric cues extracted from the depth map, typically available within the AD system.
zh

[CV-85] Exploring Fully Convolutional Networks for the Segmentation of Hyperspectral Imaging Applied to Advanced Driver Assistance Systems

【速读】：该论文试图解决现有基于计算机视觉的高级驾驶辅助系统 (ADAS) 在恶劣天气、光照变化以及复杂场景中物体重叠情况下检测和跟踪不准确的问题。解决方案的关键在于利用高光谱成像 (Hyperspectral Imaging, HSI) 技术，通过不同材料在近红外 (NIR) 光谱下的独特反射特性来更好地分离驾驶场景中的物体。具体实现方法是将全卷积网络 (Fully Convolutional Networks, FCN) 应用于HSI图像分割，以探索卷积滤波器编码的空间特征在提升HSI分割系统性能方面的潜力。研究使用了HSI-Drive v1.1数据集，并通过在多核片上系统 (MPSoC) 上原型化开发的FCN模型及其必要的HSI立方体预处理阶段，评估了该系统的可行性和性能。

链接: https://arxiv.org/abs/2412.03982
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe,M. Victoria Martínez,Inés del Campo
关键词-EN: Advanced Driver Assistance, Driver Assistance Systems, Advanced Driver, Driver Assistance, Assistance Systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: arXiv admin note: text overlap with arXiv:2411.19274

点击查看摘要

Abstract:Advanced Driver Assistance Systems (ADAS) are designed with the main purpose of increasing the safety and comfort of vehicle occupants. Most of current computer vision-based ADAS perform detection and tracking tasks quite successfully under regular conditions, but are not completely reliable, particularly under adverse weather and changing lighting conditions, neither in complex situations with many overlapping objects. In this work we explore the use of hyperspectral imaging (HSI) in ADAS on the assumption that the distinct near infrared (NIR) spectral reflectances of different materials can help to better separate the objects in a driving scene. In particular, this paper describes some experimental results of the application of fully convolutional networks (FCN) to the image segmentation of HSI for ADAS applications. More specifically, our aim is to investigate to what extent the spatial features codified by convolutional filters can be helpful to improve the performance of HSI segmentation systems. With that aim, we use the HSI-Drive v1.1 dataset, which provides a set of labelled images recorded in real driving conditions with a small-size snapshot NIR-HSI camera. Finally, we analyze the implementability of such a HSI segmentation system by prototyping the developed FCN model together with the necessary hyperspectral cube preprocessing stage and characterizing its performance on an MPSoC.
zh

[CV-86] HyperDefect-YOLO: Enhance YOLO with HyperGraph Computation for Industrial Defect Detection

【速读】：该论文试图解决制造业中缺陷检测的挑战，特别是在复杂场景和多尺度缺陷检测中，传统YOLO模型在捕捉高阶特征相互关系方面的局限性。解决方案的关键在于引入超图计算（hypergraph computation）到YOLO框架中，形成HyperDefect-YOLO (HD-YOLO)。HD-YOLO通过在主干网络中集成缺陷感知模块（Defect Aware Module, DAM）和混合图网络（Mixed Graph Network, MGNet），专门用于感知和提取缺陷特征。此外，提出超图聚合网络（HyperGraph Aggregation Network, HGANet）结合超图和注意力机制来有效聚合多尺度特征，并通过跨尺度融合（Cross-Scale Fusion, CSF）自适应地融合和处理特征。最后，在颈部引入语义感知模块（Semantic Aware Module, SAM）以增强语义利用，从而在干扰背景下准确地定位不同大小的缺陷。

链接: https://arxiv.org/abs/2412.03969
作者: Zuo Zuo,Jiahao Dong,Yue Gao,Zongze Wu
关键词-EN: challenging task aiming, detect defects generated, defect detection, Defect Aware Module, manufacturing industry
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:In the manufacturing industry, defect detection is an essential but challenging task aiming to detect defects generated in the process of production. Though traditional YOLO models presents a good performance in defect detection, they still have limitations in capturing high-order feature interrelationships, which hurdles defect detection in the complex scenarios and across the scales. To this end, we introduce hypergraph computation into YOLO framework, dubbed HyperDefect-YOLO (HD-YOLO), to improve representative ability and semantic exploitation. HD-YOLO consists of Defect Aware Module (DAM) and Mixed Graph Network (MGNet) in the backbone, which specialize for perception and extraction of defect features. To effectively aggregate multi-scale features, we propose HyperGraph Aggregation Network (HGANet) which combines hypergraph and attention mechanism to aggregate multi-scale features. Cross-Scale Fusion (CSF) is proposed to adaptively fuse and handle features instead of simple concatenation and convolution. Finally, we propose Semantic Aware Module (SAM) in the neck to enhance semantic exploitation for accurately localizing defects with different sizes in the disturbed background. HD-YOLO undergoes rigorous evaluation on public HRIPCB and NEU-DET datasets with significant improvements compared to state-of-the-art methods. We also evaluate HD-YOLO on self-built MINILED dataset collected in real industrial scenarios to demonstrate the effectiveness of the proposed method. The source codes are at this https URL.
zh

[CV-87] Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

【速读】：该论文试图解决通过卫星图像时间序列 (Satellite Image Time Series, SITS) 进行自动作物映射时，由于低分辨率和地块边界不清晰导致的像素级标注复杂且耗时的问题。解决方案的关键在于采用弱监督学习范式 (weakly supervised paradigm)，即仅利用图像级别的类别信息，从而减轻对详尽标注的依赖。具体而言，论文提出了一种名为“探索时空感知线索 (Exact)”的新方法，通过引入空间线索来捕捉不同作物的代表性模式，并利用时间与类别之间的交互来强调关键时间片段的贡献，从而增强模型对作物区域的感知。基于这些时空感知线索，生成的线索基础的类激活映射 (CAMs) 能够有效监督 SITS 分割网络，显著提升了弱监督学习在作物映射任务中的性能，接近全监督学习的水平。

链接: https://arxiv.org/abs/2412.03968
作者: Hao Zhu,Yan Zhu,Jiayu Xiao,Tianxiang Xiao,Yike Ma,Yucheng Zhang,Feng Dai
关键词-EN: Image Time Series, Satellite Image Time, Time Series, Satellite Image, Image Time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Code will be available at this https URL

点击查看摘要

Abstract:Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.
zh

[CV-88] Local Curvature Smoothing with Steins Identity for Efficient Score Matching NEURIPS2024

【速读】：该论文试图解决生成式扩散模型 (Score-based Diffusion Models, SDMs) 训练中基于评分匹配 (score matching) 的计算难题，特别是评分匹配中涉及的高计算成本的雅可比矩阵迹 (Jacobian trace) 问题。解决方案的关键是提出了一种新的评分匹配变体，即局部曲率平滑与斯坦因恒等式 (local curvature smoothing with Stein’s identity, LCSS)。LCSS 通过应用斯坦因恒等式绕过了雅可比矩阵迹的计算，不仅提高了计算效率，还增强了正则化效果。实验结果表明，LCSS 在样本生成性能上超越了现有方法，并且在 FID、Inception score 和 bits per dimension 等评估指标上与广泛采用的去噪评分匹配 (denoising score matching) 相当，甚至在 1024 × 1024 的高分辨率图像生成中也表现出色。

链接: https://arxiv.org/abs/2412.03962
作者: Genki Osada,Makoto Shing,Takashi Nishide
关键词-EN: score-based diffusion models, diffusion models, score matching, score-based diffusion, score
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein’s identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein’s identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of 1024 \times 1024 .
zh

[CV-89] A Framework For Image Synthesis Using Supervised Contrastive Learning

【速读】：该论文试图解决文本到图像生成 (Text-to-Image, T2I) 中现有生成对抗网络 (GAN) 方法在处理跨模态语义对应时忽略图像内部语义关系的问题。解决方案的关键在于提出了一种结合跨模态 (inter-modal) 和内模态 (inner-modal) 语义对应的新框架，通过标签引导的监督对比学习 (label guided supervised contrastive learning) 来增强图像和文本的表示。具体来说，该框架在预训练和生成阶段引入了两个参数共享的对比分支，有效地聚类了语义相似的图像-文本对表示，从而显著提升了生成图像的质量。实验结果表明，在单对象数据集CUB和多对象数据集COCO上，该方法在Inception Score (IS) 和 Frechet Inception Distance (FID) 等评价指标上均取得了显著改进，特别是在复杂的多对象COCO数据集上，对AttnGAN、DM-GAN、SSA-GAN和GALIP的FID分别提升了30.1%、27.3%、16.2%和17.1%。

链接: https://arxiv.org/abs/2412.03957
作者: Yibin Liu,Jianyu Zhang,Li Zhang,Shijian Li,Gang Pan
关键词-EN: producing realistic images, Generative Adversarial Network, aims at producing, producing realistic, Adversarial Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation
zh

[CV-90] Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization

【速读】：该论文试图解决扩散模型在解决逆问题时需要大量函数评估次数（NFEs）以生成高质量图像的问题。解决方案的关键是引入测量优化（Measurements Optimization, MO）模块，该模块在逆问题求解的每一步中更高效地整合测量信息。通过这种方法，论文在多个任务上实现了最先进的性能，显著减少了所需的NFEs数量，例如在FFHQ 256数据集上的高动态范围成像任务中，DPS-MO在仅100 NFEs的情况下达到了28.71 dB的峰值信噪比（PSNR），远低于现有方法所需的1000到4000 NFEs。

链接: https://arxiv.org/abs/2412.03941
作者: Tianyu Chen,Zhendong Wang,Mingyuan Zhou
关键词-EN: recently demonstrated notable, demonstrated notable success, solving inverse problems, models have recently, recently demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently demonstrated notable success in solving inverse problems. However, current diffusion model-based solutions typically require a large number of function evaluations (NFEs) to generate high-quality images conditioned on measurements, as they incorporate only limited information at each step. To accelerate the diffusion-based inverse problem-solving process, we introduce \textbfMeasurements \textbfOptimization (MO), a more efficient plug-and-play module for integrating measurement information at each step of the inverse problem-solving process. This method is comprehensively evaluated across eight diverse linear and nonlinear tasks on the FFHQ and ImageNet datasets. By using MO, we establish state-of-the-art (SOTA) performance across multiple tasks, with key advantages: (1) it operates with no more than 100 NFEs, with phase retrieval on ImageNet being the sole exception; (2) it achieves SOTA or near-SOTA results even at low NFE counts; and (3) it can be seamlessly integrated into existing diffusion model-based solutions for inverse problems, such as DPS \citechung2022diffusion and Red-diff \citemardani2023variational. For example, DPS-MO attains a peak signal-to-noise ratio (PSNR) of 28.71 dB on the FFHQ 256 dataset for high dynamic range imaging, setting a new SOTA benchmark with only 100 NFEs, whereas current methods require between 1000 and 4000 NFEs for comparable performance.
zh

[CV-91] AIpparel: A Large Multimodal Generative Model for Digital Garments

【速读】：该论文试图解决服装设计过程中手工设计耗时的问题，提出了一种名为AIpparel的大规模多模态模型，用于生成和编辑缝纫图案。解决方案的关键在于对现有的先进大规模多模态模型（LMMs）进行微调，并使用一个包含超过120,000件独特服装的自定义大规模数据集，每件服装都带有文本、图像和缝纫图案的多模态注释。此外，论文提出了一种新的标记化方案，能够简洁地编码复杂的缝纫图案，使得大型语言模型（LLMs）能够高效地预测这些图案。这种方法不仅在单模态任务（如文本到服装和图像到服装的预测）中达到了最先进的性能，还支持交互式服装编辑等新颖的多模态服装生成应用。

链接: https://arxiv.org/abs/2412.03937
作者: Kiyohiro Nakayama,Jan Ackermann,Timur Levent Kesdogan,Yang Zheng,Maria Korosteleva,Olga Sorkine-Hornung,Leonidas J. Guibas,Guandao Yang,Gordon Wetzstein
关键词-EN: mirroring cultural identities, showcasing personal style, Apparel is essential, offering protection, human life
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. \methodname achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at this http URL.
zh

[CV-92] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

【速读】：该论文试图解决生成无限动态3D驾驶场景的问题，特别是解决现有方法在生成大规模场景时存在的尺度限制和几何与外观一致性不足的问题。解决方案的关键在于利用可扩展的3D表示和视频模型，通过地图条件化的稀疏体素生成模型和像素对齐的指导缓冲区，实现对场景生成过程的灵活控制。具体来说，论文首先构建了一个基于地图条件的稀疏体素生成模型，用于生成无界的体素世界；然后，通过重新设计视频模型并将其与体素世界对齐，确保生成场景的外观一致性；最后，提出了一种快速前馈方法，结合体素和像素分支，将动态视频提升为可控的动态3D高斯分布，从而生成可控且逼真的3D驾驶场景。

链接: https://arxiv.org/abs/2412.03934
作者: Yifan Lu,Xuanchi Ren,Jiawei Yang,Tianchang Shen,Zhangjie Wu,Jun Gao,Yue Wang,Siheng Chen,Mike Chen,Sanja Fidler,Jiahui Huang
关键词-EN: present InfiniCube, fidelity and controllability, high fidelity, generating unbounded dynamic, driving scenes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
zh

[CV-93] MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction

【速读】：该论文试图解决在图像辅助的微创手术 (MIS) 中，如何准确地检测、分割和估计手术场景的深度，同时进行三维场景重建并提供手术器械的分割和检测标签的问题。解决方案的关键在于提出了一种新颖的多任务学习 (Multi-Task Learning, MTL) 网络，通过集成对抗权重更新 (Adversarial Weight Update) 来克服多任务并行处理时的优化难题。该方法通过整合分割、深度估计和物体检测任务，实现了手术场景的三维重建，显著提升了对手术场景的理解，相较于缺乏三维能力的现有研究，这是一个重要的进步。实验结果表明，该模型在EndoVis2018基准数据集上能够高效地处理所有三个任务，验证了所提出技术的有效性。

链接: https://arxiv.org/abs/2412.03928
作者: Mithun Parab,Pranay Lendave,Jiyoung Kim,Thi Quynh Dan Nguyen,Palash Ingle
关键词-EN: minimally invasive surgeries, collaborative human-robot procedures, image-assisted minimally invasive, surgical scenes, understanding surgical scenes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In image-assisted minimally invasive surgeries (MIS), understanding surgical scenes is vital for real-time feedback to surgeons, skill evaluation, and improving outcomes through collaborative human-robot procedures. Within this context, the challenge lies in accurately detecting, segmenting, and estimating the depth of surgical scenes depicted in high-resolution images, while simultaneously reconstructing the scene in 3D and providing segmentation of surgical instruments along with detection labels for each instrument. To address this challenge, a novel Multi-Task Learning (MTL) network is proposed for performing these tasks concurrently. A key aspect of this approach involves overcoming the optimization hurdles associated with handling multiple tasks concurrently by integrating a Adversarial Weight Update into the MTL framework, the proposed MTL model achieves 3D reconstruction through the integration of segmentation, depth estimation, and object detection, thereby enhancing the understanding of surgical scenes, which marks a significant advancement compared to existing studies that lack 3D capabilities. Comprehensive experiments on the EndoVis2018 benchmark dataset underscore the adeptness of the model in efficiently addressing all three tasks, demonstrating the efficacy of the proposed techniques.
zh

[CV-94] MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在感知和解释颜色及物理环境方面的能力不足问题，特别是在细微颜色变化和空间上下文理解方面的评估数据集缺乏的问题。解决方案的关键在于创建了一个高质量的人工标注数据集 MegaCOIN，该数据集包含 220,000 张真实图像，并提供了前景颜色、背景颜色和物体物理环境描述三种标注特征，共计 660,000 个人工标注。MegaCOIN 分为两个部分：MegaCOIN-Instruct 用于监督微调 (Supervised Fine-Tuning, SFT)，而 MegaCOIN-Bench 则作为一个独立的问答测试集。此外，MegaCOIN 还可用于评估领域泛化 (Domain Generalization, DG) 算法。通过使用 MegaCOIN 进行微调，VLMs 在颜色识别和视觉评估任务中的表现得到了显著提升，甚至在某些情况下，开源的小规模模型如 LLaVA 和 Bunny 能够超越闭源的 GPT-4o。

链接: https://arxiv.org/abs/2412.03927
作者: Ming-Chang Chiu,Shicheng Wen,Pin-Yu Chen,Xuezhe Ma
关键词-EN: achieving contextually accurate, contextually accurate understanding, understanding and interaction, ability to perceive, perceive and interpret
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 13 tables, 2 figures

点击查看摘要

Abstract:In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model’s capacity to discern subtle color variations and spatial context – critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emphreal images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real images: foreground color, background color, and description of an object’s physical environment, constituting 660k human annotations. In addition, MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. Last but not least, we show that VLMs, including GPT-4o, have subpar color recognition capabilities, and fine-tuning with MegaCOIN can result in improved performance on visual evaluation tasks. In certain cases, MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed light on the directions VLMs can improve and provide a more complex platform for domain generalization algorithms.
zh

[CV-95] Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications

【速读】：该论文试图解决在医疗图像分析中使用人工智能（AI）和深度学习技术时所引发的隐私问题。解决方案的关键在于采用多种隐私保护技术，包括加密（encryption）、差分隐私（differential privacy）、同态加密（homomorphic encryption）、联邦学习（federated learning）和生成对抗网络（generative adversarial networks）。这些技术被应用于诊断、病理学和远程医疗等不同的医疗图像分析任务中，以确保敏感患者信息的安全。论文还探讨了新兴趋势，如零知识证明（zero-knowledge proofs）和安全多方计算（secure multi-party computation），为未来的研究提供了方向。通过将技术应用与实际问题直接对应，论文旨在填补当前研究中的空白，推动医疗图像分析中的隐私保护技术发展。

链接: https://arxiv.org/abs/2412.03924
作者: Yanming Zhu,Xuefei Yin,Alan Wee-Chung Liew,Hui Tian
关键词-EN: medical image analysis, significantly improving diagnostic, improving diagnostic accuracy, medical image, image analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence and deep learning, medical image analysis has become a critical tool in modern healthcare, significantly improving diagnostic accuracy and efficiency. However, AI-based methods also raise serious privacy concerns, as medical images often contain highly sensitive patient information. This review offers a comprehensive overview of privacy-preserving techniques in medical image analysis, including encryption, differential privacy, homomorphic encryption, federated learning, and generative adversarial networks. We explore the application of these techniques across various medical image analysis tasks, such as diagnosis, pathology, and telemedicine. Notably, we organizes the review based on specific challenges and their corresponding solutions in different medical image analysis applications, so that technical applications are directly aligned with practical issues, addressing gaps in the current research landscape. Additionally, we discuss emerging trends, such as zero-knowledge proofs and secure multi-party computation, offering insights for future research. This review serves as a valuable resource for researchers and practitioners and can help advance privacy-preserving in medical image analysis.
zh

[CV-96] Quantized and Interpretable Learning Scheme for Deep Neural Networks in Classification Task

【速读】：该论文试图解决在资源受限环境中部署深度学习模型时面临的计算需求高和可解释性差的问题。解决方案的关键在于结合显著性引导训练（saliency-guided training）与量化技术（quantization techniques），具体通过使用参数化剪切激活（Parameterized Clipping Activation, PACT）进行量化感知训练，同时通过迭代地遮蔽低梯度值的特征来增强模型的可解释性。这种方法不仅优化了精度和资源使用，还通过减少噪声梯度，生成更清晰、更具解释性的显著性图（saliency maps），从而在不牺牲分类准确性的前提下，显著提升了模型的效率和可解释性，使其更适合在资源有限的环境中部署。

链接: https://arxiv.org/abs/2412.03915
作者: Alireza Maleki,Mahsa Lavaei,Mohsen Bagheritabar,Salar Beigzad,Zahra Abadi
关键词-EN: Deep learning techniques, proven highly effective, resourceconstrained environments remains, environments remains challenging, remains challenging due
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning techniques have proven highly effective in image classification, but their deployment in resourceconstrained environments remains challenging due to high computational demands. Furthermore, their interpretability is of high importance which demands even more available resources. In this work, we introduce an approach that combines saliency-guided training with quantization techniques to create an interpretable and resource-efficient model without compromising accuracy. We utilize Parameterized Clipping Activation (PACT) to perform quantization-aware training, specifically targeting activations and weights to optimize precision while minimizing resource usage. Concurrently, saliency-guided training is employed to enhance interpretability by iteratively masking features with low gradient values, leading to more focused and meaningful saliency maps. This training procedure helps in mitigating noisy gradients and yields models that provide clearer, more interpretable insights into their decision-making processes. To evaluate the impact of our approach, we conduct experiments using famous Convolutional Neural Networks (CNN) architecture on the MNIST and CIFAR-10 benchmark datasets as two popular datasets. We compare the saliency maps generated by standard and quantized models to assess the influence of quantization on both interpretability and classification accuracy. Our results demonstrate that the combined use of saliency-guided training and PACT-based quantization not only maintains classification performance but also produces models that are significantly more efficient and interpretable, making them suitable for deployment in resource-limited settings.
zh

[CV-97] Multi-View Pose-Agnostic Change Localization with Zero Labels

【速读】：该论文试图解决自主代理在复杂环境中检测和定位变化的问题，特别是在观测视角不受限制且不一致的情况下。解决方案的关键在于提出了一种新颖的无标签、姿态无关的变化检测方法，通过整合多个视角的信息来构建场景的3D高斯喷射（3D Gaussian Splatting, 3DGS）表示。该方法仅需5张变化后的场景图像，即可在3DGS中学习额外的变化通道，并生成优于单视角技术的变化掩码。此外，这种变化感知的3D场景表示还能够为未见过的视角生成准确的变化掩码。实验结果表明，该方法在复杂的多对象场景中达到了最先进的性能，相较于其他基线方法，在平均交并比（Mean Intersection Over Union）和F1分数上分别提高了1.7倍和1.6倍。

链接: https://arxiv.org/abs/2412.03911
作者: Chamuditha Jayanga Galappaththige,Jason Lai,Lloyd Windrim,Donald Dansereau,Niko Suenderhauf,Dimity Miller
关键词-EN: Autonomous agents, require accurate methods, agents often require, detecting and localizing, observations are captured
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous agents often require accurate methods for detecting and localizing changes in their environment, particularly when observations are captured from unconstrained and inconsistent viewpoints. We propose a novel label-free, pose-agnostic change detection method that integrates information from multiple viewpoints to construct a change-aware 3D Gaussian Splatting (3DGS) representation of the scene. With as few as 5 images of the post-change scene, our approach can learn additional change channels in a 3DGS and produce change masks that outperform single-view techniques. Our change-aware 3D scene representation additionally enables the generation of accurate change masks for unseen viewpoints. Experimental results demonstrate state-of-the-art performance in complex multi-object scenes, achieving a 1.7 \times and 1.6 \times improvement in Mean Intersection Over Union and F1 score respectively over other baselines. We also contribute a new real-world dataset to benchmark change detection in diverse challenging scenes in the presence of lighting variations.
zh

[CV-98] DGNS: Deformable Gaussian Splatting and Dynamic Neural Surface for Monocular Dynamic 3D Reconstruction

【速读】：该论文试图解决动态场景从单目视频中进行重建的关键问题，即动态新视角合成和3D几何重建的双重挑战。解决方案的关键在于引入了一种混合框架：可变形高斯喷射和动态神经表面 (Deformable Gaussian Splatting and Dynamic Neural Surfaces, DGNS)。该框架通过两个模块的相互协作来实现这两个任务：在训练过程中，可变形高斯喷射模块生成的深度图指导光线采样以加速处理，并为动态神经表面模块提供深度监督，从而改进几何重建；同时，动态神经表面模块引导高斯基元在表面周围的分布，提升渲染质量。此外，论文还引入了一种深度过滤过程，进一步优化从高斯光栅化得到的深度图的深度监督。实验结果表明，DGNS在公共数据集上实现了新视角合成和3D重建的最新技术水平。

链接: https://arxiv.org/abs/2412.03910
作者: Xuesong Li,Jinguang Tong,Jie Hong,Vivien Rolland,Lars Petersson
关键词-EN: Deformable Gaussian Splatting, dynamic neural surface, Dynamic Neural, Dynamic scene reconstruction, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic scene reconstruction from monocular video is critical for real-world applications. This paper tackles the dual challenges of dynamic novel-view synthesis and 3D geometry reconstruction by introducing a hybrid framework: Deformable Gaussian Splatting and Dynamic Neural Surfaces (DGNS), in which both modules can leverage each other for both tasks. During training, depth maps generated by the deformable Gaussian splatting module guide the ray sampling for faster processing and provide depth supervision within the dynamic neural surface module to improve geometry reconstruction. Simultaneously, the dynamic neural surface directs the distribution of Gaussian primitives around the surface, enhancing rendering quality. To further refine depth supervision, we introduce a depth-filtering process on depth maps derived from Gaussian rasterization. Extensive experiments on public datasets demonstrate that DGNS achieves state-of-the-art performance in both novel-view synthesis and 3D reconstruction.
zh

[CV-99] Can Targeted Clean-Label Poisoning Attacks Generalize?

【速读】：该论文试图解决定向投毒攻击在面对目标样本未知变体时的泛化性问题。解决方案的关键在于提出了一种基于模型梯度方向和大小的攻击方法，相较于广泛采用的基于余弦相似度的攻击方法，该方法在处理目标样本的多种变体（如不同视角的物体或不同外观的动物物种）时表现出更强的泛化能力。通过在多个泛化场景下的广泛实验，该方法在攻击成功率上显著优于传统方法，例如在两个图像基准数据集上的四个模型中，平均攻击成功率提高了20.95%，同时保持了相似的整体准确性。

链接: https://arxiv.org/abs/2412.03908
作者: Zhizhen Chen,Subrat Kishore Dutta,Zhengyu Zhao,Chenhao Lin,Chao Shen,Xiao Zhang
关键词-EN: Targeted poisoning attacks, Targeted poisoning, poisoning attacks aim, aim to compromise, specific target samples
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Targeted poisoning attacks aim to compromise the model’s prediction on specific target samples. In a common clean-label setting, they are achieved by slightly perturbing a subset of training samples given access to those specific targets. Despite continuous efforts, it remains unexplored whether such attacks can generalize to unknown variations of those targets. In this paper, we take the first step to systematically study this generalization problem. Observing that the widely adopted, cosine similarity-based attack exhibits limited generalizability, we propose a well-generalizable attack that leverages both the direction and magnitude of model gradients. In particular, we explore diverse target variations, such as an object with varied viewpoints and an animal species with distinct appearances. Extensive experiments across various generalization scenarios demonstrate that our method consistently achieves the best attack effectiveness. For example, our method outperforms the cosine similarity-based attack by 20.95% in attack success rate with similar overall accuracy, averaged over four models on two image benchmark datasets. The code is available at this https URL
zh

[CV-100] ONER: Online Experience Replay for Incremental Anomaly Detection

【速读】：该论文试图解决动态工业场景中增量异常检测的问题，特别是在新类别不断出现时，模型容易遭受知识覆盖和特征冲突导致的灾难性遗忘（catastrophic forgetting）。解决方案的关键在于提出了一种端到端的在线经验回放方法（ONline Experience Replay, ONER），通过利用过去任务中的分解提示（decomposed prompts）和语义原型（semantic prototypes）来有效缓解灾难性遗忘，同时以最小成本适应新任务。分解提示由可学习的组件组成，能够生成注意力条件提示，重用先前学到的知识，使模型能够有效学习新任务。语义原型则在像素和图像级别上操作，在潜在特征空间中进行正则化，防止跨任务的遗忘。实验结果表明，该方法在增量异常检测中达到了最先进的性能，显著减少了遗忘，并能高效适应新类别。

链接: https://arxiv.org/abs/2412.03907
作者: Yizhou Jin,Jiahui Zhu,Guodong Wang,Shiwei Li,Jinjin Zhang,Qingjie Liu,Xinyue Liu,Yunhong Wang
关键词-EN: dynamic industrial scenarios, sequentially recognizes abnormal, recognizes abnormal regions, detection sequentially recognizes, industrial scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incremental anomaly detection sequentially recognizes abnormal regions in novel categories for dynamic industrial scenarios. This remains highly challenging due to knowledge overwriting and feature conflicts, leading to catastrophic forgetting. In this work, we propose ONER, an end-to-end ONline Experience Replay method, which efficiently mitigates catastrophic forgetting while adapting to new tasks with minimal cost. Specifically, our framework utilizes two types of experiences from past tasks: decomposed prompts and semantic prototypes, addressing both model parameter updates and feature optimization. The decomposed prompts consist of learnable components that assemble to produce attention-conditioned prompts. These prompts reuse previously learned knowledge, enabling model to learn novel tasks effectively. The semantic prototypes operate at both pixel and image levels, performing regularization in the latent feature space to prevent forgetting across various tasks. Extensive experiments demonstrate that our method achieves state-of-the-art performance in incremental anomaly detection with significantly reduced forgetting, as well as efficiently adapting to new categories with minimal costs. These results confirm the efficiency and stability of ONER, making it a powerful solution for real-world applications.
zh

[CV-101] 4D SlingBAG: spatial-temporal coupled Gaussian ball for large-scale dynamic 3D photoacoustic iterative reconstruction

【速读】：该论文试图解决大规模动态三维光声成像（3D photoacoustic imaging, PAI）中，由于使用稀疏二维传感器阵列导致的角缺陷问题，以及现有迭代重建（iterative reconstruction, IR）算法在多帧三维重建时的高内存消耗和长时间计算的问题。解决方案的关键在于提出了一种名为4D滑动高斯球自适应生长（4D SlingBAG）算法的新方法。该方法基于现有的滑动高斯球自适应生长（SlingBAG）算法，通过应用时空耦合变形函数于点云中的每个高斯球，显式地学习动态三维PA场景的变形特征，从而高效地表示各种生理过程（如脉动）或外部压力（如血流实验）引起的血管形态和血流变化。这种方法不仅显著减少了计算时间，还保持了极低的内存消耗，实现了高效的动态三维PAI迭代重建。

链接: https://arxiv.org/abs/2412.03898
作者: Shuang Li,Yibing Wang,Jian Gao,Chulhong Kim,Seongwook Choi,Yu Zhang,Qian Chen,Yao Yao,Changhui Li
关键词-EN: Large-scale dynamic three-dimensional, photoacoustic imaging, PAI, clinical applications, important in clinical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale dynamic three-dimensional (3D) photoacoustic imaging (PAI) is significantly important in clinical applications. In practical implementations, large-scale 3D real-time PAI systems typically utilize sparse two-dimensional (2D) sensor arrays with certain angular deficiencies, necessitating advanced iterative reconstruction (IR) algorithms to achieve quantitative PAI and reduce reconstruction artifacts. However, for existing IR algorithms, multi-frame 3D reconstruction leads to extremely high memory consumption and prolonged computation time, with limited consideration of the spatial-temporal continuity between data frames. Here, we propose a novel method, named the 4D sliding Gaussian ball adaptive growth (4D SlingBAG) algorithm, based on the current point cloud-based IR algorithm sliding Gaussian ball adaptive growth (SlingBAG), which has minimal memory consumption among IR methods. Our 4D SlingBAG method applies spatial-temporal coupled deformation functions to each Gaussian sphere in point cloud, thus explicitly learning the deformations features of the dynamic 3D PA scene. This allows for the efficient representation of various physiological processes (such as pulsation) or external pressures (e.g., blood perfusion experiments) contributing to changes in vessel morphology and blood flow during dynamic 3D PAI, enabling highly efficient IR for dynamic 3D PAI. Simulation experiments demonstrate that 4D SlingBAG achieves high-quality dynamic 3D PA reconstruction. Compared to performing reconstructions by using SlingBAG algorithm individually for each frame, our method significantly reduces computational time and keeps a extremely low memory consumption. The project for 4D SlingBAG can be found in the following GitHub repository: \hrefthis https URLthis https URL.
zh

[CV-102] Multisource Collaborative Domain Generalization for Cross-Scene Remote Sensing Image Classification

【速读】：该论文试图解决跨场景图像分类中的领域泛化问题，特别是在多源遥感数据中，由于训练信息有限和多样性建模能力不足，导致现有方法在面对大规模真实世界领域偏移时容易混淆的问题。解决方案的关键在于提出了一个基于多源协同领域泛化框架（MS-CDG），该框架结合了数据感知对抗性增强和模型感知多层次多样化的策略。具体来说，数据感知对抗性增强通过带有语义引导的对抗神经网络，自适应地学习跨领域的真实通道和分布变化，生成多源样本；而模型感知多样化则通过将多源数据的共享空间-通道特征转换为类别的原型和核混合模块，以有效处理领域差异和聚类不同类别。最终，通过引入分布一致性对齐，联合分类原始和增强的多源样本，以增加模型多样性并确保更好的领域不变性表示学习。

链接: https://arxiv.org/abs/2412.03897
作者: Zhu Han,Ce Zhang,Lianru Gao,Zhiqiang Zeng,Michael K. Ng,Bing Zhang,Jocelyn Chanussot
关键词-EN: transfer prior knowledge, reduce hand-crafted cost, image classification aims, Cross-scene image classification, aims to transfer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-scene image classification aims to transfer prior knowledge of ground materials to annotate regions with different distributions and reduce hand-crafted cost in the field of remote sensing. However, existing approaches focus on single-source domain generalization to unseen target domains, and are easily confused by large real-world domain shifts due to the limited training information and insufficient diversity modeling capacity. To address this gap, we propose a novel multi-source collaborative domain generalization framework (MS-CDG) based on homogeneity and heterogeneity characteristics of multi-source remote sensing data, which considers data-aware adversarial augmentation and model-aware multi-level diversification simultaneously to enhance cross-scene generalization performance. The data-aware adversarial augmentation adopts an adversary neural network with semantic guide to generate MS samples by adaptively learning realistic channel and distribution changes across domains. In views of cross-domain and intra-domain modeling, the model-aware diversification transforms the shared spatial-channel features of MS data into the class-wise prototype and kernel mixture module, to address domain discrepancies and cluster different classes effectively. Finally, the joint classification of original and augmented MS samples is employed by introducing a distribution consistency alignment to increase model diversity and ensure better domain-invariant representation learning. Extensive experiments on three public MS remote sensing datasets demonstrate the superior performance of the proposed method when benchmarked with the state-of-the-art methods.
zh

[CV-103] A Noise is Worth Diffusion Guidance

【速读】：该论文试图解决当前扩散模型在生成高质量图像时对指导方法（如分类器无指导 (CFG)）的依赖问题。解决方案的关键在于通过将高斯噪声映射到“无指导噪声”，发现低幅值低频成分在去噪过程中显著增强，从而消除了对指导方法的需求。论文提出了一种名为\ours的新方法，通过单一的初始噪声精细化步骤，在相同的扩散流程中实现高质量图像生成，无需指导。该方法利用高效的噪声空间学习，仅需50K文本-图像对即可快速收敛并表现出强劲性能。

链接: https://arxiv.org/abs/2412.03895
作者: Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Jaewon Min,Minjae Kim,Wooseok Jang,Hyoungwon Cho,Sayak Paul,SeonHwa Kim,Eunju Cha,Kyong Hwan Jin,Seungryong Kim
关键词-EN: Diffusion models excel, guidance, excel in generating, generating high-quality images, guidance methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise’, we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: this https URL.
zh

[CV-104] ShapeCraft: Body-Aware and Semantics-Aware 3D Object Design

【速读】：该论文试图解决在设计日常物品时，如何使设计过程同时考虑到人体结构和设计规范的语义信息的问题。当前基于AI的设计工具在这两个方面存在显著挑战。论文提出的解决方案之关键是使用一种网格变形程序，该程序在优化过程中同时考虑语义对齐以及接触和穿透损失。通过这种方法，用户可以基于文本、图像或草图生成虚拟或现实世界中的3D物体，而无需手动干预。这种方法的有效性通过在多种物体类别上的定性和定量结果得到了验证。

链接: https://arxiv.org/abs/2412.03889
作者: Michelle Guo,Mia Tang,Hannah Cha,Ruohan Zhang,C. Karen Liu,Jiajun Wu
关键词-EN: design specification, design process, wide range, range of everyday, design
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project webpage: this https URL

点击查看摘要

Abstract:For designing a wide range of everyday objects, the design process should be aware of both the human body and the underlying semantics of the design specification. However, these two objectives present significant challenges to the current AI-based designing tools. In this work, we present a method to synthesize body-aware 3D objects from a base mesh given an input body geometry and either text or image as guidance. The generated objects can be simulated on virtual characters, or fabricated for real-world use. We propose to use a mesh deformation procedure that optimizes for both semantic alignment as well as contact and penetration losses. Using our method, users can generate both virtual or real-world objects from text, image, or sketch, without the need for manual artist intervention. We present both qualitative and quantitative results on various object categories, demonstrating the effectiveness of our approach.
zh

[CV-105] MOANA: Multi-Radar Dataset for Maritime Odometry and Autonomous Navigation Application

【速读】：该论文试图解决海上环境感知中面临的复杂条件挑战，如恶劣天气、平台扰动、大型动态物体和长距离探测需求。解决方案的关键在于整合多种雷达传感器，以克服单一传感器的局限性。具体来说，论文提出了一个综合的海上传感器数据集，该数据集融合了短距离的激光雷达（LiDAR）数据、中距离的W波段雷达（W-band radar）数据和长距离的X波段雷达（X-band radar）数据。这种多范围检测能力的整合，不仅提高了近距离物体检测的精度，还增强了远距离探测的鲁棒性，特别是在靠泊操作等需要近距离高精度检测的场景中。此外，数据集还包括从雷达和立体相机图像中提取的海洋物体标签，为海上环境中的位置识别、里程计估计、SLAM、物体检测和动态物体消除等研究提供了宝贵的资源。

链接: https://arxiv.org/abs/2412.03887
作者: Hyesu Jang,Wooseong Yang,Hanguen Kim,Dongje Lee,Yongjin Kim,Jinbum Park,Minsoo Jeon,Jaeseong Koh,Yejin Kang,Minwoo Jung,Sangwoo Jung,Ayoung Kim
关键词-EN: environmental sensing requires, sensing requires overcoming, requires overcoming challenges, Maritime environmental sensing, platform perturbations
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Maritime environmental sensing requires overcoming challenges from complex conditions such as harsh weather, platform perturbations, large dynamic objects, and the requirement for long detection ranges. While cameras and LiDAR are commonly used in ground vehicle navigation, their applicability in maritime settings is limited by range constraints and hardware maintenance issues. Radar sensors, however, offer robust long-range detection capabilities and resilience to physical contamination from weather and saline conditions, making it a powerful sensor for maritime navigation. Among various radar types, X-band radar (e.g., marine radar) is widely employed for maritime vessel navigation, providing effective long-range detection essential for situational awareness and collision avoidance. Nevertheless, it exhibits limitations during berthing operations where close-range object detection is critical. To address this shortcoming, we incorporate W-band radar (e.g., Navtech imaging radar), which excels in detecting nearby objects with a higher update rate. We present a comprehensive maritime sensor dataset featuring multi-range detection capabilities. This dataset integrates short-range LiDAR data, medium-range W-band radar data, and long-range X-band radar data into a unified framework. Additionally, it includes object labels for oceanic object detection usage, derived from radar and stereo camera images. The dataset comprises seven sequences collected from diverse regions with varying levels of estimation difficulty, ranging from easy to challenging, and includes common locations suitable for global localization tasks. This dataset serves as a valuable resource for advancing research in place recognition, odometry estimation, SLAM, object detection, and dynamic object elimination within maritime environments. Dataset can be found in following link: this https URL
zh

[CV-106] DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism ECCV

【速读】：该论文旨在解决聋哑和听力障碍者（Deaf and Hard of Hearing, DHH）社区在观看媒体内容时面临的可访问性问题。解决方案的关键在于通过生成式建模（Generative Modeling）和参数化建模（Parametric Modeling）相结合的方法，生成逼真且富有表现力的手语视频。具体来说，论文首先通过优化参数化模型将人类手语姿势重定向到3D手语虚拟形象上，然后利用渲染后的高保真姿势来条件化基于扩散模型的生成式模型，生成合成手语者的姿势。合成手语者的外观则通过视觉适配器（Visual Adapter）提供的图像提示进行控制。该方法不仅提高了手语视频的时间一致性和真实感，还支持多模态提示（Multimodal Prompts），允许用户根据多样性需求（如肤色、性别）进一步定制手语者的外观，同时也有助于手语者的匿名化。

链接: https://arxiv.org/abs/2412.03878
作者: Sudha Krishnamurthy,Vimal Bhat,Abhinav Jain
关键词-EN: media content, make content accessible, Hard of Hearing, recent years, world to view
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Proceedings of ECCV, Workshop on Assistive Computer Vision and Robotics, 2024

点击查看摘要

Abstract:The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.
zh

[CV-107] Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

【速读】：该论文试图解决文本到图像 (Text-to-Image, T2I) 扩散模型在生成过程中容易产生不安全图像的问题。解决方案的关键在于提出了一种无需训练的新方法，称为提示噪声优化 (Prompt-Noise Optimization, PNO)。该方法通过在采样过程中同时优化连续提示嵌入和注入的噪声轨迹，以生成安全的图像。PNO不仅在抑制有毒图像生成方面达到了最先进的性能，还表现出对对抗攻击的鲁棒性，且无需调整模型参数。与现有方法相比，PNO在生成时间上相当，同时在安全生成和提示-图像对齐之间提供了最佳的权衡。

链接: https://arxiv.org/abs/2412.03876
作者: Jiangweizhi Peng,Zhiwei Tang,Gaowen Liu,Charles Fleming,Mingyi Hong
关键词-EN: diverse images based, widely recognized, high-quality and diverse, based on text, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.
zh

[CV-108] CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

【速读】：该论文试图解决轻量级视觉-语言模型在资源受限场景下的性能问题，特别是在单一图像-文本对比学习目标下表现不佳的问题。解决方案的关键在于提出了CLIP-PING（Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance），这是一种简单且高效的训练范式，旨在通过最小化的计算开销和较低的数据需求提升轻量级模型的性能。CLIP-PING利用任意预训练编码器提取的单模态特征来获取内在的近邻样本指导，即最近邻（NN）和跨模态最近邻（XNN），从而增强对比监督，显著提升跨模态特征对齐，使轻量级模型能够学习到更具语义多样性的通用特征。实验结果表明，CLIP-PING在零样本泛化和跨模态检索任务中显著优于同类模型。

链接: https://arxiv.org/abs/2412.03871
作者: Chu Myaet Thwal,Ye Lin Tun,Minh N. H. Nguyen,Eui-Nam Huh,Choong Seon Hong
关键词-EN: recent trends mark, Contrastive Language-Image Pre-training, Language-Image Pre-training, recent trends, resource-constrained scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 20 tables

点击查看摘要

Abstract:Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K with 10.7% (I2T) and 5.7% (T2I) on Flickr30K, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases strong transferability under the linear evaluation protocol across several downstream tasks.
zh

[CV-109] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

【速读】：该论文试图解决在布局到图像生成 (Layout-to-Image, L2I) 任务中，如何有效利用多模态扩散变换器 (Multimodal Diffusion Transformers, MM-DiTs) 进行图像生成的问题。解决方案的关键在于提出了一种名为 SiamLayout 的网络架构，该架构通过以下几个关键步骤实现：1) 使用独立的网络权重处理布局信息，将其视为与图像和文本模态同等重要；2) 将图像与布局的交互解耦为孪生分支，与图像-文本分支并行处理，并在后期融合；3) 引入大规模布局数据集 LayoutSAM 和评估基准 LayoutSAM-Eval，以支持模型训练和评估；4) 利用大型语言模型进行布局规划，提升布局生成和优化的能力。这些措施共同确保了布局信息在多模态生成过程中的有效整合和平衡。

链接: https://arxiv.org/abs/2412.03859
作者: Hui Zhang,Dexiang Hong,Tingwei Gao,Yitong Wang,Jie Shao,Xinglong Wu,Zuxuan Wu,Yu-Gang Jiang
关键词-EN: Multimodal Diffusion Transformers, high artistic quality, explored Multimodal Diffusion, ability to generate, visually appealing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 13 figures

点击查看摘要

Abstract:Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at this https URL.
zh

[CV-110] HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

【速读】：该论文试图解决在包含瞬态对象的场景中生成高质量的新视角渲染（novel view renderings）的问题。解决方案的关键在于提出了一种新的混合表示方法，称为HybridGS，它结合了2D高斯分布（2D Gaussians）用于表示每张图像中的瞬态对象，同时保留传统的3D高斯分布（3D Gaussians）用于整个静态场景。这种混合表示方法通过视角一致性的基本原则来分解场景，使得模型更加合理。此外，论文还提出了一种新颖的多视角调节监督方法，利用共可见区域的信息来增强瞬态对象与静态场景之间的区分度。最后，通过一种简单而有效的多阶段训练策略，确保在各种设置下都能实现稳健的训练和高品质的视图合成。

链接: https://arxiv.org/abs/2412.03844
作者: Jingyu Lin,Jiaqi Gu,Lubin Fan,Bojian Wu,Yujing Lou,Renjie Chen,Ligang Liu,Jieping Ye
关键词-EN: Gaussian Splatting, featuring transient objects, scenes featuring transient, transient objects, Generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.
zh

[CV-111] LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model

【速读】：该论文试图解决机器视觉任务中的图像压缩问题，特别是针对低级机器视觉任务（如图像恢复模型）的压缩需求。解决方案的关键在于提出了一个名为LL-ICM的开创性图像压缩框架，该框架通过联合优化压缩和低级视觉任务，不仅增强了编码器在多样化低级任务中的泛化能力，还优化了下游低级任务模型的处理能力，实现了图像编解码器与低级任务模型的相互适应。此外，论文还将大规模视觉-语言模型集成到LL-ICM框架中，以生成更通用和抗失真的特征嵌入，从而使一个LL-ICM编解码器能够泛化到多个任务。实验结果表明，LL-ICM相比现有最先进的方法，能够实现22.65%的BD-rate降低。

链接: https://arxiv.org/abs/2412.03841
作者: Yuan Xue,Qi Zhang,Chuanmin Jia,Shiqi Wang
关键词-EN: machine vision tasks, aims to compress, human viewing, machine vision, tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.
zh

[CV-112] Movie Gen: SWOT Analysis of Metas Generative AI Foundation Model for Transforming Media Generation Advertising and Entertainment Industries

【速读】：该论文试图解决生成式 AI (Generative AI) 在媒体制作中的应用问题，特别是通过分析 Meta 的 Movie Gen 模型来探讨其在视频生成、个性化和可扩展性方面的潜力与挑战。解决方案的关键在于全面评估 Movie Gen 的优势（如高分辨率视频生成、精确编辑和无缝音频集成）和局限性（如视频长度限制和内容偏见），并通过与领先模型（如 DALL-E 和 Google Imagen）的比较，突出其独特的视频个性化和多模态合成功能。此外，论文还探讨了生成式 AI 在内容真实性、文化代表性和负责任使用等方面的监管和伦理问题，旨在为未来生成式 AI 的发展提供指导，确保其在媒体制作中的可扩展性、质量和伦理完整性。

链接: https://arxiv.org/abs/2412.03837
作者: Abul Ehtesham,Saket Kumar,Aditi Singh,Tala Talaei Khoei
关键词-EN: enabling unprecedented capabilities, Metas Movie Gen, enabling unprecedented, unprecedented capabilities, comprehensive SWOT analysis
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI is reshaping the media landscape, enabling unprecedented capabilities in video creation, personalization, and scalability. This paper presents a comprehensive SWOT analysis of Metas Movie Gen, a cutting-edge generative AI foundation model designed to produce 1080p HD videos with synchronized audio from simple text prompts. We explore its strengths, including high-resolution video generation, precise editing, and seamless audio integration, which make it a transformative tool across industries such as filmmaking, advertising, and education. However, the analysis also addresses limitations, such as constraints on video length and potential biases in generated content, which pose challenges for broader adoption. In addition, we examine the evolving regulatory and ethical considerations surrounding generative AI, focusing on issues like content authenticity, cultural representation, and responsible use. Through comparative insights with leading models like DALL-E and Google Imagen, this paper highlights Movie Gens unique features, such as video personalization and multimodal synthesis, while identifying opportunities for innovation and areas requiring further research. Our findings provide actionable insights for stakeholders, emphasizing both the opportunities and challenges of deploying generative AI in media production. This work aims to guide future advancements in generative AI, ensuring scalability, quality, and ethical integrity in this rapidly evolving field.
zh

[CV-113] CLIP-FSAC: Few-Shot Anomaly Classification with Anomaly Descriptor Based on CLIP

【速读】：该论文试图解决工业异常分类 (Industrial Anomaly Classification, AC) 中数据稀缺的问题，特别是在少样本 (few-shot) 场景下的异常检测。解决方案的关键在于提出了一种名为 CLIP-FSAC++ 的一阶段训练框架，并通过引入一个跨模态交互模块——异常描述符 (Anomaly Descriptor)，来增强图像和文本嵌入之间的关联性，从而使预训练的 CLIP 模型能够更好地适应目标数据。具体来说，异常描述符通过图像到文本的交叉注意力模块和文本到图像的交叉注意力模块，分别生成图像特定的文本嵌入和文本特定的视觉嵌入，进而增强原始 CLIP 表示的匹配能力。

链接: https://arxiv.org/abs/2412.03829
作者: Zuo Zuo,Jiahao Dong,Yao Wu,Yanyun Qu,Zongze Wu
关键词-EN: industrial manufacturing, Industrial anomaly classification, indispensable task, guarantees quality, quality and safety
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Industrial anomaly classification (AC) is an indispensable task in industrial manufacturing, which guarantees quality and safety of various product. To address the scarcity of data in industrial scenarios, lots of few-shot anomaly detection methods emerge recently. In this paper, we propose an effective few-shot anomaly classification (FSAC) framework with one-stage training, dubbed CLIP-FSAC++. Specifically, we introduce a cross-modality interaction module named Anomaly Descriptor following image and text encoders, which enhances the correlation of visual and text embeddings and adapts the representations of CLIP from pre-trained data to target data. In anomaly descriptor, image-to-text cross-attention module is used to obtain image-specific text embeddings and text-to-image cross-attention module is used to obtain text-specific visual embeddings. Then these modality-specific embeddings are used to enhance original representations of CLIP for better matching ability. Comprehensive experiment results are provided for evaluating our method in few-normal shot anomaly classification on VisA and MVTEC-AD for 1, 2, 4 and 8-shot settings. The source codes are at this https URL
zh

[CV-114] Exploring RealSynthetic Dataset and Linear Attention in Image Restoration

【速读】：该论文试图解决图像恢复（Image Restoration）领域中训练数据集与测试数据集之间图像复杂度不一致的问题，这影响了恢复质量。解决方案的关键在于创建了一个大规模的、复杂度平衡的图像恢复数据集ReSyn，并建立了一个统一的训练标准。此外，论文提出的RWKV-IR模型通过将线性复杂度的RWKV集成到Transformer中，结合深度卷积（Depth-wise Convolution）和双向注意力（Bi-directional attention）机制，实现了全局和局部感受野的结合，从而有效提升了图像恢复的效果。

链接: https://arxiv.org/abs/2412.03814
作者: Yuzhen Du,Teng Hu,Jiangning Zhang,Ran Yi Chengming Xu,Xiaobin Hu,Kai Wu,Donghao Luo,Yabiao Wang,Lizhuang Ma
关键词-EN: restore degraded images, enhancing performance, deep learning, Image Restoration aims, aims to restore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Restoration aims to restore degraded images, with deep learning, especially CNNs and Transformers, enhancing performance. However, there’s a lack of a unified training benchmark for IR. We identified a bias in image complexity between training and testing datasets, affecting restoration quality. To address this, we created ReSyn, a large-scale IR dataset with balanced complexity, including real and synthetic images. We also established a unified training standard for IR models. Our RWKV-IR model integrates linear complexity RWKV into transformers for global and local receptive fields. It replaces Q-Shift with Depth-wise Convolution for local dependencies and combines Bi-directional attention for global-local awareness. The Cross-Bi-WKV module balances horizontal and vertical attention. Experiments show RWKV-IR’s effectiveness in image restoration.
zh

[CV-115] Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting

【速读】：该论文试图解决前景条件下的图像修复问题，即在给定前景对象和文本描述的情况下，生成与文本描述一致且前景对象形状保持良好的高质量背景。解决方案的关键在于三个创新点：首先，设计了一个自一致适配器（Self-Consistent Adapter），通过将前景对象特征整合到布局相关的自注意力层中，缓解文本和对象特征之间的冲突，确保模型在处理整体图像布局时有效考虑前景对象的特性。其次，采用了解耦图像特征提取方法（Decoupled Image Feature Extraction），分别使用不同的架构提取语义和形状特征，显著提升对象特征提取质量并确保对象形状的高质量保留。最后，引入共享位置嵌入锚点（Shared Positional Embedding Anchor），以确保精确利用提取的特征并聚焦于对象区域，大幅提升模型对对象特征的理解和训练效率。

链接: https://arxiv.org/abs/2412.03812
作者: Guangben Lu,Yuzhen Du,Zhimin Sun,Ran Yi,Yifan Qi,Yizhe Tang,Tianyi Wang,Lizhuang Ma,Fangyuan Zou
关键词-EN: text description, provided foreground subject, Foreground-conditioned inpainting aims, foreground subject, subject
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foreground-conditioned inpainting aims to seamlessly fill the background region of an image by utilizing the provided foreground subject and a text description. While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. Firstly, we design a Self-Consistent Adapter that integrates the foreground subject features into the layout-related self-attention layer, which helps to alleviate conflicts between the text and subject features by ensuring that the model can effectively consider the foreground subject’s characteristics while processing the overall image layout. Secondly, we design a Decoupled Image Feature Extraction method that employs distinct architectures to extract semantic and shape features separately, significantly improving subject feature extraction and ensuring high-quality preservation of the subject’s shape. Thirdly, to ensure precise utilization of the extracted features and to focus attention on the subject region, we introduce a Shared Positional Embedding Anchor, greatly improving the model’s understanding of subject features and boosting training efficiency. Extensive experiments demonstrate that our method achieves superior performance and efficiency in foreground-conditioned inpainting.
zh

[CV-116] I2OL-Net: Intra-Inter Objectness Learning Network for Point-Supervised X-Ray Prohibited Item Detection

【速读】：该论文试图解决X射线图像中违禁物品检测过程中依赖大量人工标注框的问题。解决方案的关键在于开发了一种名为I²OL-Net的网络，该网络包含两个核心模块：内模态对象性学习（intra-OL）模块和跨模态对象性学习（inter-OL）模块。intra-OL模块通过局部聚焦高斯掩码块和全局随机高斯掩码块协同学习X射线图像中的对象性，而inter-OL模块则通过基于小波分解的对抗学习块和对象性块，有效减少模态差异，并将从自然图像中学习到的对象性知识迁移到X射线图像中。这一方法显著缓解了X射线图像中因类内变异导致的局部主导问题，并在多个X射线数据集上展示了优越的性能，同时大幅降低了标注成本，提高了其实用性和可及性。

链接: https://arxiv.org/abs/2412.03811
作者: Sanjoeng Wong,Yan Yan
关键词-EN: Automatic detection, X-ray images plays, X-ray images, X-ray, public security
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic detection of prohibited items in X-ray images plays a crucial role in public security. However, existing methods rely heavily on labor-intensive box annotations. To address this, we investigate X-ray prohibited item detection under labor-efficient point supervision and develop an intra-inter objectness learning network (I ^2 OL-Net). I ^2 OL-Net consists of two key modules: an intra-modality objectness learning (intra-OL) module and an inter-modality objectness learning (inter-OL) module. The intra-OL module designs a local focus Gaussian masking block and a global random Gaussian masking block to collaboratively learn the objectness in X-ray images. Meanwhile, the inter-OL module introduces the wavelet decomposition-based adversarial learning block and the objectness block, effectively reducing the modality discrepancy and transferring the objectness knowledge learned from natural images with box annotations to X-ray images. Based on the above, I ^2 OL-Net greatly alleviates the problem of part domination caused by severe intra-class variations in X-ray images. Experimental results on four X-ray datasets show that I ^2 OL-Net can achieve superior performance with a significant reduction of annotation cost, thus enhancing its accessibility and practicality.
zh

[CV-117] EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

【速读】：该论文试图解决基于扩散模型（diffusion model-based）的图像编辑技术在数字取证中难以被检测的问题。解决方案的关键在于引入多模态大型语言模型（multimodal Large Language Model, LLM），通过增强推理能力来定位由扩散模型编辑的篡改区域。该框架利用LLM的上下文和语义优势，在MagicBrush、AutoSplice和PerfBrush（新构建的扩散模型数据集）数据集上取得了优于传统方法的mIoU和F1-score指标，特别是在PerfBrush数据集上表现出色，展示了其在检测新型编辑类型方面的潜力。

链接: https://arxiv.org/abs/2412.03809
作者: Quang Nguyen,Truong Vu,Trong-Tung Nguyen,Yuxin Wen,Preston K Robinette,Taylor T Johnson,Tom Goldstein,Anh Tran,Khoi Nguyen
关键词-EN: Image editing technologies, Large Language Model, image editing tools, editing technologies, Image editing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.
zh

[CV-118] Advancing Auto-Regressive Continuation for Video Frames

【速读】：该论文试图解决视频续生任务中的两个关键问题：长期帧生成中的退化问题和生成图像质量的提升。解决方案的关键在于设计了一种名为ARCON的方案，该方案通过交替生成语义标记和RGB标记，使大型语言模型（LLM）能够显式学习和预测视频的高层次结构信息。此外，论文还采用了基于光流的纹理拼接方法来增强生成视频的视觉质量。实验结果表明，该模型在自动驾驶场景中能够持续生成高质量的长视频。

链接: https://arxiv.org/abs/2412.03758
作者: Ruibo Ming,Jingwei Wu,Zhewei Huang,Zhuoxuan Ju,Jianming HU,Lihui Peng,Shuchang Zhou
关键词-EN: generating high-quality text, auto-regressive large language, Recent advances, large language models, high-quality text
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Recent advances in auto-regressive large language models (LLMs) have shown their potential in generating high-quality text, inspiring researchers to apply them to image and video generation. This paper explores the application of LLMs to video continuation, a task essential for building world models and predicting future frames. In this paper, we tackle challenges including preventing degeneration in long-term frame generation and enhancing the quality of generated images. We design a scheme named ARCON, which involves training our model to alternately generate semantic tokens and RGB tokens, enabling the LLM to explicitly learn and predict the high-level structural information of the video. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance the visual quality of the generated videos. Quantitative and qualitative experiments in autonomous driving scenarios demonstrate our model can consistently generate long videos.
zh

[CV-119] Multi-view Image Diffusion via Coordinate Noise and Fourier Attention WACV2025

【速读】：该论文试图解决从文本提示生成多视角一致图像的问题。解决方案的关键在于引入了一种基于傅里叶变换的注意力机制（Fourier-based attention block），该机制关注特征在时间依赖的空间频率上的变化，并聚焦于生成场景中非重叠区域的特征，以更好地对齐全局外观。此外，论文还提出了新的噪声初始化技术，通过共享噪声和从像素坐标及深度图提取的低空间频率信息来诱导视角间的噪声相关性。最后，通过交叉注意力损失（cross-attention loss）进一步对齐共享相同提示的场景特征。这些创新显著提升了多视角一致性生成的效果，在多个定量指标上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.03756
作者: Justin Theiss,Norman Müller,Daeil Kim,Aayush Prakash
关键词-EN: made significant advancements, generalization capabilities compared, previous baselines, models has made, made significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2025

点击查看摘要

Abstract:Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.
zh

[CV-120] HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution

【速读】：该论文试图解决现有隐式神经表示（Implicit Neural Representations, INRs）在图像超分辨率（Image Super-Resolution, ISR）任务中，由于使用多层感知器（Multi-Layer Perceptrons）进行参数化而未能充分利用局部采样点的层次结构，从而限制了表示能力的问题。解决方案的关键在于提出了一个基于层次编码的隐式图像函数（Hierarchical encoding based Implicit Image Function, HIIF），该方法通过引入一种新颖的层次位置编码（Hierarchical Positional Encoding）来增强局部隐式表示，使其能够捕捉多尺度的细节。此外，HIIF还在隐式注意力网络中嵌入了多头线性注意力机制（Multi-Head Linear Attention Mechanism），以考虑额外的非局部信息。实验结果表明，HIIF在与不同骨干编码器结合时，能够显著优于现有的连续图像超分辨率方法，PSNR提升可达0.17dB。

链接: https://arxiv.org/abs/2412.03748
作者: Yuxuan Jiang,Ho Man Kwan,Tianhao Peng,Ge Gao,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull
关键词-EN: shown significant promise, modeling visual signals, low-vision tasks including, Recent advances, INR-based ISR methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new \textbfHierarchical encoding based \textbfImplicit \textbfImage \textbfFunction for continuous image super-resolution, \textbfHIIF, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at \urlthis http URL.
zh

[CV-121] Deep Variational Bayesian Modeling of Haze Degradation Process CIKM2023

【速读】：该论文试图解决单图像去雾问题中的不确定性，特别是由于传输图（transmission map）和大气光（atmospheric light）等未知因素导致的去雾问题的不适定性（ill-posedness）。解决方案的关键在于引入了一个变分贝叶斯框架（variational Bayesian framework），将干净图像和传输图作为潜在变量，并通过对应的神经网络（去雾网络和传输网络）参数化其后验分布。基于雾霾退化的物理模型，该框架导出了一个新的目标函数，促进了两个网络之间的合作，从而实现了它们的联合训练，提升了各自的性能。此外，该框架在推理过程中允许去雾网络独立于传输图估计进行干净图像的估计，且该模型无关的框架可以无缝集成到现有的去雾网络中，显著提升跨数据集和模型的性能。

链接: https://arxiv.org/abs/2412.03745
作者: Eun Woo Im,Junsung Shin,Sungyong Baik,Tae Hyun Kim
关键词-EN: haze degradation, scene over distance, light reaching, atmospheric light, representation power
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in CIKM 2023, 10 pages, 9 figures

点击查看摘要

Abstract:Relying on the representation power of neural networks, most recent works have often neglected several factors involved in haze degradation, such as transmission (the amount of light reaching an observer from a scene over distance) and atmospheric light. These factors are generally unknown, making dehazing problems ill-posed and creating inherent uncertainties. To account for such uncertainties and factors involved in haze degradation, we introduce a variational Bayesian framework for single image dehazing. We propose to take not only a clean image and but also transmission map as latent variables, the posterior distributions of which are parameterized by corresponding neural networks: dehazing and transmission networks, respectively. Based on a physical model for haze degradation, our variational Bayesian framework leads to a new objective function that encourages the cooperation between them, facilitating the joint training of and thereby boosting the performance of each other. In our framework, a dehazing network can estimate a clean image independently of a transmission map estimation during inference, introducing no overhead. Furthermore, our model-agnostic framework can be seamlessly incorporated with other existing dehazing networks, greatly enhancing the performance consistently across datasets and models.
zh

[CV-122] VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视频理解任务中存在的幻觉问题（hallucination）。解决方案的关键在于引入了一个名为VidHalluc的基准测试，该基准测试专门设计用于评估MLLMs在视频理解任务中的幻觉现象，涵盖动作、时间序列和场景转换三个关键维度。此外，论文提出了一种无需训练的方法DINO-HEAL，通过结合DINOv2的空间显著性信息来重新加权视觉特征，从而在推理过程中减少幻觉现象，实验结果表明该方法在VidHalluc基准测试中显著提升了模型性能，平均减少了3.02%的幻觉率。

链接: https://arxiv.org/abs/2412.03735
作者: Chaoyu Li,Eun Woo Im,Pooyan Fazli
关键词-EN: Multimodal large language, recently shown significant, shown significant advancements, Multimodal large, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks. Both the VidHalluc benchmark and DINO-HEAL code can be accessed via \hrefthis https URL\textthis link .
zh

[CV-123] Sprite Sheet Diffusion: Generate Game Character for Animation

【速读】：该论文试图解决2D游戏开发中角色动画创建过程中手动绘制每一帧的耗时和劳动密集问题。解决方案的关键在于利用生成式模型（Generative Models），特别是扩散模型（Diffusion Models），来自动化生成角色动画所需的精灵表（sprite sheets）。通过这种方式，可以显著减少插画师的手动工作量，加速动画创建过程，并为游戏开发带来新的创意可能性。

链接: https://arxiv.org/abs/2412.03685
作者: Cheng-An Hsieh,Jing Zhang,Ava Yan
关键词-EN: creating character animations, vital step, creating character, diffusion models, character animations
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:In the game development process, creating character animations is a vital step that involves several stages. Typically for 2D games, illustrators begin by designing the main character image, which serves as the foundation for all subsequent animations. To create a smooth motion sequence, these subsequent animations involve drawing the character in different poses and actions, such as running, jumping, or attacking. This process requires significant manual effort from illustrators, as they must meticulously ensure consistency in design, proportions, and style across multiple motion frames. Each frame is drawn individually, making this a time-consuming and labor-intensive task. Generative models, such as diffusion models, have the potential to revolutionize this process by automating the creation of sprite sheets. Diffusion models, known for their ability to generate diverse images, can be adapted to create character animations. By leveraging the capabilities of diffusion models, we can significantly reduce the manual workload for illustrators, accelerate the animation creation process, and open up new creative possibilities in game development.
zh

[CV-124] Designing DNNs for a trade-off between robustness and processing performance in embedded devices

【速读】：该论文试图解决在安全关键应用（如航空航天和自动驾驶）中，基于机器学习的嵌入式系统对软错误（soft errors）的鲁棒性问题。软错误在现代数字处理器中日益成为关注焦点，因为更小的晶体管几何尺寸和更低的电压使得电子设备对背景辐射更为敏感。论文的关键解决方案在于研究并评估使用有界激活函数（bounded activation functions, AFs）来提高深度神经网络（DNN）模型对参数扰动的鲁棒性。通过分析激活函数的选择对模型精度、可压缩性和计算负担的影响，论文特别关注了用于自动驾驶场景理解的高光谱图像语义分割任务的编码器-解码器全卷积模型。实验部署在AMD-Xilinx的KV260 SoM上进行，以验证解决方案的有效性。

链接: https://arxiv.org/abs/2412.03682
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe
关键词-EN: Machine learning-based embedded, learning-based embedded systems, embedded systems employed, Machine learning-based, learning-based embedded
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Machine learning-based embedded systems employed in safety-critical applications such as aerospace and autonomous driving need to be robust against perturbations produced by soft errors. Soft errors are an increasing concern in modern digital processors since smaller transistor geometries and lower voltages give electronic devices a higher sensitivity to background radiation. The resilience of deep neural network (DNN) models to perturbations in their parameters is determined, to a large extent, by the structure of the model itself, and also by the selected numerical representation and used arithmetic precision. When compression techniques such as model pruning and model quantization are applied to reduce memory footprint and computational complexity for deployment, both model structure and numerical representation are modified and thus, soft error robustness also changes. In this sense, although the choice of activation functions (AFs) in DNN models is frequently ignored, it conditions not only their accuracy and trainability, but also compressibility rates and numerical robustness. This paper investigates the suitability of using bounded AFs to improve model robustness against DNN parameter perturbations, assessing at the same time the impact of this choice on deployment in terms of model accuracy, compressibility, and computational burden. In particular, we analyze encoder-decoder fully convolutional models aimed at performing semantic segmentation tasks on hyperspectral images for scene understanding in autonomous driving. Deployment characterization is performed experimentally on an AMD-Xilinx’s KV260 SoM.
zh

[CV-125] NBM: an Open Dataset for the Acoustic Monitoring of Nocturnal Migratory Birds in Europe

【速读】：该论文试图解决夜间迁徙鸟类（Nocturnal Bird Migration, NBM）的有效监测问题，特别是针对那些难以通过其他方式追踪的夜间迁徙物种。解决方案的关键在于开发和利用一个名为Nocturnal Bird Migration (NBM)的数据集，该数据集包含了来自117种西古北区鸟类的13,359个标注的鸣声记录，这些记录由法国各地的鸟类爱好者收集，具有精确的时间和频率标注。论文展示了如何通过一个两阶段的物体检测模型来处理音频数据，从而在频谱图中定位每个感兴趣信号的边界框坐标。这种方法在鸟类声音识别文献中较少被关注，但其能够区分音频窗口内的个体鸟类，具有重要的应用潜力。此外，论文还表明，基于该数据集训练的识别模型在45种主要鸟类的识别准确率上，能够与使用更大数据集训练的最先进系统相媲美，这强调了推动类似开放科学倡议以获取昂贵但有价值的音频文件细粒度标注的重要性。

链接: https://arxiv.org/abs/2412.03633
作者: Louis Airale,Adrien Pajot,Juliette Linossier
关键词-EN: effective monitoring techniques, migratory bird populations, Nocturnal Bird Migration, nocturnal migratory species, persisting threats
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The persisting threats on migratory bird populations highlights the urgent need for effective monitoring techniques that could assist in their conservation. Among these, passive acoustic monitoring is an essential tool, particularly for nocturnal migratory species that are difficult to track otherwise. This work presents the Nocturnal Bird Migration (NBM) dataset, a collection of 13,359 annotated vocalizations from 117 species of the Western Palearctic. The dataset includes precise time and frequency annotations, gathered by dozens of bird enthusiasts across France, enabling novel downstream acoustic analysis. In particular, we demonstrate that a two-stage object detection model, tailored for the processing of audio data, can be trained on our dataset to retrieve localized bounding box coordinates around each signal of interest in a spectrogram. This object detection approach, which is largely overlooked in the bird sound recognition literature, allows important applications by potentially differentiating individual birds within audio windows. Further, we show that the accuracy of our recognition model on the 45 main species of the dataset competes with state-of-the-art systems trained on much larger datasets. This highlights the interest of fostering similar open-science initiatives to acquire costly but valuable fine-grained annotations of audio files. All data and code are made openly available.
zh

[CV-126] MV-Adapter: Multi-view Consistent Image Generation Made Easy

【速读】：该论文试图解决现有多视角图像生成方法中存在的两个主要问题：（1）对预训练的文本到图像（T2I）模型进行侵入性修改，导致高计算成本，尤其是在使用大型基础模型和高分辨率图像时；（2）由于优化困难和高质量3D数据的稀缺，导致图像质量下降。论文提出的解决方案是引入MV-Adapter，这是一种基于适配器的多功能即插即用模块，能够在不改变原始网络结构或特征空间的情况下增强T2I模型及其衍生模型。MV-Adapter通过更新较少的参数，实现高效训练并保留预训练模型中的先验知识，从而降低过拟合风险。其关键创新包括重复的自注意力层和并行注意力架构，这些设计使适配器能够继承预训练模型的强大先验知识，以建模新的3D知识。此外，论文还提出了一种统一的条件编码器，能够无缝集成相机参数和几何信息，支持基于文本和图像的3D生成及纹理应用。

链接: https://arxiv.org/abs/2412.03632
作者: Zehuan Huang,Yuan-Chen Guo,Haoran Wang,Ran Yi,Lizhuang Ma,Yan-Pei Cao,Lu Sheng
关键词-EN: Existing multi-view image, high computational costs, require full fine-tuning, make invasive modifications, large base models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
zh

[CV-127] Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective

【速读】：该论文试图解决嵌入式深度神经网络（DNNs）在面对单粒子翻转（SEUs）引起的参数扰动时的鲁棒性问题，特别是在图像语义分割任务中。解决方案的关键在于通过层级和比特级的敏感性分析，深入研究编码器-解码器模型对软错误的脆弱性，并评估模型剪枝和参数量化等技术对压缩模型鲁棒性的影响。研究结果揭示了SEU引发故障的机制，并提出了一套适用于资源受限部署的轻量级错误缓解技术，这些技术无需额外的内存或计算成本。

链接: https://arxiv.org/abs/2412.03630
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe
关键词-EN: autonomous AI-based perception, applications areas considered, areas considered safety-critical, Deep Neural Networks, embedded Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:As the deployment of artifical intelligence (AI) algorithms at edge devices becomes increasingly prevalent, enhancing the robustness and reliability of autonomous AI-based perception and decision systems is becoming as relevant as precision and performance, especially in applications areas considered safety-critical such as autonomous driving and aerospace. This paper delves into the robustness assessment in embedded Deep Neural Networks (DNNs), particularly focusing on the impact of parameter perturbations produced by single event upsets (SEUs) on convolutional neural networks (CNN) for image semantic segmentation. By scrutinizing the layer-by-layer and bit-by-bit sensitivity of various encoder-decoder models to soft errors, this study thoroughly investigates the vulnerability of segmentation DNNs to SEUs and evaluates the consequences of techniques like model pruning and parameter quantization on the robustness of compressed models aimed at embedded implementations. The findings offer valuable insights into the mechanisms underlying SEU-induced failures that allow for evaluating the robustness of DNNs once trained in advance. Moreover, based on the collected data, we propose a set of practical lightweight error mitigation techniques with no memory or computational cost suitable for resource-constrained deployments. The code used to perform the fault injection (FI) campaign is available at this https URL , while the code to implement proposed techniques is available at this https URL .
zh

[CV-128] HunyuanVideo: A Systematic Framework For Large Video Generative Models

【速读】：该论文试图解决视频生成领域中开源模型与闭源模型之间性能差距的问题。解决方案的关键在于推出了HunyuanVideo，一个创新的开源视频基础模型，其性能可与甚至超越领先的闭源模型。HunyuanVideo通过整合数据精选、先进的架构设计、渐进式模型扩展与训练以及针对大规模模型训练和推理的高效基础设施，成功训练了一个拥有超过130亿参数的视频生成模型，使其成为所有开源模型中最大的。通过广泛的实验和一系列针对性的设计，HunyuanVideo在视觉质量、运动动态、文本-视频对齐和高级拍摄技术方面表现出色，超越了包括Runway Gen-3和Luma 1.6在内的先前最先进模型。通过公开代码，论文旨在缩小闭源与开源社区之间的差距，促进更活跃和多样化的视频生成生态系统。

链接: https://arxiv.org/abs/2412.03603
作者: Weijie Kong,Qi Tian,Zijian Zhang,Rox Min,Zuozhuo Dai,Jin Zhou,Jiangfeng Xiong,Xin Li,Bo Wu,Jianwei Zhang,Kathrina Wu,Qin Lin,Aladdin Wang,Andong Wang,Bai Jiawang,Changlin Li,Duojun Huang,Fang Yang,Hao Tan,Hongmei Wang,Jacob Song,Jiawang Bai,Jianbing Wu,Jinbao Xue,Joey Wang,Junkun Yuan,Kai Wang,Mengyang Liu,Pengyu Li,Shuai Li,Weiyan Wang,Wenqing Yu,Xinchi Deng,Yanxin Long,Yi Chen,Yutao Cui,Yuanbo Peng,Zhentao Yu,Zhiyu He,Zhiyong Xu,Zixiang Zhou,Zunnan Xu,Yangyu Tao,Qinglin Lu,Songtao Liu,Daquan Zhou,Hongfa Wang,Yong Yang,Di Wang,Yuhong Liu,Jie Jiang,Caesar Zhong(Refer to the report for detailed contributions)
关键词-EN: significantly impacted daily, impacted daily life, Recent advancements, video generation, significantly impacted
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.
zh

[CV-129] Likelihood-Scheduled Score-Based Generative Modeling for Fully 3D PET Image Reconstruction

【速读】：该论文试图解决基于预训练得分生成模型 (Score-based Generative Models, SGMs) 的正电子发射断层扫描 (PET) 图像重建中存在的重建速度慢、超参数调优负担重以及三维重建中的切片不一致性问题。解决方案的关键在于提出了一种实用的三维重建方法，通过将SGM的反向扩散过程的似然性与最大似然期望最大化算法的当前迭代结果相匹配，从而加速重建过程并减少关键超参数的数量。该方法在模拟的[^18 F]DPA-714数据集上展示了其在降低重建时间和超参数调优需求的同时，能够匹配或超越现有最先进SGM-based PET重建方法的NRMSE和SSIM性能。此外，论文还首次实现了对真实三维PET数据的SGM-based重建，特别是[^18 F]DPA-714数据，通过集成垂直预训练的SGMs来消除切片不一致性问题。

链接: https://arxiv.org/abs/2412.04339
作者: George Webber,Yuya Mizuno,Oliver D. Howes,Alexander Hammers,Andrew P. King,Andrew J. Reader
关键词-EN: Medical image reconstruction, image distribution modeling, advanced image distribution, score-based generative models, Medical image
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 12 figures. Submitted to Transactions on Medical Imaging

点击查看摘要

Abstract:Medical image reconstruction with pre-trained score-based generative models (SGMs) has advantages over other existing state-of-the-art deep-learned reconstruction methods, including improved resilience to different scanner setups and advanced image distribution modeling. SGM-based reconstruction has recently been applied to simulated positron emission tomography (PET) datasets, showing improved contrast recovery for out-of-distribution lesions relative to the state-of-the-art. However, existing methods for SGM-based reconstruction from PET data suffer from slow reconstruction, burdensome hyperparameter tuning and slice inconsistency effects (in 3D). In this work, we propose a practical methodology for fully 3D reconstruction that accelerates reconstruction and reduces the number of critical hyperparameters by matching the likelihood of an SGM’s reverse diffusion process to a current iterate of the maximum-likelihood expectation maximization algorithm. Using the example of low-count reconstruction from simulated [^18 F]DPA-714 datasets, we show our methodology can match or improve on the NRMSE and SSIM of existing state-of-the-art SGM-based PET reconstruction while reducing reconstruction time and the need for hyperparameter tuning. We evaluate our methodology against state-of-the-art supervised and conventional reconstruction algorithms. Finally, we demonstrate a first-ever implementation of SGM-based reconstruction for real 3D PET data, specifically [^18 F]DPA-714 data, where we integrate perpendicular pre-trained SGMs to eliminate slice inconsistency issues.
zh

[CV-130] Multi-Subject Image Synthesis as a Generative Prior for Single-Subject PET Image Reconstruction

【速读】：该论文试图解决高质量医学图像数据集难以获取的问题，特别是在正电子发射断层扫描（PET）图像中，由于固有的泊松噪声（Poisson noise），重建图像质量受限。解决方案的关键在于提出了一种新颖的方法，通过合成多样且真实的伪PET图像（pseudo-PET images）来提高信噪比（signal-to-noise ratio）。具体步骤包括：首先，对多主体的磁共振（MR）图像与PET图像进行深度学习的可变形配准（deep-learned deformable registration）；然后，利用学到的解剖变形场（anatomically-learned deformation fields）将多个PET图像变换到同一参考空间，并通过平均随机子集的变换数据生成大量不同的伪PET图像。这种方法不仅提高了伪PET图像的解剖细节，还展示了其在PET图像重建中的应用潜力，通过生成与目标单主体重建空间相同的伪PET图像，并将其作为扩散模型（diffusion model）重建方法的训练数据，实现了视觉上的改进和背景噪声的减少。

链接: https://arxiv.org/abs/2412.04324
作者: George Webber,Yuya Mizuno,Oliver D. Howes,Alexander Hammers,Andrew P. King,Andrew J. Reader
关键词-EN: Large high-quality medical, high-quality medical image, medical image datasets, PET image reconstruction, pseudo-PET images
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 3 figures. Accepted as a poster presentation at IEEE NSS MIC RTSD 2024 (submitted May 2024; accepted July 2024; presented Nov 2024)

点击查看摘要

Abstract:Large high-quality medical image datasets are difficult to acquire but necessary for many deep learning applications. For positron emission tomography (PET), reconstructed image quality is limited by inherent Poisson noise. We propose a novel method for synthesising diverse and realistic pseudo-PET images with improved signal-to-noise ratio. We also show how our pseudo-PET images may be exploited as a generative prior for single-subject PET image reconstruction. Firstly, we perform deep-learned deformable registration of multi-subject magnetic resonance (MR) images paired to multi-subject PET images. We then use the anatomically-learned deformation fields to transform multiple PET images to the same reference space, before averaging random subsets of the transformed multi-subject data to form a large number of varying pseudo-PET images. We observe that using MR information for registration imbues the resulting pseudo-PET images with improved anatomical detail compared to the originals. We consider applications to PET image reconstruction, by generating pseudo-PET images in the same space as the intended single-subject reconstruction and using them as training data for a diffusion model-based reconstruction method. We show visual improvement and reduced background noise in our 2D reconstructions as compared to OSEM, MAP-EM and an existing state-of-the-art diffusion model-based approach. Our method shows the potential for utilising highly subject-specific prior information within a generative reconstruction framework. Future work may compare the benefits of our approach to explicitly MR-guided reconstruction methodologies.
zh

[CV-131] Generative-Model-Based Fully 3D PET Image Reconstruction by Conditional Diffusion Sampling

【速读】：该论文试图解决在正电子发射断层扫描（PET）数据中进行低剂量或短时间扫描条件下的三维图像重建问题。解决方案的关键在于开发并实施基于分数生成模型（Score-based Generative Models, SGMs）的实用方法，首次应用于真实的全三维PET数据重建。通过在全计数参考脑图像上训练SGM，并扩展方法以支持在极低计数（1%原始计数）条件下进行重建，研究团队能够分析该方法的偏差和方差特性。此外，通过从生成算法的后验分布中采样，计算重建图像的不确定性。实验结果表明，与传统的OSEM和MAP-EM方法相比，SGM在低计数条件下的重建图像更接近全剂量重建图像，并且在偏差-方差权衡中表现出较低的方差。未来的工作将包括与监督深度学习方法的比较，以及其他研究方向，如数据条件对SGM后验分布的影响以及算法在不同示踪剂下的性能。

链接: https://arxiv.org/abs/2412.04319
作者: George Webber,Yuya Mizuno,Oliver D. Howes,Alexander Hammers,Andrew P. King,Andrew J. Reader
关键词-EN: Score-based generative models, positron emission tomography, recently shown promising, shown promising results, simulated positron emission
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2 pages, 2 figures. Accepted for oral presentation at IEEE NSS MIC RTSD 2024 (submitted May 2024; accepted July 2024; presented Nov 2024)

点击查看摘要

Abstract:Score-based generative models (SGMs) have recently shown promising results for image reconstruction on simulated positron emission tomography (PET) datasets. In this work we have developed and implemented practical methodology for 3D image reconstruction with SGMs, and perform (to our knowledge) the first SGM-based reconstruction of real fully 3D PET data. We train an SGM on full-count reference brain images, and extend methodology to allow SGM-based reconstructions at very low counts (1% of original, to simulate low-dose or short-duration scanning). We then perform reconstructions for multiple independent realisations of 1% count data, allowing us to analyse the bias and variance characteristics of the method. We sample from the learned posterior distribution of the generative algorithm to calculate uncertainty images for our reconstructions. We evaluate the method’s performance on real full- and low-count PET data and compare with conventional OSEM and MAP-EM baselines, showing that our SGM-based low-count reconstructions match full-dose reconstructions more closely and in a bias-variance trade-off comparison, our SGM-reconstructed images have lower variance than existing baselines. Future work will compare to supervised deep-learned methods, with other avenues for investigation including how data conditioning affects the SGM’s posterior distribution and the algorithm’s performance with different tracers.
zh

[CV-132] Structure-Aware Stylized Image Synthesis for Robust Medical Image Segmentation

【速读】：该论文试图解决医学图像分割中由于成像设备、采集条件和患者特定属性变化导致的领域偏移问题。解决方案的关键在于提出了一种结合扩散模型和结构保留网络（Structure-Preserving Network）的新型医学图像分割方法，用于结构感知的单次图像风格化。该方法通过将来自不同来源的图像转换为一致的风格，同时保持病灶的位置、大小和形状，从而有效缓解领域偏移问题。这种方法确保了即使在训练数据中缺少目标领域的情况下，也能实现稳健和准确的分割。实验结果表明，该方法在结肠镜检查息肉分割和皮肤病变分割数据集上显著提升了分割模型的鲁棒性和准确性，优于未进行风格转换的基线模型。

链接: https://arxiv.org/abs/2412.04296
作者: Jie Bao,Zhixin Zhou,Wen Jung Li,Rui Luo
关键词-EN: acquisition conditions, imaging devices, patient-specific attributes, essential for effective, effective diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation is essential for effective diagnosis and treatment planning but is often challenged by domain shifts caused by variations in imaging devices, acquisition conditions, and patient-specific attributes. Traditional domain generalization methods typically require inclusion of parts of the test domain within the training set, which is not always feasible in clinical settings with limited diverse data. Additionally, although diffusion models have demonstrated strong capabilities in image generation and style transfer, they often fail to preserve the critical structural information necessary for precise medical analysis. To address these issues, we propose a novel medical image segmentation method that combines diffusion models and Structure-Preserving Network for structure-aware one-shot image stylization. Our approach effectively mitigates domain shifts by transforming images from various sources into a consistent style while maintaining the location, size, and shape of lesions. This ensures robust and accurate segmentation even when the target domain is absent from the training data. Experimental evaluations on colonoscopy polyp segmentation and skin lesion segmentation datasets show that our method enhances the robustness and accuracy of segmentation models, achieving superior performance metrics compared to baseline models without style transfer. This structure-aware stylization framework offers a practical solution for improving medical image segmentation across diverse domains, facilitating more reliable clinical diagnoses.
zh

[CV-133] Adult Glioma Segmentation in Sub-Saharan Africa using Transfer Learning on Stratified Finetuning Data MICCAI

【速读】：该论文试图解决在资源有限地区（特别是撒哈拉以南非洲地区）对脑胶质瘤（gliomas）进行诊断的挑战，尤其是在MRI数据质量低且数量有限的情况下。解决方案的关键在于利用迁移学习（transfer learning）和分层微调策略（stratified fine-tuning strategy），结合预训练的深度学习模型nnU-Net和MedNeXt，通过放射组学分析（radiomic analysis）创建分层训练集，并在大规模脑肿瘤数据集上进行训练，然后将模型迁移到撒哈拉以南非洲地区的特定环境中。此外，采用加权模型集成策略（weighted model ensembling strategy）和自适应后处理（adaptive post-processing）来提高分割精度。该方法在BraTS-Africa 2024任务的验证案例中取得了优异的分割结果，突显了集成机器学习技术在弥合资源有限国家与发达地区医疗影像能力差距方面的潜力。

链接: https://arxiv.org/abs/2412.04111
作者: Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,Austin Tapp,Xinyang Liu,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru
关键词-EN: present substantial diagnostic, substantial diagnostic challenges, Sub-Saharan Africa, brain tumor characterized, high mortality
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 3 tables. This paper was accepted at MICCAI-BraTS 2024

点击查看摘要

Abstract:Gliomas, a kind of brain tumor characterized by high mortality, present substantial diagnostic challenges in low- and middle-income countries, particularly in Sub-Saharan Africa. This paper introduces a novel approach to glioma segmentation using transfer learning to address challenges in resource-limited regions with minimal and low-quality MRI data. We leverage pre-trained deep learning models, nnU-Net and MedNeXt, and apply a stratified fine-tuning strategy using the BraTS2023-Adult-Glioma and BraTS-Africa datasets. Our method exploits radiomic analysis to create stratified training folds, model training on a large brain tumor dataset, and transfer learning to the Sub-Saharan context. A weighted model ensembling strategy and adaptive post-processing are employed to enhance segmentation accuracy. The evaluation of our proposed method on unseen validation cases on the BraTS-Africa 2024 task resulted in lesion-wise mean Dice scores of 0.870, 0.865, and 0.926, for enhancing tumor, tumor core, and whole tumor regions and was ranked first for the challenge. Our approach highlights the ability of integrated machine-learning techniques to bridge the gap between the medical imaging capabilities of resource-limited countries and established developed regions. By tailoring our methods to a target population’s specific needs and constraints, we aim to enhance diagnostic capabilities in isolated environments. Our findings underscore the importance of approaches like local data integration and stratification refinement to address healthcare disparities, ensure practical applicability, and enhance impact.
zh

[CV-134] Magnetic Resonance Imaging Feature-Based Subtyping and Model Ensemble for Enhanced Brain Tumor Segmentation MICCAI

【速读】：该论文试图解决多参数磁共振成像（mpMRI）中脑肿瘤的准确和自动化分割问题，这对于定量测量在临床诊断和预后中的重要性日益增加。解决方案的关键在于提出了一种基于深度学习的集成方法，该方法整合了最先进的分割模型，并引入了创新的适应性预处理和后处理技术，这些技术利用基于MRI的放射组学分析来区分肿瘤亚型。此外，该方法针对BraTS 2024数据集中肿瘤的异质性，增强了分割模型的精确性和泛化能力。实验结果显示，该方法在不同类型的脑肿瘤（如小儿脑肿瘤、脑膜瘤和脑转移瘤）的分割中表现出色，分别达到了0.926、0.801和0.688的平均病灶Dice相似系数。

链接: https://arxiv.org/abs/2412.04094
作者: Zhifan Jiang,Daniel Capellán-Martín,Abhijeet Parida,Austin Tapp,Xinyang Liu,María J. Ledesma-Carbayo,Syed Muhammad Anwar,Marius George Linguraru
关键词-EN: magnetic resonance imaging, multi-parametric magnetic resonance, increasingly important role, Accurate and automatic, International Brain Tumor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 3 tables. This paper was accepted at MICCAI-BraTS 2024

点击查看摘要

Abstract:Accurate and automatic segmentation of brain tumors in multi-parametric magnetic resonance imaging (mpMRI) is essential for quantitative measurements, which play an increasingly important role in clinical diagnosis and prognosis. The International Brain Tumor Segmentation (BraTS) Challenge 2024 offers a unique benchmarking opportunity, including various types of brain tumors in both adult and pediatric populations, such as pediatric brain tumors (PED), meningiomas (MEN-RT) and brain metastases (MET), among others. Compared to previous editions, BraTS 2024 has implemented changes to substantially increase clinical relevance, such as refined tumor regions for evaluation. We propose a deep learning-based ensemble approach that integrates state-of-the-art segmentation models. Additionally, we introduce innovative, adaptive pre- and post-processing techniques that employ MRI-based radiomic analyses to differentiate tumor subtypes. Given the heterogeneous nature of the tumors present in the BraTS datasets, this approach enhances the precision and generalizability of segmentation models. On the final testing sets, our method achieved mean lesion-wise Dice similarity coefficients of 0.926, 0.801, and 0.688 for the whole tumor in PED, MEN-RT, and MET, respectively. These results demonstrate the effectiveness of our approach in improving segmentation performance and generalizability for various brain tumor types.
zh

[CV-135] Deformation-Aware Segmentation Network Robust to Motion Artifacts for Brain Tissue Segmentation using Disentanglement Learning MICCAI2024

【速读】：该论文试图解决磁共振成像（MRI）中由于长时间采集导致的运动伪影问题，这些问题阻碍了准确的组织分割。解决方案的关键在于提出了一种新颖的深度学习框架，该框架通过解耦学习网络逐步去除伪影，从而生成更清晰的图像，并通过联合训练的运动估计和分割网络实现更精确的脑组织分割。该框架生成三个输出：运动校正图像、识别伪影影响区域的变形图以及脑组织分割掩码。变形图作为指导机制，帮助模型恢复丢失的信息或去除伪影引入的人工结构。实验结果表明，该框架在处理运动伪影影响的MRI扫描分割任务中优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.03922
作者: Sunyoung Jung,Yoonseok Choi,Mohammed A. Al-masni,Minyoung Jung,Dong-Hyun Kim
关键词-EN: Magnetic Resonance Imaging, Resonance Imaging, Magnetic Resonance, prolonged acquisition time, challenge in Magnetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Image Computing and Computer Assisted Intervention, MICCAI 2024

点击查看摘要

Abstract:Motion artifacts caused by prolonged acquisition time are a significant challenge in Magnetic Resonance Imaging (MRI), hindering accurate tissue segmentation. These artifacts appear as blurred images that mimic tissue-like appearances, making segmentation difficult. This study proposes a novel deep learning framework that demonstrates superior performance in both motion correction and robust brain tissue segmentation in the presence of artifacts. The core concept lies in a complementary process: a disentanglement learning network progressively removes artifacts, leading to cleaner images and consequently, more accurate segmentation by a jointly trained motion estimation and segmentation network. This network generates three outputs: a motioncorrected image, a motion deformation map that identifies artifact-affected regions, and a brain tissue segmentation mask. This deformation serves as a guidance mechanism for the disentanglement process, aiding the model in recovering lost information or removing artificial structures introduced by the artifacts. Extensive in-vivo experiments on pediatric motion data demonstrate that our proposed framework outperforms state-of-the-art methods in segmenting motion-corrupted MRI scans.
zh

[CV-136] Dual-Branch Subpixel-Guided Network for Hyperspectral Image Classification

【速读】：该论文试图解决高光谱图像（HSI）分类中由于传感器空间分辨率限制导致的混合像素问题。解决方案的关键在于提出了一个名为DSNet的双分支亚像素引导网络，通过引入深度自编码器解混架构，自动整合亚像素信息和卷积类特征，以增强分类性能。DSNet能够充分考虑亚像素内的物理非线性特性，并自适应地生成诊断性丰度，以实现更可靠的类别标签分布决策边界。此外，亚像素融合模块的设计确保了像素和亚像素特征之间的高质量信息融合，进一步促进了稳定的联合分类。

链接: https://arxiv.org/abs/2412.03893
作者: Zhu Han,Jin Yang,Lianru Gao,Zhiqiang Zeng,Bing Zhang,Jocelyn Chanussot
关键词-EN: promising feature learning, hyperspectral image, representation capabilities, widely applied, applied into hyperspectral
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning (DL) has been widely applied into hyperspectral image (HSI) classification owing to its promising feature learning and representation capabilities. However, limited by the spatial resolution of sensors, existing DL-based classification approaches mainly focus on pixel-level spectral and spatial information extraction through complex network architecture design, while ignoring the existence of mixed pixels in actual scenarios. To tackle this difficulty, we propose a novel dual-branch subpixel-guided network for HSI classification, called DSNet, which automatically integrates subpixel information and convolutional class features by introducing a deep autoencoder unmixing architecture to enhance classification performance. DSNet is capable of fully considering physically nonlinear properties within subpixels and adaptively generating diagnostic abundances in an unsupervised manner to achieve more reliable decision boundaries for class label distributions. The subpixel fusion module is designed to ensure high-quality information fusion across pixel and subpixel features, further promoting stable joint classification. Experimental results on three benchmark datasets demonstrate the effectiveness and superiority of DSNet compared with state-of-the-art DL-based HSI classification approaches. The codes will be available at this https URL, contributing to the remote sensing community.
zh

[CV-137] INRetouch: Context Aware Implicit Neural Representation for Photography Retouching

【速读】：该论文试图解决专业照片编辑过程中复杂操作的自动化问题，特别是在保持编辑控制和高保真输出方面的挑战。解决方案的关键在于提出了一种新颖的“修图转移”方法，该方法通过学习专业编辑前后的图像对来精确复制复杂的编辑操作。具体来说，论文引入了一个包含100,000张高质量图像的全面照片修图数据集，并开发了一种上下文感知的隐式神经表示（Implicit Neural Representation），该表示能够根据图像内容和上下文自适应地应用编辑，无需预训练，并且可以从单一示例中学习。通过提取参考编辑中的隐式变换并自适应地应用于新图像，该方法不仅在照片修图方面超越了现有方法，还在相关图像重建任务（如色域映射和原始图像重建）中提升了性能。

链接: https://arxiv.org/abs/2412.03848
作者: Omar Elezabi,Marcos V. Conde,Zongwei Wu,Radu Timofte
关键词-EN: editing remains challenging, remains challenging, knowledge of imaging, imaging pipelines, photo editing remains
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Professional photo editing remains challenging, requiring extensive knowledge of imaging pipelines and significant expertise. With the ubiquity of smartphone photography, there is an increasing demand for accessible yet sophisticated image editing solutions. While recent deep learning approaches, particularly style transfer methods, have attempted to automate this process, they often struggle with output fidelity, editing control, and complex retouching capabilities. We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs, enabling precise replication of complex editing operations. To facilitate this research direction, we introduce a comprehensive Photo Retouching Dataset comprising 100,000 high-quality images edited using over 170 professional Adobe Lightroom presets. We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context, requiring no pretraining and capable of learning from a single example. Our method extracts implicit transformations from reference edits and adaptively applies them to new images. Through extensive evaluation, we demonstrate that our approach not only surpasses existing methods in photo retouching but also enhances performance in related image reconstruction tasks like Gamut Mapping and Raw Reconstruction. By bridging the gap between professional editing capabilities and automated solutions, our work presents a significant step toward making sophisticated photo editing more accessible while maintaining high-fidelity results. Check the \hrefthis https URLProject\ Page for more Results and information about Code and Dataset availability.
zh

[CV-138] DiffuPT: Class Imbalance Mitigation for Glaucoma Detection via Diffusion Based Generation and Model Pretraining

【速读】：该论文试图解决青光眼（glaucoma）诊断中由于数据集类别不平衡（class imbalance）导致的深度学习算法性能下降问题。解决方案的关键在于利用基于扩散模型（diffusion models）的生成式框架来生成合成数据，以缓解类别不平衡问题。通过结合预训练方法，论文提出了一种更为鲁棒的分类器训练流程，显著提升了分类器的性能，特别是在调和平均值（harmonic mean）和ROC曲线的AUC方面。实验结果表明，该方法在国家数据集和AIROGS数据集上均取得了显著的性能提升，强调了扩散模型在处理医学数据集类别不平衡问题中的重要性。

链接: https://arxiv.org/abs/2412.03629
作者: Youssof Nawar,Nouran Soliman,Moustafa Wassel,Mohamed ElHabebe,Noha Adly,Marwan Torki,Ahmed Elmassry,Islam Ahmed
关键词-EN: progressive optic neuropathy, optic neuropathy characterized, optic nerve head, progressive optic, optic neuropathy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Glaucoma is a progressive optic neuropathy characterized by structural damage to the optic nerve head and functional changes in the visual field. Detecting glaucoma early is crucial to preventing loss of eyesight. However, medical datasets often suffer from class imbalances, making detection more difficult for deep-learning algorithms. We use a generative-based framework to enhance glaucoma diagnosis, specifically addressing class imbalance through synthetic data generation. In addition, we collected the largest national dataset for glaucoma detection to support our study. The imbalance between normal and glaucomatous cases leads to performance degradation of classifier models. By combining our proposed framework leveraging diffusion models with a pretraining approach, we created a more robust classifier training process. This training process results in a better-performing classifier. The proposed approach shows promising results in improving the harmonic mean (sensitivity and specificity) and AUC for the roc for the glaucoma classifier. We report an improvement in the harmonic mean metric from 89.09% to 92.59% on the test set of our national dataset. We examine our method against other methods to overcome imbalance through extensive experiments. We report similar improvements on the AIROGS dataset. This study highlights that diffusion-based generation can be of great importance in tackling class imbalances in medical datasets to improve diagnostic performance.
zh

[CV-139] End-to-end Triple-domain PET Enhancement: A Hybrid Denoising-and-reconstruction Framework for Reconstructing Standard-dose PET Images from Low-dose PET Sinograms

【速读】：该论文试图解决从低剂量正电子发射断层扫描（LPET）图像中重建高质量标准剂量正电子发射断层扫描（SPET）图像的问题，以减少患者在接受PET检查时的辐射危害。解决方案的关键在于提出了一种端到端的TriPLET框架，该框架通过利用投影域（sinograms）、频率谱图域和小波域的三重域表示，结合混合去噪与重建过程，实现了从LPET到SPET的高质量重建。具体来说，TriPLET框架包括三个顺序耦合的组件：1) 一个基于Transformer的去噪网络，用于在投影域对输入的LPET sinograms进行去噪；2) 一个基于离散小波变换的重建网络，用于在小波域进一步从LPET重建SPET；3) 一个基于配对的对抗网络，用于在图像域评估重建的SPET图像。实验结果表明，TriPLET在重建SPET图像的相似性和信噪比方面优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.03617
作者: Caiwen Jiang,Mianxin Liu,Kaicong Sun,Dinggang Shen
关键词-EN: positron emission tomography, early disease diagnosis, functional imaging technique, sensitive functional imaging, reconstruct SPET images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a sensitive functional imaging technique, positron emission tomography (PET) plays a critical role in early disease diagnosis. However, obtaining a high-quality PET image requires injecting a sufficient dose (standard dose) of radionuclides into the body, which inevitably poses radiation hazards to patients. To mitigate radiation hazards, the reconstruction of standard-dose PET (SPET) from low-dose PET (LPET) is desired. According to imaging theory, PET reconstruction process involves multiple domains (e.g., projection domain and image domain), and a significant portion of the difference between SPET and LPET arises from variations in the noise levels introduced during the sampling of raw data as sinograms. In light of these two facts, we propose an end-to-end TriPle-domain LPET EnhancemenT (TriPLET) framework, by leveraging the advantages of a hybrid denoising-and-reconstruction process and a triple-domain representation (i.e., sinograms, frequency spectrum maps, and images) to reconstruct SPET images from LPET sinograms. Specifically, TriPLET consists of three sequentially coupled components including 1) a Transformer-assisted denoising network that denoises the inputted LPET sinograms in the projection domain, 2) a discrete-wavelet-transform-based reconstruction network that further reconstructs SPET from LPET in the wavelet domain, and 3) a pair-based adversarial network that evaluates the reconstructed SPET images in the image domain. Extensive experiments on the real PET dataset demonstrate that our proposed TriPLET can reconstruct SPET images with the highest similarity and signal-to-noise ratio to real data, compared with state-of-the-art methods.
zh

人工智能

[AI-0] Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy

链接: https://arxiv.org/abs/2412.04426
作者: Keru Chen,Honghao Wei,Zhigang Deng,Sen Lin
关键词-EN: environment interactions hinder, extensive environment interactions, safe reinforcement learning, current online safe, online safe reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \empherroneous Q-estimations, resulted from offline-online objective mismatch and offline cost sparsity, and \emphLagrangian mismatch, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbfMarvel, a novel framework for O2O safe RL, comprising two key components that work in concert: \emphValue Pre-Alignment to align the Q-functions with the underlying truth before online learning, and \emphAdaptive PID Control to effectively adjust the Lagrange multipliers during online finetuning. Extensive experiments demonstrate that Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction. By introducing the first policy-finetuning based framework for O2O safe RL, which is compatible with many offline and online safe RL methods, our work has the great potential to advance the field towards more efficient and practical safe RL solutions.

[AI-1] argeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation

链接: https://arxiv.org/abs/2412.04415
作者: Xuying Li,Zhuo Li,Yuji Kosuga,Yasuhiro Yoshida,Victor Bian
关键词-EN: large language models, transformed human-computer interactions, powered by large, language models, enabling seamless
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. While these advancements offer immense utility, they also inherit and amplify inherent safety risks such as bias, fairness, hallucinations, privacy breaches, and a lack of transparency. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents. Specifically, we test the hypothesis that a deceptively simple adversarial prefix, such as \textitIgnore the document, can compel LLMs to produce dangerous or unintended outputs by bypassing their contextual safeguards. Through experimentation, we demonstrate a high attack success rate (ASR), revealing the fragility of existing LLM defenses. These findings emphasize the urgent need for robust, multi-layered security measures tailored to mitigate vulnerabilities at the LLM level and within broader agent-based architectures.

[AI-2] Machine Theory of Mind for Autonomous Cyber-Defence

链接: https://arxiv.org/abs/2412.04367
作者: Luke Swaby,Matthew Stewart,Daniel Harrold,Chris Willis,Gregory Palmer
关键词-EN: Intelligent autonomous agents, Intelligent autonomous, autonomous agents hold, Autonomous Cyber Operations, domain of cyber-security
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 29 pages, 17 figures, 12 tables

点击查看摘要

Abstract:Intelligent autonomous agents hold much potential for the domain of cyber-security. However, due to many state-of-the-art approaches relying on uninterpretable black-box models, there is growing demand for methods that offer stakeholders clear and actionable insights into their latent beliefs and motivations. To address this, we evaluate Theory of Mind (ToM) approaches for Autonomous Cyber Operations. Upon learning a robust prior, ToM models can predict an agent’s goals, behaviours, and contextual beliefs given only a handful of past behaviour observations. In this paper, we introduce a novel Graph Neural Network (GNN)-based ToM architecture tailored for cyber-defence, Graph-In, Graph-Out (GIGO)-ToM, which can accurately predict both the targets and attack trajectories of adversarial cyber agents over arbitrary computer network topologies. To evaluate the latter, we propose a novel extension of the Wasserstein distance for measuring the similarity of graph-based probability distributions. Whereas the standard Wasserstein distance lacks a fixed reference scale, we introduce a graph-theoretic normalization factor that enables a standardized comparison between networks of different sizes. We furnish this metric, which we term the Network Transport Distance (NTD), with a weighting function that emphasizes predictions according to custom node features, allowing network operators to explore arbitrary strategic considerations. Benchmarked against a Graph-In, Dense-Out (GIDO)-ToM architecture in an abstract cyber-defence environment, our empirical evaluations show that GIGO-ToM can accurately predict the goals and behaviours of various unseen cyber-attacking agents across a range of network topologies, as well as learn embeddings that can effectively characterize their policies.

[AI-3] Artificial intelligence and the internal processes of creativity

链接: https://arxiv.org/abs/2412.04366
作者: Jaan Aru
关键词-EN: generating creative outputs, internal processes, systems capable, capable of generating, outputs are reshaping
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems capable of generating creative outputs are reshaping our understanding of creativity. This shift presents an opportunity for creativity researchers to reevaluate the key components of the creative process. In particular, the advanced capabilities of AI underscore the importance of studying the internal processes of creativity. This paper explores the neurobiological machinery that underlies these internal processes and describes the experiential component of creativity. It is concluded that although the products of artificial and human creativity can be similar, the internal processes are different. The paper also discusses how AI may negatively affect the internal processes of human creativity, such as the development of skills, the integration of knowledge, and the diversity of ideas.

[AI-4] Action Mapping for Reinforcement Learning in Continuous Environments with Constraints

链接: https://arxiv.org/abs/2412.04327
作者: Mirco Theile,Lukas Dirnberger,Raphael Trumpp,Marco Caccamo,Alberto L. Sangiovanni-Vincentelli
关键词-EN: Deep reinforcement learning, constraints remains challenging, remains challenging due, poor sample efficiency, Deep reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has had success across various domains, but applying it to environments with constraints remains challenging due to poor sample efficiency and slow convergence. Recent literature explored incorporating model knowledge to mitigate these problems, particularly through the use of models that assess the feasibility of proposed actions. However, integrating feasibility models efficiently into DRL pipelines in environments with continuous action spaces is non-trivial. We propose a novel DRL training strategy utilizing action mapping that leverages feasibility models to streamline the learning process. By decoupling the learning of feasible actions from policy optimization, action mapping allows DRL agents to focus on selecting the optimal action from a reduced feasible action set. We demonstrate through experiments that action mapping significantly improves training performance in constrained environments with continuous action spaces, especially with imperfect feasibility models.

[AI-5] GRAM: Generalization in Deep RL with a Robust Adaptation Module

链接: https://arxiv.org/abs/2412.04323
作者: James Queeney,Xiaoyi Cai,Mouhacine Benosman,Jonathan P. How
关键词-EN: real-world settings requires, deep reinforcement learning, deep reinforcement, reinforcement learning, real-world settings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate on a variety of realistic simulated locomotion tasks with a quadruped robot.

[AI-6] PoTable: Programming Standardly on Table-based Reasoning Like a Human Analyst

链接: https://arxiv.org/abs/2412.04272
作者: Qingyang Mao,Qi Liu,Zhi Li,Mingyue Cheng,Zheng Zhang,Rui Li
关键词-EN: Large Language Model, Language Model, Large Language, substantial research interest, garnered substantial research
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Table-based reasoning has garnered substantial research interest, particularly in its integration with Large Language Model (LLM) which has revolutionized the general reasoning paradigm. Numerous LLM-based studies introduce symbolic tools (e.g., databases, Python) as assistants to extend human-like abilities in structured table understanding and complex arithmetic computations. However, these studies can be improved better in simulating human cognitive behavior when using symbolic tools, as they still suffer from limitations of non-standard logical splits and constrained operation pools. In this study, we propose PoTable as a novel table-based reasoning method that simulates a human tabular analyst, which integrates a Python interpreter as the real-time executor accompanied by an LLM-based operation planner and code generator. Specifically, PoTable follows a human-like logical stage split and extends the operation pool into an open-world space without any constraints. Through planning and executing in each distinct stage, PoTable standardly completes the entire reasoning process and produces superior reasoning results along with highly accurate, steply commented and completely executable programs. Accordingly, the effectiveness and explainability of PoTable are fully demonstrated. Extensive experiments over three evaluation datasets from two public benchmarks on two backbones show the outstanding performance of our approach. In particular, GPT-based PoTable achieves over 4% higher absolute accuracy than runner-ups on all evaluation datasets.

[AI-7] ransient Multi-Agent Path Finding for Lifelong Navigation in Dense Environments ICAPS2025

链接: https://arxiv.org/abs/2412.04256
作者: Jonathan Morag,Noy Gabay,Daniel koyfman,Roni Stern
关键词-EN: Multi-Agent Path Finding, finding conflict-free paths, Path Finding, Multi-Agent Path, MAPF
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Submitted to The 35th International Conference on Automated Planning and Scheduling (ICAPS 2025)

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) deals with finding conflict-free paths for a set of agents from an initial configuration to a given target configuration. The Lifelong MAPF (LMAPF) problem is a well-studied online version of MAPF in which an agent receives a new target when it reaches its current target. The common approach for solving LMAPF is to treat it as a sequence of MAPF problems, periodically replanning from the agents’ current configurations to their current targets. A significant drawback in this approach is that in MAPF the agents must reach a configuration in which all agents are at their targets simultaneously, which is needlessly restrictive for LMAPF. Techniques have been proposed to indirectly mitigate this drawback. We describe cases where these mitigation techniques fail. As an alternative, we propose to solve LMAPF problems by solving a sequence of modified MAPF problems, in which the objective is for each agent to eventually visit its target, but not necessarily for all agents to do so simultaneously. We refer to this MAPF variant as Transient MAPF (TMAPF) and propose several algorithms for solving it based on existing MAPF algorithms. A limited experimental evaluation identifies some cases where using a TMAPF algorithm instead of a MAPF algorithm with an LMAPF framework can improve the system throughput significantly.

[AI-8] HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

链接: https://arxiv.org/abs/2412.04233
作者: Kale-ab Abebe Tessera,Arrasy Rahman,Stefano V. Albrecht
关键词-EN: Balancing individual specialisation, Balancing individual, multi-agent reinforcement learning, multi-agent reinforcement, Balancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Balancing individual specialisation and shared behaviours is a critical challenge in multi-agent reinforcement learning (MARL). Existing methods typically focus on encouraging diversity or leveraging shared representations. Full parameter sharing (FuPS) improves sample efficiency but struggles to learn diverse behaviours when required, while no parameter sharing (NoPS) enables diversity but is computationally expensive and sample inefficient. To address these challenges, we introduce HyperMARL, a novel approach using hypernetworks to balance efficiency and specialisation. HyperMARL generates agent-specific actor and critic parameters, enabling agents to adaptively exhibit diverse or homogeneous behaviours as needed, without modifying the learning objective or requiring prior knowledge of the optimal diversity. Furthermore, HyperMARL decouples agent-specific and state-based gradients, which empirically correlates with reduced policy gradient variance, potentially offering insights into its ability to capture diverse behaviours. Across MARL benchmarks requiring homogeneous, heterogeneous, or mixed behaviours, HyperMARL consistently matches or outperforms FuPS, NoPS, and diversity-focused methods, achieving NoPS-level diversity with a shared architecture. These results highlight the potential of hypernetworks as a versatile approach to the trade-off between specialisation and shared behaviours in MARL.

[AI-9] Relationships between Keywords and Strong Beats in Lyrical Music

链接: https://arxiv.org/abs/2412.04202
作者: Callie C. Liao,Duoduo Liao,Ellie L. Zhang
关键词-EN: Artificial Intelligence, features remains limited, strong beats, strong, beats
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE BigData 2024

点击查看摘要

Abstract:Artificial Intelligence (AI) song generation has emerged as a popular topic, yet the focus on exploring the latent correlations between specific lyrical and rhythmic features remains limited. In contrast, this pilot study particularly investigates the relationships between keywords and rhythmically stressed features such as strong beats in songs. It focuses on several key elements: keywords or non-keywords, stressed or unstressed syllables, and strong or weak beats, with the aim of uncovering insightful correlations. Experimental results indicate that, on average, 80.8% of keywords land on strong beats, whereas 62% of non-keywords fall on weak beats. The relationship between stressed syllables and strong or weak beats is weak, revealing that keywords have the strongest relationships with strong beats. Additionally, the lyrics-rhythm matching score, a key matching metric measuring keywords on strong beats and non-keywords on weak beats across various time signatures, is 0.765, while the matching score for syllable types is 0.495. This study demonstrates that word types strongly align with their corresponding beat types, as evidenced by the distinct patterns, whereas syllable types exhibit a much weaker alignment. This disparity underscores the greater reliability of word types in capturing rhythmic structures in music, highlighting their crucial role in effective rhythmic matching and analysis. We also conclude that keywords that consistently align with strong beats are more reliable indicators of lyrics-rhythm associations, providing valuable insights for AI-driven song generation through enhanced structural analysis. Furthermore, our development of tailored Lyrics-Rhythm Matching (LRM) metrics maximizes lyrical alignments with corresponding beat stresses, and our novel LRM file format captures critical lyrical and rhythmic information without needing original sheet music.

[AI-10] Directed Structural Adaptation to Overcome Statistical Conflicts and Enable Continual Learning AAAI-2024

链接: https://arxiv.org/abs/2412.04190
作者: Zeki Doruk Erden,Boi Faltings
关键词-EN: Adaptive networks today, overparameterized fixed topologies, networks today rely, Adaptive networks, statistical conflicts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented in Deployable AI (DAI) workshop at AAAI-2024

点击查看摘要

Abstract:Adaptive networks today rely on overparameterized fixed topologies that cannot break through the statistical conflicts they encounter in the data they are exposed to, and are prone to “catastrophic forgetting” as the network attempts to reuse the existing structures to learn new task. We propose a structural adaptation method, DIRAD, that can complexify as needed and in a directed manner without being limited by statistical conflicts within a dataset. We then extend this method and present the PREVAL framework, designed to prevent “catastrophic forgetting” in continual learning by detection of new data and assigning encountered data to suitable models adapted to process them, without needing task labels anywhere in the workflow. We show the reliability of the DIRAD in growing a network with high performance and orders-of-magnitude simpler than fixed topology networks; and demonstrate the proof-of-concept operation of PREVAL, in which continual adaptation to new tasks is observed while being able to detect and discern previously-encountered tasks.

[AI-11] Leveraging Large Language Models to Generate Course-specific Semantically Annotated Learning Objects

链接: https://arxiv.org/abs/2412.04185
作者: Dominic Lohr,Marc Berges,Abhishek Chugh,Michael Kohlhase,Dennis Müller
关键词-EN: undergone significant transformations, automated question generation, past few decades, process and methodology, methodology of automated
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at Journal of Computer Assisted Learning (2024)

点击查看摘要

Abstract:Background: Over the past few decades, the process and methodology of automated question generation (AQG) have undergone significant transformations. Recent progress in generative natural language models has opened up new potential in the generation of educational content. Objectives: This paper explores the potential of large language models (LLMs) for generating computer science questions that are sufficiently annotated for automatic learner model updates, are fully situated in the context of a particular course, and address the cognitive dimension understand. Methods: Unlike previous attempts that might use basic methods like ChatGPT, our approach involves more targeted strategies such as retrieval-augmented generation (RAG) to produce contextually relevant and pedagogically meaningful learning objects. Results and Conclusions: Our results show that generating structural, semantic annotations works well. However, this success was not reflected in the case of relational annotations. The quality of the generated questions often did not meet educational standards, highlighting that although LLMs can contribute to the pool of learning materials, their current level of performance requires significant human intervention to refine and validate the generated content. Comments: Accepted at Journal of Computer Assisted Learning (2024) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04185 [cs.AI] (or arXiv:2412.04185v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.04185 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dominic Lohr [view email] [v1] Thu, 5 Dec 2024 14:24:07 UTC (850 KB)

[AI-12] Bench-CoE: a Framework for Collaboration of Experts from Benchmark

链接: https://arxiv.org/abs/2412.04167
作者: Yuanshuai Wang,Xingjian Zhang,Jinkun Zhao,Siwei Wen,Peilin Feng,Shuhao Liao,Lei Huang,Wenjun Wu
关键词-EN: key technologies driving, technologies driving intelligent, driving intelligent systems, Large Language Models, handle multiple tasks
类目: Artificial Intelligence (cs.AI)
*备注: The code is available at \url{ this https URL }

点击查看摘要

Abstract:Large Language Models (LLMs) are key technologies driving intelligent systems to handle multiple tasks. To meet the demands of various tasks, an increasing number of LLMs-driven experts with diverse capabilities have been developed, accompanied by corresponding benchmarks to evaluate their performance. This paper proposes the Bench-CoE framework, which enables Collaboration of Experts (CoE) by effectively leveraging benchmark evaluations to achieve optimal performance across various tasks. Bench-CoE includes a set of expert models, a router for assigning tasks to corresponding experts, and a benchmark dataset for training the router. Moreover, we formulate Query-Level and Subject-Level approaches based on our framework, and analyze the merits and drawbacks of these two approaches. Finally, we conduct a series of experiments with vary data distributions on both language and multimodal tasks to validate that our proposed Bench-CoE outperforms any single model in terms of overall performance. We hope this method serves as a baseline for further research in this area. The code is available at \urlthis https URL.

[AI-13] Understanding Memorization in Generative Models via Sharpness in Probability Landscapes

链接: https://arxiv.org/abs/2412.04140
作者: Dongjae Jeon,Dueun Kim,Albert No
关键词-EN: log probability density, introduce a geometric, geometric framework, framework to analyze, probability density
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a geometric framework to analyze memorization in diffusion models using the eigenvalues of the Hessian of the log probability density. We propose that memorization arises from isolated points in the learned probability distribution, characterized by sharpness in the probability landscape, as indicated by large negative eigenvalues of the Hessian. Through experiments on various datasets, we demonstrate that these eigenvalues effectively detect and quantify memorization. Our approach provides a clear understanding of memorization in diffusion models and lays the groundwork for developing strategies to ensure secure and reliable generative models

[AI-14] Monet: Mixture of Monosemantic Experts for Transformers

链接: https://arxiv.org/abs/2412.04139
作者: Jungwoo Park,Young Jin Ahn,Kee-Eung Kim,Jaewoo Kang
关键词-EN: toxic content generation, preventing undesirable behaviors, Understanding the internal, content generation, computations of large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity – where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior. The source code and pretrained checkpoints are available at this https URL.

[AI-15] DeepFEA: Deep Learning for Prediction of Transient Finite Element Analysis Solutions

链接: https://arxiv.org/abs/2412.04121
作者: Georgios Triantafyllou,Panagiotis G. Kalozoumis,George Dimas,Dimitris K. Iakovidis
关键词-EN: Finite Element Analysis, simulating physical phenomena, Finite Element, Element Analysis, computationally intensive method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: This work has been submitted to a journal for possible publication

点击查看摘要

Abstract:Finite Element Analysis (FEA) is a powerful but computationally intensive method for simulating physical phenomena. Recent advancements in machine learning have led to surrogate models capable of accelerating FEA. Yet there are still limitations in developing surrogates of transient FEA models that can simultaneously predict the solutions for both nodes and elements with applicability on both the 2D and 3D domains. Motivated by this research gap, this study proposes DeepFEA, a deep learning-based framework that leverages a multilayer Convolutional Long Short-Term Memory (ConvLSTM) network branching into two parallel convolutional neural networks to predict the solutions for both nodes and elements of FEA models. The proposed network is optimized using a novel adaptive learning algorithm, called Node-Element Loss Optimization (NELO). NELO minimizes the error occurring at both branches of the network enabling the prediction of solutions for transient FEA simulations. The experimental evaluation of DeepFEA is performed on three datasets in the context of structural mechanics, generated to serve as publicly available reference datasets. The results show that DeepFEA can achieve less than 3% normalized mean and root mean squared error for 2D and 3D simulation scenarios, and inference times that are two orders of magnitude faster than FEA. In contrast, relevant state-of-the-art methods face challenges with multi-dimensional output and dynamic input prediction. Furthermore, DeepFEA’s robustness was demonstrated in a real-life biomedical scenario, confirming its suitability for accurate and efficient predictions of FEA simulations.

[AI-16] Enhancing Mathematical Reasoning in LLM s with Background Operators

链接: https://arxiv.org/abs/2412.04110
作者: Jiajun Chen,Yik-Cheung Tam
关键词-EN: propose utilizing background, large language models, utilizing background operators, propose utilizing, reasoning in large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose utilizing background operators for mathematical reasoning in large language models (LLMs). To achieve this, we define a set of fundamental mathematical predicates as the basic building blocks. For each mathematical problem, we develop a Prolog solution that includes problem-specific predicates and intermediate predicates derived from these background operators, ensuring that each solution adheres to the defined operator set. We introduce the MATH-Prolog corpus, which is derived from the counting and probability categories of the MATH corpus. For efficient data augmentation, we apply K-fold cross-validated self-training. This method incrementally generates new Prolog solutions for each fold, incorporating those verified as correct into the training set throughout the model training process. Our experimental results demonstrate that 5-fold crossvalidated self-training effectively identifies new, accurate Prolog solutions, achieving an accuracy of 84.6% on the cross-validated set, and 84.8% on the test set during fine-tuning the Meta-Llama-3.1-8B-Instruct model. This approach successfully uncovers new solutions with fully computable inference steps for previously unseen problems. Additionally, incorporating the background mathematical predicates into the prompt enhances solution coverage.

[AI-17] Pre-train Align and Disentangle: Empowering Sequential Recommendation with Large Language Models

链接: https://arxiv.org/abs/2412.04107
作者: Yuhao Wang,Junwei Pan,Xiangyu Zhao,Pengyue Jia,Wanyu Wang,Yuan Wang,Yue Liu,Dapeng Liu,Jie Jiang
关键词-EN: users’ historical interactions, sequential dependencies, evolving interests, dependencies in users’, users’ historical
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation (SR) aims to model the sequential dependencies in users’ historical interactions to better capture their evolving interests. However, existing SR approaches primarily rely on collaborative data, which leads to limitations such as the cold-start problem and sub-optimal performance. Meanwhile, despite the success of large language models (LLMs), their application in industrial recommender systems is hindered by high inference latency, inability to capture all distribution statistics, and catastrophic forgetting. To this end, we propose a novel Pre-train, Align, and Disentangle (PAD) paradigm to empower recommendation models with LLMs. Specifically, we first pre-train both the SR and LLM models to get collaborative and textual embeddings. Next, a characteristic recommendation-anchored alignment loss is proposed using multi-kernel maximum mean discrepancy with Gaussian kernels. Finally, a triple-experts architecture, consisting aligned and modality-specific experts with disentangled embeddings, is fine-tuned in a frequency-aware manner. Experiments conducted on three public datasets demonstrate the effectiveness of PAD, showing significant improvements and compatibility with various SR backbone models, especially on cold items. The implementation code and datasets will be publicly available.

[AI-18] Practical Considerations for Agent ic LLM Systems

链接: https://arxiv.org/abs/2412.04093
作者: Chris Sypherd,Vaishak Belle
关键词-EN: Large Language Models, strength of Large, Large Language, underlying models, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 3 figures, 1 table

点击查看摘要

Abstract:As the strength of Large Language Models (LLMs) has grown over recent years, so too has interest in their use as the underlying models for autonomous agents. Although LLMs demonstrate emergent abilities and broad expertise across natural language domains, their inherent unpredictability makes the implementation of LLM agents challenging, resulting in a gap between related research and the real-world implementation of such systems. To bridge this gap, this paper frames actionable insights and considerations from the research community in the context of established application paradigms to enable the construction and facilitate the informed deployment of robust LLM agents. Namely, we position relevant research findings into four broad categories–Planning, Memory, Tools, and Control Flow–based on common practices in application-focused literature and highlight practical considerations to make when designing agentic LLMs for real-world applications, such as handling stochasticity and managing resources efficiently. While we do not conduct empirical evaluations, we do provide the necessary background for discussing critical aspects of agentic LLM designs, both in academia and industry.

[AI-19] Federated Learning in Mobile Networks: A Comprehensive Case Study on Traffic Forecasting

链接: https://arxiv.org/abs/2412.04081
作者: Nikolaos Pavlidis,Vasileios Perifanis,Selim F. Yilmaz,Francesc Wilhelmi,Marco Miozzo,Pavlos S. Efraimidis,Remous-Aris Koutsiamanis,Pavol Mulinka,Paolo Dini
关键词-EN: efficient resource allocation, real-time cellular traffic, increasing demand, demand for efficient, efficient resource
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing demand for efficient resource allocation in mobile networks has catalyzed the exploration of innovative solutions that could enhance the task of real-time cellular traffic prediction. Under these circumstances, federated learning (FL) stands out as a distributed and privacy-preserving solution to foster collaboration among different sites, thus enabling responsive near-the-edge solutions. In this paper, we comprehensively study the potential benefits of FL in telecommunications through a case study on federated traffic forecasting using real-world data from base stations (BSs) in Barcelona (Spain). Our study encompasses relevant aspects within the federated experience, including model aggregation techniques, outlier management, the impact of individual clients, personalized learning, and the integration of exogenous sources of data. The performed evaluation is based on both prediction accuracy and sustainability, thus showcasing the environmental impact of employed FL algorithms in various settings. The findings from our study highlight FL as a promising and robust solution for mobile traffic prediction, emphasizing its twin merits as a privacy-conscious and environmentally sustainable approach, while also demonstrating its capability to overcome data heterogeneity and ensure high-quality predictions, marking a significant stride towards its integration in mobile traffic management systems.

[AI-20] Does your model understand genes? A benchmark of gene properties for biological and text models

链接: https://arxiv.org/abs/2412.04075
作者: Yoav Kan-Tor,Michael Morris Danziger,Eden Zohar,Matan Ninio,Yishai Shimoni
关键词-EN: deep learning methods, models, learning methods, recent years, application of deep
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at this http URL.

[AI-21] ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description

链接: https://arxiv.org/abs/2412.04069
作者: Xiao-Yu Guo,Yi-Fan Li,Yuan Liu,Xiaoyong Pan,Hong-Bin Shen
关键词-EN: advancing significant potential, enzyme engineering, advancing significant, drug development, development and enzyme
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine-tuning struggle to capture relationships in multi-modal protein data. To address this, we propose ProtDAT, a de novo fine-grained framework capable of designing proteins from any descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages an innovative multi-modal cross-attention, integrating protein sequences and textual information for a foundational level and seamless integration. Experimental results demonstrate that ProtDAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity. On 20,000 text-sequence pairs from Swiss-Prot, it improves pLDDT by 6%, TM-score by 0.26, and reduces RMSD by 1.2 Å, highlighting its potential to advance protein design.

[AI-22] Graph Neural Networks Need Cluster-Normalize-Activate Modules NEURIPS2024

链接: https://arxiv.org/abs/2412.04064
作者: Arseny Skryagin,Felix Divo,Mohammad Amin Ali,Devendra Singh Dhami,Kristian Kersting
关键词-EN: Graph Neural Networks, Graph Neural, Neural Networks, non-Euclidean deep learning, deep learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures, 6 tables, accepted at NeurIPS 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are non-Euclidean deep learning models for graph-structured data. Despite their successful and diverse applications, oversmoothing prohibits deep architectures due to node features converging to a single fixed point. This severely limits their potential to solve complex tasks. To counteract this tendency, we propose a plug-and-play module consisting of three steps: Cluster-Normalize-Activate (CNA). By applying CNA modules, GNNs search and form super nodes in each layer, which are normalized and activated individually. We demonstrate in node classification and property prediction tasks that CNA significantly improves the accuracy over the state-of-the-art. Particularly, CNA reaches 94.18% and 95.75% accuracy on Cora and CiteSeer, respectively. It further benefits GNNs in regression tasks as well, reducing the mean squared error compared to all baselines. At the same time, GNNs with CNA require substantially fewer learnable parameters than competing architectures.

[AI-23] Expanding Deep Learning-based Sensing Systems with Multi-Source Knowledge Transfer

链接: https://arxiv.org/abs/2412.04060
作者: Gaole Dai,Huatao Xu,Rui Tan,Mo Li
关键词-EN: limited labeled data, deep learning models, provide high-quality deep, high-quality deep learning, users or environments
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Expanding the existing sensing systems to provide high-quality deep learning models for more domains, such as new users or environments, is challenged by the limited labeled data and the data and device heterogeneities. While knowledge distillation methods could overcome label scarcity and device heterogeneity, they assume the teachers are fully reliable and overlook the data heterogeneity, which prevents the direct adoption of existing models. To address this problem, this paper proposes an efficient knowledge transfer framework, HaKT, to expand sensing systems. It first selects multiple high-quality models from the system at a low cost and then fuses their knowledge by assigning sample-wise weights to their predictions. Later, the fused knowledge is selectively injected into the customized models for new domains based on the knowledge quality. Extensive experiments on different tasks, modalities, and settings show that HaKT outperforms stat-of-the-art baselines by at most 16.5% accuracy and saves up to 39% communication traffic.

[AI-24] From Code to Play: Benchmarking Program Search for Games Using Large Language Models

链接: https://arxiv.org/abs/2412.04057
作者: Manuel Eberhardinger,James Goodman,Alexander Dockhorn,Diego Perez-Liebana,Raluca D. Gaina,Duygu Çakmak,Setareh Maghsudi,Simon Lucas
关键词-EN: opening exciting opportunities, shown impressive capabilities, Large language models, Large language, applying program synthesis
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to Transactions on Games Special Issue on Large Language Models and Games

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.

[AI-25] SocialMind: LLM -based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions

链接: https://arxiv.org/abs/2412.04036
作者: Bufang Yang,Yunqi Guo,Lilin Xu,Zhenyu Yan,Hongkai Chen,Guoliang Xing,Xiaofan Jiang
关键词-EN: Social, human life, Social interactions, live social interactions, revolutionize human interactions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social interactions are fundamental to human life. The recent emergence of large language models (LLMs)-based virtual assistants has demonstrated their potential to revolutionize human interactions and lifestyles. However, existing assistive systems mainly provide reactive services to individual users, rather than offering in-situ assistance during live social interactions with conversational partners. In this study, we introduce SocialMind, the first LLM-based proactive AR social assistive system that provides users with in-situ social assistance. SocialMind employs human-like perception leveraging multi-modal sensors to extract both verbal and nonverbal cues, social factors, and implicit personas, incorporating these social cues into LLM reasoning for social suggestion generation. Additionally, SocialMind employs a multi-tier collaborative generation strategy and proactive update mechanism to display social suggestions on Augmented Reality (AR) glasses, ensuring that suggestions are timely provided to users without disrupting the natural flow of conversation. Evaluations on three public datasets and a user study with 20 participants show that SocialMind achieves 38.3% higher engagement compared to baselines, and 95% of participants are willing to use SocialMind in their live social interactions.

[AI-26] Considerations Influencing Offense-Defense Dynamics From Artificial Intelligence

链接: https://arxiv.org/abs/2412.04029
作者: Giulio Corsi,Kyle Kilian,Richard Mallah
关键词-EN: technologies presents profound, presents profound challenges, artificial intelligence, rapid advancement, advancement of artificial
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence (AI) technologies presents profound challenges to societal safety. As AI systems become more capable, accessible, and integrated into critical services, the dual nature of their potential is increasingly clear. While AI can enhance defensive capabilities in areas like threat detection, risk assessment, and automated security operations, it also presents avenues for malicious exploitation and large-scale societal harm, for example through automated influence operations and cyber attacks. Understanding the dynamics that shape AI’s capacity to both cause harm and enhance protective measures is essential for informed decision-making regarding the deployment, use, and integration of advanced AI systems. This paper builds on recent work on offense-defense dynamics within the realm of AI, proposing a taxonomy to map and examine the key factors that influence whether AI systems predominantly pose threats or offer protective benefits to society. By establishing a shared terminology and conceptual foundation for analyzing these interactions, this work seeks to facilitate further research and discourse in this critical area.

[AI-27] Augmenting Minds or Automating Skills: The Differential Role of Human Capital in Generative AIs Impact on Creative Tasks

链接: https://arxiv.org/abs/2412.03963
作者: Meiling Huang,Ming Jin,Ning Li
关键词-EN: raising critical questions, raising critical, societal implications, critical questions, beneficiaries and societal
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Generative AI is rapidly reshaping creative work, raising critical questions about its beneficiaries and societal implications. This study challenges prevailing assumptions by exploring how generative AI interacts with diverse forms of human capital in creative tasks. Through two random controlled experiments in flash fiction writing and song composition, we uncover a paradox: while AI democratizes access to creative tools, it simultaneously amplifies cognitive inequalities. Our findings reveal that AI enhances general human capital (cognitive abilities and education) by facilitating adaptability and idea integration but diminishes the value of domain-specific expertise. We introduce a novel theoretical framework that merges human capital theory with the automation-augmentation perspective, offering a nuanced understanding of human-AI collaboration. This framework elucidates how AI shifts the locus of creative advantage from specialized expertise to broader cognitive adaptability. Contrary to the notion of AI as a universal equalizer, our work highlights its potential to exacerbate disparities in skill valuation, reshaping workplace hierarchies and redefining the nature of creativity in the AI era. These insights advance theories of human capital and automation while providing actionable guidance for organizations navigating AI integration amidst workforce inequalities.

[AI-28] Chain-of-Thought in Large Language Models : Decoding Projection and Activation

链接: https://arxiv.org/abs/2412.03944
作者: Hao Yang,Qianghua Zhao,Lei Li
关键词-EN: numerous studies exploring, studies exploring factors, exploring factors influencing, large language models, prompting has significantly
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chain-of-Thought prompting has significantly enhanced the reasoning capabilities of large language models, with numerous studies exploring factors influencing its performance. However, the underlying mechanisms remain poorly understood. To further demystify the operational principles, this work examines three key aspects: decoding, projection, and activation, aiming to elucidate the changes that occur within models when employing Chainof-Thought. Our findings reveal that LLMs effectively imitate exemplar formats while integrating them with their understanding of the question, exhibiting fluctuations in token logits during generation but ultimately producing a more concentrated logits distribution, and activating a broader set of neurons in the final layers, indicating more extensive knowledge retrieval compared to standard prompts. Our code and data will be publicly avialable when the paper is accepted.

[AI-29] Exploring AI Text Generation Retrieval-Augmented Generation and Detection Technologies: a Comprehensive Overview

链接: https://arxiv.org/abs/2412.03933
作者: Fnu Neha,Deepshikha Bhati,Deepak Kumar Shukla,Angela Guercio,Ben Ward
关键词-EN: Artificial Intelligence, large language models, diverse applications, powerful text generation, large language
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of Artificial Intelligence (AI) has led to the creation of powerful text generation models, such as large language models (LLMs), which are widely used for diverse applications. However, concerns surrounding AI-generated content, including issues of originality, bias, misinformation, and accountability, have become increasingly prominent. This paper offers a comprehensive overview of AI text generators (AITGs), focusing on their evolution, capabilities, and ethical implications. This paper also introduces Retrieval-Augmented Generation (RAG), a recent approach that improves the contextual relevance and accuracy of text generation by integrating dynamic information retrieval. RAG addresses key limitations of traditional models, including their reliance on static knowledge and potential inaccuracies in handling real-world data. Additionally, the paper reviews detection tools that help differentiate AI-generated text from human-written content and discusses the ethical challenges these technologies pose. The paper explores future directions for improving detection accuracy, supporting ethical AI development, and increasing accessibility. The paper contributes to a more responsible and reliable use of AI in content creation through these discussions.

[AI-30] Integrating Various Software Artifacts for Better LLM -based Bug Localization and Program Repair

链接: https://arxiv.org/abs/2412.03905
作者: Qiong Feng,Xiaotian Ma,Jiayi Sheng,Ziyuan Feng,Wei Song,Peng Liang
关键词-EN: streamline Automated Program, Automated Program Repair, garnered considerable attention, streamline Automated, Automated Program
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 22 pages, 11 images, 9 tables, Manuscript submitted to a journal (2024)

点击查看摘要

Abstract:LLMs have garnered considerable attention for their potential to streamline Automated Program Repair (APR). LLM-based approaches can either insert the correct code or directly generate patches when provided with buggy methods. However, most of LLM-based APR methods rely on a single type of software information, without fully leveraging different software artifacts. Despite this, many LLM-based approaches do not explore which specific types of information best assist in APR. Addressing this gap is crucial for advancing LLM-based APR techniques. We propose DEVLoRe to use issue content (description and message) and stack error traces to localize buggy methods, then rely on debug information in buggy methods and issue content and stack error to localize buggy lines and generate plausible patches which can pass all unit tests. The results show that while issue content is particularly effective in assisting LLMs with fault localization and program repair, different types of software artifacts complement each other. By incorporating different artifacts, DEVLoRe successfully locates 49.3% and 47.6% of single and non-single buggy methods and generates 56.0% and 14.5% plausible patches for the Defects4J v2.0 dataset, respectively. This outperforms current state-of-the-art APR methods. The source code and experimental results of this work for replication are available at this https URL.

[AI-31] Using SlowFast Networks for Near-Miss Incident Analysis in Dashcam Videos

链接: https://arxiv.org/abs/2412.03903
作者: Yucheng Zhang,Koichi Emura,Eiji Watanabe
关键词-EN: SlowFast deep neural, deep neural network, paper classifies near-miss, fast visual information, visual information processed
类目: Artificial Intelligence (cs.AI)
*备注: Best Research Paper Award for Asia-Pacific Region, The 30th ITS World Congress 2024

点击查看摘要

Abstract:This paper classifies near-miss traffic videos using the SlowFast deep neural network that mimics the characteristics of the slow and fast visual information processed by two different streams from the M (Magnocellular) and P (Parvocellular) cells of the human brain. The approach significantly improves the accuracy of the traffic near-miss video analysis and presents insights into human visual perception in traffic scenarios. Moreover, it contributes to traffic safety enhancements and provides novel perspectives on the potential cognitive errors in traffic accidents.

[AI-32] Machine Learning-based Android Intrusion Detection System

链接: https://arxiv.org/abs/2412.03894
作者: Madiha Tahreem,Ifrah Andleeb,Bilal Zahid Hussain,Arsalan Hameed
关键词-EN: android operating system, operating system, android operating, smart devices, SMS Fraud
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The android operating system is being installed in most of the smart devices. The introduction of intrusions in such operating systems is rising at a tremendous rate. With the introduction of such malicious data streams, the smart devices are being subjected to various attacks like Phishing, Spyware, SMS Fraud, Bots and Banking-Trojans and many such. The application of machine learning classification algorithms for the security of android APK files is used in this paper. Each apk data stream was marked to be either malicious or non malicious on the basis of different parameters. The machine learning classification techniques are then used to classify whether the newly installed applications’ signature falls within the malicious or non-malicious domain. If it falls within the malicious category, appropriate action can be taken, and the Android operating system can be shielded against illegal activities.

[AI-33] A Unified Framework for Evaluating the Effectiveness and Enhancing the Transparency of Explainable AI Methods in Real-World Applications

链接: https://arxiv.org/abs/2412.03884
作者: Md. Ariful Islam,M. F. Mridha,Md Abrar Jahin,Nilanjan Dey
关键词-EN: models frequently constrains, rapid advancement, substantial advancements, black box, deep learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of deep learning has resulted in substantial advancements in AI-driven applications; however, the “black box” characteristic of these models frequently constrains their interpretability, transparency, and reliability. Explainable artificial intelligence (XAI) seeks to elucidate AI decision-making processes, guaranteeing that explanations faithfully represent the model’s rationale and correspond with human comprehension. Despite comprehensive research in XAI, a significant gap persists in standardized procedures for assessing the efficacy and transparency of XAI techniques across many real-world applications. This study presents a unified XAI evaluation framework incorporating extensive quantitative and qualitative criteria to systematically evaluate the correctness, interpretability, robustness, fairness, and completeness of explanations generated by AI models. The framework prioritizes user-centric and domain-specific adaptations, hence improving the usability and reliability of AI models in essential domains. To address deficiencies in existing evaluation processes, we suggest defined benchmarks and a systematic evaluation pipeline that includes data loading, explanation development, and thorough method assessment. The suggested framework’s relevance and variety are evidenced by case studies in healthcare, finance, agriculture, and autonomous systems. These provide a solid basis for the equitable and dependable assessment of XAI methodologies. This paradigm enhances XAI research by offering a systematic, flexible, and pragmatic method to guarantee transparency and accountability in AI systems across many real-world contexts.

[AI-34] Weak-to-Strong Generalization Through the Data-Centric Lens

链接: https://arxiv.org/abs/2412.03881
作者: Changho Shin,John Cooper,Frederic Sala
关键词-EN: important machine learning, machine learning applications, learning applications including, highly data-efficient learning, applications including highly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 39 pages

点击查看摘要

Abstract:The weak-to-strong generalization phenomenon is the driver for important machine learning applications including highly data-efficient learning and, most recently, performing superalignment. While decades of research have resulted in numerous algorithms that produce strong empirical performance, understanding what aspects of data enable weak-to-strong generalization has been understudied. We propose a simple data-centric mechanism that characterizes weak-to-strong generalization: the overlap density. Intuitively, generalization tracks the number of points that contain overlaps, i.e., both easy patterns (learnable by a weak model) and challenging patterns (only learnable by a stronger model), as with such points, weak predictions can be used to learn challenging patterns by stronger models. We provide a practical overlap detection algorithm to find such points in datasets and leverage them to learn, among multiple sources of data, which to query when seeking to maximize overlap density and thereby enhance weak-to-strong generalization. We present a theoretical result showing that the generalization benefit is a function of the overlap density and a regret bound for our data selection algorithm. Empirically, we validate the mechanism and the overlap detection algorithm on a wide array of settings.

[AI-35] Fine-Grained Sentiment Analysis of Electric Vehicle User Reviews: A Bidirectional LSTM Approach to Capturing Emotional Intensity in Chinese Text

链接: https://arxiv.org/abs/2412.03873
作者: Shuhao Chen,Chengyi Tu
关键词-EN: improving product design, electric vehicle, industry has highlighted, rapid expansion, highlighted the importance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid expansion of the electric vehicle (EV) industry has highlighted the importance of user feedback in improving product design and charging infrastructure. Traditional sentiment analysis methods often oversimplify the complexity of user emotions, limiting their effectiveness in capturing nuanced sentiments and emotional intensities. This study proposes a Bidirectional Long Short-Term Memory (Bi-LSTM) network-based sentiment scoring model to analyze user reviews of EV charging infrastructure. By assigning sentiment scores ranging from 0 to 5, the model provides a fine-grained understanding of emotional expression. Leveraging a dataset of 43,678 reviews from PC Auto, the study employs rigorous data cleaning and preprocessing, including tokenization and stop word removal, to optimize input for deep learning. The Bi-LSTM model demonstrates significant improvements over traditional approaches like SnowNLP across key evaluation metrics, including Mean Squared Error (MSE), Mean Absolute Error (MAE), and Explained Variance Score (EVS). These results highlight the model’s superior capability to capture nuanced sentiment dynamics, offering valuable insights for targeted product and service enhancements in the EV ecosystem.

[AI-36] raining MLPs on Graphs without Supervision WSDM25

链接: https://arxiv.org/abs/2412.03864
作者: Zehong Wang,Zheyuan Zhang,Chuxu Zhang,Yanfang Ye
关键词-EN: Graph Neural Networks, Neural Networks, financial fraud detection, real-time financial fraud, inference poses challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted by WSDM 25

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated their effectiveness in various graph learning tasks, yet their reliance on neighborhood aggregation during inference poses challenges for deployment in latency-sensitive applications, such as real-time financial fraud detection. To address this limitation, recent studies have proposed distilling knowledge from teacher GNNs into student Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate inference. However, these approaches often inadequately explore structural information when inferring unseen nodes. To this end, we introduce SimMLP, a Self-supervised framework for learning MLPs on graphs, designed to fully integrate rich structural information into MLPs. Notably, SimMLP is the first MLP-learning method that can achieve equivalence to GNNs in the optimal case. The key idea is to employ self-supervised learning to align the representations encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, thereby fully integrating the structural information into MLPs. We provide a comprehensive theoretical analysis, demonstrating the equivalence between SimMLP and GNNs based on mutual information and inductive bias, highlighting SimMLP’s advanced structural learning capabilities. Additionally, we conduct extensive experiments on 20 benchmark datasets, covering node classification, link prediction, and graph classification, to showcase SimMLP’s superiority over state-of-the-art baselines, particularly in scenarios involving unseen nodes (e.g., inductive and cold-start node classification) where structural insights are crucial. Our codes are available at: this https URL.

[AI-37] How Good is ChatGPT in Giving Adaptive Guidance Using Knowledge Graphs in E-Learning Environments?

链接: https://arxiv.org/abs/2412.03856
作者: Patrick Ocheja,Brendan Flanagan,Yiling Dai,Hiroaki Ogata
关键词-EN: large language models, increasingly harnessing large, harnessing large language, E-learning environments, tailored educational support
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:E-learning environments are increasingly harnessing large language models (LLMs) like GPT-3.5 and GPT-4 for tailored educational support. This study introduces an approach that integrates dynamic knowledge graphs with LLMs to offer nuanced student assistance. By evaluating past and ongoing student interactions, the system identifies and appends the most salient learning context to prompts directed at the LLM. Central to this method is the knowledge graph’s role in assessing a student’s comprehension of topic prerequisites. Depending on the categorized understanding (good, average, or poor), the LLM adjusts its guidance, offering advanced assistance, foundational reviews, or in-depth prerequisite explanations, respectively. Preliminary findings suggest students could benefit from this tiered support, achieving enhanced comprehension and improved task outcomes. However, several issues related to potential errors arising from LLMs were identified, which can potentially mislead students. This highlights the need for human intervention to mitigate these risks. This research aims to advance AI-driven personalized learning while acknowledging the limitations and potential pitfalls, thus guiding future research in technology and data-driven education.

[AI-38] What Do Machine Learning Researchers Mean by “Reproducible”? AAAI2025

链接: https://arxiv.org/abs/2412.03854
作者: Edward Raff,Michel Benaroch,Sagar Samtani,Andrew L. Farris
关键词-EN: Artificial Intelligence, Machine Learning, concern that Artificial, spurred significant research, past few years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: To appear in AAAI 2025, Senior Member Presentation Track

点击查看摘要

Abstract:The concern that Artificial Intelligence (AI) and Machine Learning (ML) are entering a “reproducibility crisis” has spurred significant research in the past few years. Yet with each paper, it is often unclear what someone means by “reproducibility”. Our work attempts to clarify the scope of “reproducibility” as displayed by the community at large. In doing so, we propose to refine the research to eight general topic areas. In this light, we see that each of these areas contains many works that do not advertise themselves as being about “reproducibility”, in part because they go back decades before the matter came to broader attention.

[AI-39] FedMetaMed: Federated Meta-Learning for Personalized Medication in Distributed Healthcare Systems

链接: https://arxiv.org/abs/2412.03851
作者: Jiechao Gao,Yuangang Li
关键词-EN: individual patient characteristics, healthcare systems, tailor healthcare, patient characteristics, healthcare
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized medication aims to tailor healthcare to individual patient characteristics. However, the heterogeneity of patient data across healthcare systems presents significant challenges to achieving accurate and effective personalized treatments. Ethical concerns further complicate the aggregation of large volumes of data from diverse institutions. Federated Learning (FL) offers a promising decentralized solution by enabling collaborative model training through the exchange of client models rather than raw data, thus preserving privacy. However, existing FL methods often suffer from retrogression during server aggregation, leading to a decline in model performance in real-world medical FL settings. To address data variability in distributed healthcare systems, we introduce Federated Meta-Learning for Personalized Medication (FedMetaMed), which combines federated learning and meta-learning to create models that adapt to diverse patient data across healthcare systems. The FedMetaMed framework aims to produce superior personalized models for individual clients by addressing these limitations. Specifically, we introduce Cumulative Fourier Aggregation (CFA) at the server to improve stability and effectiveness in global knowledge aggregation. CFA achieves this by gradually integrating client models from low to high frequencies. At the client level, we implement a Collaborative Transfer Optimization (CTO) strategy with a three-step process - Retrieve, Reciprocate, and Refine - to enhance the personalized local model through seamless global knowledge transfer. Experiments on real-world medical imaging datasets demonstrate that FedMetaMed outperforms state-of-the-art FL methods, showing superior generalization even on out-of-distribution cohorts.

[AI-40] owards Data Governance of Frontier AI Models

链接: https://arxiv.org/abs/2412.03824
作者: Jason Hausenloy,Duncan McClements,Madhavendra Thakur
关键词-EN: frontier artificial intelligence, fine-tune today frontier, today frontier artificial, Data, frontier data governance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data is essential to train and fine-tune today’s frontier artificial intelligence (AI) models and to develop future ones. To date, academic, legal, and regulatory work has primarily addressed how data can directly harm consumers and creators, such as through privacy breaches, copyright infringements, and bias and discrimination. Our work, instead, focuses on the comparatively neglected question of how data can enable new governance capacities for frontier AI models. This approach for “frontier data governance” opens up new avenues for monitoring and mitigating risks from advanced AI models, particularly as they scale and acquire specific dangerous capabilities. Still, frontier data governance faces challenges that stem from the fundamental properties of data itself: data is non-rival, often non-excludable, easily replicable, and increasingly synthesizable. Despite these inherent difficulties, we propose a set of policy mechanisms targeting key actors along the data supply chain, including data producers, aggregators, model developers, and data vendors. We provide a brief overview of 15 governance mechanisms, of which we centrally introduce five, underexplored policy recommendations. These include developing canary tokens to detect unauthorized use for producers; (automated) data filtering to remove malicious content for pre-training and post-training datasets; mandatory dataset reporting requirements for developers and vendors; improved security for datasets and data generation algorithms; and know-your-customer requirements for vendors. By considering data not just as a source of potential harm, but as a critical governance lever, this work aims to equip policymakers with a new tool for the governance and regulation of frontier AI models.

[AI-41] ELEMENT: Episodic and Lifelong Exploration via Maximum Entropy

链接: https://arxiv.org/abs/2412.03800
作者: Hongming Li,Shujian Yu,Bin Liu,Jose C. Principe
关键词-EN: intrinsically motivated reinforcement, intrinsically motivated, downstream tasks, Lifelong Exploration, ENTropy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes \emphEpisodic and Lifelong Exploration via Maximum ENTropy (ELEMENT), a novel, multiscale, intrinsically motivated reinforcement learning (RL) framework that is able to explore environments without using any extrinsic reward and transfer effectively the learned skills to downstream tasks. We advance the state of the art in three ways. First, we propose a multiscale entropy optimization to take care of the fact that previous maximum state entropy, for lifelong exploration with millions of state observations, suffers from vanishing rewards and becomes very expensive computationally across iterations. Therefore, we add an episodic maximum entropy over each episode to speedup the search further. Second, we propose a novel intrinsic reward for episodic entropy maximization named \emphaverage episodic state entropy which provides the optimal solution for a theoretical upper bound of the episodic state entropy objective. Third, to speed the lifelong entropy maximization, we propose a k nearest neighbors ( k NN) graph to organize the estimation of the entropy and updating processes that reduces the computation substantially. Our ELEMENT significantly outperforms state-of-the-art intrinsic rewards in both episodic and lifelong setups. Moreover, it can be exploited in task-agnostic pre-training, collecting data for offline reinforcement learning, etc.

[AI-42] Automated Multi-Label Annotation for Mental Health Illnesses Using Large Language Models

链接: https://arxiv.org/abs/2412.03796
作者: Abdelrahaman A. Hassan,Radwa J. Hanafy,Mohammed E. Fouda
关键词-EN: mental health, present significant challenges, mental health disorders, mental health conditions, diagnosis and treatment
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing prevalence and complexity of mental health disorders present significant challenges for accurate diagnosis and treatment, particularly in understanding the interplay between co-occurring conditions. Mental health disorders, such as depression and Anxiety, often co-occur, yet current datasets derived from social media posts typically focus on single-disorder labels, limiting their utility in comprehensive diagnostic analyses. This paper addresses this critical gap by proposing a novel methodology for cleaning, sampling, labeling, and combining data to create versatile multi-label datasets. Our approach introduces a synthetic labeling technique to transform single-label datasets into multi-label annotations, capturing the complexity of overlapping mental health conditions. To achieve this, two single-label datasets are first merged into a foundational multi-label dataset, enabling realistic analyses of co-occurring diagnoses. We then design and evaluate various prompting strategies for large language models (LLMs), ranging from single-label predictions to unrestricted prompts capable of detecting any present disorders. After rigorously assessing multiple LLMs and prompt configurations, the optimal combinations are identified and applied to label six additional single-disorder datasets from RMHD. The result is SPAADE-DR, a robust, multi-label dataset encompassing diverse mental health conditions. This research demonstrates the transformative potential of LLM-driven synthetic labeling in advancing mental health diagnostics from social media data, paving the way for more nuanced, data-driven insights into mental health care.

[AI-43] Safe Adaptive Cruise Control Under Perception Uncertainty: A Deep Ensemble and Conformal Tube Model Predictive Control Approach

链接: https://arxiv.org/abs/2412.03792
作者: Xiao Li,Anouck Girard,Ilya Kolmanovsky
关键词-EN: Autonomous driving heavily, driving heavily relies, Autonomous driving, Deep Neural Network, environment for decision-making
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous driving heavily relies on perception systems to interpret the environment for decision-making. To enhance robustness in these safety critical applications, this paper considers a Deep Ensemble of Deep Neural Network regressors integrated with Conformal Prediction to predict and quantify uncertainties. In the Adaptive Cruise Control setting, the proposed method performs state and uncertainty estimation from RGB images, informing the downstream controller of the DNN perception uncertainties. An adaptive cruise controller using Conformal Tube Model Predictive Control is designed to ensure probabilistic safety. Evaluations with a high-fidelity simulator demonstrate the algorithm’s effectiveness in speed tracking and safe distance maintaining, including in Out-Of-Distribution scenarios.

[AI-44] Coordinate In and Value Out: Training Flow Transformers in Ambient Space

链接: https://arxiv.org/abs/2412.03791
作者: Yuyang Wang,Anurag Ranjan,Josh Susskind,Miguel Angel Bautista
关键词-EN: flow matching generative, Flow matching, data, powerful method, Flow
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 10 figures, 10 tables

点击查看摘要

Abstract:Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (i.e., a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

[AI-45] Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

链接: https://arxiv.org/abs/2412.03784
作者: Yerin Choi,Jeehyun Lee,Myoung-Wan Koo
关键词-EN: subjective nature, Due, DNN models outperform, automatic severity evaluation, DNN models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to SLT 2024

点击查看摘要

Abstract:Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.

[AI-46] Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review

链接: https://arxiv.org/abs/2412.03783
作者: Sofiane Ennadir,Gabriela Zarzar Gandler,Filip Cornell,Lele Cao,Oleg Smirnov,Tianze Wang,Levente Zólyomi,Björn Brinne,Sahar Asadi
关键词-EN: Graph Neural Networks, Neural Networks, learning expressive representations, Graph Representation Learning, social networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12-page main paper + 8-page appendix

点击查看摘要

Abstract:Graphs are ubiquitous in real-world applications, ranging from social networks to biological systems, and have inspired the development of Graph Neural Networks (GNNs) for learning expressive representations. While most research has centered on static graphs, many real-world scenarios involve dynamic, temporally evolving graphs, motivating the need for Continuous-Time Dynamic Graph (CTDG) models. This paper provides a comprehensive review of Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised Representation Learning (SSRL). We introduce a novel theoretical framework that analyzes the expressivity of CTDG models through an Information-Flow (IF) lens, quantifying their ability to propagate and encode temporal and structural information. Leveraging this framework, we categorize existing CTDG methods based on their suitability for different graph types and application scenarios. Within the same scope, we examine the design of SSRL methods tailored to CTDGs, such as predictive and contrastive approaches, highlighting their potential to mitigate the reliance on labeled data. Empirical evaluations on synthetic and real-world datasets validate our theoretical insights, demonstrating the strengths and limitations of various methods across long-range, bi-partite and community-based graphs. This work offers both a theoretical foundation and practical guidance for selecting and developing CTDG models, advancing the understanding of GRL in dynamic settings.

[AI-47] Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

链接: https://arxiv.org/abs/2412.03773
作者: Chun Hei Yip,Rajashree Agrawal,Lawrence Chan,Jason Gross
关键词-EN: low-rank algorithms implemented, compressing nonlinear feature-maps, discovering simpler, goal of mechanistic, mechanistic interpretability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps – like MLP layers – is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of the MLP layer in one-layer transformers implementing the ``pizza’’ algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at this https URL.

[AI-48] A Contemporary Overview: Trends and Applications of Large Language Models on Mobile Devices

链接: https://arxiv.org/abs/2412.03772
作者: Lianjun Liu,Hongli An,Pengxuan Chen,Longxiang Ye
关键词-EN: powerful natural language, natural language processing, large language models, possess powerful natural, personalized user experiences
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), which possess powerful natural language processing and generation capabilities, LLMs are poised to provide more natural and personalized user experiences. Their deployment on mobile devices is gradually becoming a significant trend in the field of intelligent devices. LLMs have demonstrated tremendous potential in applications such as voice assistants, real-time translation, and intelligent recommendations. Advancements in hardware technologies (such as neural network accelerators) and network infrastructure (such as 5G) have enabled efficient local inference and low-latency intelligent responses on mobile devices. This reduces reliance on cloud computing while enhancing data privacy and security. Developers can easily integrate LLM functionalities through open APIs and SDKs, enabling the creation of more innovative intelligent applications. The widespread use of LLMs not only enhances the intelligence of mobile devices but also fosters the integrated innovation of fields like augmented reality (AR) and the Internet of Things (IoT). This trend is expected to drive the development of the next generation of mobile intelligent applications.

[AI-49] Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning

链接: https://arxiv.org/abs/2412.03752
作者: Debora Caldarola,Pietro Cagnasso,Barbara Caputo,Marco Ciccone
关键词-EN: enables collaborative model, collaborative model training, enables collaborative, privacy preservation, training with privacy
类目: Artificial Intelligence (cs.AI)
*备注: Preprint, 26 pages

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training with privacy preservation. Data heterogeneity across edge devices (clients) can cause models to converge to sharp minima, negatively impacting generalization and robustness. Recent approaches use client-side sharpness-aware minimization (SAM) to encourage flatter minima, but the discrepancy between local and global loss landscapes often undermines their effectiveness, as optimizing for local sharpness does not ensure global flatness. This work introduces FedGloSS (Federated Global Server-side Sharpness), a novel FL approach that prioritizes the optimization of global sharpness on the server, using SAM. To reduce communication overhead, FedGloSS cleverly approximates sharpness using the previous global gradient, eliminating the need for additional client communication. Our extensive evaluations demonstrate that FedGloSS consistently reaches flatter minima and better performance compared to state-of-the-art FL methods across various federated vision benchmarks.

[AI-50] Exploring the Role of AI-Powered Chatbots for Teens and Young Adults with ASD or Social Anxiety

链接: https://arxiv.org/abs/2412.03740
作者: Dilan Mian
关键词-EN: Autistic Spectrum Disorder, High-Functioning Autistic Spectrum, complex and difficult, difficult place, Autistic Spectrum
类目: Artificial Intelligence (cs.AI)
*备注: 33 pages, 30 figures

点击查看摘要

Abstract:The world can be a complex and difficult place to navigate. People with High-Functioning Autistic Spectrum Disorder as well as general social ineptitude often face navigation challenges that individuals of other demographics simply do not themselves. This can become even more pronounced with people of that specific group when they are in their teenage years and early adulthood (that being the usual age range of college students). When they are at such a vulnerable age, they can be far more susceptible to the struggles of becoming comfortable and content with social interactions as well as having strong relationships (outside their immediate family). Concerning this, the rapid emergence of artificial intelligence chatbots has led to many of them being used to benefit people of different ages and demographics with easy accessibility. With this, if there is anything that people with High-Functioning ASD and social ineptitude want when it comes to guidance towards self-improvement, surely easy accessibility would be one. What are the potential benefits and limitations of using a Mindstudio AI-powered chatbot to provide mental health support for teens and young adults with the aforementioned conditions? What could be done with a tool like this to help those individuals navigate ethical dilemmas within different social environments to reduce existing social tensions? This paper addresses these queries and offers insights to inform future discussions on the subject.

[AI-51] ParetoFlow: Guided Flows in Multi-Objective Optimization

链接: https://arxiv.org/abs/2412.03718
作者: Ye Yuan,Can Chen,Christopher Pal,Xue Liu
关键词-EN: simultaneously minimize multiple, dataset of designs, labels to simultaneously, simultaneously minimize, Pareto front
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In offline multi-objective optimization (MOO), we leverage an offline dataset of designs and their associated labels to simultaneously minimize multiple objectives. This setting more closely mirrors complex real-world problems compared to single-objective optimization. Recent works mainly employ evolutionary algorithms and Bayesian optimization, with limited attention given to the generative modeling capabilities inherent in such data. In this study, we explore generative modeling in offline MOO through flow matching, noted for its effectiveness and efficiency. We introduce ParetoFlow, specifically designed to guide flow sampling to approximate the Pareto front. Traditional predictor (classifier) guidance is inadequate for this purpose because it models only a single objective. In response, we propose a multi-objective predictor guidance module that assigns each sample a weight vector, representing a weighted distribution across multiple objective predictions. A local filtering scheme is introduced to address non-convex Pareto fronts. These weights uniformly cover the entire objective space, effectively directing sample generation towards the Pareto front. Since distributions with similar weights tend to generate similar samples, we introduce a neighboring evolution module to foster knowledge sharing among neighboring distributions. This module generates offspring from these distributions, and selects the most promising one for the next iteration. Our method achieves state-of-the-art performance across various tasks.

[AI-52] PathletRL: Optimizing Trajectory Pathlet Extraction and Dictionary Formation via Reinforcement Learning

链接: https://arxiv.org/abs/2412.03715
作者: Gian Alix,Arian Haghparast,Manos Papagelis
关键词-EN: Advances in tracking, tracking technologies, technologies have spurred, spurred the rapid, rapid growth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advances in tracking technologies have spurred the rapid growth of large-scale trajectory data. Building a compact collection of pathlets, referred to as a trajectory pathlet dictionary, is essential for supporting mobility-related applications. Existing methods typically adopt a top-down approach, generating numerous candidate pathlets and selecting a subset, leading to high memory usage and redundant storage from overlapping pathlets. To overcome these limitations, we propose a bottom-up strategy that incrementally merges basic pathlets to build the dictionary, reducing memory requirements by up to 24,000 times compared to baseline methods. The approach begins with unit-length pathlets and iteratively merges them while optimizing utility, which is defined using newly introduced metrics of trajectory loss and representability. We develop a deep reinforcement learning framework, PathletRL, which utilizes Deep Q-Networks (DQN) to approximate the utility function, resulting in a compact and efficient pathlet dictionary. Experiments on both synthetic and real-world datasets demonstrate that our method outperforms state-of-the-art techniques, reducing the size of the constructed dictionary by up to 65.8%. Additionally, our results show that only half of the dictionary pathlets are needed to reconstruct 85% of the original trajectory data. Building on PathletRL, we introduce PathletRL++, which extends the original model by incorporating a richer state representation and an improved reward function to optimize decision-making during pathlet merging. These enhancements enable the agent to gain a more nuanced understanding of the environment, leading to higher-quality pathlet dictionaries. PathletRL++ achieves even greater dictionary size reduction, surpassing the performance of PathletRL, while maintaining high trajectory representability.

[AI-53] CIKAN: Constraint Informed Kolmogorov-Arnold Networks for Autonomous Spacecraft Rendezvous using Time Shift Governor

链接: https://arxiv.org/abs/2412.03710
作者: Taehyeun Kim,Anouck Girard,Ilya Kolmanovsky
关键词-EN: Time Shift Governor, Constrained-Informed Neural Network, nominal closed-loop system, Shift Governor, Time Shift
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:The paper considers a Constrained-Informed Neural Network (CINN) approximation for the Time Shift Governor (TSG), which is an add-on scheme to the nominal closed-loop system used to enforce constraints by time-shifting the reference trajectory in spacecraft rendezvous applications. We incorporate Kolmogorov-Arnold Networks (KANs), an emerging architecture in the AI community, as a fundamental component of CINN and propose a Constrained-Informed Kolmogorov-Arnold Network (CIKAN)-based approximation for TSG. We demonstrate the effectiveness of the CIKAN-based TSG through simulations of constrained spacecraft rendezvous missions on highly elliptic orbits and present comparisons between CIKANs, MLP-based CINNs, and the conventional TSG.

[AI-54] System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT

链接: https://arxiv.org/abs/2412.03693
作者: Shreya Bhatia,Tarushi Gandhi,Dhruv Kumar,Pankaj Jalote
关键词-EN: final products meet, test cases, test case designs, test, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:System testing is essential in any software development project to ensure that the final products meet the requirements. Creating comprehensive test cases for system testing from requirements is often challenging and time-consuming. This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents. In this study, we collected the SRS documents of five software engineering projects containing functional and non-functional requirements, which were implemented, tested, and delivered by respective developer teams. For generating test case designs, we used ChatGPT-4o Turbo model. We employed prompt-chaining, starting with an initial context-setting prompt, followed by prompts to generate test cases for each use case. We assessed the quality of the generated test case designs through feedback from the same developer teams as mentioned above. Our experiments show that about 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant. Notably, 15 percent of the valid test cases were previously not considered by developers in their testing. We also tasked ChatGPT with identifying redundant test cases, which were subsequently validated by the respective developers to identify false positives and to uncover any redundant test cases that may have been missed by the developers themselves. This study highlights the potential of leveraging LLMs for test generation from the Requirements Specification document and also for assisting developers in quickly identifying and addressing redundancies, ultimately improving test suite quality and efficiency of the testing procedure.

[AI-55] Predicting Pedestrian Crossing Behavior in Germany and Japan: Insights into Model Transferability

链接: https://arxiv.org/abs/2412.03689
作者: Chi Zhang(1),Janis Sprenger(2),Zhongjun Ni(3),Christian Berger(1) ((1) Department of Computer Science and Engineering, University of Gothenburg, Sweden, (2) German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus, Germany, (3) Department of Science and Technology, Linköping University, Campus Norrköping, Sweden)
关键词-EN: avoid pedestrian-vehicle collisions, intelligent traffic systems, pedestrian-vehicle collisions, pedestrian crossing behavior, important for intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 12 figures, 11 tables. Accepted in IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:Predicting pedestrian crossing behavior is important for intelligent traffic systems to avoid pedestrian-vehicle collisions. Most existing pedestrian crossing behavior models are trained and evaluated on datasets collected from a single country, overlooking differences between countries. To address this gap, we compared pedestrian road-crossing behavior at unsignalized crossings in Germany and Japan. We presented four types of machine learning models to predict gap selection behavior, zebra crossing usage, and their trajectories using simulator data collected from both countries. When comparing the differences between countries, pedestrians from the study conducted in Japan are more cautious, selecting larger gaps compared to those in Germany. We evaluate and analyze model transferability. Our results show that neural networks outperform other machine learning models in predicting gap selection and zebra crossing usage, while random forest models perform best on trajectory prediction tasks, demonstrating strong performance and transferability. We develop a transferable model using an unsupervised clustering method, which improves prediction accuracy for gap selection and trajectory prediction. These findings provide a deeper understanding of pedestrian crossing behaviors in different countries and offer valuable insights into model transferability.

[AI-56] JPC: Flexible Inference for Predictive Coding Networks in JAX

链接: https://arxiv.org/abs/2412.03676
作者: Francesco Innocenti,Paul Kinghorn,Will Yun-Farmbrough,Miguel De Llanza Varona,Ryan Singh,Christopher L. Buckley
关键词-EN: Predictive Coding, JAX library, training neural networks, library for training, training neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:We introduce JPC, a JAX library for training neural networks with Predictive Coding. JPC provides a simple, fast and flexible interface to train a variety of PC networks (PCNs) including discriminative, generative and hybrid models. Unlike existing libraries, JPC leverages ordinary differential equation solvers to integrate the gradient flow inference dynamics of PCNs. We find that a second-order solver achieves significantly faster runtimes compared to standard Euler integration, with comparable performance on a range of tasks and network depths. JPC also provides some theoretical tools that can be used to study PCNs. We hope that JPC will facilitate future research of PC. The code is available at this https URL.

[AI-57] ght Lower Bounds and Improved Convergence in Performative Prediction

链接: https://arxiv.org/abs/2412.03671
作者: Pedram Khorsandi,Rushil Gupta,Mehrnaz Mofakhami,Simon Lacoste-Julien,Gauthier Gidel
关键词-EN: data distribution induced, Affine Risk Minimizers, Repeated Risk Minimization, data distribution remains, data distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in the real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

[AI-58] Recommender Systems for Sustainability: Overview and Research Issues

链接: https://arxiv.org/abs/2412.03620
作者: Alexander Felfernig,Manfred Wundara,Thi Ngoc Trang Tran,Seda Polat-Erdeniz,Sebastian Lubos,Merfat El-Mansi,Damian Garber,Viet-Man Le
关键词-EN: ending of poverty, Sustainability development goals, planet protection, universal call, call to action
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sustainability development goals (SDGs) are regarded as a universal call to action with the overall objectives of planet protection, ending of poverty, and ensuring peace and prosperity for all people. In order to achieve these objectives, different AI technologies play a major role. Specifically, recommender systems can provide support for organizations and individuals to achieve the defined goals. Recommender systems integrate AI technologies such as machine learning, explainable AI (XAI), case-based reasoning, and constraint solving in order to find and explain user-relevant alternatives from a potentially large set of options. In this article, we summarize the state of the art in applying recommender systems to support the achievement of sustainability development goals. In this context, we discuss open issues for future research.

[AI-59] Chatting with Logs: An exploratory study on Finetuning LLM s for LogQL

链接: https://arxiv.org/abs/2412.03612
作者: Vishwanath Seshagiri,Siddharth Balyan,Vaastav Anand,Kaustubh Dhole,Ishan Sharma,Avani Wildani,José Cambronero,Andreas Züfle
关键词-EN: modern distributed applications, creates significant challenges, significant challenges, critical function, function in modern
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: draft under submission at another venue

点击查看摘要

Abstract:Logging is a critical function in modern distributed applications, but the lack of standardization in log query languages and formats creates significant challenges. Developers currently must write ad hoc queries in platform-specific languages, requiring expertise in both the query language and application-specific log details – an impractical expectation given the variety of platforms and volume of logs and applications. While generating these queries with large language models (LLMs) seems intuitive, we show that current LLMs struggle with log-specific query generation due to the lack of exposure to domain-specific knowledge. We propose a novel natural language (NL) interface to address these inconsistencies and aide log query generation, enabling developers to create queries in a target log query language by providing NL inputs. We further introduce ~\textbfNL2QL, a manually annotated, real-world dataset of natural language questions paired with corresponding LogQL queries spread across three log formats, to promote the training and evaluation of NL-to-loq query systems. Using NL2QL, we subsequently fine-tune and evaluate several state of the art LLMs, and demonstrate their improved capability to generate accurate LogQL queries. We perform further ablation studies to demonstrate the effect of additional training data, and the transferability across different log formats. In our experiments, we find up to 75% improvement of finetuned models to generate LogQL queries compared to non finetuned models.

[AI-60] he Use of Artificial Intelligence in Military Intelligence: An Experimental Investigation of Added Value in the Analysis Process

链接: https://arxiv.org/abs/2412.03610
作者: Christian Nitzl,Achim Cyran,Sascha Krstanovic,Uwe M. Borghoff
关键词-EN: potential benefits, Named Entity Recognition, military intelligence, start-up Aleph Alpha, artificial intelligence
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 28 pages, 8 figures, 4 tables

点击查看摘要

Abstract:It is beyond dispute that the potential benefits of artificial intelligence (AI) in military intelligence are considerable. Nevertheless, it remains uncertain precisely how AI can enhance the analysis of military data. The aim of this study is to address this issue. To this end, the AI demonstrator deepCOM was developed in collaboration with the start-up Aleph Alpha. The AI functions include text search, automatic text summarization and Named Entity Recognition (NER). These are evaluated for their added value in military analysis. It is demonstrated that under time pressure, the utilization of AI functions results in assessments clearly superior to that of the control group. Nevertheless, despite the demonstrably superior analysis outcome in the experimental group, no increase in confidence in the accuracy of their own analyses was observed. Finally, the paper identifies the limitations of employing AI in military intelligence, particularly in the context of analyzing ambiguous and contradictory information. Comments: 28 pages, 8 figures, 4 tables Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2412.03610 [cs.AI] (or arXiv:2412.03610v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.03610 Focus to learn more arXiv-issued DOI via DataCite

[AI-61] A Survey on E-Commerce Learning to Rank

链接: https://arxiv.org/abs/2412.03581
作者: Md. Ahsanul Kabir,Mohammad Al Hasan,Aritra Mandal,Daniel Tunkelang,Zhe Wu
关键词-EN: search result ranking, search result, result ranking, important task, users’ preference
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In e-commerce, ranking the search results based on users’ preference is the most important task. Commercial e-commerce platforms, such as, Amazon, Alibaba, eBay, Walmart, etc. perform extensive and relentless research to perfect their search result ranking algorithms because the quality of ranking drives a user’s decision to purchase or not to purchase an item, directly affecting the profitability of the e-commerce platform. In such a commercial platforms, for optimizing search result ranking numerous features are considered, which emerge from relevance, personalization, seller’s reputation and paid promotion. To maintain their competitive advantage in the market, the platforms do no publish their core ranking algorithms, so it is difficult to know which of the algorithms or which of the features is the most effective for finding the most optimal search result ranking in e-commerce. No extensive surveys of ranking to rank in the e-commerce domain is also not yet published. In this work, we survey the existing e-commerce learning to rank algorithms. Besides, we also compare these algorithms based on query relevance criterion on a large real-life e-commerce dataset and provide a quantitative analysis. To the best of our knowledge this is the first such survey which include an experimental comparison among various learning to rank algorithms.

[AI-62] Reinforced Symbolic Learning with Logical Constraints for Predicting Turbine Blade Fatigue Life

链接: https://arxiv.org/abs/2412.03580
作者: Pei Li,Joo-Ho Choi,Dingyang Zhang,Shuyou Zhang,Yiming Zhang
关键词-EN: fatigue life, aircraft engines, safety and reliability, reliability of aircraft, fatigue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: full-lenth article with 24 pages

点击查看摘要

Abstract:Accurate prediction of turbine blade fatigue life is essential for ensuring the safety and reliability of aircraft engines. A significant challenge in this domain is uncovering the intrinsic relationship between mechanical properties and fatigue life. This paper introduces Reinforced Symbolic Learning (RSL), a method that derives predictive formulas linking these properties to fatigue life. RSL incorporates logical constraints during symbolic optimization, ensuring that the generated formulas are both physically meaningful and interpretable. The optimization process is further enhanced using deep reinforcement learning, which efficiently guides the symbolic regression towards more accurate models. The proposed RSL method was evaluated on two turbine blade materials, GH4169 and TC4, to identify optimal fatigue life prediction models. When compared with six empirical formulas and five machine learning algorithms, RSL not only produces more interpretable formulas but also achieves superior or comparable predictive accuracy. Additionally, finite element simulations were conducted to assess mechanical properties at critical points on the blade, which were then used to predict fatigue life under various operating conditions.

[AI-63] owards a Practical Ethics of Generative AI in Creative Production Processes

链接: https://arxiv.org/abs/2412.03579
作者: Geert Hofman
关键词-EN: significant ethical questions, raises significant ethical, raises significant, Double Diamond design, increasing integration
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:The increasing integration of artificial intelligence into various domains, including design and creative processes, raises significant ethical questions. While AI ethics is often examined from the perspective of technology developers, less attention has been paid to the practical ethical considerations faced by technology users, particularly in design contexts. This paper introduces a framework for addressing ethical challenges in creative production processes, such as the Double Diamond design model. Drawing on six major ethical theories - virtue ethics, deontology, utilitarianism, contract theory, care ethics, and existentialism - we develop a “compass” to navigate and reflect on the ethical dimensions of AI in design. The framework highlights the importance of responsibility, anticipation, and reflection across both the AI lifecycle and each stage of the creative process. We argue that by adopting a playful and exploratory approach to AI, while remaining anchored in core ethical principles, designers can responsibly harness the potential of AI technologies without overburdening or compromising their creative processes.

[AI-64] OKG: On-the-Fly Keyword Generation in Sponsored Search Advertising

链接: https://arxiv.org/abs/2412.03577
作者: Zhao Wang,Briti Gangopadhyay,Mengjie Zhao,Shingo Takamatsu
关键词-EN: Current keyword decision-making, real-time KPI metrics, sponsored search advertising, search advertising relies, Current keyword
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current keyword decision-making in sponsored search advertising relies on large, static datasets, limiting the ability to automatically set up keywords and adapt to real-time KPI metrics and product updates that are essential for effective advertising. In this paper, we propose On-the-fly Keyword Generation (OKG), an LLM agent-based method that dynamically monitors KPI changes and adapts keyword generation in real time, aligning with strategies recommended by advertising platforms. Additionally, we introduce the first publicly accessible dataset containing real keyword data along with its KPIs across diverse domains, providing a valuable resource for future research. Experimental results show that OKG significantly improves keyword adaptability and responsiveness compared to traditional methods. The code for OKG and the dataset are available at this https URL.

[AI-65] Ethical Challenges and Evolving Strategies in the Integration of Artificial Intelligence into Clinical Practice

链接: https://arxiv.org/abs/2412.03576
作者: Ellison B. Weiner,Irene Dankwa-Mullan,William A. Nelson,Saeed Hassanpour
关键词-EN: improve patient outcomes, Artificial intelligence, revolutionize clinical practice, transformed various sectors, rapidly transformed
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has rapidly transformed various sectors, including healthcare, where it holds the potential to revolutionize clinical practice and improve patient outcomes. However, its integration into medical settings brings significant ethical challenges that need careful consideration. This paper examines the current state of AI in healthcare, focusing on five critical ethical concerns: justice and fairness, transparency, patient consent and confidentiality, accountability, and patient-centered and equitable care. These concerns are particularly pressing as AI systems can perpetuate or even exacerbate existing biases, often resulting from non-representative datasets and opaque model development processes. The paper explores how bias, lack of transparency, and challenges in maintaining patient trust can undermine the effectiveness and fairness of AI applications in healthcare. In addition, we review existing frameworks for the regulation and deployment of AI, identifying gaps that limit the widespread adoption of these systems in a just and equitable manner. Our analysis provides recommendations to address these ethical challenges, emphasizing the need for fairness in algorithm design, transparency in model decision-making, and patient-centered approaches to consent and data privacy. By highlighting the importance of continuous ethical scrutiny and collaboration between AI developers, clinicians, and ethicists, we outline pathways for achieving more responsible and inclusive AI implementation in healthcare. These strategies, if adopted, could enhance both the clinical value of AI and the trustworthiness of AI systems among patients and healthcare professionals, ensuring that these technologies serve all populations equitably.

[AI-66] Back-filling Missing Data When Predicting Domestic Electricity Consumption From Smart Meter Data

链接: https://arxiv.org/abs/2412.03574
作者: Xianjuan Chen,Shuxiang Cai,Alan F. Smeaton
关键词-EN: smart meter data, data smart meter, domestic electricity smart, smart meter, annual electricity bills
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures, 4 tables

点击查看摘要

Abstract:This study uses data from domestic electricity smart meters to estimate annual electricity bills for a whole year. We develop a method for back-filling data smart meter for up to six missing months for users who have less than one year of smart meter data, ensuring reliable estimates of annual consumption. We identify five distinct electricity consumption user profiles for homes based on day, night, and peak usage patterns, highlighting the economic advantages of Time-of-Use (ToU) tariffs over fixed tariffs for most users, especially those with higher nighttime consumption. Ultimately, the results of this study empowers consumers to manage their energy use effectively and to make informed choices regarding electricity tariff plans.

[AI-67] Methodology for Online Estimation of Rheological Parameters in Polymer Melts Using Deep Learning and Microfluidics

链接: https://arxiv.org/abs/2412.04142
作者: Juan Sandubete-López,José L. Risco-Martín,Alexander H. McMillan,Eva Besada-Portas
关键词-EN: chemical experiments due, biological and chemical, chemical experiments, experiments due, Microfluidic devices
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures, Winter Simulation Conference 2024

点击查看摘要

Abstract:Microfluidic devices are increasingly used in biological and chemical experiments due to their cost-effectiveness for rheological estimation in fluids. However, these devices often face challenges in terms of accuracy, size, and cost. This study presents a methodology, integrating deep learning, modeling and simulation to enhance the design of microfluidic systems, used to develop an innovative approach for viscosity measurement of polymer melts. We use synthetic data generated from the simulations to train a deep learning model, which then identifies rheological parameters of polymer melts from pressure drop and flow rate measurements in a microfluidic circuit, enabling online estimation of fluid properties. By improving the accuracy and flexibility of microfluidic rheological estimation, our methodology accelerates the design and testing of microfluidic devices, reducing reliance on physical prototypes, and offering significant contributions to the field.

[AI-68] Deep-Unrolling Multidimensional Harmonic Retrieval Algorithms on Neuromorphic Hardware

链接: https://arxiv.org/abs/2412.04008
作者: Vlad C. Andrei,Alexandru P. Drăguţoiu,Gabriel Béna,Mahmoud Akl,Yin Li,Matthias Lohrmann,Ullrich J. Mönich,Holger Boche
关键词-EN: multidimensional harmonic retrieval, energy-efficient single-snapshot multidimensional, single-snapshot multidimensional harmonic, Structured Learned Iterative, Learned Iterative Shrinkage
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
*备注: accepted to the 58th Asilomar Conference on Signals, Systems, and Computers, Oct. 27th - Oct. 30th, 2024, Pacific Grove, CA

点击查看摘要

Abstract:This paper explores the potential of conversion-based neuromorphic algorithms for highly accurate and energy-efficient single-snapshot multidimensional harmonic retrieval (MHR). By casting the MHR problem as a sparse recovery problem, we devise the currently proposed, deep-unrolling-based Structured Learned Iterative Shrinkage and Thresholding (S-LISTA) algorithm to solve it efficiently using complex-valued convolutional neural networks with complex-valued activations, which are trained using a supervised regression objective. Afterward, a novel method for converting the complex-valued convolutional layers and activations into spiking neural networks (SNNs) is developed. At the heart of this method lies the recently proposed Few Spikes (FS) conversion, which is extended by modifying the neuron model’s parameters and internal dynamics to account for the inherent coupling between real and imaginary parts in complex-valued computations. Finally, the converted SNNs are mapped onto the SpiNNaker2 neuromorphic board, and a comparison in terms of estimation accuracy and power efficiency between the original CNNs deployed on an NVIDIA Jetson Xavier and the SNNs is being conducted. The measurement results show that the converted SNNs achieve almost five-fold power efficiency at moderate performance loss compared to the original CNNs.

[AI-69] A Data-Driven Framework for Discovering Fractional Differential Equations in Complex Systems

链接: https://arxiv.org/abs/2412.03970
作者: Xiangnan Yu,Hao Xu,Zhiping Mao,HongGuang Sun,Yong Zhang,Dongxiao Zhang,Yuntian Chen
关键词-EN: conventional differential equations, differential equations, fall short, short in capturing, limited to local
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In complex physical systems, conventional differential equations often fall short in capturing non-local and memory effects, as they are limited to local dynamics and integer-order interactions. This study introduces a stepwise data-driven framework for discovering fractional differential equations (FDEs) directly from data. FDEs, known for their capacity to model non-local dynamics with fewer parameters than integer-order derivatives, can represent complex systems with long-range interactions. Our framework applies deep neural networks as surrogate models for denoising and reconstructing sparse and noisy observations while using Gaussian-Jacobi quadrature to handle the challenges posed by singularities in fractional derivatives. To optimize both the sparse coefficients and fractional order, we employ an alternating optimization approach that combines sparse regression with global optimization techniques. We validate the framework across various datasets, including synthetic anomalous diffusion data, experimental data on the creep behavior of frozen soils, and single-particle trajectories modeled by Lévy motion. Results demonstrate the framework’s robustness in identifying the structure of FDEs across diverse noise levels and its capacity to capture integer-order dynamics, offering a flexible approach for modeling memory effects in complex systems.

[AI-70] Social Media Informatics for Sustainable Cities and Societies: An Overview of the Applications associated Challenges and Potential Solutions

链接: https://arxiv.org/abs/2412.03600
作者: Jebran Khan,Kashif Ahmad,Senthil Kumar Jagatheesaperumal,Nasir Ahmad,Kyung-Ah Sohn
关键词-EN: global warming climate, warming climate change, social media informatics, cities and societies, sustainable cities
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
*备注: 35 pages, 3 tables, and 4 figures

点击查看摘要

Abstract:In the modern world, our cities and societies face several technological and societal challenges, such as rapid urbanization, global warming climate change, the digital divide, and social inequalities, increasing the need for more sustainable cities and societies. Addressing these challenges requires a multifaceted approach involving all the stakeholders, sustainable planning, efficient resource management, innovative solutions, and modern technologies. Like other modern technologies, social media informatics also plays its part in developing more sustainable and resilient cities and societies. Despite its limitations, social media informatics has proven very effective in various sustainable cities and society applications. In this paper, we review and analyze the role of social media informatics in sustainable cities and society by providing a detailed overview of its applications, associated challenges, and potential solutions. This work is expected to provide a baseline for future research in the domain.

机器学习

[LG-0] Efficient Task Grouping Through Samplewise Optimisation Landscape Analysis

链接: https://arxiv.org/abs/2412.04413
作者: Anshul Thakur,Yichen Huang,Soheila Molaei,Yujiang Wang,David A. Clifton
关键词-EN: machine learning applications, Shared training approaches, gradient-based meta-learning, negative transfer, task
类目: Machine Learning (cs.LG)
*备注: Under review at IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:Shared training approaches, such as multi-task learning (MTL) and gradient-based meta-learning, are widely used in various machine learning applications, but they often suffer from negative transfer, leading to performance degradation in specific tasks. While several optimisation techniques have been developed to mitigate this issue for pre-selected task cohorts, identifying optimal task combinations for joint learning - known as task grouping - remains underexplored and computationally challenging due to the exponential growth in task combinations and the need for extensive training and evaluation cycles. This paper introduces an efficient task grouping framework designed to reduce these overwhelming computational demands of the existing methods. The proposed framework infers pairwise task similarities through a sample-wise optimisation landscape analysis, eliminating the need for the shared model training required to infer task similarities in existing methods. With task similarities acquired, a graph-based clustering algorithm is employed to pinpoint near-optimal task groups, providing an approximate yet efficient and effective solution to the originally NP-hard problem. Empirical assessments conducted on 8 different datasets highlight the effectiveness of the proposed framework, revealing a five-fold speed enhancement compared to previous state-of-the-art methods. Moreover, the framework consistently demonstrates comparable performance, confirming its remarkable efficiency and effectiveness in task grouping.

[LG-1] Stabilizing and Solving Inverse Problems using Data and Machine Learning

链接: https://arxiv.org/abs/2412.04409
作者: Erik Burman,Mats G. Larson,Karl Larsson,Carl Lundholm
关键词-EN: partial differential equation, unknown boundary conditions, nonlinear partial differential, boundary data, differential equation
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider an inverse problem involving the reconstruction of the solution to a nonlinear partial differential equation (PDE) with unknown boundary conditions. Instead of direct boundary data, we are provided with a large dataset of boundary observations for typical solutions (collective data) and a bulk measurement of a specific realization. To leverage this collective data, we first compress the boundary data using proper orthogonal decomposition (POD) in a linear expansion. Next, we identify a possible nonlinear low-dimensional structure in the expansion coefficients using an auto-encoder, which provides a parametrization of the dataset in a lower-dimensional latent space. We then train a neural network to map the latent variables representing the boundary data to the solution of the PDE. Finally, we solve the inverse problem by optimizing a data-fitting term over the latent space. We analyze the underlying stabilized finite element method in the linear setting and establish optimal error estimates in the H^1 and L^2 -norms. The nonlinear problem is then studied numerically, demonstrating the effectiveness of our approach. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) Cite as: arXiv:2412.04409 [math.NA] (or arXiv:2412.04409v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2412.04409 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Providing Differential Privacy for Federated Learning Over Wireless: A Cross-layer Framework

链接: https://arxiv.org/abs/2412.04408
作者: Jiayu Mao,Tongxin Yin,Aylin Yener,Mingyan Liu
关键词-EN: distributed machine learning, Federated Learning, local training data, machine learning framework, distributed machine
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: submitted for an IEEE publication

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning framework that inherently allows edge devices to maintain their local training data, thus providing some level of privacy. However, FL’s model updates still pose a risk of privacy leakage, which must be mitigated. Over-the-air FL (OTA-FL) is an adapted FL design for wireless edge networks that leverages the natural superposition property of the wireless medium. We propose a wireless physical layer (PHY) design for OTA-FL which improves differential privacy (DP) through a decentralized, dynamic power control that utilizes both inherent Gaussian noise in the wireless channel and a cooperative jammer (CJ) for additional artificial noise generation when higher privacy levels are required. Although primarily implemented within the Upcycled-FL framework, where a resource-efficient method with first-order approximations is used at every even iteration to decrease the required information from clients, our power control strategy is applicable to any FL framework, including FedAvg and FedProx as shown in the paper. This adaptation showcases the flexibility and effectiveness of our design across different learning algorithms while maintaining a strong emphasis on privacy. Our design removes the need for client-side artificial noise injection for DP, utilizing a cooperative jammer to enhance privacy without affecting transmission efficiency for higher privacy demands. Privacy analysis is provided using the Moments Accountant method. We perform a convergence analysis for non-convex objectives to tackle heterogeneous data distributions, highlighting the inherent trade-offs between privacy and accuracy. Numerical results show that our approach with various FL algorithms outperforms the state-of-the-art under the same DP conditions on the non-i.i.d. FEMNIST dataset, and highlight the cooperative jammer’s effectiveness in ensuring strict privacy.

[LG-3] Federated Automated Feature Engineering

链接: https://arxiv.org/abs/2412.04404
作者: Tom Overman,Diego Klabjan
关键词-EN: Automated feature engineering, needing significant human, significant human intervention, Automated feature, improve predictive performance
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preliminary Work

点击查看摘要

Abstract:Automated feature engineering (AutoFE) is used to automatically create new features from original features to improve predictive performance without needing significant human intervention and expertise. Many algorithms exist for AutoFE, but very few approaches exist for the federated learning (FL) setting where data is gathered across many clients and is not shared between clients or a central server. We introduce AutoFE algorithms for the horizontal, vertical, and hybrid FL settings, which differ in how the data is gathered across clients. To the best of our knowledge, we are the first to develop AutoFE algorithms for the horizontal and hybrid FL cases, and we show that the downstream model performance of federated AutoFE is similar to the case where data is held centrally and AutoFE is performed centrally.

[LG-4] Asynchronous Batch Bayesian Optimization with Pipelining Evaluations for Experimental Resourceunicodex2013constrained Conditions

链接: https://arxiv.org/abs/2412.04392
作者: Yujin Taguchi,Yusuke Shibuya,Yusuke Hiki,Takashi Morikura,Takahiro G. Yamada,Akira Funahashi
关键词-EN: Bayesian optimization, Batch Bayesian optimization, Bayesian, optimization, Bayesian optimization reduces
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian optimization is efficient even with a small amount of data and is used in engineering and in science, including biology and chemistry. In Bayesian optimization, a parameterized model with an uncertainty is fitted to explain the experimental data, and then the model suggests parameters that would most likely improve the results. Batch Bayesian optimization reduces the processing time of optimization by parallelizing experiments. However, batch Bayesian optimization cannot be applied if the number of parallelized experiments is limited by the cost or scarcity of equipment; in such cases, sequential methods require an unrealistic amount of time. In this study, we developed pipelining Bayesian optimization (PipeBO) to reduce the processing time of optimization even with a limited number of parallel experiments. PipeBO was inspired by the pipelining of central processing unit architecture, which divides computational tasks into multiple processes. PipeBO was designed to achieve experiment parallelization by overlapping various processes of the experiments. PipeBO uses the results of completed experiments to update the parameters of running parallelized experiments. Using the Black-Box Optimization Benchmarking, which consists of 24 benchmark functions, we compared PipeBO with the sequential Bayesian optimization methods. PipeBO reduced the average processing time of optimization to about 56% for the experiments that consisted of two processes or even less for those with more processes for 20 out of the 24 functions. Overall, PipeBO parallelizes Bayesian optimization in the resource-constrained settings so that efficient optimization can be achieved.

[LG-5] Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting

链接: https://arxiv.org/abs/2412.04368
作者: Edoardo Cetin,Ahmed Touati,Yann Ollivier
关键词-EN: Touati Ollivier, recently proposed framework, Touati, providing zero-shot efficient, zero-shot efficient policies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific this http URL show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.

[LG-6] Approximate Top-k for Increased Parallelism

链接: https://arxiv.org/abs/2412.04358
作者: Oscar Key,Luka Ribar,Alberto Cattaneo,Luke Hudlass-Galley,Douglas Orr
关键词-EN: bucketed approximate top, top, approximate top, bucketed approximate, algorithms
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an evaluation of bucketed approximate top- k algorithms. Computing top- k exactly suffers from limited parallelism, because the k largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top- k is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top- k operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top- k to select the most important parameters or activations. We also release a fast bucketed top- k implementation for PyTorch.

[LG-7] Distributionally Robust Performative Prediction NEURIPS

链接: https://arxiv.org/abs/2412.04346
作者: Songkai Xue,Yuekai Sun
关键词-EN: predictive outcomes subsequently, outcomes subsequently influence, distributionally robust performative, Performative prediction aims, robust performative prediction
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:Performative prediction aims to model scenarios where predictive outcomes subsequently influence the very systems they target. The pursuit of a performative optimum (PO) – minimizing performative risk – is generally reliant on modeling of the distribution map, which characterizes how a deployed ML model alters the data distribution. Unfortunately, inevitable misspecification of the distribution map can lead to a poor approximation of the true PO. To address this issue, we introduce a novel framework of distributionally robust performative prediction and study a new solution concept termed as distributionally robust performative optimum (DRPO). We show provable guarantees for DRPO as a robust approximation to the true PO when the nominal distribution map is different from the actual one. Moreover, distributionally robust performative prediction can be reformulated as an augmented performative prediction problem, enabling efficient optimization. The experimental results demonstrate that DRPO offers potential advantages over traditional PO approach when the distribution map is misspecified at either micro- or macro-level.

[LG-8] Deep Causal Inference for Point-referenced Spatial Data with Continuous Treatments

链接: https://arxiv.org/abs/2412.04285
作者: Ziyang Jiang,Zach Calhoun,Yiling Liu,Lei Duan,David Carlson
关键词-EN: handling high-dimensional inputs, high-dimensional inputs, handling high-dimensional, causal effects, Causal
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Causal reasoning is often challenging with spatial data, particularly when handling high-dimensional inputs. To address this, we propose a neural network (NN) based framework integrated with an approximate Gaussian process to manage spatial interference and unobserved confounding. Additionally, we adopt a generalized propensity-score-based approach to address partially observed outcomes when estimating causal effects with continuous treatments. We evaluate our framework using synthetic, semi-synthetic, and real-world data inferred from satellite imagery. Our results demonstrate that NN-based models significantly outperform linear spatial regression models in estimating causal effects. Furthermore, in real-world case studies, NN-based models offer more reasonable predictions of causal effects, facilitating decision-making in relevant applications.

[LG-9] Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization

链接: https://arxiv.org/abs/2412.04274
作者: Matan Schliserman,Tomer Koren
关键词-EN: dimensional feature vector, prediction rules parameterized, vector-valued linear predictors, Empirical Risk Minimization, rules parameterized
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning vector-valued linear predictors: these are prediction rules parameterized by a matrix that maps an m -dimensional feature vector to a k -dimensional target. We focus on the fundamental case with a convex and Lipschitz loss function, and show several new theoretical results that shed light on the complexity of this problem and its connection to related learning models. First, we give a tight characterization of the sample complexity of Empirical Risk Minimization (ERM) in this setting, establishing that \smash\widetilde\Omega(k/\epsilon^2) examples are necessary for ERM to reach \epsilon excess (population) risk; this provides for an exponential improvement over recent results by Magen and Shamir (2023) in terms of the dependence on the target dimension k , and matches a classical upper bound due to Maurer (2016). Second, we present a black-box conversion from general d -dimensional Stochastic Convex Optimization (SCO) to vector-valued linear prediction, showing that any SCO problem can be embedded as a prediction problem with k=\Theta(d) outputs. These results portray the setting of vector-valued linear prediction as bridging between two extensively studied yet disparate learning models: linear models (corresponds to k=1 ) and general d -dimensional SCO (with k=\Theta(d) ).

[LG-10] SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

链接: https://arxiv.org/abs/2412.04262
作者: Ethan Bradley,Muhammad Roman,Karen Rafferty,Barry Devereux
关键词-EN: challenging AI problem, tables, Table extraction, Existing table extraction, content domains
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

[LG-11] SCADE: Scalable Command-line Anomaly Detection Engine

链接: https://arxiv.org/abs/2412.04259
作者: Vaishali Vinay,Anjali Mangal
关键词-EN: complex command-line abuse, command-line interfaces remain, command-line abuse continues, exploitation through stealthy, continues to grow
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As command-line interfaces remain an integral part of high-computation environments, the risk of exploitation through stealthy, complex command-line abuse continues to grow. Conventional security solutions often struggle with these command-line-based anomalies due to their context-specific nature and lack of labeled data, especially in detecting rare, malicious patterns amidst legitimate, high-volume activity. This gap has left organizations vulnerable to sophisticated threats like Living-off-the-Land (LOL) attacks, where standard detection tools frequently miss or misclassify anomalous command-line behavior. We introduce Scalable Command-Line Anomaly Detection Engine (SCADE), who addresses these challenges by introducing a dual-layered detection framework that combines a global statistical analysis with local context-specific anomaly detection, innovatively using a novel ensemble of statistical models such as BM25 and Log Entropy, adapted for command-line data. The framework also features a dynamic thresholding mechanism for adaptive anomaly detection, ensuring high precision and recall even in environments with extremely high Signal-to-Noise Ratios (SNRs). Initial experimental results demonstrate the effectiveness of the framework, achieving above 98% SNR in identifying unusual command-line behavior while minimizing false positives. In this paper, we present SCADE’s core architecture, including its metadata-enriched approach to anomaly detection and the design choices behind its scalability for enterprise-level deployment. We argue that SCADE represents a significant advancement in command-line anomaly detection, offering a robust, adaptive framework for security analysts and researchers seeking to enhance detection accuracy in high-computation environments.

[LG-12] LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation

链接: https://arxiv.org/abs/2412.04242
作者: Xiang Chen
关键词-EN: rich geometric features, molecular diffusion model, latent molecular diffusion, maintain rich geometric, maintain Euclidean transformation
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2209.05710 by other authors

点击查看摘要

Abstract:n this work, we propose a latent molecular diffusion model that can make the generated 3D molecules rich in diversity and maintain rich geometric features. The model captures the information of the forces and local constraints between atoms so that the generated molecules can maintain Euclidean transformation and high level of effectiveness and diversity. We also use the lowerrank manifold advantage of the latent variables of the latent model to fuse the information of the forces between atoms to better maintain the geometric equivariant properties of the molecules. Because there is no need to perform information fusion encoding in stages like traditional encoders and decoders, this reduces the amount of calculation in the back-propagation process. The model keeps the forces and local constraints of particle bonds in the latent variable space, reducing the impact of underfitting on the surface of the network on the large position drift of the particle geometry, so that our model can converge earlier. We introduce a distribution control variable in each backward step to strengthen exploration and improve the diversity of generation. In the experiment, the quality of the samples we generated and the convergence speed of the model have been significantly improved.

[LG-13] Physics-informed Deep Learning for Muscle Force Prediction with Unlabeled sEMG Signals

链接: https://arxiv.org/abs/2412.04213
作者: Shuhao Ma,Jie Zhang,Chaoyang Shi,Pei Di,Ian D.Robertson,Zhi-Qiang Zhang
关键词-EN: Computational biomechanical analysis, biomechanical analysis plays, improving human movements, physical functions, Computational biomechanical
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Biological Physics (physics.bio-ph)
*备注: 11pages, 8 figures, journal

点击查看摘要

Abstract:Computational biomechanical analysis plays a pivotal role in understanding and improving human movements and physical functions. Although physics-based modeling methods can interpret the dynamic interaction between the neural drive to muscle dynamics and joint kinematics, they suffer from high computational latency. In recent years, data-driven methods have emerged as a promising alternative due to their fast execution speed, but label information is still required during training, which is not easy to acquire in practice. To tackle these issues, this paper presents a novel physics-informed deep learning method to predict muscle forces without any label information during model training. In addition, the proposed method could also identify personalized muscle-tendon parameters. To achieve this, the Hill muscle model-based forward dynamics is embedded into the deep neural network as the additional loss to further regulate the behavior of the deep neural network. Experimental validations on the wrist joint from six healthy subjects are performed, and a fully connected neural network (FNN) is selected to implement the proposed method. The predicted results of muscle forces show comparable or even lower root mean square error (RMSE) and higher coefficient of determination compared with baseline methods, which have to use the labeled surface electromyography (sEMG) signals, and it can also identify muscle-tendon parameters accurately, demonstrating the effectiveness of the proposed physics-informed deep learning method.

[LG-14] Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid Model Approach

链接: https://arxiv.org/abs/2412.04183
作者: Md Shihab Reza,Monirul Islam Mahmud,Ifti Azad Abeer,Nova Ahmed
关键词-EN: credit scoring approaches, made credit scoring, Linear Discriminant Analysis, credit scoring, development of computing
类目: Machine Learning (cs.LG)
*备注: Accepted on International Conference on Computer and Information Technology (ICCIT) 2024

点击查看摘要

Abstract:The development of computing has made credit scoring approaches possible, with various machine learning (ML) and deep learning (DL) techniques becoming more and more valuable. While complex models yield more accurate predictions, their interpretability is often weakened, which is a concern for credit scoring that places importance on decision fairness. As features of the dataset are a crucial factor for the credit scoring system, we implement Linear Discriminant Analysis (LDA) as a feature reduction technique, which reduces the burden of the models complexity. We compared 6 different machine learning models, 1 deep learning model, and a hybrid model with and without using LDA. From the result, we have found our hybrid model, XG-DNN, outperformed other models with the highest accuracy of 99.45% and a 99% F1 score with LDA. Lastly, to interpret model decisions, we have applied 2 different explainable AI techniques named LIME (local) and Morris Sensitivity Analysis (global). Through this research, we showed how feature reduction techniques can be used without affecting the performance and explainability of the model, which can be very useful in resource-constrained settings to optimize the computational workload.

[LG-15] SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

链接: https://arxiv.org/abs/2412.04180
作者: Runsheng Bai,Qiang Liu,Bo Liu
关键词-EN: Large Language Models, Large Language, inference poses challenges, exhibit impressive performance, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

[LG-16] Multi-Layer Privacy-Preserving Record Linkage with Clerical Review based on gradual information disclosure

链接: https://arxiv.org/abs/2412.04178
作者: Florens Rohde,Victor Christen,Martin Franke,Erhard Rahm
关键词-EN: data integration tasks, essential component, integration tasks, tasks of sensitive, PPRL
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted at 21st Conference on Database Systems for Business, Technology and Web (BTW)

点击查看摘要

Abstract:Privacy-Preserving Record linkage (PPRL) is an essential component in data integration tasks of sensitive information. The linkage quality determines the usability of combined datasets and (machine learning) applications based on them. We present a novel privacy-preserving protocol that integrates clerical review in PPRL using a multi-layer active learning process. Uncertain match candidates are reviewed on several layers by human and non-human oracles to reduce the amount of disclosed information per record and in total. Predictions are propagated back to update previous layers, resulting in an improved linkage performance for non-reviewed candidates as well. The data owners remain in control of the amount of information they share for each record. Therefore, our approach follows need-to-know and data sovereignty principles. The experimental evaluation on real-world datasets shows considerable linkage quality improvements with limited labeling effort and privacy risks.

[LG-17] Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning

链接: https://arxiv.org/abs/2412.04177
作者: Luis A. Ortega,Simón Rodríguez-Santana,Daniel Hernández-Lobato
关键词-EN: deep neural networks, pre-trained deep neural, performing post-hoc uncertainty, pre-trained DNN, DNN prediction uncertainty
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 6 figures and 2 tables. Submitted to IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

点击查看摘要

Abstract:Recently, there has been an increasing interest in performing post-hoc uncertainty estimation about the predictions of pre-trained deep neural networks (DNNs). Given a pre-trained DNN via back-propagation, these methods enhance the original network by adding output confidence measures, such as error bars, without compromising its initial accuracy. In this context, we introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel. Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP’s predictive variances to estimate the DNN prediction uncertainty. Our approach leverages variational inference (VI) for efficient stochastic optimization, with training costs that remain independent of the number of training points, scaling efficiently to large datasets such as ImageNet. The proposed method, called fixed mean GP (FMGP), is architecture-agnostic, relying solely on the pre-trained model’s outputs to adjust the predictive variances. Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods.

[LG-18] An In-Depth Examination of Risk Assessment in Multi-Class Classification Algorithms

链接: https://arxiv.org/abs/2412.04166
作者: Disha Ghandwani,Neeraj Sarna,Yuanyuan Li,Yang Lin
关键词-EN: Advanced classification algorithms, Advanced classification, safety-critical applications, Advanced, applications like health-care
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Advanced classification algorithms are being increasingly used in safety-critical applications like health-care, engineering, etc. In such applications, miss-classifications made by ML algorithms can result in substantial financial or health-related losses. To better anticipate and prepare for such losses, the algorithm user seeks an estimate for the probability that the algorithm miss-classifies a sample. We refer to this task as the risk-assessment. For a variety of models and datasets, we numerically analyze the performance of different methods in solving the risk-assessment problem. We consider two solution strategies: a) calibration techniques that calibrate the output probabilities of classification models to provide accurate probability outputs; and b) a novel approach based upon the prediction interval generation technique of conformal prediction. Our conformal prediction based approach is model and data-distribution agnostic, simple to implement, and provides reasonable results for a variety of use-cases. We compare the different methods on a broad variety of models and datasets.

[LG-19] On the Lack of Robustness of Binary Function Similarity Systems

链接: https://arxiv.org/abs/2412.04163
作者: Gianluca Capozzi,Tong Tang,Jie Wan,Ziqi Yang,Daniele Cono D’Elia,Giuseppe Antonio Di Luna,Lorenzo Cavallaro,Leonardo Querzoni
关键词-EN: Binary function similarity, including machine learning, Binary function, relies on learning-based, learning-based algorithms
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Binary function similarity, which often relies on learning-based algorithms to identify what functions in a pool are most similar to a given query function, is a sought-after topic in different communities, including machine learning, software engineering, and security. Its importance stems from the impact it has in facilitating several crucial tasks, from reverse engineering and malware analysis to automated vulnerability detection. Whereas recent work cast light around performance on this long-studied problem, the research landscape remains largely lackluster in understanding the resiliency of the state-of-the-art machine learning models against adversarial attacks. As security requires to reason about adversaries, in this work we assess the robustness of such models through a simple yet effective black-box greedy attack, which modifies the topology and the content of the control flow of the attacked functions. We demonstrate that this attack is successful in compromising all the models, achieving average attack success rates of 57.06% and 95.81% depending on the problem settings (targeted and untargeted attacks). Our findings are insightful: top performance on clean data does not necessarily relate to top robustness properties, which explicitly highlights performance-robustness trade-offs one should consider when deploying such models, calling for further research.

[LG-20] LossVal: Efficient Data Valuation for Neural Networks

链接: https://arxiv.org/abs/2412.04158
作者: Tim Wibiral,Mohamed Karim Belaid,Maximilian Rabus,Ansgar Scherp
关键词-EN: machine learning, key challenge, challenge in machine, individual training samples, Assessing the importance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Assessing the importance of individual training samples is a key challenge in machine learning. Traditional approaches retrain models with and without specific samples, which is computationally expensive and ignores dependencies between data points. We introduce LossVal, an efficient data valuation method that computes importance scores during neural network training by embedding a self-weighting mechanism into loss functions like cross-entropy and mean squared error. LossVal reduces computational costs, making it suitable for large datasets and practical applications. Experiments on classification and regression tasks across multiple datasets show that LossVal effectively identifies noisy samples and is able to distinguish helpful from harmful samples. We examine the gradient calculation of LossVal to highlight its advantages. The source code is available at: this https URL

[LG-21] Non-Asymptotic Bounds for Closed-Loop Identification of Unstable Nonlinear Stochastic Systems

链接: https://arxiv.org/abs/2412.04157
作者: Seth Siriya,Jingge Zhu,Dragan Nešić,Ye Pu
关键词-EN: closed-loop nonlinear stochastic, linearly parameterised uncertainty, squares parameter estimation, nonlinear stochastic systems, closed-loop nonlinear
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:We consider the problem of least squares parameter estimation from single-trajectory data for discrete-time, unstable, closed-loop nonlinear stochastic systems, with linearly parameterised uncertainty. Assuming a region of the state space produces informative data, and the system is sub-exponentially unstable, we establish non-asymptotic guarantees on the estimation error at times where the state trajectory evolves in this region. If the whole state space is informative, high probability guarantees on the error hold for all times. Examples are provided where our results are useful for analysis, but existing results are not.

[LG-22] MultiTASC: A Continuously Adaptive Scheduler for Edge-Based Multi-Device Cascade Inference

链接: https://arxiv.org/abs/2412.04147
作者: Sokratis Nikolaidis,Stylianos I. Venieris,Iakovos S. Venieris
关键词-EN: refining challenging samples, high-accuracy model refining, low computational burden, lightweight model processing, model refining challenging
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Cascade systems, consisting of a lightweight model processing all samples and a heavier, high-accuracy model refining challenging samples, have become a widely-adopted distributed inference approach to achieving high accuracy and maintaining a low computational burden for mobile and IoT devices. As intelligent indoor environments, like smart homes, continue to expand, a new scenario emerges, the multi-device cascade. In this setting, multiple diverse devices simultaneously utilize a shared heavy model hosted on a server, often situated within or close to the consumer environment. This work introduces MultiTASC++, a continuously adaptive multi-tenancy-aware scheduler that dynamically controls the forwarding decision functions of devices to optimize system throughput while maintaining high accuracy and low latency. Through extensive experimentation in diverse device environments and with varying server-side models, we demonstrate the scheduler’s efficacy in consistently maintaining a targeted satisfaction rate while providing the highest available accuracy across different device tiers and workloads of up to 100 devices. This demonstrates its scalability and efficiency in addressing the unique challenges of collaborative DNN inference in dynamic and diverse IoT environments.

[LG-23] Compositional Generative Multiphysics and Multi-component Simulation

链接: https://arxiv.org/abs/2412.04134
作者: Tao Zhang,Zhenhai Liu,Feipeng Qi,Yongjun Jiao,Tailin Wu
关键词-EN: aerospace engineering, critical in fields, multi-component simulation, multiphysics simulations typically, multi-component
类目: Machine Learning (cs.LG)
*备注: 30pages,13 figures

点击查看摘要

Abstract:Multiphysics simulation, which models the interactions between multiple physical processes, and multi-component simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies often rely on numerical solvers or machine learning-based surrogate models to solve or accelerate these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers-each responsible for evolving a specific physical process-into a coupled program, which introduces significant development challenges. Furthermore, no universal algorithm exists for multi-component simulations, which adds to the complexity. Here we propose compositional Multiphysics and Multi-component Simulation with Diffusion models (MultiSimDiff) to overcome these challenges. During diffusion-based training, MultiSimDiff learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, MultiSimDiff generates coupled multiphysics solutions and multi-component structures by sampling from the joint probability distribution, achieved by composing the learned energy functions in a structured way. We test our method in three tasks. In the reaction-diffusion and nuclear thermal coupling problems, MultiSimDiff successfully predicts the coupling solution using decoupled data, while the surrogate model fails in the more complex second problem. For the thermal and mechanical analysis of the prismatic fuel element, MultiSimDiff trained for single component prediction accurately predicts a larger structure with 64 components, reducing the relative error by 40.3% compared to the surrogate model.

[LG-24] Learnable Similarity and Dissimilarity Guided Symmetric Non-Negative Matrix Factorization

链接: https://arxiv.org/abs/2412.04082
作者: Wenlong Lyu,Yuheng Jia
关键词-EN: Symmetric nonnegative matrix, nonnegative matrix factorization, Symmetric nonnegative, similarity matrix, construct similarity matrix
类目: Machine Learning (cs.LG)
*备注: 12 pages, 14 figures

点击查看摘要

Abstract:Symmetric nonnegative matrix factorization (SymNMF) is a powerful tool for clustering, which typically uses the k -nearest neighbor ( k -NN) method to construct similarity matrix. However, k -NN may mislead clustering since the neighbors may belong to different clusters, and its reliability generally decreases as k grows. In this paper, we construct the similarity matrix as a weighted k -NN graph with learnable weight that reflects the reliability of each k -th NN. This approach reduces the search space of the similarity matrix learning to n - 1 dimension, as opposed to the \mathcalO(n^2) dimension of existing methods, where n represents the number of samples. Moreover, to obtain a discriminative similarity matrix, we introduce a dissimilarity matrix with a dual structure of the similarity matrix, and propose a new form of orthogonality regularization with discussions on its geometric interpretation and numerical stability. An efficient alternative optimization algorithm is designed to solve the proposed model, with theoretically guarantee that the variables converge to a stationary point that satisfies the KKT conditions. The advantage of the proposed model is demonstrated by the comparison with nine state-of-the-art clustering methods on eight datasets. The code is available at \urlthis https URL.

[LG-25] owards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2412.04078
作者: Shicheng Zhou,Jingju Liu,Yuliang Lu,Jiahai Yang,Yue Zhang,Jie Chen
关键词-EN: emerging research area, autonomous penetration testing, autonomous pentesting, studying autonomous pentesting, penetration testing
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:With increasing numbers of vulnerabilities exposed on the internet, autonomous penetration testing (pentesting) has emerged as an emerging research area, while reinforcement learning (RL) is a natural fit for studying autonomous pentesting. Previous research in RL-based autonomous pentesting mainly focused on enhancing agents’ learning efficacy within abstract simulated training environments. They overlooked the applicability and generalization requirements of deploying agents’ policies in real-world environments that differ substantially from their training settings. In contrast, for the first time, we shift focus to the pentesting agents’ ability to generalize across unseen real environments. For this purpose, we propose a Generalizable Autonomous Pentesting framework (namely GAP) for training agents capable of drawing inferences from one to another – a key requirement for the broad application of autonomous pentesting and a hallmark of human intelligence. GAP introduces a Real-to-Sim-to-Real pipeline with two key methods: domain randomization and meta-RL learning. Specifically, we are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve the agents’ generalization ability in unseen environments by leveraging the synthetic environments. The combination of these two methods can effectively bridge the generalization gap and improve policy adaptation performance. Experiments are conducted on various vulnerable virtual machines, with results showing that GAP can (a) enable policy learning in unknown real environments, (b) achieve zero-shot policy transfer in similar environments, and © realize rapid policy adaptation in dissimilar environments.

[LG-26] Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation COLING2025

链接: https://arxiv.org/abs/2412.04076
作者: Weihua Wang,Qiuyu Liang,Feilong Bao,Guanglai Gao
关键词-EN: expressive hypercomplex space, real part, imaginary parts, expressive hypercomplex, distance
类目: Machine Learning (cs.LG)
*备注: Accepted by COLING 2025

点击查看摘要

Abstract:Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity’s geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at this https URL.

[LG-27] Integrated Sensing and Communications for Low-Altitude Economy: A Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2412.04074
作者: Xiaowen Ye,Yuyi Mao,Xianghao Yu,Shu Sun,Liqun Fu,Jie Xu
关键词-EN: ground base station, unmanned aerial vehicles, unauthorized mobile target, authorized unmanned aerial, low-altitude economy
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: submitted for an IEEE publication

点击查看摘要

Abstract:This paper studies an integrated sensing and communications (ISAC) system for low-altitude economy (LAE), where a ground base station (GBS) provides communication and navigation services for authorized unmanned aerial vehicles (UAVs), while sensing the low-altitude airspace to monitor the unauthorized mobile target. The expected communication sum-rate over a given flight period is maximized by jointly optimizing the beamforming at the GBS and UAVs’ trajectories, subject to the constraints on the average signal-to-noise ratio requirement for sensing, the flight mission and collision avoidance of UAVs, as well as the maximum transmit power at the GBS. Typically, this is a sequential decision-making problem with the given flight mission. Thus, we transform it to a specific Markov decision process (MDP) model called episode task. Based on this modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep LAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL) technique. In DeepLSC, a reward function and a new action selection policy termed constrained noise-exploration policy are judiciously designed to fulfill various constraints. To enable efficient learning in episode tasks, we develop a hierarchical experience replay mechanism, where the gist is to employ all experiences generated within each episode to jointly train the neural network. Besides, to enhance the convergence speed of DeepLSC, a symmetric experience augmentation mechanism, which simultaneously permutes the indexes of all variables to enrich available experience sets, is proposed. Simulation results demonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate while meeting the preset constraints, achieves faster convergence, and is more robust against different settings.

[LG-28] Boundary-Guided Learning for Gene Expression Prediction in Spatial Transcriptomics

链接: https://arxiv.org/abs/2412.04072
作者: Mingcheng Qu,Yuncong Wu,Donglin Di,Anyang Su,Tonghua Su,Yang Song,Lei Fan
关键词-EN: Spatial transcriptomics, spatial context, gene expression, advanced technology, Spatial
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Spatial transcriptomics (ST) has emerged as an advanced technology that provides spatial context to gene expression. Recently, deep learning-based methods have shown the capability to predict gene expression from WSI data using ST data. Existing approaches typically extract features from images and the neighboring regions using pretrained models, and then develop methods to fuse this information to generate the final output. However, these methods often fail to account for the cellular structure similarity, cellular density and the interactions within the microenvironment. In this paper, we propose a framework named BG-TRIPLEX, which leverages boundary information extracted from pathological images as guiding features to enhance gene expression prediction from WSIs. Specifically, our model consists of three branches: the spot, in-context and global branches. In the spot and in-context branches, boundary information, including edge and nuclei characteristics, is extracted using pretrained models. These boundary features guide the learning of cellular morphology and the characteristics of microenvironment through Multi-Head Cross-Attention. Finally, these features are integrated with global features to predict the final output. Extensive experiments were conducted on three public ST datasets. The results demonstrate that our BG-TRIPLEX consistently outperforms existing methods in terms of Pearson Correlation Coefficient (PCC). This method highlights the crucial role of boundary features in understanding the complex interactions between WSI and gene expression, offering a promising direction for future research.

[LG-29] Space to Policy: Scalable Brick Kiln Detection and Automatic Compliance Monitoring with Geospatial Data

链接: https://arxiv.org/abs/2412.04065
作者: Zeel B Patel,Rishabh Mondal,Shataxi Dubey,Suraj Jaiswal,Sarath Guttikunda,Nipun Batra
关键词-EN: million people annually, Air pollution kills, million people, people annually, brick kilns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution kills 7 million people annually. The brick kiln sector significantly contributes to economic development but also accounts for 8-14% of air pollution in India. Policymakers have implemented compliance measures to regulate brick kilns. Emission inventories are critical for air quality modeling and source apportionment studies. However, the largely unorganized nature of the brick kiln sector necessitates labor-intensive survey efforts for monitoring. Recent efforts by air quality researchers have relied on manual annotation of brick kilns using satellite imagery to build emission inventories, but this approach lacks scalability. Machine-learning-based object detection methods have shown promise for detecting brick kilns; however, previous studies often rely on costly high-resolution imagery and fail to integrate with governmental policies. In this work, we developed a scalable machine-learning pipeline that detected and classified 30638 brick kilns across five states in the Indo-Gangetic Plain using free, moderate-resolution satellite imagery from Planet Labs. Our detections have a high correlation with on-ground surveys. We performed automated compliance analysis based on government policies. In the Delhi airshed, stricter policy enforcement has led to the adoption of efficient brick kiln technologies. This study highlights the need for inclusive policies that balance environmental sustainability with the livelihoods of workers.

[LG-30] AI4EF: Artificial Intelligence for Energy Efficiency in the Building Sector

链接: https://arxiv.org/abs/2412.04045
作者: Alexandros Menelaos Tzortzis,Georgios Kormpakis,Sotiris Pelekis,Ariadni Michalitsi-Psarrou,Evangelos Karakolis,Christos Ntanos,Dimitris Askounis
关键词-EN: user-centric tool designed, designed to support, support decision-making, building energy retrofitting, Energy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI4EF, Artificial Intelligence for Energy Efficiency, is an advanced, user-centric tool designed to support decision-making in building energy retrofitting and efficiency optimization. Leveraging machine learning (ML) and data-driven insights, AI4EF enables stakeholders such as public sector representatives, energy consultants, and building owners to model, analyze, and predict energy consumption, retrofit costs, and environmental impacts of building upgrades. Featuring a modular framework, AI4EF includes customizable building retrofitting, photovoltaic installation assessment, and predictive modeling tools that allow users to input building parameters and receive tailored recommendations for achieving energy savings and carbon reduction goals. Additionally, the platform incorporates a Training Playground for data scientists to refine ML models used by said framework. Finally, AI4EF provides access to the Enershare Data Space to facilitate seamless data sharing and access within the ecosystem. Its compatibility with open-source identity management, Keycloak, enhances security and accessibility, making it adaptable for various regulatory and organizational contexts. This paper presents an architectural overview of AI4EF, its application in energy efficiency scenarios, and its potential for advancing sustainable energy practices through artificial intelligence (AI).

[LG-31] Dynamic Graph Representation with Contrastive Learning for Financial Market Prediction: Integrating Temporal Evolution and Static Relations

链接: https://arxiv.org/abs/2412.04034
作者: Yunhua Pei,Jin Zheng,John Cartlidge
关键词-EN: Temporal Graph Learning, Contrastive Learning, Contrastive Constrained Training, Dynamic Graph Representation, evolving nature
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Finance (q-fin.CP)
*备注: 12 pages, 2 figures, author manuscript accepted for ICAART 2025 (International Conference on Agents and Artificial Intelligence)

点击查看摘要

Abstract:Temporal Graph Learning (TGL) is crucial for capturing the evolving nature of stock markets. Traditional methods often ignore the interplay between dynamic temporal changes and static relational structures between stocks. To address this issue, we propose the Dynamic Graph Representation with Contrastive Learning (DGRCL) framework, which integrates dynamic and static graph relations to improve the accuracy of stock trend prediction. Our framework introduces two key components: the Embedding Enhancement (EE) module and the Contrastive Constrained Training (CCT) module. The EE module focuses on dynamically capturing the temporal evolution of stock data, while the CCT module enforces static constraints based on stock relations, refined within contrastive learning. This dual-relation approach allows for a more comprehensive understanding of stock market dynamics. Our experiments on two major U.S. stock market datasets, NASDAQ and NYSE, demonstrate that DGRCL significantly outperforms state-of-the-art TGL baselines. Ablation studies indicate the importance of both modules. Overall, DGRCL not only enhances prediction ability but also provides a robust framework for integrating temporal and relational data in dynamic graphs. Code and data are available for public access.

[LG-32] Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis

链接: https://arxiv.org/abs/2412.03961
作者: Huadong Pang,Li Zhou,Yiping Dong,Peiyuan Chen,Dian Gu,Tianyi Lyu,Hansong Zhang
关键词-EN: deep learning technologies, Electronic Health Records, Logistic Regression, revolutionized data analysis, disease forecasting
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:In the healthcare sector, the application of deep learning technologies has revolutionized data analysis and disease forecasting. This is particularly evident in the field of diabetes, where the deep analysis of Electronic Health Records (EHR) has unlocked new opportunities for early detection and effective intervention strategies. Our research presents an innovative model that synergizes the capabilities of Bidirectional Long Short-Term Memory Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and Logistic Regression. This model is designed to enhance the accuracy of diabetes risk prediction by conducting an in-depth analysis of electronic medical records data. The first phase of our approach involves employing BiLSTM-CRF to delve into the temporal characteristics and latent patterns present in EHR data. This method effectively uncovers the progression trends of diabetes, which are often hidden in the complex data structures of medical records. The second phase leverages the combined strength of XGBoost and Logistic Regression to classify these extracted features and evaluate associated risks. This dual approach facilitates a more nuanced and precise prediction of diabetes, outperforming traditional models, particularly in handling multifaceted and nonlinear medical datasets. Our research demonstrates a notable advancement in diabetes prediction over traditional methods, showcasing the effectiveness of our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study highlights the value of data-driven strategies in clinical decision-making, equipping healthcare professionals with precise tools for early detection and intervention. By enabling personalized treatment and timely care, our approach signifies progress in incorporating advanced analytics in healthcare, potentially improving outcomes for diabetes and other chronic conditions.

[LG-33] BEFL: Balancing Energy Consumption in Federated Learning for Mobile Edge IoT

链接: https://arxiv.org/abs/2412.03950
作者: Zehao Ju,Tongquan Wei,Fuke Shen
关键词-EN: highly accurate global, distributed learning paradigm, learning paradigm designed, privacy-preserving distributed learning, Mobile Edge IoT
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a privacy-preserving distributed learning paradigm designed to build a highly accurate global model. In Mobile Edge IoT (MEIoT), the training and communication processes can significantly deplete the limited battery resources of devices. Existing research primarily focuses on reducing overall energy consumption, but this may inadvertently create energy consumption imbalances, leading to the premature dropout of energy-sensitive this http URL address these challenges, we propose BEFL, a joint optimization framework aimed at balancing three objectives: enhancing global model accuracy, minimizing total energy consumption, and reducing energy usage disparities among devices. First, taking into account the communication constraints of MEIoT and the heterogeneity of devices, we employed the Sequential Least Squares Programming (SLSQP) algorithm for the rational allocation of communication resources. Based on this, we introduce a heuristic client selection algorithm that combines cluster partitioning with utility-driven approaches to alleviate both the total energy consumption of all devices and the discrepancies in energy this http URL, we utilize the proposed heuristic client selection algorithm as a template for offline imitation learning during pre-training, while adopting a ranking-based reinforcement learning approach online to further boost training efficiency. Our experiments reveal that BEFL improves global model accuracy by 1.6%, reduces energy consumption variance by 72.7%, and lowers total energy consumption by 28.2% compared to existing methods. The relevant code can be found at \hrefURLthis https URL.

[LG-34] Learning Speed-Adaptive Walking Agent Using Imitation Learning with Physics-Informed Simulation

链接: https://arxiv.org/abs/2412.03949
作者: Yi-Hung Chiu,Ung Hee Lee,Changseob Song,Manaen Hu,Inseung Kang
关键词-EN: Virtual models, labor-intensive data collection, offer a promising, promising solution, solution for studying
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:Virtual models of human gait, or digital twins, offer a promising solution for studying mobility without the need for labor-intensive data collection. However, challenges such as the sim-to-real gap and limited adaptability to diverse walking conditions persist. To address these, we developed and validated a framework to create a skeletal humanoid agent capable of adapting to varying walking speeds while maintaining biomechanically realistic motions. The framework combines a synthetic data generator, which produces biomechanically plausible gait kinematics from open-source biomechanics data, and a training system that uses adversarial imitation learning to train the agent’s walking policy. We conducted comprehensive analyses comparing the agent’s kinematics, synthetic data, and the original biomechanics dataset. The agent achieved a root mean square error of 5.24 ± 0.09 degrees at varying speeds compared to ground-truth kinematics data, demonstrating its adaptability. This work represents a significant step toward developing a digital twin of human locomotion, with potential applications in biomechanics research, exoskeleton design, and rehabilitation.

[LG-35] JANUS: A Difference-Oriented Analyzer For Financial Centralization Risks in Smart Contracts

链接: https://arxiv.org/abs/2412.03938
作者: Wansen Wang,Pu Zhang,Renjie Ji,Wenchao Huang,Zhaoyi Meng,Yan Xiong
关键词-EN: violate decentralization principles, caused financial losses, centralization risks, introducing centralization risks, contracts violate decentralization
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Some smart contracts violate decentralization principles by defining privileged accounts that manage other users’ assets without permission, introducing centralization risks that have caused financial losses. Existing methods, however, face challenges in accurately detecting diverse centralization risks due to their dependence on predefined behavior patterns. In this paper, we propose JANUS, an automated analyzer for Solidity smart contracts that detects financial centralization risks independently of their specific behaviors. JANUS identifies differences between states reached by privileged and ordinary accounts, and analyzes whether these differences are finance-related. Focusing on the impact of risks rather than behaviors, JANUS achieves improved accuracy compared to existing tools and can uncover centralization risks with unknown patterns. To evaluate JANUS’s performance, we compare it with other tools using a dataset of 540 contracts. Our evaluation demonstrates that JANUS outperforms representative tools in terms of detection accuracy for financial centralization risks . Additionally, we evaluate JANUS on a real-world dataset of 33,151 contracts, successfully identifying two types of risks that other tools fail to detect. We also prove that the state traversal method and variable summaries, which are used in JANUS to reduce the number of states to be compared, do not introduce false alarms or omissions in detection. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2412.03938 [cs.LG] (or arXiv:2412.03938v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.03938 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] raffic Co-Simulation Framework Empowered by Infrastructure Camera Sensing and Reinforcement Learning

链接: https://arxiv.org/abs/2412.03925
作者: Talha Azfar,Ruimin Ke
关键词-EN: showing promising potential, showing promising, reinforcement learning, Multi-agent reinforcement learning, promising potential
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Traffic simulations are commonly used to optimize traffic flow, with reinforcement learning (RL) showing promising potential for automated traffic signal control. Multi-agent reinforcement learning (MARL) is particularly effective for learning control strategies for traffic lights in a network using iterative simulations. However, existing methods often assume perfect vehicle detection, which overlooks real-world limitations related to infrastructure availability and sensor reliability. This study proposes a co-simulation framework integrating CARLA and SUMO, which combines high-fidelity 3D modeling with large-scale traffic flow simulation. Cameras mounted on traffic light poles within the CARLA environment use a YOLO-based computer vision system to detect and count vehicles, providing real-time traffic data as input for adaptive signal control in SUMO. MARL agents, trained with four different reward structures, leverage this visual feedback to optimize signal timings and improve network-wide traffic flow. Experiments in the test-bed demonstrate the effectiveness of the proposed MARL approach in enhancing traffic conditions using real-time camera-based detection. The framework also evaluates the robustness of MARL under faulty or sparse sensing and compares the performance of YOLOv5 and YOLOv8 for vehicle detection. Results show that while better accuracy improves performance, MARL agents can still achieve significant improvements with imperfect detection, demonstrating adaptability for real-world scenarios.

[LG-37] Graph Disentangle Causal Model: Enhancing Causal Inference in Networked Observational Data WSDM2025

链接: https://arxiv.org/abs/2412.03913
作者: Binbin Hu,Zhicheng An,Zhengwei Wu,Ke Tu,Ziqi Liu,Zhiqiang Zhang,Jun Zhou,Yufei Feng,Jiawei Chen
关键词-EN: Estimating individual treatment, individual treatment effects, Estimating individual, confounder representations, ITE estimation
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2025

点击查看摘要

Abstract:Estimating individual treatment effects (ITE) from observational data is a critical task across various domains. However, many existing works on ITE estimation overlook the influence of hidden confounders, which remain unobserved at the individual unit level. To address this limitation, researchers have utilized graph neural networks to aggregate neighbors’ features to capture the hidden confounders and mitigate confounding bias by minimizing the discrepancy of confounder representations between the treated and control groups. Despite the success of these approaches, practical scenarios often treat all features as confounders and involve substantial differences in feature distributions between the treated and control groups. Confusing the adjustment and confounder and enforcing strict balance on the confounder representations could potentially undermine the effectiveness of outcome prediction. To mitigate this issue, we propose a novel framework called the \textitGraph Disentangle Causal model (GDC) to conduct ITE estimation in the network setting. GDC utilizes a causal disentangle module to separate unit features into adjustment and confounder representations. Then we design a graph aggregation module consisting of three distinct graph aggregators to obtain adjustment, confounder, and counterfactual confounder representations. Finally, a causal constraint module is employed to enforce the disentangled representations as true causal factors. The effectiveness of our proposed method is demonstrated by conducting comprehensive experiments on two networked datasets.

[LG-38] Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

链接: https://arxiv.org/abs/2412.03906
作者: Dennis Wei,Inkit Padhi,Soumya Ghosh,Amit Dhurandhar,Karthikeyan Natesan Ramamurthy,Maria Chang
关键词-EN: Training data attribution, attributing model behavior, Training data, data attribution, Training
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 8 figures

点击查看摘要

Abstract:Training data attribution (TDA) is the task of attributing model behavior to elements in the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. To serve as a gold standard for TDA in this “final-model-only” setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.

[LG-39] ransferring self-supervised pre-trained models for SHM data anomaly detection with scarce labeled data

链接: https://arxiv.org/abs/2412.03880
作者: Mingyuan Zhou,Xudong Jian,Ye Xia,Zhilu Lai
关键词-EN: Structural health monitoring, accumulating massive monitoring, Structural health, experienced significant advancements, massive monitoring data
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Structural health monitoring (SHM) has experienced significant advancements in recent decades, accumulating massive monitoring data. Data anomalies inevitably exist in monitoring data, posing significant challenges to their effective utilization. Recently, deep learning has emerged as an efficient and effective approach for anomaly detection in bridge SHM. Despite its progress, many deep learning models require large amounts of labeled data for training. The process of labeling data, however, is labor-intensive, time-consuming, and often impractical for large-scale SHM datasets. To address these challenges, this work explores the use of self-supervised learning (SSL), an emerging paradigm that combines unsupervised pre-training and supervised fine-tuning. The SSL-based framework aims to learn from only a very small quantity of labeled data by fine-tuning, while making the best use of the vast amount of unlabeled SHM data by pre-training. Mainstream SSL methods are compared and validated on the SHM data of two in-service bridges. Comparative analysis demonstrates that SSL techniques boost data anomaly detection performance, achieving increased F1 scores compared to conventional supervised training, especially given a very limited amount of labeled data. This work manifests the effectiveness and superiority of SSL techniques on large-scale SHM data, providing an efficient tool for preliminary anomaly detection with scarce label information.

[LG-40] GP-FL: Model-Based Hessian Estimation for Second-Order Over-the-Air Federated Learning

链接: https://arxiv.org/abs/2412.03867
作者: Shayan Mohajer Hamidi,Ali Bereyhi,Saba Asaad,H. Vincent Poor
关键词-EN: global Hessian matrix, Hessian matrix, global Hessian, Hessian, widely adopted
类目: Machine Learning (cs.LG)
*备注: The paper is submitted to IEEE Transactions on Signal Processing

点击查看摘要

Abstract:Second-order methods are widely adopted to improve the convergence rate of learning algorithms. In federated learning (FL), these methods require the clients to share their local Hessian matrices with the parameter server (PS), which comes at a prohibitive communication cost. A classical solution to this issue is to approximate the global Hessian matrix from the first-order information. Unlike in idealized networks, this solution does not perform effectively in over-the-air FL settings, where the PS receives noisy versions of the local gradients. This paper introduces a novel second-order FL framework tailored for wireless channels. The pivotal innovation lies in the PS’s capability to directly estimate the global Hessian matrix from the received noisy local gradients via a non-parametric method: the PS models the unknown Hessian matrix as a Gaussian process, and then uses the temporal relation between the gradients and Hessian along with the channel model to find a stochastic estimator for the global Hessian matrix. We refer to this method as Gaussian process-based Hessian modeling for wireless FL (GP-FL) and show that it exhibits a linear-quadratic convergence rate. Numerical experiments on various datasets demonstrate that GP-FL outperforms all classical baseline first and second order FL approaches.

[LG-41] A large language model-type architecture for high-dimensional molecular potential energy surfaces

链接: https://arxiv.org/abs/2412.03831
作者: Xiao Zhu,Srinivasan S. Iyengar
关键词-EN: Computing high dimensional, areas including fundamental, Computing high, including fundamental prediction, reaction rates
类目: Machine Learning (cs.LG); Atomic and Molecular Clusters (physics.atm-clus); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Computing high dimensional potential surfaces for molecular and materials systems is considered to be a great challenge in computational chemistry with potential impact in a range of areas including fundamental prediction of reaction rates. In this paper we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 dimensions. Essentially a family of neural networks that pertain to the graph-based subsystems, get the job done for this 51 dimensional system. We then ask if this same family of lower-dimensional neural networks can be transformed to provide accurate predictions for a 186 dimensional potential surface. We find that our algorithm does provide reasonably accurate results for this larger dimensional problem with sub-kcal/mol accuracy for the higher dimensional potential surface problem.

[LG-42] Residual Hyperbolic Graph Convolution Networks

链接: https://arxiv.org/abs/2412.03825
作者: Yangkai Xue,Jindou Dai,Zhipeng Lu,Yuwei Wu,Yunde Jia
关键词-EN: demonstrated representational capabilities, modeling hierarchical-structured graphs, Hyperbolic graph convolutional, graph convolutional networks, Hyperbolic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperbolic graph convolutional networks (HGCNs) have demonstrated representational capabilities of modeling hierarchical-structured graphs. However, as in general GCNs, over-smoothing may occur as the number of model layers increases, limiting the representation capabilities of most current HGCN models. In this paper, we propose residual hyperbolic graph convolutional networks (R-HGCNs) to address the over-smoothing problem. We introduce a hyperbolic residual connection function to overcome the over-smoothing problem, and also theoretically prove the effectiveness of the hyperbolic residual function. Moreover, we use product manifolds and HyperDrop to facilitate the R-HGCNs. The distinctive features of the R-HGCNs are as follows: (1) The hyperbolic residual connection preserves the initial node information in each layer and adds a hyperbolic identity mapping to prevent node features from being indistinguishable. (2) Product manifolds in R-HGCNs have been set up with different origin points in different components to facilitate the extraction of feature information from a wider range of perspectives, which enhances the representing capability of R-HGCNs. (3) HyperDrop adds multiplicative Gaussian noise into hyperbolic representations, such that perturbations can be added to alleviate the over-fitting problem without deconstructing the hyperbolic geometry. Experiment results demonstrate the effectiveness of R-HGCNs under various graph convolution layers and different structures of product manifolds.

[LG-43] Diffusion in Zero-Shot Learning for Environmental Audio

链接: https://arxiv.org/abs/2412.03771
作者: Ysobel Sims,Stephan Chalup,Alexandre Mendes
关键词-EN: Zero-shot learning, audio zero-shot learning, environmental audio zero-shot, environmental audio, Zero-shot learning enables
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Zero-shot learning enables models to generalize to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from environmental audio zero-shot learning, where classification-based approaches dominate. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, a novel diffusion model conditioned on class auxiliary data is introduced. The diffusion model generates synthetic data for unseen classes, which is combined with seen-class data to train a classifier. Experiments are conducted on two environmental audio datasets, ESC-50 and FSC22. Results show that the diffusion model significantly outperforms all baseline methods, achieving more than 25% higher accuracy on the ESC-50 test partition. This work establishes the diffusion model as a promising generative approach for zero-shot learning and introduces the first benchmark of generative methods for environmental audio zero-shot learning, providing a foundation for future research in the field. Code is provided at this https URL for the novel ZeroDiffusion method. Comments: This work has been submitted to the IEEE for possible publication Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2412.03771 [cs.SD] (or arXiv:2412.03771v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2412.03771 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ysobel Sims [view email] [v1] Wed, 4 Dec 2024 23:18:40 UTC (1,440 KB) Full-text links: Access Paper: View a PDF of the paper titled Diffusion in Zero-Shot Learning for Environmental Audio, by Ysobel Sims and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SD prev | next new | recent | 2024-12 Change to browse by: cs cs.LG eess eess.AS References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-44] Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

链接: https://arxiv.org/abs/2412.03767
作者: Yiran Wang,Chenshu Liu,Yunfan Li,Sanae Amani,Bolei Zhou,Lin F. Yang
关键词-EN: dilemma poses significant, poses significant challenges, exploitation dilemma poses, reinforcement learning, dilemma poses
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:1907.05388 by other authors

点击查看摘要

Abstract:The exploration \ exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbfHyper), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbfHyper is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

[LG-45] End to End Collaborative Synthetic Data Generation

链接: https://arxiv.org/abs/2412.03766
作者: Sikha Pentyala,Geetha Sitaraman,Trae Claar,Martine De Cock
关键词-EN: train models, data, synthetic data, Secure Multiparty Computation, synthetic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.

[LG-46] A Hybrid Deep-Learning Model for El Ni~no Southern Oscillation in the Low-Data Regime

链接: https://arxiv.org/abs/2412.03743
作者: Jakob Schlör,Matthew Newman,Jannik Thuemmel,Antonietta Capotondi,Bedartha Goswami
关键词-EN: Niño Southern Oscillation, Southern Oscillation, Niño Southern, climate model biases, climate model simulations
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:While deep-learning models have demonstrated skillful El Niño Southern Oscillation (ENSO) forecasts up to one year in advance, they are predominantly trained on climate model simulations that provide thousands of years of training data at the expense of introducing climate model biases. Simpler Linear Inverse Models (LIMs) trained on the much shorter observational record also make skillful ENSO predictions but do not capture predictable nonlinear processes. This motivates a hybrid approach, combining the LIMs modest data needs with a deep-learning non-Markovian correction of the LIM. For O(100 yr) datasets, our resulting Hybrid model is more skillful than the LIM while also exceeding the skill of a full deep-learning model. Additionally, while the most predictable ENSO events are still identified in advance by the LIM, they are better predicted by the Hybrid model, especially in the western tropical Pacific for leads beyond about 9 months, by capturing the subsequent asymmetric (warm versus cold phases) evolution of ENSO.

[LG-47] Utilizing Machine Learning Models to Predict Acute Kidney Injury in Septic Patients from MIMIC-III Database

链接: https://arxiv.org/abs/2412.03737
作者: Aleyeh Roknaldin,Zehao Zhang,Jiayuan Xu,Kamiar Alaei,Maryam Pishgar
关键词-EN: septic patients, AKI, severe condition, body to respond, respond incorrectly
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Sepsis is a severe condition that causes the body to respond incorrectly to an infection. This reaction can subsequently cause organ failure, a major one being acute kidney injury (AKI). For septic patients, approximately 50% develop AKI, with a mortality rate above 40%. Creating models that can accurately predict AKI based on specific qualities of septic patients is crucial for early detection and intervention. Using medical data from septic patients during intensive care unit (ICU) admission from the Medical Information Mart for Intensive Care 3 (MIMIC-III) database, we extracted 3301 patients with sepsis, with 73% of patients developing AKI. The data was randomly divided into a training set (n = 1980, 40%), a test set (n = 661, 10%), and a validation set (n = 660, 50%). The proposed model was logistic regression, and it was compared against five baseline models: XGBoost, K Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), and LightGBM. Area Under the Curve (AUC), Accuracy, F1-Score, and Recall were calculated for each model. After analysis, we were able to select 23 features to include in our model, the top features being urine output, maximum bilirubin, minimum bilirubin, weight, maximum blood urea nitrogen, and minimum estimated glomerular filtration rate. The logistic regression model performed the best, achieving an AUC score of 0.887 (95% CI: [0.861-0.915]), an accuracy of 0.817, an F1 score of 0.866, a recall score of 0.827, and a Brier score of 0.13. Compared to the best existing literature in this field, our model achieved an 8.57% improvement in AUC while using 13 fewer variables, showcasing its effectiveness in determining AKI in septic patients. While the features selected for predicting AKI in septic patients are similar to previous literature, the top features that influenced our model’s performance differ.

[LG-48] Online Experimental Design With Estimation-Regret Trade-off Under Network Interference

链接: https://arxiv.org/abs/2412.03727
作者: Zhiheng Zhang,Zichen Wang
关键词-EN: garnered significant interest, garnered significant, significant interest, causal inference, Network interference
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 36 pages

点击查看摘要

Abstract:Network interference has garnered significant interest in the field of causal inference. It reflects diverse sociological behaviors, wherein the treatment assigned to one individual within a network may influence the outcome of other individuals, such as their neighbors. To estimate the causal effect, one classical way is to randomly assign experimental candidates into different groups and compare their differences. However, in the context of sequential experiments, such treatment assignment may result in a large regret. In this paper, we develop a unified interference-based online experimental design framework. Compared to existing literature, we expand the definition of arm space by leveraging the statistical concept of exposure mapping. Importantly, we establish the Pareto-optimal trade-off between the estimation accuracy and regret with respect to both time period and arm space, which remains superior to the baseline even in the absence of network interference. We further propose an algorithmic implementation and model generalization.

[LG-49] Electrocardiogram-based diagnosis of liver diseases: an externally validated and explainable machine learning approach ALT

链接: https://arxiv.org/abs/2412.03717
作者: Juan Miguel Lopez Alcaraz,Wilhelm Haverkamp,Nils Strodthoff
关键词-EN: global health concern, major global health, Liver diseases, Liver, major global
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 3 images, code under this https URL

点击查看摘要

Abstract:Background: Liver diseases are a major global health concern, often diagnosed using resource-intensive methods. Electrocardiogram (ECG) data, widely accessible and non-invasive, offers potential as a diagnostic tool for liver diseases, leveraging the physiological connections between cardiovascular and hepatic health. Methods: This study applies machine learning models to ECG data for the diagnosis of liver diseases. The pipeline, combining tree-based models with Shapley values for explainability, was trained, internally validated, and externally validated on an independent cohort, demonstrating robust generalizability. Findings: Our results demonstrate the potential of ECG to derive biomarkers to diagnose liver diseases. Shapley values revealed key ECG features contributing to model predictions, highlighting already known connections between cardiovascular biomarkers and hepatic conditions as well as providing new ones. Furthermore, our approach holds promise as a scalable and affordable solution for liver disease detection, particularly in resource-limited settings. Interpretation: This study underscores the feasibility of leveraging ECG features and machine learning to enhance the diagnosis of liver diseases. By providing interpretable insights into cardiovascular-liver interactions, the approach bridges existing gaps in non-invasive diagnostics, offering implications for broader systemic disease monitoring. Comments: 8 pages, 3 images, code under this https URL Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2412.03717 [cs.LG] (or arXiv:2412.03717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.03717 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Juan Miguel Lopez Alcaraz [view email] [v1] Wed, 4 Dec 2024 21:11:34 UTC (7,542 KB)

[LG-50] A Water Efficiency Dataset for African Data Centers NEURIPS2024

链接: https://arxiv.org/abs/2412.03716
作者: Noah Shumba,Opelo Tshekiso,Pengfei Li,Giulia Fanti,Shaolei Ren
关键词-EN: African countries, electricity generation, electricity generation data, selected African countries, amount of freshwater
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted by NeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning

点击查看摘要

Abstract:AI computing and data centers consume a large amount of freshwater, both directly for cooling and indirectly for electricity generation. While most attention has been paid to developed countries such as the U.S., this paper presents the first-of-its-kind dataset that combines nation-level weather and electricity generation data to estimate water usage efficiency for data centers in 41 African countries across five different climate regions. We also use our dataset to evaluate and estimate the water consumption of inference on two large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African countries. Our findings show that writing a 10-page report using Llama-3-70B could consume about \textbf0.7 liters of water, while the water consumption by GPT-4 for the same task may go up to about 60 liters. For writing a medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about \textbf0.13 liters and 3 liters of water, respectively. Interestingly, given the same AI model, 8 out of the 11 selected African countries consume less water than the global average, mainly because of lower water intensities for electricity generation. However, water consumption can be substantially higher in some African countries with a steppe climate than the U.S. and global averages, prompting more attention when deploying AI computing in these countries. Our dataset is publicly available on \hrefthis https URLHugging Face.

[LG-51] Fairness without Demographics through Learning Graph of Gradients KDD2025

链接: https://arxiv.org/abs/2412.03706
作者: Yingtao Luo,Zhixun Li,Qiang Liu,Jun Zhu
关键词-EN: Machine learning systems, Machine learning, algorithmic fairness issues, leading to algorithmic, learning systems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to KDD 2025 (August Cycle)

点击查看摘要

Abstract:Machine learning systems are notoriously prone to biased predictions about certain demographic groups, leading to algorithmic fairness issues. Due to privacy concerns and data quality problems, some demographic information may not be available in the training data and the complex interaction of different demographics can lead to a lot of unknown minority subpopulations, which all limit the applicability of group fairness. Many existing works on fairness without demographics assume the correlation between groups and features. However, we argue that the model gradients are also valuable for fairness without demographics. In this paper, we show that the correlation between gradients and groups can help identify and improve group fairness. With an adversarial weighting architecture, we construct a graph where samples with similar gradients are connected and learn the weights of different samples from it. Unlike the surrogate grouping methods that cluster groups from features and labels as proxy sensitive attribute, our method leverages the graph structure as a soft grouping mechanism, which is much more robust to noises. The results show that our method is robust to noise and can improve fairness significantly without decreasing the overall accuracy too much.

[LG-52] Interpretable Hierarchical Attention Network for Medical Condition Identification

链接: https://arxiv.org/abs/2412.03701
作者: Dongping Fang,Lian Duan,Xiaojing Yuan,Allyn Klunder,Kevin Tan,Suiting Cao,Yeqing Ji,Mike Xu
关键词-EN: straight past clinical, past clinical evidence, Accurate prediction, medical, hierarchical attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of medical conditions with straight past clinical evidence is a long-sought topic in the medical management and health insurance field. Although great progress has been made with machine learning algorithms, the medical community is still skeptical about the model accuracy and interpretability. This paper presents an innovative hierarchical attention deep learning model to achieve better prediction and clear interpretability that can be easily understood by medical professionals. This paper developed an Interpretable Hierarchical Attention Network (IHAN). IHAN uses a hierarchical attention structure that matches naturally with the medical history data structure and reflects patients encounter (date of service) sequence. The model attention structure consists of 3 levels: (1) attention on the medical code types (diagnosis codes, procedure codes, lab test results, and prescription drugs), (2) attention on the sequential medical encounters within a type, (3) attention on the individual medical codes within an encounter and type. This model is applied to predict the occurrence of stage 3 chronic kidney disease (CKD), using three years medical history of Medicare Advantage (MA) members from an American nationwide health insurance company. The model takes members medical events, both claims and Electronic Medical Records (EMR) data, as input, makes a prediction of stage 3 CKD and calculates contribution from individual events to the predicted outcome. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.03701 [cs.LG] (or arXiv:2412.03701v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.03701 Focus to learn more arXiv-issued DOI via DataCite

[LG-53] Good practices for evaluation of machine learning systems

链接: https://arxiv.org/abs/2412.03700
作者: Luciana Ferrer,Odette Scharenborg,Tom Bäckström
关键词-EN: development decisions affect, decisions affect, development decisions, training data, test data
类目: Machine Learning (cs.LG)
*备注: v1.0

点击查看摘要

Abstract:Many development decisions affect the results obtained from ML experiments: training data, features, model architecture, hyperparameters, test data, etc. Among these aspects, arguably the most important design decisions are those that involve the evaluation procedure. This procedure is what determines whether the conclusions drawn from the experiments will or will not generalize to unseen data and whether they will be relevant to the application of interest. If the data is incorrectly selected, the wrong metric is chosen for evaluation or the significance of the comparisons between models is overestimated, conclusions may be misleading or result in suboptimal development decisions. To avoid such problems, the evaluation protocol should be very carefully designed before experimentation starts. In this work we discuss the main aspects involved in the design of the evaluation protocol: data selection, metric selection, and statistical significance. This document is not meant to be an exhaustive tutorial on each of these aspects. Instead, the goal is to explain the main guidelines that should be followed in each case. We include examples taken from the speech processing field, and provide a list of common mistakes related to each aspect. Comments: v1.0 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.03700 [cs.LG] (or arXiv:2412.03700v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.03700 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Hyperparameter Tuning Through Pessimistic Bilevel Optimization

链接: https://arxiv.org/abs/2412.03666
作者: Meltem Apaydin Ustun,Liang Xu,Bo Zeng,Xiaoning Qian
关键词-EN: deep learning models, model learning achieved, bilevel optimization problem, bilevel hyperparameter optimization, machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated hyperparameter search in machine learning, especially for deep learning models, is typically formulated as a bilevel optimization problem, with hyperparameter values determined by the upper level and the model learning achieved by the lower-level problem. Most of the existing bilevel optimization solutions either assume the uniqueness of the optimal training model given hyperparameters or adopt an optimistic view when the non-uniqueness issue emerges. Potential model uncertainty may arise when training complex models with limited data, especially when the uniqueness assumption is violated. Thus, the suitability of the optimistic view underlying current bilevel hyperparameter optimization solutions is questionable. In this paper, we propose pessimistic bilevel hyperparameter optimization to assure appropriate outer-level hyperparameters to better generalize the inner-level learned models, by explicitly incorporating potential uncertainty of the inner-level solution set. To solve the resulting computationally challenging pessimistic bilevel optimization problem, we develop a novel relaxation-based approximation method. It derives pessimistic solutions with more robust prediction models. In our empirical studies of automated hyperparameter search for binary linear classifiers, pessimistic solutions have demonstrated better prediction performances than optimistic counterparts when we have limited training data or perturbed testing data, showing the necessity of considering pessimistic solutions besides existing optimistic ones.

[LG-55] Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques

链接: https://arxiv.org/abs/2412.03634
作者: Hesamodin Mohammadian,Griffin Higgins,Samuel Ansong,Roozbeh Razavi-Far,Ali A. Ghorbani
关键词-EN: Control Flow Graphs, Function Call Graphs, Control Flow, Function Call, Graph Neural Networks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Control Flow Graphs and Function Call Graphs have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behavior of malware. These graph-based representations, when combined with Graph Neural Networks (GNN), have shown promise in developing high-performance malware detectors. However, challenges remain due to the large size of these graphs and the inherent opacity in the decision-making process of GNNs. This paper addresses these issues by developing several graph reduction techniques to reduce graph size and applying the state-of-the-art GNNExplainer to enhance the interpretability of GNN outputs. The analysis demonstrates that integrating our proposed graph reduction technique along with GNNExplainer in the malware detection framework significantly reduces graph size while preserving high performance, providing an effective balance between efficiency and transparency in malware detection.

[LG-56] Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

链接: https://arxiv.org/abs/2412.03611
作者: Xinyu Yuan,Yan Qiao,Meng Li,Zhenchun Wei,Cuiying Feng
关键词-EN: fast data stream, Estimating the frequency, fast data, network measurement, frequency of items
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketch algorithms only allow to give very rough estimates with limited memory cost, whereas some learning-augmented algorithms have been proposed recently, their offline framework requires actual frequencies that are challenging to access in general for training, and speed is too slow for real-time processing, despite the still coarse-grained accuracy. To this end, we propose a more practical learning-based estimation framework namely UCL-sketch, by following the line of equation-based sketch to estimate per-key frequencies. In a nutshell, there are two key techniques: online training via equivalent learning without ground truth, and highly scalable architecture with logical estimation buckets. We implemented experiments on both real-world and synthetic datasets. The results demonstrate that our method greatly outperforms existing state-of-the-art sketches regarding per-key accuracy and distribution, while preserving resource efficiency. Our code is attached in the supplementary material, and will be made publicly available at this https URL. Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2412.03611 [cs.LG] (or arXiv:2412.03611v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.03611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] Online Physics-Informed Dynamic Mode Decomposition: Theory and Applications

链接: https://arxiv.org/abs/2412.03609
作者: Biqi Chen,Ying Wang
关键词-EN: Dynamic Mode Decomposition, Dynamic Mode, Mode Decomposition, received increasing research, increasing research attention
类目: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:Dynamic Mode Decomposition (DMD) has received increasing research attention due to its capability to analyze and model complex dynamical systems. However, it faces challenges in computational efficiency, noise sensitivity, and difficulty adhering to physical laws, which negatively affect its performance. Addressing these issues, we present Online Physics-informed DMD (OPIDMD), a novel adaptation of DMD into a convex optimization framework. This approach not only ensures convergence to a unique global optimum, but also enhances the efficiency and accuracy of modeling dynamical systems in an online setting. Leveraging the Bayesian DMD framework, we propose a probabilistic interpretation of Physics-informed DMD (piDMD), examining the impact of physical constraints on the DMD linear operator. Further, we implement online proximal gradient descent and formulate specific algorithms to tackle problems with different physical constraints, enabling real-time solutions across various scenarios. Compared with existing algorithms such as Exact DMD, Online DMD, and piDMD, OPIDMD achieves the best prediction performance in short-term forecasting, e.g. an R^2 value of 0.991 for noisy Lorenz system. The proposed method employs a time-varying linear operator, offering a promising solution for the real-time simulation and control of complex dynamical systems.

[LG-58] Resampled Mutual Information for Clustering and Community Detection

链接: https://arxiv.org/abs/2412.03584
作者: Cheaheon Lim
关键词-EN: introduce resampled mutual, pair counting approaches, resampled mutual information, introduce resampled, resampled mutual
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce resampled mutual information (ResMI), a novel measure of clustering similarity that combines insights from information theoretic and pair counting approaches to clustering and community detection. Similar to chance-corrected measures, ResMI satisfies the constant baseline property, but it has the advantages of not requiring adjustment terms and being fully interpretable in the language of information theory. Experiments on synthetic datasets demonstrate that ResMI is robust to common biases exhibited by existing measures, particularly in settings with high cluster counts and asymmetric cluster distributions. Additionally, we show that ResMI identifies meaningful community structures in two real contact tracing networks.

[LG-59] Exploring Non-Linear Effects of Built Environment on Travel Using an Integrated Machine Learning and Inferential Modeling Approach: A Three-Wave Repeated Cross-Sectional Study

链接: https://arxiv.org/abs/2412.03582
作者: Niaz Mahmud Zafri,Ming Zhang
关键词-EN: study investigates, investigates the dynamic, built environment, built environment characteristics, dynamic relationship
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study investigates the dynamic relationship between the built environment and travel in Austin, Texas, over a 20-year period. Using three waves of household travel surveys from 1997, 2006, and 2017, the research employs a repeated cross-sectional approach to address the limitations of traditional longitudinal and cross-sectional studies. Methodologically, it integrates machine learning and inferential modeling to uncover non-linear relationships and threshold effects of built environment characteristics on travel. Findings reveal that the built environment serves as a sustainable tool for managing travel in the long term, contributing 50% or more to the total feature importance in predicting individual travel-surpassing the combined effects of personal and household characteristics. Increased transit accessibility, local and regional destination accessibility, population and employment density, and diversity significantly reduce travel, particularly within their identified thresholds, though the magnitude of their influence varies across time periods. These findings highlight the potential of smart growth policies-such as expanding transit accessibility, promoting high-density and mixed-use development, and discouraging single-use development and peripheral sprawl-as effective strategies to reduce car dependency and manage travel demand.

[LG-60] Multi-Scale Node Embeddings for Graph Modeling and Generation

链接: https://arxiv.org/abs/2412.04354
作者: Riccardo Milocco,Fabian Jansen,Diego Garlaschelli
关键词-EN: Machine Learning, Science and Machine, vector-based downstream tasks, abstract geometric space, Network Science
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); General Economics (econ.GN); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Lying at the interface between Network Science and Machine Learning, node embedding algorithms take a graph as input and encode its structure onto output vectors that represent nodes in an abstract geometric space, enabling various vector-based downstream tasks such as network modelling, data compression, link prediction, and community detection. Two apparently unrelated limitations affect these algorithms. On one hand, it is not clear what the basic operation defining vector spaces, i.e. the vector sum, corresponds to in terms of the original nodes in the network. On the other hand, while the same input network can be represented at multiple levels of resolution by coarse-graining the constituent nodes into arbitrary block-nodes, the relationship between node embeddings obtained at different hierarchical levels is not understood. Here, building on recent results in network renormalization theory, we address these two limitations at once and define a multiscale node embedding method that, upon arbitrary coarse-grainings, ensures statistical consistency of the embedding vector of a block-node with the sum of the embedding vectors of its constituent nodes. We illustrate the power of this approach on two economic networks that can be naturally represented at multiple resolution levels: namely, the international trade between (sets of) countries and the input-output flows among (sets of) industries in the Netherlands. We confirm the statistical consistency between networks retrieved from coarse-grained node vectors and networks retrieved from sums of fine-grained node vectors, a result that cannot be achieved by alternative methods. Several key network properties, including a large number of triangles, are successfully replicated already from embeddings of very low dimensionality, allowing for the generation of faithful replicas of the original networks at arbitrary resolution levels.

[LG-61] Pathwise optimization for bridge-type estimators and its applications

链接: https://arxiv.org/abs/2412.04047
作者: Alessandro De Gregorio,Francesco Iafrate
关键词-EN: Sparse parametric models, Sparse parametric, parametric models, great interest, interest in statistical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Sparse parametric models are of great interest in statistical learning and are often analyzed by means of regularized estimators. Pathwise methods allow to efficiently compute the full solution path for penalized estimators, for any possible value of the penalization parameter \lambda . In this paper we deal with the pathwise optimization for bridge-type problems; i.e. we are interested in the minimization of a loss function, such as negative log-likelihood or residual sum of squares, plus the sum of \ell^q norms with q\in(0,1] involving adpative coefficients. For some loss functions this regularization achieves asymptotically the oracle properties (such as the selection consistency). Nevertheless, since the objective function involves nonconvex and nondifferentiable terms, the minimization problem is computationally challenging. The aim of this paper is to apply some general algorithms, arising from nonconvex optimization theory, to compute efficiently the path solutions for the adaptive bridge estimator with multiple penalties. In particular, we take into account two different approaches: accelerated proximal gradient descent and blockwise alternating optimization. The convergence and the path consistency of these algorithms are discussed. In order to assess our methods, we apply these algorithms to the penalized estimation of diffusion processes observed at discrete times. This latter represents a recent research topic in the field of statistics for time-dependent data. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2412.04047 [stat.ML] (or arXiv:2412.04047v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.04047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] A Note on Spectral Map

链接: https://arxiv.org/abs/2412.04011
作者: Tuğçe Gökdemir,Jakub Rydzewski
关键词-EN: rare events due, transitions between states, thermal temperature, drive rare events, rare events
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: A letter prepared for the Ensemble journal of the Molecular Simulation Society of Japan (MSSJ)

点击查看摘要

Abstract:In molecular dynamics (MD) simulations, transitions between states are often rare events due to energy barriers that exceed the thermal temperature. Because of their infrequent occurrence and the huge number of degrees of freedom in molecular systems, understanding the physical properties that drive rare events is immensely difficult. A common approach to this problem is to propose a collective variable (CV) that describes this process by a simplified representation. However, choosing CVs is not easy, as it often relies on physical intuition. Machine learning (ML) techniques provide a promising approach for effectively extracting optimal CVs from MD data. Here, we provide a note on a recent unsupervised ML method called spectral map, which constructs CVs by maximizing the timescale separation between slow and fast variables in the system.

[LG-63] How well behaved is finite dimensional Diffusion Maps?

链接: https://arxiv.org/abs/2412.03992
作者: Wenyu Bo,Marina Meilă(Department of Statistics University of Washington Seattle, WA)
关键词-EN: isometric Diffusion Maps, finite polynomial approximation, Diffusion Maps, isometric Diffusion, family of submanifolds
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Under a set of assumptions on a family of submanifolds \subset \mathbb R^D , we derive a series of geometric properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM), including almost uniform density, finite polynomial approximation and local reach. Leveraging these properties, we establish rigorous bounds on the embedding errors introduced by the DM algorithm is O\left((\frac\log nn)^\frac18d+16\right) . These results offer a solid theoretical foundation for understanding the performance and reliability of DM in practical applications.

[LG-64] Safe and Efficient Online Convex Optimization with Linear Budget Constraints and Partial Feedback

链接: https://arxiv.org/abs/2412.03983
作者: Shanqi Liu,Xin Liu
关键词-EN: paper studies online, unknown linear budget, online convex optimization, studies online convex, linear budget constraints
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies online convex optimization with unknown linear budget constraints, where only the gradient information of the objective and the bandit feedback of constraint functions are observed. We propose a safe and efficient Lyapunov-optimization algorithm (SELO) that can achieve an O(\sqrtT) regret and zero cumulative constraint violation. The result also implies SELO achieves O(\sqrtT) regret when the budget is hard and not allowed to be violated. The proposed algorithm is computationally efficient as it resembles a primal-dual algorithm where the primal problem is an unconstrained, strongly convex and smooth problem, and the dual problem has a simple gradient-type update. The algorithm and theory are further justified in a simulated application of energy-efficient task processing in distributed data centers.

[LG-65] Deep Learning Modeling Method for RF Devices Based on Uniform Noise Training Set

链接: https://arxiv.org/abs/2412.03936
作者: Zhaokun Hu,Yindong Xiao,Houjun Wang,Jiayong Yu,Zihang Gao
关键词-EN: uniform noise training, traditional modeling methods, noise training set, uniform noise, continue to increase
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages,11 figures

点击查看摘要

Abstract:As the scale and complexity of integrated circuits continue to increase, traditional modeling methods are struggling to address the nonlinear challenges in radio frequency (RF) chips. Deep learning has been increasingly applied to RF device modeling. This paper proposes a deep learning-based modeling method for RF devices using a uniform noise training set, aimed at modeling and fitting the nonlinear characteristics of RF devices. We hypothesize that a uniform noise signal can encompass the full range of characteristics across both frequency and amplitude, and that a deep learning model can effectively capture and learn these features. Based on this hypothesis, the paper designs a complete integrated circuit modeling process based on measured data, including data collection, processing, and neural network training. The proposed method is experimentally validated using the RF amplifier PW210 as a case study. Experimental results show that the uniform noise training set allows the model to capture the nonlinear characteristics of RF devices, and the trained model can predict waveform patterns it has never encountered before. The proposed deep learning-based RF device modeling method, using a uniform noise training set, demonstrates strong generalization capability and excellent training performance, offering high practical application value.

[LG-66] Reconstruction of boosted and resolved multi-Higgs-boson events with symmetry-preserving attention networks

链接: https://arxiv.org/abs/2412.03819
作者: Haoyang Li,Marko Stamenkovic,Alexander Shmakov,Michael Fenton,Darius Shih-Chieh Chao,Kaitlyn Maiya White,Caden Mikkelsen,Jovan Mitic,Cristina Mantilla Suarez,Melissa Quinnan,Greg Landsberg,Harvey Newman,Pierre Baldi,Daniel Whiteson,Javier Duarte
关键词-EN: quartic Higgs self-interaction, Higgs self-interaction strengths, large transverse momentum, multiple Higgs bosons, standard model effects
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:The production of multiple Higgs bosons at the CERN LHC provides a direct way to measure the trilinear and quartic Higgs self-interaction strengths as well as potential access to beyond the standard model effects that can enhance production at large transverse momentum p_\mathrmT . The largest event fraction arises from the fully hadronic final state in which every Higgs boson decays to a bottom quark-antiquark pair ( b\barb ). This introduces a combinatorial challenge known as the \emphjet assignment problem: assigning jets to sets representing Higgs boson candidates. Symmetry-preserving attention networks (SPA-Nets) have been been developed to address this challenge. However, the complexity of jet assignment increases when simultaneously considering both H\rightarrow b\barb reconstruction possibilities, i.e., two “resolved” small-radius jets each containing a shower initiated by a b -quark or one “boosted” large-radius jet containing a merged shower initiated by a b\barb pair. The latter improves the reconstruction efficiency at high p_\mathrmT . In this work, we introduce a generalization to the SPA-Net approach to simultaneously consider both boosted and resolved reconstruction possibilities and unambiguously interpret an event as "fully resolved’', “fully boosted”, or in between. We report the performance of baseline methods, the original SPA-Net approach, and our generalized version on nonresonant HH and HHH production at the LHC. Considering both boosted and resolved topologies, our SPA-Net approach increases the Higgs boson reconstruction purity by 57–62% and the efficiency by 23–38% compared to the baseline method depending on the final state.

[LG-67] Samudra: An AI Global Ocean Emulator for Climate

链接: https://arxiv.org/abs/2412.03795
作者: Surya Dheeshjith,Adam Subel,Alistair Adcroft,Julius Busecke,Carlos Fernandez-Granda,Shubham Gupta,Laure Zanna
关键词-EN: conventional numerical predictions, outperform conventional numerical, numerical predictions, forecasting have emerged, emerged as powerful
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI emulators for forecasting have emerged as powerful tools that can outperform conventional numerical predictions. The next frontier is to build emulators for long-term climate projections with robust skill across a wide range of spatiotemporal scales, a particularly important goal for the ocean. Our work builds a skillful global emulator of the ocean component of a state-of-the-art climate model. We emulate key ocean variables, sea surface height, horizontal velocities, temperature, and salinity, across their full depth. We use a modified ConvNeXt UNet architecture trained on multidepth levels of ocean data. We show that the ocean emulator - Samudra - which exhibits no drift relative to the truth, can reproduce the depth structure of ocean variables and their interannual variability. Samudra is stable for centuries and 150 times faster than the original ocean model. Samudra struggles to capture the correct magnitude of the forcing trends and simultaneously remains stable, requiring further work.

[LG-68] Community Detection with Heterogeneous Block Covariance Model

链接: https://arxiv.org/abs/2412.03780
作者: Xiang Li,Yunpeng Zhao,Qing Pan,Ning Hao
关键词-EN: clustering objects based, pairwise relationships, community detection methods, task of clustering, Community detection
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Community detection is the task of clustering objects based on their pairwise relationships. Most of the model-based community detection methods, such as the stochastic block model and its variants, are designed for networks with binary (yes/no) edges. In many practical scenarios, edges often possess continuous weights, spanning positive and negative values, which reflect varying levels of connectivity. To address this challenge, we introduce the heterogeneous block covariance model (HBCM) that defines a community structure within the covariance matrix, where edges have signed and continuous weights. Furthermore, it takes into account the heterogeneity of objects when forming connections with other objects within a community. A novel variational expectation-maximization algorithm is proposed to estimate the group membership. The HBCM provides provable consistent estimates of memberships, and its promising performance is observed in numerical simulations with different setups. The model is applied to a single-cell RNA-seq dataset of a mouse embryo and a stock price dataset. Supplementary materials for this article are available online.

[LG-69] Learning Networks from Wide-Sense Stationary Stochastic Processes

链接: https://arxiv.org/abs/2412.03768
作者: Anirudh Rayas,Jiajun Cheng,Rajasekhar Anguluri,Deepjyoti Deka,Gautam Dasarathy
关键词-EN: Complex networked systems, Complex networked, networked systems driven, fields like neuroscience, ast
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Complex networked systems driven by latent inputs are common in fields like neuroscience, finance, and engineering. A key inference problem here is to learn edge connectivity from node outputs (potentials). We focus on systems governed by steady-state linear conservation laws: X_t = L^\astY_t , where X_t, Y_t \in \mathbbR^p denote inputs and potentials, respectively, and the sparsity pattern of the p \times p Laplacian L^\ast encodes the edge structure. Assuming X_t to be a wide-sense stationary stochastic process with a known spectral density matrix, we learn the support of L^\ast from temporally correlated samples of Y_t via an \ell_1 -regularized Whittle’s maximum likelihood estimator (MLE). The regularization is particularly useful for learning large-scale networks in the high-dimensional setting where the network size p significantly exceeds the number of samples n . We show that the MLE problem is strictly convex, admitting a unique solution. Under a novel mutual incoherence condition and certain sufficient conditions on (n, p, d) , we show that the ML estimate recovers the sparsity pattern of L^\ast with high probability, where d is the maximum degree of the graph underlying L^\ast . We provide recovery guarantees for L^\ast in element-wise maximum, Frobenius, and operator norms. Finally, we complement our theoretical results with several simulation studies on synthetic and benchmark datasets, including engineered systems (power and water networks), and real-world datasets from neural systems (such as the human brain). Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2412.03768 [stat.ML] (or arXiv:2412.03768v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.03768 Focus to learn more arXiv-issued DOI via DataCite

[LG-70] Optimal probabilistic feature shifts for reclassification in tree ensembles

链接: https://arxiv.org/abs/2412.03722
作者: Váctor Blanco,Alberto Japón,Justo Puerto,Peter Zhang
关键词-EN: ensemble classification rule, tree ensemble classification, mathematical optimization based, desired class, target class
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 4 Figures, 4 Tables

点击查看摘要

Abstract:In this paper we provide a novel mathematical optimization based methodology to perturb the features of a given observation to be re-classified, by a tree ensemble classification rule, to a certain desired class. The method is based on these facts: the most viable changes for an observation to reach the desired class do not always coincide with the closest distance point (in the feature space) of the target class; individuals put effort on a few number of features to reach the desired class; and each individual is endowed with a probability to change each of its features to a given value, which determines the overall probability of changing to the target class. Putting all together, we provide different methods to find the features where the individuals must exert effort to maximize the probability to reach the target class. Our method also allows us to rank the most important features in the tree-ensemble. The proposed methodology is tested on a real dataset, validating the proposal.

[LG-71] Asymptotics of Linear Regression with Linearly Dependent Data

链接: https://arxiv.org/abs/2412.03702
作者: Behrad Moniri,Hamed Hassani
关键词-EN: linear dependency structure, linear dependency, dependency structure, assumption of independence, linear regression
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper we study the asymptotics of linear regression in settings where the covariates exhibit a linear dependency structure, departing from the standard assumption of independence. We model the covariates using stochastic processes with spatio-temporal covariance and analyze the performance of ridge regression in the high-dimensional proportional regime, where the number of samples and feature dimensions grow proportionally. A Gaussian universality theorem is proven, demonstrating that the asymptotics are invariant under replacing the covariates with Gaussian vectors preserving mean and covariance. Next, leveraging tools from random matrix theory, we derive precise characterizations of the estimation error. The estimation error is characterized by a fixed-point equation involving the spectral properties of the spatio-temporal covariance matrices, enabling efficient computation. We then study optimal regularization, overparameterization, and the double descent phenomenon in the context of dependent data. Simulations validate our theoretical predictions, shedding light on how dependencies influence estimation error and the choice of regularization parameters.

[LG-72] Interpreting Transformers for Jet Tagging NEURIPS2024

链接: https://arxiv.org/abs/2412.03673
作者: Aaron Wang,Abijith Gandrakota,Jennifer Ngadiuba,Vivekanand Sahu,Priyansh Bhatnagar,Elham E Khoda,Javier Duarte
关键词-EN: CERN LHC, ATLAS and CMS, vast data generated, experiments like ATLAS, attention-based transformer models
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted at the Machine Learning and the Physical Sciences Workshop, NeurIPS 2024

点击查看摘要

Abstract:Machine learning (ML) algorithms, particularly attention-based transformer models, have become indispensable for analyzing the vast data generated by particle physics experiments like ATLAS and CMS at the CERN LHC. Particle Transformer (ParT), a state-of-the-art model, leverages particle-level attention to improve jet-tagging tasks, which are critical for identifying particles resulting from proton collisions. This study focuses on interpreting ParT by analyzing attention heat maps and particle-pair correlations on the \eta - \phi plane, revealing a binary attention pattern where each particle attends to at most one other particle. At the same time, we observe that ParT shows varying focus on important particles and subjets depending on decay, indicating that the model learns traditional jet substructure observables. These insights enhance our understanding of the model’s internal workings and learning process, offering potential avenues for improving the efficiency of transformer architectures in future high-energy physics applications.

[LG-73] Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective

链接: https://arxiv.org/abs/2412.03614
作者: Shuang Ge,Shuqing Sun,Huan Xu,Qiang Cheng,Zhixiang Ren
关键词-EN: investigate cellular properties, revolutionized our capacity, capacity to investigate, single-cell and spatial, spatial
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring the underlying biological signals. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering-based analysis methods struggle to deal with the various challenges presented by intricate biological networks. Deep learning has emerged as a powerful tool capable of handling high-dimensional complex data and automatically identifying meaningful patterns, offering significant promise in addressing these challenges. This review systematically analyzes these challenges and discusses related deep learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks, encompassing 58 computational methods, and evaluated their performance on the respective modeling tasks. Finally, we highlight three areas for future development from a technical, dataset, and application perspective. This work will serve as a valuable resource for understanding how deep learning can be effectively utilized in single-cell and spatial transcriptomics analyses, while inspiring novel approaches to address emerging challenges.

[LG-74] Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models

链接: https://arxiv.org/abs/2412.03606
作者: Wenying Sun,Zhen Xu,Wenqing Zhang,Kunyuan Ma,You Wu,Mengfang Sun
关键词-EN: Time Series Transformer, Series Transformer model, Time Series, bank stability index, Series Transformer
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper aims to study the prediction of the bank stability index based on the Time Series Transformer model. The bank stability index is an important indicator to measure the health status and risk resistance of financial institutions. Traditional prediction methods are difficult to adapt to complex market changes because they rely on single-dimensional macroeconomic data. This paper proposes a prediction framework based on the Time Series Transformer, which uses the self-attention mechanism of the model to capture the complex temporal dependencies and nonlinear relationships in financial data. Through experiments, we compare the model with LSTM, GRU, CNN, TCN and RNN-Transformer models. The experimental results show that the Time Series Transformer model outperforms other models in both mean square error (MSE) and mean absolute error (MAE) evaluation indicators, showing strong prediction ability. This shows that the Time Series Transformer model can better handle multidimensional time series data in bank stability prediction, providing new technical approaches and solutions for financial risk management.

信息检索

[IR-0] User-item fairness tradeoffs in recommendations

链接: https://arxiv.org/abs/2412.04466
作者: Sophie Greenwood,Sudalakshmee Chiniah,Nikhil Garg
关键词-EN: basic recommendation paradigm, fairness, item fairness, item, user
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:In the basic recommendation paradigm, the most (predicted) relevant item is recommended to each user. This may result in some items receiving lower exposure than they “should”; to counter this, several algorithmic approaches have been developed to ensure item fairness. These approaches necessarily degrade recommendations for some users to improve outcomes for items, leading to user fairness concerns. In turn, a recent line of work has focused on developing algorithms for multi-sided fairness, to jointly optimize user fairness, item fairness, and overall recommendation quality. This induces the question: what is the tradeoff between these objectives, and what are the characteristics of (multi-objective) optimal solutions? Theoretically, we develop a model of recommendations with user and item fairness objectives and characterize the solutions of fairness-constrained optimization. We identify two phenomena: (a) when user preferences are diverse, there is “free” item and user fairness; and (b) users whose preferences are misestimated can be especially disadvantaged by item fairness constraints. Empirically, we prototype a recommendation system for preprints on arXiv and implement our framework, measuring the phenomena in practice and showing how these phenomena inform the design of markets with recommendation systems-intermediated matching.

[IR-1] Graph-Sequential Alignment and Uniformity: Toward Enhanced Recommendation Systems

链接: https://arxiv.org/abs/2412.04276
作者: Yuwei Cao,Liangwei Yang,Zhiwei Liu,Yuqing Liu,Chen Wang,Yueqing Liang,Hao Peng,Philip S. Yu
关键词-EN: popular recommendation paradigms, Graph-based and sequential, recommendation paradigms, Graph Neural Network, popular recommendation
类目: Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Graph-based and sequential methods are two popular recommendation paradigms, each excelling in its domain but lacking the ability to leverage signals from the other. To address this, we propose a novel method that integrates both approaches for enhanced performance. Our framework uses Graph Neural Network (GNN)-based and sequential recommenders as separate submodules while sharing a unified embedding space optimized jointly. To enable positive knowledge transfer, we design a loss function that enforces alignment and uniformity both within and across submodules. Experiments on three real-world datasets demonstrate that the proposed method significantly outperforms using either approach alone and achieves state-of-the-art results. Our implementations are publicly available at this https URL.

[IR-2] Learning to Hash for Recommendation: A Survey

链接: https://arxiv.org/abs/2412.03875
作者: Fangyuan Luo,Honglei Zhang,Tong Li,Jun Wu
关键词-EN: Recommender Systems, facing unprecedented challenges, users and items, storage cost, explosive growth
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the explosive growth of users and items, Recommender Systems (RS) are facing unprecedented challenges on both retrieval efficiency and storage cost. Fortunately, Learning to Hash (L2H) techniques have been shown as a promising solution to address the two dilemmas, whose core idea is encoding high-dimensional data into compact hash codes. To this end, L2H for RS (HashRec for short) has recently received widespread attention to support large-scale recommendations. In this survey, we present a comprehensive review of current HashRec algorithms. Specifically, we first introduce the commonly used two-tower models in the recall stage and identify two search strategies frequently employed in L2H. Then, we categorize prior works into two-tier taxonomy based on: (i) the type of loss function and (ii) the optimization strategy. We also introduce some commonly used evaluation metrics to measure the performance of HashRec algorithms. Finally, we shed light on the limitations of the current research and outline the future research directions. Furthermore, the summary of HashRec methods reviewed in this survey can be found at \hrefthis https URLthis https URL.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-06

目录

概览 (2024-12-06)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载