Arxiv今日论文 | 2024-12-04

本篇博文主要展示 2024-12-04 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决土耳其语文本中自动标点符号和大写修正的问题，解决方案的关键在于利用基于BERT的模型，并通过五个不同大小的模型（Tiny, Mini, Small, Medium, 和 Base）来优化性能和计算开销。研究通过系统比较各模型的精确度（precision）、召回率（recall）和F1分数（F1 score），展示了模型大小与文本可读性和准确性之间的正相关关系，特别是Base模型在修正精确度上表现最佳。该研究为根据用户需求和计算资源选择合适的模型大小提供了全面的指导，并为在实际应用中部署这些模型以提高土耳其语文本质量奠定了框架。

链接: https://arxiv.org/abs/2412.02698
作者: Abdulkader Saoud,Mahmut Alomeyr,Himmet Toprak Kesgin,Mehmet Fatih Amasyali
关键词-EN: effectiveness of BERT, BERT based models, distinct model sizes, BERT based, paper investigates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 Innovations in Intelligent Systems and Applications Conference (ASYU)

点击查看摘要

Abstract:This paper investigates the effectiveness of BERT based models for automated punctuation and capitalization corrections in Turkish texts across five distinct model sizes. The models are designated as Tiny, Mini, Small, Medium, and Base. The design and capabilities of each model are tailored to address the specific challenges of the Turkish language, with a focus on optimizing performance while minimizing computational overhead. The study presents a systematic comparison of the performance metrics precision, recall, and F1 score of each model, offering insights into their applicability in diverse operational contexts. The results demonstrate a significant improvement in text readability and accuracy as model size increases, with the Base model achieving the highest correction precision. This research provides a comprehensive guide for selecting the appropriate model size based on specific user needs and computational resources, establishing a framework for deploying these models in real-world applications to enhance the quality of written Turkish.
zh

[NLP-1] -REG: Preference Optimization with Token-Level Reward Regularization

【速读】：该论文试图解决传统强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 中单一、稀疏的序列级奖励难以有效指导模型进行细粒度信用分配的问题。解决方案的关键在于提出了标记级奖励正则化 (Token-level Reward Regularization, T-REG) 方法，该方法结合了序列级和标记级奖励，并通过对比提示 (Contrastive Prompting) 使大型语言模型 (LLMs) 自我生成标记级奖励，从而实现更有效的奖励分配和模型对齐。实验结果表明，该方法在指令跟随基准测试中显著优于基线方法。

链接: https://arxiv.org/abs/2412.02685
作者: Wenxuan Zhou,Shujian Zhang,Lingxiao Zhao,Tao Meng
关键词-EN: aligning large language, Reinforcement learning, large language models, human feedback, RLHF involves generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at this https URL.
zh

[NLP-2] Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

【速读】：该论文试图解决大型语言模型（LLM）自我改进机制的基础理解问题。解决方案的关键在于提出了一个名为“生成-验证差距”（generation-verification gap）的数学概念，并证明了这一差距与模型预训练的浮点运算次数（flops）之间存在单调关系。通过这一发现，论文不仅为LLM的自我改进提供了理论基础，还揭示了自我改进过程的可扩展性，并为未来的研究指明了方向。

链接: https://arxiv.org/abs/2412.02674
作者: Yuda Song,Hanlin Zhang,Carson Eisenach,Sham Kakade,Dean Foster,Udaya Ghai
关键词-EN: Large Language Model, Large Language, mechanism in Large, Language Model, post-training and test-time
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 41 pages, 19 figures

点击查看摘要

Abstract:Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement – a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. Our findings not only advance understanding of LLM self-improvement with practical implications, but also open numerous avenues for future research into its capabilities and boundaries.
zh

[NLP-3] Probing the statistical properties of enriched co-occurrence networks

【速读】：该论文试图解决的问题是如何通过在词共现网络中添加虚拟边（virtual edges）来增强图表示，特别是在短文本中的应用。解决方案的关键在于评估网络度量（network metrics）在区分无意义和有意义文本方面的有效性，以及这些度量对文本句法（syntactic）和语义（semantic）方面的敏感性。研究结果表明，添加虚拟边对不同的网络度量有不同的影响，例如平均最短路径和接近中心性在短文本中的信息量有所提高，而聚类系数的信息量则随着虚拟边的增加而减少。此外，研究还发现包含停用词（stopwords）会影响增强网络的统计特性。这些发现为在特定应用中选择最合适的网络度量提供了指导。

链接: https://arxiv.org/abs/2412.02664
作者: Diego R. Amancio,Jeaneth Machicao,Laura V. C. Quispe
关键词-EN: enhance graph representations, Recent studies, word co-occurrence networks, word embeddings, graph representations
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient’s informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.
zh

[NLP-4] QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing

【速读】：该论文试图解决在先进制造任务指导系统中利用大型语言模型（LLMs）进行数据增强的问题。解决方案的关键在于通过理解过程规范文档、动作和对象的时序关系，生成高质量的问答对数据集。具体来说，研究团队构建了一个包含200,000+个问答对的基准数据集，并利用多个流行的开源LLMs进行实验，通过无参考设置下的LLM-as-a-judge方法和众包工人与专家的验证，评估了这些LLMs在数据增强任务中的表现。

链接: https://arxiv.org/abs/2412.02638
作者: Ramesh Manuvinakurike,Elizabeth Watkins,Celal Savur,Anthony Rhodes,Sovan Biswas,Gesem Gudino Mejia,Richard Beckwith,Saurav Sahay,Giuseppe Raffa,Lama Nachman
关键词-EN: task guidance system, manufacturing task guidance, explore utilizing LLMs, guidance system, data augmentation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.
zh

[NLP-5] Words and Action: Modeling Linguistic Leadership in #BlackLivesMatter Communities

【速读】：该论文试图解决的问题是如何在#BlackLivesMatter运动相关的社区中建模语义领导力，并通过定性研究了解社交媒体结构，特别是Black Twitter的结构。解决方案的关键在于定制的时间分箱、社区聚类以及跨时间的社区连接方法，以及对语义变化检测和语义领导力诱导的最新方法的适应。这些方法帮助识别了BLM活动家、进步人士和黑人名人的领导角色，并揭示了保守社区对该话语的持续参与，为当前“反觉醒”和“反CRT”法案在全国范围内的实施提供了另一种解释。

链接: https://arxiv.org/abs/2412.02637
作者: Dani Roytburg,Deborah Olorunisola,Sandeep Soni,Lauren Klein
关键词-EN: modeling semantic leadership, Black Twitter, BlackLivesMatter movement, semantic leadership induction, method of modeling
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at ICWSM 2025; minor revisions forthcoming

点击查看摘要

Abstract:In this project, we describe a method of modeling semantic leadership across a set of communities associated with the #BlackLivesMatter movement, which has been informed by qualitative research on the structure of social media and Black Twitter in particular. We describe our bespoke approaches to time-binning, community clustering, and connecting communities over time, as well as our adaptation of state-of-the-art approaches to semantic change detection and semantic leadership induction. We find substantial evidence of the leadership role of BLM activists and progressives, as well as Black celebrities. We also find evidence of the sustained engagement of the conservative community with this discourse, suggesting an alternative explanation for how we arrived at the present moment, in which “anti-woke” and “anti-CRT” bills are being enacted nationwide.
zh

[NLP-6] me-Reversal Provides Unsupervised Feedback to LLM s

【速读】：该论文试图解决的问题是如何增强大型语言模型（LLMs）在生成内容时的自我反馈能力，特别是在无监督环境下提供有效的反馈。解决方案的关键在于引入时间反转语言模型（Time Reversed Language Models, TRLMs），这些模型能够在接收到响应后，反向预测和评分查询，从而在反向时间方向上运作。通过预训练和微调一个从零开始的时间反转模型（TRLM-Ba），论文展示了TRLMs在评分查询给定响应时能够补充正向模型的预测，并在多个应用场景（如引用生成和段落检索）中显著提升性能。此外，TRLMs还被用于增强输入安全过滤器，显著减少了假阴性率，同时对假阳性率影响甚微。

链接: https://arxiv.org/abs/2412.02626
作者: Yerram Varun,Rahul Madhavan,Sravanti Addepalli,Arun Suggala,Karthikeyan Shanmugam,Prateek Jain
关键词-EN: Large Language Models, Large Language, Time Reversed Language, Reversed Language Models, typically trained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are typically trained to predict in the forward direction of time. However, recent works have shown that prompting these models to look back and critique their own generations can produce useful feedback. Motivated by this, we explore the question of whether LLMs can be empowered to think (predict and score) backwards to provide unsupervised feedback that complements forward LLMs. Towards this, we introduce Time Reversed Language Models (TRLMs), which can score and generate queries when conditioned on responses, effectively functioning in the reverse direction of time. Further, to effectively infer in the response to query direction, we pre-train and fine-tune a language model (TRLM-Ba) in the reverse token order from scratch. We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5% improvement on the widely used AlpacaEval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in significant gains in applications such as citation generation and passage retrieval. We next leverage the generative ability of TRLM to augment or provide unsupervised feedback to input safety filters of LLMs, demonstrating a drastic reduction in false negative rate with negligible impact on false positive rates against several attacks published on the popular JailbreakBench leaderboard.
zh

[NLP-7] GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

【速读】：该论文试图解决构建一个智能且类人的端到端语音聊天机器人（GLM-4-Voice）的问题，该聊天机器人支持中英文，能够进行实时语音对话，并根据用户指令调整情感、语调、语速和方言等语音细节。解决方案的关键在于采用了一种超低比特率（175bps）的单码本语音令牌器，其帧率为12.5Hz，源自自动语音识别（ASR）模型，通过在编码器中引入向量量化瓶颈实现。此外，通过使用文本到令牌模型合成语音-文本交错数据，从现有的文本预训练语料库中高效转移知识到语音模态。随后，从预训练的文本语言模型GLM-4-9B继续预训练，结合无监督语音数据、交错语音-文本数据和监督语音-文本数据，扩展到1万亿令牌，实现了在语音语言建模和口语问答方面的最先进性能。最后，通过高质量的对话语音数据对预训练模型进行微调，显著提升了对话能力和语音质量，超越了现有的基线模型。

链接: https://arxiv.org/abs/2412.02612
作者: Aohan Zeng,Zhengxiao Du,Mingdao Liu,Kedong Wang,Shengmin Jiang,Lei Zhao,Yuxiao Dong,Jie Tang
关键词-EN: intelligent and human-like, Chinese and English, speech, spoken chatbot, data
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through this https URL and this https URL.
zh

[NLP-8] AV-Odyssey Bench: Can Your Multimodal LLM s Really Understand Audio-Visual Information?

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理简单音频任务时表现不佳的问题，特别是难以区分两个声音的音量大小和音调高低。解决方案的关键在于引入AV-Odyssey Bench，这是一个包含4,555个精心设计的音频-视觉问题的综合基准测试，旨在全面评估MLLMs对音频-视觉信息的理解能力。该基准测试通过结合文本、视觉和音频组件，要求模型有效利用视觉和音频线索进行推理，并以多选题形式设计问题，确保评估的精确性和客观性，从而为未来数据集收集和模型开发提供有价值的见解。

链接: https://arxiv.org/abs/2412.02611
作者: Kaixiong Gong,Kaituo Feng,Bohao Li,Yibing Wang,Mofan Cheng,Shijia Yang,Jiaming Han,Benyou Wang,Yutong Bai,Zhuoran Yang,Xiangyu Yue
关键词-EN: Reka Core, multimodal large language, large language models, multimodal large, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project page: this https URL

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.
zh

[NLP-9] Interpretable Company Similarity with Sparse Autoencoders

【速读】：该论文试图解决在金融领域中确定公司相似性的问题，特别是在高风险环境中，传统的行业分类代码（如SIC代码和GICS代码）在解释性上的局限性。解决方案的关键在于使用稀疏自编码器（Sparse Autoencoders, SAE）特征来增强大型语言模型（Large Language Models, LLM）的解释性，并通过这些特征来衡量公司相似性。研究结果表明，SAE特征不仅能够重现行业分类的效果，还能在量化公司基本特征方面超越传统分类，尤其是在月度回报的相关性和协整收益（PnL from cointegration）方面表现出色。

链接: https://arxiv.org/abs/2412.02605
作者: Marco Molinari,Vladimir Tregubiak,Victor Shao,Abhimanyu Pandey,Mateusz Mikolajczak,Sebastião Kuznetsov Ryder Torres Pereira
关键词-EN: Determining company similarity, underpinning hedging, risk management, portfolio diversification, task in finance
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Clustering embeddings of company descriptions has been proposed as a potential technique for determining company similarity, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders have shown promise in enhancing the interpretability of Large Language Models by decomposing LLM activations into interpretable features. In this paper, we explore the use of SAE features in measuring company similarity and benchmark them against (1) SIC codes and (2) Major Group codes. We conclude that SAE features can reproduce and even surpass sector classifications in quantifying fundamental characteristics of companies, evaluated by the correlation of monthly returns, a proxy for similarity, and PnL from cointegration.
zh

[NLP-10] CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs

【速读】：该论文试图解决在模型性能与碳排放之间取得平衡的问题。解决方案的关键在于引入了一个新的指标——碳效率增益指数（CEGI, Carbon Efficient Gain Index），该指标用于量化每百万可训练参数的单位百分比性能提升所对应的碳排放量。通过这一指标，研究者能够评估不同模型在性能提升与环境成本之间的效率，从而选择既高效又环保的AI模型。实验结果表明，通过微调小型语言模型（SLMs）和视觉语言模型（VLMs），可以在显著减少碳排放的同时达到与大型语言模型（LLMs）相当的性能水平。此外，利用低比特量化技术进一步提升了能源效率，而不会牺牲模型性能。

链接: https://arxiv.org/abs/2412.02602
作者: Abhas Kumar,Kapil Pathak,Rajesh Kavuru,Prabhakar Srinivasan
关键词-EN: Visual Question Answering, Image Captioning, Visual Question, Question Answering, Dialogue Summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper analyzes the performance of Small Language Models (SLMs) and Vision Language Models (VLMs) and evaluates the trade-off between model performance and carbon emissions across 4 essential tasks: Image Captioning, Visual Question Answering (VQA), Dialogue Summarization and Text-to-SQL conversion. Various SLMs and VLMs belonging to the Qwen and LLaMA architecture family are chosen and variants based on model size in terms of the number of parameters, quantization level and fine-tuning parameters are evaluated. The model variant’s performance and carbon emissions are calculated. To quantify the trade-off between model performance and carbon emissions, we introduce a novel metric called CEGI (Carbon Efficient Gain Index). This metric represents the carbon emission per unit percentage gain per million trainable parameters . This metric provides a normalized measure to compare model’s efficiency in terms of performance improvement relative to their environmental cost. The experiment’s outcome demonstrates that fine-tuning SLMs and VLMs can achieve performance levels comparable to Large Language Models (LLMs) while producing significantly less carbon emissions. Our findings suggest that the marginal gains in accuracy from larger models do not justify the substantial increase in carbon emissions. Leveraging lower-bit quantization levels, the proposed metric further enhances energy efficiency without compromising performance. This study highlights balancing high performance and environmental sustainability. It offers a valuable metric for selecting models suitable for environmentally-friendly AI development.
zh

[NLP-11] Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

【速读】：该论文试图解决在生成式 AI (Generative AI) 训练中，通过激进的模型过滤方法（如 FineWeb-Edu 和 DCLM）虽然提高了基准测试的准确性，但导致数据量大幅减少（约90%），从而限制了其在长token训练（如Llama 3.1的15T tokens训练）中的适用性的问题。解决方案的关键在于通过分类器集成、合成数据重述和减少对启发式过滤器的依赖，实现准确性与数据量的更好平衡。具体来说，使用高质量数据子集进行8B参数模型的1T tokens训练，MMLU得分比DCLM提高了5.6，证明了该方法在相对短token训练中的有效性。此外，完整的6.3T token数据集在MMLU上与DCLM相当，但包含的唯一真实token数量是DCLM的四倍，从而解锁了长token训练中的最先进训练效果。

链接: https://arxiv.org/abs/2412.02595
作者: Dan Su,Kezhi Kong,Ying Lin,Joseph Jennings,Brandon Norick,Markus Kliegl,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
关键词-EN: Recent English Common, English Common Crawl, Common Crawl datasets, Recent English, English Common
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at this https URL
zh

[NLP-12] Semantic Tokens in Retrieval Augmented Generation

【速读】：该论文试图解决检索增强生成 (Retrieval-Augmented Generation, RAG) 架构在处理大规模数据时准确性下降的问题，特别是在依赖先进的大型语言模型 (Large Language Models, LLMs) 时引入的不确定性。解决方案的关键在于引入一个评估模块 (evaluator module)，该模块通过比较外部推荐与检索到的文档片段，增加了一个决策层，从而提高系统的可靠性。这种方法确保了检索到的片段在语义上相关且逻辑上与确定性见解一致，从而提升了RAG系统的准确性和整体效率。

链接: https://arxiv.org/abs/2412.02563
作者: Joel Suro
关键词-EN: recently garnered significant, garnered significant attention, improve truth grounding, Retrieval-Augmented Generation, language processing tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) architectures have recently garnered significant attention for their ability to improve truth grounding and coherence in natural language processing tasks. However, the reliability of RAG systems in producing accurate answers diminishes as the volume of data they access increases. Even with smaller datasets, these systems occasionally fail to address simple queries. This issue arises from their dependence on state-of-the-art large language models (LLMs), which can introduce uncertainty into the system’s outputs. In this work, I propose a novel Comparative RAG system that introduces an evaluator module to bridge the gap between probabilistic RAG systems and deterministically verifiable responses. The evaluator compares external recommendations with the retrieved document chunks, adding a decision-making layer that enhances the system’s reliability. This approach ensures that the chunks retrieved are both semantically relevant and logically consistent with deterministic insights, thereby improving the accuracy and overall efficiency of RAG systems. This framework paves the way for more reliable and scalable question-answering applications in domains requiring high precision and verifiability.
zh

[NLP-13] Patent-CR: A Dataset for Patent Claim Revision

【速读】：该论文试图解决专利申请中权利要求修订的问题，特别是如何确保修订后的权利要求符合严格的法律标准，包括范围的清晰性、技术准确性、语言精确性和法律稳健性。解决方案的关键在于创建了首个针对英文专利权利要求修订任务的数据集Patent-CR，并通过对多种大型语言模型（LLMs）进行专业的人工评估，发现领域特定的模型和微调方法显示出有希望的结果。特别是GPT-4在测试中表现优异，但仍需进一步修订以达到审查标准。此外，论文还揭示了自动化评估与人工评估结果之间的不一致性，表明基于GPT-4的自动化评估与人类判断的相关性最高。

链接: https://arxiv.org/abs/2412.02549
作者: Lekang Jiang,Pascal A Scherz,Stephan Goetz
关键词-EN: paper presents Patent-CR, patent claim revision, presents Patent-CR, patent claim, paper presents
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 tables, 3 figures

点击查看摘要

Abstract:This paper presents Patent-CR, the first dataset created for the patent claim revision task in English. It includes both initial patent applications rejected by patent examiners and the final granted versions. Unlike normal text revision tasks that predominantly focus on enhancing sentence quality, such as grammar correction and coherence improvement, patent claim revision aims at ensuring the claims meet stringent legal criteria. These criteria are beyond novelty and inventiveness, including clarity of scope, technical accuracy, language precision, and legal robustness. We assess various large language models (LLMs) through professional human evaluation, including general LLMs with different sizes and architectures, text revision models, and domain-specific models. Our results indicate that LLMs often bring ineffective edits that deviate from the target revisions. In addition, domain-specific models and the method of fine-tuning show promising results. Notably, GPT-4 outperforms other tested LLMs, but further revisions are still necessary to reach the examination standard. Furthermore, we demonstrate the inconsistency between automated and human evaluation results, suggesting that GPT-4-based automated evaluation has the highest correlation with human judgment. This dataset, along with our preliminary empirical research, offers invaluable insights for further exploration in patent claim revision.
zh

[NLP-14] LLM Forecaster: Improving Seasonal Event Forecasts with Unstructured Textual Data NEURIPS

【速读】：该论文试图解决现有时间序列预测模型未能充分利用丰富的非结构化信息（unstructured information）的问题。解决方案的关键在于引入了一种名为LLMForecaster的新型预测后处理器，该后处理器通过微调大型语言模型（LLMs）来整合非结构化的语义和上下文信息以及历史数据，从而提升现有需求预测管道的预测精度。在零售行业的实际应用中，该技术显著提高了受节假日驱动的需求激增产品的预测准确性。

链接: https://arxiv.org/abs/2412.02525
作者: Hanyu Zhang,Chuck Arvin,Dmitry Efimov,Michael W. Mahoney,Dominique Perrault-Joncas,Shankar Ramasubramanian,Andrew Gordon Wilson,Malcolm Wolff
关键词-EN: Modern time-series forecasting, Modern time-series, time-series forecasting models, rich unstructured information, make full
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at NeurIPS Time Series in the Age of Large Models (2024)

点击查看摘要

Abstract:Modern time-series forecasting models often fail to make full use of rich unstructured information about the time series themselves. This lack of proper conditioning can lead to obvious model failures; for example, models may be unaware of the details of a particular product, and hence fail to anticipate seasonal surges in customer demand in the lead up to major exogenous events like holidays for clearly relevant products. To address this shortcoming, this paper introduces a novel forecast post-processor – which we call LLMForecaster – that fine-tunes large language models (LLMs) to incorporate unstructured semantic and contextual information and historical data to improve the forecasts from an existing demand forecasting pipeline. In an industry-scale retail application, we demonstrate that our technique yields statistically significantly forecast improvements across several sets of products subject to holiday-driven demand surges.
zh

[NLP-15] DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

【速读】：该论文试图解决在差分隐私 (Differential Privacy, DP) 保护下生成表格数据时，预训练大型语言模型 (Large Language Models, LLMs) 面临的挑战，特别是在隐私预算分配不合理导致生成数据质量下降的问题。解决方案的关键是提出了一个两阶段微调框架 (\ours)，首先在伪数据集上进行非隐私微调，然后在私有数据集上进行差分隐私微调。这一方法通过优化隐私预算的分配，显著提升了在差分隐私约束下生成表格数据的质量和一致性。

链接: https://arxiv.org/abs/2412.02467
作者: Tejumade Afonja,Hui-Po Wang,Raouf Kerkouche,Mario Fritz
关键词-EN: machine learning models, protection ensures theoretical, noisy supervision signals, Large Language Models, training machine learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) – even those at the scale of GPT-2 – have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at this https URL.
zh

[NLP-16] Can ChatGPT capture swearing nuances? Evidence from translating Arabic oaths

【速读】：该论文试图解决的问题是：ChatGPT是否能够准确捕捉阿拉伯语中誓言表达的细微差别，并将其翻译成英语。解决方案的关键在于对ChatGPT翻译的30个阿拉伯语誓言表达进行详细分析，并与人工翻译进行比较，识别出ChatGPT在翻译过程中未能填补的宗教、文化、宗教与文化双重、无差距、使用非誓言粒子、冗余以及未能捕捉阿拉伯文字变音符号等方面的差距。研究结果表明，ChatGPT在翻译誓言表达方面仍存在显著不足，这揭示了需要进一步改进ChatGPT，并纳入包括誓言表达、细微差别、仪式和实践在内的阿拉伯语数据进行训练的必要性。

链接: https://arxiv.org/abs/2412.02466
作者: Mohammed Q. Shormani
关键词-EN: capture swearing nuances, ChatGPT capture swearing, Arabic oath expressions, major question, answer one major
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:This study sets out to answer one major question: Can ChatGPT capture swearing nuances? It presents an empirical study on the ability of ChatGPT to translate Arabic oath expressions into English. 30 Arabic oath expressions were collected from the literature. These 30 oaths were first translated via ChatGPT and then analyzed and compared to the human translation in terms of types of gaps left unfulfilled by ChatGPT. Specifically, the gaps involved are: religious gap, cultural gap, both religious and cultural gaps, no gap, using non-oath particles, redundancy and noncapturing of Arabic script diacritics. It concludes that ChatGPT translation of oaths is still much unsatisfactory, unveiling the need of further developments of ChatGPT, and the inclusion of Arabic data on which ChatGPT should be trained including oath expressions, oath nuances, rituals, and practices.
zh

[NLP-17] Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining COLING2025

【速读】：该论文试图解决生成式大型语言模型（Generative Large Language Models, LLMs）中的后门攻击问题。由于生成式 LLMs 输出的是高维度的 token logits 而非低维度的分类 logits，现有的针对判别模型（如 BERT）的后门防御方法在生成式 LLMs 上无效。论文的关键解决方案是利用频率空间中后门样本与干净样本梯度的显著差异，提出了一种名为“频率空间中的梯度聚类用于后门样本过滤 (GraCeFul)”的方法。该方法通过将每个训练样本的梯度转换到频率空间，并利用频率空间中的样本梯度进行后门样本的有效识别，无需重新训练 LLMs。实验结果表明，GraCeFul 在识别后门样本方面显著优于基线方法，且在多个自由问答数据集上表现出高效的计算性能和接近 100% 的召回率和 F1 分数，同时将各种后门攻击的成功率降低至 0%，且对干净样本的准确率影响极小。

链接: https://arxiv.org/abs/2412.02454
作者: Zongru Wu,Pengzhou Cheng,Lingyong Fang,Zhuosheng Zhang,Gongshen Liu
关键词-EN: remain significant security, significant security threats, generative large language, large language models, frequency space
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:Backdoor attacks remain significant security threats to generative large language models (LLMs). Since generative LLMs output sequences of high-dimensional token logits instead of low-dimensional classification logits, most existing backdoor defense methods designed for discriminative models like BERT are ineffective for generative LLMs. Inspired by the observed differences in learning behavior between backdoor and clean mapping in the frequency space, we transform gradients of each training sample, directly influencing parameter updates, into the frequency space. Our findings reveal a distinct separation between the gradients of backdoor and clean samples in the frequency space. Based on this phenomenon, we propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients in the frequency space to effectively identify backdoor samples without requiring retraining LLMs. Experimental results show that GraCeFul outperforms baselines significantly. Notably, GraCeFul exhibits remarkable computational efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor samples, reducing the average success rate of various backdoor attacks to 0% with negligible drops in clean accuracy across multiple free-style question answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna. The codes are publicly available at this https URL.
zh

[NLP-18] Artificial Expert Intelligence through PAC-reasoning

【速读】：该论文试图解决现有AI系统在面对新颖问题时缺乏适应性和精确性的问题。解决方案的关键在于引入“人工专家智能 (Artificial Expert Intelligence, AEI)”，通过集成领域特定专家知识和类似于顶尖人类专家的批判性、精确推理能力，超越了通用人工智能 (AGI) 和狭义AI的局限。AEI的核心框架是“可能近似正确 (Probably Approximately Correct, PAC) 推理”，该框架提供了可靠的理论保证，能够有效分解复杂问题，并通过实际机制控制推理精度。借鉴人类思维的系统1（直觉思维）和系统2（反思推理），论文提出了系统3（精确推理），旨在建立基于误差界限的推理时学习基础。

链接: https://arxiv.org/abs/2412.02441
作者: Shai Shalev-Shwartz,Amnon Shashua,Gal Beniamini,Yoav Levine,Or Sharir,Noam Wies,Ido Ben-Shaul,Tomer Nussbaum,Shir Granot Peled
关键词-EN: Artificial General Intelligence, Artificial Expert Intelligence, General Intelligence, Artificial General, Expert Intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial Expert Intelligence (AEI) seeks to transcend the limitations of both Artificial General Intelligence (AGI) and narrow AI by integrating domain-specific expertise with critical, precise reasoning capabilities akin to those of top human experts. Existing AI systems often excel at predefined tasks but struggle with adaptability and precision in novel problem-solving. To overcome this, AEI introduces a framework for ``Probably Approximately Correct (PAC) Reasoning". This paradigm provides robust theoretical guarantees for reliably decomposing complex problems, with a practical mechanism for controlling reasoning precision. In reference to the division of human thought into System 1 for intuitive thinking and System 2 for reflective reasoning~\citeptversky1974judgment, we refer to this new type of reasoning as System 3 for precise reasoning, inspired by the rigor of the scientific method. AEI thus establishes a foundation for error-bounded, inference-time learning.
zh

[NLP-19] GerPS-Compare: Comparing NER methods for legal norm analysis

【速读】：该论文试图解决在德国公共行政管理领域中，针对特定法律文本子类别的命名实体识别（Named Entity Recognition, NER）问题。解决方案的关键在于比较和评估三种不同的NER方法：基于规则的系统、深度判别模型和深度生成模型。研究结果表明，深度判别模型在识别这些语义和句法异质性较强的类别时表现最佳，优于基于规则的系统和深度生成模型。这一结果的关键在于深度判别模型能够更好地处理这种异质性，而传统的基于规则的系统和深度生成模型在这方面表现相对较差。

链接: https://arxiv.org/abs/2412.02427
作者: Sarah T. Bachinger,Christoph Unger,Robin Erd,Leila Feddoul,Clara Lachenmaier,Sina Zarrieß,Birgitta König-Ries
关键词-EN: legal norms regulating, norms regulating administrative, regulating administrative processes, public service administration, deep discriminative models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We apply NER to a particular sub-genre of legal texts in German: the genre of legal norms regulating administrative processes in public service administration. The analysis of such texts involves identifying stretches of text that instantiate one of ten classes identified by public service administration professionals. We investigate and compare three methods for performing Named Entity Recognition (NER) to detect these classes: a Rule-based system, deep discriminative models, and a deep generative model. Our results show that Deep Discriminative models outperform both the Rule-based system as well as the Deep Generative model, the latter two roughly performing equally well, outperforming each other in different classes. The main cause for this somewhat surprising result is arguably the fact that the classes used in the analysis are semantically and syntactically heterogeneous, in contrast to the classes used in more standard NER tasks. Deep Discriminative models appear to be better equipped for dealing with this heterogenerity than both generic LLMs and human linguists designing rule-based NER systems.
zh

[NLP-20] Four Guiding Principles for Modeling Causal Domain Knowledge: A Case Study on Brainstorming Approaches for Urban Blight Analysis

【速读】：该论文试图解决城市衰败（urban blight）分析中因果关系建模的问题，特别是在整合领域知识（domain knowledge）方面。解决方案的关键在于引入了四条有效的因果领域知识建模规则，这些规则有助于改进现有认知地图（cognitive maps）在城市衰败分析中的应用，从而揭示因果建模指南中的显著偏差，并为未来城市衰败研究提供有价值的见解，以增强对城市衰败复杂交互关系的理解。

链接: https://arxiv.org/abs/2412.02400
作者: Houssam Razouk,Michael Leitner,Roman Kern
关键词-EN: Urban blight, policy making, problem of high, high interest, interest for planning
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Urban blight is a problem of high interest for planning and policy making. Researchers frequently propose theories about the relationships between urban blight indicators, focusing on relationships reflecting causality. In this paper, we improve on the integration of domain knowledge in the analysis of urban blight by introducing four rules for effective modeling of causal domain knowledge. The findings of this study reveal significant deviation from causal modeling guidelines by investigating cognitive maps developed for urban blight analysis. These findings provide valuable insights that will inform future work on urban blight, ultimately enhancing our understanding of urban blight complex interactions.
zh

[NLP-21] SCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity ICASSP2025

【速读】：该论文试图解决藏文（Tibetan）对抗文本生成方法中未充分考虑藏文文本特征和生成文本质量的问题。解决方案的关键在于提出了一种名为TSCheater的新方法，该方法结合了藏文编码特性以及视觉相似音节具有相似语义的特征。TSCheater利用自建的藏文音节视觉相似数据库（TSVSDB）生成替代候选，并通过基于贪心算法的评分机制确定替代顺序。该方法不仅提高了攻击效果，还在扰动幅度、语义相似度、视觉相似度和人类接受度方面优于现有方法。此外，论文还构建了首个藏文对抗鲁棒性评估基准（AdvTS），以全面评估对抗文本生成方法的性能。

链接: https://arxiv.org/abs/2412.02371
作者: Xi Cao,Quzong Gesang,Yuan Sun,Nuo Qun,Tashi Nyima
关键词-EN: deep neural networks, Tibetan adversarial text, Tibetan, Tibetan adversarial, based on deep
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Review Version; Submitted to ICASSP 2025

点击查看摘要

Abstract:Language models based on deep neural networks are vulnerable to textual adversarial attacks. While rich-resource languages like English are receiving focused attention, Tibetan, a cross-border language, is gradually being studied due to its abundant ancient literature and critical language strategy. Currently, there are several Tibetan adversarial text generation methods, but they do not fully consider the textual features of Tibetan script and overestimate the quality of generated adversarial texts. To address this issue, we propose a novel Tibetan adversarial text generation method called TSCheater, which considers the characteristic of Tibetan encoding and the feature that visually similar syllables have similar semantics. This method can also be transferred to other abugidas, such as Devanagari script. We utilize a self-constructed Tibetan syllable visual similarity database called TSVSDB to generate substitution candidates and adopt a greedy algorithm-based scoring mechanism to determine substitution order. After that, we conduct the method on eight victim language models. Experimentally, TSCheater outperforms existing methods in attack effectiveness, perturbation magnitude, semantic similarity, visual similarity, and human acceptance. Finally, we construct the first Tibetan adversarial robustness evaluation benchmark called AdvTS, which is generated by existing methods and proofread by humans.
zh

[NLP-22] he Impact of Featuring Comments in Online Discussions

【速读】：该论文试图解决的问题是评估在线新闻平台通过精选评论（editor picks 或 featured comments）对讨论质量的影响。解决方案的关键在于通过比较精选评论与无精选评论的讨论，从用户和平台两个角度评估讨论质量的变化。研究结果表明，精选评论对讨论质量的影响有限，但可以观察到在首次精选评论后讨论活动的增加，这表明精选评论策略可能用于提高用户参与度并延缓用户活动随时间的自然下降。

链接: https://arxiv.org/abs/2412.02369
作者: Cedric Waterschoot,Ernst van den Hemel,Antal van den Bosch
关键词-EN: called editor picks, platform deems high, deems high quality, high quality comments, deems high
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:A widespread moderation strategy by online news platforms is to feature what the platform deems high quality comments, usually called editor picks or featured comments. In this paper, we compare online discussions of news articles in which certain comments are featured, versus discussions in which no comments are featured. We measure the impact of featuring comments on the discussion, by estimating and comparing the quality of discussions from the perspective of the user base and the platform itself. Our analysis shows that the impact on discussion quality is limited. However, we do observe an increase in discussion activity after the first comments are featured by moderators, suggesting that the moderation strategy might be used to increase user engagement and to postpone the natural decline in user activity over time.
zh

[NLP-23] ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

【速读】：该论文试图解决多模态大语言模型（LLMs）在生成科学图像方面的性能评估问题，特别是在加速科学进步中的应用。解决方案的关键在于引入了一个名为ScImage的基准测试，该基准测试旨在评估LLMs从文本描述生成科学图像的多模态能力。ScImage评估了三个关键的理解维度：空间理解、数值理解和属性理解，以及它们的组合，特别关注科学对象之间的关系（如方形、圆形）。通过评估五种模型（GPT-4o、Llama、AutomaTikZ、Dall-E和StableDiffusion）在两种输出模式（基于代码的输出和直接光栅图像生成）和四种不同输入语言（英语、德语、波斯语和中文）下的表现，研究揭示了这些模型在处理简单提示时表现尚可，但在处理更复杂的提示时面临挑战。

链接: https://arxiv.org/abs/2412.02368
作者: Leixin Zhang,Steffen Eger,Yinjie Cheng,Weihe Zhai,Jonas Belouadi,Christoph Leiter,Simone Paolo Ponzetto,Fahimeh Moafian,Zhixue Zhao
关键词-EN: demonstrated impressive capabilities, generating high-quality images, generating scientific images, Multimodal large language, demonstrated impressive
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images–a critical application for accelerating scientific progress–remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.
zh

[NLP-24] Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model WWW2024

【速读】：该论文试图解决中文少数民族语言（特别是藏语）在文本分类任务中面临的对抗攻击问题。解决方案的关键在于提出了一种基于掩码语言模型（masked language models）的多粒度藏语文本对抗攻击方法，称为TSTricker。该方法通过利用掩码语言模型生成候选替代音节或词汇，并采用评分机制确定替代顺序，从而对经过微调的受害模型进行攻击。实验结果表明，TSTricker显著降低了分类模型的准确率（超过28.70%），并使分类模型对超过90.60%的样本改变了预测结果，显示出比基线方法更高的攻击效果。

链接: https://arxiv.org/abs/2412.02343
作者: Xi Cao,Nuo Qun,Quzong Gesang,Yulei Zhu,Trashi Nyima
关键词-EN: neural network models, hate speech detection, neural network, Chinese minority languages, network models
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Revised Version; Accepted at WWW 2024 Workshop on SocialNLP

点击查看摘要

Abstract:In social media, neural network models have been applied to hate speech detection, sentiment analysis, etc., but neural network models are susceptible to adversarial attacks. For instance, in a text classification task, the attacker elaborately introduces perturbations to the original texts that hardly alter the original semantics in order to trick the model into making different predictions. By studying textual adversarial attack methods, the robustness of language models can be evaluated and then improved. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, there is little research targeting Chinese minority languages. With the rapid development of artificial intelligence technology and the emergence of Chinese minority language models, textual adversarial attacks become a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a multi-granularity Tibetan textual adversarial attack method based on masked language models called TSTricker. We utilize the masked language models to generate candidate substitution syllables or words, adopt the scoring mechanism to determine the substitution order, and then conduct the attack method on several fine-tuned victim models. The experimental results show that TSTricker reduces the accuracy of the classification models by more than 28.70% and makes the classification models change the predictions of more than 90.60% of the samples, which has an evidently higher attack effect than the baseline method.
zh

[NLP-25] Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script ACL2023

【速读】：该论文试图解决针对中国少数民族语言（特别是藏语）的文本对抗攻击问题。解决方案的关键在于提出了一种基于音节余弦距离和评分机制的藏语音节级黑盒文本对抗攻击方法，称为TSAttacker。该方法通过精心设计的不可察觉的扰动，使自然语言处理模型产生错误的判断，从而评估模型的鲁棒性。实验结果表明，TSAttacker在六个通过微调两个预训练语言模型（PLMs）生成的模型上表现有效，并生成了高质量的对抗样本，揭示了相关模型在鲁棒性方面仍有很大的改进空间。

链接: https://arxiv.org/abs/2412.02323
作者: Xi Cao,Dolma Dawa,Nuo Qun,Trashi Nyima
关键词-EN: produces false judgments, attacker adds imperceptible, adds imperceptible perturbations, model produces false, Chinese minority languages
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Revised Version; Accepted at ACL 2023 Workshop on TrustNLP

点击查看摘要

Abstract:The textual adversarial attack refers to an attack method in which the attacker adds imperceptible perturbations to the original texts by elaborate design so that the NLP (natural language processing) model produces false judgments. This method is also used to evaluate the robustness of NLP models. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, to the best of our knowledge, there is little research targeting Chinese minority languages. Textual adversarial attacks are a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a Tibetan syllable-level black-box textual adversarial attack called TSAttacker based on syllable cosine distance and scoring mechanism. And then, we conduct TSAttacker on six models generated by fine-tuning two PLMs (pre-trained language models) for three downstream tasks. The experiment results show that TSAttacker is effective and generates high-quality adversarial samples. In addition, the robustness of the involved models still has much room for improvement.
zh

[NLP-26] Large Multimodal Agents for Accurate Phishing Detection with Enhanced Token Optimization and Cost Reduction

【速读】：该论文试图解决在面对复杂钓鱼攻击时，如何有效且经济地进行检测的问题。解决方案的关键在于采用一种两层代理的方法，通过结合使用大型多模态代理（如Gemini 1.5 Flash和GPT-4o mini）来分析URL和网页截图，从而避免训练和维护AI系统的复杂性。具体来说，首先由一个代理评估URL，如果结果不明确，则由第二个代理同时评估URL和截图。这种方法不仅保持了强大的检测性能，还通过减少不必要的多输入查询显著降低了API使用成本。研究结果表明，这种代理方法在处理大量网站时，相比多模态方法，GPT-4o mini和Gemini 1.5 Flash分别能处理约4.2倍和2.6倍的网站数量，从而在经济效益上具有显著优势。

链接: https://arxiv.org/abs/2412.02301
作者: Fouad Trad,Ali Chehab
关键词-EN: sophisticated phishing attacks, rise of sophisticated, effective and economical, economical detection solutions, agentic approach
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted in the 2nd International Conference on Foundation and Large Language Models (FLLM2024)

点击查看摘要

Abstract:With the rise of sophisticated phishing attacks, there is a growing need for effective and economical detection solutions. This paper explores the use of large multimodal agents, specifically Gemini 1.5 Flash and GPT-4o mini, to analyze both URLs and webpage screenshots via APIs, thus avoiding the complexities of training and maintaining AI systems. Our findings indicate that integrating these two data types substantially enhances detection performance over using either type alone. However, API usage incurs costs per query that depend on the number of input and output tokens. To address this, we propose a two-tiered agentic approach: initially, one agent assesses the URL, and if inconclusive, a second agent evaluates both the URL and the screenshot. This method not only maintains robust detection performance but also significantly reduces API costs by minimizing unnecessary multi-input queries. Cost analysis shows that with the agentic approach, GPT-4o mini can process about 4.2 times as many websites per 100 compared to the multimodal approach (107,440 vs. 25,626), and Gemini 1.5 Flash can process about 2.6 times more websites (2,232,142 vs. 862,068). These findings underscore the significant economic benefits of the agentic approach over the multimodal method, providing a viable solution for organizations aiming to leverage advanced AI for phishing detection while controlling expenses.
zh

[NLP-27] Characterizing Information Shared by Participants to Coding Challenges: The Case of Advent of Code

【速读】：该论文试图解决的问题是如何系统地分析和理解Advent of Code (AoC)编程挑战中参与者行为、编程语言使用趋势及其与开发者社区和技术需求的关系。解决方案的关键在于创建了一个包含2019-2021年AoC讨论线程的数据集，并提出了一种基于流图（stream graphs）的模型来动态展示参与者、评论和编程语言随时间的变化。通过这一模型，研究者能够深入探讨用户参与度、新编程语言的采用、编程语言在挑战中的持续使用及其在不同挑战间的变化，以及编程语言的适应性和流行度。特别是，该模型揭示了编程语言在AoC中的长期使用与其在Stack Overflow调查中的“受欢迎”或“喜爱”状态之间的关联，为理解编程语言的实际应用和技术社区的偏好提供了新的视角。

链接: https://arxiv.org/abs/2412.02290
作者: Francesco Cauteruccio,Enrico Corradini,Luca Virgili
关键词-EN: solve programming puzzles, coding challenge requiring, programming languages, Advent of Code, sets and levels
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Advent of Code (AoC from now on) is a popular coding challenge requiring to solve programming puzzles for a variety of skill sets and levels. AoC follows the advent calendar, therefore it is an annual challenge that lasts for 25 days. AoC participants usually post their solutions on social networks and discuss them online. These challenges are interesting to study since they could highlight the adoption of new tools, the evolution of the developer community, or the technological requirements of well-known companies. For these reasons, we first create a dataset of the 2019-2021 AoC editions containing the discussion threads made on the subreddit \tt /r/adventofcode. Then, we propose a model based on stream graphs to best study this context, where we represent its most important actors through time: participants, comments, and programming languages. Thanks to our model, we investigate user participation, adoption of new programming languages during a challenge and between two of them, and resiliency of programming languages based on a Stack Overflow survey. We find that the top-used programming languages are almost the same in the three years, pointing out their importance. Moreover, participants tend to keep the same programming language for the whole challenge, while the ones attending two AoCs usually change it in the next one. Finally, we observe interesting results about the programming languages that are Popular'' or Loved’’ according to the Stack Overflow survey. Firstly, these are the ones adopted for the longest time in an AoC edition, thanks to which users have a high chance of reaching the end of the challenge. Secondly, they are the most chosen when a participant decides to change programming language during the same challenge.
zh

[NLP-28] A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis

【速读】：该论文试图解决在Aspect-Based Sentiment Analysis (ABSA)领域中，如何全面评估大型语言模型（Large Language Models, LLMs）的性能问题。解决方案的关键在于设计了一个统一的任务框架，将多个LLMs应用于多个ABSA子任务，并在两种不同的范式（fine-tuning-dependent和fine-tuning-free）下进行评估。具体来说，对于依赖微调的范式，采用了基于指令的多任务学习进行高效微调；对于无需微调的范式，提出了三种示范选择策略以激发LLMs的少样本学习能力。实验结果表明，LLMs在依赖微调的范式中达到了新的最先进性能，而在无需微调的范式中，LLMs通过In-Context Learning (ICL)展示了显著的潜力，甚至在某些ABSA子任务上与微调的小型语言模型（Small Language Models, SLMs）相媲美。

链接: https://arxiv.org/abs/2412.02279
作者: Changzhi Zhou,Dandan Song,Yuhang Tian,Zhijing Wu,Hao Wang,Xinyu Zhang,Jun Yang,Ziyi Yang,Shuhao Zhang
关键词-EN: Large Language Models, garnered increasing attention, revolutionizing numerous downstream, natural language processing, numerous downstream tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing, revolutionizing numerous downstream tasks with powerful reasoning and generation abilities. For example, In-Context Learning (ICL) introduces a fine-tuning-free paradigm, allowing out-of-the-box LLMs to execute downstream tasks by analogy learning without any fine-tuning. Besides, in a fine-tuning-dependent paradigm where substantial training data exists, Parameter-Efficient Fine-Tuning (PEFT), as the cost-effective methods, enable LLMs to achieve excellent performance comparable to full fine-tuning. However, these fascinating techniques employed by LLMs have not been fully exploited in the ABSA field. Previous works probe LLMs in ABSA by merely using randomly selected input-output pairs as demonstrations in ICL, resulting in an incomplete and superficial evaluation. In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs. Specifically, we design a unified task formulation to unify ``multiple LLMs for multiple ABSA subtasks in multiple paradigms.‘’ For the fine-tuning-dependent paradigm, we efficiently fine-tune LLMs using instruction-based multi-task learning. For the fine-tuning-free paradigm, we propose 3 demonstration selection strategies to stimulate the few-shot abilities of LLMs. Our extensive experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm. More importantly, in the fine-tuning-free paradigm where SLMs are ineffective, LLMs with ICL still showcase impressive potential and even compete with fine-tuned SLMs on some ABSA subtasks. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.02279 [cs.CL] (or arXiv:2412.02279v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.02279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-29] MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines

【速读】：该论文旨在解决新闻标题中媒体偏见（media bias）的自动检测问题。解决方案的关键在于引入了MediaSpin数据集，该数据集包含78,910对新闻标题及其对应的偏见类别标注，这些标注是通过人工监督和验证的大型语言模型（LLM）进行标记的。数据集涵盖了13种不同的媒体偏见类型，并提供了详细的解释，从而为开发能够自动检测新闻编辑中偏见的模型提供了丰富的训练资源。

链接: https://arxiv.org/abs/2412.02271
作者: Preetika Verma,Kokil Jaidka
关键词-EN: Large Language Model, validated Large Language, Large Language, media bias present, Language Model
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce the MediaSpin dataset aiming to help in the development of models that can detect different forms of media bias present in news headlines, developed through human-supervised and -validated Large Language Model (LLM) labeling of media bias. This corpus comprises 78,910 pairs of news headlines and annotations with explanations of the 13 distinct types of media bias categories assigned. We demonstrate the usefulness of our dataset for automated bias detection in news edits.
zh

[NLP-30] Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

【速读】：该论文试图解决大型语言模型（LLMs）在处理长文本任务时，由于上下文窗口增大导致的推理效率低下问题，特别是内存和计算复杂度方面的挑战。解决方案的关键在于通过减少对不重要token的内存和计算负载，而不是直接丢弃这些token，从而在不损失token的情况下提高效率。具体方法包括：1) 研究上下文中重要token的分布，发现近期token比远期token更重要；2) 通过在不同层之间共享注意力分数来优化对远期token的资源分配。实验结果表明，该方法在不降低性能的情况下，节省了35%的KV缓存。

链接: https://arxiv.org/abs/2412.02252
作者: Da Ma,Lu Chen,Situo Zhang,Yuxun Miao,Su Zhu,Zhi Chen,Hongshen Xu,Hanqi Li,Shuai Fan,Lei Pan,Kai Yu
关键词-EN: Large Language Models, Language Models, Large Language, size in Large, GPT and LLaMA
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding this http URL address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves 35% KV cache without compromising the performance.
zh

[NLP-31] BANER: Boundary-Aware LLM s for Few-Shot Named Entity Recognition COLING2025

【速读】：该论文试图解决在少样本命名实体识别（NER）中存在的两个主要问题：一是实体边界检测阶段出现的过度/不足检测的错误跨度，二是类型分类阶段中实体原型的不对齐。为解决这些问题，论文提出了一种名为“边界感知型LLMs”的方法，其关键在于引入了一种边界感知的对比学习策略，以增强LLM对实体边界的感知能力，从而提高对广义实体跨度的识别准确性。此外，论文还利用LoRAHub来对齐目标域和源域的信息，以增强跨域分类的适应性。实验结果表明，该框架在多个基准测试中优于现有方法，验证了其有效性。

链接: https://arxiv.org/abs/2412.02228
作者: Quanjiang Guo,Yihong Dong,Ling Tian,Zhao Kang,Yu Zhang,Sijie Wang
关键词-EN: two-stage prototypical networks, named entity recognition, classification stage persist, few-shot named entity, unaligned entity prototypes
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Appear on COLING 2025

点击查看摘要

Abstract:Despite the recent success of two-stage prototypical networks in few-shot named entity recognition (NER), challenges such as over/under-detected false spans in the span detection stage and unaligned entity prototypes in the type classification stage persist. Additionally, LLMs have not proven to be effective few-shot information extractors in general. In this paper, we propose an approach called Boundary-Aware LLMs for Few-Shot Named Entity Recognition to address these issues. We introduce a boundary-aware contrastive learning strategy to enhance the LLM’s ability to perceive entity boundaries for generalized entity spans. Additionally, we utilize LoRAHub to align information from the target domain to the source domain, thereby enhancing adaptive cross-domain classification capabilities. Extensive experiments across various benchmarks demonstrate that our framework outperforms prior methods, validating its effectiveness. In particular, the proposed strategies demonstrate effectiveness across a range of LLM architectures. The code and data are released on this https URL.
zh

[NLP-32] DataLab: A Unifed Platform for LLM -Powered Business Intelligence

【速读】：该论文试图解决现有商业智能（Business Intelligence, BI）系统中任务碎片化的问题，即不同数据角色和工具之间的任务分散导致效率低下和潜在错误。解决方案的关键在于引入DataLab，一个集成了大型语言模型（Large Language Model, LLM）代理框架和增强计算笔记本界面的统一BI平台。DataLab通过以下几个关键设计实现任务的整合：1) 针对企业特定BI任务的领域知识集成模块；2) 促进BI工作流中信息共享的代理间通信机制；3) 增强BI笔记本中上下文利用效率的基于单元格的上下文管理策略。这些设计使得DataLab能够在单一环境中无缝结合LLM辅助和用户定制，支持多种BI任务，并在实验中表现出优于现有方法的性能。

链接: https://arxiv.org/abs/2412.02205
作者: Luoxuan Weng,Yinghao Tang,Yingchaojie Feng,Zhuo Chang,Peng Chen,Ruiqin Chen,Haozhe Feng,Chen Hou,Danqing Huang,Yang Li,Huaming Rao,Haonan Wang,Canshi Wei,Xiaofeng Yang,Yuhui Zhang,Yifeng Zheng,Xiuqi Huang,Minfeng Zhu,Yuxin Ma,Bin Cui,Wei Chen
关键词-EN: transforms large volumes, Business intelligence, informed decision-making, modern organizations, organizations into actionable
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.
zh

[NLP-33] VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

【速读】：该论文试图解决大型视觉语言模型（LVLMs）在自我改进过程中缺乏系统性分析其批判和修正能力的问题。解决方案的关键在于提出了VISCO基准，该基准要求LVLMs对思维链中的每一步进行细粒度的评估，并提供自然语言解释以支持其判断，从而实现密集和精细的批判。通过引入LookBack策略，模型能够重新审视图像以验证初始推理中的每一条信息，显著提升了批判和修正的性能，最高可达13.5%的改进。

链接: https://arxiv.org/abs/2412.02172
作者: Xueqing Wu,Yuheng Ding,Bingxuan Li,Pan Lu,Da Yin,Kai-Wei Chang,Nanyun Peng
关键词-EN: large vision-language models, essential building block, vision-language models, ability of large, large vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:The ability of large vision-language models (LVLMs) to critique and correct their reasoning is an essential building block towards their self-improvement. However, a systematic analysis of such capabilities in LVLMs is still lacking. We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs. Compared to existing work that uses a single scalar value to critique the entire reasoning [4], VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought and provide natural language explanations to support their judgments. Extensive evaluation of 24 LVLMs demonstrates that human-written critiques significantly enhance the performance after correction, showcasing the potential of the self-improvement strategy. However, the model-generated critiques are less helpful and sometimes detrimental to the performance, suggesting that critique is the crucial bottleneck. We identified three common patterns in critique failures: failure to critique visual perception, reluctance to “say no”, and exaggerated assumption of error propagation. To address these issues, we propose an effective LookBack strategy that revisits the image to verify each piece of information in the initial reasoning. LookBack significantly improves critique and correction performance by up to 13.5%.
zh

[NLP-34] Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach NEURIPS2024

【速读】：该论文试图解决的问题是如何防止大型语言模型（LLM）在特定领域内执行被禁止的行为，特别是防止LLM帮助用户制造炸弹。解决方案的关键在于开发了一种转录分类器防御（transcript-classifier defense），该防御方法在测试中表现优于传统的安全训练、对抗训练和输入/输出分类器等基线防御方法。然而，尽管转录分类器防御在某些情况下有效，但仍存在失败的情况，这表明即使在狭窄的领域内，防止LLM执行被禁止行为的防御措施仍然具有挑战性。

链接: https://arxiv.org/abs/2412.02159
作者: Tony T. Wang,John Hughes,Henry Sleight,Rylan Schaeffer,Rajashree Agrawal,Fazl Barez,Mrinank Sharma,Jesse Mu,Nir Shavit,Ethan Perez
关键词-EN: Defending large language, large language models, Defending large, large language, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024

点击查看摘要

Abstract:Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
zh

[NLP-35] Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms

【速读】：该论文试图解决现有文献摘要模型在生成比较性总结时缺乏深度比较洞察的问题。解决方案的关键在于引入了一种名为ChatCite的新方法，该方法利用大型语言模型（LLMs）通过多步推理机制来生成比较性文献总结。ChatCite通过提取论文中的关键元素，逐步构建比较性总结，并通过反思记忆过程对输出进行精炼，从而在自动评估指标（如ROUGE和G-Score）上优于多个基线方法（包括GPT-4、BART、T5和CoT），并在人类评估中显示出更高的连贯性、洞察力和流畅性。

链接: https://arxiv.org/abs/2412.02149
作者: Fernando Gabriela Garcia,Spencer Burns,Harrison Fuller
关键词-EN: leveraging large language, large language models, method leveraging large, leveraging large, large language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages

点击查看摘要

Abstract:In this paper, we introduce ChatCite, a novel method leveraging large language models (LLMs) for generating comparative literature summaries. The ability to summarize research papers with a focus on key comparisons between studies is an essential task in academic research. Existing summarization models, while effective at generating concise summaries, fail to provide deep comparative insights. ChatCite addresses this limitation by incorporating a multi-step reasoning mechanism that extracts critical elements from papers, incrementally builds a comparative summary, and refines the output through a reflective memory process. We evaluate ChatCite on a custom dataset, CompLit-LongContext, consisting of 1000 research papers with annotated comparative summaries. Experimental results show that ChatCite outperforms several baseline methods, including GPT-4, BART, T5, and CoT, across various automatic evaluation metrics such as ROUGE and the newly proposed G-Score. Human evaluation further confirms that ChatCite generates more coherent, insightful, and fluent summaries compared to these baseline models. Our method provides a significant advancement in automatic literature review generation, offering researchers a powerful tool for efficiently comparing and synthesizing scientific research.
zh

[NLP-36] Personalized Multimodal Large Language Models : A Survey

【速读】：该论文试图解决个性化多模态大语言模型（Multimodal Large Language Models, MLLMs）的分类、训练方法、应用及评估问题。解决方案的关键在于提出了一种直观的分类法，用于归类个性化MLLMs的技术，并详细讨论了这些技术的结合与适应性，强调其优势和理论基础。此外，论文还总结了现有研究中涉及的个性化任务及其常用评估指标，并列举了用于基准测试的相关数据集，最后指出了该领域面临的关键挑战。

链接: https://arxiv.org/abs/2412.02142
作者: Junda Wu,Hanjia Lyu,Yu Xia,Zhehao Zhang,Joe Barrow,Ishita Kumar,Mehrnoosh Mirtaheri,Hongjie Chen,Ryan A. Rossi,Franck Dernoncourt,Tong Yu,Ruiyi Zhang,Jiuxiang Gu,Nesreen K. Ahmed,Yu Wang,Xiang Chen,Hanieh Deilamsalehy,Namyong Park,Sungchul Kim,Huanrui Yang,Subrata Mitra,Zhengmian Hu,Nedim Lipka,Dang Nguyen,Yue Zhao,Jiebo Luo,Julian McAuley
关键词-EN: Large Language Models, Multimodal Large Language, multiple data modalities, increasingly important due, integrate multiple data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.
zh

[NLP-37] WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

【速读】：该论文试图解决现有多模态大语言模型（MLLMs）在计算病理学中无法全面分析全切片图像（WSIs）以及忽略关键形态学特征的问题。解决方案的关键在于引入了一个大规模的形态学感知基准WSI-Bench，并提出了一个名为WSI-LLaVA的新框架。WSI-LLaVA通过三阶段训练方法（WSI-文本对齐、特征空间对齐和任务特定指令调优）来增强对千兆像素WSI的理解，并开发了两个专门的WSI评估指标（WSI-Precision和WSI-Relevance）以更好地评估模型在病理学环境中的表现。实验结果表明，WSI-LLaVA在形态学分析和诊断准确性方面显著优于现有模型。

链接: https://arxiv.org/abs/2412.02141
作者: Yuci Liang,Xinheng Lyu,Meidan Ding,Wenting Chen,Jipeng Zhang,Yuexiang Ren,Xiangjian He,Song Wu,Sen Yang,Xiyue Wang,Xiaohan Xing,Linlin Shen
关键词-EN: Multi-modal Large Language, patch-level Multi-modal Large, produced patch-level Multi-modal, Large Language Models, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 38 pages, 22 figures, 35 tables

点击查看摘要

Abstract:Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs’ understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.
zh

[NLP-38] Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition

【速读】：该论文试图解决的问题是WordNet（一个由专家构建的语义关系数据库）与语言使用者的直觉在语义关系上的对齐程度。解决方案的关键在于通过使用模板来引出人类参与者的反应，从而系统地研究这两种来源的语义关系知识之间的对齐情况。研究发现，WordNet与人类直觉之间存在普遍的语义关系知识不一致，特别是在同义关系和分类关系（如上位词和下位词）方面。此外，研究还发现WordNet的路径长度并不能可靠地反映人类对上位词或下位词关系的直觉。这一发现有助于更恰当地使用WordNet，并促进其改进。

链接: https://arxiv.org/abs/2412.02138
作者: Zhihan Cao,Hiroaki Yamada,Simone Teufel,Takenobu Tokunaga
关键词-EN: carefully constructed repository, created by specialists, carefully constructed, constructed repository, semantic relations
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:WordNet provides a carefully constructed repository of semantic relations, created by specialists. But there is another source of information on semantic relations, the intuition of language users. We present the first systematic study of the degree to which these two sources are aligned. Investigating the cases of misalignment could make proper use of WordNet and facilitate its improvement. Our analysis which uses templates to elicit responses from human participants, reveals a general misalignment of semantic relation knowledge between WordNet and human intuition. Further analyses find a systematic pattern of mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy), together with the fact that WordNet path length does not serve as a reliable indicator of human intuition regarding hypernymy or hyponymy relations.
zh

[NLP-39] Explainable and Interpretable Multimodal Large Language Models : A Comprehensive Survey

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在解释性和可解释性方面面临的挑战。解决方案的关键在于提出了一种新的框架，该框架从三个角度（数据、模型、训练/推理）对现有的研究进行了分类和系统分析。具体来说，论文探讨了从词元级到嵌入级表示的解释性，评估了与架构分析和设计相关的方法，并探索了增强透明度的训练和推理策略。通过比较各种方法，论文识别了它们的优缺点，并提出了未来研究方向以解决多模态解释性中的未决问题。这一框架为推进MLLMs的解释性和透明性提供了基础资源，指导研究人员和从业者开发更具责任感和鲁棒性的多模态AI系统。

链接: https://arxiv.org/abs/2412.02104
作者: Yunkai Dang,Kaichen Huang,Jiahao Huo,Yibo Yan,Sirui Huang,Dongrui Liu,Mengxi Gao,Jie Zhang,Chen Qian,Kun Wang,Yong Liu,Jing Shao,Hui Xiong,Xuming Hu
关键词-EN: Artificial Intelligence, revolutionized numerous fields, large language models, natural language understanding, development of Artificial
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \ Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.
zh

[NLP-40] Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

【速读】：该论文试图解决多语言神经机器翻译（MNMT）中解码器架构的性能问题，特别是其在跨语言迁移能力上的不足。解决方案的关键在于将解码过程分为两个阶段，第一阶段明确排除目标语言标记，以隐式增强跨语言的迁移能力。此外，通过在翻译指令上应用对比学习（contrastive learning），进一步提升了零样本翻译（zero-shot translation）的性能。实验结果表明，与编码器-解码器架构相比，该方法在监督翻译中表现相当，同时在零样本翻译中显著提升了BLEU、chrF++、BERTScore和COMET等指标。

链接: https://arxiv.org/abs/2412.02101
作者: Zhi Qu,Yiran Wang,Chenchen Ding,Hideki Tanaka,Masao Utiyama,Taro Watanabe
关键词-EN: Existing multilingual neural, multilingual neural machine, Existing multilingual, translate multiple languages, neural machine translation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.
zh

[NLP-41] Lets Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning

【速读】：该论文试图解决在没有直接数据访问的情况下，如何利用大语言模型（LLMs）中的常识知识来回答诸如“Newark, NJ的Airbnb房源价格是多少？”这样的估算问题。解决方案的关键在于构建一个临时性的概率模型（ad hoc probabilistic model），通过提示LLM提出与问题相关的随机变量及其联合分布的矩约束，然后在log-linear族中优化联合分布以最大化约束满足度。实验结果表明，LLMs能够成功提出合理的变量，并且尽管提出的数值约束可能存在噪声，但通过联合优化可以协调这些约束。在基于三个真实世界表格数据集的概率问题评估中，该框架在总变差距离上与直接提示基线表现相当，并且对噪声具有相似的鲁棒性。

链接: https://arxiv.org/abs/2412.02081
作者: Shepard Xia,Brian Lu,Jason Eisner
关键词-EN: hallmark of intelligence, ability to flesh, flesh out underspecified, underspecified situations, common sense
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A hallmark of intelligence is the ability to flesh out underspecified situations using “common sense.” We propose to extract that common sense from large language models (LLMs), in a form that can feed into probabilistic inference. We focus our investigation on \textitguesstimation questions such as “How much are Airbnb listings in Newark, NJ?” Formulating a sensible answer without access to data requires drawing on, and integrating, bits of common knowledge about how \textttPrice and \textttLocation may relate to other variables, such as \textttProperty Type . Our framework answers such a question by synthesizing an \textitad hoc probabilistic model. First we prompt an LLM to propose a set of random variables relevant to the question, followed by moment constraints on their joint distribution. We then optimize the joint distribution p within a log-linear family to maximize the overall constraint satisfaction. Our experiments show that LLMs can successfully be prompted to propose reasonable variables, and while the proposed numerical constraints can be noisy, jointly optimizing for their satisfaction reconciles them. When evaluated on probabilistic questions derived from three real-world tabular datasets, we find that our framework performs comparably to a direct prompting baseline in terms of total variation distance from the dataset distribution, and is similarly robust to noise.
zh

[NLP-42] BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts

【速读】：该论文试图解决孟加拉语（Bangla）作者在社交媒体上的性别和年龄分类问题。解决方案的关键在于引入并基准测试了机器学习方法在新创建的孟加拉语作者画像数据集（BN-AuthProf）上的表现。该数据集包含30,131条来自300位作者的社交媒体帖子，标签包括作者的年龄和性别。通过使用多种经典机器学习和深度学习技术，研究展示了支持向量机（SVM）在性别分类上达到80%的准确率，而多项式朴素贝叶斯（MNB）分类器在年龄分类上达到91%的准确率和0.905的F1分数。这些结果突显了机器学习在孟加拉语作者性别和年龄分类中的有效性，具有广泛的应用前景，包括市场营销、安全、法医语言学、教育和刑事调查等领域。

链接: https://arxiv.org/abs/2412.02058
作者: Raisa Tasnim,Mehanaz Chowdhury,Md Ataur Rahman
关键词-EN: Bangla Author Profiling, Author profiling, social media platforms, Bangla Author, social media
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted to be Published in 2024 27th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Author profiling, the analysis of texts to uncover attributes such as gender and age of the author, has become essential with the widespread use of social media platforms. This paper focuses on author profiling in the Bangla language, aiming to extract valuable insights about anonymous authors based on their writing style on social media. The primary objective is to introduce and benchmark the performance of machine learning approaches on a newly created Bangla Author Profiling dataset, BN-AuthProf. The dataset comprises 30,131 social media posts from 300 authors, labeled by their age and gender. Authors’ identities and sensitive information were anonymized to ensure privacy. Various classical machine learning and deep learning techniques were employed to evaluate the dataset. For gender classification, the best accuracy achieved was 80% using Support Vector Machine (SVM), while a Multinomial Naive Bayes (MNB) classifier achieved the best F1 score of 0.756. For age classification, MNB attained a maximum accuracy score of 91% with an F1 score of 0.905. This research highlights the effectiveness of machine learning in gender and age classification for Bangla author profiling, with practical implications spanning marketing, security, forensic linguistics, education, and criminal investigations, considering privacy and biases.
zh

[NLP-43] A Multi-way Parallel Named Entity Annotated Corpus for English Tamil and Sinhala

【速读】：该论文试图解决低资源语言（如泰米尔语和僧伽罗语）的命名实体识别（Named Entity Recognition, NER）问题，并探讨其在神经机器翻译（Neural Machine Translation, NMT）任务中的应用。解决方案的关键在于利用预训练的多语言语言模型（multilingual Language Models, mLMs）来建立新的NER基准结果，并通过详细调查不同类型的mLMs在NER任务中的表现，最终展示其NER系统在低资源NMT任务中的实用性。

链接: https://arxiv.org/abs/2412.02056
作者: Surangika Ranathunga,Asanka Ranasinghea,Janaka Shamala,Ayodya Dandeniyaa,Rashmi Galappaththia,Malithi Samaraweeraa
关键词-EN: Named Entity Recognition, Named Entities, Sinhala and Tamil, benchmark Named Entity, multi-way parallel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of mLMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: this https URL.
zh

[NLP-44] Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

【速读】：该论文试图解决数据窥探（data snooping）对提升代码漏洞检测中神经网络性能的影响问题。解决方案的关键在于研究嵌入模型（embedding models）在训练数据集中包含用于神经网络训练和验证的样本时，其性能是否受到显著影响。研究结果表明，引入数据窥探并未显著改变模型性能，这可能意味着数据窥探影响较小，或者由于方法论中随机丢弃的样本包含了实现最佳性能所需的关键隐藏特征。此外，研究还确认了先前研究中发现的结论，即使用GPT-2嵌入训练的模型在性能上持续优于使用其他嵌入训练的神经网络，即使在嵌入模型中引入数据窥探的情况下，GPT-2仍能稳健地表示复杂代码特征。

链接: https://arxiv.org/abs/2412.02048
作者: Gary A. McCully,John D. Hastings,Shengjie Xu
关键词-EN: bidirectional transformer-based embeddings, data snooping, study examines, vulnerability detection, detection in lifted
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:This study examines the impact of data snooping on neural networks for vulnerability detection in lifted code, building on previous research which used word2vec, and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained on datasets, including samples also used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings. The fact that this holds even when data snooping is introduced into the embedding model indicates GPT-2’s robustness in representing complex code features, even under less-than-ideal conditions.
zh

[NLP-45] Real-Time Multilingual Sign Language Processing

【速读】：该论文试图解决手语处理 (Sign Language Processing, SLP) 领域中传统基于gloss系统的局限性问题，这些系统语言特定且无法充分捕捉手语的多维特性。解决方案的关键在于提出了SignWiring，一种通用手语转录符号系统，作为视觉手势与文本语言表示之间的中介。SignWiring旨在为SLP社区提供基础库和资源，促进手语翻译和生成任务的深入研究，从而实现更自然、准确的多语言手语翻译，并为实时多语言应用铺平道路，提升手语技术的包容性和可访问性。

链接: https://arxiv.org/abs/2412.01991
作者: Amit Moryossef
关键词-EN: Computer Vision, Sign Language, interdisciplinary field comprised, Sign Language Processing, Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD Thesis

点击查看摘要

Abstract:Sign Language Processing (SLP) is an interdisciplinary field comprised of Natural Language Processing (NLP) and Computer Vision. It is focused on the computational understanding, translation, and production of signed languages. Traditional approaches have often been constrained by the use of gloss-based systems that are both language-specific and inadequate for capturing the multidimensional nature of sign language. These limitations have hindered the development of technology capable of processing signed languages effectively. This thesis aims to revolutionize the field of SLP by proposing a simple paradigm that can bridge this existing technological gap. We propose the use of SignWiring, a universal sign language transcription notation system, to serve as an intermediary link between the visual-gestural modality of signed languages and text-based linguistic representations. We contribute foundational libraries and resources to the SLP community, thereby setting the stage for a more in-depth exploration of the tasks of sign language translation and production. These tasks encompass the translation of sign language from video to spoken language text and vice versa. Through empirical evaluations, we establish the efficacy of our transcription method as a pivot for enabling faster, more targeted research, that can lead to more natural and accurate translations across a range of languages. The universal nature of our transcription-based paradigm also paves the way for real-time, multilingual applications in SLP, thereby offering a more inclusive and accessible approach to language technology. This is a significant step toward universal accessibility, enabling a wider reach of AI-driven language technologies to include the deaf and hard-of-hearing community. Comments: PhD Thesis Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.01991 [cs.CL] (or arXiv:2412.01991v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.01991 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Amit Moryossef [view email] [v1] Mon, 2 Dec 2024 21:51:41 UTC (17,413 KB)
zh

[NLP-46] Free Process Rewards without Process Labels

【速读】：该论文试图解决训练过程奖励模型 (Process Reward Model, PRM) 所需的大量中间步骤标注数据的问题。解决方案的关键在于提出了一种隐式 PRM 的训练方法，即通过在更廉价的响应级别标签上训练结果奖励模型 (Outcome Reward Model, ORM)，无需额外成本即可获得隐式 PRM。这一方法的核心假设是将结果奖励参数化为策略模型和参考模型的对数似然比，从而可以在不考虑具体损失目标的情况下进行优化。实验结果表明，这种隐式 PRM 在 MATH 数据集上的表现优于基于蒙特卡洛树搜索 (MCTS) 的强基线模型，并且在大规模指令和响应的扩展下，其性能进一步提高，尤其是在使用交叉熵 (CE) 损失时表现出更高的数据效率。

链接: https://arxiv.org/abs/2412.01981
作者: Lifan Yuan,Wendi Li,Huayu Chen,Ganqu Cui,Ning Ding,Kaiyan Zhang,Bowen Zhou,Zhiyuan Liu,Hao Peng
关键词-EN: implicit PRM, reasoning trajectory step, fine grained rewards, PRM, scores a reasoning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Models and data are available at: this https URL

点击查看摘要

Abstract:Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an \textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models, which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline \textitá la Math-Shepherd using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.
zh

[NLP-47] he use of large language models to enhance cancer clinical trial educational materials

【速读】：该论文试图解决癌症临床试验中参与者招募和参与度不足的问题，主要原因是缺乏面向患者的教育资源。解决方案的关键在于利用大型语言模型（LLMs），特别是GPT4，从临床试验知情同意书中生成易于理解的教育内容。通过零样本学习生成试验总结和单样本学习生成多项选择题，并通过患者调查和众包注释评估其有效性。研究结果表明，GPT4生成的总结既易读又全面，可能提高患者对临床试验的理解和兴趣。尽管LLMs展示了“开箱即用”生成临床试验教育材料的潜力，但仍需人工监督以避免信息错误的风险。

链接: https://arxiv.org/abs/2412.01955
作者: Mingye Gao,Aman Varshney,Shan Chen,Vikram Goddla,Jack Gallifant,Patrick Doyle,Claire Novack,Maeve Dillon-Martin,Teresia Perkins,Xinrong Correia,Erik Duhaime,Howard Isenstein,Elad Sharon,Lisa Soleymani Lehmann,David Kozono,Brian Anthony,Dmitriy Dligach,Danielle S. Bitterman
关键词-EN: Cancer clinical trials, Large Language Models, Cancer clinical, face challenges, challenges in recruitment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cancer clinical trials often face challenges in recruitment and engagement due to a lack of participant-facing informational and educational resources. This study investigated the potential of Large Language Models (LLMs), specifically GPT4, in generating patient-friendly educational content from clinical trial informed consent forms. Using data from this http URL, we employed zero-shot learning for creating trial summaries and one-shot learning for developing multiple-choice questions, evaluating their effectiveness through patient surveys and crowdsourced annotation. Results showed that GPT4-generated summaries were both readable and comprehensive, and may improve patients’ understanding and interest in clinical trials. The multiple-choice questions demonstrated high accuracy and agreement with crowdsourced annotators. For both resource types, hallucinations were identified that require ongoing human oversight. The findings demonstrate the potential of LLMs “out-of-the-box” to support the generation of clinical trial education materials with minimal trial-specific engineering, but implementation with a human-in-the-loop is still needed to avoid misinformation risks.
zh

[NLP-48] Self-Improvement in Language Models: The Sharpening Mechanism

【速读】：该论文试图解决的问题是：在没有外部反馈的情况下，语言模型如何通过自我改进（self-improvement）提升其性能。解决方案的关键在于引入了一种名为“锐化”（sharpening）的新视角，即将语言模型自身作为验证器（verifier），在训练后阶段通过自我验证和优化，使模型更倾向于生成高质量的序列，从而在推理时减少生成高质量序列的计算成本。论文通过建立一个新的统计框架来形式化这一过程，并分析了基于监督微调（SFT）和基于人类反馈的强化学习（RLHF）的两种自然自我改进算法，以探讨其基本限制和效果。

链接: https://arxiv.org/abs/2412.01951
作者: Audrey Huang,Adam Block,Dylan J. Foster,Dhruv Rohatgi,Cyril Zhang,Max Simchowitz,Jordan T. Ash,Akshay Krishnamurthy
关键词-EN: achieve higher performance, Recent work, language models evaluates, external feedback, modeling has raised
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recent work in language modeling has raised the possibility of self-improvement, where a language models evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ``sharpen’’ the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner aims to sharpen a pre-trained base policy via sample access, and establish fundamental limits. Then we analyze two natural families of self-improvement algorithms based on SFT and RLHF.
zh

[NLP-49] Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在扩展性、成本和定制化方面的挑战。解决方案的关键在于引入专家组合（Composition of Experts, CoE），这是一个模块化的复合AI系统，利用多个专家LLMs。CoE通过路由器动态选择最适合给定输入的专家，从而实现资源的有效利用和性能的提升。论文提出了一种两步路由方法来解决训练CoE的复杂性问题，首先使用路由器将输入分类为不同类别，然后通过类别到专家的映射来选择所需的专家。CoE提供了一种灵活且成本效益高的解决方案，通过实验验证了其在减少计算开销的同时实现卓越性能的有效性。

链接: https://arxiv.org/abs/2412.01868
作者: Swayambhoo Jain,Ravi Raju,Bo Li,Zoltan Csaki,Jonathan Li,Kaizhao Liang,Guoyao Feng,Urmish Thakkar,Anand Sampat,Raghu Prabhakar,Sumati Jairath
关键词-EN: Large Language Models, Large Language, Language Models, achieved remarkable advancements, nature presents challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of 59.4 with merely 31 billion average active parameters on Arena-Hard and a score of 9.06 with 54 billion average active parameters on MT-Bench.
zh

[NLP-50] A Theoretical Framework for Acoustic Neighbor Embeddings

【速读】：该论文试图解决的问题是如何在固定维度的嵌入空间中解释和应用声学邻居嵌入（acoustic neighbor embeddings），这些嵌入用于表示可变宽度音频或文本的音素内容。解决方案的关键在于提出了基于音素相似性（phonetic similarity）的距离概率解释，并展示了理论和实证证据支持的均匀簇内各向同性（uniform cluster-wise isotropy）近似，从而简化了嵌入间的距离计算为简单的欧几里得距离。这一框架不仅在理论上得到了验证，还在多个实验中展示了其应用价值，包括高词汇量下的孤立词分类、词汇外单词恢复、英语方言聚类以及设备唤醒词的混淆预测。

链接: https://arxiv.org/abs/2412.02164
作者: Woojay Jeon
关键词-EN: interpreting acoustic neighbor, fixed-dimensional embedding space, acoustic neighbor embeddings, interpreting acoustic, acoustic neighbor
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:This paper provides a theoretical framework for interpreting acoustic neighbor embeddings, which are representations of the phonetic content of variable-width audio or text in a fixed-dimensional embedding space. A probabilistic interpretation of the distances between embeddings is proposed, based on a general quantitative definition of phonetic similarity between words. This provides us a framework for understanding and applying the embeddings in a principled manner. Theoretical and empirical evidence to support an approximation of uniform cluster-wise isotropy are shown, which allows us to reduce the distances to simple Euclidean distances. Four experiments that validate the framework and demonstrate how it can be applied to diverse problems are described. Nearest-neighbor search between audio and text embeddings can give isolated word classification accuracy that is identical to that of finite state transducers (FSTs) for vocabularies as large as 500k. Embedding distances give accuracy with 0.5% point difference compared to phone edit distances in out-of-vocabulary word recovery, as well as producing clustering hierarchies identical to those derived from human listening experiments in English dialect clustering. The theoretical framework also allows us to use the embeddings to predict the expected confusion of device wake-up words. All source code and pretrained models are provided.
zh

计算机视觉

[CV-0] Motion Prompting: Controlling Video Generation with Motion Trajectories

【速读】：该论文试图解决现有视频生成模型在动态动作和时间组合细节捕捉上的不足，特别是在仅依赖文本提示进行控制时难以表达复杂运动的问题。解决方案的关键在于训练一个基于时空稀疏或密集运动轨迹的视频生成模型，这种灵活的运动条件表示（motion prompts）能够编码任意数量的轨迹、对象特定或全局场景运动以及时间稀疏运动。论文还提出了运动提示扩展（motion prompt expansion）的概念，将高级用户请求转化为详细的半密集运动提示，从而增强了模型的应用灵活性，包括相机和对象运动控制、图像交互、运动转移和图像编辑等。通过定量评估和人类研究，论文展示了其方法的强大性能和潜在的物理现实性。

链接: https://arxiv.org/abs/2412.02700
作者: Daniel Geng,Charles Herrmann,Junhwa Hur,Forrester Cole,Serena Zhang,Tobias Pfaff,Tatiana Lopez-Guevara,Carl Doersch,Yusuf Aytar,Michael Rubinstein,Chen Sun,Oliver Wang,Andrew Owens,Deqing Sun
关键词-EN: generation models rely, compelling video content, existing video generation, Motion, temporal compositions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, “interacting” with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: this https URL
zh

[CV-1] Diffusion-based Visual Anagram as Multi-task Learning WACV2025

【速读】：该论文试图解决生成视觉变位词（visual anagrams）过程中出现的两个关键问题：概念分离（concept segregation）和概念主导（concept domination）。解决方案的关键在于将视觉变位词生成问题转化为多任务学习问题，其中不同视角的提示被视为不同的任务，并通过设计反分离优化策略（anti-segregation optimization strategy）和噪声向量平衡方法（noise vector balancing method）来促进跨任务的噪声轨迹对齐。此外，论文还提出了噪声方差校正方法（noise variance rectification method），以解决直接平均噪声预测可能导致性能不佳的问题。这些技术共同作用，显著提升了生成视觉变位词的质量和多样性。

链接: https://arxiv.org/abs/2412.02693
作者: Zhiyuan Xu,Yinhe Chen,Huan-ang Gao,Weiyan Zhao,Guiyu Zhang,Hao Zhao
关键词-EN: appearance upon transformation, flipping or rotation, images that change, change appearance, Visual anagrams
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025. Code is publicly available at this https URL

点击查看摘要

Abstract:Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method’s superior ability to generate visual anagrams spanning diverse concepts.
zh

[CV-2] aming Scalable Visual Tokenizer for Autoregressive Image Generation

【速读】：该论文试图解决现有向量量化（Vector Quantization, VQ）方法在训练过程中由于部分更新导致的码本（codebook）不稳定问题，特别是码本利用率下降和视觉特征与非激活码之间的分布差距扩大，从而影响可扩展性。解决方案的关键是提出了一种新的向量量化方法——索引反向传播量化（Index Backpropagation Quantization, IBQ），通过在编码特征与码本之间应用直通估计器（straight-through estimator）处理一热分类分布，使得所有码本嵌入和视觉编码器能够联合优化。IBQ方法确保了所有码本的微分性，并维持了与视觉编码器一致的潜在空间，从而实现了视觉分词器的大规模训练，首次实现了大规模（2^18）和高维度（256）码本的高利用率。实验结果表明，IBQ在标准ImageNet基准测试中展现了其可扩展性和优越性，在重建和自回归视觉生成任务中均取得了竞争性的结果。

链接: https://arxiv.org/abs/2412.02692
作者: Fengyuan Shi,Zhuoyan Luo,Yixiao Ge,Yujiu Yang,Ying Shan,Limin Wang
关键词-EN: Existing vector quantization, undergoes partial updates, Existing vector, Index Backpropagation Quantization, largely attributed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ( 2^18 ) with high dimension ( 256 ) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction ( 1.00 rFID) and autoregressive visual generation ( 2.05 gFID). The code and models are available at this https URL.
zh

[CV-3] FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

【速读】：该论文试图解决生成逼真手部图像的难题，特别是在手部复杂关节结构、多视角变化和频繁遮挡的情况下。解决方案的关键在于提出了FoundHand，一个大规模特定领域的扩散模型，用于合成单手和双手图像。为了训练该模型，论文引入了FoundHand-10M，一个包含2D关键点和分割掩码注释的大规模手部数据集。核心思想是使用2D手部关键点作为通用表示，编码手部关节和相机视角信息。FoundHand通过学习图像对来捕捉物理上合理的手部关节结构，并支持通过2D关键点的精确控制和外观控制。该模型展示了重新定位手部、转移手部外观和合成新视角的核心能力，从而实现了在先前生成的图像中修复畸形手部或合成手部视频序列的零样本能力。

链接: https://arxiv.org/abs/2412.02690
作者: Kefan Chen,Chaerin Min,Linguang Zhang,Shreyas Hampali,Cem Keskin,Srinath Sridhar
关键词-EN: persistent challenge due, generating realistic hands, realistic hands remains, generating realistic, frequent occlusions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.
zh

[CV-4] SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

【速读】：该论文试图解决多步文本到图像扩散模型在蒸馏为单步模型时存在的两个主要问题：一是现有方法（如SwiftBrushv2）在处理不同扩散模型骨干时由于固定指导尺度导致的训练不稳定性；二是现有单步扩散模型缺乏对负提示指导的支持，这在实际图像生成中至关重要。解决方案的关键在于提出了一种名为SNOOPI的新框架，通过增强单步扩散模型在训练和推理阶段的指导来解决这些问题。具体来说，论文引入了Proper Guidance-SwiftBrush (PG-SB)，采用随机尺度无分类器指导方法来提高训练稳定性，并通过扩展教师模型的输出分布来增强VSD损失的鲁棒性。此外，提出了无需训练的Negative-Away Steer Attention (NASA)方法，通过交叉注意力机制将负提示整合到单步扩散模型中，以抑制生成图像中的不期望元素。实验结果表明，这些方法显著提升了基线模型的性能，并在HPSv2评分上达到了31.08，创造了单步扩散模型的新纪录。

链接: https://arxiv.org/abs/2412.02687
作者: Viet Nguyen,Anh Aengus Nguyen,Trung Dao,Khoi Nguyen,Cuong Pham,Toan Tran,Anh Tran
关键词-EN: Recent approaches, one-step diffusion models, diffusion models, one-step diffusion, yielded promising results
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Recent approaches have yielded promising results in distilling multi-step text-to-image diffusion models into one-step ones. The state-of-the-art efficient distillation technique, i.e., SwiftBrushv2 (SBv2), even surpasses the teacher model’s performance with limited resources. However, our study reveals its instability when handling different diffusion model backbones due to using a fixed guidance scale within the Variational Score Distillation (VSD) loss. Another weakness of the existing one-step diffusion models is the missing support for negative prompt guidance, which is crucial in practical image generation. This paper presents SNOOPI, a novel framework designed to address these limitations by enhancing the guidance in one-step diffusion models during both training and inference. First, we effectively enhance training stability through Proper Guidance-SwiftBrush (PG-SB), which employs a random-scale classifier-free guidance approach. By varying the guidance scale of both teacher models, we broaden their output distributions, resulting in a more robust VSD loss that enables SB to perform effectively across diverse backbones while maintaining competitive performance. Second, we propose a training-free method called Negative-Away Steer Attention (NASA), which integrates negative prompts into one-step diffusion models via cross-attention to suppress undesired elements in generated images. Our experimental results show that our proposed methods significantly improve baseline models across various metrics. Remarkably, we achieve an HPSv2 score of 31.08, setting a new state-of-the-art benchmark for one-step diffusion models.
zh

[CV-5] AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

【速读】：该论文试图解决从单张图像生成可动画化的人类化身（animatable human avatars）时面临的细节捕捉不足、视角不一致和计算效率低下的问题。解决方案的关键在于利用生成模型（generative models）生成多视角的标准姿态图像和法线贴图（multi-view canonical pose images and normal maps），以解决动画化重建中的模糊性问题。具体来说，论文采用基于Transformer的视频生成模型生成这些图像，并通过在大规模视频数据集上的预训练来提高泛化能力。此外，论文将重建问题重新定义为4D任务，并引入了一种高效的3D建模方法——4D高斯Splatting（4D Gaussian Splatting），以处理视角不一致问题，从而实现实时渲染。

链接: https://arxiv.org/abs/2412.02684
作者: Lingteng Qiu,Shenhao Zhu,Qi Zuo,Xiaodong Gu,Yuan Dong,Junfei Zhang,Chao Xu,Zhe Li,Weihao Yuan,Liefeng Bo,Guanying Chen,Zilong Dong
关键词-EN: Generating animatable human, Generating animatable, human modeling applications, Generating, digital human modeling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.
zh

[CV-6] Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation

【速读】：该论文试图解决复杂的双臂接触密集操作任务中的数据获取和策略泛化问题。解决方案的关键在于引入了一种名为“可泛化规划引导扩散策略学习 (Generalizable Planning-Guided Diffusion Policy Learning, GLIDE)”的方法。该方法通过利用基于模型的运动规划器在高保真物理模拟中生成演示数据，从而高效地生成大规模高质量的合成运动轨迹。随后，通过行为克隆训练任务条件化的扩散策略。为解决模拟到现实的差距，论文提出了一系列设计选项，包括特征提取、任务表示、动作预测和数据增强，以确保策略能够学习到平滑的动作序列并泛化到未见过的场景。实验结果表明，该方法能使双臂机器人系统有效地操作各种几何形状、尺寸和物理属性的物体。

链接: https://arxiv.org/abs/2412.02676
作者: Xuanlin Li,Tong Zhao,Xinghao Zhu,Jiuguang Wang,Tao Pang,Kuan Fang
关键词-EN: involves precise coordination, manipulation involves precise, strategically selected contacts, Contact-rich bimanual manipulation, change object states
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remain a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Generalizable Planning-Guided Diffusion Policy Learning (GLIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To tackle the sim-to-real gap, we propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties. Website: this https URL
zh

[CV-7] A Bidirectional Long Short Term Memory Approach for Infrastructure Health Monitoring Using On-board Vibration Response

【速读】：该论文试图解决通过直接测量车辆驶过时的振动响应信号来估计基础设施物理参数（如铁路轨道刚度）的问题。解决方案的关键在于采用了一种深度学习方法，结合长短期记忆网络（LSTM）特征提取器和双向长短期记忆网络（BiLSTM），以捕捉振动信号中的时间依赖性。此外，通过将振动信号分割成与轨道梁间距相等的帧，并将其中心对准梁节点，提高了监测任务的分辨率。这种方法能够准确自动地估计铁路轨道刚度，并在存在噪声的情况下识别局部刚度降低。

链接: https://arxiv.org/abs/2412.02643
作者: R. R. Samani,A. Nunez,B. De Schutter
关键词-EN: Long Short-term Memory, powerful datadriven approaches, drive-by vibration response, infrastructural monitoring data, monitoring data enables
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages; Accepted for the presentation at Transportation Research Board (TRB) Annual Meeting, and under review in the Journal of Transportation Research Record (TRR)

点击查看摘要

Abstract:The growing volume of available infrastructural monitoring data enables the development of powerful datadriven approaches to estimate infrastructure health conditions using direct measurements. This paper proposes a deep learning methodology to estimate infrastructure physical parameters, such as railway track stiffness, using drive-by vibration response signals. The proposed method employs a Long Short-term Memory (LSTM) feature extractor accounting for temporal dependencies in the feature extraction phase, and a bidirectional Long Short-term Memory (BiLSTM) networks to leverage bidirectional temporal dependencies in both the forward and backward paths of the drive-by vibration response in condition estimation phase. Additionally, a framing approach is employed to enhance the resolution of the monitoring task to the beam level by segmenting the vibration signal into frames equal to the distance between individual beams, centering the frames over the beam nodes. The proposed LSTM-BiLSTM model offers a versatile tool for various bridge and railway infrastructure conditions monitoring using direct drive-by vibration response measurements. The results demonstrate the potential of incorporating temporal analysis in the feature extraction phase and emphasize the pivotal role of bidirectional temporal information in infrastructure health condition estimation. The proposed methodology can accurately and automatically estimate railway track stiffness and identify local stiffness reductions in the presence of noise using drive-by measurements. An illustrative case study of vehicle-track interaction simulation is used to demonstrate the performance of the proposed model, achieving a maximum mean absolute percentage error of 1.7% and 0.7% in estimating railpad and ballast stiffness, respectively.
zh

[CV-8] Robust soybean seed yield estimation using high-throughput ground robot videos

【速读】：该论文试图解决传统大豆产量数据收集方法中存在的劳动密集、成本高、设备故障风险大以及设备运输不便等问题。解决方案的关键在于利用计算机视觉和深度学习技术，通过高吞吐量的种子计数方法进行大豆产量估算。具体来说，论文提出了一种基于地面机器人的方法，该机器人配备鱼眼摄像头，能够捕捉大豆地块的全面视频，并从中提取图像。这些图像通过P2PNet-Yield模型进行处理，该模型结合了特征提取模块（P2PNet-Soy的核心）和产量回归模块，以估算大豆地块的种子产量。通过三年的实验数据（2021年8500个地块，2022年2275个地块，2023年650个地块），该方法在种子计数和产量估算的准确性和通用性方面取得了显著改进，如鱼眼图像校正和随机传感器效应的数据增强。最终，P2PNet-Yield模型在基因型排序准确性上达到了83%，并显著减少了传统产量估算所需的时间和成本。

链接: https://arxiv.org/abs/2412.02642
作者: Jiale Feng,Samuel W. Blair,Timilehin Ayanlade,Aditya Balu,Baskar Ganapathysubramanian,Arti Singh,Soumik Sarkar,Asheesh K Singh
关键词-EN: leveraging high throughput, Glycine max, high throughput seed, estimation leveraging high, leveraging high
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 12 figures, 2 tables

点击查看摘要

Abstract:We present a novel method for soybean (Glycine max (L.) Merr.) yield estimation leveraging high throughput seed counting via computer vision and deep learning techniques. Traditional methods for collecting yield data are labor-intensive, costly, prone to equipment failures at critical data collection times, and require transportation of equipment across field sites. Computer vision, the field of teaching computers to interpret visual data, allows us to extract detailed yield information directly from images. By treating it as a computer vision task, we report a more efficient alternative, employing a ground robot equipped with fisheye cameras to capture comprehensive videos of soybean plots from which images are extracted in a variety of development programs. These images are processed through the P2PNet-Yield model, a deep learning framework where we combined a Feature Extraction Module (the backbone of the P2PNet-Soy) and a Yield Regression Module to estimate seed yields of soybean plots. Our results are built on three years of yield testing plot data - 8500 in 2021, 2275 in 2022, and 650 in 2023. With these datasets, our approach incorporates several innovations to further improve the accuracy and generalizability of the seed counting and yield estimation architecture, such as the fisheye image correction and data augmentation with random sensor effects. The P2PNet-Yield model achieved a genotype ranking accuracy score of up to 83%. It demonstrates up to a 32% reduction in time to collect yield data as well as costs associated with traditional yield estimation, offering a scalable solution for breeding programs and agricultural productivity enhancement.
zh

[CV-9] MetaShadow: Object-Centered Shadow Detection Removal and Synthesis

【速读】：该论文试图解决图像编辑应用中阴影处理不足的问题，限制了编辑结果的真实性。解决方案的关键在于引入MetaShadow框架，该框架集成了阴影检测、移除和可控合成三大功能，并以对象为中心进行操作。MetaShadow的核心在于其两个协作组件：Shadow Analyzer用于对象中心的阴影检测和移除，Shadow Synthesizer用于基于参考的可控阴影合成。通过优化Shadow Analyzer的中间特征学习，指导Shadow Synthesizer生成更逼真且与场景无缝融合的阴影。实验结果表明，MetaShadow在多个阴影基准数据集上显著优于现有最先进的方法，特别是在对象移除、重定位和插入等图像编辑任务中表现出色。

链接: https://arxiv.org/abs/2412.02635
作者: Tianyu Wang,Jianming Zhang,Haitian Zheng,Zhihong Ding,Scott Cohen,Zhe Lin,Wei Xiong,Chi-Wing Fu,Luis Figueroa,Soo Ye Kim
关键词-EN: image editing applications, Shadow, object-centered shadow detection, Shadow Analyzer, Shadow Synthesizer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shadows are often under-considered or even ignored in image editing applications, limiting the realism of the edited results. In this paper, we introduce MetaShadow, a three-in-one versatile framework that enables detection, removal, and controllable synthesis of shadows in natural images in an object-centered fashion. MetaShadow combines the strengths of two cooperative components: Shadow Analyzer, for object-centered shadow detection and removal, and Shadow Synthesizer, for reference-based controllable shadow synthesis. Notably, we optimize the learning of the intermediate features from Shadow Analyzer to guide Shadow Synthesizer to generate more realistic shadows that blend seamlessly with the scene. Extensive evaluations on multiple shadow benchmark datasets show significant improvements of MetaShadow over the existing state-of-the-art methods on object-centered shadow detection, removal, and synthesis. MetaShadow excels in image-editing tasks such as object removal, relocation, and insertion, pushing the boundaries of object-centered image editing.
zh

[CV-10] Scaling Image Tokenizers with Grouped Spherical Quantization

【速读】：该论文试图解决现有视觉标记器（Vision Tokenizers）在扩展性和紧凑性方面存在的问题，特别是依赖于传统的生成对抗网络（GAN）超参数、存在偏见的比较以及缺乏对扩展行为的全面分析。解决方案的关键是引入了一种名为分组球面量化（Grouped Spherical Quantization, GSQ）的新方法，该方法通过球面码本初始化和查找正则化来约束码本潜在变量至球面表面。GSQ-GAN在训练迭代次数较少的情况下，实现了比现有最先进方法更高的重建质量，为扩展研究提供了坚实的基础。此外，论文系统地研究了GSQ在潜在维度、码本大小和压缩比方面的扩展行为及其对模型性能的影响，揭示了在高和低空间压缩水平下的不同行为，并展示了GSQ如何将高维潜在空间重构为紧凑的低维空间，从而实现高效扩展并提升质量。最终，GSQ-GAN在16倍下采样的情况下，实现了0.50的重建FID（rFID）。

链接: https://arxiv.org/abs/2412.02632
作者: Jiangtao Wang,Zhen Qin,Yifan Zhang,Vincent Tao Hu,Björn Ommer,Rania Briq,Stefan Kesselheim
关键词-EN: previous works depend, old-school GAN-based hyperparameters, Grouped Spherical Quantization, Vision tokenizers, biased comparisons
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.
zh

[CV-11] Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

【速读】：该论文试图解决从多视角图像重建3D模型时存在的几何伪影和操控性不足的问题。解决方案的关键在于引入了一种名为Sharp-It的多视角到多视角扩散模型，该模型通过处理从低质量3D对象渲染出的多视角图像，增强其几何细节和纹理，从而实现高质量3D模型的重建。Sharp-It模型在多视角图像上并行操作，共享特征，结合了2D和3D方法的优势，提供了一种高效且可控的高质量3D内容生成方法。

链接: https://arxiv.org/abs/2412.02631
作者: Yiftach Edelstein,Or Patashnik,Dana Cohen-Bar,Lihi Zelnik-Manor
关键词-EN: led to significant, significant progress, multi-view, Advancements, multi-view images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page at this https URL

点击查看摘要

Abstract:Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.
zh

[CV-12] Continual Learning of Personalized Generative Face Models with Experience Replay WACV2025

【速读】：该论文试图解决持续学习（continual learning）中生成式人脸模型（generative face model）在接收到新批次照片时出现的灾难性遗忘（catastrophic forgetting）问题。解决方案的关键在于引入了一种结合随机采样（random sampling）和StyleGAN潜在空间（latent space）的新型经验回放算法（experience replay algorithm），即将缓冲区表示为最优凸包（optimal convex hull）。这种基于凸包的经验回放方法在防止遗忘方面比随机采样基线和下限方法更为有效。

链接: https://arxiv.org/abs/2412.02627
作者: Annie N. Wang,Luchao Qi,Roni Sengupta
关键词-EN: continual learning problem, generative face model, learning problem, captured regularly, continual learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025. Project page (incl. supplementary materials): this https URL

点击查看摘要

Abstract:We introduce a novel continual learning problem: how to sequentially update the weights of a personalized 2D and 3D generative face model as new batches of photos in different appearances, styles, poses, and lighting are captured regularly. We observe that naive sequential fine-tuning of the model leads to catastrophic forgetting of past representations of the individual’s face. We then demonstrate that a simple random sampling-based experience replay method is effective at mitigating catastrophic forgetting when a relatively large number of images can be stored and replayed. However, for long-term deployment of these models with relatively smaller storage, this simple random sampling-based replay technique also forgets past representations. Thus, we introduce a novel experience replay algorithm that combines random sampling with StyleGAN’s latent space to represent the buffer as an optimal convex hull. We observe that our proposed convex hull-based experience replay is more effective in preventing forgetting than a random sampling baseline and the lower bound.
zh

[CV-13] Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

【速读】：该论文试图解决大型文本到视频模型在动态物体交互中表现不真实的问题，特别是在物体运动和物理规律的准确性方面。解决方案的关键在于利用外部反馈来增强模型的自我改进能力，从而提升文本与视频内容的对齐度以及物体交互的真实性。具体来说，论文提出了一种基于概率目标的离线强化学习微调方法，并通过引入视觉-语言模型提供的细粒度反馈来优化视频中的物体动态表现。实验结果表明，这种基于AI反馈的奖励信号在处理复杂物体交互和真实物体运动场景时，显著提升了视频质量。

链接: https://arxiv.org/abs/2412.02617
作者: Hiroki Furuta,Heiga Zen,Dale Schuurmans,Aleksandra Faust,Yutaka Matsuo,Percy Liang,Sherry Yang
关键词-EN: hold immense potential, models hold immense, downstream applications, hold immense, immense potential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
zh

[CV-14] MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

【速读】：该论文试图解决现有空间转录组学 (Spatial Transcriptomics, ST) 方法在利用组织图像块进行基因表达预测时，未能充分考虑不同组织位置间相互作用的问题。解决方案的关键在于引入了一种名为 MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions) 的方法，该方法结合了多面体层次图构建策略与图神经网络 (Graph Neural Networks, GNN) 来提升基因表达预测的准确性。具体来说，MERGE 通过基于空间和形态特征对组织图像块进行聚类，并引入簇内和簇间边，促进了在 GNN 学习过程中远距离组织位置间的交互。此外，论文还强调了采用基因感知的数据平滑技术来减少 ST 数据中的技术缺陷带来的伪影，从而进一步提高预测精度。实验结果表明，MERGE 在基因表达预测任务中优于当前最先进的技术。

链接: https://arxiv.org/abs/2412.02601
作者: Aniruddha Ganguly,Debolina Chatterjee,Wentao Huang,Jie Zhang,Alisa Yurovsky,Travis Steele Johnson,Chao Chen
关键词-EN: Recent advances, pair histology images, spatially resolved gene, gene expression, Spatial Transcriptomics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main Paper: 8 pages, Supplementary Material: 9 pages, Figures: 16

点击查看摘要

Abstract:Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.
zh

[CV-15] Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes

【速读】：该论文试图解决分类数据集的难度分析问题，特别是如何有效地评估和量化数据集在样本、类别和整体层面的分类难度。解决方案的关键在于引入了一种基于自编码器重建误差比率（Reconstruction Error Ratios, RERs）的新分析框架。该框架通过计算不同类别训练的自编码器之间的重建误差比率，来探测分类的难度，并将其分解为有限样本大小和贝叶斯误差与决策边界复杂性两个部分。研究结果表明，基于RER的难度探测方法与最先进的分类模型的错误率高度相关，并且在硬数据集上的标签错误检测任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.02596
作者: Jacob Marks,Brent A. Griffin,Jason J. Corso
关键词-EN: individual classes, autoencoders trained, trained on individual, analyzing classification datasets, classification datasets based
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 18 figures

点击查看摘要

Abstract:We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes. This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels. We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity. Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models. By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. Our code is publicly available at this https URL.
zh

[CV-16] OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

【速读】：该论文试图解决在构建用于增强大型语言模型（LLMs）的检索增强生成（RAG）系统时，光学字符识别（OCR）引入的噪声问题。解决方案的关键在于引入OHRBench，这是一个首个用于理解OCR对RAG系统级联影响的基准。OHRBench通过包含350份来自六个实际RAG应用领域的不结构化PDF文档，以及从文档中提取的多模态元素生成的问答对，来挑战现有的OCR解决方案。论文识别了两种主要的OCR噪声类型：语义噪声（Semantic Noise）和格式噪声（Formatting Noise），并通过扰动生成具有不同程度噪声的结构化数据集。通过OHRBench，论文全面评估了当前OCR解决方案的性能，揭示了它们在构建高质量RAG知识库方面的不足，并系统地评估了这两种噪声类型对RAG系统的影响，展示了RAG系统的脆弱性。此外，论文还探讨了在不使用OCR的情况下，采用视觉语言模型（VLMs）在RAG系统中的潜力。

链接: https://arxiv.org/abs/2412.02592
作者: Junyuan Zhang,Qintong Zhang,Bin Wang,Linke Ouyang,Zichen Wen,Ying Li,Ka-Ho Chow,Conghui He,Wentao Zhang
关键词-EN: enhances Large Language, Large Language Models, Retrieval-augmented Generation, Large Language, Optical Character Recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with QAs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR’s impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: this https URL
zh

[CV-17] MedTet: An Online Motion Model for 4D Heart Reconstruction

【速读】：该论文试图解决在手术干预过程中，由于实时数据通常仅为稀疏的2D帧或1D信号，导致难以从这些有限数据中重建3D心脏运动的问题。解决方案的关键在于提出了一种多功能框架，该框架将3D空间离散化为一个带有符号距离值的可变形四面体网格，从而在保持对运动动态的显式控制的同时，提供隐式的无限分辨率。该系统利用一个通用的观察编码器，能够从完整的3D体积数据、少数2D MRI切片甚至1D信号中重建连贯的3D心脏运动。实验结果表明，该方法能够从各种稀疏的实时观察中生成合理且解剖学上一致的3D运动重建，突显了其在多模态心脏成像中的潜力。

链接: https://arxiv.org/abs/2412.02589
作者: Yihong Chen,Jiancheng Yang,Deniz Sayin Mercadier,Hieu Le,Pascal Fua
关键词-EN: sparse intraoperative data, intraoperative data, data, limited observed data, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach to reconstruction of 3D cardiac motion from sparse intraoperative data. While existing methods can accurately reconstruct 3D organ geometries from full 3D volumetric imaging, they cannot be used during surgical interventions where usually limited observed data, such as a few 2D frames or 1D signals, is available in real-time. We propose a versatile framework for reconstructing 3D motion from such partial data. It discretizes the 3D space into a deformable tetrahedral grid with signed distance values, providing implicit unlimited resolution while maintaining explicit control over motion dynamics. Given an initial 3D model reconstructed from pre-operative full volumetric data, our system, equipped with an universal observation encoder, can reconstruct coherent 3D cardiac motion from full 3D volumes, a few 2D MRI slices or even 1D signals. Extensive experiments on cardiac intervention scenarios demonstrate our ability to generate plausible and anatomically consistent 3D motion reconstructions from various sparse real-time observations, highlighting its potential for multimodal cardiac imaging. Our code and model will be made available at this https URL.
zh

[CV-18] Copy-Move Forgery Detection and Question Answering for Remote Sensing Image

【速读】：该论文试图解决遥感图像复制粘贴问答（Remote Sensing Copy-Move Question Answering, RSCMQA）任务中的复杂篡改场景解释和对象间关系推断问题。解决方案的关键在于开发了一个名为RS-CMQA-2.1M的全球遥感图像数据集，该数据集涵盖了14个国家29个不同地区的图像，并针对遥感领域长期存在的数据长尾问题，进一步细化了平衡数据集RS-CMQA-B。此外，论文提出了一种区域判别引导的多模态CMQA模型，通过利用源域和篡改域之间的差异和联系的提示信息，显著提高了对篡改图像问答的准确性。实验结果表明，该方法为RSCMQA任务提供了比通用VQA和RSVQA模型更强的基准。

链接: https://arxiv.org/abs/2412.02575
作者: Ze Zhang,Enyuan Zhao,Ziyi Wan,Jie Nie,Xinyue Liang,Lei Huang
关键词-EN: Copy-Move Question Answering, Remote Sensing Visual, Sensing Visual Question, Remote Sensing, Visual Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 7 figs, 7 tables

点击查看摘要

Abstract:This paper introduces the task of Remote Sensing Copy-Move Question Answering (RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA), RSCMQA focuses on interpreting complex tampering scenarios and inferring relationships between objects. Based on the practical needs of national defense security and land resource monitoring, we have developed an accurate and comprehensive global dataset for remote sensing image copy-move question answering, named RS-CMQA-2.1M. These images were collected from 29 different regions across 14 countries. Additionally, we have refined a balanced dataset, RS-CMQA-B, to address the long-standing issue of long-tail data in the remote sensing field. Furthermore, we propose a region-discriminative guided multimodal CMQA model, which enhances the accuracy of answering questions about tampered images by leveraging prompt about the differences and connections between the source and tampered domains. Extensive experiments demonstrate that our method provides a stronger benchmark for RS-CMQA compared to general VQA and RSVQA models. Our dataset and code are available at this https URL.
zh

[CV-19] Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

【速读】：该论文试图解决传统遥感时间图像分析中变化检测方法的局限性，即其主要依赖视觉层面的解释，缺乏上下文和描述性信息。解决方案的关键在于引入视觉-语言模型 (Vision-Language Models, VLMs)，特别是遥感时间视觉-语言模型 (Remote Sensing Temporal VLMs, RSTVLMs)，通过将视觉信息与自然语言结合，实现对时间图像变化的更高级解释。RSTVLMs 能够生成描述性字幕、回答问题，并提供更丰富的语义理解，从而在复杂遥感应用中提供更高层次的洞察。论文通过全面回顾 RSTVLM 研究进展，分类讨论核心方法、数据集和评估指标，强调了时间视觉-语言任务的最新进展，并指出了该领域的主要挑战和未来研究方向。

链接: https://arxiv.org/abs/2412.02573
作者: Chenyang Liu,Jiafan Zhang,Keyan Chen,Man Wang,Zhengxia Zou,Zhenwei Shi
关键词-EN: remote sensing temporal, sensing temporal image, Temporal image analysis, remote sensing, Temporal image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal image analysis in remote sensing has traditionally centered on change detection, which identifies regions of change between images captured at different times. However, change detection remains limited by its focus on visual-level interpretation, often lacking contextual or descriptive information. The rise of Vision-Language Models (VLMs) has introduced a new dimension to remote sensing temporal image analysis by integrating visual information with natural language, creating an avenue for advanced interpretation of temporal image changes. Remote Sensing Temporal VLMs (RSTVLMs) allow for dynamic interactions, generating descriptive captions, answering questions, and providing a richer semantic understanding of temporal images. This temporal vision-language capability is particularly valuable for complex remote sensing applications, where higher-level insights are crucial. This paper comprehensively reviews the progress of RSTVLM research, with a focus on the latest VLM applications for temporal image analysis. We categorize and discuss core methodologies, datasets, and metrics, highlight recent advances in temporal vision-language tasks, and outline key challenges and future directions for research in this emerging field. This survey fills a critical gap in the literature by providing an integrated overview of RSTVLM, offering a foundation for further advancements in remote sensing temporal image understanding. We will keep tracing related works at \urlthis https URL
zh

[CV-20] SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection

【速读】：该论文试图解决在多模态架构中实现图像分割的基本挑战，特别是在细粒度空间定位和操作能力方面的局限性。解决方案的关键在于提出了SJTU框架，即通过坐标检测实现多模态模型中的空间判断，从而统一分割。该框架通过利用归一化坐标检测来生成边界框，并将其转化为可操作的分割输出，从而探索了多模态空间和语言表示的集成。这种方法不仅在多个基准数据集上展示了优越的性能，还显著提升了对象分割的准确性。

链接: https://arxiv.org/abs/2412.02565
作者: Joongwon Chae,Zhenyu Wang,Peiwu Qin
关键词-EN: artificial intelligence systems, modern artificial intelligence, implementing image segmentation, implementing image, intelligence systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Despite advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in multimodal models - Towards Unified segmentation through coordinate detection, a novel framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework proposes a novel approach for integrating segmentation techniques with vision-language models based on multimodal spatial inference. By leveraging normalized coordinate detection for bounding boxes and translating it into actionable segmentation outputs, we explore the possibility of integrating multimodal spatial and language representations. Based on the proposed technical approach, the framework demonstrates superior performance on various benchmark datasets as well as accurate object segmentation. Results on the COCO 2017 dataset for general object detection and Pascal VOC datasets for semantic segmentation demonstrate the generalization capabilities of the framework.
zh

[CV-21] ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer

【速读】：该论文试图解决图像中阴影带来的亮度降低、纹理损坏和色彩失真问题。解决方案的关键在于提出了一种分而治之的策略，即 ShadowHack，通过将任务分解为亮度恢复和色彩修复两个部分来应对这些复杂性。具体来说，论文定制了 LRNet，一个带有修正扩展注意力模块的U型网络，用于增强亮度空间中的信息交互并重新校准受污染的注意力图，从而恢复阴影区域的亮度和修复受损纹理。在亮度恢复后，CRNet 利用交叉注意力机制来恢复鲜艳的色彩，最终生成视觉上引人注目的结果。通过在多个数据集上的广泛实验，论文展示了 ShadowHack 在定量和定性上均优于现有的最先进解决方案，突显了其设计的有效性。

链接: https://arxiv.org/abs/2412.02545
作者: Jin Hu,Mingjia Li,Xiaojie Guo
关键词-EN: Shadows introduce challenges, reduced brightness, distortion in images, complicating a holistic, introduce challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shadows introduce challenges such as reduced brightness, texture deterioration, and color distortion in images, complicating a holistic solution. This study presents \textbfShadowHack, a divide-and-conquer strategy that tackles these complexities by decomposing the original task into luminance recovery and color remedy. To brighten shadow regions and repair the corrupted textures in the luminance space, we customize LRNet, a U-shaped network with a rectified outreach attention module, to enhance information interaction and recalibrate contaminated attention maps. With luminance recovered, CRNet then leverages cross-attention mechanisms to revive vibrant colors, producing visually compelling results. Extensive experiments on multiple datasets are conducted to demonstrate the superiority of ShadowHack over existing state-of-the-art solutions both quantitatively and qualitatively, highlighting the effectiveness of our design. Our code will be made publicly available at this https URL
zh

[CV-22] Unveiling Concept Attribution in Diffusion Models

【速读】：该论文试图解决生成式模型（如扩散模型）在生成图像时，其内部组件如何协同工作以展现特定概念（如对象或风格）的问题。解决方案的关键在于采用组件归因（component attribution）方法来分解扩散模型，揭示每个组件对特定概念的贡献。通过这种方法，论文不仅能够识别出对特定概念有正面贡献的组件，还能发现有负面贡献的组件，这在以往的知识定位方法中未曾被发现。此外，该框架还支持有效的模型编辑，例如通过移除正面贡献的组件来消除模型中的特定概念，同时保留其他概念的知识。实验结果验证了该框架在解释生成式模型方面的完整性和有效性。

链接: https://arxiv.org/abs/2412.02542
作者: Quang H. Nguyen,Hoang Phan,Khoa D. Doan
关键词-EN: shown remarkable abilities, text prompts, shown remarkable, remarkable abilities, abilities in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize layers storing knowledge in generative models without showing how those layers contribute to the target concept. In this work, we approach the model interpretability problem from a more general perspective and pose a question: \textit``How do model components work jointly to demonstrate knowledge?‘’. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show there exist components that contribute negatively to a concept, which has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models. Our code is available at \urlthis https URL
zh

[CV-23] LiDAR-based Registration against Georeferenced Models for Globally Consistent Allocentric Maps

【速读】：该论文试图解决无人机在城市环境中执行搜救任务时，由于GNSS定位精度下降导致的定位不准确问题。解决方案的关键在于利用CityGML模型和LiDAR数据，通过将LiDAR地图与CityGML和数字高程模型（DEM）进行配准，来精炼粗略的GNSS测量结果。具体步骤包括：首先，基于2D高度图的占用情况选择最佳假设，并计算合理性评分；然后，将配准结果与LiDAR里程计和其他传感模式集成，通过基于连续时间样条的姿态图优化器，获得全局一致的、地理参考的轨迹和地图。实验结果表明，该方法成功将GNSS偏移误差从高达16米降低到低于0.5米，并生成了与先验3D地理空间模型一致的全局地图。

链接: https://arxiv.org/abs/2412.02533
作者: Jan Quenzel,Linus T. Mallwitz,Benedikt T. Arnold,Sven Behnke
关键词-EN: Modern unmanned aerial, unmanned aerial vehicles, Modern unmanned, aerial vehicles, search and rescue
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), New York City, USA, November 2024

点击查看摘要

Abstract:Modern unmanned aerial vehicles (UAVs) are irreplaceable in search and rescue (SAR) missions to obtain a situational overview or provide closeups without endangering personnel. However, UAVs heavily rely on global navigation satellite system (GNSS) for localization which works well in open spaces, but the precision drastically degrades in the vicinity of buildings. These inaccuracies hinder aggregation of diverse data from multiple sources in a unified georeferenced frame for SAR operators. In contrast, CityGML models provide approximate building shapes with accurate georeferenced poses. Besides, LiDAR works best in the vicinity of 3D structures. Hence, we refine coarse GNSS measurements by registering LiDAR maps against CityGML and digital elevation map (DEM) models as a prior for allocentric mapping. An intuitive plausibility score selects the best hypothesis based on occupancy using a 2D height map. Afterwards, we integrate the registration results in a continuous-time spline-based pose graph optimizer with LiDAR odometry and further sensing modalities to obtain globally consistent, georeferenced trajectories and maps. We evaluate the viability of our approach on multiple flights captured at two distinct testing sites. Our method successfully reduced GNSS offset errors from up-to 16 m to below 0.5 m on multiple flights. Furthermore, we obtain globally consistent maps w.r.t. prior 3D geospatial models.
zh

[CV-24] Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks

【速读】：该论文试图解决遥感场景分类 (Remote Sensing Scene Classification, RSSC) 中单一图像模态方法在高类内方差和类间相似性问题上的局限性。解决方案的关键在于提出了一种新颖的框架，通过集成由大规模视觉语言模型 (Vision-Language Models, VLMs) 生成的文本描述作为辅助模态，从而在不增加人工标注成本的情况下增强分类效果。具体来说，论文提出了一种基于双重交叉注意力的网络结构，用于融合视觉和文本数据，形成统一的表示形式。实验结果表明，该框架在多个RSSC数据集上均优于基线模型，并验证了VLM生成文本描述的有效性，甚至在零样本分类场景中也展示了其对未见类别的分类能力。这一研究为RSSC任务中利用文本信息提供了新的思路，并为未来的多模态融合研究提供了有价值的参考。

链接: https://arxiv.org/abs/2412.02531
作者: Jinjin Cai,Kexin Meng,Baijian Yang,Gang Shao
关键词-EN: Remote sensing scene, Remote sensing, sensing scene classification, resource management, sensing scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing scene classification (RSSC) is a critical task with diverse applications in land use and resource management. While unimodal image-based approaches show promise, they often struggle with limitations such as high intra-class variance and inter-class similarity. Incorporating textual information can enhance classification by providing additional context and semantic understanding, but manual text annotation is labor-intensive and costly. In this work, we propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs. To fully leverage the latent complementarities between visual and textual data, we propose a dual cross-attention-based network to fuse these modalities into a unified representation. Extensive experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models. We also verify the effectiveness of VLM-generated text descriptions compared to human-annotated descriptions. Additionally, we design a zero-shot classification scenario to show that the learned multimodal representation can be effectively utilized for unseen class classification. This research opens new opportunities for leveraging textual information in RSSC tasks and provides a promising multimodal fusion structure, offering insights and inspiration for future studies. Code is available at: this https URL
zh

[CV-25] WEM-GAN: Wavelet transform based facial expression manipulation

【速读】：该论文试图解决面部表情操纵过程中身份信息丢失的问题。解决方案的关键在于提出了基于小波变换的表情操纵生成对抗网络（WEM-GAN），通过结合小波变换技术和U-net自编码器生成器，增强生成器保留面部特征细节的能力。此外，引入高频成分判别器和高频域对抗损失，进一步约束模型优化，提供更丰富的细节。通过在编码器和解码器之间使用残差连接，并多次使用相对动作单元（AUs），缩小生成表情与目标表情之间的差距。实验结果表明，该模型在保留身份特征、编辑能力和图像生成质量方面表现优异，特别是在AffectNet数据集上，其平均内容距离（ACD）和表情距离（ED）等指标表现突出。

链接: https://arxiv.org/abs/2412.02530
作者: Dongya Sun,Yunfei Hu,Xianzhe Zhang,Yingsong Hu
关键词-EN: change human facial, human facial expressions, expression manipulation aims, affecting face recognition, Facial expression manipulation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Facial expression manipulation aims to change human facial expressions without affecting face recognition. In order to transform the facial expressions to target expressions, previous methods relied on expression labels to guide the manipulation process. However, these methods failed to preserve the details of facial features, which causes the weakening or the loss of identity information in the output image. In our work, we propose WEM-GAN, in short for wavelet-based expression manipulation GAN, which puts more efforts on preserving the details of the original image in the editing process. Firstly, we take advantage of the wavelet transform technique and combine it with our generator with a U-net autoencoder backbone, in order to improve the generator’s ability to preserve more details of facial features. Secondly, we also implement the high-frequency component discriminator, and use high-frequency domain adversarial loss to further constrain the optimization of our model, providing the generated face image with more abundant details. Additionally, in order to narrow the gap between generated facial expressions and target expressions, we use residual connections between encoder and decoder, while also using relative action units (AUs) several times. Extensive qualitative and quantitative experiments have demonstrated that our model performs better in preserving identity features, editing capability, and image generation quality on the AffectNet dataset. It also shows superior performance in metrics such as Average Content Distance (ACD) and Expression Distance (ED).
zh

[CV-26] owards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark

【速读】：该论文试图解决从语音文本生成具有情感动态的3D面部化身（Emo3D）的问题。解决方案的关键在于将Emo3D生成过程分解为两个级联步骤：文本到3D表情映射（Text-to-3D Expression Mapping, T3DEM）和3D化身渲染（3D Avatar Rendering, 3DAR）。T3DEM步骤是决定Emo3D生成质量的核心，涵盖了表情多样性、情感内容一致性和表情流畅性三大挑战。为应对这些挑战，论文提出了一个新基准，包括EmoAva数据集和多种评估指标，并引入了连续文本到表情生成器和全局信息高斯化身模型（GiGA），分别用于提升T3DEM步骤中的表情建模和3DAR步骤中的微表情渲染质量。

链接: https://arxiv.org/abs/2412.02508
作者: Haidong Xu,Meishan Zhang,Hao Ju,Zhedong Zheng,Hongyuan Zhu,Erik Cambria,Min Zhang,Hao Fei
关键词-EN: Producing emotionally dynamic, Producing emotionally, pivotal research topic, spoken words, emotionally dynamic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 14 figures. Project website: this https URL

点击查看摘要

Abstract:Producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. While progress has been made in general-purpose 3D avatar generation, the exploration of generating emotional 3D avatars remains scarce, primarily due to the complexities of identifying and rendering rich emotions from spoken words. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR). T3DEM is the most crucial step in determining the quality of Emo3D generation and encompasses three key challenges: Expression Diversity, Emotion-Content Consistency, and Expression Fluidity. To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation. First, we present EmoAva, a large-scale, high-quality dataset for T3DEM, comprising 15,000 text-to-3D expression mappings that characterize the aforementioned three challenges in Emo3D generation. Furthermore, we develop various metrics to effectively evaluate models against these identified challenges. Next, to effectively model the consistency, diversity, and fluidity of human expressions in the T3DEM step, we propose the Continuous Text-to-Expression Generator, which employs an autoregressive Conditional Variational Autoencoder for expression code generation, enhanced with Latent Temporal Attention and Expression-wise Attention mechanisms. Finally, to further enhance the 3DAR step on rendering higher-quality subtle expressions, we present the Globally-informed Gaussian Avatar (GiGA) model. GiGA incorporates a global information mechanism into 3D Gaussian representations, enabling the capture of subtle micro-expressions and seamless transitions between emotional states.
zh

[CV-27] ROVER: A Multi-Season Dataset for Visual SLAM

【速读】：该论文试图解决在自然、非结构化环境中（如公园和花园）进行鲁棒的同时定位与地图构建（Simultaneous Localization and Mapping, SLAM）所面临的挑战。这些环境中的季节变化、光照条件变化和密集植被等因素，常常导致原本为结构化城市环境设计的视觉SLAM算法性能下降。论文提出的解决方案之关键是引入了一个名为ROVER的综合基准数据集，该数据集专门用于评估在多样化的环境条件和空间配置下的视觉SLAM算法。ROVER数据集通过配备单目、立体和RGB-D相机以及惯性传感器的机器人平台采集，涵盖了五个户外地点的39次记录，跨越了所有季节和多种光照场景（如白天、黄昏和夜晚，包括有无外部照明的情况）。通过这一数据集，论文评估了多种传统和基于深度学习的SLAM方法，并分析了它们在不同挑战性条件下的表现，强调了现有SLAM系统在低光照和高植被场景中的不足，特别是夏季和秋季。该研究突出了改进视觉SLAM算法适应性的必要性，并为在真实自然环境中推进SLAM研究提供了坚实的基础。

链接: https://arxiv.org/abs/2412.02506
作者: Fabian Schmidt,Constantin Blessing,Markus Enzweiler,Abhinav Valada
关键词-EN: Robust Simultaneous Localization, Robust Simultaneous, visual SLAM algorithms, visual SLAM, SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Robust Simultaneous Localization and Mapping (SLAM) is a crucial enabler for autonomous navigation in natural, unstructured environments such as parks and gardens. However, these environments present unique challenges for SLAM due to frequent seasonal changes, varying light conditions, and dense vegetation. These factors often degrade the performance of visual SLAM algorithms originally developed for structured urban environments. To address this gap, we present ROVER, a comprehensive benchmark dataset tailored for evaluating visual SLAM algorithms under diverse environmental conditions and spatial configurations. We captured the dataset with a robotic platform equipped with monocular, stereo, and RGB-D cameras, as well as inertial sensors. It covers 39 recordings across five outdoor locations, collected through all seasons and various lighting scenarios, i.e., day, dusk, and night with and without external lighting. With this novel dataset, we evaluate several traditional and deep learning-based SLAM methods and study their performance in diverse challenging conditions. The results demonstrate that while stereo-inertial and RGB-D configurations generally perform better under favorable lighting and moderate vegetation, most SLAM systems perform poorly in low-light and high-vegetation scenarios, particularly during summer and autumn. Our analysis highlights the need for improved adaptability in visual SLAM algorithms for outdoor applications, as current systems struggle with dynamic environmental factors affecting scale, feature extraction, and trajectory consistency. This dataset provides a solid foundation for advancing visual SLAM research in real-world, natural environments, fostering the development of more resilient SLAM systems for long-term outdoor localization and mapping. The dataset and the code of the benchmark are available under this https URL.
zh

[CV-28] RelayGS: Reconstructing Dynamic Scenes with Large-Scale and Complex Motions via Relay Gaussians

【速读】：该论文试图解决动态场景中大规模复杂运动的重建问题。解决方案的关键在于提出了一种名为RelayGS的新方法，该方法基于3D Gaussian Splatting (3DGS)，通过三个阶段的学习过程来表示和重建高度动态的场景。首先，从所有帧中学习一个基本的3DGS，并使用可学习的掩码将高度动态的前景与几乎不动的背景分离。其次，复制并优化多个前景高斯块，每个对应一个时间片段，这些高斯块称为Relay Gaussians，作为显式的中间节点，将大规模运动轨迹分解为更小的可管理段。最后，联合学习场景的时间运动并优化前两个阶段学到的标准高斯块。实验结果表明，RelayGS在PSNR指标上显著优于现有技术，并成功地重建了复杂运动场景，如篮球比赛。

链接: https://arxiv.org/abs/2412.02493
作者: Qiankun Gao,Yanmin Wu,Chengxiang Wen,Jiarui Meng,Luyang Tang,Jie Chen,Ronggang Wang,Jian Zhang
关键词-EN: Reconstructing dynamic scenes, Neural Radiance Fields, Reconstructing dynamic, significant challenge, remains a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. GitHub: this https URL

点击查看摘要

Abstract:Reconstructing dynamic scenes with large-scale and complex motions remains a significant challenge. Recent techniques like Neural Radiance Fields and 3D Gaussian Splatting (3DGS) have shown promise but still struggle with scenes involving substantial movement. This paper proposes RelayGS, a novel method based on 3DGS, specifically designed to represent and reconstruct highly dynamic scenes. Our RelayGS learns a complete 4D representation with canonical 3D Gaussians and a compact motion field, consisting of three stages. First, we learn a fundamental 3DGS from all frames, ignoring temporal scene variations, and use a learnable mask to separate the highly dynamic foreground from the minimally moving background. Second, we replicate multiple copies of the decoupled foreground Gaussians from the first stage, each corresponding to a temporal segment, and optimize them using pseudo-views constructed from multiple frames within each segment. These Gaussians, termed Relay Gaussians, act as explicit relay nodes, simplifying and breaking down large-scale motion trajectories into smaller, manageable segments. Finally, we jointly learn the scene’s temporal motion and refine the canonical Gaussians learned from the first two stages. We conduct thorough experiments on two dynamic scene datasets featuring large and complex motions, where our RelayGS outperforms state-of-the-arts by more than 1 dB in PSNR, and successfully reconstructs real-world basketball game scenes in a much more complete and coherent manner, whereas previous methods usually struggle to capture the complex motion of players. Code will be publicly available at this https URL
zh

[CV-29] OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations

【速读】：该论文试图解决面部识别技术在现实世界中面对分布外数据（Out-of-Distribution, OOD）场景时的鲁棒性问题。解决方案的关键在于系统性地设计和模拟30种OOD场景，涵盖9大类别，包括常见的损坏和外观变化，并通过在公开数据集上建立三个鲁棒性基准（LFW-C/V, CFP-FP-C/V, YTF-C/V）来评估19种面部识别模型和3个商业API的性能。此外，论文还进行了扩展实验，探讨了口罩、视觉语言模型（Vision-Language Models, VLMs）和防御策略对模型鲁棒性的影响，并提供了一个统一的工具包，便于其他数据集的扩展应用。通过这些实验和分析，论文揭示了面部识别系统对OOD数据的脆弱性，并提出了可能的改进方向。

链接: https://arxiv.org/abs/2412.02479
作者: Caixin Kang,Yubo Chen,Shouwei Ruan,Shiji Zhao,Ruochen Zhang,Jiayi Wang,Shan Fu,Xingxing Wei
关键词-EN: facial recognition, facial recognition models, deep learning, rapid development, recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rise of deep learning, facial recognition technology has seen extensive research and rapid development. Although facial recognition is considered a mature technology, we find that existing open-source models and commercial algorithms lack robustness in certain real-world Out-of-Distribution (OOD) scenarios, raising concerns about the reliability of these systems. In this paper, we introduce OODFace, which explores the OOD challenges faced by facial recognition models from two perspectives: common corruptions and appearance variations. We systematically design 30 OOD scenarios across 9 major categories tailored for facial recognition. By simulating these challenges on public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V, and YTF-C/V. We then conduct extensive experiments on 19 different facial recognition models and 3 commercial APIs, along with extended experiments on face masks, Vision-Language Models (VLMs), and defense strategies to assess their robustness. Based on the results, we draw several key insights, highlighting the vulnerability of facial recognition systems to OOD data and suggesting possible solutions. Additionally, we offer a unified toolkit that includes all corruption and variation types, easily extendable to other datasets. We hope that our benchmarks and findings can provide guidance for future improvements in facial recognition model robustness.
zh

[CV-30] BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding

【速读】：该论文试图解决动态场景理解中的对象关联问题，特别是在长期动态环境中，传统方法依赖于预定义的对象形状和类别，缺乏适应性和鲁棒性。解决方案的关键在于引入了一个类别无关的、针对每个场景的点云编码器 (BYE)，它不需要预定义的类别、形状先验或大量的关联数据集。通过仅在单个探索数据序列上训练，BYE能够高效地执行动态变化场景中的对象关联。此外，论文还提出了一种集成方案，结合了视觉语言模型 (VLMs) 的语义优势和BYE的场景特定专业知识，从而在对象关联任务中实现了7%的改进和95%的成功率。

链接: https://arxiv.org/abs/2412.02449
作者: Chenguang Huang,Shengchao Yan,Wolfram Burgard
关键词-EN: scene understanding remains, robotic applications, Dynamic scene understanding, understanding remains, remains a persistent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dynamic scene understanding remains a persistent challenge in robotic applications. Early dynamic mapping methods focused on mitigating the negative influence of short-term dynamic objects on camera motion estimation by masking or tracking specific categories, which often fall short in adapting to long-term scene changes. Recent efforts address object association in long-term dynamic environments using neural networks trained on synthetic datasets, but they still rely on predefined object shapes and categories. Other methods incorporate visual, geometric, or semantic heuristics for the association but often lack robustness. In this work, we introduce BYE, a class-agnostic, per-scene point cloud encoder that removes the need for predefined categories, shape priors, or extensive association datasets. Trained on only a single sequence of exploration data, BYE can efficiently perform object association in dynamically changing scenes. We further propose an ensembling scheme combining the semantic strengths of Vision Language Models (VLMs) with the scene-specific expertise of BYE, achieving a 7% improvement and a 95% success rate in object association tasks. Code and dataset are available at this https URL.
zh

[CV-31] Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

【速读】：该论文试图解决在预测智能体（如行人）轨迹时，如何准确考虑并解释智能体之间的社会交互作用的问题。解决方案的关键在于提出了共振模型（Resonance model），将行人轨迹预测视为协同振动，并将社会交互作用与轨迹的频谱特性相关联。该模型通过三个独立的振动项来分别表示智能体未来计划的不同方面，从而实现对未来轨迹的解耦预测。此外，模型通过学习轨迹频谱的相似性，以共振的方式考虑社会交互对预定轨迹的修改，从而在多个数据集上验证了其有效性。

链接: https://arxiv.org/abs/2412.02447
作者: Conghao Wong,Ziqian Zou,Beihao Xia,Xinge You
关键词-EN: researchers’ attention, social interactions, intelligent agents, trajectories, social
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning to forecast the trajectories of intelligent agents like pedestrians has caught more researchers’ attention. Despite researchers’ efforts, it remains a challenge to accurately account for social interactions among agents when forecasting, and in particular, to simulate such social modifications to future trajectories in an explainable and decoupled way. Inspired by the resonance phenomenon of vibration systems, we propose the Resonance (short for Re) model to forecast pedestrian trajectories as co-vibrations, and regard that social interactions are associated with spectral properties of agents’ trajectories. It forecasts future trajectories as three distinct vibration terms to represent agents’ future plans from different perspectives in a decoupled way. Also, agents’ social interactions and how they modify scheduled trajectories will be considered in a resonance-like manner by learning the similarities of their trajectory spectrums. Experiments on multiple datasets, whether pedestrian or vehicle, have verified the usefulness of our method both quantitatively and qualitatively.
zh

[CV-32] meWalker: Personalized Neural Space for Lifelong Head Avatars WWW ATC

【速读】：该论文试图解决的问题是如何在终身尺度上构建一个真实、全尺寸的3D头部虚拟形象（avatar），尤其是如何从一个人在其不同生命阶段的无结构数据中构建其全面的个体身份。解决方案的关键在于提出了一个名为TimeWalker的新框架，该框架的核心是一个新颖的神经参数化模型，能够学习并分离形状、表情和外观在不同年龄段的个性化表示。具体来说，解决方案的关键技术包括：1) 动态神经基混合模块（Dynamic Neural Basis-Blending Module, Dynamo），用于动态调整神经头部基的数量和混合权重，以紧凑地表示目标人物在不同年龄段的头部变化；2) 动态2D高斯喷射（Dynamic 2D Gaussian Splatting, DNA-2DGS），作为高斯喷射表示的扩展，用于在不损失渲染和重建真实性的前提下，模拟头部运动变形，如面部表情。这些技术共同使得TimeWalker能够在解耦的维度上实现虚拟形象的重建和动画，提供了一种实现个性化“时间旅行”的途径。

链接: https://arxiv.org/abs/2412.02421
作者: Dongwei Pan,Yang Li,Hongsheng Li,Kwan-Yee Lin
关键词-EN: lifelong scale, head, head avatar pipelines, human head avatar, person
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Video: this https URL

点击查看摘要

Abstract:We present TimeWalker, a novel framework that models realistic, full-scale 3D head avatars of a person on lifelong scale. Unlike current human head avatar pipelines that capture identity at the momentary level(e.g., instant photography or short videos), TimeWalker constructs a person’s comprehensive identity from unstructured data collection over his/her various life stages, offering a paradigm to achieve full reconstruction and animation of that person at different moments of life. At the heart of TimeWalker’s success is a novel neural parametric model that learns personalized representation with the disentanglement of shape, expression, and appearance across ages. Central to our methodology are the concepts of two aspects: (1) We track back to the principle of modeling a person’s identity in an additive combination of average head representation in the canonical space, and moment-specific head attribute representations driven from a set of neural head basis. To learn the set of head basis that could represent the comprehensive head variations in a compact manner, we propose a Dynamic Neural Basis-Blending Module (Dynamo). It dynamically adjusts the number and blend weights of neural head bases, according to both shared and specific traits of the target person over ages. (2) Dynamic 2D Gaussian Splatting (DNA-2DGS), an extension of Gaussian splatting representation, to model head motion deformations like facial expressions without losing the realism of rendering and reconstruction. DNA-2DGS includes a set of controllable 2D oriented planar Gaussian disks that utilize the priors from parametric model, and move/rotate with the change of expression. Through extensive experimental evaluations, we show TimeWalker’s ability to reconstruct and animate avatars across decoupled dimensions with realistic rendering effects, demonstrating a way to achieve personalized ‘time traveling’ in a breeze.
zh

[CV-33] It Takes Two: Real-time Co-Speech Two-persons Interaction Generation via Reactive Auto-regressive Diffusion Model

【速读】：该论文试图解决在对话场景中，现有协同语音运动合成方法在处理一方音频和手势对另一方响应的影响时表现不足的问题。此外，现有方法多依赖于离线序列到序列框架，不适用于在线应用。解决方案的关键在于引入了一个音频驱动的自回归系统，用于在对话中为两个角色合成动态运动。核心是一个基于扩散的全身体运动合成模型，该模型以双方角色的过去状态、语音音频和任务导向的运动轨迹输入为条件，实现灵活的空间控制。为增强模型学习多样交互的能力，论文还丰富了现有的两人对话运动数据集，增加了更多动态和交互性运动。通过多项实验评估，该系统在单人和两人协同语音运动生成以及交互运动生成等任务中表现优异，据作者所知，这是首个能够在线生成两个角色交互全身体运动的系统。

链接: https://arxiv.org/abs/2412.02419
作者: Mingyi Shi,Dafei Qin,Leo Ho,Zhouyingcheng Liao,Yinghao Huang,Junichi Yamagishi,Taku Komura
关键词-EN: real-world settings, common in real-world, approaches often fall, fall short, gestures will influence
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person’s audio and gestures will influence the other’s responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model’s ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic and interactive motions. We evaluate our system through multiple experiments to show it outperforms across a variety of tasks, including single and two-person co-speech motion generation, as well as interactive motion generation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.
zh

[CV-34] VISTA: A Panoramic View of Neural Representations

【速读】：该论文试图解决现代机器学习模型中多维内部表示空间的分析难题，解决方案的关键是提出了VISTA（Visualization of Internal States and Their Associations）这一新颖的管道。VISTA通过将表示映射到语义二维空间，从而能够直观地探索和解释神经网络的内部状态及其关联。这种方法不仅揭示了内部表示中的模式和关系，还在稀疏自编码器的潜在空间中发现了新的属性和解释，展示了其在神经网络可解释性方面的广泛应用潜力。

链接: https://arxiv.org/abs/2412.02412
作者: Tom White
关键词-EN: Internal States, exploring and interpreting, Visualization of Internal, interpreting neural network, Visualization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present VISTA (Visualization of Internal States and Their Associations), a novel pipeline for visually exploring and interpreting neural network representations. VISTA addresses the challenge of analyzing vast multidimensional spaces in modern machine learning models by mapping representations into a semantic 2D space. The resulting collages visually reveal patterns and relationships within internal representations. We demonstrate VISTA’s utility by applying it to sparse autoencoder latents uncovering new properties and interpretations. We review the VISTA methodology, present findings from our case study ( this https URL ), and discuss implications for neural network interpretability across various domains of machine learning.
zh

[CV-35] 3D Face Reconstruction From Radar Images

【速读】：该论文试图解决从雷达图像中进行人脸三维重建的问题。解决方案的关键在于提出了一种基于模型的方法，通过生成合成雷达图像数据集并训练卷积神经网络（CNN）编码器来估计三维可变形人脸模型（3D Morphable Face Model）的参数。论文进一步通过分析合成方法（Analysis-by-Synthesis）将重建过程扩展为一个基于模型的自编码器，其中解码器学习了渲染过程，充当对象特定的可微分雷达渲染器。通过联合训练编码器和解码器，最小化参数损失和重建雷达图像损失，实现了在测试时对参数的进一步无监督微调优化。

链接: https://arxiv.org/abs/2412.02403
作者: Valentin Braeutigam,Vanessa Wirth,Ingrid Ullmann,Christian Schüßler,Martin Vossiek,Matthias Berking,Bernhard Egger
关键词-EN: gains wide attention, faces gains wide, virtual reality, gains wide, wide attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The 3D reconstruction of faces gains wide attention in computer vision and is used in many fields of application, for example, animation, virtual reality, and even forensics. This work is motivated by monitoring patients in sleep laboratories. Due to their unique characteristics, sensors from the radar domain have advantages compared to optical sensors, namely penetration of electrically non-conductive materials and independence of light. These advantages of radar signals unlock new applications and require adaptation of 3D reconstruction frameworks. We propose a novel model-based method for 3D reconstruction from radar images. We generate a dataset of synthetic radar images with a physics-based but non-differentiable radar renderer. This dataset is used to train a CNN-based encoder to estimate the parameters of a 3D morphable face model. Whilst the encoder alone already leads to strong reconstructions of synthetic data, we extend our reconstruction in an Analysis-by-Synthesis fashion to a model-based autoencoder. This is enabled by learning the rendering process in the decoder, which acts as an object-specific differentiable radar renderer. Subsequently, the combination of both network parts is trained to minimize both, the loss of the parameters and the loss of the resulting reconstructed radar image. This leads to the additional benefit, that at test time the parameters can be further optimized by finetuning the autoencoder unsupervised on the image loss. We evaluated our framework on generated synthetic face images as well as on real radar images with 3D ground truth of four individuals.
zh

[CV-36] RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation NEURIPS2024 SOSP

【速读】：该论文试图解决3D指代表达分割 (3D Referring Expression Segmentation, 3D-RES) 中由于对实例空间信息重视不足导致的过度分割或错误分割问题。解决方案的关键在于引入规则引导的空间感知网络 (Rule-Guided Spatial Awareness Network, RG-SAN)，该网络通过仅利用目标实例的空间信息进行监督，从而增强网络对文本描述中所有实体空间关系的准确描绘和推理能力。RG-SAN的核心组件包括文本驱动定位模块 (Text-driven Localization Module, TLM) 和规则引导的弱监督策略 (Rule-guided Weak Supervision, RWS)。TLM负责初始定位所有提及的实例并迭代细化其位置信息，而RWS策略则利用依赖树规则精确指导核心实例的定位，尤其是在目标对象仅有部分监督位置信息的情况下。

链接: https://arxiv.org/abs/2412.02402
作者: Changli Wu,Qi Chen,Jiayi Ji,Haowei Wang,Yiwei Ma,You Huang,Gen Luo,Hao Fei,Xiaoshuai Sun,Rongrong Ji
关键词-EN: Referring Expression Segmentation, correlating referring expressions, Expression Segmentation, Referring Expression, correlating referring
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024 (Oral), Code: this https URL

点击查看摘要

Abstract:3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance’s positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5.1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at this https URL.
zh

[CV-37] OMENN: One Matrix to Explain Neural Networks

【速读】：该论文试图解决深度学习模型（Deep Learning, DL）决策过程难以解释的问题。解决方案的关键是提出了一种名为“One Matrix to Explain Neural Networks (OMENN)”的新型事后解释方法（post-hoc method）。OMENN通过将神经网络表示为一个单一的可解释矩阵，该矩阵通过一系列线性变换构建，代表了输入数据在网络各层中的处理过程。这种方法不仅提供了局部精确的归因解释（attribution-based explanations），而且适用于多种现代模型，包括视觉变换器（ViTs）和卷积神经网络（CNNs）。通过理论分析和实验验证，OMENN在解释性人工智能（eXplainable Artificial Intelligence, XAI）基准测试中表现出与最先进方法相当的竞争力。

链接: https://arxiv.org/abs/2412.02399
作者: Adam Wróbel,Mikołaj Janusz,Bartosz Zieliński,Dawid Rymarczyk
关键词-EN: decision-making processes difficult, Deep Learning, black boxes, making their decision-making, difficult to interpret
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, code will be released after acceptance

点击查看摘要

Abstract:Deep Learning (DL) models are often black boxes, making their decision-making processes difficult to interpret. This lack of transparency has driven advancements in eXplainable Artificial Intelligence (XAI), a field dedicated to clarifying the reasoning behind DL model predictions. Among these, attribution-based methods such as LRP and GradCAM are widely used, though they rely on approximations that can be imprecise. To address these limitations, we introduce One Matrix to Explain Neural Networks (OMENN), a novel post-hoc method that represents a neural network as a single, interpretable matrix for each specific input. This matrix is constructed through a series of linear transformations that represent the processing of the input by each successive layer in the neural network. As a result, OMENN provides locally precise, attribution-based explanations of the input across various modern models, including ViTs and CNNs. We present a theoretical analysis of OMENN based on dynamic linearity property and validate its effectiveness with extensive tests on two XAI benchmarks, demonstrating that OMENN is competitive with state-of-the-art methods. Comments: Under review, code will be released after acceptance Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.02399 [cs.LG] (or arXiv:2412.02399v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.02399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-38] Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction CVPR2025

【速读】：该论文试图解决在自动驾驶和监控等应用中理解和预测行人运动的问题，特别是由于不同主体间复杂交互关系带来的挑战。解决方案的关键在于提出了GrouP ConCeption (GPCC)模型，该模型通过Group方法将附近的主体分类为群体成员或非群体成员，并利用Conception模块感知目标主体周围的视觉和声学信息。GPCC模型在多个数据集上的评估显示，其在轨迹预测准确性方面有显著提升，验证了其在建模社会和个体动态方面的有效性。

链接: https://arxiv.org/abs/2412.02395
作者: Ziqian Zou,Conghao Wong,Beihao Xia,Qinmu Peng,Xinge You
关键词-EN: Understanding and anticipating, anticipating human movement, driving and surveillance, critical and challenging, challenging in diverse
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, submitted to CVPR 2025

点击查看摘要

Abstract:Understanding and anticipating human movement has become more critical and challenging in diverse applications such as autonomous driving and surveillance. The complex interactions brought by different relations between agents are a crucial reason that poses challenges to this task. Researchers have put much effort into designing a system using rule-based or data-based models to extract and validate the patterns between pedestrian trajectories and these interactions, which has not been adequately addressed yet. Inspired by how humans perceive social interactions with different level of relations to themself, this work proposes the GrouP ConCeption (short for GPCC) model composed of the Group method, which categorizes nearby agents into either group members or non-group members based on a long-term distance kernel function, and the Conception module, which perceives both visual and acoustic information surrounding the target agent. Evaluated across multiple datasets, the GPCC model demonstrates significant improvements in trajectory prediction accuracy, validating its effectiveness in modeling both social and individual dynamics. The qualitative analysis also indicates that the GPCC framework successfully leverages grouping and perception cues human-like intuitively to validate the proposed model’s explainability in pedestrian trajectory forecasting.
zh

[CV-39] Bio-inspired visual relative localization for large swarms of UAVs

【速读】：该论文试图解决大规模无人机集群中个体相对定位的问题。解决方案的关键在于提出了一种基于邻近密度回归的方法，该方法不依赖于检测每个邻居并估计其相对位置，而是通过回归邻近密度来估计距离。这种方法不仅提高了距离估计的准确性，还增强了系统的可扩展性。此外，论文还提出了一种新的集群控制算法，使其与新的相对定位方法兼容，从而实现集群的稳定控制。

链接: https://arxiv.org/abs/2412.02393
作者: Martin Křížek,Matouš Vrba,Antonella Barišić Kulaš,Stjepan Bogdan,Martin Saska
关键词-EN: visual perception, relative localization, biological perception utilized, large-scale swarms, relative
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new approach to visual perception for relative localization of agents within large-scale swarms of UAVs. Inspired by biological perception utilized by schools of sardines, swarms of bees, and other large groups of animals capable of moving in a decentralized yet coherent manner, our method does not rely on detecting individual neighbors by each agent and estimating their relative position, but rather we propose to regress a neighbor density over distance. This allows for a more accurate distance estimation as well as better scalability with respect to the number of neighbors. Additionally, a novel swarm control algorithm is proposed to make it compatible with the new relative localization method. We provide a thorough evaluation of the presented methods and demonstrate that the regressing approach to distance estimation is more robust to varying relative pose of the targets and that it is suitable to be used as the main source of relative localization for swarm stabilization.
zh

[CV-40] Single-Shot Metric Depth from Focused Plenoptic Cameras ICRA2025

【速读】：该论文试图解决从视觉传感器中进行度量深度估计的问题，特别是在单视图情况下，如何在不依赖传统立体或结构光相机的高硬件要求和校准复杂性的前提下，实现密集的度量深度估计。解决方案的关键在于利用聚焦全光相机（focused plenoptic cameras），通过机器学习生成稀疏度量点云，并将其用于缩放和对齐由基础深度模型回归的密集相对深度图，从而生成密集的度量深度。此外，论文还创建了光场立体图像数据集（Light Field Stereo Image Dataset, LFS），以验证和推动这一领域的研究。

链接: https://arxiv.org/abs/2412.02386
作者: Blanca Lasheras-Hernandez,Klaus H. Strobl,Sergio Izquierdo,Tim Bodenmüller,Rudolph Triebel,Javier Civera
关键词-EN: Metric depth, Metric, dense metric depth, depth, robots to perceive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages (6 for text + 2 for references), 6 figures, 2 tables. Submitted to IEEE ICRA 2025

点击查看摘要

Abstract:Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multi-view monocular depth offers a more compact alternative, but is constrained by the unobservability of the metric scale. Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device. However, its application to single-view dense metric depth is under-addressed mainly due to the technology’s high cost, the lack of public benchmarks, and proprietary geometrical models and software. Our work explores the potential of focused plenoptic cameras for dense metric depth. We propose a novel pipeline that predicts metric depth from a single plenoptic camera shot by first generating a sparse metric point cloud using machine learning, which is then used to scale and align a dense relative depth map regressed by a foundation depth model, resulting in dense metric depth. To validate it, we curated the Light Field Stereo Image Dataset (LFS) of real-world light field images with stereo depth labels, filling a current gap in existing resources. Experimental results show that our pipeline produces accurate metric depth predictions, laying a solid groundwork for future research in this field. Comments: 8 pages (6 for text + 2 for references), 6 figures, 2 tables. Submitted to IEEE ICRA 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.8; I.2.9; I.2.10 Cite as: arXiv:2412.02386 [cs.CV] (or arXiv:2412.02386v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.02386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-41] Active Negative Loss: A Robust Framework for Learning with Noisy Labels

【速读】：该论文试图解决深度监督学习在面对噪声标签时容易过拟合的问题。解决方案的关键在于引入一种新的损失函数类别，称为归一化负损失函数 (Normalized Negative Loss Functions, NNLFs)，作为主动被动损失 (Active Passive Loss, APL) 框架中的被动损失函数，以替代原有的平均绝对误差 (Mean Absolute Error, MAE)。NNLFs 通过更集中地关注已记忆的干净样本，有效克服了 MAE 对所有样本同等对待的缺点，从而加速收敛并提高训练效率，特别是在大规模数据集上。此外，在非对称噪声场景下，论文还提出了一种基于熵的正则化技术，以缓解标签不平衡带来的脆弱性。实验结果表明，采用 NNLFs 的新框架 Active Negative Loss (ANL) 在各种标签噪声类型和图像分割任务中，性能优于或相当于现有最先进的方法。

链接: https://arxiv.org/abs/2412.02373
作者: Xichen Ye,Yifan Wu,Yiwen Xu,Xiaoqiang Li,Weizhong Zhang,Yifan Chen
关键词-EN: Deep supervised learning, achieved remarkable success, Deep supervised, loss functions, loss
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: this https URL.
zh

[CV-42] rajectory-based Road Autolabeling with Lidar-Camera Fusion in Winter Conditions

【速读】：该论文试图解决在各种道路条件下进行鲁棒道路分割的问题，这对于安全的自动驾驶和高级驾驶辅助系统至关重要。解决方案的关键在于采用基于轨迹的自监督学习方法，结合激光雷达（lidar）和摄像头数据进行联合学习。这种方法能够在无需手动标注的情况下，从行驶路径中学习，从而提高在分布外场景中的性能。论文提出的方法在包括乡村和郊区驾驶场景的冬季驾驶数据集上表现优于现有的仅依赖摄像头或激光雷达的方法。

链接: https://arxiv.org/abs/2412.02370
作者: Eerik Alamikkotervo,Henrik Toikka,Kari Tammi,Risto Ojala
关键词-EN: Robust road segmentation, driver assistance systems, advanced driver assistance, Robust road, safe autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust road segmentation in all road conditions is required for safe autonomous driving and advanced driver assistance systems. Supervised deep learning methods provide accurate road segmentation in the domain of their training data but cannot be trusted in out-of-distribution scenarios. Including the whole distribution in the trainset is challenging as each sample must be labeled by hand. Trajectory-based self-supervised methods offer a potential solution as they can learn from the traversed route without manual labels. However, existing trajectory-based methods use learning schemes that rely only on the camera or only on the lidar. In this paper, trajectory-based learning is implemented jointly with lidar and camera for increased performance. Our method outperforms recent standalone camera- and lidar-based methods when evaluated with a challenging winter driving dataset including countryside and suburb driving scenes. The source code is available at this https URL
zh

[CV-43] GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing

【速读】：该论文试图解决在视觉分类任务中，传统数据增强方法在源域和目标域存在差异时（如领域自适应）难以有效提升模型泛化能力的问题。解决方案的关键在于提出了GenMix，一种通用的提示引导生成式数据增强方法。GenMix通过利用图像编辑技术，根据定制的条件提示生成增强图像，并通过混合输入图像与其生成的编辑版本以及引入分形图案，来减少不真实图像和标签模糊性，从而提高模型在域内和跨域分类任务中的性能及对抗鲁棒性。该方法在多个公共数据集上的广泛实验中展示了其在通用和细粒度分类任务中的有效性，并显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.02366
作者: Khawar Islam,Muhammad Zaigham Zaheer,Arif Mahmood,Karthik Nandakumar,Naveed Akhtar
关键词-EN: visual classification tasks, generalization in visual, Data augmentation, classification tasks, enhance generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.
zh

[CV-44] Realistic Surgical Simulation from Monocular Videos

【速读】：该论文试图解决从现成的手术视频中自动生成逼真手术模拟的挑战。解决方案的关键在于提出了一种名为 SurgiSim 的新型自动模拟系统，该系统通过维护一个由 3D 高斯分布（3D Gaussians）和变形场（deformation field）组成的规范 3D 场景来构建手术模拟环境，并通过多阶段优化和各向异性正则化（anisotropic regularization）来增强几何一致性。此外，SurgiSim 采用基于 Maxwell 模型的粘弹性变形模型（Visco-Elastic deformation model）来模拟软组织的复杂变形，并通过最小化输入视频与模拟结果之间的差异来推断组织的物理参数，从而确保模拟结果的真实性。

链接: https://arxiv.org/abs/2412.02359
作者: Kailing Wang,Chen Yang,Keyang Zhao,Xiaokang Yang,Wei Shen
关键词-EN: automatically performing realistic, tackles the challenge, challenge of automatically, automatically performing, simulation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper tackles the challenge of automatically performing realistic surgical simulations from readily available surgical videos. Recent efforts have successfully integrated physically grounded dynamics within 3D Gaussians to perform high-fidelity simulations in well-reconstructed simulation environments from static scenes. However, they struggle with the geometric inconsistency in reconstructing simulation environments and unrealistic physical deformations in simulations of soft tissues when it comes to dynamic and complex surgical processes. In this paper, we propose SurgiSim, a novel automatic simulation system to overcome these limitations. To build a surgical simulation environment, we maintain a canonical 3D scene composed of 3D Gaussians coupled with a deformation field to represent a dynamic surgical scene. This process involves a multi-stage optimization with trajectory and anisotropic regularization, enhancing the geometry consistency of the canonical scene, which serves as the simulation environment. To achieve realistic physical simulations in this environment, we implement a Visco-Elastic deformation model based on the Maxwell model, effectively restoring the complex deformations of tissues. Additionally, we infer the physical parameters of tissues by minimizing the discrepancies between the input video and simulation results guided by estimated tissue motion, ensuring realistic simulation outcomes. Experiments on various surgical scenarios and interactions demonstrate SurgiSim’s ability to perform realistic simulation of soft tissues among surgical procedures, showing its enormous potential for enhancing surgical training, planning, and robotic surgery systems. The project page is at this https URL.
zh

[CV-45] Dual Exposure Stereo for Extended Dynamic Range 3D Imaging

【速读】：该论文试图解决在多样光照条件下实现鲁棒立体3D成像的问题，主要挑战在于相机的动态范围（DR）远小于现实世界的动态范围，导致现有立体深度估计算法在图像欠曝或过曝时精度下降。解决方案的关键在于引入双曝光立体成像技术，通过开发自动双曝光控制方法，在场景动态范围超出相机动态范围时调整双曝光参数，从而提供更广泛的动态范围信息。基于捕获的双曝光立体图像，论文提出了一种运动感知的双曝光立体网络来估计深度。实验结果表明，该方法在性能上优于其他曝光控制方法。

链接: https://arxiv.org/abs/2412.02351
作者: Juhyung Choi,Jinnyeong Kim,Seokjun Choi,Jinwoo Lee,Samuel Brucker,Mario Bijelic,Felix Heide,Seung-Hwan Baek
关键词-EN: Achieving robust stereo, diverse illumination conditions, Achieving robust, limited dynamic ranges, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving robust stereo 3D imaging under diverse illumination conditions is an important however challenging task, due to the limited dynamic ranges (DRs) of cameras, which are significantly smaller than real world DR. As a result, the accuracy of existing stereo depth estimation methods is often compromised by under- or over-exposed images. Here, we introduce dual-exposure stereo for extended dynamic range 3D imaging. We develop automatic dual-exposure control method that adjusts the dual exposures, diverging them when the scene DR exceeds the camera DR, thereby providing information about broader DR. From the captured dual-exposure stereo images, we estimate depth using motion-aware dual-exposure stereo network. To validate our method, we develop a robot-vision system, collect stereo video datasets, and generate a synthetic dataset. Our method outperforms other exposure control methods.
zh

[CV-46] UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices

【速读】：该论文试图解决基于Transformer架构在边缘设备上部署时面临的内存和计算需求高的问题。解决方案的关键在于引入了一种名为Reuse Attention的新机制，该机制通过将多个注意力头（multi-head attention, MHA）的计算整合到一个共享的注意力矩阵中，从而显著减少了内存开销和计算复杂度。这一创新使得在资源受限的平台上也能实现高性能的图像分类任务，同时保持了与现有注意力机制（如Linear Attention和Flash Attention）相比更快的推理速度和更好的内存扩展性。实验结果表明，利用Reuse Attention的UniForm模型在ImageNet-1K数据集上达到了76.7%的Top-1准确率，并在边缘设备上实现了21.8ms的推理时间，相比竞争方法实现了高达5倍的加速。

链接: https://arxiv.org/abs/2412.02344
作者: Seul-Ki Yeom,Tae-Ho Kim
关键词-EN: demonstrated remarkable success, remains challenging due, Transformer-based architectures, Reuse Attention, devices remains challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 8 Tables, 7 Figures

点击查看摘要

Abstract:Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications
zh

[CV-47] Amodal Depth Anything: Amodal Depth Estimation in the Wild

【速读】：该论文试图解决场景中物体被遮挡部分的深度估计问题，即非模态深度估计（amodal depth estimation）。解决方案的关键在于提出了一种新的相对深度预测方法，以提高模型在多样化自然图像中的泛化能力。论文引入了Amodal Depth In the Wild (ADIW)数据集，并通过利用大规模预训练深度模型生成深度图，采用尺度与偏移对齐策略来精炼和融合深度预测，确保标注的一致性。此外，论文提出了两种互补的框架：Amodal-DAV2（基于Depth Anything V2的确定性模型）和Amodal-DepthFM（结合条件流匹配原理的生成模型），这两种模型通过最小化对大规模预训练模型的修改，实现了高质量的非模态深度预测。实验结果表明，该方法在ADIW数据集上相较于之前的最佳方法（SoTA）提高了69.5%的准确性。

链接: https://arxiv.org/abs/2412.02336
作者: Zhenyu Li,Mykola Lavreniuk,Jian Shi,Shariq Farooq Bhat,Peter Wonka
关键词-EN: Amodal depth, Amodal depth estimation, depth, depth estimation aims, parts of objects
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the flexibility of our models in generating diverse, plausible depth structures for occluded regions. Our method achieves a 69.5% improvement in accuracy over the previous SoTA on the ADIW dataset.
zh

[CV-48] SimuScope: Realistic Endoscopic Synthetic Dataset Generation through Surgical Simulation and Diffusion Models WACV

【速读】：该论文试图解决计算机辅助手术系统（CAS）中深度学习模型训练数据不足和标注困难的问题。解决方案的关键在于引入一个多阶段管道，通过一个全功能的手术模拟器生成逼真的合成数据，并自动生成现代CAS系统所需的所有必要标注。该模拟器不仅生成了比公开合成数据集更丰富的标注集，还提供了更复杂和真实的手术交互模拟，包括手术器械与可变形解剖环境之间的动态关系。此外，论文提出了一种基于Stable Diffusion（SD）和Low-Rank Adaptation（LoRA）的轻量级且灵活的图像到图像翻译方法，以进一步缩小合成数据与真实数据之间的视觉差距。这种方法利用有限的标注数据，实现高效训练，并保持模拟器生成标注的完整性，从而提高训练效果和CAS系统的指导能力。

链接: https://arxiv.org/abs/2412.02332
作者: Sabina Martyniak,Joanna Kaleta,Diego Dall’Alba,Michał Naskręt,Szymon Płotka,Przemysław Korzeniowski
关键词-EN: providing advanced support, enhance surgical execution, Computer-assisted surgical, support to surgeons, execution and outcomes
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Computer-assisted surgical (CAS) systems enhance surgical execution and outcomes by providing advanced support to surgeons. These systems often rely on deep learning models trained on complex, challenging-to-annotate data. While synthetic data generation can address these challenges, enhancing the realism of such data is crucial. This work introduces a multi-stage pipeline for generating realistic synthetic data, featuring a fully-fledged surgical simulator that automatically produces all necessary annotations for modern CAS systems. This simulator generates a wide set of annotations that surpass those available in public synthetic datasets. Additionally, it offers a more complex and realistic simulation of surgical interactions, including the dynamics between surgical instruments and deformable anatomical environments, outperforming existing approaches. To further bridge the visual gap between synthetic and real data, we propose a lightweight and flexible image-to-image translation method based on Stable Diffusion (SD) and Low-Rank Adaptation (LoRA). This method leverages a limited amount of annotated data, enables efficient training, and maintains the integrity of annotations generated by our simulator. The proposed pipeline is experimentally validated and can translate synthetic images into images with real-world characteristics, which can generalize to real-world context, thereby improving both training and CAS guidance. The code and the dataset are available at this https URL.
zh

[CV-49] Controlling the Latent Diffusion Model for Generative Image Shadow Removal via Residual Generation

【速读】：该论文试图解决大规模生成模型在图像阴影去除任务中的应用难题，关键在于生成模型在生成多样化和真实细节时往往缺乏对图像内容保真度的关注，无法满足阴影去除任务对精确内容保留的要求。解决方案的核心是利用扩散模型生成和细化图像残差，充分利用阴影图像中的固有细节信息，实现更高效和忠实的无阴影内容重建。此外，论文提出了一种跨时间步自我增强训练策略，通过网络自身增强训练数据，动态修正生成轨迹，确保输出更准确和鲁棒。同时，设计了一种内容保留的编码器-解码器结构，结合控制机制和多尺度跳跃连接，以实现高保真度的无阴影图像重建。

链接: https://arxiv.org/abs/2412.02322
作者: Xinjie Li,Yang Zhao,Dong Wang,Yuan Chen,Li Cao,Xiaoping Liu
关键词-EN: achieved remarkable advancements, Large-scale generative models, images remains challenging, Large-scale generative, visual tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages, 10 figures

点击查看摘要

Abstract:Large-scale generative models have achieved remarkable advancements in various visual tasks, yet their application to shadow removal in images remains challenging. These models often generate diverse, realistic details without adequate focus on fidelity, failing to meet the crucial requirements of shadow removal, which necessitates precise preservation of image content. In contrast to prior approaches that aimed to regenerate shadow-free images from scratch, this paper utilizes diffusion models to generate and refine image residuals. This strategy fully uses the inherent detailed information within shadowed images, resulting in a more efficient and faithful reconstruction of shadow-free content. Additionally, to revent the accumulation of errors during the generation process, a crosstimestep self-enhancement training strategy is proposed. This strategy leverages the network itself to augment the training data, not only increasing the volume of data but also enabling the network to dynamically correct its generation trajectory, ensuring a more accurate and robust output. In addition, to address the loss of original details in the process of image encoding and decoding of large generative models, a content-preserved encoder-decoder structure is designed with a control mechanism and multi-scale skip connections to achieve high-fidelity shadow-free image reconstruction. Experimental results demonstrate that the proposed method can reproduce high-quality results based on a large latent diffusion prior and faithfully preserve the original contents in shadow regions.
zh

[CV-50] HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset

【速读】：该论文试图解决3D人形角色模型自动绑定（rigging）过程中缺乏全面数据集的问题。解决方案的关键在于提出了HumanRig数据集，这是首个大规模专门用于3D人形角色绑定的数据集，包含11,434个精心筛选的T形姿态网格，遵循统一的骨骼拓扑结构。基于此数据集，论文引入了一种创新的、数据驱动的自动绑定框架，该框架结合了先验引导的骨骼估计器（Prior-Guided Skeleton Estimator, PGSE）和网格-骨骼互注意力网络（Mesh-Skeleton Mutual Attention Network, MSMAN），通过2D骨骼关节提供初步的3D骨骼，并利用U形点变换器提取的3D网格特征进行融合，从而实现从粗到精的3D骨骼关节回归和稳健的蒙皮估计，显著超越了以往方法的质量和通用性。

链接: https://arxiv.org/abs/2412.02317
作者: Zedong Chu,Feng Xiong,Meiduo Liu,Jinzhi Zhang,Mingqi Shao,Zhaoxu Sun,Di Wang,Mu Xu
关键词-EN: humanoid character models, humanoid character rigging, generation algorithms, humanoid character, cost of producing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:With the rapid evolution of 3D generation algorithms, the cost of producing 3D humanoid character models has plummeted, yet the field is impeded by the lack of a comprehensive dataset for automatic rigging, which is a pivotal step in character animation. Addressing this gap, we present HumanRig, the first large-scale dataset specifically designed for 3D humanoid character rigging, encompassing 11,434 meticulously curated T-posed meshes adhered to a uniform skeleton topology. Capitalizing on this dataset, we introduce an innovative, data-driven automatic rigging framework, which overcomes the limitations of GNN-based methods in handling complex AI-generated meshes. Our approach integrates a Prior-Guided Skeleton Estimator (PGSE) module, which uses 2D skeleton joints to provide a preliminary 3D skeleton, and a Mesh-Skeleton Mutual Attention Network (MSMAN) that fuses skeleton features with 3D mesh features extracted by a U-shaped point transformer. This enables a coarse-to-fine 3D skeleton joint regression and a robust skinning estimation, surpassing previous methods in quality and versatility. This work not only remedies the dataset deficiency in rigging research but also propels the animation industry towards more efficient and automated character rigging pipelines.
zh

[CV-51] LoCo: Low-Contrast-Enhanced Contrastive Learning for Semi-Supervised Endoscopic Image Segmentation

【速读】：该论文试图解决内窥镜图像分割中的精确分割问题，特别是由于标注数据有限和低对比度问题导致的挑战。解决方案的关键在于提出了一种名为LoCo的半监督分割框架，通过低对比度增强对比学习（LCC）来有效利用大量未标注数据，从而提高分割的准确性和鲁棒性。LCC方法包括两个核心策略：类间对比增强（ICE）和边界对比增强（BCE），用于增强低对比度像素的区分度，使得模型能够更好地分割恶性肿瘤、良性肿瘤和正常组织中的低对比度像素。此外，论文还设计了一种基于置信度的动态滤波器（CDF），用于优化伪标签选择，特别是针对少数类别的伪标签利用，从而进一步提升未标注数据的利用效率。

链接: https://arxiv.org/abs/2412.02314
作者: Lingcong Cai,Yun Li,Xiaomao Fan,Kaixuan Song,Yongcheng Li,Yixuan Yuan,Ruxin Wang,Wenbin Lei
关键词-EN: endoscopic images plays, diagnosis and treatment, plays a vital, vital role, role in computer-aided
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The segmentation of endoscopic images plays a vital role in computer-aided diagnosis and treatment. The advancements in deep learning have led to the employment of numerous models for endoscopic tumor segmentation, achieving promising segmentation performance. Despite recent advancements, precise segmentation remains challenging due to limited annotations and the issue of low contrast. To address these issues, we propose a novel semi-supervised segmentation framework termed LoCo via low-contrast-enhanced contrastive learning (LCC). This innovative approach effectively harnesses the vast amounts of unlabeled data available for endoscopic image segmentation, improving both accuracy and robustness in the segmentation process. Specifically, LCC incorporates two advanced strategies to enhance the distinctiveness of low-contrast pixels: inter-class contrast enhancement (ICE) and boundary contrast enhancement (BCE), enabling models to segment low-contrast pixels among malignant tumors, benign tumors, and normal tissues. Additionally, a confidence-based dynamic filter (CDF) is designed for pseudo-label selection, enhancing the utilization of generated pseudo-labels for unlabeled data with a specific focus on minority classes. Extensive experiments conducted on two public datasets, as well as a large proprietary dataset collected over three years, demonstrate that LoCo achieves state-of-the-art results, significantly outperforming previous methods. The source code of LoCo is available at the URL of this https URL.
zh

[CV-52] Noisy Ostracods: A Fine-Grained Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods

【速读】：该论文试图解决在甲壳类动物（ostracods）属和种分类中，由于数据集存在多种来源的噪声（noise），导致现有机器学习方法在细粒度分类任务中表现不佳的问题。解决方案的关键在于创建并公开了一个名为“Noisy Ostracods”的数据集，该数据集包含了多种类型的噪声，如开放集噪声（open-set noise）、伪类（pseudo-classes）以及高度不平衡的数据分布（imbalance factor ρ = 22429）。通过提供这样一个具有复杂噪声特征的真实世界数据集，论文旨在推动噪声鲁棒机器学习方法的发展，特别是那些能够有效处理细粒度分类任务中多样化和真实世界噪声的方法。

链接: https://arxiv.org/abs/2412.02313
作者: Jiamian Hu,Yuanyuan Hong,Yihua Chen,He Wang,Moriaki Yasuhara
关键词-EN: Noisy Ostracods dataset, Noisy Ostracods, Ostracods dataset, Noisy, dataset
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Initial submit

点击查看摘要

Abstract:We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists’ annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation. The dataset has pseudo-classes, where annotators misclassified samples that should belong to an existing class into a new pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance factor \rho = 22429. This presents a unique challenge for robust machine learning methods, as existing approaches have not been extensively evaluated on fine-grained classification tasks with such diverse real-world noise. Initial experiments using current robust learning techniques have not yielded significant performance improvements on the Noisy Ostracods dataset compared to cross-entropy training on the raw, noisy data. On the other hand, noise detection methods have underperformed in error hit rate compared to naive cross-validation ensembling for identifying problematic labels. These findings suggest that the fine-grained, imbalanced nature, and complex noise characteristics of the dataset present considerable challenges for existing noise-robust algorithms. By openly releasing the Noisy Ostracods dataset, our goal is to encourage further research into the development of noise-resilient machine learning methods capable of effectively handling diverse, real-world noise in fine-grained classification tasks. The dataset, along with its evaluation protocols, can be accessed at this https URL.
zh

[CV-53] Active Learning via Classifier Impact and Greedy Selection for Interactive Image Retrieval

【速读】：该论文试图解决在交互式图像检索场景中，由于开放集和类别不平衡的二分类问题，以及初始标记样本极少的情况下，如何有效降低标注成本的问题。解决方案的关键在于提出了一种名为GAL（Greedy Active Learning）的新型批量模式主动学习框架。该框架通过引入一种新的样本选择获取函数，衡量每个未标记样本对分类器的影响，并采用贪婪选择方法，更有效地利用每个批次中的样本。该方法在支持向量机（SVM）和多层感知机（MLP）/高斯过程分类器中进行了评估，并在高斯过程分类器中提供了贪婪近似的理论保证。实验结果表明，GAL在交互式基于内容的图像检索任务中优于现有方法和常见基线。

链接: https://arxiv.org/abs/2412.02310
作者: Leah Bar,Boaz Lerner,Nir Darshan,Rami Ben-Ari
关键词-EN: reducing annotation costs, Active Learning, Greedy Active Learning, user-interactive approach aimed, Active Learning framework
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Active Learning (AL) is a user-interactive approach aimed at reducing annotation costs by selecting the most crucial examples to label. Although AL has been extensively studied for image classification tasks, the specific scenario of interactive image retrieval has received relatively little attention. This scenario presents unique characteristics, including an open-set and class-imbalanced binary classification, starting with very few labeled samples. We introduce a novel batch-mode Active Learning framework named GAL (Greedy Active Learning) that better copes with this application. It incorporates a new acquisition function for sample selection that measures the impact of each unlabeled sample on the classifier. We further embed this strategy in a greedy selection approach, better exploiting the samples within each batch. We evaluate our framework with both linear (SVM) and non-linear MLP/Gaussian Process classifiers. For the Gaussian Process case, we show a theoretical guarantee on the greedy approximation. Finally, we assess our performance for the interactive content-based image retrieval task on several benchmarks and demonstrate its superiority over existing approaches and common baselines. Code is available at this https URL.
zh

[CV-54] Partial Non-rigid Deformations and interpolations of Human Body Surfaces

【速读】：该论文试图解决非刚性形状变形中的局部变形问题，特别是针对人体表面的3D网格模型。解决方案的关键在于提出了一个名为“局部非刚性变形与人体表面插值 (Partial Non-rigid Deformations and interpolations of the human body Surfaces, PaNDAS)”的新方法。该方法通过构建在深度模型上的学习框架，能够灵活地限制变形仅发生在形状的特定部分，并允许从数据库中混合和组合各种姿态，而不需要在推理时进行优化。这一方法在生成新形状、形状部分之间的插值以及其他形状操作任务中，展示了最先进的精度和更高的局部性。

链接: https://arxiv.org/abs/2412.02306
作者: Thomas Besnier,Emery Pierson,Sylvain Arguillere,Mohamed Daoudi
关键词-EN: Partial Non-rigid Deformations, partial deformations effectively, pose significant challenges, handle partial deformations, existing methods struggle
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-rigid shape deformations pose significant challenges, and most existing methods struggle to handle partial deformations effectively. We present Partial Non-rigid Deformations and interpolations of the human body Surfaces (PaNDAS), a new method to learn local and global deformations of 3D surface meshes by building on recent deep models. Unlike previous approaches, our method enables restricting deformations to specific parts of the shape in a versatile way and allows for mixing and combining various poses from the database, all while not requiring any optimization at inference time. We demonstrate that the proposed framework can be used to generate new shapes, interpolate between parts of shapes, and perform other shape manipulation tasks with state-of-the-art accuracy and greater locality across various types of human surface data. Code and data will be made available soon.
zh

[CV-55] Viewpoint Consistency in 3D Generation via Attention and CLIP Guidance

【速读】：该论文试图解决文本到3D生成技术中的几何不一致性问题，即Janus Problem。解决方案的关键在于识别并纠正扩散模型中存在的视角生成偏差，通过提出一种无需调参的方法——Attention and CLIP Guidance (ACG)机制。ACG机制通过自适应控制交叉注意力图来增强期望视角，利用基于CLIP的视图-文本相似性来过滤错误视角，并采用由粗到细的优化策略和分阶段提示来逐步精炼3D生成。实验结果表明，该方法显著减少了Janus Problem，同时保持了生成速度，使其成为现有文本到3D框架中高效且即插即用的组件。

链接: https://arxiv.org/abs/2412.02287
作者: Qing Zhang,Zehao Chen,Jinguang Tong,Jing Zhang,Jie Hong,Xuesong Li
关键词-EN: Janus Problem, geometric inconsistencies, commonly referred, recent advances, suffer from geometric
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.
zh

[CV-56] AH-OCDA: Amplitude-based Curriculum Learning and Hopfield Segmentation Model for Open Compound Domain Adaptation WACV2025

【速读】：该论文试图解决开放复合域适应 (Open Compound Domain Adaptation, OCDA) 问题，即在源域、目标复合域和未见开放域之间进行适应，且缺乏域标签和像素级分割标签的挑战。解决方案的关键在于提出了基于振幅的课程学习 (Amplitude-based curriculum learning) 和 Hopfield 分割模型 (Hopfield segmentation model) 的组合方法 (AH-OCDA)。具体来说，基于振幅的课程学习通过快速傅里叶变换 (Fast Fourier Transform, FFT) 对未标记的复合域图像进行排序，逐步引导语义分割模型从近源复合域适应到远源复合域；而 Hopfield 分割模型则将任意域的分割特征分布映射到源域的特征分布上。这种方法在两个 OCDA 基准测试和扩展的开放域上实现了最先进的性能，展示了其对不断变化的复合域和未见开放域的适应能力。

链接: https://arxiv.org/abs/2412.02280
作者: Jaehyun Choi,Junwon Ko,Dong-Jae Lee,Junmo Kim
关键词-EN: Open compound domain, compound domain adaptation, compound domain, Hopfield segmentation model, domain adaptation
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:Open compound domain adaptation (OCDA) is a practical domain adaptation problem that consists of a source domain, target compound domain, and unseen open domain. In this problem, the absence of domain labels and pixel-level segmentation labels for both compound and open domains poses challenges to the direct application of existing domain adaptation and generalization methods. To address this issue, we propose Amplitude-based curriculum learning and a Hopfield segmentation model for Open Compound Domain Adaptation (AH-OCDA). Our method comprises two complementary components: 1) amplitude-based curriculum learning and 2) Hopfield segmentation model. Without prior knowledge of target domains within the compound domains, amplitude-based curriculum learning gradually induces the semantic segmentation model to adapt from the near-source compound domain to the far-source compound domain by ranking unlabeled compound domain images through Fast Fourier Transform (FFT). Additionally, the Hopfield segmentation model maps segmentation feature distributions from arbitrary domains to the feature distributions of the source domain. AH-OCDA achieves state-of-the-art performance on two OCDA benchmarks and extended open domains, demonstrating its adaptability to continuously changing compound domains and unseen open domains.
zh

[CV-57] PCIM: Learning Pixel Attributions via Pixel-wise Channel Isolation Mixing in High Content Imaging

【速读】：该论文试图解决深度神经网络（DNNs）在计算机视觉任务中决策解释性不足的问题，特别是在生物医学应用中，这种黑箱特性阻碍了其广泛接受。解决方案的关键在于引入了一种名为像素级通道隔离混合（Pixel-wise Channel Isolation Mixing, PCIM）的新方法，用于计算像素归属图（pixel attribution maps）。PCIM通过将每个像素视为独立的输入通道，并训练一个混合层来混合这些像素，从而反映特定的分类决策，无需提取网络内部状态或梯度。这种方法不仅能够生成适用于任意DNN的像素归属图，还显著提升了模型在荧光和高亮场高内涵成像中的保真度和定位能力，从而增强了DNN的可解释性和可信度。

链接: https://arxiv.org/abs/2412.02275
作者: Daniel Siegismund,Mario Wieser,Stephan Heyse,Stephan Steigele
关键词-EN: Deep Neural Networks, Deep Neural, computer vision tasks, shown remarkable success, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have shown remarkable success in various computer vision tasks. However, their black-box nature often leads to difficulty in interpreting their decisions, creating an unfilled need for methods to explain the decisions, and ultimately forming a barrier to their wide acceptance especially in biomedical applications. This work introduces a novel method, Pixel-wise Channel Isolation Mixing (PCIM), to calculate pixel attribution maps, highlighting the image parts most crucial for a classification decision but without the need to extract internal network states or gradients. Unlike existing methods, PCIM treats each pixel as a distinct input channel and trains a blending layer to mix these pixels, reflecting specific classifications. This unique approach allows the generation of pixel attribution maps for each image, but agnostic to the choice of the underlying classification network. Benchmark testing on three application relevant, diverse high content Imaging datasets show state-of-the-art performance, particularly for model fidelity and localization ability in both, fluorescence and bright field High Content Imaging. PCIM contributes as a unique and effective method for creating pixel-level attribution maps from arbitrary DNNs, enabling interpretability and trust.
zh

[CV-58] Sustainable Self-evolution Adversarial Training

【速读】：该论文试图解决现有对抗训练防御模型在面对动态和不断演变的攻击方法时，难以长期保持防御性能的问题。解决方案的关键在于提出了一个可持续自进化对抗训练 (Sustainable Self-Evolution Adversarial Training, SSEAT) 框架，该框架通过持续对抗防御流程，实现从多阶段多种对抗样本中学习。此外，为解决持续学习新攻击导致的模型灾难性遗忘问题，论文提出了对抗数据回放模块，以选择更多样化和关键的再学习数据。同时，设计了一致性正则化策略，促使当前防御模型从先前训练的模型中学习更多，保留更多过去知识并保持对干净样本的准确性。

链接: https://arxiv.org/abs/2412.02270
作者: Wenxuan Wang,Chenglei Wang,Huihui Qi,Menghao Ye,Xuelin Qian,Peng Wang,Yanning Zhang
关键词-EN: computer vision tasks, deep neural network, generation strategies aimed, exploring model security, neural network models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACMMM 2024

点击查看摘要

Abstract:With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.
zh

[CV-59] GSGTrack: Gaussian Splatting-Guided Object Pose Tracking from RGB Videos

【速读】：该论文试图解决在单目RGB视频序列中对未知物体的6自由度（6DoF）姿态跟踪问题，特别是在缺乏准确深度信息的情况下。解决方案的关键在于引入了一种名为GSGTrack的新型基于RGB的姿态跟踪框架，该框架通过联合优化几何和姿态来实现目标。具体来说，论文采用了3D高斯样条（3D Gaussian Splatting）来创建一个可优化的3D表示，并结合基于图的几何优化方法来捕捉物体的外观特征并细化其几何结构。为了应对姿态和几何数据中的噪声扰动，论文提出了一种物体轮廓损失（object silhouette loss）来减轻像素级损失对姿态噪声的过度敏感性。此外，为了缓解由于深度信息不准确导致的几何模糊性，论文还提出了一种几何一致性图像对选择策略，以过滤低置信度的图像对，确保几何优化的鲁棒性。

链接: https://arxiv.org/abs/2412.02267
作者: Zhiyuan Chen,Fan Lu,Guo Yu,Bin Li,Sanqing Qu,Yuan Huang,Changhong Fu,Guang Chen
关键词-EN: monocular RGB video, RGB video sequences, monocular RGB, RGB video, robotic manipulation
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object’s appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.
zh

[CV-60] Diabetic Retinopathy Classification from Retinal Images using Machine Learning Approaches

【速读】：该论文试图解决糖尿病视网膜病变（Diabetic Retinopathy, DR）的早期检测问题，以避免患者失明。解决方案的关键在于提取和利用DR的特征，如渗出物（exudates）、血管（blood vessels）和微动脉瘤（microaneurysm）的属性，并通过支持向量机（Support Vector Machine）、随机森林（Random Forest）和朴素贝叶斯（Naive Bayes）分类器对DR的不同阶段（健康、轻度非增殖性、中度非增殖性、重度非增殖性和增殖性）进行分类。最终，随机森林分类器在准确性（76.5%）、敏感性（77.2%）和特异性（93.3%）方面表现最佳。

链接: https://arxiv.org/abs/2412.02265
作者: Indronil Bhattacharjee,Al-Mahmud,Tareq Mahmud
关键词-EN: Diabetic Retinopathy, affects eyes, familiar diseases, diabetes complication, complication that affects
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 9 figures, 2 tables. International Conference on Advanced Engineering, Technology and Applications (ICAETA-2021), Istanbul, Turkey

点击查看摘要

Abstract:Diabetic Retinopathy is one of the most familiar diseases and is a diabetes complication that affects eyes. Initially, diabetic retinopathy may cause no symptoms or only mild vision problems. Eventually, it can cause blindness. So early detection of symptoms could help to avoid blindness. In this paper, we present some experiments on some features of diabetic retinopathy, like properties of exudates, properties of blood vessels and properties of microaneurysm. Using the features, we can classify healthy, mild non-proliferative, moderate non-proliferative, severe non-proliferative and proliferative stages of DR. Support Vector Machine, Random Forest and Naive Bayes classifiers are used to classify the stages. Finally, Random Forest is found to be the best for higher accuracy, sensitivity and specificity of 76.5%, 77.2% and 93.3% respectively.
zh

[CV-61] Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation NEURIPS2024

【速读】：该论文试图解决海洋生物多样性监测中的物种识别问题，特别是在动态和多样化的海洋环境中，传统自上而下的学习方法难以应对长尾分布、泛化能力和领域迁移的挑战。解决方案的关键在于利用自下而上的开放领域学习框架，通过预训练的视觉-语言模型（VLMs）结合检索增强生成（RAG）技术，实现对图像和视频分析的弹性扩展。该方法无需特定领域的训练或任务知识，初步应用在渔船上的鱼类分类中展示了显著的检索和预测能力。

链接: https://arxiv.org/abs/2412.02262
作者: Sepand Dyanatkar,Angran Li,Alexander Dungate
关键词-EN: Climate change destruction, Climate change, change destruction, biodiversity is threatening, threatening communities
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to Climate Change AI Workshop at NeurIPS 2024. 9 pages, 6 figures, 1 table

点击查看摘要

Abstract:Climate change’s destruction of marine biodiversity is threatening communities and economies around the world which rely on healthy oceans for their livelihoods. The challenge of applying computer vision to niche, real-world domains such as ocean conservation lies in the dynamic and diverse environments where traditional top-down learning struggle with long-tailed distributions, generalization, and domain transfer. Scalable species identification for ocean monitoring is particularly difficult due to the need to adapt models to new environments and identify rare or unseen species. To overcome these limitations, we propose leveraging bottom-up, open-domain learning frameworks as a resilient, scalable solution for image and video analysis in marine applications. Our preliminary demonstration uses pretrained vision-language models (VLMs) combined with retrieval-augmented generation (RAG) as grounding, leaving the door open for numerous architectural, training and engineering optimizations. We validate this approach through a preliminary application in classifying fish from video onboard fishing vessels, demonstrating impressive emergent retrieval and prediction capabilities without domain-specific training or knowledge of the task itself.
zh

[CV-62] Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

【速读】：该论文试图解决场景感知运动合成中的数据依赖问题，特别是依赖于配对的场景-运动数据（paired motion-scene data）的局限性。解决方案的关键在于提出了一种统一框架，称为扩散隐式策略（Diffusion Implicit Policy, DIP），该框架在训练过程中将人-场景交互与运动合成解耦，并在推理过程中引入基于交互的隐式策略到运动扩散过程中。通过迭代扩散去噪和隐式策略优化，生成的运动能够同时保持自然性和交互合理性。隐式策略通过GAN反演方式优化中间噪声运动，以保持运动连续性和关键帧姿态控制，同时利用ControlNet分支和运动修复技术。对于长期运动合成，论文还引入了运动混合技术，以在旋转功率空间和线性平移空间中融合多个子任务的运动，确保稳定过渡。实验结果表明，该框架在合成场景和真实场景中均优于现有最先进方法，显示出其在更广泛任务和多样场景中进行运动合成的可行性。

链接: https://arxiv.org/abs/2412.02261
作者: Jingyu Gong,Chong Zhang,Fengqi Liu,Ke Fan,Qianyu Zhou,Xin Tan,Zhizhong Zhang,Yuan Xie,Lizhuang Ma
关键词-EN: Human motion generation, widely researched recently, researched recently due, Human motion, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. this https URL
zh

[CV-63] VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation

【速读】：该论文试图解决现有视频生成模型在生成多镜头、电影式视频时遇到的逻辑叙事和视觉一致性问题。解决方案的关键在于提出了VideoGen-of-Thought (VGoT)架构，该架构通过以下几个核心模块实现多镜头视频生成：1) 脚本生成，将简短故事转化为每个镜头的详细提示；2) 关键帧生成，负责创建与角色描绘一致的关键帧；3) 镜头级视频生成，将脚本和关键帧信息转化为镜头；4) 平滑机制，确保多镜头输出的视觉一致性。此外，VGoT还通过合理的叙事设计确保逻辑一致性和角色发展，并利用身份保持嵌入（IP embeddings）和跨镜头平滑机制来维持时间与身份的一致性，从而实现高质量、连贯的多镜头视频生成。

链接: https://arxiv.org/abs/2412.02259
作者: Mingzhe Zheng,Yongqi Xu,Haojian Huang,Xuran Ma,Yexin Liu,Wenjie Shu,Yatian Pang,Feilong Tang,Qifeng Chen,Harry Yang,Ser-Nam Lim
关键词-EN: Current video generation, generating short clips, video generation, Current video, generation models excel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Webpage: this https URL

点击查看摘要

Abstract:Current video generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos. Existing models trained on large-scale data on the back of rich computational resources are unsurprisingly inadequate for maintaining a logical storyline and visual consistency across multiple shots of a cohesive script since they are often trained with a single-shot objective. To this end, we propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation. VGoT is designed with three goals in mind as follows. Multi-Shot Video Generation: We divide the video generation process into a structured, modular sequence, including (1) Script Generation, which translates a curt story into detailed prompts for each shot; (2) Keyframe Generation, responsible for creating visually consistent keyframes faithful to character portrayals; and (3) Shot-Level Video Generation, which transforms information from scripts and keyframes into shots; (4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable Narrative Design: Inspired by cinematic scriptwriting, our prompt generation approach spans five key domains, ensuring logical consistency, character development, and narrative flow across the entire video. Cross-Shot Consistency: We ensure temporal and identity consistency by leveraging identity-preserving (IP) embeddings across shots, which are automatically created from the narrative. Additionally, we incorporate a cross-shot smoothing mechanism, which integrates a reset boundary that effectively combines latent features from adjacent shots, resulting in smooth transitions and maintaining visual coherence throughout the video. Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
zh

[CV-64] ProbPose: A Probabilistic Approach to 2D Human Pose Estimation KR

【速读】：该论文试图解决当前人体姿态估计方法中存在的两个主要问题：一是现有最先进模型忽略了图像外的关键点（keypoints），二是使用未校准的热图（heatmaps）作为关键点位置的表示。为解决这些问题，论文提出了ProbPose方法，其关键在于预测每个关键点的以下三个属性：在激活窗口中每个位置的关键点存在概率、在窗口外的存在概率以及预测的可见性。此外，论文还引入了CropCOCO数据集和扩展的OKS（Ex-OKS）指标，以评估图像外关键点的表现。通过在COCO、CropCOCO和OCHuman数据集上的测试，ProbPose显著提升了图像外关键点的定位精度，并通过数据增强改善了图像内关键点的定位，同时增强了模型在边界框边缘的鲁棒性和关键点评估的灵活性。

链接: https://arxiv.org/abs/2412.02254
作者: Miroslav Purkrabek,Jiri Matas
关键词-EN: Current Human Pose, Human Pose Estimation, Pose Estimation methods, Current Human, Human Pose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Current Human Pose Estimation methods have achieved significant improvements. However, state-of-the-art models ignore out-of-image keypoints and use uncalibrated heatmaps as keypoint location representations. To address these limitations, we propose ProbPose, which predicts for each keypoint: a calibrated probability of keypoint presence at each location in the activation window, the probability of being outside of it, and its predicted visibility. To address the lack of evaluation protocols for out-of-image keypoints, we introduce the CropCOCO dataset and the Extended OKS (Ex-OKS) metric, which extends OKS to out-of-image points. Tested on COCO, CropCOCO, and OCHuman, ProbPose shows significant gains in out-of-image keypoint localization while also improving in-image localization through data augmentation. Additionally, the model improves robustness along the edges of the bounding box and offers better flexibility in keypoint evaluation. The code and models are available on this https URL for research purposes.
zh

[CV-65] Vision Transformers for Weakly-Supervised Microorganism Enumeration

【速读】：该论文试图解决微生物计数任务中的自动化问题，传统方法依赖人工计数，耗时且繁琐。解决方案的关键在于利用计算机视觉和机器学习技术，特别是通过实例分割（instance segmentation）和密度估计（density estimation）技术，以及对比研究不同架构（如ResNet和Vision Transformers (ViTs)）在微生物图像分析中的表现。论文通过训练不同版本的ViTs作为特征提取的骨干网络，使用四个微生物学数据集进行实验，结果表明尽管ResNet总体表现更优，但ViTs在所有数据集上均表现出良好的性能，为微生物计数任务提供了新的研究方向。

链接: https://arxiv.org/abs/2412.02250
作者: Javier Ureña Santiago,Thomas Ströhle,Antonio Rodríguez-Sánchez,Ruth Breu
关键词-EN: evaluating surface cleanliness, assessing contamination levels, ensuring health standards, Microorganism enumeration, surface cleanliness
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 3 tables, conference

点击查看摘要

Abstract:Microorganism enumeration is an essential task in many applications, such as assessing contamination levels or ensuring health standards when evaluating surface cleanliness. However, it’s traditionally performed by human-supervised methods that often require manual counting, making it tedious and time-consuming. Previous research suggests automating this task using computer vision and machine learning methods, primarily through instance segmentation or density estimation techniques. This study conducts a comparative analysis of vision transformers (ViTs) for weakly-supervised counting in microorganism enumeration, contrasting them with traditional architectures such as ResNet and investigating ViT-based models such as TransCrowd. We trained different versions of ViTs as the architectural backbone for feature extraction using four microbiology datasets to determine potential new approaches for total microorganism enumeration in images. Results indicate that while ResNets perform better overall, ViTs performance demonstrates competent results across all datasets, opening up promising lines of research in microorganism enumeration. This comparative study contributes to the field of microbial image analysis by presenting innovative approaches to the recurring challenge of microorganism enumeration and by highlighting the capabilities of ViTs in the task of regression counting.
zh

[CV-66] Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

【速读】：该论文试图解决大规模场景快速重建的问题，特别是在单机器人系统中存在的重建效率低和任务驱动规划易陷入局部最优的问题。解决方案的关键在于提出了首个基于3D高斯喷射（3D Gaussian Splatting, 3DGS）的集中式多机器人自主3D重建框架。该框架通过集成在线开放词汇语义分割与3DGS的表面不确定性，优化了任务生成的时间成本，并提高了重建质量。具体来说，系统聚焦于高实例不确定性的区域进行视图采样，并开发了多机器人协作策略，通过模式和任务分配来提升重建质量和规划效率。实验结果表明，该方法在所有规划方法中展现了最高的重建质量和优于现有多机器人方法的规划效率。

链接: https://arxiv.org/abs/2412.02249
作者: Jing Zeng,Qi Ye,Tianle Liu,Yang Xu,Jin Li,Jinming Xu,Liang Li,Jiming Chen
关键词-EN: Implicit neural representations, Gaussian splatting, shown great potential, Implicit neural, neural representations
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large-scale scenes remains challenging. Additionally, task-driven planning based on surface uncertainty is prone to being trapped in local optima. To this end, we propose the first 3DGS-based centralized multi-robot autonomous 3D reconstruction framework. To further reduce time cost of task generation and improve reconstruction quality, we integrate online open-vocabulary semantic segmentation with surface uncertainty of 3DGS, focusing view sampling on regions with high instance uncertainty. Finally, we develop a multi-robot collaboration strategy with mode and task assignments improving reconstruction quality while ensuring planning efficiency. Our method demonstrates the highest reconstruction quality among all planning methods and superior planning efficiency compared to existing multi-robot methods. We deploy our method on multiple robots, and results show that it can effectively plan view paths and reconstruct scenes with high quality.
zh

[CV-67] SparseLGS: Sparse View Language Embedded Gaussian Splatting

【速读】：该论文试图解决在3D场景理解中，使用稀疏且无姿态信息的输入图像进行场景重建的问题。解决方案的关键在于提出了一种名为SparseLGS的方法，该方法通过以下几个关键步骤实现：首先，利用基于学习的密集立体模型处理无姿态和稀疏输入；其次，采用三步区域匹配策略解决多视角语义不一致问题，这对于稀疏输入尤为重要；第三，通过提取低维信息并建立双射关系，避免直接学习高维CLIP特征带来的过度学习和存储成本；最后，在语义训练中引入重建损失，以优化高斯分布的位置和形状。SparseLGS在实验中展示了在较少输入（3-4个视角）下重建语义场时，与使用密集输入的现有最先进方法相比，具有相当的品质，并且在相同稀疏输入下显著提高了重建质量和计算速度（5倍加速）。

链接: https://arxiv.org/abs/2412.02245
作者: Jun Hu,Zhang Chen,Zhong Li,Yi Xu,Juyong Zhang
关键词-EN: combined Gaussian Splatting, Splatting to obtain, obtain scene representations, Gaussian Splatting, embeddings for open-vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5 \times speedup). Project page: \tt\small \urlthis https URL
zh

[CV-68] Fast LiDAR Data Generation with Rectified Flows

【速读】：该论文试图解决生成式 LiDAR 数据模型在计算成本高和生成效率低的问题。解决方案的关键在于提出了 R2Flow 模型，该模型基于修正流 (rectified flows)，通过学习直线轨迹来模拟数据生成，从而显著减少采样步骤，提高生成效率。此外，论文还设计了一种高效的基于 Transformer 的模型架构，用于处理 LiDAR 数据的图像表示，包括距离和反射测量。这些创新使得 R2Flow 在生成高质量 LiDAR 数据的同时，大幅降低了计算成本，适用于自动驾驶移动机器人的实际应用。

链接: https://arxiv.org/abs/2412.02241
作者: Kazuto Nakashima,Xiaowen Liu,Tomoya Miyawaki,Yumi Iwashita,Ryo Kurazume
关键词-EN: autonomous mobile robots, Building LiDAR generative, models holds promise, scene manipulation, powerful data priors
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite the success of diffusion models, generating high-quality samples requires numerous iterations of running neural networks, and the increasing computational cost can pose a barrier to robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with much fewer sampling steps against diffusion models. We also propose a efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on the unconditional generation of the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.
zh

[CV-69] Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models

【速读】：该论文试图解决对文本到图像扩散模型中的跨注意力层（cross-attention layers）的理解不足问题。解决方案的关键在于提出了一种构建头部相关向量（Head Relevance Vectors, HRVs）的方法，这些向量能够与有用的视觉概念对齐。HRV 是一个长度等于跨注意力层中头部总数的向量，每个元素表示相应头部对给定视觉概念的重要性。通过有序弱化分析（ordered weakening analysis），论文展示了 HRV 作为可解释特征的有效性。此外，论文还提出了概念强化和概念调整方法，并应用于三个视觉生成任务中，展示了 HRV 在纠正多义词误解、修改图像编辑中的挑战性属性以及缓解多概念生成中的灾难性遗忘方面的实用性。总体而言，该研究不仅深化了对跨注意力层的理解，还引入了在头部级别精细控制这些层的新方法。

链接: https://arxiv.org/abs/2412.02237
作者: Jungwon Park,Jungmin Ko,Dongnam Byun,Jangwon Suh,Wonjong Rhee
关键词-EN: diffusion models leverage, models leverage cross-attention, leverage cross-attention layers, diffusion models, models leverage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we present a method for constructing Head Relevance Vectors (HRVs) that align with useful visual concepts. An HRV for a given visual concept is a vector with a length equal to the total number of cross-attention heads, where each element represents the importance of the corresponding head for the given visual concept. We develop and employ an ordered weakening analysis to demonstrate the effectiveness of HRVs as interpretable features. To demonstrate the utility of HRVs, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. We show that misinterpretations of polysemous words in image generation can be corrected in most cases, five challenging attributes in image editing can be successfully modified, and catastrophic neglect in multi-concept generation can be mitigated. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.
zh

[CV-70] CubeFormer: A Simple yet Effective Baseline for Lightweight Image Super-Resolution

【速读】：该论文试图解决轻量级图像超分辨率（SR）方法在性能和细节恢复方面的不足问题。论文指出，现有轻量级SR方法受限于特征多样性不足，导致特征表示和细节恢复效果不佳。解决方案的关键在于提出了一种名为CubeFormer的新型基线模型，通过增强特征丰富性来提升性能。具体来说，CubeFormer引入了立方体注意力（cube attention），将2D注意力扩展到3D空间，促进全面的信息交互和特征多样性。此外，通过块采样和网格采样策略构建的内部立方体变换块（Intra-CTB）和外部立方体变换块（Inter-CTB）分别进行局部和全局建模，进一步提升了模型的信息提取能力。实验结果表明，CubeFormer在常用的SR基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.02234
作者: Jikai Wang,Huan Zheng,Jianbing Shen
关键词-EN: Lightweight image super-resolution, lightweight neural network, image super-resolution, neural network, Lightweight image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightweight image super-resolution (SR) methods aim at increasing the resolution and restoring the details of an image using a lightweight neural network. However, current lightweight SR methods still suffer from inferior performance and unpleasant details. Our analysis reveals that these methods are hindered by constrained feature diversity, which adversely impacts feature representation and detail recovery. To respond this issue, we propose a simple yet effective baseline called CubeFormer, designed to enhance feature richness by completing holistic information aggregation. To be specific, we introduce cube attention, which expands 2D attention to 3D space, facilitating exhaustive information interactions, further encouraging comprehensive information extraction and promoting feature variety. In addition, we inject block and grid sampling strategies to construct intra-cube transformer blocks (Intra-CTB) and inter-cube transformer blocks (Inter-CTB), which perform local and global modeling, respectively. Extensive experiments show that our CubeFormer achieves state-of-the-art performance on commonly used SR benchmarks. Our source code and models will be publicly available.
zh

[CV-71] How to Use Diffusion Priors under Sparse Views?

【速读】：该论文试图解决在稀疏视角下进行新视角合成的长期重要挑战，特别是在使用扩散模型（Diffusion Model）作为外部先验时，由于稀疏视角的信息熵较低，导致在基于分数蒸馏采样（Score Distillation Sampling, SDS）的3D重建中出现模式偏移问题。解决方案的关键在于提出了内联先验引导的分数匹配（Inline Prior Guided Score Matching, IPSM）方法，该方法利用视角间的姿态关系提供的视觉内联先验来修正渲染图像的分布，并将SDS的原始优化目标分解，从而在不进行微调或预训练的情况下提供有效的扩散视觉指导。此外，论文还提出了IPSM-Gaussian管道，采用3D高斯拼接（3D Gaussian Splatting）作为骨干，并基于IPSM补充深度和几何一致性正则化，以进一步增强内联先验和修正分布，从而在不同公共数据集上实现了最先进的重建质量。

链接: https://arxiv.org/abs/2412.02225
作者: Qisen Wang,Yifan Zhao,Jiawei Ma,Jia Li
关键词-EN: long-term important challenge, Score Distillation Sampling, Guided Score Matching, long-term important, Prior Guided Score
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis under sparse views has been a long-term important challenge in 3D reconstruction. Existing works mainly rely on introducing external semantic or depth priors to supervise the optimization of 3D representations. However, the diffusion model, as an external prior that can directly provide visual supervision, has always underperformed in sparse-view 3D reconstruction using Score Distillation Sampling (SDS) due to the low information entropy of sparse views compared to text, leading to optimization challenges caused by mode deviation. To this end, we present a thorough analysis of SDS from the mode-seeking perspective and propose Inline Prior Guided Score Matching (IPSM), which leverages visual inline priors provided by pose relationships between viewpoints to rectify the rendered image distribution and decomposes the original optimization objective of SDS, thereby offering effective diffusion visual guidance without any fine-tuning or pre-training. Furthermore, we propose the IPSM-Gaussian pipeline, which adopts 3D Gaussian Splatting as the backbone and supplements depth and geometry consistency regularization based on IPSM to further improve inline priors and rectified distribution. Experimental results on different public datasets show that our method achieves state-of-the-art reconstruction quality. The code is released at this https URL.
zh

[CV-72] Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

【速读】：该论文试图解决视觉基础模型（Visual Foundation Models, VFMs）在数据有限和实时应用场景下缺乏少样本适应性的问题。解决方案的关键在于提出了一种名为LoRA Recycle的框架，该框架通过元学习（meta-learning）目标从多种预调优的低秩适应（Low-Rank Adaptation, LoRA）模块中提炼出一个元LoRA（meta-LoRA），并使用从这些预调优LoRA模块自身生成的代理数据进行训练。这一方法使得视觉基础模型能够在不进行显式微调的情况下，通过单次前向传播解决新的少样本任务，类似于大型语言模型（Large Language Models, LLMs）的上下文学习（in-context learning）。此外，该框架还引入了一种双重高效机制，显著加速了元训练过程，同时保持或提升了性能。

链接: https://arxiv.org/abs/2412.02220
作者: Zixuan Hu,Yongxian Wei,Li Shen,Chun Yuan,Dacheng Tao
关键词-EN: Large Language Models, Large Language, Visual Foundation Models, Language Models, ChatGPT demonstrate strong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.
zh

[CV-73] GIST: Towards Photorealistic Style Transfer via Multiscale Geometric Representations

【速读】：该论文试图解决现有风格迁移方法依赖于为判别任务优化的预训练编码器，导致图像合成中出现显著伪影和真实感损失的问题。解决方案的关键在于提出了一种基于几何的图像风格迁移技术，称为GIST（Geometric-based Image Style Transfer）。GIST通过多尺度图像扩展替代传统的神经风格迁移自动编码框架，保留场景细节而不需要后处理或训练。该方法通过解决最优传输问题来匹配多分辨率和多方向表示（如小波和轮廓波），从而实现高效的纹理传输。实验表明，GIST在保持或超越现有照片级风格迁移方法性能的同时，显著减少了处理时间，且无需模型训练。

链接: https://arxiv.org/abs/2412.02214
作者: Renan A. Rojas-Gomez,Minh N. Do
关键词-EN: leverage pre-trained encoders, pre-trained encoders optimized, Style Transfer, Neural Style Transfer, discriminative tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:State-of-the-art Style Transfer methods often leverage pre-trained encoders optimized for discriminative tasks, which may not be ideal for image synthesis. This can result in significant artifacts and loss of photorealism. Motivated by the ability of multiscale geometric image representations to capture fine-grained details and global structure, we propose GIST: Geometric-based Image Style Transfer, a novel Style Transfer technique that exploits the geometric properties of content and style images. GIST replaces the standard Neural Style Transfer autoencoding framework with a multiscale image expansion, preserving scene details without the need for post-processing or training. Our method matches multiresolution and multidirectional representations such as Wavelets and Contourlets by solving an optimal transport problem, leading to an efficient texture transferring. Experiments show that GIST is on-par or outperforms recent photorealistic Style Transfer approaches while significantly reducing the processing time with no model training.
zh

[CV-74] CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

【速读】：该论文试图解决当前大型多模态模型（Large Multimodal Models, LMMs）在处理复杂结构和细粒度视觉挑战的文档图像识别任务中缺乏全面评估基准的问题。解决方案的关键在于引入了一个名为CC-OCR的综合基准，该基准涵盖了多场景文本阅读、多语言文本阅读、文档解析和关键信息提取四个OCR核心赛道，包含39个子集和7,058张全标注图像，其中41%来自实际应用场景，首次公开发布。通过评估九个知名LMMs，CC-OCR揭示了这些模型在文本定位、多方向识别和重复幻觉等方面的优势与不足，旨在全面评估LMMs在OCR相关任务中的能力，推动LMMs的进一步发展。

链接: https://arxiv.org/abs/2412.02210
作者: Zhibo Yang,Jun Tang,Zhaohai Li,Pengfei Wang,Jianqiang Wan,Humen Zhong,Xuejing Liu,Mingkun Yang,Peng Wang,Yuliang Liu,LianWen Jin,Xiang Bai,Shuai Bai,Junyang Lin
关键词-EN: Large Multimodal Models, Large Multimodal, natural language instructions, demonstrated impressive performance, Multimodal Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures; The code and data will be publicly available as soon as possible

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive performance on recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possess a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. Furthermore, we evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, driving advancement in LMMs.
zh

[CV-75] 3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

【速读】：该论文试图解决在3D数据生成中，由于缺乏固有的顺序和多尺度关系，导致难以将3D数据高效压缩为可管理的离散标记（tokens）的问题。解决方案的关键在于引入变分标记器（Variational Tokenizer, VAT），它通过将无序的3D数据转换为紧凑的潜在标记，并利用隐式层次结构进行高效的自回归建模。VAT首先使用上下文变换器将大量无序的3D特征压缩为减少的标记集，然后在高斯分布上进行残差量化，逐步增加标记数量以适应不同尺度。这种方法不仅在不同尺度上自然地建立了标记之间的联系，还通过高分辨率的三平面（triplane）在解码阶段将这些紧凑的潜在标记转换为详细的3D形状。实验结果表明，VAT在3D生成质量、效率和泛化能力方面优于现有方法，实现了高达250倍的压缩率，同时保持了高F-score。

链接: https://arxiv.org/abs/2412.02202
作者: Jinzhi Zhang,Feng Xiong,Mu Xu
关键词-EN: revolutionized high-fidelity image, high-fidelity image generation, image patches, tokens, VAT
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 21 figures

点击查看摘要

Abstract:Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.
zh

[CV-76] ransformer-Metric Loss for CNN-Based Face Recognition

【速读】：该论文试图解决在人脸识别领域中，如何通过结合transformer网络与传统的度量损失函数（metric loss）来优化深度学习模型的性能。解决方案的关键在于提出了一种新的损失评估技术，即transformer-metric损失，它将transformer网络作为附加损失引入到传统的度量损失函数中。具体来说，该方法利用transformer编码器处理从CNN网络最终卷积层输出的上下文向量，从而在损失函数层面整合了transformer的特性。通过这种方式，研究者能够评估结合transformer损失与基础度量损失函数的效果，并在多个验证数据集上实现了最先进（SoTA）的结果，尽管存在一些局限性。这一研究不仅扩展了transformer在机器视觉领域的应用，还为探索transformer作为损失函数的可能性开辟了新的方向。

链接: https://arxiv.org/abs/2412.02198
作者: Pritesh Prakash,Ashish Jacob Sam
关键词-EN: loss, loss function plays, deep learning, loss function, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Face Recognition using Transformer Loss

点击查看摘要

Abstract:In deep learning, the loss function plays a crucial role in optimizing the network. Many recent innovations in loss techniques have been made, and various margin-based angular loss functions (metric loss) have been designed particularly for face recognition. The concept of transformers is already well-researched and applied in many facets of machine vision. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results on various validation datasets with some limitations. This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.
zh

[CV-77] Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images

【速读】：该论文试图解决在图像识别任务中，如人体姿态估计，摄像头常捕捉到低分辨率图像的问题。低分辨率图像对提取和利用多尺度特征提出了挑战，而多尺度特征对于精确推理至关重要。解决方案的关键在于提出了一种名为级联多尺度注意力机制 (cascaded multi-scale attention, CMSA) 的新型注意力机制，专门设计用于卷积神经网络-视觉变换器 (CNN-ViT) 混合架构中，以有效处理低分辨率输入。CMSA 通过结合分组多头自注意力机制与基于窗口的局部注意力和不同尺度上多尺度特征的级联融合，实现了在不降低输入图像或特征图分辨率的情况下，提取和无缝集成多尺度特征。这种架构增强了模型处理不同尺度特征的能力，从而在低分辨率图像上提升了人体姿态估计、头部姿态估计等任务的性能，并在实验中展示了其优于现有最先进方法的性能，且参数更少，适用于实际应用中难以获取高分辨率图像的场景。

链接: https://arxiv.org/abs/2412.02197
作者: Xiangyong Lu,Masanori Suganuma,Takayuki Okatani
关键词-EN: human pose estimation, pose estimation, cameras often capture, capture objects, low resolutions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 5 tables. The paper is under consideration at Computer Vision and Image Understanding

点击查看摘要

Abstract:In real-world applications of image recognition tasks, such as human pose estimation, cameras often capture objects, like human bodies, at low resolutions. This scenario poses a challenge in extracting and leveraging multi-scale features, which is often essential for precise inference. To address this challenge, we propose a new attention mechanism, named cascaded multi-scale attention (CMSA), tailored for use in CNN-ViT hybrid architectures, to handle low-resolution inputs effectively. The design of CMSA enables the extraction and seamless integration of features across various scales without necessitating the downsampling of the input image or feature maps. This is achieved through a novel combination of grouped multi-head self-attention mechanisms with window-based local attention and cascaded fusion of multi-scale features over different scales. This architecture allows for the effective handling of features across different scales, enhancing the model’s ability to perform tasks such as human pose estimation, head pose estimation, and more with low-resolution images. Our experimental results show that the proposed method outperforms existing state-of-the-art methods in these areas with fewer parameters, showcasing its potential for broad application in real-world scenarios where capturing high-resolution images is not feasible. Code is available at this https URL.
zh

[CV-78] LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

【速读】：该论文试图解决开放宇宙3D布局生成中的问题，即在复杂场景中，大型语言模型（LLMs）难以生成符合物理规律且忠实于输入指令的3D场景。解决方案的关键在于引入LayoutVLM框架和场景布局表示，该框架利用视觉语言模型（VLMs）的语义知识，并通过可微分优化确保物理合理性。LayoutVLM通过VLMs生成两种相互增强的表示，并采用自洽解码过程来改进VLMs的空间规划能力。实验表明，LayoutVLM有效解决了现有LLM和基于约束的方法的局限性，生成的3D布局更符合输入语言指令的语义意图，并且通过从现有场景数据集中提取的场景布局表示对VLMs进行微调可以进一步提升性能。

链接: https://arxiv.org/abs/2412.02193
作者: Fan-Yun Sun,Weiyu Liu,Siyi Gu,Dylan Lim,Goutam Bhat,Federico Tombari,Manling Li,Nick Haber,Jiajun Wu
关键词-EN: generation arranges unlabeled, layout generation arranges, arranges unlabeled, assets conditioned, Open-universe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project website: this https URL

点击查看摘要

Abstract:Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.
zh

[CV-79] VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

【速读】：该论文试图解决视频大模态模型（Video Large Multimodal Models, LMMs）在分布外（Out-of-Distribution, OOD）任务上的性能下降问题。解决方案的关键在于提出了一个名为VideoICL的新型视频上下文学习框架，该框架通过引入基于相似度的相关示例选择策略和基于置信度的迭代推理方法，来扩展有效上下文长度，从而在不增加高计算成本的情况下提升OOD任务的视频理解性能。具体来说，VideoICL通过选择与当前任务最相关的示例并根据相似度进行排序，用于推理过程中。如果生成的响应置信度低，框架会迭代选择新的示例并重新进行推理，直到获得高置信度的结果。这种方法显著提升了在多个基准测试中的性能，特别是在特定领域场景中，为更广泛的视频理解应用奠定了基础。

链接: https://arxiv.org/abs/2412.02186
作者: Kangsan Kim,Geon Park,Youngwan Lee,Woongyeong Yeo,Sung Ju Hwang
关键词-EN: large multimodal models, Recent advancements, video large multimodal, multimodal models, reasoning capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at this https URL
zh

[CV-80] Anatomically-Grounded Fact Checking of Automated Chest X-ray Reports

【速读】：该论文试图解决大规模视觉-语言模型在生成放射学报告时出现的事实错误问题。解决方案的关键在于提出了一种新的可解释事实核查模型，该模型能够识别报告中的错误及其在图像中的位置。具体来说，研究者分析了自动化报告方法产生的错误类型，并基于真实数据集构建了一个包含真实和虚假描述的合成数据集。随后，他们训练了一个新的多标签跨模态对比回归网络（multi-label cross-modal contrastive regression network），该网络能够有效识别和纠正报告中的错误。实验结果表明，通过这种错误检测和纠正机制，报告质量提高了超过40%。

链接: https://arxiv.org/abs/2412.02177
作者: R. Mahmood,K.C.L. Wong,D. M. Reyes,N. D’Souza,L. Shi,J. Wu,P. Kaviani,M. Kalra,G. Wang,P. Yan,T. Syeda-Mahmood
关键词-EN: realistic radiology reports, large-scale vision-language models, realistic radiology, simple prompts, emergence of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the emergence of large-scale vision-language models, realistic radiology reports may be generated using only medical images as input guided by simple prompts. However, their practical utility has been limited due to the factual errors in their description of findings. In this paper, we propose a novel model for explainable fact-checking that identifies errors in findings and their locations indicated through the reports. Specifically, we analyze the types of errors made by automated reporting methods and derive a new synthetic dataset of images paired with real and fake descriptions of findings and their locations from a ground truth dataset. A new multi-label cross-modal contrastive regression network is then trained on this datsaset. We evaluate the resulting fact-checking model and its utility in correcting reports generated by several SOTA automated reporting tools on a variety of benchmark datasets with results pointing to over 40% improvement in report quality through such error detection and correction.
zh

[CV-81] Underload: Defending against Latency Attacks for Object Detectors on Edge Devices

【速读】：该论文试图解决对象检测系统在面对延迟攻击（latency attacks）时的脆弱性问题。延迟攻击通过在对象检测器的后处理模块中创建计算瓶颈，导致级联故障，从而威胁实时下游任务的性能。解决方案的关键在于采用背景注意力对抗训练（background-attentive adversarial training），该方法不仅考虑了对抗行为，还结合了底层硬件能力。具体来说，通过系统级连接延迟攻击与异构GPU设备的硬件容量，利用对象性损失（objectness loss）作为代理，将背景注意力引入对抗训练流程，从而在保持清洁准确性的同时，显著提升系统的鲁棒性。实验结果表明，该方法在Jetson Orin NX平台上将实时处理能力从13 FPS恢复到43 FPS，并在清洁准确性和鲁棒准确性之间实现了更好的平衡。

链接: https://arxiv.org/abs/2412.02171
作者: Tianyi Wang,Zichen Wang,Cong Wang,Yuanchao Shu,Ruilong Deng,Peng Cheng,Jiming Chen(Zhejiang University, Hangzhou, China)
关键词-EN: supply chain management, real-time downstream applications, autonomous driving, augmented reality, chain management
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Object detection is a fundamental enabler for many real-time downstream applications such as autonomous driving, augmented reality and supply chain management. However, the algorithmic backbone of neural networks is brittle to imperceptible perturbations in the system inputs, which were generally known as misclassifying attacks. By targeting the real-time processing capability, a new class of latency attacks are reported recently. They exploit new attack surfaces in object detectors by creating a computational bottleneck in the post-processing module, that leads to cascading failure and puts the real-time downstream tasks at risks. In this work, we take an initial attempt to defend against this attack via background-attentive adversarial training that is also cognizant of the underlying hardware capabilities. We first draw system-level connections between latency attack and hardware capacity across heterogeneous GPU devices. Based on the particular adversarial behaviors, we utilize objectness loss as a proxy and build background attention into the adversarial training pipeline, and achieve a reasonable balance between clean and robust accuracy. The extensive experiments demonstrate the defense effectiveness of restoring real-time processing capability from 13 FPS to 43 FPS on Jetson Orin NX, with a better trade-off between the clean and robust accuracy.
zh

[CV-82] Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

【速读】：该论文试图解决生成式图像（Image Generation）在处理特定相机设置（如不同焦距镜头的视野）时无法生成场景一致的图像的问题。解决方案的关键在于提出了生成摄影（Generative Photography）框架，并通过维度提升（Dimensionality Lifting）和对比相机学习（Contrastive Camera Learning）两个核心概念，实现了不同相机设置下的连续且一致的图像生成。实验结果表明，该方法生成的图像在场景一致性和真实感方面显著优于现有的先进模型，如Stable Diffusion 3和FLUX。

链接: https://arxiv.org/abs/2412.02168
作者: Yu Yuan,Xijun Wang,Yichen Sheng,Prateek Chennuri,Xingguang Zhang,Stanley Chan
关键词-EN: Image generation today, text prompts, realistic images, Contrastive Camera Learning, generation today
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
zh

[CV-83] Agri-LLaVA: Knowledge-Infused Large Multimodal Assistant on Agricultural Pests and Diseases

【速读】：该论文试图解决农业领域中病虫害识别与控制的复杂性和多样性问题。解决方案的关键在于构建了首个农业领域的多模态指令跟随数据集，涵盖超过221种病虫害，约40万条数据，并基于此数据集提出了知识融合训练方法，开发了农业多模态对话系统Agri-LLaVA。该系统通过多模态对话和视觉理解，为农业病虫害的识别与控制提供了新的视角和方法。此外，论文还设计了多样且具有挑战性的评估基准，以促进该领域的研究进展，并通过开源数据集和模型，推动农业领域大模型（LMMs）的研究与发展。

链接: https://arxiv.org/abs/2412.02158
作者: Liqiong Wang,Teng Jin,Jinyu Yang,Ales Leonardis,Fangyi Wang,Feng Zheng
关键词-EN: achieved significant advancements, pests and diseases, persist in applying, agricultural pests, large multimodal models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the general domain, large multimodal models (LMMs) have achieved significant advancements, yet challenges persist in applying them to specific fields, especially agriculture. As the backbone of the global economy, agriculture confronts numerous challenges, with pests and diseases being particularly concerning due to their complexity, variability, rapid spread, and high resistance. This paper specifically addresses these issues. We construct the first multimodal instruction-following dataset in the agricultural domain, covering over 221 types of pests and diseases with approximately 400,000 data entries. This dataset aims to explore and address the unique challenges in pest and disease control. Based on this dataset, we propose a knowledge-infused training method to develop Agri-LLaVA, an agricultural multimodal conversation system. To accelerate progress in this field and inspire more researchers to engage, we design a diverse and challenging evaluation benchmark for agricultural pests and diseases. Experimental results demonstrate that Agri-LLaVA excels in agricultural multimodal conversation and visual understanding, providing new insights and approaches to address agricultural pests and diseases. By open-sourcing our dataset and model, we aim to promote research and development in LMMs within the agricultural domain and make significant contributions to tackle the challenges of agricultural pests and diseases. All resources can be found at this https URL.
zh

[CV-84] SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images

【速读】：该论文试图解决语言引导的机器人抓取（Language-guided robotic grasping）在可变环境中快速更新场景和高效操作的问题。现有方法依赖密集相机视图且难以快速更新场景，限制了其在变化环境中的有效性。解决方案的关键在于提出了一种名为SparseGrasp的新型开放词汇机器人抓取系统，该系统能够在稀疏视图的RGB图像下高效运行，并快速处理场景更新。关键技术包括：1) 利用DUSt3R生成密集点云作为3D高斯喷射（3D Gaussian Splatting, 3DGS）的初始化，即使在稀疏监督下也能保持高保真度；2) 结合最新的视觉基础模型（vision foundation models）的语义感知能力；3) 通过主成分分析（Principal Component Analysis, PCA）重构2D模型的特征以提高处理效率；4) 引入一种新的渲染与比较策略（render-and-compare strategy），确保快速场景更新，从而在可变环境中实现多轮抓取。实验结果表明，SparseGrasp在速度和适应性方面显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.02140
作者: Junqiu Yu,Xinlin Ren,Yongchong Gu,Haitao Lin,Tianyu Wang,Yi Zhu,Hang Xu,Yu-Gang Jiang,Xiangyang Xue,Yanwei Fu
关键词-EN: grasp specific objects, rapidly advancing field, Language-guided robotic grasping, Language-guided robotic, specific objects
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects. However, existing methods often depend on dense camera views and struggle to quickly update scenes, limiting their effectiveness in changeable environments. In contrast, we propose SparseGrasp, a novel open-vocabulary robotic grasping system that operates efficiently with sparse-view RGB images and handles scene updates fastly. Our system builds upon and significantly enhances existing computer vision modules in robotic learning. Specifically, SparseGrasp utilizes DUSt3R to generate a dense point cloud as the initialization for 3D Gaussian Splatting (3DGS), maintaining high fidelity even under sparse supervision. Importantly, SparseGrasp incorporates semantic awareness from recent vision foundation models. To further improve processing efficiency, we repurpose Principal Component Analysis (PCA) to compress features from 2D models. Additionally, we introduce a novel render-and-compare strategy that ensures rapid scene updates, enabling multi-turn grasping in changeable environments. Experimental results show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability, providing a robust solution for multi-turn grasping in changeable environment. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.02140 [cs.RO] (or arXiv:2412.02140v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2412.02140 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-85] From Pixels to Planes: Minimum Ground Sample Distance for Aircraft

【速读】：该论文旨在研究地面采样距离（GSD）对不同尺寸飞机检测性能的影响，并确定在保持检测准确性的前提下，最小化相机重量的GSD要求。解决方案的关键在于通过使用YOLOv8s模型对不同GSD的图像进行训练和评估，发现至少需要0.86米的GSD才能准确检测大多数飞机，尤其是翼展小于20米的飞机。这一发现为高海拔平台上的相机设计提供了依据，以在重量限制下优化检测性能。

链接: https://arxiv.org/abs/2412.02137
作者: Matthew Ciolino,Willie Maddox
关键词-EN: ground sample distance, sample distance, proprietary AllPlanes, study investigates, investigates the impact
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1 Table, 2 Authors, 3 Pages, 4 Figures

点击查看摘要

Abstract:This study investigates the impact of ground sample distance (GSD) on the detection performance of various sized aircraft using the proprietary AllPlanes 120 dataset. The data set comprises 120 civilian, military and museum aircraft from multiple satellite/aerial sources collected over two years. Resolutions ranging from 2.4 to 0.3 meters GSD were simulated. Performance metrics were derived from a YOLOv8s model trained on down-sampled versions of zoom level 19 (0.3m GSD) imagery. The results indicate that a GSD of at least 0.86m is required to accurately detect most aircraft, particularly those with wingspans shorter than 20 meters. Due to weight constraints in high-altitude platforms, this GSD specification can inform camera design to minimize weight while maintaining detection accuracy.
zh

[CV-86] GSOT3D: Towards Generic 3D Single Object Tracking in the Wild

【速读】：该论文旨在解决通用3D单目标跟踪（Single Object Tracking, SOT）在复杂环境中的挑战，特别是缺乏一个全面且高质量的基准数据集来支持多种3D跟踪任务。解决方案的关键在于提出了一个名为GSOT3D的新基准，该基准包含620个序列和123,000帧，涵盖54个对象类别，并提供多种模态数据（点云、RGB图像和深度）。GSOT3D不仅支持单模态3D SOT（如点云上的跟踪），还支持多模态3D SOT（如RGB-点云或RGB-深度），从而极大地拓宽了3D对象跟踪的研究方向。此外，所有序列都经过多轮手动标注和精细检查，确保了高质量的3D标注。论文还评估了八个代表性的点云跟踪模型，并提出了一个简单但有效的通用3D跟踪器PROT3D，该跟踪器通过渐进式时空网络定位目标对象，显著优于现有解决方案。通过发布GSOT3D，论文期望推动未来3D跟踪研究和应用的发展。

链接: https://arxiv.org/abs/2412.02129
作者: Yifan Jiao,Yunhao Li,Junhua Ding,Qing Yang,Song Fu,Heng Fan,Libo Zhang
关键词-EN: single object tracking, aims at facilitating, facilitating development, SOT, object tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network and outperforms all current solutions by a large margin. By releasing GSOT3D, we expect to advance further 3D tracking in future research and applications. Our benchmark and model as well as the evaluation results will be publicly released at our webpage this https URL.
zh

[CV-87] Streamlining Video Analysis for Efficient Violence Detection

【速读】：该论文试图解决监控摄像头捕捉的视频帧中自动检测暴力行为的问题，特别是将场景分类为“打斗”或“非打斗”。解决方案的关键在于采用基于3D卷积神经网络（3D CNN）的X3D模型，结合预处理步骤如管状提取（tube extraction）、体积裁剪（volume cropping）和帧聚合（frame aggregation），以及聚类技术，以准确地定位和分类打斗场景。实验结果表明，该方法在区分暴力与非暴力事件方面具有显著效果，为推进实际暴力检测系统提供了有价值的见解。

链接: https://arxiv.org/abs/2412.02127
作者: Gourang Pathak,Abhay Kumar,Sannidhya Rawat,Shikha Gupta
关键词-EN: Convolutional Neural Network, surveillance cameras, specifically focusing, video frames captured, paper addresses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as “fight” or “non-fight.” This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.
zh

[CV-88] Rethinking Self-Supervised Learning Within the Framework of Partial Information Decomposition

【速读】：该论文试图解决自监督学习 (Self Supervised Learning, SSL) 框架中关于互信息 (Mutual Information) 作用的争议。具体来说，现有研究对于在SSL中增加或减少增强视图之间的互信息存在不同观点。论文提出通过引入部分信息分解 (Partial Information Decomposition, PID) 框架来重新审视SSL的核心思想，并建议用更广义的联合互信息 (Joint Mutual Information) 概念替代传统的互信息，以解决这一争议。解决方案的关键在于利用PID框架中的独特信息成分，通过提取这些成分来改进现有的SSL模型，从而实现更有效的特征学习。具体方法包括在低层次监督中利用独特信息进行通用特征学习，以及在高层次监督中开发任务相关的特征学习信号。实验结果表明，该方法在多个基准模型和数据集上均能有效提升SSL框架的性能。

链接: https://arxiv.org/abs/2412.02121
作者: Salman Mohamadi,Gianfranco Doretto,Donald A. Adjeroh
关键词-EN: mutual information, SSL, Supervised learning, unlabeled data, information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self Supervised learning (SSL) has demonstrated its effectiveness in feature learning from unlabeled data. Regarding this success, there have been some arguments on the role that mutual information plays within the SSL framework. Some works argued for increasing mutual information between representation of augmented views. Others suggest decreasing mutual information between them, while increasing task-relevant information. We ponder upon this debate and propose to revisit the core idea of SSL within the framework of partial information decomposition (PID). Thus, with SSL under PID we propose to replace traditional mutual information with the more general concept of joint mutual information to resolve the argument. Our investigation on instantiation of SSL within the PID framework leads to upgrading the existing pipelines by considering the components of the PID in the SSL models for improved representation learning. Accordingly we propose a general pipeline that can be applied to improve existing baselines. Our pipeline focuses on extracting the unique information component under the PID to build upon lower level supervision for generic feature learning and on developing higher-level supervisory signals for task-related feature learning. In essence, this could be interpreted as a joint utilization of local and global clustering. Experiments on four baselines and four datasets show the effectiveness and generality of our approach in improving existing SSL frameworks.
zh

[CV-89] Understanding Particles From Video: Property Estimation of Granular Materials via Visuo-Haptic Learning ICRA2025

【速读】：该论文试图解决在无需专用测量设备和大量人工操作的情况下，如何从视频中估计颗粒材料（Granular Materials, GMs）的粒径和密度相对值的问题。解决方案的关键在于利用视觉-触觉学习框架，通过训练神经网络模型，使其能够从探针在颗粒材料中拖动时的视觉数据中提取出与触觉信号相关的特征，从而隐式地表征颗粒材料的粒径和密度分布。该方法的核心在于网络能够将视觉信息映射到触觉信号，并利用训练后的编码器分析颗粒材料的属性，无需额外的传感器数据或人工标注。

链接: https://arxiv.org/abs/2412.02119
作者: Zeqing Zhang,Guangze Zheng,Xuebo Ji,Guanqi Chen,Ruixing Jia,Wentao Chen,Guanhua Chen,Liangjun Zhang,Jia Pan
关键词-EN: Granular materials, daily life, ubiquitous in daily, Granular, properties
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: IEEE Robotics and Automation Letters, with ICRA 2025

点击查看摘要

Abstract:Granular materials (GMs) are ubiquitous in daily life. Understanding their properties is also important, especially in agriculture and industry. However, existing works require dedicated measurement equipment and also need large human efforts to handle a large number of particles. In this paper, we introduce a method for estimating the relative values of particle size and density from the video of the interaction with GMs. It is trained on a visuo-haptic learning framework inspired by a contact model, which reveals the strong correlation between GM properties and the visual-haptic data during the probe-dragging in the GMs. After training, the network can map the visual modality well to the haptic signal and implicitly characterize the relative distribution of particle properties in its latent embeddings, as interpreted in that contact model. Therefore, we can analyze GM properties using the trained encoder, and only visual information is needed without extra sensory modalities and human efforts for labeling. The presented GM property estimator has been extensively validated via comparison and ablation experiments. The generalization capability has also been evaluated and a real-world application on the beach is also demonstrated. Experiment videos are available at \urlthis https URL .
zh

[CV-90] ILASH: A Predictive Neural Architecture Search Framework for Multi-Task Applications

【速读】：该论文试图解决在资源受限的边缘设备上部署多任务AI模型时面临的效率问题，特别是如何在减少功耗、提高帧率和降低模型尺寸的同时保持高性能。解决方案的关键在于提出了一种新的神经网络架构范式（ILASH），该架构利用层共享概念来优化上述指标。此外，论文还提出了一个新颖的神经网络架构搜索框架（ILASH-NAS），通过数据驱动的智能方法来高效构建适应特定任务和设备约束的神经网络模型，从而在能源、时间和二氧化碳排放方面显著提升搜索效率。

链接: https://arxiv.org/abs/2412.02116
作者: Md Hafizur Rahman,Md Mashfiq Rizvee,Sumaiya Shomaji,Prabuddha Chakraborty
关键词-EN: fields including healthcare, Artificial intelligence, autonomous vehicles, traffic monitoring, including healthcare
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Artificial intelligence (AI) is widely used in various fields including healthcare, autonomous vehicles, robotics, traffic monitoring, and agriculture. Many modern AI applications in these fields are multi-tasking in nature (i.e. perform multiple analysis on same data) and are deployed on resource-constrained edge devices requiring the AI models to be efficient across different metrics such as power, frame rate, and size. For these specific use-cases, in this work, we propose a new paradigm of neural network architecture (ILASH) that leverages a layer sharing concept for minimizing power utilization, increasing frame rate, and reducing model size. Additionally, we propose a novel neural network architecture search framework (ILASH-NAS) for efficient construction of these neural network models for a given set of tasks and device constraints. The proposed NAS framework utilizes a data-driven intelligent approach to make the search efficient in terms of energy, time, and CO2 emission. We perform extensive evaluations of the proposed layer shared architecture paradigm (ILASH) and the ILASH-NAS framework using four open-source datasets (UTKFace, MTFL, CelebA, and Taskonomy). We compare ILASH-NAS with AutoKeras and observe significant improvement in terms of both the generated model performance and neural search efficiency with up to 16x less energy utilization, CO2 emission, and training/search time.
zh

[CV-91] OmniCreator: Self-Supervised Unified Generation with Universal Editing

【速读】：该论文试图解决文本驱动的统一图像和视频生成与编辑的问题。解决方案的关键在于引入了一种名为 OmniCreator 的新框架，该框架通过自监督学习方式，利用原始文本-视频对作为条件，并使用同一视频作为去噪目标，来学习视频与文本之间的语义对应关系。在推理阶段，OmniCreator 能够根据文本提示生成或编辑视频，且不受限于特定的编辑类型或额外的控制条件（如结构条件、注意力特征或 DDIM 反演）。此外，OmniCreator 还能够处理图像生成任务，使其成为一个真正统一的框架。为了评估生成视频编辑模型的性能，论文还引入了 OmniBench-99 数据集。实验结果表明，OmniCreator 在性能上显著优于其他模型。

链接: https://arxiv.org/abs/2412.02114
作者: Haodong Chen,Lan Wang,Harry Yang,Ser-Nam Lim
关键词-EN: conduct text-prompted unified, conduct text-prompted, generative video editing, video, editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project: this https URL

点击查看摘要

Abstract:We introduce OmniCreator, a novel framework that can conduct text-prompted unified (image+video) generation as well as editing all in one place. OmniCreator acquires generative and universal editing capabilities in a self-supervised manner, taking original text-video pairs as conditions while utilizing the same video as a denoising target to learn the semantic correspondence between video and text. During inference, when presented with a text prompt and a video, OmniCreator is capable of generating a target that is faithful to both, achieving a universal editing effect that is unconstrained as opposed to existing editing work that primarily focuses on certain editing types or relies on additional controls (e.g., structural conditions, attention features, or DDIM inversion). On the other hand, when presented with a text prompt only, OmniCreator becomes generative, producing high-quality video as a result of the semantic correspondence learned. Importantly, we found that the same capabilities extend to images as is, making OmniCreator a truly unified framework. Further, due to the lack of existing generative video editing benchmarks, we introduce the OmniBench-99 dataset, designed to evaluate the performance of generative video editing models comprehensively. Extensive experiments demonstrate that OmniCreator exhibits substantial superiority over all other models.
zh

[CV-92] Direct Coloring for Self-Supervised Enhanced Feature Decoupling

【速读】：该论文试图解决自监督学习 (Self-supervised Learning, SSL) 中的维度坍塌 (dimensional collapse) 问题，并提出了一种通过特征解耦 (feature decoupling) 来改进表征学习的解决方案。解决方案的关键在于通过早期促进有用特征的特征着色 (feature coloring) 技术，该技术基于增强数据的贝叶斯先验 (Bayesian prior)，从而实现特征解耦。论文表明，该框架不仅与当前最先进的技术互补，而且在性能上优于对比学习和非对比学习方法。

链接: https://arxiv.org/abs/2412.02109
作者: Salman Mohamadi,Gianfranco Doretto,Donald A. Adjeroh
关键词-EN: multiple recent theoretical, empirical studies, including the role, success of self-supervised, focus of multiple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of self-supervised learning (SSL) has been the focus of multiple recent theoretical and empirical studies, including the role of data augmentation (in feature decoupling) as well as complete and dimensional representation collapse. While complete collapse is well-studied and addressed, dimensional collapse has only gain attention and addressed in recent years mostly using variants of redundancy reduction (aka whitening) techniques. In this paper, we further explore a complementary approach to whitening via feature decoupling for improved representation learning while avoiding representation collapse. In particular, we perform feature decoupling by early promotion of useful features via careful feature coloring. The coloring technique is developed based on a Bayesian prior of the augmented data, which is inherently encoded for feature decoupling. We show that our proposed framework is complementary to the state-of-the-art techniques, while outperforming both contrastive and recent non-contrastive methods. We also study the different effects of coloring approach to formulate it as a general complementary technique along with other baselines.
zh

[CV-93] AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

【速读】：该论文试图解决扩散模型在推理分辨率与预训练分辨率不一致时出现的物体重复和局部失真问题。解决方案的关键在于提出了AccDiffusion v2，这是一种无需训练的补丁级高分辨率扩散外推方法。其核心创新包括：1) 将传统的图像内容感知提示词解耦为一组补丁内容感知提示词，以更精确地描述每个补丁；2) 通过ControlNet引入辅助局部结构信息，以减轻局部失真；3) 提出带有窗口交互的扩张采样方法，以增强全局语义信息，从而抑制重复生成和局部失真。实验结果表明，AccDiffusion v2在图像生成外推方面达到了最先进的性能。

链接: https://arxiv.org/abs/2412.02099
作者: Zhihang Lin,Mingbao Lin,Wengyi Zhan,Rongrong Ji
关键词-EN: inference resolution differs, models suffer severe, suffer severe object, severe object repetition, Diffusion models suffer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages. arXiv admin note: text overlap with arXiv:2407.10738

点击查看摘要

Abstract:Diffusion models suffer severe object repetition and local distortion when the inference resolution differs from its pre-trained resolution. We propose AccDiffusion v2, an accurate method for patch-wise higher-resolution diffusion extrapolation without training. Our in-depth analysis in this paper shows that using an identical text prompt for different patches leads to repetitive generation, while the absence of a prompt undermines image details. In response, our AccDiffusion v2 novelly decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of a patch. Further analysis reveals that local distortion arises from inaccurate descriptions in prompts about the local structure of higher-resolution images. To address this issue, AccDiffusion v2, for the first time, introduces an auxiliary local structural information through ControlNet during higher-resolution diffusion extrapolation aiming to mitigate the local distortions. Finally, our analysis indicates that global semantic information is conducive to suppressing both repetitive generation and local distortion. Hence, our AccDiffusion v2 further proposes dilated sampling with window interaction for better global semantic information during higher-resolution diffusion extrapolation. We conduct extensive experiments, including both quantitative and qualitative comparisons, to demonstrate the efficacy of our AccDiffusion v2. The quantitative comparison shows that AccDiffusion v2 achieves state-of-the-art performance in image generation extrapolation without training. The qualitative comparison intuitively illustrates that AccDiffusion v2 effectively suppresses the issues of repetitive generation and local distortion in image generation extrapolation. Our code is available at \urlthis https URL.
zh

[CV-94] opology-Preserving Image Segmentation with Spatial-Aware Persistent Feature Matching

【速读】：该论文试图解决现有拓扑分割损失函数在匹配持久同调特征时存在的模糊匹配问题。现有方法主要依赖于图像的拓扑空间信息，而忽略了原始空间域的信息，导致匹配不准确。论文提出的解决方案是引入一种空间感知拓扑损失函数（Spatial-Aware Topological Loss Function），该函数不仅利用拓扑空间信息，还结合了图像的原始空间域信息，以辅助持久同调特征的匹配。这一关键改进显著提升了拓扑分割的准确性，并在多种管状结构图像的实验中表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.02076
作者: Bo Wen,Haochen Zhang,Dirk-Uwe G. Bartsch,William R. Freeman,Truong Q. Nguyen,Cheolhong An
关键词-EN: topological segmentation loss, Topological Loss Function, segmentation loss functions, correctness is critical, persistent features
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Topological correctness is critical for segmentation of tubular structures. Existing topological segmentation loss functions are primarily based on the persistent homology of the image. They match the persistent features from the segmentation with the persistent features from the ground truth and minimize the difference between them. However, these methods suffer from an ambiguous matching problem since the matching only relies on the information in the topological space. In this work, we propose an effective and efficient Spatial-Aware Topological Loss Function that further leverages the information in the original spatial domain of the image to assist the matching of persistent features. Extensive experiments on images of various types of tubular structures show that the proposed method has superior performance in improving the topological accuracy of the segmentation compared with state-of-the-art methods.
zh

[CV-95] Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

【速读】：该论文试图解决3D场景重建中的可编辑性和组合灵活性不足的问题，特别是在需要高交互性和物体级别操作的场景中。解决方案的关键在于引入了一种名为Gaussian Object Carver (GOC)的新型框架，该框架利用3D高斯喷射 (3D Gaussian Splatting, GS)，结合单目几何先验和多视角几何正则化，实现了高效、可扩展且灵活的物体组合式3D场景重建。此外，论文还提出了一种零样本物体表面完成 (Object Surface Completion, OSC) 模型，利用3D物体数据中的先验信息来重建未观察到的表面，确保即使在遮挡区域也能实现物体的完整性。这些创新显著提升了重建效率和几何保真度，有望推动数字孪生在具身AI、AR/VR及交互模拟环境中的实际应用。

链接: https://arxiv.org/abs/2412.02075
作者: Liu Liu,Xinjie Wang,Jiaxiong Qiu,Tianwei Lin,Xiaolin Zhou,Zhizhong Su
关键词-EN: Neural Implicit Representations, computer vision, foundational problem, problem in computer, Gaussian Object Carver
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D scene reconstruction is a foundational problem in computer vision. Despite recent advancements in Neural Implicit Representations (NIR), existing methods often lack editability and compositional flexibility, limiting their use in scenarios requiring high interactivity and object-level manipulation. In this paper, we introduce the Gaussian Object Carver (GOC), a novel, efficient, and scalable framework for object-compositional 3D scene reconstruction. GOC leverages 3D Gaussian Splatting (GS), enriched with monocular geometry priors and multi-view geometry regularization, to achieve high-quality and flexible reconstruction. Furthermore, we propose a zero-shot Object Surface Completion (OSC) model, which uses 3D priors from 3d object data to reconstruct unobserved surfaces, ensuring object completeness even in occluded areas. Experimental results demonstrate that GOC improves reconstruction efficiency and geometric fidelity. It holds promise for advancing the practical application of digital twins in embodied AI, AR/VR, and interactive simulation environments.
zh

[CV-96] Performance Comparison of Deep Learning Techniques in Naira Classification

【速读】：该论文试图解决尼日利亚官方货币奈拉（Naira）的纸币面额分类问题，解决方案的关键在于部署和评估深度学习（Deep Learning, DL）模型。通过使用包含1,808张不同条件下拍摄的奈拉纸币图像的多样化数据集，研究者训练了多种架构的模型，其中MobileNetV2表现最佳，训练准确率达到90.75%，验证准确率为87.04%。该模型在各种场景下展示了显著的性能，具有应用于自动化现金处理系统、分类系统以及视障人士辅助技术的潜力，从而提升尼日利亚金融交易的安全性和效率。

链接: https://arxiv.org/abs/2412.02072
作者: Ismail Ismail Tijjani,Ahmad Abubakar Mustapha,Isma’il Tijjani Idris
关键词-EN: Nigeria official currency, Nigeria official, classify Currency Notes, Naira notes captured, official currency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Naira is Nigeria’s official currency in daily transactions. This study presents the deployment and evaluation of Deep Learning (DL) models to classify Currency Notes (Naira) by denomination. Using a diverse dataset of 1,808 images of Naira notes captured under different conditions, trained the models employing different architectures and got the highest accuracy with MobileNetV2, the model achieved a high accuracy rate of in training of 90.75% and validation accuracy of 87.04% in classification tasks and demonstrated substantial performance across various scenarios. This model holds significant potential for practical applications, including automated cash handling systems, sorting systems, and assistive technology for the visually impaired. The results demonstrate how the model could boost the Nigerian economy’s security and efficiency of financial transactions.
zh

[CV-97] Progress-Aware Video Frame Captioning

【速读】：该论文试图解决视频描述生成中的一个重要中间问题，即帧级别的进度感知视频描述生成（progress-aware video captioning at the frame level）。现有的图像描述生成和视频描述生成方法分别针对单张图像和整个视频片段提供孤立的描述或单一的叙述，而该研究旨在生成时间上细粒度的描述，不仅能准确描述每一帧，还能捕捉视频序列中动作的细微进展。解决方案的关键在于提出了ProgressCaptioner模型，该模型专门设计用于捕捉动作序列中的细粒度时间动态。此外，论文还开发了FrameCap数据集用于训练，以及FrameCapEval基准用于评估描述质量。实验结果表明，ProgressCaptioner显著优于现有的领先描述生成模型，能够生成精确捕捉动作进展的描述，为视频描述生成中的时间精度设定了新标准。

链接: https://arxiv.org/abs/2412.02071
作者: Zihui Xue,Joungbin An,Xitong Yang,Kristen Grauman
关键词-EN: important middle ground, entire video clip, middle ground, individual images, isolated descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
zh

[CV-98] CLERF: Contrastive LEaRning for Full Range Head Pose Estimation

【速读】：该论文试图解决头部姿态估计 (Head Pose Estimation, HPE) 中由于数据稀疏性导致的三元组采样困难问题。解决方案的关键在于利用3D生成对抗网络 (3D-aware GAN) 生成的数据进行对比学习 (Contrastive Learning)，通过几何变换增强数据，使网络能够学习到有助于准确HPE的真实特征。实验结果表明，该方法不仅在标准测试数据集上与最先进模型表现相当，而且在图像轻微旋转或翻转以及全范围头部姿态估计中表现更优，首次实现了包括倒置姿态在内的全范围头部姿态的准确预测。

链接: https://arxiv.org/abs/2412.02066
作者: Ting-Ruen Wei,Haowei Liu,Huei-Chung Hu,Xuyang Wu,Yi Fang,Hsin-Tai Wu
关键词-EN: framework for representation, head pose estimation, head pose, HPE, head
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel framework for representation learning in head pose estimation (HPE). Previously such a scheme was difficult due to head pose data sparsity, making triplet sampling infeasible. Recent progress in 3D generative adversarial networks (3D-aware GAN) has opened the door for easily sampling triplets (anchor, positive, negative). We perform contrastive learning on extensively augmented data including geometric transformations and demonstrate that contrastive learning allows networks to learn genuine features that contribute to accurate HPE. On the other hand, we observe that existing HPE works struggle to predict head poses as accurately when test image rotation matrices are slightly out of the training dataset distribution. Experiments show that our methodology performs on par with state-of-the-art models on standard test datasets and outperforms them when images are slightly rotated/ flipped or full range head pose. To the best of our knowledge, we are the first to deliver a true full range HPE model capable of accurately predicting any head pose including upside-down pose. Furthermore, we compared with other existing full-yaw range models and demonstrated superior results.
zh

[CV-99] Redundant Queries in DETR-Based 3D Detection Methods: Unnecessary and Prunable

【速读】：该论文试图解决3D物体检测任务中基于查询的模型（query-based models）由于使用过多的对象查询（object queries）而导致的计算和内存成本过高的问题。解决方案的关键在于提出了一种称为渐进式查询剪枝（Gradually Pruning Queries, GPQ）的方法，该方法通过逐步根据查询的分类分数来剪枝冗余查询。GPQ方法简单易行，可以作为现有模型训练后的微调步骤集成，从而在不显著影响性能的前提下，显著减少查询数量，加速模型推理，特别是在桌面GPU和边缘设备上的部署中，分别实现了高达1.31倍的加速和76.38%的推理时间减少。

链接: https://arxiv.org/abs/2412.02054
作者: Lizhen Xu,Shanmin Pang,Wenzhao Qiu,Zehao Wu,Xiuxiu Bai,Kuizhi Mei,Jianru Xue
关键词-EN: object detection tasks, pre-trained checkpoints readily, detection tasks, readily available online, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,5 figures

点击查看摘要

Abstract:Query-based models are extensively used in 3D object detection tasks, with a wide range of pre-trained checkpoints readily available online. However, despite their popularity, these models often require an excessive number of object queries, far surpassing the actual number of objects to detect. The redundant queries result in unnecessary computational and memory costs. In this paper, we find that not all queries contribute equally – a significant portion of queries have a much smaller impact compared to others. Based on this observation, we propose an embarrassingly simple approach called \bdGradually \bdPruning \bdQueries (GPQ), which prunes queries incrementally based on their classification scores. It is straightforward to implement in any query-based method, as it can be seamlessly integrated as a fine-tuning step using an existing checkpoint after training. With GPQ, users can easily generate multiple models with fewer queries, starting from a checkpoint with an excessive number of queries. Experiments on various advanced 3D detectors show that GPQ effectively reduces redundant queries while maintaining performance. Using our method, model inference on desktop GPUs can be accelerated by up to 1.31x. Moreover, after deployment on edge devices, it achieves up to a 67.86% reduction in FLOPs and a 76.38% decrease in inference time. The code will be available at \urlthis https URL.
zh

[CV-100] Mutli-View 3D Reconstruction using Knowledge Distillation

【速读】：该论文试图解决大型基础模型（如Dust3r）在视觉定位任务中推理时间长和计算资源消耗大的问题。解决方案的关键在于使用知识蒸馏（Knowledge Distillation）技术，构建一个以Dust3r为教师模型的学生-教师模型，并探索多种学生模型架构（包括基于卷积神经网络（CNN）和视觉变换器（Vision Transformer）的架构），通过训练学生模型学习Dust3r输出的3D重建点，以实现场景特定表示的学习，并输出与Dust3r性能相当的3D点。实验结果表明，视觉变换器在视觉和定量性能上表现最佳。

链接: https://arxiv.org/abs/2412.02039
作者: Aditya Dutt,Ishikaa Lunawat,Manpreet Kaur
关键词-EN: Large Foundation Models, produce high quality, Large Foundation, high quality outputs, Visual Localization requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 10 figures

点击查看摘要

Abstract:Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r’s and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.
zh

[CV-101] NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

【速读】：该论文试图解决单步扩散生成方法在生成质量上的不足问题，特别是在速度优势下如何保持与多步方法相当的高质量生成。解决方案的关键在于引入了一个动态对抗框架，即NitroFusion，其核心创新包括：(i) 一个动态的判别器池，包含专门化的判别器组，以提升生成质量；(ii) 策略性的刷新机制，防止判别器过拟合；(iii) 全局-局部判别器头，用于多尺度质量评估，以及无条件/有条件训练，以实现平衡生成。此外，该框架支持自底向上的细化，允许用户在同一模型下动态选择1-4步去噪步骤，从而直接进行质量-速度的权衡。通过这些创新，NitroFusion在多个评估指标上显著优于现有的单步方法，特别是在保留细节和全局一致性方面表现突出。

链接: https://arxiv.org/abs/2412.02030
作者: Dar-Yen Chen,Hmrishav Bandyopadhyay,Kai Zou,Yi-Zhe Song
关键词-EN: achieves high-quality generation, diffusion that achieves, achieves high-quality, dynamic adversarial framework, discriminator
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. While one-step methods offer dramatic speed advantages, they typically suffer from quality degradation compared to their multi-step counterparts. Just as a panel of art critics provides comprehensive feedback by specializing in different aspects like composition, color, and technique, our approach maintains a large pool of specialized discriminator heads that collectively guide the generation process. Each discriminator group develops expertise in specific quality aspects at different noise levels, providing diverse feedback that enables high-fidelity one-step generation. Our framework combines: (i) a dynamic discriminator pool with specialized discriminator groups to improve generation quality, (ii) strategic refresh mechanisms to prevent discriminator overfitting, and (iii) global-local discriminator heads for multi-scale quality assessment, and unconditional/conditional training for balanced generation. Additionally, our framework uniquely supports flexible deployment through bottom-up refinement, allowing users to dynamically choose between 1-4 denoising steps with the same model for direct quality-speed trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion significantly outperforms existing single-step methods across multiple evaluation metrics, particularly excelling in preserving fine details and global consistency.
zh

[CV-102] Learning Ensembles of Vision-based Safety Control Filters

【速读】：该论文试图解决在不确定和复杂环境中，基于视觉观察设计安全过滤器（safety filters）的问题。解决方案的关键在于利用集成方法（ensemble methods）来提升过滤器的准确性和分布外泛化能力（out-of-distribution generalization）。通过实验比较不同配置的集成模型与其单个成员模型以及大型单一模型基线在DeepAccident数据集上的表现，研究结果表明，多样化的集成模型在状态和控制分类准确性上优于单个模型。

链接: https://arxiv.org/abs/2412.02029
作者: Ihab Tabbara,Hussein Sibai
关键词-EN: violate safety constraints, systems correct nominal, correct nominal controls, correct nominal, safety constraints
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Safety filters in control systems correct nominal controls that violate safety constraints. Designing such filters as functions of visual observations in uncertain and complex environments is challenging. Several deep learning-based approaches to tackle this challenge have been proposed recently. However, formally verifying that the learned filters satisfy critical properties that enable them to guarantee the safety of the system is currently beyond reach. Instead, in this work, motivated by the success of ensemble methods in reinforcement learning, we empirically investigate the efficacy of ensembles in enhancing the accuracy and the out-of-distribution generalization of such filters, as a step towards more reliable ones. We experiment with diverse pre-trained vision representation models as filter backbones, training approaches, and output aggregation techniques. We compare the performance of ensembles with different configurations against each other, their individual member models, and large single-model baselines in distinguishing between safe and unsafe states and controls in the DeepAccident dataset. Our results show that diverse ensembles have better state and control classification accuracies compared to individual models.
zh

[CV-103] Unveiling Interpretability in Self-Supervised Speech Representations for Parkinsons Diagnosis

【速读】：该论文试图解决自监督语音表示在病理语音分析中的复杂性和解释性不足的问题，特别是在帕金森病（Parkinson’s Disease, PD）诊断中的应用。解决方案的关键在于提出了一种新颖的可解释框架，通过设计简单而有效的交叉注意力机制，分别在嵌入层和时间层进行分析，从而从两个互补的角度提供解释性。该框架不仅在多个PD检测基准上展示了其识别自监督表示中有意义语音模式的能力，还通过细粒度的时间分析增强了深度学习病理语音模型的解释性，为开发更透明、可信且临床适用的计算机辅助诊断系统铺平了道路。此外，该方法在分类准确性和跨语言场景下的鲁棒性方面也表现出色。

链接: https://arxiv.org/abs/2412.02006
作者: David Gimeno-Gómez,Catarina Botelho,Anna Pompili,Alberto Abad,Carlos-D. Martínez-Hinarejos
关键词-EN: Recent works, leading to promising, increasingly relied, relied on powerful, support Parkinson Disease
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the Special Issue on “Modelling and Processing Language and Speech in Neurodegenerative Disorders” published by Journal of Selected Topics in Signal Processing (JSTSP)

点击查看摘要

Abstract:Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson’s Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework’s capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.
zh

[CV-104] ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

【速读】：该论文试图解决的问题是如何根据输入的场景图像和文本指令生成一系列逐步的视觉指令图像。解决方案的关键在于三个方面：首先，通过自动化的方法从教学视频中收集大规模的逐步视觉指令训练数据，创建了一个包含0.6M图像-文本对的高质量数据集；其次，开发并训练了一个名为ShowHowTo的视频扩散模型，该模型能够生成与输入图像一致的逐步视觉指令；最后，通过在步骤、场景和任务三个维度上评估生成的图像序列的准确性，展示了该模型在所有维度上均达到了最先进的结果。

链接: https://arxiv.org/abs/2412.01987
作者: Tomáš Souček,Prajwal Gatti,Michael Wray,Ivan Laptev,Dima Damen,Josef Sivic
关键词-EN: image sequences, visual instructions, textual instructions, multi-step image sequences, image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with the provided input image. Third, we evaluate the generated image sequences across three dimensions of accuracy (step, scene, and task) and show our model achieves state-of-the-art results on all of them. Our code, dataset, and trained models are publicly available.
zh

[CV-105] HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment

【速读】：该论文试图解决现有网格质量评估模型（Mesh Quality Assessment, MQA）在捕捉纹理与三维几何复杂交互方面的不足。解决方案的关键在于引入了一种全新的混合全参考彩色MQA框架，称为HybridMQA。该框架结合了基于模型的拓扑感知特征提取方法和基于投影的2D渲染方法，通过图学习提取详细的3D表示，并使用新颖的特征渲染过程将其精确对齐到彩色投影上。这种方法通过交叉注意力机制探索几何与纹理的交互，从而生成全面的网格质量表示。实验结果表明，HybridMQA在多个数据集上表现优异，能够有效利用几何与纹理的交互来深入理解网格质量。

链接: https://arxiv.org/abs/2412.01986
作者: Armin Shafiee Sarvestani,Sheyang Tang,Zhou Wang
关键词-EN: mesh operation systems, variety of applications, Mesh quality assessment, Current MQA models, play a critical
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Mesh quality assessment (MQA) models play a critical role in the design, optimization, and evaluation of mesh operation systems in a wide variety of applications. Current MQA models, whether model-based methods using topology-aware features or projection-based approaches working on rendered 2D projections, often fail to capture the intricate interactions between texture and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid full-reference colored MQA framework that integrates model-based and projection-based approaches, capturing complex interactions between textural information and 3D structures for enriched quality representations. Our method employs graph learning to extract detailed 3D representations, which are then projected to 2D using a novel feature rendering process that precisely aligns them with colored projections. This enables the exploration of geometry-texture interactions via cross-attention, producing comprehensive mesh quality representations. Extensive experiments demonstrate HybridMQA’s superior performance across diverse datasets, highlighting its ability to effectively leverage geometry-texture interactions for a thorough understanding of mesh quality. Our implementation will be made publicly available.
zh

[CV-106] Smart Parking with Pixel-Wise ROI Selection for Vehicle Detection Using YOLOv8 YOLOv9 YOLOv10 and YOLOv11

【速读】：该论文试图解决城市化进程中日益增长的停车需求与传统智能停车系统效率低下之间的矛盾。解决方案的关键在于结合物联网（Internet of Things）、边缘计算（Edge Computing）和深度学习（Deep Learning）技术，采用最新的YOLO模型（YOLOv8, YOLOv9, YOLOv10, YOLOv11）进行车辆检测。通过探索边缘和云计算的结合，论文发现边缘设备的推理时间在1到92秒之间，具体取决于硬件和模型版本。此外，论文提出了一种新的像素级后处理感兴趣区域（Region of Interest, ROI）选择方法，以精确识别停车场图像中的车辆区域，从而实现高效的车辆计数。该系统在自定义的3,484张图像数据集上达到了99.68%的平衡准确率，提供了一种成本效益高且能确保精确车辆检测的智能停车解决方案，同时保护了数据隐私。

链接: https://arxiv.org/abs/2412.01983
作者: Gustavo P. C. P. da Luz,Gabriel Massuyoshi Sato,Luis Fernando Gomez Gonzalez,Juliana Freitag Borin
关键词-EN: efficient parking management, increasing urbanization, growing number, cities have underscored, parking management systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to Elsevier Internet of Things, 22 pages, 11 figures, 6 tables

点击查看摘要

Abstract:The increasing urbanization and the growing number of vehicles in cities have underscored the need for efficient parking management systems. Traditional smart parking solutions often rely on sensors or cameras for occupancy detection, each with its limitations. Recent advancements in deep learning have introduced new YOLO models (YOLOv8, YOLOv9, YOLOv10, and YOLOv11), but these models have not been extensively evaluated in the context of smart parking systems, particularly when combined with Region of Interest (ROI) selection for object detection. Existing methods still rely on fixed polygonal ROI selections or simple pixel-based modifications, which limit flexibility and precision. This work introduces a novel approach that integrates Internet of Things, Edge Computing, and Deep Learning concepts, by using the latest YOLO models for vehicle detection. By exploring both edge and cloud computing, it was found that inference times on edge devices ranged from 1 to 92 seconds, depending on the hardware and model version. Additionally, a new pixel-wise post-processing ROI selection method is proposed for accurately identifying regions of interest to count vehicles in parking lot images. The proposed system achieved 99.68% balanced accuracy on a custom dataset of 3,484 images, offering a cost-effective smart parking solution that ensures precise vehicle detection while preserving data privacy
zh

[CV-107] Enhancing Deep Learning Model Robustness through Metamorphic Re-Training

【速读】：该论文试图解决机器学习模型在实际应用中的鲁棒性问题，解决方案的关键在于提出了一个Metamorphic Retraining Framework。该框架通过应用metamorphic relations到数据上，并结合semi-supervised learning算法，进行迭代和自适应的多周期训练过程。框架整合了多种半监督重训练算法，包括FixMatch、FlexMatch、MixMatch和FullMatch，以自动化模型的重训练、评估和测试。实验结果表明，该方法能够显著提升模型的鲁棒性，平均提高了17%的鲁棒性指标。

链接: https://arxiv.org/abs/2412.01958
作者: Said Togru,Youssef Sameh Mostafa,Karim Lotfy
关键词-EN: metamorphic relations, applies metamorphic relations, Metamorphic Retraining Framework, machine learning models, paper evaluates
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper evaluates the use of metamorphic relations to enhance the robustness and real-world performance of machine learning models. We propose a Metamorphic Retraining Framework, which applies metamorphic relations to data and utilizes semi-supervised learning algorithms in an iterative and adaptive multi-cycle process. The framework integrates multiple semi-supervised retraining algorithms, including FixMatch, FlexMatch, MixMatch, and FullMatch, to automate the retraining, evaluation, and testing of models with specified configurations. To assess the effectiveness of this approach, we conducted experiments on CIFAR-10, CIFAR-100, and MNIST datasets using a variety of image processing models, both pretrained and non-pretrained. Our results demonstrate the potential of metamorphic retraining to significantly improve model robustness as we show in our results that each model witnessed an increase of an additional flat 17 percent on average in our robustness metric.
zh

[CV-108] Enhancing Crop Segmentation in Satellite Image Time Series with Transformer Networks

【速读】：该论文试图解决在卫星图像时间序列（SITS）中作物分割任务中，卷积神经网络（CNNs）是否能被基于Transformer的网络超越的问题。解决方案的关键在于提出了一种改进的基于Transformer的Swin UNETR模型，该模型专门针对SITS中的作物分割进行了优化。通过在Munich数据集上实现96.14%的验证准确率和95.26%的测试准确率，显著超越了之前最佳的93.55%和92.94%的结果，并且在Lombardia数据集上的表现与UNet3D相当，优于FPN和DeepLabV3。此外，实验表明该模型在达到或超越CNNs精度的同时，所需的训练时间显著减少。这些结果突显了基于Transformer架构在SITS作物分割中的潜力，为遥感应用开辟了新的途径。

链接: https://arxiv.org/abs/2412.01944
作者: Ignazio Gallo,Mattia Gatti,Nicola Landro,Christian Loschiavo,Mirco Boschetti,Riccardo La Grassa
关键词-EN: Convolutional Neural Networks, Image Time Series, Satellite Image Time, Convolutional Neural, Satellite Image
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent studies have shown that Convolutional Neural Networks (CNNs) achieve impressive results in crop segmentation of Satellite Image Time Series (SITS). However, the emergence of transformer networks in various vision tasks raises the question of whether they can outperform CNNs in this task as well. This paper presents a revised version of the Transformer-based Swin UNETR model, specifically adapted for crop segmentation of SITS. The proposed model demonstrates significant advancements, achieving a validation accuracy of 96.14% and a test accuracy of 95.26% on the Munich dataset, surpassing the previous best results of 93.55% for validation and 92.94% for the test. Additionally, the model’s performance on the Lombardia dataset is comparable to UNet3D and superior to FPN and DeepLabV3. Experiments of this study indicate that the model will likely achieve comparable or superior accuracy to CNNs while requiring significantly less training time. These findings highlight the potential of transformer-based architectures for crop segmentation in SITS, opening new avenues for remote sensing applications.
zh

[CV-109] Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers

【速读】：该论文试图解决现代神经网络在面对分布外数据（out-of-distribution data）时的鲁棒性问题，特别是在语义分割任务中。解决方案的关键是提出了一种名为通道级特征增强（Channel Wise Feature Augmentation, CWFA）的技术。CWFA通过在训练过程中对编码器应用全局估计的扰动，以最小的计算开销显著提升Vision Transformers的鲁棒性，同时不影响其在干净数据上的性能。实验结果表明，CWFA增强的模型在Cityscapes和ADE20K数据集上，对于如模糊或噪声等视觉损坏，表现出显著的鲁棒性提升，例如在Cityscapes数据集上，CWFA增强的SegFormer-B1模型在脉冲噪声下的mIoU鲁棒性提升了27.7%。

链接: https://arxiv.org/abs/2412.01941
作者: Alberto Gonzalo Rodriguez Salgado,Maying Schen,Philipp Harzig,Peter Mayer,Jose M. Alvarez
关键词-EN: modern neural networks, deploying modern neural, Vision Transformers, neural networks, Channel Wise Feature
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robustness to out-of-distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to improve the robustness of Vision Transformers for semantic segmentation. CWFA applies a globally estimated perturbation per encoder with minimal compute overhead during training. Extensive evaluations on Cityscapes and ADE20K, with three state-of-the-art Vision Transformer architectures : SegFormer, Swin Transformer, and Twins demonstrate that CWFA-enhanced models significantly improve robustness without affecting clean data performance. For instance, on Cityscapes, a CWFA-augmented SegFormer-B1 model yields up to 27.7% mIoU robustness gain on impulse noise compared to the non-augmented SegFormer-B1. Furthermore, CWFA-augmented SegFormer-B5 achieves a new state-of-the-art 84.3% retention rate, a 0.7% improvement over the recently published FAN+STL.
zh

[CV-110] Planar Gaussian Splatting WACV

【速读】：该论文试图解决从多张RGB图像中直接学习场景的三维几何结构和解析三维平面的问题。解决方案的关键在于提出了平面高斯喷射 (Planar Gaussian Splatting, PGS) 这一新颖的神经渲染方法。PGS 利用高斯基元来建模场景，并通过层次化的高斯混合方法对这些基元进行分组。通过在树状结构的高斯混合中逐步概率性地合并相似的高斯基元，PGS 能够识别出不同的三维平面实例并形成整体的三维场景几何结构。为了实现这一分组，高斯基元包含了额外的参数，如由通用二维分割模型提升的二维掩码和表面法线导出的平面描述符。实验结果表明，PGS 在不依赖三维平面标签或深度监督的情况下，实现了最先进的平面重建性能，并且在跨数据集的泛化能力和速度方面优于现有的优化方法。

链接: https://arxiv.org/abs/2412.01931
作者: Farhad G. Zanjani,Hong Cai,Hanno Ackermann,Leila Mirvakhabova,Fatih Porikli
关键词-EN: multiple RGB images, Planar Gaussian Splatting, RGB images, Gaussian Splatting, paper presents Planar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

点击查看摘要

Abstract:This paper presents Planar Gaussian Splatting (PGS), a novel neural rendering approach to learn the 3D geometry and parse the 3D planes of a scene, directly from multiple RGB images. The PGS leverages Gaussian primitives to model the scene and employ a hierarchical Gaussian mixture approach to group them. Similar Gaussians are progressively merged probabilistically in the tree-structured Gaussian mixtures to identify distinct 3D plane instances and form the overall 3D scene geometry. In order to enable the grouping, the Gaussian primitives contain additional parameters, such as plane descriptors derived by lifting 2D masks from a general 2D segmentation model and surface normals. Experiments show that the proposed PGS achieves state-of-the-art performance in 3D planar reconstruction without requiring either 3D plane labels or depth supervision. In contrast to existing supervised methods that have limited generalizability and struggle under domain shift, PGS maintains its performance across datasets thanks to its neural rendering and scene-specific optimization mechanism, while also being significantly faster than existing optimization-based approaches.
zh

[CV-111] PROFIT: A PROximal FIne Tuning Optimizer for Multi-Task Learning

【速读】：该论文试图解决在计算机视觉和机器人领域中，预训练模型微调过程中过度关注效率而忽视模型准确性的问题。解决方案的关键在于提出了一个新的优化器——PROFIT，它专门设计用于在新任务或数据集上增量微调已收敛的模型。与传统的优化器（如SGD或Adam）不同，PROFIT利用已收敛模型的结构来正则化优化过程，通过简单的时序梯度正交化过程，显著提升了微调效果。PROFIT不仅在图像分类、表示学习和大规模运动预测等任务上表现优异，而且其设计使其能够轻松集成到任何训练流程中，减少了从头开始训练模型的依赖。

链接: https://arxiv.org/abs/2412.01930
作者: Anirudh S Chakravarthy,Shuai Kyle Zheng,Xin Huang,Sachithra Hemachandra,Xiao Zhang,Yuning Chai,Zhao Chen
关键词-EN: vision and robotics, invaluable in computer, computer vision, PROFIT, Fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Fine-tuning pre-trained models has become invaluable in computer vision and robotics. Recent fine-tuning approaches focus on improving efficiency rather than accuracy by using a mixture of smaller learning rates or frozen backbones. To return the spotlight to model accuracy, we present PROFIT, one of the first optimizers specifically designed for incrementally fine-tuning converged models on new tasks or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initialization, PROFIT leverages the structure of a converged model to regularize the optimization process, leading to improved results. By employing a simple temporal gradient orthogonalization process, PROFIT outperforms traditional fine-tuning methods across various tasks: image classification, representation learning, and large-scale motion prediction. Moreover, PROFIT is encapsulated within the optimizer logic, making it easily integrated into any training pipeline with minimal engineering effort. A new class of fine-tuning optimizers like PROFIT can drive advancements as fine-tuning and incremental training become increasingly prevalent, reducing reliance on costly model training from scratch.
zh

[CV-112] Understanding Bias in Large-Scale Visual Datasets

【速读】：该论文试图解决大规模视觉数据集中的偏差问题，特别是这些数据集的偏差形式尚不明确。解决方案的关键在于提出一个框架，通过应用多种变换（如语义、结构、边界、颜色和频率信息的提取）来识别这些数据集的独特视觉属性，并评估每种信息类型反映偏差的程度。此外，通过对象级分析和自然语言方法生成详细的、开放式的数据集特征描述，以帮助研究人员理解现有大规模预训练数据集中的偏差，并为构建更加多样化和具有代表性的数据集提供指导。

链接: https://arxiv.org/abs/2412.01876
作者: Boya Zeng,Yida Yin,Zhuang Liu
关键词-EN: modern neural networks, neural networks, easily classified, classified by modern, modern neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset’s characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at this http URL .
zh

[CV-113] Pairwise Discernment of AffectNet Expressions with ArcFace

【速读】：该论文试图解决计算机识别人类情感的问题，通过面部情感识别 (Facial Emotion Recognition, FER) 技术实现。解决方案的关键在于利用迁移学习 (Transfer Learning) 和成对学习 (Pairwise Learning)。具体来说，研究采用了ResNeXt和EfficientNet模型，并结合ArcFace模型进行迁移学习，这些模型最初是为面部验证任务训练的，利用了AffectNet数据库中的面部图像及其情感标注。研究强调了领域一致性迁移学习的价值，以及成对学习在处理类别不平衡问题上的有效性，从而提升了FER任务的模型性能。

链接: https://arxiv.org/abs/2412.01860
作者: Dylan Waldner,Shyamal Mitra
关键词-EN: Facial Emotion Recognition, recognize human emotions, Emotion Recognition, preliminary step, step toward teaching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study takes a preliminary step toward teaching computers to recognize human emotions through Facial Emotion Recognition (FER). Transfer learning is applied using ResNeXt, EfficientNet models, and an ArcFace model originally trained on the facial verification task, leveraging the AffectNet database, a collection of human face images annotated with corresponding emotions. The findings highlight the value of congruent domain transfer learning, the challenges posed by imbalanced datasets in learning facial emotion patterns, and the effectiveness of pairwise learning in addressing class imbalances to enhance model performance on the FER task.
zh

[CV-114] BAFPN: Bi directional alignment of features to improve localization accuracy

【速读】：该论文试图解决传统特征金字塔网络（Feature Pyramid Network, FPN）及其变体在处理全局尺度上的空间错位问题，导致物体高精度定位性能不佳的问题。解决方案的关键在于提出了一种新的双向对齐特征金字塔网络（Bidirectional Alignment Feature Pyramid Network, BAFPN），通过在自底向上信息传播阶段引入空间特征对齐模块（Spatial Feature Alignment Module, SPAM）来全局对齐错位特征，并在自顶向下阶段通过细粒度语义对齐模块（Semantic Alignment Module, SEAM）进一步减轻跨尺度特征融合引起的混叠效应。实验结果表明，BAFPN在DOTAv1.5数据集上显著提升了基线模型的AP75、AP50和mAP，分别为1.68%、1.45%和1.34%，并且在应用于其他高级检测器时也表现出显著的性能提升。

链接: https://arxiv.org/abs/2412.01859
作者: Li Jiakun,Wang Qingqing,Dong Hongbin,Li Kexin
关键词-EN: Feature Pyramid Network, Pyramid Network, utilize feature pyramids, extract multi-scale information, Alignment Feature Pyramid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 page

点击查看摘要

Abstract:Current state-of-the-art vision models often utilize feature pyramids to extract multi-scale information, with the Feature Pyramid Network (FPN) being one of the most widely used classic architectures. However, traditional FPNs and their variants (e.g., AUGFPN, PAFPN) fail to fully address spatial misalignment on a global scale, leading to suboptimal performance in high-precision localization of objects. In this paper, we propose a novel Bidirectional Alignment Feature Pyramid Network (BAFPN), which aligns misaligned features globally through a Spatial Feature Alignment Module (SPAM) during the bottom-up information propagation phase. Subsequently, it further mitigates aliasing effects caused by cross-scale feature fusion via a fine-grained Semantic Alignment Module (SEAM) in the top-down phase. On the DOTAv1.5 dataset, BAFPN improves the baseline model’s AP75, AP50, and mAP by 1.68%, 1.45%, and 1.34%, respectively. Additionally, BAFPN demonstrates significant performance gains when applied to various other advanced detectors.
zh

[CV-115] Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

【速读】：该论文试图解决现有视觉与语言导航（Vision-and-Language Navigation, VLN）代理在利用情景模拟和情景记忆能力导航陌生环境方面的不足。解决方案的关键在于提出了一种新颖的架构，帮助代理构建一个循环的想象记忆系统。具体来说，代理在导航过程中能够维持一个现实与想象混合的全局记忆，并通过想象机制和导航动作扩展记忆地图。此外，设计了一系列预训练任务以帮助代理获取细粒度的想象能力。这一方法不仅提高了当前最先进（SoTA）的成功率（SR）7%，还能同时想象未来场景的高保真RGB表示。

链接: https://arxiv.org/abs/2412.01857
作者: Yiyuan Pan,Yunzhe Xu,Zhe Liu,Hesheng Wang
关键词-EN: Humans navigate unfamiliar, Humans navigate, navigate unfamiliar environments, episodic simulation, navigate unfamiliar
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Humans navigate unfamiliar environments using the capabilities of episodic simulation and episodic memory. Developing imagination-based memory, analogous to episodic simulation and episodic memory, can enhance embodied agents’ comprehension of the complex relationship between environments and objects. However, existing Vision-and-Language Navigation (VLN) agents fail to perform the aforementioned mechanism. We propose a novel architecture to help agents build a recurrent imaginative memory system. Specifically, the agent can maintain a reality-imagination hybrid global memory during navigation and expand the memory map through imaginative mechanisms and navigation actions. Correspondingly, we design a series of pre-training tasks to help the agent acquire fine-grained imaginative abilities. Our agents improve the state-of-the-art (SoTA) success rate (SR) by 7% while simultaneously imagining high-fidelity RGB representations for future scenes.
zh

[CV-116] Data Augmentation through Background Removal for Apple Leaf Disease Classification Using the MobileNetV2 Model

【速读】：该论文试图解决在真实世界条件下，使用田间采集的苹果叶图像进行疾病分类时性能较低的问题。解决方案的关键在于采用数据增强策略，即通过去除叶图像中的复杂背景来扩充训练数据集。具体实施中，研究人员微调了轻量级的预训练模型MobileNetV2，并通过实验验证了背景去除作为数据增强技术对提高分类准确性的有效性。实验结果表明，该方法在Plant Pathology数据库上达到了98.71%的分类准确率，相较于现有最先进方法提升了约3%，显著增强了模型在真实世界条件下的鲁棒性。

链接: https://arxiv.org/abs/2412.01854
作者: Youcef Ferdi
关键词-EN: computer vision made, plant diseases, advances in computer, computer vision, vision made
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advances in computer vision made possible by deep learning technology are increasingly being used in precision agriculture to automate the detection and classification of plant diseases. Symptoms of plant diseases are often seen on their leaves. The leaf images in existing datasets have been collected either under controlled conditions or in the field. The majority of previous studies have focused on identifying leaf diseases using images captured in controlled laboratory settings, often achieving high performance. However, methods aimed at detecting and classifying leaf diseases in field images have generally exhibited lower performance. The objective of this study is to evaluate the impact of a data augmentation approach that involves removing complex backgrounds from leaf images on the classification performance of apple leaf diseases in images captured under real world conditions. To achieve this objective, the lightweight pre-trained MobileNetV2 deep learning model was fine-tuned and subsequently used to evaluate the impact of expanding the training dataset with background-removed images on classification performance. Experimental results show that this augmentation strategy enhances classification accuracy. Specifically, using the Adam optimizer, the proposed method achieved a classification accuracy of 98.71% on the Plant Pathology database, representing an approximately 3% improvement and outperforming state-of-the-art methods. This demonstrates the effectiveness of background removal as a data augmentation technique for improving the robustness of disease classification models in real-world conditions.
zh

[CV-117] Explainable Artificial Intelligence for Medical Applications: A Review

【速读】：该论文试图解决医疗领域中人工智能（AI）决策的可靠性和透明性问题。解决方案的关键在于引入可解释性人工智能（Explainable Artificial Intelligence, XAI），通过在视觉、音频和多模态视角下的研究，提供对AI决策过程的解释和透明度，从而增强医疗专业人员对AI辅助诊断和决策的信任和接受度。

链接: https://arxiv.org/abs/2412.01829
作者: Qiyang Sun,Alican Akman,Björn W. Schuller
关键词-EN: theory has propelled, unprecedented heights, continuous development, relentless efforts, efforts of scholars
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The continuous development of artificial intelligence (AI) theory has propelled this field to unprecedented heights, owing to the relentless efforts of scholars and researchers. In the medical realm, AI takes a pivotal role, leveraging robust machine learning (ML) algorithms. AI technology in medical imaging aids physicians in X-ray, computed tomography (CT) scans, and magnetic resonance imaging (MRI) diagnoses, conducts pattern recognition and disease prediction based on acoustic data, delivers prognoses on disease types and developmental trends for patients, and employs intelligent health management wearable devices with human-computer interaction technology to name but a few. While these well-established applications have significantly assisted in medical field diagnoses, clinical decision-making, and management, collaboration between the medical and AI sectors faces an urgent challenge: How to substantiate the reliability of decision-making? The underlying issue stems from the conflict between the demand for accountability and result transparency in medical scenarios and the black-box model traits of AI. This article reviews recent research grounded in explainable artificial intelligence (XAI), with an emphasis on medical practices within the visual, audio, and multimodal perspectives. We endeavour to categorise and synthesise these practices, aiming to provide support and guidance for future researchers and healthcare professionals.
zh

[CV-118] COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models ICML2023

【速读】：该论文试图解决基于注意力机制的视觉模型（如 Vision Transformer (ViT) 及其变体）在模型规模和计算成本方面的问题，提出了一种高效的模型压缩方法。解决方案的关键在于基于多头注意力层的新见解，开发了一种高效的 ViT 压缩方案，该方案在 DeiT-small 和 DeiT-base 模型上分别实现了 0.45% 和 0.76% 的更高 top-1 准确率，即使在更少的参数下也能表现优异。此外，该方法还能显著提升文本到图像扩散模型的定制效率，训练速度提升高达 2.6 倍，额外存储成本降低高达 1927.5 倍。

链接: https://arxiv.org/abs/2305.17235
作者: Jinqi Xiao,Miao Yin,Yu Gong,Xiao Zang,Jian Ren,Bo Yuan
关键词-EN: shown promising performance, computer vision tasks, Attention-based vision models, shown promising, promising performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2023 Poster

点击查看摘要

Abstract:Attention-based vision models, such as Vision Transformer (ViT) and its variants, have shown promising performance in various computer vision tasks. However, these emerging architectures suffer from large model sizes and high computational costs, calling for efficient model compression solutions. To date, pruning ViTs has been well studied, while other compression strategies that have been widely applied in CNN compression, e.g., model factorization, is little explored in the context of ViT compression. This paper explores an efficient method for compressing vision transformers to enrich the toolset for obtaining compact attention-based vision models. Based on the new insight on the multi-head attention layer, we develop a highly efficient ViT compression solution, which outperforms the state-of-the-art pruning methods. For compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters. Our finding can also be applied to improve the customization efficiency of text-to-image diffusion models, with much faster training (up to 2.6\times speedup) and lower extra storage cost (up to 1927.5\times reduction) than the existing works.
zh

[CV-119] Segmentation of Coronary Artery Stenosis in X-ray Angiography using Mamba Models

【速读】：该论文试图解决冠状动脉疾病诊断中冠状动脉狭窄的自动识别问题。解决方案的关键在于采用了基于U-Net架构的Mamba模型和Swin Transformer模型的多种变体，通过这些模型的组合和优化，实现了对冠状动脉狭窄的精准定位。其中，U-Mamba BOT模型表现最佳，F1分数达到68.79%，相较于半监督方法提升了11.8%。

链接: https://arxiv.org/abs/2412.02568
作者: Ali Rostami,Fatemeh Fouladi,Hedieh Sajedi
关键词-EN: global mortality rates, Coronary artery disease, artery disease stands, Coronary artery, coronary artery stenosis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coronary artery disease stands as one of the primary contributors to global mortality rates. The automated identification of coronary artery stenosis from X-ray images plays a critical role in the diagnostic process for coronary heart disease. This task is challenging due to the complex structure of coronary arteries, intrinsic noise in X-ray images, and the fact that stenotic coronary arteries appear narrow and blurred in X-ray angiographies. This study employs five different variants of the Mamba-based model and one variant of the Swin Transformer-based model, primarily based on the U-Net architecture, for the localization of stenosis in Coronary artery disease. Our best results showed an F1 score of 68.79% for the U-Mamba BOT model, representing an 11.8% improvement over the semi-supervised approach.
zh

[CV-120] Multi-scale and Multi-path Cascaded Convolutional Network for Semantic Segmentation of Colorectal Polyps

【速读】：该论文试图解决结直肠息肉分割中现有模型在空间依赖性表示不足和解码阶段缺乏多层次特征整合的问题。解决方案的关键在于引入了一种名为多尺度多路径级联卷积网络 (Multi-Scale and Multi-Path Cascaded Convolution Network, MMCC-Net) 的新框架，通过集成多尺度多路径级联卷积技术和增强特征聚合的双注意力模块、跳跃连接和特征增强器，显著提升了模型在像素级别识别息肉区域的能力。MMCC-Net在六个公开数据集上进行了测试，并与八种最先进的模型进行了比较，结果显示其在息肉分割任务中表现优异，Dice系数和Mean Intersection over Union (MIoU) 的置信区间均表明其具有高精度和高效性，为结直肠癌的早期检测和预防策略提供了有力工具。

链接: https://arxiv.org/abs/2412.02443
作者: Malik Abdul Manan,Feng Jinchao,Muhammad Yaqub,Shahzad Ahmed,Syed Muhammad Ali Imran,Imran Shabir Chuhan,Haroon Ahmed Khan
关键词-EN: Cascaded Convolution Network, Multi-Path Cascaded Convolution, colorectal polyp segmentation, structural abnormalities, gastrointestinal tract
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colorectal polyps are structural abnormalities of the gastrointestinal tract that can potentially become cancerous in some cases. The study introduces a novel framework for colorectal polyp segmentation named the Multi-Scale and Multi-Path Cascaded Convolution Network (MMCC-Net), aimed at addressing the limitations of existing models, such as inadequate spatial dependence representation and the absence of multi-level feature integration during the decoding stage by integrating multi-scale and multi-path cascaded convolutional techniques and enhances feature aggregation through dual attention modules, skip connections, and a feature enhancer. MMCC-Net achieves superior performance in identifying polyp areas at the pixel level. The Proposed MMCC-Net was tested across six public datasets and compared against eight SOTA models to demonstrate its efficiency in polyp segmentation. The MMCC-Net’s performance shows Dice scores with confidence intervals ranging between (77.08, 77.56) and (94.19, 94.71) and Mean Intersection over Union (MIoU) scores with confidence intervals ranging from (72.20, 73.00) to (89.69, 90.53) on the six databases. These results highlight the model’s potential as a powerful tool for accurate and efficient polyp segmentation, contributing to early detection and prevention strategies in colorectal cancer.
zh

[CV-121] Initial Study On Improving Segmentation By Combining Preoperative CT And Intraoperative CBCT Using Synthetic Data

【速读】：该论文试图解决在计算机辅助干预中，由于锥束计算机断层扫描（CBCT）图像质量下降导致的分割性能不佳的问题。解决方案的关键在于提出了一种多模态学习方法，通过融合大致对齐的CBCT和CT扫描图像，以提高分割性能。研究结果表明，这种融合方法在20个实验设置中有18个显著提升了分割性能。

链接: https://arxiv.org/abs/2412.02294
作者: Maximilian E. Tschuchnig,Philipp Steininger,Michael Gadermayr
关键词-EN: minimally invasive procedures, Interventions enable clinicians, Computer-Assisted Interventions enable, advanced imaging methods, Computer-Assisted Interventions
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at BVM 2025. arXiv admin note: text overlap with arXiv:2406.11650

点击查看摘要

Abstract:Computer-Assisted Interventions enable clinicians to perform precise, minimally invasive procedures, often relying on advanced imaging methods. Cone-beam computed tomography (CBCT) can be used to facilitate computer-assisted interventions, despite often suffering from artifacts that pose challenges for accurate interpretation. While the degraded image quality can affect image analysis, the availability of high quality, preoperative scans offers potential for improvements. Here we consider a setting where preoperative CT and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect to simulate a real world scenario. We propose a multimodal learning method that fuses roughly aligned CBCT and CT scans and investigate the effect on segmentation performance. For this experiment we use synthetically generated data containing real CT and synthetic CBCT volumes with corresponding voxel annotations. We show that this fusion setup improves segmentation performance in 18 out of 20 investigated setups.
zh

[CV-122] U-Net in Medical Image Segmentation: A Review of Its Applications Across Modalities

【速读】：该论文试图解决医学影像分析中手动特征提取方法的劳动密集性和专家间差异性问题。解决方案的关键在于利用人工智能（AI）和深度学习（DL），特别是卷积神经网络模型如U-Net及其变体（U-Net++和U-Net 3+），来自动化医学影像分割（MIS）过程并提高精度。这些模型通过高效的像素级分类，克服了传统手动分割的局限性，显著提升了不同成像模式下的分割效率和准确性。

链接: https://arxiv.org/abs/2412.02242
作者: Fnu Neha,Deepshikha Bhati,Deepak Kumar Shukla,Sonavi Makarand Dalvi,Nikolaos Mantzou,Safa Shubbar
关键词-EN: provide key insights, Magnetic Resonance Imaging, anatomy and pathology, aiding in diagnosis, diagnosis and treatment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical imaging is essential in healthcare to provide key insights into patient anatomy and pathology, aiding in diagnosis and treatment. Non-invasive techniques such as X-ray, Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Ultrasound (US), capture detailed images of organs, tissues, and abnormalities. Effective analysis of these images requires precise segmentation to delineate regions of interest (ROI), such as organs or lesions. Traditional segmentation methods, relying on manual feature-extraction, are labor-intensive and vary across experts. Recent advancements in Artificial Intelligence (AI) and Deep Learning (DL), particularly convolutional models such as U-Net and its variants (U-Net++ and U-Net 3+), have transformed medical image segmentation (MIS) by automating the process and enhancing accuracy. These models enable efficient, precise pixel-wise classification across various imaging modalities, overcoming the limitations of manual segmentation. This review explores various medical imaging techniques, examines the U-Net architectures and their adaptations, and discusses their application across different modalities. It also identifies common challenges in MIS and proposes potential solutions.
zh

[CV-123] A Classic-Quantum Hybrid Network Framework: CQH-Net

【速读】：该论文试图解决量子神经网络（QNNs）在人工智能中的内在机制和决策透明性问题。解决方案的关键在于提出了一种经典-量子混合网络框架（CQH-Net），该框架结合了传统机器学习方法进行特征提取，并利用量子神经网络进行分类任务。通过在公开数据集上的图像分类实验，CQH-Net不仅显著提高了收敛速度（比经典卷积网络CNNs提高了72.8%），还在Fashion MNIST数据集上达到了99.02%的最终准确率，比CNNs提高了5.07%。此外，研究还探索了CQH-Net决策过程的可视化解释，表明模型在训练过程中有效捕捉了关键数据特征，并建立了这些特征与相应类别之间的关联。这一研究展示了量子化增强模型处理复杂分类问题的能力，同时提供了决策透明性，进一步支持了量子计算在机器学习中的优势。

链接: https://arxiv.org/abs/2412.02059
作者: Ao Liu,Cuihong Wen,Jieci Wang
关键词-EN: shown remarkable capabilities, Deep Learning, pattern recognition, shown remarkable, remarkable capabilities
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Deep Learning has shown remarkable capabilities in pattern recognition, feature extraction, and classification decision-making. With the rise of quantum computing, the potential of quantum neural networks (QNNs) in Artificial Intelligence is emerging. However, the intrinsic mechanisms and decision transparency of QNNs remain unclear. In this paper, we propose a classic-quantum hybrid network framework (CQH-Net), which uses traditional machine learning methods for feature extraction and quantizes neural networks for classification tasks. We apply CQH-Net to image classification on public datasets. Experimentally, CQH-Net achieves an average convergence rate improvement of 72.8% compared to classical convolutional networks (CNNs) with identical parameters. On the Fashion MNIST dataset, it reaches a final accuracy of 99.02%, representing a significant increase of 5.07% over CNNs. Furthermore, we explore visual explanations for CQH-Net’s decision-making process. Results show that the model effectively captures key data features during training and establishes associations between these features and their corresponding categories. This study demonstrates that quantization enhances the models ability to tackle complex classification problems while providing transparency in its decision-making process further supporting quantum advantages in machine learning.
zh

[CV-124] FoveaSPAD: Exploiting Depth Priors for Adaptive and Efficient Single-Photon 3D Imaging

【速读】：该论文试图解决基于单光子雪崩二极管（SPAD）的激光雷达（LiDAR）在实际应用中面临的关键问题，即对环境光的敏感性和处理大量原始光子数据所需的计算和存储资源。解决方案的关键在于提出新的算法和传感策略，通过使用外部信号引导SPAD系统进行场景深度估计的“聚焦”（foveation）方法，从而提高信噪比（SNR）并减少数据处理量。具体来说，该方法通过“放大”感兴趣的信号区域，减少了需要存储和传输的原始光子数据量，同时增强了系统对环境光的抗干扰能力。实验结果表明，该方法在硬件仿真中实现了高达1548倍的内存使用减少，并且可以应用于新近推出的和未来的SPAD阵列。

链接: https://arxiv.org/abs/2412.02052
作者: Justin Folden,Atul Ingle,Sanjeev J. Koppal
关键词-EN: autonomous vehicles, accurate depth-sensing, depth-sensing is important, important for safety-critical, safety-critical applications
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fast, efficient, and accurate depth-sensing is important for safety-critical applications such as autonomous vehicles. Direct time-of-flight LiDAR has the potential to fulfill these demands, thanks to its ability to provide high-precision depth measurements at long standoff distances. While conventional LiDAR relies on avalanche photodiodes (APDs), single-photon avalanche diodes (SPADs) are an emerging image-sensing technology that offer many advantages such as extreme sensitivity and time resolution. In this paper, we remove the key challenges to widespread adoption of SPAD-based LiDARs: their susceptibility to ambient light and the large amount of raw photon data that must be processed to obtain in-pixel depth estimates. We propose new algorithms and sensing policies that improve signal-to-noise ratio (SNR) and increase computing and memory efficiency for SPAD-based LiDARs. During capture, we use external signals to \emphfoveate, i.e., guide how the SPAD system estimates scene depths. This foveated approach allows our method to ``zoom into’’ the signal of interest, reducing the amount of raw photon data that needs to be stored and transferred from the SPAD sensor, while also improving resilience to ambient light. We show results both in simulation and also with real hardware emulation, with specific implementations achieving a 1548-fold reduction in memory usage, and our algorithms can be applied to newly available and future SPAD arrays.
zh

[CV-125] ASANet: Asymmetric Semantic Aligning Network for RGB and SAR image land cover classification

【速读】：该论文试图解决多模态地物分类（Land Cover Classification, LCC）中，现有方法未能充分利用合成孔径雷达（Synthetic Aperture Radar, SAR）图像和RGB图像之间的互补特征的问题。解决方案的关键在于提出了一种名为非对称语义对齐网络（Asymmetric Semantic Aligning Network, ASANet）的新架构，该架构通过在特征层面引入非对称性来解决多模态架构经常未能充分利用互补特征的问题。ASANet的核心是语义聚焦模块（Semantic Focusing Module, SFM），它为每种模态显式计算差异化的权重，以考虑模态特定的特征。此外，ASANet还集成了级联融合模块（Cascade Fusion Module, CFM），深入挖掘通道和空间表示，以有效选择和融合两种模态的特征。通过这两个模块的协同作用，ASANet能够有效学习两种模态之间的特征相关性，并消除由特征差异引起的噪声。

链接: https://arxiv.org/abs/2412.02044
作者: Pan Zhang,Baochai Peng,Chaoran Lu,Quanjin Huang
关键词-EN: Synthetic Aperture Radar, Land Cover Classification, multimodal Land Cover, Synthetic Aperture, Aperture Radar
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) images have proven to be a valuable cue for multimodal Land Cover Classification (LCC) when combined with RGB images. Most existing studies on cross-modal fusion assume that consistent feature information is necessary between the two modalities, and as a result, they construct networks without adequately addressing the unique characteristics of each modality. In this paper, we propose a novel architecture, named the Asymmetric Semantic Aligning Network (ASANet), which introduces asymmetry at the feature level to address the issue that multi-modal architectures frequently fail to fully utilize complementary features. The core of this network is the Semantic Focusing Module (SFM), which explicitly calculates differential weights for each modality to account for the modality-specific features. Furthermore, ASANet incorporates a Cascade Fusion Module (CFM), which delves deeper into channel and spatial representations to efficiently select features from the two modalities for fusion. Through the collaborative effort of these two modules, the proposed ASANet effectively learns feature correlations between the two modalities and eliminates noise caused by feature differences. Comprehensive experiments demonstrate that ASANet achieves excellent performance on three multimodal datasets. Additionally, we have established a new RGB-SAR multimodal dataset, on which our ASANet outperforms other mainstream methods with improvements ranging from 1.21% to 17.69%. The ASANet runs at 48.7 frames per second (FPS) when the input image is 256x256 pixels. The source code are available at this https URL
zh

[CV-126] INSIGHT: Explainable Weakly-Supervised Medical Image Analysis

【速读】：该论文试图解决在处理大体积扫描和全切片病理图像（WSIs）时，现有方法在定位小而临床关键细节方面的不足。现有方法通常依赖于局部区域嵌入的提取和聚合器进行预测，但这些方法需要后处理的可视化技术（如 Grad-CAM），并且往往无法准确地定位这些细节。论文提出的解决方案是引入 INSIGHT，一种新颖的弱监督聚合器，它将热图生成作为归纳偏置。INSIGHT 从预训练的特征图开始，通过使用具有小卷积核的检测模块来捕捉精细细节，并通过具有更广泛感受野的上下文模块来抑制局部假阳性。这种方法不仅在 CT 和 WSI 基准测试中实现了最先进的分类结果，还显著提高了弱标签语义分割的性能。

链接: https://arxiv.org/abs/2412.02012
作者: Wenbo Zhang,Junyu Chen,Christopher Kanan
关键词-EN: whole-slide pathology images, aggregator makes predictions, large sizes, volumetric scans, pathology images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to their large sizes, volumetric scans and whole-slide pathology images (WSIs) are often processed by extracting embeddings from local regions and then an aggregator makes predictions from this set. However, current methods require post-hoc visualization techniques (e.g., Grad-CAM) and often fail to localize small yet clinically crucial details. To address these limitations, we introduce INSIGHT, a novel weakly-supervised aggregator that integrates heatmap generation as an inductive bias. Starting from pre-trained feature maps, INSIGHT employs a detection module with small convolutional kernels to capture fine details and a context module with a broader receptive field to suppress local false positives. The resulting internal heatmap highlights diagnostically relevant regions. On CT and WSI benchmarks, INSIGHT achieves state-of-the-art classification results and high weakly-labeled semantic segmentation performance. Project website and code are available at: this https URL
zh

[CV-127] MPBD-LSTM: A Predictive Model for Colorectal Liver Metastases Using Time Series Multi-phase Contrast-Enhanced CT Scans

【速读】：该论文试图解决结直肠癌肝转移（CRLM）的早期检测问题。解决方案的关键在于构建一个能够处理五维数据（时间、相位和三维CT图像）的多平面深度学习模型。具体来说，论文提出了一种基于3D双向LSTM的多平面架构（MPBD-LSTM），该模型在处理时间序列的对比增强CT扫描数据时表现最佳，达到了0.79的AUC值。尽管如此，研究结果表明仍有很大的改进空间。

链接: https://arxiv.org/abs/2412.01973
作者: Xueyang Li,Han Xiao,Weixiang Weng,Xiaowei Xu,Yiyu Shi
关键词-EN: develop colorectal cancer, colorectal cancer liver, patients develop colorectal, cancer liver metastasis, Colorectal cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colorectal cancer is a prevalent form of cancer, and many patients develop colorectal cancer liver metastasis (CRLM) as a result. Early detection of CRLM is critical for improving survival rates. Radiologists usually rely on a series of multi-phase contrast-enhanced computed tomography (CECT) scans done during follow-up visits to perform early detection of the potential CRLM. These scans form unique five-dimensional data (time, phase, and axial, sagittal, and coronal planes in 3D CT). Most of the existing deep learning models can readily handle four-dimensional data (e.g., time-series 3D CT images) and it is not clear how well they can be extended to handle the additional dimension of phase. In this paper, we build a dataset of time-series CECT scans to aid in the early diagnosis of CRLM, and build upon state-of-the-art deep learning techniques to evaluate how to best predict CRLM. Our experimental results show that a multi-plane architecture based on 3D bi-directional LSTM, which we call MPBD-LSTM, works best, achieving an area under curve (AUC) of 0.79. On the other hand, analysis of the results shows that there is still great room for further improvement.
zh

[CV-128] Volumetric Reconstruction of Prostatectomy Specimens from Histology

【速读】：该论文试图解决前列腺癌手术病理报告中复杂信息的可视化和整合问题。解决方案的关键在于开发了一个名为3D-SLIVER的开源3DSlicer扩展，通过四个子模块实现3D组织重建：切片协议的数字化、基于该协议的任意3D模型虚拟切片、使用Coherent Point Drift算法将切片与虚拟切片配准，以及使用凸包、高斯溅射和线性拉伸进行配准信息的3D重建。该工具旨在简化病理工作流程的整合，并支持跨学科的沟通和多模态研究。

链接: https://arxiv.org/abs/2412.01855
作者: Tom Bisson,Isil Dogan O,Iris Piwonski,Tim-Rasmus Kiehl,Georg Lukas Baumgärtner,Rita Carvalho,Peter Hufnagl,Tobias Penzkofer,Norman Zerbe,Sefer Elezkurtaj
关键词-EN: involves organ removal, Surgical treatment, organ removal, involves organ, Surgical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical treatment for prostate cancer often involves organ removal, i.e., prostatectomy. Pathology reports on these specimens convey treatment-relevant information. Beyond these reports, the diagnostic process generates extensive and complex information that is difficult to represent in reports, although it is of significant interest to the other medical specialties involved. 3D tissue reconstruction would allow for better spatial visualization, as well as combinations with other imaging modalities. Existing approaches in this area have proven labor-intensive and challenging to integrate into clinical workflows. 3D-SLIVER provides a simplified solution, implemented as an open-source 3DSlicer extension. We outline three specific real-world scenarios to illustrate its potential to improve transparency in diagnostic workflows and contribute to multi-modal research endeavors. Implementing the 3D reconstruction process involved four sub-modules of 3D-SLIVER: digitization of slicing protocol, virtual slicing of arbitrary 3D models based on that protocol, registration of slides with virtual slices using the Coherent Point Drift algorithm, and 3D reconstruction of registered information using convex hulls, Gaussian splatter and linear extrusion. Three use cases to employ 3D-SLIVER are presented: a low-effort approach to pathology workflow integration and two research-related use cases illustrating how to perform retrospective evaluations of PI-RADS predictions and statistically model 3D distributions of morphological patterns. 3D-SLIVER allows for improved interdisciplinary communication among specialties. It is designed for simplicity in application, allowing for flexible integration into various workflows and use cases. Here we focused on the clinical care of prostate cancer patients, but future possibilities are extensive with other neoplasms and in education and research.
zh

[CV-129] You can monitor your hydration level using your smartphone camera

【速读】：该论文试图解决通过智能手机非侵入式地自我监测个体的水合水平问题，解决方案的关键在于利用智能手机摄像头记录指尖视频，从中提取光电容积脉搏波（PPG）信号，并通过机器学习、深度学习和变换器模型对PPG数据进行处理和分类，以实现对水合水平的二分类和四分类评估。该方法不仅实现了高精度（95%至99%）的分类效果，还通过深度学习模型进行特征提取和t-SNE方法进行特征选择与降维，最终在SHAP框架下解释模型的决策过程，从而提供了一种快速、成本效益高且符合联合国可持续发展目标的家庭自我测试解决方案。

链接: https://arxiv.org/abs/2402.07467
作者: Rose Alaslani,Levina Perzhilla,Muhammad Mahboob Ur Rahman,Taous-Meriem Laleg-Kirati,Tareq Y. Al-Naffouri
关键词-EN: popular assistive gadget, assistive gadget, utilize the regular, popular assistive, hydration level
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 13 figures, 6 tables, under review with a journal

点击查看摘要

Abstract:This work proposes for the first time to utilize the regular smartphone – a popular assistive gadget – to design a novel, non-invasive method for self-monitoring of one’s hydration level on a scale of 1 to 4. The proposed method involves recording a small video of a fingertip using the smartphone camera. Subsequently, a photoplethysmography (PPG) signal is extracted from the video data, capturing the fluctuations in peripheral blood volume as a reflection of a person’s hydration level changes over time. To train and evaluate the artificial intelligence models, a custom multi-session labeled dataset was constructed by collecting video-PPG data from 25 fasting subjects during the month of Ramadan in 2023. With this, we solve two distinct problems: 1) binary classification (whether a person is hydrated or not), 2) four-class classification (whether a person is fully hydrated, mildly dehydrated, moderately dehydrated, or extremely dehydrated). For both classification problems, we feed the pre-processed and augmented PPG data to a number of machine learning, deep learning and transformer models which models provide a very high accuracy, i.e., in the range of 95% to 99%. We also propose an alternate method where we feed high-dimensional PPG time-series data to a DL model for feature extraction, followed by t-SNE method for feature selection and dimensionality reduction, followed by a number of ML classifiers that do dehydration level classification. Finally, we interpret the decisions by the developed deep learning model under the SHAP-based explainable artificial intelligence framework. The proposed method allows rapid, do-it-yourself, at-home testing of one’s hydration level, is cost-effective and thus inline with the sustainable development goals 3 10 of the United Nations, and a step-forward to patient-centric healthcare systems, smart homes, and smart cities of future.
zh

[CV-130] Your smartphone could act as a pulse-oximeter and as a single-lead ECG

【速读】：该论文试图解决在后疫情时代，如何通过便捷、低成本且非侵入性的方式快速连续监测人体重要生命体征（如脉搏率 (PR)、血氧饱和度 (SpO2) 和呼吸频率 (RR)）的问题。解决方案的关键在于利用深度学习技术，将智能手机这一普及的个人设备转化为诊断工具。具体步骤包括：用户通过智能手机的后置摄像头录制指尖视频，视频经过预处理提取出滤波或去趋势的视频光电容积描记信号 (vPPG)，然后将该信号输入到定制的卷积神经网络 (CNN) 中，最终输出脉搏率、血氧饱和度和呼吸频率，同时还能提取出单导联心电图 (ECG)。论文的贡献在于：1) 使用定制的CNN、视觉变换器和CLIP模型从vPPG数据中估计三种生命体征；2) 提出了一种基于离散余弦变换和前馈神经网络的新方法，将录制的视频PPG信号转换为单导联ECG信号。

链接: https://arxiv.org/abs/2305.12583
作者: Ahsan Mehmood,Asma Sarauji,M. Mahboob Ur Rahman,Tareq Y. Al-Naffouri
关键词-EN: state of well-being, increased concern, masses to learn, body vitals, single-lead ECG signal
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注: 14 pages, 16 figures

点击查看摘要

Abstract:In the post-covid19 era, every new wave of the pandemic causes an increased concern among the masses to learn more about their state of well-being. Therefore, it is the need of the hour to come up with ubiquitous, low-cost, non-invasive tools for rapid and continuous monitoring of body vitals that reflect the status of one’s overall health. In this backdrop, this work proposes a deep learning approach to turn a smartphone-the popular hand-held personal gadget-into a diagnostic tool to measure/monitor the three most important body vitals, i.e., pulse rate (PR), blood oxygen saturation level (aka SpO2), and respiratory rate (RR). Furthermore, we propose another method that could extract a single-lead electrocardiograph (ECG) of the subject. The proposed methods include the following core steps: subject records a small video of his/her fingertip by placing his/her finger on the rear camera of the smartphone, and the recorded video is pre-processed to extract the filtered and/or detrended video-photoplethysmography (vPPG) signal, which is then fed to custom-built convolutional neural networks (CNN), which eventually spit-out the vitals (PR, SpO2, and RR) as well as a single-lead ECG of the subject. To be precise, the contribution of this paper is two-fold: 1) estimation of the three body vitals (PR, SpO2, RR) from the vPPG data using custom-built CNNs, vision transformer, and most importantly by CLIP model; 2) a novel discrete cosine transform+feedforward neural network-based method that translates the recorded video- PPG signal to a single-lead ECG signal. The proposed method is anticipated to find its application in several use-case scenarios, e.g., remote healthcare, mobile health, fitness, sports, etc.
zh

人工智能

[AI-0] he Asymptotic Behavior of Attention in Transformers

链接: https://arxiv.org/abs/2412.02682
作者: Álvaro Rodríguez Abella,João Pedro Silvestre,Paulo Tabuada
关键词-EN: attention mechanism orchestrating, key component, mechanism orchestrating, influences the propagation, attention mechanism
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token through a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

[AI-1] Adaptive Informed Deep Neural Networks for Power Flow Analysis

链接: https://arxiv.org/abs/2412.02659
作者: Zeynab Kaseb,Stavros Orfanoudakis,Pedro P. Vergara,Peter Palensky
关键词-EN: deep learning architecture, large-scale modern power, modern power systems, study introduces, deep learning
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 10 pages, 7 figures, 4 tables

点击查看摘要

Abstract:This study introduces PINN4PF, an end-to-end deep learning architecture for power flow (PF) analysis that effectively captures the nonlinear dynamics of large-scale modern power systems. The proposed neural network (NN) architecture consists of two important advancements in the training pipeline: (A) a double-head feed-forward NN that aligns with PF analysis, including an activation function that adjusts to active and reactive power consumption patterns, and (B) a physics-based loss function that partially incorporates power system topology information. The effectiveness of the proposed architecture is illustrated through 4-bus, 15-bus, 290-bus, and 2224-bus test systems and is evaluated against two baselines: a linear regression model (LR) and a black-box NN (MLP). The comparison is based on (i) generalization ability, (ii) robustness, (iii) impact of training dataset size on generalization ability, (iv) accuracy in approximating derived PF quantities (specifically line current, line active power, and line reactive power), and (v) scalability. Results demonstrate that PINN4PF outperforms both baselines across all test systems by up to two orders of magnitude not only in terms of direct criteria, e.g., generalization ability but also in terms of approximating derived physical quantities.

[AI-2] Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications Challenges and Future Directions

链接: https://arxiv.org/abs/2412.02621
作者: Kai Sun,Siyan Xue,Fuchun Sun,Haoran Sun,Yu Luo,Ling Wang,Siyuan Wang,Na Guo,Lei Liu,Tian Zhao,Xinzhou Wang,Lei Yang,Shuo Jin,Jun Yan,Jiahong Dong
关键词-EN: improve diagnostic precision, diverse clinical domains, Medical Multimodal Foundation, precision medicine, offering novel approaches
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in deep learning have significantly revolutionized the field of clinical diagnosis and treatment, offering novel approaches to improve diagnostic precision and treatment efficacy across diverse clinical domains, thus driving the pursuit of precision medicine. The growing availability of multi-organ and multimodal datasets has accelerated the development of large-scale Medical Multimodal Foundation Models (MMFMs). These models, known for their strong generalization capabilities and rich representational power, are increasingly being adapted to address a wide range of clinical tasks, from early diagnosis to personalized treatment strategies. This review offers a comprehensive analysis of recent developments in MMFMs, focusing on three key aspects: datasets, model architectures, and clinical applications. We also explore the challenges and opportunities in optimizing multimodal representations and discuss how these advancements are shaping the future of healthcare by enabling improved patient outcomes and more efficient clinical workflows.

[AI-3] Projection Abstractions in Planning Under the Lenses of Abstractions for MDPs

链接: https://arxiv.org/abs/2412.02615
作者: Giuseppe Canonaco,Alberto Pozanco,Daniel Borrajo
关键词-EN: Markov Decision Processes, discounted Markov Decision, Decision Processes, Markov Decision, discounted Markov
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The concept of abstraction has been independently developed both in the context of AI Planning and discounted Markov Decision Processes (MDPs). However, the way abstractions are built and used in the context of Planning and MDPs is different even though lots of commonalities can be highlighted. To this day there is no work trying to relate and unify the two fields on the matter of abstractions unraveling all the different assumptions and their effect on the way they can be used. Therefore, in this paper we aim to do so by looking at projection abstractions in Planning through the lenses of discounted MDPs. Starting from a projection abstraction built according to Classical or Probabilistic Planning techniques, we will show how the same abstraction can be obtained under the abstraction frameworks available for discounted MDPs. Along the way, we will focus on computational as well as representational advantages and disadvantages of both worlds pointing out new research directions that are of interest for both fields.

[AI-4] AI-Driven Resource Allocation Framework for Microservices in Hybrid Cloud Platforms

链接: https://arxiv.org/abs/2412.02610
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: hybrid cloud, hybrid cloud platforms, efficient resource management, resource allocation, resource
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Performance (cs.PF); Software Engineering (cs.SE); Systems and Control (eess.SY)
*备注: 25 pages, 14 figures

点击查看摘要

Abstract:The increasing demand for scalable, efficient resource management in hybrid cloud environments has led to the exploration of AI-driven approaches for dynamic resource allocation. This paper presents an AI-driven framework for resource allocation among microservices in hybrid cloud platforms. The framework employs reinforcement learning (RL)-based resource utilization optimization to reduce costs and improve performance. The framework integrates AI models with cloud management tools to respond to challenges of dynamic scaling and cost-efficient low-latency service delivery. The reinforcement learning model continuously adjusts provisioned resources as required by the microservices and predicts the future consumption trends to minimize both under- and over-provisioning of resources. Preliminary simulation results indicate that using AI in the provision of resources related to costs can reduce expenditure by up to 30-40% compared to manual provisioning and threshold-based auto-scaling approaches. It is also estimated that the efficiency in resource utilization is expected to improve by 20%-30% with a corresponding latency cut of 15%-20% during the peak demand periods. This study compares the AI-driven approach with existing static and rule-based resource allocation methods, demonstrating the capability of this new model to outperform them in terms of flexibility and real-time interests. The results indicate that reinforcement learning can make optimization of hybrid cloud platforms even better, offering a 25-35% improvement in cost efficiency and the power of scaling for microservice-based applications. The proposed framework is a strong and scalable solution to managing cloud resources in dynamic and performance-critical environments.

[AI-5] PrefixLLM : LLM -aided Prefix Circuit Design

链接: https://arxiv.org/abs/2412.02594
作者: Weihua Xiao,Venkata Sai Charan Putrevu,Raghu Vamshi Hemadri,Siddharth Garg,Ramesh Karri
关键词-EN: calculating carry signals, digital systems due, digital adders, components in digital, carry signals
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prefix circuits are fundamental components in digital adders, widely used in digital systems due to their efficiency in calculating carry signals. Synthesizing prefix circuits with minimized area and delay is crucial for enhancing the performance of modern computing systems. Recently, large language models (LLMs) have demonstrated a surprising ability to perform text generation tasks. We propose PrefixLLM, that leverages LLMs for prefix circuit synthesis. PrefixLLM transforms the prefix circuit synthesis task into a structured text generation problem, termed the Structured Prefix Circuit Representation (SPCR), and introduces an iterative framework to automatically and accurately generate valid SPCRs. We further present a design space exploration (DSE) framework that uses LLMs to iteratively search for area and delay optimized prefix circuits. Compared to state-of-the-art, PrefixLLM can reduce the area by 3.70% under the same delay constraint. This work highlights the use of LLMs in the synthesis of arithmetic circuits, which can be transformed into the structured text generation.

[AI-6] Explainable CTR Prediction via LLM Reasoning WSDM2025

链接: https://arxiv.org/abs/2412.02588
作者: Xiaohan Yu,Li Zhang,Chong Chen
关键词-EN: modern user experiences, Recommendation Systems, decision-making processes, integral to modern, lack transparency
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: WSDM 2025

点击查看摘要

Abstract:Recommendation Systems have become integral to modern user experiences, but lack transparency in their decision-making processes. Existing explainable recommendation methods are hindered by reliance on a post-hoc paradigm, wherein explanation generators are trained independently of the underlying recommender models. This paradigm necessitates substantial human effort in data construction and raises concerns about explanation reliability. In this paper, we present ExpCTR, a novel framework that integrates large language model based explanation generation directly into the CTR prediction process. Inspired by recent advances in reinforcement learning, we employ two carefully designed reward mechanisms, LC alignment, which ensures explanations reflect user intentions, and IC alignment, which maintains consistency with traditional ID-based CTR models. Our approach incorporates an efficient training paradigm with LoRA and a three-stage iterative process. ExpCTR circumvents the need for extensive explanation datasets while fostering synergy between CTR prediction and explanation generation. Experimental results demonstrate that ExpCTR significantly enhances both recommendation accuracy and interpretability across three real-world datasets.

[AI-7] Factored space models: Towards causality between levels of abstraction

链接: https://arxiv.org/abs/2412.02579
作者: Scott Garrabrant,Matthias Georg Mayer,Magdalena Wache,Leon Lang,Sam Eisenstat,Holger Dell
关键词-EN: understanding intelligent behavior, Causality plays, causal graphs, intelligent behavior, plays an important
类目: Artificial Intelligence (cs.AI)
*备注: 29 pages

点击查看摘要

Abstract:Causality plays an important role in understanding intelligent behavior, and there is a wealth of literature on mathematical models for causality, most of which is focused on causal graphs. Causal graphs are a powerful tool for a wide range of applications, in particular when the relevant variables are known and at the same level of abstraction. However, the given variables can also be unstructured data, like pixels of an image. Meanwhile, the causal variables, such as the positions of objects in the image, can be arbitrary deterministic functions of the given variables. Moreover, the causal variables may form a hierarchy of abstractions, in which the macro-level variables are deterministic functions of the micro-level variables. Causal graphs are limited when it comes to modeling this kind of situation. In the presence of deterministic relationships there is generally no causal graph that satisfies both the Markov condition and the faithfulness condition. We introduce factored space models as an alternative to causal graphs which naturally represent both probabilistic and deterministic relationships at all levels of abstraction. Moreover, we introduce structural independence and establish that it is equivalent to statistical independence in every distribution that factorizes over the factored space. This theorem generalizes the classical soundness and completeness theorem for d-separation.

[AI-8] Generating Critical Scenarios for Testing Automated Driving Systems

链接: https://arxiv.org/abs/2412.02574
作者: Trung-Hieu Nguyen,Truong-Giang Vuong,Hong-Nam Duong,Son Nguyen,Hieu Dinh Vo,Toshiaki Aoki,Thu-Trang Nguyen
关键词-EN: demonstrated significant potential, Autonomous Driving System, Autonomous vehicles, revolutionizing transportation, demonstrated significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) have demonstrated significant potential in revolutionizing transportation, yet ensuring their safety and reliability remains a critical challenge, especially when exposed to dynamic and unpredictable environments. Real-world testing of an Autonomous Driving System (ADS) is both expensive and risky, making simulation-based testing a preferred approach. In this paper, we propose AVASTRA, a Reinforcement Learning (RL)-based approach to generate realistic critical scenarios for testing ADSs in simulation environments. To capture the complexity of driving scenarios, AVASTRA comprehensively represents the environment by both the internal states of an ADS under-test (e.g., the status of the ADS’s core components, speed, or acceleration) and the external states of the surrounding factors in the simulation environment (e.g., weather, traffic flow, or road condition). AVASTRA trains the RL agent to effectively configure the simulation environment that places the AV in dangerous situations and potentially leads it to collisions. We introduce a diverse set of actions that allows the RL agent to systematically configure both environmental conditions and traffic participants. Additionally, based on established safety requirements, we enforce heuristic constraints to ensure the realism and relevance of the generated test scenarios. AVASTRA is evaluated on two popular simulation maps with four different road configurations. Our results show AVASTRA’s ability to outperform the state-of-the-art approach by generating 30% to 115% more collision scenarios. Compared to the baseline based on Random Search, AVASTRA achieves up to 275% better performance. These results highlight the effectiveness of AVASTRA in enhancing the safety testing of AVs through realistic comprehensive critical scenario generation.

[AI-9] AB-Fields: A Maximum Entropy Framework for Mission-Aware Adversarial Planning

链接: https://arxiv.org/abs/2412.02570
作者: Gokul Puthumanaillam,Jae Hyuk Song,Nurzhan Yesmagambet,Shinkyu Park,Melkior Ornik
关键词-EN: Autonomous agents operating, adversarial scenarios face, Markov Decision Process, Autonomous agents, Observable Markov Decision
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous agents operating in adversarial scenarios face a fundamental challenge: while they may know their adversaries’ high-level objectives, such as reaching specific destinations within time constraints, the exact policies these adversaries will employ remain unknown. Traditional approaches address this challenge by treating the adversary’s state as a partially observable element, leading to a formulation as a Partially Observable Markov Decision Process (POMDP). However, the induced belief-space dynamics in a POMDP require knowledge of the system’s transition dynamics, which, in this case, depend on the adversary’s unknown policy. Our key observation is that while an adversary’s exact policy is unknown, their behavior is necessarily constrained by their mission objectives and the physical environment, allowing us to characterize the space of possible behaviors without assuming specific policies. In this paper, we develop Task-Aware Behavior Fields (TAB-Fields), a representation that captures adversary state distributions over time by computing the most unbiased probability distribution consistent with known constraints. We construct TAB-Fields by solving a constrained optimization problem that minimizes additional assumptions about adversary behavior beyond mission and environmental requirements. We integrate TAB-Fields with standard planning algorithms by introducing TAB-conditioned POMCP, an adaptation of Partially Observable Monte Carlo Planning. Through experiments in simulation with underwater robots and hardware implementations with ground robots, we demonstrate that our approach achieves superior performance compared to baselines that either assume specific adversary policies or neglect mission constraints altogether. Evaluation videos and code are available at this https URL.

[AI-10] Graph-Powered Defense: Controller Area Network Intrusion Detection for Unmanned Aerial Vehicles

链接: https://arxiv.org/abs/2412.02539
作者: Reek Majumder,Gurcan Comert,David Werth,Adrian Gale,Mashrur Chowdhury,M Sabbir Salek
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, experienced exponential expansion, Controller Area Network
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The network of services, including delivery, farming, and environmental monitoring, has experienced exponential expansion in the past decade with Unmanned Aerial Vehicles (UAVs). Yet, UAVs are not robust enough against cyberattacks, especially on the Controller Area Network (CAN) bus. The CAN bus is a general-purpose vehicle-bus standard to enable microcontrollers and in-vehicle computers to interact, primarily connecting different Electronic Control Units (ECUs). In this study, we focus on solving some of the most critical security weaknesses in UAVs by developing a novel graph-based intrusion detection system (IDS) leveraging the Uncomplicated Application-level Vehicular Communication and Networking (UAVCAN) protocol. First, we decode CAN messages based on UAVCAN protocol specification; second, we present a comprehensive method of transforming tabular UAVCAN messages into graph structures. Lastly, we apply various graph-based machine learning models for detecting cyber-attacks on the CAN bus, including graph convolutional neural networks (GCNNs), graph attention networks (GATs), Graph Sample and Aggregate Networks (GraphSAGE), and graph structure-based transformers. Our findings show that inductive models such as GATs, GraphSAGE, and graph-based transformers can achieve competitive and even better accuracy than transductive models like GCNNs in detecting various types of intrusions, with minimum information on protocol specification, thus providing a generic robust solution for CAN bus security for the UAVs. We also compared our results with baseline single-layer Long Short-Term Memory (LSTM) and found that all our graph-based models perform better without using any decoded features based on the UAVCAN protocol, highlighting higher detection performance with protocol-independent capability.

[AI-11] Bias Analysis of AI Models for Undergraduate Student Admissions

链接: https://arxiv.org/abs/2412.02528
作者: Kelly Van Busum,Shiaofen Fang
关键词-EN: machine learning, Bias detection, detection and mitigation, active area, standardized test scores
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bias detection and mitigation is an active area of research in machine learning. This work extends previous research done by the authors to provide a rigorous and more complete analysis of the bias found in AI predictive models. Admissions data spanning six years was used to create an AI model to determine whether a given student would be directly admitted into the School of Science under various scenarios at a large urban research university. During this time, submission of standardized test scores as part of an application became optional which led to interesting questions about the impact of standardized test scores on admission decisions. We developed and analyzed AI models to understand which variables are important in admissions decisions, and how the decision to exclude test scores affects the demographics of the students who are admitted. We then evaluated the predictive models to detect and analyze biases these models may carry with respect to three variables chosen to represent sensitive populations: gender, race, and whether a student was the first in his or her family to attend college. We also extended our analysis to show that the biases detected were persistent. Finally, we included several fairness metrics in our analysis and discussed the uses and limitations of these metrics.

[AI-12] Cooperative Cruising: Reinforcement Learning based Time-Headway Control for Increased Traffic Efficiency

链接: https://arxiv.org/abs/2412.02520
作者: Yaron Veksler,Sharon Hornstein,Han Wang,Maria Laura Delle Monache,Daniel Urieli
关键词-EN: Connected Automated Vehicles, proliferation of Connected, Automated Vehicles represents, improving driving efficiency, Connected Automated
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The proliferation of Connected Automated Vehicles represents an unprecedented opportunity for improving driving efficiency and alleviating traffic congestion. However, existing research fails to address realistic multi-lane highway scenarios without assuming connectivity, perception, and control capabilities that are typically unavailable in current vehicles. This paper proposes a novel AI system that is the first to improve highway traffic efficiency compared with human-like traffic in realistic, simulated multi-lane scenarios, while relying on existing connectivity, perception, and control capabilities. At the core of our approach is a reinforcement learning based controller that dynamically communicates time-headways to automated vehicles near bottlenecks based on real-time traffic conditions. These desired time-headways are then used by Adaptive Cruise Control (ACC) systems to adjust their following distance. By (i) integrating existing traffic estimation technology and low-bandwidth vehicle-to-infrastructure connectivity, (ii) leveraging safety-certified ACC systems, and (iii) targeting localized bottleneck challenges that can be addressed independently in different locations, we propose a practical, safe, and scalable system that can positively impact numerous road users.

[AI-13] Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities

链接: https://arxiv.org/abs/2412.02512
作者: Matteo Pistillo,Charlotte Stix
关键词-EN: early warning shots, warning shots long, reaching red lines, early warning, warning shots
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-impact and potentially dangerous capabilities can and should be broken down into early warning shots long before reaching red lines. Each of these early warning shots should correspond to a precursory capability. Each precursory capability sits on a spectrum indicating its proximity to a final high-impact capability, corresponding to a red line. To meaningfully detect and track capability progress, we propose a taxonomy of dangerous capability zones (a zoning taxonomy) tied to a staggered information exchange framework that enables relevant bodies to take action accordingly. In the Frontier AI Safety Commitments, signatories commit to sharing more detailed information with trusted actors, including an appointed body, as appropriate (Commitment VII). Building on our zoning taxonomy, this paper makes four recommendations for specifying information sharing as detailed in Commitment VII. (1) Precursory capabilities should be shared as soon as they become known through internal evaluations before deployment. (2) AI Safety Institutes (AISIs) should be the trusted actors appointed to receive and coordinate information on precursory components. (3) AISIs should establish adequate information protection infrastructure and guarantee increased information security as precursory capabilities move through the zones and towards red lines, including, if necessary, by classifying the information on precursory capabilities or marking it as controlled. (4) High-impact capability progress in one geographical region may translate to risk in other regions and necessitates more comprehensive risk assessment internationally. As such, AISIs should exchange information on precursory capabilities with other AISIs, relying on the existing frameworks on international classified exchanges and applying lessons learned from other regulated high-risk sectors.

[AI-14] FCL-ViT: Task-Aware Attention Tuning for Continual Learning

链接: https://arxiv.org/abs/2412.02509
作者: Anestis Kaimakamidis,Ioannis Pitas
关键词-EN: Deep Neural Network, prior Deep Neural, Neural Network, Deep Neural, Continual Learning Vision
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) involves adapting the prior Deep Neural Network (DNN) knowledge to new tasks, without forgetting the old ones. However, modern CL techniques focus on provisioning memory capabilities to existing DNN models rather than designing new ones that are able to adapt according to the task at hand. This paper presents the novel Feedback Continual Learning Vision Transformer (FCL-ViT) that uses a feedback mechanism to generate real-time dynamic attention features tailored to the current task. The FCL-ViT operates in two Phases. In phase 1, the generic image features are produced and determine where the Transformer should attend on the current image. In phase 2, task-specific image features are generated that leverage dynamic attention. To this end, Tunable self-Attention Blocks (TABs) and Task Specific Blocks (TSBs) are introduced that operate in both phases and are responsible for tuning the TABs attention, respectively. The FCL-ViT surpasses state-of-the-art performance on Continual Learning compared to benchmark methods, while retaining a small number of trainable DNN parameters.

[AI-15] F-SE-LSTM: A Time Series Anomaly Detection Method with Frequency Domain Information

链接: https://arxiv.org/abs/2412.02474
作者: Yi-Xiang Lu,Xiao-Bo Jin,Jian Chen,Dong-Jie Liu,Guang-Gang Geng
关键词-EN: series anomaly detection, anomaly detection, time series anomaly, anomaly detection plays, time series
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:With the development of society, time series anomaly detection plays an important role in network and IoT services. However, most existing anomaly detection methods directly analyze time series in the time domain and cannot distinguish some relatively hidden anomaly sequences. We attempt to analyze the impact of frequency on time series from a frequency domain perspective, thus proposing a new time series anomaly detection method called F-SE-LSTM. This method utilizes two sliding windows and fast Fourier transform (FFT) to construct a frequency matrix. Simultaneously, Squeeze-and-Excitation Networks (SENet) and Long Short-Term Memory (LSTM) are employed to extract frequency-related features within and between periods. Through comparative experiments on multiple datasets such as Yahoo Webscope S5 and Numenta Anomaly Benchmark, the results demonstrate that the frequency matrix constructed by F-SE-LSTM exhibits better discriminative ability than ordinary time domain and frequency domain data. Furthermore, F-SE-LSTM outperforms existing state-of-the-art deep learning anomaly detection methods in terms of anomaly detection capability and execution efficiency.

[AI-16] Knowledge-Enhanced Conversational Recommendation via Transformer-based Sequential Modelling

链接: https://arxiv.org/abs/2412.02415
作者: Jie Zou,Aixin Sun,Cheng Long,Evangelos Kanoulas
关键词-EN: item-related entities, conversational recommender systems, potential sequential dependencies, TSCR, items and item-related
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by ACM TOIS

点击查看摘要

Abstract:In conversational recommender systems (CRSs), conversations usually involve a set of items and item-related entities or attributes, e.g., director is a related entity of a movie. These items and item-related entities are often mentioned along the development of a dialog, leading to potential sequential dependencies among them. However, most of existing CRSs neglect these potential sequential dependencies. In this article, we first propose a Transformer-based sequential conversational recommendation method, named TSCR, to model the sequential dependencies in the conversations to improve CRS. In TSCR, we represent conversations by items and the item-related entities, and construct user sequences to discover user preferences by considering both the mentioned items and item-related entities. Based on the constructed sequences, we deploy a Cloze task to predict the recommended items along a sequence. Meanwhile, in certain domains, knowledge graphs formed by the items and their related entities are readily available, which provide various different kinds of associations among them. Given that TSCR does not benefit from such knowledge graphs, we then propose a knowledge graph enhanced version of TSCR, called TSCRKG. In specific, we leverage the knowledge graph to offline initialize our model TSCRKG, and augment the user sequence of conversations (i.e., sequence of the mentioned items and item-related entities in the conversation) with multi-hop paths in the knowledge graph. Experimental results demonstrate that our TSCR model significantly outperforms state-of-the-art baselines, and the enhanced version TSCRKG further improves recommendation performance on top of TSCR.

[AI-17] A Multi-Agent Framework for Extensible Structured Text Generation in PLCs

链接: https://arxiv.org/abs/2412.02410
作者: Donghao Yang,Aolang Wu,Tianyi Zhang,Li Zhang,Fang Liu,Xiaoli Lian,Yuming Ren,Jiaji Tian
关键词-EN: Programmable Logic Controllers, automating factory operations, Programmable Logic, Logic Controllers, factory operations
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Programmable Logic Controllers (PLCs) are microcomputers essential for automating factory operations. Structured Text (ST), a high-level language adhering to the IEC 61131-3 standard, is pivotal for PLCs due to its ability to express logic succinctly and to seamlessly integrate with other languages within the same standard. However, vendors develop their own customized versions of ST, and the lack of comprehensive and standardized documentation for the full semantics of ST has contributed to inconsistencies in how the language is implemented. Consequently, the steep learning curve associated with ST, combined with ever-evolving industrial requirements, presents significant challenges for developers. In response to these issues, we present AutoPLC, an LLM-based approach designed to automate the generation of vendor-specific ST code. To facilitate effective code generation, we first built a comprehensive knowledge base, including Rq2ST Case Library (requirements and corresponding implementations) and Instruction libraries. Then we developed a retrieval module to incorporate the domain-specific knowledge by identifying pertinent cases and instructions, guiding the LLM to generate code that meets the requirements. In order to verify and improve the quality of the generated code, we designed an adaptable code checker. If errors are detected, we initiate an iterative self-improvement process to instruct the LLM to revise the generated code. We evaluate AutoPLC’s performance against seven state-of-the-art baselines using three benchmarks, one for open-source basic ST and two for commercial Structured Control Language (SCL) from Siemens. The results show that our approach consistently achieves superior performance across all benchmarks. Ablation study emphasizes the significance of our modules. Further manual analysis confirm the practical utility of the ST code generated by AutoPLC.

[AI-18] HERO: Hint-Based Efficient and Reliable Query Optimizer VLDB2025

链接: https://arxiv.org/abs/2412.02372
作者: Sergey Zinchenko,Sergey Iazov
关键词-EN: execution plans, query, query hints leading, optimization, model
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to VLDB 2025; 13 pages; 13 figures

点击查看摘要

Abstract:We propose a novel model for learned query optimization which provides query hints leading to better execution plans. The model addresses the three key challenges in learned hint-based query optimization: reliable hint recommendation (ensuring non-degradation of query latency), efficient hint exploration, and fast inference. We provide an in-depth analysis of existing NN-based approaches to hint-based optimization and experimentally confirm the named challenges for them. Our alternative solution consists of a new inference schema based on an ensemble of context-aware models and a graph storage for reliable hint suggestion and fast inference, and a budget-controlled training procedure with a local search algorithm that solves the issue of exponential search space exploration. In experiments on standard benchmarks, our model demonstrates optimization capability close to the best achievable with coarse-grained hints. Controlling the degree of parallelism (query dop) in addition to operator-related hints enables our model to achieve 3x latency improvement on JOB benchmark which sets a new standard for optimization. Our model is interpretable and easy to debug, which is particularly important for deployment in production.

[AI-19] Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks

链接: https://arxiv.org/abs/2412.02357
作者: Ian Drosos,Jack Williams,Advait Sarkar,Nicholas Wilson
关键词-EN: explaining spreadsheet formulas, Python code, Dynamic PRC, Dynamic PRC approach, Effective prompting
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective prompting of generative AI is challenging for many users, particularly in expressing context for comprehension tasks such as explaining spreadsheet formulas, Python code, and text passages. Prompt middleware aims to address this barrier by assisting in prompt construction, but barriers remain for users in expressing adequate control so that they can receive AI-responses that match their preferences. We conduct a formative survey (n=38) investigating user needs for control over AI-generated explanations in comprehension tasks, which uncovers a trade-off between standardized but predictable support for prompting, and adaptive but unpredictable support tailored to the user and task. To explore this trade-off, we implement two prompt middleware approaches: Dynamic Prompt Refinement Control (Dynamic PRC) and Static Prompt Refinement Control (Static PRC). The Dynamic PRC approach generates context-specific UI elements that provide prompt refinements based on the user’s prompt and user needs from the AI, while the Static PRC approach offers a preset list of generally applicable refinements. We evaluate these two approaches with a controlled user study (n=16) to assess the impact of these approaches on user control of AI responses for crafting better explanations. Results show a preference for the Dynamic PRC approach as it afforded more control, lowered barriers to providing context, and encouraged exploration and reflection of the tasks, but that reasoning about the effects of different generated controls on the final output remains challenging. Drawing on participant feedback, we discuss design implications for future Dynamic PRC systems that enhance user control of AI responses. Our findings suggest that dynamic prompt middleware can improve the user experience of generative AI workflows by affording greater control and guide users to a better AI response. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.02357 [cs.HC] (or arXiv:2412.02357v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2412.02357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Sample Efficient Robot Learning in Supervised Effect Prediction Tasks

链接: https://arxiv.org/abs/2412.02331
作者: Mehmet Arda Eren,Erhan Oztop
关键词-EN: robots actively explore, learning, actively explore, generate data, data by acting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 18 figures

点击查看摘要

Abstract:In self-supervised robot learning, robots actively explore their environments and generate data by acting on entities in the environment. Therefore, an exploration policy is desired that ensures sample efficiency to minimize robot execution costs while still providing accurate learning. For this purpose, the robotic community has adopted Intrinsic Motivation (IM)-based approaches such as Learning Progress (LP). On the machine learning front, Active Learning (AL) has been used successfully, especially for classification tasks. In this work, we develop a novel AL framework geared towards robotics regression tasks, such as action-effect prediction and, more generally, for world model learning, which we call MUSEL - Model Uncertainty for Sample Efficient Learning. MUSEL aims to extract model uncertainty from the total uncertainty estimate given by a suitable learning engine by making use of earning progress and input diversity and use it to improve sample efficiency beyond the state-of-the-art action-effect prediction methods. We demonstrate the feasibility of our model by using a Stochastic Variational Gaussian Process (SVGP) as the learning engine and testing the system on a set of robotic experiments in simulation. The efficacy of MUSEL is demonstrated by comparing its performance to standard methods used in robot action-effect learning. In a robotic tabletop environment in which a robot manipulator is tasked with learning the effect of its actions, the experiments show that MUSEL facilitates higher accuracy in learning action effects while ensuring sample efficiency.

[AI-21] Switchable deep beamformer for high-quality and real-time passive acoustic mapping

链接: https://arxiv.org/abs/2412.02327
作者: Yi Zeng,Jinwei Li,Hui Zhu,Shukuan Lu,Jianfeng Li,Xiran Cai
关键词-EN: Passive acoustic mapping, deep beamformer, Passive acoustic, Data-adaptive beamformers, deep beamformer reduced
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Passive acoustic mapping (PAM) is a promising tool for monitoring acoustic cavitation activities in the applications of ultrasound therapy. Data-adaptive beamformers for PAM have better image quality compared to the time exposure acoustics (TEA) algorithms. However, the computational cost of data-adaptive beamformers is considerably expensive. In this work, we develop a deep beamformer based on a generative adversarial network, which can switch between different transducer arrays and reconstruct high-quality PAM images directly from radio frequency ultrasound signals with low computational cost. The deep beamformer was trained on the dataset consisting of simulated and experimental cavitation signals of single and multiple microbubble clouds measured by different (linear and phased) arrays covering 1-15 MHz. We compared the performance of the deep beamformer to TEA and three different data-adaptive beamformers using the simulated and experimental test dataset. Compared with TEA, the deep beamformer reduced the energy spread area by 18.9%-65.0% and improved the image signal-to-noise ratio by 9.3-22.9 dB in average for the different arrays in our data. Compared to the data-adaptive beamformers, the deep beamformer reduced the computational cost by three orders of magnitude achieving 10.5 ms image reconstruction speed in our data, while the image quality was as good as that of the data-adaptive beamformers. These results demonstrated the potential of the deep beamformer for high-resolution monitoring of microbubble cavitation activities for ultrasound therapy.

[AI-22] Enhanced Photovoltaic Power Forecasting: An iTransformer and LSTM-Based Model Integrating Temporal and Covariate Interactions

链接: https://arxiv.org/abs/2412.02302
作者: Guang Wu,Yun Wang,Qian Zhou,Ziyang Zhang
关键词-EN: optimizing real-time energy, real-time energy management, integrating renewable energy, renewable energy sources, ensuring energy reliability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate photovoltaic (PV) power forecasting is critical for integrating renewable energy sources into the grid, optimizing real-time energy management, and ensuring energy reliability amidst increasing demand. However, existing models often struggle with effectively capturing the complex relationships between target variables and covariates, as well as the interactions between temporal dynamics and multivariate data, leading to suboptimal forecasting accuracy. To address these challenges, we propose a novel model architecture that leverages the iTransformer for feature extraction from target variables and employs long short-term memory (LSTM) to extract features from covariates. A cross-attention mechanism is integrated to fuse the outputs of both models, followed by a Kolmogorov-Arnold network (KAN) mapping for enhanced representation. The effectiveness of the proposed model is validated using publicly available datasets from Australia, with experiments conducted across four seasons. Results demonstrate that the proposed model effectively capture seasonal variations in PV power generation and improve forecasting accuracy.

[AI-23] CADMR: Cross-Attention and Disentangled Learning for Multimodal Recommender Systems

链接: https://arxiv.org/abs/2412.02295
作者: Yasser Khalafaoui(Alteca),Martino Lovisetto(Alteca),Basarab Matei,Nistor Grozavu(CY)
关键词-EN: enhancing recommendation accuracy, recommender systems offer, increasing availability, availability and diversity, offer new avenues
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability and diversity of multimodal data in recommender systems offer new avenues for enhancing recommendation accuracy and user satisfaction. However, these systems must contend with high-dimensional, sparse user-item rating matrices, where reconstructing the matrix with only small subsets of preferred items for each user poses a significant challenge. To address this, we propose CADMR, a novel autoencoder-based multimodal recommender system framework. CADMR leverages multi-head cross-attention mechanisms and Disentangled Learning to effectively integrate and utilize heterogeneous multimodal data in reconstructing the rating matrix. Our approach first disentangles modality-specific features while preserving their interdependence, thereby learning a joint latent representation. The multi-head cross-attention mechanism is then applied to enhance user-item interaction representations with respect to the learned multimodal item latent representations. We evaluate CADMR on three benchmark datasets, demonstrating significant performance improvements over state-of-the-art methods.

[AI-24] Conformal Symplectic Optimization for Stable Reinforcement Learning

链接: https://arxiv.org/abs/2412.02291
作者: Yao Lyu,Xiangteng Zhang,Shengbo Eben Li,Jingliang Duan,Letian Tao,Qing Xu,Lei He,Keqiang Li
关键词-EN: agents necessitates overcoming, Training deep reinforcement, stochastic optimization inherent, deep reinforcement learning, highly unstable nonconvex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training deep reinforcement learning (RL) agents necessitates overcoming the highly unstable nonconvex stochastic optimization inherent in the trial-and-error mechanism. To tackle this challenge, we propose a physics-inspired optimization algorithm called relativistic adaptive gradient descent (RAD), which enhances long-term training stability. By conceptualizing neural network (NN) training as the evolution of a conformal Hamiltonian system, we present a universal framework for transferring long-term stability from conformal symplectic integrators to iterative NN updating rules, where the choice of kinetic energy governs the dynamical properties of resulting optimization algorithms. By utilizing relativistic kinetic energy, RAD incorporates principles from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal gradient influences. Additionally, RAD models NN optimization as the evolution of a multi-particle system where each trainable parameter acts as an independent particle with an individual adaptive learning rate. We prove RAD’s sublinear convergence under general nonconvex settings, where smaller gradient variance and larger batch sizes contribute to tighter convergence. Notably, RAD degrades to the well-known adaptive moment estimation (ADAM) algorithm when its speed coefficient is chosen as one and symplectic factor as a small positive value. Experimental results show RAD outperforming nine baseline optimizers with five RL algorithms across twelve environments, including standard benchmarks and challenging scenarios. Notably, RAD achieves up to a 155.1% performance improvement over ADAM in Atari games, showcasing its efficacy in stabilizing and accelerating RL training.

[AI-25] GQWformer: A Quantum-based Transformer for Graph Representation Learning

链接: https://arxiv.org/abs/2412.02285
作者: Lei Yu,Hongyang Chen,Jingsong Lv,Linyao Yang
关键词-EN: demonstrated significant advantages, Graph, Quantum Walk Transformer, quantum, graph representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Transformers (GTs) have demonstrated significant advantages in graph representation learning through their global attention mechanisms. However, the self-attention mechanism in GTs tends to neglect the inductive biases inherent in graph structures, making it chanllenging to effectively capture essential structural information. To address this issue, we propose a novel approach that integrate graph inductive bias into self-attention mechanisms by leveraging quantum technology for structural encoding. In this paper, we introduce the Graph Quantum Walk Transformer (GQWformer), a groundbreaking GNN framework that utilizes quantum walks on attributed graphs to generate node quantum states. These quantum states encapsulate rich structural attributes and serve as inductive biases for the transformer, thereby enabling the generation of more meaningful attention scores. By subsequently incorporating a recurrent neural network, our design amplifies the model’s ability to focus on both local and global information. We conducted comprehensive experiments across five publicly available datasets to evaluate the effectiveness of our model. These results clearly indicate that GQWformer outperforms existing state-of-the-art graph classification algorithms. These findings highlight the significant potential of integrating quantum computing methodologies with traditional GNNs to advance the field of graph representation learning, providing a promising direction for future research and applications.

[AI-26] Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence

链接: https://arxiv.org/abs/2412.02263
作者: Youquan Xian,Xueying Zeng,Duancheng Xuan,Danping Yang,Chunpei Li,Peng Fan,Peng Liu
关键词-EN: including decentralized finance, smart contracts, Large Language Models, Blockchain smart contracts, current smart contracts
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Blockchain smart contracts have catalyzed the development of decentralized applications across various domains, including decentralized finance. However, due to constraints in computational resources and the prevalence of data silos, current smart contracts face significant challenges in fully leveraging the powerful capabilities of Large Language Models (LLMs) for tasks such as intelligent analysis and reasoning. To address this gap, this paper proposes and implements a universal framework for integrating LLMs with blockchain data, \sysname, effectively overcoming the interoperability barriers between blockchain and LLMs. By combining semantic relatedness with truth discovery methods, we introduce an innovative data aggregation approach, \funcname, which significantly enhances the accuracy and trustworthiness of data generated by LLMs. To validate the framework’s effectiveness, we construct a dataset consisting of three types of questions, capturing Q\A interactions between 10 oracle nodes and 5 LLM models. Experimental results demonstrate that, even with 40% malicious nodes, the proposed solution improves data accuracy by an average of 17.74% compared to the optimal baseline. This research not only provides an innovative solution for the intelligent enhancement of smart contracts but also highlights the potential for deep integration between LLMs and blockchain technology, paving the way for more intelligent and complex applications of smart contracts in the future.

[AI-27] Deep learning approach for predicting the replicator equation in evolutionary game theory

链接: https://arxiv.org/abs/2412.02222
作者: Advait Chandorkar
关键词-EN: allowing accurate forecasting, physics-informed deep learning, deep learning approach, allowing accurate, paper presents
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a physics-informed deep learning approach for predicting the replicator equation, allowing accurate forecasting of population dynamics. This methodological innovation allows us to derive governing differential or difference equations for systems that lack explicit mathematical models. We used the SINDy model first introduced by Fasel, Kaiser, Kutz, Brunton, and Brunt 2016a to get the replicator equation, which will significantly advance our understanding of evolutionary biology, economic systems, and social dynamics. By refining predictive models across multiple disciplines, including ecology, social structures, and moral behaviours, our work offers new insights into the complex interplay of variables shaping evolutionary outcomes in dynamic systems

[AI-28] Recovering implicit physics model under real-world constraints ECAI2024

链接: https://arxiv.org/abs/2412.02215
作者: Ayan Banerjee,Sandeep K.S. Gupta
关键词-EN: recent interest, real-world data, data, model, real-world
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is published in ECAI 2024, this https URL

点击查看摘要

Abstract:Recovering a physics-driven model, i.e. a governing set of equations of the underlying dynamical systems, from the real-world data has been of recent interest. Most existing methods either operate on simulation data with unrealistically high sampling rates or require explicit measurements of all system variables, which is not amenable in real-world deployments. Moreover, they assume the timestamps of external perturbations to the physical system are known a priori, without uncertainty, implicitly discounting any sensor time-synchronization or human reporting errors. In this paper, we propose a novel liquid time constant neural network (LTC-NN) based architecture to recover underlying model of physical dynamics from real-world data. The automatic differentiation property of LTC-NN nodes overcomes problems associated with low sampling rates, the input dependent time constant in the forward pass of the hidden layer of LTC-NN nodes creates a massive search space of implicit physical dynamics, the physics model solver based data reconstruction loss guides the search for the correct set of implicit dynamics, and the use of the dropout regularization in the dense layer ensures extraction of the sparsest model. Further, to account for the perturbation timing error, we utilize dense layer nodes to search through input shifts that results in the lowest reconstruction loss. Experiments on four benchmark dynamical systems, three with simulation data and one with the real-world data show that the LTC-NN architecture is more accurate in recovering implicit physics model coefficients than the state-of-the-art sparse model recovery approaches. We also introduce four additional case studies (total eight) on real-life medical examples in simulation and with real-world clinical data to show effectiveness of our approach in recovering underlying model in practice.

[AI-29] Comparative Performance of Machine Learning Algorithms for Early Genetic Disorder and Subclass Classification

链接: https://arxiv.org/abs/2412.02189
作者: Abu Bakar Siddik,Faisal R. Badal,Afroza Islam
关键词-EN: types remains elusive, remains elusive, great deal, deal of effort, devoted to discovering
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 11 figures, 9 tables

点击查看摘要

Abstract:A great deal of effort has been devoted to discovering a particular genetic disorder, but its classification across a broad spectrum of disorder classes and types remains elusive. Early diagnosis of genetic disorders enables timely interventions and improves outcomes. This study implements machine learning models using basic clinical indicators measurable at birth or infancy to enable diagnosis in preliminary life stages. Supervised learning algorithms were implemented on a dataset of 22083 instances with 42 features like family history, newborn metrics, and basic lab tests. Extensive hyperparameter tuning, feature engineering, and selection were undertaken. Two multi-class classifiers were developed: one for predicting disorder classes (mitochondrial, multifactorial, and single-gene) and one for subtypes (9 disorders). Performance was evaluated using accuracy, precision, recall, and the F1-score. The CatBoost classifier achieved the highest accuracy of 77% for predicting genetic disorder classes. For subtypes, SVM attained a maximum accuracy of 80%. The study demonstrates the feasibility of using basic clinical data in machine learning models for early categorization and diagnosis across various genetic disorders. Applying ML with basic clinical indicators can enable timely interventions once validated on larger datasets. It is necessary to conduct further studies to improve model performance on this dataset.

[AI-30] Generalizing Weisfeiler-Lehman Kernels to Subgraphs

链接: https://arxiv.org/abs/2412.02181
作者: Dongkwan Kim,Alice Oh
关键词-EN: Subgraph representation learning, representation learning, effective in solving, Subgraph representation, real-world problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 15 pages

点击查看摘要

Abstract:Subgraph representation learning has been effective in solving various real-world problems. However, current graph neural networks (GNNs) produce suboptimal results for subgraph-level tasks due to their inability to capture complex interactions within and between subgraphs. To provide a more expressive and efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel generalized for subgraphs by applying the WL algorithm on induced k -hop neighborhoods. We combine kernels across different k -hop levels to capture richer structural information that is not fully encoded in existing models. Our approach can balance expressiveness and efficiency by eliminating the need for neighborhood sampling. In experiments on eight real-world and synthetic benchmarks, WLKS significantly outperforms leading approaches on five datasets while reducing training time, ranging from 0.01x to 0.25x compared to the state-of-the-art.

[AI-31] Self-Supervised Learning-Based Path Planning and Obstacle Avoidance Using PPO and B-Splines in Unknown Environments

链接: https://arxiv.org/abs/2412.02176
作者: Shahab Shokouhi,Oguzhan Oruc,May-Win Thein
关键词-EN: paper introduces SmartBSP, advanced self-supervised learning, self-supervised learning framework, autonomous robotics navigating, Proximal Policy Optimization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces SmartBSP, an advanced self-supervised learning framework for real-time path planning and obstacle avoidance in autonomous robotics navigating through complex environments. The proposed system integrates Proximal Policy Optimization (PPO) with Convolutional Neural Networks (CNN) and Actor-Critic architecture to process limited LIDAR inputs and compute spatial decision-making probabilities. The robot’s perceptual field is discretized into a grid format, which the CNN analyzes to produce a spatial probability distribution. During the training process a nuanced cost function is minimized that accounts for path curvature, endpoint proximity, and obstacle avoidance. Simulations results in different scenarios validate the algorithm’s resilience and adaptability across diverse operational scenarios. Subsequently, Real-time experiments, employing the Robot Operating System (ROS), were carried out to assess the efficacy of the proposed algorithm.

[AI-32] Keeping Experts in the Loop: Expert-Guided Optimization for Clinical Data Classification using Large Language Models

链接: https://arxiv.org/abs/2412.02173
作者: Nader Karayanni,Aya Awwad,Chein-Lien Hsiao,Surish P Shanmugam
关键词-EN: Large Language Models, Language Models, Large Language, emergence of Large, center stage
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Since the emergence of Large Language Models (LLMs), the challenge of effectively leveraging their potential in healthcare has taken center stage. A critical barrier to using LLMs for extracting insights from unstructured clinical notes lies in the prompt engineering process. Despite its pivotal role in determining task performance, a clear framework for prompt optimization remains absent. Current methods to address this gap take either a manual prompt refinement approach, where domain experts collaborate with prompt engineers to create an optimal prompt, which is time-intensive and difficult to scale, or through employing automatic prompt optimizing approaches, where the value of the input of domain experts is not fully realized. To address this, we propose StructEase, a novel framework that bridges the gap between automation and the input of human expertise in prompt engineering. A core innovation of the framework is SamplEase, an iterative sampling algorithm that identifies high-value cases where expert feedback drives significant performance improvements. This approach minimizes expert intervention, to effectively enhance classification outcomes. This targeted approach reduces labeling redundancy, mitigates human error, and enhances classification outcomes. We evaluated the performance of StructEase using a dataset of de-identified clinical narratives from the US National Electronic Injury Surveillance System (NEISS), demonstrating significant gains in classification performance compared to current methods. Our findings underscore the value of expert integration in LLM workflows, achieving notable improvements in F1 score while maintaining minimal expert effort. By combining transparency, flexibility, and scalability, StructEase sets the foundation for a framework to integrate expert input into LLM workflows in healthcare and beyond.

[AI-33] Analyzing the Impact of AI Tools on Student Study Habits and Academic Performance

链接: https://arxiv.org/abs/2412.02166
作者: Ben Ward,Deepshikha Bhati,Fnu Neha,Angela Guercio
关键词-EN: improving study habits, enhancing student learning, time management, specifically in improving, support personalized learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study explores the effectiveness of AI tools in enhancing student learning, specifically in improving study habits, time management, and feedback mechanisms. The research focuses on how AI tools can support personalized learning, adaptive test adjustments, and provide real-time classroom analysis. Student feedback revealed strong support for these features, and the study found a significant reduction in study hours alongside an increase in GPA, suggesting positive academic outcomes. Despite these benefits, challenges such as over-reliance on AI and difficulties in integrating AI with traditional teaching methods were also identified, emphasizing the need for AI tools to complement conventional educational strategies rather than replace them. Data were collected through a survey with a Likert scale and follow-up interviews, providing both quantitative and qualitative insights. The analysis involved descriptive statistics to summarize demographic data, AI usage patterns, and perceived effectiveness, as well as inferential statistics (T-tests, ANOVA) to examine the impact of demographic factors on AI adoption. Regression analysis identified predictors of AI adoption, and qualitative responses were thematically analyzed to understand students’ perspectives on the future of AI in education. This mixed-methods approach provided a comprehensive view of AI’s role in education and highlighted the importance of privacy, transparency, and continuous refinement of AI features to maximize their educational benefits.

[AI-34] CausalMob: Causal Human Mobility Prediction with LLM s-derived Human Intentions toward Public Events KDD2025

链接: https://arxiv.org/abs/2412.02155
作者: Xiaojie Yang,Hangli Ge,Jiawei Wang,Zipei Fan,Renhe Jiang,Ryosuke Shibasaki,Noboru Koshizuka
关键词-EN: mobility exhibits spatial, decision making, exhibits spatial, spatial and temporal, assist policymakers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:Large-scale human mobility exhibits spatial and temporal patterns that can assist policymakers in decision making. Although traditional prediction models attempt to capture these patterns, they often interfered by non-periodic public events, such as disasters and occasional celebrations. Since regular human mobility patterns are heavily affected by these events, estimating their causal effects is critical to accurate mobility predictions. Although news articles provide unique perspectives on these events in an unstructured format, processing is a challenge. In this study, we propose a causality-augmented prediction model, called \textbfCausalMob, to analyze the causal effects of public events. We first utilize large language models (LLMs) to extract human intentions from news articles and transform them into features that act as causal treatments. Next, the model learns representations of spatio-temporal regional covariates from multiple data sources to serve as confounders for causal inference. Finally, we present a causal effect estimation framework to ensure event features remain independent of confounders during prediction. Based on large-scale real-world data, the experimental results show that the proposed model excels in human mobility prediction, outperforming state-of-the-art models.

[AI-35] Failure Probability Estimation for Black-Box Autonomous Systems using State-Dependent Importance Sampling Proposals

链接: https://arxiv.org/abs/2412.02154
作者: Harrison Delecki,Sydney M. Katz,Mykel J. Kochenderfer
关键词-EN: developing safety-critical autonomous, safety-critical autonomous systems, critical step, step in developing, developing safety-critical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to L4DC 2025

点击查看摘要

Abstract:Estimating the probability of failure is a critical step in developing safety-critical autonomous systems. Direct estimation methods such as Monte Carlo sampling are often impractical due to the rarity of failures in these systems. Existing importance sampling approaches do not scale to sequential decision-making systems with large state spaces and long horizons. We propose an adaptive importance sampling algorithm to address these limitations. Our method minimizes the forward Kullback-Leibler divergence between a state-dependent proposal distribution and a relaxed form of the optimal importance sampling distribution. Our method uses Markov score ascent methods to estimate this objective. We evaluate our approach on four sequential systems and show that it provides more accurate failure probability estimates than baseline Monte Carlo and importance sampling techniques. This work is open sourced.

[AI-36] Revisiting the Initial Steps in Adaptive Gradient Descent Optimization NEURIPS2024

链接: https://arxiv.org/abs/2412.02153
作者: Abulikemu Abuduweili,Changliu Liu
关键词-EN: deep neural networks, diverse machine learning, machine learning tasks, learning tasks due, training deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: OPT workshop at NeurIPS 2024

点击查看摘要

Abstract:Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ( v_0 =0 ) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods, highlighting the practical impact of this straightforward modification.

[AI-37] Mining Tweets to Predict Future Bitcoin Price

链接: https://arxiv.org/abs/2412.02148
作者: Ashutosh Hathidara,Gaurav Atavale,Suyash Chaudhary
关键词-EN: increased investment interests, increased investment, investment interests, Bitcoin, Bitcoin price
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bitcoin has increased investment interests in people during the last decade. We have seen an increase in the number of posts on social media platforms about cryptocurrency, especially Bitcoin. This project focuses on analyzing user tweet data in combination with Bitcoin price data to see the relevance between price fluctuations and the conversation between millions of people on Twitter. This study also exploits this relationship between user tweets and bitcoin prices to predict the future bitcoin price. We are utilizing novel techniques and methods to analyze the data and make price predictions.

[AI-38] Effective Mitigations for Systemic Risks from General-Purpose AI

链接: https://arxiv.org/abs/2412.02145
作者: Risto Uuk,Annemieke Brouwer,Tim Schreier,Noemi Dreksler,Valeria Pulignano,Rishi Bommasani
关键词-EN: mitigations remains underexplored, systemic risks, systemic risks posed, growing concern, remains underexplored
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 78 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The systemic risks posed by general-purpose AI models are a growing concern, yet the effectiveness of mitigations remains underexplored. Previous research has proposed frameworks for risk mitigation, but has left gaps in our understanding of the perceived effectiveness of measures for mitigating systemic risks. Our study addresses this gap by evaluating how experts perceive different mitigations that aim to reduce the systemic risks of general-purpose AI models. We surveyed 76 experts whose expertise spans AI safety; critical infrastructure; democratic processes; chemical, biological, radiological, and nuclear risks (CBRN); and discrimination and bias. Among 27 mitigations identified through a literature review, we find that a broad range of risk mitigation measures are perceived as effective in reducing various systemic risks and technically feasible by domain experts. In particular, three mitigation measures stand out: safety incident reports and security information sharing, third-party pre-deployment model audits, and pre-deployment risk assessments. These measures show both the highest expert agreement ratings (60%) across all four risk areas and are most frequently selected in experts’ preferred combinations of measures (40%). The surveyed experts highlighted that external scrutiny, proactive evaluation and transparency are key principles for effective mitigation of systemic risks. We provide policy recommendations for implementing the most promising measures, incorporating the qualitative contributions from experts. These insights should inform regulatory frameworks and industry practices for mitigating the systemic risks associated with general-purpose AI.

[AI-39] Graph Learning for Planning: The Story Thus Far and Open Challenges

链接: https://arxiv.org/abs/2412.02136
作者: Dillon Z. Chen,Mingyu Hao,Sylvie Thiébaux,Felipe Trevizan
关键词-EN: exploit relational structures, relational structures exhibited, input planning instances, Graph learning, graph learning architectures
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph learning is naturally well suited for use in planning due to its ability to exploit relational structures exhibited in planning domains and to take as input planning instances with arbitrary number of objects. In this paper, we study the usage of graph learning for planning thus far by studying the theoretical and empirical effects on learning and planning performance of (1) graph representations of planning tasks, (2) graph learning architectures, and (3) optimisation formulations for learning. Our studies accumulate in the GOOSE framework which learns domain knowledge from small planning tasks in order to scale up to much larger planning tasks. In this paper, we also highlight and propose the 5 open challenges in the general Learning for Planning field that we believe need to be addressed for advancing the state-of-the-art.

[AI-40] A privacy-preserving distributed credible evidence fusion algorithm for collective decision-making

链接: https://arxiv.org/abs/2412.02130
作者: Chaoxiong Ma,Yan Liang,Xinyu Yang,Han Wu,Huixia Zhang
关键词-EN: evidence, recent years, applied to collective, collective decision-making, decision-making in recent
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The theory of evidence reasoning has been applied to collective decision-making in recent years. However, existing distributed evidence fusion methods lead to participants’ preference leakage and fusion failures as they directly exchange raw evidence and do not assess evidence credibility like centralized credible evidence fusion (CCEF) does. To do so, a privacy-preserving distributed credible evidence fusion method with three-level consensus (PCEF) is proposed in this paper. In evidence difference measure (EDM) neighbor consensus, an evidence-free equivalent expression of EDM among neighbored agents is derived with the shared dot product protocol for pignistic probability and the identical judgment of two events with maximal subjective probabilities, so that evidence privacy is guaranteed due to such irreversible evidence transformation. In EDM network consensus, the non-neighbored EDMs are inferred and neighbored EDMs reach uniformity via interaction between linear average consensus (LAC) and low-rank matrix completion with rank adaptation to guarantee EDM consensus convergence and no solution of inferring raw evidence in numerical iteration style. In fusion network consensus, a privacy-preserving LAC with a self-cancelling differential privacy term is proposed, where each agent adds its randomness to the sharing content and step-by-step cancels such randomness in consensus iterations. Besides, the sufficient condition of the convergence to the CCEF is explored, and it is proven that raw evidence is impossibly inferred in such an iterative consensus. The simulations show that PCEF is close to CCEF both in credibility and fusion results and obtains higher decision accuracy with less time-comsuming than existing methods.

[AI-41] Benchmarking symbolic regression constant optimization schemes

链接: https://arxiv.org/abs/2412.02126
作者: L.G.A dos Reis,V.L.P.S. Caminha,T.J.P.Penna
关键词-EN: machine learning technique, genetic programming approaches, learning technique, programming approaches, greatly increases GPSR
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 9 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Symbolic regression is a machine learning technique, and it has seen many advancements in recent years, especially in genetic programming approaches (GPSR). Furthermore, it has been known for many years that constant optimization of parameters, during the evolutionary search, greatly increases GPSR performance However, different authors approach such tasks differently and no consensus exists regarding which methods perform best. In this work, we evaluate eight different parameter optimization methods, applied during evolutionary search, over ten known benchmark problems, in two different scenarios. We also propose using an under-explored metric called Tree Edit Distance (TED), aiming to identify symbolic accuracy. In conjunction with classical error measures, we develop a combined analysis of model performance in symbolic regression. We then show that different constant optimization methods perform better in certain scenarios and that there is no overall best choice for every problem. Finally, we discuss how common metric decisions may be biased and appear to generate better models in comparison.

[AI-42] Optimizing Latent Goal by Learning from Trajectory Preference

链接: https://arxiv.org/abs/2412.02125
作者: Guangyu Zhao,Kewei Lian,Haowei Lin,Haobo Fu,Qiang Fu,Shaofei Cai,Zihao Wang,Yitao Liang
关键词-EN: human intentions, glowing body, body of work, work has emerged, emerged focusing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent’s behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.

[AI-43] rust Safety of LLM s and LLM s in Trust Safety

链接: https://arxiv.org/abs/2412.02113
作者: Doohee You,Dan Chon
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have garnered considerable attention for their remarkable abilities in natural language processing tasks. However, their widespread adoption has raised concerns pertaining to trust and safety. This systematic review investigates the current research landscape on trust and safety in LLMs, with a particular focus on the novel application of LLMs within the field of Trust and Safety itself. We delve into the complexities of utilizing LLMs in domains where maintaining trust and safety is paramount, offering a consolidated perspective on this emerging trend.\ By synthesizing findings from various studies, we identify key challenges and potential solutions, aiming to benefit researchers and practitioners seeking to understand the nuanced interplay between LLMs and Trust and Safety. This review provides insights on best practices for using LLMs in Trust and Safety, and explores emerging risks such as prompt injection and jailbreak attacks. Ultimately, this study contributes to a deeper understanding of how LLMs can be effectively and responsibly utilized to enhance trust and safety in the digital realm. Comments: 11 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.02113 [cs.AI] (or arXiv:2412.02113v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.02113 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-44] he Problem of Social Cost in Multi-Agent General Reinforcement Learning: Survey and Synthesis

链接: https://arxiv.org/abs/2412.02091
作者: Kee Siong Ng,Samuel Yang-Zhao,Timothy Cadogan-Cowper
关键词-EN: catastrophic collateral damage, narrow objective, safety literature, literature is full, blindly pursuing
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 49 pages

点击查看摘要

Abstract:The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable and even catastrophic collateral damage to others. In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in Everitt et al, 2018. We attempt a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms. The proposed setup captures many well-studied special cases and is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI; (ii) the reinforcement-learning agents participating in the environment can have different learning strategies and planning horizons. To demonstrate the practicality of the proposed setup, we survey some key classes of learning algorithms and present a few applications, including a discussion of the Paperclips problem and pollution control with a cap-and-trade system.

[AI-45] Evolution of Collective AI Beyond Individual Optimization

链接: https://arxiv.org/abs/2412.02085
作者: Ryosuke Takata,Yujin Tang,Yingtao Tian,Norihiro Maruyama,Hiroki Kojima,Takashi Ikegami
关键词-EN: homogeneous individuals optimized, specific capability, study investigates collective, group, group of homogeneous
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates collective behaviors that emerge from a group of homogeneous individuals optimized for a specific capability. We created a group of simple, identical neural network based agents modeled after chemotaxis-driven vehicles that follow pheromone trails and examined multi-agent simulations using clones of these evolved individuals. Our results show that the evolution of individuals led to population differentiation. Surprisingly, we observed that collective fitness significantly changed during later evolutionary stages, despite maintained high individual performance and simplified neural architectures. This decline occurred when agents developed reduced sensor-motor coupling, suggesting that over-optimization of individual agents almost always lead to less effective group behavior. Our research investigates how individual differentiation can evolve through what evolutionary pathways.

[AI-46] Comparative Analysis of Black-Box and White-Box Machine Learning Model in Phishing Detection

链接: https://arxiv.org/abs/2412.02084
作者: Abdullah Fajar,Setiadi Yazid,Indra Budi
关键词-EN: phishing attack mitigation, phishing detection, attack mitigation, mitigation by increasing, increasing trust
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Explainability in phishing detection model can support a further solution of phishing attack mitigation by increasing trust and understanding how phishing can be detected. Objective: The aims of this study to determine and best recommendation to apply an approach which has several components with abilities to fulfil the critical needs Methods: A methodology starting with analyzing both black-box and white-box models to get the pros and cons specifically in phishing detection. The conclusion of the analysis will be validated by experiment using a set of well-known algorithms and public phishing datasets. Experimental metrics covers 3 measurements such as predictive accuracy and explainability metrics. Conclusion: Both models are comparable in terms of interpretability and consistency, with room for improvement in diverse datasets. EBM as an example of white-box model is generally better suited for applications requiring explainability and actionable insights. Finally, each model, white-box and black-box model has positive and negative aspects both for performance metric and for explainable metric. It is important to consider the objective of model usage.

[AI-47] Construction and optimization of health behavior prediction model for the elderly in smart elderly care

链接: https://arxiv.org/abs/2412.02062
作者: Qian Guo,Peiyuan Chen
关键词-EN: health behavior prediction, smart elderly care, elderly care, global aging, social attention
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 23 pages

点击查看摘要

Abstract:With the intensification of global aging, health management of the elderly has become a focus of social attention. This study designs and implements a smart elderly care service model to address issues such as data diversity, health status complexity, long-term dependence and data loss, sudden changes in behavior, and data privacy in the prediction of health behaviors of the elderly. The model achieves accurate prediction and dynamic management of health behaviors of the elderly through modules such as multimodal data fusion, data loss processing, nonlinear prediction, emergency detection, and privacy protection. In the experimental design, based on multi-source data sets and market research results, the model demonstrates excellent performance in health behavior prediction, emergency detection, and personalized services. The experimental results show that the model can effectively improve the accuracy and robustness of health behavior prediction and meet the actual application needs in the field of smart elderly care. In the future, with the integration of more data and further optimization of technology, the model will provide more powerful technical support for smart elderly care services.

[AI-48] Comparative Analysis of Multi-Agent Reinforcement Learning Policies for Crop Planning Decision Support

链接: https://arxiv.org/abs/2412.02057
作者: Anubha Mahajan,Shreya Hegde,Ethan Shay,Daniel Wu,Aviva Prins
关键词-EN: economic losses due, Multi-agent Rollout Policy, small or marginal, climate risks, classified as small
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In India, the majority of farmers are classified as small or marginal, making their livelihoods particularly vulnerable to economic losses due to market saturation and climate risks. Effective crop planning can significantly impact their expected income, yet existing decision support systems (DSS) often provide generic recommendations that fail to account for real-time market dynamics and the interactions among multiple farmers. In this paper, we evaluate the viability of three multi-agent reinforcement learning (MARL) approaches for optimizing total farmer income and promoting fairness in crop planning: Independent Q-Learning (IQL), where each farmer acts independently without coordination, Agent-by-Agent (ABA), which sequentially optimizes each farmer’s policy in relation to the others, and the Multi-agent Rollout Policy, which jointly optimizes all farmers’ actions for global reward maximization. Our results demonstrate that while IQL offers computational efficiency with linear runtime, it struggles with coordination among agents, leading to lower total rewards and an unequal distribution of income. Conversely, the Multi-agent Rollout policy achieves the highest total rewards and promotes equitable income distribution among farmers but requires significantly more computational resources, making it less practical for large numbers of agents. ABA strikes a balance between runtime efficiency and reward optimization, offering reasonable total rewards with acceptable fairness and scalability. These findings highlight the importance of selecting appropriate MARL approaches in DSS to provide personalized and equitable crop planning recommendations, advancing the development of more adaptive and farmer-centric agricultural decision-making systems.

[AI-49] Future of Information Retrieval Research in the Age of Generative AI

链接: https://arxiv.org/abs/2412.02043
作者: James Allan,Eunsol Choi,Daniel P. Lopresti,Hamed Zamani
关键词-EN: large language models, Computing Community Consortium, information retrieval, transforming how users, users search
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the fast-evolving field of information retrieval (IR), the integration of generative AI technologies such as large language models (LLMs) is transforming how users search for and interact with information. Recognizing this paradigm shift at the intersection of IR and generative AI (IR-GenAI), a visioning workshop supported by the Computing Community Consortium (CCC) was held in July 2024 to discuss the future of IR in the age of generative AI. This workshop convened 44 experts in information retrieval, natural language processing, human-computer interaction, and artificial intelligence from academia, industry, and government to explore how generative AI can enhance IR and vice versa, and to identify the major challenges and opportunities in this rapidly advancing field. This report contains a summary of discussions as potentially important research topics and contains a list of recommendations for academics, industry practitioners, institutions, evaluation campaigns, and funding agencies. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.02043 [cs.IR] (or arXiv:2412.02043v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.02043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-50] LLM s4Life: Large Language Models for Ontology Learning in Life Sciences

链接: https://arxiv.org/abs/2412.02035
作者: Nadeen Fathallah,Steffen Staab,Alsayed Algergawy
关键词-EN: current Large Language, Large Language Models, Large Language, poses significant challenges, current Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ontology learning in complex domains, such as life sciences, poses significant challenges for current Large Language Models (LLMs). Existing LLMs struggle to generate ontologies with multiple hierarchical levels, rich interconnections, and comprehensive class coverage due to constraints on the number of tokens they can generate and inadequate domain adaptation. To address these issues, we extend the NeOn-GPT pipeline for ontology learning using LLMs with advanced prompt engineering techniques and ontology reuse to enhance the generated ontologies’ domain-specific reasoning and structural depth. Our work evaluates the capabilities of LLMs in ontology learning in the context of highly specialized and complex domains such as life science domains. To assess the logical consistency, completeness, and scalability of the generated ontologies, we use the AquaDiva ontology developed and used in the collaborative research center AquaDiva as a case study. Our evaluation shows the viability of LLMs for ontology learning in specialized domains, providing solutions to longstanding limitations in model performance and scalability.

[AI-51] PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving ICONIP2024

链接: https://arxiv.org/abs/2412.02025
作者: Xuewen Luo,Fan Ding,Yinsheng Song,Xiaofeng Zhang,Junnyong Loo
关键词-EN: Large Language Models, Multi-Modal Large Language, robust Multi-Modal Large, Large Language, Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted for presentation at ICONIP 2024

点击查看摘要

Abstract:There is growing interest in leveraging the capabilities of robust Multi-Modal Large Language Models (MLLMs) directly within autonomous driving contexts. However, the high costs and complexity of designing and training end-to-end autonomous driving models make them challenging for many enterprises and research entities. To address this, our study explores a seamless integration of MLLMs into autonomous driving systems by proposing a Zero-Shot Chain-of-Thought (Zero-Shot-CoT) prompt design named PKRD-CoT. PKRD-CoT is based on the four fundamental capabilities of autonomous driving: perception, knowledge, reasoning, and decision-making. This makes it particularly suitable for understanding and responding to dynamic driving environments by mimicking human thought processes step by step, thus enhancing decision-making in real-time scenarios. Our design enables MLLMs to tackle problems without prior experience, thereby increasing their utility within unstructured autonomous driving environments. In experiments, we demonstrate the exceptional performance of GPT-4.0 with PKRD-CoT across autonomous driving tasks, highlighting its effectiveness in autonomous driving scenarios. Additionally, our benchmark analysis reveals the promising viability of PKRD-CoT for other MLLMs, such as Claude, LLava1.6, and Qwen-VL-Plus. Overall, this study contributes a novel and unified prompt-design framework for GPT-4.0 and other MLLMs in autonomous driving, while also rigorously evaluating the efficacy of these widely recognized MLLMs in the autonomous driving domain through comprehensive comparisons.

[AI-52] Explore Reinforced: Equilibrium Approximation with Reinforcement Learning

链接: https://arxiv.org/abs/2412.02016
作者: Ryan Yu,Mateusz Nowak,Qintong Xie,Michelle Yilin Feng,Peter Chin
关键词-EN: Coarse Correlated Equilibria, Current approximate Coarse, approximate Coarse Correlated, Correlated Equilibria, Coarse Correlated
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle with equilibrium approximation for games in large stochastic environments but are theoretically guaranteed to converge to a strong solution concept. In contrast, modern Reinforcement Learning (RL) algorithms provide faster training yet yield weaker solutions. We introduce Exp3-IXrl - a blend of RL and game-theoretic approach, separating the RL agent’s action selection from the equilibrium computation while preserving the integrity of the learning process. We demonstrate that our algorithm expands the application of equilibrium approximation algorithms to new environments. Specifically, we show the improved performance in a complex and adversarial cybersecurity network environment - the Cyber Operations Research Gym - and in the classical multi-armed bandit settings.

[AI-53] Whos Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation NEURIPS2024

链接: https://arxiv.org/abs/2412.02000
作者: Trenton Chang,Lindsay Warrenburg,Sae-Hwan Park,Ravi B. Parikh,Maggie Makar,Jenna Wiens
关键词-EN: machine learning models, machine learning, impact individuals, learning models, inform decisions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38 pages, 31 figures. NeurIPS 2024

点击查看摘要

Abstract:In many settings, machine learning models may be used to inform decisions that impact individuals or entities who interact with the model. Such entities, or agents, may game model decisions by manipulating their inputs to the model to obtain better outcomes and maximize some utility. We consider a multi-agent setting where the goal is to identify the “worst offenders:” agents that are gaming most aggressively. However, identifying such agents is difficult without knowledge of their utility function. Thus, we introduce a framework in which each agent’s tendency to game is parameterized via a scalar. We show that this gaming parameter is only partially identifiable. By recasting the problem as a causal effect estimation problem where different agents represent different “treatments,” we prove that a ranking of all agents by their gaming parameters is identifiable. We present empirical results in a synthetic data study validating the usage of causal effect estimation for gaming detection and show in a case study of diagnosis coding behavior in the U.S. that our approach highlights features associated with gaming.

[AI-54] ChatCollab: Exploring Collaboration Between Humans and AI Agents in Software Teams

链接: https://arxiv.org/abs/2412.01992
作者: Benjamin Klieger,Charis Charitsis,Miroslav Suzara,Sierra Wang,Nick Haber,John C. Mitchell
关键词-EN: Artificial Intelligence, conducting initial tests, enables multiple human, productive team-based collaboration, humans and Artificial
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Preprint, 25 pages, 7 figures

点击查看摘要

Abstract:We explore the potential for productive team-based collaboration between humans and Artificial Intelligence (AI) by presenting and conducting initial tests with a general framework that enables multiple human and AI agents to work together as peers. ChatCollab’s novel architecture allows agents - human or AI - to join collaborations in any role, autonomously engage in tasks and communication within Slack, and remain agnostic to whether their collaborators are human or AI. Using software engineering as a case study, we find that our AI agents successfully identify their roles and responsibilities, coordinate with other agents, and await requested inputs or deliverables before proceeding. In relation to three prior multi-agent AI systems for software development, we find ChatCollab AI agents produce comparable or better software in an interactive game development task. We also propose an automated method for analyzing collaboration dynamics that effectively identifies behavioral characteristics of agents with distinct roles, allowing us to quantitatively compare collaboration dynamics in a range of experimental conditions. For example, in comparing ChatCollab AI agents, we find that an AI CEO agent generally provides suggestions 2-4 times more often than an AI product manager or AI developer, suggesting agents within ChatCollab can meaningfully adopt differentiated collaborative roles. Our code and data can be found at: this https URL.

[AI-55] Human-centred test and evaluation of military AI

链接: https://arxiv.org/abs/2412.01978
作者: David Helmer,Michael Boardman,S. Kate Conroy,Adam J. Hepworth,Manoj Harjani
关键词-EN: Blueprint for Action, Action states, military domain, ethical and human-centric, test and evaluation
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 11 pages, summary report from ‘Human-centred test and evaluation of military AI’ panel at Responsible AI in the Military Domain 2024, Seoul Korea, 9-10 September 2024

点击查看摘要

Abstract:The REAIM 2024 Blueprint for Action states that AI applications in the military domain should be ethical and human-centric and that humans must remain responsible and accountable for their use and effects. Developing rigorous test and evaluation, verification and validation (TEVV) frameworks will contribute to robust oversight mechanisms. TEVV in the development and deployment of AI systems needs to involve human users throughout the lifecycle. Traditional human-centred test and evaluation methods from human factors need to be adapted for deployed AI systems that require ongoing monitoring and evaluation. The language around AI-enabled systems should be shifted to inclusion of the human(s) as a component of the system. Standards and requirements supporting this adjusted definition are needed, as are metrics and means to evaluate them. The need for dialogue between technologists and policymakers on human-centred TEVV will be evergreen, but dialogue needs to be initiated with an objective in mind for it to be productive. Development of TEVV throughout system lifecycle is critical to support this evolution including the issue of human scalability and impact on scale of achievable testing. Communication between technical and non technical communities must be improved to ensure operators and policy-makers understand risk assumed by system use and to better inform research and development. Test and evaluation in support of responsible AI deployment must include the effect of the human to reflect operationally realised system performance. Means of communicating the results of TEVV to those using and making decisions regarding the use of AI based systems will be key in informing risk based decisions regarding use.

[AI-56] Usage Governance Advisor: from Intent to AI Governance AAAI

链接: https://arxiv.org/abs/2412.01957
作者: Elizabeth M. Daly,Sean Rooney,Seshu Tirupathi,Luis Garces-Erice,Inge Vejsbjerg,Frank Bagehorn,Dhaval Salwala,Christopher Giblin,Mira L. Wolf-Bauwens,Ioana Giurgiu,Michael Hind,Peter Urbanetz
关键词-EN: pressing concern, concern for organizations, organizations deploying, Evaluating the safety, Evaluating
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages, 8 figures, AAAI workshop submission

点击查看摘要

Abstract:Evaluating the safety of AI Systems is a pressing concern for organizations deploying them. In addition to the societal damage done by the lack of fairness of those systems, deployers are concerned about the legal repercussions and the reputational damage incurred by the use of models that are unsafe. Safety covers both what a model does; e.g., can it be used to reveal personal information from its training set, and how a model was built; e.g., was it only trained on licensed data sets. Determining the safety of an AI system requires gathering information from a wide set of heterogeneous sources including safety benchmarks and technical documentation for the set of models used in that system. In addition, responsible use is encouraged through mechanisms that advise and help the user to take mitigating actions where safety risks are detected. We present Usage Governance Advisor which creates semi-structured governance information, identifies and prioritizes risks according to the intended use case, recommends appropriate benchmarks and risk assessments and importantly proposes mitigation strategies and actions.

[AI-57] Identifying Key Nodes for the Influence Spread using a Machine Learning Approach

链接: https://arxiv.org/abs/2412.01949
作者: Mateusz Stolarski,Adam Piróg,Piotr Bródka
关键词-EN: network science areas, science areas, network science, including viral marketing, important topic
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The identification of key nodes in complex networks is an important topic in many network science areas. It is vital to a variety of real-world applications, including viral marketing, epidemic spreading and influence maximization. In recent years, machine learning algorithms have proven to outperform the conventional, centrality-based methods in accuracy and consistency, but this approach still requires further refinement. What information about the influencers can be extracted from the network? How can we precisely obtain the labels required for training? Can these models generalize well? In this paper, we answer these questions by presenting an enhanced machine learning-based framework for the influence spread problem. We focus on identifying key nodes for the Independent Cascade model, which is a popular reference method. Our main contribution is an improved process of obtaining the labels required for training by introducing ‘Smart Bins’ and proving their advantage over known methods. Next, we show that our methodology allows ML models to not only predict the influence of a given node, but to also determine other characteristics of the spreading process-which is another novelty to the relevant literature. Finally, we extensively test our framework and its ability to generalize beyond complex networks of different types and sizes, gaining important insight into the properties of these methods.

[AI-58] he Evolution and Future Perspectives of Artificial Intelligence Generated Content

链接: https://arxiv.org/abs/2412.01948
作者: Chengzhang Zhu,Luobin Cui,Ying Tang,Jiacun Wang
关键词-EN: Artificial intelligence generated, rapidly advancing technology, intelligence generated content, Artificial intelligence, transforming content creation
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 16 figures

点击查看摘要

Abstract:Artificial intelligence generated content (AIGC), a rapidly advancing technology, is transforming content creation across domains, such as text, images, audio, and video. Its growing potential has attracted more and more researchers and investors to explore and expand its possibilities. This review traces AIGC’s evolution through four developmental milestones-ranging from early rule-based systems to modern transfer learning models-within a unified framework that highlights how each milestone contributes uniquely to content generation. In particular, the paper employs a common example across all milestones to illustrate the capabilities and limitations of methods within each phase, providing a consistent evaluation of AIGC methodologies and their development. Furthermore, this paper addresses critical challenges associated with AIGC and proposes actionable strategies to mitigate them. This study aims to guide researchers and practitioners in selecting and optimizing AIGC models to enhance the quality and efficiency of content creation across diverse domains.

[AI-59] he Reality of AI and Biorisk

链接: https://arxiv.org/abs/2412.01946
作者: Aidan Peppin,Anka Reuel,Stephen Casper,Elliot Jones,Andrew Strait,Usman Anwar,Anurag Agrawal,Sayash Kapoor,Sanmi Koyejo,Marie Pellat,Rishi Bommasani,Nick Frosst,Sara Hooker
关键词-EN: sound theoretical threat, theoretical threat model, system increase biorisk, biorisk threat models, threat model
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To accurately and confidently answer the question ‘could an AI model or system increase biorisk’, it is necessary to have both a sound theoretical threat model for how AI models or systems could increase biorisk and a robust method for testing that threat model. This paper provides an analysis of existing available research surrounding two AI and biorisk threat models: 1) access to information and planning via large language models (LLMs), and 2) the use of AI-enabled biological tools (BTs) in synthesizing novel biological artifacts. We find that existing studies around AI-related biorisk are nascent, often speculative in nature, or limited in terms of their methodological maturity and transparency. The available literature suggests that current LLMs and BTs do not pose an immediate risk, and more work is needed to develop rigorous approaches to understanding how future models could increase biorisks. We end with recommendations about how empirical work can be expanded to more precisely target biorisk and ensure rigor and validity of findings.

[AI-60] Approximately Optimal Search on a Higher-dimensional Sliding Puzzle

链接: https://arxiv.org/abs/2412.01937
作者: Nono SC Merleau,Miguel O’Malley,Érika Roldán,Sayan Mukherjee
关键词-EN: Higher-dimensional sliding puzzles, distinctly coloured, vertices, dimensional hypercube, Higher-dimensional sliding
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Higher-dimensional sliding puzzles are constructed on the vertices of a d -dimensional hypercube, where 2^d-l vertices are distinctly coloured. Rings with the same colours are initially set randomly on the vertices of the hypercube. The goal of the puzzle is to move each of the 2^d-l rings to pre-defined target vertices on the cube. In this setting, the k -rule constraint represents a generalisation of edge collision for the movement of colours between vertices, allowing movement only when a hypercube face of dimension k containing a ring is completely free of other rings. Starting from an initial configuration, what is the minimum number of moves needed to make ring colours match the vertex colours? An algorithm that provides us with such a number is called God’s algorithm. When such an algorithm exists, it does not have a polynomial time complexity, at least in the case of the 15-puzzle corresponding to k=1 in the cubical puzzle. This paper presents a comprehensive computational study of different scenarios of the higher-dimensional puzzle. A benchmark of three computational techniques, an exact algorithm (the A* search) and two approximately optimal search techniques (an evolutionary algorithm (EA) and reinforcement learning (RL)) is presented in this work. The experiments show that all three methods can successfully solve the puzzle of dimension three for different face dimensions and across various difficulty levels. When the dimension increases, the A* search fails, and RL and EA methods can still provide a generally acceptable solution, i.e. a distribution of a number of moves with a median value of less than 30 . Overall, the EA method consistently requires less computational time, while failing in most cases to minimise the number of moves for the puzzle dimensions d=4 and d=5 .

[AI-61] Kernel-Free Universum Quadratic Surface Twin Support Vector Machines for Imbalanced Data

链接: https://arxiv.org/abs/2412.01936
作者: Hossein Moosaei,Milan Hladík,Ahmad Mousavi,Zheming Gao,Haojie Fu
关键词-EN: classes pose significant, pose significant challenges, imbalanced classes pose, classes pose, pose significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Binary classification tasks with imbalanced classes pose significant challenges in machine learning. Traditional classifiers often struggle to accurately capture the characteristics of the minority class, resulting in biased models with subpar predictive performance. In this paper, we introduce a novel approach to tackle this issue by leveraging Universum points to support the minority class within quadratic twin support vector machine models. Unlike traditional classifiers, our models utilize quadratic surfaces instead of hyperplanes for binary classification, providing greater flexibility in modeling complex decision boundaries. By incorporating Universum points, our approach enhances classification accuracy and generalization performance on imbalanced datasets. We generated four artificial datasets to demonstrate the flexibility of the proposed methods. Additionally, we validated the effectiveness of our approach through empirical evaluations on benchmark datasets, showing superior performance compared to conventional classifiers and existing methods for imbalanced classification.

[AI-62] Cross Domain Adaptation using Adversarial networks with Cyclic loss

链接: https://arxiv.org/abs/2412.01935
作者: Manpreet Kaur,Ankur Tomar,Srijan Mishra,Shashwat Verma
关键词-EN: Deep Learning methods, methods are highly, highly local, local and sensitive, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Deep Learning methods are highly local and sensitive to the domain of data they are trained with. Even a slight deviation from the domain distribution affects prediction accuracy of deep networks significantly. In this work, we have investigated a set of techniques aimed at increasing accuracy of generator networks which perform translation from one domain to the other in an adversarial setting. In particular, we experimented with activations, the encoder-decoder network architectures, and introduced a Loss called cyclic loss to constrain the Generator network so that it learns effective source-target translation. This machine learning problem is motivated by myriad applications that can be derived from domain adaptation networks like generating labeled data from synthetic inputs in an unsupervised fashion, and using these translation network in conjunction with the original domain network to generalize deep learning networks across domains.

[AI-63] Recurrent Neural Network on PICTURE Model

链接: https://arxiv.org/abs/2412.01933
作者: Weihan Xu
关键词-EN: provide critical care, Intensive Care Units, Predicting Intensive Care, Intensive Care, Intensive Care Transfers
类目: Artificial Intelligence (cs.AI)
*备注: University of Michigan, Senior Honor Thesis

点击查看摘要

Abstract:Intensive Care Units (ICUs) provide critical care and life support for most severely ill and injured patients in the hospital. With the need for ICUs growing rapidly and unprecedentedly, especially during COVID-19, accurately identifying the most critical patients helps hospitals to allocate resources more efficiently and save more lives. The Predicting Intensive Care Transfers and Other Unforeseen Events (PICTURE) model predicts patient deterioration by separating those at high risk for imminent intensive care unit transfer, respiratory failure, or death from those at lower risk. This study aims to implement a deep learning model to benchmark the performance from the XGBoost model, an existing model which has competitive results on prediction.

[AI-64] ECG-SleepNet: Deep Learning-Based Comprehensive Sleep Stage Classification Using ECG Signals

链接: https://arxiv.org/abs/2412.01929
作者: Poorya Aghaomidi,Ge Wang
关键词-EN: Accurate sleep stage, understanding sleep disorders, Accurate sleep, ECG signals, sleep stage classification
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Accurate sleep stage classification is essential for understanding sleep disorders and improving overall health. This study proposes a novel three-stage approach for sleep stage classification using ECG signals, offering a more accessible alternative to traditional methods that often rely on complex modalities like EEG. In Stages 1 and 2, we initialize the weights of two networks, which are then integrated in Stage 3 for comprehensive classification. In the first phase, we estimate key features using Feature Imitating Networks (FINs) to achieve higher accuracy and faster convergence. The second phase focuses on identifying the N1 sleep stage through the time-frequency representation of ECG signals. Finally, the third phase integrates models from the previous stages and employs a Kolmogorov-Arnold Network (KAN) to classify five distinct sleep stages. Additionally, data augmentation techniques, particularly SMOTE, are used in enhancing classification capabilities for underrepresented stages like N1. Our results demonstrate significant improvements in the classification performance, with an overall accuracy of 80.79% an overall kappa of 0.73. The model achieves specific accuracies of 86.70% for Wake, 60.36% for N1, 83.89% for N2, 84.85% for N3, and 87.16% for REM. This study emphasizes the importance of weight initialization and data augmentation in optimizing sleep stage classification with ECG signals.

[AI-65] MALT: Improving Reasoning with Multi-Agent LLM Training

链接: https://arxiv.org/abs/2412.01928
作者: Sumeet Ramesh Motwani,Chandler Smith,Rocktim Jyoti Das,Markian Rybchuk,Philip H. S. Torr,Ivan Laptev,Fabio Pizzati,Ronald Clark,Christian Schroeder de Witt
关键词-EN: Enabling effective collaboration, Enabling effective, developing autonomous systems, autonomous systems capable, effective collaboration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preliminary work

点击查看摘要

Abstract:Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward “Multi-agent LLM training” (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model’s specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.

[AI-66] Learning Aggregation Rules in Participatory Budgeting: A Data-Driven Approach

链接: https://arxiv.org/abs/2412.01864
作者: Roy Fairstein,Dan Vilenchik,Kobi Gal
关键词-EN: Participatory Budgeting, allocate public funds, offers a democratic, projects through voting, democratic process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Participatory Budgeting (PB) offers a democratic process for communities to allocate public funds across various projects through voting. In practice, PB organizers face challenges in selecting aggregation rules either because they are not familiar with the literature and the exact details of every existing rule or because no existing rule echoes their expectations. This paper presents a novel data-driven approach utilizing machine learning to address this challenge. By training neural networks on PB instances, our approach learns aggregation rules that balance social welfare, representation, and other societal beneficial goals. It is able to generalize from small-scale synthetic PB examples to large, real-world PB instances. It is able to learn existing aggregation rules but also generate new rules that adapt to diverse objectives, providing a more nuanced, compromise-driven solution for PB processes. The effectiveness of our approach is demonstrated through extensive experiments with synthetic and real-world PB data, and can expand the use and deployment of PB solutions.

[AI-67] owards Data-centric Machine Learning on Directed Graphs: a Survey

链接: https://arxiv.org/abs/2412.01849
作者: Henan Sun,Xunkai Li,Daohan Su,Junyi Han,Rong-Hua Li,Guoren Wang
关键词-EN: Graph Neural Networks, Neural Networks, made significant advances, Graph Neural, made significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: In Progress

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have made significant advances in processing structured data. However, most of them primarily adopted a model-centric approach, which simplifies graphs by converting it into undirected formats and emphasizes model designs. This approach is inherently constrained in real-world applications due to inevitable information loss in simple undirected graphs and data-driven model optimization dilemmas associated with exceeding the upper bounds of representational capacity. As a result, there has been a shift toward data-centric methods that prioritize improving graph quality and representation. Specifically, various types of graphs can be derived from naturally structured data, including heterogeneous graphs, hypergraphs, and directed graphs. Among these, directed graphs offer distinct advantages in topological systems by modeling causal relationships, and directed GNNs have been extensively studied in recent years. However, a comprehensive survey of this emerging topic is still lacking. Therefore, we aim to provide a comprehensive review of directed graph learning, with a particular focus on a data-centric perspective. Specifically, we first introduce a novel taxonomy for existing studies. Subsequently, we re-examine these methods from the data-centric perspective, with an emphasis on understanding and improving data representation. It demonstrates that a deep understanding of directed graphs and its quality plays a crucial role in model performance. Additionally, we explore the diverse applications of directed GNNs across 10+ domains, highlighting their broad applicability. Finally, we identify key opportunities and challenges within the field, offering insights that can guide future research and development in directed graph learning.

[AI-68] Zonal Architecture Development with evolution of Artificial Intelligence

链接: https://arxiv.org/abs/2412.01840
作者: Sneha Sudhir Shetiya,Vikas Vyas,Shreyas Renukuntla
关键词-EN: traditional centralized architectures, distributed zonal approaches, explains how traditional, traditional centralized, transitioning to distributed
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper explains how traditional centralized architectures are transitioning to distributed zonal approaches to address challenges in scalability, reliability, performance, and cost-effectiveness. The role of edge computing and neural networks in enabling sophisticated sensor fusion and decision-making capabilities for autonomous vehicles is examined. Additionally, this paper discusses the impact of zonal architectures on vehicle diagnostics, power distribution, and smart power management systems. Key design considerations for implementing effective zonal architectures are presented, along with an overview of current challenges and future directions. The objective of this paper is to provide a comprehensive understanding of how zonal architectures are shaping the future of automotive technology, particularly in the context of self-driving vehicles and artificial intelligence integration.

[AI-69] Why Reinforcement Learning in Energy Systems Needs Explanations

链接: https://arxiv.org/abs/2405.18823
作者: Hallah Shahid Butt,Benjamin Schäfer
关键词-EN: economic development, increased drastically, complexity of infrastructure, infrastructure has increased, Abstract
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:With economic development, the complexity of infrastructure has increased drastically. Similarly, with the shift from fossil fuels to renewable sources of energy, there is a dire need for such systems that not only predict and forecast with accuracy but also help in understanding the process of predictions. Artificial intelligence and machine learning techniques have helped in finding out wellperforming solutions to different problems in the energy sector. However, the usage of state-of-the-art techniques like reinforcement learning is not surprisingly convincing. This paper discusses the application of reinforcement techniques in energy systems and how explanations of these models can be helpful

[AI-70] Scaffold or Crutch? Examining College Students Use and Views of Generative AI Tools for STEM Education

链接: https://arxiv.org/abs/2412.02653
作者: Karen D. Wang,Zhangyang Wu,L’Nard Tufts II,Carl Wieman,Shima Salehi,Nick Haber
关键词-EN: Developing problem-solving competency, central to Science, STEM, Developing problem-solving, genAI tools
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing problem-solving competency is central to Science, Technology, Engineering, and Mathematics (STEM) education, yet translating this priority into effective approaches to problem-solving instruction and assessment remain a significant challenge. The recent proliferation of generative artificial intelligence (genAI) tools like ChatGPT in higher education introduces new considerations about how these tools can help or hinder students’ development of STEM problem-solving competency. Our research examines these considerations by studying how and why college students use genAI tools in their STEM coursework, focusing on their problem-solving support. We surveyed 40 STEM college students from diverse U.S. institutions and 28 STEM faculty to understand instructor perspectives on effective genAI tool use and guidance in STEM courses. Our findings reveal high adoption rates and diverse applications of genAI tools among STEM students. The most common use cases include finding explanations, exploring related topics, summarizing readings, and helping with problem-set questions. The primary motivation for using genAI tools was to save time. Moreover, over half of student participants reported simply inputting problems for AI to generate solutions, potentially bypassing their own problem-solving processes. These findings indicate that despite high adoption rates, students’ current approaches to utilizing genAI tools often fall short in enhancing their own STEM problem-solving competencies. The study also explored students’ and STEM instructors’ perceptions of the benefits and risks associated with using genAI tools in STEM education. Our findings provide insights into how to guide students on appropriate genAI use in STEM courses and how to design genAI-based tools to foster students’ problem-solving competency.

[AI-71] Reinforcement learning to learn quantum states for Heisenberg scaling accuracy

链接: https://arxiv.org/abs/2412.02334
作者: Jeongwoo Jae,Jeonghoon Hong,Jinho Choo,Yeong-Dae Kwon
关键词-EN: Learning quantum states, quantum information technology, quantum states, Learning quantum, quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Learning quantum states is a crucial task for realizing the potential of quantum information technology. Recently, neural approaches have emerged as promising methods for learning quantum states. We propose a meta-learning model that employs reinforcement learning (RL) to optimize the process of learning quantum states. For learning quantum states, our scheme trains a Hardware efficient ansatz with a blackbox optimization algorithm, called evolution strategy (ES). To enhance the efficiency of ES, a RL agent dynamically adjusts the hyperparameters of ES. To facilitate the RL training, we introduce an action repetition strategy inspired by curriculum learning. The RL agent significantly improves the sample efficiency of learning random quantum states, and achieves infidelity scaling close to the Heisenberg limit. We showcase that the RL agent trained using 3-qubit states can be generalized to learning up to 5-qubit states. These results highlight the utility of RL-driven meta-learning to enhance the efficiency and generalizability of learning quantum states. Our approach can be applicable to improve quantum control, quantum optimization, and quantum machine learning.

[AI-72] Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering

链接: https://arxiv.org/abs/2412.02292
作者: Yasser Khalafaoui(Alteca),Basarab Matei,Martino Lovisetto(Alteca),Nistor Grozavu(CY)
关键词-EN: deep matrix factorization, achieving promising results, deep matrix, matrix factorization, unsupervised tasks
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, deep matrix factorization has been established as a powerful model for unsupervised tasks, achieving promising results, especially for multi-view clustering. However, existing methods often lack effective feature selection mechanisms and rely on empirical hyperparameter selection. To address these issues, we introduce a novel Deep Matrix Factorization with Adaptive Weights for Multi-View Clustering (DMFAW). Our method simultaneously incorporates feature selection and generates local partitions, enhancing clustering results. Notably, the features weights are controlled and adjusted by a parameter that is dynamically updated using Control Theory inspired mechanism, which not only improves the model’s stability and adaptability to diverse datasets but also accelerates convergence. A late fusion approach is then proposed to align the weighted local partitions with the consensus partition. Finally, the optimization problem is solved via an alternating optimization algorithm with theoretically guaranteed convergence. Extensive experiments on benchmark datasets highlight that DMFAW outperforms state-of-the-art methods in terms of clustering performance.

[AI-73] VR Based Emotion Recognition Using Deep Multimodal Fusion With Biosignals Across Multiple Anatomical Domains

链接: https://arxiv.org/abs/2412.02283
作者: Pubudu L. Indrasiri,Bipasha Kashyap,Chandima Kolambahewage,Bahareh Nakisa,Kiran Ijaz,Pubudu N. Pathirana
关键词-EN: Meta Quest Pro, integrating multimodal biosignals, significantly enhanced, enhanced by integrating, integrating multimodal
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Emotion recognition is significantly enhanced by integrating multimodal biosignals and IMU data from multiple domains. In this paper, we introduce a novel multi-scale attention-based LSTM architecture, combined with Squeeze-and-Excitation (SE) blocks, by leveraging multi-domain signals from the head (Meta Quest Pro VR headset), trunk (Equivital Vest), and peripheral (Empatica Embrace Plus) during affect elicitation via visual stimuli. Signals from 23 participants were recorded, alongside self-assessed valence and arousal ratings after each stimulus. LSTM layers extract features from each modality, while multi-scale attention captures fine-grained temporal dependencies, and SE blocks recalibrate feature importance prior to classification. We assess which domain’s signals carry the most distinctive emotional information during VR experiences, identifying key biosignals contributing to emotion detection. The proposed architecture, validated in a user study, demonstrates superior performance in classifying valance and arousal level (high / low), showcasing the efficacy of multi-domain and multi-modal fusion with biosignals (e.g., TEMP, EDA) with IMU data (e.g., accelerometer) for emotion recognition in real-world applications.

[AI-74] Selective Reviews of Bandit Problems in AI via a Statistical View

链接: https://arxiv.org/abs/2412.02251
作者: Pengjie Zhou,Haoyu Wei,Huiming Zhang
关键词-EN: Reinforcement Learning, widely researched area, teaching agents decision-making, widely researched, researched area
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Probability (math.PR)
*备注: 46 pages, 5 figures,

点击查看摘要

Abstract:Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. We also extend the discussion to K -armed contextual bandits and SCAB, examining their methodologies, regret analyses, and discussing the relation between the SCAB problems and the functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.

[AI-75] Implementing An Artificial Quantum Perceptron

链接: https://arxiv.org/abs/2412.02083
作者: Ashutosh Hathidara,Lalit Pandey
关键词-EN: fundamental building block, neural network, fundamental building, building block, building intelligent systems
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A Perceptron is a fundamental building block of a neural network. The flexibility and scalability of perceptron make it ubiquitous in building intelligent systems. Studies have shown the efficacy of a single neuron in making intelligent decisions. Here, we examined and compared two perceptrons with distinct mechanisms, and developed a quantum version of one of those perceptrons. As a part of this modeling, we implemented the quantum circuit for an artificial perception, generated a dataset, and simulated the training. Through these experiments, we show that there is an exponential growth advantage and test different qubit versions. Our findings show that this quantum model of an individual perceptron can be used as a pattern classifier. For the second type of model, we provide an understanding to design and simulate a spike-dependent quantum perceptron. Our code is available at \urlthis https URL

[AI-76] Learning a Filtered Backprojection Reconstruction Method for Photoacoustic Computed Tomography with Hemispherical Measurement Geometries

链接: https://arxiv.org/abs/2412.01971
作者: Panpan Chen,Seonyeong Park,Refik Mert Cam,Hsuan-Kai Huang,Alexander A. Oraevsky,Umberto Villa,Mark A. Anastasio
关键词-EN: half-scan FBP method, photoacoustic computed tomography, hemispherical measurement apertures, half-scan FBP, FBP method
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In certain three-dimensional (3D) applications of photoacoustic computed tomography (PACT), including \textitin vivo breast imaging, hemispherical measurement apertures that enclose the object within their convex hull are employed for data acquisition. Data acquired with such measurement geometries are referred to as \textithalf-scan data, as only half of a complete spherical measurement aperture is employed. Although previous studies have demonstrated that half-scan data can uniquely and stably reconstruct the sought-after object, no closed-form reconstruction formula for use with half-scan data has been reported. To address this, a semi-analytic reconstruction method in the form of filtered backprojection (FBP), referred to as the half-scan FBP method, is developed in this work. Because the explicit form of the filtering operation in the half-scan FBP method is not currently known, a learning-based method is proposed to approximate it. The proposed method is systematically investigated by use of virtual imaging studies of 3D breast PACT that employ ensembles of numerical breast phantoms and a physics-based model of the data acquisition process. The method is subsequently applied to experimental data acquired in an \textitin vivo breast PACT study. The results confirm that the half-scan FBP method can accurately reconstruct 3D images from half-scan data. Importantly, because the sought-after inverse mapping is well-posed, the reconstruction method remains accurate even when applied to data that differ considerably from those employed to learn the filtering operation.

机器学习

[LG-0] An ADHD Diagnostic Interface Based on EEG Spectrograms and Deep Learning Techniques

链接: https://arxiv.org/abs/2412.02695
作者: Medha Pappula,Syed Muhammad Anwar
关键词-EN: employing deep learning, approach to Attention-deficit, hyperactivity disorder, techniques on electroencephalography, diagnosis by employing
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Presented at SIPAIM 2024

点击查看摘要

Abstract:This paper introduces an innovative approach to Attention-deficit/hyperactivity disorder (ADHD) diagnosis by employing deep learning (DL) techniques on electroencephalography (EEG) signals. This method addresses the limitations of current behavior-based diagnostic methods, which often lead to misdiagnosis and gender bias. By utilizing a publicly available EEG dataset and converting the signals into spectrograms, a Resnet-18 convolutional neural network (CNN) architecture was used to extract features for ADHD classification. The model achieved a high precision, recall, and an overall F1 score of 0.9. Feature extraction highlighted significant brain regions (frontopolar, parietal, and occipital lobes) associated with ADHD. These insights guided the creation of a three-part digital diagnostic system, facilitating cost-effective and accessible ADHD screening, especially in school environments. This system enables earlier and more accurate identification of students at risk for ADHD, providing timely support to enhance their developmental outcomes. This study showcases the potential of integrating EEG analysis with DL to enhance ADHD diagnostics, presenting a viable alternative to traditional methods.

[LG-1] Interpretable Generalized Additive Models for Datasets with Missing Values NEURIPS2024

链接: https://arxiv.org/abs/2412.02646
作者: Hayden McTavish,Jon Donnelly,Margo Seltzer,Cynthia Rudin
关键词-EN: important datasets, datasets contain samples, Abstract, missing, sparsity
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2024

点击查看摘要

Abstract:Many important datasets contain samples that are missing one or more feature values. Maintaining the interpretability of machine learning models in the presence of such missing data is challenging. Singly or multiply imputing missing values complicates the model’s mapping from features to labels. On the other hand, reasoning on indicator variables that represent missingness introduces a potentially large number of additional terms, sacrificing sparsity. We solve these problems with M-GAM, a sparse, generalized, additive modeling approach that incorporates missingness indicators and their interaction terms while maintaining sparsity through l0 regularization. We show that M-GAM provides similar or superior accuracy to prior methods while significantly improving sparsity relative to either imputation or naive inclusion of indicator variables.

[LG-2] he Space Complexity of Approximating Logistic Loss

链接: https://arxiv.org/abs/2412.02639
作者: Gregory Dexter,Petros Drineas,Rajiv Khanna
关键词-EN: logistic regression problem, mathbf, space complexity lower, approximate logistic loss, provide space complexity
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2303.14284

点击查看摘要

Abstract:We provide space complexity lower bounds for data structures that approximate logistic loss up to \epsilon -relative error on a logistic regression problem with data \mathbfX \in \mathbbR^n \times d and labels \mathbfy \in -1,1^d . The space complexity of existing coreset constructions depend on a natural complexity measure \mu_\mathbfy(\mathbfX) , first defined in (Munteanu, 2018). We give an \tilde\Omega(\fracd\epsilon^2) space complexity lower bound in the regime \mu_\mathbfy(\mathbfX) = O(1) that shows existing coresets are optimal in this regime up to lower order factors. We also prove a general \tilde\Omega(d\cdot \mu_\mathbfy(\mathbfX)) space lower bound when \epsilon is constant, showing that the dependency on \mu_\mathbfy(\mathbfX) is not an artifact of mergeable coresets. Finally, we refute a prior conjecture that \mu_\mathbfy(\mathbfX) is hard to compute by providing an efficient linear programming formulation, and we empirically compare our algorithm to prior approximate methods.

[LG-3] Wasserstein Markets for Differentially-Private Data

链接: https://arxiv.org/abs/2412.02609
作者: Saurab Chhachhi,Fei Teng
关键词-EN: increasingly vital component, decision making processes, processes across industries, increasingly vital, vital component
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
*备注: 35 pages, 15 figures

点击查看摘要

Abstract:Data is an increasingly vital component of decision making processes across industries. However, data access raises privacy concerns motivating the need for privacy-preserving techniques such as differential privacy. Data markets provide a means to enable wider access as well as determine the appropriate privacy-utility trade-off. Existing data market frameworks either require a trusted third party to perform computationally expensive valuations or are unable to capture the combinatorial nature of data value and do not endogenously model the effect of differential privacy. This paper addresses these shortcomings by proposing a valuation mechanism based on the Wasserstein distance for differentially-private data, and corresponding procurement mechanisms by leveraging incentive mechanism design theory, for task-agnostic data procurement, and task-specific procurement co-optimisation. The mechanisms are reformulated into tractable mixed-integer second-order cone programs, which are validated with numerical studies.

[LG-4] Private Linear Regression with Differential Privacy and PAC Privacy

链接: https://arxiv.org/abs/2412.02578
作者: Hillary Yang
关键词-EN: linear regression methods, Linear regression, satisfy provable privacy, provable privacy guarantees, privacy-preserving linear regression
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.

[LG-5] Fractional Order Distributed Optimization

链接: https://arxiv.org/abs/2412.02546
作者: Andrei Lixandru,Marcel van Gerven,Sergio Pequito
关键词-EN: machine learning applications, modern machine learning, Distributed optimization, fundamental to modern, modern machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed optimization is fundamental to modern machine learning applications like federated learning, but existing methods often struggle with ill-conditioned problems and face stability-versus-speed tradeoffs. We introduce fractional order distributed optimization (FrODO); a theoretically-grounded framework that incorporates fractional-order memory terms to enhance convergence properties in challenging optimization landscapes. Our approach achieves provable linear convergence for any strongly connected network. Through empirical validation, our results suggest that FrODO achieves up to 4 times faster convergence versus baselines on ill-conditioned problems and 2-3 times speedup in federated neural network training, while maintaining stability and theoretical guarantees.

[LG-6] On the Privacy Security and Trustworthy for Distributed Wireless Large AI Model (WLAM)

链接: https://arxiv.org/abs/2412.02538
作者: Zhaohui Yang,Wei Xu,Le Liang,Yuanhao Cui,Zhijin Qin,Merouane Debbah
关键词-EN: Combining wireless communication, large artificial intelligence, distributed WLAM, Combining wireless, artificial intelligence
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Combining wireless communication with large artificial intelligence (AI) models can open up a myriad of novel application scenarios. In sixth generation (6G) networks, ubiquitous communication and computing resources allow large AI models to serve democratic large AI models-related services to enable real-time applications like autonomous vehicles, smart cities, and Internet of Things (IoT) ecosystems. However, the security considerations and sustainable communication resources limit the deployment of large AI models over distributed wireless networks. This paper provides a comprehensive overview of privacy, security, and trustworthy for distributed wireless large AI model (WLAM). In particular, the detailed privacy and security are analysis for distributed WLAM is fist revealed. The classifications and theoretical findings about privacy and security in distributed WLAM are discussed. Then the trustworthy and ethics for implementing distributed WLAM are described. Finally, the comprehensive applications of distributed WLAM is provided in the aspect of electromagnetic signal processing.

[LG-7] Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization

链接: https://arxiv.org/abs/2412.02535
作者: Nicolás García Trillos,Aditya Kumar Akash,Sixu Li,Konstantin Riedl,Yuhua Zhu
关键词-EN: pose significant challenges, attacks pose significant, Adversarial attacks pose, malicious agents seek, machine learning applications
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Adversarial attacks pose significant challenges in many machine learning applications, particularly in the setting of distributed training and federated learning, where malicious agents seek to corrupt the training process with the goal of jeopardizing and compromising the performance and reliability of the final models. In this paper, we address the problem of robust federated learning in the presence of such attacks by formulating the training task as a bi-level optimization problem. We conduct a theoretical analysis of the resilience of consensus-based bi-level optimization (CB ^2 O), an interacting multi-particle metaheuristic optimization method, in adversarial settings. Specifically, we provide a global convergence analysis of CB ^2 O in mean-field law in the presence of malicious agents, demonstrating the robustness of CB ^2 O against a diverse range of attacks. Thereby, we offer insights into how specific hyperparameter choices enable to mitigate adversarial effects. On the practical side, we extend CB ^2 O to the clustered federated learning setting by proposing FedCB ^2 O, a novel interacting multi-particle system, and design a practical algorithm that addresses the demands of real-world applications. Extensive experiments demonstrate the robustness of the FedCB ^2 O algorithm against label-flipping attacks in decentralized clustered federated learning scenarios, showcasing its effectiveness in practical contexts.

[LG-8] CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

链接: https://arxiv.org/abs/2412.02503
作者: Hao Chen,Han Tao,Guo Song,Jie Zhang,Yunlong Yu,Yonghan Dong,Chuang Yang,Lei Bai
关键词-EN: geography and aerospace, science is intricately, intricately connected, Atmospheric science, Atmospheric
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Atmospheric science is intricately connected with other fields, e.g., geography and aerospace. Most existing approaches involve training a joint atmospheric and geographic model from scratch, which incurs significant computational costs and overlooks the potential for incremental learning of weather variables across different domains. In this paper, we introduce incremental learning to weather forecasting and propose a novel structure that allows for the flexible expansion of variables within the model. Specifically, our method presents a Channel-Adapted MoE (CA-MoE) that employs a divide-and-conquer strategy. This strategy assigns variable training tasks to different experts by index embedding and reduces computational complexity through a channel-wise Top-K strategy. Experiments conducted on the widely utilized ERA5 dataset reveal that our method, utilizing only approximately 15% of trainable parameters during the incremental stage, attains performance that is on par with state-of-the-art competitors. Notably, in the context of variable incremental experiments, our method demonstrates negligible issues with catastrophic forgetting.

[LG-9] he Cost of Consistency: Submodular Maximization with Constant Recourse

链接: https://arxiv.org/abs/2412.02492
作者: Paul Dütting,Federico Fusco,Silvio Lattanzi,Ashkan Norouzi-Fard,Ola Svensson,Morteza Zadimoghaddam
关键词-EN: stable solution impacts, online submodular maximization, study online submodular, study online, requirement of maintaining
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we study online submodular maximization, and how the requirement of maintaining a stable solution impacts the approximation. In particular, we seek bounds on the best-possible approximation ratio that is attainable when the algorithm is allowed to make at most a constant number of updates per step. We show a tight information-theoretic bound of \tfrac23 for general monotone submodular functions, and an improved (also tight) bound of \tfrac34 for coverage functions. Since both these bounds are attained by non poly-time algorithms, we also give a poly-time randomized algorithm that achieves a 0.51 -approximation. Combined with an information-theoretic hardness of \tfrac12 for deterministic algorithms from prior work, our work thus shows a separation between deterministic and randomized algorithms, both information theoretically and for poly-time algorithms.

[LG-10] Vector Optimization with Gaussian Process Bandits

链接: https://arxiv.org/abs/2412.02484
作者: İlter Onat Korkmaz,Yaşar Cahit Yıldırım,Çağın Ararat,Cem Tekin
关键词-EN: Learning problems, multiple conflicting objectives, including engineering, drug design, environmental management
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning problems in which multiple conflicting objectives must be considered simultaneously often arise in various fields, including engineering, drug design, and environmental management. Traditional methods for dealing with multiple black-box objective functions, such as scalarization and identification of the Pareto set under the componentwise order, have limitations in incorporating objective preferences and exploring the solution space accordingly. While vector optimization offers improved flexibility and adaptability via specifying partial orders based on ordering cones, current techniques designed for sequential experiments either suffer from high sample complexity or lack theoretical guarantees. To address these issues, we propose Vector Optimization with Gaussian Process (VOGP), a probably approximately correct adaptive elimination algorithm that performs black-box vector optimization using Gaussian process bandits. VOGP allows users to convey objective preferences through ordering cones while performing efficient sampling by exploiting the smoothness of the objective function, resulting in a more effective optimization process that requires fewer evaluations. We establish theoretical guarantees for VOGP and derive information gain-based and kernel-specific sample complexity bounds. We also conduct experiments on both real-world and synthetic datasets to compare VOGP with the state-of-the-art methods.

[LG-11] What should a neuron aim for? Designing local objective functions based on information theory

链接: https://arxiv.org/abs/2412.02482
作者: Andreas C. Schneider,Valentin Neuhaus,David A. Ehrlich,Abdullah Makkeh,Alexander S. Ecker,Viola Priesemann,Michael Wibral
关键词-EN: deep neural networks, modern deep neural, neural networks, modern deep, deep neural
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:In modern deep neural networks, the learning dynamics of the individual neurons is often obscure, as the networks are trained via global optimization. Conversely, biological systems build on self-organized, local learning, achieving robustness and efficiency with limited global information. We here show how self-organization between individual artificial neurons can be achieved by designing abstract bio-inspired local learning goals. These goals are parameterized using a recent extension of information theory, Partial Information Decomposition (PID), which decomposes the information that a set of information sources holds about an outcome into unique, redundant and synergistic contributions. Our framework enables neurons to locally shape the integration of information from various input classes, i.e. feedforward, feedback, and lateral, by selecting which of the three inputs should contribute uniquely, redundantly or synergistically to the output. This selection is expressed as a weighted sum of PID terms, which, for a given problem, can be directly derived from intuitive reasoning or via numerical optimization, offering a window into understanding task-relevant local information processing. Achieving neuron-level interpretability while enabling strong performance using local learning, our work advances a principled information-theoretic foundation for local learning strategies.

[LG-12] COMET:Combined Matrix for Elucidating Targets

链接: https://arxiv.org/abs/2412.02471
作者: Haojie Wang,Zhe Zhang,Haotian Gao,Xiangying Zhang,Zhihang Chen,Xinchong Chen,Yifei Qi,Yan Li,Renxiao Wang
关键词-EN: pharmacological effects, foundational element, element for deciphering, deciphering their pharmacological, COMET
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the interaction targets of bioactive compounds is a foundational element for deciphering their pharmacological effects. Target prediction algorithms equip researchers with an effective tool to rapidly scope and explore potential targets. Here, we introduce the COMET, a multi-technological modular target prediction tool that provides comprehensive predictive insights, including similar active compounds, three-dimensional predicted binding modes, and probability scores, all within an average processing time of less than 10 minutes per task. With meticulously curated data, the COMET database encompasses 990,944 drug-target interaction pairs and 45,035 binding pockets, enabling predictions for 2,685 targets, which span confirmed and exploratory therapeutic targets for human diseases. In comparative testing using datasets from ChEMBL and BindingDB, COMET outperformed five other well-known algorithms, offering nearly an 80% probability of accurately identifying at least one true target within the top 15 predictions for a given compound. COMET also features a user-friendly web server, accessible freely at this https URL.

[LG-13] Improved Localized Machine Unlearning Through the Lens of Memorization

链接: https://arxiv.org/abs/2412.02432
作者: Reihaneh Torkzadehmahani,Reza Nasirigerdeh,Georgios Kaissis,Daniel Rueckert,Gintare Karolina Dziugaite,Eleni Triantafillou
关键词-EN: machine learning model, Machine unlearning refers, machine learning, Machine, Machine unlearning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.

[LG-14] me-Series-Informed Closed-loop Learning for Sequential Decision Making and Control

链接: https://arxiv.org/abs/2412.02423
作者: Sebastian Hirt,Lukas Theiner,Rolf Findeisen
关键词-EN: model predictive control, predictive control, depends strongly, cost functions, model predictive
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, submitted to L4DC 2025

点击查看摘要

Abstract:Closed-loop performance of sequential decision making algorithms, such as model predictive control, depends strongly on the parameters of cost functions, models, and constraints. Bayesian optimization is a common approach to learning these parameters based on closed-loop experiments. However, traditional Bayesian optimization approaches treat the learning problem as a black box, ignoring valuable information and knowledge about the structure of the underlying problem, resulting in slow convergence and high experimental resource use. We propose a time-series-informed optimization framework that incorporates intermediate performance evaluations from early iterations of each experimental episode into the learning procedure. Additionally, probabilistic early stopping criteria are proposed to terminate unpromising experiments, significantly reducing experimental time. Simulation results show that our approach achieves baseline performance with approximately half the resources. Moreover, with the same resource budget, our approach outperforms the baseline in terms of final closed-loop performance, highlighting its efficiency in sequential decision making scenarios.

[LG-15] Leveraging Ensemble-Based Semi-Supervised Learning for Illicit Account Detection in Ethereum DeFi Transactions

链接: https://arxiv.org/abs/2412.02408
作者: Shabnam Fazliani,Mohammad Mowlavi Sorond,Arsalan Masoudifard
关键词-EN: Decentralized Finance, offering substantial rewards, Ethereum blockchain, rise of Decentralized, offering substantial
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:The advent of smart contracts has enabled the rapid rise of Decentralized Finance (DeFi) on the Ethereum blockchain, offering substantial rewards in financial innovation and inclusivity. However, this growth has also introduced significant security risks, including the proliferation of illicit accounts involved in fraudulent activities. Traditional detection methods are limited by the scarcity of labeled data and the evolving tactics of malicious actors. In this paper, we propose a novel Self-Learning Ensemble-based Illicit account Detection (SLEID) framework to address these challenges. SLEID employs an Isolation Forest for initial outlier detection and a self-training mechanism to iteratively generate pseudo-labels for unlabeled accounts, thereby enhancing detection accuracy. Extensive experiments demonstrate that SLEID significantly outperforms traditional supervised approaches and recent semi-supervised models, achieving superior precision, recall, and F1-scores, particularly in detecting illicit accounts. Compared to state-of-the-art methods, our approach achieves better detection performance while reducing reliance on labeled data. The results affirm SLEID’s efficacy as a robust solution for safeguarding the DeFi ecosystem and mitigating risks posed by malicious accounts.

[LG-16] LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

链接: https://arxiv.org/abs/2412.02352
作者: Ethan Smith,Rami Seid,Alberto Hojel,Paramita Mishra,Jianbo Wu
关键词-EN: methods provide low-memory, Low-Rank Adaptation, provide low-memory, storage-efficient solutions, solutions for personalizing
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) and other parameter-efficient fine-tuning (PEFT) methods provide low-memory, storage-efficient solutions for personalizing text-to-image models. However, these methods offer little to no improvement in wall-clock training time or the number of steps needed for convergence compared to full model fine-tuning. While PEFT methods assume that shifts in generated distributions (from base to fine-tuned models) can be effectively modeled through weight changes in a low-rank subspace, they fail to leverage knowledge of common use cases, which typically focus on capturing specific styles or identities. Observing that desired outputs often comprise only a small subset of the possible domain covered by LoRA training, we propose reducing the search space by incorporating a prior over regions of interest. We demonstrate that training a hypernetwork model to generate LoRA weights can achieve competitive quality for specific domains while enabling near-instantaneous conditioning on user input, in contrast to traditional training methods that require thousands of steps.

[LG-17] Federated Analytics in Practice: Engineering for Privacy Scalability and Practicality

链接: https://arxiv.org/abs/2412.02340
作者: Harish Srinivas,Graham Cormode,Mehrdad Honarkhah,Samuel Lurye,Jonathan Hehir,Lunwen He,George Hong,Ahmed Magdy,Dzmitry Huba,Kaikai Wang,Shen Guo,Shoubhik Bhattacharya
关键词-EN: distributed computation paradigm, computation paradigm designed, Cross-device Federated Analytics, answer analytics queries, data held locally
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users’ devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA’s broad relevance, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build and deploy a system that overcomes these limitations. Our FA system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.

[LG-18] An Adaptive Grasping Force Tracking Strategy for Nonlinear and Time-Varying Object Behaviors

链接: https://arxiv.org/abs/2412.02335
作者: Ziyang Cheng,Xiangyu Tian,Ruomin Sui,Tiemin Li,Yao Jiang
关键词-EN: Accurate grasp force, Accurate grasp, grasp force control, force tracking, damage-free robotic grasping
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate grasp force control is one of the key skills for ensuring successful and damage-free robotic grasping of objects. Although existing methods have conducted in-depth research on slip detection and grasping force planning, they often overlook the issue of adaptive tracking of the actual force to the target force when handling objects with different material properties. The optimal parameters of a force tracking controller are significantly influenced by the object’s stiffness, and many adaptive force tracking algorithms rely on stiffness estimation. However, real-world objects often exhibit viscous, plastic, or other more complex nonlinear time-varying behaviors, and existing studies provide insufficient support for these materials in terms of stiffness definition and estimation. To address this, this paper introduces the concept of generalized stiffness, extending the definition of stiffness to nonlinear time-varying grasp system models, and proposes an online generalized stiffness estimator based on Long Short-Term Memory (LSTM) networks. Based on generalized stiffness, this paper proposes an adaptive parameter adjustment strategy using a PI controller as an example, enabling dynamic force tracking for objects with varying characteristics. Experimental results demonstrate that the proposed method achieves high precision and short probing time, while showing better adaptability to non-ideal objects compared to existing methods. The method effectively solves the problem of grasp force tracking in unknown, nonlinear, and time-varying grasp systems, enhancing the robotic grasping ability in unstructured environments.

[LG-19] Efficient Model Compression Techniques with FishLeg NEURIPS2024

链接: https://arxiv.org/abs/2412.02328
作者: Jamie McGowan,Wei Sheng Lai,Weibin Chen,Henry Aldridge,Jools Clarke,Jezabel Garcia,Rui Xia,Yilei Liang,Guillaume Hennequin,Alberto Bernacchia
关键词-EN: limited computational resources, players with limited, second-order pruning method, computational resources, Fisher information matrix
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2024 - Neural Compression Workshop, 13 pages, 6 figures

点击查看摘要

Abstract:In many domains, the most successful AI models tend to be the largest, indeed often too large to be handled by AI players with limited computational resources. To mitigate this, a number of compression methods have been developed, including methods that prune the network down to high sparsity whilst retaining performance. The best-performing pruning techniques are often those that use second-order curvature information (such as an estimate of the Fisher information matrix) to score the importance of each weight and to predict the optimal compensation for weight deletion. However, these methods are difficult to scale to high-dimensional parameter spaces without making heavy approximations. Here, we propose the FishLeg surgeon (FLS), a new second-order pruning method based on the Fisher-Legendre (FishLeg) optimizer. At the heart of FishLeg is a meta-learning approach to amortising the action of the inverse FIM, which brings a number of advantages. Firstly, the parameterisation enables the use of flexible tensor factorisation techniques to improve computational and memory efficiency without sacrificing much accuracy, alleviating challenges associated with scalability of most second-order pruning methods. Secondly, directly estimating the inverse FIM leads to less sensitivity to the amplification of stochasticity during inversion, thereby resulting in more precise estimates. Thirdly, our approach also allows for progressive assimilation of the curvature into the parameterisation. In the gradual pruning regime, this results in a more efficient estimate refinement as opposed to re-estimation. We find that FishLeg achieves higher or comparable performance against two common baselines in the area, most notably in the high sparsity regime when considering a ResNet18 model on CIFAR-10 (84% accuracy at 95% sparsity vs 60% for OBS) and TinyIM (53% accuracy at 80% sparsity vs 48% for OBS).

[LG-20] Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.02316
作者: Alejandro Mendoza Barrionuevo,Samuel Yanes Luis,Daniel Gutiérrez Reina,Sergio L. Toral Marín
关键词-EN: informative path planning, autonomous surface vehicles, model-free deep reinforcement, paper presents, presents a model-free
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This article is currently under revision for the Robotics and Automation Letters (IEEE)

点击查看摘要

Abstract:This paper presents a model-free deep reinforcement learning framework for informative path planning with heterogeneous fleets of autonomous surface vehicles to locate and collect plastic waste. The system employs two teams of vehicles: scouts and cleaners. Coordination between these teams is achieved through a deep reinforcement approach, allowing agents to learn strategies to maximize cleaning efficiency. The primary objective is for the scout team to provide an up-to-date contamination model, while the cleaner team collects as much waste as possible following this model. This strategy leads to heterogeneous teams that optimize fleet efficiency through inter-team cooperation supported by a tailored reward function. Different trainings of the proposed algorithm are compared with other state-of-the-art heuristics in two distinct scenarios, one with high convexity and another with narrow corridors and challenging access. According to the obtained results, it is demonstrated that deep reinforcement learning based algorithms outperform other benchmark heuristics, exhibiting superior adaptability. In addition, training with greedy actions further enhances performance, particularly in scenarios with intricate layouts.

[LG-21] Learn More by Using Less: Distributed Learning with Energy-Constrained Devices

链接: https://arxiv.org/abs/2412.02289
作者: Roberto Pereira,Cristian J. Vaca-Rubio,Luis Blanco
关键词-EN: Federated Learning, constrain real-world implementations, solution for distributed, capacities of participating, distributed model training
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a solution for distributed model training across decentralized, privacy-preserving devices, but the different energy capacities of participating devices (system heterogeneity) constrain real-world implementations. These energy limitations not only reduce model accuracy but also increase dropout rates, impacting on convergence in practical FL deployments. In this work, we propose LeanFed, an energy-aware FL framework designed to optimize client selection and training workloads on battery-constrained devices. LeanFed leverages adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, thereby maximizing device participation across communication rounds while ensuring they do not run out of battery during the process. We rigorously evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets, simulating various levels of data heterogeneity and device participation rates. Results show that LeanFed consistently enhances model accuracy and stability, particularly in settings with high data heterogeneity and limited battery life, by mitigating client dropout and extending device availability. This approach demonstrates the potential of energy-efficient, privacy-preserving FL in real-world, large-scale applications, setting a foundation for robust and sustainable pervasive AI on resource-constrained networks.

[LG-22] Step-by-Step Guidance to Differential Anemia Diagnosis with Real-World Data and Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.02273
作者: Lillian Muyama,Estelle Lu,Geoffrey Cheminet,Jacques Pouchot,Bastien Rance,Anne-Isabelle Tropeano,Antoine Neuraz,Adrien Coulet
关键词-EN: Clinical diagnostic guidelines, Clinical diagnostic, outline the key, key questions, questions to answer
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.05913

点击查看摘要

Abstract:Clinical diagnostic guidelines outline the key questions to answer to reach a diagnosis. Inspired by guidelines, we aim to develop a model that learns from electronic health records to determine the optimal sequence of actions for accurate diagnosis. Focusing on anemia and its sub-types, we employ deep reinforcement learning (DRL) algorithms and evaluate their performance on both a synthetic dataset, which is based on expert-defined diagnostic pathways, and a real-world dataset. We investigate the performance of these algorithms across various scenarios. Our experimental results demonstrate that DRL algorithms perform competitively with state-of-the-art methods while offering the significant advantage of progressively generating pathways to the suggested diagnosis, providing a transparent decision-making process that can guide and explain diagnostic reasoning.

[LG-23] BOTracle: A framework for Discriminating Bots and Humans ESORICS

链接: https://arxiv.org/abs/2412.02266
作者: Jan Kadel,August See,Ritwik Sinha,Mathias Fischer
关键词-EN: portion of Internet, multiple domains, Internet traffic, constitute a significant, significant portion
类目: Machine Learning (cs.LG)
*备注: Bot Detection; User Behaviour Analysis; Published at ESORICS International Workshops 2024

点击查看摘要

Abstract:Bots constitute a significant portion of Internet traffic and are a source of various issues across multiple domains. Modern bots often become indistinguishable from real users, as they employ similar methods to browse the web, including using real browsers. We address the challenge of bot detection in high-traffic scenarios by analyzing three distinct detection methods. The first method operates on heuristics, allowing for rapid detection. The second method utilizes, well known, technical features, such as IP address, window size, and user agent. It serves primarily for comparison with the third method. In the third method, we rely solely on browsing behavior, omitting all static features and focusing exclusively on how clients behave on a website. In contrast to related work, we evaluate our approaches using real-world e-commerce traffic data, comprising 40 million monthly page visits. We further compare our methods against another bot detection approach, Botcha, on the same dataset. Our performance metrics, including precision, recall, and AUC, reach 98 percent or higher, surpassing Botcha.

[LG-24] chnical Report on Reinforcement Learning Control on the Lucas-N"ulle Inverted Pendulum

链接: https://arxiv.org/abs/2412.02264
作者: Maximilian Schenke,Shalbus Bukarov
关键词-EN: discipline of automatic, domain of machine, making increased, machine learning, sequential decision making
类目: ystems and Control (eess.SY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The discipline of automatic control is making increased use of concepts that originate from the domain of machine learning. Herein, reinforcement learning (RL) takes an elevated role, as it is inherently designed for sequential decision making, and can be applied to optimal control problems without the need for a plant system model. To advance education of control engineers and operators in this field, this contribution targets an RL framework that can be applied to educational hardware provided by the Lucas-Nülle company. Specifically, the goal of inverted pendulum control is pursued by means of RL, including both, swing-up and stabilization within a single holistic design approach. Herein, the actual learning is enabled by separating corresponding computations from the real-time control computer and outsourcing them to a different hardware. This distributed architecture, however, necessitates communication of the involved components, which is realized via CAN bus. The experimental proof of concept is presented with an applied safeguarding algorithm that prevents the plant from being operated harmfully during the trial-and-error training phase.

[LG-25] On Simplifying Large-Scale Spatial Vectors: Fast Memory-Efficient and Cost-Predictable k-means

链接: https://arxiv.org/abs/2412.02244
作者: Yushuai Ji,Zepeng Liu,Sheng Wang,Yuan Sun,Zhiyong Peng
关键词-EN: support fast analytics, point clouds, simplify large-scale spatial, analytics and learning, k-means
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from devices or adjust the accelerator’s required space to meet memory constraints, and ensure sufficient CPU time for running k-means. Experiments show that when simplifying datasets with scale such as 10^6 , Dask-means uses less than 30 MB of memory, achieves over 168 times speedup compared to the widely-used Lloyd’s algorithm. We also validate Dask-means on mobile devices, where it demonstrates significant speedup and low memory cost compared to other state-of-the-art (SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a difference of less than 3% from the actual ones and predicts runtime with an MSE up to 33.3% lower than SOTA methods.

[LG-26] ESA: Example Sieve Approach for Multi-Positive and Unlabeled Learning

链接: https://arxiv.org/abs/2412.02240
作者: Zhongnian Li,Meng Wei,Peng Ying,Xinzheng Xu
关键词-EN: gradually attracted significant, attracted significant attention, Learning from Multi-Positive, Multi-Positive and Unlabeled, data has gradually
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Learning from Multi-Positive and Unlabeled (MPU) data has gradually attracted significant attention from practical applications. Unfortunately, the risk of MPU also suffer from the shift of minimum risk, particularly when the models are very flexible as shown in Fig.\refmoti. In this paper, to alleviate the shifting of minimum risk problem, we propose an Example Sieve Approach (ESA) to select examples for training a multi-class classifier. Specifically, we sieve out some examples by utilizing the Certain Loss (CL) value of each example in the training stage and analyze the consistency of the proposed risk estimator. Besides, we show that the estimation error of proposed ESA obtains the optimal parametric convergence rate. Extensive experiments on various real-world datasets show the proposed approach outperforms previous methods.

[LG-27] Learning from Concealed Labels

链接: https://arxiv.org/abs/2412.02230
作者: Zhongnian Li,Meng Wei,Peng Ying,Tongfeng Sun,Xinzheng Xu
关键词-EN: Annotating data, concealed labels, poses a potential, labels, potential threats
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.

[LG-28] An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction

链接: https://arxiv.org/abs/2412.02211
作者: Yaxin Liang,Xinshi Li,Xin Huang,Ziqi Zhang,Yue Yao
关键词-EN: automated data mining, mining framework based, data dimensionality reduction, study proposes, proposes an automated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data’s potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.

[LG-29] SA-GNAS: Seed Architecture Expansion for Efficient Large-scale Graph Neural Architecture Search

链接: https://arxiv.org/abs/2412.02196
作者: Guanghui Zhu,Zipeng Ji,Jingyan Chen,Limin Wang,Chunfeng Yuan,Yihua Huang
关键词-EN: optimal graph neural, Graph Neural Architecture, multiple downstream tasks, Neural Architecture Search, demonstrated great effectiveness
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GNAS (Graph Neural Architecture Search) has demonstrated great effectiveness in automatically designing the optimal graph neural architectures for multiple downstream tasks, such as node classification and link prediction. However, most existing GNAS methods cannot efficiently handle large-scale graphs containing more than million-scale nodes and edges due to the expensive computational and memory overhead. To scale GNAS on large graphs while achieving better performance, we propose SA-GNAS, a novel framework based on seed architecture expansion for efficient large-scale GNAS. Similar to the cell expansion in biotechnology, we first construct a seed architecture and then expand the seed architecture iteratively. Specifically, we first propose a performance ranking consistency-based seed architecture selection method, which selects the architecture searched on the subgraph that best matches the original large-scale graph. Then, we propose an entropy minimization-based seed architecture expansion method to further improve the performance of the seed architecture. Extensive experimental results on five large-scale graphs demonstrate that the proposed SA-GNAS outperforms human-designed state-of-the-art GNN architectures and existing graph NAS methods. Moreover, SA-GNAS can significantly reduce the search time, showing better search efficiency. For the largest graph with billion edges, SA-GNAS can achieve 2.8 times speedup compared to the SOTA large-scale GNAS method GAUSS. Additionally, since SA-GNAS is inherently parallelized, the search efficiency can be further improved with more GPUs. SA-GNAS is available at this https URL.

[LG-30] Deep Learning Machine Learning Advancing Big Data Analytics and Management

链接: https://arxiv.org/abs/2412.02187
作者: Weiche Hsieh,Ziqian Bi,Keyu Chen,Benji Peng,Sen Zhang,Jiawei Xu,Jinlang Wang,Caitlyn Heqi Yin,Yichao Zhang,Pohsun Feng,Yizhu Wen,Tianyang Wang,Ming Li,Chia Xin Liang,Jintao Ren,Qian Niu,Silin Chen,Lawrence K.Q. Yan,Han Xu,Hong-Ming Tseng,Xinyuan Song,Bowen Jing,Junjie Yang,Junhao Song,Junyu Liu,Ming Liu
关键词-EN: deep learning, catalyzed the transformation, pivotal domains, domains for research, data
类目: Machine Learning (cs.LG)
*备注: 174 pages

点击查看摘要

Abstract:Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.

[LG-31] owards the efficacy of federated prediction for epidemics on networks

链接: https://arxiv.org/abs/2412.02161
作者: Chengpeng Fu,Tong Li,Hao Chen,Wen Du,Zhidong He
关键词-EN: enabling early intervention, resource allocation, enabling early, early intervention, strategic planning
类目: ocial and Information Networks (cs.SI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epidemic prediction is of practical significance in public health, enabling early intervention, resource allocation, and strategic planning. However, privacy concerns often hinder the sharing of health data among institutions, limiting the development of accurate prediction models. In this paper, we develop a general privacy-preserving framework for node-level epidemic prediction on networks based on federated learning (FL). We frame the spatio-temporal spread of epidemics across multiple data-isolated subnetworks, where each node state represents the aggregate epidemic severity within a community. Then, both the pure temporal LSTM model and the spatio-temporal model i.e., Spatio-Temporal Graph Attention Network (STGAT) are proposed to address the federated epidemic prediction. Extensive experiments are conducted on various epidemic processes using a practical airline network, offering a comprehensive assessment of FL efficacy under diverse scenarios. By introducing the efficacy energy metric to measure system robustness under various client configurations, we systematically explore key factors influencing FL performance, including client numbers, aggregation strategies, graph partitioning, missing infectious reports. Numerical results manifest that STGAT excels in capturing spatio-temporal dependencies in dynamic processes whereas LSTM performs well in simpler pattern. Moreover, our findings highlight the importance of balancing feature consistency and volume uniformity among clients, as well as the prediction dilemma between information richness and intrinsic stochasticity of dynamic processes. This study offers practical insights into the efficacy of FL scenario in epidemic management, demonstrates the potential of FL to address broader collective dynamics.

[LG-32] Evaluating the Impact of Data Augmentation on Predictive Model Performance

链接: https://arxiv.org/abs/2412.02108
作者: Valdemar Švábenský,Conrad Borchers,Elizabeth B. Cloude,Atsushi Shimada
关键词-EN: supervised machine learning, large training datasets, valid results, supervised machine, datasets are essential
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Published in LAK 2025 conference proceedings in the ACM Digital Library, see this https URL

点击查看摘要

Abstract:In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper’s contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.

[LG-33] Beyond Tree Models: A Hybrid Model of KAN and gMLP for Large-Scale Financial Tabular Data

链接: https://arxiv.org/abs/2412.02097
作者: Mingming Zhang,Jiahao Hu,Pengfei Shi,Ningtao Wang,Ruizhe Gao,Guandong Sun,Feng Zhao,Yulin kang,Xing Fu,Weiqiang Wang,Junbo Zhao
关键词-EN: Tabular data, Tabular data plays, plays a critical, critical role, data
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Tabular data plays a critical role in real-world financial scenarios. Traditionally, tree models have dominated in handling tabular data. However, financial datasets in the industry often encounter some challenges, such as data heterogeneity, the predominance of numerical features and the large scale of the data, which can range from tens of millions to hundreds of millions of records. These challenges can lead to significant memory and computational issues when using tree-based models. Consequently, there is a growing need for neural network-based solutions that can outperform these models. In this paper, we introduce TKGMLP, an hybrid network for tabular data that combines shallow Kolmogorov Arnold Networks with Gated Multilayer Perceptron. This model leverages the strengths of both architectures to improve performance and scalability. We validate TKGMLP on a real-world credit scoring dataset, where it achieves state-of-the-art results and outperforms current benchmarks. Furthermore, our findings demonstrate that the model continues to improve as the dataset size increases, making it highly scalable. Additionally, we propose a novel feature encoding method for numerical data, specifically designed to address the predominance of numerical features in financial datasets. The integration of this feature encoding method within TKGMLP significantly improves prediction accuracy. This research not only advances table prediction technology but also offers a practical and effective solution for handling large-scale numerical tabular data in various industrial applications.

[LG-34] Crash Severity Risk Modeling Strategies under Data Imbalance

链接: https://arxiv.org/abs/2412.02094
作者: Abdullah Al Mamun(1),Abyad Enan(1),Debbie A. Indah(2),Judith Mwakalonge(3),Gurcan Comert(4),Mashrur Chowdhury(5) ((1) Graduate Student, Glenn Department of Civil Engineering, Clemson University, (2) Graduate Student, Department of Engineering, South Carolina State University, (3) Professor, Department of Engineering, South Carolina State University, (4) Associate Professor, Computational Data Science and Engineering Department, North Carolina Aamp;T State University, (5) Professor, Glenn Department of Civil Engineering, Clemson University)
关键词-EN: involving large vehicles, risk modeling strategies, South Carolina work, Carolina work zones, severity risk modeling
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP)
*备注: This version has been resubmitted to the Transportation Research Record: Journal of the Transportation Research Board after addressing the reviewers’ comments and is currently awaiting the final decision

点击查看摘要

Abstract:This study investigates crash severity risk modeling strategies for work zones involving large vehicles (i.e., trucks, buses, and vans) when there are crash data imbalance between low-severity (LS) and high-severity (HS) crashes. We utilized crash data, involving large vehicles in South Carolina work zones for the period between 2014 and 2018, which included 4 times more LS crashes compared to HS crashes. The objective of this study is to explore crash severity prediction performance of various models under different feature selection and data balancing techniques. The findings of this study highlight a disparity between LS and HS predictions, with less-accurate prediction of HS crashes compared to LS crashes due to class imbalance and feature overlaps between LS and HS crashes. Combining features from multiple feature selection techniques: statistical correlation, feature importance, recursive elimination, statistical tests, and mutual information, slightly improves HS crash prediction performance. Data balancing techniques such as NearMiss-1 and RandomUnderSampler, maximize HS recall when paired with certain prediction models, such as Bayesian Mixed Logit (BML), NeuralNet, and RandomForest, making them suitable for HS crash prediction. Conversely, RandomOverSampler, HS Class Weighting, and Kernel-based Synthetic Minority Oversampling (K-SMOTE), used with certain prediction models such as BML, CatBoost, and LightGBM, achieve a balanced performance, defined as achieving an equitable trade-off between LS and HS prediction performance metrics. These insights provide safety analysts with guidance to select models, feature selection techniques, and data balancing techniques that align with their specific safety objectives, offering a robust foundation for enhancing work-zone crash severity prediction.

[LG-35] Offline Stochastic Optimization of Black-Box Objective Functions

链接: https://arxiv.org/abs/2412.02089
作者: Juncheng Dong,Zihao Wu,Hamid Jafarkhani,Ali Pezeshki,Vahid Tarokh
关键词-EN: communication network design, involve optimizing complex, vast search spaces, expensive black-box functions, science and engineering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many challenges in science and engineering, such as drug discovery and communication network design, involve optimizing complex and expensive black-box functions across vast search spaces. Thus, it is essential to leverage existing data to avoid costly active queries of these black-box functions. To this end, while Offline Black-Box Optimization (BBO) is effective for deterministic problems, it may fall short in capturing the stochasticity of real-world scenarios. To address this, we introduce Stochastic Offline BBO (SOBBO), which tackles both black-box objectives and uncontrolled uncertainties. We propose two solutions: for large-data regimes, a differentiable surrogate allows for gradient-based optimization, while for scarce-data regimes, we directly estimate gradients under conservative field constraints, improving robustness, convergence, and data efficiency. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world tasks.

[LG-36] GNN-based Auto-Encoder for Short Linear Block Codes: A DRL Approach

链接: https://arxiv.org/abs/2412.02053
作者: Kou Tian,Chentao Yue,Changyang She,Yonghui Li,Branka Vucetic
关键词-EN: Markov Decision Process, paper presents, Markov Decision, Decision Process, code
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 13 pages; submitted to IEEE Trans. arXiv admin note: text overlap with arXiv:2211.06962

点击查看摘要

Abstract:This paper presents a novel auto-encoder based end-to-end channel encoding and decoding. It integrates deep reinforcement learning (DRL) and graph neural networks (GNN) in code design by modeling the generation of code parity-check matrices as a Markov Decision Process (MDP), to optimize key coding performance metrics such as error-rates and code algebraic properties. An edge-weighted GNN (EW-GNN) decoder is proposed, which operates on the Tanner graph with an iterative message-passing structure. Once trained on a single linear block code, the EW-GNN decoder can be directly used to decode other linear block codes of different code lengths and code rates. An iterative joint training of the DRL-based code designer and the EW-GNN decoder is performed to optimize the end-end encoding and decoding process. Simulation results show the proposed auto-encoder significantly surpasses several traditional coding schemes at short block lengths, including low-density parity-check (LDPC) codes with the belief propagation (BP) decoding and the maximum-likelihood decoding (MLD), and BCH with BP decoding, offering superior error-correction capabilities while maintaining low decoding complexity.

[LG-37] Predicting the Impact of Scope Changes on Project Cost and Schedule Using Machine Learning Techniques

链接: https://arxiv.org/abs/2412.02041
作者: Soheila Sadeghi
关键词-EN: cost, significantly impact project, project, scope, inevitable reality
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the dynamic landscape of project management, scope changes are an inevitable reality that can significantly impact project performance. These changes, whether initiated by stakeholders, external factors, or internal project dynamics, can lead to cost overruns and schedule delays. Accurately predicting the consequences of these changes is crucial for effective project control and informed decision-making. This study aims to develop predictive models to estimate the impact of scope changes on project cost and schedule using machine learning techniques. The research utilizes a comprehensive dataset containing detailed information on project tasks, including the Work Breakdown Structure (WBS), task type, productivity rate, estimated cost, actual cost, duration, task dependencies, scope change magnitude, and scope change timing. Multiple machine learning models are developed and evaluated to predict the impact of scope changes on project cost and schedule. These models include Linear Regression, Decision Tree, Ridge Regression, Random Forest, Gradient Boosting, and XGBoost. The dataset is split into training and testing sets, and the models are trained using the preprocessed data. Model robustness and generalization are assessed using cross-validation techniques. To evaluate the performance of models, we use Mean Squared Error (MSE) and R2. Residual plots are generated to assess the goodness of fit and identify any patterns or outliers. Hyperparameter tuning is performed to optimize the XGBoost model and improve its predictive accuracy. The study identifies the most influential project attributes in determining the magnitude of cost and schedule deviations caused by scope modifications. It is identified that productivity rate, scope change magnitude, task dependencies, estimated cost, actual cost, duration, and specific WBS elements are powerful predictors.

[LG-38] Generalized EXTRA stochastic gradient Langevin dynamics

链接: https://arxiv.org/abs/2412.01993
作者: Mert Gurbuzbalaban,Mohammad Rafiqul Islam,Xiaoyu Wang,Lingjiong Zhu
关键词-EN: Markov Chain Monte, Chain Monte Carlo, Monte Carlo methods, popular Markov Chain, Markov Chain
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches of large datasets and are scalable to large datasets. However, when data is decentralized across a network of agents subject to communication and privacy constraints, standard SGLD algorithms cannot be applied. Instead, we employ decentralized SGLD (DE-SGLD) algorithms, where Bayesian learning is performed collaboratively by a network of agents without sharing individual data. Nonetheless, existing DE-SGLD algorithms induce a bias at every agent that can negatively impact performance; this bias persists even when using full batches and is attributable to network effects. Motivated by the EXTRA algorithm and its generalizations for decentralized optimization, we propose the generalized EXTRA stochastic gradient Langevin dynamics, which eliminates this bias in the full-batch setting. Moreover, we show that, in the mini-batch setting, our algorithm provides performance bounds that significantly improve upon those of standard DE-SGLD algorithms in the literature. Our numerical results also demonstrate the efficiency of the proposed approach.

[LG-39] FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph Attention Networks and Transformer Encoders

链接: https://arxiv.org/abs/2412.01979
作者: Jinming Xing,Ruilin Xing,Yan Sun
关键词-EN: Graph Attention Network, Fuzzy Graph Attention, pervasive challenge, compromising the performance, performance of machine
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Missing data is a pervasive challenge in wireless networks and many other domains, often compromising the performance of machine learning and deep learning models. To address this, we propose a novel framework, FGATT, that combines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder to perform robust and accurate data imputation. FGAT leverages fuzzy rough sets and graph attention mechanisms to capture spatial dependencies dynamically, even in scenarios where predefined spatial information is unavailable. The Transformer encoder is employed to model temporal dependencies, utilizing its self-attention mechanism to focus on significant time-series patterns. A self-adaptive graph construction method is introduced to enable dynamic connectivity learning, ensuring the framework’s applicability to a wide range of wireless datasets. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in imputation accuracy and robustness, particularly in scenarios with substantial missing data. The proposed model is well-suited for applications in wireless sensor networks and IoT environments, where data integrity is critical.

[LG-40] Geometry-aware PINNs for Turbulent Flow Prediction NEURIPS’2024

链接: https://arxiv.org/abs/2412.01954
作者: Shinjan Ghosh,Julian Busch,Georgia Olympia Brikis,Biswadip Dey
关键词-EN: computational fluid dynamics, fluid dynamics, exploration or optimization, optimization using computational, computational fluid
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注: Machine Learning and the Physical Sciences (ML4PS) Workshop at NeurIPS’2024

点击查看摘要

Abstract:Design exploration or optimization using computational fluid dynamics (CFD) is commonly used in the industry. Geometric variation is a key component of such design problems, especially in turbulent flow scenarios, which involves running costly simulations at every design iteration. While parametric RANS-PINN type approaches have been proven to make effective turbulent surrogates, as a means of predicting unknown Reynolds number flows for a given geometry at near real-time, geometry aware physics informed surrogates with the ability to predict varying geometries are a relatively less studied topic. A novel geometry aware parametric PINN surrogate model has been created, which can predict flow fields for NACA 4 digit airfoils in turbulent conditions, for unseen shapes as well as inlet flow conditions. A local+global approach for embedding has been proposed, where known global design parameters for an airfoil as well as local SDF values can be used as inputs to the model along with velocity inlet/Reynolds number ( \mathcalR_e ) to predict the flow fields. A RANS formulation of the Navier-Stokes equations with a 2-equation k-epsilon turbulence model has been used for the PDE losses, in addition to limited CFD data from 8 different NACA airfoils for training. The models have then been validated with unknown NACA airfoils at unseen Reynolds numbers.

[LG-41] he Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

链接: https://arxiv.org/abs/2412.01953
作者: Philippe Brouillard,Chandler Squires,Jonas Wahl,Konrad P. Kording,Karen Sachs,Alexandre Drouin,Dhanya Sridhar
关键词-EN: automatically uncover causal, uncover causal relationships, relationships from data, scientific disciplines, aims to automatically
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 39 pages, 8 figures

点击查看摘要

Abstract:Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientific disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.

[LG-42] A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients

链接: https://arxiv.org/abs/2412.01950
作者: Junbo Shen,Bing Xue,Thomas Kannampallil,Chenyang Lu,Joanna Abraham
关键词-EN: Early detection, proactive risk mitigation, postoperative complications, surgical Variational Autoencoder, timely therapy
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Codes are publicly available at: this https URL

点击查看摘要

Abstract:Early detection of surgical complications allows for timely therapy and proactive risk mitigation. Machine learning (ML) can be leveraged to identify and predict patient risks for postoperative complications. We developed and validated the effectiveness of predicting postoperative complications using a novel surgical Variational Autoencoder (surgVAE) that uncovers intrinsic patterns via cross-task and cross-cohort presentation learning. This retrospective cohort study used data from the electronic health records of adult surgical patients over four years (2018 - 2021). Six key postoperative complications for cardiac surgery were assessed: acute kidney injury, atrial fibrillation, cardiac arrest, deep vein thrombosis or pulmonary embolism, blood transfusion, and other intraoperative cardiac events. We compared prediction performances of surgVAE against widely-used ML models and advanced representation learning and generative models under 5-fold cross-validation. 89,246 surgeries (49% male, median (IQR) age: 57 (45-69)) were included, with 6,502 in the targeted cardiac surgery cohort (61% male, median (IQR) age: 60 (53-70)). surgVAE demonstrated superior performance over existing ML solutions across all postoperative complications of cardiac surgery patients, achieving macro-averaged AUPRC of 0.409 and macro-averaged AUROC of 0.831, which were 3.4% and 3.7% higher, respectively, than the best alternative method (by AUPRC scores). Model interpretation using Integrated Gradients highlighted key risk factors based on preoperative variable importance. surgVAE showed excellent discriminatory performance for predicting postoperative complications and addressing the challenges of data complexity, small cohort sizes, and low-frequency positive events. surgVAE enables data-driven predictions of patient risks and prognosis while enhancing the interpretability of patient risk profiles.

[LG-43] Down with the Hierarchy: The H in HNSW Stands for “Hubs”

链接: https://arxiv.org/abs/2412.01940
作者: Blaise Munyampirwa,Vihan Lakshman,Benjamin Coleman
关键词-EN: neural representation learning, critical computational workload, recent breakthrough advances, Driven by recent, ANN search
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 10 pages

点击查看摘要

Abstract:Driven by recent breakthrough advances in neural representation learning, approximate near-neighbor (ANN) search over vector embeddings has emerged as a critical computational workload. With the introduction of the seminal Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have established themseves as the overwhelmingly dominant paradigm for efficient and scalable ANN search. As the name suggests, HNSW searches a layered hierarchical graph to quickly identify neighborhoods of similar points to a given query vector. But is this hierarchy even necessary? A rigorous experimental analysis to answer this question would provide valuable insights into the nature of algorithm design for ANN search and motivate directions for future work in this increasingly crucial domain. To that end, we conduct an extensive benchmarking study covering more large-scale datasets than prior investigations of this question. We ultimately find that a flat graph retains all of the benefits of HNSW on high-dimensional datasets, with latency and recall performance essentially \emphidentical to the original algorithm but with less memory overhead. Furthermore, we go a step further and study \emphwhy the hierarchy of HNSW provides no benefit in high dimensions, hypothesizing that navigable small world graphs contain a well-connected, frequently traversed ``highway" of hub nodes that maintain the same purported function as the hierarchical layers. We present compelling empirical evidence that the \emphHub Highway Hypothesis holds for real datasets and investigate the mechanisms by which the highway forms. The implications of this hypothesis may also provide future research directions in developing enhancements to graph-based ANN search.

[LG-44] Beyond Pairwise Correlations: Higher-Order Redundancies in Self-Supervised Representation Learning

链接: https://arxiv.org/abs/2412.01926
作者: David Zollikofer,Béni Egressy,Frederik Benzing,Matthias Otth,Roger Wattenhofer
关键词-EN: embedding space, approaches have shown, embedding space redundancy, effective tool, tool for representation
类目: Machine Learning (cs.LG)
*备注: 12 pages main paper, 24 pages total

点击查看摘要

Abstract:Several self-supervised learning (SSL) approaches have shown that redundancy reduction in the feature embedding space is an effective tool for representation learning. However, these methods consider a narrow notion of redundancy, focusing on pairwise correlations between features. To address this limitation, we formalize the notion of embedding space redundancy and introduce redundancy measures that capture more complex, higher-order dependencies. We mathematically analyze the relationships between these metrics, and empirically measure these redundancies in the embedding spaces of common SSL methods. Based on our findings, we propose Self Supervised Learning with Predictability Minimization (SSLPM) as a method for reducing redundancy in the embedding space. SSLPM combines an encoder network with a predictor engaging in a competitive game of reducing and exploiting dependencies respectively. We demonstrate that SSLPM is competitive with state-of-the-art methods and find that the best performing SSL methods exhibit low embedding space redundancy, suggesting that even methods without explicit redundancy reduction mechanisms perform redundancy reduction implicitly.

[LG-45] Dynamics of Resource Allocation in O-RANs: An In-depth Exploration of On-Policy and Off-Policy Deep Reinforcement Learning for Real-Time Applications

链接: https://arxiv.org/abs/2412.01839
作者: Manal Mehdaoui,Amine Abouaomar
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, Radio Access Networks, Open Radio Access
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) is a powerful tool used for addressing complex challenges in mobile networks. This paper investigates the application of two DRL models, on-policy and off-policy, in the field of resource allocation for Open Radio Access Networks (O-RAN). The on-policy model is the Proximal Policy Optimization (PPO), and the off-policy model is the Sample Efficient Actor-Critic with Experience Replay (ACER), which focuses on resolving the challenges of resource allocation associated with a Quality of Service (QoS) application that has strict requirements. Motivated by the original work of Nessrine Hammami and Kim Khoa Nguyen, this study is a replication to validate and prove the findings. Both PPO and ACER are used within the same experimental setup to assess their performance in a scenario of latency-sensitive and latency-tolerant users and compare them. The aim is to verify the efficacy of on-policy and off-policy DRL models in the context of O-RAN resource allocation. Results from this replication contribute to the ongoing scientific research and offer insights into the reproducibility and generalizability of the original research. This analysis reaffirms that both on-policy and off-policy DRL models have better performance than greedy algorithms in O-RAN settings. In addition, it confirms the original observations that the on-policy model (PPO) gives a favorable balance between energy consumption and user latency, while the off-policy model (ACER) shows a faster convergence. These findings give good insights to optimize resource allocation strategies in O-RANs. Index Terms: 5G, O-RAN, resource allocation, ML, DRL, PPO, ACER.

[LG-46] Enabling Explainable Recommendation in E-commerce with LLM -powered Product Knowledge Graph IJCAI2024

链接: https://arxiv.org/abs/2412.01837
作者: Menghan Wang,Yuchen Guo,Duanfeng Zhang,Jianian Jin,Minnie Li,Dan Schonfeld,Shawn Zhou
关键词-EN: leverage large language, large language model, language model superior, model superior capability, hot topic
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: This paper was accepted by The First International OpenKG Workshop Large Knowledge-Enhanced Models @IJCAI 2024

点击查看摘要

Abstract:How to leverage large language model’s superior capability in e-commerce recommendation has been a hot topic. In this paper, we propose LLM-PKG, an efficient approach that distills the knowledge of LLMs into product knowledge graph (PKG) and then applies PKG to provide explainable recommendations. Specifically, we first build PKG by feeding curated prompts to LLM, and then map LLM response to real enterprise products. To mitigate the risks associated with LLM hallucination, we employ rigorous evaluation and pruning methods to ensure the reliability and availability of the KG. Through an A/B test conducted on an e-commerce website, we demonstrate the effectiveness of LLM-PKG in driving user engagements and transactions significantly.

[LG-47] Monolithic Hybrid Recommender System for Suggesting Relevant Movies

链接: https://arxiv.org/abs/2412.01835
作者: Mahdi Rezapour
关键词-EN: users information access, facilitate users information, recommendation system works, fundamental services, services to facilitate
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommendation systems have become the fundamental services to facilitate users information access. Generally, recommendation system works by filtering historical behaviors to understand and learn users preferences. With the growth of online information, recommendations have become of crucial importance in information filtering to prevent the information overload problem. In this study, we considered hybrid post-fusion of two approaches of collaborative filtering, by using sequences of watched movies and considering the related movies rating. After considering both techniques and applying the weights matrix, the recommendations would be modified to correspond to the users preference as needed. We discussed that various weights would be set based on use cases. For instance, in cases where we have the rating for most classes, we will assign a higher weight to the rating matrix and in case where the rating is unavailable for the majority of cases, the higher weights might be assigned to the sequential dataset. An extensive discussion is made in the context of this paper. Sequential type of the watched movies was used in conjunction of the rating as especially that model might be inadequate in distinguishing users long-term preference and that does not account for the rating of the watched movies and thus that model along might not suffice. Extensive discussion was made regarding the literature and methodological approach to solve the problem.

[LG-48] he effect of priors on Learning with Restricted Boltzmann Machines

链接: https://arxiv.org/abs/2412.02623
作者: Gianluca Manzan,Daniele Tantari
关键词-EN: Restricted Boltzmann Machines, Restricted Boltzmann, rich underlying structure, generative models designed, student RBM learns
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Restricted Boltzmann Machines (RBMs) are generative models designed to learn from data with a rich underlying structure. In this work, we explore a teacher-student setting where a student RBM learns from examples generated by a teacher RBM, with a focus on the effect of the unit priors on learning efficiency. We consider a parametric class of priors that interpolate between continuous (Gaussian) and binary variables. This approach models various possible choices of visible units, hidden units, and weights for both the teacher and student RBMs. By analyzing the phase diagram of the posterior distribution in both the Bayes optimal and mismatched regimes, we demonstrate the existence of a triple point that defines the critical dataset size necessary for learning through generalization. The critical size is strongly influenced by the properties of the teacher, and thus the data, but is unaffected by the properties of the student RBM. Nevertheless, a prudent choice of student priors can facilitate training by expanding the so-called signal retrieval region, where the machine generalizes effectively. Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG) Cite as: arXiv:2412.02623 [cond-mat.dis-nn] (or arXiv:2412.02623v1 [cond-mat.dis-nn] for this version) https://doi.org/10.48550/arXiv.2412.02623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Plug-and-Play Half-Quadratic Splitting for Ptychography

链接: https://arxiv.org/abs/2412.02548
作者: Alexander Denker,Johannes Hertrich,Zeljko Kereta,Silvia Cipiccia,Ecem Erin,Simon Arridge
关键词-EN: reconstruct complex-valued images, reconstruct complex-valued, coherent diffraction imaging, Abstract, phase retrieval
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ptychography is a coherent diffraction imaging method that uses phase retrieval techniques to reconstruct complex-valued images. It achieves this by sequentially illuminating overlapping regions of a sample with a coherent beam and recording the diffraction pattern. Although this addresses traditional imaging system challenges, it is computationally intensive and highly sensitive to noise, especially with reduced illumination overlap. Data-driven regularisation techniques have been applied in phase retrieval to improve reconstruction quality. In particular, plug-and-play (PnP) offers flexibility by integrating data-driven denoisers as implicit priors. In this work, we propose a half-quadratic splitting framework for using PnP and other data-driven priors for ptychography. We evaluate our method both on natural images and real test objects to validate its effectiveness for ptychographic image reconstruction.

[LG-50] Active learning of neural population dynamics using two-photon holographic optogenetics NEURIPS2024

链接: https://arxiv.org/abs/2412.02529
作者: Andrew Wagenmaker,Lu Mi,Marton Rozsa,Matthew S. Bull,Karel Svoboda,Kayvon Daie,Matthew D. Golub,Kevin Jamieson
关键词-EN: Recent advances, perturbing neural populations, neural population, advances in techniques, techniques for monitoring
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Recent advances in techniques for monitoring and perturbing neural populations have greatly enhanced our ability to study circuits in the brain. In particular, two-photon holographic optogenetics now enables precise photostimulation of experimenter-specified groups of individual neurons, while simultaneous two-photon calcium imaging enables the measurement of ongoing and induced activity across the neural population. Despite the enormous space of potential photostimulation patterns and the time-consuming nature of photostimulation experiments, very little algorithmic work has been done to determine the most effective photostimulation patterns for identifying the neural population dynamics. Here, we develop methods to efficiently select which neurons to stimulate such that the resulting neural responses will best inform a dynamical model of the neural population activity. Using neural population responses to photostimulation in mouse motor cortex, we demonstrate the efficacy of a low-rank linear dynamical systems model, and develop an active learning procedure which takes advantage of low-rank structure to determine informative photostimulation patterns. We demonstrate our approach on both real and synthetic data, obtaining in some cases as much as a two-fold reduction in the amount of data required to reach a given predictive power. Our active stimulation design method is based on a novel active learning procedure for low-rank regression, which may be of independent interest.

[LG-51] Nature versus nurture in galaxy formation: the effect of environment on star formation with causal machine learning

链接: https://arxiv.org/abs/2412.02439
作者: Sunil Mucesh,William G. Hartley,Ciarán M. Gilligan-Lee,Ofer Lahav
关键词-EN: Understanding how galaxies, modern astronomy, galaxies form, form and evolve, heart of modern
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Understanding how galaxies form and evolve is at the heart of modern astronomy. With the advent of large-scale surveys and simulations, remarkable progress has been made in the last few decades. Despite this, the physical processes behind the phenomena, and particularly their importance, remain far from known, as correlations have primarily been established rather than the underlying causality. We address this challenge by applying the causal inference framework. Specifically, we tackle the fundamental open question of whether galaxy formation and evolution depends more on nature (i.e., internal processes) or nurture (i.e., external processes), by estimating the causal effect of environment on star-formation rate in the IllustrisTNG simulations. To do so, we develop a comprehensive causal model and employ cutting-edge techniques from epidemiology to overcome the long-standing problem of disentangling nature and nurture. We find that the causal effect is negative and substantial, with environment suppressing the SFR by a maximal factor of \sim100 . While the overall effect at z=0 is negative, in the early universe, environment is discovered to have a positive impact, boosting star formation by a factor of \sim10 at z\sim1 and by even greater amounts at higher redshifts. Furthermore, we show that: (i) nature also plays an important role, as ignoring it underestimates the causal effect in intermediate-density environments by a factor of \sim2 , (ii) controlling for the stellar mass at a snapshot in time, as is common in the literature, is not only insufficient to disentangle nature and nurture but actually has an adverse effect, though (iii) stellar mass is an adequate proxy of the effects of nature. Finally, this work may prove a useful blueprint for extracting causal insights in other fields that deal with dynamical systems with closed feedback loops, such as the Earth’s climate.

[LG-52] ransformer-based Koopman Autoencoder for Linearizing Fishers Equation

链接: https://arxiv.org/abs/2412.02430
作者: Kanav Singh Rana,Nitu Kumari
关键词-EN: linearizing Fisher reaction-diffusion, Fisher reaction-diffusion equation, linearizing Fisher, Transformer-based Koopman autoencoder, Fisher reaction-diffusion
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:A Transformer-based Koopman autoencoder is proposed for linearizing Fisher’s reaction-diffusion equation. The primary focus of this study is on using deep learning techniques to find complex spatiotemporal patterns in the reaction-diffusion system. The emphasis is on not just solving the equation but also transforming the system’s dynamics into a more comprehensible, linear form. Global coordinate transformations are achieved through the autoencoder, which learns to capture the underlying dynamics by training on a dataset with 60,000 initial conditions. Extensive testing on multiple datasets was used to assess the efficacy of the proposed model, demonstrating its ability to accurately predict the system’s evolution as well as to generalize. We provide a thorough comparison study, comparing our suggested design to a few other comparable methods using experiments on various PDEs, such as the Kuramoto-Sivashinsky equation and the Burger’s equation. Results show improved accuracy, highlighting the capabilities of the Transformer-based Koopman autoencoder. The proposed architecture in is significantly ahead of other architectures, in terms of solving different types of PDEs using a single architecture. Our method relies entirely on the data, without requiring any knowledge of the underlying equations. This makes it applicable to even the datasets where the governing equations are not known.

[LG-53] Improved Complexity for Smooth Nonconvex Optimization: A Two-Level Online Learning Approach with Quasi-Newton Methods

链接: https://arxiv.org/abs/2412.02175
作者: Ruichen Jiang,Aryan Mokhtari,Francisco Patitucci
关键词-EN: epsilon, first-order stationary point, online learning problem, gradient, problem
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages

点击查看摘要

Abstract:We study the problem of finding an \epsilon -first-order stationary point (FOSP) of a smooth function, given access only to gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O(\epsilon^-7/4) . In this work, we propose a method with a gradient complexity of O(d^1/4\epsilon^-13/8) , where d is the problem dimension, leading to an improved complexity when d = O(\epsilon^-1/2) . To achieve this result, we design an optimization algorithm that, underneath, involves solving two online learning problems. Specifically, we first reformulate the task of finding a stationary point for a nonconvex problem as minimizing the regret in an online convex optimization problem, where the loss is determined by the gradient of the objective function. Then, we introduce a novel optimistic quasi-Newton method to solve this online learning problem, with the Hessian approximation update itself framed as an online learning problem in the space of matrices. Beyond improving the complexity bound for achieving an \epsilon -FOSP using a gradient oracle, our result provides the first guarantee suggesting that quasi-Newton methods can potentially outperform gradient descent-type methods in nonconvex settings.

[LG-54] Machine Learning Methods for Automated Interstellar Object Classification with LSST

链接: https://arxiv.org/abs/2412.02112
作者: Richard Cloete,Peter Vereš,Abraham Loeb
关键词-EN: Space and Time, Rubin Observatory, Vera C. Rubin, LSST data, simulated LSST data
类目: Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 6 tables

点击查看摘要

Abstract:The Legacy Survey of Space and Time, to be conducted with the Vera C. Rubin Observatory, is poised to revolutionize our understanding of the Solar System by providing an unprecedented wealth of data on various objects, including the elusive interstellar objects (ISOs). Detecting and classifying ISOs is crucial for studying the composition and diversity of materials from other planetary systems. However, the rarity and brief observation windows of ISOs, coupled with the vast quantities of data to be generated by LSST, create significant challenges for their identification and classification. This study aims to address these challenges by exploring the application of machine learning algorithms to the automated classification of ISO tracklets in simulated LSST data. We employed various machine learning algorithms, including random forests (RFs), stochastic gradient descent (SGD), gradient boosting machines (GBMs), and neural networks (NNs), to classify ISO tracklets in simulated LSST data. We demonstrate that GBM and RF algorithms outperform SGD and NN algorithms in accurately distinguishing ISOs from other Solar System objects. RF analysis shows that many derived Digest2 values are more important than direct observables in classifying ISOs from the LSST tracklets. The GBM model achieves the highest precision, recall, and F1 score, with values of 0.9987, 0.9986, and 0.9987, respectively. These findings lay the foundation for the development of an efficient and robust automated system for ISO discovery using LSST data, paving the way for a deeper understanding of the materials and processes that shape planetary systems beyond our own. The integration of our proposed machine learning approach into the LSST data processing pipeline will optimize the survey’s potential for identifying these rare and valuable objects, enabling timely follow-up observations and further characterization.

[LG-55] MEP-Net: Generating Solutions to Scientific Problems with Limited Knowledge by Maximum Entropy Principle

链接: https://arxiv.org/abs/2412.02090
作者: Wuyue Yang,Liangrong Peng,Guojie Li,Liu Hong
关键词-EN: Maximum entropy principle, inferring unknown probability, unknown probability distributions, learn complex distributions, probability distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 35 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Maximum entropy principle (MEP) offers an effective and unbiased approach to inferring unknown probability distributions when faced with incomplete information, while neural networks provide the flexibility to learn complex distributions from data. This paper proposes a novel neural network architecture, the MEP-Net, which combines the MEP with neural networks to generate probability distributions from moment constraints. We also provide a comprehensive overview of the fundamentals of the maximum entropy principle, its mathematical formulations, and a rigorous justification for its applicability for non-equilibrium systems based on the large deviations principle. Through fruitful numerical experiments, we demonstrate that the MEP-Net can be particularly useful in modeling the evolution of probability distributions in biochemical reaction networks and in generating complex distributions from data.

[LG-56] Diffusion models learn distributions generated by complex Langevin dynamics

链接: https://arxiv.org/abs/2412.01919
作者: Diaa E. Habibi,Gert Aarts,Lingxiao Wang,Kai Zhou
关键词-EN: complex Langevin process, probability distribution effectively, distribution effectively sampled, complex Langevin, hard to understand
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 8 pages + references. Proceedings of the 41st International Symposium on Lattice Field Theory (Lattice 2024), July 28th - August 3rd, 2024, University of Liverpool, UK

点击查看摘要

Abstract:The probability distribution effectively sampled by a complex Langevin process for theories with a sign problem is not known a priori and notoriously hard to understand. Diffusion models, a class of generative AI, can learn distributions from data. In this contribution, we explore the ability of diffusion models to learn the distributions created by a complex Langevin process.

[LG-57] Enhancing Brain Age Estimation with a Multimodal 3D CNN Approach Combining Structural MRI and AI-Synthesized Cerebral Blood Volume Data

链接: https://arxiv.org/abs/2412.01865
作者: Jordan Jomsky,Zongyu Li,Yiren Zhang,Jia Guo
关键词-EN: population necessitates enhanced, necessitates enhanced methods, Age Gap Estimation, growing global aging, global aging population
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:The growing global aging population necessitates enhanced methods for assessing brain aging and related neurodegenerative changes. Brain Age Gap Estimation (BrainAGE) offers a neuroimaging biomarker for understanding these changes by predicting brain age from MRI scans. Current approaches primarily use T1-weighted magnetic resonance imaging (T1w MRI) data, capturing only structural brain information. To address the lack of functional data, we integrated AI-generated Cerebral Blood Volume (AICBV) with T1w MRI, combining both structural and functional metrics. We developed a deep learning model using a VGG-based architecture to predict brain age. Our model achieved a mean absolute error (MAE) of 3.95 years and a correlation of (R^2 = 0.94) on the test set ((n = 288)), outperforming existing models trained on similar data. We have further created gradient-based class activation maps (Grad-CAM) to visualize the regions of the brain that most influenced the model’s predictions, providing interpretable insights into the structural and functional contributors to brain aging.

[LG-58] Late fusion ensembles for speech recognition on diverse input audio representations

链接: https://arxiv.org/abs/2412.01861
作者: Marin Jezidžić,Matej Mihelčić
关键词-EN: Automatic Speech Recognition, applied to Automatic, late fusion ensemble, Speech Recognition, Automatic Speech
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textitLibrispeech, Aishell, Gigaspeech, \textitTEDLIUMv2 and show that improvements of 1% - 14% can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers improvements even with the use of language models, although the gap is closing.

[LG-59] MQFL-FHE: Multimodal Quantum Federated Learning Framework with Fully Homomorphic Encryption

链接: https://arxiv.org/abs/2412.01858
作者: Siddhant Dutta,Nouhaila Innan,Sadok Ben Yahia,Muhammad Shafique,David Esteban Bernal Neira
关键词-EN: fully homomorphic encryption, homomorphic encryption, integration of fully, fully homomorphic, led to significant
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 5 Tables. Under Review

点击查看摘要

Abstract:The integration of fully homomorphic encryption (FHE) in federated learning (FL) has led to significant advances in data privacy. However, during the aggregation phase, it often results in performance degradation of the aggregated model, hindering the development of robust representational generalization. In this work, we propose a novel multimodal quantum federated learning framework that utilizes quantum computing to counteract the performance drop resulting from FHE. For the first time in FL, our framework combines a multimodal quantum mixture of experts (MQMoE) model with FHE, incorporating multimodal datasets for enriched representation and task-specific learning. Our MQMoE framework enhances performance on multimodal datasets and combined genomics and brain MRI scans, especially for underrepresented categories. Our results also demonstrate that the quantum-enhanced approach mitigates the performance degradation associated with FHE and improves classification accuracy across diverse datasets, validating the potential of quantum interventions in enhancing privacy in FL.

信息检索

[IR-0] Improving Sequential Recommender Systems with Online and In-store User Behavior

链接: https://arxiv.org/abs/2412.02122
作者: Luyi Ma,Aashika Padmanabhan,Anjana Ganesh,Shengwei Tang,Jiao Chen,Xiaohan Li,Lalitesh Morishetti,Kaushiki Nag,Malay Patel,Jason Cho,Sushant Kumar,Kannan Achan
关键词-EN: extending in-store shopping, exploring in-store shopping, in-store shopping, Online e-commerce platforms, canonical online browsing
类目: Information Retrieval (cs.IR)
*备注: 6 pages, IEEE BigData 2024 Workshop

点击查看摘要

Abstract:Online e-commerce platforms have been extending in-store shopping, which allows users to keep the canonical online browsing and checkout experience while exploring in-store shopping. However, the growing transition between online and in-store becomes a challenge to sequential recommender systems for future online interaction prediction due to the lack of holistic modeling of hybrid user behaviors (online and in-store). The challenges are twofold. First, combining online and in-store user behavior data into a single data schema and supporting multiple stages in the model life cycle (pre-training, training, inference, etc.) organically needs a new data pipeline design. Second, online recommender systems, which solely rely on online user behavior sequences, must be redesigned to support online and in-store user data as input under the sequential modeling setting. To overcome the first challenge, we propose a hybrid, omnichannel data pipeline to compile online and in-store user behavior data by caching information from diverse data sources. Later, we introduce a model-agnostic encoder module to the sequential recommender system to interpret the user in-store transaction and augment the modeling capacity for better online interaction prediction given the hybrid user behavior.

[IR-1] Improving feature interactions at Pinterest under industry constraints

链接: https://arxiv.org/abs/2412.01985
作者: Siddarth Malreddy,Matthew Lawhon,Usha Amrutha Nookala,Aditya Mantha,Dhruvil Deven Badani
关键词-EN: Adopting advances, industrial settings due, due to unique, recommendation systems, Adopting
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Adopting advances in recommendation systems is often challenging in industrial settings due to unique constraints. This paper aims to highlight these constraints through the lens of feature interactions. Feature interactions are critical for accurately predicting user behavior in recommendation systems and online advertising. Despite numerous novel techniques showing superior performance on benchmark datasets like Criteo, their direct application in industrial settings is hindered by constraints such as model latency, GPU memory limitations and model reproducibility. In this paper, we share our learnings from improving feature interactions in Pinterest’s Homefeed ranking model under such constraints. We provide details about the specific challenges encountered, the strategies employed to address them, and the trade-offs made to balance performance with practical limitations. Additionally, we present a set of learning experiments that help guide the feature interaction architecture selection. We believe these insights will be useful for engineers who are interested in improving their model through better feature interaction learning.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-04

目录

概览 (2024-12-04)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载