Arxiv今日论文 | 2025-01-22

本篇博文主要内容为 2025-01-22 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决当前视频理解基准测试在评估基础模型（foundation models）时存在的局限性，特别是缺乏对领域特定知识和专家级推理能力的评估。为此，作者提出了MMVU（Multi-Discipline Multi-Video Understanding），一个综合性的专家级多学科基准测试，涵盖科学、医疗、人文社会科学和工程四个核心学科的27个主题，包含3,000个由专家标注的问题。解决方案的关键在于三个方面：首先，MMVU要求模型应用领域特定知识并进行专家级推理，超越当前基准测试中通常评估的基本视觉感知能力；其次，每个示例均由专家从头标注，并实施严格的数据质量控制以确保数据集的高质量；最后，每个示例还附有专家标注的推理依据和相关领域知识，便于深入分析。通过这些设计，MMVU为未来在专家级、知识密集型视频理解领域的进一步研究提供了可操作的见解。

链接: https://arxiv.org/abs/2501.12380
作者: Yilun Zhao,Lujing Xie,Haowei Zhang,Guo Gan,Yitao Long,Zhiyuan Hu,Tongyan Hu,Weiyuan Chen,Chuhan Li,Junyang Song,Zhijian Xu,Chengye Wang,Weifeng Pan,Ziyao Shangguan,Xiangru Tang,Zhenwen Liang,Yixin Liu,Chen Zhao,Arman Cohan
机构: Yale NLP MMVU Team
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.
zh

[NLP-1] InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

【速读】：该论文试图解决大型视觉语言模型（LVLMs）在生成输出时偶尔产生错误的问题。尽管强化学习中的奖励模型（RMs）或测试时缩放有潜力提高生成质量，但目前公开可用的多模态奖励模型稀缺，且专有模型的实现细节不明确。为解决这一问题，论文提出了InternLM-XComposer2.5-Reward（IXC-2.5-Reward），这是一个简单但有效的多模态奖励模型，旨在将LVLMs与人类偏好对齐。关键解决方案包括构建一个高质量的多模态偏好语料库，涵盖文本、图像和视频输入，并应用于指令遵循、通用理解、文本丰富的文档、数学推理和视频理解等多个领域。IXC-2.5-Reward在多模态奖励模型基准测试中表现出色，并在文本奖励模型基准测试中展示了竞争力。此外，论文还展示了IXC-2.5-Reward的三个关键应用：为强化学习训练提供监督信号、在测试时缩放中选择最佳响应、以及从现有图像和视频指令调优训练数据中过滤异常或噪声样本。

链接: https://arxiv.org/abs/2501.12368
作者: Yuhang Zang,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Ziyu Liu,Shengyuan Ding,Shenxi Wu,Yubo Ma,Haodong Duan,Wenwei Zhang,Kai Chen,Dahua Lin,Jiaqi Wang
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Shanghai Jiao Tong University(上海交通大学); Nanjing University(南京大学); Fudan University(复旦大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Tech Report

点击查看摘要

Abstract:Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at this https URL
zh

[NLP-2] FuocChuVIP123 at CoMeDi Shared Task: Disagreement Ranking with XLM-Roberta Sentence Embeddings and Deep Neural Regression COLING2025

【速读】：该论文旨在解决多语言环境下的分歧排序（Disagreement Ranking）问题，特别是在CoMeDi共享任务的子任务2中。解决方案的关键在于利用paraphrase-xlm-r-multilingual-v1模型生成的句子嵌入（sentence embeddings），并结合深度神经回归模型（deep neural regression model），该模型引入了批归一化（batch normalization）和丢弃法（dropout）以提高泛化能力。通过预测标注者之间成对判断差异的均值，该方法明确针对分歧排序，与传统“黄金标签”聚合方法不同。论文通过定制化的架构和训练过程优化系统，在Spearman相关性方面取得了与平均分歧标签相竞争的性能。研究结果表明，在多语言环境中，鲁棒的嵌入、有效的模型架构以及对判断差异的细致处理对于分歧排序至关重要。这些发现为使用上下文表示进行序数判断任务提供了见解，并为分歧预测模型的进一步改进开辟了途径。

链接: https://arxiv.org/abs/2501.12336
作者: Phuoc Duong Huy Chu
机构: University of Information Technology (信息技术大学); Vietnam National University - Ho Chi Minh City (越南国家大学 - 胡志明市)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COMEDI shared Task, Workshop at COLING 2025

点击查看摘要

Abstract:This paper presents results of our system for CoMeDi Shared Task, focusing on Subtask 2: Disagreement Ranking. Our system leverages sentence embeddings generated by the paraphrase-xlm-r-multilingual-v1 model, combined with a deep neural regression model incorporating batch normalization and dropout for improved generalization. By predicting the mean of pairwise judgment differences between annotators, our method explicitly targets disagreement ranking, diverging from traditional “gold label” aggregation approaches. We optimized our system with a customized architecture and training procedure, achieving competitive performance in Spearman correlation against mean disagreement labels. Our results highlight the importance of robust embeddings, effective model architecture, and careful handling of judgment differences for ranking disagreement in multilingual contexts. These findings provide insights into the use of contextualized representations for ordinal judgment tasks and open avenues for further refinement of disagreement prediction models.
zh

[NLP-3] Automatic Labelling with Open-source LLM s using Dynamic Label Schema Integration

【速读】：该论文试图解决在现实世界的机器学习项目中获取高质量标注数据的高成本问题。尽管大型语言模型（LLMs），如GPT-4，在数据标注方面表现出高准确性，但由于隐私和成本问题，GPT-4的广泛应用受到限制。为此，论文探索了如何有效利用开源模型进行自动标注。关键解决方案是提出了检索增强分类（Retrieval Augmented Classification, RAC）方法。RAC通过动态整合标签描述，逐一对标签进行推理，从最相关的标签开始迭代，直到LLM选择一个标签。这种方法在高基数任务中显著提升了标注性能，并通过专注于最有希望的标签，实现了标注质量和覆盖范围之间的权衡，从而能够自动标注内部数据集。

链接: https://arxiv.org/abs/2501.12332
作者: Thomas Walshe,Sae Young Moon,Chunyang Xiao,Yawwani Gunawardana,Fran Silavong
机构: J.P. Morgan Chase; Snorkel AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage - a property we leverage to automatically label our internal datasets.
zh

[NLP-4] UI-TARS: Pioneering Automated GUI Interaction with Native Agents

【速读】：该论文试图解决现有GUI代理框架依赖高度封装的商业模型（如GPT-4o）以及专家设计的工作流程的问题，提出了一种名为UI-TARS的原生GUI代理模型。UI-TARS仅通过屏幕截图作为输入，执行类似人类的交互操作（如键盘和鼠标操作），并在多个GUI代理基准测试中表现出色。解决方案的关键在于以下几个创新点：(1) 增强的感知能力，利用大规模GUI截图数据集进行上下文感知的UI元素理解和精确标注；(2) 统一动作建模，将动作标准化为跨平台的统一空间，并通过大规模动作轨迹实现精确的定位和交互；(3) 系统-2推理，将深思熟虑的推理引入多步决策中，涉及任务分解、反思思维、里程碑识别等多种推理模式；(4) 迭代训练与反思在线轨迹，通过自动收集、过滤和反思优化新交互轨迹，解决数据瓶颈问题。通过这些创新，UI-TARS能够持续从错误中学习，并在最少人工干预的情况下适应不可预见的情况。

链接: https://arxiv.org/abs/2501.12326
作者: Yujia Qin,Yining Ye,Junjie Fang,Haoming Wang,Shihao Liang,Shizuo Tian,Junda Zhang,Jiahao Li,Yunxin Li,Shijue Huang,Wanjun Zhong,Kuanye Li,Jiale Yang,Yu Miao,Woyu Lin,Longxiang Liu,Xu Jiang,Qianli Ma,Jingyu Li,Xiaojun Xiao,Kai Cai,Chuang Li,Yaowei Zheng,Chaolin Jin,Chen Li,Xiao Zhou,Minchao Wang,Haoli Chen,Zhaojian Li,Haihua Yang,Haifeng Liu,Feng Lin,Tao Peng,Xin Liu,Guang Shi
机构: ByteDance Seed(字节跳动种子); Tsinghua University(清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
zh

[NLP-5] Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

【速读】：该论文试图解决高质量监督微调（Supervised Fine-Tuning, SFT）数据稀缺的问题，这一问题在大型语言模型（Large Language Models, LLMs）日益先进的背景下尤为突出。为了解决这一问题，论文提出了Condor，一种新颖的两阶段合成数据生成框架。该框架结合了世界知识树（World Knowledge Tree）和自我反思精炼（Self-Reflection Refinement）技术，能够大规模生成高质量的SFT数据。实验结果表明，仅使用20K Condor生成的样本进行微调的基模型，其性能优于其他对比模型。Condor中的额外精炼阶段还支持不同规模（最高达72B）的LLMs进行迭代自我改进，验证了该方法的有效性。此外，论文还探讨了合成数据在训练后扩展中的潜力，揭示了未来研究中性能提升的广阔前景。

链接: https://arxiv.org/abs/2501.12273
作者: Maosong Cao,Taolin Zhang,Mo Li,Chuyu Zhang,Yunxin Liu,Haodong Duan,Songyang Zhang,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tech Report. Github: this https URL

点击查看摘要

Abstract:The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
zh

[NLP-6] CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

【速读】：该论文试图解决在医疗工作流程中采用基于深度学习（deep learning）解决方案时面临的两个主要挑战：标注数据的可用性和系统缺乏可解释性。解决方案的关键在于提出了一种名为CBVLM（Concept-Based Vision-Language Model）的方法。该方法利用大型视觉-语言模型（Large Vision-Language Models, LVLMs）在少样本（few-shot）设置下的卓越性能，通过两个阶段来实现：首先，对于每个预定义的概念，提示LVLM判断输入图像中是否存在该概念；其次，基于这些概念预测，要求LVLM对图像进行分类。此外，该方法还引入了一个检索模块，用于选择最佳的上下文学习示例。通过将最终诊断基于预测的概念，确保了系统的可解释性；同时，利用LVLMs的少样本能力，显著降低了标注成本。实验表明，CBVLM在多个医疗数据集和多种LVLMs上均优于传统的概念瓶颈模型（Concept Bottleneck Models, CBMs）和特定任务的监督学习方法，且无需训练，仅需少量标注示例。

链接: https://arxiv.org/abs/2501.12266
作者: Cristiano Patrício,Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira,João C. Neves
机构: INESC TEC; NOVA LINCS; Universidade da Beira Interior (贝拉内政大学); Universidade do Porto (波尔图大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: this https URL.
zh

[NLP-7] FOCUS: First Order Concentrated Updating Scheme

【速读】：该论文试图解决大语言模型（LLMs）在预训练过程中由于梯度噪声（gradient noise）导致的性能下降问题。具体来说，作者观察到在高梯度噪声环境下，Adam优化器的表现不如Signum优化器，因为Adam会过度减小有效步长（effective step size），从而影响模型的训练效果。基于这一观察，作者提出了FOCUS优化器，该优化器在Signum的基础上引入了对移动平均参数（moving averaged parameters）的吸引力机制，使其能够在保持较大步长的同时更好地处理噪声。实验结果表明，FOCUS在训练GPT-2时比Signum更稳定，且比Adam更快，表明梯度噪声可能是LLM训练中一个被低估的限制因素，而FOCUS为解决这一问题提供了有效的解决方案。

链接: https://arxiv.org/abs/2501.12243
作者: Yizhou Liu,Ziming Liu,Jeff Gore
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley’s sharpness, Adam’s performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.
zh

[NLP-8] InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models

【速读】：该论文旨在解决如何构建一个能够理解多模态任务情境并实时响应用户查询的虚拟助手。解决方案的关键在于开发了一个名为InsTALL的上下文感知指令任务助手，该助手利用多模态大语言模型（Multi-modal Large Language Models）处理在线视觉流（如用户的屏幕共享或视频录制），并结合任务视频和配对的文本数据进行训练。InsTALL通过自动从视频数据中提取任务图（task graph），并在训练和推理过程中利用该任务图，从而在多模态活动理解的子任务（如任务识别、动作识别、下一步动作预测和计划预测）中实现了最先进的性能，并在自动错误识别的两个新子任务上超越了现有基线。

链接: https://arxiv.org/abs/2501.12231
作者: Pha Nguyen,Sailik Sengupta,Girik Malik,Arshit Gupta,Bonan Min
机构: University of Arkansas(阿肯色大学); Amazon AWS AI Labs(亚马逊AWS AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user’s screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding – task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) – and outperforms existing baselines on two novel sub-tasks related to automatic error identification.
zh

[NLP-9] Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model

【速读】：该论文旨在解决大型视觉语言模型（Large Vision Language Models, LVLMs）在生成图像描述时出现的幻觉（hallucination）问题，即模型生成的内容包含输入图像中不存在的对象或细节。为了解决这一问题，论文提出了一种新颖的注意力修正方法，其关键在于两个核心组件：首先，采用双流令牌选择机制（dual-stream token selection mechanism），通过识别并优先处理局部信息丰富和空间显著的视觉令牌（visual tokens），以增强视觉信息的提取；其次，引入注意力头特异性调制策略（attention head-specific modulation strategy），根据每个注意力头的视觉敏感性差异性地放大视觉信息处理。实验表明，该方法在MSCOCO数据集上显著减少了幻觉现象，幻觉率降低了62.3%，同时保持了与基线模型相当的任务性能。通过选择性调制具有不同视觉敏感性的注意力头中的令牌，该方法在不重新训练模型的情况下显著改善了视觉基础（visual grounding）。

链接: https://arxiv.org/abs/2501.12206
作者: Kazi Hasan Ibn Arif,Sajib Acharjee Dip,Khizar Hussain,Lang Zhang,Chris Thomas
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 5 tables, 4 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.
zh

[NLP-10] Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

【速读】：该论文试图解决在神经机器翻译（NMT）领域中，现有的对抗样本生成方法在面对字符扰动（character perturbations）时效果不佳的问题。现有的对抗策略通常适用于固定的分词（tokenization）方式，难以应对涉及多种分词方式的字符扰动。为此，论文提出了一种名为“DexChar policy”的解决方案，该方案基于强化学习（RL）的现有对抗生成方法，引入了字符扰动，以改进基于词替换（token substitution）的主流对抗策略。此外，论文还改进了自监督匹配（self-supervised matching）机制，以在强化学习中提供反馈，满足训练对抗样本时所需的语义约束。实验表明，该方法在基线对抗方法失效的场景中表现良好，能够生成高效的对抗样本，用于系统的分析和优化。

链接: https://arxiv.org/abs/2501.12183
作者: Wei Zou,Shujian Huang,Jiajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by CCMT 2024()

点击查看摘要

Abstract:Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy’ that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.
zh

[NLP-11] AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

【速读】：该论文旨在解决大语言模型（LLM）服务系统中如何在不牺牲吞吐量的情况下，支持服务级别目标（SLO）定制化的问题。现有的系统难以在满足多样化SLO需求的同时保持高吞吐量。论文提出的解决方案AdaServe通过细粒度的推测解码（speculative decoding）来实现这一目标。其关键创新在于利用草稿模型（draft model）的logits预测token的推测准确性，并采用理论最优算法构建token树进行验证。此外，AdaServe通过推测与选择（speculation-and-selection）机制，首先为每个请求构建候选token树，然后动态选择token以满足个体SLO约束并优化吞吐量。实验结果表明，AdaServe在SLO达成率和有效吞吐量（goodput）方面分别比现有最先进系统提高了73%和74%，显著提升了LLM部署的效率和适应性。

链接: https://arxiv.org/abs/2501.12162
作者: Zikun Li,Zhuofu Chen,Remi Delacourt,Gabriele Oliaro,Zeyu Wang,Qinghan Chen,Shuhuai Lin,April Yang,Zhihao Zhang,Zhuoming Chen,Sean Lai,Xupeng Miao,Zhihao Jia
机构: Carnegie Mellon University(卡内基梅隆大学); Tongji University(同济大学); EPFL(洛桑联邦理工学院); Amazon Web Services(亚马逊网络服务); Purdue University(普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe’s potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
zh

[NLP-12] Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

【速读】：该论文试图解决在大语言模型（LLMs）的指令微调过程中，如何选择适当的训练数据以同时实现两个目标：(1) 激发模型的强大能力，以及 (2) 在多样化任务上实现平衡的性能。现有的基于影响力（influence-based）的方法虽然在估计每个训练样本对模型预测的贡献方面表现出色，但在实现任务间的平衡性能方面存在不足。论文通过系统研究发现，这种不足源于某些任务在影响力上具有固有偏差，导致数据选择偏向这些任务，进而损害模型在其他任务上的表现，甚至对高影响力任务本身也产生负面影响。

为解决这一问题，论文提出了BIDS（Balanced and Influential Data Selection）算法，其关键在于对训练数据的影响力得分进行归一化，并通过迭代选择对最不具代表性任务具有最高影响力的训练样本来平衡数据选择。实验结果表明，BIDS在多个基准测试中均优于现有的基于影响力的算法和其他非基于影响力的选择框架。值得注意的是，使用BIDS选择的15%子集进行训练，甚至可以在更平衡的性能上优于全数据集训练。分析进一步强调了实例级归一化和迭代优化在选择数据中的重要性，以实现多样化能力的平衡学习。

链接: https://arxiv.org/abs/2501.12147
作者: Qirun Dai,Dylan Zhang,Jiaqi W. Ma,Hao Peng
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model’s predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model’s performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2501.12147 [cs.CL] (or arXiv:2501.12147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.12147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] Can open source large language models be used for tumor documentation in Germany? – An evaluation on urological doctors notes

【速读】：该论文试图解决德国肿瘤文档记录过程中手动操作的低效性和可靠性问题。当前的肿瘤文档记录主要依赖于人工阅读患者记录并将数据输入结构化数据库，这一过程既耗时又容易出错。论文提出利用大语言模型（LLMs）来自动化这一过程，以提高效率和准确性。

解决方案的关键在于评估了11种不同规模的开源大语言模型（模型参数从1亿到700亿不等），在肿瘤文档记录的三个基本任务上的表现：识别肿瘤诊断、分配ICD-10编码（International Classification of Diseases, 10th Revision）以及提取首次诊断日期。研究使用了基于泌尿科匿名医生笔记的标注文本片段数据集，并通过不同的提示策略（如少样本提示）来探索模型的能力。研究发现，参数规模在7-12亿之间的模型（如Llama 3.1 8B、Mistral 7B和Mistral NeMo 12B）在这些任务上表现较好，且资源效率较高。此外，跨医学领域的示例在少样本提示中也能提升模型表现，表明大语言模型具备处理肿瘤文档记录任务的能力。通过定制化的微调和精心设计的提示策略，这些模型有望成为未来临床文档记录的重要工具。

链接: https://arxiv.org/abs/2501.12106
作者: Stefan Lenz,Arsenij Ustjanzew,Marco Jeray,Torsten Panholzer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 48 pages, 5 figures

点击查看摘要

Abstract:Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors’ notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from this https URL. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.
zh

[NLP-14] EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition

【速读】：该论文旨在解决参数高效微调方法（Parameter-efficient fine-tuning, PEFT）在可扩展性和学习模式与全微调（full fine-tuning）之间的差异问题。现有的方法如LoRA（Low-Rank Adaptation）虽然减少了可训练参数的数量，但在扩展性和学习能力上存在局限。为此，论文提出了一种新的PEFT方法——高效权重分解低秩适应（Efficient Weight-Decomposed Low-Rank Adaptation, EDoRA）。该方法的关键在于将预训练权重分解为幅度和方向分量，并通过冻结低秩矩阵、使用奇异值分解（Singular Value Decomposition, SVD）初始化，以及在分量之间引入一个小的可训练矩阵，从而在显著减少可训练参数的同时保持学习能力。实验结果表明，EDoRA在GLUE基准测试中表现优异，与现有方法如LoRA和DoRA相比，可训练参数减少了多达30倍，适用于内存受限环境下的LLM（Large Language Models）任务适配。

链接: https://arxiv.org/abs/2501.12067
作者: Hamid Nasiri,Peter Garraghan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at this https URL .
zh

[NLP-15] MedS3: Towards Medical Small Language Models with Self-Evolved Slow Thinking

【速读】：该论文旨在解决现有医学语言模型（Medical Language Models, MLMs）在真实临床应用中数据效率低和实用性有限的问题。现有模型通常依赖于预训练或监督微调，难以满足临床任务中的长链推理需求。论文提出的解决方案是开发一个可部署的小规模医学语言模型 \mone，采用自进化范式（self-evolution paradigm）进行长链推理。关键创新在于通过蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）构建可验证的推理链，并为每个推理步骤分配进化推演值（evolution rollout value），从而训练策略模型和奖励模型。在推理阶段，策略模型生成多个响应，奖励模型选择得分最高的响应。实验结果表明，\mone 在多个评估数据集上优于现有开源模型，且奖励模型的引入进一步提升了性能。

链接: https://arxiv.org/abs/2501.12051
作者: Shuyang Jiang,Yusheng Liao,Zhe Chen,Ya Zhang,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 19 pages; technical report

点击查看摘要

Abstract:Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAIs O1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present a deployable, small-scale medical language model, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the reward model. During inference, the policy model generates multiple responses, and the reward model selects the one with the highest reward score. Experiments on eleven evaluation datasets demonstrate that \mone outperforms prior open-source models by 2 points, with the addition of the reward model further boosting performance ( \sim 13 points), surpassing GPT-4o-mini. Code and data are available at \urlthis https URL.
zh

[NLP-16] Reference-free Evaluation Metrics for Text Generation: A Survey

【速读】：该论文旨在探讨自然语言生成（NLG）系统中自动评估指标的应用和发展。目前，最常见的自动评估方法是基于参考的指标（reference-based metric），即通过将模型输出与人工编写的黄金标准参考文本进行比较来评估模型性能。然而，生成这些参考文本成本高昂，且在某些任务（如对话中的响应生成）中，创建参考文本并不简单。因此，近年来出现了多种无参考的评估指标（reference-free metrics）。论文通过对各类NLG任务中常用的评估方法进行全面调查，分析了这些方法的应用场景及其在模型评估之外的其他用途。最后，论文还指出了未来研究的一些有前景的方向。解决方案的关键在于开发和应用无参考的评估指标，以降低评估成本并提高评估的灵活性。

链接: https://arxiv.org/abs/2501.12011
作者: Takumi Ito,Kees van Deemter,Jun Suzuki
机构: Tohoku University(东北大学); Langsmith Inc.; Utrecht University(乌得勒支大学); RIKEN(理化学研究所)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:A number of automatic evaluation metrics have been proposed for natural language generation systems. The most common approach to automatic evaluation is the use of a reference-based metric that compares the model’s output with gold-standard references written by humans. However, it is expensive to create such references, and for some tasks, such as response generation in dialogue, creating references is not a simple matter. Therefore, various reference-free metrics have been developed in recent years. In this survey, which intends to cover the full breadth of all NLG tasks, we investigate the most commonly used approaches, their application, and their other uses beyond evaluating models. The survey concludes by highlighting some promising directions for future research.
zh

[NLP-17] Leverag ing Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues

【速读】：该论文旨在解决任务导向对话系统（task-oriented dialogue systems）训练过程中数据集创建成本高、耗时长的问题。传统方法依赖于大量的人工标注，而近期的方法虽然利用了大语言模型（LLMs）生成合成数据，但仍需要定制提示或代码，限制了非技术用户的使用。论文提出的解决方案是GraphTOD，这是一个端到端（end-to-end）框架，通过允许用户以JSON格式指定转移图（transition graphs）来简化任务导向对话的生成。该框架显著降低了数据集创建的复杂性和成本，并在多个领域中生成了高质量的对话数据。

链接: https://arxiv.org/abs/2501.11977
作者: Maya Medjad,Hugo Imbert,Bruno Yun,Raphaël Szymocha,Frédéric Armetta
机构: UCBL, CNRS, Centrale Lyon, INSA Lyon, Univ. Lumière Lyon 2, LIRIS, UMR5205 (里昂大学, 法国国家科学研究中心, 里昂中央理工学院, 里昂国立应用科学学院, 里昂第二大学, 里昂信息与系统研究所, UMR5205); Reecall (Reecall公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training task-oriented dialogue systems is both costly and time-consuming, due to the need for high-quality datasets encompassing diverse intents. Traditional methods depend on extensive human annotation, while recent advancements leverage large language models (LLMs) to generate synthetic data. However, these approaches often require custom prompts or code, limiting accessibility for non-technical users. We introduce GraphTOD, an end-to-end framework that simplifies the generation of task-oriented dialogues. Users can create dialogues by specifying transition graphs in JSON format. Our evaluation demonstrates that GraphTOD generates high-quality dialogues across various domains, significantly lowering the cost and complexity of dataset creation.
zh

[NLP-18] A Hybrid Attention Framework for Fake News Detection with Large Language Models

【速读】：该论文旨在解决在线信息快速增长背景下虚假新闻传播的严重社会问题。为解决这一问题，作者提出了一种基于大语言模型（Large Language Models, LLMs）的新型检测框架，通过整合文本统计特征和深度语义特征来识别和分类虚假新闻。该解决方案的关键在于利用大语言模型的上下文理解能力进行文本分析，并引入混合注意力机制（hybrid attention mechanism）以重点关注对虚假新闻识别尤为重要的特征组合。实验结果表明，该模型在WELFake新闻数据集上显著优于现有方法，F1分数提高了1.5%。此外，通过注意力热图和SHAP值评估模型的可解释性，为内容审核策略提供了可操作的见解。该框架为应对虚假新闻传播提供了可扩展且高效的解决方案，有助于构建更可靠的在线信息生态系统。

链接: https://arxiv.org/abs/2501.11967
作者: Xiaochuan Xu,Peiyang Yu,Zeqiu Xu,Jiani Wang
机构: Information Networking Institute, Carnegie Mellon University (卡内基梅隆大学); Department of Computer Science, Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid growth of online information, the spread of fake news has become a serious social challenge. In this study, we propose a novel detection framework based on Large Language Models (LLMs) to identify and classify fake news by integrating textual statistical features and deep semantic features. Our approach utilizes the contextual understanding capability of the large language model for text analysis and introduces a hybrid attention mechanism to focus on feature combinations that are particularly important for fake news identification. Extensive experiments on the WELFake news dataset show that our model significantly outperforms existing methods, with a 1.5% improvement in F1 score. In addition, we assess the interpretability of the model through attention heat maps and SHAP values, providing actionable insights for content review strategies. Our framework provides a scalable and efficient solution to deal with the spread of fake news and helps build a more reliable online information ecosystem.
zh

[NLP-19] AD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

【速读】：该论文试图解决文本异常检测（Text Anomaly Detection）在自然语言处理任务中的有效性和泛化性问题。尽管基于嵌入（embedding-based）的方法在文本异常检测中得到了广泛应用，但其在不同应用场景中的效果和泛化能力尚未得到充分探索。为此，作者提出了TAD-Bench，一个全面的基准测试工具，旨在系统评估基于嵌入的文本异常检测方法。TAD-Bench整合了多个跨领域的数据集，并结合了来自大型语言模型的最先进嵌入技术和多种异常检测算法。通过大量实验，作者分析了嵌入与检测方法之间的相互作用，揭示了它们在不同任务中的优势、劣势及适用性。这些发现为构建更鲁棒、高效且泛化能力强的异常检测系统提供了新的视角。

链接: https://arxiv.org/abs/2501.11960
作者: Yang Cao,Sikun Yang,Chen Li,Haolong Xiang,Lianyong Qi,Bo Liu,Rongsheng Li,Ming Liu
机构: 1School of Computing and Information Technology, Great Bay University, China (大湾区大学计算与信息技术学院); 2Great Bay Institute for Advanced Study, Great Bay University, China (大湾区大学高级研究院); 3Graduate School of Informatics, Nagoya University, Japan (名古屋大学信息学研究生院); 4School of Software, Nanjing University of Information Science and Technology, China (南京信息工程大学软件学院); 5College of Computer Science and Technology, China University of Petroleum (East China), China (中国石油大学（华东）计算机科学与技术学院); 6College of Cyberspace Security, Zhengzhou University, China (郑州大学网络空间安全学院); 7School of Computer, Harbin Engineering University, China (哈尔滨工程大学计算机学院); 8School of IT, Deakin University, Australia (迪肯大学信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.
zh

[NLP-20] Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model

【速读】：该论文试图解决机器翻译（MT）在翻译文化元素（如成语、谚语和口语表达）方面的不足问题，特别是针对谚语的翻译。论文通过构建包含独立谚语和对话中谚语的翻译数据集，研究了最先进的神经机器翻译（NMT）和大型语言模型（LLMs）在翻译谚语方面的能力。实验结果表明，LLMs在谚语翻译方面通常优于NMT模型，尤其是在文化背景相似的语言之间。此外，论文指出当前的自动评估指标（如BLEU、CHRF++和COMET）在评估谚语翻译质量时存在不足，强调了开发更具文化意识的评估指标的必要性。解决方案的关键在于利用LLMs的优越性能，并推动开发更适用于文化元素翻译的评估方法。

链接: https://arxiv.org/abs/2501.11953
作者: Minghan Wang,Viet-Thanh Pham,Farhad Moghimifar,Thuy-Trang Vu
机构: Department of Data Science & AI, Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
zh

[NLP-21] HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

【速读】：该论文试图解决的是现代人难以理解和翻译韩国历史文献的问题，这些文献主要使用汉文（Hanja）书写，而汉文是一种在20世纪之前在韩国使用的古老语言，其字符源自古代汉字但在韩国演变数百年。由于现代韩国人和中国人无法直接理解这些文献，且现有的翻译工作依赖于深厚的专业知识，导致大部分文献未被翻译成现代语言。为解决这一问题，论文提出了HERITAGE，这是一个开源的汉文自然语言处理（NLP）工具包，旨在帮助理解和翻译这些未探索的韩国历史文献。HERITAGE的关键解决方案包括：1）提供基于汉文语言模型的三个关键任务预测，即标点恢复、命名实体识别和机器翻译（MT）；2）提供一个交互式词汇表，展示汉文字符的现代韩语读音和英文定义。通过这些功能，HERITAGE不仅使非专业人士能够初步理解文献内容，还为汉文专家提供了修订模型输出的工具，从而提高翻译效率，推动更多历史文献被翻译成现代语言。

链接: https://arxiv.org/abs/2501.11951
作者: Seyoung Song,Haneul Yoo,Jiho Jin,Kyunghyun Cho,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Demo and video are available at this https URL and this https URL

点击查看摘要

Abstract:While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
zh

[NLP-22] LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models

【速读】：该论文旨在解决跨领域机器生成文本（Cross-Domain Machine-Generated Text, MGT）检测问题，特别是在非对抗性和对抗性场景下的检测任务。论文提出了一个基于微调变压器模型（transformer models）的集成方法，并通过逆困惑度加权（inverse perplexity weighting）来提升分类准确性。解决方案的关键在于结合了微调的RoBERTa-base模型和集成OpenAI检测器的RoBERTa-base模型，分别用于非对抗性MGT检测和对抗性MGT检测。通过逆困惑度加权，模型在不同文本领域中的泛化能力和性能得到了显著提升，展示了变压器模型在跨领域AI生成内容检测中的潜力。

链接: https://arxiv.org/abs/2501.11918
作者: Md Kamrujjaman Mobin,Md Saiful Islam
机构: Computer Science and Engineering, Shahjalal University of Science and Technology (沙贾拉尔科技大学); Computing Science, University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.
zh

[NLP-23] LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts

【速读】：该论文旨在解决机器生成文本（machine-generated text）与人类书写文本（human-written text）的二元分类问题，特别是在COLING 2025 Workshop on Detecting AI-Generated Content的Task 1中。解决方案的关键在于使用集成模型（ensemble of models）并结合逆困惑度加权（inverse perplexity weighting）技术来提升分类准确性。具体而言，作者在英语文本检测任务中结合了RoBERTa-base、RoBERTa-base与OpenAI检测器以及BERT-base-cased模型，并在多语言文本检测任务中集成了RemBERT、XLM-RoBERTa-base和BERT-base-multilingual-cased模型。通过这种加权集成方法，作者在英语任务中获得了0.7458的Macro F1分数，在多语言任务中获得了0.7513的Macro F1分数，分别排名第12和第4。结果表明，逆困惑度加权技术能够有效提升单语和多语言环境下机器生成文本检测的鲁棒性，展示了集成方法在这一复杂任务中的潜力。

链接: https://arxiv.org/abs/2501.11914
作者: Md Kamrujjaman Mobin,Md Saiful Islam
机构: Computer Science and Engineering, Shahjalal University of Science and Technology (沙贾拉尔科技大学计算机科学与工程); Computing Science, University of Alberta (阿尔伯塔大学计算机科学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content, focusing on the binary classification of machine-generated versus human-written text. Our approach utilizes an ensemble of models, with weights assigned according to each model’s inverse perplexity, to enhance classification accuracy. For the English text detection task, we combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and BERT-base-multilingual-case for the multilingual text detection task, employing the same inverse perplexity weighting technique. This resulted in a Macro F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings, highlighting the potential of ensemble methods for this challenging task.
zh

[NLP-24] Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation WWW’25

【速读】：该论文试图解决个性化新闻标题生成中忽视用户风格偏好（stylistic preferences）的问题，现有方法主要关注用户的内容偏好（content preferences），而忽略了风格偏好对用户全景兴趣（panoramic interests）的重要性，导致个性化效果不佳。为解决这一问题，论文提出了一个新颖的风格-内容感知个性化标题生成框架（Stylistic-Content Aware Personalized Headline Generation, SCAPE）。其关键解决方案在于：通过大语言模型（LLM）协作提取标题的内容和风格特征，并利用基于对比学习的分层融合网络（contrastive learning-based hierarchical fusion network）自适应地整合用户的长期和短期兴趣。通过将全景兴趣融入标题生成过程，SCAPE能够在生成过程中反映用户的风格-内容偏好，从而提升个性化效果。实验结果表明，SCAPE在真实数据集PENS上优于基线方法。

链接: https://arxiv.org/abs/2501.11900
作者: Junhong Lian,Xiang Ao,Xinyu Liu,Yang Liu,Qing He
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to The ACM Web Conference 2025 (WWW’25, short paper)

点击查看摘要

Abstract:Personalized news headline generation aims to provide users with attention-grabbing headlines that are tailored to their preferences. Prevailing methods focus on user-oriented content preferences, but most of them overlook the fact that diverse stylistic preferences are integral to users’ panoramic interests, leading to suboptimal personalization. In view of this, we propose a novel Stylistic-Content Aware Personalized Headline Generation (SCAPE) framework. SCAPE extracts both content and stylistic features from headlines with the aid of large language model (LLM) collaboration. It further adaptively integrates users’ long- and short-term interests through a contrastive learning-based hierarchical fusion network. By incorporating the panoramic interests into the headline generator, SCAPE reflects users’ stylistic-content preferences during the generation process. Extensive experiments on the real-world dataset PENS demonstrate the superiority of SCAPE over baselines.
zh

[NLP-25] Med-R2: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine

【速读】：该论文试图解决大型语言模型（LLMs）在医疗场景中应用时面临的挑战，包括高成本的医学数据集训练、数据过时、外部知识库检索精度有限以及答案提取效果不佳等问题。这些挑战导致LLMs在掌握医学专业知识方面未能达到预期水平。为解决这些问题，论文提出了Med-R^2框架，该框架基于循证医学（EBM）流程，通过高效整合检索机制、证据选择和推理过程，提升了LLMs在医疗场景中的问题解决能力，并增强了其可信度。Med-R^2的关键在于其无需额外训练成本的情况下，相较于传统的RAG方法和微调策略，分别实现了14.87%和3.59%的性能提升。

链接: https://arxiv.org/abs/2501.11885
作者: Keer Lu,Zheng Liang,Da Pan,Shusen Zhang,Xin Wu,Weipeng Chen,Zenan Zhou,Guosheng Dong,Bin Cui,Wentao Zhang
机构: Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学数据科学中心, 前沿交叉学科研究院); School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University(北京大学计算机学院 & 高可信软件技术教育部重点实验室); Baichuan Inc.(百川智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. However, despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.87% improvement over vanilla RAG methods and even a 3.59% enhancement compared to fine-tuning strategies, without incurring additional training costs.
zh

[NLP-26] From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning

【速读】：该论文旨在解决如何在不增加数据量或模型规模的情况下，进一步提升大语言模型（LLMs）的性能问题。传统的训练时扩展（training-time scaling）和测试时计算资源增加已被证明有效，但本文提出了一种新的监督微调范式——聚合微调（Aggregation Fine-Tuning, AFT）。AFT的核心在于模型学习将多个草稿响应（proposals）合成为一个精炼的答案（aggregation）。在推理阶段，通过“提出-聚合”策略，模型迭代生成多个草稿响应并对其进行聚合，从而进一步提升性能。实验结果表明，AFT训练的模型在基准数据集上显著优于标准的监督微调（SFT），尤其是在AlpacaEval 2上，AFT模型以较小的数据量（64k）和模型规模（Llama3.1-8B-Base）超越了更大的模型（如Llama3.1-405B-Instruct和GPT4）。通过结合顺序精炼和并行采样，AFT框架在推理时灵活扩展计算资源，展示了在不增加数据或模型规模的情况下解锁LLMs额外潜力的前景。

链接: https://arxiv.org/abs/2501.11877
作者: Yafu Li,Zhilin Wang,Tingchen Fu,Ganqu Cui,Sen Yang,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages; work in progress

点击查看摘要

Abstract:Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.
zh

[NLP-27] Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

【速读】：该论文探讨了在训练混合专家模型（Mixture-of-Experts, MoEs）时，负载均衡损失（Load-Balancing Loss, LBL）的实现问题。具体来说，现有的MoE训练框架通常采用并行训练策略，在微批次（micro-batch）内计算专家选择频率（f_i）和LBL，并在并行组之间进行平均。然而，由于微批次通常包含的序列数量较少，LBL几乎是在序列级别上计算的，这导致路由器（router）被迫在每个序列内均匀分配令牌（token），从而抑制了专家的领域专业化（domain specialization）。为了解决这一问题，论文提出了一种基于全局批次（global-batch）的LBL计算方法。全局批次包含更多样化的序列，能够在语料库级别上实现负载均衡。具体而言，该方法通过引入额外的通信步骤来同步微批次之间的f_i，并用于计算LBL。实验结果表明，全局批次LBL策略在预训练困惑度（perplexity）和下游任务中均表现出显著的性能提升，同时显著提高了MoE专家的领域专业化能力。

链接: https://arxiv.org/abs/2501.11873
作者: Zihan Qiu,Zeyu Huang,Bo Zheng,Kaiyue Wen,Zekun Wang,Rui Men,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper revisits the implementation of \textbfL oad- \textbfb alancing \textbfL oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E \sum_i=1^N_E f_i p_i , where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i . Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a \textbfmicro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ( \textite.g. , code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a \textbfglobal-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to \textbf42.8B total parameters and \textbf400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.
zh

[NLP-28] EmbodiedEval: Evaluate Multimodal LLM s as Embodied Agents

【速读】：该论文旨在解决现有多模态大语言模型（Multimodal Large Language Models, MLLMs）在具身任务评估中的局限性问题。现有的评估基准主要依赖静态图像或视频，无法充分评估MLLMs在交互式场景中的具身能力。同时，现有的具身AI基准任务过于特定且缺乏多样性，无法全面评估MLLMs的具身能力。为此，作者提出了EmbodiedEval，一个全面且交互式的评估基准，专门用于评估MLLMs在具身任务中的表现。EmbodiedEval的关键在于其设计了328个不同的任务，分布在125个多样化的3D场景中，涵盖了导航、物体交互、社交互动、属性问答和空间问答五大类别，以全面评估MLLMs的多种能力。通过这一统一的仿真和评估框架，作者揭示了现有MLLMs在具身任务中与人类水平的显著差距，为未来的模型改进提供了重要见解。

链接: https://arxiv.org/abs/2501.11858
作者: Zhili Cheng,Yuge Tu,Ran Li,Shiqi Dai,Jinyi Hu,Shengding Hu,Jiahao Li,Yang Shi,Tianyu Yu,Weize Chen,Lei Shi,Maosong Sun
机构: Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at this https URL.
zh

[NLP-29] Cross-Entropy Attacks to Language Models via Rare Event Simulation

【速读】：该论文试图解决黑盒文本对抗攻击（Black-box textual adversarial attacks）中的几个关键问题：缺乏模型信息、文本的离散性和不可微性导致攻击方法缺乏通用性、现有方法由于依赖词显著性排序（word saliency ranking）而导致的攻击效率低下，以及为了提升攻击效果而牺牲语义完整性的问题。论文提出的解决方案是引入一种新的方法，称为交叉熵攻击（Cross-Entropy Attacks, CEA），该方法通过交叉熵优化（Cross-Entropy optimization）来定义软标签（soft-label）和硬标签（hard-label）设置下的对抗目标，并利用交叉熵优化来识别最优的替换词。实验表明，该方法在攻击性能、不可察觉性和句子质量方面表现优异。

链接: https://arxiv.org/abs/2501.11852
作者: Mingze Ni,Yongshun Gong,Wei Liu
机构: University of Technology Sydney(悉尼科技大学); Shandong University(山东大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Black-box textual adversarial attacks are challenging due to the lack of model information and the discrete, non-differentiable nature of text. Existing methods often lack versatility for attacking different models, suffer from limited attacking performance due to the inefficient optimization with word saliency ranking, and frequently sacrifice semantic integrity to achieve better attack outcomes. This paper introduces a novel approach to textual adversarial attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy optimization to address the above issues. Our CEA approach defines adversarial objectives for both soft-label and hard-label settings and employs CE optimization to identify optimal replacements. Through extensive experiments on document classification and language translation problems, we demonstrate that our attack method excels in terms of attacking performance, imperceptibility, and sentence quality.
zh

[NLP-30] Challenges in Expanding Portuguese Resources: A View from Open Information Extraction

【速读】：该论文试图解决葡萄牙语（Portuguese）在开放信息抽取（Open Information Extraction, Open IE）领域缺乏高质量标注数据集的问题。由于传统开放信息抽取方法主要依赖于无监督学习，而近年来基于数据的监督学习方法在英语领域取得了显著进展，但其他语言（如葡萄牙语）由于缺乏标注数据集，相关研究进展缓慢。为此，作者提出了一种基于严格语义理论的高质量手动标注葡萄牙语语料库，并制定了结构化和上下文标注规则。该语料库的构建不仅填补了葡萄牙语在开放信息抽取领域的数据空白，还为该领域新方法和系统的开发与评估提供了重要支持。

链接: https://arxiv.org/abs/2501.11851
作者: Marlo Souza,Bruno Cabral,Daniela Claro,Lais Salvador
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open Information Extraction (Open IE) is the task of extracting structured information from textual documents, independent of domain. While traditional Open IE methods were based on unsupervised approaches, recently, with the emergence of robust annotated datasets, new data-based approaches have been developed to achieve better results. These innovations, however, have focused mainly on the English language due to a lack of datasets and the difficulty of constructing such resources for other languages. In this work, we present a high-quality manually annotated corpus for Open Information Extraction in the Portuguese language, based on a rigorous methodology grounded in established semantic theories. We discuss the challenges encountered in the annotation process, propose a set of structural and contextual annotation rules, and validate our corpus by evaluating the performance of state-of-the-art Open IE systems. Our resource addresses the lack of datasets for Open IE in Portuguese and can support the development and evaluation of new methods and systems in this area.
zh

[NLP-31] Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance

【速读】：该论文旨在解决社交媒体上识别有组织的政治宣传活动（astroturf campaigns）的问题，特别是在应对虚假信息传播方面。现有方法主要依赖于网络科学、图机器学习（graph machine learning）和自然语言处理（natural language processing）技术，通过分析用户之间的关系和互动（如转发）以及帖子之间的文本相似性来识别这些活动。然而，这些方法面临的主要挑战是训练数据集中类别不平衡的问题。为了解决这一问题，论文提出了一种基于大语言模型（LLMs）的新框架，引入了平衡检索增强生成（Balanced Retrieval-Augmented Generation, Balanced RAG）组件。该框架通过将社交媒体帖子（如推文）的文本信息和用户互动作为输入，结合提示工程（prompt engineering）和Balanced RAG方法，有效地检测出X（Twitter）平台上的协调虚假信息宣传活动。该框架无需对语言模型进行训练或微调，而是通过策略性地利用提示工程和Balanced RAG的优势，克服类别不平衡的影响，显著提升了识别精度、召回率和F1分数，相较于传统的基于图的方法，性能提升了2-3倍。

链接: https://arxiv.org/abs/2501.11849
作者: Nikos Kanakaris,Heng Ping,Xiongye Xiao,Nesreen K. Ahmed,Luca Luceri,Emilio Ferrara,Paul Bogdan
机构: University of Southern California(南加州大学); Cisco AI Research(思科人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Detecting organized political campaigns is of paramount importance in fighting against disinformation on social media. Existing approaches for the identification of such organized actions employ techniques mostly from network science, graph machine learning and natural language processing. Their ultimate goal is to analyze the relationships and interactions (e.g. re-posting) among users and the textual similarities of their posts. Despite their effectiveness in recognizing astroturf campaigns, these methods face significant challenges, notably the class imbalance in available training datasets. To mitigate this issue, recent methods usually resort to data augmentation or increasing the number of positive samples, which may not always be feasible or sufficient in real-world settings. Following a different path, in this paper, we propose a novel framework for identifying astroturf campaigns based solely on large language models (LLMs), introducing a Balanced Retrieval-Augmented Generation (Balanced RAG) component. Our approach first gives both textual information concerning the posts (in our case tweets) and the user interactions of the social network as input to a language model. Then, through prompt engineering and the proposed Balanced RAG method, it effectively detects coordinated disinformation campaigns on X (Twitter). The proposed framework does not require any training or fine-tuning of the language model. Instead, by strategically harnessing the strengths of prompt engineering and Balanced RAG, it facilitates LLMs to overcome the effects of class imbalance and effectively identify coordinated political campaigns. The experimental results demonstrate that by incorporating the proposed prompt engineering and Balanced RAG methods, our framework outperforms the traditional graph-based baselines, achieving 2x-3x improvements in terms of precision, recall and F1 scores.
zh

[NLP-32] Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLM s

【速读】：该论文探讨了心理定势（Mental Set）如何影响大语言模型（LLMs）的推理能力。尽管LLMs在多种自然语言处理任务中表现出色，尤其是在参数高效微调（PEFT）和上下文学习（ICL）等新兴能力的推动下，但在复杂推理任务中，选择合适的模型进行PEFT或ICL仍然至关重要。当前评估方法主要依赖于MMLU、MATH和GSM8K等基准测试的分数，或通过更大模型的推理链评估，但这些方法忽视了模型在应对陌生情境和克服固有思维模式方面的适应性。心理定势在认知心理学中指的是倾向于坚持使用先前成功的策略，即使这些策略在特定情境下变得低效。论文通过比较Llama-3.1-8B-Instruct、Llama-3.1-70B-Instruct和GPT-4o等模型在心理定势存在下的表现，首次将认知心理学概念引入LLMs的复杂推理任务评估中，从而更深入地理解其适应性和问题解决效能。解决方案的关键在于将心理定势的概念融入模型评估框架，以揭示LLMs在面对新问题和克服固有思维模式时的实际能力。

链接: https://arxiv.org/abs/2501.11833
作者: Saiful Haq,Niyati Chhaya,Piyush Pandey,Pushpak Bhattacharya
机构: IIT Bombay(印度理工学院孟买分校); Hyperbots Inc(超机器人公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present an investigative study on how Mental Sets influence the reasoning capabilities of LLMs. LLMs have excelled in diverse natural language processing (NLP) tasks, driven by advancements in parameter-efficient fine-tuning (PEFT) and emergent capabilities like in-context learning (ICL). For complex reasoning tasks, selecting the right model for PEFT or ICL is critical, often relying on scores on benchmarks such as MMLU, MATH, and GSM8K. However, current evaluation methods, based on metrics like F1 Score or reasoning chain assessments by larger models, overlook a key dimension: adaptability to unfamiliar situations and overcoming entrenched thinking patterns. In cognitive psychology, Mental Set refers to the tendency to persist with previously successful strategies, even when they become inefficient - a challenge for problem solving and reasoning. We compare the performance of LLM models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the presence of mental sets. To the best of our knowledge, this is the first study to integrate cognitive psychology concepts into the evaluation of LLMs for complex reasoning tasks, providing deeper insights into their adaptability and problem-solving efficacy.
zh

[NLP-33] Fact-Preserved Personalized News Headline Generation ICDM2023

【速读】：该论文试图解决个性化新闻标题生成（Personalized News Headline Generation）中个性化与事实一致性（factual consistency）之间的平衡问题。现有研究通常通过将用户兴趣嵌入（user interest embedding）注入编码器-解码器（encoder-decoder）标题生成器来实现个性化，但生成标题的事实一致性往往不足。为此，论文提出了一个名为事实保留的个性化新闻标题生成框架（Fact-Preserved Personalized News Headline Generation, FPG）。该框架的关键在于利用候选新闻与用户历史点击新闻的相似性，对候选新闻中的关键事实赋予不同级别的注意力，并通过相似性分数学习一个事实感知的全局用户嵌入（fact-aware global user embedding）。此外，框架还引入了基于对比学习（contrastive learning）的额外训练过程，以进一步增强生成标题的事实一致性。实验结果表明，FPG在个性化与事实一致性之间的权衡上表现优异。

链接: https://arxiv.org/abs/2501.11828
作者: Zhao Yang,Junhong Lian,Xiang Ao
机构: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)(中国科学院智能信息处理重点实验室); Institute of Computing Technology, CAS(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); Institute of Intelligent Computing Technology, Suzhou, CAS(中国科学院苏州智能计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ICDM 2023, Short paper, 6 pages

点击查看摘要

Abstract:Personalized news headline generation, aiming at generating user-specific headlines based on readers’ preferences, burgeons a recent flourishing research direction. Existing studies generally inject a user interest embedding into an encoderdecoder headline generator to make the output personalized, while the factual consistency of headlines is inadequate to be verified. In this paper, we propose a framework Fact-Preserved Personalized News Headline Generation (short for FPG), to prompt a tradeoff between personalization and consistency. In FPG, the similarity between the candidate news to be exposed and the historical clicked news is used to give different levels of attention to key facts in the candidate news, and the similarity scores help to learn a fact-aware global user embedding. Besides, an additional training procedure based on contrastive learning is devised to further enhance the factual consistency of generated headlines. Extensive experiments conducted on a real-world benchmark PENS validate the superiority of FPG, especially on the tradeoff between personalization and factual consistency.
zh

[NLP-34] Benchmarking Large Language Models via Random Variables

【速读】：该论文试图解决当前大语言模型（LLMs）在数学推理领域性能评估的可靠性问题。现有的数学基准测试存在设计过于简单和潜在数据泄露等问题，导致无法准确评估LLMs的真实数学推理能力。为解决这一问题，作者提出了RV-Bench框架，通过随机变量（Random Variables）来评估LLMs的数学推理能力。RV-Bench的关键在于其问题设计：随机变量问题的背景内容与现有标准基准测试中的原始问题一致，但变量组合被随机化为不同的值。LLMs必须完全理解原始问题的解题过程，才能正确回答不同变量组合的随机变量问题。通过这种方式，RV-Bench能够更准确地反映LLMs在数学推理中的真实能力。实验结果表明，当前LLMs在复杂数学推理问题上仍存在显著困难。

链接: https://arxiv.org/abs/2501.11790
作者: Zijin Hong,Hao Wu,Su Dong,Junnan Dong,Yilin Xiao,Yujing Zhang,Zhu Wang,Feiran Huang,Linyi Li,Hongxia Yang,Xiao Huang
机构: The Hong Kong Polytechnic University(香港理工大学); University of Electronic Science and Technology of China(电子科技大学); Jinan University(暨南大学); Simon Fraser University(西蒙弗雷泽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:With the continuous advancement of large language models (LLMs) in mathematical reasoning, evaluating their performance in this domain has become a prominent research focus. Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data leakage. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of LLMs in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing standard benchmarks, but the variable combinations are randomized into different values. LLMs must fully understand the problem-solving process for the original problem to correctly answer RV questions with various combinations of variable values. As a result, the LLM’s genuine capability in mathematical reasoning is reflected by its accuracy on RV-Bench. Extensive experiments are conducted with 29 representative LLMs across 900+ RV questions. A leaderboard for RV-Bench ranks the genuine capability of these LLMs. Further analysis of accuracy dropping indicates that current LLMs still struggle with complex mathematical reasoning problems.
zh

[NLP-35] Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection

【速读】：该论文探讨了在大型语言模型（LLMs）上进行成员推断攻击（Membership Inference Attacks, MIAs）时，使用合成数据作为替代方案可能导致的误导性结果。研究发现，MIAs实际上起到了机器生成文本检测器的作用，错误地将合成数据识别为训练样本，无论数据来源如何。这种行为在不同模型架构和规模的模型中均存在，包括开源模型和商业模型如GPT-3.5。论文的关键发现是，使用合成数据进行成员评估可能会导致关于模型记忆和数据泄漏的错误结论。因此，论文警告在评估模型信号（如损失）时，使用合成或机器生成的翻译数据替代真实世界样本可能会影响评估结果的准确性。

链接: https://arxiv.org/abs/2501.11786
作者: Ali Naseh,Niloofar Mireshghallah
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage. We caution that this issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
zh

[NLP-36] he Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers

【速读】：该论文试图解决的问题是如何从面向儿童和青少年的TikTok视频中提取隐含的价值观（values），并探讨这些价值观如何通过社交媒体平台传播。传统上，儿童和青少年通过父母、教育者或同伴学习价值观，而如今社交媒体平台成为他们获取信息和娱乐的主要渠道，可能也是他们学习不同价值观的媒介。论文通过构建一个基于Schwartz个人价值观理论（Schwartz Theory of Personal Values）的TikTok视频数据集，并采用两种不同的方法进行价值观提取：一种是从视频中直接提取价值观，另一种是先将视频转换为详细的脚本，再从脚本中提取价值观。研究结果表明，两步法（2-step approach）显著优于直接提取法，并且使用可训练的掩码语言模型（Masked Language Model）作为第二步的效果优于使用少量样本的大型语言模型（Large Language Models）。此外，论文还讨论了微调（fine-tuning）对模型性能的影响，并比较了不同模型在识别TikTok视频中呈现或矛盾的价值观时的表现。最终，论文分享了首个价值观标注的TikTok视频数据集，为基于视频的社交媒体平台上的影响力和价值观传播研究奠定了基础。

链接: https://arxiv.org/abs/2501.11770
作者: Alina Starovolsky-Shitrit,Alon Neduva,Naama Appel Doron,Ella Daniel,Oren Tsur
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Societal and personal values are transmitted to younger generations through interaction and exposure. Traditionally, children and adolescents learned values from parents, educators, or peers. Nowadays, social platforms serve as a significant channel through which youth (and adults) consume information, as the main medium of entertainment, and possibly the medium through which they learn different values. In this paper we extract implicit values from TikTok movies uploaded by online influencers targeting children and adolescents. We curated a dataset of hundreds of TikTok movies and annotated them according to the Schwartz Theory of Personal Values. We then experimented with an array of Masked and Large language model, exploring how values can be detected. Specifically, we considered two pipelines – direct extraction of values from video and a 2-step approach in which videos are first converted to elaborated scripts and then values are extracted. Achieving state-of-the-art results, we find that the 2-step approach performs significantly better than the direct approach and that using a trainable Masked Language Model as a second step significantly outperforms a few-shot application of a number of Large Language Models. We further discuss the impact of fine-tuning and compare the performance of the different models on identification of values present or contradicted in the TikTok. Finally, we share the first values-annotated dataset of TikTok videos. Our results pave the way to further research on influence and value transmission in video-based social platforms. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI) Cite as: arXiv:2501.11770 [cs.CL] (or arXiv:2501.11770v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-37] Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?

【速读】：该论文探讨了Transformer架构中自注意力机制（self-attention）是否能够独立执行逻辑分析任务，而不依赖于全连接层（fully connected layer）。传统观点认为，自注意力机制主要用于信息聚合，而逻辑分析则由全连接层完成。然而，本文通过设计一个手工编码的单层编码器，展示了自注意力机制本身也能够执行逻辑分析。论文进一步研究了在单层Transformer模型中，模型在自学习过程中如何选择使用全连接层或自注意力机制进行逻辑分析。为了避免梯度下降（gradient descent）陷入不希望的零点，作者显式计算了这些零点并提出了避免方法。研究背景是基于预测文本中相邻标记的语法类别对。本文的发现对理解自注意力机制潜在逻辑操作的能力具有广泛意义。

链接: https://arxiv.org/abs/2501.11765
作者: Evgeniy Shin,Heinrich Matzinger
机构: School of Mathematics, Georgia Institute of Technology (乔治亚理工学院数学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 3 figures, to be submitted

点击查看摘要

Abstract:Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.
zh

[NLP-38] Optimizing Pretraining Data Mixtures with LLM -Estimated Utility

【速读】：该论文旨在解决在大规模语言模型（Large Language Models, LLMs）训练过程中，如何高效利用大规模高质量训练数据的问题。具体而言，论文探讨了在计算资源和数据受限的情况下，如何平衡数据的质量、数量和多样性，以优化模型的训练效果。通过对九种基线方法的评估，论文发现基于词元计数（token-count heuristics）的简单方法在数据集大小和多样性方面表现出色，优于手动和学习的混合方法。基于这一发现，论文提出了两种互补的解决方案：UtiliMax 和 Model Estimated Data Utility (MEDU)。UtiliMax 通过结合小规模消融实验（reduced-scale ablations）的效用估计，扩展了基于词元的启发式方法，实现了比手动基线方法高达10.6倍的加速；而 MEDU 则利用 LLMs 从小样本中估计数据效用，匹配了基于消融实验的性能，同时减少了约200倍的计算需求。这两种方法共同建立了一个自动化、计算高效的数据混合框架，适用于多种训练场景。

链接: https://arxiv.org/abs/2501.11747
作者: William Held,Bhargavi Paranjape,Punit Singh Koura,Mike Lewis,Frank Zhang,Todor Mihaylov
机构: Meta AI; Stanford University (斯坦福大学); Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by \sim 200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.
zh

[NLP-39] Mobile-Agent -E: Self-Evolving Mobile Assistant for Complex Tasks

【速读】：该论文旨在解决当前基于大模态模型（LMM）的移动代理在应对复杂任务时的局限性，包括无法有效满足现实世界中的人类需求、难以处理推理密集型和长时程任务，以及缺乏从以往经验中学习和改进的机制。为解决这些问题，论文提出了Mobile-Agent-E，一种分层多代理框架，能够通过过去的经验实现自我进化。该框架的关键在于其分层结构，明确区分了高层规划和低层动作执行。框架包括一个负责将复杂任务分解为子目标并制定总体计划的Manager，以及四个下属代理——Perceptor（感知器）、Operator（操作器）、Action Reflector（动作反射器）和Notetaker（记录器），分别负责细粒度的视觉感知、即时动作执行、错误验证和信息聚合。此外，Mobile-Agent-E引入了一个新颖的自我进化模块，该模块通过维护包含Tips（提示）和Shortcuts（快捷方式）的持久长期记忆来实现性能的持续优化。Tips是从以往任务中总结出的与环境有效交互的一般性指导，而Shortcuts则是针对特定子任务的可重用原子操作序列。通过这些机制，Mobile-Agent-E在复杂移动任务中表现出显著的性能提升，相较于现有最先进方法，其性能提升了22%。

链接: https://arxiv.org/abs/2501.11733
作者: Zhenhailong Wang,Haiyang Xu,Junyang Wang,Xi Zhang,Ming Yan,Ji Zhang,Fei Huang,Heng Ji
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents–Perceptor, Operator, Action Reflector, and Notetaker–which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: this https URL.
zh

[NLP-40] Explain-Query-Test: Self-Evaluating LLM s Via Explanation and Comprehension Discrepancy

【速读】：该论文试图解决大型语言模型（LLMs）在生成复杂概念详细解释时是否真正理解这些概念的问题。为了解决这一问题，作者提出了一种自评估流程，称为Explain-Query-Test（EQT）。该流程包括三个步骤：(i) 给定一个主题，模型生成关于该主题的摘要；(ii) 给定摘要，模型生成问题-答案对；(iii) 给定问题，模型生成答案。通过这一流程，作者发现模型在生成问题上的准确性与典型基准测试（如MMLU-Pro）的表现高度相关，表明EQT可以用于模型排名，而无需外部评估数据。此外，研究结果揭示了模型在生成详细解释与回答相关问题时表现之间的差距，突显了当前LLMs在内部知识表示和推理能力上的根本局限性。

链接: https://arxiv.org/abs/2501.11721
作者: Saeid Asgari Taghanaki,Joao Monteiro
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT’s performance is predictive of MMLU-Pro’s, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models’ ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at this https URL.
zh

[NLP-41] YouLeQD: Decoding the Cognitive Complexity of Questions and Engagement in Online Educational Videos from Learners Perspectives

【速读】：该论文试图解决的问题是如何利用人工智能（AI）技术在教育环境中自动生成和分析问题，以促进学生的理解和互动。具体来说，研究关注的是如何通过分析学生在YouTube教学视频评论中提出的问题，来理解这些问题的认知复杂性，并基于布鲁姆分类法（Bloom’s Taxonomy）进行分类。解决方案的关键在于创建了一个名为YouTube Learners’ Questions on Bloom’s Taxonomy Dataset (YouLeQD)的数据集，并开发了两个基于RoBERTa的分类模型。这些模型利用大型语言模型（Large Language Models）来检测问题并分析其认知复杂性，从而为开发更有效的教育AI模型提供基础。通过这一研究，作者旨在提升学生的学习体验，并促进教育环境中的人机互动。

链接: https://arxiv.org/abs/2501.11712
作者: Nong Ming,Sachin Sharma,Jiho Noh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11pages. Extended version, Jan 2025. A shortened version was resubmitted and published in IEEE Conference on Semantic Computing Feb 2025

点击查看摘要

Abstract:Questioning is a fundamental aspect of education, as it helps assess students’ understanding, promotes critical thinking, and encourages active engagement. With the rise of artificial intelligence in education, there is a growing interest in developing intelligent systems that can automatically generate and answer questions and facilitate interactions in both virtual and in-person education settings. However, to develop effective AI models for education, it is essential to have a fundamental understanding of questioning. In this study, we created the YouTube Learners’ Questions on Bloom’s Taxonomy Dataset (YouLeQD), which contains learner-posed questions from YouTube lecture video comments. Along with the dataset, we developed two RoBERTa-based classification models leveraging Large Language Models to detect questions and analyze their cognitive complexity using Bloom’s Taxonomy. This dataset and our findings provide valuable insights into the cognitive complexity of learner-posed questions in educational videos and their relationship with interaction metrics. This can aid in the development of more effective AI models for education and improve the overall learning experience for students.
zh

[NLP-42] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理任务中测试时扩展（test-time scaling）效果不佳的问题。现有方法主要依赖模仿学习（imitation learning），难以实现有效的测试时扩展。尽管强化学习（RL）在自我探索和从反馈中学习方面具有潜力，但最近的尝试在复杂推理任务中仅取得了有限的改进。论文提出的解决方案T1通过鼓励探索和理解推理扩展来提升RL的效果。具体而言，T1首先使用合成的链式思维数据（chain-of-thought data）初始化LLM，这些数据结合了试错（trial-and-error）和自我验证（self-verification）。为了扩展RL训练，T1通过过采样（oversampling）增加采样多样性，并采用熵奖励（entropy bonus）作为辅助损失，结合动态锚点（dynamic anchor）进行正则化，以促进奖励优化。实验表明，基于开源LLM的T1在推理扩展行为上表现出色，并在数学推理基准测试中取得了优越的性能。此外，论文还提出了一种简单的策略来检验推理扩展，即增加推理预算直接提升T1的性能，而无需额外的验证。

链接: https://arxiv.org/abs/2501.11651
作者: Zhenyu Hou,Xin Lv,Rui Lu,Jiajie Zhang,Yujiang Li,Zijun Yao,Juanzi Li,Jie Tang,Yuxiao Dong
机构: Tsinghua University(清华大学); Zhipu AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We further employ an entropy bonus as an auxiliary loss, alongside a dynamic anchor for regularization to facilitate reward optimization. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. For example, T1 with Qwen2.5-32B as the base model outperforms the recent Qwen QwQ-32B-Preview model on MATH500, AIME2024, and Omni-math-500. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. We will open-source the T1 models and the data used to train them at \urlthis https URL.
zh

[NLP-43] StAyaL | Multilingual Style Transfer KR

【速读】：该论文旨在解决跨语言风格化文本生成的问题，即如何在不同的语言中生成特定说话者风格的文本。解决方案的关键在于通过仅使用100行文本，捕捉个体的独特风格并将其表示为高维嵌入（high-dimensional embedding），从而用于文本生成和风格化翻译。该方法通过三个主要阶段实现：首先，利用风格一致的外部数据源增强说话者的数据；其次，使用机器学习和深度学习技术将风格与内容分离；最后，通过对学习到的嵌入进行均值池化（mean pooling）生成抽象的风格轮廓（style profile）。该方法具有主题无关性（topic-agnostic），实验结果显示其测试准确率和F1分数分别为74.9%和0.75，表明其在多语言通信中的潜力，并为个性化内容生成和跨语言风格迁移的进一步应用铺平了道路。

链接: https://arxiv.org/abs/2501.11639
作者: Karishma Thakrar,Katrina Lawrence,Kyle Howard
机构: Cohere for AI Community; Cohere for AI Community; Cohere for AI Community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The primary authors, Karishma Thakrar and Katrina Lawrence, contributed equally to this work

点击查看摘要

Abstract:Stylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker’s style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker’s data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer.
zh

[NLP-44] Biomedical Knowledge Graph: A Survey of Domains Tasks and Real-World Applications

【速读】：该论文旨在解决当前关于生物医学知识图谱（Biomedical Knowledge Graphs, BKGs）的综述文献往往局限于特定领域或方法，未能全面反映其广泛的应用场景和快速发展的技术进展的问题。为此，论文通过系统性地从三个核心视角（领域、任务和应用）对BKGs进行综述，填补了这一空白。解决方案的关键在于：首先，分析了BKGs如何从多种数据源（如分子相互作用、药理学数据集和临床记录）构建；其次，探讨了BKGs支持的关键任务，包括知识管理、检索、推理和解释；最后，展示了BKGs在精准医学、药物发现和科学研究等领域的实际应用，突出了其跨领域的转化影响。通过将这些视角整合到一个统一的框架中，该论文不仅阐明了BKG研究的现状，还为未来的探索奠定了基础，推动了方法学创新和实际应用的进一步发展。

链接: https://arxiv.org/abs/2501.11632
作者: Yuxing Lu,Sin Yee Goi,Xukai Zhao,Jinzhuo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注: 45 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Biomedical knowledge graphs (BKGs) have emerged as powerful tools for organizing and leveraging the vast and complex data found across the biomedical field. Yet, current reviews of BKGs often limit their scope to specific domains or methods, overlooking the broader landscape and the rapid technological progress reshaping it. In this survey, we address this gap by offering a systematic review of BKGs from three core perspectives: domains, tasks, and applications. We begin by examining how BKGs are constructed from diverse data sources, including molecular interactions, pharmacological datasets, and clinical records. Next, we discuss the essential tasks enabled by BKGs, focusing on knowledge management, retrieval, reasoning, and interpretation. Finally, we highlight real-world applications in precision medicine, drug discovery, and scientific research, illustrating the translational impact of BKGs across multiple sectors. By synthesizing these perspectives into a unified framework, this survey not only clarifies the current state of BKG research but also establishes a foundation for future exploration, enabling both innovative methodological advances and practical implementations.
zh

[NLP-45] rojan Detection Through Pattern Recognition for Large Language Models

【速读】：该论文试图解决在大语言模型（Large Language Models, LLMs）中检测特洛伊木马后门（Trojan backdoors）的问题。特洛伊木马后门可以在预训练（pretraining）、微调（fine-tuning）和上下文学习（in-context learning）等不同阶段被注入模型，对模型的对齐性（alignment）构成严重威胁。由于因果语言建模（causal language modeling）的特性，检测这些触发器（triggers）在庞大的搜索空间中具有挑战性。论文提出了一种多阶段框架，包括令牌过滤（token filtration）、触发器识别（trigger identification）和触发器验证（trigger verification），以有效检测这些后门。关键解决方案在于提出了一种基于输出logits的黑盒触发器反演方法（black-box trigger inversion method），并利用beam search和greedy decoding两种变体进行触发器识别。此外，验证阶段通过语义保持提示（semantic-preserving prompts）和特殊扰动（special perturbations）来区分真实的特洛伊触发器与其他具有类似特征的对抗性字符串，确保检测的准确性。

链接: https://arxiv.org/abs/2501.11621
作者: Vedant Bhasin,Matthew Yudin,Razvan Stefanescu,Rauf Izmailov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 11 Figures

点击查看摘要

Abstract:Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model’s alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.
zh

[NLP-46] Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems

【速读】：该论文试图解决如何利用大语言模型（LLMs）可靠地执行复杂业务流程的问题。尽管LLMs在自然语言理解方面表现出色，但在实际应用中，如何将其工程化为能够稳定执行复杂任务导向对话系统仍然具有挑战性。论文提出的解决方案是Conversation Routines（CR）框架，该框架通过自然语言规范将任务导向的逻辑嵌入到LLM提示中，从而开发出Conversation Agentic Systems（CAS）。CR框架的关键在于提供了一种系统化的方法，用于设计和实现复杂的对话工作流，同时保持行为一致性。通过两个概念验证案例（火车票预订系统和交互式故障排除助手），论文验证了CR框架在编码复杂行为模式和决策逻辑方面的有效性，同时保持了自然对话的灵活性。该框架使得领域专家能够用自然语言设计对话工作流，而软件工程师则专注于核心API的实现，实现了职责的高效分工。

链接: https://arxiv.org/abs/2501.11613
作者: Giorgio Robino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging. The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework’s effectiveness through two proof of concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR’s capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility. Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom enterprise functionalities (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design. While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non-deterministic behavior, and domain-specific logic optimization. Future research directions include enhancing system robustness, improving scalability for complex multi-agent interactions, and addressing the identified limitations across diverse business applications.
zh

[NLP-47] SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks AAAI2025

【速读】：该论文试图解决大型语言模型（LLMs）在演绎推理（deductive reasoning）任务中可能无法遵循正确推理路径的问题。尽管通过链式思维提示（Chain-of-Thought prompts）增强了LLMs的推理能力，但其在复杂知识推理任务中的表现仍存在不足。论文提出了一种多阶段的三段论推理思维框架（Syllogistic-Reasoning Framework of Thought, SR-FoT），旨在模仿人类的演绎推理范式，提升LLMs的演绎推理能力。该框架的关键在于通过多阶段推理步骤，首先解释问题并生成合适的大前提（major premise），随后分两阶段生成并回答小前提（minor premise）问题，最终利用生成的大前提和小前提进行三段论演绎推理，从而得出原始问题的答案。实验结果表明，SR-FoT在知识推理任务中具有显著的有效性和优势。

链接: https://arxiv.org/abs/2501.11599
作者: Wentao Wan,Zhuojie Yang,Yongcan Chen,Chenglin Luo,Ruilin Wang,Kehao Cai,Nan Kang,Liang Lin,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted by AAAI 2025

点击查看摘要

Abstract:Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.
zh

[NLP-48] raining-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

【速读】：该论文试图解决压缩感知（Compressed Sensing, CS）在大规模数据应用中稀疏重建时间过长的问题。传统迭代方法在处理大规模数据时效率低下，而现有的AI方法如深度展开（deep unfolding）由于预训练模型在训练条件之外泛化能力差或缺乏可解释性，无法有效替代传统方法。论文提出了一种称为系数学习（Coefficients Learning, CL）的超小型人工神经网络模型，能够在无需训练的情况下实现快速稀疏重建，同时完美继承了传统迭代方法的泛化性和可解释性。CL的关键在于仅需信号长度n的最小可训练参数，显著提高了重建效率。通过案例模型CLOMP的实验验证，该方法在大规模数据上的效率提升了100到1000倍，并在多个图像数据集上显著提高了结构相似性指数（Structural Similarity Index）。

链接: https://arxiv.org/abs/2501.11592
作者: Chaoqing Tang,Huanze Zhuang,Guiyun Tian,Zhenli Zeng,Yi Ding,Wenzhong Liu,Xiang Bai
机构: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学人工智能与自动化学院); China Belt and Road Joint Lab on Measurement and Control Technology (中国一带一路联合实验室测量与控制技术); School of Electric and Electrical Engineering, Chongqing University of Technology (重庆理工大学电气与电子工程学院); Optics Valley Laboratory (光谷实验室); School of Soft Engineering, Huazhong University of Science and Technology (华中科技大学软件学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length n only needs a minimal of n trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.
zh

[NLP-49] PIKE-RAG : sPecIalized KnowledgE and Rationale Augmented Generation

【速读】：该论文试图解决当前检索增强生成（Retrieval-Augmented Generation, RAG）系统在复杂多样的工业应用场景中表现不足的问题。尽管RAG系统通过外部检索扩展了大语言模型（LLM）的能力，但其依赖单一检索机制难以从专业语料库中提取深层次、领域特定的知识，并在逻辑推理任务中表现不佳。为解决这一问题，论文提出了sPecIalized KnowledgE and Rationale Augmentation Generation（PIKE-RAG）框架，其核心在于提取、理解和应用领域特定知识，并通过构建连贯的推理过程逐步引导LLM生成准确响应。关键解决方案包括：1）引入任务分类范式，根据知识提取和应用的复杂性对任务进行分类，以系统评估RAG系统的解决问题能力；2）提出知识原子化（knowledge atomizing）和知识感知任务分解（knowledge-aware task decomposition）方法，从数据块中有效提取多维知识，并基于原始查询和累积知识迭代构建推理过程。这些策略为RAG系统的分阶段开发和增强提供了路线图，以更好地满足工业应用的动态需求。

链接: https://arxiv.org/abs/2501.11551
作者: Jinyu Wang,Jingjing Fu,Lei Song,Jiang Bian
机构: Microsoft Research Asia(微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注: 36 pages, 18 figures, technique report

点击查看摘要

Abstract:Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.
zh

[NLP-50] Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

【速读】：该论文试图解决大语言模型（LLMs）在基于用户偏好数据进行指令对齐（aligned）时，无法理解用户为何选择或拒绝某些输出的问题。现有的偏好数据格式仅能表明用户对输出的选择倾向，但无法解释背后的原因，导致模型难以根据不同用户的需求进行个性化响应。为解决这一问题，论文提出了一种基于溯因推理（abductive reasoning）的方法，通过推断用户的需求和兴趣（即用户画像，personas）来揭示用户选择或拒绝输出的原因。解决方案的关键在于两个步骤：1）用户画像推断（Persona Inference, PI），通过溯因推理推断出偏好选择或拒绝输出的用户画像；2）用户画像定制（Persona Tailoring, PT），训练模型以根据PI推断的用户画像生成个性化响应。实验表明，该方法能够准确推断用户画像，并通过增强的偏好数据提升个性化能力，尤其对具有非典型偏好的用户效果显著。论文主张从溯因视角看待偏好数据，不仅关注“哪个输出更好”，还关注“何时、为何、对谁更好”。

链接: https://arxiv.org/abs/2501.11549
作者: Nishant Balepur,Vishakh Padmakumar,Fumeng Yang,Shi Feng,Rachel Rudinger,Jordan Lee Boyd-Graber
机构: University of Maryland(马里兰大学); New York University(纽约大学); George Washington University(乔治华盛顿大学)
类目: Computation and Language (cs.CL)
备注: In Progress Preprint

点击查看摘要

Abstract:LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)-abductively inferring personas of users who prefer chosen or rejected outputs-and Persona Tailoring (PT)-training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
zh

[NLP-51] Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija

【速读】：该论文旨在解决将自然语言问题（NLQs）转换为可执行的SQL查询（text-to-SQL）任务中的一个关键问题，即缺乏针对低资源语言（如阿拉伯方言）的大规模、跨领域的文本到SQL数据集。现有的数据集（如SPIDER和WikiSQL）主要关注高资源语言（如英语和中文），无法充分反映低资源语言在实际应用中的复杂性和挑战。为此，作者提出了Dialect2SQL，这是第一个针对阿拉伯方言（特别是摩洛哥方言）的大规模、跨领域的文本到SQL数据集。该数据集包含9,428个NLQ-SQL对，覆盖69个不同领域的数据库，并引入了SQL相关的挑战（如长模式、脏数据和复杂查询）以及摩洛哥方言特有的复杂性（如多样化的源语言、大量借词和独特表达）。这一解决方案的关键在于通过引入低资源语言的真实场景复杂性，推动文本到SQL任务在更广泛语言环境中的发展。

链接: https://arxiv.org/abs/2501.11498
作者: Salmane Chafik,Saad Ezzini,Ismail Berrada
机构: Mohammed VI Polytechnic University(穆罕默德六世理工大学); King Fahd University of Petroleum and Minerals(法赫德国王石油与矿业大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The task of converting natural language questions (NLQs) into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source languages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.
zh

[NLP-52] Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

【速读】：该论文探讨了生成式 AI 和大规模语言模型（LLM）在濒危语言保护中的作用，旨在解决全球语言多样性急剧下降的问题。论文分析了生成式 AI 和 LLM 在语言保护中的潜力，特别是针对资源匮乏的语言（low-resource languages）。解决方案的关键在于利用自然语言处理（NLP）和深度学习技术，通过生成式 AI 和 LLM 来支持濒危语言的记录、教育和文化传承。同时，论文还讨论了数据稀缺、技术挑战和伦理问题，并提出了增强 AI 驱动语言保护的解决方案。

链接: https://arxiv.org/abs/2501.11496
作者: Vincent Koc
机构: Hyperthink, Sydney, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, submitted for IEEE publication

点击查看摘要

Abstract:Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation, particularly for near-native and endangered languages. With the increasing reliance on technology for communication, education, and cultural documentation, new opportunities have emerged to mitigate the dramatic decline of linguistic diversity worldwide. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use. We analyze the underlying technologies driving these models, including natural language processing (NLP) and deep learning, and explore several cases where these technologies have been applied to low-resource languages. Additionally, we discuss ethical considerations, data scarcity issues, and technical challenges while proposing solutions to enhance AI-driven language preservation.
zh

[NLP-53] Graph-defined Language Learning with LLM s

【速读】：该论文试图解决在大语言模型（LLMs）中建模文本属性图结构时面临的两个主要问题：(i) 高阶图结构的描述变得冗长；(ii) 仅依赖文本属性无法充分捕捉图结构信息。为解决这些问题，论文提出了一种名为Graph-Defined Language for Large Language Model (GDL4LLM)的新框架。该框架的关键在于将图结构转化为一种图语言语料库，而不是通过冗长的图描述来传达图结构信息。通过在这种语料库上预训练LLMs，GDL4LLM能够使LLMs在微调阶段仅用少量标记就能简洁地描述目标节点的结构信息。通过将图视为一种新的语言，GDL4LLM使LLMs能够在节点分类任务中高效且简洁地建模不同阶数的图结构，从而超越了基于描述和文本属性嵌入的基线方法。

链接: https://arxiv.org/abs/2501.11478
作者: Huachi Zhou,Jiahe Du,Chuang Zhou,Chang Yang,Yilin Xiao,Yuxuan Xie,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbfGraph-\textbfDefined \textbfLanguage for \textbfLarge \textbfLanguage \textbfModel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.11478 [cs.CL] (or arXiv:2501.11478v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-54] Curiosity-Driven Reinforcement Learning from Human Feedback

【速读】：该论文试图解决在使用人类反馈进行强化学习（RLHF）时，大型语言模型（LLMs）在输出多样性和对齐质量之间的权衡问题。传统的RLHF方法虽然在使模型输出与人类偏好对齐方面有效，但往往以牺牲输出多样性为代价。为了解决这一问题，论文提出了好奇心驱动的RLHF（CD-RLHF）框架，其关键创新在于引入了对新颖状态的内在奖励（intrinsic rewards），与传统的稀疏外在奖励（extrinsic rewards）相结合，以同时优化输出多样性和对齐质量。通过在一系列任务（如文本摘要和指令遵循）上的广泛实验，CD-RLHF在保持与人类偏好对齐的同时，显著提升了输出多样性。

链接: https://arxiv.org/abs/2501.11463
作者: Haoran Sun,Yekun Chai,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang
机构: Baidu Inc.(百度)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at this https URL.
zh

[NLP-55] Ontology Matching with Large Language Models and Prioritized Depth-First Search

【速读】：该论文试图解决本体匹配（Ontology Matching, OM）中的两个主要问题：一是现有方法需要大量训练数据集且词汇处理能力有限，二是基于大语言模型（Large Language Model, LLMs）的方法虽然展现出潜力，但性能有限且计算开销较大。为解决这些问题，论文提出了一种名为MILA的新方法，其关键创新在于将“检索-识别-提示”（retrieve-identify-prompt）管道嵌入到优先深度优先搜索（Prioritized Depth-First Search, PDFS）策略中。这种方法通过高效识别大量语义对应关系，并仅对最边缘的情况请求LLM，从而在保证高精度的同时显著减少LLM请求次数。实验结果表明，MILA在多个无监督任务中表现优异，且无需领域特定的启发式方法或微调，展示了高性能LLM-based OM的可行性。

链接: https://arxiv.org/abs/2501.11441
作者: Maria Taboada,Diego Martinez,Mohammed Arideh,Rosa Mosquera
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ontology matching (OM) plays a key role in enabling data interoperability and knowledge sharing, but it remains challenging due to the need for large training datasets and limited vocabulary processing in machine learning approaches. Recently, methods based on Large Language Model (LLMs) have shown great promise in OM, particularly through the use of a retrieve-then-prompt pipeline. In this approach, relevant target entities are first retrieved and then used to prompt the LLM to predict the final matches. Despite their potential, these systems still present limited performance and high computational overhead. To address these issues, we introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy. This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases. We evaluated MILA using the biomedical challenge proposed in the 2023 and 2024 editions of the Ontology Alignment Evaluation Initiative. Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%. It also performed better than or comparable to the leading supervised OM systems. MILA further exhibited task-agnostic performance, remaining stable across all tasks and settings, while significantly reducing LLM requests. These findings highlight that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics, without the need of domain-specific heuristics or fine-tuning.
zh

[NLP-56] RACCOON: A Retrieval-Augmented Generation Approach for Location Coordinate Capture from News Articles WWW2025

【速读】：该论文旨在解决从新闻报道中自动提取事件发生地点的地理坐标（geocoding）问题，特别是在流行病情报或灾害管理等领域的应用。论文提出的解决方案是Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON)，这是一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的开源地理编码方法。RACCOON的关键在于通过从位置数据库中检索候选位置及其相关信息作为上下文，并将包含检索到的上下文、位置提及和新闻报道的提示输入到大型语言模型（LLM）中，从而生成地理坐标。该方法通过在三组数据集、两种基础LLM、三种基线模型以及多个消融实验中的评估，展示了其有效性。RACCOON是首个基于RAG方法并利用预训练LLM进行地理编码的解决方案。

链接: https://arxiv.org/abs/2501.11440
作者: Jonathan Lin,Aditya Joshi,Hye-young Paik,Tri Dung Doung,Deepti Gurdasani
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Accepted at WWW 2025 as a short paper. 4 pages with references

点击查看摘要

Abstract:Geocoding involves automatic extraction of location coordinates of incidents reported in news articles, and can be used for epidemic intelligence or disaster management. This paper introduces Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON), an open-source geocoding approach that extracts geolocations from news articles. RACCOON uses a retrieval-augmented generation (RAG) approach where candidate locations and associated information are retrieved in the form of context from a location database, and a prompt containing the retrieved context, location mentions and news articles is fed to an LLM to generate the location coordinates. Our evaluation on three datasets, two underlying LLMs, three baselines and several ablation tests based on the components of RACCOON demonstrate the utility of RACCOON. To the best of our knowledge, RACCOON is the first RAG-based approach for geocoding using pre-trained LLMs.
zh

[NLP-57] Neural Contextual Reinforcement Framework for Logical Structure Language Generation

【速读】：该论文试图解决大语言模型生成文本时逻辑连贯性和结构一致性不足的问题，特别是在处理长序列依赖关系时面临的挑战。解决方案的关键在于引入了神经上下文强化框架（Neural Contextual Reinforcement Framework），该框架结合了强化学习原理，通过定制奖励函数和动态上下文对齐机制来优化文本生成。具体而言，框架采用了多头注意力层（multi-head attention layers）和分层编码模块（hierarchical encoding modules），以增强模型在长距离依赖关系中的表现，从而生成更符合人类逻辑结构和语义流畅性预期的文本。实验结果表明，该框架在连贯性指标、困惑度降低和语义对齐方面显著优于基线模型，并在多语言环境中表现出良好的适应性和资源效率。

链接: https://arxiv.org/abs/2501.11417
作者: Marcus Irvin,William Cooper,Edward Hughes,Jessica Morgan,Christopher Hamilton
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Neural Contextual Reinforcement Framework introduces an innovative approach to enhancing the logical coherence and structural consistency of text generated by large language models. Leveraging reinforcement learning principles, the framework integrates custom reward functions and dynamic context alignment mechanisms to address challenges inherent in maintaining long-range dependencies across extended sequences. The architecture incorporates multi-head attention layers and hierarchical encoding modules, enabling the model to produce outputs that align closely with human expectations of logical structure and semantic flow. Quantitative evaluations across diverse datasets demonstrate substantial improvements in coherence metrics, perplexity reduction, and semantic alignment, showcasing the framework’s ability to outperform baseline models in both general and domain-specific tasks. Qualitative analyses further highlight the framework’s capacity to generate text with improved narrative clarity and reduced redundancy, reflecting its effectiveness in balancing fluency with structural precision. In addition to its performance gains, the framework exhibits robustness in handling noisy input data and scalability across varying model sizes, reinforcing its versatility in practical applications. Experimental results reveal that optimal context window sizes significantly influence coherence outcomes, showing the importance of architectural flexibility in adapting to diverse linguistic structures. Cross-lingual performance evaluations affirm the framework’s adaptability to multiple languages, extending its utility beyond monolingual contexts. Resource efficiency analyses indicate a reduction in computational overhead compared to traditional approaches, emphasizing the practicality of the framework for large-scale deployment.
zh

[NLP-58] Verifying Cross-modal Entity Consistency in News using Vision-language Models ECIR

【速读】：该论文试图解决跨模态信息（如图像和文本）中实体（如人物、地点和事件）一致性的验证问题，特别是在新闻领域中检测虚假信息（disinformation）。现有的方法要么通过评估图像与整个文档的一致性来识别上下文外的虚假信息，忽略了单个实体之间的关系，要么专注于与新闻无关的通用实体。论文提出了一种基于大规模视觉-语言模型（Large Vision-Language Models, LVLMs）的框架，称为LVLM4CEC，用于验证新闻文章中人物、地点和事件在图像和文本之间的一致性。解决方案的关键在于利用从网络上爬取的参考图像，设计有效的提示策略（prompting strategies）来引导LVLMs进行实体验证。此外，论文扩展了三个现有数据集，提供了手动标注的真实数据（ground-truth data），以支持实体验证任务。实验结果表明，LVLMs在自动化跨模态实体验证方面具有潜力，特别是在使用证据图像时，识别人物和事件的准确性有所提高，并且在地点和事件的验证任务中优于基线方法。

链接: https://arxiv.org/abs/2501.11403
作者: Sahar Tahmasebi,Eric Müller-Budack,Ralph Ewerth
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted for publication in: European Conference on Information Retrieval (ECIR) 2025

点击查看摘要

Abstract:The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at \urlthis https URL.
zh

[NLP-59] Few-shot Policy (de)composition in Conversational Question Answering

【速读】：该论文旨在解决政策合规性检测（Policy Compliance Detection, PCD）任务中的问题，即在对话场景中判断某个情境是否符合一组书面政策的要求。现有方法通常依赖于隐式推理能力或需要大量标注数据，而本文提出了一种神经符号框架——逻辑分解政策合规性（Logical Decomposition for Policy Compliance, LDPC），利用大语言模型（Large Language Models, LLMs）在少样本设置下进行政策合规性检测。该框架的关键在于通过选择少量示例并结合最新的提示技术，能够从给定政策中提取子问题、从上下文信息中分配真值，并显式生成一组逻辑语句。通过构建显式逻辑图，LDPC能够以更高的透明度和可解释性回答PCD相关问题。该方法在ShARC基准测试中表现出色，且无需任务特定的微调，同时其可解释的架构有助于识别错误来源，揭示了ShARC数据集中的模糊性，并凸显了对话问答推理中的挑战。

链接: https://arxiv.org/abs/2501.11335
作者: Kyle Erwin,Guy Axelrod,Maria Chang,Achille Fokoue,Maxwell Crouse,Soham Dan,Tian Gao,Rosario Uceda-Sosa,Ndivhuwo Makondo,Naweed Khan,Alexander Gray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The task of policy compliance detection (PCD) is to determine if a scenario is in compliance with respect to a set of written policies. In a conversational setting, the results of PCD can indicate if clarifying questions must be asked to determine compliance status. Existing approaches usually claim to have reasoning capabilities that are latent or require a large amount of annotated data. In this work, we propose logical decomposition for policy compliance (LDPC): a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting. By selecting only a few exemplars alongside recently developed prompting techniques, we demonstrate that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies. The formulation of explicit logic graphs can in turn help answer PCDrelated questions with increased transparency and explainability. We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning. We also leverage the inherently interpretable architecture of LDPC to understand where errors occur, revealing ambiguities in the ShARC dataset and highlighting the challenges involved with reasoning for conversational question answering.
zh

[NLP-60] Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering

【速读】：该论文旨在解决基于知识库（如Wikipedia和Wikidata）的问答系统中，如何高效且精确地检索相关信息的问题。传统的问答系统通常通过生成答案或直接检索文档内容来实现，而本文提出了一种新的方法，即通过“问题到问题”匹配和检索来实现。具体而言，解决方案的关键在于使用指令调优的大语言模型（LLM）为每个逻辑内容单元生成一组全面的问题，并将这些问题进行向量嵌入（vector embedding）并存储，形成问题向量库。当用户提出查询时，系统会将查询向量与问题向量库中的向量进行匹配，选择相似度最高的向量，并直接检索与之关联的文章内容，从而避免了答案生成的过程。该方法在Wikipedia和Wikidata上表现出色，能够实现高余弦相似度（>0.9）的精确检索，具有计算效率高、响应速度快和可扩展性强等优势。此外，该方法还支持通过Wikidata进行结构化事实检索，为多模态问答开辟了新途径。

链接: https://arxiv.org/abs/2501.11301
作者: Santhosh Thottingal
机构: Wikimedia Foundation(维基媒体基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an approach to question answering over knowledge bases like Wikipedia and Wikidata by performing “question-to-question” matching and retrieval from a dense vector embedding store. Instead of embedding document content, we generate a comprehensive set of questions for each logical content unit using an instruction-tuned LLM. These questions are vector-embedded and stored, mapping to the corresponding content. Vector embedding of user queries are then matched against this question vector store. The highest similarity score leads to direct retrieval of the associated article content, eliminating the need for answer generation. Our method achieves high cosine similarity ( 0.9 ) for relevant question pairs, enabling highly precise retrieval. This approach offers several advantages including computational efficiency, rapid response times, and increased scalability. We demonstrate its effectiveness on Wikipedia and Wikidata, including multimedia content through structured fact retrieval from Wikidata, opening up new pathways for multimodal question answering.
zh

[NLP-61] Advancing Multi-Party Dialogue Systems with Speaker-ware Contrastive Learning

【速读】：该论文试图解决多轮多方对话（multi-party dialogue）中的响应生成问题。与传统的双人对话（dyadic dialogue）相比，多方对话涉及更多参与者，且每个参与者可能讨论不同主题，导致任务复杂度显著增加。现有方法通常依赖图神经网络（Graph Neural Networks, GNNs）来建模对话上下文，虽然能够捕捉多方对话的结构动态，但这些方法过于依赖复杂的图结构和数据集标注，且往往忽略了参与者的独特说话风格。为解决这些问题，论文提出了基于对比学习（Contrastive Learning）的多方对话响应生成模型CMR。CMR通过自监督对比学习来更好地区分“谁说了什么”，并通过比较同一对话中的不同说话者，捕捉说话风格和主题转换的差异。实验结果表明，CMR在多方对话响应生成任务中显著优于现有最先进的模型。

链接: https://arxiv.org/abs/2501.11292
作者: Zhongtian Hu,Qi He,Ronghan Li,Meng Zhao,Lifang Wang
机构: 1School of Computer Science and Engineering, Northwestern Polytechnical University (西北工业大学计算机科学与工程学院); 2School of Computer Science and Technology, Xidian University (西安电子科技大学计算机科学与技术学院); 3School of Artificial Intelligence and Big Data, Henan University of Technology (河南工业大学人工智能与大数据学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dialogue response generation has made significant progress, but most research has focused on dyadic dialogue. In contrast, multi-party dialogues involve more participants, each potentially discussing different topics, making the task more complex. Current methods often rely on graph neural networks to model dialogue context, which helps capture the structural dynamics of multi-party conversations. However, these methods are heavily dependent on intricate graph structures and dataset annotations, and they often overlook the distinct speaking styles of participants. To address these challenges, we propose CMR, a Contrastive learning-based Multi-party dialogue Response generation model. CMR uses self-supervised contrastive learning to better distinguish “who says what.” Additionally, by comparing speakers within the same conversation, the model captures differences in speaking styles and thematic transitions. To the best of our knowledge, this is the first approach to apply contrastive learning in multi-party dialogue generation. Experimental results show that CMR significantly outperforms state-of-the-art models in multi-party dialogue response tasks.
zh

[NLP-62] RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

【速读】：该论文探讨了通过扩展长链思维（Long Chain-of-Thought, Long-CoT）数据规模至1000k样本，是否能够提升推理能力的问题。研究团队开发了一种名为RedStar的慢思维模型，并通过大量实验揭示了长链思维训练中专业化和规模化的关键因素。研究发现，即使较小的模型在有限数据下也能显著提升性能，表明长链思维具有较高的样本效率，且样本难度在学习过程中起着关键作用。此外，论文引入了强化学习（Reinforcement Learning, RL）规模化训练作为推进慢思维系统的有前景方向。RedStar在多个领域表现出色，特别是在MATH-Hard基准测试中，RedStar-code-math将性能从66.2%提升至81.6%，并在美国数学奥林匹克（AIME）中仅使用21k混合代码-数学数据集解决了46.7%的问题。研究结果表明，通过精心调优，扩展长链思维数据可以解锁非凡的推理能力，即使数据集有限，也能为慢思维模型设定新的标准。

链接: https://arxiv.org/abs/2501.11284
作者: Haotian Xu,Xing Wu,Weinong Wang,Zhongzhi Li,Da Zheng,Boyuan Chen,Yi Hu,Shijia Kang,Jiaming Ji,Yingying Zhang,Zhijiang Guo,Yaodong Yang,Muhan Zhang,Debing Zhang
机构: Xiaohongshu Inc; Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); ECNU (华东师范大学); HKUST (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: technique-report, this https URL

点击查看摘要

Abstract:Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME), it solves 46.7% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at this https URL.
zh

[NLP-63] Multi-round Chain-of-thought Post-editing for Unfaithful Summaries

【速读】：该论文旨在解决新闻摘要生成中的忠实性（faithfulness）问题，即生成的摘要与源新闻文档之间的事实一致性。论文探讨了使用大语言模型（LLMs）来评估和提升摘要的忠实性，并通过实验验证了其在定位和纠正事实不一致性方面的有效性。解决方案的关键在于利用链式思维提示（chain-of-thought prompts）来引导LLMs进行事实错误的识别和修正，从而提升编辑成功率。此外，论文还提出了多轮后编辑（multiple rounds of post-editing）的策略，逐步改进那些无法通过单轮编辑完全纠正的摘要的忠实性。实验结果表明，这种基于链式思维推理的提示策略在忠实性后编辑任务中表现优异，与经过微调的后编辑模型相当。

链接: https://arxiv.org/abs/2501.11273
作者: Yi-Hui Lee,Xiangci Li,Jessica Ouyang
机构: The University of Texas at Dallas; Amazon Web Services
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated a remarkable ability to perform natural language understanding and generation tasks. In this work, we investigate the use of LLMs for evaluating faithfulness in news summarization, finding that it achieves a strong correlation with human judgments. We further investigate LLMs’ capabilities as a faithfulness post-editor, experimenting with different chain-of-thought prompts for locating and correcting factual inconsistencies between a generated summary and the source news document and are able to achieve a higher editing success rate than was reported in prior work. We perform both automated and human evaluations of the post-edited summaries, finding that prompting LLMs using chain-of-thought reasoning about factual error types is an effective faithfulness post-editing strategy, performing comparably to fine-tuned post-editing models. We also demonstrate that multiple rounds of post-editing, which has not previously been explored, can be used to gradually improve the faithfulness of summaries whose errors cannot be fully corrected in a single round.
zh

[NLP-64] Can xLLM s Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios

【速读】：该论文试图解决多语言研究领域中的两个主要问题：高质量多语言数据集的稀缺性以及现有数据集在捕捉真实对话场景复杂性方面的局限性。为了解决这些问题，作者引入了XMP数据集，这是一个基于多参与者播客对话的高质量平行多语言数据集。该数据集中的每个样本都包含至少三名参与者，讨论的主题广泛，涵盖社会、文化、政治等多个领域。通过广泛的实验，作者揭示了大型语言模型（LLMs）在复杂对话场景中的多语言能力存在显著局限性，特别是其广泛认可的多语言互补能力受到影响。进一步实验从多个角度探索了LLMs在多语言环境中的机制，为其在现实世界多样化对话场景中的表现提供了新的见解。

链接: https://arxiv.org/abs/2501.11269
作者: Zhongtian Hu,Yiwen Cui,Ronghan Li,Meng Zhao,Lifang Wang
机构: School of Computer Science and Engineering, Northwestern Polytechnical University(西北工业大学计算机科学与工程学院); School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院); School of Artificial Intelligence and Big Data, Henan University of Technology(河南工业大学人工智能与大数据学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual research has garnered increasing attention, especially in the domain of dialogue systems. The rapid advancements in large language models (LLMs) have fueled the demand for high-performing multilingual models. However, two major challenges persist: the scarcity of high-quality multilingual datasets and the limited complexity of existing datasets in capturing realistic dialogue scenarios. To address these gaps, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues. Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and this http URL extensive experiments, we uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios. For instance, the widely accepted multilingual complementary ability of LLMs is notably impacted. By conducting further experiments, we explore the mechanisms of LLMs in multilingual environments from multiple perspectives, shedding new light on their performance in real-world, diverse conversational contexts.
zh

[NLP-65] Code Readability in the Age of Large Language Models : An Industrial Case Study from Atlassian

【速读】：该论文试图解决的问题是：在大语言模型（LLMs）自动生成代码的背景下，代码的可读性是否仍然重要，以及LLM生成的代码与人工编写的代码在可读性上的比较。论文通过调查从业者的视角，探讨了LLM时代代码可读性的重要性，并通过对比LLM生成的代码与人工编写的代码，评估了其可读性。解决方案的关键在于开发了一个基于LLM的软件开发代理框架HULA，并通过实际场景中的代码生成实验，验证了LLM生成的代码在可读性上与人工编写的代码相当，从而促进了从业者对LLM驱动的软件开发平台的信任和广泛采用。

链接: https://arxiv.org/abs/2501.11264
作者: Wannita Takerngsaksiri,Micheal Fu,Chakkrit Tantithamthavorn,Jirat Pasuksmit,Kun Chen,Ming Wu
机构: Monash University(莫纳什大学); The University of Melbourne(墨尔本大学); Atlassian(澳大利亚); Atlassian(美国)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 2 figures, 5 tables, under review

点击查看摘要

Abstract:Programmers spend a significant amount of time reading code during the software development process. This trend is amplified by the emergence of large language models (LLMs) that automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners’ perspectives in this new era. In this paper, we conduct a survey to explore the practitioners’ perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.
zh

[NLP-66] Irony in Emojis: A Comparative Study of Human and LLM Interpretation

【速读】：该论文试图解决的问题是大型语言模型（LLMs）在解释表情符号（emojis）中的讽刺（irony）时所面临的挑战。讽刺由于其表面意义与真实意图之间的不一致性，对LLMs的理解能力提出了较高的要求。论文通过让GPT-4o评估特定表情符号在社交媒体上表达讽刺的可能性，并将其解释与人类感知进行比较，旨在缩小机器与人类在理解讽刺表情符号方面的差距。解决方案的关键在于通过对比GPT-4o的解释与人类感知，揭示GPT-4o在解释讽刺表情符号时的能力，并探讨人口统计因素（如年龄和性别）如何影响表情符号的解释以及GPT-4o的表现。

链接: https://arxiv.org/abs/2501.11241
作者: Yawen Zheng,Hanjia Lyu,Jiebo Luo
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o’s interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o’s performance.
zh

[NLP-67] PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents ECIR2025

【速读】：该论文旨在解决图表可视化在PDF或数字扫描件中仅以图像形式存在，缺乏源数据表和样式信息的问题，从而限制了图表的有效编辑。为了解决这一问题，论文提出了PlotEdit，一个基于自然语言驱动的多智能体框架，用于端到端的图表图像编辑。PlotEdit通过五个LLM（大语言模型）智能体的协同工作实现这一目标：(1) Chart2Table用于提取数据表，(2) Chart2Vision用于识别样式属性，(3) Chart2Code用于检索渲染代码，(4) Instruction Decomposition Agent用于将用户请求解析为可执行步骤，(5) Multimodal Editing Agent用于实现图表组件的细微修改。这些智能体通过多模态反馈进行协调，以保持视觉保真度。PlotEdit在ChartCraft数据集上优于现有基线，特别是在样式、布局、格式和数据为中心的编辑任务中，提升了视觉障碍用户的可访问性，并提高了新手用户的生产力。

链接: https://arxiv.org/abs/2501.11233
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted at ECIR 2025

点击查看摘要

Abstract:Chart visualizations, while essential for data interpretation and communication, are predominantly accessible only as images in PDFs, lacking source data tables and stylistic information. To enable effective editing of charts in PDFs or digital scans, we present PlotEdit, a novel multi-agent framework for natural language-driven end-to-end chart image editing via self-reflective LLM agents. PlotEdit orchestrates five LLM agents: (1) Chart2Table for data table extraction, (2) Chart2Vision for style attribute identification, (3) Chart2Code for retrieving rendering code, (4) Instruction Decomposition Agent for parsing user requests into executable steps, and (5) Multimodal Editing Agent for implementing nuanced chart component modifications - all coordinated through multimodal feedback to maintain visual fidelity. PlotEdit outperforms existing baselines on the ChartCraft dataset across style, layout, format, and data-centric edits, enhancing accessibility for visually challenged users and improving novice productivity.
zh

[NLP-68] Reasoning Language Models: A Blueprint

【速读】：该论文试图解决推理语言模型（RLMs）或大型推理模型（LRMs）在可访问性和可扩展性方面面临的挑战。这些挑战主要源于其高成本、专有性质以及复杂的架构，这些架构独特地结合了强化学习（Reinforcement Learning, RL）、搜索启发式方法和大型语言模型（LLMs）。为了解决这些问题，论文提出了一种模块化框架的蓝图，该蓝图基于对所有RLM工作的调查和分析，将RLM组件组织成模块化结构。关键解决方案包括：1）整合多样化的推理结构（如链式、树状、图状和嵌套形式）；2）采用多种推理策略（如蒙特卡洛树搜索、束搜索）；3）结合强化学习概念（如策略模型、价值模型等）；4）引入监督方案（基于输出的监督和基于过程的监督）。此外，论文还提供了详细的数学公式和算法规范，以简化RLM的实现。通过展示LLaMA-Berry、QwQ、Journey Learning和Graph of Thoughts等方案如何作为特例融入该蓝图，论文展示了其通用性和统一潜力。最后，论文通过引入x1模块化实现，进一步说明了该蓝图的实用性，并提供了关键见解，如策略模型和价值模型的多阶段训练，以及熟悉训练分布的重要性。

链接: https://arxiv.org/abs/2501.11223
作者: Maciej Besta,Julia Barth,Eric Schreiber,Ales Kubicek,Afonso Catarino,Robert Gerstenberger,Piotr Nyczyk,Patrick Iff,Yueling Li,Sam Houliston,Tomasz Sternal,Marcin Copik,Grzegorz Kwaśniewski,Jürgen Müller,Łukasz Flis,Hannes Eberhard,Hubert Niewiadomski,Torsten Hoefler
机构: ETH Zurich(苏黎世联邦理工学院); Cledar; BASF SE(巴斯夫); Cyfronet AGH
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), and supervision schemes (Output-Based and Process-Based Supervision). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we outline how RLMs can integrate with a broader LLM ecosystem, including tools and databases. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between “rich AI” and “poor AI” by lowering barriers to RLM development and experimentation.
zh

[NLP-69] Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

【速读】：该论文试图解决在临床文本分类任务中，由于高质量数据和专家标注的高成本和时间消耗，导致预训练语言模型（pre-trained language models）微调过程困难的问题。为了解决这一问题，作者提出了一种基于嵌入驱动的方法（embedding-driven approach），通过从少量真实临床笔记中进行多样性采样（diversity sampling），指导大语言模型在少样本提示（few-shot prompting）下生成更符合临床语法特征的合成文本。该方法在CheXpert数据集上的分类任务中进行了评估，结果表明，相较于随机少样本和零样本方法，生成的合成文本在余弦相似度和图灵测试中更接近真实临床文本。此外，使用合成数据增强模型后，AUROC和AUPRC分别提升了57%和68%，且合成数据的有效性达到了真实数据的90%，价值提升了60%。

链接: https://arxiv.org/abs/2501.11199
作者: Ivan Lopez,Fateme Nateghi Haredasht,Kaitlin Caoili,Jonathan H Chen,Akshay Chaudhari
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.
zh

[NLP-70] AIMA at SemEval-2024 Task 3: Simple Yet Powerful Emotion Cause Pair Analysis SEMEVAL-2024

【速读】：该论文旨在解决在对话语境中提取情感-原因对（emotion-cause pair extraction）的问题，具体分为两个子任务：子任务1专注于从文本中提取情感-原因对，其中原因被定义为对话中的文本片段；子任务2则扩展到了多模态（multimodal）分析，涵盖了语言、音频和视觉信息，以应对原因可能不完全体现在文本中的情况。解决方案的关键在于提出的模型结构，该模型分为三个核心部分：(i) 嵌入提取（embedding extraction），(ii) 情感分类与原因对提取（cause-pair extraction and emotion classification），以及 (iii) 在找到原因对后通过问答机制（QA）进行原因提取。通过结合最先进的技术并在任务特定数据集上进行微调，该模型有效地揭示了对话动态中的复杂关系，并提取了情感表达中的因果关系线索。

链接: https://arxiv.org/abs/2501.11170
作者: Alireza Ghahramani Kure,Mahshid Dehghani,Mohammad Mahdi Abootorabi,Nona Ghazizadeh,Seyed Arshan Dalili,Ehsaneddin Asgari
机构: NLP & DH Lab, Computer Engineering Department, Sharif University of Technology (NLP与数字人文实验室，计算机工程系，谢里夫理工大学); Qatar Computing Research Institute, Doha, Qatar (卡塔尔计算研究所，多哈，卡塔尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

点击查看摘要

Abstract:The SemEval-2024 Task 3 presents two subtasks focusing on emotion-cause pair extraction within conversational contexts. Subtask 1 revolves around the extraction of textual emotion-cause pairs, where causes are defined and annotated as textual spans within the conversation. Conversely, Subtask 2 extends the analysis to encompass multimodal cues, including language, audio, and vision, acknowledging instances where causes may not be exclusively represented in the textual data. Our proposed model for emotion-cause analysis is meticulously structured into three core segments: (i) embedding extraction, (ii) cause-pair extraction emotion classification, and (iii) cause extraction using QA after finding pairs. Leveraging state-of-the-art techniques and fine-tuning on task-specific datasets, our model effectively unravels the intricate web of conversational dynamics and extracts subtle cues signifying causality in emotional expressions. Our team, AIMA, demonstrated strong performance in the SemEval-2024 Task 3 competition. We ranked as the 10th in subtask 1 and the 6th in subtask 2 out of 23 teams.
zh

[NLP-71] AIMA at SemEval-2024 Task 10: History-Based Emotion Recognition in Hindi-English Code-Mixed Conversations SEMEVAL-2024

【速读】：该论文旨在解决在代码混合（code-mixed）的印地语-英语（Hindi-English）对话中进行情感识别（Emotion Recognition in Conversation, ERC）的挑战。由于现有模型通常在单语数据集上训练，难以有效处理代码混合数据，因此作者提出了一系列模型，这些模型不仅考虑了当前话语的前后上下文，还结合了对话的顺序信息。为了处理代码混合数据，作者开发了一个将印地语-英语混合对话（Hinglish）翻译为英语的管道。此外，作者设计了四种不同的基础模型，每种模型都利用强大的预训练编码器（pre-trained encoders）从输入中提取特征，但具有不同的架构。最终，通过集成这些模型，作者开发了一个优于所有基线的最终模型。

链接: https://arxiv.org/abs/2501.11166
作者: Mohammad Mahdi Abootorabi,Nona Ghazizadeh,Seyed Arshan Dalili,Alireza Ghahramani Kure,Mahshid Dehghani,Ehsaneddin Asgari
机构: NLP & DH Lab, Computer Engineering Department, Sharif University of Technology (NLP与数字人文实验室，计算机工程系，谢里夫理工大学); Qatar Computing Research Institute, Doha, Qatar (卡塔尔计算研究所，多哈，卡塔尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

点击查看摘要

Abstract:In this study, we introduce a solution to the SemEval 2024 Task 10 on subtask 1, dedicated to Emotion Recognition in Conversation (ERC) in code-mixed Hindi-English conversations. ERC in code-mixed conversations presents unique challenges, as existing models are typically trained on monolingual datasets and may not perform well on code-mixed data. To address this, we propose a series of models that incorporate both the previous and future context of the current utterance, as well as the sequential information of the conversation. To facilitate the processing of code-mixed data, we developed a Hinglish-to-English translation pipeline to translate the code-mixed conversations into English. We designed four different base models, each utilizing powerful pre-trained encoders to extract features from the input but with varying architectures. By ensembling all of these models, we developed a final model that outperforms all other baselines.
zh

[NLP-72] A Collection of Question Answering Datasets for Norwegian ALT

【速读】：该论文旨在解决挪威语（Norwegian）在问答系统（question answering）领域缺乏高质量数据集的问题。为此，作者引入了四个新的挪威语问答数据集：NorOpenBookQA、NorCommonSenseQA、NorTruthfulQA和NRK-Quiz-QA。这些数据集涵盖了广泛的知识领域和技能，包括世界知识、常识推理（commonsense reasoning）、真实性（truthfulness）以及关于挪威的知识。数据集覆盖了挪威语的两种书面标准——Bokmål和Nynorsk，并包含超过10,000个由母语者创建的问题-答案对。解决方案的关键在于通过详细的标注和评估方法，创建了一个多样化的数据集，并评估了11种语言模型（LMs）在零样本（zero-shot）和少样本（few-shot）场景下的表现。研究结果表明，大多数语言模型在Bokmål上的表现优于Nynorsk，且在常识推理任务上表现较差，生成的答案往往缺乏真实性。所有数据集和标注材料均已公开，为后续研究提供了重要资源。

链接: https://arxiv.org/abs/2501.11128
作者: Vladislav Mikhailov,Petter Mæhlum,Victoria Ovedie Chruickshank Langø,Erik Velldal,Lilja Øvrelid
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for NoDaLiDa / Baltic-HLT 2025

点击查看摘要

Abstract:This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokmål and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
zh

[NLP-73] Assessing Semantic Annotation Activities with Formal Concept Analysis

【速读】：该论文试图解决如何评估和改进语义标注（semantic annotation）活动的问题。具体来说，作者提出了一种基于形式概念分析（Formal Concept Analysis, FCA）的方法，用于评估标注者在使用领域专家创建的分类本体（taxonomical ontologies）进行数字资源标注时的表现。解决方案的关键在于利用FCA生成概念格（concept lattices），这些概念格以图形化的方式展示了本体在语义标注过程中的使用情况。通过这种方式，领域专家能够直观地了解标注者如何使用本体，并据此提供改进建议，包括如何更有效地使用本体以及如何优化本体以更好地满足标注者的需求。论文通过在一个名为@note的富互联网应用（Rich Internet Application, RIA）中实现该方法，并结合案例研究和评估结果，展示了该方法的可行性和有效性。

链接: https://arxiv.org/abs/2501.11123
作者: Juan Cigarrán-Recuero,Joaquín Gayoso-Cabada,Miguel Rodríguez-Artacho,María-Dolores Romero-López,Antonio Sarasa-Cabezuelo,José-Luis Sierra
机构: 未知
类目: Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:This paper describes an approach to assessing semantic annotation activities based on formal concept analysis (FCA). In this approach, annotators use taxonomical ontologies created by domain experts to annotate digital resources. Then, using FCA, domain experts are provided with concept lattices that graphically display how their ontologies were used during the semantic annotation process. In consequence, they can advise annotators on how to better use the ontologies, as well as how to refine them to better suit the needs of the semantic annotators. To illustrate the approach, we describe its implementation in @note, a Rich Internet Application (RIA) for the collaborative annotation of digitized literary texts, we exemplify its use with a case study, and we provide some evaluation results using the method.
zh

[NLP-74] me about yourself: LLM s are aware of their learned behaviors ICLR2025

【速读】：该论文研究了大型语言模型（LLM）的行为自我意识（behavioral self-awareness），即模型在没有上下文示例的情况下，能够明确描述其自身行为的能力。论文通过微调（finetune）LLM，使其在特定行为（如做出高风险经济决策或输出不安全的代码）的数据集上进行训练，尽管这些数据集中并未包含与这些行为相关的明确描述，但微调后的模型能够明确表达这些行为。例如，经过训练输出不安全代码的模型会表示“我写的代码是不安全的”。研究的关键在于，模型在没有专门训练或示例的情况下，能够自发地表达其隐含行为，这种行为自我意识对于AI安全具有重要意义，因为模型可以利用这种能力主动披露潜在的问题行为。此外，论文还探讨了后门策略（backdoor policies），发现模型有时能够识别自身是否具有后门，即使在没有触发条件的情况下。然而，模型默认情况下无法直接输出其触发条件。研究结果表明，模型在自我意识和隐含行为的自发表达方面具有令人惊讶的能力。未来的研究可以进一步探讨这种能力在更广泛场景和模型中的应用，并解释其在LLM中的产生机制。

链接: https://arxiv.org/abs/2501.11120
作者: Jan Betley,Xuchan Bao,Martín Soto,Anna Sztyber-Betley,James Chua,Owain Evans
机构: Truthful AI; University of Toronto(多伦多大学); UK AISI; Warsaw University of Technology(华沙理工大学); UC Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Submitted to ICLR 2025. 17 pages, 13 figures

点击查看摘要

Abstract:We study behavioral self-awareness – an LLM’s ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.‘’ Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors – models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs. Comments: Submitted to ICLR 2025. 17 pages, 13 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2501.11120 [cs.CL] (or arXiv:2501.11120v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-75] Clinical trial cohort selection using Large Language Models on n2c2 Challenges

【速读】：该论文试图解决临床研究中的队列选择（cohort selection）问题，特别是在处理患者文本记录时，手动筛选特定关键词的过程耗时且效率低下。为了解决这一问题，论文探讨了利用预训练大语言模型（LLMs）在自然语言处理（NLP）任务中的潜力，尤其是其在临床研究队列选择中的应用。解决方案的关键在于利用LLMs的文本理解能力，通过n2c2挑战赛的数据集来评估这些模型在简单队列选择任务中的表现。研究结果表明，LLMs在简单任务中表现良好，但在需要细粒度知识和推理的复杂任务中仍面临挑战。

链接: https://arxiv.org/abs/2501.11114
作者: Chi-en Amy Tai,Xavier Tannier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.
zh

[NLP-76] Chain-of-Reasoning : Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective

【速读】：该论文旨在解决大型语言模型（LLMs）在数学推理任务中依赖单一推理范式（single-paradigm reasoning）的问题，这种依赖限制了模型在多样化任务中的有效性。为了解决这一问题，论文提出了一个名为“推理链”（Chain-of-Reasoning, CoR）的统一框架，该框架整合了多种推理范式，包括自然语言推理（Natural Language Reasoning, NLR）、算法推理（Algorithmic Reasoning, AR）和符号推理（Symbolic Reasoning, SR），以实现协同合作。CoR通过生成多个潜在的答案，并将这些答案综合成一个连贯的最终解决方案。此外，论文还提出了一种渐进式范式训练（Progressive Paradigm Training, PPT）策略，使模型能够逐步掌握这些推理范式，最终开发出CoR-Math-7B模型。实验结果表明，CoR-Math-7B在定理证明任务中显著优于当前的最先进模型（SOTA），并在算术任务中表现出色，展示了其增强的数学综合能力和跨任务的零样本泛化能力。

链接: https://arxiv.org/abs/2501.11110
作者: Yiyao Yu,Yuxiang Zhang,Dongdong Zhang,Xiao Liang,Hengyuan Zhang,Xingxing Zhang,Ziyi Yang,Mahmoud Khademi,Hany Awadalla,Junjie Wang,Yujiu Yang,Furu Wei
机构: Tsinghua University(清华大学); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet they often rely on single-paradigm reasoning that limits their effectiveness across diverse tasks. In this paper, we introduce Chain-of-Reasoning (CoR), a novel unified framework that integrates multiple reasoning paradigms–Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)–to enable synergistic collaboration. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy that allows models to progressively master these paradigms, culminating in the development of CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4 in theorem proving tasks and a 7.9% improvement over RL-based methods in arithmetic tasks. These results showcase the enhanced mathematical comprehensive ability of our model, achieving significant performance gains on specific tasks and enabling zero-shot generalization across tasks.
zh

[NLP-77] ChaosEater: Fully Automating Chaos Engineering with Large Language Models

【速读】：该论文试图解决混沌工程（Chaos Engineering, CE）中手动定义实验和实验后系统重新配置的高成本问题。混沌工程是一种通过人为注入特定故障来观察分布式系统行为并提升其弹性的工程技术。尽管现有的CE工具已经实现了预定义实验的自动化执行，但实验的定义和实验后的系统重新配置仍然依赖手动操作，导致时间和经济成本较高。

论文提出的解决方案是\textsc{ChaosEater}，一个利用大语言模型（Large Language Models, LLMs）实现整个CE操作自动化的系统。该系统的关键点在于：首先，它根据系统的CE周期预定义了通用流程，并将流程中的细分操作分配给LLMs执行；其次，该系统假设系统基于基础设施即代码（Infrastructure as Code, IaC），即系统配置和人为故障通过代码管理，因此LLMs的操作对应于软件工程任务，包括需求定义、代码生成与调试以及测试。通过案例研究，论文验证了该系统在小型和大型系统中均能显著降低时间和经济成本，同时完成合理的单个CE周期。

链接: https://arxiv.org/abs/2501.11107
作者: Daisuke Kikuta,Hiroki Ikeuchi,Kengo Tajiri,Yuusuke Nakano
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: 138 pages (12 main), 10 figures. Project page: this https URL

点击查看摘要

Abstract:Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools realize the automated execution of predefined CE experiments. However, defining these experiments and reconfiguring the system after the experiments still remain manual. To reduce the costs of the manual operations, we propose \textscChaosEater, a \textitsystem for automating the entire CE operations with Large Language Models (LLMs). It pre-defines the general flow according to the systematic CE cycle and assigns subdivided operations within the flow to LLMs. We assume systems based on Infrastructure as Code (IaC), wherein the system configurations and artificial failures are managed through code. Hence, the LLMs’ operations in our \textitsystem correspond to software engineering tasks, including requirement definition, code generation and debugging, and testing. We validate our \textitsystem through case studies on both small and large systems. The results demonstrate that our \textitsystem significantly reduces both time and monetary costs while completing reasonable single CE cycles.
zh

[NLP-78] Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model

【速读】：该论文试图解决通过社交媒体文本检测自杀意念（suicidal ideation）的问题，这是预防自杀的关键步骤。解决方案的核心在于采用了一种混合框架，结合了卷积神经网络（CNN）和双向长短期记忆网络（BiLSTM），并通过注意力机制（attention mechanism）进行增强。此外，为了提高模型预测的可解释性，论文引入了可解释人工智能（Explainable AI, XAI）方法，特别是SHapley Additive exPlanations（SHAP）。通过微调和早停技术，模型的准确率从92.81%提升至94.29%。SHAP分析揭示了影响模型预测的关键特征，如与心理健康问题相关的术语，从而增强了模型的可信度，并帮助心理健康专业人员理解和信任预测结果。该研究强调了结合强大的机器学习方法与可解释性来开发可靠且有效的心理健康解决方案的重要性。

链接: https://arxiv.org/abs/2501.11094
作者: Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Nur Hafieza Ismail
机构: Universiti Malaysia Pahang Al-Sltan Abdullah (马来西亚彭亨大学苏丹阿卜杜拉校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Suicidal ideation detection is crucial for preventing suicides, a leading cause of death worldwide. Many individuals express suicidal thoughts on social media, offering a vital opportunity for early detection through advanced machine learning techniques. The identification of suicidal ideation in social media text is improved by utilising a hybrid framework that integrates Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM), enhanced with an attention mechanism. To enhance the interpretability of the model’s predictions, Explainable AI (XAI) methods are applied, with a particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At first, the model managed to reach an accuracy of 92.81%. By applying fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The SHAP analysis revealed key features influencing the model’s predictions, such as terms related to mental health struggles. This level of transparency boosts the model’s credibility while helping mental health professionals understand and trust the predictions. This work highlights the potential for improving the accuracy and interpretability of detecting suicidal tendencies, making a valuable contribution to the progress of mental health monitoring systems. It emphasizes the significance of blending powerful machine learning methods with explainability to develop reliable and impactful mental health solutions.
zh

[NLP-79] Dynamic semantic networks for exploration of creative thinking

【速读】：该论文试图解决的问题是如何通过动态语义网络（dynamic semantic networks）来实时监测和评估创造性问题解决过程中的关键事件，进而人工增强人类创造力。解决方案的关键在于利用词汇数据库（如WordNet）进行信息论量化，通过移动时间窗口计算语义度量的动态变化，从而捕捉设计任务中的发散思维（divergent thinking）。这种方法能够同时处理词汇和语义，并解释与概念理解和产生相关的功能活跃脑皮层区域，最终实现对设计创意成功率的预测。

链接: https://arxiv.org/abs/2501.11090
作者: Danko D. Georgiev,Georgi V. Georgiev
机构: Institute for Advanced Study, 30 Vasilaki Papadopulu Str., Varna, 9010, Bulgaria; Center for Ubiquitous Computing, Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, FIN-90014, Finland
类目: Computation and Language (cs.CL)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Human creativity originates from brain cortical networks that are specialized in idea generation, processing, and evaluation. The concurrent verbalization of our inner thoughts during the execution of a design task enables the use of dynamic semantic networks as a tool for investigating, evaluating, and monitoring creative thought. The primary advantage of using lexical databases such as WordNet for reproducible information-theoretic quantification of convergence or divergence of design ideas in creative problem solving is the simultaneous handling of both words and meanings, which enables interpretation of the constructed dynamic semantic networks in terms of underlying functionally active brain cortical regions involved in concept comprehension and production. In this study, the quantitative dynamics of semantic measures computed with a moving time window is investigated empirically in the DTRS10 dataset with design review conversations and detected divergent thinking is shown to predict success of design ideas. Thus, dynamic semantic networks present an opportunity for real-time computer-assisted detection of critical events during creative problem solving, with the goal of employing this knowledge to artificially augment human creativity.
zh

[NLP-80] IntellAgent : A Multi-Agent Framework for Evaluating Conversational AI Systems

【速读】：该论文试图解决的是评估对话式AI系统（conversational AI systems）的挑战，特别是在多轮对话、领域特定API集成和严格政策约束下的复杂性和变异性。传统评估方法难以捕捉这些系统在真实世界交互中的复杂性。论文提出的解决方案是IntellAgent，一个可扩展的开源多智能体框架，旨在全面评估对话式AI系统。IntellAgent通过结合策略驱动的图建模（policy-driven graph modeling）、真实事件生成（realistic event generation）和交互式用户-智能体模拟（interactive user-agent simulations），自动化生成多样化的合成基准测试。这一创新方法提供了细粒度的诊断，克服了静态和手动策划的基准测试中粗粒度指标的局限性。IntellAgent通过模拟不同复杂度的多策略场景，捕捉智能体能力和政策约束之间的微妙相互作用，并采用基于图的策略模型来表示关系、可能性和政策交互的复杂性，从而实现高度详细的诊断。此外，IntellAgent还识别关键性能差距，为针对性优化提供可操作的见解。其模块化和开源设计支持新领域、政策和API的无缝集成，促进可重复性和社区协作。

链接: https://arxiv.org/abs/2501.11067
作者: Elad Levi,Ilan Kadar
机构: Plurai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at this https URL
zh

[NLP-81] Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach

【速读】：该论文试图解决大语言模型（LLM）在面对语义相同但表达不同的提示时，生成不一致甚至矛盾输出的问题。为了解决这一问题，论文提出了一种更具解释性的方法，即通过模型编辑（model editing）来增强LLM的语义一致性。关键解决方案包括：首先识别对LLM语义一致性有重要影响的模型组件（如注意力头，attention heads），然后沿着语义一致性激活方向对这些组件的输出注入偏差。这种方法不仅计算成本低，且无需对原始模型参数进行大规模修改。通过在构建的自然语言理解（NLU）和开源自然语言生成（NLG）数据集上的全面实验，该方法显著提升了LLM的语义一致性和任务性能，并展示了在主要任务之外的泛化能力。

链接: https://arxiv.org/abs/2501.11041
作者: Jingyuan Yang,Dapeng Chen,Yajing Sun,Rongjun Li,Zhiyong Feng,Wei Peng
机构: 1College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); 2IT Innovation and Research Center, Huawei Technologies (华为技术有限公司IT创新与研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A Large Language Model (LLM) tends to generate inconsistent and sometimes contradictory outputs when presented with a prompt that has equivalent semantics but is expressed differently from the original prompt. To achieve semantic consistency of an LLM, one of the key approaches is to finetune the model with prompt-output pairs with semantically equivalent meanings. Despite its effectiveness, a data-driven finetuning method incurs substantial computation costs in data preparation and model optimization. In this regime, an LLM is treated as a ``black box’', restricting our ability to gain deeper insights into its internal mechanism. In this paper, we are motivated to enhance the semantic consistency of LLMs through a more interpretable method (i.e., model editing) to this end. We first identify the model components (i.e., attention heads) that have a key impact on the semantic consistency of an LLM. We subsequently inject biases into the output of these model components along the semantic-consistency activation direction. It is noteworthy that these modifications are cost-effective, without reliance on mass manipulations of the original model parameters. Through comprehensive experiments on the constructed NLU and open-source NLG datasets, our method demonstrates significant improvements in the semantic consistency and task performance of LLMs. Additionally, our method exhibits promising generalization capabilities by performing well on tasks beyond the primary tasks.
zh

[NLP-82] LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在面对语义等价的改写输入时生成不一致响应的问题。具体来说，现有的激活引导（activation steering）方法通常在模型组件级别（如层隐藏状态或注意力头）进行操作，但由于LLMs的模型组件通常编码多个纠缠特征（polysemanticity issue），导致精确引导变得困难。为解决这一问题，论文提出了一种新的激活引导方法LF-Steering，其关键在于通过稀疏自编码器（SAE）将相关Transformer层的隐藏状态映射到一个稀疏激活的高维特征空间，从而基于解耦的特征表示进行模型引导，最小化干扰。实验结果表明，该方法在提升语义一致性方面具有显著效果，并在多种自然语言理解（NLU）和自然语言生成（NLG）任务中取得了显著的性能提升。

链接: https://arxiv.org/abs/2501.11036
作者: Jingyuan Yang,Rongjun Li,Weixuan Wang,Ziyu Zhou,Zhiyong Feng,Wei Peng
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); IT Innovation and Research Center, Huawei Technologies (华为技术有限公司IT创新与研究中心); Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLM behavior by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention heads. They face a challenge due to the ``polysemanticity issue’', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on both NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.
zh

[NLP-83] From Arabic Text to Puzzles: LLM -Driven Development of Arabic Educational Crosswords COLING2025

【速读】：该论文旨在解决阿拉伯语教育工具稀缺的问题，特别是缺乏高级的、基于人工智能的互动学习工具。解决方案的关键在于开发了一个阿拉伯语填字游戏生成器，该生成器利用了先进的生成式 AI 模型（如 GPT-4-Turbo、GPT-3.5-Turbo 和 Llama3-8B-Instruct），并结合了一个精心构建的数据集 Arabic-Clue-Instruct。该数据集包含超过 50,000 条条目，涵盖文本、答案、线索和类别，旨在生成与特定文本和关键词相关的线索。通过将最先进的人工智能技术与现代学习方法相结合，该工具能够从任何给定的教育文本中生成填字游戏，从而促进互动和有趣的学习体验。这一工具不仅推动了教育范式的进步，还为互动和认知学习技术设定了新标准。

链接: https://arxiv.org/abs/2501.11035
作者: Kamyar Zeinalipour,Mohamed Zaky Saad,Marco Maggini,Marco Gori
机构: University of Siena, DIISM, Via Roma 56, 53100 Siena, Italy (锡耶纳大学, DIISM, Via Roma 56, 53100 锡耶纳, 意大利)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at LoResLM @ COLING 2025

点击查看摘要

Abstract:We present an Arabic crossword puzzle generator from a given text that utilizes advanced language models such as GPT-4-Turbo, GPT-3.5-Turbo and Llama3-8B-Instruct, specifically developed for educational purposes, this innovative generator leverages a meticulously compiled dataset named Arabic-Clue-Instruct with over 50,000 entries encompassing text, answers, clues, and categories. This dataset is intricately designed to aid in the generation of pertinent clues linked to specific texts and keywords within defined categories. This project addresses the scarcity of advanced educational tools tailored for the Arabic language, promoting enhanced language learning and cognitive development. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective through gamification and interactivity. Integrating state-of-the-art artificial intelligence with contemporary learning methodologies, this tool can generate crossword puzzles from any given educational text, thereby facilitating an interactive and enjoyable learning experience. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies. The model and dataset are publicly available.
zh

[NLP-84] AdaptiveLog: An Adaptive Log Analysis Framework with the Collaboration of Large and Small Language Model

【速读】：该论文试图解决在自动化日志分析（automated log analysis）中，如何在性能与推理成本之间取得平衡的问题。具体来说，小型语言模型（SLMs）虽然成本较低但能力有限，而大型语言模型（LLMs）虽然强大但成本高且效率低。为解决这一问题，论文提出了一个名为AdaptiveLog的自适应日志分析框架。该框架的关键在于通过协同使用LLM和SLM，策略性地将复杂日志分配给LLM处理，而将简单日志分配给SLM处理。为了高效调用LLM，论文提出了一种基于SLM不确定性估计的自适应选择策略，仅在SLM不确定时调用LLM。此外，论文还提出了一种新的提示策略，通过检索类似的易错案例作为参考，增强LLM在日志分析任务中的推理能力。实验结果表明，AdaptiveLog在不同任务中均达到了最先进的性能，同时保持了成本效率。

链接: https://arxiv.org/abs/2501.11031
作者: Lipeng Ma,Weidong Yang,Yixuan Li,Ben Fei,Mingjie Zhou,Shuhao Li,Sihang Jiang,Bo Xu,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学数据科学重点实验室, 计算机科学学院); School of Computer Science and Technology, Donghua University (东华大学计算机科学与技术学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated log analysis is crucial to ensure high availability and reliability of complex systems. The advent of LLMs in NLP has ushered in a new era of language model-driven automated log analysis, garnering significant interest. Within this field, two primary paradigms based on language models for log analysis have become prominent. Small Language Models (SLMs) follow the pre-train and fine-tune paradigm, focusing on the specific log analysis task through fine-tuning on supervised datasets. On the other hand, LLMs following the in-context learning paradigm, analyze logs by providing a few examples in prompt contexts without updating parameters. Despite their respective strengths, we notice that SLMs are more cost-effective but less powerful, whereas LLMs with large parameters are highly powerful but expensive and inefficient. To trade-off between the performance and inference costs of both models in automated log analysis, this paper introduces an adaptive log analysis framework known as AdaptiveLog, which effectively reduces the costs associated with LLM while ensuring superior results. This framework collaborates an LLM and a small language model, strategically allocating the LLM to tackle complex logs while delegating simpler logs to the SLM. Specifically, to efficiently query the LLM, we propose an adaptive selection strategy based on the uncertainty estimation of the SLM, where the LLM is invoked only when the SLM is uncertain. In addition, to enhance the reasoning ability of the LLM in log analysis tasks, we propose a novel prompt strategy by retrieving similar error-prone cases as the reference, enabling the model to leverage past error experiences and learn solutions from these cases. Extensive experiments demonstrate that AdaptiveLog achieves state-of-the-art results across different tasks, elevating the overall accuracy of log analysis while maintaining cost efficiency.
zh

[NLP-85] Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa

【速读】：该论文试图解决低资源语言（如豪萨语）在情感分析（Sentiment Analysis, SA）中的挑战，主要由于缺乏数字资源。解决方案的关键在于采用语言自适应微调（Language-Adaptive Fine-Tuning, LAFT）技术，通过构建一个多样化的未标注语料库来扩展模型的语言能力，并应用LAFT将AfriBERTa模型适配到豪萨语的特定语言特征上。随后，该模型在标注的NaijaSenti情感数据集上进行微调，以评估其性能。研究结果表明，LAFT带来了适度的性能提升，尽管这可能归因于使用了正式的豪萨语文本而非非正式的社交媒体数据。此外，预训练的AfriBERTa模型显著优于未针对豪萨语进行专门训练的模型，强调了在低资源语言环境中使用预训练模型的重要性。该研究强调了多样化数据源在推进低资源非洲语言自然语言处理应用中的必要性。

链接: https://arxiv.org/abs/2501.11023
作者: Sani Abdullahi Sani,Shamsuddeen Hassan Muhammad,Devon Jarvis
机构: University of the Witwatersrand, Johannesburg(约翰内斯堡金山大学); Imperial College London(伦敦帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by ~identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model’s linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We published the code and the dataset to encourage further research and facilitate reproducibility in low-resource NLP here: this https URL
zh

[NLP-86] GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human

【速读】：该论文旨在解决机器生成文本（machine generated text）的二元检测问题，具体任务为区分文本是由人类撰写还是由生成式 AI（Generative AI）生成。解决方案的关键在于设计并实施一个共享任务（shared task），该任务分为两个子任务：单语（Monolingual，仅英语）和多语（Multilingual）。通过吸引大量参与者（36个团队参与单语子任务，26个团队参与多语子任务），收集并分析不同系统的性能数据，提供了对数据集、结果排名、系统性能评分以及提交系统的详细描述和深入分析。这一方法为机器生成文本检测领域提供了基准数据和系统性能评估框架。

链接: https://arxiv.org/abs/2501.11012
作者: Yuxia Wang,Artem Shelmanov,Jonibek Mansurov,Akim Tsvigun,Vladislav Mikhailov,Rui Xing,Zhuohan Xie,Jiahui Geng,Giovanni Puccetti,Ekaterina Artemova,jinyan su,Minh Ngoc Ta,Mervat Abassy,Kareem Ashraf Elozeiri,Saad El Dine Ahmed El Etter,Maiya Goloburda,Tarek Mahmoud,Raj Vardhan Tomar,Nurkhan Laiyk,Osama Mohammed Afzal,Ryuto Koike,Masahiro Kaneko,Alham Fikri Aji,Nizar Habash,Iryna Gurevych,Preslav Nakov
机构: MBZUAI; Nebius AI; KU Leuven; University of Oslo; ISTI-CNR; Toloka AI; Institute of Science Tokyo; New York University Abu Dhabi; BKAI Research Center, Hanoi University of Science and Technology; Cornell University; Zewail City of Science and Technology; TU Darmstadt; Alexandria University; Cluster Innovation Center, University of Delhi; University of Florida
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:We present the GenAI Content Detection Task~1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions. this https URL
zh

[NLP-87] Building low-resource African language corpora: A case study of Kidawida Kalenjin and Dholuo

【速读】：该论文试图解决非洲语言在自然语言处理（Natural Language Processing, NLP）领域中资源匮乏的问题，特别是针对肯尼亚的三种低资源语言（Kidaw’ida、Kalenjin 和 Dholuo）。由于缺乏足够的语言资源，这些语言在数字化转型中代表性不足，限制了相关NLP应用的发展。论文的关键解决方案是通过众包（crowd-sourcing）方法，收集这三种语言的文本和语音数据，构建平行语料库（parallel corpora）和语音语料库（speech corpora）。具体方法包括：（1）记录对话并将其翻译成斯瓦希里语（Kiswahili），以创建平行语料库；（2）通过朗读和记录书面文本来生成语音语料库。这些资源通过开放研究平台（如Zenodo和Mozilla Common Voice）免费公开，便于开发者和研究人员使用这些数据进行模型训练和NLP应用开发。该项目的核心在于通过基层语料库建设，推动非洲语言在人工智能创新中的包容性发展，同时促进语言多样性和本地社区的赋权。

链接: https://arxiv.org/abs/2501.11003
作者: Audrey Mbogho,Quin Awuor,Andrew Kipkebut,Lilian Wanzare,Vivian Oloo
机构: usiu.ac.ke (United States International University - Africa); kabarak.ac.ke (Kabarak University); maseno.ac.ke (Maseno University)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, intend to submit to a Springer Nature journal

点击查看摘要

Abstract:Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.
zh

[NLP-88] he Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

【速读】：该论文试图解决在大规模语言模型（LLMs）作为注释者和评估者时，如何确定其是否能够替代人类注释者的问题。尽管LLMs在自然语言处理（NLP）及其他领域（如医学、心理学和社会科学）中广泛应用，但目前缺乏标准且严谨的流程来评估LLMs是否能够胜任这一角色。为此，论文提出了一种新颖的统计方法——替代注释者测试（Alternative Annotator Test, alt-test），该方法仅需少量标注样本即可验证LLMs注释的合理性。此外，论文还引入了一种通用且可解释的度量方法，用于比较不同LLMs的表现。通过实验，作者展示了在某些情况下，闭源LLMs（如GPT-4o）能够替代人类注释者，且优于开源LLMs，同时不同的提示技术（prompting techniques）也会影响LLMs的表现质量。该研究旨在推动更严谨和可靠的实践方法。

链接: https://arxiv.org/abs/2501.10970
作者: Nitay Calderon,Roi Reichart,Rotem Dror
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The “LLM-as-a-judge” paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure – the Alternative Annotator Test (alt-test) – that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
zh

[NLP-89] AI Based Font Pair Suggestion Modelling For Graphic Design MICRO

【速读】：该论文试图解决在Microsoft Designer中AI生成设计时，如何选择最符合上下文且新颖的字体（fonts）用于设计建议的关键挑战。以往的方法是通过手动将设计意图映射到字体，虽然质量较高，但无法应对大量字体（超过3000种）和多样化的用户设计意图。解决方案的关键在于创建字体视觉嵌入（font visual embeddings）、字体笔画宽度算法（font stroke width algorithm）、字体类别到字体的映射数据集（font category to font mapping dataset）、基于大语言模型（LLM）的类别利用描述，以及一个轻量级、低延迟的知识蒸馏小型语言模型（Mini LM V2），用于推荐多对符合上下文的标题和副标题字体组合。此外，还采用了加权评分机制、最近邻方法和分层抽样来对字体对进行排序，并为预测结果引入新颖性。

链接: https://arxiv.org/abs/2501.10969
作者: Aryan Singh,Sumithra Bhakthavatsalam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: In the Microsoft Journal of Applied Research (MSJAR), Volume 21, July 2024

点击查看摘要

Abstract:One of the key challenges of AI generated designs in Microsoft Designer is selecting the most contextually relevant and novel fonts for the design suggestions. Previous efforts involved manually mapping design intent to fonts. Though this was high quality, this method does not scale for a large number of fonts (3000+) and numerous user intents for graphic design. In this work we create font visual embeddings, a font stroke width algorithm, a font category to font mapping dataset, an LLM-based category utilization description and a lightweight, low latency knowledge-distilled mini language model (Mini LM V2) to recommend multiple pairs of contextual heading and subheading fonts for beautiful and intuitive designs. We also utilize a weighted scoring mechanism, nearest neighbor approach and stratified sampling to rank the font pairs and bring novelty to the predictions.
zh

[NLP-90] Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding

【速读】：该论文试图解决视觉-语言模型（Vision-language Models, VLMs）在视觉位置编码（Visual Position Encoding）方面的不合理性问题，这一问题限制了模型在不同粒度上的综合感知性能。传统的栅格扫描方法（raster-scan methods）和旋转位置嵌入（Rotary Position Embedding, RoPE）导致的长期衰减效应（long-term decay effects）是主要挑战。论文提出的解决方案是金字塔下降视觉位置编码（Pyramid-descent Visual Position Encoding, PyPE），其关键创新在于从外围到中心分配视觉位置索引，并逐步扩展中心感受野（receptive field）。这种方法减少了相关视觉元素与指令标记之间的相对距离，促进了注意力权重的更合理分配，实现了对视觉元素的多粒度感知，并减少了对锚定标记（anchor tokens）的过度依赖。实验结果表明，PyPE在不同规模的VLMs中均显著提升了模型的综合能力。

链接: https://arxiv.org/abs/2501.10967
作者: Zhanpeng Chen,Mingxiao Li,Ziyang Chen,Nan Du,Xiaolong Li,Yuexian Zou
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University (北京大学深圳研究生院广东省超高清沉浸式媒体技术重点实验室); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models’ comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at this https URL.
zh

[NLP-91] InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在中文保险行业等专业领域应用中的有效性问题。保险领域的复杂性，包括专业术语和多样化的数据类型，对模型和用户都提出了显著挑战。为解决这一问题，作者提出了InsQABench，一个针对中文保险行业的基准数据集，该数据集分为三类：保险常识知识（Insurance Commonsense Knowledge）、保险结构化数据库（Insurance Structured Database）和保险非结构化文档（Insurance Unstructured Documents），以反映现实世界中的保险问答场景。此外，作者还提出了两种方法，SQL-ReAct和RAG-ReAct，分别用于处理结构化和非结构化数据任务。评估结果表明，尽管LLMs在处理领域特定术语和复杂条款文本时存在困难，但在InsQABench上进行微调后，性能显著提升。该基准为推进LLMs在保险领域的应用奠定了坚实基础。

链接: https://arxiv.org/abs/2501.10943
作者: Jing Ding,Kai Feng,Binbin Lin,Jiarui Cai,Qiushi Wang,Yu Xie,Xiaojin Zhang,Zhongyu Wei,Wei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering this http URL also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at this https URL.
zh

[NLP-92] Leverag ing Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data ICASSP2025

【速读】：该论文试图解决在语音对话系统中生成具有同理心（empathetic）的响应的问题。现有的基于大语言模型（LLMs）的对话系统虽然在理解语音内容方面表现出色，但由于缺乏包含语音风格信息的问答数据集来进行监督微调（SFT），这些系统在生成具有情感共鸣的响应时表现不佳。为了解决这一问题，论文提出了一种名为“倾听、感知与表达”（Listen, Perceive, and Express, LPE）的新方法。该方法的关键在于采用两阶段训练过程：首先引导大语言模型倾听语音内容并感知其中的情感信息，然后利用思维链（Chain-of-Thought, CoT）提示技术，基于所听到的语音内容和感知到的情感线索，激发模型生成具有同理心的响应。这一方法首次尝试将思维链技术应用于基于语音的对话系统，旨在提升系统的情感感知和响应能力。

链接: https://arxiv.org/abs/2501.10937
作者: Jingran Xie,Shun Lei,Yue Yu,Yang Xiang,Hui Wang,Xixin Wu,Zhiyong Wu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Pengcheng Laboratory (鹏城实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model’s potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
zh

[NLP-93] LegalGuardian: A Privacy-Preserving Framework for Secure Integration of Large Language Models in Legal Practice

【速读】：该论文试图解决在法律实践中使用大型语言模型（LLMs）时面临的客户机密信息（PII）泄露风险问题。由于律师在处理法律事务时可能会在提示中包含敏感的客户信息，这些信息一旦暴露，可能导致未经授权的数据泄露。为了解决这一问题，论文提出了LegalGuardian框架，这是一个轻量级且注重隐私保护的解决方案，专门为使用LLM工具的律师设计。LegalGuardian通过使用命名实体识别（NER）技术和本地LLM，在提示中自动屏蔽和解除屏蔽机密信息，从而在外部交互之前保护敏感数据。该框架在移民法场景中通过合成提示库进行了有效性评估，结果显示，使用GLiNER和Qwen2.5-14B模型时，LegalGuardian在PII检测中的F1得分分别达到93%和97%。语义相似性分析进一步证实，该框架在保持输出高保真度的同时，确保了LLM工具的实用性。因此，LegalGuardian使法律专业人员能够在保护客户机密信息和法律文件质量的前提下，充分利用先进的AI技术。

链接: https://arxiv.org/abs/2501.10915
作者: M. Mikail Demir,Hakan T. Otal,M. Abdullah Canbaz
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) hold promise for advancing legal practice by automating complex tasks and improving access to justice. However, their adoption is limited by concerns over client confidentiality, especially when lawyers include sensitive Personally Identifiable Information (PII) in prompts, risking unauthorized data exposure. To mitigate this, we introduce LegalGuardian, a lightweight, privacy-preserving framework tailored for lawyers using LLM-based tools. LegalGuardian employs Named Entity Recognition (NER) techniques and local LLMs to mask and unmask confidential PII within prompts, safeguarding sensitive data before any external interaction. We detail its development and assess its effectiveness using a synthetic prompt library in immigration law scenarios. Comparing traditional NER models with one-shot prompted local LLM, we find that LegalGuardian achieves a F1-score of 93% with GLiNER and 97% with Qwen2.5-14B in PII detection. Semantic similarity analysis confirms that the framework maintains high fidelity in outputs, ensuring robust utility of LLM-based tools. Our findings indicate that legal professionals can harness advanced AI technologies without compromising client confidentiality or the quality of legal documents.
zh

[NLP-94] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

【速读】：该论文试图解决CLIP（Contrastive Language–Image Pretraining）模型在理解否定（negation）方面的局限性，例如无法区分“停车”和“禁止停车”等概念。这种局限性主要源于预训练数据中缺乏包含否定的样本。为解决这一问题，论文提出了通过使用大型语言模型（LLM）和多模态LLM生成包含否定的标注数据的数据生成管道（data generation pipelines），并在此基础上对CLIP进行微调，开发出NegationCLIP。该模型在保持通用性的同时，显著提升了否定理解能力。此外，论文还提出了NegRefCOCOg基准，用于全面评估视觉语言模型（VLMs）在句子中不同位置和表达方式下理解否定的能力。实验结果表明，该数据生成管道有效提升了CLIP的否定感知能力，并在文本到图像生成和参考图像分割等多模态任务中展示了实际应用价值。

链接: https://arxiv.org/abs/2501.10913
作者: Junsung Park,Jungbeom Lee,Jongyoon Song,Sangwon Yu,Dahuin Jung,Sungroh Yoon
机构: 1Department of Electrical and Computer Engineering, Seoul National University (首尔国立大学); 2Amazon (亚马逊); 3School of Computer Science and Engineering, Soongsil University (崇实大学); 4IPAI, AIIS, ASRI, INMC, and ISRC, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like “parking” from “no parking” - poses substantial challenges. By analyzing the data used in the public CLIP model’s pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs’ ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP’s ability to perceive negation accurately. Additionally, NegationCLIP’s enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
zh

[NLP-95] A Benchmark of French ASR Systems Based on Error Severity COLING2025

【速读】：该论文试图解决自动语音识别（ASR）系统在转录错误评估中的局限性，特别是现有评估方法（如词错误率 WER 或基于语义的评分）往往忽略了人类对转录错误的理解程度。为解决这一问题，论文提出了一种新的评估方法，该方法基于客观的语言学标准、上下文模式和以内容词为分析单位，将错误分为四个严重程度等级，并进一步细分为子类型。这一评估方法应用于10种最先进的法语ASR系统（包括基于隐马尔可夫模型 HMM 和端到端模型），揭示了各系统的优缺点，并识别出哪些系统能为用户提供最舒适的阅读体验。解决方案的关键在于通过更细粒度的错误分类和上下文分析，更准确地反映转录错误对人类理解的影响。

链接: https://arxiv.org/abs/2501.10879
作者: Antoine Tholly,Jane Wottawa,Mickael Rouvier,Richard Dufour
机构: LS2N, Nantes Université, France (南特大学); LIUM, Le Mans Université, France (勒芒大学); LIA, Avignon Université, France (阿维尼翁大学)
类目: Computation and Language (cs.CL)
备注: To be published in COLING 2025 Proceedings

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which measures spelling deviations from the reference, or semantic score-based metrics. However, these approaches often overlook what is understandable to humans when interpreting transcription errors. To address this limitation, a new evaluation is proposed that categorizes errors into four levels of severity, further divided into subtypes, based on objective linguistic criteria, contextual patterns, and the use of content words as the unit of analysis. This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language, encompassing both HMM-based and end-to-end models. Our findings reveal the strengths and weaknesses of each system, identifying those that provide the most comfortable reading experience for users.
zh

[NLP-96] Generating Structured Outputs from Language Models: Benchmark and Studies

【速读】：该论文试图解决在生成结构化输出时，约束解码（constrained decoding）方法在实际应用中的有效性和性能评估不足的问题。尽管约束解码已成为现代语言模型应用中生成结构化输出的主要技术，但其行为和性能的系统性评估尚未得到充分研究。论文提出了一种评估框架，旨在从三个关键维度评估约束解码方法：生成符合约束的输出的效率、覆盖多样化约束类型的能力以及生成输出的质量。为了支持这一评估，作者引入了JSONSchemaBench，一个包含10K真实世界JSON模式（JSON Schema）的基准测试集，涵盖了各种复杂度的约束类型。通过结合现有的官方JSON Schema测试套件，作者评估了六种先进的约束解码框架（包括Guidance、Outlines、Llamacpp、XGrammar、OpenAI和Gemini），并深入分析了这些框架在真实世界JSON模式下的能力和局限性。该研究为改进约束解码框架和结构化生成任务提供了可操作的见解，并为约束解码和结构化生成的评估设定了新标准。

链接: https://arxiv.org/abs/2501.10868
作者: Saibo Geng,Hudson Cooper,Michał Moskal,Samuel Jenkins,Julian Berman,Nathan Ranchin,Robert West,Eric Horvitz,Harsha Nori
机构: EPFL(洛桑联邦理工学院); Microsoft(微软); JSON Schema(JSON Schema)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at this https URL
zh

[NLP-97] Zero-shot and Few-shot Learning with Instruction-following LLM s for Claim Matching in Automated Fact-checking COLING2025

【速读】：该论文旨在解决声明匹配（Claim Matching, CM）任务中的自动化问题，通过将能够通过相同事实核查解决的声明进行匹配，从而提升自动化事实核查流程的效率。论文首次探索了零样本学习（zero-shot learning）和少样本学习（few-shot learning）方法在CM任务中的应用。关键解决方案包括将CM任务视为二分类问题，并实验了多种指令跟随的大型语言模型（如GPT-3.5-turbo、Gemini-1.5-flash、Mistral-7B-Instruct和Llama-3-8B-Instruct），同时研究了不同的提示模板（prompt templates）。此外，论文引入了一个新的CM数据集ClaimMatch，并提出了一个针对不同长度文本的CM处理流程。通过利用自然语言推理（natural language inference）或释义检测（paraphrase detection）等更为成熟且相似的任务，论文展示了LLMs在CM任务中的潜力。

链接: https://arxiv.org/abs/2501.10860
作者: Dina Pisarevskaya,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:The claim matching (CM) task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. In this work, we are the first to explore zero-shot and few-shot learning approaches to the task. We consider CM as a binary classification task and experiment with a set of instruction-following large language models (GPT-3.5-turbo, Gemini-1.5-flash, Mistral-7B-Instruct, and Llama-3-8B-Instruct), investigating prompt templates. We introduce a new CM dataset, ClaimMatch, which will be released upon acceptance. We put LLMs to the test in the CM task and find that it can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We also propose a pipeline for CM, which we evaluate on texts of different lengths.
zh

[NLP-98] BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

【速读】：该论文旨在解决在物理世界中理解和执行指令的交互式智能体（Interactive agents）的核心挑战，特别是在Minecraft协作建造任务（MCBT）中的建造者动作预测（Builder Action Prediction, BAP）子任务。BAP任务的核心挑战在于如何在有限的多模态游戏上下文数据中准确预测建造者的动作序列。论文提出了BAP v2，通过两个关键改进来解决这一问题：首先，引入了一个增强的评估基准，包括更干净的测试集和更公平、更具洞察力的评估指标；其次，通过新颖的Minecraft对话和目标结构模拟器生成了额外的合成训练数据。这些改进使得即使在相对简单的训练方法下，也能训练出性能更强、鲁棒性更好的神经网络模型。此外，论文还展示了这些数据和方法对基于LLM和transformer的简单模型的影响，验证了其方法的鲁棒性，并为未来更先进的架构和LLM的应用奠定了基础。

链接: https://arxiv.org/abs/2501.10836
作者: Prashant Jayannavar,Liliang Ren,Marisa Hudspeth,Charlotte Lambert,Ariel Cordes,Elizabeth Kaplan,Anjali Narayan-Chen,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Microsoft(微软); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Amazon(亚马逊); Amazon AGI(亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive agents capable of understanding and executing instructions in the physical world have long been a central goal in AI research. The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal (Narayan-Chen, Jayannavar, and Hockenmaier 2019). It is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a given multimodal game context with limited training data (Jayannavar, Narayan-Chen, and Hockenmaier 2020). We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task. This will allow future work to make more efficient and meaningful progress on it. It comprises of: (1) an enhanced evaluation benchmark that includes a cleaner test set and fairer, more insightful metrics, and (2) additional synthetic training data generated from novel Minecraft dialogue and target structure simulators emulating the MCBT. We show that the synthetic data can be used to train more performant and robust neural models even with relatively simple training methods. Looking ahead, such data could also be crucial for training more sophisticated, data-hungry deep transformer models and training/fine-tuning increasingly large LLMs. Although modeling is not the primary focus of this work, we also illustrate the impact of our data and training methodologies on a simple LLM- and transformer-based model, thus validating the robustness of our approach, and setting the stage for more advanced architectures and LLMs going forward.
zh

[NLP-99] Development of Application-Specific Large Language Models to Facilitate Research Ethics Review

【速读】：该论文试图解决机构审查委员会（IRBs）在确保人类受试者研究伦理审查过程中面临的挑战，包括审查不一致性、延迟和效率低下等问题。解决方案的关键在于开发和实施针对IRB审查流程的应用特定大语言模型（LLMs）。这些IRB特定的LLMs将通过IRB特定文献和机构数据集进行微调，并配备检索功能以访问最新的、与上下文相关的信息。论文提出了这些模型在预审筛查、初步分析、一致性检查和决策支持等方面的潜在应用。尽管存在准确性、上下文敏感性和人类监督等方面的担忧，但通过增强伦理审查的效率和质量，同时保持人类在关键决策中的判断力，IRB特定的LLMs有望成为改进研究监督的有力工具。论文呼吁进行试点研究以评估该方法的可行性和影响。

链接: https://arxiv.org/abs/2501.10741
作者: Sebastian Porsdam Mann,Joel Seah Jiehao,Stephen R. Latham,Julian Savulescu,Mateo Aboy,Brian D. Earp
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, 0 figures

点击查看摘要

Abstract:Institutional review boards (IRBs) play a crucial role in ensuring the ethical conduct of human subjects research, but face challenges including inconsistency, delays, and inefficiencies. We propose the development and implementation of application-specific large language models (LLMs) to facilitate IRB review processes. These IRB-specific LLMs would be fine-tuned on IRB-specific literature and institutional datasets, and equipped with retrieval capabilities to access up-to-date, context-relevant information. We outline potential applications, including pre-review screening, preliminary analysis, consistency checking, and decision support. While addressing concerns about accuracy, context sensitivity, and human oversight, we acknowledge remaining challenges such as over-reliance on AI and the need for transparency. By enhancing the efficiency and quality of ethical review while maintaining human judgment in critical decisions, IRB-specific LLMs offer a promising tool to improve research oversight. We call for pilot studies to evaluate the feasibility and impact of this approach.
zh

[NLP-100] Computational Discovery of Chiasmus in Ancient Religious Text

【速读】：该论文旨在解决如何系统地在圣经文本中检测交错配列（chiasmus）这一文学手法的问题。交错配列在圣经文本中一直是一个备受争议的文学手法，吸引了神秘主义者的关注并引发了学术界的持续讨论。论文提出了一种基于神经嵌入（neural embeddings）的计算方法，通过捕捉与交错配列相关的词汇和语义模式，在多个文本粒度（如半节、节）上进行检测。该方法的关键在于利用神经嵌入来捕捉文本中的复杂模式，并结合专家注释者对检测结果进行验证，以确保其准确性和可靠性。尽管该方法计算效率高，但在节级别和半节级别的检测中分别达到了0.80和0.60的精确度（precision@k），并展示了高水平的注释者一致性。此外，论文还提供了对检测到的交错配列分布的定性分析，并通过具体示例展示了该方法的有效性。

链接: https://arxiv.org/abs/2501.10739
作者: Hope McGovern,Hale Sirin,Tom Lippincott
机构: Department of Computer Science & Technology, University of Cambridge, U.K.(剑桥大学计算机科学与技术系); Center for Digital Humanities, Johns Hopkins University, Baltimore, U.S.A.(约翰霍普金斯大学数字人文中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chiasmus, a debated literary device in Biblical texts, has captivated mystics while sparking ongoing scholarly discussion. In this paper, we introduce the first computational approach to systematically detect chiasmus within Biblical passages. Our method leverages neural embeddings to capture lexical and semantic patterns associated with chiasmus, applied at multiple levels of textual granularity (half-verses, verses). We also involve expert annotators to review a subset of the detected patterns. Despite its computational efficiency, our method achieves robust results, with high inter-annotator agreement and system precision@k of 0.80 at the verse level and 0.60 at the half-verse level. We further provide a qualitative analysis of the distribution of detected chiasmi, along with selected examples that highlight the effectiveness of our approach.
zh

[NLP-101] Characterizing the Effects of Translation on Intertextuality using Multilingual Embedding Spaces

【速读】：该论文试图解决的问题是如何在多语言嵌入空间（multilingual embedding spaces）中表征和保留互文性（intertextuality）这一常见的修辞手法，特别是在文学文献的翻译过程中。互文性在文学翻译中至关重要，但其翻译难度较大。论文通过分析圣经文本（Biblical texts）——这些文本富含互文性且被广泛翻译——来探讨人类翻译和机器翻译在保留互文性方面的差异。解决方案的关键在于提出了一种在语料库层面表征互文性的度量方法，并对现有的人类翻译和机器翻译进行了定量分析。此外，论文还通过定性分析揭示了人类翻译在某些情况下会过度强调或弱化原文中的互文性，而机器翻译则提供了一个中性的基线。这一发现支持了已有学术观点，即人类译者在翻译过程中倾向于放大原文的某些文学特征。

链接: https://arxiv.org/abs/2501.10731
作者: Hope McGovern,Hale Sirin,Tom Lippincott
机构: Department of Computer Science & Technology, University of Cambridge, U.K. (剑桥大学计算机科学与技术系); Center for Digital Humanities, Johns Hopkins University, Baltimore, U.S.A. (约翰霍普金斯大学数字人文中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rhetorical devices are difficult to translate, but they are crucial to the translation of literary documents. We investigate the use of multilingual embedding spaces to characterize the preservation of intertextuality, one common rhetorical device, across human and machine translation. To do so, we use Biblical texts, which are both full of intertextual references and are highly translated works. We provide a metric to characterize intertextuality at the corpus level and provide a quantitative analysis of the preservation of this rhetorical device across extant human translations and machine-generated counterparts. We go on to provide qualitative analysis of cases wherein human translations over- or underemphasize the intertextuality present in the text, whereas machine translations provide a neutral baseline. This provides support for established scholarship proposing that human translators have a propensity to amplify certain literary characteristics of the original manuscripts.
zh

[NLP-102] Simulation of Hypergraph Algorithms with Looped Transformers

【速读】：该论文试图解决将Loop Transformer架构应用于超图（hypergraph）算法模拟的问题，特别是针对超图的高阶关系建模及其带来的计算挑战。超图通过建模多个实体之间的高阶关系，提供了更丰富的表示能力，但也引入了显著的计算复杂性。论文的关键解决方案包括两个方面：首先，提出了一种新的降级机制，将超图简化为图表示，从而能够模拟基于图的算法，如Dijkstra最短路径算法；其次，引入了一种超边感知的编码方案，用于模拟超图特定的算法，例如Helly算法。通过这些方法，论文展示了使用Loop Transformer处理高维和组合数据的可行性，并为其提供了理论保证，进一步凸显了Transformer作为结构化数据通用算法求解器的潜力。

链接: https://arxiv.org/abs/2501.10688
作者: Xiaoyu Li,Yingyu Liang,Jiangxuan Long,Zhenmei Shi,Zhao Song,Zhen Zhuang
机构: Independent Researcher; The University of Hong Kong(香港大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); South China University of Technology(华南理工大学); The Simons Institute for the Theory of Computing at the University of California, Berkeley(加州大学伯克利分校西蒙斯理论计算研究所); University of Minnesota(明尼苏达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Looped Transformers have shown exceptional capability in simulating traditional graph algorithms, but their application to more complex structures like hypergraphs remains underexplored. Hypergraphs generalize graphs by modeling higher-order relationships among multiple entities, enabling richer representations but introducing significant computational challenges. In this work, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. In this paper, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. Specifically, we propose a novel degradation mechanism for reducing hypergraphs to graph representations, enabling the simulation of graph-based algorithms, such as Dijkstra’s shortest path. Furthermore, we introduce a hyperedge-aware encoding scheme to simulate hypergraph-specific algorithms, exemplified by Helly’s algorithm. The paper establishes theoretical guarantees for these simulations, demonstrating the feasibility of processing high-dimensional and combinatorial data using Loop Transformers. This work highlights the potential of Transformers as general-purpose algorithmic solvers for structured data.
zh

[NLP-103] Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications Future Directions and Strategic Recommendations

【速读】：该论文旨在探讨大型语言模型（LLMs）在营销管理中的变革潜力，包括其在客户互动、活动优化和内容生成等方面的应用。论文重点分析了LLMs在个性化、实时交互式客户洞察和内容自动化等关键业务驱动因素中的作用，以及如何通过这些技术提升客户体验和业务成果。此外，论文还涉及了AI在数据隐私、透明度和减少偏见等伦理方面的挑战，提出了通过最佳实践和新技术来促进负责任使用LLMs的建议。解决方案的关键在于通过整合LLMs到营销策略中，帮助企业在不损害品牌价值观的前提下，利用这些强大的技术实现增长并在数字营销的竞争中保持领先地位。

链接: https://arxiv.org/abs/2501.10685
作者: Raha Aghaei,Ali A. Kiaei,Mahnaz Boush,Javad Vahidi,Mohammad Zavvar,Zeynab Barzegar,Mahan Rofoosheh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 40 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized the process of customer engagement, campaign optimization, and content generation, in marketing management. In this paper, we explore the transformative potential of LLMs along with the current applications, future directions, and strategic recommendations for marketers. In particular, we focus on LLMs major business drivers such as personalization, real-time-interactive customer insights, and content automation, and how they enable customers and business outcomes. For instance, the ethical aspects of AI with respect to data privacy, transparency, and mitigation of bias are also covered, with the goal of promoting responsible use of the technology through best practices and the use of new technologies businesses can tap into the LLM potential, which help growth and stay one step ahead in the turmoil of digital marketing. This article is designed to give marketers the necessary guidance by using best industry practices to integrate these powerful LLMs into their marketing strategy and innovation without compromising on the ethos of their brand.
zh

[NLP-104] Can Multimodal LLM s do Visual Temporal Understanding and Reasoning ? The answer is No!

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在时间理解（temporal understanding）方面的不足，特别是在视觉问答（Visual Question Answering, VQA）任务中。时间理解对于理解现实世界中的动态变化至关重要，但现有的MLLMs在这一领域的能力尚未得到充分探索。为此，作者提出了一个名为TemporalVQA的评估基准，该基准包含两个部分：时间顺序理解（Temporal Order Understanding）和时间间隔估计（Time-lapse Estimation）。时间顺序理解要求MLLMs通过分析时间上连续的视频帧来确定事件的顺序，而时间间隔估计则通过呈现具有不同时间间隔的图像对，并以多项选择题的形式要求MLLMs估计图像之间的时间间隔。通过对GPT-4o和Gemini-1.5-Pro等先进MLLMs的评估，发现这些模型在时间顺序任务中的平均一致准确率仅为43.8%，在时间间隔估计任务中的准确率为70%，开源模型的表现更差。这些结果表明当前MLLMs在视觉时间理解和推理方面存在显著局限性，强调了进一步改进其时间能力的必要性。

链接: https://arxiv.org/abs/2501.10674
作者: Mohamed Fazli Imam,Chenyang Lyu,Alham Fikri Aji
机构: Mohamed bin Zayed University of Artificial Intelligence; Alibaba International Digital Commerce
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our dataset can be found at \url{ this https URL }

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: (1) Temporal Order Understanding and (2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges: GPT-4o achieved only 43.8% average consistent accuracy in temporal order tasks and 70% in time-lapse estimation, with open-source models performing even less effectively. These findings underscore the limitations of current MLLMs in visual temporal understanding and reasoning, highlighting the need for further improvements in their temporal capabilities. Our dataset can be found at this https URL.
zh

[NLP-105] MappedTrace: Tracing Pointer Remotely with Compiler-generated Maps

【速读】：该论文旨在解决现有精确指针追踪方法在程序执行过程中引入的高运行时开销以及仅适用于特定程序执行点的问题。提出的解决方案MappedTrace利用编译器生成的只读映射（read-only maps）来准确识别程序执行状态中任意快照的所有指针。这些映射记录了指针的位置和类型，使得追踪器能够精确识别指针，而无需被追踪程序维护额外的数据结构或在安全点进行轮询，从而显著降低了运行时开销。此外，MappedTrace通过在不同地址空间或机器上运行追踪器，为改进内存管理技术（如内存泄漏检测）提供了新的可能性，并支持在资源受限环境中实现无限内存抽象等新颖用例。

链接: https://arxiv.org/abs/2501.10668
作者: Zhiyao Ma,Caihua Li,Lin Zhong
机构: Yale University (耶鲁大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing precise pointer tracing methods introduce substantial runtime overhead to the program being traced and are applicable only at specific program execution points. We propose MappedTrace that leverages compiler-generated read-only maps to accurately identify all pointers in any given snapshot of a program’s execution state. The maps record the locations and types of pointers, allowing the tracer to precisely identify pointers without requiring the traced program to maintain bookkeeping data structures or poll at safe points, thereby reducing runtime overhead. By running the tracer from a different address space or machine, MappedTrace presents new opportunities to improve memory management techniques like memory leak detection and enables novel use cases such as infinite memory abstraction for resource-constrained environments.
zh

[NLP-106] Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades

【速读】：该论文旨在探索大型基础模型（Large Foundation Models, LFMs）权重的内在机制，以简化人工智能研究。通过对现有LFMs的广泛观察和分析，研究发现无论初始化策略如何，这些模型的权重主要遵循高斯分布（Gaussian distribution），偶尔呈现尖锐、倒T形或线性模式。进一步发现，这些权重具有与高斯噪声相同的独立同分布（i.i.d.）特性，并探讨了它们之间的直接关系。研究发现，变换权重可以从高斯噪声中推导出来，其主要作用是增加预训练权重的标准差，且标准差随层深度增加而增大。换句话说，变换权重扩大了与最优权重的可接受偏差范围，从而促进了对下游任务的适应。基于这些结论，论文深入讨论了最优权重的本质，最终得出结论：最优权重应具有零均值、对称性和稀疏性，稀疏值表现为截断高斯分布和少量异常值。通过在LFM适应和编辑中的实验，验证了这些见解的有效性。这些发现为LFM社区的未来发展提供了基础性理解。

链接: https://arxiv.org/abs/2501.10661
作者: Chongjie Si,Jingjing Jiang,Wei Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Revisions ongoing

点击查看摘要

Abstract:This paper presents a pioneering exploration of the mechanisms underlying large foundation models’ (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.
zh

[NLP-107] DNA 1.0 Technical Report

【速读】：该论文旨在解决双语语言模型在韩语和英语任务中的性能优化问题，特别是在韩语任务上的表现。解决方案的关键在于通过持续预训练（Continual Pre-training, CPT）和高质量的韩语数据集对Llama 3.1 8B模型进行优化，随后进行监督微调（Supervised Fine-tuning, SFT），以创建一个能够更好地遵循指令的模型。接着，通过球面线性插值（Spherical Linear Interpolation, SLERP）将该模型与Llama 3.1 8B Instruct模型合并，并进一步通过直接偏好优化（Direct Preference Optimization, DPO）和知识蒸馏（Knowledge Distillation, KD）进行优化。最终，DNA 1.0 8B Instruct模型在韩语特定任务（如KMMLU、KoBEST和BELEBELE）上取得了最先进的成果，同时在英语任务（如MMLU、MMLU-Pro和GSM8K）上也保持了较强的性能。

链接: https://arxiv.org/abs/2501.10648
作者: Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we present DNA 1.0 8B Instruct, a state-of-the-art bilingual language model optimized for Korean and English language tasks. By applying continual pre-training (CPT) with high-quality Korean datasets to Llama 3.1 8B and subsequent supervised fine-tuning (SFT), we create an instruction-following model with enhanced Korean language capabilities. This model is then merged with Llama 3.1 8B Instruct via spherical linear interpolation (SLERP) and undergoes further optimization through direct preference optimization (DPO) and knowledge distillation (KD). DNA 1.0 8B Instruct achieves state-of-the-art results on Korean-specific tasks, including KMMLU (53.26%), KoBEST (83.40%), and BELEBELE (57.99%), while maintaining strong English capabilities on MMLU (66.64%), MMLU-Pro (43.05%) and GSM8K (80.52%). As an open model, DNA 1.0 8B Instruct represents a significant advancement in bilingual language modeling. As an open model, DNA 1.0 8B Instruct is freely available through this https URL . For commercial licensing inquiries or feedback, please contact us at this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.10648 [cs.CL] (or arXiv:2501.10648v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.10648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-108] Iterative Tree Analysis for Medical Critics

【速读】：该论文试图解决大型语言模型（LLMs）在医学领域中生成误导性关键声明（hallucinations）的问题。这些误导性声明在开放式长文本中难以验证，主要原因有两个：一是关键声明通常深嵌于文本中，无法仅通过表层信息提取；二是基于表层词汇的检索方法往往缺乏精确或具体的证据，导致声明无法通过深层机制分析进行验证。论文提出的解决方案是引入一种名为迭代树分析（Iterative Tree Analysis, ITA）的新方法。ITA通过迭代和自适应的树状推理过程，从长医学文本中提取隐含声明，并通过自上而下的任务分解和自下而上的证据整合相结合的方式，实现对复杂医学声明的精确验证。实验结果表明，ITA在检测复杂医学文本中的事实错误方面比现有方法提高了10%。此外，论文还计划发布一个全面的测试集，以促进该领域的进一步研究。

链接: https://arxiv.org/abs/2501.10642
作者: Zenan Huang,Mingwei Li,Zheng Zhou,Youxin Jiang
机构: Baichuan Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted across various domains, yet their application in the medical field poses unique challenges, particularly concerning the generation of hallucinations. Hallucinations in open-ended long medical text manifest as misleading critical claims, which are difficult to verify due to two reasons. First, critical claims are often deeply entangled within the text and cannot be extracted based solely on surface-level presentation. Second, verifying these claims is challenging because surface-level token-based retrieval often lacks precise or specific evidence, leaving the claims unverifiable without deeper mechanism-based analysis. In this paper, we introduce a novel method termed Iterative Tree Analysis (ITA) for medical critics. ITA is designed to extract implicit claims from long medical texts and verify each claim through an iterative and adaptive tree-like reasoning process. This process involves a combination of top-down task decomposition and bottom-up evidence consolidation, enabling precise verification of complex medical claims through detailed mechanism-level reasoning. Our extensive experiments demonstrate that ITA significantly outperforms previous methods in detecting factual inaccuracies in complex medical text verification tasks by 10%. Additionally, we will release a comprehensive test set to the public, aiming to foster further advancements in research within this domain.
zh

[NLP-109] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

【速读】：该论文试图解决大型语言模型（LLMs）在应对越狱攻击（jailbreak attacks）时面临的安全对齐（safety alignment）问题，特别是现有防御机制在对抗训练过程中容易导致过度拒绝（over-refusal）行为，从而影响模型的整体实用性。为解决这一问题，论文提出了一个名为“潜在空间对抗训练与后感知校准”（Latent-space Adversarial Training with Post-aware Calibration, LATPC）的框架。该框架的关键在于：在对抗训练阶段，通过比较潜在空间中的有害和无害指令，提取安全关键维度（safety-critical dimensions）来构建拒绝特征攻击（refusal features attack），从而精确模拟需要对抗缓解的未知越狱攻击类型；在推理阶段，采用嵌入级校准机制（embedding-level calibration mechanism）来缓解过度拒绝行为，同时保持较低的计算开销。实验结果表明，LATPC框架在五种越狱攻击类型中实现了安全性与实用性的最佳平衡，并验证了从潜在空间提取安全关键维度对构建鲁棒拒绝特征攻击的有效性。

链接: https://arxiv.org/abs/2501.10639
作者: Xin Yi,Yue Li,Linlin Wang,Xiaoling Wang,Liang He
机构: Lab of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育实验室); Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学上海人工智能教育研究院); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.
zh

[NLP-110] When language and vision meet road safety: leverag ing multimodal large language models for video-based traffic accident analysis

【速读】：该论文试图解决的是如何高效分析24/7/365全天候运行的交通监控视频，以提升交通事故的时空覆盖率和交通安全。当前基于视觉的方法主要集中于提取原始信息（如车辆轨迹或单个物体检测），但需要大量后处理才能获得可操作的见解，这在实际应用中存在较大挑战。论文提出的解决方案是SeeUnsafe框架，该框架通过集成多模态大语言模型（Multimodal Large Language Model, MLLM）代理，将基于视频的交通事故分析从传统的“提取-解释”工作流程转变为更具交互性和对话性的方法。这一转变通过自动化复杂任务（如视频分类和视觉定位）显著提高了处理吞吐量，并通过无缝调整以适应不同的交通场景和用户定义的查询，增强了系统的适应性。关键创新点包括：采用基于严重性的聚合策略处理不同长度的视频，引入多模态提示生成结构化响应以支持细粒度视觉定位，并提出基于MLLM的新度量标准IMS（Information Matching Score）来对齐结构化响应与真实情况。实验结果表明，SeeUnsafe在丰田Woven交通安全数据集上有效实现了事故感知的视频分类和视觉定位。

链接: https://arxiv.org/abs/2501.10604
作者: Ruixuan Zhang,Beichen Wang,Juexiao Zhang,Zilin Bian,Chen Feng,Kaan Ozbay
机构: New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \urlthis https URL.
zh

[NLP-111] Adapting Large Language Models for Character-based Augmentative and Alternative Communication

【速读】：该论文试图解决增强与替代沟通（AAC）用户在通过字符语言模型界面逐字母书写时，如何有效利用最先进的大规模预训练语言模型进行准确且高效的字符预测的问题。大多数现有的大规模预训练语言模型预测的是可变长度的子词（subword）标记，而AAC用户需要的是逐字符的预测。论文通过使用一个经过精心筛选的大规模句子数据集对模型进行微调，其中每个句子都根据其在口语或书面AAC沟通中的实用性进行了评分。研究发现，通过算法从子词大规模语言模型中生成字符预测，比添加分类层或使用字节级模型提供了更准确的预测结果。此外，论文提出的领域适应课程（domain adaptation curriculum）在提高模型对简单对话文本的性能方面表现出色。解决方案的关键在于通过微调和领域适应策略，优化大规模预训练语言模型在字符预测任务中的表现。

链接: https://arxiv.org/abs/2501.10582
作者: Dylan Gaines,Keith Vertanen
机构: Michigan Technological University(密歇根理工大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation curriculum is effective at improving model performance on simple, conversational text.
zh

[NLP-112] he Geometry of Tokens in Internal Representations of Large Language Models

【速读】：该论文旨在研究Transformer模型中token嵌入的几何特性与其在下一个token预测中的作用之间的关系。具体来说，作者通过引入经验测度（empirical measure）的概念，分析了token点云在Transformer各层中的分布及其在平均场相互作用框架下的演化。为了探究这些经验测度，作者使用了内在维度（intrinsic dimension）、邻域重叠（neighborhood overlap）和余弦相似度（cosine similarity）等度量方法，并通过与打乱token顺序的数据集进行对比，验证了这些度量的有效性。研究结果表明，token嵌入的几何特性与下一个token预测的交叉熵损失之间存在相关性，提示损失值较高的提示（prompts）中的token往往位于更高维的表示空间中。解决方案的关键在于通过几何度量和经验测度来揭示token嵌入的演化规律及其对模型性能的影响。

链接: https://arxiv.org/abs/2501.10573
作者: Karthik Viswanathan,Yuri Gardinazzi,Giada Panerai,Alberto Cazzaniga,Matteo Biagetti
机构: 1. University of Amsterdam (阿姆斯特丹大学); 2. AREA Science Park (AREA科学园); 3. Unknown
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15+9 pages, 21 figures, all comments welcome!

点击查看摘要

Abstract:We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.
zh

[NLP-113] Improved IR-based Bug Localization with Intelligent Relevance Feedback

【速读】：该论文试图解决软件开发与维护过程中软件缺陷（bug）定位的难题。现有技术通常采用信息检索（Information Retrieval, IR）方法，通过缺陷报告与源代码之间的文本和语义相关性来定位缺陷。然而，这些方法往往难以弥补缺陷报告与代码之间需要深入上下文理解的差距，这超出了单纯的文本或语义相关性。论文提出了一种新的缺陷定位技术——BRaIn，该技术通过使用大语言模型（Large Language Models, LLM）评估缺陷报告与代码之间的相关性，并利用LLM的反馈（即智能相关性反馈，Intelligent Relevance Feedback）来重新制定查询和重新排序源文档，从而改进缺陷定位。BRaIn在Bench4BL基准数据集上进行了评估，并在MAP、MRR和HIT@K三个性能指标上分别比基线技术提高了87.6%、89.5%和48.8%。此外，BRaIn能够定位约52%的基线技术无法定位的缺陷，这些缺陷通常由于缺陷报告质量较差而难以处理。通过解决上下文差距并引入智能相关性反馈，BRaIn不仅在理论上有所突破，还显著提升了基于IR的缺陷定位效果。

链接: https://arxiv.org/abs/2501.10542
作者: Asif Mohammed Samir,Mohammad Masudur Rahman
机构: Department of Computer Science, Dalhousie University (达尔豪斯大学计算机科学系)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Software bugs pose a significant challenge during development and maintenance, and practitioners spend nearly 50% of their time dealing with bugs. Many existing techniques adopt Information Retrieval (IR) to localize a reported bug using textual and semantic relevance between bug reports and source code. However, they often struggle to bridge a critical gap between bug reports and code that requires in-depth contextual understanding, which goes beyond textual or semantic relevance. In this paper, we present a novel technique for bug localization - BRaIn - that addresses the contextual gaps by assessing the relevance between bug reports and code with Large Language Models (LLM). It then leverages the LLM’s feedback (a.k.a., Intelligent Relevance Feedback) to reformulate queries and re-rank source documents, improving bug localization. We evaluate BRaIn using a benchmark dataset, Bench4BL, and three performance metrics and compare it against six baseline techniques from the literature. Our experimental results show that BRaIn outperforms baselines by 87.6%, 89.5%, and 48.8% margins in MAP, MRR, and HIT@K, respectively. Additionally, it can localize approximately 52% of bugs that cannot be localized by the baseline techniques due to the poor quality of corresponding bug reports. By addressing the contextual gaps and introducing Intelligent Relevance Feedback, BRaIn advances not only theory but also improves IR-based bug localization.
zh

[NLP-114] abular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning ACL2024

【速读】：该论文旨在解决表格数据的高效处理和摘要生成问题，特别是在资源受限的环境中。现有的基于微调的方法在处理复杂表格数据时存在局限性，而该论文提出的解决方案——基于主题-解释结构的表格摘要生成管道（Tabular-TX），通过预处理表格数据并生成结构化的摘要句子来应对这一挑战。Tabular-TX的关键在于其独特的主题-解释结构，其中主题部分以状语短语形式呈现，解释部分则以从句形式呈现。此外，Tabular-TX利用上下文学习（In-Context Learning）优化大型语言模型（LLMs）的分析能力，无需微调即可有效处理表格数据的结构复杂性。实验结果表明，Tabular-TX在生成表格摘要任务中表现优于现有的基于微调的方法，尤其在处理复杂表格数据时表现出色，为表格问答和摘要任务提供了新的替代方案。

链接: https://arxiv.org/abs/2501.10487
作者: TaeYoon Kwack,Jisoo Kim,Ki Yong Jung,DongGeon Lee,Heesun Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, in Korean language. The 2024 Joint Conference on Human and Cognitive Language Technology, Korean Association for Corpus Linguistics (HCLT-KACL 2024)

点击查看摘要

Abstract:This paper proposes a Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline designed to efficiently process table data. Tabular-TX preprocesses table data by focusing on highlighted cells and then generates summary sentences structured with a Theme Part in the form of adverbial phrases followed by an Explanation Part in the form of clauses. In this process, customized analysis is performed by considering the structural characteristics and comparability of the table. Additionally, by utilizing In-Context Learning, Tabular-TX optimizes the analytical capabilities of large language models (LLMs) without the need for fine-tuning, effectively handling the structural complexity of table data. Results from applying the proposed Tabular-TX to generate table-based summaries demonstrated superior performance compared to existing fine-tuning-based methods, despite limitations in dataset size. Experimental results confirmed that Tabular-TX can process complex table data more effectively and established it as a new alternative for table-based question answering and summarization tasks, particularly in resource-constrained environments.
zh

[NLP-115] ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

【速读】：该论文试图解决语言模型（Language Models, LMs）在生成科学文献相关内容时出现的“幻觉”（hallucination）问题，即生成看似合理但实际上虚假的信息，包括虚构的引用和不存在的研究论文。这种不准确性在需要高度事实正确性的领域（如学术界和教育）中尤为危险。论文提出了一种名为ArxEval的评估管道，通过两个任务（Jumbled Titles和Mixed Titles）来评估语言模型在生成科学文献响应时的幻觉频率。该解决方案的关键在于利用ArXiv作为知识库，对十五种广泛使用的语言模型进行评估，从而提供它们在处理科学文献时的可靠性比较分析。

链接: https://arxiv.org/abs/2501.10483
作者: Aarush Sinha,Viraj Virk,Dipshikha Chakraborty,P.S. Sreeja
机构: Vellore Institute of Technology (韦洛尔理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
zh

[NLP-116] Beyond the Sum: Unlocking AI Agents Potential Through Market Forces

【速读】：该论文探讨了大型语言模型（Large Language Models, LLMs）作为自主经济主体在数字市场中参与时所面临的基础设施挑战。论文指出，尽管这些AI代理在操作连续性、完美复制和分布式学习能力方面具有显著优势，能够为数字市场带来前所未有的价值创造潜力，但现有的数字基础设施主要为人机交互设计，严重阻碍了AI代理的参与。论文通过系统分析，提出了四个关键领域的基础设施需求：身份与授权（identity and authorization）、服务发现（service discovery）、接口（interfaces）和支付系统（payment systems），并指出这些现有基础设施如何阻碍AI代理的参与。论文认为，解决这些基础设施挑战不仅是技术上的必要，更是实现新型经济组织形式的关键步骤。通过解决这些挑战，AI代理可以在数字市场中实现持续操作、完美信息共享和快速适应变化条件，从而显著提升经济效率。

链接: https://arxiv.org/abs/2501.10388
作者: Jordi Montes Sanabria,Pol Alvarez Vecino
机构: Fewsats
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:The emergence of Large Language Models has fundamentally transformed the capabilities of AI agents, enabling a new class of autonomous agents capable of interacting with their environment through dynamic code generation and execution. These agents possess the theoretical capacity to operate as independent economic actors within digital markets, offering unprecedented potential for value creation through their distinct advantages in operational continuity, perfect replication, and distributed learning capabilities. However, contemporary digital infrastructure, architected primarily for human interaction, presents significant barriers to their participation. This work presents a systematic analysis of the infrastructure requirements necessary for AI agents to function as autonomous participants in digital markets. We examine four key areas - identity and authorization, service discovery, interfaces, and payment systems - to show how existing infrastructure actively impedes agent participation. We argue that addressing these infrastructure challenges represents more than a technical imperative; it constitutes a fundamental step toward enabling new forms of economic organization. Much as traditional markets enable human intelligence to coordinate complex activities beyond individual capability, markets incorporating AI agents could dramatically enhance economic efficiency through continuous operation, perfect information sharing, and rapid adaptation to changing conditions. The infrastructure challenges identified in this work represent key barriers to realizing this potential. Comments: 20 pages, 5 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) ACMclasses: I.2.2; I.2.7; I.2.11; J.4; K.4.4 Cite as: arXiv:2501.10388 [cs.CY] (or arXiv:2501.10388v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.10388 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-117] he Three Social Dimensions of Chatbot Technology

【速读】：该论文试图解决的问题是如何全面理解聊天机器人（chatbot）技术在社会中的多维角色及其对人类生活的影响。传统的技术中心视角无法充分揭示聊天机器人在社会动态中的嵌入方式。为此，论文提出了一个结构化框架，从三个社会维度（科学研究对象、商业工具和亲密互动媒介）对聊天机器人进行系统分析。解决方案的关键在于通过这一多维框架，揭示聊天机器人从实验室到市场再到私人生活的演变过程，从而为学术界提供更全面的视角，探讨聊天机器人技术对人类生活体验和社会动态的影响。

链接: https://arxiv.org/abs/2501.10377
作者: Mauricio Figueroa-Torres
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development and deployment of chatbot technology, while spanning decades and employing different techniques, require innovative frameworks to understand and interrogate their functionality and implications. A mere technocentric account of the evolution of chatbot technology does not fully illuminate how conversational systems are embedded in societal dynamics. This study presents a structured examination of chatbots across three societal dimensions, highlighting their roles as objects of scientific research, commercial instruments, and agents of intimate interaction. Through furnishing a dimensional framework for the evolution of conversational systems, from laboratories to marketplaces to private lives, this article contributes to the wider scholarly inquiry of chatbot technology and its impact in lived human experiences and dynamics.
zh

[NLP-118] How Large Language Models (LLM s) Extrapolate: From Guided Missiles to Guided Prompts

【速读】：该论文试图解决的问题是如何正确理解大型语言模型（LLMs）的功能及其在生成文本时出现的“幻觉”（hallucination）现象。论文认为，LLMs应被视为外推（extrapolation）机器，外推是一种用于预测序列中下一个值的统计函数。外推既是GPT成功的关键，也是其引发争议的原因。论文指出，所谓的“幻觉”并非模型故障，而是模型在外推过程中效率过高的表现。论文还从历史角度追溯了外推概念的起源，将其与20世纪40年代的导弹科学、冷战时期的控制论（cybernetics）以及当代关于LLM性能的讨论联系起来。解决方案的关键在于重新定义LLMs的功能，将其视为外推机器，并理解外推在模型生成文本中的核心作用。

链接: https://arxiv.org/abs/2501.10361
作者: Xuenan Cao
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper argues that we should perceive LLMs as machines of extrapolation. Extrapolation is a statistical function for predicting the next value in a series. Extrapolation contributes to both GPT successes and controversies surrounding its hallucination. The term hallucination implies a malfunction, yet this paper contends that it in fact indicates the chatbot efficiency in extrapolation, albeit an excess of it. This article bears a historical dimension: it traces extrapolation to the nascent years of cybernetics. In 1941, when Norbert Wiener transitioned from missile science to communication engineering, the pivotal concept he adopted was none other than extrapolation. Soviet mathematician Andrey Kolmogorov, renowned for his compression logic that inspired OpenAI, had developed in 1939 another extrapolation project that Wiener later found rather like his own. This paper uncovers the connections between hot war science, Cold War cybernetics, and the contemporary debates on LLM performances.
zh

[NLP-119] Leverag ing Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

【速读】：该论文试图解决跨语言语音情感识别（Cross-Linguistic Speech Emotion Recognition, CLSER）中的挑战，特别是由于不同语言在语言学和声学特征上的显著差异所导致的识别困难。为了解决这一问题，作者提出了一种名为HuMP-CAT的新方法，该方法结合了HuBERT（一种自监督学习模型）、MFCC（梅尔频率倒谱系数）和韵律特征（prosodic characteristics），并在特征提取阶段通过交叉注意力变换器（Cross-Attention Transformer, CAT）机制进行特征融合。此外，作者采用了迁移学习策略，利用源情感语音数据集（如IEMOCAP）训练源模型，并在目标语料库上进行微调，以实现跨语言的情感识别。实验结果表明，HuMP-CAT在七个数据集（涵盖五种语言）上的平均准确率达到78.75%，尤其在德语数据集EMODB和意大利语数据集EMOVO上分别取得了88.69%和79.48%的显著性能，优于现有方法。

链接: https://arxiv.org/abs/2501.10408
作者: Ruoyu Zhao,Xiantao Jiang,F. Richard Yu,Victor C.M. Leung,Tao Wang,Shaohu Zhang
机构: Shanghai Maritime University (上海海事大学); Carleton University (卡尔顿大学); The University of British Columbia (不列颠哥伦比亚大学); Stanford University (斯坦福大学); The University of North Carolina at Pembroke (北卡罗来纳大学彭布罗克分校)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
zh

计算机视觉

[CV-0] owards Affordance-Aware Articulation Synthesis for Rigged Objects

【速读】：该论文试图解决在艺术创作流程中，如何自动生成符合上下文、物理规律和对象个性的逼真姿态（affordance-aware postures）的问题。传统方法依赖于经验丰富的艺术家手动调整，耗时且劳动密集。论文提出的解决方案A3Syn通过结合环境网格和文本提示，自动合成任意开放域（open-domain）绑定对象（rigged objects）的关节参数（articulation parameters）。其关键技术包括：1）使用2D修复扩散模型（2D inpainting diffusion model）和多种控制技术合成上下文相关的功能信息（affordance information）；2）通过可微分渲染（differentiable rendering）和语义对应（semantic correspondence）实现高效的骨骼对应对齐（bone correspondence alignment）。A3Syn能够在几分钟内稳定收敛，并在不同场景和对象组合下生成合理的功能姿态。

链接: https://arxiv.org/abs/2501.12393
作者: Yu-Chu Yu,Chieh Hubert Lin,Hsin-Ying Lee,Chaoyang Wang,Yu-Chiang Frank Wang,Ming-Hsuan Yang
机构: National Taiwan University(国立台湾大学); UC Merced(加州大学默塞德分校); Snap Research(Snap研究); Yonsei University(延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.
zh

[CV-1] Learning segmentation from point trajectories WWW NEURIPS2024

【速读】：该论文试图解决基于运动信息进行视频对象分割（segmentation）的问题，且不依赖于其他形式的监督信号。现有方法通常利用“共同命运”（common fate）原则，即同一对象上的点运动具有强相关性，但大多数研究仅依赖于光流（optical flow）提供的瞬时运动信息。本文提出了一种利用长期点轨迹（long-term point trajectories）作为监督信号来补充光流的方法。关键挑战在于长期运动难以建模，任何参数化近似都无法准确捕捉长时间内的复杂运动模式。为此，本文从子空间聚类（subspace clustering）方法中汲取灵感，提出了一种损失函数，旨在将轨迹分组为低秩矩阵，使得对象点的运动可以近似表示为其他点轨迹的线性组合。实验结果表明，该方法在基于运动的分割任务上优于现有技术，证明了长期运动信息的有效性及其提出的损失函数的优越性。

链接: https://arxiv.org/abs/2501.12392
作者: Laurynas Karazija,Iro Laina,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2024 Spotlight. Project this https URL

点击查看摘要

Abstract:We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model – any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
zh

[CV-2] GPS as a Control Signal for Image Generation

【速读】：该论文旨在解决如何利用照片元数据中的GPS标签（GPS tags）作为控制信号来生成具有地理位置特征的图像。具体来说，研究通过训练GPS-to-image模型，结合扩散模型（diffusion model）和文本条件，生成能够捕捉城市中不同区域（如街区、公园和地标）独特外观的图像。解决方案的关键在于利用GPS条件约束图像生成过程，并通过分数蒸馏采样（score distillation sampling）从2D GPS-to-image模型中提取3D模型，从而在多个视角下约束重建的外观。实验结果表明，GPS条件模型能够成功生成基于地理位置变化的图像，并且GPS条件显著改善了3D结构的估计。

链接: https://arxiv.org/abs/2501.12390
作者: Chao Feng,Ziyang Chen,Aleksander Holynski,Alexei A. Efros,Andrew Owens
机构: University of Michigan(密歇根大学); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
zh

[CV-3] aming Teacher Forcing for Masked Autoregressive Video Generation

【速读】：该论文试图解决视频生成中的两个关键问题：帧内生成和帧间生成的连贯性。为了解决这些问题，作者提出了MAGI（混合视频生成框架），该框架结合了掩码建模（masked modeling）用于帧内生成和因果建模（causal modeling）用于下一帧生成。其核心创新在于完全教师强制（Complete Teacher Forcing, CTF）方法，该方法通过将掩码帧基于完整观测帧而非掩码帧进行条件生成，从而实现了从令牌级（patch-level）到帧级自回归生成的平滑过渡。与传统的掩码教师强制（Masked Teacher Forcing, MTF）相比，CTF在第一帧条件视频预测任务中显著提升了FVD（Fréchet Video Distance）分数，提升了23%。此外，为了解决曝光偏差（exposure bias）等问题，作者采用了针对性的训练策略，为自回归视频生成设定了新的基准。实验结果表明，MAGI能够在仅训练16帧的情况下生成超过100帧的长且连贯的视频序列，展示了其在可扩展、高质量视频生成中的潜力。

链接: https://arxiv.org/abs/2501.12389
作者: Deyu Zhou,Quan Sun,Yuang Peng,Kun Yan,Runpei Dong,Duomin Wang,Zheng Ge,Nan Duan,Xiangyu Zhang,Lionel M. Ni,Heung-Yeung Shum
机构: HKUST(GZ)(香港科技大学广州校区); StepFun; UIUC(伊利诺伊大学厄巴纳-香槟分校); THU(清华大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
zh

[CV-4] Continuous 3D Perception Model with Persistent State

【速读】：该论文旨在解决广泛的3D任务，特别是如何从连续的图像流中在线生成度量尺度的点云图（metric-scale pointmaps），并将其累积为一致的、密集的场景重建。解决方案的关键在于提出了一个名为CUT3R（Continuous Updating Transformer for 3D Reconstruction）的状态循环模型（stateful recurrent model），该模型能够随着每个新的观测不断更新其状态表示。CUT3R不仅能够从图像观测中预测精确的点云图，还能通过虚拟的、未观测的视角推断场景中未见的区域。该方法的灵活性使其能够处理不同长度的图像流，无论是视频流还是无序的照片集合，且能够处理静态和动态内容。通过在各种3D/4D任务上的评估，CUT3R展示了其竞争性或最先进的性能。

链接: https://arxiv.org/abs/2501.12387
作者: Qianqian Wang,Yifei Zhang,Aleksander Holynski,Alexei A. Efros,Angjoo Kanazawa
机构: University of California, Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: this https URL
zh

[CV-5] InternVideo2.5: Empowering Video MLLM s with Long and Rich Context Modeling

【速读】：该论文旨在通过长且丰富的上下文（Long and Rich Context, LRC）建模来提升视频多模态大语言模型（Multimodal Large Language Models, MLLM）的性能。具体而言，论文提出了一种新版本的InternVideo2.5，重点在于增强原始MLLM在视频中感知细粒度细节和捕捉长时间结构的能力。解决方案的关键在于将密集视觉任务标注通过直接偏好优化（Direct Preference Optimization）整合到MLLM中，并通过自适应分层令牌压缩（Adaptive Hierarchical Token Compression）开发紧凑的时空表示。实验结果表明，这种独特的LRC设计显著提升了视频MLLM在主流视频理解基准测试（包括短时和长时）中的表现，使其能够记忆显著更长的视频输入（至少比原始模型长6倍），并掌握如目标跟踪和分割等专业视觉能力。该研究强调了多模态上下文丰富性（长度和精细度）在增强MLLM内在能力（专注力和记忆力）方面的重要性，为未来视频MLLM的研究提供了新的见解。

链接: https://arxiv.org/abs/2501.12386
作者: Yi Wang,Xinhao Li,Ziang Yan,Yinan He,Jiashuo Yu,Xiangyu Zeng,Chenting Wang,Changlian Ma,Haian Huang,Jianfei Gao,Min Dou,Kai Chen,Wenhai Wang,Yu Qiao,Yali Wang,Limin Wang
机构: 1Shanghai AI Laboratory (上海人工智能实验室); 2Nanjing University (南京大学); 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs’ ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM’s innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at this https URL
zh

[CV-6] CCESAR: Coastline Classification-Extraction From SAR Images Using CNN-U-Net Combination

【速读】：该论文旨在解决从合成孔径雷达（Synthetic Aperture Radar, SAR）图像中提取海岸线时，单一分割模型难以准确表征不同类型海岸线的问题。为此，作者提出了一种两阶段模型，首先进行图像分类，随后进行分割。通过在不同压缩级别的SAR图像上进行实验，作者验证了两阶段工作流的优越性。具体而言，结合卷积神经网络（CNN）和U-Net模型的两阶段工作流——海岸线分类与提取（CCESAR），在Sentinel-1图像上的表现优于单一U-Net分割模型。该解决方案的关键在于通过分类阶段预先区分海岸线类型，从而提升后续分割的精度和鲁棒性。

链接: https://arxiv.org/abs/2501.12384
作者: Vidhu Arora,Shreyan Gupta,Ananthakrishna Kudupu,Aditya Priyadarshi,Aswathi Mundayatt,Jaya Sreevalsan-Nair
机构: Graphics-Visualization-Computing Lab, International Institute of Information Technology Bangalore (国际信息技术学院班加罗尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this article, we improve the deep learning solution for coastline extraction from Synthetic Aperture Radar (SAR) images by proposing a two-stage model involving image classification followed by segmentation. We hypothesize that a single segmentation model usually used for coastline detection is insufficient to characterize different coastline types. We demonstrate that the need for a two-stage workflow prevails through different compression levels of these images. Our results from experiments using a combination of CNN and U-Net models on Sentinel-1 images show that the two-stage workflow, coastline classification-extraction from SAR images (CCESAR) outperforms a single U-Net segmentation model.
zh

[CV-7] DiffDoctor: Diagnosing Image Diffusion Models Before Treating

【速读】：该论文旨在解决图像扩散模型（image diffusion models）在生成图像时产生的伪影（artifacts）问题。尽管已有进展，这些模型仍会在生成的图像中引入缺陷。论文提出了一种名为DiffDoctor的两阶段解决方案，其关键在于首先开发一个鲁棒的伪影检测器（artifact detector），该检测器能够识别图像中缺陷的具体位置，而不仅仅是整体质量评估。为此，作者收集了一个包含超过100万张有缺陷的合成图像的数据集，并通过人工参与的标注过程，结合精心设计的类别平衡策略，训练了一个高效的检测器。在第二阶段，该检测器通过为每个合成图像生成逐像素的置信度图（per-pixel confidence map），用于调整扩散模型，从而减少伪影的生成。实验表明，该伪影检测器及其“先诊断后治疗”的设计在文本到图像扩散模型中具有显著效果。

链接: https://arxiv.org/abs/2501.12382
作者: Yiyang Wang,Xi Chen,Xiaogang Xu,Sihui Ji,Yu Liu,Yujun Shen,Hengshuang Zhao
机构: The University of Hong Kong(香港大学); Tongyi Lab(通义实验室); Ant Financial Services Group(蚂蚁金服集团); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages of main body and 2 pages of references, 9 figures, 2 tables

点击查看摘要

Abstract:In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
zh

[CV-8] Parallel Sequence Modeling via Generalized Spatial Propagation Network

【速读】：该论文旨在解决现有注意力机制（如Transformer、线性注意力及Mamba等状态空间模型）在处理多维数据时将其视为一维序列，从而导致空间一致性和计算效率下降的问题。为此，论文提出了广义空间传播网络（Generalized Spatial Propagation Network, GSPN），其关键创新在于直接操作空间一致的图像数据，并通过线扫描方法形成密集的成对连接。GSPN的核心是稳定性-上下文条件（Stability-Context Condition），该条件确保了在二维序列中的稳定且上下文感知的传播，并将有效序列长度减少到√N（N为方形图中的元素数量），显著提升了计算效率。此外，GSPN通过可学习的、输入依赖的权重，且不依赖位置嵌入，实现了卓越的空间保真度，并在视觉任务（如图像分类、类引导图像生成和文本到图像生成）中达到了最先进的性能。特别是在生成16K图像时，GSPN将SD-XL与softmax注意力的加速比提升至84倍以上。

链接: https://arxiv.org/abs/2501.12381
作者: Hongjun Wang,Wonmin Byeon,Jiarui Xu,Jinwei Gu,Ka Chun Cheung,Xiaolong Wang,Kai Han,Jan Kautz,Sifei Liu
机构: NVIDIA; The University of Hong Kong (香港大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this http URL

点击查看摘要

Abstract:We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to \sqrtN for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84\times when generating 16K images.
zh

[CV-9] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

【速读】：该论文旨在解决单目深度估计（monocular depth estimation）在视频中存在的时序不一致性问题，这一问题限制了其在实际应用中的广泛使用。现有的方法通常通过利用视频生成模型或引入光流（optical flow）和相机姿态（camera poses）的先验信息来缓解这一问题，但这些方法仅适用于短视频（10秒以内），并且在质量和计算效率之间存在权衡。论文提出的解决方案是“Video Depth Anything”，该模型基于Depth Anything V2，并通过替换其头部为高效的时空头部（spatial-temporal head）来实现高质量且一致的深度估计，适用于超长视频（数分钟以上）。关键创新点包括设计了一种简单但有效的时序一致性损失函数，通过约束时序深度梯度（temporal depth gradient）来消除对额外几何先验的需求，并开发了一种基于关键帧（key-frame-based）的策略用于长视频推理。实验表明，该模型能够在保持质量、一致性和泛化能力的同时，应用于任意长度的视频，并在多个视频基准测试中达到了零样本视频深度估计的最新水平。

链接: https://arxiv.org/abs/2501.12375
作者: Sili Chen,Hengkai Guo,Shengnan Zhu,Feihu Zhang,Zilong Huang,Jiashi Feng,Bingyi Kang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos ( 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
zh

[CV-10] DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions

【速读】：该论文试图解决基于Splatting的三维重建方法中，重建核函数（reconstruction kernels）局限于指数族函数（exponential family functions）的问题。尽管指数族函数（如高斯函数）因其各向异性、易于投影和可微性在光栅化中被广泛使用，但广义的重建核函数尚未得到充分探索，主要原因是其在三维到二维投影中缺乏易于积分的特性。论文提出了一种新的解决方案，即使用一类衰减的各向异性径向基函数（decaying anisotropic radial basis functions, DARBFs），这些函数基于马氏距离（Mahalanobis distance）且非负，能够通过近似高斯函数的闭式积分优势来支持Splatting。这一方法在训练过程中实现了高达34%的收敛速度提升，并在多种DARBF重建核函数中减少了15%的内存消耗，同时保持了与现有方法相当的PSNR、SSIM和LPIPS结果。

链接: https://arxiv.org/abs/2501.12369
作者: Vishagar Arunan(1),Saeedha Nazar(1),Hashiru Pramuditha(1),Vinasirajan Viruthshaan(1),Sameera Ramasinghe(2),Simon Lucey(2),Ranga Rodrigo(1) ((1) University of Moratuwa, (2) University of Adelaide)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Link to the project page: this https URL

点击查看摘要

Abstract:Splatting-based 3D reconstruction methods have gained popularity with the advent of 3D Gaussian Splatting, efficiently synthesizing high-quality novel views. These methods commonly resort to using exponential family functions, such as the Gaussian function, as reconstruction kernels due to their anisotropic nature, ease of projection, and differentiability in rasterization. However, the field remains restricted to variations within the exponential family, leaving generalized reconstruction kernels largely underexplored, partly due to the lack of easy integrability in 3D to 2D projections. In this light, we show that a class of decaying anisotropic radial basis functions (DARBFs), which are non-negative functions of the Mahalanobis distance, supports splatting by approximating the Gaussian function’s closed-form integration advantage. With this fresh perspective, we demonstrate up to 34% faster convergence during training and a 15% reduction in memory consumption across various DARB reconstruction kernels, while maintaining comparable PSNR, SSIM, and LPIPS results. We will make the code available.
zh

[CV-11] Vision-Language Models for Automated Chest X-ray Interpretation: Leverag ing ViT and GPT -2

【速读】：该论文旨在解决放射学报告中手动生成非结构化报告耗时且易出错的问题，这一问题在临床工作流程中形成了显著的瓶颈。尽管生成式AI在放射学报告生成方面取得了进展，但在生成详细且准确的报告方面仍存在挑战。论文的解决方案关键在于整合计算机视觉（Computer Vision）和自然语言处理（Natural Language Processing）的多模态模型，通过预训练的Vision Transformer（ViT-B16）和SWIN Transformer作为图像编码器，以及BART和GPT-2作为文本解码器，来生成全面的放射学报告。研究使用IU-Xray数据集的胸部X光图像和报告，评估了SWIN Transformer-BART、SWIN Transformer-GPT-2、ViT-B16-BART和ViT-B16-GPT-2四种模型的性能，最终发现SWIN-BART模型在ROUGE、BLEU和BERTScore等评估指标上表现最佳。

链接: https://arxiv.org/abs/2501.12356
作者: Md. Rakibul Islam,Md. Zahid Hossain,Mustofa Ahmed,Most. Sharmin Sultana Samu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, manuscript under-review

点击查看摘要

Abstract:Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation. We aimed at finding the best combination among the models. The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.
zh

[CV-12] VARGPT : Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model

【速读】：该论文旨在解决多模态大语言模型（MLLM）在视觉理解和生成任务中的统一性问题。传统方法通常将视觉理解和生成任务分开处理，导致模型在处理混合模态输入和输出时效率低下。VARGPT通过引入一种新颖的自回归框架，将视觉理解和生成统一在一个模型中。其关键解决方案包括：1）采用“下一标记预测”（next-token prediction）范式进行视觉理解；2）采用“下一尺度预测”（next-scale prediction）范式进行视觉自回归生成；3）基于LLaVA架构进行扩展，实现高效的尺度自回归视觉生成。此外，VARGPT通过三阶段的统一训练策略（包括预训练和两个混合视觉指令微调阶段），实现了视觉与文本特征的对齐，增强了指令跟随能力，并提升了视觉生成质量。实验表明，VARGPT在视觉问答和推理任务等视觉中心基准测试中显著优于LLaVA-1.5，并展示了其在自回归视觉生成和指令到图像合成任务中的多功能性。

链接: https://arxiv.org/abs/2501.12327
作者: Xianwei Zhuang,Yuxin Xie,Yufan Deng,Liming Liang,Jinghan Ru,Yuguo Yin,Yuexian Zou
机构: SECE of Peking University (北京大学信息科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \urlthis https URL
zh

[CV-13] Metric for Evaluating Performance of Reference-Free Demorphing Methods

【速读】：该论文旨在解决面部去变形（face demorphing）技术评估中缺乏统一评价指标的问题。面部去变形是指从合成的面部图像中恢复出原始的面部图像，而现有的评估方法存在不足，无法有效比较不同去变形技术的性能。为此，作者提出了一种新的评估指标，称为生物特征交叉加权图像质量评估（biometrically cross-weighted IQA），该指标克服了现有方法的局限性，并通过在三种现有去变形方法和六个数据集上的实验验证了其有效性。解决方案的关键在于引入生物特征交叉加权机制，结合图像质量评估和生物特征匹配性能，从而更全面地衡量去变形技术的效果。

链接: https://arxiv.org/abs/2501.12319
作者: Nitish Shukla,Arun Ross
机构: Michigan State University(密歇根州立大学); Michigan State University(密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A facial morph is an image created by combining two (or more) face images pertaining to two (or more) distinct identities. Reference-free face demorphing inverts the process and tries to recover the face images constituting a facial morph without using any other information. However, there is no consensus on the evaluation metrics to be used to evaluate and compare such demorphing techniques. In this paper, we first analyze the shortcomings of the demorphing metrics currently used in the literature. We then propose a new metric called biometrically cross-weighted IQA that overcomes these issues and extensively benchmark current methods on the proposed metric to show its efficacy. Experiments on three existing demorphing methods and six datasets on two commonly used face matchers validate the efficacy of our proposed metric.
zh

[CV-14] BlanketGen2-Fit3D: Synthetic Blanket Augmentation Towards Improving Real-World In-Bed Blanket Occluded Human Pose Estimation

【速读】：该论文试图解决在临床环境中，基于单目RGB图像的人体姿态估计（Human Pose Estimation, HPE）在床场景下由于被子遮挡而面临的挑战。由于被子遮挡频繁出现，且在此场景下的标注数据稀缺，现有的HPE模型在此类情况下的表现受限。为解决这一问题，论文提出了BlanketGen2-Fit3D（BG2-Fit3D）数据集，该数据集是对Fit3D数据集的增强，包含1,217,312帧带有合成逼真被子的图像。生成这些图像的关键在于使用了改进的BlanketGen2管道，该管道通过基于真实人体网格模型（Skinned Multi-Person Linear model, SMPL）生成合成被子，并将其渲染为透明图像，叠加到原始帧上。通过将BG2-Fit3D与原始Fit3D数据集结合，微调了ViTPose-B HPE模型，并评估了合成被子增强的有效性。实验结果表明，使用合成数据增强的模型在BG2-Fit3D数据集上的姿态估计性能显著提升（PCK提高4.4%），并且在真实世界的被子遮挡数据集（SLP数据集）上也表现出2.3%的PCK提升。这些结果表明，合成被子增强在改善床场景下被子遮挡的HPE任务中具有潜力。

链接: https://arxiv.org/abs/2501.12318
作者: Tamás Karácsony,João Carmona,João Paulo Silva Cunha
机构: Fundação para a Ciência e a Tecnologia (葡萄牙科学技术基金会); CMU Portugal program (CMU葡萄牙项目); Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia (葡萄牙资助机构，FCT - 葡萄牙科学技术基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Human Pose Estimation (HPE) from monocular RGB images is crucial for clinical in-bed skeleton-based action recognition, however, it poses unique challenges for HPE models due to the frequent presence of blankets occluding the person, while labeled HPE data in this scenario is scarce. To address this we introduce BlanketGen2-Fit3D (BG2-Fit3D), an augmentation of Fit3D dataset that contains 1,217,312 frames with synthetic photo-realistic blankets. To generate it we used BlanketGen2, our new and improved version of our BlanketGen pipeline that simulates synthetic blankets using ground-truth Skinned Multi-Person Linear model (SMPL) meshes and then renders them as transparent images that can be layered on top of the original frames. This dataset was used in combination with the original Fit3D to finetune the ViTPose-B HPE model, to evaluate synthetic blanket augmentation effectiveness. The trained models were further evaluated on a real-world blanket occluded in-bed HPE dataset (SLP dataset). Comparing architectures trained on only Fit3D with the ones trained with our synthetic blanket augmentation the later improved pose estimation performance on BG2-Fit3D, the synthetic blanket occluded dataset significantly to (0.977 Percentage of Correct Keypoints (PCK), 0.149 Normalized Mean Error (NME)) with an absolute 4.4% PCK increase. Furthermore, the test results on SLP demonstrated the utility of synthetic data augmentation by improving performance by an absolute 2.3% PCK, on real-world images with the poses occluded by real blankets. These results show synthetic blanket augmentation has the potential to improve in-bed blanket occluded HPE from RGB images. The dataset as well as the code will be made available to the public.
zh

[CV-15] RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning

【速读】：该论文试图解决自动驾驶系统在从真实世界数据集训练后，难以适应新环境（尤其是极端天气等极端情况）的问题。由于在真实世界中收集这些极端情况数据非常困难，通常需要使用模拟器进行验证。然而，高计算成本和数据分布中的领域差距（domain gap）阻碍了真实与模拟驾驶场景之间的无缝过渡。为解决这一问题，论文提出了检索增强学习框架（Retrieval-Augmented Learning for Autonomous Driving, RALAD），其关键解决方案包括：(1) 通过增强的最优传输（Optimal Transport, OT）方法进行领域适应，该方法同时考虑了单个和分组图像的距离；(2) 设计了一个简单且统一的框架，适用于多种模型；(3) 采用高效的微调技术，冻结计算成本高的层，同时保持模型的鲁棒性。实验结果表明，RALAD在模拟环境中显著提升了性能（如mIOU和mAP分别提高了10.30%和12.29%），同时在真实场景中保持了准确性，并且重新训练成本降低了约88.1%。

链接: https://arxiv.org/abs/2501.12296
作者: Jiacheng Zuo,Haibo Hu,Zikang Zhou,Yufei Cui,Ziquan Liu,Jianping Wang,Nan Guan,Jin Wang,Chun Jason Xue
机构: Department of Computer Science, Soochow University(苏州大学计算机科学系); Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系); Department of Computer Science, McGill University(麦吉尔大学计算机科学系); School of Electronic Engineering and Computer Science, Queen Mary University of London(伦敦玛丽女王大学电子工程与计算机科学学院); Department of Computer Science, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at this https URL.
zh

[CV-16] owards Accurate Unified Anomaly Segmentation

【速读】：该论文试图解决无监督异常检测（Unsupervised Anomaly Detection, UAD）中异常像素的精确分割问题。尽管现有的方法在建模正常数据分布和区分异常方面取得了一定进展，但在不平衡的UAD设置下，广泛使用的AUROC（Area Under the Receiver Operating Characteristic）指标难以准确反映异常分割的效果。为此，论文强调了使用pAP（Pixel-wise Average Precision）和DSC（Dice Similarity Coefficient）作为评估指标的重要性。为解决这一未解决的异常分割任务，论文提出了统一异常分割（Unified Anomaly Segmentation, UniAS）方法。UniAS的关键在于其多层次混合管道，该管道从粗到细逐步增强正常信息，并结合了一种新颖的多粒度门控卷积神经网络（Multi-Granularity Gated CNN, MGG-CNN）与Transformer层，以显式聚合来自不同粒度的局部细节。UniAS在MVTec-AD和VisA数据集上分别达到了65.12/59.33和40.06/32.50的pAP/DSC，显著超越了现有方法。

链接: https://arxiv.org/abs/2501.12295
作者: Wenxin Ma,Qingsong Yao,Xiang Zhang,Zhelong Huang,Zihang Jiang,S. Kevin Zhou
机构: University of Science and Technology of China (USTC) (中国科学技术大学); Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC (苏州先进技术研究院); Stanford University (斯坦福大学); School of Medicine, Shanghai University (上海大学医学院); Key Laboratory of Precision and Intelligent Chemistry, USTC (中国科学技术大学精密智能化学重点实验室); Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (中国科学院智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) from images strives to model normal data distributions, creating discriminative representations to distinguish and precisely localize anomalies. Despite recent advancements in the efficient and unified one-for-all scheme, challenges persist in accurately segmenting anomalies for further monitoring. Moreover, this problem is obscured by the widely-used AUROC metric under imbalanced UAD settings. This motivates us to emphasize the significance of precise segmentation of anomaly pixels using pAP and DSC as metrics. To address the unsolved segmentation task, we introduce the Unified Anomaly Segmentation (UniAS). UniAS presents a multi-level hybrid pipeline that progressively enhances normal information from coarse to fine, incorporating a novel multi-granularity gated CNN (MGG-CNN) into Transformer layers to explicitly aggregate local details from different granularities. UniAS achieves state-of-the-art anomaly segmentation performance, attaining 65.12/59.33 and 40.06/32.50 in pAP/DSC on the MVTec-AD and VisA datasets, respectively, surpassing previous methods significantly. The codes are shared at this https URL.
zh

[CV-17] Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement

【速读】：该论文试图解决的问题是如何通过图像编辑技术降低图像对观众情绪的影响。解决方案的关键在于提出了三种基于回归器引导的图像编辑方法：(i) 基于全局图像变换的参数优化方法，这些变换已知会影响情绪；(ii) 针对生成对抗网络（GAN）风格潜在空间的优化方法；(iii) 基于扩散模型（diffusion model）的方法，结合了分类器引导（classifier guidance）和无分类器引导（classifier-free guidance）。研究结果表明，这些方法能够有效改变图像的情绪属性，同时保持较高的视觉质量。其中，基于优化的方法主要通过调整颜色色调和亮度等低层次属性来影响情绪，而基于扩散模型的方法则引入了语义层面的变化，如改变外观或面部表情。行为学研究表明，只有基于扩散模型的方法能够成功引发观众情绪反应的变化，同时保持较高的图像质量感知。未来的研究将进一步探讨这些图像调整对互联网用户行为的影响。

链接: https://arxiv.org/abs/2501.12289
作者: Christoph Gebhardt,Robin Willardt,Seyedmorteza Sadat,Chih-Wei Ning,Andreas Brombach,Jie Song,Otmar Hilliges,Christian Holz
机构: ETH Zurich(苏黎世联邦理工学院); UNSW Sydney(新南威尔士大学悉尼分校); HKUST Guangzhou(香港科技大学广州校区); Eastern Switzerland University of Applied Sciences(瑞士东部应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 39 pages, 22 figures

点击查看摘要

Abstract:Emotions are known to mediate the relationship between users’ content consumption and their online engagement, with heightened emotional intensity leading to increased engagement. Building on this insight, we propose three regressor-guided image editing approaches aimed at diminishing the emotional impact of images. These include (i) a parameter optimization approach based on global image transformations known to influence emotions, (ii) an optimization approach targeting the style latent space of a generative adversarial network, and (iii) a diffusion-based approach employing classifier guidance and classifier-free guidance. Our findings demonstrate that approaches can effectively alter the emotional properties of images while maintaining high visual quality. Optimization-based methods primarily adjust low-level properties like color hues and brightness, whereas the diffusion-based approach introduces semantic changes, such as altering appearance or facial expressions. Notably, results from a behavioral study reveal that only the diffusion-based approach successfully elicits changes in viewers’ emotional responses while preserving high perceived image quality. In future work, we will investigate the impact of these image adaptations on internet user behavior.
zh

[CV-18] With Great Backbones Comes Great Adversarial Transferability

【速读】：该论文试图解决在自监督学习（SSL）预训练模型（如ResNet和ViT）中，模型在面对对抗攻击时的鲁棒性问题。尽管自监督学习提升了模型的表示鲁棒性和性能，但这些预训练模型在对抗攻击下的脆弱性尚未得到充分研究。论文通过系统评估20,000种不同的调优元信息组合（包括微调技术、骨干网络家族、数据集和攻击类型），探讨了这些因素对模型对抗鲁棒性的影响。关键解决方案包括使用代理模型（proxy models）来模拟不同目标知识水平的攻击，并提出了一种基于骨干网络的“骨干攻击”（backbone attack），该攻击仅利用骨干网络生成对抗样本，结果显示其性能优于黑盒攻击，并接近白盒攻击的效果。此外，论文还通过消融实验揭示了调优元信息对攻击可转移性的影响。

链接: https://arxiv.org/abs/2501.12275
作者: Erik Arakelyan,Karen Hambardzumyan,Davit Papikyan,Pasquale Minervini,Albert Gordo,Isabelle Augenstein,Aram H. Markosyan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Advances in self-supervised learning (SSL) for machine vision have improved representation robustness and model performance, giving rise to pre-trained backbones like \emphResNet and \emphViT models tuned with SSL methods such as \emphSimCLR. Due to the computational and data demands of pre-training, the utilization of such backbones becomes a strenuous necessity. However, employing these backbones may inherit vulnerabilities to adversarial attacks. While adversarial robustness has been studied under \emphwhite-box and \emphblack-box settings, the robustness of models tuned on pre-trained backbones remains largely unexplored. Additionally, the role of tuning meta-information in mitigating exploitation risks is unclear. This work systematically evaluates the adversarial robustness of such models across 20,000 combinations of tuning meta-information, including fine-tuning techniques, backbone families, datasets, and attack types. We propose using proxy models to transfer attacks, simulating varying levels of target knowledge by fine-tuning these proxies with diverse configurations. Our findings reveal that proxy-based attacks approach the effectiveness of \emphwhite-box methods, even with minimal tuning knowledge. We also introduce a naive “backbone attack,” leveraging only the backbone to generate adversarial samples, which outperforms \emphblack-box attacks and rivals \emphwhite-box methods, highlighting critical risks in model-sharing practices. Finally, our ablations reveal how increasing tuning meta-information impacts attack transferability, measuring each meta-information combination.
zh

[CV-19] Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems

【速读】：该论文旨在解决基于深度神经网络（DNNs）的高级驾驶辅助系统（ADAS）在面对输入变化（如噪声和光照变化）时的鲁棒性和泛化能力问题。这些输入变化可能导致系统失效，进而引发安全隐患。论文通过全面的实证评估，研究了图像扰动技术在揭示ADAS感知系统脆弱性方面的有效性，并提出了改进方案。关键解决方案包括：1）系统性地识别了38类图像扰动，并评估了它们在组件和系统层面上对ADAS的影响；2）探索了基于扰动的数据增强和持续学习策略，以提高ADAS在新操作设计域中的适应能力。研究结果表明，所有类别的图像扰动均能有效暴露ADAS的鲁棒性问题，而数据增强和持续学习显著提升了ADAS在未见环境中的性能。

链接: https://arxiv.org/abs/2501.12269
作者: Stefano Carlo Lambertenghi,Hannes Leonhard,Andrea Stocco
机构: Technical University of Munich(慕尼黑工业大学), fortiss; Technical University of Munich(慕尼黑工业大学); Technical University of Munich(慕尼黑工业大学), fortiss
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)

点击查看摘要

Abstract:Advanced Driver Assistance Systems (ADAS) based on deep neural networks (DNNs) are widely used in autonomous vehicles for critical perception tasks such as object detection, semantic segmentation, and lane recognition. However, these systems are highly sensitive to input variations, such as noise and changes in lighting, which can compromise their effectiveness and potentially lead to safety-critical failures. This study offers a comprehensive empirical evaluation of image perturbations, techniques commonly used to assess the robustness of DNNs, to validate and improve the robustness and generalization of ADAS perception systems. We first conducted a systematic review of the literature, identifying 38 categories of perturbations. Next, we evaluated their effectiveness in revealing failures in two different ADAS, both at the component and at the system level. Finally, we explored the use of perturbation-based data augmentation and continuous learning strategies to improve ADAS adaptation to new operational design domains. Our results demonstrate that all categories of image perturbations successfully expose robustness issues in ADAS and that the use of dataset augmentation and continuous learning significantly improves ADAS performance in novel, unseen environments. Comments: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.12269 [cs.SE] (or arXiv:2501.12269v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.12269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-20] VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models WACV2025

【速读】：该论文试图解决视频修复（video inpainting）中在大面积掩码区域中心无法找到像素对应关系时产生的严重伪影问题。现有的视频修复方法通常利用光流（optical flow）在图像空间或特征空间中引导像素传播，但在掩码区域过大时，这些方法在中心区域无法找到有效的像素对应关系，导致修复结果出现伪影。论文提出的解决方案VipDiff是一个无需训练的框架，通过在反向扩散过程（reverse diffusion process）中引入光流作为引导，从参考帧中提取有效像素作为约束，优化随机采样的高斯噪声，从而生成时空一致的修复结果。VipDiff的关键在于利用预训练的扩散模型（diffusion models）进行条件生成，避免了额外的训练数据或微调需求，同时允许通过不同的噪声采样生成多样化的修复结果。实验表明，VipDiff在时空一致性和保真度方面显著优于现有的视频修复方法。

链接: https://arxiv.org/abs/2501.12267
作者: Chaohao Xie,Kai Han,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 Figures (Accepted at WACV 2025)

点击查看摘要

Abstract:Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.
zh

[CV-21] mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

【速读】：该论文旨在解决多智能体协同感知（Collaborative Perception）在实际部署中面临的带宽限制和信息交换过程中的校准误差问题。为了解决这些问题，作者提出了mmCooper框架，这是一个多智能体、多阶段、通信高效且协作鲁棒的协同感知框架。该框架的关键在于采用多阶段协作策略，动态且自适应地平衡中间阶段和后期阶段的信息共享，以在保持通信效率的同时提升感知性能。此外，框架通过捕捉多尺度上下文信息以增强中间阶段的鲁棒融合，并在后期阶段对接收到的检测结果进行校准，从而提高准确性。实验结果表明，mmCooper在真实世界和模拟数据集上均表现出优越性能，验证了其有效性及各组件的贡献。

链接: https://arxiv.org/abs/2501.12263
作者: Bingyi Liu,Jian Teng,Hongfei Xue,Enshu Wang,Chuanhui Zhu,Pu Wang,Libing Wu
机构: Wuhan University Of Technology(武汉理工大学); University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework captures multi-scale contextual information for robust fusion in the intermediate stage and calibrates the received detection results to improve accuracy in the late stage. We validate the effectiveness of mmCooper through extensive experiments on real-world and simulated datasets. The results demonstrate the superiority of our proposed framework and the effectiveness of each component.
zh

[CV-22] HAC: Towards 100X Compression of 3D Gaussian Splatting ECCV2024

【速读】：该论文旨在解决3D高斯泼溅（3D Gaussian Splatting, 3DGS）框架中由于大量高斯点及其相关属性导致的存储和压缩问题。3DGS虽然在新视角合成中表现出色，但其点云数据稀疏且无序，给压缩带来了挑战。论文提出的解决方案HAC++通过利用无序锚点（anchors）与结构化哈希网格（structured hash grid）之间的关系，结合它们的互信息进行上下文建模，从而有效压缩数据。此外，HAC++还捕捉锚点内部的上下文关系，进一步提升压缩性能。为了支持熵编码，HAC++采用高斯分布精确估计每个量化属性的概率，并引入自适应量化模块以实现高精度量化，从而提高保真度恢复。同时，自适应掩码策略被用于消除无效的高斯点和锚点。实验结果表明，HAC++在所有数据集上平均实现了超过100倍的尺寸压缩，同时提升了保真度，相比Scaffold-GS也实现了超过20倍的尺寸压缩。

链接: https://arxiv.org/abs/2501.12255
作者: Yihang Chen,Qianyi Wu,Weiyao Lin,Mehrtash Harandi,Jianfei Cai
机构: Monash University(莫纳什大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TPAMI Submission. This paper is an extension of HAC at arXiv:2403.14530 (ECCV 2024)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at this https URL.
zh

[CV-23] Memory Storyboard: Leverag ing Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

【速读】：该论文试图解决从现实世界的连续未整理数据流中进行自监督学习（self-supervised learning）的问题，特别是针对长时程的自我中心（egocentric）视频流。现有的视觉自监督学习方法主要集中于静态图像或人工生成的数据流，而本文则探索了更为现实的学习场景。解决方案的关键在于提出了“记忆故事板”（Memory Storyboard），该方法通过将最近的过去帧分组为时间片段，从而更有效地总结过去的视觉流以进行记忆回放。为了适应高效的时间分割，论文还提出了一个双层记忆层次结构：最近的过去存储在短期记忆中，而故事板时间片段则转移到长期记忆中。通过在真实世界的自我中心视频数据集（如SAYCam和KrishnaCam）上的实验，论文展示了基于故事板帧的对比学习目标能够生成语义上有意义的表示，并且优于现有的无监督持续学习方法。

链接: https://arxiv.org/abs/2501.12254
作者: Yanlai Yang,Mengye Ren
机构: New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose “Memory Storyboard” that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.
zh

[CV-24] Video Deblurring by Sharpness Prior Detection and Edge Information

【速读】：该论文旨在解决视频去模糊（video deblurring）任务中的两个主要问题：传统方法直接估计运动模糊核（motion blur kernels）容易引入伪影并导致效果不佳，以及现有数据集依赖固定数量的清晰帧（sharp frames），限制了模型的训练多样性和领域适应性。为解决这些问题，论文提出了两个关键解决方案：首先，引入了GoPro Random Sharp (GoProRS)数据集，该数据集允许自定义序列中清晰帧的频率，从而支持更多样化的训练和测试场景；其次，提出了一种名为SPEINet的新型视频去模糊模型，该模型通过基于注意力机制的编码器-解码器架构（attention-based encoder-decoder architecture），将清晰帧特征整合到模糊帧重建中，并结合轻量级且鲁棒的清晰帧检测和边缘提取阶段。实验结果表明，SPEINet在多个数据集上均优于现有最先进方法，平均PSNR（峰值信噪比）提升了3.2%。

链接: https://arxiv.org/abs/2501.12246
作者: Yang Tian,Fabio Brau,Giulio Rossolini,Giorgio Buttazzo,Hao Meng
机构: Harbin Engineering University(哈尔滨工程大学); University of Cagliari(卡利亚里大学); Scuola Superiore Sant’Anna(圣安娜高等研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in Pattern Recognition

点击查看摘要

Abstract:Video deblurring is essential task for autonomous driving, facial recognition, and security surveillance. Traditional methods directly estimate motion blur kernels, often introducing artifacts and leading to poor results. Recent approaches utilize the detection of sharp frames within video sequences to enhance deblurring. However, existing datasets rely on fixed number of sharp frames, which may be too restrictive for some applications and may introduce a bias during model training. To address these limitations and enhance domain adaptability, this work first introduces GoPro Random Sharp (GoProRS), a new dataset where the the frequency of sharp frames within the sequence is customizable, allowing more diverse training and testing scenarios. Furthermore, it presents a novel video deblurring model, called SPEINet, that integrates sharp frame features into blurry frame reconstruction through an attention-based encoder-decoder architecture, a lightweight yet robust sharp frame detection and an edge extraction phase. Extensive experimental results demonstrate that SPEINet outperforms state-of-the-art methods across multiple datasets, achieving an average of +3.2% PSNR improvement over recent techniques. Given such promising results, we believe that both the proposed model and dataset pave the way for future advancements in video deblurring based on the detection of sharp frames.
zh

[CV-25] Investigating Market Strength Prediction with CNNs on Candlestick Chart Images ACML

【速读】：该论文旨在通过仅使用蜡烛图（candlestick chart）图像来预测市场强度，以辅助投资决策。核心研究问题是开发一种基于计算机视觉的有效模型，仅利用原始蜡烛图视觉数据，而不依赖于时间序列数据。研究特别分析了通过YOLOv8检测到的蜡烛图形态对模型性能的影响。解决方案的关键在于两种方法的实现：一是直接在图表图像上使用纯卷积神经网络（CNN），二是采用一种分解器架构（Decomposer architecture）来检测蜡烛图形态。实验结果表明，在本研究中，蜡烛图形态的引入并未显著提升模型性能，仅使用图像数据的模型表现最佳，准确率约为0.7，低于更复杂的时间序列模型。这一发现揭示了仅从视觉形态中提取足够预测能力的挑战，并强调了结合其他数据模态的必要性。

链接: https://arxiv.org/abs/2501.12239
作者: Thanh Nam Duong,Trung Kien Hoang,Quoc Khanh Duong,Quoc Dat Dinh,Duc Hoan Le,Huy Tuan Nguyen,Xuan Bach Nguyen,Quy Ban Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMLC 2025; 8 pages

点击查看摘要

Abstract:This paper investigates predicting market strength solely from candlestick chart images to assist investment decisions. The core research problem is developing an effective computer vision-based model using raw candlestick visuals without time-series data. We specifically analyze the impact of incorporating candlestick patterns that were detected by YOLOv8. The study implements two approaches: pure CNN on chart images and a Decomposer architecture detecting patterns. Experiments utilize diverse financial datasets spanning stocks, cryptocurrencies, and forex assets. Key findings demonstrate candlestick patterns do not improve model performance over only image data in our research. The significance is illuminating limitations in candlestick image signals. Performance peaked at approximately 0.7 accuracy, below more complex time-series models. Outcomes reveal challenges in distilling sufficient predictive power from visual shapes alone, motivating the incorporation of other data modalities. This research clarifies how purely image-based models can inform trading while confirming patterns add little value over raw charts. Our content is endeavored to be delineated into distinct sections, each autonomously furnishing a unique contribution while maintaining cohesive linkage. Note that, the examples discussed herein are not limited to the scope, applicability, or knowledge outlined in the paper.
zh

[CV-26] DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains

【速读】：该论文旨在解决低光图像增强（Low-light image enhancement, LLE）中的视觉质量问题，包括低亮度、低对比度、噪声和颜色失真等问题。这些问题影响了计算机视觉任务（如目标检测、人脸识别和自动驾驶）的性能。现有的增强技术（如多尺度融合和直方图均衡化）在复杂光照条件下难以保留细节并保持图像的自然外观。尽管Retinex理论为图像分解提供了基础，但它通常会放大噪声，导致图像质量不理想。论文提出的解决方案是双光增强网络（Dual Light Enhance Network, DLEN），其关键创新在于结合了两种不同的注意力机制，分别考虑空间域和频率域。该模型在光照估计阶段引入了可学习的小波变换模块，以保留高频和低频成分，从而增强边缘和纹理细节。此外，设计了一个双分支结构，利用Transformer架构的优势，同时增强图像的照明和结构成分。实验表明，该模型在标准数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.12235
作者: Junyu Xia,Jiesong Bai,Yihang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10pages,6figures

点击查看摘要

Abstract:Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous this http URL enhancement techniques, such as multi-scale fusion and histogram equalization, fail to preserve fine details and often struggle with maintaining the natural appearance of enhanced images under complex lighting conditions. Although the Retinex theory provides a foundation for image decomposition, it often amplifies noise, leading to suboptimal image quality. In this paper, we propose the Dual Light Enhance Network (DLEN), a novel architecture that incorporates two distinct attention mechanisms, considering both spatial and frequency domains. Our model introduces a learnable wavelet transform module in the illumination estimation phase, preserving high- and low-frequency components to enhance edge and texture details. Additionally, we design a dual-branch structure that leverages the power of the Transformer architecture to enhance both the illumination and structural components of the this http URL extensive experiments, our model outperforms state-of-the-art methods on standard this http URL is available here: this https URL
zh

[CV-27] okenVerse: Versatile Multi-concept Personalization in Token Modulation Space

【速读】：该论文旨在解决多概念个性化（multi-concept personalization）问题，即如何从少量图像中解耦复杂的视觉元素和属性，并实现从多个图像中提取的概念的无缝组合生成。现有的方法通常难以处理每个图像包含多个概念的情况，且支持的概念范围有限。TokenVerse 提出了一种基于预训练文本到图像扩散模型（text-to-image diffusion model）的框架，能够从单个图像中解耦多个复杂概念，并支持广泛的视觉元素，如物体、配饰、材质、姿态和光照等。其关键解决方案在于利用基于 DiT（Diffusion Transformer）的文本到图像模型，通过调制空间（modulation space）实现语义控制。具体而言，TokenVerse 通过优化框架为每个输入图像和文本描述找到调制空间中的特定方向，从而实现对复杂概念的局部控制，并生成符合预期配置的新图像。该方法在个性化设置中表现出显著优势，超越了现有方法。

链接: https://arxiv.org/abs/2501.12224
作者: Daniel Garibi,Shahar Yadin,Roni Paiss,Omer Tov,Shiran Zada,Ariel Ephrat,Tomer Michaeli,Inbar Mosseri,Tali Dekel
机构: Google DeepMind; Tel Aviv University (特拉维夫大学); Technion (以色列理工学院); Weizmann Institute (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TokenVerse – a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project’s webpage in this https URL
zh

[CV-28] Exploring Temporally-Aware Features for Point Tracking

【速读】：该论文试图解决视频中点跟踪（point tracking）任务中的两个主要问题：一是现有方法通常依赖于在合成数据上从头训练的简单特征骨干网络（feature backbone），这可能在真实场景中限制了模型的鲁棒性；二是现有方法通常采用两阶段处理流程（即粗预测和细化阶段），虽然通过细化阶段注入时间信息并修正粗预测阶段的错误，但这种方法计算成本高且可能存在冗余。论文提出的解决方案是引入一种名为Chrono的特征骨干网络，该网络专门为点跟踪任务设计，具有内置的时间感知能力。Chrono利用自监督学习模型DINOv2的预训练表示，并通过时间适配器（temporal adapter）增强，能够有效捕捉长期时间上下文，从而在无需细化阶段的情况下实现精确预测。实验结果表明，Chrono在TAP-Vid-DAVIS和TAP-Vid-Kinetics数据集上实现了最先进的性能，且具有较高的计算效率。

链接: https://arxiv.org/abs/2501.12218
作者: Inès Hyeonsu Kim,Seokju Cho,Jiahui Huang,Jung Yi,Joon-Young Lee,Seungryong Kim
机构: KAIST AI; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information. In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.12218 [cs.CV] (or arXiv:2501.12218v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.12218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] Early Detection and Classification of Breast Cancer Using Deep Learning Techniques

【速读】：该论文旨在解决乳腺癌早期检测的问题，特别是通过自动化技术提高检测的准确性和效率。乳腺癌是全球范围内致死率较高的癌症之一，早期检测可以有效降低其恶性发展的风险。论文提出的解决方案关键在于利用人工智能（Artificial Intelligence, AI）和机器学习（Machine Learning, ML）技术，特别是通过预训练模型（如ResNet50、MobileNet和VGG16）以及自定义的卷积神经网络（CNN）模型，对乳腺癌超声图像进行分类。这些模型在乳腺癌图像分类数据集上表现出色，其中ResNet50模型达到了最高的准确率（98.41%），表明机器学习方法在乳腺癌分类和早期检测中具有较高的适用性和效果。

链接: https://arxiv.org/abs/2501.12217
作者: Mst. Mumtahina Labonno,D.M. Asadujjaman,Md. Mahfujur Rahman,Abdullah Tamim,Mst. Jannatul Ferdous,Rafi Muttaki Mahi
机构: Dept. of Computer Science and Engineering, Varendra University, Rajshahi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Breast cancer is one of the deadliest cancers causing about massive number of patients to die annually all over the world according to the WHO. It is a kind of cancer that develops when the tissues of the breast grow rapidly and unboundly. This fatality rate can be prevented if the cancer is detected before it gets malignant. Using automation for early-age detection of breast cancer, Artificial Intelligence and Machine Learning technologies can be implemented for the best outcome. In this study, we are using the Breast Cancer Image Classification dataset collected from the Kaggle depository, which comprises 9248 Breast Ultrasound Images and is classified into three categories: Benign, Malignant, and Normal which refers to non-cancerous, cancerous, and normal this http URL research introduces three pretrained model featuring custom classifiers that includes ResNet50, MobileNet, and VGG16, along with a custom CNN model utilizing the ReLU activation this http URL models ResNet50, MobileNet, VGG16, and a custom CNN recorded accuracies of 98.41%, 97.91%, 98.19%, and 92.94% on the dataset, correspondingly, with ResNet50 achieving the highest accuracy of 98.41%.This model, with its deep and powerful architecture, is particularly successful in detecting aberrant cells as well as cancerous or non-cancerous tumors. These accuracies show that the Machine Learning methods are more compatible for the classification and early detection of breast cancer.
zh

[CV-30] RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

【速读】：该论文试图解决的问题是：在现代应用中，如自动驾驶，大多数视频是作为AI系统（如目标识别或分割）的输入，而非供人类观看。因此，传统的视频编码器（Video Encoder）以最小化重建误差为目标进行压缩优化，可能不再适用于这些任务。论文提出了一种新的方法，通过优化编码器以提升下游任务（如目标检测）的性能，而不是仅仅优化感知图像质量。

解决方案的关键在于：通过在大块级别（macro-block level）控制量化参数（Quantization Parameters, QPs），实现对任务相关区域的优先编码。具体而言，论文将这一优化问题建模为强化学习（Reinforcement Learning, RL）任务，智能体（agent）学习在长期任务性能和比特率约束之间平衡选择QPs的影响。值得注意的是，该方法在推理过程中不需要下游任务作为输入，因此适用于流媒体应用和边缘设备（如车辆）。实验表明，与传统任务无关的编码方法相比，该方法在给定比特率下显著提升了任务性能，如车辆检测和感兴趣区域（ROI）编码。

链接: https://arxiv.org/abs/2501.12216
作者: Uri Gadot,Assaf Shocher,Shie Mannor,Gal Chechik,Assaf Hallak
机构: NVIDIA Research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to traditional task agnostic encoding methods, paving the way for more efficient task-aware video compression.
zh

[CV-31] Explainability for Vision Foundation Models: A Survey

【速读】：该论文旨在探讨基础模型（foundation models）与可解释人工智能（eXplainable AI, XAI）在视觉领域的交叉点，并解决如何在这些复杂模型中实现可解释性的问题。基础模型由于其广泛的泛化能力和新兴用途，在可解释性领域中处于一个模糊的位置：其复杂性使得它们本身难以解释，但它们又被越来越多地用作构建可解释模型的工具。论文的解决方案关键在于首先通过整理相关文献，构建一个涵盖这两个领域的综合文献库；其次，根据这些文献的架构特征进行分类；接着，讨论当前研究在将XAI集成到基础模型中所面临的挑战；然后，回顾这些结合方法的常见评估方法；最后，提出未来研究的方向。通过这些步骤，论文为这一快速发展的领域提供了系统的分析和前瞻性见解。

链接: https://arxiv.org/abs/2501.12203
作者: Rémi Kazmierczak,Eloïse Berthier,Goran Frehse,Gianni Franchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As artificial intelligence systems become increasingly integrated into daily life, the field of explainability has gained significant attention. This trend is particularly driven by the complexity of modern AI models and their decision-making processes. The advent of foundation models, characterized by their extensive generalization capabilities and emergent uses, has further complicated this landscape. Foundation models occupy an ambiguous position in the explainability domain: their complexity makes them inherently challenging to interpret, yet they are increasingly leveraged as tools to construct explainable models. In this survey, we explore the intersection of foundation models and eXplainable AI (XAI) in the vision domain. We begin by compiling a comprehensive corpus of papers that bridge these fields. Next, we categorize these works based on their architectural characteristics. We then discuss the challenges faced by current research in integrating XAI within foundation models. Furthermore, we review common evaluation methodologies for these combined approaches. Finally, we present key observations and insights from our survey, offering directions for future research in this rapidly evolving field.
zh

[CV-32] Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

【速读】：该论文旨在解决大规模高分辨率纹理3D资产生成的问题，特别是在几何细节、条件对齐和纹理质量等方面超越现有技术。解决方案的关键在于提出了Hunyuan3D 2.0系统，该系统包含两个核心组件：Hunyuan3D-DiT和Hunyuan3D-Paint。Hunyuan3D-DiT是一个基于可扩展流式扩散变换器（scalable flow-based diffusion transformer）的几何生成模型，能够根据给定的条件图像生成与之对齐的几何形状。Hunyuan3D-Paint则是一个纹理合成模型，利用几何和扩散先验（diffusion priors）生成高分辨率且色彩鲜艳的纹理贴图。此外，论文还介绍了Hunyuan3D-Studio，这是一个多功能、用户友好的生产平台，简化了3D资产的重建过程，使专业和业余用户都能高效地操作甚至动画化他们的网格模型。通过系统评估，Hunyuan3D 2.0在多个方面超越了现有的开源和闭源模型，填补了开源3D社区在大规模基础生成模型方面的空白。

链接: https://arxiv.org/abs/2501.12202
作者: Zibo Zhao,Zeqiang Lai,Qingxiang Lin,Yunfei Zhao,Haolin Liu,Shuhui Yang,Yifei Feng,Mingxin Yang,Sheng Zhang,Xianghui Yang,Huiwen Shi,Sicong Liu,Junta Wu,Yihang Lian,Fan Yang,Ruining Tang,Zebin He,Xinzhou Wang,Jian Liu,Xuhui Zuo,Zhuo Chen,Biwen Lei,Haohan Weng,Jing Xu,Yiling Zhu,Xinhai Liu,Lixin Xu,Changrong Hu,Tianyu Huang,Lifu Wang,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Chao Zhang,Yonghao Tan,Jie Xiao,Yangyu Tao,Jianchen Zhu,Jinbao Xue,Kai Liu,Chongqing Zhao,Xinming Wu,Zhichao Hu,Lei Qin,Jianbing Peng,Zhan Li,Minghui Chen,Xipeng Zhang,Lin Niu,Paige Wang,Yingkai Wang,Haozhao Kuang,Zhongyi Fan,Xu Zheng,Weihao Zhuang,YingPing He,Tian Liu,Yong Yang,Di Wang,Yuhong Liu,Jie Jiang,Jingwei Huang,Chunchao Guo(refer to the report for detailed contributions)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub link: this https URL

点击查看摘要

Abstract:We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model – Hunyuan3D-DiT, and a large-scale texture synthesis model – Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio – a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: this https URL
zh

[CV-33] A margin-based replacement for cross-entropy loss

【速读】：该论文试图解决使用交叉熵损失（Cross-Entropy Loss, CE）训练深度神经网络时存在的鲁棒性和泛化性问题。具体来说，CE损失在应对未知类别拒绝、对抗鲁棒性、不平衡数据学习、持续学习和语义分割等任务时表现不佳。为了解决这些问题，论文提出了一种称为高误差边际损失（High Error Margin Loss, HEM）的变体，这是一种多类边际损失（multi-class margin loss）的改进版本。HEM损失通过引入更大的误差边际来克服其他基于边际的损失函数在训练中的问题。实验结果表明，HEM损失在多个任务上优于CE损失，并且在大多数情况下甚至优于专门为特定任务设计的损失函数（如LogitNorm、Logit-adjusted loss和DICE）。尽管HEM在干净数据上的准确率略低于CE，但这一差异并不显著。因此，HEM损失作为一种通用替代方案，能够有效提升深度神经网络在多种任务上的性能。

链接: https://arxiv.org/abs/2501.12191
作者: Michael W. Spratling,Heiko H. Schütt
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Cross-entropy (CE) loss is the de-facto standard for training deep neural networks to perform classification. However, CE-trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi-class margin loss that overcomes the training issues of other margin-based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross-entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel-level classification task). Despite all training hyper-parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses that have previously been proposed to improve performance on specific tasks. LogitNorm, a loss achieving state-of-the-art performance on unknown class rejection, produces similar performance to HEM for this task, but is much poorer for continual learning and semantic segmentation. Logit-adjusted loss, designed for imbalanced data, has superior results to HEM for that task, but performs more poorly on unknown class rejection and semantic segmentation. DICE, a popular loss for semantic segmentation, is inferior to HEM loss on all tasks, including semantic segmentation. Thus, HEM often out-performs specialised losses, and in contrast to them, is a general-purpose replacement for CE loss.
zh

[CV-34] High-dimensional multimodal uncertainty estimation by manifold alignment:Application to 3D right ventricular strain computations

【速读】：该论文试图解决在医学图像分析中，由于不同定义或计算方法导致的生理描述符（如心肌变形）的局部不确定性（local uncertainties）问题。传统方法通常假设单个样本足以代表每个受试者，而忽略了数据本身的不确定性。论文提出了一种表示学习策略，通过流形对齐（manifold alignment）来匹配与不同高维输入描述符相关的潜在表示，进而构建潜在不确定性的合理分布，并利用这些分布重建输入高维描述符的不确定性。该方法的关键在于通过流形对齐和不确定性建模，量化不同描述符定义下的心肌变形局部不确定性，从而为临床医生提供更可靠的结果。论文以右心室三维超声图像序列中的心肌变形（应变）量化为例，展示了该方法的有效性，并表明其可推广至其他涉及异质高维描述符的群体分析。

链接: https://arxiv.org/abs/2501.12178
作者: Maxime Di Folco,Gabriel Bernardino,Patrick Clarysse,Nicolas Duchateau
机构: Univ Lyon, Université Claude Bernard Lyon 1, INSA-Lyon, CNRS, Inserm, CREATIS UMR 5220, U1294, F-69621, Lyon, France; Institute of Machine Learning in Biomedical Imaging, Helmholtz Center Munich, Germany; DTIC, Universitat Pompeu Fabra, Barcelona, Spain; Institut Universitaire de France (IUF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Confidence in the results is a key ingredient to improve the adoption of machine learning methods by clinicians. Uncertainties on the results have been considered in the literature, but mostly those originating from the learning and processing methods. Uncertainty on the data is hardly challenged, as a single sample is often considered representative enough of each subject included in the analysis. In this paper, we propose a representation learning strategy to estimate local uncertainties on a physiological descriptor (here, myocardial deformation) previously obtained from medical images by different definitions or computations. We first use manifold alignment to match the latent representations associated to different high-dimensional input descriptors. Then, we formulate plausible distributions of latent uncertainties, and finally exploit them to reconstruct uncertainties on the input high-dimensional descriptors. We demonstrate its relevance for the quantification of myocardial deformation (strain) from 3D echocardiographic image sequences of the right ventricle, for which a lack of consensus exists in its definition and which directional component to use. We used a database of 100 control subjects with right ventricle overload, for which different types of strain are available at each point of the right ventricle endocardial surface mesh. Our approach quantifies local uncertainties on myocardial deformation from different descriptors defining this physiological concept. Such uncertainties cannot be directly estimated by local statistics on such descriptors, potentially of heterogeneous types. Beyond this controlled illustrative application, our methodology has the potential to be generalized to many other population analyses considering heterogeneous high-dimensional descriptors.
zh

[CV-35] ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions

【速读】：该论文旨在解决当前多模态图像生成任务中，特别是在人类图像生成方面，现有方法在灵活性和精确性上的不足。现有方法主要依赖于文本到图像或基于参考图像的生成方式，无法满足日益复杂的需求。为此，论文提出了ComposeAnyone，一种可控的布局到人类图像生成方法，通过解耦的多模态条件实现对任意手绘布局部分的控制。该方法允许使用文本或参考图像对手绘布局中的任意部分进行解耦控制，并在生成过程中无缝整合这些条件。手绘布局采用色块几何形状（如椭圆和矩形），易于绘制，提供了更灵活和可访问的方式来定义空间布局。此外，论文还引入了ComposeHuman数据集，该数据集为每张人类图像的不同组件提供了解耦的文本和参考图像注释，从而扩展了人类图像生成任务的应用范围。实验结果表明，ComposeAnyone在多个数据集上生成的图像与给定布局、文本描述和参考图像具有更好的对齐性，展示了其多任务能力和可控性。

链接: https://arxiv.org/abs/2501.12173
作者: Shiyue Zhang,Zheng Chong,Xi Lu,Wenqing Zhang,Haoxiang Li,Xujie Zhang,Jiehui Huang,Xiao Dong,Xiaodan Liang
机构: Sun Yat-Sen University(中山大学); National University of Singapore(新加坡国立大学); Pixocial Technology(Pixocial Technology); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building on the success of diffusion models, significant advancements have been made in multimodal image generation tasks. Among these, human image generation has emerged as a promising technique, offering the potential to revolutionize the fashion design process. However, existing methods often focus solely on text-to-image or image reference-based human generation, which fails to satisfy the increasingly sophisticated demands. To address the limitations of flexibility and precision in human generation, we introduce ComposeAnyone, a controllable layout-to-human generation method with decoupled multimodal conditions. Specifically, our method allows decoupled control of any part in hand-drawn human layouts using text or reference images, seamlessly integrating them during the generation process. The hand-drawn layout, which utilizes color-blocked geometric shapes such as ellipses and rectangles, can be easily drawn, offering a more flexible and accessible way to define spatial layouts. Additionally, we introduce the ComposeHuman dataset, which provides decoupled text and reference image annotations for different components of each human image, enabling broader applications in human image generation tasks. Extensive experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts, text descriptions, and reference images, showcasing its multi-task capability and controllability.
zh

[CV-36] SVGS-DSGAT: An IoT-Enabled Innovation in Underwater Robotic Object Detection Technology

【速读】：该论文旨在解决复杂水下环境中高噪声和低对比度图像的目标检测与跟踪问题。现有方法在处理这些复杂环境时，往往缺乏精度和鲁棒性。论文提出的解决方案是引入一种新型的SVGS-DSGAT模型，该模型结合了GraphSage、SVAM（Spatial-Visual Attention Module）和DSGAT（Dual-Scale Graph Attention Network）模块，通过图神经网络和注意力机制增强了特征提取和目标检测能力。此外，该模型集成了物联网（IoT）技术，实现了实时数据采集与处理，优化了资源分配和模型响应速度。实验结果表明，SVGS-DSGAT模型在URPC 2020和SeaDronesSee数据集上分别达到了40.8%和41.5%的mAP（mean Average Precision），显著优于现有主流模型。这一基于IoT的增强方法不仅在高噪声和复杂背景下表现出色，还提升了系统的整体效率和可扩展性，为水下目标检测技术提供了有效的解决方案。

链接: https://arxiv.org/abs/2501.12169
作者: Dongli Wu,Ling Luo
机构: College of Design and Engineering, National University of Singapore (新加坡国立大学设计与工程学院); Institute of Semiconductors, CAS AnnLab, Institute of Semiconductors, Chinese Academy of Sciences (中国科学院半导体研究所); Beijing Ratu Technology Co., Ltd. (北京睿图科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:With the advancement of Internet of Things (IoT) technology, underwater target detection and tracking have become increasingly important for ocean monitoring and resource management. Existing methods often fall short in handling high-noise and low-contrast images in complex underwater environments, lacking precision and robustness. This paper introduces a novel SVGS-DSGAT model that combines GraphSage, SVAM, and DSGAT modules, enhancing feature extraction and target detection capabilities through graph neural networks and attention mechanisms. The model integrates IoT technology to facilitate real-time data collection and processing, optimizing resource allocation and model responsiveness. Experimental results demonstrate that the SVGS-DSGAT model achieves an mAP of 40.8% on the URPC 2020 dataset and 41.5% on the SeaDronesSee dataset, significantly outperforming existing mainstream models. This IoT-enhanced approach not only excels in high-noise and complex backgrounds but also improves the overall efficiency and scalability of the system. This research provides an effective IoT solution for underwater target detection technology, offering significant practical application value and broad development prospects.
zh

[CV-37] Fast-RF-Shimming: Accelerate RF Shimming in 7T MRI using Deep Learning

【速读】：该论文旨在解决超高场（Ultrahigh Field, UHF）磁共振成像（Magnetic Resonance Imaging, MRI）中射频（Radiofrequency, RF）场不均匀性导致的图像伪影问题。传统方法如幅度最小二乘（Magnitude Least Squares, MLS）优化虽然能够缓解RF场不均匀性，但其计算耗时且通常需要患者在扫描过程中参与。论文提出了一种基于机器学习的快速RF匀场（Fast RF Shimming）框架，通过随机初始化的自适应矩估计（Adaptive Moment Estimation, Adam）从多通道RF场中推导参考匀场权重，并利用残差网络（Residual Network, ResNet）将RF场映射到匀场输出，同时在损失函数中引入置信度参数。此外，非均匀场检测器（Non-uniformity Field Detector, NFD）用于识别极端非均匀结果。该框架在速度和预测准确性上均显著优于传统方法，并支持进一步扩展，如结合解剖学先验或多回波数据，以提高RF场校正的鲁棒性。

链接: https://arxiv.org/abs/2501.12157
作者: Zhengyi Lu,Hao Liang,Ming Lu,Xiao Wang,Xinqiang Yan,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) provides a high signal-to-noise ratio (SNR), enabling exceptional spatial resolution for clinical diagnostics and research. However, higher fields introduce challenges such as transmit radiofrequency (RF) field inhomogeneities, which result in uneven flip angles and image intensity artifacts. These artifacts degrade image quality and limit clinical adoption. Traditional RF shimming methods, including Magnitude Least Squares (MLS) optimization, mitigate RF field inhomogeneity but are time-intensive and often require the presence of the patient. Recent machine learning methods, such as RF Shim Prediction by Iteratively Projected Ridge Regression and other deep learning architectures, offer alternative approaches but face challenges such as extensive training requirements, limited complexity, and practical data constraints. This paper introduces a holistic learning-based framework called Fast RF Shimming, which achieves a 5000-fold speedup compared to MLS methods. First, random-initialized Adaptive Moment Estimation (Adam) derives reference shimming weights from multichannel RF fields. Next, a Residual Network (ResNet) maps RF fields to shimming outputs while incorporating a confidence parameter into the loss function. Finally, a Non-uniformity Field Detector (NFD) identifies extreme non-uniform outcomes. Comparative evaluations demonstrate significant improvements in both speed and predictive accuracy. The proposed pipeline also supports potential extensions, such as the integration of anatomical priors or multi-echo data, to enhance the robustness of RF field correction. This approach offers a faster and more efficient solution to RF shimming challenges in UHF MRI.
zh

[CV-38] DNRSelect: Active Best View Selection for Deferred Neural Rendering ICRA2025

【速读】：该论文试图解决在延迟神经渲染（Deferred Neural Rendering, DNR）中过度依赖高质量光线追踪（ray-traced）图像的问题，同时保持渲染的高保真度。解决方案的关键在于提出了DNRSelect，该方法集成了基于强化学习的视图选择器（view selector）和3D纹理聚合器（3D texture aggregator）。视图选择器通过训练在易于获取的光栅化（rasterized）图像上，能够识别出最优的视图，从而仅需获取少量光线追踪图像即可实现高质量的渲染。3D纹理聚合器则通过融合深度图（depth maps）、法线图（normal maps）和UV图的金字塔特征，进一步增强DNR的空间感知和几何一致性。通过这种方法，DNRSelect显著减少了对光线追踪数据的依赖，同时仍能实现高保真度的渲染效果。

链接: https://arxiv.org/abs/2501.12150
作者: Dongli Wu,Haochen Li,Xiaobao Wei
机构: College of Design and Engineering, National University of Singapore(新加坡国立大学设计与工程学院); School of Cyber Science and Technology, Beihang University(北京航空航天大学网络空间安全学院); University of Chinese Academy of Sciences(中国科学院大学); Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures, submitted to ICRA 2025

点击查看摘要

Abstract:Deferred neural rendering (DNR) is an emerging computer graphics pipeline designed for high-fidelity rendering and robotic perception. However, DNR heavily relies on datasets composed of numerous ray-traced images and demands substantial computational resources. It remains under-explored how to reduce the reliance on high-quality ray-traced images while maintaining the rendering fidelity. In this paper, we propose DNRSelect, which integrates a reinforcement learning-based view selector and a 3D texture aggregator for deferred neural rendering. We first propose a novel view selector for deferred neural rendering based on reinforcement learning, which is trained on easily obtained rasterized images to identify the optimal views. By acquiring only a few ray-traced images for these selected views, the selector enables DNR to achieve high-quality rendering. To further enhance spatial awareness and geometric consistency in DNR, we introduce a 3D texture aggregator that fuses pyramid features from depth maps and normal maps with UV maps. Given that acquiring ray-traced images is more time-consuming than generating rasterized images, DNRSelect minimizes the need for ray-traced data by using only a few selected views while still achieving high-fidelity rendering results. We conduct detailed experiments and ablation studies on the NeRF-Synthetic dataset to demonstrate the effectiveness of DNRSelect. The code will be released.
zh

[CV-39] ENTIRE: Learning-based Volume Rendering Time Prediction

【速读】：该论文试图解决时间依赖的体积数据（time-dependent volume data）在渲染过程中渲染时间预测的问题。这类数据通常包含数百或数千个时间步长的复杂变形结构，且相机配置对渲染性能有显著影响。解决方案的关键在于首先从体积数据中提取一个特征向量（feature vector），该向量捕捉了与渲染时间性能相关的结构信息。然后，将此特征向量与其他相关参数（如相机设置）结合，进行最终的渲染时间预测。实验结果表明，该方法能够在多种数据集上高效实现高精度的预测，并具有快速的响应速度。此外，ENTIRE方法还展示了在动态参数调整和负载平衡方面的能力，以确保稳定的帧率。

链接: https://arxiv.org/abs/2501.12119
作者: Zikai Yin,Hamid Gadirov,Jiri Kosinka,Steffen Frey
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present ENTIRE, a novel approach for volume rendering time prediction. Time-dependent volume data from simulations or experiments typically comprise complex deforming structures across hundreds or thousands of time steps, which in addition to the camera configuration has a significant impact on rendering performance. We first extract a feature vector from a volume that captures its structure that is relevant for rendering time performance. Then we combine this feature vector with further relevant parameters (e.g. camera setup), and with this perform the final prediction. Our experiments conducted on various datasets demonstrate that our model is capable of efficiently achieving high prediction accuracy with fast response rates. We showcase ENTIRE’s capability of enabling dynamic parameter adaptation for stable frame rates and load balancing in two case studies.
zh

[CV-40] Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning

【速读】：该论文旨在解决在多任务学习（MTL）场景中，深度神经网络（DNNs）如何自动生成最优稀疏共享结构的问题。传统方法依赖于手动调整超参数来控制稀疏度，而本文提出的“元稀疏性”（meta-sparsity）框架则通过学习控制稀疏度的参数，使得模型能够在多任务学习中动态生成最优的稀疏结构。该框架的关键在于借鉴了模型无关元学习（MAML）的思想，通过在元训练阶段引入基于惩罚的通道级结构化稀疏性（channel-wise structured sparsity），从而学习共享且最优的稀疏参数。这种方法不仅能够去除不必要的参数，提升模型效率，还能增强模型在处理已知和未知任务时的泛化能力。实验结果表明，该方法在多个任务上表现优异，展示了其在构建高效、适应性强的稀疏神经网络方面的潜力。

链接: https://arxiv.org/abs/2501.12115
作者: Richa Upadhyay,Ronald Phlypo,Rajkumar Saini,Marcus Liwicki
机构: Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典); University Grenoble Alpes, France(格勒诺布尔阿尔卑斯大学, 法国); Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典); Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model’s efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.
zh

[CV-41] acher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection

【速读】：该论文试图解决视觉异常检测（Visual Anomaly Detection）中的挑战，特别是针对单类分类（one-class classification）和分割问题。现有的学生-教师（Student-Teacher, S-T）框架主要依赖预训练的教师网络来指导学生网络学习多尺度相似特征，但忽视了学生网络通过多尺度特征融合（multi-scale feature fusion）来增强学习的潜力。为此，论文提出了一种名为PFADSeg的新模型，其关键解决方案包括：1）将预训练的教师网络、具有多尺度特征融合的去噪学生网络以及引导异常分割网络集成到一个统一框架中；2）采用独特的教师编码器-学生解码器去噪模式，提升学生网络从教师网络特征中学习的能力；3）引入自适应特征融合机制，训练自监督分割网络以自主合成异常掩码，从而显著提升检测性能。实验结果表明，PFADSeg在MVTec AD数据集上取得了图像级AUC为98.9%、像素级平均精度为76.4%和实例级平均精度为78.7%的先进性能。

链接: https://arxiv.org/abs/2501.12104
作者: ShiXuan Song,Hao Chen,Shu Hu,Xin Wang,Jinrong Hu,Xi Wu
机构: CUIT, China; Purdue University, USA; University at Albany, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network’s ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Evaluated on the MVTec AD dataset, PFADSeg achieves state-of-the-art results with an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
zh

[CV-42] Proxies for Distortion and Consistency with Applications for Real-World Image Restoration

【速读】：该论文旨在解决真实世界图像恢复（real-world image restoration）中的挑战，即在仅给定退化图像（degraded images）且无对应真实图像（ground-truth）的情况下，设计和评估图像恢复算法的困难。论文提出了一套工具，用于设计和评估真实世界图像恢复算法。其关键解决方案包括：1）提出一个训练模型，用于预测给定真实世界测量图像所经历的退化链（chain of degradations），并利用该估计器近似测量值与任何恢复图像之间的一致性（consistency）；2）利用预训练的基于扩散的图像先验（diffusion-based image prior），设计了一个简单且高效的即插即用（plug-and-play）图像恢复算法；3）提出了无参考（no-reference）的代理指标，如近似均方误差（MSE）和学习感知图像块相似度（LPIPS），用于在没有真实图像的情况下对恢复算法进行排序。这套工具为真实场景下的盲图像恢复算法（blind image restoration algorithms）提供了一个首创的、多功能的评估和比较框架。

链接: https://arxiv.org/abs/2501.12102
作者: Sean Man,Guy Ohayon,Ron Raphaeli,Michael Elad
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Project page in this https URL

点击查看摘要

Abstract:Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency – the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.
zh

[CV-43] UAV-Assisted Real-Time Disaster Detection Using Optimized Transformer Model

【速读】：该论文旨在解决在灾害恢复和管理中，特别是在不稳定环境和难以到达的地形中，准确和及时的灾害检测所面临的挑战。解决方案的关键在于利用配备机载嵌入式平台和摄像头传感器的无人机（UAVs），通过机载航空图像处理来避免连接性、隐私和延迟问题。论文提出了一种基于UAV的边缘计算框架，用于实时灾害管理，并采用了一种经过优化的模型进行实时航空图像分类。该模型通过后训练量化技术进行优化，以提高在资源受限设备上的推理速度和内存使用效率。此外，论文还引入了一个名为DisasterEye的新数据集，包含无人机拍摄的灾害场景和现场人员拍摄的地面图像，以支持真实世界灾害场景的应用。实验结果表明，该模型在资源受限的UAV平台上实现了高准确率、低延迟和低内存使用，展示了其可扩展性和适应性。

链接: https://arxiv.org/abs/2501.12087
作者: Branislava Jankovic,Sabina Jangirova,Waseem Ullah,Latif U. Khan,Mohsen Guizani
机构: Mohamed Bin Zayed University of Artificial Intelligence, United Arab Emirates(穆罕默德·本·扎耶德人工智能大学, 阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Disaster recovery and management present significant challenges, particularly in unstable environments and hard-to-reach terrains. These difficulties can be overcome by employing unmanned aerial vehicles (UAVs) equipped with onboard embedded platforms and camera sensors. In this work, we address the critical need for accurate and timely disaster detection by enabling onboard aerial imagery processing and avoiding connectivity, privacy, and latency issues despite the challenges posed by limited onboard hardware resources. We propose a UAV-assisted edge framework for real-time disaster management, leveraging our proposed model optimized for real-time aerial image classification. The optimization of the model employs post-training quantization techniques. For real-world disaster scenarios, we introduce a novel dataset, DisasterEye, featuring UAV-captured disaster scenes as well as ground-level images taken by individuals on-site. Experimental results demonstrate the effectiveness of our model, achieving high accuracy with reduced inference latency and memory usage on resource-constrained devices. The framework’s scalability and adaptability make it a robust solution for real-time disaster detection on resource-limited UAV platforms.
zh

[CV-44] DSTSA-GCN: Advancing Skeleton-Based Gesture Recognition with Semantic-Aware Spatio-Temporal Topology Modeling

【速读】：该论文旨在解决现有基于图卷积网络（GCNs）的骨架动作和手势识别方法中的两个关键问题：一是缺乏有效的时空拓扑建模，无法捕捉骨骼运动中的动态变化；二是难以建模超越局部关节连接的多尺度结构关系。为解决这些问题，论文提出了一种名为动态时空语义感知图卷积网络（DSTSA-GCN）的新框架。该框架的核心在于引入了三个关键模块：组通道图卷积（GC-GC）、组时序图卷积（GT-GC）和多尺度时序卷积（MS-TCN）。GC-GC和GT-GC并行工作，分别建模通道特定和帧特定的相关性，从而实现对时空变化的鲁棒拓扑学习。此外，这两个模块采用分组策略，自适应地捕捉多尺度结构关系。MS-TCN则通过具有不同感受野的分组时序卷积进一步增强时序建模能力。实验结果表明，DSTSA-GCN显著提升了GCNs的拓扑建模能力，在多个基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2501.12086
作者: Hu Cui,Renjing Huang,Ruoyu Zhang,Tessai Hayama
机构: Nagaoka University of Technology (长冈技术科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit to Neurocomputing

点击查看摘要

Abstract:Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14/28, NTU-RGB+D, and NTU-RGB+D-120.
zh

[CV-45] Scalable Whole Slide Image Representation Using K-Mean Clustering and Fisher Vector Aggregation

【速读】：该论文旨在解决全切片图像（Whole Slide Images, WSIs）分类中的计算挑战，这些图像由于高分辨率和巨大的尺寸，传统机器学习模型难以处理。论文提出了一种可扩展且高效的方法，通过基于补丁的特征提取、聚类和Fisher向量编码来实现WSI分类。关键解决方案包括：首先将WSI分割为固定大小的补丁，并使用预训练的卷积神经网络（CNN）提取每个补丁的深度特征嵌入；接着通过K-means聚类将这些补丁级嵌入进行聚类，每个聚类聚合了WSI中语义相似的区域；然后通过将每个聚类中的补丁嵌入分布建模为参数化的高斯混合模型（GMM），计算Fisher向量表示；最后将这些Fisher向量拼接成一个高维特征向量，用于分类器预测WSI的诊断标签。该方法能够捕捉局部和全局组织结构，并在大规模WSI分类中表现出优异的准确性和可扩展性。

链接: https://arxiv.org/abs/2501.12085
作者: Ravi Kant Gupta,Shounak Das,Ardhendu Sekhar,Amit Sethi
机构: Department of Electrical Engineering, Indian Institute of Technology Bombay (电气工程系, 印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whole slide images (WSIs) are high-resolution, gigapixel sized images that pose significant computational challenges for traditional machine learning models due to their size and this http URL this paper, we present a scalable and efficient methodology for WSI classification by leveraging patch-based feature extraction, clustering, and Fisher vector encoding. Initially, WSIs are divided into fixed size patches, and deep feature embeddings are extracted from each patch using a pre-trained convolutional neural network (CNN). These patch-level embeddings are subsequently clustered using K-means clustering, where each cluster aggregates semantically similar regions of the WSI. To effectively summarize each cluster, Fisher vector representations are computed by modeling the distribution of patch embeddings in each cluster as a parametric Gaussian mixture model (GMM). The Fisher vectors from each cluster are concatenated into a high-dimensional feature vector, creating a compact and informative representation of the entire WSI. This feature vector is then used by a classifier to predict the WSI’s diagnostic label. Our method captures local and global tissue structures and yields robust performance for large-scale WSI classification, demonstrating superior accuracy and scalability compared to other approaches.
zh

[CV-46] A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment

【速读】：该论文试图解决广角视频（wide-angle video）质量评估的问题。广角视频因其宽广的视角和大范围场景捕捉能力，在体育和冒险记录中具有广泛应用前景，但其易受变形、曝光等失真影响，导致视频质量下降，进而影响感知和体验，限制了其在竞技体育等领域的应用。目前，针对广角视频质量评估的研究较少，主要原因在于缺乏专门的广角视频数据集。为解决这一问题，论文构建了首个多标注、多模态的广角视频质量评估数据集（Multi-annotated and multi-modal Wide-angle Video quality assessment, MWV），并通过跨数据集测试和数据集内测试，评估了现有先进视频质量评估方法在该数据集上的表现。实验结果表明，这些方法在广角视频质量评估上存在显著局限性。因此，构建专门的数据集是解决广角视频质量评估问题的关键。

链接: https://arxiv.org/abs/2501.12082
作者: Bo Hu,Wei Wang,Chunyi Li,Lihuo He,Leida Li,Xinbo Gao
机构: Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China(重庆邮电大学图像认知重点实验室); School of Electronic Engineering, Xidian University, Xi’an, China(西安电子科技大学电子工程学院); School of Artificial Intelligence, Xidian University, Xi’an, China(西安电子科技大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Wide-angle video is favored for its wide viewing angle and ability to capture a large area of scenery, making it an ideal choice for sports and adventure recording. However, wide-angle video is prone to deformation, exposure and other distortions, resulting in poor video quality and affecting the perception and experience, which may seriously hinder its application in fields such as competitive sports. Up to now, few explorations focus on the quality assessment issue of wide-angle video. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. To bridge this gap, we construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment (MWV) dataset. Then, the performances of state-of-the-art video quality methods on the MWV dataset are investigated by inter-dataset testing and intra-dataset testing. Experimental results show that these methods impose significant limitations on their applicability.
zh

[CV-47] owards autonomous photogrammetric forest inventory using a lightweight under-canopy robotic drone

【速读】：该论文试图解决在森林冠层下进行自主无人机飞行和数据采集的挑战。由于在密集森林环境中，全球导航卫星系统（GNSS）无法提供可靠的定位，且无人机需要自主调整飞行路径以避免碰撞，传统的自动化飞行技术难以适用。为此，论文提出了一种基于先进开源方法的机器人无人机原型，能够在GNSS受限且障碍物丰富的森林环境中实现自主飞行。该解决方案的关键在于利用机载立体相机和摄影测量方法进行数据采集，并通过多组测试飞行验证了其在复杂森林环境中的性能。实验结果表明，该原型在森林重建和胸径（DBH）估计方面表现出色，特别是在DBH小于30厘米的树木上，误差显著降低。总体而言，该方案在DBH精度、自主性和森林复杂性方面的表现优于现有文献中的方法。

链接: https://arxiv.org/abs/2501.12073
作者: Väinö Karjalainen,Niko Koivumäki,Teemu Hakala,Jesse Muhojoki,Eric Hyyppä,Anand George,Juha Suomalainen,Eija Honkavaara
机构: Department of Remote Sensing and Photogrammetry, Finnish Geospatial Research Institute FGI, The National Land Survey of Finland (芬兰地理空间研究所FGI, 芬兰国家土地调查局)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 13 Figures

点击查看摘要

Abstract:Drones are increasingly used in forestry to capture high-resolution remote sensing data. While operations above the forest canopy are already highly automated, flying inside forests remains challenging, primarily relying on manual piloting. Inside dense forests, reliance on the Global Navigation Satellite System (GNSS) for localization is not feasible. Additionally, the drone must autonomously adjust its flight path to avoid collisions. Recently, advancements in robotics have enabled autonomous drone flights in GNSS-denied obstacle-rich areas. In this article, a step towards autonomous forest data collection is taken by building a prototype of a robotic under-canopy drone utilizing state-of-the-art open-source methods and validating its performance for data collection inside forests. The autonomous flight capability was evaluated through multiple test flights in two boreal forest test sites. The tree parameter estimation capability was studied by conducting diameter at breast height (DBH) estimation using onboard stereo camera data and photogrammetric methods. The prototype conducted flights in selected challenging forest environments, and the experiments showed excellent performance in forest reconstruction with a miniaturized stereoscopic photogrammetric system. The stem detection algorithm managed to identify 79.31 % of the stems. The DBH estimation had a root mean square error (RMSE) of 3.33 cm (12.79 %) and a bias of 1.01 cm (3.87 %) across all trees. For trees with a DBH less than 30 cm, the RMSE was 1.16 cm (5.74 %), and the bias was 0.13 cm (0.64 %). When considering the overall performance in terms of DBH accuracy, autonomy, and forest complexity, the proposed approach was superior compared to methods proposed in the scientific literature. Results provided valuable insights into autonomous forest reconstruction using drones, and several further development topics were proposed.
zh

[CV-48] Co-Paced Learning Strategy Based on Confidence for Flying Bird Object Detection Model Training

【速读】：该论文旨在解决在监控视频中飞鸟目标检测（FBOD）模型训练过程中，硬样本（hard samples）对模型性能的负面影响问题。为了解决这一问题，作者提出了一种基于置信度的协同学习策略（Co-Paced Learning Based on Confidence, CPL-BC）。该策略的核心在于使用两个结构相同但初始参数配置不同的模型，通过相互协作选择预测置信度超过设定阈值的易样本（easy samples）进行训练。随着训练的进行，策略逐步降低置信度阈值，使更多样本参与训练，从而增强模型从易到难的样本识别能力。在应用CPL-BC策略之前，作者首先对两个FBOD模型进行了预训练，使其具备评估飞鸟目标样本难度的能力。实验结果表明，与其他模型学习策略相比，CPL-BC显著提高了检测精度，验证了该方法的有效性和先进性。

链接: https://arxiv.org/abs/2501.12071
作者: Zi-Wei Sun,Ze-Xi Hua,Heng-Chao Li,Yan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To mitigate the adverse effects of hard samples on the training of the Flying Bird Object Detection (FBOD) model for surveillance videos, we propose a Co-Paced Learning Based on Confidence (CPL-BC) strategy and apply this strategy to the training process of the FBOD model. This strategy involves maintaining two models with identical structures but different initial parameter configurations, which collaborate with each other to select easy samples with prediction confidence exceeding a set threshold for training. As training progresses, the strategy gradually lowers the threshold, allowing more samples to participate, enhancing the model’s ability to recognize objects from easy to hard. Before applying the CPL-BC strategy to train the FBOD models, we initially trained the two FBOD models to equip them with the capability to assess the difficulty level of flying bird object samples. Experimental results on two different datasets of flying bird objects in surveillance videos demonstrate that, compared to other model learning strategies, CPL-BC significantly improves detection accuracy, verifying the effectiveness and advancement of this method.
zh

[CV-49] GaussianVideo: Efficient Video Representation Through 2D Gaussian Splatting

【速读】：该论文旨在解决视频表示和压缩的问题，提出了一种基于2D高斯斑点（2D Gaussian splats）的新方法GaussianVideo。该方法通过以下关键技术实现高效视频表示和压缩：(i) 利用相邻帧之间的时间冗余性，基于前一帧预测当前帧的高斯斑点，从而加速训练并提高压缩效率；(ii) 通过移除对视频质量贡献较低的高斯斑点，控制文件大小与质量之间的权衡；(iii) 通过随机添加高斯斑点来捕捉视频中的动态内容，如大幅运动或新出现的物体；(iv) 在学习过程中基于损失差异检测关键帧，以处理场景中的显著变化。实验结果表明，GaussianVideo在率失真权衡（rate-distortion trade-offs）方面表现优异，与AV1和VVC等先进视频编解码器相当，并在1920x1080分辨率下实现了1500 fps的渲染速度。

链接: https://arxiv.org/abs/2501.12060
作者: Longan Wang,Yuang Shi,Wei Tsang Ooi
机构: Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:3D Gaussian splats have emerged as a revolutionary, effective, learned representation for static 3D scenes. In this work, we explore using 2D Gaussian splats as a new primitive for representing videos. We propose GaussianVideo, an approach to learning a set of 2D Gaussian splats that can effectively represent video frames. GaussianVideo incorporates the following techniques: (i) To exploit temporal redundancy among adjacent frames, which can speed up training and improve the compression efficiency, we predict the Gaussian splats of a frame based on its previous frame; (ii) To control the trade-offs between file size and quality, we remove Gaussian splats with low contribution to the video quality; (iii) To capture dynamics in videos, we randomly add Gaussian splats to fit content with large motion or newly-appeared objects; (iv) To handle significant changes in the scene, we detect key frames based on loss differences during the learning process. Experiment results show that GaussianVideo achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs such as AV1 and VVC, and a rendering speed of 1500 fps for a 1920x1080 video.
zh

[CV-50] Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning

【速读】：该论文试图解决在3D MRI（磁共振成像）数据分析中，由于数据稀缺且预训练的2D模型无法捕捉体积上下文信息，导致自监督深度学习难以有效应用的问题。解决方案的关键在于提出了一种序列不变的自监督框架，利用定量MRI（qMRI）技术，通过从单个3D qMRI扫描中模拟多种MRI对比度，并强制这些对比度之间的一致性表示，从而学习到以解剖结构为中心而非序列特定的特征。这种方法生成了一个鲁棒的3D编码器，能够在多种任务和协议中表现出色，特别是在低数据环境下（如健康脑部分割、中风病变分割和MRI去噪任务中），显著优于基线自监督学习方法。此外，该模型还能有效泛化到未见过的站点，展示了其在可扩展性和临床可靠性方面的潜力。

链接: https://arxiv.org/abs/2501.12057
作者: Liam Chalcroft,Jenny Cronin,Cathy J. Price,John Ashburner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a sequence-invariant self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. This yields a robust 3D encoder that performs strongly across varied tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3% Dice, +4.2 dB PSNR). Our model also generalises effectively to unseen sites, demonstrating potential for more scalable and clinically reliable volumetric analysis. All code and trained models are publicly available.
zh

[CV-51] ORCAst: Operational High-Resolution Current Forecasts

【速读】：该论文旨在解决实时预测海洋表面流（ocean surface currents）的挑战，特别是在一周时间尺度内的高分辨率预测。由于卫星遥感数据提供的信息通常是间接或不完整的，因此这一问题具有较高的复杂性。论文提出的解决方案是ORCAst模型，这是一个多阶段、多臂网络（multi-stage, multi-arm network），通过多阶段学习过程，利用真实卫星数据和浮标（drifters）的现场测量数据进行训练。模型的关键在于其多臂编码器-解码器架构（multi-arm encoder-decoder architecture），首先从大量的天底（nadir）和SWOT高度计数据中预测海面高度（sea surface height）和地转流（geostrophic currents），然后从稀疏的浮标现场测量数据中学习预测海洋表面流。通过在特定区域进行训练，模型在预测海洋表面流的实时预报和短期预报方面表现优于多种最先进的方法。

链接: https://arxiv.org/abs/2501.12054
作者: Pierre Garcia,Inès Larroche,Amélie Pesnec,Hannah Bull,Théo Archambault,Evangelos Moschos,Alexandre Stegner,Anastase Charantonis,Dominique Béréziat
机构: Amphitrite; Sorbonne Université (索邦大学), CNRS (法国国家科学研究中心), LIP6; Inria (法国国家信息与自动化研究所), Sorbonne Université (索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:We present ORCAst, a multi-stage, multi-arm network for Operational high-Resolution Current forecAsts over one week. Producing real-time nowcasts and forecasts of ocean surface currents is a challenging problem due to indirect or incomplete information from satellite remote sensing data. Entirely trained on real satellite data and in situ measurements from drifters, our model learns to forecast global ocean surface currents using various sources of ground truth observations in a multi-stage learning procedure. Our multi-arm encoder-decoder model architecture allows us to first predict sea surface height and geostrophic currents from larger quantities of nadir and SWOT altimetry data, before learning to predict ocean surface currents from much more sparse in situ measurements from drifters. Training our model on specific regions improves performance. Our model achieves stronger nowcast and forecast performance in predicting ocean surface currents than various state-of-the-art methods.
zh

[CV-52] Aggrotech: Leverag ing Deep Learning for Sustainable Tomato Disease Management

【速读】：该论文旨在解决番茄作物健康监测中的病害及时准确检测问题，以确保农业生产力和粮食安全。论文提出的解决方案基于深度学习技术，具体采用了两类卷积神经网络（CNNs）：VGG19和Inception v3。这两种模型在番茄村庄数据集（Tomato Villages Dataset）上进行了训练和测试，该数据集包含健康番茄叶片和受多种病害影响的叶片图像。VGG19模型通过增加全连接层进行增强，而Inception v3模型则通过引入全局平均池化层和密集分类层进行改进。实验结果表明，这两种模型在测试集上的准确率达到了93.93%，证明了其在作物健康监测中的有效性。论文还提出了一种包括数据归一化、图像大小调整、数据集准备和独特模型架构的深度学习策略，并通过准确率、精确率、召回率和F1分数等指标评估了模型的性能。该方法在精准农业中具有实际应用潜力，能够帮助早期预防番茄病害。

链接: https://arxiv.org/abs/2501.12052
作者: MD Mehraz Hosen,Md. Hasibul Islam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, ROC curves, confusion matrix analysis, and classification reports

点击查看摘要

Abstract:Tomato crop health plays a critical role in ensuring agricultural productivity and food security. Timely and accurate detection of diseases affecting tomato plants is vital for effective disease management. In this study, we propose a deep learning-based approach for Tomato Leaf Disease Detection using two well-established convolutional neural networks (CNNs), namely VGG19 and Inception v3. The experiment is conducted on the Tomato Villages Dataset, encompassing images of both healthy tomato leaves and leaves afflicted by various diseases. The VGG19 model is augmented with fully connected layers, while the Inception v3 model is modified to incorporate a global average pooling layer and a dense classification layer. Both models are trained on the prepared dataset, and their performances are evaluated on a separate test set. This research employs VGG19 and Inception v3 models on the Tomato Villages dataset (4525 images) for tomato leaf disease detection. The models’ accuracy of 93.93% with dropout layers demonstrates their usefulness for crop health monitoring. The paper suggests a deep learning-based strategy that includes normalization, resizing, dataset preparation, and unique model architectures. During training, VGG19 and Inception v3 serve as feature extractors, with possible data augmentation and fine-tuning. Metrics like accuracy, precision, recall, and F1 score are obtained through evaluation on a test set and offer important insights into the strengths and shortcomings of the model. The method has the potential for practical use in precision agriculture and could help tomato crops prevent illness early on.
zh

[CV-53] Adaptive Class Learning to Screen Diabetic Disorders in Fundus Images of Eye ICPR

【速读】：该论文旨在解决全球范围内日益增长的眼科疾病（ocular illnesses）问题，特别是如何通过早期检测和及时干预来预防视力损害并改善患者预后。论文提出了一种名为“有限数据下的类别扩展”（Class Extension with Limited Data, CELD）的新框架，用于训练分类器对眼底图像进行分类。该框架的关键在于先训练分类器识别健康（Healthy）和糖尿病视网膜病变（Diabetic Retinopathy, DR）两类相关特征，然后通过微调使其能够将输入图像分类为健康、DR和青光眼（Glaucoma）三类。这种策略使模型能够在仅有少量标注数据集的情况下逐步提升分类能力。此外，论文还采用了扰动方法（perturbation methods）来识别影响模型决策过程的输入图像特征。最终，该模型在公开数据集上实现了91%的总体准确率。

链接: https://arxiv.org/abs/2501.12048
作者: Shramana Dey,Pallabi Dutta,Riddhasree Bhattacharyya,Surochita Pal,Sushmita Mitra,Rajiv Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:The prevalence of ocular illnesses is growing globally, presenting a substantial public health challenge. Early detection and timely intervention are crucial for averting visual impairment and enhancing patient prognosis. This research introduces a new framework called Class Extension with Limited Data (CELD) to train a classifier to categorize retinal fundus images. The classifier is initially trained to identify relevant features concerning Healthy and Diabetic Retinopathy (DR) classes and later fine-tuned to adapt to the task of classifying the input images into three classes: Healthy, DR, and Glaucoma. This strategy allows the model to gradually enhance its classification capabilities, which is beneficial in situations where there are only a limited number of labeled datasets available. Perturbation methods are also used to identify the input image characteristics responsible for influencing the models decision-making process. We achieve an overall accuracy of 91% on publicly available datasets.
zh

[CV-54] Advancing Earth Observation: A Survey on AI-Powered Image Processing in Satellites

【速读】：该论文试图解决地球观测卫星（Earth Observation, EO）在获取大量高质量图像后，传统工作流程中将这些图像传输到地面进行处理所面临的效率挑战。随着技术进步和成本降低，卫星捕获的图像质量和数量显著增加，导致传统处理方式难以应对。论文提出的解决方案关键在于利用预训练的人工智能模型在卫星上进行图像处理，从而减少数据传输需求并提高处理效率。然而，这一方案在卫星环境中的实施面临诸多约束，论文详细探讨了这些约束及其最新的缓解策略。

链接: https://arxiv.org/abs/2501.12030
作者: Aidan Duggan,Bruno Andrade,Haithem Afli
机构: Computer Science Department, Munster Technological University, Cork, T12 P928 Ireland(爱尔兰科克市芒斯特理工大学计算机科学系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Advancements in technology and reduction in it’s cost have led to a substantial growth in the quality quantity of imagery captured by Earth Observation (EO) satellites. This has presented a challenge to the efficacy of the traditional workflow of transmitting this imagery to Earth for processing. An approach to addressing this issue is to use pre-trained artificial intelligence models to process images on-board the satellite, but this is difficult given the constraints within a satellite’s environment. This paper provides an up-to-date and thorough review of research related to image processing on-board Earth observation satellites. The significant constraints are detailed along with the latest strategies to mitigate them.
zh

[CV-55] Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushings Syndrome Diagnosis in Facial Analysis

【速读】：该论文旨在解决库欣综合征（Cushing’s syndrome）的诊断问题，特别是通过面部图像进行自动化诊断。库欣综合征是由于肾上腺皮质分泌过多的糖皮质激素（glucocorticoid）引起的疾病，常表现为满月脸（moon facies）和多血质（plethora），因此面部数据在诊断中至关重要。传统的卷积神经网络（CNNs）在捕捉局部特征方面表现较好，但库欣综合征的面部特征往往是全局性的。为此，论文提出使用基于自注意力机制（self-attention）的Transformer模型（如ViT和SWIN）以及基础模型DINOv2，这些模型能够更好地捕捉长距离依赖和全局特征。研究结果表明，Transformer模型和DINOv2在诊断库欣综合征时优于CNNs，其中ViT的F1得分最高，达到85.74%。此外，DINOv2在冻结参数后表现出更好的性能，且对女性样本的准确率更高。因此，Transformer模型和DINOv2是库欣综合征分类的有效解决方案。

链接: https://arxiv.org/abs/2501.12023
作者: Hongjun Liu,Changwei Song,Jiaqi Qiang,Jianqiang Li,Hui Pan,Lin Lu,Xiao Long,Qing Zhao,Jiuzuo Huang,Shi Chen
机构: School of Software Engineering, Beijing University of Technology, Beijing, China(北京工业大学软件工程学院); Key Laboratory of Endocrinology of National Health Commission, Department of Endocrinology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院内分泌科国家卫生健康委员会内分泌重点实验室); Department of Plastic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院整形外科); State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院复杂重症罕见病国家重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cushing’s syndrome is a condition caused by excessive glucocorticoid secretion from the adrenal cortex, often manifesting with moon facies and plethora, making facial data crucial for diagnosis. Previous studies have used pre-trained convolutional neural networks (CNNs) for diagnosing Cushing’s syndrome using frontal facial images. However, CNNs are better at capturing local features, while Cushing’s syndrome often presents with global facial features. Transformer-based models like ViT and SWIN, which utilize self-attention mechanisms, can better capture long-range dependencies and global features. Recently, DINOv2, a foundation model based on visual Transformers, has gained interest. This study compares the performance of various pre-trained models, including CNNs, Transformer-based models, and DINOv2, in diagnosing Cushing’s syndrome. We also analyze gender bias and the impact of freezing mechanisms on DINOv2. Our results show that Transformer-based models and DINOv2 outperformed CNNs, with ViT achieving the highest F1 score of 85.74%. Both the pre-trained model and DINOv2 had higher accuracy for female samples. DINOv2 also showed improved performance when freezing parameters. In conclusion, Transformer-based models and DINOv2 are effective for Cushing’s syndrome classification.
zh

[CV-56] Foreign object segmentation in chest x-rays through anatomy-guided shape insertion

【速读】：该论文试图解决胸部X光片中异物（如术后随访中的支架、起搏器或儿童误吞的物体）实例分割（instance segmentation）的挑战。由于异物的多样性，现有的数据集标注不足，导致密集标注变得复杂。为了解决这一问题，论文提出了一种通过生成合成数据的简单方法，关键步骤包括：（1）插入具有不同对比度和不透明度的任意形状（如线条、多边形、椭圆），（2）从少量半自动提取的标签中进行剪切-粘贴增强。这些插入操作通过解剖学标签进行指导，以确保异物的放置符合实际情况（例如支架仅出现在相关血管中）。该方法使网络能够在最少手动标注数据的情况下分割复杂结构，并在使用93%更少手动标注的情况下，实现了与全监督模型相当的性能。

链接: https://arxiv.org/abs/2501.12022
作者: Constantin Seibold,Hamza Kalisch,Lukas Heine,Simon Reiß,Jens Kleesiek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we tackle the challenge of instance segmentation for foreign objects in chest radiographs, commonly seen in postoperative follow-ups with stents, pacemakers, or ingested objects in children. The diversity of foreign objects complicates dense annotation, as shown in insufficient existing datasets. To address this, we propose the simple generation of synthetic data through (1) insertion of arbitrary shapes (lines, polygons, ellipses) with varying contrasts and opacities, and (2) cut-paste augmentations from a small set of semi-automatically extracted labels. These insertions are guided by anatomy labels to ensure realistic placements, such as stents appearing only in relevant vessels. Our approach enables networks to segment complex structures with minimal manually labeled data. Notably, it achieves performance comparable to fully supervised models while using 93% fewer manual annotations.
zh

[CV-57] On the “Illusion” of Gender Bias in Face Recognition: Explaining the Fairness Issue Through Non-demographic Attributes

【速读】：该论文旨在解决人脸识别系统（FRS）中存在的性别偏差问题。研究表明，FRS的准确性在不同性别用户之间存在显著差异，这种性别差距降低了系统的可信度。尽管已有研究尝试探讨其原因，但这些研究通常依赖于手动选择、相关性高且规模较小的面部特征集，难以全面解释性别偏差的来源。本文通过扩展搜索范围，分析了40个非人口统计学的面部特征（non-demographic facial characteristics）之间的去相关性组合，以更全面地揭示性别偏差的成因。关键解决方案包括：1）提出一种工具链，有效去相关并聚合面部属性，从而在大规模数据上进行更少偏差的性别分析；2）引入两种新的公平性度量指标，分别在有上下文和无上下文的条件下评估公平性；3）提出一种新颖的无监督算法，能够可靠地识别出在平衡测试数据集中使用时能够消除偏差的属性组合。实验结果表明，当男性和女性受试者的图像共享特定属性时，性别差距消失，表明该问题并非生物学差异所致，而是社会对外貌定义的结果。这些发现可能重塑我们对人脸生物识别中公平性的理解，并为解决FRS中的性别偏差问题提供新的见解。

链接: https://arxiv.org/abs/2501.12020
作者: Paul Jonas Kurz,Haiyu Wu,Kevin W. Bowyer,Philipp Terhörst
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition systems (FRS) exhibit significant accuracy differences based on the user’s gender. Since such a gender gap reduces the trustworthiness of FRS, more recent efforts have tried to find the causes. However, these studies make use of manually selected, correlated, and small-sized sets of facial features to support their claims. In this work, we analyse gender bias in face recognition by successfully extending the search domain to decorrelated combinations of 40 non-demographic facial characteristics. First, we propose a toolchain to effectively decorrelate and aggregate facial attributes to enable a less-biased gender analysis on large-scale data. Second, we introduce two new fairness metrics to measure fairness with and without context. Based on these grounds, we thirdly present a novel unsupervised algorithm able to reliably identify attribute combinations that lead to vanishing bias when used as filter predicates for balanced testing datasets. The experiments show that the gender gap vanishes when images of male and female subjects share specific attributes, clearly indicating that the issue is not a question of biology but of the social definition of appearance. These findings could reshape our understanding of fairness in face biometrics and provide insights into FRS, helping to address gender bias issues.
zh

[CV-58] Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?

【速读】：该论文旨在评估自监督的视网膜特异性基础模型（RETFound）与三种基于ImageNet预训练的传统深度学习模型（ResNet50、ViT-base、SwinV2）在检测眼部和全身性疾病方面的性能差异。研究的关键在于通过在大规模和小规模数据集上的微调和训练，比较这些模型在内部和外部验证数据集上的表现，使用AUC（受试者工作特征曲线下面积）和经过Bonferroni校正的Z检验来评估模型性能。研究结果表明，传统深度学习模型在大数据集上的眼部疾病检测性能与RETFound相当，但在小数据集上，RETFound在全身性疾病检测方面表现更优。这一发现为传统模型和基础模型各自的优势和局限性提供了有价值的见解。

链接: https://arxiv.org/abs/2501.12016
作者: Samantha Min Er Yew,Xiaofeng Lei,Jocelyn Hui Lin Goh,Yibing Chen,Sahana Srinivasan,Miao-li Chee,Krithi Pushpanathan,Ke Zou,Qingshan Hou,Zhi Da Soh,Cancan Xue,Marco Chak Yan Yu,Charumathi Sabanayagam,E Shyong Tai,Xueling Sim,Yaxing Wang,Jost B. Jonas,Vinay Nangia,Gabriel Dawei Yang,Emma Anran Ran,Carol Yim-Lui Cheung,Yangqin Feng,Jun Zhou,Rick Siow Mong Goh,Yukun Zhou,Pearse A. Keane,Yong Liu,Ching-Yu Cheng,Yih-Chung Tham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: RETFound, a self-supervised, retina-specific foundation model (FM), showed potential in downstream applications. However, its comparative performance with traditional deep learning (DL) models remains incompletely understood. This study aimed to evaluate RETFound against three ImageNet-pretrained supervised DL models (ResNet50, ViT-base, SwinV2) in detecting ocular and systemic diseases. Methods: We fine-tuned/trained RETFound and three DL models on full datasets, 50%, 20%, and fixed sample sizes (400, 200, 100 images, with half comprising disease cases; for each DR severity class, 100 and 50 cases were used. Fine-tuned models were tested internally using the SEED (53,090 images) and APTOS-2019 (3,672 images) datasets and externally validated on population-based (BES, CIEMS, SP2, UKBB) and open-source datasets (ODIR-5k, PAPILA, GAMMA, IDRiD, MESSIDOR-2). Model performance was compared using area under the receiver operating characteristic curve (AUC) and Z-tests with Bonferroni correction (P0.05/3). Interpretation: Traditional DL models are mostly comparable to RETFound for ocular disease detection with large datasets. However, RETFound is superior in systemic disease detection with smaller datasets. These findings offer valuable insights into the respective merits and limitation of traditional models and FMs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.12016 [cs.CV] (or arXiv:2501.12016v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.12016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-59] Survey on Hand Gesture Recognition from Visual Input

【速读】：该论文旨在解决手势识别（hand gesture recognition）领域缺乏全面综述的问题，特别是在涵盖最新研究进展、可用解决方案和基准数据集方面。论文通过分析从不同类型相机输入数据（如RGB图像、深度图像、单目或多视角相机视频）中识别手势和3D手部姿态（3D hand pose recognition）的最新进展，探讨了不同方法的需求差异。此外，论文还提供了广泛使用的数据集的概述，详细描述了它们的主要特征和应用领域。解决方案的关键在于综合近期研究的目标、方法和应用，为未来研究提供有价值的见解，并突出开放挑战，如在实际环境中实现鲁棒识别、处理遮挡、确保跨用户的泛化能力以及满足实时应用的计算效率需求。

链接: https://arxiv.org/abs/2501.11992
作者: Manousos Linardakis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: Harokopio University of Athens (哈罗科皮奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.
zh

[CV-60] SMamba: Sparse Mamba for Event-based Object Detection AAAI2025

【速读】：该论文试图解决基于Transformer的事件目标检测方法在处理非事件和噪声区域时计算开销过高的问题。现有的窗口注意力稀疏化策略虽然减少了计算量，但牺牲了全局建模能力，导致性能下降。为解决这一问题，论文提出了Sparse Mamba (SMamba)，其关键解决方案包括：1) 引入时空连续性评估模块（Spatio-Temporal Continuity Assessment），通过分析活动事件与噪声事件的时空分布差异，评估信息量并丢弃无信息量的token；2) 设计信息优先局部扫描策略（Information-Prioritized Local Scan），缩短高信息量token之间的扫描距离，促进它们在空间维度上的交互；3) 提出全局通道交互模块（Global Channel Interaction），从全局空间角度聚合通道信息，将全局交互从2D空间扩展到3D表示。实验结果表明，该方法在性能和效率上均优于现有方法。

链接: https://arxiv.org/abs/2501.11971
作者: Nan Yang,Yang Wang,Zhanwen Liu,Meng Li,Yisheng An,Xiangmo Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2025

点击查看摘要

Abstract:Transformer-based methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability. However, they neglect the influence of non-event and noisy regions and process them uniformly, leading to high computational overhead. To mitigate computation cost, some researchers propose window attention based sparsification strategies to discard unimportant regions, which sacrifices the global modeling ability and results in suboptimal performance. To achieve better trade-off between accuracy and efficiency, we propose Sparse Mamba (SMamba), which performs adaptive sparsification to reduce computational effort while maintaining global modeling capability. Specifically, a Spatio-Temporal Continuity Assessment module is proposed to measure the information content of tokens and discard uninformative ones by leveraging the spatiotemporal distribution differences between activity and noise events. Based on the assessment results, an Information-Prioritized Local Scan strategy is designed to shorten the scan distance between high-information tokens, facilitating interactions among them in the spatial dimension. Furthermore, to extend the global interaction from 2D space to 3D representations, a Global Channel Interaction module is proposed to aggregate channel information from a global spatial perspective. Results on three datasets (Gen1, 1Mpx, and eTram) demonstrate that our model outperforms other methods in both performance and efficiency.
zh

[CV-61] A Lightweight and Interpretable Deepfakes Detection Framework

【速读】：该论文旨在解决深度伪造（deepfakes）检测中的关键问题，即现有检测方法通常只能针对特定类型的深度伪造（如换脸、唇同步或傀儡操纵）进行检测，而缺乏一个统一的框架来同时检测所有类型的深度伪造。为了解决这一问题，论文提出了一种基于混合面部特征点（hybrid facial landmarks）和心率特征（heart rate features）融合的统一检测框架。该框架通过将心率特征与面部特征点特征相结合，能够更好地提取伪造视频中的面部伪影和原始视频中的自然变化。这些特征被用于训练一个轻量级的XGBoost模型，以区分深度伪造视频和真实视频。实验结果表明，该框架在包含多种深度伪造类型的世界领导人数据集（WLDR）上表现出优越的检测性能，且与深度学习模型LSTM-FCN相比，具有相似的检测效果，但更具可解释性。

链接: https://arxiv.org/abs/2501.11927
作者: Muhammad Umar Farooq,Ali Javed,Khalid Mahmood Malik,Muhammad Anas Raza
机构: University of Engineering and Technology, Taxila, Pakistan (塔克西拉工程技术大学); Oakland University, Rochester, MI, USA (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent realistic creation and dissemination of so-called deepfakes poses a serious threat to social life, civil rest, and law. Celebrity defaming, election manipulation, and deepfakes as evidence in court of law are few potential consequences of deepfakes. The availability of open source trained models based on modern frameworks such as PyTorch or TensorFlow, video manipulations Apps such as FaceApp and REFACE, and economical computing infrastructure has easen the creation of deepfakes. Most of the existing detectors focus on detecting either face-swap, lip-sync, or puppet master deepfakes, but a unified framework to detect all three types of deepfakes is hardly explored. This paper presents a unified framework that exploits the power of proposed feature fusion of hybrid facial landmarks and our novel heart rate features for detection of all types of deepfakes. We propose novel heart rate features and fused them with the facial landmark features to better extract the facial artifacts of fake videos and natural variations available in the original videos. We used these features to train a light-weight XGBoost to classify between the deepfake and bonafide videos. We evaluated the performance of our framework on the world leaders dataset (WLDR) that contains all types of deepfakes. Experimental results illustrate that the proposed framework offers superior detection performance over the comparative deepfakes detection methods. Performance comparison of our framework against the LSTM-FCN, a candidate of deep learning model, shows that proposed model achieves similar results, however, it is more interpretable.
zh

[CV-62] Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

【速读】：该论文试图解决现有洪水监测方法在利用多光谱卫星信息时忽视相关特征的问题。现有的洪水分割方法通常未能充分利用多光谱数据中的关联特征，导致洪水监测的准确性受限。为此，作者提出了一种渐进式交叉注意力网络（ProCANet），该模型通过逐步应用自注意力机制和交叉注意力机制，生成最优的多光谱特征组合，从而提升洪水分割的精度。该模型在Sen1Floods11数据集和印度尼西亚Citarum河流域的定制洪水数据上进行了测试，结果显示其具有最高的交并比（IoU）得分0.815。通过对比不同模态下有无注意力机制的场景，该研究为利用遥感技术提高洪水分析的准确性开辟了新的途径。

链接: https://arxiv.org/abs/2501.11923
作者: Vicky Feliren,Fithrothul Khikmah,Irfan Dwiki Bhaswara,Bahrul I. Nasution,Alex M. Lechner,Muhamad Risqi U. Saputra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, published in IEEE Geoscience and Remote Sensing Letters

点击查看摘要

Abstract:In recent years, the integration of deep learning techniques with remote sensing technology has revolutionized the way natural hazards, such as floods, are monitored and managed. However, existing methods for flood segmentation using remote sensing data often overlook the utility of correlative features among multispectral satellite information. In this study, we introduce a progressive cross attention network (ProCANet), a deep learning model that progressively applies both self- and cross-attention mechanisms to multispectral features, generating optimal feature combinations for flood segmentation. The proposed model was compared with state-of-the-art approaches using Sen1Floods11 dataset and our bespoke flood data generated for the Citarum River basin, Indonesia. Our model demonstrated superior performance with the highest Intersection over Union (IoU) score of 0.815. Our results in this study, coupled with the ablation assessment comparing scenarios with and without attention across various modalities, opens a promising path for enhancing the accuracy of flood analysis using remote sensing technology.
zh

[CV-63] Enhancing Adversarial Transferability via Component-Wise Augmentation Method

【速读】：该论文试图解决深度神经网络（DNNs）在面对对抗样本（adversarial examples）时的高度脆弱性问题，特别是在安全敏感应用中，这一问题尤为突出。现有的基于输入变换的对抗攻击方法在增强对抗样本的迁移性（transferability）方面表现出色，但存在两个主要问题：一是未能充分多样化不同模型之间的注意力区域（attention regions），二是在变换过程中引入了过多的信息损失。为解决这些问题，论文提出了一种新的基于输入变换的方法，称为组件增强（Component-Wise Augmentation, CWA）。CWA通过在局部应用块级变换（block-wise transformations），结合插值（interpolation）和选择性旋转（selective rotation）来多样化模型的注意力区域，同时保持语义完整性。实验结果表明，CWA在ImageNet数据集上显著优于现有的最先进方法，在攻击成功率和稳定性方面均表现出色，并且对多种防御方法也展现了优越的性能。

链接: https://arxiv.org/abs/2501.11901
作者: Hangyu Liu,Bo Peng,Pengxiang Ding,Donglin Wang
机构: Westlake University(西湖大学); Beijing University of Posts and Telecommunications(北京邮电大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages,5 figures

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples, which pose significant challenges in security-sensitive applications. Among various adversarial attack strategies, input transformation-based attacks have demonstrated remarkable effectiveness in enhancing adversarial transferability. However, existing methods fail to diversify attention regions across models adequately and introduce excessive information loss during transformations. In this paper, we introduce a novel input transformation-based method, termed Component-Wise Augmentation (CWA), designed to enhance transferability by locally applying block-wise transformations. CWA strategically integrates interpolation and selective rotation on individual image blocks to diversify model attention regions while preserving semantic integrity. Extensive experiments on the standard ImageNet dataset show that CWA consistently outperforms state-of-the-art methods in both attack success rates and stability across CNN- and Transformer-based models, while also demonstrating superior performance against multiple defense methods.
zh

[CV-64] LASER: Lip Landmark Assisted Speaker Detection for Robustness

【速读】：该论文试图解决主动说话者检测（Active Speaker Detection, ASD）在复杂视觉场景中识别说话者时面临的挑战，特别是在音频和唇部运动不同步的情况下，现有模型容易误判非说话者的问题。为了解决这一局限性，论文提出了Lip landmark Assisted Speaker dEtection for Robustness (LASER)模型。其关键解决方案在于通过整合唇部标志点（lip landmarks）来显式关注唇部运动。具体而言，LASER从面部轨迹中提取帧级视觉特征和唇部标志点的2D坐标，并将这些坐标编码为密集特征图，以提供唇部位置的空间和结构信息。此外，考虑到在低分辨率、遮挡或极端角度等挑战性条件下，唇部标志点检测器可能失效，LASER还引入了一个辅助一致性损失函数，以对齐基于唇部特征和仅基于面部特征的预测，从而确保即使在唇部数据缺失的情况下也能保持可靠的性能。实验结果表明，LASER在多个数据集上优于现有最先进的模型，尤其是在音频和视觉不同步的场景中表现出色，展示了其在真实世界视频环境中的鲁棒性。

链接: https://arxiv.org/abs/2501.11899
作者: Le Thien Phuc Nguyen,Zhuoran Yu,Yong Jae Lee
机构: University of Wisconsin - Madison(威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \urlthis https URL.
zh

[CV-65] Contrastive Masked Autoencoders for Character-Level Open-Set Writer Identification

【速读】：该论文试图解决数字取证和文档认证领域中的“开放集场景”（open-set scenario）问题，即在模型训练过程中未见过的书写者识别问题。解决方案的关键在于结合了掩码自编码器（Masked Auto-Encoders, MAE）和对比学习（Contrastive Learning, CL）的对比掩码自编码器（Contrastive Masked Auto-Encoders, CMAE）。通过这种方法，模型能够同时捕捉手写风格的序列信息并区分不同的书写风格，从而在开放集场景下实现高精度的书写者识别。实验结果表明，该模型在CASIA在线手写数据集上达到了89.7%的精确率，显著提升了书写者识别的性能。

链接: https://arxiv.org/abs/2501.11895
作者: Xiaowei Jiang,Wenhao Ma,Yiqun Duan,Thomas Do,Chin-Teng Lin
机构: GrapheneX-UTS Human-centric AI Centre, Australian AI Institute, School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the realm of digital forensics and document authentication, writer identification plays a crucial role in determining the authors of documents based on handwriting styles. The primary challenge in writer-id is the “open-set scenario”, where the goal is accurately recognizing writers unseen during the model training. To overcome this challenge, representation learning is the key. This method can capture unique handwriting features, enabling it to recognize styles not previously encountered during training. Building on this concept, this paper introduces the Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Demonstrating its effectiveness, our model achieves state-of-the-art (SOTA) results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%. Our study advances universal writer-id with a sophisticated representation learning approach, contributing substantially to the ever-evolving landscape of digital handwriting analysis, and catering to the demands of an increasingly interconnected world.
zh

[CV-66] Fast Underwater Scene Reconstruction using Multi-View Stereo and Physical Imaging

【速读】：该论文旨在解决水下场景重建中的挑战，特别是由于光与介质之间的复杂相互作用导致的散射和吸收效应，使得深度估计和渲染变得更加复杂。尽管基于神经辐射场（NeRF）的方法通过建模和分离散射介质在水下场景中取得了高质量的成果，但其训练和渲染速度较慢。为解决这些问题，论文提出了一种新颖的方法，将多视图立体（MVS）与基于物理的水下图像形成模型相结合。该方法包括两个分支：一个用于通过MVS的传统成本体积管道进行深度估计，另一个用于基于物理的图像形成模型进行渲染。深度分支改进了场景几何，而介质分支则确定散射参数以实现精确的场景渲染。与依赖真实深度数据的传统MVSNet方法不同，该方法无需使用真实深度数据，从而加快了训练和渲染过程。通过利用介质子网估计介质参数并结合颜色MLP进行渲染，该方法恢复了水下场景的真实颜色，并实现了更高保真度的几何表示。实验结果表明，该方法能够在散射介质中高质量合成新视角，通过去除介质恢复清晰视图，并在渲染质量和训练效率上优于现有方法。

链接: https://arxiv.org/abs/2501.11884
作者: Shuyi Hu,Qi Liu
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater scene reconstruction poses a substantial challenge because of the intricate interplay between light and the medium, resulting in scattering and absorption effects that make both depth estimation and rendering more complex. While recent Neural Radiance Fields (NeRF) based methods for underwater scenes achieve high-quality results by modeling and separating the scattering medium, they still suffer from slow training and rendering speeds. To address these limitations, we propose a novel method that integrates Multi-View Stereo (MVS) with a physics-based underwater image formation model. Our approach consists of two branches: one for depth estimation using the traditional cost volume pipeline of MVS, and the other for rendering based on the physics-based image formation model. The depth branch improves scene geometry, while the medium branch determines the scattering parameters to achieve precise scene rendering. Unlike traditional MVSNet methods that rely on ground-truth depth, our method does not necessitate the use of depth truth, thus allowing for expedited training and rendering processes. By leveraging the medium subnet to estimate the medium parameters and combining this with a color MLP for rendering, we restore the true colors of underwater scenes and achieve higher-fidelity geometric representations. Experimental results show that our method enables high-quality synthesis of novel views in scattering media, clear views restoration by removing the medium, and outperforms existing methods in rendering quality and training efficiency.
zh

[CV-67] FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-form-gradients AAAI2025

【速读】：该论文试图解决从梯度（gradient）恢复三维（3D）表面（surface）的问题，即表面从梯度重建（Surface-from-gradients, SfG）。传统方法在处理高精度和高分辨率输入时面临显著挑战，尤其是在处理不连续性和大规模线性求解器的低效性方面。尽管深度学习的最新进展（如光度立体法，photometric stereo）提高了法线估计的准确性，但仍未完全解决基于梯度的表面重建的复杂性。为此，论文提出了一种基于傅里叶神经算子（Fourier Neural Operator, FNO）的数值积分网络（FNIN），采用两阶段优化框架。第一阶段通过迭代架构进行数值积分，利用傅里叶神经算子在傅里叶空间中近似求解算子，并结合自学习注意力机制有效检测和处理不连续性。第二阶段通过加权最小二乘问题进一步优化表面重建，合理处理已识别的不连续性。实验表明，该方法在处理高分辨率复杂数据时，相较于现有最先进的求解器，在精度和效率上均有显著提升，测试对象的误差小于0.1毫米。

链接: https://arxiv.org/abs/2501.11876
作者: Jiaqi Leng,Yakun Ju,Yuanxu Duan,Jiangnan Zhang,Qingxuan Lv,Zuxuan Wu,Hao Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Surface-from-gradients (SfG) aims to recover a three-dimensional (3D) surface from its gradients. Traditional methods encounter significant challenges in achieving high accuracy and handling high-resolution inputs, particularly facing the complex nature of discontinuities and the inefficiencies associated with large-scale linear solvers. Although recent advances in deep learning, such as photometric stereo, have enhanced normal estimation accuracy, they do not fully address the intricacies of gradient-based surface reconstruction. To overcome these limitations, we propose a Fourier neural operator-based Numerical Integration Network (FNIN) within a two-stage optimization framework. In the first stage, our approach employs an iterative architecture for numerical integration, harnessing an advanced Fourier neural operator to approximate the solution operator in Fourier space. Additionally, a self-learning attention mechanism is incorporated to effectively detect and handle discontinuities. In the second stage, we refine the surface reconstruction by formulating a weighted least squares problem, addressing the identified discontinuities rationally. Extensive experiments demonstrate that our method achieves significant improvements in both accuracy and efficiency compared to current state-of-the-art solvers. This is particularly evident in handling high-resolution images with complex data, achieving errors of fewer than 0.1 mm on tested objects.
zh

[CV-68] Survey on Monocular Metric Depth Estimation

【速读】：该论文试图解决单目深度估计（Monocular Depth Estimation, MDE）中缺乏度量尺度信息的问题，这一问题导致了尺度不一致性，限制了其在视觉SLAM、3D重建和新视角合成等下游任务中的应用。为了解决这一问题，论文提出了单目度量深度估计（Monocular Metric Depth Estimation, MMDE），通过实现精确的场景尺度深度推断，提升了深度一致性、增强了序列任务的稳定性、简化了下游应用的集成，并拓宽了实际应用场景。解决方案的关键在于零样本泛化（zero-shot generalization）能力的提升，这是MMDE的基础能力。论文详细探讨了零样本MMDE研究的最新进展，重点关注模型泛化和场景边界细节丢失等挑战，并提出了包括无标签数据增强、图像分块、架构优化和生成技术等创新策略来应对这些问题。这些进展显著推动了现有局限性的克服，并为未来的研究方向提供了清晰的路线图。

链接: https://arxiv.org/abs/2501.11841
作者: Jiuling Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is a fundamental computer vision task underpinning applications such as spatial understanding, 3D reconstruction, and autonomous driving. While deep learning-based MDE methods can predict relative depth from a single image, their lack of metric scale information often results in scale inconsistencies, limiting their utility in downstream tasks like visual SLAM, 3D reconstruction, and novel view synthesis. Monocular Metric Depth Estimation (MMDE) addresses these challenges by enabling precise, scene-scale depth inference. MMDE improves depth consistency, enhances sequential task stability, simplifies integration into downstream applications, and broadens practical use cases. This paper provides a comprehensive review of depth estimation technologies, highlighting the evolution from geometry-based methods to state-of-the-art deep learning approaches. It emphasizes advancements in scale-agnostic methods, which are crucial for enabling zero-shot generalization as the foundational capability for MMDE. Recent progress in zero-shot MMDE research is explored, focusing on challenges such as model generalization and the loss of detail at scene boundaries. Innovative strategies to address these issues include unlabelled data augmentation, image patching, architectural optimization, and generative techniques. These advancements, analyzed in detail, demonstrate significant contributions to overcoming existing limitations. Finally, this paper synthesizes recent developments in zero-shot MMDE, identifies unresolved challenges, and outlines future research directions. By offering a clear roadmap and cutting-edge insights, this work aims to deepen understanding of MMDE, inspire novel applications, and drive technological innovation.
zh

[CV-69] Data-driven Detection and Evaluation of Damages in Concrete Structures: Using Deep Learning and Computer Vision

【速读】：该论文试图解决传统方法在检测混凝土基础设施（如桥梁、隧道和墙壁）损伤（如裂缝和剥落）时存在的劳动密集、耗时且易受人为误差影响的问题。解决方案的关键在于采用先进的数据驱动技术，特别是基于深度学习的自动化损伤检测与分析。研究评估了两种最先进的实例分割模型（YOLO-v7实例分割和Mask R-CNN），通过增强数据集（从400张图像扩充至10,995张）来提高模型的鲁棒性。YOLO-v7在平均精度（mAP@0.5）和帧率（FPS）方面表现更优，分别为96.1%和40 FPS，优于Mask R-CNN的92.1%和18 FPS。因此，YOLO-v7更适合实时高速的结构健康监测，而Mask R-CNN则适用于详细的离线评估。该研究表明深度学习在基础设施维护中具有革命性潜力，提供了可扩展且高效的自动化损伤检测解决方案。

链接: https://arxiv.org/abs/2501.11836
作者: Saeid Ataei,Saeed Adibnazari,Seyyed Taghi Ataei
机构: Stevens Institute of Technology(史蒂文斯理工学院); Sharif University of Technology(谢里夫理工大学); University of Tehran(德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 10 figures. This study focuses on the data-driven detection and evaluation of damages in concrete structures using deep learning and computer vision techniques

点击查看摘要

Abstract:Structural integrity is vital for maintaining the safety and longevity of concrete infrastructures such as bridges, tunnels, and walls. Traditional methods for detecting damages like cracks and spalls are labor-intensive, time-consuming, and prone to human error. To address these challenges, this study explores advanced data-driven techniques using deep learning for automated damage detection and analysis. Two state-of-the-art instance segmentation models, YOLO-v7 instance segmentation and Mask R-CNN, were evaluated using a dataset comprising 400 images, augmented to 10,995 images through geometric and color-based transformations to enhance robustness. The models were trained and validated using a dataset split into 90% training set, validation and test set 10%. Performance metrics such as precision, recall, mean average precision (mAP@0.5), and frames per second (FPS) were used for evaluation. YOLO-v7 achieved a superior mAP@0.5 of 96.1% and processed 40 FPS, outperforming Mask R-CNN, which achieved a mAP@0.5 of 92.1% with a slower processing speed of 18 FPS. The findings recommend YOLO-v7 instance segmentation model for real-time, high-speed structural health monitoring, while Mask R-CNN is better suited for detailed offline assessments. This study demonstrates the potential of deep learning to revolutionize infrastructure maintenance, offering a scalable and efficient solution for automated damage detection.
zh

[CV-70] CogMorph: Cognitive Morphing Attacks for Text-to-Image Models

【速读】：该论文揭示并解决了一个之前未被充分认识的伦理风险问题，即文本到图像生成模型（Text-to-Image, T2I）在生成高质量图像时可能被操纵以嵌入有害或毒性的上下文元素，从而放大情感伤害。这种操纵利用了人类认知原则，即人类对概念的理解受到整个视觉场景及其上下文的影响。论文提出了一种名为“认知变形攻击”（Cognitive Morphing Attack, CogMorph）的新方法，该方法通过操纵T2I模型生成保留原始核心主题但嵌入有害元素的图像。解决方案的关键在于两个核心步骤：首先，构建了一个基于人类认知感知维度的图像毒性分类体系，并生成了1,176个高质量的T2I毒性提示词；其次，通过“认知毒性增强”（Cognitive Toxicity Augmentation）和“上下文层次变形”（Contextual Hierarchical Morphing）技术，分别从外部毒性表示和原始提示词中提取关键部分，并迭代融合毒性特征以注入有害上下文。实验结果表明，CogMorph在多个开源T2I模型和商业API上显著优于其他基线方法，平均提升了20.62%的效果。

链接: https://arxiv.org/abs/2501.11815
作者: Zonglei Jing,Zonghao Ying,Le Wang,Siyuan Liang,Aishan Liu,Xianglong Liu,Dacheng Tao
机构: Beihang University(北京航空航天大学); National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62% on average).
zh

[CV-71] FLOP: Table Structure Recognition Framework with Layout Pointer Mechanism IJCAI

【速读】：该论文旨在解决表格结构识别（Table Structure Recognition, TSR）任务中文本区域与表格结构标签之间的对齐问题。传统方法通过预测文本区域并将其与表格结构标签进行匹配，但这种方法容易产生对齐错误，且需要复杂的后处理步骤。论文提出的解决方案是TFLOP（TSR Framework with LayOut Pointer mechanism），该框架将传统的文本区域预测和匹配问题重新表述为直接指向文本区域的问题。TFLOP利用文本区域信息同时识别表格的结构标签和与之对齐的文本区域，从而避免了额外的文本区域匹配阶段。此外，TFLOP采用跨度感知对比监督（span-aware contrastive supervision）来增强复杂结构表格中的指向机制。实验结果表明，TFLOP在多个基准数据集（如PubTabNet、FinTabNet和SynthTabNet）上达到了最先进的性能，并在带有水印或非英文领域的工业文档TSR场景中表现出色。

链接: https://arxiv.org/abs/2501.11800
作者: Minsoo Khang,Teakgyu Hong
机构: Upstage AI, South Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IJCAI Proceedings 2024

点击查看摘要

Abstract:Table Structure Recognition (TSR) is a task aimed at converting table images into a machine-readable format (e.g. HTML), to facilitate other applications such as information retrieval. Recent works tackle this problem by identifying the HTML tags and text regions, where the latter is used for text extraction from the table document. These works however, suffer from misalignment issues when mapping text into the identified text regions. In this paper, we introduce a new TSR framework, called TFLOP (TSR Framework with LayOut Pointer mechanism), which reformulates the conventional text region prediction and matching into a direct text region pointing problem. Specifically, TFLOP utilizes text region information to identify both the table’s structure tags and its aligned text regions, simultaneously. Without the need for region prediction and alignment, TFLOP circumvents the additional text region matching stage, which requires finely-calibrated post-processing. TFLOP also employs span-aware contrastive supervision to enhance the pointing mechanism in tables with complex structure. As a result, TFLOP achieves the state-of-the-art performance across multiple benchmarks such as PubTabNet, FinTabNet, and SynthTabNet. In our extensive experiments, TFLOP not only exhibits competitive performance but also shows promising results on industrial document TSR scenarios such as documents with watermarks or in non-English domain.
zh

[CV-72] Provably effective detection of effective data poisoning attacks

【速读】：该论文旨在解决数据集投毒攻击（dataset poisoning attack）的检测问题。数据集投毒攻击是指通过恶意修改训练数据来影响机器学习模型的性能或行为。论文的核心贡献在于提出了一个数学上精确的定义，并证明了对数据集进行有效投毒的行为本身可以被有效检测。关键解决方案是引入了一种新的统计测试方法，称为“Conformal Separability Test”，该方法能够从数学上保证数据集投毒的可识别性，并通过实验验证了其在现实世界中对投毒攻击的有效检测能力。

链接: https://arxiv.org/abs/2501.11795
作者: Jonathan Gallagher,Yasaman Esfandiari,Callen MacPhee,Michael Warren
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper establishes a mathematically precise definition of dataset poisoning attack and proves that the very act of effectively poisoning a dataset ensures that the attack can be effectively detected. On top of a mathematical guarantee that dataset poisoning is identifiable by a new statistical test that we call the Conformal Separability Test, we provide experimental evidence that we can adequately detect poisoning attempts in the real world.
zh

[CV-73] Generating visual explanations from deep networks using implicit neural representations WACV2025

【速读】：该论文试图解决深度学习模型的可解释性问题，特别是如何生成视觉解释（visual explanations）以帮助人类理解模型的决策过程。解决方案的关键在于利用隐式神经表示（implicit neural representations, INRs）来生成归因掩码（attribution masks）。具体而言，作者首先通过基于坐标的隐式网络重新构建并扩展了极值扰动技术（extremal perturbations technique），从而生成符合面积约束的归因掩码。其次，作者提出了一种基于INR的迭代方法，能够为同一图像生成多个不重叠的归因掩码。实验结果表明，隐式网络能够有效生成归因掩码，并揭示深度学习模型在图像分类任务中可能同时关注目标物体的外观及其伴随区域和纹理的特征。

链接: https://arxiv.org/abs/2501.11784
作者: Michal Byra,Henrik Skibbe
机构: Institute of Fundamental Technological Research, Polish Academy of Sciences, Poland (波兰科学院基础技术研究所); RIKEN Center for Brain Science, Japan (日本理化学研究所脑科学中心); Samsung AI Center Warsaw, Poland (三星AI中心华沙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:Explaining deep learning models in a way that humans can easily understand is essential for responsible artificial intelligence applications. Attribution methods constitute an important area of explainable deep learning. The attribution problem involves finding parts of the network’s input that are the most responsible for the model’s output. In this work, we demonstrate that implicit neural representations (INRs) constitute a good framework for generating visual explanations. Firstly, we utilize coordinate-based implicit networks to reformulate and extend the extremal perturbations technique and generate attribution masks. Experimental results confirm the usefulness of our method. For instance, by proper conditioning of the implicit network, we obtain attribution masks that are well-behaved with respect to the imposed area constraints. Secondly, we present an iterative INR-based method that can be used to generate multiple non-overlapping attribution masks for the same image. We depict that a deep learning model may associate the image label with both the appearance of the object of interest as well as with areas and textures usually accompanying the object. Our study demonstrates that implicit networks are well-suited for the generation of attribution masks and can provide interesting insights about the performance of deep learning models.
zh

[CV-74] EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process

【速读】：该论文试图解决虚拟试衣（virtual try-on）中的核心挑战，即如何实现高质量的图像到图像转换（image-to-image translation），使服装能够适应不同的人体形态、姿势和体型。早期方法依赖2D变换，虽然速度快，但图像质量较差且缺乏深度学习的细节表现。尽管基于生成对抗网络（GAN）的技术提升了真实感，但其对配对数据的依赖限制了应用。更灵活的方法虽然提供了更好的视觉效果，但计算资源消耗大且耗时长。最近，扩散模型（diffusion models）在高保真图像转换方面显示出潜力，但现有虚拟试衣工具仍面临细节丢失和形变问题。

论文提出的解决方案EfficientVITON，利用预训练的Stable Diffusion模型，通过空间编码器（spatial encoder）保留服装的细节，并采用零交叉注意力块（zero cross-attention blocks）捕捉服装与人体贴合时的细微变化。此外，输入图像经过精心处理，扩散过程也经过优化以显著缩短生成时间而不损失图像质量。训练过程分为两个阶段，通过平衡损失函数确保试衣结果的准确性和视觉效果的高质量。在VITON-HD数据集上的测试表明，EfficientVITON达到了当前最先进的性能。

链接: https://arxiv.org/abs/2501.11776
作者: Mostafa Atef,Mariam Ayman,Ahmed Rashed,Ashrakat Saeed,Abdelrahman Saeed,Ahmed Fares
机构: Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology, Alexandria, Egypt (埃及-日本科学技术大学计算机科学与工程系，亚历山大，埃及)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.
zh

[CV-75] A Review Paper of the Effects of Distinct Modalities and ML Techniques to Distracted Driving Detection

【速读】：该论文试图解决分心驾驶（distracted driving）检测中的关键挑战，特别是现有单模态方法在识别复杂分心模式（尤其是认知分心）方面的不足。解决方案的关键在于全面分析机器学习（ML）和深度学习（DL）技术在多模态数据（包括视觉、感官、听觉和多模态数据）中的应用。通过对不同模态、数据可访问性和方法学进行分类和评估，论文明确了哪些方法在特定分心驾驶检测目标中具有最高准确性和适用性，并强调了多模态系统相较于单模态系统的优势。这一系统性综述为开发更鲁棒的分心驾驶检测框架提供了重要见解，支持提升道路安全和制定更有效的干预策略。

链接: https://arxiv.org/abs/2501.11758
作者: Anthony. Dontoh,Stephanie. Ivey,Logan. Sirbaugh,Armstrong. Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Distracted driving remains a significant global challenge with severe human and economic repercussions, demanding improved detection and intervention strategies. While previous studies have extensively explored single-modality approaches, recent research indicates that these systems often fall short in identifying complex distraction patterns, particularly cognitive distractions. This systematic review addresses critical gaps by providing a comprehensive analysis of machine learning (ML) and deep learning (DL) techniques applied across various data modalities - visual, sensory, auditory, and multimodal. By categorizing and evaluating studies based on modality, data accessibility, and methodology, this review clarifies which approaches yield the highest accuracy and are best suited for specific distracted driving detection goals. The findings offer clear guidance on the advantages of multimodal versus single-modal systems and capture the latest advancements in the field. Ultimately, this review contributes valuable insights for developing robust distracted driving detection frameworks, supporting enhanced road safety and mitigation strategies.
zh

[CV-76] Are generative models fair? A study of racial bias in dermatological image generation

【速读】：该论文试图解决医学领域，特别是皮肤病学中存在的种族偏见问题，尤其是在生成式模型（如变分自编码器，VAE）中的公平性。种族偏见通常源于训练数据集中深色肤色的代表性不足，这可能导致模型在不同肤色上的表现不均衡。论文的核心解决方案是通过训练一个带有感知损失（perceptual loss）的VAE模型，生成和重建不同肤色的高质量皮肤图像，并利用Fitzpatrick17k数据集评估种族偏见对这些模型表现的影响。研究结果表明，VAE的性能受训练数据集中肤色多样性的影响，且在浅色肤色上的表现更好。此外，VAE生成的不确定性估计无法有效评估模型的公平性。因此，论文强调了改进不确定性量化机制的必要性，以检测和解决生成式模型中的种族偏见，从而推动可信赖的医疗技术的发展。

链接: https://arxiv.org/abs/2501.11752
作者: Miguel López-Pérez,Søren Hauberg,Aasa Feragen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Racial bias in medicine, particularly in dermatology, presents significant ethical and clinical challenges. It often results from the underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that the VAE is influenced by the diversity of skin tones in the training dataset, with better performance observed for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model’s fairness. These results highlight the need for improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.
zh

[CV-77] SILO: Solving Inverse Problems with Latent Operators

【速读】：该论文旨在解决在使用潜在扩散模型（latent diffusion models）处理逆问题（inverse problems）时，由于在恢复过程中多次应用自编码器（Autoencoder）而带来的计算效率和恢复质量方面的挑战。为了解决这一问题，论文提出了一种新的方法，即在潜在空间中使用学习到的退化函数（learned degradation function）来模拟已知的图像空间退化。这种方法将自编码器的使用限制在恢复过程的初始和最终步骤，从而减少了计算负担并提高了恢复质量。通过在各种图像恢复任务和数据集上的实验，论文证明了该方法的有效性，并显著超越了现有技术的表现。

链接: https://arxiv.org/abs/2501.11746
作者: Ron Raphaeli,Sean Man,Michael Elad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page in this https URL

点击查看摘要

Abstract:Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.
zh

[CV-78] FaceSORT: a Multi-Face Tracking Method based on Biometric and Appearance Features

【速读】：该论文试图解决多面部跟踪（multiple face tracking）中由于部分遮挡或侧脸导致的跟踪性能下降问题。传统的多面部跟踪方法通常依赖于生物特征（biometric）面部特征，但这些特征提取模型通常需要正面面部图像，限制了其在非正面情况下的应用。为解决这一问题，论文提出了一种名为FaceSORT的多面部跟踪方法，其关键创新在于将生物特征面部特征与视觉外观特征（visual appearance features）相结合。这些特征均从同一面部区域提取，其中视觉外观特征由通用物体分类器生成。通过这种结合，FaceSORT能够更好地处理部分遮挡或侧脸情况，从而提升跟踪性能。论文还通过全面的实验评估验证了该方法的有效性，包括对不同面部描述符、参数设置和相似性度量的比较，并公开了一个新的多面部跟踪数据集。

链接: https://arxiv.org/abs/2501.11741
作者: Robert Jöchl,Andreas Uhl
机构: University of Salzburg, Department of Artificial Intelligence and Human Interfaces (萨尔茨堡大学，人工智能与人类界面系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking multiple faces is a difficult problem, as there may be partially occluded or lateral faces. In multiple face tracking, association is typically based on (biometric) face features. However, the models used to extract these face features usually require frontal face images, which can limit the tracking performance. In this work, a multi-face tracking method inspired by StrongSort, FaceSORT, is proposed. To mitigate the problem of partially occluded or lateral faces, biometric face features are combined with visual appearance features (i.e., generated by a generic object classifier), with both features are extracted from the same face patch. A comprehensive experimental evaluation is performed, including a comparison of different face descriptors, an evaluation of different parameter settings, and the application of a different similarity metric. All experiments are conducted with a new multi-face tracking dataset and a subset of the ChokePoint dataset. The `Paris Lodron University Salzburg Faces in a Queue’ dataset consists of a total of seven fully annotated sequences (12730 frames) and is made publicly available as part of this work. Together with this dataset, annotations of 6 sequences from the ChokePoint dataset are also provided.
zh

[CV-79] SeRpEnt: Selective Resampling for Expressive State Space Models

【速读】：该论文试图解决状态空间模型（State Space Models, SSMs）在序列建模中的选择性机制（selectivity）的有效性问题，特别是其在处理长序列时的信息压缩能力。尽管Mamba模型通过选择性机制在性能上媲美Transformer模型，但其选择性机制的有效性仅通过实验验证，缺乏理论解释。论文通过分析选择性时间间隔在Mamba中的作用，揭示了其作为信息线性近似器的功能。基于此，作者提出了SeRpEnt架构，进一步利用选择性机制，通过信息感知的方式压缩序列。SeRpEnt采用重采样机制（resampling mechanism），根据序列元素的信息内容进行聚合。实验结果表明，SeRpEnt在长序列建模任务中表现出色，验证了其重采样机制的有效性。

链接: https://arxiv.org/abs/2501.11729
作者: Stefano Rando,Luca Romani,Matteo Migliarini,Luca Franco,Denis Gudovskiy,Fabio Galasso
机构: italailabs.com; Università degli Studi di Roma “La Sapienza” (罗马大学); Panasonic (松下)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:State Space Models (SSMs) have recently enjoyed a rise to prominence in the field of deep learning for sequence modeling, especially as an alternative to Transformers. Their success stems from avoiding two well-known drawbacks of attention-based models: quadratic complexity with respect to the sequence length and inability to model long-range dependencies. The SSM variant Mamba has demonstrated performance comparable to Transformers without any form of attention, thanks to the use of a selective mechanism for the state parameters. Selectivity, however, is only evaluated empirically and the reasons of its effectiveness remain unclear. In this work, we show how selectivity is related to the sequence processing. Our analysis shows that selective time intervals in Mamba act as linear approximators of information. Then, we propose our SeRpEnt architecture, a SSM that further exploits selectivity to compress sequences in an information-aware fashion. It employs a resampling mechanism that aggregates elements based on their information content. Our empirical results in the Long Range Arena benchmark and other language modeling tasks show benefits of the SeRpEnt’s resampling mechanism.
zh

[CV-80] GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimers Disease

【速读】：该论文试图解决基于卷积神经网络（CNNs）的深度学习方法在阿尔茨海默病（AD）痴呆早期和准确诊断中的可解释性问题。尽管CNNs在影像数据分析中表现出巨大潜力，但由于深度学习模型的可解释性有限，这些方法尚未在临床实践中广泛应用。论文提出了一种结合CNNs和可解释增强机（EBM）的新型可解释模型，用于AD的诊断和预测。解决方案的关键在于开发了一种创新的训练策略，交替训练CNN组件作为特征提取器和EBM组件作为输出块，形成一个端到端的模型。该模型以影像数据为输入，不仅提供预测结果，还提供可解释的特征重要性度量。通过在阿尔茨海默病神经影像倡议（ADNI）数据集和Health-RI Parelsnoer神经退行性疾病生物库（PND）外部测试集上的验证，该模型在AD与对照组分类中达到了0.956的AUC值，在轻度认知障碍（MCI）向AD转化的预测中达到了0.694的AUC值。该模型作为一种“玻璃盒”模型，与其他最先进的“黑盒”模型相比，具有相当的性能。

链接: https://arxiv.org/abs/2501.11715
作者: Wenjie Kang,Lize Jiskoot,Peter De Deyn,Geert Biessels,Huiberdina Koek,Jurgen Claassen,Huub Middelkoop,Wiesje Flier,Willemijn J. Jansen,Stefan Klein,Esther Bron
机构: Biomedical Imaging Group Rotterdam, Erasmus MC, NL; Erasmus MC, NL; University Medical Center Groningen, NL; University Medical Center Utrecht, NL; Radboud University Medical Center, NL; Leiden University Medical Center, NL; Amsterdam University Medical Center, NL; Maastricht University Medical Center, NL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:Deep learning methods based on Convolutional Neural Networks (CNNs) have shown great potential to improve early and accurate diagnosis of Alzheimer’s disease (AD) dementia based on imaging data. However, these methods have yet to be widely adopted in clinical practice, possibly due to the limited interpretability of deep learning models. The Explainable Boosting Machine (EBM) is a glass-box model but cannot learn features directly from input imaging data. In this study, we propose a novel interpretable model that combines CNNs and EBMs for the diagnosis and prediction of AD. We develop an innovative training strategy that alternatingly trains the CNN component as a feature extractor and the EBM component as the output block to form an end-to-end model. The model takes imaging data as input and provides both predictions and interpretable feature importance measures. We validated the proposed model on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND) as an external testing set. The proposed model achieved an area-under-the-curve (AUC) of 0.956 for AD and control classification, and 0.694 for the prediction of conversion of mild cognitive impairment (MCI) to AD on the ADNI cohort. The proposed model is a glass-box model that achieves a comparable performance with other state-of-the-art black-box models. Our code is publicly available at: this https URL.
zh

[CV-81] Dynamic Scene Understanding from Vision-Language Representations

【速读】：该论文旨在解决复杂动态场景图像的自动解析问题，这需要对整体情境的高层次理解以及对参与实体及其交互的细粒度识别。当前的解决方案通常针对子任务（如情境识别、人-人交互和人-物体交互检测）采用不同的方法。然而，最新的图像理解进展通过利用网络规模的视觉-语言（Vision-Language, VL）表示，减少了对任务特定工程的需求。本文提出了一种基于现代冻结VL表示的动态场景理解框架，通过将这些任务统一为结构化文本的预测和解析，或直接将表示连接到现有模型的输入，实现了在相对较少可训练参数的情况下达到最先进的性能。关键点在于，现代VL表示能够有效编码动态场景语义，使得这一方法成为可能。

链接: https://arxiv.org/abs/2501.11653
作者: Shahaf Pruss,Morris Alper,Hadar Averbuch-Elor
机构: Tel Aviv University(特拉维夫大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (VL) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen VL representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.
zh

[CV-82] Early evidence of how LLM s outperform traditional systems on OCR/HTR tasks for historical records

【速读】：该论文旨在解决历史手写文档的转录问题，特别是针对表格形式的数据。研究比较了两种大型语言模型（LLMs）——GPT-4o和Claude Sonnet 3.5——与传统OCR/HTR系统（如EasyOCR、Keras、Pytesseract和TrOCR）在转录历史手写文档时的性能差异。研究通过两种实验设计进行评估：一种是逐行分割图像进行转录，另一种是将整个扫描图像作为输入。通过字符错误率（CER）和BLEU评分，研究证明了LLMs在转录任务上优于传统OCR/HTR方法。此外，研究还结合了人工评估，以更好地理解CER和BLEU评分的影响因素。最终，研究得出结论：对于逐行图像，两样本GPT-4o表现最佳；对于整个扫描图像，两样本Claude Sonnet 3.5的转录结果最接近真实值。解决方案的关键在于利用LLMs的上下文理解能力，结合两样本学习策略，显著提升了转录的准确性。

链接: https://arxiv.org/abs/2501.11623
作者: Seorin Kim,Julien Baudru,Wouter Ryckbosch,Hugues Bersini,Vincent Ginis
机构: Vrije Universiteit Brussel (VUB)(布鲁塞尔自由大学); Université Libre de Bruxelles (ULB)(布鲁塞尔自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:We explore the ability of two LLMs – GPT-4o and Claude Sonnet 3.5 – to transcribe historical handwritten documents in a tabular format and compare their performance to traditional OCR/HTR systems: EasyOCR, Keras, Pytesseract, and TrOCR. Considering the tabular form of the data, two types of experiments are executed: one where the images are split line by line and the other where the entire scan is used as input. Based on CER and BLEU, we demonstrate that LLMs outperform the conventional OCR/HTR methods. Moreover, we also compare the evaluated CER and BLEU scores to human evaluations to better judge the outputs of whole-scan experiments and understand influential factors for CER and BLEU. Combining judgments from all the evaluation metrics, we conclude that two-shot GPT-4o for line-by-line images and two-shot Claude Sonnet 3.5 for whole-scan images yield the transcriptions of the historical records most similar to the ground truth.
zh

[CV-83] Compressibility Analysis for the differentiable shift-variant Filtered Backprojection Model

【速读】：该论文试图解决在锥束计算机断层扫描（CBCT）数据重建中，基于可微分平移不变滤波反投影（FBP）模型的计算冗余问题。具体来说，传统的FBP模型在非圆形轨迹下需要为每个投影计算冗余权重（redundancy weights），这一过程计算量巨大，限制了模型的实际应用。论文提出了一种基于主成分分析（PCA）的压缩和优化方法，通过将冗余权重层参数分解为可训练的特征向量矩阵、压缩权重和均值向量，显著减少了模型的可训练参数数量。这一创新方法在不影响重建精度的前提下，实现了97.25%的参数压缩，大幅降低了模型复杂度并提升了训练速度，从而增强了模型在实际应用中的实用性。

链接: https://arxiv.org/abs/2501.11586
作者: Chengze Ye,Linda-Sophie Schneider,Yipeng Sun,Mareike Thies,Andreas Maier
机构: Friedrich-Alexander University Erlangen-Nuremberg (弗里德里希-亚历山大大学埃尔兰根-纽伦堡); Fraunhofer EZRT (弗劳恩霍夫 EZRT)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The differentiable shift-variant filtered backprojection (FBP) model enables the reconstruction of cone-beam computed tomography (CBCT) data for any non-circular trajectories. This method employs deep learning technique to estimate the redundancy weights required for reconstruction, given knowledge of the specific trajectory at optimization time. However, computing the redundancy weight for each projection remains computationally intensive. This paper presents a novel approach to compress and optimize the differentiable shift-variant FBP model based on Principal Component Analysis (PCA). We apply PCA to the redundancy weights learned from sinusoidal trajectory projection data, revealing significant parameter redundancy in the original model. By integrating PCA directly into the differentiable shift-variant FBP reconstruction pipeline, we develop a method that decomposes the redundancy weight layer parameters into a trainable eigenvector matrix, compressed weights, and a mean vector. This innovative technique achieves a remarkable 97.25% reduction in trainable parameters without compromising reconstruction accuracy. As a result, our algorithm significantly decreases the complexity of the differentiable shift-variant FBP model and greatly improves training speed. These improvements make the model substantially more practical for real-world applications.
zh

[CV-84] aching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

【速读】：该论文旨在解决基于多模态大语言模型（MLLMs）的图像质量评估（IQA）方法在准确评分图像质量方面的不足。当前方法在将连续的质量评分（通常建模为高斯分布）与MLLMs生成的离散标记输出进行匹配时存在挑战，导致信息丢失和图像间关系捕捉不足。论文提出了一种基于分布的解决方案，将评分分布离散化为软标签（soft label），从而保留评分分布的特性，提高准确性并维持图像间关系。此外，针对不同IQA数据集分布差异的问题，论文引入了基于Thurstone模型的保真度损失（fidelity loss），以捕捉数据集内部关系，促进跨多个IQA数据集的联合训练。通过这些设计，论文开发了基于分布的图像质量评分回归模型（DeQA-Score），实验表明该模型在多个基准测试中稳定优于基线方法，并能预测与人类标注高度一致的评分分布。

链接: https://arxiv.org/abs/2501.11561
作者: Zhiyuan You,Xin Cai,Jinjin Gu,Tianfan Xue,Chao Dong
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); The Chinese University of Hong Kong (香港中文大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone’s model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the distribution-based Depicted image Quality Assessment model for Score regression (DeQA-Score). Experiments across multiple benchmarks show that DeQA-Score stably outperforms baselines in score regression. Also, DeQA-Score can predict the score distribution that closely aligns with human annotations. Codes and model weights have been released in this https URL.
zh

[CV-85] Event-based vision for egomotion estimation using precise event timing

【速读】：该论文旨在解决自主导航和机器人应用中自我运动估计（egomotion estimation）的准确性和实时性问题。传统方法依赖惯性传感器，对外部条件高度敏感，且在长距离运动中容易产生漂移，导致较大误差。论文提出了一种基于事件视觉传感器（event-based vision sensors）的解决方案，通过仅在场景变化时捕捉数据，显著降低了功耗，同时提供了高速、低延迟的反馈。关键创新在于提出了一种完全基于事件的处理流程，直接在事件域中处理事件流，避免了帧间中介的需求，从而实现了低延迟和高效能的运动估计。该方法采用浅层脉冲神经网络（spiking neural network）和突触门控机制（synaptic gating mechanism），将精确的事件时间转换为脉冲爆发，编码局部光流速度，并通过网络输出基于事件的自我运动估计。实验表明，该方法在专用芯片上表现出低延迟、低功耗的潜力，并在模拟更大网络时达到了基于事件相机的最先进精度，适用于实时、功耗受限的机器人应用。

链接: https://arxiv.org/abs/2501.11554
作者: Hugh Greatorex,Michele Mastella,Madison Cotteret,Ole Richter,Elisabetta Chicca
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Robotics (cs.RO)
备注: 10 pages, 7 figures. Supplementary material: 4 pages, 1 figure

点击查看摘要

Abstract:Egomotion estimation is crucial for applications such as autonomous navigation and robotics, where accurate and real-time motion tracking is required. However, traditional methods relying on inertial sensors are highly sensitive to external conditions, and suffer from drifts leading to large inaccuracies over long distances. Vision-based methods, particularly those utilising event-based vision sensors, provide an efficient alternative by capturing data only when changes are perceived in the scene. This approach minimises power consumption while delivering high-speed, low-latency feedback. In this work, we propose a fully event-based pipeline for egomotion estimation that processes the event stream directly within the event-based domain. This method eliminates the need for frame-based intermediaries, allowing for low-latency and energy-efficient motion estimation. We construct a shallow spiking neural network using a synaptic gating mechanism to convert precise event timing into bursts of spikes. These spikes encode local optical flow velocities, and the network provides an event-based readout of egomotion. We evaluate the network’s performance on a dedicated chip, demonstrating strong potential for low-latency, low-power motion estimation. Additionally, simulations of larger networks show that the system achieves state-of-the-art accuracy in egomotion estimation tasks with event-based cameras, making it a promising solution for real-time, power-constrained robotics applications.
zh

[CV-86] A baseline for machine-learning-based hepatocellular carcinoma diagnosis using multi-modal clinical data

【速读】：该论文旨在为肝细胞癌（HCC）的多模态数据分类提供一个基准，使用的数据集包括图像数据（增强CT和MRI图像）和表格数据（临床实验室测试数据和病例报告表）。分类任务是基于TNM分期系统。研究的关键在于通过结合图像数据和临床实验室数据，提取向量化预处理后的表格数据特征以及增强CT和MRI图像的放射组学特征，并基于互信息进行特征选择。最终，使用XGBoost分类器预测TNM分期，结果显示预测准确率为0.89 ± 0.05，AUC为0.93 ± 0.03。研究表明，仅通过结合图像和临床数据才能达到如此高的预测准确性，因此这是一个多模态分类在实现准确结果中不可或缺的典型案例。

链接: https://arxiv.org/abs/2501.11535
作者: Binwu Wang,Isaac Rodriguez,Leon Breitinger,Fabian Tollens,Timo Itzel,Dennis Grimm,Andrei Sirazitdinov,Matthias Frölich,Stefan Schönberg,Andreas Teufel,Jürgen Hesser,Wenzhao Zhao
机构: Mannheim Institute for Intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University(曼海姆医学智能系统研究所，曼海姆医学院，海德堡大学); UMM Mannheim, Mannheim, Germany(曼海姆大学医学中心，曼海姆，德国); Complex data processing in medical informatics (CMI), Mannheim Medical Faculty, Heidelberg University(医学信息学中的复杂数据处理，曼海姆医学院，海德堡大学); Clinic for Radiology and Nuclear Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany(放射学和核医学诊所，曼海姆医学院，海德堡大学，曼海姆，德国); Heidelberg University, Mannheim, Germany(海德堡大学，曼海姆，德国); Interdisciplinary Center for Scientific Computing, Central Institute for Computer Engineering, CSZ Heidelberg Center for Model-Based AI, Data Analysis and Modeling in Medicine, Mannheim Institute for Intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University(科学计算跨学科中心，计算机工程中心，海德堡基于模型的AI、数据分析和医学建模中心，曼海姆医学智能系统研究所，曼海姆医学院，海德堡大学); School of Information Engineering, Nanjing University of Finance and Economics(信息工程学院，南京财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The objective of this paper is to provide a baseline for performing multi-modal data classification on a novel open multimodal dataset of hepatocellular carcinoma (HCC), which includes both image data (contrast-enhanced CT and MRI images) and tabular data (the clinical laboratory test data as well as case report forms). TNM staging is the classification task. Features from the vectorized preprocessed tabular data and radiomics features from contrast-enhanced CT and MRI images are collected. Feature selection is performed based on mutual information. An XGBoost classifier predicts the TNM staging and it shows a prediction accuracy of 0.89 \pm 0.05 and an AUC of 0.93 \pm 0.03 . The classifier shows that this high level of prediction accuracy can only be obtained by combining image and clinical laboratory data and therefore is a good example case where multi-model classification is mandatory to achieve accurate results.
zh

[CV-87] UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion

【速读】：该论文试图解决在高动态范围（HDR）场景下，传统曝光融合技术（exposure fusion technique）在处理大曝光差异（通常超过3-4档）时出现的对齐错误、光照不一致或色调映射伪影等问题。为了解决这些问题，论文提出了UltraFusion技术，这是第一种能够处理9档曝光差异的曝光融合方法。其关键创新在于将曝光融合建模为一个引导修复（guided inpainting）问题，利用欠曝光图像作为软引导（soft guidance）来填补过曝光区域中的高光缺失信息。这种方法不仅能够有效应对对齐问题和光照变化，还通过生成模型的图像先验（image prior）生成自然的色调映射，从而在超高动态范围场景中表现出色。实验结果表明，UltraFusion在最新的HDR基准测试中优于HDR-Transformer，并在新构建的UltraFusion数据集上展示了高质量融合效果。

链接: https://arxiv.org/abs/2501.11515
作者: Zixuan Chen,Yujin Wang,Xin Cai,Zhiyuan You,Zheming Lu,Fan Zhang,Shi Guo,Tianfan Xue
机构: Shanghai AI Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at this https URL.
zh

[CV-88] ransferability of labels between multilens cameras

【速读】：该论文旨在解决多镜头相机（multilens cameras）中不同通道间的边界框（Bounding Box, BB）和掩码标签（mask labels）自动扩展的问题。解决方案的关键在于结合相位相关方法（phase correlation method）和优化过程（refinement process）。首先，通过在频域进行互相关（cross correlation）处理，并在空间域中定位强度峰值来实现图像对齐。其次，通过迭代过程最大化交并比（Intersection over Union, IoU）指标，获得最佳变换。该方法能够在大多数情况下以超过90%的准确率在不同镜头间传递标签，且整个过程仅需65毫秒。最终，通过生成人工RGB图像并对其进行标注，将这些信息传递到其他镜头中。这一方法扩展了多镜头相机的应用领域，使其不仅限于卫星或医学图像，还能用于标注可见光谱中不可见的物体。

链接: https://arxiv.org/abs/2501.11513
作者: Ignacio de Loyola Páez-Ubieta,Daniel Frau-Alfaro,Santiago T. Puente
机构: AUtomatics, RObotics, and Artificial Vision (AUROVA) Lab, University Institute for Computer Research (IUII), University of Alicante (阿利坎特大学), Crta. San Vicente s/n, San Vicente del Raspeig, E-03690, Alicante, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is a preprint version of the work accepted at 20th International Conference on Computer Vision Theory and Applications (VISAPP 2025)

点击查看摘要

Abstract:In this work, a new method for automatically extending Bounding Box (BB) and mask labels across different channels on multilens cameras is presented. For that purpose, the proposed method combines the well known phase correlation method with a refinement process. During the first step, images are aligned by localizing the peak of intensity obtained in the spatial domain after performing the cross correlation process in the frequency domain. The second step consists of obtaining the best possible transformation by using an iterative process maximising the IoU (Intersection over Union) metric. Results show that, by using this method, labels could be transferred across different lens on a camera with an accuracy over 90% in most cases and just by using 65 ms in the whole process. Once the transformations are obtained, artificial RGB images are generated, for labeling them so as to transfer this information into each of the other lens. This work will allow users to use this type of cameras in more fields rather than satellite or medical imagery, giving the chance of labeling even invisible objects in the visible spectrum.
zh

[CV-89] See In Detail: Enhancing Sparse-view 3D Gaussian Splatting with Local Depth and Semantic Regularization ICASSP2025

【速读】：该论文试图解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在稀疏视角输入下渲染质量下降的问题，具体表现为内容失真和细节减少，限制了其实际应用。为解决这一问题，论文提出了一种稀疏视角的3DGS方法。其解决方案的关键在于引入了两种正则化技术：一是语义正则化（semantic regularization），利用预训练的DINO-ViT模型提取特征，确保多视角语义一致性；二是局部深度正则化（local depth regularization），通过约束深度值来提高对未见视角的泛化能力。该方法在LLFF数据集上显著提升了渲染质量，PSNR（峰值信噪比）提高了0.4dB，并减少了失真，增强了视觉质量。

链接: https://arxiv.org/abs/2501.11508
作者: Zongqi He,Zhe Xiao,Kin-Chung Chan,Yushen Zuo,Jun Xiao,Kin-Man Lam
机构: Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University (香港理工大学电子及电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures, has been accepted by the ICASSP 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown remarkable performance in novel view synthesis. However, its rendering quality deteriorates with sparse inphut views, leading to distorted content and reduced details. This limitation hinders its practical application. To address this issue, we propose a sparse-view 3DGS method. Given the inherently ill-posed nature of sparse-view rendering, incorporating prior information is crucial. We propose a semantic regularization technique, using features extracted from the pretrained DINO-ViT model, to ensure multi-view semantic consistency. Additionally, we propose local depth regularization, which constrains depth values to improve generalization on unseen views. Our method outperforms state-of-the-art novel view synthesis approaches, achieving up to 0.4dB improvement in terms of PSNR on the LLFF dataset, with reduced distortion and enhanced visual quality.
zh

[CV-90] Communication-Efficient Federated Learning Based on Explanation-Guided Pruning for Remote Sensing Image Classification

【速读】：该论文试图解决在遥感（Remote Sensing, RS）图像分类中，联邦学习（Federated Learning, FL）系统由于模型更新传输量大而导致的高通信开销问题。为了解决这一问题，论文提出了一种基于解释引导的剪枝策略（explanation-guided pruning strategy），该策略利用层次相关性传播（Layerwise Relevance Propagation, LRP）驱动的解释来识别并保留模型中最相关和信息量最大的参数，同时剔除不重要的参数，从而减少模型更新的传输量。实验结果表明，该策略在BigEarthNet-S2数据集上有效减少了共享模型更新的数量，同时提高了全局模型的泛化能力。

链接: https://arxiv.org/abs/2501.11493
作者: Jonas Klotz,Barış Büyüktaş,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2025

点击查看摘要

Abstract:Federated learning (FL) is a decentralized machine learning paradigm, where multiple clients collaboratively train a global model by exchanging only model updates with the central server without sharing the local data of clients. Due to the large volume of model updates required to be transmitted between clients and the central server, most FL systems are associated with high transfer costs (i.e., communication overhead). This issue is more critical for operational applications in remote sensing (RS), especially when large-scale RS data is processed and analyzed through FL systems with restricted communication bandwidth. To address this issue, we introduce an explanation-guided pruning strategy for communication-efficient FL in the context of RS image classification. Our pruning strategy is defined based on the layerwise relevance propagation (LRP) driven explanations to: 1) efficiently and effectively identify the most relevant and informative model parameters (to be exchanged between clients and the central server); and 2) eliminate the non-informative ones to minimize the volume of model updates. The experimental results on the BigEarthNet-S2 dataset demonstrate that our strategy effectively reduces the number of shared model updates, while increasing the generalization ability of the global model. The code of this work will be publicly available at this https URL
zh

[CV-91] SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models

【速读】：该论文旨在解决在现实世界的机器学习应用中，特别是在安全关键领域，检测分布外（Out-of-Distribution, OOD）数据的问题。现有的方法通常利用视觉-语言模型（Vision-Language Models, VLMs）中的语言信息，通过丰富的类别文本信息来增强置信度估计，从而提升OOD检测效果。然而，这些方法在构建OOD检测分数时，要么关注每个分布内（In-Distribution, ID）类别，要么关注整个ID标签集，忽略了ID类别之间的内在联系。论文发现，不同ID类别之间的语义信息对于有效的OOD检测是有益的。因此，作者研究了VLMs中不同语义相关ID标签之间的图像-文本理解能力，并提出了一种称为SimLabel的后处理策略。SimLabel通过建立一种更鲁棒的图像-类别相似性度量，考虑了一组相似类别标签的一致性，从而增强了ID和OOD样本之间的可分离性。实验结果表明，SimLabel在多个零样本OOD检测基准上表现出色，并且该模型可以扩展到不同的VLM骨干网络，展示了其良好的泛化能力。

链接: https://arxiv.org/abs/2501.11485
作者: Shu Zou,Xinyu Tian,Qinyu Zhao,Zhaoyuan Yang,Jing Zhang
机构: School of Computing, the Australian National University, Canberra, Australia(澳大利亚国立大学计算机学院); GE Research, America(美国通用电气研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial in real-world machine learning applications, particularly in safety-critical domains. Existing methods often leverage language information from vision-language models (VLMs) to enhance OOD detection by improving confidence estimation through rich class-wise text information. However, when building OOD detection score upon on in-distribution (ID) text-image affinity, existing works either focus on each ID class or whole ID label sets, overlooking inherent ID classes’ connection. We find that the semantic information across different ID classes is beneficial for effective OOD detection. We thus investigate the ability of image-text comprehension among different semantic-related ID labels in VLMs and propose a novel post-hoc strategy called SimLabel. SimLabel enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels. Extensive experiments demonstrate the superior performance of SimLabel on various zero-shot OOD detection benchmarks. The proposed model is also extended to various VLM-backbones, demonstrating its good generalization ability. Our demonstration and implementation codes are available at: this https URL.
zh

[CV-92] MASS: Overcoming Language Bias in Image-Text Matching AAAI2025

【速读】：该论文试图解决视觉-语言模型（visual-language models）在图像-文本匹配任务中存在的语言偏差（language bias）问题。具体而言，现有模型在匹配图像和文本时过度依赖语言先验（language priors），而未能充分考虑到视觉内容，导致匹配结果的准确性受到影响。为解决这一问题，论文提出了多模态关联评分（Multimodal ASsociation Score, MASS）框架。该框架的关键在于减少对语言先验的依赖，从而提升图像-文本匹配中的视觉准确性。MASS无需额外训练即可无缝集成到现有的视觉-语言模型中，实验表明其在降低语言偏差的同时，仍能保持对语言组合性（linguistic compositionality）的理解。因此，MASS为提升视觉-语言模型在图像-文本匹配任务中的性能提供了一种有效的解决方案。

链接: https://arxiv.org/abs/2501.11469
作者: Jiwan Chung,Seungwon Lim,Sangkyu Lee,Youngjae Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025

点击查看摘要

Abstract:Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
zh

[CV-93] On the Adversarial Vulnerabilities of Transfer Learning in Remote Sensing

【速读】：该论文试图解决在遥感任务中使用预训练模型时引入的安全漏洞问题。具体来说，公开可用的预训练模型可能被用作代理来攻击下游模型，从而影响其性能。论文提出了一种新颖的对抗性神经元操纵方法（Adversarial Neuron Manipulation），通过选择性地操纵预训练模型中的单个或多个神经元来生成可迁移的扰动。与现有攻击方法不同，该方法无需领域特定信息，因此具有更广泛的适用性和更高的效率。通过针对多个脆弱神经元，该方法能够实现卓越的攻击性能，揭示了深度学习模型中的关键漏洞。实验结果表明，该方法在多种模型和遥感数据集上均表现出显著的有效性，强调了在安全关键的遥感任务中设计更鲁棒的防御机制的紧迫性。

链接: https://arxiv.org/abs/2501.11462
作者: Tao Bai,Xingjian Tian,Yonghao Xu,Bihan Wen
机构: School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (南洋理工大学电气与电子工程学院); Computer Vision Laboratory (CVL) at the Department of Electrical Engineering (ISY), Linköping University, Linköping, Sweden (瑞典林雪平大学电气工程系计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The use of pretrained models from general computer vision tasks is widespread in remote sensing, significantly reducing training costs and improving performance. However, this practice also introduces vulnerabilities to downstream tasks, where publicly available pretrained models can be used as a proxy to compromise downstream models. This paper presents a novel Adversarial Neuron Manipulation method, which generates transferable perturbations by selectively manipulating single or multiple neurons in pretrained models. Unlike existing attacks, this method eliminates the need for domain-specific information, making it more broadly applicable and efficient. By targeting multiple fragile neurons, the perturbations achieve superior attack performance, revealing critical vulnerabilities in deep learning models. Experiments on diverse models and remote sensing datasets validate the effectiveness of the proposed method. This low-access adversarial neuron manipulation technique highlights a significant security risk in transfer learning models, emphasizing the urgent need for more robust defenses in their design when addressing the safety-critical remote sensing tasks.
zh

[CV-94] Enhancing Coronary Artery Calcium Scoring via Multi-Organ Segmentation on Non-Contrast Cardiac Computed Tomography

【速读】：该论文试图解决的问题是尽管冠状动脉钙化评分（coronary artery calcium scoring）在医学人工智能领域被认为是一个基本解决的问题，但仍存在改进空间。论文提出了一种新的算法，通过将重点从病理检测转向对解剖结构的深入理解，不仅实现了高精度的冠状动脉钙化评分，还增强了结果的可解释性。解决方案的关键在于采用了一种基于解剖学的方法，通过更细致地理解心脏的解剖结构，从而在心血管健康领域获得更准确和可解释的结果。该方法在开源的多厂商数据集上进行了评估，结果显示其精度达到了观察者间一致性的水平，超越了当前的最新技术。此外，定性分析还展示了该算法在标记冠状动脉钙化、识别主动脉钙化以及过滤噪声引起的假阳性检测等任务中的实际应用价值。

链接: https://arxiv.org/abs/2501.11428
作者: Jakub Nalepa,Tomasz Bartczak,Mariusz Bujny,Jarosław Gośliński,Katarzyna Jesionek,Wojciech Malara,Filip Malawski,Karol Miszalski-Jamka,Patrycja Rewa,Marcin Kostur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite coronary artery calcium scoring being considered a largely solved problem within the realm of medical artificial intelligence, this paper argues that significant improvements can still be made. By shifting the focus from pathology detection to a deeper understanding of anatomy, the novel algorithm proposed in the paper both achieves high accuracy in coronary artery calcium scoring and offers enhanced interpretability of the results. This approach not only aids in the precise quantification of calcifications in coronary arteries, but also provides valuable insights into the underlying anatomical structures. Through this anatomically-informed methodology, the paper shows how a nuanced understanding of the heart’s anatomy can lead to more accurate and interpretable results in the field of cardiovascular health. We demonstrate the superior accuracy of the proposed method by evaluating it on an open-source multi-vendor dataset, where we obtain results at the inter-observer level, surpassing the current state of the art. Finally, the qualitative analyses show the practical value of the algorithm in such tasks as labeling coronary artery calcifications, identifying aortic calcifications, and filtering out false positive detections due to noise.
zh

[CV-95] Block Flow: Learning Straight Flow on Data Blocks

【速读】：该论文旨在解决流匹配模型（flow-matching models）中由于生成轨迹的高曲率（curvature）导致的截断误差（truncation error）问题。高曲率会增加采样步骤中的数值误差，影响生成样本的质量和多样性。为解决这一问题，论文提出了一种新的方法——块匹配（block matching）。该方法通过利用标签信息将数据分布划分为多个块，并将这些块与基于相同标签信息参数化的先验分布进行匹配，从而学习到更直的流（straighter flows）。关键创新在于通过控制先验分布的方差来调节前向轨迹的曲率上限，并通过设计灵活的正则化策略来优化生成性能，有效平衡生成样本的多样性与数值求解器误差之间的权衡。实验结果表明，该方法在相同参数规模下具有竞争力。

链接: https://arxiv.org/abs/2501.11361
作者: Zibin Wang,Zhiyuan Ouyang,Xiangyun Zhang
机构: East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter this http URL is available at \urlthis https URL.
zh

[CV-96] Automatic Labelling Semantic Segmentation with 4D Radar Tensors ICASSP2025

【速读】：该论文旨在解决自动驾驶领域中多传感器数据融合的自动标注问题，特别是利用LiDAR（激光雷达）和相机（camera）的互补信息生成高质量的地面真值（ground truth）标签。解决方案的关键在于提出了一种自动标注流程，通过结合LiDAR和相机的数据生成精确的标签，并将这些标签与4D雷达数据一起输入到一个语义分割网络（semantic segmentation network）中，以实现对每个空间体素（voxel）的分类标注。该方法在公开的RaDelft数据集上取得了显著效果，相较于文献中的其他变体，所提出的网络在LiDAR检测性能上达到了65%以上，车辆检测概率提升了13.2%，并且在Chamfer距离上减少了0.54米。

链接: https://arxiv.org/abs/2501.11351
作者: Botao Sun,Ignacio Roldan,Francesco Fioranelli
机构: Microwave Sensing, Signals & Systems (MS3) Group, Dept. of Microelectronics, TU Delft (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted in ICASSP 2025

点击查看摘要

Abstract:In this paper, an automatic labelling process is presented for automotive datasets, leveraging on complementary information from LiDAR and camera. The generated labels are then used as ground truth with the corresponding 4D radar data as inputs to a proposed semantic segmentation network, to associate a class label to each spatial voxel. Promising results are shown by applying both approaches to the publicly shared RaDelft dataset, with the proposed network achieving over 65% of the LiDAR detection performance, improving 13.2% in vehicle detection probability, and reducing 0.54 m in terms of Chamfer distance, compared to variants inspired from the literature.
zh

[CV-97] EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

【速读】：该论文试图解决在机器人辅助手术（robotic-assisted surgery）中缺乏专门用于手术场景理解（surgical scene understanding）的多模态大语言模型（Multimodal Large Language Models, MLLMs）的问题。为了解决这一问题，作者提出了EndoChat模型，旨在处理外科医生在手术场景理解中遇到的各种对话范式（dialogue paradigms）和子任务。解决方案的关键在于构建了Surg-396K数据集，该数据集通过系统化提取手术信息并基于大规模内窥镜手术数据集生成结构化注释。此外，作者引入了多尺度视觉标记交互机制（multi-scale visual token interaction mechanism）和基于视觉对比的推理机制（visual contrast-based reasoning mechanism），以增强模型的表示学习和推理能力。通过这些创新，EndoChat在五种对话范式和八种手术场景理解任务中实现了最先进的性能，并获得了专业外科医生的积极反馈，展示了其在机器人辅助手术训练和自动化中的巨大潜力。

链接: https://arxiv.org/abs/2501.11347
作者: Guankun Wang,Long Bai,Junyi Wang,Kun Yuan,Zhen Li,Tianxu Jiang,Xiting He,Jinlin Wu,Zhen Chen,Zhen Lei,Hongbin Liu,Jiazheng Wang,Fan Zhang,Nicolas Padoy,Nassir Navab,Hongliang Ren
机构: The Chinese University of Hong Kong(香港中文大学); Huawei Technologies Co. Ltd.(华为技术有限公司); Technical University of Munich(慕尼黑工业大学); University of Strasbourg, CNRS, INSERM, ICube & IHU Strasbourg(斯特拉斯堡大学, 法国国家科学研究中心, 法国国家健康与医学研究院, ICube & IHU斯特拉斯堡); Qilu Hospital of Shandong University(山东大学齐鲁医院); Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences(香港科学创新研究院, 中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model’s representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.
zh

[CV-98] GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video

【速读】：该论文旨在解决AI生成视频检测领域面临的挑战，特别是由于缺乏大规模、高质量的数据集而导致的检测模型开发困难。为了解决这一问题，作者提出了GenVidBench数据集，该数据集具有三个关键优势：1）跨源和跨生成器（Cross Source and Cross Generator），通过跨生成源减少视频内容对检测的干扰，并通过跨生成器确保训练集和测试集之间的视频属性多样性，避免过度相似；2）包含8种最先进的AI视频生成器（State-of-the-Art Video Generators），确保数据集涵盖视频生成领域的最新进展；3）丰富的语义（Rich Semantics），通过对视频内容的多维度分析，将视频分类为多种语义类别，确保数据集不仅规模大，而且多样性高，从而有助于开发更通用和有效的检测模型。通过这些关键设计，GenVidBench为研究人员提供了一个高效开发和评估AI生成视频检测模型的工具。

链接: https://arxiv.org/abs/2501.11340
作者: Zhenliang Ni,Qiangyu Yan,Mouxiao Huang,Tianning Yuan,Yehui Tang,Hailin Hu,Xinghao Chen,Yunhe Wang
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at this https URL.
zh

[CV-99] CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

【速读】：该论文旨在解决虚拟试衣（Virtual Try-On, VTON）技术在图像和视频试衣任务中难以实现高质量结果的问题，尤其是在长视频场景下。现有的方法在处理静态图像和动态视频试衣时表现不一致，难以同时满足高质量和高效的需求。论文提出的解决方案是CatV2TON，这是一种基于视觉的虚拟试衣方法，通过单一扩散变换器模型（diffusion transformer model）同时支持图像和视频试衣任务。其关键创新点包括：1）通过时间上拼接服装和人物输入，并在混合的图像和视频数据集上进行训练，以实现静态和动态场景下的鲁棒试衣效果；2）提出了一种基于重叠片段的推理策略，利用序列帧引导和自适应片段归一化（Adaptive Clip Normalization, AdaCN）来保持时间一致性，同时减少资源需求；3）引入了ViViD-S数据集，通过过滤背面帧和应用3D掩码平滑来增强时间一致性。实验表明，CatV2TON在图像和视频试衣任务中均优于现有方法，为多样化场景下的逼真虚拟试衣提供了可靠解决方案。

链接: https://arxiv.org/abs/2501.11325
作者: Zheng Chong,Wenqing Zhang,Shiyue Zhang,Jun Zheng,Xiao Dong,Haoxiang Li,Yiling Wu,Dongmei Jiang,Xiaodan Liang
机构: Sun Yat-Sen University(中山大学); National University of Singapore(新加坡国立大学); Pixocial Technology(Pixocial Technology); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.
zh

[CV-100] StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

【速读】：该论文旨在解决基于无训练扩散模型（training-free diffusion-based methods）在风格迁移（style transfer）过程中存在的两个主要问题：原始内容图像的布局变化（layout changes）和风格图像的内容泄漏（content leakage）。为了解决这些问题，论文提出了StyleSSP方法，其关键在于通过两个核心组件来优化采样阶段的起点（startpoint）：(1) 频率操纵（Frequency Manipulation），通过减少DDIM潜在表示的低频成分，增强对内容图像布局的关注，从而更好地保留原始内容；(2) 反演阶段的负引导（Negative Guidance via Inversion），通过在反演阶段引入负引导，确保采样阶段的起点远离风格图像的内容，从而减少内容泄漏。实验结果表明，StyleSSP在保留原始内容和减少风格图像内容泄漏方面优于现有的无训练风格迁移基线方法。

链接: https://arxiv.org/abs/2501.11319
作者: Ruojun Xu,Weijie Xi,Xiaodi Wang,Yongbo Mao,Zach Cheng
机构: Zhejiang University(浙江大学); Dcar; Bytedance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.
zh

[CV-101] Nested Annealed Training Scheme for Generative Adversarial Networks

【速读】：该论文旨在解决生成对抗网络（GANs）在数学理论基础上的不足，特别是针对复合函数梯度生成对抗网络（CFG）的理论框架进行深入研究。论文揭示了CFG模型与基于分数的模型（score-based models）之间的理论联系，并指出CFG判别器的训练目标等价于寻找一个最优的D(x)，其梯度能够区分真实样本和生成样本的分数函数积分差异。同时，CFG生成器的训练则涉及寻找一个最优的G(x)，以最小化这一差异。为解决CFG方法在应用于当前最先进的GAN模型时的局限性，论文提出了一种嵌套退火训练方案（NATS），该方案保留了CFG方法中的退火权重，并能够无缝适应各种GAN模型，无论其结构、损失函数或正则化方式如何。实验结果表明，退火CFG和NATS方法显著提高了生成样本的质量和多样性，尤其是在与当前最先进的GAN模型进行比较时。

链接: https://arxiv.org/abs/2501.11318
作者: Chang Wan,Ming-Hsuan Yang,Minglu Li,Yunliang Jiang,Zhonglong Zheng
机构: School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院); Department of Computer Science and Engineering, University of California, Merced (加州大学默塞德分校计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, researchers have proposed many deep generative models, including generative adversarial networks(GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG)[1]. Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the training objective of the CFG discriminator is equivalent to finding an optimal D(x). The optimal gradient of D(x) differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the weight of the CFG discriminator. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the limitation of the annealed CFG method, as the method is not readily applicable to the SOTA GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the quality and diversity of the synthesized samples. This improvement is clear when comparing the CFG method and the SOTA GAN models.
zh

[CV-102] Anomaly Detection for Industrial Applications Its Challenges Solutions and Future Directions: A Review

【速读】：该论文旨在解决工业领域中基于视觉的异常检测（Vision-based Anomaly Detection）问题，特别是在生产过程中通过相机传感器捕获的图像进行异常检测的应用。传统方法依赖于人工检查，效率低下且繁琐。论文通过综述自2019年以来发表的研究，重点探讨了基于视觉的异常检测技术，提出了自动化检测系统的关键组成部分，包括数据获取、预处理、学习机制和评估等方面。解决方案的关键在于利用计算机视觉技术自动提取、处理和解释图像特征，从而实现工业操作的自动化。此外，论文还总结了相关工业数据集，并讨论了未来的研究方向，为研究人员提供了工业检测领域的最新进展和挑战。

链接: https://arxiv.org/abs/2501.11310
作者: Abdelrahman Alzarooni,Ehtesham Iqbal,Samee Ullah Khan,Sajid Javed,Brain Moyo,Yusra Abdulrahman
机构: Advanced Research and Innovation Center (ARIC), Khalifa University of Science and Technology (哈利法科技大学); Department of Aerospace Engineering, Khalifa University of Science and Technology (哈利法科技大学); Department of Computer Science, Khalifa University of Science and Technology (哈利法科技大学); Research & Development Program, Sanad Aerotech (Sanad Aerotech 研发项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection from images captured using camera sensors is one of the mainstream applications at the industrial level. Particularly, it maintains the quality and optimizes the efficiency in production processes across diverse industrial tasks, including advanced manufacturing and aerospace engineering. Traditional anomaly detection workflow is based on a manual inspection by human operators, which is a tedious task. Advances in intelligent automated inspection systems have revolutionized the Industrial Anomaly Detection (IAD) process. Recent vision-based approaches can automatically extract, process, and interpret features using computer vision and align with the goals of automation in industrial operations. In light of the shift in inspection methodologies, this survey reviews studies published since 2019, with a specific focus on vision-based anomaly detection. The components of an IAD pipeline that are overlooked in existing surveys are presented, including areas related to data acquisition, preprocessing, learning mechanisms, and evaluation. In addition to the collected publications, several scientific and industry-related challenges and their perspective solutions are highlighted. Popular and relevant industrial datasets are also summarized, providing further insight into inspection applications. Finally, future directions of vision-based IAD are discussed, offering researchers insight into the state-of-the-art of industrial inspection.
zh

[CV-103] Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

【速读】：该论文试图解决类激活图（Class Activation Map, CAM）在区分视觉上相似的细粒度类别时难以准确定位判别性区域的问题。尽管CAM具有简单和计算效率高的优点，但其在识别区分性区域时表现不佳，尤其是在处理视觉上相似的细粒度类别时。论文提出的解决方案Finer-CAM的关键在于，通过显式比较目标类别与相似类别之间的差异，抑制与其他类别共享的特征，并强调目标类别的独特判别性细节。这种方法不仅保留了CAM的效率，还实现了对判别性区域的精确定位。Finer-CAM易于实现，兼容多种CAM方法，并可扩展到多模态模型中以准确定位特定概念。此外，Finer-CAM允许调整比较强度，使用户能够选择性地突出粗粒度对象轮廓或细粒度判别性细节。

链接: https://arxiv.org/abs/2501.11309
作者: Ziheng Zhang,Jianyang Gu,Arpita Chowdhury,Zheda Mai,David Carlyn,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao
机构: The Ohio State University(俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM’s efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in “how” it explains, but in “what” it explains. Specifically, previous methods attempt to identify all cues contributing to the target class’s logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at this https URL.
zh

[CV-104] MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching

【速读】：该论文旨在解决多模态图像匹配中由于单模态数据训练的描述符缺乏对多模态数据非线性变化的鲁棒性而导致的问题。现有的关键点检测和描述方法在单模态图像匹配中表现良好，但在多模态数据上往往表现不佳，主要原因是多模态数据的非线性变化使得单模态数据训练的描述符难以适应。为了解决这一问题，论文提出了一种模态不变特征学习网络（MIFNet），该网络仅使用单模态训练数据来计算多模态图像匹配中的模态不变特征。关键解决方案包括引入一个新颖的潜在特征聚合模块和一个累积混合聚合模块，通过利用预训练的Stable Diffusion模型的特征来增强基于单模态数据训练的关键点描述符。该方法在三个多模态视网膜图像数据集（CF-FA、CF-OCT、EMA-OCTA）和两个遥感数据集（Optical-SAR和Optical-NIR）上进行了验证，实验结果表明，MIFNet能够在无需访问目标模态的情况下学习到模态不变特征，并具有良好的零样本泛化能力。

链接: https://arxiv.org/abs/2501.11299
作者: Yepeng Liu,Zhichao Sun,Baosheng Yu,Yitian Zhao,Bo Du,Yongchao Xu,Jun Cheng
机构: National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430070, China (武汉大学多媒体软件国家工程研究中心、人工智能研究所、计算机学院、多媒体与网络通信工程湖北省重点实验室); Lee Kong Chian School of Medicine, Nanyang Technological University, 308232, Singapore (南洋理工大学李光前医学院); Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, Zhejiang 315211, China (中国科学院宁波材料技术与工程研究所); Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), 1 Fusionpolis Way, #21-01, Connexis South Tower, Singapore 138632, Republic of Singapore (新加坡科技研究局信息通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The source code will be made publicly available.
zh

[CV-105] PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues

【速读】：该论文旨在解决多目标跟踪（Multi-object Tracking, MOT）在复杂场景中由于严重遮挡导致的关联性能下降问题。当前主流的跟踪-检测（Tracking-by-Detection, TBD）方法在帧间进行目标检测和关联，但在遮挡严重的复杂场景中表现不佳。为此，论文提出了一种基于伪深度线索的增强关联性能的方法，称为Pseudo-Depth SORT (PD-SORT)。其关键解决方案包括：1）扩展卡尔曼滤波（Kalman Filter）状态向量，引入伪深度状态；2）提出一种新的深度体积交并比（Depth Volume IoU, DVIoU），将传统的2D交并比（2D IoU）与伪深度结合；3）开发了一种量化伪深度测量（Quantized Pseudo-Depth Measurement, QPDM）策略，以提高数据关联的鲁棒性；4）集成相机运动补偿（Camera Motion Compensation, CMC）以应对动态相机场景。通过这些设计，PD-SORT显著缓解了遮挡引起的模糊关联问题，并在DanceTrack、MOT17和MOT20数据集上取得了领先的性能，尤其在DanceTrack数据集上表现尤为突出，该数据集中的目标具有复杂运动、相似外观和频繁遮挡的特点。

链接: https://arxiv.org/abs/2501.11288
作者: Yanchao Wang,Dawei Zhang,Run Li,Zhonglong Zheng,Minglu Li
机构: School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world this http URL this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at this https URL.
zh

[CV-106] Spatiotemporal Air Quality Mapping in Urban Areas Using Sparse Sensor Data Satellite Imagery Meteorological Factors and Spatial Features

【速读】：该论文试图解决传统空气质量监测方法（如地面传感器和卫星遥感）在部署成本高、传感器覆盖稀疏以及环境干扰等方面的局限性问题。为此，论文提出了一种基于稀疏传感器数据、卫星图像和多种时空因素的高分辨率时空空气质量指数（AQI）映射框架。解决方案的关键在于利用图神经网络（GNNs），通过捕捉空间和时间依赖性，估算未监测位置的AQI值。该框架整合了多种环境特征，包括气象数据、道路网络、兴趣点（PoIs）、人口密度和城市绿地等，以提高预测精度。通过巴基斯坦拉合尔的案例研究，展示了该方法在多分辨率数据下生成精细时空尺度空气质量指数地图的应用。

链接: https://arxiv.org/abs/2501.11270
作者: Osama Ahmad,Zubair Khalid,Muhammad Tahir,Momin Uppal
机构: School of Science and Engineering, Lahore University of Management Sciences, Lahore 54792, Pakistan (拉合尔管理科学大学科学与工程学院，拉合尔 54792，巴基斯坦)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monitoring air pollution is crucial for protecting human health from exposure to harmful substances. Traditional methods of air quality monitoring, such as ground-based sensors and satellite-based remote sensing, face limitations due to high deployment costs, sparse sensor coverage, and environmental interferences. To address these challenges, this paper proposes a framework for high-resolution spatiotemporal Air Quality Index (AQI) mapping using sparse sensor data, satellite imagery, and various spatiotemporal factors. By leveraging Graph Neural Networks (GNNs), we estimate AQI values at unmonitored locations based on both spatial and temporal dependencies. The framework incorporates a wide range of environmental features, including meteorological data, road networks, points of interest (PoIs), population density, and urban green spaces, which enhance prediction accuracy. We illustrate the use of our approach through a case study in Lahore, Pakistan, where multi-resolution data is used to generate the air quality index map at a fine spatiotemporal scale.
zh

[CV-107] owards Loss-Resilient Image Coding for Unstable Satellite Networks AAAI2025

【速读】：该论文旨在解决地球静止轨道（GEO）卫星通信中由于网络不稳定（尤其是频繁丢包）导致的图像传输不准确的问题。为了解决这一问题，作者提出了一种基于端到端优化的学习图像压缩（LIC）方法，该方法具有抗丢包能力。解决方案的关键在于采用了通道级渐进编码框架，并在编码器端引入了空间-通道重排（SCR）技术，在解码器端引入了掩码条件聚合（MCA）技术，以在不可预测的错误情况下提高重建质量。此外，通过将Gilbert-Elliot模型集成到训练过程中，增强了模型在真实网络条件下的泛化能力。实验结果表明，该方法在压缩性能和不同丢包情况下的稳定性方面优于传统方法和基于深度学习的方法，能够在恶劣环境下实现稳健且高效的渐进传输。

链接: https://arxiv.org/abs/2501.11263
作者: Hongwei Sha,Muchen Dong,Quanyou Luo,Ming Lu,Hao Chen,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted as a poster presentation at AAAI 2025

点击查看摘要

Abstract:Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model’s ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at this https URL.
zh

[CV-108] A Survey of World Models for Autonomous Driving

【速读】：该论文旨在探讨自动驾驶领域中的关键技术挑战及其解决方案，特别是通过世界模型（world models）来提升自动驾驶系统的感知、预测和规划能力。世界模型通过整合多传感器数据、语义线索和时间动态信息，提供了高保真的驾驶环境表示，从而在复杂和不可预测的条件下实现快速且明智的决策。解决方案的关键在于利用大规模预训练和先进的自监督学习技术，增强模型对罕见事件的模拟能力和实时交互能力。此外，论文还强调了领域适应、长尾异常检测和多模态融合等关键挑战的应对策略，为更鲁棒、可靠和适应性强的自动驾驶系统铺平了道路。

链接: https://arxiv.org/abs/2501.11260
作者: Tuo Feng,Wenguan Wang,Yi Yang
机构: ReLER Lab, Australian Artificial Intelligence Institute (AAII), University of Technology Sydney (悉尼科技大学); Collaborative Innovation Center of Artificial Intelligence (CCAI), Zhejiang University (浙江大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Ongoing project

点击查看摘要

Abstract:Recent breakthroughs in autonomous driving have revolutionized the way vehicles perceive and interact with their surroundings. In particular, world models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. Such models unify perception, prediction, and planning, thereby enabling autonomous systems to make rapid, informed decisions under complex and often unpredictable conditions. Research trends span diverse areas, including 4D occupancy prediction and generative data synthesis, all of which bolster scene understanding and trajectory forecasting. Notably, recent works exploit large-scale pretraining and advanced self-supervised learning to scale up models’ capacity for rare-event simulation and real-time interaction. In addressing key challenges – ranging from domain adaptation and long-tail anomaly detection to multimodal fusion – these world models pave the way for more robust, reliable, and adaptable autonomous driving solutions. This survey systematically reviews the state of the art, categorizing techniques by their focus on future prediction, behavior planning, and the interaction between the two. We also identify potential directions for future research, emphasizing holistic integration, improved computational efficiency, and advanced simulation. Our comprehensive analysis underscores the transformative role of world models in driving next-generation autonomous systems toward safer and more equitable mobility.
zh

[CV-109] Enhancing Uncertainty Estimation in Semantic Segmentation via Monte-Carlo Frequency Dropout

【速读】：该论文旨在解决确定性神经网络中预测分布估计的问题，特别是在医学影像分析中，传统 dropout 方法在信号空间内应用时可能无法有效处理频率相关噪声，从而导致预测估计偏差。论文提出了一种新颖的解决方案，即将 dropout 扩展到频域（frequency domain），在推理过程中对信号频率进行随机衰减。这种方法在保持结构完整性的同时，能够在特征图中生成多样化的全局纹理变化，从而更准确地估计语义分割中的不确定性。通过在三项涉及不同成像模态的分割任务（双参数 MRI 中的前列腺区域、对比增强 CT 中的肝脏肿瘤以及胸部 X 光片中的肺部）中进行评估，结果表明，MC-Frequency Dropout 在模型校准、收敛性和语义不确定性方面均有显著提升，有助于改善预测的精确性、边界划分以及医学决策的准确性。

链接: https://arxiv.org/abs/2501.11258
作者: Tal Zeevi,Lawrence H. Staib,John A. Onofrey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: Accepted by IEEE ISBI 2025 4-page paper. Code for the implementation is available at this https URL

点击查看摘要

Abstract:Monte-Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency-related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity – a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC-Dropout and the MC-frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast-enhanced CT, and (iii) lungs in chest X-ray scans. Our results show that MC-Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision-making.
zh

[CV-110] Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders

【速读】：该论文试图解决在合成孔径雷达（SAR）图像中，由于缺乏领域特定的预训练模型，传统方法通常依赖于自然场景（如ImageNet）的预训练模型进行监督微调（SFT），但由于SAR图像与自然场景图像的特性差异较大，导致在小规模标注的SAR数据上进行SFT时，模型在下游任务中的性能受限。论文提出了一种基于掩码自编码器（MAE）的自监督学习（SSL）方法，通过在预训练过程中学习SAR图像的特征表示，从而提升SAR图像目标检测任务中的模型泛化能力。解决方案的关键在于通过自监督学习将预训练领域从自然场景转换为SAR图像，从而捕获SAR图像的潜在表示，并在大规模SAR目标检测基准SARDet-100k上验证了该方法的有效性，相比仅使用SFT策略，该方法在SARDet-100k基准上实现了1.3 mAP的提升。

链接: https://arxiv.org/abs/2501.11249
作者: Xinyang Pu,Feng Xu
机构: Key Lab for Information Science of Electromagnetic Waves (MoE), Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.
zh

[CV-111] A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs

【速读】：该论文旨在解决生成对抗网络（GANs）在大规模数据集上训练时的不稳定性问题，特别是模式崩溃（mode collapse）现象。传统GANs通过生成器和判别器之间的极小极大博弈进行学习，这种方法在经验上表现出不稳定性，且缺乏理论保证。为了解决这些问题，作者提出了一种新颖的Lipschitz约束函数梯度GANs学习方法（Li-CFG），通过减少潜在向量的邻域大小来稳定GAN的训练，并提供了理论依据以有效增加生成样本的多样性。具体而言，作者证明了通过增加判别器梯度的范数可以减少潜在向量的邻域大小，从而增强生成样本的多样性。为了有效增大判别器梯度的范数，作者引入了一种新的ε中心梯度惩罚（ε-centered gradient penalty），利用超参数ε来放大判别器梯度的范数。与其他约束方法相比，该方法通过增大判别器范数，获得了最小的潜在向量邻域大小。实验结果表明，Li-CFG方法和ε中心梯度惩罚在图像生成基准数据集上显著提高了训练的稳定性和生成样本的多样性。

链接: https://arxiv.org/abs/2501.11236
作者: Chang Wan,Ke Fan,Xinwei Sun,Yanwei Fu,Minglu Li,Yunliang Jiang,Zhonglong Zheng
机构: Zhejiang Normal University (浙江师范大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel \epsilon-centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter \epsilon. In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the \epsilon-centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.
zh

[CV-112] KPL: Training-Free Medical Knowledge Mining of Vision-Language Models AAAI

【速读】：该论文试图解决在医学图像诊断中应用CLIP（Contrastive Language–Image Pretraining）进行零样本分类（zero-shot classification）时面临的两个主要挑战：1）仅使用单一类别名称无法充分表示图像类别；2）CLIP编码器生成的视觉和文本空间之间存在模态差距（modal gap）。尽管已有研究尝试通过大型语言模型丰富疾病描述，但由于缺乏类别特定的知识，性能仍然较差。此外，现有代理学习方法在自然图像数据集上的零样本图像分类表现不稳定，尤其是在医学数据集上。

为解决这些问题，论文提出了知识代理学习（Knowledge Proxy Learning, KPL）方法，旨在通过从CLIP中挖掘知识来提升医学图像分类的性能。KPL的关键在于通过文本代理优化（Text Proxy Optimization）和多模态代理学习（Multimodal Proxy Learning）来利用CLIP的多模态理解能力。具体而言，KPL从构建的知识增强库中检索与图像相关的知识描述，以丰富语义文本代理，并利用CLIP编码的输入图像和这些描述生成稳定的多模态代理，从而提升零样本分类性能。实验结果表明，KPL在医学和自然图像数据集上均显著优于现有基线方法，展示了从CLIP中挖掘知识在医学图像分类及其他领域的巨大潜力。

链接: https://arxiv.org/abs/2501.11231
作者: Jiaxiang Liu,Tianxiang Hu,Jiawei Du,Ruiyuan Zhang,Joey Tianyi Zhou,Zuozhu Liu
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI(Oral)

点击查看摘要

Abstract:Visual Language Models such as CLIP excel in image recognition due to extensive image-text pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) the inadequacy of representing image classes solely with single category names; 2) the modal gap between the visual and text spaces generated by CLIP encoders. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. KPL is designed to leverage CLIP’s multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies. It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance. Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas.
zh

[CV-113] Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios ICASSP2025

【速读】：该论文旨在解决在低秩信道（low-rank channel）场景下，现有联合信道估计和源检测算法性能不足的问题。特别是在用户数量超过接入点（AP）天线数量的情况下，传统方法在处理低秩信道时表现不佳。论文提出了一种基于生成式分数扩散模型（generative score-based diffusion models）和连续干扰消除（SIC）的联合算法。该算法的关键在于通过分数迭代扩散过程估计部分信道的先验分布梯度，并递归更新信道估计和源信号。实验结果表明，该方法在全秩和低秩信道场景下均优于现有基线方法，尤其在低秩信道场景下表现更为显著，显著降低了归一化均方误差（NMSE）和符号错误率（SER）。

链接: https://arxiv.org/abs/2501.11229
作者: Sagnik Bhattacharya,Muhammad Ahmed Mohsin,Kamyar Rajabalifardi,John M. Cioffi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: Published at IEEE ICASSP 2025

点击查看摘要

Abstract:This paper proposes a novel joint channel-estimation and source-detection algorithm using successive interference cancellation (SIC)-aided generative score-based diffusion models. Prior work in this area focuses on massive MIMO scenarios, which are typically characterized by full-rank channels, and fail in low-rank channel scenarios. The proposed algorithm outperforms existing methods in joint source-channel estimation, especially in low-rank scenarios where the number of users exceeds the number of antennas at the access point (AP). The proposed score-based iterative diffusion process estimates the gradient of the prior distribution on partial channels, and recursively updates the estimated channel parts as well as the source. Extensive simulation results show that the proposed method outperforms the baseline methods in terms of normalized mean squared error (NMSE) and symbol error rate (SER) in both full-rank and low-rank channel scenarios, while having a more dominant effect in the latter, at various signal-to-noise ratios (SNR).
zh

[CV-114] Leverag ing GANs For Active Appearance Models Optimized Model Fitting

【速读】：该论文试图解决在计算机视觉领域中，特别是在涉及可变形模型（如主动外观模型，Active Appearance Models, AAMs）的拟合过程中，优化与外观和形状变化相关的非线性参数时所面临的挑战。论文提出的解决方案之关键在于利用生成对抗网络（Generative Adversarial Networks, GANs）的对抗训练框架，以最小化拟合误差并提高收敛速度。通过这种方法，即使在存在高外观变异性和遮挡的情况下，也能实现鲁棒的性能。与传统的优化技术相比，该方法在精度和计算效率方面表现出显著改进，从而确立了GANs在高级图像模型拟合中的强大作用。

链接: https://arxiv.org/abs/2501.11218
作者: Anurag Awasthi
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, in proceeding at conference

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have gained prominence in refining model fitting tasks in computer vision, particularly in domains involving deformable models like Active Appearance Models (AAMs). This paper explores the integration of GANs to enhance the AAM fitting process, addressing challenges in optimizing nonlinear parameters associated with appearance and shape variations. By leveraging GANs’ adversarial training framework, the aim is to minimize fitting errors and improve convergence rates. Achieving robust performance even in cases with high appearance variability and occlusions. Our approach demonstrates significant improvements in accuracy and computational efficiency compared to traditional optimization techniques, thus establishing GANs as a potent tool for advanced image model fitting.
zh

[CV-115] Ditto: Accelerating Diffusion Model via Temporal Value Similarity HPCA2025

【速读】：该论文旨在解决扩散模型（Diffusion Models）在图像生成任务中由于迭代结构导致的高计算开销问题。扩散模型在相邻时间步之间表现出高度的数值相似性，导致连续时间步之间的差异较小。基于这一观察，论文提出了一种名为Ditto的算法，该算法利用时间步之间的相似性和量化技术来提升扩散模型的效率。Ditto算法的关键在于通过量化减少差异的位宽表示，并在初始时间步执行全位宽操作，而在后续时间步中仅处理时间差异。此外，Ditto算法还设计了执行流程优化以减少时间差异处理的内存开销，并开发了专用的硬件加速器Ditto硬件，以充分利用算法的动态特性。实验结果表明，Ditto硬件相比其他加速器实现了最高1.5倍的加速和17.74%的能耗节省。

链接: https://arxiv.org/abs/2501.11211
作者: Sungbin Kim,Hyunwuk Lee,Wonho Cho,Mincheol Park,Won Woo Ro
机构: School of Electrical and Electronic Engineering, Yonsei University (延世大学); Samsung Electronics (三星电子)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)

点击查看摘要

Abstract:Diffusion models achieve superior performance in image generation tasks. However, it incurs significant computation overheads due to its iterative structure. To address these overheads, we analyze this iterative structure and observe that adjacent time steps in diffusion models exhibit high value similarity, leading to narrower differences between consecutive time steps. We adapt these characteristics to a quantized diffusion model and reveal that the majority of these differences can be represented with reduced bit-width, and even zero. Based on our observations, we propose the Ditto algorithm, a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models. By exploiting the narrower differences and the distributive property of layer operations, it performs full bit-width operations for the initial time step and processes subsequent steps with temporal differences. In addition, Ditto execution flow optimization is designed to mitigate the memory overhead of temporal difference processing, further boosting the efficiency of the Ditto algorithm. We also design the Ditto hardware, a specialized hardware accelerator, fully exploiting the dynamic characteristics of the proposed algorithm. As a result, the Ditto hardware achieves up to 1.5x speedup and 17.74% energy saving compared to other accelerators.
zh

[CV-116] Advancing Oyster Phenotype Segmentation with Multi-Network Ensemble and Multi-Scale mechanism

【速读】：该论文试图解决的是牡蛎表型分割（phenotype segmentation）中的肉质量评估问题，特别是针对牡蛎的壳、肉、性腺和肌肉等组分的分割。传统的手动检测方法耗时且主观性强，因此论文提出采用机器视觉技术来实现高效且客观的评估。解决方案的关键在于开发了一种多网络集成方法（multi-network ensemble approach），并结合了全局-局部层次注意力机制（global-local hierarchical attention mechanism）。该方法通过整合多个模型的预测结果，解决了不同尺度变化带来的挑战，确保了各组分实例分割的鲁棒性。论文还通过多个真实数据集对提出的方法进行了全面评估，证明了其在提升牡蛎表型分割效果方面的有效性和鲁棒性。

链接: https://arxiv.org/abs/2501.11203
作者: Wenli Yang,Yanyu Chen,Andrew Trotter,Byeong Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Phenotype segmentation is pivotal in analysing visual features of living organisms, enhancing our understanding of their characteristics. In the context of oysters, meat quality assessment is paramount, focusing on shell, meat, gonad, and muscle components. Traditional manual inspection methods are time-consuming and subjective, prompting the adoption of machine vision technology for efficient and objective evaluation. We explore machine vision’s capacity for segmenting oyster components, leading to the development of a multi-network ensemble approach with a global-local hierarchical attention mechanism. This approach integrates predictions from diverse models and addresses challenges posed by varying scales, ensuring robust instance segmentation across components. Finally, we provide a comprehensive evaluation of the proposed method’s performance using different real-world datasets, highlighting its efficacy and robustness in enhancing oyster phenotype segmentation.
zh

[CV-117] ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

【速读】：该论文旨在提升对比语言-图像预训练（Contrastive Language-Image Pretraining, CLIP）在少样本适应任务中的效果和通用性。具体而言，论文探讨了无需额外微调的轻量级适应方法，特别是以Tip-Adapter为代表的缓存方法（caching methods），并从核（kernel）的角度重新审视了这些方法。通过理论分析，论文揭示了缓存方法作为局部适配器（local adapters）的运作机制，并指出其在核文献中的理论基础。在此基础上，论文提出了一种全局方法，称为ProKeR（Proximal Kernel ridge Regression），该方法在学习过程中引入了一个近端正则化器（proximal regularizer），并在再生核希尔伯特空间（reproducing kernel Hilbert space, RKHS）中利用CLIP作为基础学习器。ProKeR具有闭式解，并在标准的少样本适应基准测试中，在11个数据集上实现了最先进的性能。解决方案的关键在于结合全局信息来增强局部适配器的表现，并通过核方法提升模型的适应能力。

链接: https://arxiv.org/abs/2501.11175
作者: Yassir Bendou,Amine Ouasfi,Vincent Gripon,Adnane Boukhayma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP’s effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
zh

[CV-118] Counteracting temporal attacks in Video Copy Detection

【速读】：该论文旨在解决视频拷贝检测（Video Copy Detection, VCD）中的两个主要问题：一是现有方法在处理精确拷贝时的局限性，二是对时间攻击（temporal attacks）的脆弱性。具体而言，论文指出双级检测方法（Dual-level detection）在视频编辑检测（Video Editing Detection, VED）组件中存在显著不足，尤其是在处理精确拷贝时表现不佳。此外，该方法在面对时间攻击时也表现出脆弱性。

论文提出的解决方案的关键在于改进帧选择策略，基于帧间差异的局部最大值（local maxima of interframe differences）来选择关键帧。这一策略不仅增强了对对抗性时间修改的鲁棒性，还显著降低了计算开销。与标准的每秒1帧（1 FPS）方法相比，该方法的效率提高了1.4到5.8倍。与双级检测方法相比，该方法在保持相当的微平均精度（μAP）的同时，还展示出对时间攻击的更强鲁棒性。此外，该方法减少了56%的表示大小，并将推理时间缩短了2倍以上，使其更适合实际应用中的资源限制。

链接: https://arxiv.org/abs/2501.11171
作者: Katarzyna Fojcik,Piotr Syga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 14 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision ( \mu AP) while also demonstrating improved robustness against temporal attacks. Given 56% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.
zh

[CV-119] DeepEyeNet: Adaptive Genetic Bayesian Algorithm Based Hybrid ConvNeXtTiny Framework For Multi-Feature Glaucoma Eye Diagnosis

【速读】：该论文旨在解决青光眼（Glaucoma）早期检测的挑战，青光眼是全球不可逆失明的主要原因之一。论文提出了一种名为DeepEyeNet的自动化青光眼检测框架，其核心解决方案包括以下几个关键点：首先，通过动态阈值化（dynamic thresholding）实现先进的图像标准化；其次，利用U-Net模型进行精确的视盘（optic disc）和视杯（optic cup）分割；第三，结合解剖学和基于纹理的特征进行全面的特征提取；最后，采用基于ConvNeXtTiny的卷积神经网络（CNN）分类器，并通过提出的自适应遗传贝叶斯优化（Adaptive Genetic Bayesian Optimization, AGBO）算法进行超参数优化。AGBO算法在探索与利用之间取得平衡，显著提升了模型性能。实验结果表明，DeepEyeNet在EyePACS-AIROGS-light-V2数据集上实现了95.84%的高分类准确率，优于现有方法。通过整合先进的图像处理技术、深度学习以及优化的超参数调优，DeepEyeNet展现了在临床环境中进行早期青光眼检测的潜力。

链接: https://arxiv.org/abs/2501.11168
作者: Angshuman Roy,Anuvab Sen,Soumyajit Gupta,Soham Haldar,Subhrajit Deb,Taraka Nithin Vankala,Arkapravo Das
机构: Indian Institute of Engineering Science and Technology, Shibpur, Howrah 711103, India (印度工程技术学院); Georgia Institute of Technology, Atlanta, GA 30332, USA (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 7 pages, 12 figures, 3 Tables, Accepted by 15th IEEE Symposium Series on Computational Intelligence (SSCI) 2025, Trondheim, Norway, Europe

点击查看摘要

Abstract:Glaucoma is a leading cause of irreversible blindness worldwide, emphasizing the critical need for early detection and intervention. In this paper, we present DeepEyeNet, a novel and comprehensive framework for automated glaucoma detection using retinal fundus images. Our approach integrates advanced image standardization through dynamic thresholding, precise optic disc and cup segmentation via a U-Net model, and comprehensive feature extraction encompassing anatomical and texture-based features. We employ a customized ConvNeXtTiny based Convolutional Neural Network (CNN) classifier, optimized using our Adaptive Genetic Bayesian Optimization (AGBO) algorithm. This proposed AGBO algorithm balances exploration and exploitation in hyperparameter tuning, leading to significant performance improvements. Experimental results on the EyePACS-AIROGS-light-V2 dataset demonstrate that DeepEyeNet achieves a high classification accuracy of 95.84%, which was possible due to the effective optimization provided by the novel AGBO algorithm, outperforming existing methods. The integration of sophisticated image processing techniques, deep learning, and optimized hyperparameter tuning through our proposed AGBO algorithm positions DeepEyeNet as a promising tool for early glaucoma detection in clinical settings.
zh

[CV-120] LiFT: Lightweight FPGA-tailored 3D object detection based on LiDAR data

【速读】：该论文旨在解决在FPGA平台上实现实时推理的轻量级、全量化3D目标检测问题。针对FPGA平台的特定限制，如计算复杂度限制在30 GMACs（十亿次乘加运算）、权重和激活的INT8量化、基于2D单元的处理而非3D体素、以及最小化跳跃连接的使用，论文提出了LiFT算法。LiFT通过结合可重参数化卷积和全稀疏架构等先进技术，设计了双边界柱特征网络（Dual-bound Pillar Feature Net），在不增加复杂度的前提下提升性能，并实现了输入特征的高效INT8量化方案。LiFT的计算成本仅为20.73 GMACs，在NuScenes验证数据集上达到了51.84%的mAP（平均精度）和61.01%的NDS（归一化检测分数），在同类方法中表现最佳。

链接: https://arxiv.org/abs/2501.11159
作者: Konrad Lis,Tomasz Kryjak,Marek Gorgon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
备注: The paper has been accepted for the DASIP 2025 workshop in conjunction with the HiPEAC 2025 conference in Barcelona

点击查看摘要

Abstract:This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm’s design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at this https URL.
zh

[CV-121] Efficient Frame Extraction: A Novel Approach Through Frame Similarity and Surgical Tool Tracking for Video Segmentation

【速读】：该论文旨在解决在手术视频分析中，由于视频时长过长（通常为30分钟至数小时）导致的人工智能（AI）模型学习效率低下的问题。为了解决这一问题，作者提出了一种名为“运动学自适应帧识别”（Kinematics Adaptive Frame Recognition, KAFR）的新技术。该技术的核心在于通过跟踪手术工具的运动来计算连续帧之间的相似性，从而有效去除冗余帧，减少数据集大小和计算时间，同时保留有用的帧以提高分析准确性。具体步骤包括：1) 使用YOLOv8模型检测手术工具；2) 通过估计工具的空间位置和速度变化来计算帧间相似性；3) 使用X3D CNN进行分类。实验结果表明，该方法在Gastrojejunostomy（GJ）和Pancreaticojejunostomy（PJ）数据集上实现了帧数减少十倍，同时准确率提高了4.32%。

链接: https://arxiv.org/abs/2501.11153
作者: Huu Phong Nguyen,Shekhar Madhav Khairnar,Sofia Garces Palacios,Amr Al-Abbas,Francisco Antunes,Bernardete Ribeiro,Melissa E. Hogg,Amer H. Zureikat,Patricio M. Polanco,Herbert Zeh III,Ganesh Sankaranarayanan
机构: Department of Surgery, University of Texas Southwestern Medical Center, Texas, USA; NorthShore University HealthSystem, Evanston, IL, USA; University of Pittsburgh Medical Center, Pittsburgh, PA, USA; University of Coimbra, Coimbra, Portugal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17

点击查看摘要

Abstract:The interest in leveraging Artificial Intelligence (AI) for surgical procedures to automate analysis has witnessed a significant surge in recent years. One of the primary tools for recording surgical procedures and conducting subsequent analyses, such as performance assessment, is through videos. However, these operative videos tend to be notably lengthy compared to other fields, spanning from thirty minutes to several hours, which poses a challenge for AI models to effectively learn from them. Despite this challenge, the foreseeable increase in the volume of such videos in the near future necessitates the development and implementation of innovative techniques to tackle this issue effectively. In this article, we propose a novel technique called Kinematics Adaptive Frame Recognition (KAFR) that can efficiently eliminate redundant frames to reduce dataset size and computation time while retaining useful frames to improve accuracy. Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools. Our approach follows these steps: i) Tracking phase: a YOLOv8 model is utilized to detect tools presented in the scene, ii) Similarity phase: Similarities between consecutive frames are computed by estimating variation in the spatial positions and velocities of the tools, iii) Classification phase: A X3D CNN is trained to classify segmentation. We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases at two referral centers. The Gastrojejunostomy (GJ) dataset covers procedures performed between 2017 to 2021, while the Pancreaticojejunostomy (PJ) dataset spans from 2011 to 2022 at the same centers. By adaptively selecting relevant frames, we achieve a tenfold reduction in the number of frames while improving accuracy by 4.32% (from 0.749 to 0.7814).
zh

[CV-122] CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning

【速读】：该论文试图解决生成式 AI 模型（Generative AI）生成的逼真媒体内容与真实图像之间的区分问题，特别是在分类器遇到未包含在其训练数据中的生成模型图像时性能下降的挑战。传统方法是通过定期更新分类器的训练数据并重新训练，但在实际应用中，由于存储、计算或隐私限制，这种方法往往不可行。论文提出了一种基于持续学习（Continual Learning）的解决方案，使分类器能够在无需重新训练整个数据集的情况下进行更新。关键解决方案是引入了一个新的数据集 CLOFAI（Continual Learning On Fake and Authentic Images），并将其作为评估持续学习方法的基准。通过在该数据集上测试三种基础持续学习方法（EWC、GEM 和 Experience Replay），发现 GEM 和 Experience Replay 表现优于 EWC 和 Naive 基线，展示了持续学习在应对生成式 AI 模型变化时的潜力。

链接: https://arxiv.org/abs/2501.11140
作者: William Doherty,Anton Lee,Heitor Murilo Gomes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier’s training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods – EWC, GEM, and Experience Replay – and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: this https URL.
zh

[CV-123] Advanced technology in railway track monitoring using the GPR Technique: A Review

【速读】：该论文旨在解决铁路轨道地下结构评估中的关键问题，特别是如何通过先进的无损检测技术（NDT）——地质雷达（GPR）——来早期检测和修复可能导致事故或脱轨的结构弱点或缺陷。论文的核心解决方案包括利用合成建模技术校准实际GPR数据，以提高对地下特征（如道砟条件和结构异常）的识别精度，并应用多种算法（如支持向量机（SVM）、模糊C均值聚类和广义回归神经网络）来优化GPR数据分析。此外，论文特别强调了深度学习技术，尤其是卷积神经网络（CNN）和循环神经网络（RNN）在识别GPR图像中缺陷相关模式方面的有效性，并开发了一种结合CNN和RNN架构的卷积循环神经网络（CRNN）模型。该模型在缺陷检测能力和处理速度上优于传统的目标检测模型（如Faster R-CNN），从而为铁路轨道的地下结构评估提供了更高效和准确的解决方案。

链接: https://arxiv.org/abs/2501.11132
作者: Farhad Kooban,Aleksandra Radlińska,Reza Mousapour,Maryam Saraei
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 2nd Canadian Cold Regions Rail Research Conference 2024 (CCRC 2024)

点击查看摘要

Abstract:Subsurface evaluation of railway tracks is crucial for safe operation, as it allows for the early detection and remediation of potential structural weaknesses or defects that could lead to accidents or derailments. Ground Penetrating Radar (GPR) is an electromagnetic survey technique as advanced non-destructive technology (NDT) that can be used to monitor railway tracks. This technology is well-suited for railway applications due to the sub-layered composition of the track, which includes ties, ballast, sub-ballast, and subgrade regions. It can detect defects such as ballast pockets, fouled ballast, poor drainage, and subgrade settlement. The paper reviews recent works on advanced technology and interpretations of GPR data collected for different layers. Further, this paper demonstrates the current techniques for using synthetic modeling to calibrate real-world GPR data, enhancing accuracy in identifying subsurface features like ballast conditions and structural anomalies and applying various algorithms to refine GPR data analysis. These include Support Vector Machine (SVM) for classifying railway ballast types, Fuzzy C-means, and Generalized Regression Neural Networks for high-accuracy defect classification. Deep learning techniques, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are also highlighted for their effectiveness in recognizing patterns associated with defects in GPR images. The article specifically focuses on the development of a Convolutional Recurrent Neural Network (CRNN) model, which combines CNN and RNN architectures for efficient processing of GPR data. This model demonstrates enhanced detection capabilities and faster processing compared to traditional object detection models like Faster R-CNN.
zh

[CV-124] Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

【速读】：该论文旨在解决弱监督时序动作定位（Weakly-Supervised Temporal Action Localization）中伪标签（pseudo-labels）噪声对全监督检测头（fully-supervised detection head）学习过程的干扰问题。具体来说，伪标签噪声会导致以下问题：(1) 边界定位不准确；(2) 短动作片段未被检测到；(3) 多个相邻片段被错误地检测为一个片段。为解决这些问题，论文提出了一种两阶段的噪声标签学习策略。首先，通过一个帧级伪标签生成模型结合上下文感知去噪算法（context-aware denoising algorithm）来优化边界定位。其次，引入了一个在线修正的师生框架（online-revised teacher-student framework），该框架包含缺失实例补偿模块（missing instance compensation module）和模糊实例校正模块（ambiguous instance correction module），以解决短动作缺失和多对一检测问题。此外，论文还采用了高质量伪标签挖掘损失（high-quality pseudo-label mining loss），为噪声标签赋予不同权重，从而更有效地训练模型。该方案在THUMOS14和ActivityNet v1.2基准测试中显著提升了检测精度和推理速度。

链接: https://arxiv.org/abs/2501.11124
作者: Quan Zhang,Yuxin Qi,Xi Tang,Rui Yuan,Xi Lin,Ke Zhang,Chun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
zh

[CV-125] RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering

【速读】：该论文试图解决在稀疏输入视图下进行3D重建时，如何高效合成新颖视图并保持准确性的关键挑战。现有方法如辐射场（radiance fields）和3D高斯溅射（3D Gaussian Splatting）虽然在密集视图输入下实现了高质量的渲染和显著的效率，但在稀疏视图输入下存在显著的几何重建误差。此外，尽管最近的方法利用单目深度估计（monocular depth estimation）来增强几何学习，但其对单视图估计深度的依赖常常导致不同视角下的视图不一致问题，进而引入几何信息的不准确性，影响场景重建质量。

解决方案的关键在于提出了一种基于3D高斯溅射的相对深度引导（Relative Depth Guidance）框架，称为RDG-GS。该框架通过利用相对深度引导来优化高斯场，使其朝向视图一致的空间几何表示，从而实现准确的几何结构重建和复杂纹理的捕捉。具体而言，首先设计了精细的深度先验来修正粗略估计的深度，并将全局和细粒度的场景信息融入常规高斯分布中。其次，通过优化深度和图像空间相关补丁之间的相似性，提出了相对深度引导，以解决绝对深度带来的空间几何不准确问题。此外，还通过自适应采样快速密集化处理难以收敛的稀疏区域。实验结果表明，RDG-GS在Mip-NeRF360、LLFF、DTU和Blender等数据集上展示了最先进的渲染质量和效率，显著推动了实际应用的发展。

链接: https://arxiv.org/abs/2501.11102
作者: Chenlu Zhan,Yufei Zhang,Yu Lin,Gaoang Wang,Hongwei Wang
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures

点击查看摘要

Abstract:Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leverage monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present RDG-GS, a novel sparse-view 3D rendering framework with Relative Depth Guidance based on 3D Gaussian Splatting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information to regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world application.
zh

[CV-126] Unit Region Encoding: A Unified and Compact Geometry-aware Representation for Floorplan Applications

【速读】：该论文旨在解决在室内空间规划、平面图度量学习以及平面图生成等任务中，如何有效地表示平面图的问题。现有的方法通常使用过度分割的栅格化图像或房间级别的图结构，这些方法在灵活性和准确性上存在局限。论文提出了一种基于几何感知密度图的单元区域编码（Unit Region Encoding）方法，通过边界自适应的单元区域划分，将平面图表示为潜在编码。该编码通过训练的网络（URE-Net）从输入的密集密度图和其他可用的语义图中提取。与现有方法相比，这种表示方法能够灵活适应不同应用场景，同时提高了准确性和视觉质量。关键解决方案在于利用几何感知密度图进行聚类，生成边界自适应的单元区域，并通过网络提取潜在编码，从而实现更高效的平面图表示。

链接: https://arxiv.org/abs/2501.11097
作者: Huichao Zhang,Pengyu Wang,Manyi Li,Zuojun Li,Yaguang Wu
机构: ByteDance(字节跳动); Alibaba(阿里巴巴); Shandong University(山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the Unit Region Encoding of floorplans, which is a unified and compact geometry-aware encoding representation for various applications, ranging from interior space planning, floorplan metric learning to floorplan generation tasks. The floorplans are represented as the latent encodings on a set of boundary-adaptive unit region partition based on the clustering of the proposed geometry-aware density map. The latent encodings are extracted by a trained network (URE-Net) from the input dense density map and other available semantic maps. Compared to the over-segmented rasterized images and the room-level graph structures, our representation can be flexibly adapted to different applications with the sliced unit regions while achieving higher accuracy performance and better visual quality. We conduct a variety of experiments and compare to the state-of-the-art methods on the aforementioned applications to validate the superiority of our representation, as well as extensive ablation studies to demonstrate the effect of our slicing choices.
zh

[CV-127] Reproducibility review of “Why Not Other Classes”: Towards Class-Contrastive Back-Propagation Explanations

【速读】：该论文旨在解决神经网络图像分类器中为何选择某一类别而非其他类别的对比解释问题。其核心解决方案是通过在softmax层之后而非之前使用基于反向传播的解释方法（back-propagation-based explanation methods），从而提供类别的对比解释。该方法的关键在于通过调整解释方法的应用位置，增强了模型输出类别选择的解释能力。此外，论文还通过评估XGradCAM、FullGrad和Vision Transformers等方法，验证了该解决方案的泛化能力，并发现其在Vision Transformers和其他反向传播方法中表现良好。然而，论文也指出原始方法存在细节不足和公式错误等问题，影响了可复现性，因此作者提供了开源代码库以支持进一步研究和复现。

链接: https://arxiv.org/abs/2501.11096
作者: Arvid Eriksson(1),Anton Israelsson(1),Mattias Kallhauge(1) ((1) KTH Royal Institute of Technology)
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:“Why Not Other Classes?”: Towards Class-Contrastive Back-Propagation Explanations (Wang Wang, 2022) provides a method for contrastively explaining why a certain class in a neural network image classifier is chosen above others. This method consists of using back-propagation-based explanation methods from after the softmax layer rather than before. Our work consists of reproducing the work in the original paper. We also provide extensions to the paper by evaluating the method on XGradCAM, FullGrad, and Vision Transformers to evaluate its generalization capabilities. The reproductions show similar results as the original paper, with the only difference being the visualization of heatmaps which could not be reproduced to look similar. The generalization seems to be generally good, with implementations working for Vision Transformers and alternative back-propagation methods. We also show that the original paper suffers from issues such as a lack of detail in the method and an erroneous equation which makes reproducibility difficult. To remedy this we provide an open-source repository containing all code used for this project.
zh

[CV-128] Leverag ing counterfactual concepts for debugging and improving CNN model performance

【速读】：该论文试图解决如何利用反事实解释（counterfactual explanation）方法来提升基于卷积神经网络（CNN）的图像分类模型的性能。尽管反事实解释方法在提供易于理解且符合人类推理的解释方面受到了广泛关注，但其在改进模型性能方面的应用却较少被探讨。论文提出的解决方案关键在于通过反事实推理识别出在决策过程中起关键作用的滤波器（filters），并设计了一种新颖的方法和损失函数来进行模型重训练。该方法鼓励激活与类别相关的重要滤波器，同时抑制与类别无关的滤波器的激活，从而有效减少局部预测的激活模式与全局类别激活模式之间的偏差。通过引入反事实解释，论文不仅验证了模型对未见数据的预测能力，还识别了误分类情况，揭示了模型学习过程中的潜在弱点和偏差，进而实现了有针对性的改进和性能提升。实验结果表明，该方法在公开数据集上实现了1-2%的性能提升，验证了其有效性。

链接: https://arxiv.org/abs/2501.11087
作者: Syed Ali Tariq,Tehseen Zia
机构: COMSATS University Islamabad (COMSATS大学伊斯兰堡)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is currently under consideration for publication in Pattern Recognition Letters

点击查看摘要

Abstract:Counterfactual explanation methods have recently received significant attention for explaining CNN-based image classifiers due to their ability to provide easily understandable explanations that align more closely with human reasoning. However, limited attention has been given to utilizing explainability methods to improve model performance. In this paper, we propose to leverage counterfactual concepts aiming to enhance the performance of CNN models in image classification tasks. Our proposed approach utilizes counterfactual reasoning to identify crucial filters used in the decision-making process. Following this, we perform model retraining through the design of a novel methodology and loss functions that encourage the activation of class-relevant important filters and discourage the activation of irrelevant filters for each class. This process effectively minimizes the deviation of activation patterns of local predictions and the global activation patterns of their respective inferred classes. By incorporating counterfactual explanations, we validate unseen model predictions and identify misclassifications. The proposed methodology provides insights into potential weaknesses and biases in the model’s learning process, enabling targeted improvements and enhanced performance. Experimental results on publicly available datasets have demonstrated an improvement of 1-2%, validating the effectiveness of the approach.
zh

[CV-129] Refinement Module based on Parse Graph of Feature Map for Human Pose Estimation

【速读】：该论文试图解决人体姿态估计（Human Pose Estimation, HPE）中预设计的人体解析图（parse graph）难以适应与预设结构不同情况的问题。传统方法通常预先设计人体结构的解析图，并基于此设计HPE框架，但这些框架在面对与预设结构不同的情况时难以灵活适应。论文提出的解决方案关键在于将特征图（feature map）视为一个整体，类似于人体结构，通过解析图优化特征图，并隐式学习每个节点的特征，而非显式设计。具体而言，论文设计了基于特征图解析图的精炼模块（Refinement Module based on the Parse Graph, RMPG），该模块包括自上而下的分解和自下而上的组合两个阶段。在分解阶段，特征图沿通道分解为多个子特征图，并计算其上下文关系以获取各自的上下文信息；在组合阶段，子特征图与其上下文信息结合生成精炼后的子特征图，最终拼接得到精炼后的特征图。此外，论文还设计了使用多个RMPG模块的自上而下框架，部分模块通过监督学习获取身体部位间的上下文关系。该框架在COCO关键点检测、CrowdPose和MPII人体姿态数据集上取得了优异的结果，并验证了RMPG在不同方法（如SimpleBaselines、Hourglass和ViTPose）中的有效性。

链接: https://arxiv.org/abs/2501.11069
作者: Shibang Liu,Xuemei Xie,Guangming Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parse graphs of the human body can be obtained in the human brain to help humans complete the human pose estimation (HPE). It contains a hierarchical structure, like a tree structure, and context relations among nodes. Many researchers pre-design the parse graph of body structure, and then design framework for HPE. However, these frameworks are difficulty adapting when encountering situations that differ from the preset human structure. Different from them, we regard the feature map as a whole, similarly to human body, so the feature map can be optimized based on parse graphs and each node feature is learned implicitly instead of explicitly, which means it can flexibly respond to different human body structure. In this paper, we design the Refinement Module based on the Parse Graph of feature map (RMPG), which includes two stages: top-down decomposition and bottom-up combination. In the top-down decomposition stage, the feature map is decomposed into multiple sub-feature maps along the channel and their context relations are calculated to obtain their respective context information. In the bottom-up combination stage, the sub-feature maps and their context information are combined to obtain refined sub-feature maps, and then these refined sub-feature maps are concatenated to obtain the refined feature map. Additionally ,we design a top-down framework by using multiple RMPG modules for HPE, some of which are supervised to obtain context relations among body parts. Our framework achieves excellent results on the COCO keypoint detection, CrowdPose and MPII human pose datasets. More importantly, our experiments also demonstrate the effectiveness of RMPG on different methods, including SimpleBaselines, Hourglass, and ViTPose.
zh

[CV-130] Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection

【速读】：该论文试图解决深度度量学习（Deep Metric Learning, DML）中存在的噪声标签问题。噪声标签会显著降低深度学习模型的性能，尽管在分类任务中已有大量研究致力于提高对噪声标签的鲁棒性，但在DML中这一问题尚未得到充分探索。现有的噪声标签学习方法通常直接丢弃可疑的噪声样本，导致训练数据的浪费。为解决这一问题，论文提出了一种基于子组的正样本选择（SubGroup-based Positive-pair Selection, SGPS）的噪声鲁棒DML框架。该框架通过概率基础的干净样本选择策略有效识别干净样本和噪声样本，并利用子组信息发现噪声样本的潜在相似样本，进而通过正样本原型生成模块将这些样本聚合为信息丰富的正样本原型。随后，论文为噪声样本及其选定的正样本对设计了一种新的对比损失函数。SGPS框架可以轻松集成到现有的成对DML任务（如图像检索和人脸识别）的训练过程中。实验结果表明，该方法在多个合成和真实世界的大规模噪声标签数据集上均优于现有的噪声标签DML方法。

链接: https://arxiv.org/abs/2501.11063
作者: Zhipeng Yu,Qianqian Xu,Yangbangyan Jiang,Yingfei Sun,Qingming Huang
机构: School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences(中国科学院大学电子、电气与通信工程学院); Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所智能信息处理重点实验室); School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2108.01431 , arXiv:2103.16047 by other authors

点击查看摘要

Abstract:The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \urlthis https URL.
zh

[CV-131] Learning with Open-world Noisy Data via Class-independent Margin in Dual Representation Space AAAI2025

【速读】：该论文试图解决在开放世界噪声（open-world noise）环境下，模型在面对来自未知类别的噪声标签时的泛化问题。现有方法通常假设噪声标签来自已知类别（即闭集噪声，closed-set noise），但在实际场景中，噪声标签可能来自相似的未知类别（即开集噪声，open-set noise），这会对学习噪声标签（LNL）方法的性能产生显著影响。论文提出了一种新颖的双空间联合学习方法，通过构建双表示空间来缓解模型对闭集和开集噪声的过拟合。具体而言，该方法使用两个网络：一个投影网络（projection network）在原型空间中学习共享表示，另一个一对多网络（One-Vs-All network, OVA）在类别无关空间中使用独特的语义表示进行预测。通过在两个空间中引入双层对比学习（bi-level contrastive learning）和一致性正则化（consistency regularization），增强了模型对未知类别数据的检测能力。此外，设计了类别无关的边界准则（class-independent margin criteria）来有效选择干净样本、加权闭集噪声并过滤开集噪声。实验结果表明，该方法在CIFAR80N数据集上平均准确率提升了4.55%，AUROC提升了6.17%，优于现有最先进方法。

链接: https://arxiv.org/abs/2501.11053
作者: Linchao Pan,Can Gao,Jie Zhou,Jinbao Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages of main text, 4 pages of appendix, accepted to AAAI 2025

点击查看摘要

Abstract:Learning with Noisy Labels (LNL) aims to improve the model generalization when facing data with noisy labels, and existing methods generally assume that noisy labels come from known classes, called closed-set noise. However, in real-world scenarios, noisy labels from similar unknown classes, i.e., open-set noise, may occur during the training and inference stage. Such open-world noisy labels may significantly impact the performance of LNL methods. In this study, we propose a novel dual-space joint learning method to robustly handle the open-world noise. To mitigate model overfitting on closed-set and open-set noises, a dual representation space is constructed by two networks. One is a projection network that learns shared representations in the prototype space, while the other is a One-Vs-All (OVA) network that makes predictions using unique semantic representations in the class-independent space. Then, bi-level contrastive learning and consistency regularization are introduced in two spaces to enhance the detection capability for data with unknown classes. To benefit from the memorization effects across different types of samples, class-independent margin criteria are designed for sample identification, which selects clean samples, weights closed-set noise, and filters open-set noise effectively. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods and achieves an average accuracy improvement of 4.55% and an AUROC improvement of 6.17% on CIFAR80N.
zh

[CV-132] BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

【速读】：该论文旨在解决低分辨率、低帧率视频向高分辨率、高帧率视频转换的问题，以提升用户体验。现有方法通常使用隐式神经表示（Implicit Neural Representation, INR）进行连续编码，但它们在捕捉视频数据复杂性方面存在不足，主要依赖于简单的坐标拼接和预训练的光流网络进行运动表示。论文发现，添加位置编码不仅没有提升性能，反而可能降低性能，尤其是在与预训练光流网络结合时，限制了模型的灵活性。为解决这些问题，论文提出了BF-STVSR框架，其关键创新在于两个模块：1）B样条映射器（B-spline Mapper），用于平滑的时间插值；2）傅里叶映射器（Fourier Mapper），用于捕捉主要的空间频率。该框架在PSNR和SSIM指标上达到了最先进的性能，显著提升了空间细节和时间一致性。

链接: https://arxiv.org/abs/2501.11043
作者: Eunjin Kim,Hyeonjin Kim,Kyong Hwan Jin,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (UNIST); Korea University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11pages, 5 figures

点击查看摘要

Abstract:Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve-and even degrade performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
zh

[CV-133] racking Mouse from Incomplete Body-Part Observations and Deep-Learned Deformable-Mouse Model Motion-Track Constraint for Behavior Analysis

【速读】：该论文旨在解决由于遮挡导致的小鼠身体部位在视频中跟踪不完整的问题，从而影响后续动作和行为分析的准确性。解决方案的关键在于通过多视角视频的集成，利用全局外部相机定位（global exterior camera orientation）进行三维三角测量（3D triangulation）和捆绑调整（bundle adjustment）。此外，通过引入三维小鼠模型、深度学习身体部位运动预测以及全局运动轨迹平滑约束（global motion-track smoothness constraint），实现了整体三维轨迹重建的一致性。最终，该方法显著提高了小鼠身体和身体部位轨迹估计的完整性，从而改善了动物行为分析的准确性。

链接: https://arxiv.org/abs/2501.11030
作者: Olaf Hellwich,Niek Andresen,Katharina Hohlbaum,Marcus N. Boon,Monika Kwiatkowski,Simon Matern,Patrik Reiske,Henning Sprekeler,Christa ThöneReineke,Lars Lewejohann,Huma Ghani Zada,Michael Brück,Soledad Traverso
机构: TU Berlin, Computer Vision & Remote Sensing(柏林工业大学，计算机视觉与遥感); German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR)(德国实验动物保护中心，德国联邦风险评估研究所); TU Berlin, Modeling of Cognitive Processes(柏林工业大学，认知过程建模); FU Berlin, Institute of Animal Welfare, Animal Behavior and Laboratory Animal Science(柏林自由大学，动物福利、动物行为与实验动物科学研究所); TU Berlin, Remote Sensing Image Analysis(柏林工业大学，遥感图像分析); TU Berlin, Science of Intelligence Excellence Cluster(柏林工业大学，智能卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Tracking mouse body parts in video is often incomplete due to occlusions such that - e.g. - subsequent action and behavior analysis is impeded. In this conceptual work, videos from several perspectives are integrated via global exterior camera orientation; body part positions are estimated by 3D triangulation and bundle adjustment. Consistency of overall 3D track reconstruction is achieved by introduction of a 3D mouse model, deep-learned body part movements, and global motion-track smoothness constraint. The resulting 3D body and body part track estimates are substantially more complete than the original single-frame-based body part detection, therefore, allowing improved animal behavior analysis.
zh

[CV-134] Car-GS: Addressing Reflective and Transparent Surface Challenges in 3D Car Reconstruction

【速读】：该论文旨在解决3D汽车建模中由于汽车表面材料（如高反射和透明材料）的特殊性质导致的几何和着色重建（3DGS）不准确的问题。现有方法在处理这些材料时，常常难以有效应对镜面高光和RGB与几何耦合的挑战。为此，论文提出了Car-GS方法，其关键创新包括：首先，引入了视点依赖的高斯基元（view-dependent Gaussian primitives）以有效建模表面反射；其次，针对透明物体建模时共享不透明度参数（shared opacity parameter）的局限性，为每个2D高斯基元分配了可学习的几何特定不透明度（learnable geometry-specific opacity），专门用于渲染深度和法线；最后，针对相机视角与玻璃表面接近正交时重建误差显著的问题，开发了一个质量感知监督模块（quality-aware supervision module），自适应地利用预训练的大规模法线先验。实验结果表明，Car-GS在汽车表面重建精度上显著优于现有方法。

链接: https://arxiv.org/abs/2501.11020
作者: Congcong Li,Jin Wang,Xiaomeng Wang,Xingchen Zhou,Wei Wu,Yuzhi Zhang,Tongyi Cao
机构: DeepRoute.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D car modeling is crucial for applications in autonomous driving systems, virtual and augmented reality, and gaming. However, due to the distinctive properties of cars, such as highly reflective and transparent surface materials, existing methods often struggle to achieve accurate 3D car this http URL address these limitations, we propose Car-GS, a novel approach designed to mitigate the effects of specular highlights and the coupling of RGB and geometry in 3D geometric and shading reconstruction (3DGS). Our method incorporates three key innovations: First, we introduce view-dependent Gaussian primitives to effectively model surface reflections. Second, we identify the limitations of using a shared opacity parameter for both image rendering and geometric attributes when modeling transparent objects. To overcome this, we assign a learnable geometry-specific opacity to each 2D Gaussian primitive, dedicated solely to rendering depth and normals. Third, we observe that reconstruction errors are most prominent when the camera view is nearly orthogonal to glass surfaces. To address this issue, we develop a quality-aware supervision module that adaptively leverages normal priors from a pre-trained large-scale normal this http URL results demonstrate that Car-GS achieves precise reconstruction of car surfaces and significantly outperforms prior methods. The project page is available at this https URL.
zh

[CV-135] HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition

【速读】：该论文试图解决动作识别（action recognition）领域中，现有方法在骨架点（skeleton points）分类和拓扑建模（topological modeling）方面的不足。具体而言，现有研究大多通过深度学习方法来提升性能，而忽略了骨架点与身体部位之间的拓扑关系，且未充分考虑骨架点的运动学（kinematics）特性。为此，论文提出了一种基于身体部位和距离的骨架点拓扑关系分类方法，并结合运动学理论进行建模。解决方案的关键在于提出了一种新颖的超图融合图卷积网络（Hypergraph Fusion Graph Convolutional Network, HFGCN），该网络能够同时关注人体骨架点和不同身体部位，并通过超图（hypergraph）表示骨架点的分类关系，将其融入图卷积网络中以建模高阶关系，从而增强网络的特征表示能力。此外，论文还引入了超图注意力模块和超图图卷积模块，分别在时间和通道维度上优化拓扑建模，进一步提升网络性能。实验结果表明，该方法在多个数据集上优于现有的基于骨架的动作识别方法。

链接: https://arxiv.org/abs/2501.11007
作者: Pengcheng Dong,Wenbo Wan,Huaxiang Zhang,Jiande Sun
机构: School of Information Science and Engineering, Shandong Normal University, China (山东师范大学信息科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, action recognition has received much attention and wide application due to its important role in video understanding. Most of the researches on action recognition methods focused on improving the performance via various deep learning methods rather than the classification of skeleton points. The topological modeling between skeleton points and body parts was seldom considered. Although some studies have used a data-driven approach to classify the topology of the skeleton point, the nature of the skeleton point in terms of kinematics has not been taken into consideration. Therefore, in this paper, we draw on the theory of kinematics to adapt the topological relations of the skeleton point and propose a topological relation classification based on body parts and distance from core of body. To synthesize these topological relations for action recognition, we propose a novel Hypergraph Fusion Graph Convolutional Network (HFGCN). In particular, the proposed model is able to focus on the human skeleton points and the different body parts simultaneously, and thus construct the topology, which improves the recognition accuracy obviously. We use a hypergraph to represent the categorical relationships of these skeleton points and incorporate the hypergraph into a graph convolution network to model the higher-order relationships among the skeleton points and enhance the feature representation of the network. In addition, our proposed hypergraph attention module and hypergraph graph convolution module optimize topology modeling in temporal and channel dimensions, respectively, to further enhance the feature representation of the network. We conducted extensive experiments on three widely used this http URL results validate that our proposed method can achieve the best performance when compared with the state-of-the-art skeleton-based methods.
zh

[CV-136] Self-CephaloNet: A Two-stage Novel Framework using Operational Neural Network for Cephalometric Analysis

【速读】：该论文旨在解决在正畸诊断和治疗规划中，手动检测侧位头颅X光片（lateral cephalograms）中的解剖标志点（anatomical landmarks）耗时且效率低下的问题。为了解决这一问题，作者提出了一种端到端的级联深度学习框架（Self-CepahloNet），该框架在预测19个牙科标志点时在ISBI 2015数据集上展现了基准性能。解决方案的关键在于引入了自操作神经网络（Self-ONN），该网络在复杂特征空间的学习性能上优于传统的卷积神经网络（CNN）。此外，作者在HRNetV2（高分辨率网络）骨干网络中引入了一种新颖的自瓶颈（self-bottleneck）结构，进一步提升了模型性能。实验结果表明，该模型在2mm范围内的标志点检测成功率显著提高，第一阶段达到了70.95%，第二阶段进一步提升至82.25%，并在外部验证数据集PKU上也表现出了75.95%的成功率。

链接: https://arxiv.org/abs/2501.10984
作者: Md. Shaheenur Islam Sumon,Khandaker Reajul Islam,Tanzila Rafique,Gazi Shamim Hassan,Md. Sakib Abrar Hossain,Kanchon Kanti Podder,Noha Barhom,Faleh Tamimi,Abdulrahman Alqahtani,Muhammad E. H. Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: The paper has been accepted for publication in Neural Computing and Applications

点击查看摘要

Abstract:Cephalometric analysis is essential for the diagnosis and treatment planning of orthodontics. In lateral cephalograms, however, the manual detection of anatomical landmarks is a time-consuming procedure. Deep learning solutions hold the potential to address the time constraints associated with certain tasks; however, concerns regarding their performance have been observed. To address this critical issue, we proposed an end-to-end cascaded deep learning framework (Self-CepahloNet) for the task, which demonstrated benchmark performance over the ISBI 2015 dataset in predicting 19 dental landmarks. Due to their adaptive nodal capabilities, Self-ONN (self-operational neural networks) demonstrate superior learning performance for complex feature spaces over conventional convolutional neural networks. To leverage this attribute, we introduced a novel self-bottleneck in the HRNetV2 (High Resolution Network) backbone, which has exhibited benchmark performance on the ISBI 2015 dataset for the dental landmark detection task. Our first-stage results surpassed previous studies, showcasing the efficacy of our singular end-to-end deep learning model, which achieved a remarkable 70.95% success rate in detecting cephalometric landmarks within a 2mm range for the Test1 and Test2 datasets. Moreover, the second stage significantly improved overall performance, yielding an impressive 82.25% average success rate for the datasets above within the same 2mm distance. Furthermore, external validation was conducted using the PKU cephalogram dataset. Our model demonstrated a commendable success rate of 75.95% within the 2mm range.
zh

[CV-137] SMARTe-VR: Student Monitoring and Adaptive Response Technology for e-learning in Virtual Reality AAAI2025

【速读】：该论文旨在解决在线教育中学生学习效果监测和个性化学习的问题。其核心解决方案是开发了一个名为SMARTe-VR的平台，该平台通过沉浸式虚拟现实（VR）环境收集学生的面部生物特征（facial biometrics）和学习元数据（learning metadata），以支持自适应学习（adaptive learning）。平台的关键功能包括：允许教师创建定制化的学习会话，提供视频讲座、自动问答系统（Auto QA system）以评估学生的理解程度，以及互动工具（如教科书高亮和讲座标记）和实时反馈。此外，论文还发布了一个包含5个研究挑战的数据集，涵盖了10名用户在VR环境下的TOEIC（托业）学习会话数据，总时长超过25小时，包括面部特征、学习元数据、450个回答、问题难度级别、概念标签和理解标签。论文还初步探索了基于项目反应理论（Item Response Theory）的模型，用于通过面部特征检测学生的理解程度，并测试了两种架构：用于局部特征的时序卷积网络（Temporal Convolutional Network）和用于全局特征的多层感知器（Multilayer Perceptron）。

链接: https://arxiv.org/abs/2501.10977
作者: Roberto Daza,Lin Shengkai,Aythami Morales,Julian Fierrez,Katashi Nagao
机构: 1. Universidad Autonoma de Madrid (马德里自治大学); 2. Nagoya University (名古屋大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Workshop on Artificial Intelligence for Education (AI4EDU) at AAAI 2025

点击查看摘要

Abstract:This work introduces SMARTe-VR, a platform for student monitoring in an immersive virtual reality environment designed for online education. SMARTe-VR is aimed to gather data for adaptive learning, focusing on facial biometrics and learning metadata. The platform allows instructors to create tailored learning sessions with video lectures, featuring an interface with an Auto QA system to evaluate understanding, interaction tools (e.g., textbook highlighting and lecture tagging), and real-time feedback. Additionally, we release a dataset containing 5 research challenges with data from 10 users in VR-based TOEIC sessions. This dataset, spanning over 25 hours, includes facial features, learning metadata, 450 responses, question difficulty levels, concept tags, and understanding labels. Alongside the database, we present preliminary experiments using Item Response Theory models, adapted for understanding detection using facial features. Two architectures were explored: a Temporal Convolutional Network for local features and a Multilayer Perceptron for global features.
zh

[CV-138] DC-PCN: Point Cloud Completion Network with Dual-Codebook Guided Quantization AAAI25

【速读】：该论文试图解决点云补全（Point Cloud Completion）中的一个关键问题，即在从同一3D物体表面采样的点云中存在的高度变异性。这种变异性会导致补全结果的模糊性，从而影响补全的精确性。为了解决这一问题，论文提出了一种新颖的点云补全网络，称为双码本点云补全网络（Dual-Codebook Point Completion Network, DC-PCN）。该网络采用编码器-解码器（encoder-decoder）架构，并通过引入双码本设计来从多层次角度量化点云表示。具体来说，DC-PCN包含一个编码器码本（encoder-codebook）和一个解码器码本（decoder-codebook），分别用于捕捉浅层和深层的点云模式。此外，为了增强这两个码本之间的信息流动，论文还设计了一种信息交换机制，确保浅层和深层的关键特征和模式能够有效地用于点云补全。实验结果表明，该方法在PCN、ShapeNet_Part和ShapeNet34数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.10966
作者: Qiuxia Wu,Haiyang Huang,Kunming Su,Zhiyong Wang,Kun Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI25 Accepted

点击查看摘要

Abstract:Point cloud completion aims to reconstruct complete 3D shapes from partial 3D point clouds. With advancements in deep learning techniques, various methods for point cloud completion have been developed. Despite achieving encouraging results, a significant issue remains: these methods often overlook the variability in point clouds sampled from a single 3D object surface. This variability can lead to ambiguity and hinder the achievement of more precise completion results. Therefore, in this study, we introduce a novel point cloud completion network, namely Dual-Codebook Point Completion Network (DC-PCN), following an encder-decoder pipeline. The primary objective of DC-PCN is to formulate a singular representation of sampled point clouds originating from the same 3D surface. DC-PCN introduces a dual-codebook design to quantize point-cloud representations from a multilevel perspective. It consists of an encoder-codebook and a decoder-codebook, designed to capture distinct point cloud patterns at shallow and deep levels. Additionally, to enhance the information flow between these two codebooks, we devise an information exchange mechanism. This approach ensures that crucial features and patterns from both shallow and deep levels are effectively utilized for completion. Extensive experiments on the PCN, ShapeNet_Part, and ShapeNet34 datasets demonstrate the state-of-the-art performance of our method.
zh

[CV-139] Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation ICASSP2025

【速读】：该论文旨在解决RGB和热成像（thermal image）融合在低光照条件下进行语义分割（semantic segmentation）时，现有方法需要大量参数更新和计算资源的问题。现有方法通常采用双分支编码器框架进行多模态特征提取，并设计复杂的特征融合策略，导致计算负担较重。为解决这一问题，论文提出了一种基于早期融合策略（early fusion strategy）的新型多模态融合网络（EFNet），并结合简单但有效的特征聚类方法，以实现高效的RGB-T语义分割。此外，论文还提出了一种基于欧几里得距离（Euclidean distance）的轻量级多尺度特征聚合解码器（multi-scale feature aggregation decoder），以进一步降低计算复杂度。实验结果表明，该方法在不同数据集上均表现出色，且参数和计算量显著低于现有最优方法。

链接: https://arxiv.org/abs/2501.10958
作者: Zhengwen Shen,Yulian Li,Han Zhang,Yuchen Weng,Jun Wang
机构: School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China (中国矿业大学信息与控制工程学院, 徐州, 江苏 221116, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low-illumination conditions. Existing methods typically employ a two-branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation. In addition, we also propose a lightweight and efficient multi-scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.
zh

[CV-140] MARIO: A Mixed Annotation Framework For Polyp Segmentation

【速读】：该论文旨在解决现有息肉分割（polyp segmentation）模型面临的高标注成本和小规模数据集限制的问题。现有的模型通常依赖于单一类型的标注，导致大量息肉数据集未被充分利用。为了解决这一问题，论文提出了MARIO模型，该模型采用混合监督（mixed supervision）方法，能够适应多种标注类型，从而显著扩展了可用数据的范围。MARIO通过整合五种监督形式（像素级、框级、多边形级、涂鸦级和点级）来从未充分利用的数据集中学习，每种监督形式都配有定制的损失函数，以有效利用监督标签并最小化噪声。这一方法使MARIO能够超越单一标注类型的限制，并主要利用弱标注和低成本标注的数据集，减少对大规模全标注数据集的依赖。实验结果表明，MARIO在五个基准数据集上均优于现有方法，展示了其在平衡不同监督形式之间的权衡和最大化息肉分割性能方面的有效性。

链接: https://arxiv.org/abs/2501.10957
作者: Haoyang Li,Yiwen Hu,Jun Wei,Zhen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ISBI 2025 4-page paper

点击查看摘要

Abstract:Existing polyp segmentation models are limited by high labeling costs and the small size of datasets. Additionally, vast polyp datasets remain underutilized because these models typically rely on a single type of annotation. To address this dilemma, we introduce MARIO, a mixed supervision model designed to accommodate various annotation types, significantly expanding the range of usable data. MARIO learns from underutilized datasets by incorporating five forms of supervision: pixel-level, box-level, polygon-level, scribblelevel, and point-level. Each form of supervision is associated with a tailored loss that effectively leverages the supervision labels while minimizing the noise. This allows MARIO to move beyond the constraints of relying on a single annotation type. Furthermore, MARIO primarily utilizes dataset with weak and cheap annotations, reducing the dependence on large-scale, fully annotated ones. Experimental results across five benchmark datasets demonstrate that MARIO consistently outperforms existing methods, highlighting its efficacy in balancing trade-offs between different forms of supervision and maximizing polyp segmentation performance
zh

[CV-141] SVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval AAAI2025

【速读】：该论文试图解决跨模态检索（cross-modal retrieval）中由于数据对未对齐和广泛存在的标注噪声（noisy correspondence, NC）导致的性能下降问题。现有的方法通常假设数据对是良好对齐的，并且忽略了标注噪声，这会导致模型性能的显著下降。尽管已有研究尝试通过使用相同架构的协同教学范式（co-teaching paradigm）来提供不同的数据视角，但这些架构之间的差异主要源于随机初始化，导致模型在训练过程中逐渐趋同，从而限制了该范式带来的额外信息。

为解决这一问题，论文提出了一种基于语义变化一致性（Semantic Variation Consistency, TSVC）的三方学习框架。该框架包括一个协调器（Coordinator）、一个主模型（Master）和一个辅助模型（Assistant）。协调器负责数据分配，辅助模型通过多样化的数据支持主模型的噪声标签预测。此外，论文还引入了一种基于互信息变化（mutual information variation）的软标签估计方法，用于量化新样本中的噪声并分配相应的软标签。同时，论文提出了一种新的损失函数，以增强模型的鲁棒性并优化训练效果。通过在三个广泛使用的数据集上进行的大量实验，TSVC在检索准确性和训练稳定性方面表现出显著优势，即使在噪声比例增加的情况下也能保持稳定的性能。

链接: https://arxiv.org/abs/2501.10935
作者: Shuai Lyu,Zijing Tian,Zhonghong Ou,Yifan Zhu,Xiao Zhang,Qiankun Ha,Haoran Luo,Meina Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the Main Track of AAAI 2025. It contains 9 pages, 7 figures, and is relevant to the areas of cross-modal retrieval and machine learning. The work presents a novel approach in robust image-text retrieval using a tripartite learning framework

点击查看摘要

Abstract:Cross-modal retrieval maps data under different modality via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures are primarily stemmed from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce a Tripartite learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model’s noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.
zh

[CV-142] Generative Physical AI in Vision: A Survey

【速读】：该论文旨在解决生成式人工智能（Generative AI）在计算机视觉领域中生成内容时缺乏物理合理性的问题。传统生成模型主要关注视觉逼真度，而忽略了生成内容是否符合现实世界的物理规律，这限制了其在需要遵循物理定律的应用（如机器人、自主系统和科学模拟）中的有效性。论文的关键解决方案是通过物理感知的生成式人工智能（physics-aware generative AI），将物理知识融入生成模型中，从而提升生成内容的物理合理性。具体方法包括显式模拟（explicit simulation）和隐式学习（implicit learning），通过这些方法，生成式AI能够更好地模拟现实世界的物理交互，进而推动其在虚拟与物理现实之间的桥梁作用。

链接: https://arxiv.org/abs/2501.10928
作者: Daochang Liu,Junyu Zhang,Anh-Dung Dinh,Eunbyung Park,Shichao Zhang,Chang Xu
机构: The University of Western Australia(西澳大利亚大学); Central South University(中南大学); The University of Sydney(悉尼大学); Sungkyunkwan University(成均馆大学); Guangxi Normal University(广西师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication. This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D or 4D content. Traditionally, generative models primarily focus on visual fidelity while often neglecting the physical plausibility of generated content. This gap limits their effectiveness in applications requiring adherence to real-world physical laws, such as robotics, autonomous systems, and scientific simulations. As generative AI evolves to increasingly integrate physical realism and dynamic simulation, its potential to function as a “world simulator” expands-enabling the modeling of interactions governed by physics and bridging the divide between virtual and physical realities. This survey systematically reviews this emerging field of physics-aware generative AI in computer vision, categorizing methods based on how they incorporate physical knowledge-either through explicit simulation or implicit learning. We analyze key paradigms, discuss evaluation protocols, and identify future research directions. By offering a comprehensive overview, this survey aims to help future developments in physically grounded generation for vision. The reviewed papers are summarized at this https URL.
zh

[CV-143] Decomposing and Fusing Intra- and Inter-Sensor Spatio-Temporal Signal for Multi-Sensor Wearable Human Activity Recognition

【速读】：该论文试图解决可穿戴设备人体活动识别（Wearable Human Activity Recognition, WHAR）中多传感器同步测量时，现有方法无法有效捕捉传感器内部（intra-sensor）和传感器之间（inter-sensor）时空关系的问题。现有方法通常使用共享卷积核（shared convolutional kernels）对所有传感器变量进行无差别的时间特征提取，导致无法充分捕捉传感器内部和传感器之间的时空特征。论文提出的解决方案是DecomposeWHAR模型，该模型包含分解阶段和融合阶段。分解阶段通过改进的深度可分离卷积（Depth Separable Convolution）为每个传感器内部变量生成高维表示，以捕捉局部时间特征并保留其独特性。融合阶段首先捕捉传感器内部变量之间的关系，并在通道和变量级别融合其特征，然后使用状态空间模型（State Space Model, SSM）建模长时间依赖关系，最后通过自注意力机制（self-attention mechanism）动态捕捉跨传感器交互，突出传感器之间的空间相关性。该模型在三个广泛使用的WHAR数据集上表现出色，显著优于现有最先进模型，同时保持了可接受的计算效率。

链接: https://arxiv.org/abs/2501.10917
作者: Haoyu Xie,Haoxuan Li,Chunyuan Zheng,Haonan Yuan,Guorui Liao,Jun Liao,Li Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing. Multi-sensor synchronous measurement has proven to be more effective for WHAR than using a single sensor. However, existing WHAR methods use shared convolutional kernels for indiscriminate temporal feature extraction across each sensor variable, which fails to effectively capture spatio-temporal relationships of intra-sensor and inter-sensor variables. We propose the DecomposeWHAR model consisting of a decomposition phase and a fusion phase to better model the relationships between modality variables. The decomposition creates high-dimensional representations of each intra-sensor variable through the improved Depth Separable Convolution to capture local temporal features while preserving their unique characteristics. The fusion phase begins by capturing relationships between intra-sensor variables and fusing their features at both the channel and variable levels. Long-range temporal dependencies are modeled using the State Space Model (SSM), and later cross-sensor interactions are dynamically captured through a self-attention mechanism, highlighting inter-sensor spatial correlations. Our model demonstrates superior performance on three widely used WHAR datasets, significantly outperforming state-of-the-art models while maintaining acceptable computational efficiency. Our codes and supplementary materials are available at this https URL.
zh

[CV-144] Green Video Camouflaged Object Detection

【速读】：该论文旨在解决视频中的伪装目标检测（Camouflaged Object Detection, COD）问题，特别是针对隐藏在与其环境高度相似的目标。传统视频COD方法通常通过显式提取运动线索或使用复杂的深度学习网络来处理时间信息，但这些方法存在高复杂性和性能不稳定的问题。本文提出了一种名为GreenVCOD的绿色视频COD方法，其关键解决方案是基于绿色ICOD方法，利用长短期时间邻域（Temporal Neighborhoods, TN）来捕捉联合的时空上下文信息，从而优化决策。实验结果表明，GreenVCOD在性能上与现有的先进视频COD基准方法具有竞争力。

链接: https://arxiv.org/abs/2501.10914
作者: Xinyu Wang,Hong-Shuo Chen,Zhiruo Zhou,Suya You,Azad M. Madni,C.-C. Jay Kuo
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to distinguish hidden objects embedded in an environment highly similar to the object. Conventional video-based COD (VCOD) methods explicitly extract motion cues or employ complex deep learning networks to handle the temporal information, which is limited by high complexity and unstable performance. In this work, we propose a green VCOD method named GreenVCOD. Built upon a green ICOD method, GreenVCOD uses long- and short-term temporal neighborhoods (TN) to capture joint spatial/temporal context information for decision refinement. Experimental results show that GreenVCOD offers competitive performance compared to state-of-the-art VCOD benchmarks.
zh

[CV-145] Explainable Adversarial Attacks on Coarse-to-Fine Classifiers ICASSP2025

【速读】：该论文试图解决传统对抗攻击（adversarial attacks）在解释性和多阶段分类器（multi-stage classifiers）应用中的不足。传统对抗攻击通常通过生成人眼难以察觉的扰动来改变输入图像的预测标签，但这些方法缺乏解释性，且主要针对单阶段分类器，对多阶段分类器的研究较少。论文提出的解决方案关键是通过层间相关性传播（Layer-wise Relevance Propagation, LRP）来生成可解释的对抗扰动。LRP通过为像素分配相关性分数，识别并针对对粗粒度和细粒度分类都至关重要的关键特征。与传统的对抗攻击不同，该方法不仅诱导误分类，还增强了模型在不同分类阶段行为的可解释性。实验结果表明，该方法在多阶段分类器中有效且具有解释性。

链接: https://arxiv.org/abs/2501.10906
作者: Akram Heidarizadeh,Connor Hatfield,Lorenzo Lazzarotto,HanQin Cai,George Atia
机构: University of Central Florida (中佛罗里达大学); Pontifícia Universidade Católica do Rio Grande do Sul (南里奥格兰德天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: ICASSP 2025

点击查看摘要

Abstract:Traditional adversarial attacks typically aim to alter the predicted labels of input images by generating perturbations that are imperceptible to the human eye. However, these approaches often lack explainability. Moreover, most existing work on adversarial attacks focuses on single-stage classifiers, but multi-stage classifiers are largely unexplored. In this paper, we introduce instance-based adversarial attacks for multi-stage classifiers, leveraging Layer-wise Relevance Propagation (LRP), which assigns relevance scores to pixels based on their influence on classification outcomes. Our approach generates explainable adversarial perturbations by utilizing LRP to identify and target key features critical for both coarse and fine-grained classifications. Unlike conventional attacks, our method not only induces misclassification but also enhances the interpretability of the model’s behavior across classification stages, as demonstrated by experimental results.
zh

[CV-146] A Remote Sensing Image Change Detection Method Integrating Layer Exchange and Channel-Spatial Differences

【速读】：该论文试图解决遥感影像中的变化检测问题，特别是在双时相图像中像素级变化区域的准确分割。变化检测的核心在于确定双时相图像中对应像素是否发生了变化。论文提出的解决方案关键在于设计了通道-空间差异加权（CSDW）模块，该模块通过聚合和分配双时相特征，增强了模型对差异特征的敏感性。此外，论文还提出了一种基于层交换（LE）方法的解码结构，用于增强双时相特征之间的交互，从而更好地构建双时相图像之间的相关性。通过在多个数据集上的实验验证，所提出的LENet模型显著提升了变化检测的性能。

链接: https://arxiv.org/abs/2501.10905
作者: Sijun Dong,Fangcheng Zuo,Geng Chen,Siming Fu,Xiaoliang Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Change detection in remote sensing imagery is a critical technique for Earth observation, primarily focusing on pixel-level segmentation of change regions between bi-temporal images. The essence of pixel-level change detection lies in determining whether corresponding pixels in bi-temporal images have changed. In deep learning, the spatial and channel dimensions of feature maps represent different information from the original images. In this study, we found that in change detection tasks, difference information can be computed not only from the spatial dimension of bi-temporal features but also from the channel dimension. Therefore, we designed the Channel-Spatial Difference Weighting (CSDW) module as an aggregation-distribution mechanism for bi-temporal features in change detection. This module enhances the sensitivity of the change detection model to difference features. Additionally, bi-temporal images share the same geographic location and exhibit strong inter-image correlations. To construct the correlation between bi-temporal images, we designed a decoding structure based on the Layer-Exchange (LE) method to enhance the interaction of bi-temporal features. Comprehensive experiments on the CLCD, PX-CLCD, LEVIR-CD, and S2Looking datasets demonstrate that the proposed LENet model significantly improves change detection performance. The code and pre-trained models will be available at: this https URL.
zh

[CV-147] Visual RAG : Expanding MLLM visual knowledge without fine-tuning

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在计算机视觉任务中面临的局限性，特别是其在推理过程中依赖于预训练数据且需要大量微调的问题。为了解决这些问题，论文提出了一种名为Visual RAG的新方法，该方法通过结合MLLMs的上下文学习能力和检索机制，动态选择最相关的示例来增强模型的知识。这种方法的核心理念是通过类比学习，使模型能够在推理过程中利用动态提供的新信息，从而不再局限于从训练数据中提取的知识，并且无需微调即可快速更新。此外，Visual RAG显著减少了提升模型图像分类性能的计算成本，并扩展了模型到未训练过的视觉领域和任务的能力。实验结果表明，与现有的多示例上下文学习方法相比，Visual RAG在使用更少示例的情况下，能够达到接近甚至更高的准确率（平均提升约2%）。

链接: https://arxiv.org/abs/2501.10834
作者: Mirco Bonomo,Simone Bianco
机构: University of Milano-Bicocca, Italy (米兰比可卡大学, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved notable performance in computer vision tasks that require reasoning across visual and textual modalities, yet their capabilities are limited to their pre-trained data, requiring extensive fine-tuning for updates. Recent researches have explored the use of In-Context Learning (ICL) to overcome these challenges by providing a set of demonstrating examples as context to augment MLLMs performance in several tasks, showing that many-shot ICL leads to substantial improvements compared to few-shot ICL. However, the reliance on numerous demonstrating examples and the limited MLLMs context windows presents significant obstacles. This paper aims to address these challenges by introducing a novel approach, Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism. The crux of this approach is to ensure to augment the MLLM knowledge by selecting only the most relevant demonstrating examples for the query, pushing it to learn by analogy. In this way, relying on the new information provided dynamically during inference time, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning. Furthermore, this greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for. Extensive experiments on eight different datasets in the state of the art spanning several domains and image classification tasks show that the proposed Visual RAG, compared to the most recent state of the art (i.e., many-shot ICL), is able to obtain an accuracy that is very close or even higher (approx. +2% improvement on average) while using a much smaller set of demonstrating examples (approx. only 23% on average).
zh

[CV-148] GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation

【速读】：该论文试图解决在手术数据积累过程中面临的伦理、组织和监管问题，通过生成式建模（Generative Modelling）来增强数据，特别是针对手术中的图像分割任务，生成高质量的（图像，掩码）对。论文提出了一种联合建模方法，利用潜在扩散模型（Latent Diffusion Model）学习（图像，掩码）空间的语义丰富且紧凑的潜在表示，从而生成具有显著语义一致性的未见过的分割数据。此外，论文进一步提出了生成式自适应不确定性引导的扩散增强方法（Generative Adaptive Uncertainty-guided Diffusion-based Augmentation, GAUDA），通过贝叶斯下游模型的认知不确定性（epistemic uncertainty）进行有针对性的在线合成，生成当前数据分布中最不确定类别的额外样本。该方法能够有效减少额外训练样本的数量，并围绕数据分布中最不确定的部分进行增强，从而显著提升下游分割任务的性能。

链接: https://arxiv.org/abs/2501.10819
作者: Yannik Frisch,Christina Bornberg,Moritz Fuchs,Anirban Mukhopadhyay
机构: Technical University Darmstadt(达姆施塔特工业大学); University Medical Center Mainz(美因茨大学医学中心); University of Girona(赫罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Augmentation by generative modelling yields a promising alternative to the accumulation of surgical data, where ethical, organisational and regulatory aspects must be considered. Yet, the joint synthesis of (image, mask) pairs for segmentation, a major application in surgery, is rather unexplored. We propose to learn semantically comprehensive yet compact latent representations of the (image, mask) space, which we jointly model with a Latent Diffusion Model. We show that our approach can effectively synthesise unseen high-quality paired segmentation data of remarkable semantic coherence. Generative augmentation is typically applied pre-training by synthesising a fixed number of additional training samples to improve downstream task models. To enhance this approach, we further propose Generative Adaptive Uncertainty-guided Diffusion-based Augmentation (GAUDA), leveraging the epistemic uncertainty of a Bayesian downstream model for targeted online synthesis. We condition the generative model on classes with high estimated uncertainty during training to produce additional unseen samples for these classes. By adaptively utilising the generative model online, we can minimise the number of additional training samples and centre them around the currently most uncertain parts of the data distribution. GAUDA effectively improves downstream segmentation results over comparable methods by an average absolute IoU of 1.6% on CaDISv2 and 1.5% on CholecSeg8k, two prominent surgical datasets for semantic segmentation.
zh

[CV-149] Efficient Auto-Labeling of Large-Scale Poultry Datasets (ALPD) Using Semi-Supervised Models Active Learning and Prompt-then-Detect Approach

【速读】：该论文旨在解决家禽养殖中大规模、多样化数据集的高效标注问题。传统的手动标注方法耗时且不适用于现代系统持续生成的数据。为此，研究提出了一种半监督自动标注框架，结合主动学习（active learning）和“提示-检测”（prompt-then-detect）范式，以提高家禽行为和健康监测的AI驱动效率。解决方案的关键在于利用多种机器学习模型，包括零样本模型（如Grounding DINO、YOLO-World和CLIP）和监督模型（如YOLO和Faster-RCNN），并通过半监督学习和主动学习显著减少标注时间。研究结果表明，YOLOv8s-ALPD在半监督模型中表现最佳，精度和召回率分别达到96.1%和99.0%，同时混合YOLO-World模型在品种检测和行为检测中均表现出色。此外，半监督模型在行为检测中的精度和F1分数分别提升了31%和16%，且标注时间减少了80%以上。

链接: https://arxiv.org/abs/2501.10809
作者: Ramesh Bahadur Bist,Lilong Chai,Shawna Weimer,Hannah Atungulua,Chantel Pennicott,Xiao Yang,Sachin Subedi,Chaitanya Pallerla,Yang Tian,Dongyi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of AI in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time-consuming, making it impractical for modern systems that continuously generate data. This study explores semi-supervised auto-labeling methods, integrating active learning, and prompt-then-detect paradigm to develop an efficient framework for auto-labeling of large poultry datasets aimed at advancing AI-driven behavior and health monitoring. Viideo data were collected from broilers and laying hens housed at the University of Arkansas and the University of Georgia. The collected videos were converted into images, pre-processed, augmented, and labeled. Various machine learning models, including zero-shot models like Grounding DINO, YOLO-World, and CLIP, and supervised models like YOLO and Faster-RCNN, were utilized for broilers, hens, and behavior detection. The results showed that YOLOv8s-World and YOLOv9s performed better when compared performance metrics for broiler and hen detection under supervised learning, while among the semi-supervised model, YOLOv8s-ALPD achieved the highest precision (96.1%) and recall (99.0%) with an RMSE of 1.9. The hybrid YOLO-World model, incorporating the optimal YOLOv8s backbone, demonstrated the highest overall performance. It achieved a precision of 99.2%, recall of 99.4%, and an F1 score of 98.7% for breed detection, alongside a precision of 88.4%, recall of 83.1%, and an F1 score of 84.5% for individual behavior detection. Additionally, semi-supervised models showed significant improvements in behavior detection, achieving up to 31% improvement in precision and 16% in F1-score. The semi-supervised models with minimal active learning reduced annotation time by over 80% compared to full manual labeling. Moreover, integrating zero-shot models enhanced detection and behavior identification.
zh

[CV-150] CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

【速读】：该论文旨在解决点云采样（point cloud sampling）在视觉任务中计算成本和存储需求过高的问题。传统采样方法（如最远点采样）缺乏任务特定的信息，无法保证在特定应用中的最优性能。基于学习的方法虽然通过训练网络进行采样，但无法确保采样的点是最相关的，且可能导致重复采样点，需要通过后处理技术完成采样点云。为解决这些局限性，论文提出了一种基于贡献的采样网络（CS-Net），将采样操作形式化为Top-k操作。为确保网络可以通过梯度下降算法进行端到端训练，作者通过最优传输问题的熵正则化实现了Top-k操作的可微分近似。CS-Net由特征嵌入模块、级联注意力模块和贡献评分模块组成，通过减少参数、突出重要特征并生成每个点的贡献评分，指导采样过程优先选择最重要的点。实验结果表明，CS-Net在分类、配准、压缩和表面重建等任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.10789
作者: Tian Guo,Chen Chen,Hui Yuan,Xiaolong Mao,Raouf Hamzaoui,Junhui Hou
机构: Shandong University(山东大学); De Montfort University(德蒙福特大学); City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud sampling plays a crucial role in reducing computation costs and storage requirements for various vision tasks. Traditional sampling methods, such as farthest point sampling, lack task-specific information and, as a result, cannot guarantee optimal performance in specific applications. Learning-based methods train a network to sample the point cloud for the targeted downstream task. However, they do not guarantee that the sampled points are the most relevant ones. Moreover, they may result in duplicate sampled points, which requires completion of the sampled point cloud through post-processing techniques. To address these limitations, we propose a contribution-based sampling network (CS-Net), where the sampling operation is formulated as a Top-k operation. To ensure that the network can be trained in an end-to-end way using gradient descent algorithms, we use a differentiable approximation to the Top-k operation via entropy regularization of an optimal transport problem. Our network consists of a feature embedding module, a cascade attention module, and a contribution scoring module. The feature embedding module includes a specifically designed spatial pooling layer to reduce parameters while preserving important features. The cascade attention module combines the outputs of three skip connected offset attention layers to emphasize the attractive features and suppress less important ones. The contribution scoring module generates a contribution score for each point and guides the sampling process to prioritize the most important ones. Experiments on the ModelNet40 and PU147 showed that CS-Net achieved state-of-the-art performance in two semantic-based downstream tasks (classification and registration) and two reconstruction-based tasks (compression and surface reconstruction).
zh

[CV-151] Decoupling Appearance Variations with 3D Consistent Features in Gaussian Splatting AAAI2025

【速读】：该论文试图解决高斯泼溅（Gaussian Splatting）在新型视图合成（novel view synthesis）中由于现代相机图像信号处理器（ISP）、不同时间、天气条件和局部光照变化等因素导致的外观变化问题。这些变化会导致渲染图像或视频中出现浮动物体和颜色失真。现有的外观建模方法要么与渲染过程紧密耦合，影响实时渲染性能，要么只能处理轻微的全局变化，在局部光照变化的场景中表现不佳。

论文提出的解决方案是DAVIGS，该方法通过解耦外观变化并以即插即用（plug-and-play）的方式高效处理这些问题。其关键在于在图像级别而非高斯级别对渲染结果进行变换，从而以最小的优化时间和内存开销建模外观变化。此外，该方法在三维空间中收集外观相关信息来变换渲染图像，从而隐式地构建跨视图的三维一致性。实验表明，DAVIGS在多种外观变化场景中实现了最先进的渲染质量，且在不影响渲染速度的情况下，显著减少了训练时间和内存使用。

链接: https://arxiv.org/abs/2501.10788
作者: Jiaqi Lin,Zhihao Li,Binxiao Huang,Xiao Tang,Jianzhuang Liu,Shiyong Liu,Xiaofei Wu,Fenglong Song,Wenming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025. Project website: this https URL

点击查看摘要

Abstract:Gaussian Splatting has emerged as a prominent 3D representation in novel view synthesis, but it still suffers from appearance variations, which are caused by various factors, such as modern camera ISPs, different time of day, weather conditions, and local light changes. These variations can lead to floaters and color distortions in the rendered images/videos. Recent appearance modeling approaches in Gaussian Splatting are either tightly coupled with the rendering process, hindering real-time rendering, or they only account for mild global variations, performing poorly in scenes with local light changes. In this paper, we propose DAVIGS, a method that decouples appearance variations in a plug-and-play and efficient manner. By transforming the rendering results at the image level instead of the Gaussian level, our approach can model appearance variations with minimal optimization time and memory overhead. Furthermore, our method gathers appearance-related information in 3D space to transform the rendered images, thus building 3D consistency across views implicitly. We validate our method on several appearance-variant scenes, and demonstrate that it achieves state-of-the-art rendering quality with minimal training time and memory usage, without compromising rendering speeds. Additionally, it provides performance improvements for different Gaussian Splatting baselines in a plug-and-play manner.
zh

[CV-152] LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection

【速读】：该论文旨在解决视频时刻检索（Video Moment Retrieval）和高光检测（Highlight Detection）任务中存在的三个主要问题：(1) 数据集中不同样本之间的语义信息重叠（overlapping semantic information）影响了模型的多模态对齐性能；(2) 现有模型无法高效提取视频的局部特征（local features）；(3) 现有模型使用的Transformer解码器（Transformer Decoder）无法充分解码多模态特征。为解决这些问题，作者提出了LD-DETR模型。其关键解决方案包括：首先，通过将相似度矩阵蒸馏为恒等矩阵（identity matrix）来减轻语义信息重叠的影响；其次，设计了一种方法使卷积层能够更高效地提取多模态局部特征；最后，通过将Transformer解码器的输出反馈回自身，以充分解码多模态信息。实验结果表明，LD-DETR在多个公开基准数据集上优于现有最先进的模型。

链接: https://arxiv.org/abs/2501.10787
作者: Pengcheng Zhao,Zhixian He,Fuwei Zhang,Shujin Lin,Fan Zhou
机构: Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model’s multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at this https URL.
zh

[CV-153] MedFILIP: Medical Fine-grained Language-Image Pre-training ALT

【速读】：该论文试图解决现有医学视觉-语言预训练（VLP）模型在医学图像分析中难以准确表征图像与疾病之间关联的问题，导致诊断结果不准确或不完整。为解决这一问题，论文提出了MedFILIP模型，其关键解决方案包括：1）基于大语言模型的信息提取器，通过灵活的提示工程从报告中解耦出详细的疾病信息，有效降低文本复杂性，同时以极小的代价保留丰富信息；2）知识注入器，构建类别与视觉属性之间的关系，帮助模型基于图像特征做出判断，并促进对不熟悉疾病类别的知识外推；3）基于细粒度注释的语义相似性矩阵，提供更平滑、信息更丰富的标签，从而实现细粒度的图像-文本对齐。通过这些创新，MedFILIP在多个数据集上实现了最先进的性能，分类准确率最高提升了6.69%。

链接: https://arxiv.org/abs/2501.10775
作者: Xinjie Liang,Xiangyu Li,Fanding Li,Jie Jiang,Qing Dong,Wei Wang,Kuanquan Wang,Suyu Dong,Gongning Luo,Shuo Li
机构: School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (哈尔滨工业大学计算机科学与技术学院); Department of Thoracic Surgery at No. 4 Affiliated Hospital, Harbin Medical University, Harbin, China (哈尔滨医科大学附属第四医院胸外科); School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区计算机科学与技术学院); College of computer and control engineering, Northeast Forestry University, Harbin, China (东北林业大学计算机与控制工程学院); Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia (阿卜杜拉国王科技大学计算机、电气和数学科学与工程学部); Department of Biomedical Engineering and Department of Computer and Data Science, Case Western Reserve University, Cleveland, OH, USA (凯斯西储大学生物医学工程系和计算机与数据科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, IEEE Journal of Biomedical and Health Informatics 2025

点击查看摘要

Abstract:Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%. The code is available in this https URL.
zh

[CV-154] Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption

【速读】：该论文旨在解决红外-可见光图像融合（Infrared-visible image fusion, IVIF）领域中缺乏近期全面综述的问题。自2018年以来，随着深度学习技术的引入，IVIF领域涌现了多种网络架构和损失函数，以提升视觉性能。然而，数据兼容性、感知准确性和效率等方面的挑战仍然存在。为此，论文提供了一个多维度的框架，详细阐述了基于学习的IVIF方法，涵盖了从视觉增强策略到数据兼容性和任务适应性的各个方面。解决方案的关键在于通过系统性综述和分析，梳理现有方法的核心理念，并通过性能对比（包括定量和定性分析）来评估这些方法在配准、融合及后续高级任务中的表现。此外，论文还探讨了该领域的未来发展方向和开放性问题。

链接: https://arxiv.org/abs/2501.10761
作者: Jinyuan Liu,Guanyao Wu,Zhu Liu,Di Wang,Zhiying Jiang,Long Ma,Wei Zhong,Xin Fan,Risheng Liu
机构: School of Software Technology, Dalian University of Technology, Dalian, 116024, China (大连理工大学软件技术学院, 大连, 116024, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data compatibility, perception accuracy, and efficiency remain. Unfortunately, there is a lack of recent comprehensive surveys that address this rapidly expanding domain. This paper fills that gap by providing a thorough survey covering a broad range of topics. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods, from visual enhancement strategies to data compatibility and task adaptability. We also present a detailed analysis of these approaches, accompanied by a lookup table clarifying their core ideas. Furthermore, we summarize performance comparisons, both quantitatively and qualitatively, focusing on registration, fusion, and subsequent high-level tasks. Beyond technical analysis, we discuss potential future directions and open issues in this area. For further details, visit our GitHub repository: this https URL.
zh

[CV-155] Quadcopter Position Hold Function using Optical Flow in a Smartphone-based Flight Computer

【速读】：该论文探讨了智能手机作为四旋翼无人机（quadcopter）计算设备的潜力，特别是其在位置保持功能（position hold function）中的应用。论文的核心问题是如何利用智能手机的传感器和内置摄像头进行图像处理，以实现无人机的位置保持。解决方案的关键在于使用Shi-Tomasi角点检测（Shi-Tomasi corner detection）和Lucas-Kanade稀疏光流算法（Lucas-Kanade sparse optical flow algorithms）来识别和跟踪地面特征，并通过计算无人机相对于图像中心的欧几里得距离（Euclidian distance）来维持位置。此外，PID控制器用于计算相应的俯仰（pitch）和横滚（roll）估计值。实验结果表明，智能手机的传感器和摄像头能够有效执行光流位置保持功能，证明了其在无人机应用中的潜力。

链接: https://arxiv.org/abs/2501.10752
作者: Noel P Caliston,Chris Jordan C. Aliac,James Arnold E. Nogra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Purpose. This paper explores the capability of smartphones as computing devices for a quadcopter, specifically in terms of the ability of drones to maintain a position known as the position hold function. Image processing can be performed with the phone’s sensors and powerful built-in camera. Method. Using Shi-Tomasi corner detection and the Lucas-Kanade sparse optical flow algorithms, ground features are recognized and tracked using the downward-facing camera. The position is maintained by computing quadcopter displacement from the center of the image using Euclidian distance, and the corresponding pitch and roll estimate is calculated using the PID controller. Results. Actual flights show a double standard deviation of 18.66 cm from the center for outdoor tests. With a quadcopter size of 58cm x 58cm used, it implies that 95% of the time, the quadcopter is within a diameter of 96 cm. For indoor tests, a double standard deviation of 10.55 cm means that 95% of the time, the quadcopter is within a diameter of 79 cm. Conclusion. Smartphone sensors and cameras can be used to perform optical flow position hold functions, proving their potential as computing devices for drones. Recommendations. To further improve the positioning system of the phone-based quadcopter system, it is suggested that potential sensor fusion be explored with the phone’s GNSS sensor, which gives absolute positioning information for outdoor applications. Research Implications. As different devices and gadgets are integrated into the smartphone, this paper presents an opportunity for phone manufacturers and researchers to explore the potential of smartphones for a drone use-case.
zh

[CV-156] Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention

【速读】：该论文旨在解决遥感（RS）图像语义分割任务中半监督学习面临的挑战，特别是多尺度特征丰富性和类间相似性高的问题。为了解决这些问题，论文提出了一种新颖的半监督多尺度不确定性和跨师生注意力（MUCA）模型。该模型的关键解决方案包括两个方面：首先，通过引入多尺度不确定性一致性正则化（multi-scale uncertainty consistency regularization），约束网络不同层次特征图之间的一致性，从而提升半监督算法在未标记数据上的多尺度学习能力；其次，利用跨师生注意力机制（Cross-Teacher-Student attention mechanism），通过教师网络的互补特征指导学生网络构建更具判别性的特征表示。此外，该模型通过有效整合弱增强（WA）和强增强（SA）进一步提升了分割性能。实验结果表明，该方法在ISPRS-Potsdam和LoveDA数据集上优于现有的半监督方法，尤其在区分高度相似物体方面表现出色。

链接: https://arxiv.org/abs/2501.10736
作者: Shanwen Wang,Changrui Chen,Xin Sun,Danfeng Hong,Jungong Han
机构: Faculty of Data Science, City University of Macau, 999078, SAR Macao, China(澳门城市大学数据科学学院); WMG, University of Warwick, UK(华威大学WMG); Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China(中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学电子电气与通信工程学院); Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semi-supervised learning offers an appealing solution for remote sensing (RS) image segmentation to relieve the burden of labor-intensive pixel-level labeling. However, RS images pose unique challenges, including rich multi-scale features and high inter-class similarity. To address these problems, this paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. Specifically, MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. It improves the multi-scale learning capability of semi-supervised algorithms on unlabeled data. Additionally, MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations through complementary features from the teacher network. This design effectively integrates weak and strong augmentations (WA and SA) to further boost segmentation performance. To verify the effectiveness of our model, we conduct extensive experiments on ISPRS-Potsdam and LoveDA datasets. The experimental results show the superiority of our method over state-of-the-art semi-supervised methods. Notably, our model excels in distinguishing highly similar objects, showcasing its potential for advancing semi-supervised RS image segmentation tasks.
zh

[CV-157] A CNN-Transformer for Classification of Longitudinal 3D MRI Images – A Case Study on Hepatocellular Carcinoma Prediction

【速读】：该论文试图解决在慢性疾病如肝细胞癌（HCC）中，如何通过纵向MRI分析来预测疾病进展的问题。由于数据可用性有限、实质变化细微以及医学筛查时间不规律等挑战，现有方法主要依赖于横截面成像数据。为解决这一问题，作者提出了HCCNet，一种新颖的模型架构，结合了3D ConvNeXt CNN架构和Transformer编码器，以捕捉3D MRI的复杂空间特征和不同时间点之间的时间依赖性。HCCNet采用了两阶段预训练过程，分别针对3D MRI的自监督学习和序列顺序预测任务进行预训练，从而增强对疾病进展的理解。实验结果表明，HCCNet在预测准确性和可靠性方面显著优于基线模型，为个性化HCC监测提供了强有力的工具。

链接: https://arxiv.org/abs/2501.10733
作者: Jakob Nolte,Maureen M. J. Guichelaar,Donald E. Bouman,Stephanie M. van den Berg,Maryam Amir Haeri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for publication to Biomedical Signal Processing and Control

点击查看摘要

Abstract:Longitudinal MRI analysis is crucial for predicting disease outcomes, particularly in chronic conditions like hepatocellular carcinoma (HCC), where early detection can significantly influence treatment strategies and patient prognosis. Yet, due to challenges like limited data availability, subtle parenchymal changes, and the irregular timing of medical screenings, current approaches have so far focused on cross-sectional imaging data. To address this, we propose HCCNet, a novel model architecture that integrates a 3D adaptation of the ConvNeXt CNN architecture with a Transformer encoder, capturing both the intricate spatial features of 3D MRIs and the complex temporal dependencies across different time points. HCCNet utilizes a two-stage pre-training process tailored for longitudinal MRI data. The CNN backbone is pre-trained using a self-supervised learning framework adapted for 3D MRIs, while the Transformer encoder is pre-trained with a sequence-order-prediction task to enhance its understanding of disease progression over time. We demonstrate the effectiveness of HCCNet by applying it to a cohort of liver cirrhosis patients undergoing regular MRI screenings for HCC surveillance. Our results show that HCCNet significantly improves predictive accuracy and reliability over baseline models, providing a robust tool for personalized HCC surveillance. The methodological approach presented in this paper is versatile and can be adapted to various longitudinal MRI screening applications. Its ability to handle varying patient record lengths and irregular screening intervals establishes it as an invaluable framework for monitoring chronic diseases, where timely and accurate disease prognosis is critical for effective treatment planning. Comments: Submitted for publication to Biomedical Signal Processing and Control Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.9; I.2.1 Cite as: arXiv:2501.10733 [cs.CV] (or arXiv:2501.10733v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.10733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-158] In the Picture: Medical Imaging Datasets Artifacts and their Living Review

【速读】：该论文试图解决医学影像研究中数据集（datasets）存在的标签质量、捷径学习（shortcuts）和元数据（metadata）等问题，这些问题往往被忽视，可能影响算法的泛化能力，进而对患者结果产生负面影响。现有的医学影像文献综述大多集中于机器学习方法，仅有少数关注特定应用的数据集，且这些综述通常是静态的，发布后不再更新，无法反映数据集发布后其他研究者可能贡献的新发现，如偏见、捷径学习和额外注释等。论文将这些新发现称为研究产物（research artifacts）。为解决这一问题，论文提出了一种动态综述（living review），持续跟踪多个医学影像应用中的公共数据集及其相关研究产物。解决方案的关键包括一个用于监控数据文档产物的框架，以及一个SQL数据库，用于可视化研究产物与数据集之间的引用关系。此外，论文还讨论了创建医学影像数据集的关键考虑因素，回顾了数据注释的最佳实践，探讨了捷径学习和人口多样性（demographic diversity）的重要性，并强调了在整个生命周期中管理数据集的重要性。

链接: https://arxiv.org/abs/2501.10727
作者: Amelia Jiménez-Sánchez,Natalia-Rozalia Avlona,Sarah de Boer,Víctor M. Campello,Aasa Feragen,Enzo Ferrante,Melanie Ganz,Judy Wawira Gichoya,Camila González,Steff Groefsema,Alessa Hering,Adam Hulman,Leo Joskowicz,Dovile Juodelyte,Melih Kandemir,Thijs Kooi,Jorge del Pozo Lérida,Livie Yumeng Li,Andre Pacheco,Tim Rädsch,Mauricio Reyes,Théo Sourget,Bram van Ginneken,David Wen,Nina Weng,Jack Junchi Xu,Hubert Dariusz Zając,Maria A. Zuluaga,Veronika Cheplygina
机构: IT University of Copenhagen(哥本哈根信息技术大学); University of Copenhagen(哥本哈根大学); Radboud University Medical Center(拉德堡德大学医学中心); Universitat de Barcelona(巴塞罗那大学); Technical University of Denmark(丹麦技术大学); CONICET(阿根廷国家科学技术研究委员会); University of Buenos Aires(布宜诺斯艾利斯大学); Rigshospitalet(里格斯医院); Emory University(埃默里大学); Stanford University(斯坦福大学); University of Groningen(格罗宁根大学); Steno Diabetes Center Aarhus, Aarhus University Hospital(奥胡斯大学医院斯泰诺糖尿病中心); Department of Public Health, Aarhus University(奥胡斯大学公共卫生系); The Hebrew University of Jerusalem(耶路撒冷希伯来大学); University of Southern Denmark(南丹麦大学); Lunit(Lunit); IT University of Copenhagen & Cerebriu A/S(哥本哈根信息技术大学与Cerebriu A/S); Federal University of Espírito Santo(圣埃斯皮里图联邦大学); Division of Intelligent Medical Systems, German Cancer Research Center(德国癌症研究中心智能医疗系统部门); Helmholtz Imaging, German Cancer Research Center(德国癌症研究中心亥姆霍兹成像); Engineering Faculty, Heidelberg University(海德堡大学工程学院); ARTORG Center for Biomedical Engineering Research, University of Bern(伯尔尼大学ARTORG生物医学工程研究中心); Department of Radiation Oncology, University Hospital Bern, University of Bern(伯尔尼大学医院放射肿瘤科); Plain Medical(Plain Medical); Department of Dermatology, Churchill Hospital, Oxford University Hospitals(牛津大学医院丘吉尔医院皮肤科); Copenhagen University Hospital, Herlev and Gentofte(哥本哈根大学医院赫勒乌和根托夫特); Radiological AI Testcenter(放射学AI测试中心); EURECOM(EURECOM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Manuscript under review

点击查看摘要

Abstract:Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static – they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at this http URL.
zh

[CV-159] Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning

【速读】：该论文试图解决组合零样本学习（Compositional Zero-Shot Learning）中的条件依赖性问题，这一问题导致同一状态（对象）在不同对象（状态）下表现出显著的属性变化。现有方法通常采用“多对一”或“一对一”表示范式，但这些极端方法在可迁移性和可区分性之间造成了不平衡，往往偏向一方而牺牲另一方。相比之下，人类擅长通过层次聚类的方式进行类比和推理，直观地将具有相似属性的类别分组形成连贯的概念。受此启发，论文提出了同质组表示学习（Homogeneous Group Representation Learning, HGRL），将状态（对象）表示学习重新定义为多个同质子组的表示学习。HGRL通过自适应地发现和聚合具有共享属性的类别，学习保留组内特定区分特征的分布式组中心，从而在语义可迁移性和可区分性之间实现平衡。该方法集成了三个核心组件，旨在同时增强模型的视觉和提示表示能力。通过在三个基准数据集上的广泛实验，验证了该方法的有效性。

链接: https://arxiv.org/abs/2501.10695
作者: Zhijie Rao,Jingcai Guo,Miaoge Li,Yang Chen
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method.
zh

[CV-160] Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection ICME2024

【速读】：该论文旨在解决视频时刻检索和高光检测（MRHD）任务中的多模态信息融合问题。现有方法主要依赖RGB图像作为输入，忽略了光流（optical flow）和深度图（depth map）等多模态视觉信号。为此，论文提出了一种多模态融合与查询精炼网络（MRNet），通过动态融合RGB、光流和深度图来学习互补信息。此外，为了模拟人类对句子的理解，论文还引入了查询精炼模块，该模块在不同粒度（词、短语和句子级别）上融合文本信息。实验结果表明，MRNet在QVHighlights和Charades数据集上显著优于现有方法，特别是在MR-mAP@Avg和HD-HIT@1指标上分别提升了3.41和3.46。

链接: https://arxiv.org/abs/2501.10692
作者: Yifang Xu,Yunzhuo Sun,Benxiang Zhai,Zien Xie,Youyao Jia,Sidan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2024

点击查看摘要

Abstract:Given a video and a linguistic query, video moment retrieval and highlight detection (MRHD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
zh

[CV-161] EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

【速读】：该论文旨在解决音频驱动（audio-driven）的说话头（talking head）生成中，如何同时生成高度表达的面部表情和手势的挑战。现有方法主要关注生成全身或半身姿态，但音频特征与全身手势之间的弱对应关系限制了生成效果。为此，论文提出了一种两阶段的解决方案：首先，直接从音频输入生成手部姿态，利用音频信号与手部运动之间的强相关性；其次，采用扩散模型（diffusion model）合成视频帧，结合第一阶段生成的手部姿态，生成逼真的面部表情和身体动作。该方法在视觉质量和同步准确性方面优于现有技术（如CyberHost和Vlogger），为音频驱动的手势生成提供了新的视角和鲁棒的框架。

链接: https://arxiv.org/abs/2501.10687
作者: Linrui Tian,Siqi Hu,Qi Wang,Bang Zhang,Liefeng Bo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art approaches, such as CyberHost and Vlogger, in terms of both visual quality and synchronization accuracy. This work provides a new perspective on audio-driven gesture generation and a robust framework for creating expressive and natural talking head animations.
zh

[CV-162] ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning

【速读】：该论文试图解决视觉图神经网络（Vision GNNs, ViG）在图构建过程中由于基于k近邻（k-Nearest Neighbors, k-NN）的图构建方法导致的计算效率低下问题。具体来说，现有的ViG方法在图构建时依赖于昂贵的k-NN算法，这严重限制了其性能，尤其是在处理高分辨率图像时。为了解决这一问题，论文提出了一种名为动态高效图卷积（Dynamic Efficient Graph Convolution, DEGC）的新方法。DEGC通过并行分区构建图，显著提高了图构建的效率。此外，DEGC结合了局部图内和全局图间的特征学习，增强了全局上下文感知能力。基于DEGC，论文进一步提出了一种新的CNN-GNN架构——ClusterViG，用于计算机视觉任务。实验表明，ClusterViG在保持相似模型参数数量的情况下，显著降低了端到端推理延迟，并在图像分类、目标检测和实例分割任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.10640
作者: Dhruv Parikh,Jacob Fein-Ashley,Tian Ye,Rajgopal Kannan,Viktor Prasanna
机构: University of Southern California (南加州大学); DEVCOM Army Research Office (DEVCOM陆军研究办公室)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Preprint

点击查看摘要

Abstract:Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive k -Nearest Neighbors ( k -NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to 5\times when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
zh

[CV-163] A Resource-Efficient Training Framework for Remote Sensing Text–Image Retrieval

【速读】：该论文试图解决遥感文本-图像检索（RSTIR）中模型复杂度和资源效率低下的问题。随着大规模视觉-语言预训练模型的快速发展，RSTIR的研究在迁移学习过程中面临资源效率不佳的挑战。为解决这一问题，作者提出了一种计算和内存高效的检索框架（CMER）。其关键解决方案包括：1）引入Focus-Adapter模块，采用侧分支结构，通过焦点层抑制背景像素对小目标的干扰，从而减少训练内存消耗；2）利用遥感场景类别作为元数据，设计简洁的数据增强技术，缩小搜索空间；3）提出负样本回收策略，使负样本池与mini-batch大小解耦，提升泛化性能而不引入额外编码器。实验结果表明，CMER在RSITMD数据集上的检索性能比现有先进方法高出2%-5%，同时减少了49%的内存消耗，并实现了1.4倍的数据吞吐量提升。

链接: https://arxiv.org/abs/2501.10638
作者: Weihang Zhang,Jihao Li,Shuoke Li,Ziqing Niu,Jialiang Chen,Wenkai Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Remote sensing text–image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%–5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at this https URL.
zh

[CV-164] RoMu4o: A Robotic Manipulation Unit For Orchard Operations Automating Proximal Hyperspectral Leaf Sensing

【速读】：该论文旨在解决精准农业中劳动力短缺和快速增长的粮食需求问题，提出了一种用于果园操作的机器人自动化解决方案。关键解决方案是RoMu4o，一种配备6自由度（6DOF）机械臂和视觉系统的地面机器人，能够进行近端高光谱叶片传感。该机器人通过实时深度学习图像处理和运动规划，实现了对目标叶片的精确抓取和高光谱测量。其核心创新在于开发了鲁棒的感知和操作流程，能够从观察到的叶片群中识别并提取叶片的3D结构，提出6D位姿，并生成无碰撞的约束感知路径以实现精确的叶片操作。机械臂的末端执行器集成了独立光源和高光谱传感器，确保了高保真数据采集和简化校准过程。该系统在实验室和室外植物模型中的性能评估表明，其在1-LPB高光谱采样中表现出色，实验室成功率为95%，田间试验成功率为79%，在开心果果园中的自主叶片抓取和高光谱测量总体成功率为70%。

链接: https://arxiv.org/abs/2501.10621
作者: Mehrad Mortazavi,David J. Cappelleri,Reza Ehsani
机构: University of California, Merced (加州大学默塞德分校); Purdue University (普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the need to address labor shortages and meet the demands of a rapidly growing population, robotic automation has become a critical component in precision agriculture. Leaf-level hyperspectral spectroscopy is shown to be a powerful tool for phenotyping, monitoring crop health, identifying essential nutrients within plants as well as detecting diseases and water stress. This work introduces RoMu4o, a robotic manipulation unit for orchard operations offering an automated solution for proximal hyperspectral leaf sensing. This ground robot is equipped with a 6DOF robotic arm and vision system for real-time deep learning-based image processing and motion planning. We developed robust perception and manipulation pipelines that enable the robot to successfully grasp target leaves and perform spectroscopy. These frameworks operate synergistically to identify and extract the 3D structure of leaves from an observed batch of foliage, propose 6D poses, and generate collision-free constraint-aware paths for precise leaf manipulation. The end-effector of the arm features a compact design that integrates an independent lighting source with a hyperspectral sensor, enabling high-fidelity data acquisition while streamlining the calibration process for accurate measurements. Our ground robot is engineered to operate in unstructured orchard environments. However, the performance of the system is evaluated in both indoor and outdoor plant models. The system demonstrated reliable performance for 1-LPB hyperspectral sampling, achieving 95% success rate in lab trials and 79% in field trials. Field experiments revealed an overall success rate of 70% for autonomous leaf grasping and hyperspectral measurement in a pistachio orchard. The open-source repository is available at: this https URL
zh

[CV-165] Hierarchical LoG Bayesian Neural Network for Enhanced Aorta Segmentation

【速读】：该论文旨在解决主动脉及其分支的精确分割问题，特别是在多尺度结构和周围组织复杂性的背景下，现有的深度学习方法仍面临挑战。论文提出了一种基于贝叶斯神经网络的分层拉普拉斯高斯（LoG）模型，通过结合3D U-Net流和分层LoG流来增强主动脉分割。3D U-Net流提供初始的主动脉分割，而分层LoG流通过学习适合的LoG核，自适应地处理主动脉血管中不同尺度的部分，从而提升血管检测的准确性。此外，贝叶斯方法用于参数化LoG流，并为分割结果提供置信区间，确保预测的鲁棒性和可靠性。实验结果表明，该模型在多个主动脉数据集上显著优于现有方法，Dice系数提升了至少3%，并能为主动脉的不同部分提供可靠的置信区间。

链接: https://arxiv.org/abs/2501.10615
作者: Delin An,Pan Du,Pengfei Gu,Jian-Xun Wang,Chaoli Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of the aorta and its associated arch branches is crucial for diagnosing aortic diseases. While deep learning techniques have significantly improved aorta segmentation, they remain challenging due to the intricate multiscale structure and the complexity of the surrounding tissues. This paper presents a novel approach for enhancing aorta segmentation using a Bayesian neural network-based hierarchical Laplacian of Gaussian (LoG) model. Our model consists of a 3D U-Net stream and a hierarchical LoG stream: the former provides an initial aorta segmentation, and the latter enhances blood vessel detection across varying scales by learning suitable LoG kernels, enabling self-adaptive handling of different parts of the aorta vessels with significant scale differences. We employ a Bayesian method to parameterize the LoG stream and provide confidence intervals for the segmentation results, ensuring robustness and reliability of the prediction for vascular medical image analysts. Experimental results show that our model can accurately segment main and supra-aortic vessels, yielding at least a 3% gain in the Dice coefficient over state-of-the-art methods across multiple volumes drawn from two aorta datasets, and can provide reliable confidence intervals for different parts of the aorta. The code is available at this https URL.
zh

[CV-166] High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model

【速读】：该论文旨在解决亚马逊森林树冠高度（tree canopy height）的精确测量问题，这是评估森林生物量、生产力和生态系统结构的重要指标。传统的地面和空间测量方法存在挑战，难以实现高精度的测量。为此，研究提出了一种基于U-Net模型的回归方法，利用Planet NICFI影像（空间分辨率约为4.78米）对2020-2024年期间的亚马逊森林平均树冠高度进行制图。关键解决方案包括：使用航空LiDAR数据生成的树冠高度模型作为参考，结合Planet NICFI影像训练U-Net模型。该模型在验证样本上的预测误差平均为3.68米，且在整个树高范围内表现出较低的系统偏差，能够有效估计高达40-50米的树冠高度，优于现有的全球模型产品。研究还发现亚马逊森林的平均树冠高度约为22米，并展示了通过树高变化检测伐木或森林砍伐事件以及监测再生森林高度的潜力。这些结果表明，利用Planet NICFI影像进行大规模树高制图和监测具有重要应用价值。

链接: https://arxiv.org/abs/2501.10600
作者: Fabien H Wagner,Ricardo Dalagnol,Griffin Carter,Mayumi CM Hirye,Shivraj Gill,Le Bienfaiteur Sagang Takougoum,Samuel Favrichon,Michael Keller,Jean PHB Ometto,Lorena Alves,Cynthia Creze,Stephanie P George-Chacon,Shuang Li,Zhihua Liu,Adugna Mullissa,Yan Yang,Erone G Santos,Sarah R Worden,Martin Brandt,Philippe Ciais,Stephen C Hagen,Sassan Saatchi
机构: CTrees, Pasadena, CA 91105, US; Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove, Pasadena, CA 91109, USA; Institute of Environment and Sustainability, University of California, Los Angeles, CA, USA; Quapá Lab, Faculty of Architecture and Urbanism, University of São Paulo, 05508080, São Paulo, SP, Brazil; Gamma Remote Sensing Ag, Gumligen, Switzerland; USDA Forest Service, International Institute of Tropical Forestry, Rio Piedras, Puerto Rico, USA; EMBRAPA Satellite Monitoring, Campinas 13070-115, SP, Brazil; Remote Sensing Division, National Institute for Space Research—INPE, São José dos Campos 12227-010, SP, Brazil; Department of Geosciences and Natural Resource Management, University of Copenhagen, Copenhagen, 1350, Denmark; Laboratoire des Sciences du Climat et de l’Environnement, CEA-CNRS-UVSQ, CE Orme des Merisiers, Gif sur Yvette, 91190, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: will be submitted to the journal Remote Sensing of Environment in February 2025

点击查看摘要

Abstract:Tree canopy height is one of the most important indicators of forest biomass, productivity, and ecosystem structure, but it is challenging to measure accurately from the ground and from space. Here, we used a U-Net model adapted for regression to map the mean tree canopy height in the Amazon forest from Planet NICFI images at ~4.78 m spatial resolution for the period 2020-2024. The U-Net model was trained using canopy height models computed from aerial LiDAR data as a reference, along with their corresponding Planet NICFI images. Predictions of tree heights on the validation sample exhibited a mean error of 3.68 m and showed relatively low systematic bias across the entire range of tree heights present in the Amazon forest. Our model successfully estimated canopy heights up to 40-50 m without much saturation, outperforming existing canopy height products from global models in this region. We determined that the Amazon forest has an average canopy height of ~22 m. Events such as logging or deforestation could be detected from changes in tree height, and encouraging results were obtained to monitor the height of regenerating forests. These findings demonstrate the potential for large-scale mapping and monitoring of tree height for old and regenerating Amazon forests using Planet NICFI imagery.
zh

[CV-167] On the Benefits of Instance Decomposition in Video Prediction Models

【速读】：该论文试图解决视频预测任务中的一个关键问题，即在动态场景中如何更准确地预测未来帧。现有的视频预测方法通常将场景的动态变化联合建模，而没有显式地将场景中的各个对象分解开来。这种做法在处理复杂动态场景时可能不够优化，因为每个对象的运动模式通常是相对独立的。论文提出了一种解决方案，即在潜在变换器（latent-transformer）视频预测模型中显式地对动态场景中的各个对象进行单独建模。通过这种分解方法，论文在合成和真实数据集上进行了详细的实验，结果表明，与未进行对象分解的模型相比，显式分解动态场景能够显著提高预测质量。

链接: https://arxiv.org/abs/2501.10562
作者: Eliyas Suleyman,Paul Henderson,Nicolas Pugeault
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
zh

[CV-168] HyperCam: Low-Power Onboard Computer Vision for IoT Cameras

【速读】：该论文旨在解决在低功耗物联网（IoT）摄像头系统上进行计算机视觉任务时，如何在资源受限的硬件上实现高效的图像分类问题。现有的机器学习分类器（如SVM、xgBoost、MicroNets、MobileNetV3和MCUNetV3）在低功耗设备上难以同时兼顾高精度和低资源消耗。为此，论文提出了HyperCam，一种基于超维度计算（hyperdimensional computing）的图像分类管道，能够在低功耗微控制器上高效地进行训练和推理。HyperCam的关键创新在于其能够在保持较高分类精度的同时，显著减少内存占用和推理延迟。实验结果表明，HyperCam在MNIST、Fashion-MNIST、人脸检测和人脸识别任务上分别达到了93.60%、84.06%、92.98%和72.79%的准确率，并且在资源效率上显著优于其他分类器，推理延迟为0.08-0.27秒，峰值时仅使用42.91-63.00KB的闪存和22.25KB的RAM。

链接: https://arxiv.org/abs/2501.10547
作者: Chae Young Lee, Pu (Luke)Yi,Maxwell Fite,Tejus Rao,Sara Achour,Zerina Kapetanovic
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present HyperCam, an energy-efficient image classification pipeline that enables computer vision tasks onboard low-power IoT camera systems. HyperCam leverages hyperdimensional computing to perform training and inference efficiently on low-power microcontrollers. We implement a low-power wireless camera platform using off-the-shelf hardware and demonstrate that HyperCam can achieve an accuracy of 93.60%, 84.06%, 92.98%, and 72.79% for MNIST, Fashion-MNIST, Face Detection, and Face Identification tasks, respectively, while significantly outperforming other classifiers in resource efficiency. Specifically, it delivers inference latency of 0.08-0.27s while using 42.91-63.00KB flash memory and 22.25KB RAM at peak. Among other machine learning classifiers such as SVM, xgBoost, MicroNets, MobileNetV3, and MCUNetV3, HyperCam is the only classifier that achieves competitive accuracy while maintaining competitive memory footprint and inference latency that meets the resource requirements of low-power camera systems.
zh

[CV-169] Poxel: Voxel Reconstruction for 3D Printing

【速读】：该论文旨在解决现有3D重建技术（如NeRF和Plenoxel）在物理3D打印中的局限性问题。这些技术主要针对数字环境优化，使用依赖于视角的颜色模型（RGB）和2D splatting技术，无法很好地适应物理3D打印的需求。论文提出的解决方案是“Poxel”（Printable-Voxel），一种基于体素（voxel）的3D重建框架，专门为光敏聚合物喷射3D打印优化。Poxel通过去除视角依赖性，并将数字RGB颜色空间转换为适用于多材料喷射的物理CMYKWCl颜色空间，直接输出可打印的体素网格。这一方法显著提高了打印模型的保真度和质量，满足了物理3D物体的需求。

链接: https://arxiv.org/abs/2501.10474
作者: Ruixiang Cao,Satoshi Yagi,Satoshi Yamamori,Jun Morimoto
机构: Graduate School of Informatics, Kyoto University (京都大学); Dept. of Brain Robot Interface, Computational Neuroscience Labs, ATR (脑机器人接口部门，计算神经科学实验室，ATR)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 3D reconstruction, especially through neural rendering approaches like Neural Radiance Fields (NeRF) and Plenoxel, have led to high-quality 3D visualizations. However, these methods are optimized for digital environments and employ view-dependent color models (RGB) and 2D splatting techniques, which do not translate well to physical 3D printing. This paper introduces “Poxel”, which stands for Printable-Voxel, a voxel-based 3D reconstruction framework optimized for photopolymer jetting 3D printing, which allows for high-resolution, full-color 3D models using a CMYKWCl color model. Our framework directly outputs printable voxel grids by removing view-dependency and converting the digital RGB color space to a physical CMYKWCl color space suitable for multi-material jetting. The proposed system achieves better fidelity and quality in printed models, aligning with the requirements of physical 3D objects.
zh

[CV-170] Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection ICML2024

【速读】：该论文试图解决在自监督对抗训练（Self-Supervised Adversarial Training, SSAT）中使用大量未标记数据所导致的内存占用和训练时间增加的问题。为了解决这一问题，论文提出了一种新颖的方法，通过策略性地选择一小部分对SSAT和模型鲁棒性提升至关重要的未标记数据。其解决方案的关键在于基于潜在聚类技术（latent clustering-based techniques）优先选择靠近模型决策边界的数据点，从而高效地识别出包含更多边界邻近点的关键未标记数据子集。同时，该方法在关注边界数据的同时，保持了边界与非边界数据点之间的平衡比例，以避免过拟合。实验结果表明，该方法在图像基准测试中能够显著减少内存和计算需求，同时保持较高的模型鲁棒性，尤其是在使用k-means聚类方法时，能够在减少5到10倍外部或生成未标记数据的情况下，达到几乎相同的测试时鲁棒精度。此外，该方法在包括COVID-19胸部X光分类在内的多种应用场景中展示了良好的泛化能力。

链接: https://arxiv.org/abs/2501.10466
作者: Somrita Ghosh,Yuelin Xu,Xiao Zhang
机构: CISPA Helmholtz Center for Information Security (CISPA 亥姆霍兹信息安全中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Shorter version of this work accepted by NextGenAISafety Workshop at ICML 2024

点击查看摘要

Abstract:Compared with standard learning, adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training (SSAT) with external or synthetically generated unlabeled data to enhance model robustness. However, SSAT requires a substantial amount of extra unlabeled data, significantly increasing memory usage and model training times. To address these challenges, we propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement. Our selection prioritizes data points near the model’s decision boundary based on latent clustering-based techniques, efficiently identifying a critical subset of unlabeled data with a higher concentration of boundary-adjacent points. While focusing on near-boundary data, our methods are designed to maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Our experiments on image benchmarks show that integrating our selection strategies into self-supervised adversarial training can largely reduce memory and computational requirements while achieving high model robustness. In particular, our latent clustering-based selection method with k-means is the most effective, achieving nearly identical test-time robust accuracies with 5 to 10 times less external or generated unlabeled data when applied to image benchmarks. Additionally, we validate the generalizability of our approach across various application scenarios, including a real-world medical dataset for COVID-19 chest X-ray classification.
zh

[CV-171] BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation

【速读】：该论文旨在解决当前3D场景生成方法中存在的存储空间占用大、几何失真以及缺乏有效正则化的问题。解决方案的关键在于提出了BloomScene，一种轻量级的结构化3D高斯泼溅（3D Gaussian splatting）方法，用于跨模态场景生成。具体而言，BloomScene通过跨模态渐进式场景生成框架，利用增量点云重建和3D高斯泼溅技术生成连贯的场景。此外，论文提出了一种基于层次深度先验的正则化机制，通过多层次深度精度和平滑度约束来增强生成场景的真实感和连续性。最后，论文还提出了一种结构化上下文引导的压缩机制，利用结构化哈希网格（structured hash grids）对无序锚点属性进行建模，显著消除了结构冗余并减少了存储开销。这些创新使得生成的3D场景在多样性和质量上均优于现有基线方法。

链接: https://arxiv.org/abs/2501.10462
作者: Xiaolu Hou,Mingcheng Li,Dingkang Yang,Jiawei Chen,Ziyun Qian,Xiao Zhao,Yue Jiang,Jinjie Wei,Qingyao Xu,Lihua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
zh

[CV-172] PhyDeformer: High-Quality Non-Rigid Garment Registration with Physics-Awareness

【速读】：该论文旨在解决高质量服装网格配准（garment mesh registration）中的变形问题。解决方案的关键在于分两个阶段进行：首先，通过服装分级（garment grading）实现网格模板与目标网格之间的粗略三维对齐，考虑比例缩放和合身性（如长度、尺寸）；其次，利用基于雅可比矩阵（Jacobian-based）的变形框架进行优化，进一步细化分级后的网格，使其与目标的三维细节精确对齐。该方法在合成和真实服装上的定量和定性评估中均表现出显著效果。

链接: https://arxiv.org/abs/2501.10455
作者: Boyang Yu,Frederic Cordier,Hyewon Seo
机构: ICube laboratory, CNRS–University of Strasbourg, France(ICube实验室, CNRS–斯特拉斯堡大学, 法国); IRIMAS, University of Haute-Alsace, France(IRIMAS, 上阿尔萨斯大学, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present PhyDeformer, a new deformation method for high-quality garment mesh registration. It operates in two phases: In the first phase, a garment grading is performed to achieve a coarse 3D alignment between the mesh template and the target mesh, accounting for proportional scaling and fit (e.g. length, size). Then, the graded mesh is refined to align with the fine-grained details of the 3D target through an optimization coupled with the Jacobian-based deformation framework. Both quantitative and qualitative evaluations on synthetic and real garments highlight the effectiveness of our method.
zh

[CV-173] Cinepro: Robust Training of Foundation Models for Cancer Detection in Prostate Ultrasound Cineloops

【速读】：该论文试图解决前列腺癌（PCa）检测中由于超声图像缺乏像素级癌症标注（pixel-level cancer annotations）而引入的标签噪声问题。当前的方法通常局限于有限的感兴趣区域（ROIs），忽略了准确诊断所需的解剖学背景。解决方案的关键在于提出了Cinepro框架，该框架通过将病理报告中活检核心的癌症组织比例整合到损失函数中，以应对标签噪声，并提供更细致的监督。此外，Cinepro利用多帧的时间数据来应用鲁棒的增强技术，增强了模型学习稳定癌症相关特征的能力。Cinepro在多中心前列腺超声数据集上表现出色，AUROC达到77.1%，平衡准确率为83.8%，超越了现有基准。这些发现表明Cinepro在推进弱标注超声数据的基础模型方面具有潜力。

链接: https://arxiv.org/abs/2501.12331
作者: Mohamed Harmanani,Amoon Jamzad,Minh Nguyen Nhat To,Paul F.R. Wilson,Zhuoxin Guo,Fahimeh Fooladgar,Samira Sojoudi,Mahdi Gilany,Silvia Chang,Peter Black,Michael Leveridge,Robert Siemens,Purang Abolmaesumi,Parvin Mousavi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注: accepted to IEEE ISBI 2025

点击查看摘要

Abstract:Prostate cancer (PCa) detection using deep learning (DL) models has shown potential for enhancing real-time guidance during biopsies. However, prostate ultrasound images lack pixel-level cancer annotations, introducing label noise. Current approaches often focus on limited regions of interest (ROIs), disregarding anatomical context necessary for accurate diagnosis. Foundation models can overcome this limitation by analyzing entire images to capture global spatial relationships; however, they still encounter challenges stemming from the weak labels associated with coarse pathology annotations in ultrasound data. We introduce Cinepro, a novel framework that strengthens foundation models’ ability to localize PCa in ultrasound cineloops. Cinepro adapts robust training by integrating the proportion of cancer tissue reported by pathology in a biopsy core into its loss function to address label noise, providing a more nuanced supervision. Additionally, it leverages temporal data across multiple frames to apply robust augmentations, enhancing the model’s ability to learn stable cancer-related features. Cinepro demonstrates superior performance on a multi-center prostate ultrasound dataset, achieving an AUROC of 77.1% and a balanced accuracy of 83.8%, surpassing current benchmarks. These findings underscore Cinepro’s promise in advancing foundation models for weakly labeled ultrasound data.
zh

[CV-174] Deep Learning Based Segmentation of Blood Vessels from HE Stained Oesophageal Adenocarcinoma Whole-Slide Images

【速读】：该论文旨在解决在肿瘤微环境（Tumor Micro-Environment, TME）中手动量化血（Blood Vessels, BVs）在苏木精和伊红（Hematoxylin and Eosin, HE）染色图像中的困难，由于血血管的异质性外观，手动量化既耗时又费力。论文提出了一种新颖的方法，通过构建引导图（guiding maps）来改进现有最先进的分割模型在血血管分割中的性能。引导图能够促使模型学习血血管的代表性特征，这对于计算病理学尤为重要，因为标记的训练数据通常有限，且大型模型容易过拟合。论文通过定量和定性结果展示了该方法在提高分割准确性方面的有效性。未来，作者计划验证该方法在不同组织类型中的血血管分割效果，并研究细胞结构与血血管在肿瘤微环境中的关系。

链接: https://arxiv.org/abs/2501.12323
作者: Jiaqi Lv,Stefan S Antonowicz,Shan E Ahmed Raza
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2025

点击查看摘要

Abstract:Blood vessels (BVs) play a critical role in the Tumor Micro-Environment (TME), potentially influencing cancer progression and treatment response. However, manually quantifying BVs in Hematoxylin and Eosin (HE) stained images is challenging and labor-intensive due to their heterogeneous appearances. We propose a novel approach of constructing guiding maps to improve the performance of state-of-the-art segmentation models for BV segmentation, the guiding maps encourage the models to learn representative features of BVs. This is particularly beneficial for computational pathology, where labeled training data is often limited and large models are prone to overfitting. We have quantitative and qualitative results to demonstrate the efficacy of our approach in improving segmentation accuracy. In future, we plan to validate this method to segment BVs across various tissue types and investigate the role of cellular structures in relation to BVs in the TME.
zh

[CV-175] Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

【速读】：该论文旨在解决高斯混合模型（Gaussian Mixture Models, GMMs）在处理高维大规模数据集时计算复杂度高的问题。具体来说，传统的GMM在训练过程中，尤其是当数据点数量N和维度D较大时，计算复杂度会急剧增加，导致训练时间过长。论文提出了一种高效的变分近似方法，并将其与因子分析混合模型（Mixtures of Factor Analyzers, MFAs）相结合。该算法的关键创新在于显著降低了每次迭代的运行时间复杂度，从原来的(\mathcal{O}(NCD^2))降低到与D线性相关且与C无关的复杂度。通过数值验证，论文展示了该算法在大规模数据集上的优化过程中所需的距离评估次数与NC呈次线性关系，从而实现了相比现有技术一个数量级的加速。作为概念验证，论文在约1亿张图像上训练了包含超过100亿参数的GMM，并在单个高性能CPU上实现了约9小时的训练时间。

链接: https://arxiv.org/abs/2501.12299
作者: Sebastian Salwig,Till Kahlke,Florian Hirschberger,Dennis Forster,Jörg Lücke
机构: 1: University of Oldenburg (奥尔登堡大学); 2: Frankfurt University of Applied Sciences (法兰克福应用科技大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 6 figures (and 17 pages, 3 figures in Appendix)

点击查看摘要

Abstract:Gaussian Mixture Models (GMMs) range among the most frequently used machine learning models. However, training large, general GMMs becomes computationally prohibitive for datasets with many data points N of high-dimensionality D . For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is integrated with mixtures of factor analyzers (MFAs). For GMMs with C components, our proposed algorithm significantly reduces runtime complexity per iteration from \mathcalO(NCD^2) to a complexity scaling linearly with D and remaining constant w.r.t. C . Numerical validation of this theoretical complexity reduction then shows the following: the distance evaluations required for the entire GMM optimization process scale sublinearly with NC . On large-scale benchmarks, this sublinearity results in speed-ups of an order-of-magnitude compared to the state-of-the-art. As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
zh

[CV-176] Quality Enhancement of Radiographic X-ray Images by Interpretable Mapping

【速读】：该论文旨在解决X射线成像（X-ray imaging）中由于患者体位、体型和扫描协议不同导致的图像亮度（brightness）和对比度（contrast）不一致的问题。这种不一致性增加了放射科医生调整图像的工作负担，且现有基于深度学习（deep learning）的端到端解决方案虽然性能优异，但缺乏可解释性，难以被临床专家理解。为此，论文提出了一种新颖的基于深度学习的可解释映射方法，能够自动全局和局部增强图像亮度和对比度。该模型的设计灵感来源于亮度与对比度调整的工作流程，能够提供可解释的像素映射（pixel maps），以解释图像增强的动机。实验结果表明，该方法在临床数据集上能够以24.75 dB的峰值信噪比（PSNR）和0.8431的结构相似性（SSIM）实现一致的亮度和对比度校正。

链接: https://arxiv.org/abs/2501.12245
作者: Hongxu Yang,Najib Akram Aboobacker,Xiaomeng Dong,German Gonzalez,Lehel Ferenczi,Gopal Avinash
机构: GE Healthcare(通用电气医疗集团), Netherlands(荷兰); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), Hungary(匈牙利); GE Healthcare(通用电气医疗集团), USA(美国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: SPIE Medical Imaging 2025

点击查看摘要

Abstract:X-ray imaging is the most widely used medical imaging modality. However, in the common practice, inconsistency in the initial presentation of X-ray images is a common complaint by radiologists. Different patient positions, patient habitus and scanning protocols can lead to differences in image presentations, e.g., differences in brightness and contrast globally or regionally. To compensate for this, additional work will be executed by clinical experts to adjust the images to the desired presentation, which can be time-consuming. Existing deep-learning-based end-to-end solutions can automatically correct images with promising performances. Nevertheless, these methods are hard to be interpreted and difficult to be understood by clinical experts. In this manuscript, a novel interpretable mapping method by deep learning is proposed, which automatically enhances the image brightness and contrast globally and locally. Meanwhile, because the model is inspired by the workflow of the brightness and contrast manipulation, it can provide interpretable pixel maps for explaining the motivation of image enhancement. The experiment on the clinical datasets show the proposed method can provide consistent brightness and contrast correction on X-ray images with accuracy of 24.75 dB PSNR and 0.8431 SSIM.
zh

[CV-177] Zero-shot Bias Correction: Efficient MR Image Inhomogeneity Reduction Without Any Data

【速读】：该论文旨在解决图像不均匀性（image inhomogeneity）问题，特别是在无需预训练数据集的情况下进行图像校正。当前基于有监督或无监督学习的深度神经网络方法需要大量数据收集和标注，成本高昂且耗时。本文提出了一种新颖的零样本（zero-shot）深度神经网络方法，无需预训练数据，也不需要对偏差场（bias field）进行专门假设。该方法通过设计轻量级的卷积神经网络（CNN），实现了高效的零样本自适应，用于校正偏差污染的图像。其核心解决方案是通过迭代均匀性优化（iterative homogeneity refinement）来缓解图像偏差问题，确保在零样本优化过程中具有稳定的收敛性。实验结果表明，该方法在效率和准确性上均优于当前的无数据N4方法。

链接: https://arxiv.org/abs/2501.12244
作者: Hongxu Yang,Edina Timko,Brice Fernandez
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2025. Supported by IHI PREDICTOM Project

点击查看摘要

Abstract:In recent years, deep neural networks for image inhomogeneity reduction have shown promising results. However, current methods with (un)supervised solutions require preparing a training dataset, which is expensive and laborious for data collection. In this work, we demonstrate a novel zero-shot deep neural networks, which requires no data for pre-training and dedicated assumption of the bias field. The designed light-weight CNN enables an efficient zero-shot adaptation for bias-corrupted image correction. Our method provides a novel solution to mitigate the biased corrupted image as iterative homogeneity refinement, which therefore ensures the considered issue can be solved easier with stable convergence of zero-shot optimization. Extensive comparison on different datasets show that the proposed method performs better than current data-free N4 methods in both efficiency and accuracy.
zh

[CV-178] WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in the Spatial-Frequency Domain

【速读】：该论文旨在解决视网膜疾病诊断中光学相干断层扫描（OCT）图像分析面临的挑战，包括斑点噪声、复杂病变形状和不同病变尺寸等问题，这些问题使得图像解释变得困难。为解决这些问题，论文提出了一种名为WaveNet-SF的新框架，该框架通过整合空间域和频域学习来增强视网膜疾病的检测能力。解决方案的关键在于利用小波变换将OCT图像分解为低频和高频成分，从而提取全局结构特征和细粒度细节。此外，论文引入了多尺度小波空间注意力（MSW-SA）模块，以增强模型对多尺度感兴趣区域的关注，并结合高频特征补偿块（HFFC）来恢复小波分解过程中丢失的边缘信息，抑制噪声并保留对病变检测至关重要的细节。通过这些创新，WaveNet-SF在OCT-C8和OCT2017数据集上分别达到了97.82%和99.58%的分类准确率，超越了现有方法，展示了其在OCT图像分析中的高效性和作为视网膜疾病诊断工具的潜力。

链接: https://arxiv.org/abs/2501.11854
作者: Jilan Cheng,Guoli Long,Zeyu Zhang,Zhenjia Qi,Hanyu Wang,Libin Lu,Shuihua Wang,Yudong Zhang,Jin Hong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper, we propose a novel framework, WaveNet-SF, to enhance retinal disease detection by integrating spatial-domain and frequency-domain learning. The framework utilizes wavelet transforms to decompose OCT images into low- and high-frequency components, enabling the model to extract both global structural features and fine-grained details. To improve lesion detection, we introduce a multi-scale wavelet spatial attention (MSW-SA) module, which enhances the model’s focus on regions of interest at multiple scales. Additionally, a high-frequency feature compensation block (HFFC) is incorporated to recover edge information lost during wavelet decomposition, suppress noise, and preserve fine details crucial for lesion detection. Our approach achieves state-of-the-art (SOTA) classification accuracies of 97.82% and 99. 58% on the OCT-C8 and OCT2017 datasets, respectively, surpassing existing methods. These results demonstrate the efficacy of WaveNet-SF in addressing the challenges of OCT image analysis and its potential as a powerful tool for retinal disease diagnosis.
zh

[CV-179] A generalizable 3D framework and model for self-supervised learning in medical imaging

【速读】：该论文旨在解决当前自监督学习（Self-Supervised Learning, SSL）方法在3D医学影像中的局限性，特别是其依赖于简单的预训练任务（pretext tasks）和特定器官或模态的数据集，导致泛化能力和可扩展性不足的问题。为此，作者提出了3DINO，一种适用于3D数据集的前沿自监督学习方法，并利用其预训练了一个通用的医学影像模型3DINO-ViT。该模型在一个包含约100,000个3D医学影像扫描的多模态、多器官数据集上进行预训练，涵盖了超过10个器官。通过大量实验验证，3DINO-ViT在多种医学影像分割和分类任务中表现出色，能够跨模态和跨器官泛化，甚至在分布外任务和数据集上也优于现有最先进方法。3DINO框架和3DINO-ViT模型的发布将促进3D基础模型的研究，并为广泛的医学影像应用提供进一步微调的基础。

链接: https://arxiv.org/abs/2501.11755
作者: Tony Xu,Sepehr Hosseini,Chris Anderson,Anthony Rinaldi,Rahul G. Krishnan,Anne L. Martel,Maged Goubran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.
zh

[CV-180] MedicoSAM: Towards foundation models for medical image segmentation

【速读】：该论文旨在解决医学图像分割（medical image segmentation）领域中模型训练和适应新条件时所需的高成本问题，特别是由于需要大量手动标注数据（manually labeled data）所带来的挑战。论文提出通过利用视觉基础模型（vision foundation models），特别是 Segment Anything 模型，来实现医学图像的通用分割（universal segmentation），从而克服这些限制。解决方案的关键在于对 Segment Anything 模型进行微调（finetuning），并在一个大规模且多样化的数据集上比较不同的微调策略。研究结果表明，微调后的模型在交互式分割（interactive segmentation）任务中表现显著提升，但在语义分割（semantic segmentation）任务中，预训练于医学图像并未带来明显优势。最终，论文提出的最佳模型 MedicoSAM 已公开发布，并与现有数据标注工具兼容，具有重要的实际应用价值。

链接: https://arxiv.org/abs/2501.11734
作者: Anwai Archit,Luca Freckmann,Constantin Pape
机构: Institute of Computer Science, University of Göttingen, Germany(计算机科学研究所，哥廷根大学，德国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at this https URL. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.
zh

[CV-181] Fundus Image Quality Assessment and Enhancement: a Systematic Review

【速读】：该论文旨在解决眼底摄影图像质量评估（IQA）和增强（IQE）领域的研究空白，特别是在复杂成像环境下图像退化对诊断和治疗的影响。论文通过全面综述眼底IQA和IQE算法、研究进展及实际应用，填补了现有文献中对IQA与IQE之间相互作用及其临床部署挑战的不足。解决方案的关键在于系统地总结眼底摄影成像系统的基本原理和相关干扰，并详细分析IQA和IQE的范式，同时探讨实际部署中的挑战及解决方案，为未来研究方向提供见解。

链接: https://arxiv.org/abs/2501.11520
作者: Heng Li,Haojin Li,Mingyang Ou,Xiangyang Yu,Xiaoqing Zhang,Ke Niu,Huazhu Fu,Jiang Liu
机构: Research Institute of Trustworthy Autonomous Systems, SUSTech, Shenzhen, China; Department of Computer Science and Engineering, SUSTech, Shenzhen, China; Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Computer School, Beijing Information Science and Technology University, Beijing, China; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As an affordable and convenient eye scan, fundus photography holds the potential for preventing vision impairment, especially in resource-limited regions. However, fundus image degradation is common under intricate imaging environments, impacting following diagnosis and treatment. Consequently, image quality assessment (IQA) and enhancement (IQE) are essential for ensuring the clinical value and reliability of fundus images. While existing reviews offer some overview of this field, a comprehensive analysis of the interplay between IQA and IQE, along with their clinical deployment challenges, is lacking. This paper addresses this gap by providing a thorough review of fundus IQA and IQE algorithms, research advancements, and practical applications. We outline the fundamentals of the fundus photography imaging system and the associated interferences, and then systematically summarize the paradigms in fundus IQA and IQE. Furthermore, we discuss the practical challenges and solutions in deploying IQA and IQE, as well as offer insights into potential future research directions.
zh

[CV-182] Multitask Auxiliary Network for Perceptual Quality Assessment of Non-Uniformly Distorted Omnidirectional Images

【速读】：该论文试图解决全向图像质量评估（Omnidirectional Image Quality Assessment, OIQA）中非均匀失真（non-uniform distortion）问题。现有研究主要集中在解决均匀失真（uniform distortion）问题，而在捕捉非均匀失真方面的能力尚不令人满意。为此，论文提出了一种多任务辅助网络（multitask auxiliary network），通过联合训练主任务和其他辅助任务来优化网络参数。该网络主要由三部分组成：用于从视口序列中提取多尺度特征的主干网络（backbone）、用于动态分配特定特征到不同任务的多任务特征选择模块（multitask feature selection module），以及用于引导模型捕捉局部失真和全局质量变化的辅助子网络（auxiliary sub-networks）。实验结果表明，该模型在两个大规模OIQA数据库上优于其他最先进的OIQA指标，且辅助子网络对提升模型性能起到了重要作用。

链接: https://arxiv.org/abs/2501.11512
作者: Jiebin Yan,Jiale Rao,Junjie Chen,Ziwen Tan,Weide Liu,Yuming Fang
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics (江西财经大学计算与人工智能学院); Jiangxi Provincial Key Laboratory of Multimedia Intelligent Processing (江西省多媒体智能处理重点实验室); Harvard Medical School, Harvard University (哈佛大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional image quality assessment (OIQA) has been widely investigated in the past few years and achieved much success. However, most of existing studies are dedicated to solve the uniform distortion problem in OIQA, which has a natural gap with the non-uniform distortion problem, and their ability in capturing non-uniform distortion is far from satisfactory. To narrow this gap, in this paper, we propose a multitask auxiliary network for non-uniformly distorted omnidirectional images, where the parameters are optimized by jointly training the main task and other auxiliary tasks. The proposed network mainly consists of three parts: a backbone for extracting multiscale features from the viewport sequence, a multitask feature selection module for dynamically allocating specific features to different tasks, and auxiliary sub-networks for guiding the proposed model to capture local distortion and global quality change. Extensive experiments conducted on two large-scale OIQA databases demonstrate that the proposed model outperforms other state-of-the-art OIQA metrics, and these auxiliary sub-networks contribute to improve the performance of the proposed model. The source code is available at this https URL.
zh

[CV-183] Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images

【速读】：该论文主要解决了全向图像质量评估（Omnidirectional Image Quality Assessment, OIQA）中的非均匀失真问题。传统的研究大多集中在均匀失真上，即全向图像的所有区域受到相同程度的噪声干扰，而忽略了非均匀失真，即图像中部分区域受到不同程度的干扰。此外，现有的OIQA模型通常在样本数量有限的平台上验证，增加了过拟合风险，阻碍了OIQA的发展。为解决这些问题，论文从主观和客观两个角度进行了深入研究。具体而言，作者构建了一个包含10,320张非均匀失真全向图像的大型数据库，并通过心理物理实验探讨了失真范围和观看条件等整体和个体因素对全向图像质量的影响。在此基础上，提出了一种感知引导的OIQA模型，通过自适应模拟用户的观看行为来评估非均匀失真。实验结果表明，该模型优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.11511
作者: Jiebin Yan,Jiale Rao,Xuelin Liu,Yuming Fang,Yifan Zuo,Weide Liu
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics (江西财经大学计算与人工智能学院); Harvard Medical School, Harvard University (哈佛大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the same amount'' of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo different amount’’ of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users’ viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods. The source code is available at this https URL.
zh

[CV-184] ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction

【速读】：该论文旨在解决阿尔茨海默病（AD）前驱阶段——轻度认知障碍（MCI）向AD转化的早期预测问题，特别是针对多模态数据（如正电子发射断层扫描（PET）数据缺失）和异质性带来的挑战。解决方案的关键在于提出了一种创新的多模态方法，包括以下几个核心模块：1）缺失模态生成模块，通过磁共振成像（MRI）合成缺失的PET数据；2）专门设计的编码器用于特征提取；3）通道聚合模块和三模态共注意力融合模块，以减少特征冗余并实现有效的多模态数据融合；4）设计了一种损失函数，用于处理缺失模态问题并对齐跨模态特征。这些模块共同提升了网络性能，实验结果表明该方法在ADNI1和ADNI2数据集上显著优于现有的单模态和其他多模态模型。

链接: https://arxiv.org/abs/2501.11276
作者: Xiangyang Hu,Xiangyu Shen,Yifei Sun,Xuhao Shan,Wenwen Min,Liyilei Su,Xiaomao Fan,Ahmed Elazab,Ruiquan Ge,Changmiao Wang,Xiaopeng Fan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, accepted by IEEE ISBI 2025

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a common neurodegenerative disease among the elderly. Early prediction and timely intervention of its prodromal stage, mild cognitive impairment (MCI), can decrease the risk of advancing to AD. Combining information from various modalities can significantly improve predictive accuracy. However, challenges such as missing data and heterogeneity across modalities complicate multimodal learning methods as adding more modalities can worsen these issues. Current multimodal fusion techniques often fail to adapt to the complexity of medical data, hindering the ability to identify relationships between modalities. To address these challenges, we propose an innovative multimodal approach for predicting MCI conversion, focusing specifically on the issues of missing positron emission tomography (PET) data and integrating diverse medical information. The proposed incomplete triple-modal MCI conversion prediction network is tailored for this purpose. Through the missing modal generation module, we synthesize the missing PET data from the magnetic resonance imaging and extract features using specifically designed encoders. We also develop a channel aggregation module and a triple-modal co-attention fusion module to reduce feature redundancy and achieve effective multimodal data fusion. Furthermore, we design a loss function to handle missing modality issues and align cross-modal features. These components collectively harness multimodal data to boost network performance. Experimental results on the ADNI1 and ADNI2 datasets show that our method significantly surpasses existing unimodal and other multimodal models. Our code is available at this https URL.
zh

[CV-185] How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks? ICLR-2024

【速读】：该论文试图解决在3D图像分割等多样化任务中，由于缺乏大规模标注的3D数据集（如ImageNet规模）而导致的模型预训练效果受限的问题。解决方案的关键在于两个方面：首先，构建了一个名为AbdomenAtlas 1.1的大规模3D CT数据集，包含9,262个三维CT扫描体，并提供了25个解剖结构的高质量体素级标注以及7种肿瘤类型的伪标注；其次，开发了一套基于AbdomenAtlas 1.1进行预训练的模型，用于迁移学习。实验表明，仅使用21个CT扫描体、672个标注掩码和40个GPU小时训练的模型，其迁移学习能力与使用5,050个未标注CT扫描体和1,152个GPU小时训练的模型相当。此外，随着标注数据集的扩大，监督预训练模型的迁移学习能力可以进一步提升，显著优于现有的预训练模型。该研究旨在推动构建更大规模的3D医学数据集和发布更多监督预训练模型的集体努力。

链接: https://arxiv.org/abs/2501.11253
作者: Wenxuan Li,Alan Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR-2024

点击查看摘要

Abstract:The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. While ImageNet pre-training has shown enormous success, it is formed in 2D, and the learned features are for classification tasks; when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct AbdomenAtlas 1.1 that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations of 25 anatomical structures and pseudo annotations of seven tumor types. Secondly, we develop a suite of models that are pre-trained on our AbdomenAtlas 1.1 for transfer learning. Our preliminary analyses indicate that the model trained only with 21 CT volumes, 672 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 (unlabeled) CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets, achieving significantly better performance than preexisting pre-trained models, irrespective of their pre-training methodologies or data sources. We hope this study can facilitate collective efforts in constructing larger 3D medical datasets and more releases of supervised pre-trained models.
zh

[CV-186] CNN-based TEM image denoising from first principles

【速读】：该论文旨在解决透射电子显微镜（Transmission Electron Microscope, TEM）图像因噪声干扰而难以解释的问题。解决方案的关键在于利用深度学习技术，特别是卷积神经网络（Convolutional Neural Network, CNN），对噪声进行有效去除。具体而言，研究通过密度泛函理论（Density Functional Theory, DFT）计算生成高精度的模拟图像作为基准数据，并引入四种不同类型的噪声来创建逼真的训练数据集。每种噪声类型分别用于训练一个独立的CNN模型。实验结果表明，这些CNN模型在不同噪声水平的图像上均表现出良好的去噪效果，尽管在某些情况下仍存在局限性，如圆形结构的完整性保持和图像块之间的可见伪影问题。为此，研究提出了替代训练策略和未来研究方向，为TEM图像去噪的深度学习模型训练提供了有价值的框架。

链接: https://arxiv.org/abs/2501.11225
作者: Jinwoong Chae,Sungwook Hong,Sungkyu Kim,Sungroh Yoon,Gunn Kim
机构: Sejong University (世宗大学); Seoul National University (首尔国立大学)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages and 4 figures

点击查看摘要

Abstract:Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to create realistic training datasets. Each type of noise is then used to train a separate convolutional neural network (CNN) model. Our results show that these CNNs are effective in reducing noise, even when applied to images with different noise levels than those used during training. However, we observe limitations in some cases, particularly in preserving the integrity of circular shapes and avoiding visible artifacts between image patches. To overcome these challenges, we propose alternative training strategies and future research directions. This study provides a valuable framework for training deep learning models for TEM image denoising.
zh

[CV-187] Finding Reproducible and Prognostic Radiomic Features in Variable Slice Thickness Contrast Enhanced CT of Colorectal Liver Metastases

【速读】：该论文旨在解决放射组学特征（radiomic features）在结直肠肝转移（colorectal liver metastases, CRLM）患者中的可重复性和预后价值问题。具体来说，研究通过分析对比增强CT扫描中肝实质和最大肝转移灶的放射组学特征，评估这些特征在不同切片厚度重建图像中的可重复性，并探讨其在预测患者总生存期（overall survival）中的预后价值。研究使用了来自两个美国主要癌症中心的81名患者的前瞻性队列来评估特征的可重复性，并使用了一个公开的单中心队列（197名术前扫描患者）来评估特征的预后价值。

解决方案的关键在于采用数据驱动的方法进行特征提取和选择。研究通过使用八种不同的特征提取设置，提取了93个标准特征，并发现最可重复和最具预后区分能力的特征值高度依赖于感兴趣区域（region of interest）和具体特征。研究结果表明，尽管使用特定设置提取的特征可以生成最佳预测模型（C-index = 0.630），但通过整合所有提取设置的特征并基于可重复性（CCC ≥ 0.85）进行阈值筛选后，生成的模型性能相当（C-index = 0.629）。因此，研究支持在特征提取和选择过程中优先考虑包含多个特征，并在有相关数据时基于可重复性进行特征筛选。

链接: https://arxiv.org/abs/2501.11221
作者: Jacob J. Peoples,Mohammad Hamghalam,Imani James,Maida Wasim,Natalie Gangai,Hyunseon Christine Kang,X. John Rong,Yun Shin Chun,Richard K. G. Do,Amber L. Simpson
机构: School of Computing, Queen’s University, Kingston, ON, Canada(加拿大皇后大学计算机学院); Department of Electrical Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran(伊朗加兹温伊斯兰阿扎德大学电气工程系); Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, NY, USA(美国纽约纪念斯隆-凯特琳癌症中心放射科); Department of Abdominal Imaging, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心腹部影像科); Department of Imaging Physics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心影像物理科); Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心外科肿瘤科); Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, ON, Canada(加拿大皇后大学生物医学与分子科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Establishing the reproducibility of radiomic signatures is a critical step in the path to clinical adoption of quantitative imaging biomarkers; however, radiomic signatures must also be meaningfully related to an outcome of clinical importance to be of value for personalized medicine. In this study, we analyze both the reproducibility and prognostic value of radiomic features extracted from the liver parenchyma and largest liver metastases in contrast enhanced CT scans of patients with colorectal liver metastases (CRLM). A prospective cohort of 81 patients from two major US cancer centers was used to establish the reproducibility of radiomic features extracted from images reconstructed with different slice thicknesses. A publicly available, single-center cohort of 197 preoperative scans from patients who underwent hepatic resection for treatment of CRLM was used to evaluate the prognostic value of features and models to predict overall survival. A standard set of 93 features was extracted from all images, with a set of eight different extractor settings. The feature extraction settings producing the most reproducible, as well as the most prognostically discriminative feature values were highly dependent on both the region of interest and the specific feature in question. While the best overall predictive model was produced using features extracted with a particular setting, without accounting for reproducibility, (C-index = 0.630 (0.603–0.649)) an equivalent-performing model (C-index = 0.629 (0.605–0.645)) was produced by pooling features from all extraction settings, and thresholding features with low reproducibility ( \mathrmCCC \geq 0.85 ), prior to feature selection. Our findings support a data-driven approach to feature extraction and selection, preferring the inclusion of many features, and narrowing feature selection based on reproducibility when relevant data is available.
zh

[CV-188] Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

【速读】：该论文旨在解决脑肿瘤分割（brain tumor segmentation）的准确性和效率问题，这对于临床诊断、治疗规划和监测至关重要。论文提出的解决方案基于增强的ResUNet架构，关键创新点包括：1) 使用EfficientNetB0编码器（EfficientNetB0 encoder）来提升特征提取效率；2) 引入通道注意力机制（channel attention mechanism）以增强模型对肿瘤相关特征的关注；3) 采用空洞空间金字塔池化模块（Atrous Spatial Pyramid Pooling, ASPP）以实现多尺度上下文学习，从而更好地处理不同大小和形状的肿瘤。通过在TCGA LGG和BraTS 2020两个基准数据集上的实验验证，该模型在整体肿瘤和肿瘤核心区域的分割性能上显著优于基线ResUNet及其EfficientNet变体，展示了其在脑肿瘤分割任务中的竞争力和潜力。

链接: https://arxiv.org/abs/2501.11196
作者: Majid Behzadpour,Ebrahim Azizi,Kai Wu,Bengie L. Ortiz
机构: Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran; Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA; Department of Pediatrics, Hematology and Oncology Division, Michigan Medicine, University of Michigan Health System, Ann Arbor, MI, USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Accurate and efficient segmentation of brain tumors is critical for diagnosis, treatment planning, and monitoring in clinical practice. In this study, we present an enhanced ResUNet architecture for automatic brain tumor segmentation, integrating an EfficientNetB0 encoder, a channel attention mechanism, and an Atrous Spatial Pyramid Pooling (ASPP) module. The EfficientNetB0 encoder leverages pre-trained features to improve feature extraction efficiency, while the channel attention mechanism enhances the model’s focus on tumor-relevant features. ASPP enables multiscale contextual learning, crucial for handling tumors of varying sizes and shapes. The proposed model was evaluated on two benchmark datasets: TCGA LGG and BraTS 2020. Experimental results demonstrate that our method consistently outperforms the baseline ResUNet and its EfficientNet variant, achieving Dice coefficients of 0.903 and 0.851 and HD95 scores of 9.43 and 3.54 for whole tumor and tumor core regions on the BraTS 2020 dataset, respectively. compared with state-of-the-art methods, our approach shows competitive performance, particularly in whole tumor and tumor core segmentation. These results indicate that combining a powerful encoder with attention mechanisms and ASPP can significantly enhance brain tumor segmentation performance. The proposed approach holds promise for further optimization and application in other medical image segmentation tasks.
zh

[CV-189] ransfer Learning Strategies for Pathological Foundation Models: A Systematic Evaluation in Brain Tumor Classification

【速读】：该论文旨在解决脑肿瘤分类任务中传统方法需要大量图像采样的问题，并探讨基于大规模病理数据集预训练的生成式 AI 模型（foundation models）在脑肿瘤分类中的迁移学习策略。研究的关键解决方案是通过系统评估迁移学习策略，发现预训练模型在仅使用每例10个图像块（patches）的情况下即可实现鲁棒的分类性能，挑战了传统方法中需要大量图像采样的假设。此外，研究还表明，简单的迁移学习策略（如线性探测，linear probing）已足够有效，而微调（fine-tuning）反而可能降低模型性能。这些发现为临床病理学中AI辅助诊断的实施提供了新的思路，即从大量数据收集转向高效利用预训练特征。

链接: https://arxiv.org/abs/2501.11014
作者: Ken Enda,Yoshitaka Oda,Zen-ichi Tanei,Wang Lei,Masumi Tsuda,Takahiro Ogawa,Shinya Tanaka
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Foundation models pretrained on large-scale pathology datasets have shown promising results across various diagnostic tasks. Here, we present a systematic evaluation of transfer learning strategies for brain tumor classification using these models. We analyzed 252 cases comprising five major tumor types: glioblastoma, astrocytoma, oligodendroglioma, primary central nervous system lymphoma, and metastatic tumors. Comparing state-of-the-art foundation models with conventional approaches, we found that foundation models demonstrated robust classification performance with as few as 10 patches per case, challenging the traditional assumption that extensive per-case image sampling is necessary. Furthermore, our evaluation revealed that simple transfer learning strategies like linear probing were sufficient, while fine-tuning often degraded model performance. These findings suggest a paradigm shift from extensive data collection to efficient utilization of pretrained features, providing practical implications for implementing AI-assisted diagnosis in clinical pathology.
zh

[CV-190] OpenEarthMap-SAR: A Benchmark Synthetic Aperture Radar Dataset for Global High-Resolution Land Cover Mapping

【速读】：该论文旨在解决高分辨率土地覆盖制图（land cover mapping）中的关键挑战，特别是在合成孔径雷达（Synthetic Aperture Radar, SAR）影像领域。由于地理空间数据的复杂性，如地形多样性、传感器模态差异和大气条件变化，创建准确的大规模土地覆盖数据集仍然是一个重大难题。SAR影像具有穿透云层并在全天候条件下获取数据的优势，但缺乏专门为SAR影像设计的基准数据集，限制了针对该数据模态的模型开发。为解决这一问题，论文提出了OpenEarthMap-SAR，这是一个专门为全球高分辨率土地覆盖制图设计的基准SAR数据集。该数据集包含5033张航拍和卫星图像，共计150万个1024×1024像素的片段，覆盖日本、法国和美国的35个区域，并提供了部分手动标注和完全伪标注的8类土地覆盖标签，地面采样距离为0.15–0.5米。通过评估最先进的语义分割方法，该数据集为SAR影像的进一步技术发展提供了挑战性的问题设置，并作为IEEE GRSS数据融合竞赛Track I的官方数据集公开发布。

链接: https://arxiv.org/abs/2501.10891
作者: Junshi Xia,Hongruixuan Chen,Clifford Broni-Bediako,Yimin Wei,Jian Song,Naoto Yokoya
机构: RIKEN Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, 103-0027, Japan (理化学研究所先进智能项目中心); Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan (东京大学前沿科学研究生院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:High-resolution land cover mapping plays a crucial role in addressing a wide range of global challenges, including urban planning, environmental monitoring, disaster response, and sustainable development. However, creating accurate, large-scale land cover datasets remains a significant challenge due to the inherent complexities of geospatial data, such as diverse terrain, varying sensor modalities, and atmospheric conditions. Synthetic Aperture Radar (SAR) imagery, with its ability to penetrate clouds and capture data in all-weather, day-and-night conditions, offers unique advantages for land cover mapping. Despite these strengths, the lack of benchmark datasets tailored for SAR imagery has limited the development of robust models specifically designed for this data modality. To bridge this gap and facilitate advancements in SAR-based geospatial analysis, we introduce OpenEarthMap-SAR, a benchmark SAR dataset, for global high-resolution land cover mapping. OpenEarthMap-SAR consists of 1.5 million segments of 5033 aerial and satellite images with the size of 1024 \times 1024 pixels, covering 35 regions from Japan, France, and the USA, with partially manually annotated and fully pseudo 8-class land cover labels at a ground sampling distance of 0.15–0.5 m. We evaluated the performance of state-of-the-art methods for semantic segmentation and present challenging problem settings suitable for further technical development. The dataset also serves the official dataset for IEEE GRSS Data Fusion Contest Track I. The dataset has been made publicly available at this https URL.
zh

[CV-191] Exploring Siamese Networks in Self-Supervised Fast MRI Reconstruction

【速读】：该论文试图解决在无全采样训练参考的情况下，利用深度神经网络从欠采样的k空间数据中重建磁共振成像（MRI）图像的问题。这是一个自监督回归问题，需要有效的先验知识和监督信号。解决方案的关键在于采用孪生网络架构（Siamese architecture），通过构建同源变换图像并避免平凡解，探索了一种称为SiamRecon的自监督训练方法。该方法模拟了期望最大化算法（expectation maximization algorithm），通过交替优化提供有效的监督信号并避免模型崩溃。SiamRecon在单线圈脑部MRI和多线圈膝部MRI的自监督学习领域实现了最先进的重建精度。

链接: https://arxiv.org/abs/2501.10851
作者: Liyan Sun,Shaocong Yu,Chi Zhang,Xinghao Ding
机构: Stanford University School of Medicine, Stanford University (斯坦福大学医学院, 斯坦福大学); School of Electronic Science and Engineering, Xiamen University (厦门大学电子科学与工程学院); School of Informatics, Xiamen University (厦门大学信息学院); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing MR images using deep neural networks from undersampled k-space data without using fully sampled training references offers significant value in practice, which is a self-supervised regression problem calling for effective prior knowledge and supervision. The Siamese architectures are motivated by the definition “invariance” and shows promising results in unsupervised visual representative learning. Building homologous transformed images and avoiding trivial solutions are two major challenges in Siamese-based self-supervised model. In this work, we explore Siamese architecture for MRI reconstruction in a self-supervised training fashion called SiamRecon. We show the proposed approach mimics an expectation maximization algorithm. The alternative optimization provide effective supervision signal and avoid collapse. The proposed SiamRecon achieves the state-of-the-art reconstruction accuracy in the field of self-supervised learning on both single-coil brain MRI and multi-coil knee MRI.
zh

[CV-192] No More Sliding Window: Efficient 3D Medical Image Segmentation with Differentiable Top-k Patch Sampling

【速读】：该论文旨在解决3D医学图像分割任务中，3D模型由于需要处理大量中间张量和模型规模较大而导致的GPU内存需求过高的问题。传统的解决方案是使用基于补丁的训练和滑动窗口（Sliding Window, SW）推理，这种方法虽然减少了内存使用，但存在推理速度慢和忽略全局特征的问题。论文提出的NMSW-Net（No-More-Sliding-Window-Net）框架通过消除滑动窗口推理并在必要时引入全局预测，显著提高了效率和准确性。NMSW-Net的关键创新在于引入了一个可微分的Top-k模块，该模块仅采样对分割精度有贡献的相关补丁，从而最小化冗余计算。此外，NMSW-Net还能够在补丁预测不足时利用粗略的全局预测。该框架与任何依赖滑动窗口推理的3D分割模型兼容，并在多个任务和分割骨干网络上验证了其有效性，显著降低了计算复杂度并加速了推理过程。

链接: https://arxiv.org/abs/2501.10814
作者: Young Seok Jeon,Hongfei Yang,Huazhu Fu,Mengling Feng
机构: Institute of Data Science, National University of Singapore, Singapore(新加坡国立大学数据科学研究所); Saw Swee Hock School of Public Health, National University of Singapore, Singapore(新加坡国立大学Saw Swee Hock公共卫生学院); Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D models are favored over 2D for 3D medical image segmentation tasks due to their ability to leverage inter-slice relationship, yielding higher segmentation accuracy. However, 3D models demand significantly more GPU memory with increased model size and intermediate tensors. A common solution is to use patch-based training and make whole-volume predictions with sliding window (SW) inference. SW inference reduces memory usage but is slower due to equal resource allocation across patches and less accurate as it overlooks global features beyond patches. We propose NMSW-Net (No-More-Sliding-Window-Net), a novel framework that enhances efficiency and accuracy of any given 3D segmentation model by eliminating SW inference and incorporating global predictions when necessary. NMSW-Net incorporates a differentiable Top-k module to sample only the relevant patches that enhance segmentation accuracy, thereby minimizing redundant computations. Additionally, it learns to leverage coarse global predictions when patch prediction alone is insufficient. NMSW-Net is model-agnostic, making it compatible with any 3D segmentation model that previously relied on SW inference. Evaluated across 3 tasks with 3 segmentation backbones, NMSW-Net achieves competitive or sometimes superior accuracy compared to SW, while reducing computational complexity by 90% (87.5 to 7.95 TFLOPS), delivering 4x faster inference on the H100 GPU (19.0 to 4.3 sec), and 7x faster inference on the Intel Xeon Gold CPU (1710 to 230 seconds). Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.10814 [eess.IV] (or arXiv:2501.10814v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.10814 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-193] Enhancing Diagnostic in 3D COVID-19 Pneumonia CT-scans through Explainable Uncertainty Bayesian Quantification

【速读】：该论文试图解决在3D CT扫描中准确分类COVID-19肺炎（COVID-19 pneumonia）的挑战，特别是在医学图像分析领域中，确定性神经网络（deterministic neural networks）虽然表现出色，但其仅提供点估计输出，导致在临床决策中的诊断效果不佳。论文的解决方案关键在于使用贝叶斯神经网络（Bayesian neural networks）进行分类，并在预测中提供不确定性信息。通过对比确定性网络及其贝叶斯对应模型，研究发现在不确定性信息的支持下，决策准确性得到了提升。此外，研究还通过超参数调优（hyperparameter tuning）开发了轻量级架构，达到了96%的最高准确率，并通过Multiplied Normalizing Flow技术保持了类似的性能，同时提供了校准的不确定性估计。最后，论文还开发了一种基于SHAP值的3D可视化方法，用于解释神经网络的输出。研究结论表明，解释性与不确定性量化的结合将在医学图像分析中提供更好的临床决策支持，有助于改进COVID-19肺炎的诊断和治疗。

链接: https://arxiv.org/abs/2501.10770
作者: Juan Manuel Liscano Fierro,Hector J. Hortua
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 61 pages, 16 figures. Comments are welcome

点击查看摘要

Abstract:Accurately classifying COVID-19 pneumonia in 3D CT scans remains a significant challenge in the field of medical image analysis. Although deterministic neural networks have shown promising results in this area, they provide only point estimates outputs yielding poor diagnostic in clinical decision-making. In this paper, we explore the use of Bayesian neural networks for classifying COVID-19 pneumonia in 3D CT scans providing uncertainties in their predictions. We compare deterministic networks and their Bayesian counterpart, enhancing the decision-making accuracy under uncertainty information. Remarkably, our findings reveal that lightweight architectures achieve the highest accuracy of 96% after developing extensive hyperparameter tuning. Furthermore, the Bayesian counterpart of these architectures via Multiplied Normalizing Flow technique kept a similar performance along with calibrated uncertainty estimates. Finally, we have developed a 3D-visualization approach to explain the neural network outcomes based on SHAP values. We conclude that explainability along with uncertainty quantification will offer better clinical decisions in medical image analysis, contributing to ongoing efforts for improving the diagnosis and treatment of COVID-19 pneumonia.
zh

[CV-194] Deformable Image Registration of Dark-Field Chest Radiographs for Local Lung Signal Change Assessment

【速读】：该论文试图解决在暗场胸部X光成像（dark-field chest radiography）中，如何在不同呼吸状态下对肺部信号进行局部比较的问题。以往的研究仅评估了吸气状态下的肺部信号，而本文旨在通过比较不同呼吸状态下的暗场肺部信息，为动态X光成像的肺功能评估提供新的视角。解决方案的关键在于提出适用于暗场胸部X光的图像配准方法（image registration methods），以确保在不同呼吸状态下肺部空间位置的一致性。通过利用临床慢性阻塞性肺病（chronic obstructive pulmonary disease, COPD）研究中的全吸气和全呼气扫描数据，本文评估了所提出的配准框架的性能，并概述了适用的评估方法。研究结果表明，结合配准后的暗场图像与标准胸部X光片，可能有助于提升基于动态X光成像的肺功能评估方法的准确性。

链接: https://arxiv.org/abs/2501.10757
作者: Fabian Drexel,Vasiliki Sideri-Lampretsa,Henriette Bast,Alexander W. Marka,Thomas Koehler,Florian T. Gassert,Daniela Pfeiffer,Daniel Rueckert,Franz Pfeiffer
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute for Biomedical Engineering (慕尼黑生物医学工程研究所); Philips Innovative Technologies (飞利浦创新技术); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Dark-field radiography of the human chest has been demonstrated to have promising potential for the analysis of the lung microstructure and the diagnosis of respiratory diseases. However, previous studies of dark-field chest radiographs evaluated the lung signal only in the inspiratory breathing state. Our work aims to add a new perspective to these previous assessments by locally comparing dark-field lung information between different respiratory states. To this end, we discuss suitable image registration methods for dark-field chest radiographs to enable consistent spatial alignment of the lung in distinct breathing states. Utilizing full inspiration and expiration scans from a clinical chronic obstructive pulmonary disease study, we assess the performance of the proposed registration framework and outline applicable evaluation approaches. Our regional characterization of lung dark-field signal changes between the breathing states provides a proof-of-principle that dynamic radiography-based lung function assessment approaches may benefit from considering registered dark-field images in addition to standard plain chest radiographs.
zh

人工智能

[AI-0] Physics of Skill Learning

链接: https://arxiv.org/abs/2501.12391
作者: Ziming Liu,Yizhou Liu,Eric J. Michaud,Jeff Gore,Max Tegmark
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 25 pages, 20 figures. Codes are available at this https URL

点击查看摘要

Abstract:We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists’ approach of abstraction and simplification. We propose three models with varying complexities – the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting – e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development – e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.

[AI-1] Expertise elevates AI usage: experimental evidence comparing laypeople and professional artists

链接: https://arxiv.org/abs/2501.12374
作者: Thomas F. Eisenmann,Andres Karjus,Mar Canet Sola,Levin Brinkmann,Bramantyo Ibrahim Supriyatno,Iyad Rahwan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Eisenmann and Karjus contributed equally to this work and share first authorship

点击查看摘要

Abstract:Novel capacities of generative AI to analyze and generate cultural artifacts raise inevitable questions about the nature and value of artistic education and human expertise. Has AI already leveled the playing field between professional artists and laypeople, or do trained artistic expressive capacity, curation skills and experience instead enhance the ability to use these new tools? In this pre-registered study, we conduct experimental comparisons between 50 active artists and a demographically matched sample of laypeople. We designed two tasks to approximate artistic practice for testing their capabilities in both faithful and creative image creation: replicating a reference image, and moving as far away as possible from it. We developed a bespoke platform where participants used a modern text-to-image model to complete both tasks. We also collected and compared participants’ sentiments towards AI. On average, artists produced more faithful and creative outputs than their lay counterparts, although only by a small margin. While AI may ease content creation, professional expertise is still valuable - even within the confined space of generative AI itself. Finally, we also explored how well an exemplary vision-capable large language model (GPT-4o) would complete the same tasks, if given the role of an image generation agent, and found it performed on par in copying but outperformed even artists in the creative task. The very best results were still produced by humans in both tasks. These outcomes highlight the importance of integrating artistic skills with AI training to prepare artists and other visual professionals for a technologically evolving landscape. We see a potential in collaborative synergy with generative AI, which could reshape creative industries and education in the arts.

[AI-2] Is Long Context All You Need? Leverag ing LLM s Extended Context for NL2SQL

链接: https://arxiv.org/abs/2501.12372
作者: Yeounoh Chung,Gaurav T. Kakkar,Yu Gan,Brenton Milne,Fatma Ozcan
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across a range of natural language processing tasks. In particular, improvements in reasoning abilities and the expansion of context windows have opened new avenues for leveraging these powerful models. NL2SQL is challenging in that the natural language question is inherently ambiguous, while the SQL generation requires a precise understanding of complex data schema and semantics. One approach to this semantic ambiguous problem is to provide more and sufficient contextual information. In this work, we explore the performance and the latency trade-offs of the extended context window (a.k.a., long context) offered by Google’s state-of-the-art LLM (\textitgemini-1.5-pro). We study the impact of various contextual information, including column example values, question and SQL query pairs, user-provided hints, SQL documentation, and schema. To the best of our knowledge, this is the first work to study how the extended context window and extra contextual information can help NL2SQL generation with respect to both accuracy and latency cost. We show that long context LLMs are robust and do not get lost in the extended contextual information. Additionally, our long-context NL2SQL pipeline based on Google’s \textitgemini-pro-1.5 achieve a strong performance with 67.41% on BIRD benchmark (dev) without finetuning and expensive self-consistency based techniques. Comments: 14 pages, 10 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.12372 [cs.DB] (or arXiv:2501.12372v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2501.12372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

链接: https://arxiv.org/abs/2501.12370
作者: Samira Abnar,Harshay Shah,Dan Busbridge,Alaaeldin Mohamed Elnouby Ali,Josh Susskind,Vimal Thilak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

[AI-4] st-time regression: a unifying framework for designing sequence models with associative memory

链接: https://arxiv.org/abs/2501.12352
作者: Ke Alexander Wang,Jiaxin Shi,Emily B. Fox
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sequences provide a remarkably general way to represent and process information. This powerful abstraction has placed sequence modeling at the center of modern deep learning applications, inspiring numerous architectures from transformers to recurrent networks. While this fragmented development has yielded powerful models, it has left us without a unified framework to understand their fundamental similarities and explain their effectiveness. We present a unifying framework motivated by an empirical observation: effective sequence models must be able to perform associative recall. Our key insight is that memorizing input tokens through an associative memory is equivalent to performing regression at test-time. This regression-memory correspondence provides a framework for deriving sequence models that can perform associative recall, offering a systematic lens to understand seemingly ad-hoc architectural choices. We show numerous recent architectures – including linear attention models, their gated variants, state-space models, online learners, and softmax attention – emerge naturally as specific approaches to test-time regression. Each architecture corresponds to three design choices: the relative importance of each association, the regressor function class, and the optimization algorithm. This connection leads to new understanding: we provide theoretical justification for QKNorm in softmax attention, and we motivate higher-order generalizations of softmax attention. Beyond unification, our work unlocks decades of rich statistical tools that can guide future development of more powerful yet principled sequence models.

[AI-5] reefix: Enabling Execution with a Tree of Prefixes ICSE

链接: https://arxiv.org/abs/2501.12339
作者: Beatriz Souza,Michael Pradel
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted in research track of the EEE/ACM International Conference on Software Engineering (ICSE) 2025

点击查看摘要

Abstract:The ability to execute code is a prerequisite for various dynamic program analyses. Learning-guided execution has been proposed as an approach to enable the execution of arbitrary code snippets by letting a neural model predict likely values for any missing variables. Although state-of-the-art learning-guided execution approaches, such as LExecutor, can enable the execution of a relative high amount of code, they are limited to predicting a restricted set of possible values and do not use any feedback from previous executions to execute even more code. This paper presents Treefix, a novel learning-guided execution approach that leverages LLMs to iteratively create code prefixes that enable the execution of a given code snippet. The approach addresses the problem in a multi-step fashion, where each step uses feedback about the code snippet and its execution to instruct an LLM to improve a previously generated prefix. This process iteratively creates a tree of prefixes, a subset of which is returned to the user as prefixes that maximize the number of executed lines in the code snippet. In our experiments with two datasets of Python code snippets, Treefix achieves 25% and 7% more coverage relative to the current state of the art in learning-guided execution, covering a total of 84% and 82% of all lines in the code snippets.

[AI-6] LLM -Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations

链接: https://arxiv.org/abs/2501.12300
作者: Hasan Abu-Rasheed,Constance Jumbo,Rashed Al Amin,Christian Weber,Veit Wiese,Roman Obermaisser,Madjid Fathi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted in the IEEE Global Engineering Education Conference (EDUCON2025), London, UK, 22-25 April, 2025

点击查看摘要

Abstract:While learning personalization offers great potential for learners, modern practices in higher education require a deeper consideration of domain models and learning contexts, to develop effective personalization algorithms. This paper introduces an innovative approach to higher education curriculum modelling that utilizes large language models (LLMs) for knowledge graph (KG) completion, with the goal of creating personalized learning-path recommendations. Our research focuses on modelling university subjects and linking their topics to corresponding domain models, enabling the integration of learning modules from different faculties and institutions in the student’s learning path. Central to our approach is a collaborative process, where LLMs assist human experts in extracting high-quality, fine-grained topics from lecture materials. We develop a domain, curriculum, and user models for university modules and stakeholders. We implement this model to create the KG from two study modules: Embedded Systems and Development of Embedded Systems Using FPGA. The resulting KG structures the curriculum and links it to the domain models. We evaluate our approach through qualitative expert feedback and quantitative graph quality metrics. Domain experts validated the relevance and accuracy of the model, while the graph quality metrics measured the structural properties of our KG. Our results show that the LLM-assisted graph completion approach enhances the ability to connect related courses across disciplines to personalize the learning experience. Expert feedback also showed high acceptance of the proposed collaborative approach for concept extraction and classification.

[AI-7] Implementation of an Asymmetric Adjusted Activation Function for Class Imbalance Credit Scoring

链接: https://arxiv.org/abs/2501.12285
作者: Xia Li,Hanghang Zheng,Kunpeng Tao,Mao Mao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:Credit scoring is a systematic approach to evaluate a borrower’s probability of default (PD) on a bank loan. The data associated with such scenarios are characteristically imbalanced, complicating binary classification owing to the often-underestimated cost of misclassification during the classifier’s learning process. Considering the high imbalance ratio (IR) of these datasets, we introduce an innovative yet straightforward optimized activation function by incorporating an IR-dependent asymmetric adjusted factor embedded Sigmoid activation function (ASIG). The embedding of ASIG makes the sensitive margin of the Sigmoid function auto-adjustable, depending on the imbalance nature of the datasets distributed, thereby giving the activation function an asymmetric characteristic that prevents the underrepresentation of the minority class (positive samples) during the classifier’s learning process. The experimental results show that the ASIG-embedded-classifier outperforms traditional classifiers on datasets across wide-ranging IRs in the downstream credit-scoring task. The algorithm also shows robustness and stability, even when the IR is ultra-high. Therefore, the algorithm provides a competitive alternative in the financial industry, especially in credit scoring, possessing the ability to effectively process highly imbalanced distribution data.

[AI-8] An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication WWW ATC

链接: https://arxiv.org/abs/2501.12194
作者: Geonwoo Seo(Dongguk University)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 19 pages, 10 figures, implementation code available at this https URL , this https URL , demo video at this https URL

点击查看摘要

Abstract:Wakeword detection plays a critical role in enabling AI assistants to listen to user voices and interact effectively. However, for languages other than English, there is a significant lack of pre-trained wakeword models. Additionally, systems that merely determine the presence of a wakeword can pose serious privacy concerns. In this paper, we propose an end-to-end approach that trains wakewords for Non-English languages, particulary Korean, and uses this to develop a Voice Authentication model to protect user privacy. Our implementation employs an open-source platform OpenWakeWord, which performs wakeword detection using an FCN (Fully-Connected Network) architecture. Once a wakeword is detected, our custom-developed code calculates cosine similarity for robust user authentication. Experimental results demonstrate the effectiveness of our approach, achieving a 16.79% and a 6.6% Equal Error Rate (EER) each in the Wakeword Detection and the Voice Authentication. These findings highlight the model’s potential in providing secure and accurate wakeword detection and authentication for Korean users.

[AI-9] FedCLEAN: byzantine defense by CLustering Errors of Activation maps in Non-IID federated learning environments

链接: https://arxiv.org/abs/2501.12123
作者: Mehdi Ben Ghali,Reda Bellafqira,Gouenou Coatrieux
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:Federated Learning (FL) enables clients to collaboratively train a global model using their local datasets while reinforcing data privacy. However, FL is susceptible to poisoning attacks. Existing defense mechanisms assume that clients’ data are independent and identically distributed (IID), making them ineffective in real-world applications where data are non-IID. This paper presents FedCLEAN, the first defense capable of filtering attackers’ model updates in a non-IID FL environment. The originality of FedCLEAN is twofold. First, it relies on a client confidence score derived from the reconstruction errors of each client’s model activation maps for a given trigger set, with reconstruction errors obtained by means of a Conditional Variational Autoencoder trained according to a novel server-side strategy. Second, we propose an ad-hoc trust propagation algorithm based on client scores, which allows building a cluster of benign clients while flagging potential attackers. Experimental results on the datasets MNIST and FashionMNIST demonstrate the robustness of FedCLEAN against Byzantine attackers in non-IID scenarios and a close-to-zero benign client misclassification rate, even in the absence of an attack.

[AI-10] Efficient PINNs: Multi-Head Unimodular Regularization of the Solutions Space

链接: https://arxiv.org/abs/2501.12116
作者: Pedro Tarancón-Álvarez,Pablo Tejerina-Pérez,Raul Jimenez,Pavlos Protopapas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:We present a machine learning framework to facilitate the solution of nonlinear multiscale differential equations and, especially, inverse problems using Physics-Informed Neural Networks (PINNs). This framework is based on what is called multihead (MH) training, which involves training the network to learn a general space of all solutions for a given set of equations with certain variability, rather than learning a specific solution of the system. This setup is used with a second novel technique that we call Unimodular Regularization (UR) of the latent space of solutions. We show that the multihead approach, combined with the regularization, significantly improves the efficiency of PINNs by facilitating the transfer learning process thereby enabling the finding of solutions for nonlinear, coupled, and multiscale differential equations.

[AI-11] Harnessing Generative Pre-Trained Transformer for Datacenter Packet Trace Generation

链接: https://arxiv.org/abs/2501.12033
作者: Chen Griner
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Today, the rapid growth of applications reliant on datacenters calls for new advancements to meet the increasing traffic and computational demands. Traffic traces from datacenters are essential for further development and optimization of future datacenters. However, traces are rarely released to the public. Researchers often use simplified mathematical models that lack the depth needed to recreate intricate traffic patterns and, thus, miss optimization opportunities found in realistic traffic. In this preliminary work, we introduce DTG-GPT, a packet-level Datacenter Traffic Generator (DTG), based on the generative pre-trained transformer (GPT) architecture used by many state-of-the-art large language models. We train our model on a small set of available traffic traces from different domains and offer a simple methodology to evaluate the fidelity of the generated traces to their original counterparts. We show that DTG-GPT can synthesize novel traces that mimic the spatiotemporal patterns found in real traffic traces. We further demonstrate that DTG-GPT can generate traces for networks of different scales while maintaining fidelity. Our findings indicate the potential that, in the future, similar models to DTG-GPT will allow datacenter operators to release traffic information to the research community via trained GPT models.

[AI-12] Full Proportional Justified Representation AAMAS25

链接: https://arxiv.org/abs/2501.12015
作者: Yusuf Hakan Kalayci,Jiasen Liu,David Kempe
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: 18 pages, Accepted to AAMAS 25

点击查看摘要

Abstract:In multiwinner approval voting, forming a committee that proportionally represents voters’ approval ballots is an essential task. The notion of justified representation (JR) demands that any large “cohesive” group of voters should be proportionally “represented”. The “cohesiveness” is defined in different ways; two common ways are the following: (C1) demands that the group unanimously approves a set of candidates proportional to its size, while (C2) requires each member to approve at least a fixed fraction of such a set. Similarly, “representation” have been considered in different ways: (R1) the coalition’s collective utility from the winning set exceeds that of any proportionally sized alternative, and (R2) for any proportionally sized alternative, at least one member of the coalition derives less utility from it than from the winning set. Three of the four possible combinations have been extensively studied: (C1)-(R1) defines Proportional Justified Representation (PJR), (C1)-(R2) defines Extended Justified Representation (EJR), (C2)-(R2) defines Full Justified Representation (FJR). All three have merits, but also drawbacks. PJR is the weakest notion, and perhaps not sufficiently demanding; EJR may not be compatible with perfect representation; and it is open whether a committee satisfying FJR can be found efficiently. We study the combination (C2)-(R1), which we call Full Proportional Justified Representation (FPJR). We investigate FPJR’s properties and find that it shares PJR’s advantages over EJR: several proportionality axioms (e.g. priceability, perfect representation) imply FPJR and PJR but not EJR. We also find that efficient rules like the greedy Monroe rule and the method of equal shares satisfy FPJR, matching a key advantage of EJR over FJR. However, the Proportional Approval Voting (PAV) rule may violate FPJR, so neither of EJR and FPJR implies the other. Comments: 18 pages, Accepted to AAMAS 25 Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.12015 [cs.GT] (or arXiv:2501.12015v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2501.12015 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization

链接: https://arxiv.org/abs/2501.11968
作者: Jie Zhao,Kang Hao Cheong,Witold Pedrycz
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-structured combinatorial challenges are inherently difficult due to their nonlinear and intricate nature, often rendering traditional computational methods ineffective or expensive. However, these challenges can be more naturally tackled by humans through visual representations that harness our innate ability for spatial reasoning. In this study, we propose transforming graphs into images to preserve their higher-order structural features accurately, revolutionizing the representation used in solving graph-structured combinatorial tasks. This approach allows machines to emulate human-like processing in addressing complex combinatorial challenges. By combining the innovative paradigm powered by multimodal large language models (MLLMs) with simple search techniques, we aim to develop a novel and effective framework for tackling such problems. Our investigation into MLLMs spanned a variety of graph-based tasks, from combinatorial problems like influence maximization to sequential decision-making in network dismantling, as well as addressing six fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit exceptional spatial intelligence and a distinctive capability for handling these problems, significantly advancing the potential for machines to comprehend and analyze graph-structured data with a depth and intuition akin to human cognition. These results also imply that integrating MLLMs with simple optimization strategies could form a novel and efficient approach for navigating graph-structured combinatorial challenges without complex derivations, computationally demanding training and fine-tuning.

[AI-14] MeshONet: A Generalizable and Efficient Operator Learning Method for Structured Mesh Generation

链接: https://arxiv.org/abs/2501.11937
作者: Jing Xiao,Xinhai Chen,Qingling Wang,Jie Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mesh generation plays a crucial role in scientific computing. Traditional mesh generation methods, such as TFI and PDE-based methods, often struggle to achieve a balance between efficiency and mesh quality. To address this challenge, physics-informed intelligent learning methods have recently emerged, significantly improving generation efficiency while maintaining high mesh quality. However, physics-informed methods fail to generalize when applied to previously unseen geometries, as even small changes in the boundary shape necessitate burdensome retraining to adapt to new geometric variations. In this paper, we introduce MeshONet, the first generalizable intelligent learning method for structured mesh generation. The method transforms the mesh generation task into an operator learning problem with multiple input and solution functions. To effectively overcome the multivariable mapping restriction of operator learning methods, we propose a dual-branch, shared-trunk architecture to approximate the mapping between function spaces based on input-output pairs. Experimental results show that MeshONet achieves a speedup of up to four orders of magnitude in generation efficiency over traditional methods. It also enables generalization to different geometries without retraining, greatly enhancing the practicality of intelligent methods.

[AI-15] Webvs. LLM s: An Empirical Study of Learning Behaviors of CS2 Students

链接: https://arxiv.org/abs/2501.11935
作者: Aayush Kumar,Daniel Prol,Amin Alipour,Sruti Srinivasa Ragavan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:LLMs such as ChatGPT have been widely adopted by students in higher education as tools for learning programming and related concepts. However, it remains unclear how effective students are and what strategies students use while learning with LLMs. Since the majority of students’ experiences in online self-learning have come through using search engines such as Google, evaluating AI tools in this context can help us address these gaps. In this mixed methods research, we conducted an exploratory within-subjects study to understand how CS2 students learn programming concepts using both LLMs as well as traditional online methods such as educational websites and videos to examine how students approach learning within and across both scenarios. We discovered that students found it easier to learn a more difficult concept using traditional methods than using ChatGPT. We also found that students ask fewer follow-ups and use more keyword-based queries for search engines while their prompts to LLMs tend to explicitly ask for information.

[AI-16] Make Full Use of Testing Information: An Integrated Accelerated Testing and Evaluation Method for Autonomous Driving Systems

链接: https://arxiv.org/abs/2501.11924
作者: Xinzheng Wu,Junyi Chen,Jianfeng Wu,Longgao Zhang,Tian Xia,Yong Shen
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Testing and evaluation is an important step before the large-scale application of the autonomous driving systems (ADSs). Based on the three level of scenario abstraction theory, a testing can be performed within a logical scenario, followed by an evaluation stage which is inputted with the testing results of each concrete scenario generated from the logical parameter space. During the above process, abundant testing information is produced which is beneficial for comprehensive and accurate evaluations. To make full use of testing information, this paper proposes an Integrated accelerated Testing and Evaluation Method (ITEM). Based on a Monte Carlo Tree Search (MCTS) paradigm and a dual surrogates testing framework proposed in our previous work, this paper applies the intermediate information (i.e., the tree structure, including the affiliation of each historical sampled point with the subspaces and the parent-child relationship between subspaces) generated during the testing stage into the evaluation stage to achieve accurate hazardous domain identification. Moreover, to better serve this purpose, the UCB calculation method is improved to allow the search algorithm to focus more on the hazardous domain boundaries. Further, a stopping condition is constructed based on the convergence of the search algorithm. Ablation and comparative experiments are then conducted to verify the effectiveness of the improvements and the superiority of the proposed method. The experimental results show that ITEM could well identify the hazardous domains in both low- and high-dimensional cases, regardless of the shape of the hazardous domains, indicating its generality and potential for the safety evaluation of ADSs.

[AI-17] Goal-oriented Transmission Scheduling: Structure-guided DRL with a Unified Dual On-policy and Off-policy Approach

链接: https://arxiv.org/abs/2501.11921
作者: Jiazheng Chen,Wanchun Liu
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Paper submitted to IEEE

点击查看摘要

Abstract:Goal-oriented communications prioritize application-driven objectives over data accuracy, enabling intelligent next-generation wireless systems. Efficient scheduling in multi-device, multi-channel systems poses significant challenges due to high-dimensional state and action spaces. We address these challenges by deriving key structural properties of the optimal solution to the goal-oriented scheduling problem, incorporating Age of Information (AoI) and channel states. Specifically, we establish the monotonicity of the optimal state value function (a measure of long-term system performance) w.r.t. channel states and prove its asymptotic convexity w.r.t. AoI states. Additionally, we derive the monotonicity of the optimal policy w.r.t. channel states, advancing the theoretical framework for optimal scheduling. Leveraging these insights, we propose the structure-guided unified dual on-off policy DRL (SUDO-DRL), a hybrid algorithm that combines the stability of on-policy training with the sample efficiency of off-policy methods. Through a novel structural property evaluation framework, SUDO-DRL enables effective and scalable training, addressing the complexities of large-scale systems. Numerical results show SUDO-DRL improves system performance by up to 45% and reduces convergence time by 40% compared to state-of-the-art methods. It also effectively handles scheduling in much larger systems, where off-policy DRL fails and on-policy benchmarks exhibit significant performance loss, demonstrating its scalability and efficacy in goal-oriented communications.

[AI-18] Bridging the Communication Gap: Evaluating AI Labeling Practices for Trustworthy AI Development

链接: https://arxiv.org/abs/2501.11909
作者: Raphael Fischer,Magdalena Wischnewski,Alexander van der Staay,Katharina Poitz,Christian Janiesch,Thomas Liebig
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) becomes integral to economy and society, communication gaps between developers, users, and stakeholders hinder trust and informed decision-making. High-level AI labels, inspired by frameworks like EU energy labels, have been proposed to make the properties of AI models more transparent. Without requiring deep technical expertise, they can inform on the trade-off between predictive performance and resource efficiency. However, the practical benefits and limitations of AI labeling remain underexplored. This study evaluates AI labeling through qualitative interviews along four key research questions. Based on thematic analysis and inductive coding, we found a broad range of practitioners to be interested in AI labeling (RQ1). They see benefits for alleviating communication gaps and aiding non-expert decision-makers, however limitations, misunderstandings, and suggestions for improvement were also discussed (RQ2). Compared to other reporting formats, interviewees positively evaluated the reduced complexity of labels, increasing overall comprehensibility (RQ3). Trust was influenced most by usability and the credibility of the responsible labeling authority, with mixed preferences for self-certification versus third-party certification (RQ4). Our Insights highlight that AI labels pose a trade-off between simplicity and complexity, which could be resolved by developing customizable and interactive labeling frameworks to address diverse user needs. Transparent labeling of resource efficiency also nudged interviewee priorities towards paying more attention to sustainability aspects during AI development. This study validates AI labels as a valuable tool for enhancing trust and communication in AI, offering actionable guidelines for their refinement and standardization.

[AI-19] Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture

链接: https://arxiv.org/abs/2501.11896
作者: Zhong-Hua Sun,Ru-Yuan Zhang,Zonglei Zhen,Da-Hui Wang,Yong-Jie Li,Xiaohong Wan,Hongzhi You
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In abstract visual reasoning, monolithic deep learning models suffer from limited interpretability and generalization, while existing neuro-symbolic approaches fall short in capturing the diversity and systematicity of attributes and relation representations. To address these challenges, we propose a Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR) in Vector-symbolic Architecture (VSA) to solve Raven’s Progressive Matrices (RPM). To derive attribute representations with symbolic reasoning potential, we introduce not only various types of atomic vectors that represent numeric, periodic and logical semantics, but also the structured high-dimentional representation (SHDR) for the overall Grid component. For systematic reasoning, we propose novel numerical and logical relation functions and perform rule abduction and execution in a unified framework that integrates these relation representations. Experimental results demonstrate that Rel-SAR achieves significant improvement on RPM tasks and exhibits robust out-of-distribution generalization. Rel-SAR leverages the synergy between HD attribute representations and symbolic reasoning to achieve systematic abductive reasoning with both interpretable and computable semantics.

[AI-20] Community-Aware Temporal Walks: Parameter-Free Representation Learning on Continuous-Time Dynamic Graphs

链接: https://arxiv.org/abs/2501.11880
作者: He Yu,Jing Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic graph representation learning plays a crucial role in understanding evolving behaviors. However, existing methods often struggle with flexibility, adaptability, and the preservation of temporal and structural dynamics. To address these issues, we propose Community-aware Temporal Walks (CTWalks), a novel framework for representation learning on continuous-time dynamic graphs. CTWalks integrates three key components: a community-based parameter-free temporal walk sampling mechanism, an anonymization strategy enriched with community labels, and an encoding process that leverages continuous temporal dynamics modeled via ordinary differential equations (ODEs). This design enables precise modeling of both intra- and inter-community interactions, offering a fine-grained representation of evolving temporal patterns in continuous-time dynamic graphs. CTWalks theoretically overcomes locality bias in walks and establishes its connection to matrix factorization. Experiments on benchmark datasets demonstrate that CTWalks outperforms established methods in temporal link prediction tasks, achieving higher accuracy while maintaining robustness.

[AI-21] Coarse-to-Fine Lightweight Meta-Embedding for ID-Based Recommendation

链接: https://arxiv.org/abs/2501.11870
作者: Yang Wang,Haipeng Liu,Zeqian Yi,Biao Qian,Meng Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The state-of-the-art recommendation systems have shifted the attention to efficient recommendation, e.g., on-device recommendation, under memory constraints. To this end, the existing methods either focused on the lightweight embeddings for both users and items, or involved on-device systems enjoying the compact embeddings to enhance reusability and reduces space complexity. However, they focus solely on the coarse granularity of embedding, while overlook the fine-grained semantic nuances, to adversarially downgrade the efficacy of meta-embeddings in capturing the intricate relationship over both user and item, consequently resulting into the suboptimal recommendations. In this paper, we aim to study how the meta-embedding can efficiently learn varied grained semantics, together with how the fine-grained meta-embedding can strengthen the representation of coarse-grained meta-embedding. To answer these questions, we develop a novel graph neural networks (GNNs) based recommender where each user and item serves as the node, linked directly to coarse-grained virtual nodes and indirectly to fine-grained virtual nodes, ensuring different grained semantic learning, while disclosing: 1) In contrast to coarse-grained semantics, fine-grained semantics are well captured through sparse meta-embeddings, which adaptively 2) balance the embedding uniqueness and memory constraint. Additionally, the initialization method come up upon SparsePCA, along with a soft thresholding activation function to render the sparseness of the meta-embeddings. We propose a weight bridging update strategy that focuses on matching each coarse-grained meta-embedding with several fine-grained meta-embeddings based on the users/items’ semantics. Extensive experiments substantiate our method’s superiority over existing baselines. Our code is available at this https URL.

[AI-22] A Survey on Memory-Efficient Large-Scale Model Training in AI for Science

链接: https://arxiv.org/abs/2501.11847
作者: Kaiyuan Tian,Linbo Qiao,Baihui Liu,Gongqingjian Jiang,Dongsheng Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. To address this, we review memory-efficient training techniques for LLMs based on the transformer architecture, including distributed training, mixed precision training, and gradient checkpointing. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. We also discuss the challenges of memory optimization in practice and potential future directions, hoping to provide valuable insights for researchers and engineers.

[AI-23] Supervised Learning for Analog and RF Circuit Design: Benchmarks and Comparative Insights

链接: https://arxiv.org/abs/2501.11839
作者: Asal Mehradfar,Xuzhe Zhao,Yue Niu,Sara Babakniya,Mahdi Alesheikh,Hamidreza Aghasi,Salman Avestimehr
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Automating analog and radio-frequency (RF) circuit design using machine learning (ML) significantly reduces the time and effort required for parameter optimization. This study explores supervised ML-based approaches for designing circuit parameters from performance specifications across various circuit types, including homogeneous and heterogeneous designs. By evaluating diverse ML models, from neural networks like transformers to traditional methods like random forests, we identify the best-performing models for each circuit. Our results show that simpler circuits, such as low-noise amplifiers, achieve exceptional accuracy with mean relative errors as low as 0.3% due to their linear parameter-performance relationships. In contrast, complex circuits, like power amplifiers and voltage-controlled oscillators, present challenges due to their non-linear interactions and larger design spaces. For heterogeneous circuits, our approach achieves an 88% reduction in errors with increased training data, with the receiver achieving a mean relative error as low as 0.23%, showcasing the scalability and accuracy of the proposed methodology. Additionally, we provide insights into model strengths, with transformers excelling in capturing non-linear mappings and k-nearest neighbors performing robustly in moderately linear parameter spaces, especially in heterogeneous circuits with larger datasets. This work establishes a foundation for extending ML-driven design automation, enabling more efficient and scalable circuit design workflows.

[AI-24] PXGen: A Post-hoc Explainable Method for Generative Models

链接: https://arxiv.org/abs/2501.11827
作者: Yen-Lung Huang,Ming-Hsi Weng,Hao-Tsung Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid growth of generative AI in numerous applications, explainable AI (XAI) plays a crucial role in ensuring the responsible development and deployment of generative AI technologies. XAI has undergone notable advancements and widespread adoption in recent years, reflecting a concerted push to enhance the transparency, interpretability, and credibility of AI systems. Recent research emphasizes that a proficient XAI method should adhere to a set of criteria, primarily focusing on two key areas. Firstly, it should ensure the quality and fluidity of explanations, encompassing aspects like faithfulness, plausibility, completeness, and tailoring to individual needs. Secondly, the design principle of the XAI system or mechanism should cover the following factors such as reliability, resilience, the verifiability of its outputs, and the transparency of its algorithm. However, research in XAI for generative models remains relatively scarce, with little exploration into how such methods can effectively meet these criteria in that domain. In this work, we propose PXGen, a post-hoc explainable method for generative models. Given a model that needs to be explained, PXGen prepares two materials for the explanation, the Anchor set and intrinsic extrinsic criteria. Those materials are customizable by users according to their purpose and requirements. Via the calculation of each criterion, each anchor has a set of feature values and PXGen provides examplebased explanation methods according to the feature values among all the anchors and illustrated and visualized to the users via tractable algorithms such as k-dispersion or k-center.

[AI-25] oward Scalable Graph Unlearning: A Node Influence Maximization based Approach

链接: https://arxiv.org/abs/2501.11823
作者: Xunkai Li,Bowen Fan,Zhengyu Wu,Zhiyu Li,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: Under Review

点击查看摘要

Abstract:Machine unlearning, as a pivotal technology for enhancing model robustness and data privacy, has garnered significant attention in prevalent web mining applications, especially in thriving graph-based scenarios. However, most existing graph unlearning (GU) approaches face significant challenges due to the intricate interactions among web-scale graph elements during the model training: (1) The gradient-driven node entanglement hinders the complete knowledge removal in response to unlearning requests; (2) The billion-level graph elements in the web scenarios present inevitable scalability issues. To break the above limitations, we open up a new perspective by drawing a connection between GU and conventional social influence maximization. To this end, we propose Node Influence Maximization (NIM) through the decoupled influence propagation model and fine-grained influence function in a scalable manner, which is crafted to be a plug-and-play strategy to identify potential nodes affected by unlearning entities. This approach enables offline execution independent of GU, allowing it to be seamlessly integrated into most GU methods to improve their unlearning performance. Based on this, we introduce Scalable Graph Unlearning (SGU) as a new fine-tuned framework, which balances the forgetting and reasoning capability of the unlearned model by entity-specific optimizations. Extensive experiments on 14 datasets, including large-scale ogbn-papers100M, have demonstrated the effectiveness of our approach. Specifically, NIM enhances the forgetting capability of most GU methods, while SGU achieves comprehensive SOTA performance and maintains scalability.

[AI-26] oward Effective Digraph Representation Learning: A Magnetic Adaptive Propagation based Approach WWW2025

链接: https://arxiv.org/abs/2501.11817
作者: Xunkai Li,Daohan Su,Zhengyu Wu,Guang Zeng,Hongchao Qin,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: Accepted by WWW 2025

点击查看摘要

Abstract:The q -parameterized magnetic Laplacian serves as the foundation of directed graph (digraph) convolution, enabling this kind of digraph neural network (MagDG) to encode node features and structural insights by complex-domain message passing. As a generalization of undirected methods, MagDG shows superior capability in modeling intricate web-scale topology. Despite the great success achieved by existing MagDGs, limitations still exist: (1) Hand-crafted q : The performance of MagDGs depends on selecting an appropriate q -parameter to construct suitable graph propagation equations in the complex domain. This parameter tuning, driven by downstream tasks, limits model flexibility and significantly increases manual effort. (2) Coarse Message Passing: Most approaches treat all nodes with the same complex-domain propagation and aggregation rules, neglecting their unique digraph contexts. This oversight results in sub-optimal performance. To address the above issues, we propose two key techniques: (1) MAP is crafted to be a plug-and-play complex-domain propagation optimization strategy in the context of digraph learning, enabling seamless integration into any MagDG to improve predictions while enjoying high running efficiency. (2) MAP++ is a new digraph learning framework, further incorporating a learnable mechanism to achieve adaptively edge-wise propagation and node-wise aggregation in the complex domain for better performance. Extensive experiments on 12 datasets demonstrate that MAP enjoys flexibility for it can be incorporated with any MagDG, and scalability as it can deal with web-scale digraphs. MAP++ achieves SOTA predictive performance on 4 different downstream tasks.

[AI-27] Policy-Adaptable Methods For Resolving Normative Conflicts Through Argumentation and Graph Colouring

链接: https://arxiv.org/abs/2501.11799
作者: Johnny Joyce
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
*备注: Written and submitted as master’s thesis for University of Southampton in 2020

点击查看摘要

Abstract:In a multi-agent system, one may choose to govern the behaviour of an agent by imposing norms, which act as guidelines for how agents should act either all of the time or in given situations. However, imposing multiple norms on one or more agents may result in situations where these norms conflict over how the agent should behave. In any system with normative conflicts (such as safe reinforcement models or systems which monitor safety protocols), one must decide which norms should be followed such that the most important and most relevant norms are maintained. We introduce a new method for resolving normative conflicts through argumentation and graph colouring which is compatible with a variety of normative conflict resolution policies. We prove that this method always creates an admissible set of arguments under argumentation semantics, meaning that it produces coherent outputs. We also introduce more robust variants of this method, each building upon their predecessor to create a superior output, and we include further mathematical proof of their coherence. Our most advanced variant uses the existing concept of curtailment, where one norm may supersede another without fully eliminating it. The methods we introduce are all compatible with various pre-existing policies for resolving normative conflicts. Empirical evaluations are also performed to compare our algorithms to each other and to others in existing literature.

[AI-28] Human-AI Collaborative Game Testing with Vision Language Models

链接: https://arxiv.org/abs/2501.11782
作者: Boran Zhang,Muhan Xu,Zhijun Pan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Experiment Report

点击查看摘要

Abstract:As modern video games become increasingly complex, traditional manual testing methods are proving costly and inefficient, limiting the ability to ensure high-quality game experiences. While advancements in Artificial Intelligence (AI) offer the potential to assist human testers, the effectiveness of AI in truly enhancing real-world human performance remains underexplored. This study investigates how AI can improve game testing by developing and experimenting with an AI-assisted workflow that leverages state-of-the-art machine learning models for defect detection. Through an experiment involving 800 test cases and 276 participants of varying backgrounds, we evaluate the effectiveness of AI assistance under four conditions: with or without AI support, and with or without detailed knowledge of defects and design documentation. The results indicate that AI assistance significantly improves defect identification performance, particularly when paired with detailed knowledge. However, challenges arise when AI errors occur, negatively impacting human decision-making. Our findings show the importance of optimizing human-AI collaboration and implementing strategies to mitigate the effects of AI inaccuracies. By this research, we demonstrate AI’s potential and problems in enhancing efficiency and accuracy in game testing workflows and offers practical insights for integrating AI into the testing process.

[AI-29] Episodic memory in AI agents poses risks that should be studied and mitigated

链接: https://arxiv.org/abs/2501.11739
作者: Chad DeChant
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Most current AI models have little ability to store and later retrieve a record or representation of what they do. In human cognition, episodic memories play an important role in both recall of the past as well as planning for the future. The ability to form and use episodic memories would similarly enable a broad range of improved capabilities in an AI agent that interacts with and takes actions in the world. Researchers have begun directing more attention to developing memory abilities in AI models. It is therefore likely that models with such capability will be become widespread in the near future. This could in some ways contribute to making such AI agents safer by enabling users to better monitor, understand, and control their actions. However, as a new capability with wide applications, we argue that it will also introduce significant new risks that researchers should begin to study and address. We outline these risks and benefits and propose four principles to guide the development of episodic memory capabilities so that these will enhance, rather than undermine, the effort to keep AI safe and trustworthy.

[AI-30] ransformer Vibration Forecasting for Advancing Rail Safety and Maintenance 4.0

链接: https://arxiv.org/abs/2501.11730
作者: Darío C. Larese,Almudena Bravo Cerrada,Gabriel Dambrosio Tomei,Alejandro Guerrero-López,Pablo M. Olmos,María Jesús Gómez García
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Maintaining railway axles is critical to preventing severe accidents and financial losses. The railway industry is increasingly interested in advanced condition monitoring techniques to enhance safety and efficiency, moving beyond traditional periodic inspections toward Maintenance 4.0. This study introduces a robust Deep Autoregressive solution that integrates seamlessly with existing systems to avert mechanical failures. Our approach simulates and predicts vibration signals under various conditions and fault scenarios, improving dataset robustness for more effective detection systems. These systems can alert maintenance needs, preventing accidents preemptively. We use experimental vibration signals from accelerometers on train axles. Our primary contributions include a transformer model, ShaftFormer, designed for processing time series data, and an alternative model incorporating spectral methods and enhanced observation models. Simulating vibration signals under diverse conditions mitigates the high cost of obtaining experimental signals for all scenarios. Given the non-stationary nature of railway vibration signals, influenced by speed and load changes, our models address these complexities, offering a powerful tool for predictive maintenance in the rail industry. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2501.11730 [cs.LG] (or arXiv:2501.11730v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.11730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] Human services organizations and the responsible integration of AI: Considering ethics and contextualizing risk(s)

链接: https://arxiv.org/abs/2501.11705
作者: Brian E. Perron,Lauri Goldkind,Zia Qi,Bryan G. Victor
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 1 figure. Journal of Technology in Human Services (2025)

点击查看摘要

Abstract:This paper examines the responsible integration of artificial intelligence (AI) in human services organizations (HSOs), proposing a nuanced framework for evaluating AI applications across multiple dimensions of risk. The authors argue that ethical concerns about AI deployment – including professional judgment displacement, environmental impact, model bias, and data laborer exploitation – vary significantly based on implementation context and specific use cases. They challenge the binary view of AI adoption, demonstrating how different applications present varying levels of risk that can often be effectively managed through careful implementation strategies. The paper highlights promising solutions, such as local large language models, that can facilitate responsible AI integration while addressing common ethical concerns. The authors propose a dimensional risk assessment approach that considers factors like data sensitivity, professional oversight requirements, and potential impact on client wellbeing. They conclude by outlining a path forward that emphasizes empirical evaluation, starting with lower-risk applications and building evidence-based understanding through careful experimentation. This approach enables organizations to maintain high ethical standards while thoughtfully exploring how AI might enhance their capacity to serve clients and communities effectively.

[AI-32] Spatially-Delineated Domain-Adapted AI Classification: An Application for Oncology Data

链接: https://arxiv.org/abs/2501.11695
作者: Majid Farhadloo,Arun Sharma,Alexey Leontovich,Svetomir N. Markovic,Shashi Shekhar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given multi-type point maps from different place-types (e.g., tumor regions), our objective is to develop a classifier trained on the source place-type to accurately distinguish between two classes of the target place-type based on their point arrangements. This problem is societally important for many applications, such as generating clinical hypotheses for designing new immunotherapies for cancer treatment. The challenge lies in the spatial variability, the inherent heterogeneity and variation observed in spatial properties or arrangements across different locations (i.e., place-types). Previous techniques focus on self-supervised tasks to learn domain-invariant features and mitigate domain differences; however, they often neglect the underlying spatial arrangements among data points, leading to significant discrepancies across different place-types. We explore a novel multi-task self-learning framework that targets spatial arrangements, such as spatial mix-up masking and spatial contrastive predictive coding, for spatially-delineated domain-adapted AI classification. Experimental results on real-world datasets (e.g., oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.

[AI-33] Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection ICASSP2025

链接: https://arxiv.org/abs/2501.11631
作者: Myeonghoon Ryu,June-Woo Kim,Minseok Oh,Suji Lee,Han Park
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Keyword spotting is often implemented by keyword classifier to the encoder in acoustic models, enabling the classification of predefined or open vocabulary keywords. Although keyword spotting is a crucial task in various applications and can be extended to call-for-help detection in emergencies, however, the previous method often suffers from scalability limitations due to retraining required to introduce new keywords or adapt to changing contexts. We explore a simple yet effective approach that leverages off-the-shelf pretrained ASR models to address these challenges, especially in call-for-help detection scenarios. Furthermore, we observed a substantial increase in false alarms when deploying call-for-help detection system in real-world scenarios due to noise introduced by microphones or different environments. To address this, we propose a novel noise-agnostic multitask learning approach that integrates a noise classification head into the ASR encoder. Our method enhances the model’s robustness to noisy environments, leading to a significant reduction in false alarms and improved overall call-for-help performance. Despite the added complexity of multitask learning, our approach is computationally efficient and provides a promising solution for call-for-help detection in real-world scenarios.

[AI-34] Fairness Testing through Extreme Value Theory ICSE’25

链接: https://arxiv.org/abs/2501.11597
作者: Verya Monjezi,Ashutosh Trivedi,Vladik Kreinovich,Saeid Tizpaz-Niari
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: In IEEE/ACM 47th International Conference on Software Engineering (ICSE’25)

点击查看摘要

Abstract:Data-driven software is increasingly being used as a critical component of automated decision-support systems. Since this class of software learns its logic from historical data, it can encode or amplify discriminatory practices. Previous research on algorithmic fairness has focused on improving average-case fairness. On the other hand, fairness at the extreme ends of the spectrum, which often signifies lasting and impactful shifts in societal attitudes, has received significantly less emphasis. Leveraging the statistics of extreme value theory (EVT), we propose a novel fairness criterion called extreme counterfactual discrimination (ECD). This criterion estimates the worst-case amounts of disadvantage in outcomes for individuals solely based on their memberships in a protected group. Utilizing tools from search-based software engineering and generative AI, we present a randomized algorithm that samples a statistically significant set of points from the tail of ML outcome distributions even if the input dataset lacks a sufficient number of relevant samples. We conducted several experiments on four ML models (deep neural networks, logistic regression, and random forests) over 10 socially relevant tasks from the literature on algorithmic fairness. First, we evaluate the generative AI methods and find that they generate sufficient samples to infer valid EVT distribution in 95% of cases. Remarkably, we found that the prevalent bias mitigators reduce the average-case discrimination but increase the worst-case discrimination significantly in 5% of cases. We also observed that even the tail-aware mitigation algorithm – MiniMax-Fairness – increased the worst-case discrimination in 30% of cases. We propose a novel ECD-based mitigator that improves fairness in the tail in 90% of cases with no degradation of the average-case discrimination. Comments: In IEEE/ACM 47th International Conference on Software Engineering (ICSE’25) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2501.11597 [cs.SE] (or arXiv:2501.11597v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.11597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] Recurrent Diffusion for Large-Scale Parameter Generation

链接: https://arxiv.org/abs/2501.11587
作者: Kai Wang,Dongwen Tang,Wangbo Zhao,Yang You
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Generating 200 million parameters in just minutes

点击查看摘要

Abstract:Parameter generation has struggled to scale up for a long time, significantly limiting its range of applications. In this study, we introduce \textbfRecurrent diffusion for large-scale \textbfParameter \textbfGeneration, called \textbfRPG. We first divide the trained parameters into non-overlapping parts, after which a recurrent model is proposed to learn their relationships. The recurrent model’s outputs, as conditions, are then fed into a diffusion model to generate the neural network parameters. Using only a single GPU, recurrent diffusion enables us to generate popular vision and language models such as ConvNeXt-L and LoRA parameters of LLaMA-7B. Meanwhile, across various architectures and tasks, the generated parameters consistently perform comparable results over trained networks. Notably, our approach also shows the potential to generate models for handling unseen tasks, which largely increases the practicality of parameter generation. Our code is available \hrefthis https URLhere.

[AI-36] Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation

链接: https://arxiv.org/abs/2501.11560
作者: M. Manzour,A. Ballardini,R. Izquierdo,M. Á. Sotelo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lane-changing maneuvers, particularly those executed abruptly or in risky situations, are a significant cause of road traffic accidents. However, current research mainly focuses on predicting safe lane changes. Furthermore, existing accident datasets are often based on images only and lack comprehensive sensory data. In this work, we focus on predicting risky lane changes using the CRASH dataset (our own collected dataset specifically for risky lane changes), and safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian inference to predict these maneuvers using linguistic contextual information, enhancing the model’s interpretability and transparency. The model achieved a 91.5% f1-score with anticipation time extending to four seconds for risky lane changes, and a 90.0% f1-score for predicting safe lane changes with the same anticipation time. We validate our model by integrating it into a vehicle within the CARLA simulator in scenarios that involve risky lane changes. The model managed to anticipate sudden lane changes, thus providing automated vehicles with further time to plan and execute appropriate safe reactions. Finally, to enhance the explainability of our model, we utilize RAG to provide clear and natural language explanations for the given prediction.

[AI-37] he impact of intrinsic rewards on exploration in Reinforcement Learning

链接: https://arxiv.org/abs/2501.11533
作者: Aya Kayal,Eduardo Pignatelli,Laura Toni
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 17 figures. Submitted to Neural Computing and Applications Journal

点击查看摘要

Abstract:One of the open challenges in Reinforcement Learning is the hard exploration problem in sparse reward environments. Various types of intrinsic rewards have been proposed to address this challenge by pushing towards diversity. This diversity might be imposed at different levels, favouring the agent to explore different states, policies or behaviours (State, Policy and Skill level diversity, respectively). However, the impact of diversity on the agent’s behaviour remains unclear. In this work, we aim to fill this gap by studying the effect of different levels of diversity imposed by intrinsic rewards on the exploration patterns of RL agents. We select four intrinsic rewards (State Count, Intrinsic Curiosity Module (ICM), Maximum Entropy, and Diversity is all you need (DIAYN)), each pushing for a different diversity level. We conduct an empirical study on MiniGrid environment to compare their impact on exploration considering various metrics related to the agent’s exploration, namely: episodic return, observation coverage, agent’s position coverage, policy entropy, and timeframes to reach the sparse reward. The main outcome of the study is that State Count leads to the best exploration performance in the case of low-dimensional observations. However, in the case of RGB observations, the performance of State Count is highly degraded mostly due to representation learning challenges. Conversely, Maximum Entropy is less impacted, resulting in a more robust exploration, despite being not always optimal. Lastly, our empirical study revealed that learning diverse skills with DIAYN, often linked to improved robustness and generalisation, does not promote exploration in MiniGrid environments. This is because: i) learning the skill space itself can be challenging, and ii) exploration within the skill space prioritises differentiating between behaviours rather than achieving uniform state visitation.

[AI-38] Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

链接: https://arxiv.org/abs/2501.11526
作者: Marcin Blachnik,Piotr Ciepliński
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data pruning, or instance selection, is an important problem in machine learning especially in terms of nearest neighbour classifier. However, in data pruning which speeds up the prediction phase, there is an issue related to the speed and efficiency of the process itself. In response, the study proposes an approach involving transforming the instance selection process into a classification task conducted in a unified meta-feature space where each instance can be classified and assigned to either the “to keep” or “to remove” class. This approach requires training an appropriate meta-classifier, which can be developed based on historical instance selection results from other datasets using reference instance selection methods as a labeling tool. This work proposes constructing the meta-feature space based on properties extracted from the nearest neighbor graph. Experiments conducted on 17 datasets of varying sizes and five reference instance selection methods (ENN, Drop3, ICF, HMN-EI, and CCIS) demonstrate that the proposed solution achieves results comparable to reference instance selection methods while significantly reducing computational complexity. In the proposed approach, the computational complexity of the system depends only on identifying the k-nearest neighbors for each data sample and running the meta-classifier. Additionally, the study discusses the choice of meta-classifier, recommending the use of Balanced Random Forest.

[AI-39] chnical Report for the Forgotten-by-Design Project: Targeted Obfuscation for Machine Learning

链接: https://arxiv.org/abs/2501.11525
作者: Rickard Brännvall,Laurynas Adomaitis,Olof Görnerup,Anass Sedrati
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:The right to privacy, enshrined in various human rights declarations, faces new challenges in the age of artificial intelligence (AI). This paper explores the concept of the Right to be Forgotten (RTBF) within AI systems, contrasting it with traditional data erasure methods. We introduce Forgotten by Design, a proactive approach to privacy preservation that integrates instance-specific obfuscation techniques during the AI model training process. Unlike machine unlearning, which modifies models post-training, our method prevents sensitive data from being embedded in the first place. Using the LIRA membership inference attack, we identify vulnerable data points and propose defenses that combine additive gradient noise and weighting schemes. Our experiments on the CIFAR-10 dataset demonstrate that our techniques reduce privacy risks by at least an order of magnitude while maintaining model accuracy (at 95% significance). Additionally, we present visualization methods for the privacy-utility trade-off, providing a clear framework for balancing privacy risk and model accuracy. This work contributes to the development of privacy-preserving AI systems that align with human cognitive processes of motivated forgetting, offering a robust framework for safeguarding sensitive information and ensuring compliance with privacy regulations.

[AI-40] Decomposing Interventional Causality into Synergistic Redundant and Unique Components

链接: https://arxiv.org/abs/2501.11447
作者: Abel Jansma
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:We introduce a novel framework for decomposing interventional causal effects into synergistic, redundant, and unique components, building on the intuition of Partial Information Decomposition (PID) and the principle of Möbius inversion. While recent work has explored a similar decomposition of an observational measure, we argue that a proper causal decomposition must be interventional in nature. We develop a mathematical approach that systematically quantifies how causal power is distributed among variables in a system, using a recently derived closed-form expression for the Möbius function of the redundancy lattice. The formalism is then illustrated by decomposing the causal power in logic gates, cellular automata, and chemical reaction networks. Our results reveal how the distribution of causal power can be context- and parameter-dependent. This decomposition provides new insights into complex systems by revealing how causal influences are shared and combined among multiple variables, with potential applications ranging from attribution of responsibility in legal or AI systems, to the analysis of biological networks or climate models.

[AI-41] A Survey on Diffusion Models for Anomaly Detection

链接: https://arxiv.org/abs/2501.11430
作者: Jing Liu,Zhenchao Ma,Zepu Wang,Yang Liu,Zehua Wang,Peng Sun,Liang Song,Bo Hu,Azzedine Boukerche,Victor C.M. Leung
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we systematically review recent advances in DMAD research and investigate their capabilities. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at this https URL.

[AI-42] he Explanation Game – Rekindled (Extended Version)

链接: https://arxiv.org/abs/2501.11429
作者: Joao Marques-Silva,Xuanxiang Huang,Olivier Letoffe
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent work demonstrated the existence of critical flaws in the current use of Shapley values in explainable AI (XAI), i.e. the so-called SHAP scores. These flaws are significant in that the scores provided to a human decision-maker can be misleading. Although these negative results might appear to indicate that Shapley values ought not be used in XAI, this paper argues otherwise. Concretely, this paper proposes a novel definition of SHAP scores that overcomes existing flaws. Furthermore, the paper outlines a practically efficient solution for the rigorous estimation of the novel SHAP scores. Preliminary experimental results confirm our claims, and further underscore the flaws of the current SHAP scores.

[AI-43] Agent -R: Training Language Model Agents to Reflect via Iterative Self-Training

链接: https://arxiv.org/abs/2501.11425
作者: Siyu Yuan,Zehui Chen,Zhiheng Xi,Junjie Ye,Zhengyin Du,Jiecao Chen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model’s ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

[AI-44] Multi-View Spectral Clustering for Graphs with Multiple View Structures

链接: https://arxiv.org/abs/2501.11422
作者: Yorgos Tsitsikas,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the fundamental importance of clustering, to this day, much of the relevant research is still based on ambiguous foundations, leading to an unclear understanding of whether or how the various clustering methods are connected with each other. In this work, we provide an additional stepping stone towards resolving such ambiguities by presenting a general clustering framework that subsumes a series of seemingly disparate clustering methods, including various methods belonging to the wildly popular spectral clustering framework. In fact, the generality of the proposed framework is additionally capable of shedding light to the largely unexplored area of multi-view graphs whose each view may have differently clustered nodes. In turn, we propose GenClus: a method that is simultaneously an instance of this framework and a generalization of spectral clustering, while also being closely related to k-means as well. This results in a principled alternative to the few existing methods studying this special type of multi-view graphs. Then, we conduct in-depth experiments, which demonstrate that GenClus is more computationally efficient than existing methods, while also attaining similar or better clustering performance. Lastly, a qualitative real-world case-study further demonstrates the ability of GenClus to produce meaningful clusterings.

[AI-45] Generalization and Informativeness of Weighted Conformal Risk Control Under Covariate Shift

链接: https://arxiv.org/abs/2501.11413
作者: Matteo Zecchin,Fredrik Hellström,Sangwoo Park,Shlomo Shamai(Shitz),Osvaldo Simeone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Predictive models are often required to produce reliable predictions under statistical conditions that are not matched to the training data. A common type of training-testing mismatch is covariate shift, where the conditional distribution of the target variable given the input features remains fixed, while the marginal distribution of the inputs changes. Weighted conformal risk control (W-CRC) uses data collected during the training phase to convert point predictions into prediction sets with valid risk guarantees at test time despite the presence of a covariate shift. However, while W-CRC provides statistical reliability, its efficiency – measured by the size of the prediction sets – can only be assessed at test time. In this work, we relate the generalization properties of the base predictor to the efficiency of W-CRC under covariate shifts. Specifically, we derive a bound on the inefficiency of the W-CRC predictor that depends on algorithmic hyperparameters and task-specific quantities available at training time. This bound offers insights on relationships between the informativeness of the prediction sets, the extent of the covariate shift, and the size of the calibration and training sets. Experiments on fingerprinting-based localization validate the theoretical results.

[AI-46] Unsupervised Learning in Echo State Networks for Input Reconstruction

链接: https://arxiv.org/abs/2501.11409
作者: Taiki Yamada,Yuichi Katori,Kantaro Fujiwara
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Chaotic Dynamics (nlin.CD); Neurons and Cognition (q-bio.NC)
*备注: 16 pages, 7 figures, regular paper

点击查看摘要

Abstract:Conventional echo state networks (ESNs) require supervised learning to train the readout layer, using the desired outputs as training data. In this study, we focus on input reconstruction (IR), which refers to training the readout layer to reproduce the input time series in its output. We reformulate the learning algorithm of the ESN readout layer to perform IR using unsupervised learning (UL). By conducting theoretical analysis and numerical experiments, we demonstrate that IR in ESNs can be effectively implemented under realistic conditions without explicitly using the desired outputs as training data; in this way, UL is enabled. Furthermore, we demonstrate that applications relying on IR, such as dynamical system replication and noise filtering, can be reformulated within the UL framework. Our findings establish a theoretically sound and universally applicable IR formulation, along with its related tasks in ESNs. This work paves the way for novel predictions and highlights unresolved theoretical challenges in ESNs, particularly in the context of time-series processing methods and computational models of the brain.

[AI-47] A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

链接: https://arxiv.org/abs/2501.11407
作者: Jamie Lohoff,Anil Kaya,Florian Assmuth,Emre Neftci
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

[AI-48] Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio ICASSP2025

链接: https://arxiv.org/abs/2501.11378
作者: Mateusz Barański,Jan Jasiński,Julitta Bartolewska,Stanisław Kacprzak,Marcin Witkowski,Konrad Kowalczyk
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted for IEEE ICASSP 2025

点击查看摘要

Abstract:Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

[AI-49] Federated Learning with Sample-level Client Drift Mitigation AAAI2025

链接: https://arxiv.org/abs/2501.11360
作者: Haoran Xu,Jiaze Li,Wanyi Wu,Hao Ren
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Federated Learning (FL) suffers from severe performance degradation due to the data heterogeneity among clients. Existing works reveal that the fundamental reason is that data heterogeneity can cause client drift where the local model update deviates from the global one, and thus they usually tackle this problem from the perspective of calibrating the obtained local update. Despite effectiveness, existing methods substantially lack a deep understanding of how heterogeneous data samples contribute to the formation of client drift. In this paper, we bridge this gap by identifying that the drift can be viewed as a cumulative manifestation of biases present in all local samples and the bias between samples is different. Besides, the bias dynamically changes as the FL training progresses. Motivated by this, we propose FedBSS that first mitigates the heterogeneity issue in a sample-level manner, orthogonal to existing methods. Specifically, the core idea of our method is to adopt a bias-aware sample selection scheme that dynamically selects the samples from small biases to large epoch by epoch to train progressively the local model in each round. In order to ensure the stability of training, we set the diversified knowledge acquisition stage as the warm-up stage to avoid the local optimality caused by knowledge deviation in the early stage of the model. Evaluation results show that FedBSS outperforms state-of-the-art baselines. In addition, we also achieved effective results on feature distribution skew and noise label dataset setting, which proves that FedBSS can not only reduce heterogeneity, but also has scalability and robustness.

[AI-50] owards Advancing Code Generation with Large Language Models : A Research Roadmap

链接: https://arxiv.org/abs/2501.11354
作者: Haolin Jin,Huaming Chen,Qinghua Lu,Liming Zhu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, we have witnessed the rapid development of large language models, which have demonstrated excellent capabilities in the downstream task of code generation. However, despite their potential, LLM-based code generation still faces numerous technical and evaluation challenges, particularly when embedded in real-world development. In this paper, we present our vision for current research directions, and provide an in-depth analysis of existing studies on this task. We propose a six-layer vision framework that categorizes code generation process into distinct phases, namely Input Phase, Orchestration Phase, Development Phase, and Validation Phase. Additionally, we outline our vision workflow, which reflects on the currently prevalent frameworks. We systematically analyse the challenges faced by large language models, including those LLM-based agent frameworks, in code generation tasks. With these, we offer various perspectives and actionable recommendations in this area. Our aim is to provide guidelines for improving the reliability, robustness and usability of LLM-based code generation systems. Ultimately, this work seeks to address persistent challenges and to provide practical suggestions for a more pragmatic LLM-based solution for future code generation endeavors.

[AI-51] Collaborative Imputation of Urban Time Series through Cross-city Meta-learning

链接: https://arxiv.org/abs/2501.11306
作者: Tong Nie,Wei Ma,Jian Sun,Yu Yang,Jiannong Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urban time series, such as mobility flows, energy consumption, and pollution records, encapsulate complex urban dynamics and structures. However, data collection in each city is impeded by technical challenges such as budget limitations and sensor failures, necessitating effective data imputation techniques that can enhance data quality and reliability. Existing imputation models, categorized into learning-based and analytics-based paradigms, grapple with the trade-off between capacity and generalizability. Collaborative learning to reconstruct data across multiple cities holds the promise of breaking this trade-off. Nevertheless, urban data’s inherent irregularity and heterogeneity issues exacerbate challenges of knowledge sharing and collaboration across cities. To address these limitations, we propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs). INRs offer a continuous mapping from domain coordinates to target values, integrating the strengths of both paradigms. By imposing embedding theory, we first employ continuous parameterization to handle irregularity and reconstruct the dynamical system. We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning, incorporating hierarchical modulation and normalization techniques to accommodate multiscale representations and reduce variance in response to heterogeneity. Extensive experiments on a diverse urban dataset from 20 global cities demonstrate our model’s superior imputation performance and generalizability, underscoring the effectiveness of collaborative imputation in resource-constrained settings.

[AI-52] A Machine Learning Framework for Handling Unreliable Absence Label and Class Imbalance for Marine Stinger Beaching Prediction

链接: https://arxiv.org/abs/2501.11293
作者: Amuche Ibenegbu,Amandine Schaeffer,Pierre Lafaye de Micheaux,Rohitash Chandra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bluebottles (\textitPhysalia spp.) are marine stingers resembling jellyfish, whose presence on Australian beaches poses a significant public risk due to their venomous nature. Understanding the environmental factors driving bluebottles ashore is crucial for mitigating their impact, and machine learning tools are to date relatively unexplored. We use bluebottle marine stinger presence/absence data from beaches in Eastern Sydney, Australia, and compare machine learning models (Multilayer Perceptron, Random Forest, and XGBoost) to identify factors influencing their presence. We address challenges such as class imbalance, class overlap, and unreliable absence data by employing data augmentation techniques, including the Synthetic Minority Oversampling Technique (SMOTE), Random Undersampling, and Synthetic Negative Approach that excludes the negative class. Our results show that SMOTE failed to resolve class overlap, but the presence-focused approach effectively handled imbalance, class overlap, and ambiguous absence data. The data attributes such as the wind direction, which is a circular variable, emerged as a key factor influencing bluebottle presence, confirming previous inference studies. However, in the absence of population dynamics, biological behaviours, and life cycles, the best predictive model appears to be Random Forests combined with Synthetic Negative Approach. This research contributes to mitigating the risks posed by bluebottles to beachgoers and provides insights into handling class overlap and unreliable negative class in environmental modelling.

[AI-53] WSSM: Geographic-enhanced hierarchical state-space model for global station weather forecast

链接: https://arxiv.org/abs/2501.11238
作者: Songru Yang,Zili Liu,Zhenwei Shi,Zhengxia Zou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Global Station Weather Forecasting (GSWF), a prominent meteorological research area, is pivotal in providing timely localized weather predictions. Despite the progress existing models have made in the overall accuracy of the GSWF, executing high-precision extreme event prediction still presents a substantial challenge. The recent emergence of state-space models, with their ability to efficiently capture continuous-time dynamics and latent states, offer potential solutions. However, early investigations indicated that Mamba underperforms in the context of GSWF, suggesting further adaptation and optimization. To tackle this problem, in this paper, we introduce Weather State-space Model (WSSM), a novel Mamba-based approach tailored for GSWF. Geographical knowledge is integrated in addition to the widely-used positional encoding to represent the absolute special-temporal position. The multi-scale time-frequency features are synthesized from coarse to fine to model the seasonal to extreme weather dynamic. Our method effectively improves the overall prediction accuracy and addresses the challenge of forecasting extreme weather events. The state-of-the-art results obtained on the Weather-5K subset underscore the efficacy of the WSSM

[AI-54] Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity NEURIPS

链接: https://arxiv.org/abs/2501.11183
作者: David Williams-King,Linh Le,Adam Oberman,Yoshua Bengio
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: published at Neurips Safe Generative AI Workshop 2024

点击查看摘要

Abstract:As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certain model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we draw analogies with historical examples and develop lessons learned that can be applied to LLM safety. These arguments support the need for new and more principled approaches to designing safe models, which are architected for security from the beginning. We describe several such approaches from the AI literature.

[AI-55] Playing the Lottery With Concave Regularizers for Sparse Trainable Neural Networks

链接: https://arxiv.org/abs/2501.11135
作者: Giulia Fracastoro,Sophie M. Fosson,Andrea Migliorati,Giuseppe C. Calafiore
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The design of sparse neural networks, i.e., of networks with a reduced number of parameters, has been attracting increasing research attention in the last few years. The use of sparse models may significantly reduce the computational and storage footprint in the inference phase. In this context, the lottery ticket hypothesis (LTH) constitutes a breakthrough result, that addresses not only the performance of the inference phase, but also of the training phase. It states that it is possible to extract effective sparse subnetworks, called winning tickets, that can be trained in isolation. The development of effective methods to play the lottery, i.e., to find winning tickets, is still an open problem. In this article, we propose a novel class of methods to play the lottery. The key point is the use of concave regularization to promote the sparsity of a relaxed binary mask, which represents the network topology. We theoretically analyze the effectiveness of the proposed method in the convex framework. Then, we propose extended numerical tests on various datasets and architectures, that show that the proposed method can improve the performance of state-of-the-art algorithms.

[AI-56] Can LLM Generate Regression Tests for Software Commits?

链接: https://arxiv.org/abs/2501.11086
作者: Jing Liu,Seongmin Lee,Eleonora Losiouk,Marcel Böhme
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 18 pages. This version of the paper was written on Thu, 12 Sep 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown tremendous promise in automated software engineering. In this paper, we investigate the opportunities of LLMs for automatic regression test generation for programs that take highly structured, human-readable inputs, such as XML parsers or JavaScript interpreters. Concretely, we explore the following regression test generation scenarios for such programs that have so far been difficult to test automatically in the absence of corresponding input grammars: \bullet Bug finding. Given a code change (e.g., a commit or pull request), our LLM-based approach generates a test case with the objective of revealing any bugs that might be introduced if that change is applied. \bullet Patch testing. Given a patch, our LLM-based approach generates a test case that fails before but passes after the patch. This test can be added to the regression test suite to catch similar bugs in the future. We implement Cleverest, a feedback-directed, zero-shot LLM-based regression test generation technique, and evaluate its effectiveness on 22 commits to three subject programs: Mujs, Libxml2, and Poppler. For programs using more human-readable file formats, like XML or JavaScript, we found Cleverest performed very well. It generated easy-to-understand bug-revealing or bug-reproduction test cases for the majority of commits in just under three minutes – even when only the code diff or commit message (unless it was too vague) was given. For programs with more compact file formats, like PDF, as expected, it struggled to generate effective test cases. However, the LLM-supplied test cases are not very far from becoming effective (e.g., when used as a seed by a greybox fuzzer or as a starting point by the developer). Comments: 18 pages. This version of the paper was written on Thu, 12 Sep 2024 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.11086 [cs.SE] (or arXiv:2501.11086v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.11086 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Seongmin Lee Dr. [view email] [v1] Sun, 19 Jan 2025 15:46:26 UTC (2,258 KB)

[AI-57] Federated Deep Reinforcement Learning for Energy Efficient Multi-Functional RIS-Assisted Low-Earth Orbit Networks

链接: https://arxiv.org/abs/2501.11079
作者: Li-Hsiang Shen,Jyun-Jhe Huang,Kai-Ten Feng,Lie-Liang Yang,Jen-Ming Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, a novel network architecture that deploys the multi-functional reconfigurable intelligent surface (MF-RIS) in low-Earth orbit (LEO) is proposed. Unlike traditional RIS with only signal reflection capability, the MF-RIS can reflect, refract, and amplify signals, as well as harvest energy from wireless signals. Given the high energy demands in shadow regions where solar energy is unavailable, MF-RIS is deployed in LEO to enhance signal coverage and improve energy efficiency (EE). To address this, we formulate a long-term EE optimization problem by determining the optimal parameters for MF-RIS configurations, including amplification and phase-shifts, energy harvesting ratios, and LEO transmit beamforming. To address the complex non-convex and non-linear problem, a federated learning enhanced multi-agent deep deterministic policy gradient (FEMAD) scheme is designed. Multi-agent DDPG of each agent can provide the optimal action policy from its interaction to environments, whereas federated learning enables the hidden information exchange among multi-agents. In numerical results, we can observe significant EE improvements compared to the other benchmarks, including centralized deep reinforcement learning as well as distributed multi-agent deep deterministic policy gradient (DDPG). Additionally, the proposed LEO-MF-RIS architecture has demonstrated its effectiveness, achieving the highest EE performance compared to the scenarios of fixed/no energy harvesting in MF-RIS, traditional reflection-only RIS, and deployment without RISs/MF-RISs.

[AI-58] Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets

链接: https://arxiv.org/abs/2501.11065
作者: Or Haim Anidjar,Roi Yozevitch
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized pooling layer. We utilized a broad dataset range from Common-Voice, targeting ten languages across Indo-European, Semitic, and East Asian families. The major innovation involved optimizing the architecture of Time Delay Neural Networks. We introduced additional layers and restructured these networks into a funnel shape, enhancing their ability to process complex linguistic patterns. A rigorous grid search determined the optimal settings for these networks, significantly boosting their efficiency in language pattern recognition from audio samples. The model underwent extensive training, including a phase with augmented data, to refine its capabilities. The culmination of these efforts is a highly accurate system, achieving a 97% accuracy rate in language recognition. This advancement represents a notable contribution to artificial intelligence, specifically in improving the accuracy and efficiency of language processing systems, a critical aspect in the engineering of advanced speech recognition technologies.

[AI-59] GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation

链接: https://arxiv.org/abs/2501.11006
作者: Shashikant Ilager,Lukas Florian Briem,Ivona Brandic
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
*备注: Under submission in ACM/IEEE conference, 11 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming integral to daily life, showcasing their vast potential across various Natural Language Processing (NLP) tasks. Beyond NLP, LLMs are increasingly used in software development tasks, such as code completion, modification, bug fixing, and code translation. Software engineers widely use tools like GitHub Copilot and Amazon Q, streamlining workflows and automating tasks with high accuracy. While the resource and energy intensity of LLM training is often highlighted, inference can be even more resource-intensive over time, as it’s a continuous process with a high number of invocations. Therefore, developing resource-efficient alternatives for LLM inference is crucial for sustainability. This work proposes GREEN-CODE, a framework for energy-aware code generation in LLMs. GREEN-CODE performs dynamic early exit during LLM inference. We train a Reinforcement Learning (RL) agent that learns to balance the trade-offs between accuracy, latency, and energy consumption. Our approach is evaluated on two open-source LLMs, Llama 3.2 3B and OPT 2.7B, using the JavaCorpus and PY150 datasets. Results show that our method reduces the energy consumption between 23-50 % on average for code generation tasks without significantly affecting accuracy.

[AI-60] Blockchain-assisted Demonstration Cloning for Multi-Agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.10938
作者: Ahmed Alagha,Jamal Bentahar,Hadi Otrok,Shakti Singh,Rabeb Mizouni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-Agent Deep Reinforcement Learning (MDRL) is a promising research area in which agents learn complex behaviors in cooperative or competitive environments. However, MDRL comes with several challenges that hinder its usability, including sample efficiency, curse of dimensionality, and environment exploration. Recent works proposing Federated Reinforcement Learning (FRL) to tackle these issues suffer from problems related to model restrictions and maliciousness. Other proposals using reward shaping require considerable engineering and could lead to local optima. In this paper, we propose a novel Blockchain-assisted Multi-Expert Demonstration Cloning (MEDC) framework for MDRL. The proposed method utilizes expert demonstrations in guiding the learning of new MDRL agents, by suggesting exploration actions in the environment. A model sharing framework on Blockchain is designed to allow users to share their trained models, which can be allocated as expert models to requesting users to aid in training MDRL systems. A Consortium Blockchain is adopted to enable traceable and autonomous execution without the need for a single trusted entity. Smart Contracts are designed to manage users and models allocation, which are shared using IPFS. The proposed framework is tested on several applications, and is benchmarked against existing methods in FRL, Reward Shaping, and Imitation Learning-assisted RL. The results show the outperformance of the proposed framework in terms of learning speed and resiliency to faulty and malicious models.

[AI-61] Adaptive Target Localization under Uncertainty using Multi-Agent Deep Reinforcement Learning with Knowledge Transfer

链接: https://arxiv.org/abs/2501.10924
作者: Ahmed Alagha,Rabeb Mizouni,Shakti Singh,Jamal Bentahar,Hadi Otrok
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Target localization is a critical task in sensitive applications, where multiple sensing agents communicate and collaborate to identify the target location based on sensor readings. Existing approaches investigated the use of Multi-Agent Deep Reinforcement Learning (MADRL) to tackle target localization. Nevertheless, these methods do not consider practical uncertainties, like false alarms when the target does not exist or when it is unreachable due to environmental complexities. To address these drawbacks, this work proposes a novel MADRL-based method for target localization in uncertain environments. The proposed MADRL method employs Proximal Policy Optimization to optimize the decision-making of sensing agents, which is represented in the form of an actor-critic structure using Convolutional Neural Networks. The observations of the agents are designed in an optimized manner to capture essential information in the environment, and a team-based reward functions is proposed to produce cooperative agents. The MADRL method covers three action dimensionalities that control the agents’ mobility to search the area for the target, detect its existence, and determine its reachability. Using the concept of Transfer Learning, a Deep Learning model builds on the knowledge from the MADRL model to accurately estimating the target location if it is unreachable, resulting in shared representations between the models for faster learning and lower computational complexity. Collectively, the final combined model is capable of searching for the target, determining its existence and reachability, and estimating its location accurately. The proposed method is tested using a radioactive target localization environment and benchmarked against existing methods, showing its efficacy.

[AI-62] Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition

链接: https://arxiv.org/abs/2501.10909
作者: Gaole He,Patrick Hemmer,Michael Vössing,Max Schemmer,Ujwal Gadiraju
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Work in progress

点击查看摘要

Abstract:In recent years, the rapid development of AI systems has brought about the benefits of intelligent services but also concerns about security and reliability. By fostering appropriate user reliance on an AI system, both complementary team performance and reduced human workload can be achieved. Previous empirical studies have extensively analyzed the impact of factors ranging from task, system, and human behavior on user trust and appropriate reliance in the context of one-step decision making. However, user reliance on AI systems in tasks with complex semantics that require multi-step workflows remains under-explored. Inspired by recent work on task decomposition with large language models, we propose to investigate the impact of a novel Multi-Step Transparent (MST) decision workflow on user reliance behaviors. We conducted an empirical study (N = 233) of AI-assisted decision making in composite fact-checking tasks (i.e., fact-checking tasks that entail multiple sub-fact verification steps). Our findings demonstrate that human-AI collaboration with an MST decision workflow can outperform one-step collaboration in specific contexts (e.g., when advice from an AI system is misleading). Further analysis of the appropriate reliance at fine-grained levels indicates that an MST decision workflow can be effective when users demonstrate a relatively high consideration of the intermediate steps. Our work highlights that there is no one-size-fits-all decision workflow that can help obtain optimal human-AI collaboration. Our insights help deepen the understanding of the role of decision workflows in facilitating appropriate reliance. We synthesize important implications for designing effective means to facilitate appropriate reliance on AI systems in composite tasks, positioning opportunities for the human-centered AI and broader HCI communities.

[AI-63] A Generative Security Application Engineering Curriculum

链接: https://arxiv.org/abs/2501.10900
作者: Wu-chang Feng,David Baker-Robinson
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Generative AI and large language models (LLMs) are transforming security by automating many tasks being performed manually. With such automation changing the practice of security as we know it, it is imperative that we prepare future students for the technology landscape they will ultimately face. Towards this end, we describe an initial curriculum and course that attempts to show students how to apply generative AI in order to solve problems in security. By refocusing security education and training on aspects uniquely suited for humans and showing students how to leverage automation for the rest, we believe we can better align security education practices with generative AI as it evolves.

[AI-64] Classical and Deep Reinforcement Learning Inventory Control Policies for Pharmaceutical Supply Chains with Perishability and Non-Stationarity

链接: https://arxiv.org/abs/2501.10895
作者: Francesco Stranieri,Chaaben Kouki,Willem van Jaarsveld,Fabio Stella
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study inventory control policies for pharmaceutical supply chains, addressing challenges such as perishability, yield uncertainty, and non-stationary demand, combined with batching constraints, lead times, and lost sales. Collaborating with Bristol-Myers Squibb (BMS), we develop a realistic case study incorporating these factors and benchmark three policies–order-up-to (OUT), projected inventory level (PIL), and deep reinforcement learning (DRL) using the proximal policy optimization (PPO) algorithm–against a BMS baseline based on human expertise. We derive and validate bounds-based procedures for optimizing OUT and PIL policy parameters and propose a methodology for estimating projected inventory levels, which are also integrated into the DRL policy with demand forecasts to improve decision-making under non-stationarity. Compared to a human-driven policy, which avoids lost sales through higher holding costs, all three implemented policies achieve lower average costs but exhibit greater cost variability. While PIL demonstrates robust and consistent performance, OUT struggles under high lost sales costs, and PPO excels in complex and variable scenarios but requires significant computational effort. The findings suggest that while DRL shows potential, it does not outperform classical policies in all numerical experiments, highlighting 1) the need to integrate diverse policies to manage pharmaceutical challenges effectively, based on the current state-of-the-art, and 2) that practical problems in this domain seem to lack a single policy class that yields universally acceptable performance.

[AI-65] Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

链接: https://arxiv.org/abs/2501.10893
作者: Hongjin Su,Ruoxi Sun,Jinsung Yoon,Pengcheng Yin,Tao Yu,Sercan Ö. Arık
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks – baseline results are improved by up to 12.2% for ICL with Claude-3.5 and 19.5% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

[AI-66] Dynamic Continual Learning: Harnessing Parameter Uncertainty for Improved Network Adaptation

链接: https://arxiv.org/abs/2501.10861
作者: Christopher Angelini,Nidhal Bouaynaya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:When fine-tuning Deep Neural Networks (DNNs) to new data, DNNs are prone to overwriting network parameters required for task-specific functionality on previously learned tasks, resulting in a loss of performance on those tasks. We propose using parameter-based uncertainty to determine which parameters are relevant to a network’s learned function and regularize training to prevent change in these important parameters. We approach this regularization in two ways: (1), we constrain critical parameters from significant changes by associating more critical parameters with lower learning rates, thereby limiting alterations in those parameters; (2), important parameters are restricted from change by imposing a higher regularization weighting, causing parameters to revert to their states prior to the learning of subsequent tasks. We leverage a Bayesian Moment Propagation framework which learns network parameters concurrently with their associated uncertainties while allowing each parameter to contribute uncertainty to the network’s predictive distribution, avoiding the pitfalls of existing sampling-based methods. The proposed approach is evaluated for common sequential benchmark datasets and compared to existing published approaches from the Continual Learning community. Ultimately, we show improved Continual Learning performance for Average Test Accuracy and Backward Transfer metrics compared to sampling-based methods and other non-uncertainty-based approaches.

[AI-67] Reliable Text-to-SQL with Adaptive Abstention

链接: https://arxiv.org/abs/2501.10858
作者: Kaiwen Chen,Yueting Chen,Xiaohui Yu,Nick Koudas
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language interfaces for databases, particularly in text-to-SQL conversion. However, current approaches often generate unreliable outputs when faced with ambiguity or insufficient context. We present Reliable Text-to-SQL (RTS), a novel framework that enhances query generation reliability by incorporating abstention and human-in-the-loop mechanisms. RTS focuses on the critical schema linking phase, which aims to identify the key database elements needed for generating SQL queries. It autonomously detects potential errors during the answer generation process and responds by either abstaining or engaging in user interaction. A vital component of RTS is the Branching Point Prediction (BPP) which utilizes statistical conformal techniques on the hidden layers of the LLM model for schema linking, providing probabilistic guarantees on schema linking accuracy. We validate our approach through comprehensive experiments on the BIRD benchmark, demonstrating significant improvements in robustness and reliability. Our findings highlight the potential of combining transparent-box LLMs with human-in-the-loop processes to create more robust natural language interfaces for databases. For the BIRD benchmark, our approach achieves near-perfect schema linking accuracy, autonomously involving a human when needed. Combined with query generation, we demonstrate that near-perfect schema linking and a small query generation model can almost match SOTA accuracy achieved with a model orders of magnitude larger than the one we use.

[AI-68] Fake Advertisements Detection Using Automated Multimodal Learning: A Case Study for Vietnamese Real Estate Data

链接: https://arxiv.org/abs/2501.10848
作者: Duy Nguyen,Trung T. Nguyen,Cuong V. Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The popularity of e-commerce has given rise to fake advertisements that can expose users to financial and data risks while damaging the reputation of these e-commerce platforms. For these reasons, detecting and removing such fake advertisements are important for the success of e-commerce websites. In this paper, we propose FADAML, a novel end-to-end machine learning system to detect and filter out fake online advertisements. Our system combines techniques in multimodal machine learning and automated machine learning to achieve a high detection rate. As a case study, we apply FADAML to detect fake advertisements on popular Vietnamese real estate websites. Our experiments show that we can achieve 91.5% detection accuracy, which significantly outperforms three different state-of-the-art fake news detection systems.

[AI-69] Practical and Ready-to-Use Methodology to Assess the re-identification Risk in Anonymized Datasets

链接: https://arxiv.org/abs/2501.10841
作者: Louis-Philippe Sondeck,Maryline Laurent
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:To prove that a dataset is sufficiently anonymized, many privacy policies suggest that a re-identification risk assessment be performed, but do not provide a precise methodology for doing so, leaving the industry alone with the problem. This paper proposes a practical and ready-to-use methodology for re-identification risk assessment, the originality of which is manifold: (1) it is the first to follow well-known risk analysis methods (e.g. EBIOS) that have been used in the cybersecurity field for years, which consider not only the ability to perform an attack, but also the impact such an attack can have on an individual; (2) it is the first to qualify attributes and values of attributes with e.g. degree of exposure, as known real-world attacks mainly target certain types of attributes and not others.

[AI-70] Addressing Multilabel Imbalance with an Efficiency-Focused Approach Using Diffusion Model-Generated Synthetic Samples

链接: https://arxiv.org/abs/2501.10822
作者: Francisco Charte,Miguel Ángel Dávila,María Dolores Pérez-Godoy,María José del Jesus
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Predictive models trained on imbalanced data tend to produce biased results. This problem is exacerbated when there is not just one output label, but a set of them. This is the case for multilabel learning (MLL) algorithms used to classify patterns, rank labels, or learn the distribution of outputs. Many solutions have been proposed in the literature. The one that can be applied universally, independent of the algorithm used to build the model, is data resampling. The generation of new instances associated with minority labels, so that empty areas of the feature space are filled, helps to improve the obtained models. The quality of these new instances depends on the algorithm used to generate them. In this paper, a diffusion model tailored to produce new instances for MLL data, called MLDM (\textitMultiLabel Diffusion Model), is proposed. Diffusion models have been mainly used to generate artificial images and videos. Our proposed MLDM is based on this type of models. The experiments conducted compare MLDM with several other MLL resampling algorithms. The results show that MLDM is competitive while it improves efficiency.

[AI-71] Graph Coloring to Reduce Computation Time in Prioritized Planning

链接: https://arxiv.org/abs/2501.10812
作者: Patrick Scheffe,Julius Kahle,Bassam Alrifaee
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Distributing computations among agents in large networks reduces computational effort in multi-agent path finding (MAPF). One distribution strategy is prioritized planning (PP). In PP, we couple and prioritize interacting agents to achieve a desired behavior across all agents in the network. We characterize the interaction with a directed acyclic graph (DAG). The computation time for solving MAPF problem using PP is mainly determined through the longest path in this DAG. The longest path depends on the fixed undirected coupling graph and the variable prioritization. The approaches from literature to prioritize agents are numerous and pursue various goals. This article presents an approach for prioritization in PP to reduce the longest path length in the coupling DAG and thus the computation time for MAPF using PP. We prove that this problem can be mapped to a graph-coloring problem, in which the number of colors required corresponds to the longest path length in the coupling DAG. We propose a decentralized graph-coloring algorithm to determine priorities for the agents. We evaluate the approach by applying it to multi-agent motion planning (MAMP) for connected and automated vehicles (CAVs) on roads using, a variant of MAPF.

[AI-72] Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

链接: https://arxiv.org/abs/2501.10799
作者: Yen-Ting Lin,Di Jin,Tengyu Xu,Tianhao Wu,Sainbayar Sukhbaatar,Chen Zhu,Yun He,Yun-Nung Chen,Jason Weston,Yuandong Tian,Arash Rahnama,Sinong Wang,Hao Ma,Han Fang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

[AI-73] ML-SceGen: A Multi-level Scenario Generation Framework

链接: https://arxiv.org/abs/2501.10782
作者: Yicheng Xiao,Yangyang Sun,Yicheng Lin
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Current scientific research witnesses various attempts at applying Large Language Models for scenario generation but is inclined only to comprehensive or dangerous scenarios. In this paper, we seek to build a three-stage framework that not only lets users regain controllability over the generated scenarios but also generates comprehensive scenarios containing danger factors in uncontrolled intersection settings. In the first stage, LLM agents will contribute to translating the key components of the description of the expected scenarios into Functional Scenarios. For the second stage, we use Answer Set Programming (ASP) solver Clingo to help us generate comprehensive logical traffic within intersections. During the last stage, we use LLM to update relevant parameters to increase the critical level of the concrete scenario.

[AI-74] Simultaneous Computation with Multiple Prioritizations in Multi-Agent Motion Planning

链接: https://arxiv.org/abs/2501.10781
作者: Patrick Scheffe,Julius Kahle,Bassam Alrifaee
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Multi-agent path finding (MAPF) in large networks is computationally challenging. An approach for MAPF is prioritized planning (PP), in which agents plan sequentially according to their priority. Albeit a computationally efficient approach for MAPF, the solution quality strongly depends on the prioritization. Most prioritizations rely either on heuristics, which do not generalize well, or iterate to find adequate priorities, which costs computational effort. In this work, we show how agents can compute with multiple prioritizations simultaneously. Our approach is general as it does not rely on domain-specific knowledge. The context of this work is multi-agent motion planning (MAMP) with a receding horizon subject to computation time constraints. MAMP considers the system dynamics in more detail compared to MAPF. In numerical experiments on MAMP, we demonstrate that our approach to prioritization comes close to optimal prioritization and outperforms state-of-the-art methods with only a minor increase in computation time. We show real-time capability in an experiment on a road network with ten vehicles in our Cyber-Physical Mobility Lab.

[AI-75] MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science

链接: https://arxiv.org/abs/2501.10768
作者: Erle Zhu,Yadi Liu,Zhe Zhang,Xujun Li,Jin Zhou,Xinjie Yu,Minlie Huang,Hongning Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained on extensive text and image corpora, current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks. However, their performance is still lacking in physical domains that require understanding diagrams with complex physical structures and quantitative analysis based on multi-modal information. To address this, we develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM. MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator. The PPM module is obtained by fine-tuning a visual language model using carefully designed synthetic data with paired physical diagrams and corresponding simulation language descriptions. At the inference stage, MAPS integrates the simulation language description of the input diagram provided by PPM and results obtained through a Chain-of-Simulation process with MLLM to derive the underlying rationale and the final answer. Validated using our collected college-level circuit analysis problems, MAPS significantly improves reasoning accuracy of MLLM and outperforms all existing models. The results confirm MAPS offers a promising direction for enhancing multi-modal scientific reasoning ability of MLLMs. We will release our code, model and dataset used for our experiments upon publishing of this paper.

[AI-76] How Should I Build A Benchmark?

链接: https://arxiv.org/abs/2501.10711
作者: Jialun Cao,Yuk-Kit Chan,Zixuan Ling,Wenxuan Wang,Shuqing Li,Mingwei Liu,Chaozheng Wang,Boxi Yu,Pinjia He,Shuai Wang,Zibin Zheng,Michael R. Lyu,Shing-Chi Cheung
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 42 pages

点击查看摘要

Abstract:Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55- 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.

[AI-77] Revisiting Ensemble Methods for Stock Trading and Crypto Trading Tasks at ACM ICAIF FinRL Contest 2023-2024

链接: https://arxiv.org/abs/2501.10709
作者: Nikolaus Holzer,Keyi Wang,Kairong Xiao,Xiao-Yang Liu Yanglet
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning has demonstrated great potential for performing financial tasks. However, it faces two major challenges: policy instability and sampling bottlenecks. In this paper, we revisit ensemble methods with massively parallel simulations on graphics processing units (GPUs), significantly enhancing the computational efficiency and robustness of trained models in volatile financial markets. Our approach leverages the parallel processing capability of GPUs to significantly improve the sampling speed for training ensemble models. The ensemble models combine the strengths of component agents to improve the robustness of financial decision-making strategies. We conduct experiments in both stock and cryptocurrency trading tasks to evaluate the effectiveness of our approach. Massively parallel simulation on a single GPU improves the sampling speed by up to 1,746\times using 2,048 parallel environments compared to a single environment. The ensemble models have high cumulative returns and outperform some individual agents, reducing maximum drawdown by up to 4.17% and improving the Sharpe ratio by up to 0.21 . This paper describes trading tasks at ACM ICAIF FinRL Contests in 2023 and 2024. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2501.10709 [cs.CE] (or arXiv:2501.10709v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2501.10709 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-78] Algorithmic Derivation of Human Spatial Navigation Indices From Eye Movement Data ICIP ALT

链接: https://arxiv.org/abs/2501.10696
作者: Sobhan Teymouri,Fatemeh Alizadehziri,Mobina Zibandehpoor,Mehdi Delrobaei
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: The dataset is available in the following work: Mobina Zibandehpoor, Fatemeh Alizadehziri, Arash Abbasi Larki, Sobhan Teymouri, and Mehdi Delrobaei. Electrooculography Dataset for Objective Spatial Navigation Assessment in Healthy Participants. arXiv preprint arXiv:2411.06811 , 2024

点击查看摘要

Abstract:Spatial navigation is a complex cognitive function involving sensory inputs, such as visual, auditory, and proprioceptive information, to understand and move within space. This ability allows humans to create mental maps, navigate through environments, and process directional cues, crucial for exploring new places and finding one’s way in unfamiliar surroundings. This study takes an algorithmic approach to extract indices relevant to human spatial navigation using eye movement data. Leveraging electrooculography signals, we analyzed statistical features and applied feature engineering techniques to study eye movements during navigation tasks. The proposed work combines signal processing and machine learning approaches to develop indices for navigation and orientation, spatial anxiety, landmark recognition, path survey, and path route. The analysis yielded five subscore indices with notable accuracy. Among these, the navigation and orientation subscore achieved an R2 score of 0.72, while the landmark recognition subscore attained an R2 score of 0.50. Additionally, statistical features highly correlated with eye movement metrics, including blinks, saccades, and fixations, were identified. The findings of this study can lead to more cognitive assessments and enable early detection of spatial navigation impairments, particularly among individuals at risk of cognitive decline.

[AI-79] Distributionally Robust Policy Evaluation and Learning for Continuous Treatment with Observational Data

链接: https://arxiv.org/abs/2501.10693
作者: Cheuk Hang Leung,Yiyan Huang,Yijun Li,Qi Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using offline observational data for policy evaluation and learning allows decision-makers to evaluate and learn a policy that connects characteristics and interventions. Most existing literature has focused on either discrete treatment spaces or assumed no difference in the distributions between the policy-learning and policy-deployed environments. These restrict applications in many real-world scenarios where distribution shifts are present with continuous treatment. To overcome these challenges, this paper focuses on developing a distributionally robust policy under a continuous treatment setting. The proposed distributionally robust estimators are established using the Inverse Probability Weighting (IPW) method extended from the discrete one for policy evaluation and learning under continuous treatments. Specifically, we introduce a kernel function into the proposed IPW estimator to mitigate the exclusion of observations that can occur in the standard IPW method to continuous treatments. We then provide finite-sample analysis that guarantees the convergence of the proposed distributionally robust policy evaluation and learning estimators. The comprehensive experiments further verify the effectiveness of our approach when distribution shifts are present.

[AI-80] Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

链接: https://arxiv.org/abs/2501.10677
作者: Xia Li,Hanghang Zheng,Xiao Chen,Hong Liu,Mao Mao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.

[AI-81] LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

链接: https://arxiv.org/abs/2501.10658
作者: Guoyu Li(1 and 2),Shengyu Ye(2),Chunyun Chen(3),Yang Wang(2),Fan Yang(2),Ting Cao(2),Cheng Liu(1),Mohamed M. Sabry(3),Mao Yang(2) ((1) University of Chinese Academy of Sciences, (2) Microsoft Research, (3) NTU Singapore)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 14 figures

点击查看摘要

Abstract:The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of 1.4 ~ 7.0\times and 1.5 ~ 146.1\times , respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by 0.1% ~ 3.1% using the L_2 distance similarity, 0.1% ~ 3.4% with the L_1 distance similarity, and 0.1% ~ 3.8% when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from 1.4% to 3.0% .

[AI-82] AI/ML Based Detection and Categorization of Covert Communication in IPv6 Network

链接: https://arxiv.org/abs/2501.10627
作者: Mohammad Wali Ur Rahman,Yu-Zheng Lin,Carter Weeks,David Ruddell,Jeff Gabriellini,Bill Hayes,Salim Hariri,Edward V. Ziegler Jr
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The flexibility and complexity of IPv6 extension headers allow attackers to create covert channels or bypass security mechanisms, leading to potential data breaches or system compromises. The mature development of machine learning has become the primary detection technology option used to mitigate covert communication threats. However, the complexity of detecting covert communication, evolving injection techniques, and scarcity of data make building machine-learning models challenging. In previous related research, machine learning has shown good performance in detecting covert communications, but oversimplified attack scenario assumptions cannot represent the complexity of modern covert technologies and make it easier for machine learning models to detect covert communications. To bridge this gap, in this study, we analyzed the packet structure and network traffic behavior of IPv6, used encryption algorithms, and performed covert communication injection without changing network packet behavior to get closer to real attack scenarios. In addition to analyzing and injecting methods for covert communications, this study also uses comprehensive machine learning techniques to train the model proposed in this study to detect threats, including traditional decision trees such as random forests and gradient boosting, as well as complex neural network architectures such as CNNs and LSTMs, to achieve detection accuracy of over 90%. This study details the methods used for dataset augmentation and the comparative performance of the applied models, reinforcing insights into the adaptability and resilience of the machine learning application in IPv6 covert communication. In addition, we also proposed a Generative AI-assisted interpretation concept based on prompt engineering as a preliminary study of the role of Generative AI agents in covert communication.

[AI-83] ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance

链接: https://arxiv.org/abs/2501.10593
作者: Andrey Risukhin,Kavel Rao,Ben Caffee,Alan Fan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous agents’ interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents’ learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a leader'' agent representing a human and a follower’’ assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at this https URL.

[AI-84] AI Technicians: Developing Rapid Occupational Training Methods for a Competitive AI Workforce

链接: https://arxiv.org/abs/2501.10579
作者: Jaromir Savelka,Can Kultur,Arav Agarwal,Christopher Bogart,Heather Burte,Adam Zhang,Majd Sakr
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The accelerating pace of developments in Artificial Intelligence~(AI) and the increasing role that technology plays in society necessitates substantial changes in the structure of the workforce. Besides scientists and engineers, there is a need for a very large workforce of competent AI technicians (i.e., maintainers, integrators) and users~(i.e., operators). As traditional 4-year and 2-year degree-based education cannot fill this quickly opening gap, alternative training methods have to be developed. We present the results of the first four years of the AI Technicians program which is a unique collaboration between the U.S. Army’s Artificial Intelligence Integration Center (AI2C) and Carnegie Mellon University to design, implement and evaluate novel rapid occupational training methods to create a competitive AI workforce at the technicians level. Through this multi-year effort we have already trained 59 AI Technicians. A key observation is that ongoing frequent updates to the training are necessary as the adoption of AI in the U.S. Army and within the society at large is evolving rapidly. A tight collaboration among the stakeholders from the army and the university is essential for successful development and maintenance of the training for the evolving role. Our findings can be leveraged by large organizations that face the challenge of developing a competent AI workforce as well as educators and researchers engaged in solving the challenge.

[AI-85] AI Toolkit: Libraries and Essays for Exploring the Technology and Ethics of AI

链接: https://arxiv.org/abs/2501.10576
作者: Levin Ho,Morgan McErlean,Zehua You,Douglas Blank,Lisa Meeden
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we describe the development and evaluation of AITK, the Artificial Intelligence Toolkit. This open-source project contains both Python libraries and computational essays (Jupyter notebooks) that together are designed to allow a diverse audience with little or no background in AI to interact with a variety of AI tools, exploring in more depth how they function, visualizing their outcomes, and gaining a better understanding of their ethical implications. These notebooks have been piloted at multiple institutions in a variety of humanities courses centered on the theme of responsible AI. In addition, we conducted usability testing of AITK. Our pilot studies and usability testing results indicate that AITK is easy to navigate and effective at helping users gain a better understanding of AI. Our goal, in this time of rapid innovations in AI, is for AITK to provide an accessible resource for faculty from any discipline looking to incorporate AI topics into their courses and for anyone eager to learn more about AI on their own.

[AI-86] owards Data-Centric AI: A Comprehensive Survey of Traditional Reinforcement and Generative Approaches for Tabular Data Transformation

链接: https://arxiv.org/abs/2501.10555
作者: Dongjie Wang,Yanyong Huang,Wangyang Ying,Haoyue Bai,Nanxu Gong,Xinyuan Wang,Sixun Dong,Tao Zhe,Kunpeng Liu,Meng Xiao,Pengfei Wang,Pengyang Wang,Hui Xiong,Yanjie Fu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

[AI-87] Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google

链接: https://arxiv.org/abs/2501.10546
作者: George Kurian,Somayeh Sardashti,Ryan Sims,Felix Berger,Gary Holt,Yang Li,Jeremiah Willcock,Kaiyuan Wang,Herve Quiroz,Abdulrahman Salem,Julian Grady
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems. This paper proposes solutions for three critical challenges that must be addressed for efficient end-to-end execution in a widely used production infrastructure: (1) Input Generation and Ingestion Pipeline: Efficiently transforming raw features (e.g., “search query”) into numerical inputs and streaming them to TPUs; (2) Large Embedding Tables: Optimizing conversion of sparse features into dense floating-point vectors for neural network consumption; (3) Interruptions and Error Handling: Minimizing resource wastage in large-scale shared datacenters. To tackle these challenges, we propose a shared input generation technique to reduce computational load of input generation by amortizing costs across many models. Furthermore, we propose partitioning, pipelining, and RPC (Remote Procedure Call) coalescing software techniques to optimize embedding operations. To maintain efficiency at scale, we describe novel preemption notice and training hold mechanisms that minimize resource wastage, and ensure prompt error resolution. These techniques have demonstrated significant improvement in Google production, achieving a 116% performance boost and an 18% reduction in training costs across representative models.

[AI-88] FORLAPS: An Innovative Data-Driven Reinforcement Learning Approach for Prescriptive Process Monitoring

链接: https://arxiv.org/abs/2501.10543
作者: Mostafa Abbasi,Maziyar Khadivi,Maryam Ahang,Patricia Lasserre,Yves Lucet,Homayoun Najjaran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a novel 5-step framework called Fine-Tuned Offline Reinforcement Learning Augmented Process Sequence Optimization (FORLAPS), which aims to identify optimal execution paths in business processes using reinforcement learning. We implemented this approach on real-life event logs from our case study an energy regulator in Canada and other real-life event logs, demonstrating the feasibility of the proposed method. Additionally, to compare FORLAPS with the existing models (Permutation Feature Importance and multi-task LSTM-Based model), we experimented to evaluate its effectiveness in terms of resource savings and process time span reduction. The experimental results on real-life event log validate that FORLAPS achieves 31% savings in resource time spent and a 23% reduction in process time span. Using this innovative data augmentation technique, we propose a fine-tuned reinforcement learning approach that aims to automatically fine-tune the model by selectively increasing the average estimated Q-value in the sampled batches. The results show that we obtained a 44% performance improvement compared to the pre-trained model. This study introduces an innovative evaluation model, benchmarking its performance against earlier works using nine publicly available datasets. Robustness is ensured through experiments utilizing the Damerau-Levenshtein distance as the primary metric. In addition, we discussed the suitability of datasets, taking into account their inherent properties, to evaluate the performance of different models. The proposed model, FORLAPS, demonstrated exceptional performance, outperforming existing state-of-the-art approaches in suggesting the most optimal policies or predicting the best next activities within a process trace.

[AI-89] 4bit-Quantization in Vector-Embedding for RAG

链接: https://arxiv.org/abs/2501.10534
作者: Taehee Jeong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at this https URL

[AI-90] Solving Sparse Finite Element Problems on Neuromorphic Hardware

链接: https://arxiv.org/abs/2501.10526
作者: Bradley H. Theilman,James B. Aimone
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Pre-publication submission

点击查看摘要

Abstract:We demonstrate that scalable neuromorphic hardware can implement the finite element method, which is a critical numerical method for engineering and scientific discovery. Our approach maps the sparse interactions between neighboring finite elements to small populations of neurons that dynamically update according to the governing physics of a desired problem description. We show that for the Poisson equation, which describes many physical systems such as gravitational and electrostatic fields, this cortical-inspired neural circuit can achieve comparable levels of numerical accuracy and scaling while enabling the use of inherently parallel and energy-efficient neuromorphic hardware. We demonstrate that this approach can be used on the Intel Loihi 2 platform and illustrate how this approach can be extended to nontrivial mesh geometries and dynamics.

[AI-91] Real-Time Bus Departure Prediction Using Neural Networks for Smart IoT Public Bus Transit

链接: https://arxiv.org/abs/2501.10514
作者: Narges Rashvand,Sanaz Sadat Hosseini,Mona Azarbayjani,Hamed Tabkhi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bus transit plays a vital role in urban public transportation but often struggles to provide accurate and reliable departure times. This leads to delays, passenger dissatisfaction, and decreased ridership, particularly in transit-dependent areas. A major challenge lies in the discrepancy between actual and scheduled bus departure times, which disrupts timetables and impacts overall operational efficiency. To address these challenges, this paper presents a neural network-based approach for real-time bus departure time prediction tailored for smart IoT public transit applications. We leverage AI-driven models to enhance the accuracy of bus schedules by preprocessing data, engineering relevant features, and implementing a fully connected neural network that utilizes historical departure data to predict departure times at subsequent stops. In our case study analyzing bus data from Boston, we observed an average deviation of nearly 4 minutes from scheduled times. However, our model, evaluated across 151 bus routes, demonstrates a significant improvement, predicting departure time deviations with an accuracy of under 80 seconds. This advancement not only improves the reliability of bus transit schedules but also plays a crucial role in enabling smart bus systems and IoT applications within public transit networks. By providing more accurate real-time predictions, our approach can facilitate the integration of IoT devices, such as smart bus stops and passenger information systems, that rely on precise data for optimal performance.

[AI-92] Bias in Decision-Making for AIs Ethical Dilemmas: A Comparative Study of ChatGPT and Claude

链接: https://arxiv.org/abs/2501.10484
作者: Yile Yan,Yuqi Zhu,Wentao Xu
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled human-like responses across various tasks, raising questions about their ethical decision-making capabilities and potential biases. This study investigates protected attributes in LLMs through systematic evaluation of their responses to ethical dilemmas. Using two prominent models - GPT-3.5 Turbo and Claude 3.5 Sonnet - we analyzed their decision-making patterns across multiple protected attributes including age, gender, race, appearance, and disability status. Through 11,200 experimental trials involving both single-factor and two-factor protected attribute combinations, we evaluated the models’ ethical preferences, sensitivity, stability, and clustering of preferences. Our findings reveal significant protected attributeses in both models, with consistent preferences for certain features (e.g., “good-looking”) and systematic neglect of others. Notably, while GPT-3.5 Turbo showed stronger preferences aligned with traditional power structures, Claude 3.5 Sonnet demonstrated more diverse protected attribute choices. We also found that ethical sensitivity significantly decreases in more complex scenarios involving multiple protected attributes. Additionally, linguistic referents heavily influence the models’ ethical evaluations, as demonstrated by differing responses to racial descriptors (e.g., “Yellow” versus “Asian”). These findings highlight critical concerns about the potential impact of LLM biases in autonomous decision-making systems and emphasize the need for careful consideration of protected attributes in AI development. Our study contributes to the growing body of research on AI ethics by providing a systematic framework for evaluating protected attributes in LLMs’ ethical decision-making capabilities.

[AI-93] Revisiting Rogers Paradox in the Context of Human-AI Interaction

链接: https://arxiv.org/abs/2501.10476
作者: Katherine M. Collins,Umang Bhatt,Ilia Sucholutsky
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:Humans learn about the world, and how to act in the world, in many ways: from individually conducting experiments to observing and reproducing others’ behavior. Different learning strategies come with different costs and likelihoods of successfully learning more about the world. The choice that any one individual makes of how to learn can have an impact on the collective understanding of a whole population if people learn from each other. Alan Rogers developed simulations of a population of agents to study these network phenomena where agents could individually or socially learn amidst a dynamic, uncertain world and uncovered a confusing result: the availability of cheap social learning yielded no benefit to population fitness over individual learning. This paradox spawned decades of work trying to understand and uncover factors that foster the relative benefit of social learning that centuries of human behavior suggest exists. What happens in such network models now that humans can socially learn from AI systems that are themselves socially learning from us? We revisit Rogers’ Paradox in the context of human-AI interaction to probe a simplified network of humans and AI systems learning together about an uncertain world. We propose and examine the impact of several learning strategies on the quality of the equilibrium of a society’s ‘collective world model’. We consider strategies that can be undertaken by various stakeholders involved in a single human-AI interaction: human, AI model builder, and society or regulators around the interaction. We then consider possible negative feedback loops that may arise from humans learning socially from AI: that learning from the AI may impact our own ability to learn about the world. We close with open directions into studying networks of human and AI systems that can be explored in enriched versions of our simulation framework.

[AI-94] Securing the AI Frontier: Urgent Ethical and Regulatory Imperatives for AI-Driven Cybersecurity

链接: https://arxiv.org/abs/2501.10467
作者: Vikram Kulothungan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: This is a preprint of a paper that has been accepted at BigCyber at 2024 IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:This paper critically examines the evolving ethical and regulatory challenges posed by the integration of artificial intelligence (AI) in cybersecurity. We trace the historical development of AI regulation, highlighting major milestones from theoretical discussions in the 1940s to the implementation of recent global frameworks such as the European Union AI Act. The current regulatory landscape is analyzed, emphasizing risk-based approaches, sector-specific regulations, and the tension between fostering innovation and mitigating risks. Ethical concerns such as bias, transparency, accountability, privacy, and human oversight are explored in depth, along with their implications for AI-driven cybersecurity systems. Furthermore, we propose strategies for promoting AI literacy and public engagement, essential for shaping a future regulatory framework. Our findings underscore the need for a unified, globally harmonized regulatory approach that addresses the unique risks of AI in cybersecurity. We conclude by identifying future research opportunities and recommending pathways for collaboration between policymakers, industry leaders, and researchers to ensure the responsible deployment of AI technologies in cybersecurity.

[AI-95] Adapting Beyond the Depth Limit: Counter Strategies in Large Imperfect Information Games

链接: https://arxiv.org/abs/2501.10464
作者: David Milec,Vojtěch Kovařík,Viliam Lisý
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the problem of adapting to a known sub-rational opponent during online play while remaining robust to rational opponents. We focus on large imperfect-information (zero-sum) games, which makes it impossible to inspect the whole game tree at once and necessitates the use of depth-limited search. However, all existing methods assume rational play beyond the depth-limit, which only allows them to adapt a very limited portion of the opponent’s behaviour. We propose an algorithm Adapting Beyond Depth-limit (ABD) that uses a strategy-portfolio approach - which we refer to as matrix-valued states - for depth-limited search. This allows the algorithm to fully utilise all information about the opponent model, making it the first robust-adaptation method to be able to do so in large imperfect-information games. As an additional benefit, the use of matrix-valued states makes the algorithm simpler than traditional methods based on optimal value functions. Our experimental results in poker and battleship show that ABD yields more than a twofold increase in utility when facing opponents who make mistakes beyond the depth limit and also delivers significant improvements in utility and safety against randomly generated opponents.

[AI-96] GLow – A Novel Flower-Based Simulated Gossip Learning Strategy

链接: https://arxiv.org/abs/2501.10463
作者: Aitor Belenguer,Jose A. Pascual,Javier Navaridas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 7 figures, 2 tables, source code: this https URL

点击查看摘要

Abstract:Fully decentralized learning algorithms are still in an early stage of development. Creating modular Gossip Learning strategies is not trivial due to convergence challenges and Byzantine faults intrinsic in systems of decentralized nature. Our contribution provides a novel means to simulate custom Gossip Learning systems by leveraging the state-of-the-art Flower Framework. Specifically, we introduce GLow, which will allow researchers to train and assess scalability and convergence of devices, across custom network topologies, before making a physical deployment. The Flower Framework is selected for being a simulation featured library with a very active community on Federated Learning research. However, Flower exclusively includes vanilla Federated Learning strategies and, thus, is not originally designed to perform simulations without a centralized authority. GLow is presented to fill this gap and make simulation of Gossip Learning systems possible. Results achieved by GLow in the MNIST and CIFAR10 datasets, show accuracies over 0.98 and 0.75 respectively. More importantly, GLow performs similarly in terms of accuracy and convergence to its analogous Centralized and Federated approaches in all designed experiments.

[AI-97] A Framework for Mining Collectively-Behaving Bots in MMORPGs

链接: https://arxiv.org/abs/2501.10461
作者: Hyunsoo Kim,Jun Hee Kim,Jaeman Son,Jihoon Song,Eunjo Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In MMORPGs (Massively Multiplayer Online Role-Playing Games), abnormal players (bots) using unauthorized automated programs to carry out pre-defined behaviors systematically and repeatedly are commonly observed. Bots usually engage in these activities to gain in-game money, which they eventually trade for real money outside the game. Such abusive activities negatively impact the in-game experiences of legitimate users since bots monopolize specific hunting areas and obtain valuable items. Thus, detecting abnormal players is a significant task for game companies. Motivated by the fact that bots tend to behave collectively with similar in-game trajectories due to the auto-programs, we developed BotTRep, a framework that comprises trajectory representation learning followed by clustering using a completely unlabeled in-game trajectory dataset. Our model aims to learn representations for in-game trajectory sequences so that players with contextually similar trajectories have closer embeddings. Then, by applying DBSCAN to these representations and visualizing the corresponding moving patterns, our framework ultimately assists game masters in identifying and banning bots.

[AI-98] Uncovering Bias in Foundation Models: Impact Testing Harm and Mitigation

链接: https://arxiv.org/abs/2501.10453
作者: Shuzhou Sun(1 and 2),Li Liu(3),Yongxiang Liu(3),Zhen Liu(3),Shuanghui Zhang(3),Janne Heikkilä(2),Xiang Li(3) ((1) The College of Computer Science, Nankai University, Tianjin, China, (2) The Center for Machine Vision and Signal Analysis, University of Oulu, Finland, (3) The College of Electronic Science, National University of Defense Technology, China)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 60 pages, 5 figures

点击查看摘要

Abstract:Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.

[AI-99] owards Lightweight Time Series Forecasting: a Patch-wise Transformer with Weak Data Enriching ICDE2025

链接: https://arxiv.org/abs/2501.10448
作者: Meng Wang,Jintao Yang,Bin Yang,Hui Li,Tongxin Gong,Bo Yang,Jiangtao Cui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the 41st IEEE International Conference on Data Engineering (ICDE 2025)

点击查看摘要

Abstract:Patch-wise Transformer based time series forecasting achieves superior accuracy. However, this superiority relies heavily on intricate model design with massive parameters, rendering both training and inference expensive, thus preventing their deployments on edge devices with limited resources and low latency requirements. In addition, existing methods often work in an autoregressive manner, which take into account only historical values, but ignore valuable, easy-to-obtain context information, such as weather forecasts, date and time of day. To contend with the two limitations, we propose LiPFormer, a novel Lightweight Patch-wise Transformer with weak data enriching. First, to simplify the Transformer backbone, LiPFormer employs a novel lightweight cross-patch attention and a linear transformation-based attention to eliminate Layer Normalization and Feed Forward Network, two heavy components in existing Transformers. Second, we propose a lightweight, weak data enriching module to provide additional, valuable weak supervision to the training. It enhances forecasting accuracy without significantly increasing model complexity as it does not involve expensive, human-labeling but using easily accessible context information. This facilitates the weak data enriching to plug-and-play on existing models. Extensive experiments on nine benchmark time series datasets demonstrate that LiPFormer outperforms state-of-the-art methods in accuracy, while significantly reducing parameter scale, training duration, and GPU memory usage. Deployment on an edge device reveals that LiPFormer takes only 1/3 inference time compared to classic Transformers. In addition, we demonstrate that the weak data enriching can integrate seamlessly into various Transformer based models to enhance their accuracy, suggesting its generality.

[AI-100] CodEv: An Automated Grading Framework Leverag ing Large Language Models for Consistent and Constructive Feedback

链接: https://arxiv.org/abs/2501.10421
作者: En-Qi Tseng,Pei-Cing Huang,Chan Hsu,Peng-Yi Wu,Chan-Tung Ku,Yihuang Kang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Grading programming assignments is crucial for guiding students to improve their programming skills and coding styles. This study presents an automated grading framework, CodEv, which leverages Large Language Models (LLMs) to provide consistent and constructive feedback. We incorporate Chain of Thought (CoT) prompting techniques to enhance the reasoning capabilities of LLMs and ensure that the grading is aligned with human evaluation. Our framework also integrates LLM ensembles to improve the accuracy and consistency of scores, along with agreement tests to deliver reliable feedback and code review comments. The results demonstrate that the framework can yield grading results comparable to human evaluators, by using smaller LLMs. Evaluation and consistency tests of the LLMs further validate our approach, confirming the reliability of the generated scores and feedback.

[AI-101] Cooperative Search and Track of Rogue Drones using Multiagent Reinforcement Learning

链接: https://arxiv.org/abs/2501.10413
作者: Panayiota Valianti,Kleanthis Malialis,Panayiotis Kolios,Georgios Ellinas
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This work considers the problem of intercepting rogue drones targeting sensitive critical infrastructure facilities. While current interception technologies focus mainly on the jamming/spoofing tasks, the challenges of effectively locating and tracking rogue drones have not received adequate attention. Solving this problem and integrating with recently proposed interception techniques will enable a holistic system that can reliably detect, track, and neutralize rogue drones. Specifically, this work considers a team of pursuer UAVs that can search, detect, and track multiple rogue drones over a sensitive facility. The joint search and track problem is addressed through a novel multiagent reinforcement learning scheme to optimize the agent mobility control actions that maximize the number of rogue drones detected and tracked. The performance of the proposed system is investigated under realistic settings through extensive simulation experiments with varying number of agents demonstrating both its performance and scalability.

[AI-102] AI-Powered Urban Transportation Digital Twin: Methods and Applications

链接: https://arxiv.org/abs/2501.10396
作者: Xuan Di,Yongjie Fu,Mehmet K.Turkcan,Mahshid Ghasemi,Zhaobin Mo,Chengbo Zang,Abhishek Adhikari,Zoran Kostic,Gil Zussman
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:We present a survey paper on methods and applications of digital twins (DT) for urban traffic management. While the majority of studies on the DT focus on its “eyes,” which is the emerging sensing and perception like object detection and tracking, what really distinguishes the DT from a traditional simulator lies in its ``brain," the prediction and decision making capabilities of extracting patterns and making informed decisions from what has been seen and perceived. In order to add values to urban transportation management, DTs need to be powered by artificial intelligence and complement with low-latency high-bandwidth sensing and networking technologies. We will first review the DT pipeline leveraging cyberphysical systems and propose our DT architecture deployed on a real-world testbed in New York City. This survey paper can be a pointer to help researchers and practitioners identify challenges and opportunities for the development of DTs; a bridge to initiate conversations across disciplines; and a road map to exploiting potentials of DTs for diverse urban transportation applications.

[AI-103] owards General Purpose Robots at Scale: Lifelong Learning and Learning to Use Memory

链接: https://arxiv.org/abs/2501.10395
作者: William Yue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2401.02576 , arXiv:2411.07954

点击查看摘要

Abstract:The widespread success of artificial intelligence in fields like natural language processing and computer vision has not yet fully transferred to robotics, where progress is hindered by the lack of large-scale training data and the complexity of real-world tasks. To address this, many robot learning researchers are pushing to get robots deployed at scale in everyday unstructured environments like our homes to initiate a data flywheel. While current robot learning systems are effective for certain short-horizon tasks, they are not designed to autonomously operate over long time horizons in unstructured environments. This thesis focuses on addressing two key challenges for robots operating over long time horizons: memory and lifelong learning. We propose two novel methods to advance these capabilities. First, we introduce t-DGR, a trajectory-based deep generative replay method that achieves state-of-the-art performance on Continual World benchmarks, advancing lifelong learning. Second, we develop a framework that leverages human demonstrations to teach agents effective memory utilization, improving learning efficiency and success rates on Memory Gym tasks. Finally, we discuss future directions for achieving the lifelong learning and memory capabilities necessary for robots to function at scale in real-world settings. Comments: arXiv admin note: substantial text overlap with arXiv:2401.02576, arXiv:2411.07954 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2501.10395 [cs.LG] (or arXiv:2501.10395v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.10395 Focus to learn more arXiv-issued DOI via DataCite

[AI-104] Developing an Ontology for AI Act Fundamental Rights Impact Assessments

链接: https://arxiv.org/abs/2501.10391
作者: Tytti Rintamaki,Harshvardhan J. Pandit
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Presented at CLAIRvoyant (ConventicLE on Artificial Intelligence Regulation) Workshop 2024

点击查看摘要

Abstract:The recently published EU Artificial Intelligence Act (AI Act) is a landmark regulation that regulates the use of AI technologies. One of its novel requirements is the obligation to conduct a Fundamental Rights Impact Assessment (FRIA), where organisations in the role of deployers must assess the risks of their AI system regarding health, safety, and fundamental rights. Another novelty in the AI Act is the requirement to create a questionnaire and an automated tool to support organisations in their FRIA obligations. Such automated tools will require a machine-readable form of information involved within the FRIA process, and additionally also require machine-readable documentation to enable further compliance tools to be created. In this article, we present our novel representation of the FRIA as an ontology based on semantic web standards. Our work builds upon the existing state of the art, notably the Data Privacy Vocabulary (DPV), where similar works have been established to create tools for GDPR’s Data Protection Impact Assessments (DPIA) and other obligations. Through our ontology, we enable the creation and management of FRIA, and the use of automated tool in its various steps.

[AI-105] owards an Environmental Ethics of Artificial Intelligence

链接: https://arxiv.org/abs/2501.10390
作者: Nynke van Uffelen,Lode Lauwaert,Mark Coeckelbergh,Olya Kudina
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, much research has been dedicated to uncovering the environmental impact of Artificial Intelligence (AI), showing that training and deploying AI systems require large amounts of energy and resources, and the outcomes of AI may lead to decisions and actions that may negatively impact the environment. This new knowledge raises new ethical questions, such as: When is it (un)justifiable to develop an AI system, and how to make design choices, considering its environmental impact? However, so far, the environmental impact of AI has largely escaped ethical scrutiny, as AI ethics tends to focus strongly on themes such as transparency, privacy, safety, responsibility, and bias. Considering the environmental impact of AI from an ethical perspective expands the scope of AI ethics beyond an anthropocentric focus towards including more-than-human actors such as animals and ecosystems. This paper explores the ethical implications of the environmental impact of AI for designing AI systems by drawing on environmental justice literature, in which three categories of justice are distinguished, referring to three elements that can be unjust: the distribution of benefits and burdens (distributive justice), decision-making procedures (procedural justice), and institutionalized social norms (justice as recognition). Based on these tenets of justice, we outline criteria for developing environmentally just AI systems, given their ecological impact.

[AI-106] Autonomous Microscopy Experiments through Large Language Model Agents

链接: https://arxiv.org/abs/2501.10385
作者: Indrajeet Mandal,Jitendra Soni,Mohd Zaki,Morten M. Smedskjaer,Katrin Wondraczek,Lothar Wondraczek,Nitya Nand Gosvami,N. M. Anoop Krishnan
类目: Computers and Society (cs.CY); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has accelerated the development of self-driving laboratories (SDLs) for materials research. Despite their transformative potential, current SDL implementations rely on rigid, predefined protocols that limit their adaptability to dynamic experimental scenarios across different labs. A significant challenge persists in measuring how effectively AI agents can replicate the adaptive decision-making and experimental intuition of expert scientists. Here, we introduce AILA (Artificially Intelligent Lab Assistant), a framework that automates atomic force microscopy (AFM) through LLM-driven agents. Using AFM as an experimental testbed, we develop AFMBench-a comprehensive evaluation suite that challenges AI agents based on language models like GPT-4o and GPT-3.5 to perform tasks spanning the scientific workflow: from experimental design to results analysis. Our systematic assessment shows that state-of-the-art language models struggle even with basic tasks such as documentation retrieval, leading to a significant decline in performance in multi-agent coordination scenarios. Further, we observe that LLMs exhibit a tendency to not adhere to instructions or even divagate to additional tasks beyond the original request, raising serious concerns regarding safety alignment aspects of AI agents for SDLs. Finally, we demonstrate the application of AILA on increasingly complex experiments open-ended experiments: automated AFM calibration, high-resolution feature detection, and mechanical property measurement. Our findings emphasize the necessity for stringent benchmarking protocols before deploying AI agents as laboratory assistants across scientific disciplines.

[AI-107] DK-PRACTICE: An Intelligent Educational Platform for Personalized Learning Content Recommendations Based on Students Knowledge State

链接: https://arxiv.org/abs/2501.10373
作者: Marina Delianidi,Konstantinos Diamantaras,Ioannis Moras,Antonis Sidiropoulos
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 13 pages, The Barcelona Conference on Education 2024

点击查看摘要

Abstract:This study introduces DK-PRACTICE (Dynamic Knowledge Prediction and Educational Content Recommendation System), an intelligent online platform that leverages machine learning to provide personalized learning recommendations based on student knowledge state. Students participate in a short, adaptive assessment using the question-and-answer method regarding key concepts in a specific knowledge domain. The system dynamically selects the next question for each student based on the correctness and accuracy of their previous answers. After the test is completed, DK-PRACTICE analyzes students’ interaction history to recommend learning materials to empower the student’s knowledge state in identified knowledge gaps. Both question selection and learning material recommendations are based on machine learning models trained using anonymized data from a real learning environment. To provide self-assessment and monitor learning progress, DK-PRACTICE allows students to take two tests: one pre-teaching and one post-teaching. After each test, a report is generated with detailed results. In addition, the platform offers functions to visualize learning progress based on recorded test statistics. DK-PRACTICE promotes adaptive and personalized learning by empowering students with self-assessment capabilities and providing instructors with valuable information about students’ knowledge levels. DK-PRACTICE can be extended to various educational environments and knowledge domains, provided the necessary data is available according to the educational topics. A subsequent paper will present the methodology for the experimental application and evaluation of the platform.

[AI-108] What we learned while automating bias detection in AI hiring systems for compliance with NYC Local Law 144

链接: https://arxiv.org/abs/2501.10371
作者: Gemma Galdon Clavell,Rubén González-Sendino
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Since July 5, 2023, New York City’s Local Law 144 requires employers to conduct independent bias audits for any automated employment decision tools (AEDTs) used in hiring processes. The law outlines a minimum set of bias tests that AI developers and implementers must perform to ensure compliance. Over the past few months, we have collected and analyzed audits conducted under this law, identified best practices, and developed a software tool to streamline employer compliance. Our tool, ITACA_144, tailors our broader bias auditing framework to meet the specific requirements of Local Law 144. While automating these legal mandates, we identified several critical challenges that merit attention to ensure AI bias regulations and audit methodologies are both effective and practical. This document presents the insights gained from automating compliance with NYC Local Law 144. It aims to support other cities and states in crafting similar legislation while addressing the limitations of the NYC framework. The discussion focuses on key areas including data requirements, demographic inclusiveness, impact ratios, effective bias, metrics, and data reliability.

[AI-109] Harnessing Large Language Models for Mental Health: Opportunities Challenges and Ethical Considerations

链接: https://arxiv.org/abs/2501.10370
作者: Hari Mohan Pandey
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming mental health care by enhancing accessibility, personalization, and efficiency in therapeutic interventions. These AI-driven tools empower mental health professionals with real-time support, improved data integration, and the ability to encourage care-seeking behaviors, particularly in underserved communities. By harnessing LLMs, practitioners can deliver more empathetic, tailored, and effective support, addressing longstanding gaps in mental health service provision. However, their implementation comes with significant challenges and ethical concerns. Performance limitations, data privacy risks, biased outputs, and the potential for generating misleading information underscore the critical need for stringent ethical guidelines and robust evaluation mechanisms. The sensitive nature of mental health data further necessitates meticulous safeguards to protect patient rights and ensure equitable access to AI-driven care. Proponents argue that LLMs have the potential to democratize mental health resources, while critics warn of risks such as misuse and the diminishment of human connection in therapy. Achieving a balance between innovation and ethical responsibility is imperative. This paper examines the transformative potential of LLMs in mental health care, highlights the associated technical and ethical complexities, and advocates for a collaborative, multidisciplinary approach to ensure these advancements align with the goal of providing compassionate, equitable, and effective mental health support.

[AI-110] Creative Loss: Ambiguity Uncertainty and Indeterminacy NEURIPS2024

链接: https://arxiv.org/abs/2501.10369
作者: Tom Holberton
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Creative AI Track

点击查看摘要

Abstract:This article evaluates how creative uses of machine learning can address three adjacent terms: ambiguity, uncertainty and indeterminacy. Through the progression of these concepts it reflects on increasing ambitions for machine learning as a creative partner, illustrated with research from Unit 21 at the Bartlett School of Architecture, UCL. Through indeterminacy are potential future approaches to machine learning and design.

[AI-111] he Potential of Answer Classes in Large-scale Written Computer-Science Exams – Vol. 2

链接: https://arxiv.org/abs/2501.10368
作者: Dominic Lohr,Marc Berges,Michael Kohlhase,Florian Rabe
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted at Commentarii Informaticae Didacticae (CID) 2024

点击查看摘要

Abstract:Students’ answers to tasks provide a valuable source of information in teaching as they result from applying cognitive processes to a learning content addressed in the task. Due to steadily increasing course sizes, analyzing student answers is frequently the only means of obtaining evidence about student performance. However, in many cases, resources are limited, and when evaluating exams, the focus is solely on identifying correct or incorrect answers. This overlooks the value of analyzing incorrect answers, which can help improve teaching strategies or identify misconceptions to be addressed in the next cohort. In teacher training for secondary education, assessment guidelines are mandatory for every exam, including anticipated errors and misconceptions. We applied this concept to a university exam with 462 students and 41 tasks. For each task, the instructors developed answer classes – classes of expected responses, to which student answers were mapped during the exam correction process. The experiment resulted in a shift in mindset among the tutors and instructors responsible for the course: after initially having great reservations about whether the significant additional effort would yield an appropriate benefit, the procedure was subsequently found to be extremely valuable. The concept presented, and the experience gained from the experiment were cast into a system with which it is possible to correct paper-based exams on the basis of answer classes. This updated version of the paper provides an overview and new potential in the course of using the digital version of the approach. Comments: Accepted at Commentarii Informaticae Didacticae (CID) 2024 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.10368 [cs.CY] (or arXiv:2501.10368v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.10368 Focus to learn more arXiv-issued DOI via DataCite

[AI-112] GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

链接: https://arxiv.org/abs/2501.10367
作者: Mengxian Li,Qi Wang,Yongjun Xu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of multi-agent reinforcement learning (MARL) has given rise to diverse training paradigms to learn the policies of each agent in the multi-agent system. The paradigms of decentralized training and execution (DTDE) and centralized training with decentralized execution (CTDE) have been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metrics, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metrics, we propose a novel training paradigm of grouped training decentralized execution (GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 382% compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100% win rate against the baseline.

[AI-113] Participatory Assessment of Large Language Model Applications in an Academic Medical Center ALT

链接: https://arxiv.org/abs/2501.10366
作者: Giorgia Carra,Bogdan Kulynych,François Bastardot,Daniel E. Kaufmann,Noémie Boillat-Blanco,Jean Louis Raisaro
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: MeurIPS GenAI for Health Workshop

点击查看摘要

Abstract:Although Large Language Models (LLMs) have shown promising performance in healthcare-related applications, their deployment in the medical domain poses unique challenges of ethical, regulatory, and technical nature. In this study, we employ a systematic participatory approach to investigate the needs and expectations regarding clinical applications of LLMs at Lausanne University Hospital, an academic medical center in Switzerland. Having identified potential LLM use-cases in collaboration with thirty stakeholders, including clinical staff across 11 departments as well nursing and patient representatives, we assess the current feasibility of these use-cases taking into account the regulatory frameworks, data protection regulation, bias, hallucinations, and deployment constraints. This study provides a framework for a participatory approach to identifying institutional needs with respect to introducing advanced technologies into healthcare practice, and a realistic analysis of the technology readiness level of LLMs for medical applications, highlighting the issues that would need to be overcome LLMs in healthcare to be ethical, and regulatory compliant.

[AI-114] Can LLM s Identify Gaps and Misconceptions in Students Code Explanations?

链接: https://arxiv.org/abs/2501.10365
作者: Priti Oli,Rabin Banjade,Andrew M. Olney,Vasile Rus
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:This paper investigates various approaches using Large Language Models (LLMs) to identify gaps and misconceptions in students’ self-explanations of specific instructional material, in our case explanations of code examples. This research is a part of our larger effort to automate the assessment of students’ freely generated responses, focusing specifically on their self-explanations of code examples during activities related to code comprehension. In this work, we experiment with zero-shot prompting, Supervised Fine-Tuning (SFT), and preference alignment of LLMs to identify gaps in students’ self-explanation. With simple prompting, GPT-4 consistently outperformed LLaMA3 and Mistral in identifying gaps and misconceptions, as confirmed by human evaluations. Additionally, our results suggest that fine-tuned large language models are more effective at identifying gaps in students’ explanations compared to zero-shot and few-shot prompting techniques. Furthermore, our findings show that the preference optimization approach using Odds Ratio Preference Optimization (ORPO) outperforms SFT in identifying gaps and misconceptions in students’ code explanations.

[AI-115] Integrating Artificial Open Generative Artificial Intelligence into Software Supply Chain Security

链接: https://arxiv.org/abs/2412.19088
作者: Vasileios Alevizos,George A Papakostas,Akebu Simasiku,Dimitra Malliarou,Antonis Messinis,Sabrina Edralin,Clark Xu,Zongliang Yue
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:While new technologies emerge, human errors always looming. Software supply chain is increasingly complex and intertwined, the security of a service has become paramount to ensuring the integrity of products, safeguarding data privacy, and maintaining operational continuity. In this work, we conducted experiments on the promising open Large Language Models (LLMs) into two main software security challenges: source code language errors and deprecated code, with a focus on their potential to replace conventional static and dynamic security scanners that rely on predefined rules and patterns. Our findings suggest that while LLMs present some unexpected results, they also encounter significant limitations, particularly in memory complexity and the management of new and unfamiliar data patterns. Despite these challenges, the proactive application of LLMs, coupled with extensive security databases and continuous updates, holds the potential to fortify Software Supply Chain (SSC) processes against emerging threats.

[AI-116] Disentangled Interpretable Representation for Efficient Long-term Time Series Forecasting ICDE

链接: https://arxiv.org/abs/2411.17257
作者: Yuang Zhao,Tianyu Li,Jiadong Chen,Shenrong Ye,Fuxin Jiang,Tieying Zhang,Xiaofeng Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work is submitted to IEEE International Conference on Data Engineering (ICDE) 2025

点击查看摘要

Abstract:Industry 5.0 introduces new challenges for Long-term Time Series Forecasting (LTSF), characterized by high-dimensional, high-resolution data and high-stakes application scenarios. Against this backdrop, developing efficient and interpretable models for LTSF becomes a key challenge. Existing deep learning and linear models often suffer from excessive parameter complexity and lack intuitive interpretability. To address these issues, we propose DiPE-Linear, a Disentangled interpretable Parameter-Efficient Linear network. DiPE-Linear incorporates three temporal components: Static Frequential Attention (SFA), Static Temporal Attention (STA), and Independent Frequential Mapping (IFM). These components alternate between learning in the frequency and time domains to achieve disentangled interpretability. The decomposed model structure reduces parameter complexity from quadratic in fully connected networks (FCs) to linear and computational complexity from quadratic to log-linear. Additionally, a Low-Rank Weight Sharing policy enhances the model’s ability to handle multivariate series. Despite operating within a subspace of FCs with limited expressive capacity, DiPE-Linear demonstrates comparable or superior performance to both FCs and nonlinear models across multiple open-source and real-world LTSF datasets, validating the effectiveness of its sophisticatedly designed structure. The combination of efficiency, accuracy, and interpretability makes DiPE-Linear a strong candidate for advancing LTSF in both research and real-world applications. The source code is available at this https URL.

[AI-117] On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance

链接: https://arxiv.org/abs/2403.17154
作者: Jaskirat Singh,Bram Adams,Ahmed E. Hassan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.

[AI-118] Strong phonon-mediated high temperature superconductivity in Li_2AuH_6 under ambient pressure

链接: https://arxiv.org/abs/2501.12222
作者: Zhenfeng Ouyang,Bo-Wen Yao,Xiao-Qi Han,Peng-Jie Guo,Ze-Feng Gao,Zhong-Yi Lu
类目: perconductivity (cond-mat.supr-con); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 6 pages; 4 figures

点击查看摘要

Abstract:We used our developed AI search engine~(InvDesFlow) to perform extensive investigations regarding ambient stable superconducting hydrides. A cubic structure Li _2 AuH _6 with Au-H octahedral motifs is identified to be a candidate. After performing thermodynamical analysis, we provide a feasible route to experimentally synthesize this material via the known LiAu and LiH compounds under ambient pressure. The further first-principles calculations suggest that Li _2 AuH _6 shows a high superconducting transition temperature ( T_c ) \sim 140 K under ambient pressure. The H-1 s electrons strongly couple with phonon modes of vibrations of Au-H octahedrons as well as vibrations of Li atoms, where the latter is not taken seriously in other previously similar cases. Hence, different from previous claims of searching metallic covalent bonds to find high- T_c superconductors, we emphasize here the importance of those phonon modes with strong electron-phonon coupling (EPC). And we suggest that one can intercalate atoms into binary or ternary hydrides to introduce more potential phonon modes with strong EPC, which is an effective approach to find high- T_c superconductors within multicomponent compounds.

[AI-119] On the practical applicability of modern DFT functionals for chemical computations. Case study of DM21 applicability for geometry optimization

链接: https://arxiv.org/abs/2501.12149
作者: Kirill Kulaev,Alexander Ryabov,Michael Medvedev,Evgeny Burnaev,Vladimir Vanovskiy
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Density functional theory (DFT) is probably the most promising approach for quantum chemistry calculations considering its good balance between calculations precision and speed. In recent years, several neural network-based functionals have been developed for exchange-correlation energy approximation in DFT, DM21 developed by Google Deepmind being the most notable between them. This study focuses on evaluating the efficiency of DM21 functional in predicting molecular geometries, with a focus on the influence of oscillatory behavior in neural network exchange-correlation functionals. We implemented geometry optimization in PySCF for the DM21 functional in geometry optimization problem, compared its performance with traditional functionals, and tested it on various benchmarks. Our findings reveal both the potential and the current challenges of using neural network functionals for geometry optimization in DFT. We propose a solution extending the practical applicability of such functionals and allowing to model new substances with their help.

[AI-120] Improving thermal state preparation of Sachdev-Ye-Kitaev model with reinforcement learning on quantum hardware

链接: https://arxiv.org/abs/2501.11454
作者: Akash Kundu
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); High Energy Physics - Theory (hep-th)
*备注: The code and the data will be available soon. Comments are welcomed!

点击查看摘要

Abstract:The Sachdev-Ye-Kitaev (SYK) model, known for its strong quantum correlations and chaotic behavior, serves as a key platform for quantum gravity studies. However, variationally preparing thermal states on near-term quantum processors for large systems (N12, where N is the number of Majorana fermions) presents a significant challenge due to the rapid growth in the complexity of parameterized quantum circuits. This paper addresses this challenge by integrating reinforcement learning (RL) with convolutional neural networks, employing an iterative approach to optimize the quantum circuit and its parameters. The refinement process is guided by a composite reward signal derived from entropy and the expectation values of the SYK Hamiltonian. This approach reduces the number of CNOT gates by two orders of magnitude for systems N10 compared to traditional methods like first-order Trotterization. We demonstrate the effectiveness of the RL framework in both noiseless and noisy quantum hardware environments, maintaining high accuracy in thermal state preparation. This work contributes to the advancement of a scalable, RL-based framework with applications for computations of thermal out-of-time-order correlators in quantum many-body systems and quantum gravity studies on near-term quantum hardware.

[AI-121] On the Dimension of Pullback Attractors in Recurrent Neural Networks

链接: https://arxiv.org/abs/2501.11357
作者: Muhammed Fadera
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) are high-dimensional state space models capable of learning functions on sequence data. Recently, it has been conjectured that reservoir computers, a particular class of RNNs, trained on observations of a dynamical systems can be interpreted as embeddings. This result has been established for the case of linear reservoir systems. In this work, we use a nonautonomous dynamical systems approach to establish an upper bound for the fractal dimension of the subset of reservoir state space approximated during training and prediction phase. We prove that when the input sequences comes from an Nin-dimensional invertible dynamical system, the fractal dimension of this set is bounded above by Nin. The result obtained here are useful in dimensionality reduction of computation in RNNs as well as estimating fractal dimensions of dynamical systems from limited observations of their time series. It is also a step towards understanding embedding properties of reservoir computers.

[AI-122] GEC-RAG : Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems

链接: https://arxiv.org/abs/2501.10734
作者: Amin Robatian,Mohammad Hajipour,Mohammad Reza Peyghan,Fatemeh Rajabi,Sajjad Amini,Shahrokh Ghaemmaghami,Iman Gholampour
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: 6 pages

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems have demonstrated remarkable performance across various applications. However, limited data and the unique language features of specific domains, such as low-resource languages, significantly degrade their performance and lead to higher Word Error Rates (WER). In this study, we propose Generative Error Correction via Retrieval-Augmented Generation (GEC-RAG), a novel approach designed to improve ASR accuracy for low-resource domains, like Persian. Our approach treats the ASR system as a black-box, a common practice in cloud-based services, and proposes a Retrieval-Augmented Generation (RAG) approach within the In-Context Learning (ICL) scheme to enhance the quality of ASR predictions. By constructing a knowledge base that pairs ASR predictions (1-best and 5-best hypotheses) with their corresponding ground truths, GEC-RAG retrieves lexically similar examples to the ASR transcription using the Term Frequency-Inverse Document Frequency (TF-IDF) measure. This process provides relevant error patterns of the system alongside the ASR transcription to the Generative Large Language Model (LLM), enabling targeted corrections. Our results demonstrate that this strategy significantly reduces WER in Persian and highlights a potential for domain adaptation and low-resource scenarios. This research underscores the effectiveness of using RAG in enhancing ASR systems without requiring direct model modification or fine-tuning, making it adaptable to any domain by simply updating the transcription knowledge base with domain-specific data.

[AI-123] he Mathematics of Artificial Intelligence

链接: https://arxiv.org/abs/2501.10465
作者: Gabriel Peyré
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This overview article highlights the critical role of mathematics in artificial intelligence (AI), emphasizing that mathematics provides tools to better understand and enhance AI systems. Conversely, AI raises new problems and drives the development of new mathematics at the intersection of various fields. This article focuses on the application of analytical and probabilistic tools to model neural network architectures and better understand their optimization. Statistical questions (particularly the generalization capacity of these networks) are intentionally set aside, though they are of crucial importance. We also shed light on the evolution of ideas that have enabled significant advances in AI through architectures tailored to specific tasks, each echoing distinct mathematical techniques. The goal is to encourage more mathematicians to take an interest in and contribute to this exciting field.

机器学习

[LG-0] Audio Texture Manipulation by Exemplar-Based Analogy ICASSP2025

链接: https://arxiv.org/abs/2501.12385
作者: Kan Jen Cheng,Tingle Li,Gopala Anumanchipalli
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: ICASSP 2025

点击查看摘要

Abstract:Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the original sound and another illustrates the desired transformation. The model learns to apply the same transformation to new input, allowing for the manipulation of sound textures. We construct a quadruplet dataset representing various editing tasks, and train a latent diffusion model in a self-supervised manner. We show through quantitative evaluations and perceptual studies that our model outperforms text-conditioned baselines and generalizes to real-world, out-of-distribution, and non-speech scenarios. Project page: this https URL

[LG-1] Budget-constrained Collaborative Renewable Energy Forecasting Market

链接: https://arxiv.org/abs/2501.12367
作者: Carla Goncalves,Ricardo J. Bessa,Tiago Teixeira,Joao Vinagre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate power forecasting from renewable energy sources (RES) is crucial for integrating additional RES capacity into the power system and realizing sustainability goals. This work emphasizes the importance of integrating decentralized spatio-temporal data into forecasting models. However, decentralized data ownership presents a critical obstacle to the success of such spatio-temporal models, and incentive mechanisms to foster data-sharing need to be considered. The main contributions are a) a comparative analysis of the forecasting models, advocating for efficient and interpretable spline LASSO regression models, and b) a bidding mechanism within the data/analytics market to ensure fair compensation for data providers and enable both buyers and sellers to express their data price requirements. Furthermore, an incentive mechanism for time series forecasting is proposed, effectively incorporating price constraints and preventing redundant feature allocation. Results show significant accuracy improvements and potential monetary gains for data sellers. For wind power data, an average root mean squared error improvement of over 10% was achieved by comparing forecasts generated by the proposal with locally generated ones.

[LG-2] Efficient Algorithm for Sparse Fourier Transform of Generalized q-ary Functions

链接: https://arxiv.org/abs/2501.12365
作者: Darin Tsui,Kunal Talreja,Amirali Aghazadeh
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computing the Fourier transform of a q -ary function f:\mathbbZ_q^n\rightarrow \mathbbR , which maps q -ary sequences to real numbers, is an important problem in mathematics with wide-ranging applications in biology, signal processing, and machine learning. Previous studies have shown that, under the sparsity assumption, the Fourier transform can be computed efficiently using fast and sample-efficient algorithms. However, in many practical settings, the function is defined over a more general space – the space of generalized q -ary sequences \mathbbZ_q_1 \times \mathbbZ_q_2 \times \cdots \times \mathbbZ_q_n – where each \mathbbZ_q_i corresponds to integers modulo q_i . A naive approach involves setting q=\max_iq_i and treating the function as q -ary, which results in heavy computational overheads. Herein, we develop GFast, an algorithm that computes the S -sparse Fourier transform of f with a sample complexity of O(Sn) , computational complexity of O(Sn \log N) , and a failure probability that approaches zero as N=\prod_i=1^n q_i \rightarrow \infty with S = N^\delta for some 0 \leq \delta 1 . In the presence of noise, we further demonstrate that a robust version of GFast computes the transform with a sample complexity of O(Sn^2) and computational complexity of O(Sn^2 \log N) under the same high probability guarantees. Using large-scale synthetic experiments, we demonstrate that GFast computes the sparse Fourier transform of generalized q -ary functions using 16\times fewer samples and running 8\times faster than existing algorithms. In real-world protein fitness datasets, GFast explains the predictive interactions of a neural network with 25% smaller normalized mean-squared error compared to existing algorithms.

[LG-3] Diffusion-aware Censored Gaussian Processes for Demand Modelling

链接: https://arxiv.org/abs/2501.12354
作者: Filipe Rodrigues
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inferring the true demand for a product or a service from aggregate data is often challenging due to the limited available supply, thus resulting in observations that are censored and correspond to the realized demand, thereby not accounting for the unsatisfied demand. Censored regression models are able to account for the effect of censoring due to the limited supply, but they don’t consider the effect of substitutions, which may cause the demand for similar alternative products or services to increase. This paper proposes Diffusion-aware Censored Demand Models, which combine a Tobit likelihood with a graph diffusion process in order to model the latent process of transfer of unsatisfied demand between similar products or services. We instantiate this new class of models under the framework of GPs and, based on both simulated and real-world data for modeling sales, bike-sharing demand, and EV charging demand, demonstrate its ability to better recover the true demand and produce more accurate out-of-sample predictions.

[LG-4] CYCle: Choosing Your Collaborators Wisely to Enhance Collaborative Fairness in Decentralized Learning

链接: https://arxiv.org/abs/2501.12344
作者: Nurbek Tastan,Samuel Horvath,Karthik Nandakumar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Collaborative learning (CL) enables multiple participants to jointly train machine learning (ML) models on decentralized data sources without raw data sharing. While the primary goal of CL is to maximize the expected accuracy gain for each participant, it is also important to ensure that the gains are fairly distributed. Specifically, no client should be negatively impacted by the collaboration, and the individual gains must ideally be commensurate with the contributions. Most existing CL algorithms require central coordination and focus on the gain maximization objective while ignoring collaborative fairness. In this work, we first show that the existing measure of collaborative fairness based on the correlation between accuracy values without and with collaboration has drawbacks because it does not account for negative collaboration gain. We argue that maximizing mean collaboration gain (MCG) while simultaneously minimizing the collaboration gain spread (CGS) is a fairer alternative. Next, we propose the CYCle protocol that enables individual participants in a private decentralized learning (PDL) framework to achieve this objective through a novel reputation scoring method based on gradient alignment between the local cross-entropy and distillation losses. Experiments on the CIFAR-10, CIFAR-100, and Fed-ISIC2019 datasets empirically demonstrate the effectiveness of the CYCle protocol to ensure positive and fair collaboration gain for all participants, even in cases where the data distributions of participants are highly skewed. For the simple mean estimation problem with two participants, we also theoretically show that CYCle performs better than standard FedAvg, especially when there is large statistical heterogeneity.

[LG-5] he Gap Between Principle and Practice of Lossy Image Coding

链接: https://arxiv.org/abs/2501.12330
作者: Haotian Zhang,Dong Liu
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Lossy image coding is the art of computing that is principally bounded by the image’s rate-distortion function. This bound, though never accurately characterized, has been approached practically via deep learning technologies in recent years. Indeed, learned image coding schemes allow direct optimization of the joint rate-distortion cost, thereby outperforming the handcrafted image coding schemes by a large margin. Still, it is observed that there is room for further improvement in the rate-distortion performance of learned image coding. In this article, we identify the gap between the ideal rate-distortion function forecasted by Shannon’s information theory and the empirical rate-distortion function achieved by the state-of-the-art learned image coding schemes, revealing that the gap is incurred by five different effects: modeling effect, approximation effect, amortization effect, digitization effect, and asymptotic effect. We design simulations and experiments to quantitively evaluate the last three effects, which demonstrates the high potential of future lossy image coding technologies.

[LG-6] A Hybrid Supervised and Self-Supervised Graph Neural Network for Edge-Centric Applications

链接: https://arxiv.org/abs/2501.12309
作者: Eugenio Borzone,Leandro Di Persia,Matias Gerard
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:This paper presents a novel graph-based deep learning model for tasks involving relations between two nodes (edge-centric tasks), where the focus lies on predicting relationships and interactions between pairs of nodes rather than node properties themselves. This model combines supervised and self-supervised learning, taking into account for the loss function the embeddings learned and patterns with and without ground truth. Additionally it incorporates an attention mechanism that leverages both node and edge features. The architecture, trained end-to-end, comprises two primary components: embedding generation and prediction. First, a graph neural network (GNN) transform raw node features into dense, low-dimensional embeddings, incorporating edge attributes. Then, a feedforward neural model processes the node embeddings to produce the final output. Experiments demonstrate that our model matches or exceeds existing methods for protein-protein interactions prediction and Gene Ontology (GO) terms prediction. The model also performs effectively with one-hot encoding for node features, providing a solution for the previously unsolved problem of predicting similarity between compounds with unknown structures.

[LG-7] MoGERNN: An Inductive Traffic Predictor for Unobserved Locations in Dynamic Sensing Networks

链接: https://arxiv.org/abs/2501.12281
作者: Qishen Zhou,Yifan Zhang,Michail A. Makridis,Anastasios Kouvelas,Yibing Wang,Simon Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given a partially observed road network, how can we predict the traffic state of unobserved locations? While deep learning approaches show exceptional performance in traffic prediction, most assume sensors at all locations of interest, which is impractical due to financial constraints. Furthermore, these methods typically require costly retraining when sensor configurations change. We propose MoGERNN, an inductive spatio-temporal graph representation model, to address these challenges. Inspired by the Mixture of Experts approach in Large Language Models, we introduce a Mixture of Graph Expert (MoGE) block to model complex spatial dependencies through multiple graph message aggregators and a sparse gating network. This block estimates initial states for unobserved locations, which are then processed by a GRU-based Encoder-Decoder that integrates a graph message aggregator to capture spatio-temporal dependencies and predict future states. Experiments on two real-world datasets show MoGERNN consistently outperforms baseline methods for both observed and unobserved locations. MoGERNN can accurately predict congestion evolution even in areas without sensors, offering valuable information for traffic management. Moreover, MoGERNN is adaptable to dynamic sensing networks, maintaining competitive performance even compared to its retrained counterpart. Tests with different numbers of available sensors confirm its consistent superiority, and ablation studies validate the effectiveness of its key modules.

[LG-8] CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning AAAI25

链接: https://arxiv.org/abs/2501.12226
作者: Yuanheng Fang,Guoqing Chao,Wenqiang Lei,Shaobo Li,Dianhui Chu
类目: Machine Learning (cs.LG)
*备注: aaai25(poster)

点击查看摘要

Abstract:Large Language Models (LLMs) have recently achieved impressive results in complex reasoning tasks through Chain of Thought (CoT) prompting. However, most existing CoT methods rely on using the same prompts, whether manually designed or automatically generated, to handle the entire dataset. This one-size-fits-all approach may fail to meet the specific needs arising from the diversities within a single dataset. To solve this problem, we propose the Clustered Distance-Weighted Chain of Thought (CDW-CoT) method, which dynamically constructs prompts tailored to the characteristics of each data instance by integrating clustering and prompt optimization techniques. Our method employs clustering algorithms to categorize the dataset into distinct groups, from which a candidate pool of prompts is selected to reflect the inherent diversity within the dataset. For each cluster, CDW-CoT trains the optimal prompt probability distribution tailored to their specific characteristics. Finally, it dynamically constructs a unique prompt probability distribution for each test instance, based on its proximity to cluster centers, from which prompts are selected for reasoning. CDW-CoT consistently outperforms traditional CoT methods across six datasets, including commonsense, symbolic, and mathematical reasoning tasks. Specifically, when compared to manual CoT, CDW-CoT achieves an average accuracy improvement of 25.34% on LLaMA2 (13B) and 15.72% on LLaMA3 (8B).

[LG-9] Automatic selection of the best neural architecture for time series forecasting via multi-objective optimization and Pareto optimality conditions

链接: https://arxiv.org/abs/2501.12215
作者: Qianying Cao,Shanqing Liu,Alan John Varghese,Jerome Darbon,Michael Triantafyllou,George Em Karniadakis
类目: Machine Learning (cs.LG)
*备注: 35 pages, 8 figures

点击查看摘要

Abstract:Time series forecasting plays a pivotal role in a wide range of applications, including weather prediction, healthcare, structural health monitoring, predictive maintenance, energy systems, and financial markets. While models such as LSTM, GRU, Transformers, and State-Space Models (SSMs) have become standard tools in this domain, selecting the optimal architecture remains a challenge. Performance comparisons often depend on evaluation metrics and the datasets under analysis, making the choice of a universally optimal model controversial. In this work, we introduce a flexible automated framework for time series forecasting that systematically designs and evaluates diverse network architectures by integrating LSTM, GRU, multi-head Attention, and SSM blocks. Using a multi-objective optimization approach, our framework determines the number, sequence, and combination of blocks to align with specific requirements and evaluation objectives. From the resulting Pareto-optimal architectures, the best model for a given context is selected via a user-defined preference function. We validate our framework across four distinct real-world applications. Results show that a single-layer GRU or LSTM is usually optimal when minimizing training time alone. However, when maximizing accuracy or balancing multiple objectives, the best architectures are often composite designs incorporating multiple block types in specific configurations. By employing a weighted preference function, users can resolve trade-offs between objectives, revealing novel, context-specific optimal architectures. Our findings underscore that no single neural architecture is universally optimal for time series forecasting. Instead, the best-performing model emerges as a data-driven composite architecture tailored to user-defined criteria and evaluation objectives.

[LG-10] Score Combining for Contrastive OOD Detection

链接: https://arxiv.org/abs/2501.12204
作者: Edward T. Reehorst,Philip Schniter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In out-of-distribution (OOD) detection, one is asked to classify whether a test sample comes from a known inlier distribution or not. We focus on the case where the inlier distribution is defined by a training dataset and there exists no additional knowledge about the novelties that one is likely to encounter. This problem is also referred to as novelty detection, one-class classification, and unsupervised anomaly detection. The current literature suggests that contrastive learning techniques are state-of-the-art for OOD detection. We aim to improve on those techniques by combining/ensembling their scores using the framework of null hypothesis testing and, in particular, a novel generalized likelihood ratio test (GLRT). We demonstrate that our proposed GLRT-based technique outperforms the state-of-the-art CSI and SupCSI techniques from Tack et al. 2020 in dataset-vs-dataset experiments with CIFAR-10, SVHN, LSUN, ImageNet, and CIFAR-100, as well as leave-one-class-out experiments with CIFAR-10. We also demonstrate that our GLRT outperforms the score-combining methods of Fisher, Bonferroni, Simes, Benjamini-Hochwald, and Stouffer in our application.

[LG-11] Experience-replay Innovative Dynamics

链接: https://arxiv.org/abs/2501.12199
作者: Tuo Zhang,Leonardo Stella,Julian Barreiro Gomez
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Despite its groundbreaking success, multi-agent reinforcement learning (MARL) still suffers from instability and nonstationarity. Replicator dynamics, the most well-known model from evolutionary game theory (EGT), provide a theoretical framework for the convergence of the trajectories to Nash equilibria and, as a result, have been used to ensure formal guarantees for MARL algorithms in stable game settings. However, they exhibit the opposite behavior in other settings, which poses the problem of finding alternatives to ensure convergence. In contrast, innovative dynamics, such as the Brown-von Neumann-Nash (BNN) or Smith, result in periodic trajectories with the potential to approximate Nash equilibria. Yet, no MARL algorithms based on these dynamics have been proposed. In response to this challenge, we develop a novel experience replay-based MARL algorithm that incorporates revision protocols as tunable hyperparameters. We demonstrate, by appropriately adjusting the revision protocols, that the behavior of our algorithm mirrors the trajectories resulting from these dynamics. Importantly, our contribution provides a framework capable of extending the theoretical guarantees of MARL algorithms beyond replicator dynamics. Finally, we corroborate our theoretical findings with empirical results.

[LG-12] MyDigiTwin: A Privacy-Preserving Framework for Personalized Cardiovascular Risk Prediction and Scenario Exploration

链接: https://arxiv.org/abs/2501.12193
作者: Héctor Cadavid,Hyunho Mo,Bauke Arends,Katarzyna Dziopa,Esther E. Bron,Daniel Bos,Sonja Georgievska,Pim van der Harst
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Cardiovascular disease (CVD) remains a leading cause of death, and primary prevention through personalized interventions is crucial. This paper introduces MyDigiTwin, a framework that integrates health digital twins with personal health environments to empower patients in exploring personalized health scenarios while ensuring data privacy. MyDigiTwin uses federated learning to train predictive models across distributed datasets without transferring raw data, and a novel data harmonization framework addresses semantic and format inconsistencies in health data. A proof-of-concept demonstrates the feasibility of harmonizing and using cohort data to train privacy-preserving CVD prediction models. This framework offers a scalable solution for proactive, personalized cardiovascular care and sets the stage for future applications in real-world healthcare settings.

[LG-13] BiMarker: Enhancing Text Watermark Detection for Large Language Models with Bipolar Watermarks

链接: https://arxiv.org/abs/2501.12174
作者: Zhuang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has raised concerns about misuse and the challenges of distinguishing AI-generated text from human-written content. Existing watermarking techniques, such as \kgw, still face limitations under low watermark strength, stringent false-positive requirements, and low-entropy scenarios. Our analysis reveals that current detection methods rely on coarse estimates of non-watermarked text, which constrains watermark detectability. We propose the Bipolar Watermark (BiMarker), a novel approach that divides generated text into positive and negative poles, leveraging the difference in green token counts for detection. This differential mechanism significantly enhances the detectability of watermarked text. Theoretical analysis and experimental results demonstrate BiMarker’s effectiveness and compatibility with existing optimization techniques, offering a new optimization dimension for watermarking in LLM-generated content.

[LG-14] Beyond Window-Based Detection: A Graph-Centric Framework for Discrete Log Anomaly Detection

链接: https://arxiv.org/abs/2501.12166
作者: Jiaxing Qi,Chang Zeng,Zhongzhi Luan,Shaohan Huang,Shu Yang,Yao Lu,Hailong Yang,Depei Qian
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting anomalies in discrete event logs is critical for ensuring system reliability, security, and efficiency. Traditional window-based methods for log anomaly detection often suffer from context bias and fuzzy localization, which hinder their ability to precisely and efficiently identify anomalies. To address these challenges, we propose a graph-centric framework, TempoLog, which leverages multi-scale temporal graph networks for discrete log anomaly detection. Unlike conventional methods, TempoLog constructs continuous-time dynamic graphs directly from event logs, eliminating the need for fixed-size window grouping. By representing log templates as nodes and their temporal relationships as edges, the framework dynamically captures both local and global dependencies across multiple temporal scales. Additionally, a semantic-aware model enhances detection by incorporating rich contextual information. Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art performance in event-level anomaly detection, significantly outperforming existing approaches in both accuracy and efficiency.

[LG-15] Heterogeneous Federated Learning Systems for Time-Series Power Consumption Prediction with Multi-Head Embedding Mechanism

链接: https://arxiv.org/abs/2501.12136
作者: Jia-Hao Syu,Jerry Chun-Wei Lin,Gautam Srivastava,Unil Yun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series prediction is increasingly popular in a variety of applications, such as smart factories and smart transportation. Researchers have used various techniques to predict power consumption, but existing models lack discussion of collaborative learning and privacy issues among multiple clients. To address these issues, we propose Multi-Head Heterogeneous Federated Learning (MHHFL) systems that consist of multiple head networks, which independently act as carriers for federated learning. In the federated period, each head network is embedded into 2-dimensional vectors and shared with the centralized source pool. MHHFL then selects appropriate source networks and blends the head networks as knowledge transfer in federated learning. The experimental results show that the proposed MHHFL systems significantly outperform the benchmark and state-of-the-art systems and reduce the prediction error by 24.9% to 94.1%. The ablation studies demonstrate the effectiveness of the proposed mechanisms in the MHHFL (head network embedding and selection mechanisms), which significantly outperforms traditional federated average and random transfer.

[LG-16] Distributed Multi-Head Learning Systems for Power Consumption Prediction

链接: https://arxiv.org/abs/2501.12133
作者: Jia-Hao Syu,Jerry Chun-Wei Lin,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As more and more automatic vehicles, power consumption prediction becomes a vital issue for task scheduling and energy management. Most research focuses on automatic vehicles in transportation, but few focus on automatic ground vehicles (AGVs) in smart factories, which face complex environments and generate large amounts of data. There is an inevitable trade-off between feature diversity and interference. In this paper, we propose Distributed Multi-Head learning (DMH) systems for power consumption prediction in smart factories. Multi-head learning mechanisms are proposed in DMH to reduce noise interference and improve accuracy. Additionally, DMH systems are designed as distributed and split learning, reducing the client-to-server transmission cost, sharing knowledge without sharing local data and models, and enhancing the privacy and security levels. Experimental results show that the proposed DMH systems rank in the top-2 on most datasets and scenarios. DMH-E system reduces the error of the state-of-the-art systems by 14.5% to 24.0%. Effectiveness studies demonstrate the effectiveness of Pearson correlation-based feature engineering, and feature grouping with the proposed multi-head learning further enhances prediction performance.

[LG-17] Heterogeneous Federated Learning System for Sparse Healthcare Time-Series Prediction

链接: https://arxiv.org/abs/2501.12125
作者: Jia-Hao Syu,Jerry Chun-Wei Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a heterogeneous federated learning (HFL) system for sparse time series prediction in healthcare, which is a decentralized federated learning algorithm with heterogeneous transfers. We design dense and sparse feature tensors to deal with the sparsity of data sources. Heterogeneous federated learning is developed to share asynchronous parts of networks and select appropriate models for knowledge transfer. Experimental results show that the proposed HFL achieves the lowest prediction error among all benchmark systems on eight out of ten prediction tasks, with MSE reduction of 94.8%, 48.3%, and 52.1% compared to the benchmark systems. These results demonstrate the effectiveness of HFL in transferring knowledge from heterogeneous domains, especially in the smaller target domain. Ablation studies then demonstrate the effectiveness of the designed mechanisms for heterogeneous domain selection and switching in predicting healthcare time series with privacy, model security, and heterogeneous knowledge transfer.

[LG-18] Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning

链接: https://arxiv.org/abs/2501.12121
作者: KaiHui Huang,RunQing Wu,Fei Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning has emerged as a pivotal area of research, primarily due to its advantageous characteristic that allows models to persistently acquire and retain information. However, catastrophic forgetting can severely impair model performance. In this study, we tackle the issue of network forgetting by introducing a novel framework termed Optimally-Weighted Maximum Mean Discrepancy (OWMMD), which imposes penalties on representation alterations via a Multi-Level Feature Matching Mechanism (MLFMM). Furthermore, we propose an Adaptive Regularization Optimization (ARO) strategy to refine the adaptive weight vectors, which autonomously assess the significance of each feature layer throughout the optimization process. We conduct a comprehensive series of experiments, benchmarking our proposed method against several established baselines. The empirical findings indicate that our approach achieves state-of-the-art performance.

[LG-19] Regularized dynamical parametric approximation of stiff evolution problems

链接: https://arxiv.org/abs/2501.12118
作者: Christian Lubich,Jörg Nick
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 33 pages, 22 figures

点击查看摘要

Abstract:Evolutionary deep neural networks have emerged as a rapidly growing field of research. This paper studies numerical integrators for such and other classes of nonlinear parametrizations u(t) = \Phi(\theta(t)) , where the evolving parameters \theta(t) are to be computed. The primary focus is on tackling the challenges posed by the combination of stiff evolution problems and irregular parametrizations, which typically arise with neural networks, tensor networks, flocks of evolving Gaussians, and in further cases of overparametrization. We propose and analyse regularized parametric versions of the implicit Euler method and higher-order implicit Runge–Kutta methods for the time integration of the parameters in nonlinear approximations to evolutionary partial differential equations and large systems of stiff ordinary differential equations. At each time step, an ill-conditioned nonlinear optimization problem is solved approximately with a few regularized Gauss–Newton iterations. Error bounds for the resulting parametric integrator are derived by relating the computationally accessible Gauss–Newton iteration for the parameters to the computationally inaccessible Newton iteration for the underlying non-parametric time integration scheme. The theoretical findings are supported by numerical experiments that are designed to show key properties of the proposed parametric integrators.

[LG-20] Optimizing Portfolio Performance through Clustering and Sharpe Ratio-Based Optimization: A Comparative Backtesting Approach

链接: https://arxiv.org/abs/2501.12074
作者: Keon Vin Park
类目: Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注:

点击查看摘要

Abstract:Optimizing portfolio performance is a fundamental challenge in financial modeling, requiring the integration of advanced clustering techniques and data-driven optimization strategies. This paper introduces a comparative backtesting approach that combines clustering-based portfolio segmentation and Sharpe ratio-based optimization to enhance investment decision-making. First, we segment a diverse set of financial assets into clusters based on their historical log-returns using K-Means clustering. This segmentation enables the grouping of assets with similar return characteristics, facilitating targeted portfolio construction. Next, for each cluster, we apply a Sharpe ratio-based optimization model to derive optimal weights that maximize risk-adjusted returns. Unlike traditional mean-variance optimization, this approach directly incorporates the trade-off between returns and volatility, resulting in a more balanced allocation of resources within each cluster. The proposed framework is evaluated through a backtesting study using historical data spanning multiple asset classes. Optimized portfolios for each cluster are constructed and their cumulative returns are compared over time against a traditional equal-weighted benchmark portfolio. Subjects: Machine Learning (cs.LG); Portfolio Management (q-fin.PM) Cite as: arXiv:2501.12074 [cs.LG] (or arXiv:2501.12074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.12074 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Keon Vin Park [view email] [v1] Tue, 21 Jan 2025 12:00:52 UTC (311 KB)

[LG-21] ackling Uncertainties in Multi-Agent Reinforcement Learning through Integration of Agent Termination Dynamics

链接: https://arxiv.org/abs/2501.12061
作者: Somnath Hazra,Pallab Dasgupta,Soumyajit Dey
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) has gained significant traction for solving complex real-world tasks, but the inherent stochasticity and uncertainty in these environments pose substantial challenges to efficient and robust policy learning. While Distributional Reinforcement Learning has been successfully applied in single-agent settings to address risk and uncertainty, its application in MARL is substantially limited. In this work, we propose a novel approach that integrates distributional learning with a safety-focused loss function to improve convergence in cooperative MARL tasks. Specifically, we introduce a Barrier Function based loss that leverages safety metrics, identified from inherent faults in the system, into the policy learning process. This additional loss term helps mitigate risks and encourages safer exploration during the early stages of training. We evaluate our method in the StarCraft II micromanagement benchmark, where our approach demonstrates improved convergence and outperforms state-of-the-art baselines in terms of both safety and task completion. Our results suggest that incorporating safety considerations can significantly enhance learning performance in complex, multi-agent environments.

[LG-22] Parameterised Quantum Circuits for Novel Representation Learning in Speech Emotion Recognition

链接: https://arxiv.org/abs/2501.12050
作者: Thejan Rajapakshe,Rajib Rana,Farina Riaz,Sara Khalifa,Björn W. Schuller
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) is a complex and challenging task in human-computer interaction due to the intricate dependencies of features and the overlapping nature of emotional expressions conveyed through speech. Although traditional deep learning methods have shown effectiveness, they often struggle to capture subtle emotional variations and overlapping states. This paper introduces a hybrid classical-quantum framework that integrates Parameterised Quantum Circuits (PQCs) with conventional Convolutional Neural Network (CNN) architectures. By leveraging quantum properties such as superposition and entanglement, the proposed model enhances feature representation and captures complex dependencies more effectively than classical methods. Experimental evaluations conducted on benchmark datasets, including IEMOCAP, RECOLA, and MSP-Improv, demonstrate that the hybrid model achieves higher accuracy in both binary and multi-class emotion classification while significantly reducing the number of trainable parameters. While a few existing studies have explored the feasibility of using Quantum Circuits to reduce model complexity, none have successfully shown how they can enhance accuracy. This study is the first to demonstrate that Quantum Circuits has the potential to improve the accuracy of SER. The findings highlight the promise of QML to transform SER, suggesting a promising direction for future research and practical applications in emotion-aware systems.

[LG-23] Communication-Efficient and Privacy-Adaptable Mechanism for Federated Learning

链接: https://arxiv.org/abs/2501.12046
作者: Chih Wei Ling,Youqi Wu,Jiande Sun,Cheuk Ting Li,Linqi Song,Weitao Xu
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, Submitted to 2025 IEEE International Symposium on Information Theory

点击查看摘要

Abstract:Training machine learning models on decentralized private data via federated learning (FL) poses two key challenges: communication efficiency and privacy protection. In this work, we address these challenges within the trusted aggregator model by introducing a novel approach called the Communication-Efficient and Privacy-Adaptable Mechanism (CEPAM), achieving both objectives simultaneously. In particular, CEPAM leverages the rejection-sampled universal quantizer (RSUQ), a construction of randomized vector quantizer whose resulting distortion is equivalent to a prescribed noise, such as Gaussian or Laplace noise, enabling joint differential privacy and compression. Moreover, we analyze the trade-offs among user privacy, global utility, and transmission rate of CEPAM by defining appropriate metrics for FL with differential privacy and compression. Our CEPAM provides the additional benefit of privacy adaptability, allowing clients and the server to customize privacy protection based on required accuracy and protection. We assess CEPAM’s utility performance using MNIST dataset, demonstrating that CEPAM surpasses baseline models in terms of learning accuracy.

[LG-24] In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs

链接: https://arxiv.org/abs/2501.12032
作者: Yu Zhu,Wenqi Jiang,Gustavo Alonso
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39 \sim 105 \times speedup over a server-grade, 128-core CPU and 3 \sim 17 \times speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.

[LG-25] abularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data

链接: https://arxiv.org/abs/2501.12012
作者: Paul Tiwald,Ivona Krchova,Andrey Sidorenko,Mariana Vargas-Vieyra,Mario Scriminaci,Michael Platzer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data generation for tabular datasets must balance fidelity, efficiency, and versatility to meet the demands of real-world applications. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a flexible framework designed to handle mixed-type, multivariate, and sequential datasets. By training on all possible conditional probabilities, TabularARGN supports advanced features such as fairness-aware generation, imputation, and conditional generation on any subset of columns. The framework achieves state-of-the-art synthetic data quality while significantly reducing training and inference times, making it ideal for large-scale datasets with diverse structures. Evaluated across established benchmarks, including realistic datasets with complex relationships, TabularARGN demonstrates its capability to synthesize high-quality data efficiently. By unifying flexibility and performance, this framework paves the way for practical synthetic data generation across industries.

[LG-26] Linear Feedback Control Systems for Iterative Prompt Optimization in Large Language Models

链接: https://arxiv.org/abs/2501.11979
作者: Rupesh Raj Karn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various applications by generating outputs based on given prompts. However, achieving the desired output requires iterative prompt refinement. This paper presents a novel approach that draws parallels between the iterative prompt optimization process in LLMs and feedback control systems. We iteratively refine the prompt by treating the deviation between the LLM output and the desired result as an error term until the output criteria are met. This process is akin to a feedback control system, where the LLM, despite being non-linear and non-deterministic, is managed using principles from linear feedback control systems. We explore the application of different types of controllers within this framework, providing a mathematical foundation for integrating linear feedback control mechanisms with LLMs.

[LG-27] “FRAME: Forward Recursive Adaptive Model Extraction – A Technique for Advance Feature Selection”

链接: https://arxiv.org/abs/2501.11972
作者: Nachiket Kapure,Harsh Joshi,Parul Kumari,Rajeshwari mistri,Manasi Mali
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection is a crucial preprocessing step in machine learning, impacting model performance, interpretability, and computational efficiency. This study introduces a novel hybrid approach, the Forward Recursive Adaptive Model Extraction Technique (FRAME), which combines Forward Selection and Recursive Feature Elimination (RFE) to enhance feature selection across diverse datasets. FRAME integrates the strengths of both methods, balancing exploration and exploitation of features to optimize selection. A comprehensive evaluation of FRAME was conducted against traditional methods such as SelectKBest and Lasso Regression, using high-dimensional, noisy, and heterogeneous datasets. The results demonstrate that FRAME consistently delivers superior predictive performance based on downstream machine learning evaluation metrics. It effectively reduces dimensionality while maintaining robust model performance, making it particularly valuable for applications requiring interpretable and accurate predictions, such as biomedical diagnostics. This study highlights the importance of assessing feature selection methods across varied datasets to ensure their robustness and generalizability. The findings suggest that FRAME has significant potential for further enhancement, particularly through integration with deep learning architectures for adaptive and real-time feature selection in dynamic environments. By advancing feature selection methodologies, FRAME offers a practical and effective solution to improve machine learning applications across multiple domains.

[LG-28] Noise-Resilient Point-wise Anomaly Detection in Time Series Using Weak Segment Labels KDD KDD’25

链接: https://arxiv.org/abs/2501.11959
作者: Yaxuan Wang,Hao Cheng,Jing Xiong,Qingsong Wen,Han Jia,Ruixuan Song,Liyuan Zhang,Zhaowei Zhu,Yang Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by 2025 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’25)

点击查看摘要

Abstract:Detecting anomalies in temporal data has gained significant attention across various real-world applications, aiming to identify unusual events and mitigate potential hazards. In practice, situations often involve a mix of segment-level labels (detected abnormal events with segments of time points) and unlabeled data (undetected events), while the ideal algorithmic outcome should be point-level predictions. Therefore, the huge label information gap between training data and targets makes the task challenging. In this study, we formulate the above imperfect information as noisy labels and propose NRdetector, a noise-resilient framework that incorporates confidence-based sample selection, robust segment-level learning, and data-centric point-level detection for multivariate time series anomaly detection. Particularly, to bridge the information gap between noisy segment-level labels and missing point-level labels, we develop a novel loss function that can effectively mitigate the label noise and consider the temporal features. It encourages the smoothness of consecutive points and the separability of points from segments with different labels. Extensive experiments on real-world multivariate time series datasets with 11 different evaluation metrics demonstrate that NRdetector consistently achieves robust results across multiple real-world datasets, outperforming various baselines adapted to operate in our setting.

[LG-29] GLAM: Global-Local Variation Awareness in Mamba-based World Model

链接: https://arxiv.org/abs/2501.11949
作者: Qian He,Wenqi Liang,Chunhui Hao,Gan Sun,Jiandong Tian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mimicking the real interaction trajectory in the inference of the world model has been shown to improve the sample efficiency of model-based reinforcement learning (MBRL) algorithms. Many methods directly use known state sequences for reasoning. However, this approach fails to enhance the quality of reasoning by capturing the subtle variation between states. Much like how humans infer trends in event development from this variation, in this work, we introduce Global-Local variation Awareness Mamba-based world model (GLAM) that improves reasoning quality by perceiving and predicting variation between states. GLAM comprises two Mambabased parallel reasoning modules, GMamba and LMamba, which focus on perceiving variation from global and local perspectives, respectively, during the reasoning process. GMamba focuses on identifying patterns of variation between states in the input sequence and leverages these patterns to enhance the prediction of future state variation. LMamba emphasizes reasoning about unknown information, such as rewards, termination signals, and visual representations, by perceiving variation in adjacent states. By integrating the strengths of the two modules, GLAM accounts for highervalue variation in environmental changes, providing the agent with more efficient imagination-based training. We demonstrate that our method outperforms existing methods in normalized human scores on the Atari 100k benchmark.

[LG-30] ALoFTRAG : Automatic Local Fine Tuning for Retrieval Augmented Generation

链接: https://arxiv.org/abs/2501.11929
作者: Peter Devine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems have been shown to improve the accuracy of Large Language Model (LLM) outputs. However, these models can often achieve low accuracy when applied to new data domains. We introduce the Automatic Local Fine Tuning of Retrieval Augmented Generation models (ALoFTRAG) framework, designed to improve the accuracy of RAG systems on a given domain by training LLMs without manually labeled data or using larger teacher models. By generating and filtering synthetic training data and performing LoRA fine-tuning, ALoFTRAG improves citation and answer accuracy across 20 datasets in 26 languages by, on average, 8.3% and 3.0% respectively. Our results demonstrate that ALoFTRAG offers a practical, cost-effective, and data-secure solution for improving RAG accuracy, making it particularly applicable to sensitive domains such as healthcare and finance. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.11929 [cs.LG] (or arXiv:2501.11929v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.11929 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peter Devine [view email] [v1] Tue, 21 Jan 2025 07:07:58 UTC (108 KB)

[LG-31] Improving Fine-Tuning with Latent Cluster Correction

链接: https://arxiv.org/abs/2501.11919
作者: Cédric Ho Thanh
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:The existence of salient semantic clusters in the latent spaces of a neural network during training strongly correlates its final accuracy on classification tasks. This paper proposes a novel fine-tuning method that boosts performance by optimising the formation of these latent clusters, using the Louvain community detection algorithm and a specifically designed clustering loss function. We present preliminary results that demonstrate the viability of this process on classical neural network architectures during fine-tuning on the CIFAR-100 dataset.

[LG-32] Highly Efficient Rotation-Invariant Spectral Embedding for Scalable Incomplete Multi-View Clustering

链接: https://arxiv.org/abs/2501.11898
作者: Xinxin Wang,Yongshan Zhang,Yicong Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Incomplete multi-view clustering presents significant challenges due to missing views. Although many existing graph-based methods aim to recover missing instances or complete similarity matrices with promising results, they still face several limitations: (1) Recovered data may be unsuitable for spectral clustering, as these methods often ignore guidance from spectral analysis; (2) Complex optimization processes require high computational burden, hindering scalability to large-scale problems; (3) Most methods do not address the rotational mismatch problem in spectral embeddings. To address these issues, we propose a highly efficient rotation-invariant spectral embedding (RISE) method for scalable incomplete multi-view clustering. RISE learns view-specific embeddings from incomplete bipartite graphs to capture the complementary information. Meanwhile, a complete consensus representation with second-order rotation-invariant property is recovered from these incomplete embeddings in a unified model. Moreover, we design a fast alternating optimization algorithm with linear complexity and promising convergence to solve the proposed formulation. Extensive experiments on multiple datasets demonstrate the effectiveness, scalability, and efficiency of RISE compared to the state-of-the-art methods.

[LG-33] Evaluating multiple models using labeled and unlabeled data

链接: https://arxiv.org/abs/2501.11866
作者: Divya Shanmugam,Shuvom Sadhuka,Manish Raghavan,John Guttag,Bonnie Berger,Emma Pierson
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.

[LG-34] Bayesian Despeckling of Structured Sources

链接: https://arxiv.org/abs/2501.11860
作者: Ali Zafari,Shirin Jalali
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Speckle noise is a fundamental challenge in coherent imaging systems, significantly degrading image quality. Over the past decades, numerous despeckling algorithms have been developed for applications such as Synthetic Aperture Radar (SAR) and digital holography. In this paper, we aim to establish a theoretically grounded approach to despeckling. We propose a method applicable to general structured stationary stochastic sources. We demonstrate the effectiveness of the proposed method on piecewise constant sources. Additionally, we theoretically derive a lower bound on the despeckling performance for such sources. The proposed depseckler applied to the 1-Markov structured sources achieves better reconstruction performance with no strong simplification of the ground truth signal model or speckle noise.

[LG-35] Hybrid Adaptive Modeling using Neural Networks Trained with Nonlinear Dynamics Based Features

链接: https://arxiv.org/abs/2501.11835
作者: Zihan Liu,Prashant N. Kambali,C. Nataraj
类目: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:Accurate models are essential for design, performance prediction, control, and diagnostics in complex engineering systems. Physics-based models excel during the design phase but often become outdated during system deployment due to changing operational conditions, unknown interactions, excitations, and parametric drift. While data-based models can capture the current state of complex systems, they face significant challenges, including excessive data dependence, limited generalizability to changing conditions, and inability to predict parametric dependence. This has led to combining physics and data in modeling, termed physics-infused machine learning, often using numerical simulations from physics-based models. This paper introduces a novel approach that departs from standard techniques by uncovering information from nonlinear dynamical modeling and embedding it in data-based models. The goal is to create a hybrid adaptive modeling framework that integrates data-based modeling with newly measured data and analytical nonlinear dynamical models for enhanced accuracy, parametric dependence, and improved generalizability. By explicitly incorporating nonlinear dynamic phenomena through perturbation methods, the predictive capabilities are more realistic and insightful compared to knowledge obtained from brute-force numerical simulations. In particular, perturbation methods are utilized to derive asymptotic solutions which are parameterized to generate frequency responses. Frequency responses provide comprehensive insights into dynamics and nonlinearity which are quantified and extracted as high-quality features. A machine-learning model, trained by these features, tracks parameter variations and updates the mismatched model. The results demonstrate that this adaptive modeling method outperforms numerical gray box models in prediction accuracy and computational efficiency.

[LG-36] ShadowGenes: Leverag ing Recurring Patterns within Computational Graphs for Model Genealogy

链接: https://arxiv.org/abs/2501.11830
作者: Kasimir Schulz,Kieran Evans
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning model genealogy enables practitioners to determine which architectural family a neural network belongs to. In this paper, we introduce ShadowGenes, a novel, signature-based method for identifying a given model’s architecture, type, and family. Our method involves building a computational graph of the model that is agnostic of its serialization format, then analyzing its internal operations to identify unique patterns, and finally building and refining signatures based on these. We highlight important workings of the underlying engine and demonstrate the technique used to construct a signature and scan a given model. This approach to model genealogy can be applied to model files without the need for additional external information. We test ShadowGenes on a labeled dataset of over 1,400 models and achieve a mean true positive rate of 97.49% and a precision score of 99.51%; which validates the technique as a practical method for model genealogy. This enables practitioners to understand the use cases of a given model, the internal computational process, and identify possible security risks, such as the potential for model backdooring.

[LG-37] Group-Agent Reinforcement Learning with Heterogeneous Agents

链接: https://arxiv.org/abs/2501.11818
作者: Kaiyue Wu,Xiao-Jun Zeng,Tingting Mu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group-agent reinforcement learning (GARL) is a newly arising learning scenario, where multiple reinforcement learning agents study together in a group, sharing knowledge in an asynchronous fashion. The goal is to improve the learning performance of each individual agent. Under a more general heterogeneous setting where different agents learn using different algorithms, we advance GARL by designing novel and effective group-learning mechanisms. They guide the agents on whether and how to learn from action choices from the others, and allow the agents to adopt available policy and value function models sent by another agent if they perform better. We have conducted extensive experiments on a total of 43 different Atari 2600 games to demonstrate the superior performance of the proposed method. After the group learning, among the 129 agents examined, 96% are able to achieve a learning speed-up, and 72% are able to learn over 100 times faster. Also, around 41% of those agents have achieved a higher accumulated reward score by learning in less than 5% of the time steps required by a single agent when learning on its own.

[LG-38] Utilising Deep Learning to Elicit Expert Uncertainty

链接: https://arxiv.org/abs/2501.11813
作者: Julia R. Falconer,Eibe Frank,Devon L. L. Polaschek,Chaitanya Joshi
类目: Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Recent work [ 14 ] has introduced a method for prior elicitation that utilizes records of expert decisions to infer a prior distribution. While this method provides a promising approach to eliciting expert uncertainty, it has only been demonstrated using tabular data, which may not entirely represent the information used by experts to make decisions. In this paper, we demonstrate how analysts can adopt a deep learning approach to utilize the method proposed in [14 ] with the actual information experts use. We provide an overview of deep learning models that can effectively model expert decision-making to elicit distributions that capture expert uncertainty and present an example examining the risk of colon cancer to show in detail how these models can be used.

[LG-39] Automating High Quality RT Planning at Scale

链接: https://arxiv.org/abs/2501.11803
作者: Riqiang Gao,Mamadou Diallo,Han Liu,Anthony Magliari,Jonathan Sackett,Wilko Verbakel,Sandra Meyers,Masoud Zarepisheh,Rafe Mcbeth,Simon Arberet,Martin Kraus,Florin C. Ghesu,Ali Kamen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Related to GDP-HMM grand challenge

点击查看摘要

Abstract:Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances in artificial intelligence (AI) promise to improve its precision, efficiency, and consistency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Eclipse of Varian. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. This data set features more than 10 times the number of plans compared to the largest existing well-curated public data set to our best knowledge. Repo:this https URL

[LG-40] Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference

链接: https://arxiv.org/abs/2501.11779
作者: Pouya Hamadanian,Sadjad Fouladi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large Language Models (LLM) have revolutionized natural language processing, but their inference demands substantial resources, while under-utilizing high-end accelerators like GPUs. A major bottleneck arises from the attention mechanism, which requires storing large key-value caches, limiting the maximum achievable throughput way below the available computing resources. Current approaches attempt to mitigate this issue through memory-efficient attention and paging mechanisms, but remained constrained by the assumption that all operations must be performed on high-end accelerators. In this work, we propose Glinthawk, a two-tiered architecture that decouples the attention mechanism from the rest of the Transformer model. This approach allows the memory requirements for attention to scale independently, enabling larger batch sizes and more efficient use of the high-end accelerators. We prototype Glinthawk with NVIDIA T4 GPUs as one tier and standard CPU VMs as the other. Compared to a traditional single-tier setup, it improves throughput by 5.9\times and reduces cost of generation by 2.8\times . For longer sequence lengths, it achieves 16.3\times throughput improvement at 2.4\times less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-oriented applications such as batch processing. We shared our prototype publicly at \urlthis https URL. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2501.11779 [cs.LG] (or arXiv:2501.11779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.11779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Personalized Federated Learning for Cellular VR: Online Learning and Dynamic Caching

链接: https://arxiv.org/abs/2501.11745
作者: Krishnendu S. Tharakan,Hayssam Dahrouj,Nour Kouzayha,Hesham ElSawy,Tareq Y. Al-Naffouri
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: accepted for publication in IEEE Transactions on Communications

点击查看摘要

Abstract:Delivering an immersive experience to virtual reality (VR) users through wireless connectivity offers the freedom to engage from anywhere at any time. Nevertheless, it is challenging to ensure seamless wireless connectivity that delivers real-time and high-quality videos to the VR users. This paper proposes a field of view (FoV) aware caching for mobile edge computing (MEC)-enabled wireless VR network. In particular, the FoV of each VR user is cached/prefetched at the base stations (BSs) based on the caching strategies tailored to each BS. Specifically, decentralized and personalized federated learning (DP-FL) based caching strategies with guarantees are presented. Considering VR systems composed of multiple VR devices and BSs, a DP-FL caching algorithm is implemented at each BS to personalize content delivery for VR users. The utilized DP-FL algorithm guarantees a probably approximately correct (PAC) bound on the conditional average cache hit. Further, to reduce the cost of communicating gradients, one-bit quantization of the stochastic gradient descent (OBSGD) is proposed, and a convergence guarantee of \mathcalO(1/\sqrtT) is obtained for the proposed algorithm, where T is the number of iterations. Additionally, to better account for the wireless channel dynamics, the FoVs are grouped into multicast or unicast groups based on the number of requesting VR users. The performance of the proposed DP-FL algorithm is validated through realistic VR head-tracking dataset, and the proposed algorithm is shown to have better performance in terms of average delay and cache hit as compared to baseline algorithms.

[LG-42] Non-Reversible Langevin Algorithms for Constrained Sampling

链接: https://arxiv.org/abs/2501.11743
作者: Hengrong Du,Qi Feng,Changwei Tu,Xiaoyu Wang,Lingjiong Zhu
类目: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注: 30 pages, 9 figures

点击查看摘要

Abstract:We consider the constrained sampling problem where the goal is to sample from a target distribution on a constrained domain. We propose skew-reflected non-reversible Langevin dynamics (SRNLD), a continuous-time stochastic differential equation with skew-reflected boundary. We obtain non-asymptotic convergence rate of SRNLD to the target distribution in both total variation and 1-Wasserstein distances. By breaking reversibility, we show that the convergence is faster than the special case of the reversible dynamics. Based on the discretization of SRNLD, we propose skew-reflected non-reversible Langevin Monte Carlo (SRNLMC), and obtain non-asymptotic discretization error from SRNLD, and convergence guarantees to the target distribution in 1-Wasserstein distance. We show better performance guarantees than the projected Langevin Monte Carlo in the literature that is based on the reversible dynamics. Numerical experiments are provided for both synthetic and real datasets to show efficiency of the proposed algorithms.

[LG-43] he Transition from Centralized Machine Learning to Federated Learning for Mental Health in Education: A Survey of Current Methods and Future Directions

链接: https://arxiv.org/abs/2501.11714
作者: Maryam Ebrahimi,Rajeev Sahay,Seyyedali Hosseinalipour,Bita Akram
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 18 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Research has increasingly explored the application of artificial intelligence (AI) and machine learning (ML) within the mental health domain to enhance both patient care and healthcare provider efficiency. Given that mental health challenges frequently emerge during early adolescence – the critical years of high school and college – investigating AI/ML-driven mental health solutions within the education domain is of paramount importance. Nevertheless, conventional AI/ML techniques follow a centralized model training architecture, which poses privacy risks due to the need for transferring students’ sensitive data from institutions, universities, and clinics to central servers. Federated learning (FL) has emerged as a solution to address these risks by enabling distributed model training while maintaining data privacy. Despite its potential, research on applying FL to analyze students’ mental health remains limited. In this paper, we aim to address this limitation by proposing a roadmap for integrating FL into mental health data analysis within educational settings. We begin by providing an overview of mental health issues among students and reviewing existing studies where ML has been applied to address these challenges. Next, we examine broader applications of FL in the mental health domain to emphasize the lack of focus on educational contexts. Finally, we propose promising research directions focused on using FL to address mental health issues in the education sector, which entails discussing the synergies between the proposed directions with broader human-centered domains. By categorizing the proposed research directions into short- and long-term strategies and highlighting the unique challenges at each stage, we aim to encourage the development of privacy-conscious AI/ML-driven mental health solutions.

[LG-44] Leverag ing graph neural networks and mobility data for COVID-19 forecasting

链接: https://arxiv.org/abs/2501.11711
作者: Fernando H. O. Duarte,Gladston J. P. Moreira,Eduardo J. S. Luz,Leonardo B. L. Santos,Vander L. S. Freitas
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The COVID-19 pandemic has victimized over 7 million people to date, prompting diverse research efforts. Spatio-temporal models combining mobility data with machine learning have gained attention for disease forecasting. Here, we explore Graph Convolutional Recurrent Network (GCRN) and Graph Convolutional Long Short-Term Memory (GCLSTM), which combine the power of Graph Neural Networks (GNN) with traditional architectures that deal with sequential data. The aim is to forecast future values of COVID-19 cases in Brazil and China by leveraging human mobility networks, whose nodes represent geographical locations and links are flows of vehicles or people. We show that employing backbone extraction to filter out negligible connections in the mobility network enhances predictive stability. Comparing regression and classification tasks demonstrates that binary classification yields smoother, more interpretable results. Interestingly, we observe qualitatively equivalent results for both Brazil and China datasets by introducing sliding windows of variable size and prediction horizons. Compared to prior studies, introducing the sliding window and the network backbone extraction strategies yields improvements of about 80% in root mean squared errors.

[LG-45] rustformer: A Trusted Federated Transformer

链接: https://arxiv.org/abs/2501.11706
作者: Ali Abbasi Tadi,Dima Alhadidi,Luis Rueda
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Transformers, a cornerstone of deep-learning architectures for sequential data, have achieved state-of-the-art results in tasks like Natural Language Processing (NLP). Models such as BERT and GPT-3 exemplify their success and have driven the rise of large language models (LLMs). However, a critical challenge persists: safeguarding the privacy of data used in LLM training. Privacy-preserving techniques like Federated Learning (FL) offer potential solutions, but practical limitations hinder their effectiveness for Transformer training. Two primary issues are (I) the risk of sensitive information leakage due to aggregation methods like FedAvg or FedSGD, and (II) the high communication overhead caused by the large size of Transformer models. This paper introduces a novel FL method that reduces communication overhead while maintaining competitive utility. Our approach avoids sharing full model weights by simulating a global model locally. We apply k-means clustering to each Transformer layer, compute centroids locally, and transmit only these centroids to the server instead of full weights or gradients. To enhance security, we leverage Intel SGX for secure transmission of centroids. Evaluated on a translation task, our method achieves utility comparable to state-of-the-art baselines while significantly reducing communication costs. This provides a more efficient and privacy-preserving FL solution for Transformer models. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2501.11706 [cs.LG] (or arXiv:2501.11706v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.11706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Randomness exchangeability and conformal prediction

链接: https://arxiv.org/abs/2501.11689
作者: Vladimir Vovk
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:This note continues development of the functional theory of randomness, a modification of the algorithmic theory of randomness getting rid of unspecified additive constants. It introduces new kinds of confidence predictors, including randomness predictors (the most general confidence predictors based on the assumption of IID observations) and exchangeability predictors (the most general confidence predictors based on the assumption of exchangeable observations). The main result implies that both are close to conformal predictors and quantifies the difference between them.

[LG-47] Randomized Kaczmarz Methods with Beyond-Krylov Convergence

链接: https://arxiv.org/abs/2501.11673
作者: Michał Dereziński,Deanna Needell,Elizaveta Rebrova,Jiaming Yang
类目: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Randomized Kaczmarz methods form a family of linear system solvers which converge by repeatedly projecting their iterates onto randomly sampled equations. While effective in some contexts, such as highly over-determined least squares, Kaczmarz methods are traditionally deemed secondary to Krylov subspace methods, since this latter family of solvers can exploit outliers in the input’s singular value distribution to attain fast convergence on ill-conditioned systems. In this paper, we introduce Kaczmarz++, an accelerated randomized block Kaczmarz algorithm that exploits outlying singular values in the input to attain a fast Krylov-style convergence. Moreover, we show that Kaczmarz++ captures large outlying singular values provably faster than popular Krylov methods, for both over- and under-determined systems. We also develop an optimized variant for positive semidefinite systems, called CD++, demonstrating empirically that it is competitive in arithmetic operations with both CG and GMRES on a collection of benchmark problems. To attain these results, we introduce several novel algorithmic improvements to the Kaczmarz framework, including adaptive momentum acceleration, Tikhonov-regularized projections, and a memoization scheme for reusing information from previously sampled equation~blocks. Subjects: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2501.11673 [math.NA] (or arXiv:2501.11673v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2501.11673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] KKL Observer Synthesis for Nonlinear Systems via Physics-Informed Learning

链接: https://arxiv.org/abs/2501.11655
作者: M. Umar B. Niazi,John Cao,Matthieu Barreau,Karl Henrik Johansson
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a novel learning approach for designing Kazantzis-Kravaris/Luenberger (KKL) observers for autonomous nonlinear systems. The design of a KKL observer involves finding an injective map that transforms the system state into a higher-dimensional observer state, whose dynamics is linear and stable. The observer’s state is then mapped back to the original system coordinates via the inverse map to obtain the state estimate. However, finding this transformation and its inverse is quite challenging. We propose to sequentially approximate these maps by neural networks that are trained using physics-informed learning. We generate synthetic data for training by numerically solving the system and observer dynamics. Theoretical guarantees for the robustness of state estimation against approximation error and system uncertainties are provided. Additionally, a systematic method for optimizing observer performance through parameter selection is presented. The effectiveness of the proposed approach is demonstrated through numerical simulations on benchmark examples and its application to sensor fault detection and isolation in a network of Kuramoto oscillators using learned KKL observers.

[LG-49] Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model

链接: https://arxiv.org/abs/2501.11638
作者: F.S. Pezzicoli,V. Ros,F.P. Landes,M. Baity-Jesi
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:Class imbalance (CI) is a longstanding problem in machine learning, slowing down training and reducing performances. Although empirical remedies exist, it is often unclear which ones work best and when, due to the lack of an overarching theory. We address a common case of imbalance, that of anomaly (or outlier) detection. We provide a theoretical framework to analyze, interpret and address CI. It is based on an exact solution of the teacher-student perceptron model, through replica theory. Within this framework, one can distinguish several sources of CI: either intrinsic, train or test imbalance. Our analysis reveals that the optimal train imbalance is generally different from 50%, with a non trivial dependence on the intrinsic imbalance, the abundance of data and on the noise in the learning. Moreover, there is a crossover between a small noise training regime where results are independent of the noise level to a high noise regime where performances quickly degrade with noise. Our results challenge some of the conventional wisdom on CI and offer practical guidelines to address it.

[LG-50] Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel Clustering

链接: https://arxiv.org/abs/2501.11622
作者: Lu Liu,Yang Tang,Kexuan Zhang,Qiyu Sun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, we introduce the nonlinear Causal Kernel Clustering method designed for heterogeneous subgroup causal learning, illuminating variations in causal relationships across diverse subgroups. It comprises two primary components. First, the construction of a sample mapping function forms the basis of the subsequent nonlinear causal kernel. This function assesses the differences in potential nonlinear causal relationships in various samples, supported by our causal identifiability theory. Second, a nonlinear causal kernel is proposed for clustering heterogeneous subgroups. Experimental results showcase the exceptional performance of our method in accurately identifying heterogeneous subgroups and effectively enhancing causal learning, leading to a great reduction in prediction error.

[LG-51] GCSAM: Gradient Centralized Sharpness Aware Minimization

链接: https://arxiv.org/abs/2501.11584
作者: Mohamed Hassan,Aleksandar Vakanski,Boyu Zhang,Min Xian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The generalization performance of deep neural networks (DNNs) is a critical factor in achieving robust model behavior on unseen data. Recent studies have highlighted the importance of sharpness-based measures in promoting generalization by encouraging convergence to flatter minima. Among these approaches, Sharpness-Aware Minimization (SAM) has emerged as an effective optimization technique for reducing the sharpness of the loss landscape, thereby improving generalization. However, SAM’s computational overhead and sensitivity to noisy gradients limit its scalability and efficiency. To address these challenges, we propose Gradient-Centralized Sharpness-Aware Minimization (GCSAM), which incorporates Gradient Centralization (GC) to stabilize gradients and accelerate convergence. GCSAM normalizes gradients before the ascent step, reducing noise and variance, and improving stability during training. Our evaluations indicate that GCSAM consistently outperforms SAM and the Adam optimizer in terms of generalization and computational efficiency. These findings demonstrate GCSAM’s effectiveness across diverse domains, including general and medical imaging tasks.

[LG-52] Rethinking Membership Inference Attacks Against Transfer Learning

链接: https://arxiv.org/abs/2501.11577
作者: Cong Wu,Jing Chen,Qianru Fang,Kun He,Ziming Zhao,Hao Ren,Guowen Xu,Yang Liu,Yang Xiang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning, successful in knowledge translation across related tasks, faces a substantial privacy threat from membership inference attacks (MIAs). These attacks, despite posing significant risk to ML model’s training data, remain limited-explored in transfer learning. The interaction between teacher and student models in transfer learning has not been thoroughly explored in MIAs, potentially resulting in an under-examined aspect of privacy vulnerabilities within transfer learning. In this paper, we propose a new MIA vector against transfer learning, to determine whether a specific data point was used to train the teacher model while only accessing the student model in a white-box setting. Our method delves into the intricate relationship between teacher and student models, analyzing the discrepancies in hidden layer representations between the student model and its shadow counterpart. These identified differences are then adeptly utilized to refine the shadow model’s training process and to inform membership inference decisions effectively. Our method, evaluated across four datasets in diverse transfer learning tasks, reveals that even when an attacker only has access to the student model, the teacher model’s training data remains susceptible to MIAs. We believe our work unveils the unexplored risk of membership inference in transfer learning.

[LG-53] Uncertainty Estimation in the Real World: A Study on Music Emotion Recognition ECIR

链接: https://arxiv.org/abs/2501.11570
作者: Karn N. Watcharasupat,Yiwei Ding,T. Aleksandra Ma,Pavan Seshadri,Alexander Lerch
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: To be presented as a Findings paper at the 2025 European Conference on Information Retrieval (ECIR)

点击查看摘要

Abstract:Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on predicting central tendencies of human subjective responses. In this work, we explore several methods for estimating not only the central tendencies of the subjective responses to a musical stimulus, but also for estimating the uncertainty associated with these responses. In particular, we investigate probabilistic loss functions and inference-time random sampling. Experimental results indicate that while the modeling of the central tendencies is achievable, modeling of the uncertainty in subjective responses proves significantly more challenging with currently available approaches even when empirical estimates of variations in the responses are available.

[LG-54] Graph Defense Diffusion Model

链接: https://arxiv.org/abs/2501.11568
作者: Xin He,Wenqi Fan,Yili Wang,Chengyi Liu,Rui Miao,Xin Juan,Xin Wang
类目: Machine Learning (cs.LG)
*备注: 13 pages,5 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) demonstrate significant potential in various applications but remain highly vulnerable to adversarial attacks, which can greatly degrade their performance. Existing graph purification methods attempt to address this issue by filtering attacked graphs; however, they struggle to effectively defend against multiple types of adversarial attacks simultaneously due to their limited flexibility, and they lack comprehensive modeling of graph data due to their heavy reliance on heuristic prior knowledge. To overcome these challenges, we propose a more versatile approach for defending against adversarial attacks on graphs. In this work, we introduce the Graph Defense Diffusion Model (GDDM), a flexible purification method that leverages the denoising and modeling capabilities of diffusion models. The iterative nature of diffusion models aligns well with the stepwise process of adversarial attacks, making them particularly suitable for defense. By iteratively adding and removing noise, GDDM effectively purifies attacked graphs, restoring their original structure and features. Our GDDM consists of two key components: (1) Graph Structure-Driven Refiner, which preserves the basic fidelity of the graph during the denoising process, and ensures that the generated graph remains consistent with the original scope; and (2) Node Feature-Constrained Regularizer, which removes residual impurities from the denoised graph, further enhances the purification effect. Additionally, we design tailored denoising strategies to handle different types of adversarial attacks, improving the model’s adaptability to various attack scenarios. Extensive experiments conducted on three real-world datasets demonstrate that GDDM outperforms state-of-the-art methods in defending against a wide range of adversarial attacks, showcasing its robustness and effectiveness.

[LG-55] Secure Resource Allocation via Constrained Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.11557
作者: Jianfei Sun,Qiang Gao,Cong Wu,Yuxian Li,Jiacheng Wang,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) devices and the advent of 6G technologies have introduced computationally intensive tasks that often surpass the processing capabilities of user devices. Efficient and secure resource allocation in serverless multi-cloud edge computing environments is essential for supporting these demands and advancing distributed computing. However, existing solutions frequently struggle with the complexity of multi-cloud infrastructures, robust security integration, and effective application of traditional deep reinforcement learning (DRL) techniques under system constraints. To address these challenges, we present SARMTO, a novel framework that integrates an action-constrained DRL model. SARMTO dynamically balances resource allocation, task offloading, security, and performance by utilizing a Markov decision process formulation, an adaptive security mechanism, and sophisticated optimization techniques. Extensive simulations across varying scenarios, including different task loads, data sizes, and MEC capacities, show that SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs and a 41.5% improvement in energy efficiency over state-of-the-art methods. These enhancements highlight SARMTO’s potential to revolutionize resource management in intricate distributed computing environments, opening the door to more efficient and secure IoT and edge computing applications.

[LG-56] DLinear-based Prediction of Remaining Useful Life of Lithium-Ion Batteries: Feature Engineering through Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2501.11542
作者: Minsu Kim,Jaehyun Oh,Sang-Young Lee,Junghwan Kim
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of the Remaining Useful Life (RUL) of lithium-ion batteries is essential for ensuring safety, reducing maintenance costs, and optimizing usage. However, predicting RUL is challenging due to the nonlinear characteristics of the degradation caused by complex chemical reactions. Machine learning allows precise predictions by learning the latent functions of degradation relationships based on cycling behavior. This study introduces an accurate RUL prediction approach based on feature engineering and DLinear, applied to the dataset from NASA’s Prognostics Center of Excellence. Among the 20 features generated from current, voltage, temperature, and time provided in this dataset, key features contributing to degradation are selected using Pearson correlation coefficient and Shapley values. Shapley value-based feature selection effectively reflects cell-to-cell variability, showing similar importance rankings across all cells. The DLinear-based RUL prediction using key features efficiently captures the time-series trend, demonstrating significantly better performance compared to Long Short-Term Memory and Transformer models.

[LG-57] A Hands-free Spatial Selection and Interaction Technique using Gaze and Blink Input with Blink Prediction for Extended Reality

链接: https://arxiv.org/abs/2501.11540
作者: Tim Rolff,Jenny Gabel,Lauren Zerbin,Niklas Hypki,Susanne Schmidt,Markus Lappe,Frank Steinicke
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaze-based interaction techniques have created significant interest in the field of spatial interaction. Many of these methods require additional input modalities, such as hand gestures (e.g., gaze coupled with pinch). Those can be uncomfortable and difficult to perform in public or limited spaces, and pose challenges for users who are unable to execute pinch gestures. To address these aspects, we propose a novel, hands-free Gaze+Blink interaction technique that leverages the user’s gaze and intentional eye blinks. This technique enables users to perform selections by executing intentional blinks. It facilitates continuous interactions, such as scrolling or drag-and-drop, through eye blinks coupled with head movements. So far, this concept has not been explored for hands-free spatial interaction techniques. We evaluated the performance and user experience (UX) of our Gaze+Blink method with two user studies and compared it with Gaze+Pinch in a realistic user interface setup featuring common menu interaction tasks. Study 1 demonstrated that while Gaze+Blink achieved comparable selection speeds, it was prone to accidental selections resulting from unintentional blinks. In Study 2 we explored an enhanced technique employing a deep learning algorithms for filtering out unintentional blinks.

[LG-58] DenoMAE: A Multimodal Autoencoder for Denoising Modulation Signals

链接: https://arxiv.org/abs/2501.11538
作者: Atik Faysal,Taha Boushine,Mohammad Rostami,Reihaneh Gh. Roshan,Huaxia Wang,Nikhil Muralidhar,Avimanyu Sahoo,Yu-Dong Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Denoising Masked Autoencoder (Deno-MAE), a novel multimodal autoencoder framework for denoising modulation signals during pretraining. DenoMAE extends the concept of masked autoencoders by incorporating multiple input modalities, including noise as an explicit modality, to enhance cross-modal learning and improve denoising performance. The network is pre-trained using unlabeled noisy modulation signals and constellation diagrams, effectively learning to reconstruct their equivalent noiseless signals and diagrams. Deno-MAE achieves state-of-the-art accuracy in automatic modulation classification tasks with significantly fewer training samples, demonstrating a 10% reduction in unlabeled pretraining data and a 3% reduction in labeled fine-tuning data compared to existing approaches. Moreover, our model exhibits robust performance across varying signal-to-noise ratios (SNRs) and supports extrapolation on unseen lower SNRs. The results indicate that DenoMAE is an efficient, flexible, and data-efficient solution for denoising and classifying modulation signals in challenging noise-intensive environments.

[LG-59] Online Clustering with Bandit Information

链接: https://arxiv.org/abs/2501.11421
作者: G Dhinesh Chandran,Srinivas Reddy Kota,Srikrishna Bhashyam
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the problem of online clustering within the multi-armed bandit framework under the fixed confidence setting. In this multi-armed bandit problem, we have M arms, each providing i.i.d. samples that follow a multivariate Gaussian distribution with an \em unknown mean and a known unit covariance. The arms are grouped into K clusters based on the distance between their means using the Single Linkage (SLINK) clustering algorithm on the means of the arms. Since the true means are unknown, the objective is to obtain the above clustering of the arms with the minimum number of samples drawn from the arms, subject to an upper bound on the error probability. We introduce a novel algorithm, Average Tracking Bandit Online Clustering (ATBOC), and prove that this algorithm is order optimal, meaning that the upper bound on its expected sample complexity for given error probability \delta is within a factor of 2 of an instance-dependent lower bound as \delta \rightarrow 0 . Furthermore, we propose a computationally more efficient algorithm, Lower and Upper Confidence Bound-based Bandit Online Clustering (LUCBBOC), inspired by the LUCB algorithm for best arm identification. Simulation results demonstrate that the performance of LUCBBOC is comparable to that of ATBOC. We numerically assess the effectiveness of the proposed algorithms through numerical experiments on both synthetic datasets and the real-world MovieLens dataset. To the best of our knowledge, this is the first work on bandit online clustering that allows arms with different means in a cluster and K greater than 2.

[LG-60] Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model

链接: https://arxiv.org/abs/2501.11414
作者: Quentin Renau,Emma Hart
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: To appear in Applications of Evolutionary Computation 28th International Conference, EvoApplications 2025

点击查看摘要

Abstract:Recent approaches to training algorithm selectors in the black-box optimisation domain have advocated for the use of training data that is algorithm-centric in order to encapsulate information about how an algorithm performs on an instance, rather than relying on information derived from features of the instance itself. Probing-trajectories that consist of a sequence of objective performance per function evaluation obtained from a short run of an algorithm have recently shown particular promise in training accurate selectors. However, training models on this type of data requires an appropriately chosen classifier given the sequential nature of the data. There are currently no clear guidelines for choosing the most appropriate classifier for algorithm selection using time-series data from the plethora of models available. To address this, we conduct a large benchmark study using 17 different classifiers and three types of trajectory on a classification task using the BBOB benchmark suite using both leave-one-instance out and leave-one-problem out cross-validation. In contrast to previous studies using tabular data, we find that the choice of classifier has a significant impact, showing that feature-based and interval-based models are the best choices.

[LG-61] UniTrans: A Unified Vertical Federated Knowledge Transfer Framework for Enhancing Cross-Hospital Collaboration

链接: https://arxiv.org/abs/2501.11388
作者: Chung-ju Huang,Yuanpeng He,Xiao Han,Wenpin Jiao,Zhi Jin,Leye Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Cross-hospital collaboration has the potential to address disparities in medical resources across different regions. However, strict privacy regulations prohibit the direct sharing of sensitive patient information between hospitals. Vertical federated learning (VFL) offers a novel privacy-preserving machine learning paradigm that maximizes data utility across multiple hospitals. Traditional VFL methods, however, primarily benefit patients with overlapping data, leaving vulnerable non-overlapping patients without guaranteed improvements in medical prediction services. While some knowledge transfer techniques can enhance the prediction performance for non-overlapping patients, they fall short in addressing scenarios where overlapping and non-overlapping patients belong to different domains, resulting in challenges such as feature heterogeneity and label heterogeneity. To address these issues, we propose a novel unified vertical federated knowledge transfer framework (Unitrans). Our framework consists of three key steps. First, we extract the federated representation of overlapping patients by employing an effective vertical federated representation learning method to model multi-party joint features online. Next, each hospital learns a local knowledge transfer module offline, enabling the transfer of knowledge from the federated representation of overlapping patients to the enriched representation of local non-overlapping patients in a domain-adaptive manner. Finally, hospitals utilize these enriched local representations to enhance performance across various downstream medical prediction tasks. Experiments on real-world medical datasets validate the framework’s dual effectiveness in both intra-domain and cross-domain knowledge transfer. The code of \method is available at \urlthis https URL.

[LG-62] ransductive Conformal Inference for Ranking

链接: https://arxiv.org/abs/2501.11384
作者: Jean-Baptiste Fermanian(UM, Inria, IMAG),Pierre Humbert(SU, LPSM (UMR_8001)),Gilles Blanchard(LMO, DATASHAPE)
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a method based on Conformal Prediction (CP) to quantify the uncertainty of full ranking algorithms. We focus on a specific scenario where n + m items are to be ranked by some ‘‘black box’’ algorithm. It is assumed that the relative (ground truth) ranking of n of them is known. The objective is then to quantify the error made by the algorithm on the ranks of the m new items among the total (n + m) . In such a setting, the true ranks of the n original items in the total (n + m) depend on the (unknown) true ranks of the m new ones. Consequently, we have no direct access to a calibration set to apply a classical CP method. To address this challenge, we propose to construct distribution-free bounds of the unknown conformity scores using recent results on the distribution of conformal p-values. Using these scores upper bounds, we provide valid prediction sets for the rank of any item. We also control the false coverage proportion, a crucial quantity when dealing with multiple prediction sets. Finally, we empirically show on both synthetic and real data the efficiency of our CP method.

[LG-63] Adaptive parameters identification for nonlinear dynamics using deep permutation invariant networks

链接: https://arxiv.org/abs/2501.11350
作者: Mouad Elaarabi,Domenico Borzacchiello,Yves Le Guennec,Philippe Le Bot,Sebastien Comas-Cardona
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The promising outcomes of dynamical system identification techniques, such as SINDy [Brunton et al. 2016], highlight their advantages in providing qualitative interpretability and extrapolation compared to non-interpretable deep neural networks [Rudin 2019]. These techniques suffer from parameter updating in real-time use cases, especially when the system parameters are likely to change during or between processes. Recently, the OASIS [Bhadriraju et al. 2020] framework introduced a data-driven technique to address the limitations of real-time dynamical system parameters updating, yielding interesting results. Nevertheless, we show in this work that superior performance can be achieved using more advanced model architectures. We present an innovative encoding approach, based mainly on the use of Set Encoding methods of sequence data, which give accurate adaptive model identification for complex dynamic systems, with variable input time series length. Two Set Encoding methods are used, the first is Deep Set [Zaheer et al. 2017], and the second is Set Transformer [Lee et al. 2019]. Comparing Set Transformer to OASIS framework on Lotka Volterra for real-time local dynamical system identification and time series forecasting, we find that the Set Transformer architecture is well adapted to learning relationships within data sets. We then compare the two Set Encoding methods based on the Lorenz system for online global dynamical system identification. Finally, we trained a Deep Set model to perform identification and characterization of abnormalities for 1D heat-transfer problem.

[LG-64] Lee and Seung (2000)s Algorithms for Non-negative Matrix Factorization: A Supplementary Proof Guide

链接: https://arxiv.org/abs/2501.11341
作者: Sungjae Cho
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 17 pages; 3 figures; 10 subfigures

点击查看摘要

Abstract:Lee and Seung (2000) introduced numerical solutions for non-negative matrix factorization (NMF) using iterative multiplicative update algorithms. These algorithms have been actively utilized as dimensionality reduction tools for high-dimensional non-negative data and learning algorithms for artificial neural networks. Despite a considerable amount of literature on the applications of the NMF algorithms, detailed explanations about their formulation and derivation are lacking. This report provides supplementary details to help understand the formulation and derivation of the proofs as used in the original paper.

[LG-65] he “Law” of the Unconscious Contrastive Learner: Probabilistic Alignment of Unpaired Modalities

链接: https://arxiv.org/abs/2501.11326
作者: Yongwei Che,Benjamin Eysenbach
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While internet-scale data often comes in pairs (e.g., audio/image, image/text), we often want to perform inferences over modalities unseen together in the training data (e.g., audio/text). Empirically, this can often be addressed by learning multiple contrastive embedding spaces between existing modality pairs, implicitly hoping that unseen modality pairs will end up being aligned. This theoretical paper proves that this hope is well founded, under certain assumptions. Starting with the proper Bayesian approach of integrating out intermediate modalities, we show that directly comparing the representations of data from unpaired modalities can recover the same likelihood ratio. Our analysis builds on prior work on the geometry and probabilistic interpretation of contrastive representations, showing how these representations can answer many of the same inferences as probabilistic graphical models. Our analysis suggests two new ways of using contrastive representations: in settings with pre-trained contrastive models, and for handling language ambiguity in reinforcement learning. Our numerical experiments study the importance of our assumptions and demonstrate these new applications.

[LG-66] Physics-Informed Machine Learning for Efficient Reconfigurable Intelligent Surface Design

链接: https://arxiv.org/abs/2501.11323
作者: Zhen Zhang,Jun Hui Qiu,Jun Wei Zhang,Hui Dong Li,Dong Tang,Qiang Cheng,Wei Lin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Applied Physics (physics.app-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reconfigurable intelligent surface (RIS) is a two-dimensional periodic structure integrated with a large number of reflective elements, which can manipulate electromagnetic waves in a digital way, offering great potentials for wireless communication and radar detection applications. However, conventional RIS designs highly rely on extensive full-wave EM simulations that are extremely time-consuming. To address this challenge, we propose a machine-learning-assisted approach for efficient RIS design. An accurate and fast model to predict the reflection coefficient of RIS element is developed by combining a multi-layer perceptron neural network (MLP) and a dual-port network, which can significantly reduce tedious EM simulations in the network training. A RIS has been practically designed based on the proposed method. To verify the proposed method, the RIS has also been fabricated and measured. The experimental results are in good agreement with the simulation results, which validates the efficacy of the proposed method in RIS design.

[LG-67] A2SB: Audio-to-Audio Schrodinger Bridges

链接: https://arxiv.org/abs/2501.11311
作者: Zhifeng Kong,Kevin J Shih,Weili Nie,Arash Vahdat,Sang-gil Lee,Joao Felipe Santos,Ante Jukic,Rafael Valle,Bryan Catanzaro
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.this http URL.

[LG-68] Generalizable Spectral Embedding with an Application to UMAP

链接: https://arxiv.org/abs/2501.11305
作者: Nir Ben-Ari,Amitai Yacobi,Uri Shaham
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. In this paper, we introduce GrEASE: Generalizable and Efficient Approximate Spectral Embedding, a novel deep-learning approach designed to address these limitations. GrEASE incorporates an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability, allowing for the computation of the Laplacian’s eigenvectors on unseen data. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate GrEASE’s ability to consistently approximate and generalize SE, while ensuring scalability. Additionally, we show how GrEASE can be leveraged to enhance existing methods. Specifically, we focus on UMAP, a leading visualization technique, and introduce NUMAP, a generalizable version of UMAP powered by GrEASE. Our codes are publicly available.

[LG-69] Higher Order Approximation Rates for ReLU CNNs in Korobov Spaces

链接: https://arxiv.org/abs/2501.11275
作者: Yuwen Li,Guozhi Zhang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper investigates the L_p approximation error for higher order Korobov functions using deep convolutional neural networks (CNNs) with ReLU activation. For target functions having a mixed derivative of order m+1 in each direction, we improve classical approximation rate of second order to (m+1)-th order (modulo a logarithmic factor) in terms of the depth of CNNs. The key ingredient in our analysis is approximate representation of high-order sparse grid basis functions by CNNs. The results suggest that higher order expressivity of CNNs does not severely suffer from the curse of dimensionality.

[LG-70] Sparse L0-norm based Kernel-free Quadratic Surface Support Vector Machines

链接: https://arxiv.org/abs/2501.11268
作者: Ahmad Mousavi,Ramin Zandvakili
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Kernel-free quadratic surface support vector machine (SVM) models have gained significant attention in machine learning. However, introducing a quadratic classifier increases the model’s complexity by quadratically expanding the number of parameters relative to the dimensionality of the data, exacerbating overfitting. To address this, we propose sparse \ell_0 -norm based Kernel-free quadratic surface SVMs, designed to mitigate overfitting and enhance interpretability. Given the intractable nature of these models, we present a penalty decomposition algorithm to efficiently obtain first-order optimality points. Our analysis shows that the subproblems in this framework either admit closed-form solutions or can leverage duality theory to improve computational efficiency. Through empirical evaluations on real-world datasets, we demonstrate the efficacy and robustness of our approach, showcasing its potential to advance Kernel-free quadratic surface SVMs in practical applications while addressing overfitting concerns. All the implemented models and experiment codes are available at \urlthis https URL.

[LG-71] Communication-Efficient Federated Learning by Quantized Variance Reduction for Heterogeneous Wireless Edge Networks

链接: https://arxiv.org/abs/2501.11267
作者: Shuai Wang,Yanqing Xu,Chaoqun You,Mingjie Shao,Tony Q.S. Quek
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has been recognized as a viable solution for local-privacy-aware collaborative model training in wireless edge networks, but its practical deployment is hindered by the high communication overhead caused by frequent and costly server-device synchronization. Notably, most existing communication-efficient FL algorithms fail to reduce the significant inter-device variance resulting from the prevalent issue of device heterogeneity. This variance severely decelerates algorithm convergence, increasing communication overhead and making it more challenging to achieve a well-performed model. In this paper, we propose a novel communication-efficient FL algorithm, named FedQVR, which relies on a sophisticated variance-reduced scheme to achieve heterogeneity-robustness in the presence of quantized transmission and heterogeneous local updates among active edge devices. Comprehensive theoretical analysis justifies that FedQVR is inherently resilient to device heterogeneity and has a comparable convergence rate even with a small number of quantization bits, yielding significant communication savings. Besides, considering non-ideal wireless channels, we propose FedQVR-E which enhances the convergence of FedQVR by performing joint allocation of bandwidth and quantization bits across devices under constrained transmission delays. Extensive experimental results are also presented to demonstrate the superior performance of the proposed algorithms over their counterparts in terms of both communication efficiency and application performance.

[LG-72] A Metric Topology of Deep Learning for Data Classification

链接: https://arxiv.org/abs/2501.11265
作者: Jwo-Yuh Wu,Liang-Chi Huang,Wen-Hsuan Li,Chun-Hung Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Empirically, Deep Learning (DL) has demonstrated unprecedented success in practical applications. However, DL remains by and large a mysterious “black-box”, spurring recent theoretical research to build its mathematical foundations. In this paper, we investigate DL for data classification through the prism of metric topology. Considering that conventional Euclidean metric over the network parameter space typically fails to discriminate DL networks according to their classification outcomes, we propose from a probabilistic point of view a meaningful distance measure, whereby DL networks yielding similar classification performances are close. The proposed distance measure defines such an equivalent relation among network parameter vectors that networks performing equally well belong to the same equivalent class. Interestingly, our proposed distance measure can provably serve as a metric on the quotient set modulo the equivalent relation. Then, under quite mild conditions it is shown that, apart from a vanishingly small subset of networks likely to predict non-unique labels, our proposed metric space is compact, and coincides with the well-known quotient topological space. Our study contributes to fundamental understanding of DL, and opens up new ways of studying DL using fruitful metric space theory.

[LG-73] Multivariate Wireless Link Quality Prediction Based on Pre-trained Large Language Models

链接: https://arxiv.org/abs/2501.11247
作者: Zhuangzhuang Yan,Xinyu Gu,Shilong Fan,Zhenyu Liu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Accurate and reliable link quality prediction (LQP) is crucial for optimizing network performance, ensuring communication stability, and enhancing user experience in wireless communications. However, LQP faces significant challenges due to the dynamic and lossy nature of wireless links, which are influenced by interference, multipath effects, fading, and blockage. In this paper, we propose GAT-LLM, a novel multivariate wireless link quality prediction model that combines Large Language Models (LLMs) with Graph Attention Networks (GAT) to enable accurate and reliable multivariate LQP of wireless communications. By framing LQP as a time series prediction task and appropriately preprocessing the input data, we leverage LLMs to improve the accuracy of link quality prediction. To address the limitations of LLMs in multivariate prediction due to typically handling one-dimensional data, we integrate GAT to model interdependencies among multiple variables across different protocol layers, enhancing the model’s ability to handle complex dependencies. Experimental results demonstrate that GAT-LLM significantly improves the accuracy and robustness of link quality prediction, particularly in multi-step prediction scenarios.

[LG-74] Fast instance-specific algorithm configuration with graph neural network

链接: https://arxiv.org/abs/2501.11240
作者: Shingo Aihara,Matthieu Parizy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combinatorial optimization (CO) problems are pivotal across various industrial applications, where the speed of solving these problems is crucial. Improving the performance of CO solvers across diverse input instances requires fine-tuning solver parameters for each instance. However, this tuning process is time-consuming, and the time required increases with the number of instances. To address this, a method called instance-specific algorithm configuration (ISAC) has been devised. This approach involves two main steps: training and execution. During the training step, features are extracted from various instances and then grouped into clusters. For each cluster, parameters are fine-tuned. This cluster-specific tuning process results in a set of generalized parameters for instances belonging to each class. In the execution step, features are extracted from an unknown instance to determine its cluster, and the corresponding pre-tuned parameters are applied. Generally, the running time of a solver is evaluated by the time to solution ( TTS ). However, methods like ISAC require preprocessing. Therefore, the total execution time is T_tot=TTS+T_tune , where T_tune represents the tuning time. While the goal is to minimize T_tot , it is important to note that extracting features in the ISAC method requires a certain amount of computational time. The extracting features include summary statistics of the solver execution logs, which takes several 10 seconds. This research presents a method to significantly reduce the time of the ISAC execution step by streamlining feature extraction and class determination with a graph neural network. Experimental results show that T_tune in the execution step, which take several 10 seconds in the original ISAC manner, could be reduced to sub-seconds.

[LG-75] An Imbalanced Learning-based Sampling Method for Physics-informed Neural Networks

链接: https://arxiv.org/abs/2501.11222
作者: Jiaqi Luo,Yahong Yang,Yuan Yuan,Shixin Xu,Wenrui Hao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 figures,7 tables

点击查看摘要

Abstract:This paper introduces Residual-based Smote (RSmote), an innovative local adaptive sampling technique tailored to improve the performance of Physics-Informed Neural Networks (PINNs) through imbalanced learning strategies. Traditional residual-based adaptive sampling methods, while effective in enhancing PINN accuracy, often struggle with efficiency and high memory consumption, particularly in high-dimensional problems. RSmote addresses these challenges by targeting regions with high residuals and employing oversampling techniques from imbalanced learning to refine the sampling process. Our approach is underpinned by a rigorous theoretical analysis that supports the effectiveness of RSmote in managing computational resources more efficiently. Through extensive evaluations, we benchmark RSmote against the state-of-the-art Residual-based Adaptive Distribution (RAD) method across a variety of dimensions and differential equations. The results demonstrate that RSmote not only achieves or exceeds the accuracy of RAD but also significantly reduces memory usage, making it particularly advantageous in high-dimensional scenarios. These contributions position RSmote as a robust and resource-efficient solution for solving complex partial differential equations, especially when computational constraints are a critical consideration.

[LG-76] Mitigating Spatial Disparity in Urban Prediction Using Residual-Aware Spatiotemporal Graph Neural Networks: A Chicago Case Study

链接: https://arxiv.org/abs/2501.11214
作者: Dingyi Zhuang,Hanyong Xu,Xiaotong Guo,Yunhan Zheng,Shenhao Wang,Jinhua Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban prediction tasks, such as forecasting traffic flow, temperature, and crime rates, are crucial for efficient urban planning and management. However, existing Spatiotemporal Graph Neural Networks (ST-GNNs) often rely solely on accuracy, overlooking spatial and demographic disparities in their predictions. This oversight can lead to imbalanced resource allocation and exacerbate existing inequities in urban areas. This study introduces a Residual-Aware Attention (RAA) Block and an equality-enhancing loss function to address these disparities. By adapting the adjacency matrix during training and incorporating spatial disparity metrics, our approach aims to reduce local segregation of residuals and errors. We applied our methodology to urban prediction tasks in Chicago, utilizing a travel demand dataset as an example. Our model achieved a 48% significant improvement in fairness metrics with only a 9% increase in error metrics. Spatial analysis of residual distributions revealed that models with RAA Blocks produced more equitable prediction results, particularly by reducing errors clustered in central regions. Attention maps demonstrated the model’s ability to dynamically adjust focus, leading to more balanced predictions. Case studies of various community areas in Chicago further illustrated the effectiveness of our approach in addressing spatial and demographic disparities, supporting more balanced and equitable urban planning and policy-making.

[LG-77] Risk Analysis of Flowlines in the Oil and Gas Sector: A GIS and Machine Learning Approach

链接: https://arxiv.org/abs/2501.11213
作者: I. Chittumuri,N. Alshehab,R. J. Voss,L. L. Douglass,S. Kamrava,Y. Fan,J. Miskimins,W. Fleckenstein,S. Bandyopadhyay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a risk analysis of flowlines in the oil and gas sector using Geographic Information Systems (GIS) and machine learning (ML). Flowlines, vital conduits transporting oil, gas, and water from wellheads to surface facilities, often face under-assessment compared to transmission pipelines. This study addresses this gap using advanced tools to predict and mitigate failures, improving environmental safety and reducing human exposure. Extensive datasets from the Colorado Energy and Carbon Management Commission (ECMC) were processed through spatial matching, feature engineering, and geometric extraction to build robust predictive models. Various ML algorithms, including logistic regression, support vector machines, gradient boosting decision trees, and K-Means clustering, were used to assess and classify risks, with ensemble classifiers showing superior accuracy, especially when paired with Principal Component Analysis (PCA) for dimensionality reduction. Finally, a thorough data analysis highlighted spatial and operational factors influencing risks, identifying high-risk zones for focused monitoring. Overall, the study demonstrates the transformative potential of integrating GIS and ML in flowline risk management, proposing a data-driven approach that emphasizes the need for accurate data and refined models to improve safety in petroleum extraction.

[LG-78] Reinforcement Learning Based Goodput Maximization with Quantized Feedback in URLLC

链接: https://arxiv.org/abs/2501.11190
作者: Hasan Basri Celebi,Mikael Skoglund
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for the IARIA 21st International Conference on Wireless and Mobile Communication (ICWMC 2025) Conference

点击查看摘要

Abstract:This paper presents a comprehensive system model for goodput maximization with quantized feedback in Ultra-Reliable Low-Latency Communication (URLLC), focusing on dynamic channel conditions and feedback schemes. The study investigates a communication system, where the receiver provides quantized channel state information to the transmitter. The system adapts its feedback scheme based on reinforcement learning, aiming to maximize goodput while accommodating varying channel statistics. We introduce a novel Rician- K factor estimation technique to enable the communication system to optimize the feedback scheme. This dynamic approach increases the overall performance, making it well-suited for practical URLLC applications where channel statistics vary over time.

[LG-79] Federated Testing (FedTest): A New Scheme to Enhance Convergence and Mitigate Adversarial Attacks in Federating Learning

链接: https://arxiv.org/abs/2501.11167
作者: Mustafa Ghaleb,Mohanad Obeed,Muhamad Felemban,Anas Chaaban,Halim Yanikomeroglu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a significant paradigm for training machine learning models. This is due to its data-privacy-preserving property and its efficient exploitation of distributed computational resources. This is achieved by conducting the training process in parallel at distributed users. However, traditional FL strategies grapple with difficulties in evaluating the quality of received models, handling unbalanced models, and reducing the impact of detrimental models. To resolve these problems, we introduce a novel federated learning framework, which we call federated testing for federated learning (FedTest). In the FedTest method, the local data of a specific user is used to train the model of that user and test the models of the other users. This approach enables users to test each other’s models and determine an accurate score for each. This score can then be used to aggregate the models efficiently and identify any malicious ones. Our numerical results reveal that the proposed method not only accelerates convergence rates but also diminishes the potential influence of malicious users. This significantly enhances the overall efficiency and robustness of FL systems.

[LG-80] Modeling Attention during Dimensional Shifts with Counterfactual and Delayed Feedback

链接: https://arxiv.org/abs/2501.11161
作者: Tyler Malloy,Roderick Seow,Cleotilde Gonzalez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention can be used to inform choice selection in contextual bandit tasks even when context features have not been previously experienced. One example of this is in dimensional shifts, where additional feature values are introduced and the relationship between features and outcomes can either be static or variable. Attentional mechanisms have been extensively studied in contextual bandit tasks where the feedback of choices is provided immediately, but less research has been done on tasks where feedback is delayed or in counterfactual feedback cases. Some methods have successfully modeled human attention with immediate feedback based on reward prediction errors (RPEs), though recent research raises questions of the applicability of RPEs onto more general attentional mechanisms. Alternative models suggest that information theoretic metrics can be used to model human attention, with broader applications to novel stimuli. In this paper, we compare two different methods for modeling how humans attend to specific features of decision making tasks, one that is based on calculating an information theoretic metric using a memory of past experiences, and another that is based on iteratively updating attention from reward prediction errors. We compare these models using simulations in a contextual bandit task with both intradimensional and extradimensional domain shifts, as well as immediate, delayed, and counterfactual feedback. We find that calculating an information theoretic metric over a history of experiences is best able to account for human-like behavior in tasks that shift dimensions and alter feedback presentation. These results indicate that information theoretic metrics of attentional mechanisms may be better suited than RPEs to predict human attention in decision making, though further studies of human behavior are necessary to support these results.

[LG-81] Modelling of automotive steel fatigue lifetime by machine learning method

链接: https://arxiv.org/abs/2501.11154
作者: Oleh Yasniy,Dmytro Tymoshchuk,Iryna Didych,Nataliya Zagorodna,Olha Malyshevska
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Paper Submitted to ITTAP 2024 CEUR-WS, see this https URL

点击查看摘要

Abstract:In the current study, the fatigue life of QSTE340TM steel was modelled using a machine learning method, namely, a neural network. This problem was solved by a Multi-Layer Perceptron (MLP) neural network with a 3-75-1 architecture, which allows the prediction of the crack length based on the number of load cycles N, the stress ratio R, and the overload ratio Rol. The proposed model showed high accuracy, with mean absolute percentage error (MAPE) ranging from 0.02% to 4.59% for different R and Rol. The neural network effectively reveals the nonlinear relationships between input parameters and fatigue crack growth, providing reliable predictions for different loading conditions.

[LG-82] A Novel Switch-Type Policy Network for Resource Allocation Problems: Technical Report

链接: https://arxiv.org/abs/2501.11136
作者: Jerrod Wigmore,Brooke Shrader,Eytan Modiano
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has become a powerful tool for developing control policies in queueing networks, but the common use of Multi-layer Perceptron (MLP) neural networks in these applications has significant drawbacks. MLP architectures, while versatile, often suffer from poor sample efficiency and a tendency to overfit training environments, leading to suboptimal performance on new, unseen networks. In response to these issues, we introduce a switch-type neural network (STN) architecture designed to improve the efficiency and generalization of DRL policies in queueing networks. The STN leverages structural patterns from traditional non-learning policies, ensuring consistent action choices across similar states. This design not only streamlines the learning process but also fosters better generalization by reducing the tendency to overfit. Our works presents three key contributions: first, the development of the STN as a more effective alternative to MLPs; second, empirical evidence showing that STNs achieve superior sample efficiency in various training scenarios; and third, experimental results demonstrating that STNs match MLP performance in familiar environments and significantly outperform them in new settings. By embedding domain-specific knowledge, the STN enhances the Proximal Policy Optimization (PPO) algorithm’s effectiveness without compromising performance, suggesting its suitability for a wide range of queueing network control problems.

[LG-83] A Novel Pearson Correlation-Based Merging Algorithm for Robust Distributed Machine Learning with Heterogeneous Data

链接: https://arxiv.org/abs/2501.11112
作者: Mohammad Ghabel Rahmat,Majid Khalilian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning faces significant challenges in scenarios with heterogeneous data distributions and adverse network conditions, such as delays, packet loss, and data poisoning attacks. This paper proposes a novel method based on the SCAFFOLD algorithm to improve the quality of local updates and enhance the robustness of the global model. The key idea is to form intermediary nodes by merging local models with high similarity, using the Pearson correlation coefficient as a similarity measure. The proposed merging algorithm reduces the number of local nodes while maintaining the accuracy of the global model, effectively addressing communication overhead and bandwidth consumption. Experimental results on the MNIST dataset under simulated federated learning scenarios demonstrate the method’s effectiveness. After 10 rounds of training using a CNN model, the proposed approach achieved accuracies of 0.82, 0.73, and 0.66 under normal conditions, packet loss and data poisoning attacks, respectively, outperforming the baseline SCAFFOLD algorithm. These results highlight the potential of the proposed method to improve efficiency and resilience in federated learning systems.

[LG-84] mporal Analysis of Adversarial Attacks in Federated Learning

链接: https://arxiv.org/abs/2501.11054
作者: Rohit Mapakshi,Sayma Akther,Mark Stamp
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In this paper, we experimentally analyze the robustness of selected Federated Learning (FL) systems in the presence of adversarial clients. We find that temporal attacks significantly affect model performance in the FL models tested, especially when the adversaries are active throughout or during the later rounds. We consider a variety of classic learning models, including Multinominal Logistic Regression (MLR), Random Forest, XGBoost, Support Vector Classifier (SVC), as well as various Neural Network models including Multilayer Perceptron (MLP), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM). Our results highlight the effectiveness of temporal attacks and the need to develop strategies to make the FL process more robust against such attacks. We also briefly consider the effectiveness of defense mechanisms, including outlier detection in the aggregation algorithm.

[LG-85] Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

链接: https://arxiv.org/abs/2501.11039
作者: Qi Cheems Wang,Zehao Xiao,Yixiu Mao,Yun Qu,Jiayi Shen,Yiqin Lv,Xiangyang Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The foundation model enables fast problem-solving without learning from scratch, and such a desirable adaptation property benefits from its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, or finetuning. Recent trends have focused on the curation of task datasets during optimization, which includes task selection as an indispensable consideration for either adaptation robustness or sampling efficiency purposes. Despite some progress, selecting crucial task batches to optimize over iteration mostly exhausts massive task queries and requires intensive evaluation and computations to secure robust adaptation. This work underscores the criticality of both robustness and learning efficiency, especially in scenarios where tasks are risky to collect or costly to evaluate. To this end, we present Model Predictive Task Sampling (MPTS), a novel active task sampling framework to establish connections between the task space and adaptation risk landscape achieve robust adaptation. Technically, MPTS characterizes the task episodic information with a generative model and predicts optimization outcome after adaptation from posterior inference, i.e., forecasting task-specific adaptation risk values. The resulting risk learner amortizes expensive annotation, evaluation, or computation operations in task robust adaptation learning paradigms. Extensive experimental results show that MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms, increases adaptation robustness, and retains learning efficiency without affording extra cost. The code will be available at the project site this https URL.

[LG-86] pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

链接: https://arxiv.org/abs/2501.11002
作者: Yasaman Saadati,Mohammad Rostami,M. Hadi Amini
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 20 pages, 9 Images

点击查看摘要

Abstract:Traditional Federated Learning (FL) methods encounter significant challenges when dealing with heterogeneous data and providing personalized solutions for non-IID scenarios. Personalized Federated Learning (PFL) approaches aim to address these issues by balancing generalization and personalization, often through parameter decoupling or partial models that freeze some neural network layers for personalization while aggregating other layers globally. However, existing methods still face challenges of global-local model discrepancy, client drift, and catastrophic forgetting, which degrade model accuracy. To overcome these limitations, we propose pMixFed, a dynamic, layer-wise PFL approach that integrates mixup between shared global and personalized local models. Our method introduces an adaptive strategy for partitioning between personalized and shared layers, a gradual transition of personalization degree to enhance local client adaptation, improved generalization across clients, and a novel aggregation mechanism to mitigate catastrophic forgetting. Extensive experiments demonstrate that pMixFed outperforms state-of-the-art PFL methods, showing faster model training, increased robustness, and improved handling of data heterogeneity under different heterogeneous settings.

[LG-87] GRID: Protecting Training Graph from Link Stealing Attacks on GNN Models

链接: https://arxiv.org/abs/2501.10985
作者: Jiadong Lou,Xu Yuan,Rui Zhang,Xingliang Yuan,Neil Gong,Nian-Feng Tzeng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have exhibited superior performance in various classification tasks on graph-structured data. However, they encounter the potential vulnerability from the link stealing attacks, which can infer the presence of a link between two nodes via measuring the similarity of its incident nodes’ prediction vectors produced by a GNN model. Such attacks pose severe security and privacy threats to the training graph used in GNN models. In this work, we propose a novel solution, called Graph Link Disguise (GRID), to defend against link stealing attacks with the formal guarantee of GNN model utility for retaining prediction accuracy. The key idea of GRID is to add carefully crafted noises to the nodes’ prediction vectors for disguising adjacent nodes as n-hop indirect neighboring nodes. We take into account the graph topology and select only a subset of nodes (called core nodes) covering all links for adding noises, which can avert the noises offset and have the further advantages of reducing both the distortion loss and the computation cost. Our crafted noises can ensure 1) the noisy prediction vectors of any two adjacent nodes have their similarity level like that of two non-adjacent nodes and 2) the model prediction is unchanged to ensure zero utility loss. Extensive experiments on five datasets are conducted to show the effectiveness of our proposed GRID solution against different representative link-stealing attacks under transductive settings and inductive settings respectively, as well as two influence-based attacks. Meanwhile, it achieves a much better privacy-utility trade-off than existing methods when extended to GNNs.

[LG-88] An analysis of the combination of feature selection and machine learning methods for an accurate and timely detection of lung cancer

链接: https://arxiv.org/abs/2501.10980
作者: Omid Shahriyar,Babak Nuri Moghaddam,Davoud Yousefi,Abbas Mirzaei,Farnaz Hoseini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the deadliest cancers, lung cancer necessitates an early and precise diagnosis. Because patients have a better chance of recovering, early identification of lung cancer is crucial. This review looks at how to diagnose lung cancer using sophisticated machine learning techniques like Random Forest (RF) and Support Vector Machine (SVM). The Chi-squared test is one feature selection strategy that has been successfully applied to find related features and enhance model performance. The findings demonstrate that these techniques can improve detection efficiency and accuracy while also assisting in runtime reduction. This study produces recommendations for further research as well as ideas to enhance diagnostic techniques. In order to improve healthcare and create automated methods for detecting lung cancer, this research is a critical first step.

[LG-89] Control LLM : Controlled Evolution for Intelligence Retention in LLM

链接: https://arxiv.org/abs/2501.10979
作者: Haichao Wei,Yunxiang Ren,Zhoutong Fu,Aman Lunia,Yi-Lin Chen,Alice Leung,Ya Xu
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Large Language Models (LLMs) demand significant computational resources, making it essential to enhance their capabilities without retraining from scratch. A key challenge in this domain is \textitcatastrophic forgetting (CF), which hampers performance during Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT). We propose \textbfControl LLM, a novel approach that leverages parallel pre-trained and expanded transformer blocks, aligning their hidden-states through interpolation strategies This method effectively preserves performance on existing tasks while seamlessly integrating new knowledge. Extensive experiments demonstrate the effectiveness of Control LLM in both CPT and CSFT. On Llama3.1-8B-Instruct, it achieves significant improvements in mathematical reasoning ( +14.4% on Math-Hard) and coding performance ( +10% on MBPP-PLUS). On Llama3.1-8B, it enhances multilingual capabilities ( +10.6% on C-Eval, +6.8% on CMMLU, and +30.2% on CMMLU-0shot-CoT). It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute. Crucially, these gains are realized while preserving strong original capabilities, with minimal degradation ( 4.3% \texton MMLU ) compared to 35% in open-source Math and Coding models. This approach has been successfully deployed in LinkedIn’s GenAI-powered job seeker and Ads unit products. To support further research, we release the training and evaluation code (\urlthis https URL) along with models trained on public datasets (\url this https URL) to the community. Comments: 8 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.10979 [cs.LG] (or arXiv:2501.10979v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.10979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-90] Multimodal Techniques for Malware Classification

链接: https://arxiv.org/abs/2501.10956
作者: Jonathan Jiang,Mark Stamp
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The threat of malware is a serious concern for computer networks and systems, highlighting the need for accurate classification techniques. In this research, we experiment with multimodal machine learning approaches for malware classification, based on the structured nature of the Windows Portable Executable (PE) file format. Specifically, we train Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) models on features extracted from PE headers, we train these same models on features extracted from the other sections of PE files, and train each model on features extracted from the entire PE file. We then train SVM models on each of the nine header-sections combinations of these baseline models, using the output layer probabilities of the component models as feature vectors. We compare the baseline cases to these multimodal combinations. In our experiments, we find that the best of the multimodal models outperforms the best of the baseline cases, indicating that it can be advantageous to train separate models on distinct parts of Windows PE files.

[LG-91] Gradient-Based Multi-Objective Deep Learning: Algorithms Theories Applications and Beyond

链接: https://arxiv.org/abs/2501.10945
作者: Weiyu Chen,Xiaoyuan Zhang,Baijiong Lin,Xi Lin,Han Zhao,Qingfu Zhang,James T. Kwok
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives, a challenge frequently encountered in areas like multi-task learning and multi-criteria learning. Recent advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions, ranging from a single balanced solution to finite or even infinite Pareto sets, tailored to user needs. These developments have broad applications across domains such as reinforcement learning, computer vision, recommendation systems, and large language models. This survey provides the first comprehensive review of gradient-based MOO in deep learning, covering algorithms, theories, and practical applications. By unifying various approaches and identifying critical challenges, it serves as a foundational resource for driving innovation in this evolving field. A comprehensive list of MOO algorithms in deep learning is available at \urlthis https URL.

[LG-92] BeST – A Novel Source Selection Metric for Transfer Learning

链接: https://arxiv.org/abs/2501.10933
作者: Ashutosh Soni,Peizhong Ju,Atilla Eryilmaz,Ness B. Shroff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:One of the most fundamental, and yet relatively less explored, goals in transfer learning is the efficient means of selecting top candidates from a large number of previously trained models (optimized for various “source” tasks) that would perform the best for a new “target” task with a limited amount of data. In this paper, we undertake this goal by developing a novel task-similarity metric (BeST) and an associated method that consistently performs well in identifying the most transferrable source(s) for a given task. In particular, our design employs an innovative quantization-level optimization procedure in the context of classification tasks that yields a measure of similarity between a source model and the given target data. The procedure uses a concept similar to early stopping (usually implemented to train deep neural networks (DNNs) to ensure generalization) to derive a function that approximates the transfer learning mapping without training. The advantage of our metric is that it can be quickly computed to identify the top candidate(s) for a given target task before a computationally intensive transfer operation (typically using DNNs) can be implemented between the selected source and the target task. As such, our metric can provide significant computational savings for transfer learning from a selection of a large number of possible source models. Through extensive experimental evaluations, we establish that our metric performs well over different datasets and varying numbers of data samples.

[LG-93] Data Enrichment Opportunities for Distribution Grid Cable Networks using Variational Autoencoders

链接: https://arxiv.org/abs/2501.10920
作者: Konrad Sundsgaard,Kutay Bölat,Guangya Yang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Electricity distribution cable networks suffer from incomplete and unbalanced data, hindering the effectiveness of machine learning models for predictive maintenance and reliability evaluation. Features such as the installation date of the cables are frequently missing. To address data scarcity, this study investigates the application of Variational Autoencoders (VAEs) for data enrichment, synthetic data generation, imbalanced data handling, and outlier detection. Based on a proof-of-concept case study for Denmark, targeting the imputation of missing age information in cable network asset registers, the analysis underlines the potential of generative models to support data-driven maintenance. However, the study also highlights several areas for improvement, including enhanced feature importance analysis, incorporating network characteristics and external features, and handling biases in missing data. Future initiatives should expand the application of VAEs by incorporating semi-supervised learning, advanced sampling techniques, and additional distribution grid elements, including low-voltage networks, into the analysis.

[LG-94] DeepIFSA: Deep Imputation of Missing Values Using Feature and Sample Attention

链接: https://arxiv.org/abs/2501.10910
作者: Ibna Kowsar,Shourav B. Rabbani,Yina Hou,Manar D. Samad
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Missing values of varying patterns and rates in real-world tabular data pose a significant challenge in developing reliable data-driven models. Existing missing value imputation methods use statistical and traditional machine learning, which are ineffective when the missing rate is high and not at random. This paper explores row and column attention in tabular data to address the shortcomings of existing methods by introducing a new method for imputing missing values. The method combines between-feature and between-sample attention learning in a deep data reconstruction framework. The proposed data reconstruction uses CutMix data augmentation within a contrastive learning framework to improve the uncertainty of missing value estimation. The performance and generalizability of trained imputation models are evaluated on set-aside test data folds with missing values. The proposed joint attention learning outperforms nine state-of-the-art imputation methods across several missing value types and rates (10%-50%) on twelve data sets. Real electronic health records data with missing values yield the best classification accuracy when imputed using the proposed attention learning compared to other statistical, machine learning, and deep imputation methods. This paper highlights the heterogeneity of tabular data sets to recommend imputation methods based on missing value types and data characteristics.

[LG-95] ARD-VAE: A Statistical Formulation to Find the Relevant Latent Dimensions of Variational Autoencoders

链接: https://arxiv.org/abs/2501.10901
作者: Surojit Saha,Sarang Joshi,Ross Whitaker
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The variational autoencoder (VAE) is a popular, deep, latent-variable model (DLVM) due to its simple yet effective formulation for modeling the data distribution. Moreover, optimizing the VAE objective function is more manageable than other DLVMs. The bottleneck dimension of the VAE is a crucial design choice, and it has strong ramifications for the model’s performance, such as finding the hidden explanatory factors of a dataset using the representations learned by the VAE. However, the size of the latent dimension of the VAE is often treated as a hyperparameter estimated empirically through trial and error. To this end, we propose a statistical formulation to discover the relevant latent factors required for modeling a dataset. In this work, we use a hierarchical prior in the latent space that estimates the variance of the latent axes using the encoded data, which identifies the relevant latent dimensions. For this, we replace the fixed prior in the VAE objective function with a hierarchical prior, keeping the remainder of the formulation unchanged. We call the proposed method the automatic relevancy detection in the variational autoencoder (ARD-VAE). We demonstrate the efficacy of the ARD-VAE on multiple benchmark datasets in finding the relevant latent dimensions and their effect on different evaluation metrics, such as FID score and disentanglement analysis.

[LG-96] CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention

链接: https://arxiv.org/abs/2501.10885
作者: Alexandru Dimofte,Glenn Anta Bucagu,Thorir Mar Ingolfsson,Xiaying Wang,Andrea Cossettini,Luca Benini,Yawei Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalograph (EEG) is a crucial tool for studying brain activity. Recently, self-supervised learning methods leveraging large unlabeled datasets have emerged as a potential solution to the scarcity of widely available annotated EEG data. However, current methods suffer from at least one of the following limitations: i) sub-optimal EEG signal modeling, ii) model sizes in the hundreds of millions of trainable parameters, and iii) reliance on private datasets and/or inconsistent public benchmarks, hindering reproducibility. To address these challenges, we introduce a Compact Encoder for Representations of Brain Oscillations using alternating attention (CEReBrO), a new small EEG foundation model. Our tokenization scheme represents EEG signals at a per-channel patch granularity. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention. We present several model sizes ranging from 3.6 million to 85 million parameters. Pre-trained on over 20,000 hours of publicly available scalp EEG recordings with diverse channel configurations, our models set new benchmarks in emotion detection and seizure detection tasks, with competitive performance in anomaly classification and gait prediction. This validates our models’ effectiveness and effictiveness.

[LG-97] Fixed Point Computation: Beating Brute Force with Smoothed Analysis

链接: https://arxiv.org/abs/2501.10884
作者: Idan Attias,Yuval Dagan,Constantinos Daskalakis,Rui Yao,Manolis Zampetakis
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new algorithm that finds an \varepsilon -approximate fixed point of a smooth function from the n -dimensional \ell_2 unit ball to itself. We use the general framework of finding approximate solutions to a variational inequality, a problem that subsumes fixed point computation and the computation of a Nash Equilibrium. The algorithm’s runtime is bounded by e^O(n)/\varepsilon , under the smoothed-analysis framework. This is the first known algorithm in such a generality whose runtime is faster than (1/\varepsilon)^O(n) , which is a time that suffices for an exhaustive search. We complement this result with a lower bound of e^\Omega(n) on the query complexity for finding an O(1) -approximate fixed point on the unit ball, which holds even in the smoothed-analysis model, yet without the assumption that the function is smooth. Existing lower bounds are only known for the hypercube, and adapting them to the ball does not give non-trivial results even for finding O(1/\sqrtn) -approximate fixed points.

[LG-98] Distributed Quasi-Newton Method for Fair and Fast Federated Learning

链接: https://arxiv.org/abs/2501.10877
作者: Shayan Mohajer Hamidi,Linfeng Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a promising technology that enables edge devices/clients to collaboratively and iteratively train a machine learning model under the coordination of a central server. The most common approach to FL is first-order methods, where clients send their local gradients to the server in each iteration. However, these methods often suffer from slow convergence rates. As a remedy, second-order methods, such as quasi-Newton, can be employed in FL to accelerate its convergence. Unfortunately, similarly to the first-order FL methods, the application of second-order methods in FL can lead to unfair models, achieving high average accuracy while performing poorly on certain clients’ local datasets. To tackle this issue, in this paper we introduce a novel second-order FL framework, dubbed \textbfdistributed \textbfquasi-\textbfNewton \textbffederated learning (DQN-Fed). This approach seeks to ensure fairness while leveraging the fast convergence properties of quasi-Newton methods in the FL context. Specifically, DQN-Fed helps the server update the global model in such a way that (i) all local loss functions decrease to promote fairness, and (ii) the rate of change in local loss functions aligns with that of the quasi-Newton method. We prove the convergence of DQN-Fed and demonstrate its \textitlinear-quadratic convergence rate. Moreover, we validate the efficacy of DQN-Fed across a range of federated datasets, showing that it surpasses state-of-the-art fair FL methods in fairness, average accuracy and convergence speed.

[LG-99] Diffusion-Based Imitation Learning for Social Pose Generation

链接: https://arxiv.org/abs/2501.10869
作者: Antonio Lech Martin-Ozimek,Isuru Jayarathne,Su Larb Mon,Jouh Yeong Chew
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This paper was submitted as an LBR to HRI2025

点击查看摘要

Abstract:Intelligent agents, such as robots and virtual agents, must understand the dynamics of complex social interactions to interact with humans. Effectively representing social dynamics is challenging because we require multi-modal, synchronized observations to understand a scene. We explore how using a single modality, the pose behavior, of multiple individuals in a social interaction can be used to generate nonverbal social cues for the facilitator of that interaction. The facilitator acts to make a social interaction proceed smoothly and is an essential role for intelligent agents to replicate in human-robot interactions. In this paper, we adapt an existing diffusion behavior cloning model to learn and replicate facilitator behaviors. Furthermore, we evaluate two representations of pose observations from a scene, one representation has pre-processing applied and one does not. The purpose of this paper is to introduce a new use for diffusion behavior cloning for pose generation in social interactions. The second is to understand the relationship between performance and computational load for generating social pose behavior using two different techniques for collecting scene observations. As such, we are essentially testing the effectiveness of two different types of conditioning for a diffusion model. We then evaluate the resulting generated behavior from each technique using quantitative measures such as mean per-joint position error (MPJPE), training time, and inference time. Additionally, we plot training and inference time against MPJPE to examine the trade-offs between efficiency and performance. Our results suggest that the further pre-processed data can successfully condition diffusion models to generate realistic social behavior, with reasonable trade-offs in accuracy and processing time.

[LG-100] QGAPHEnsemble : Combining Hybrid QLSTM Network Ensemble via Adaptive Weighting for Short Term Weather Forecasting

链接: https://arxiv.org/abs/2501.10866
作者: Anuvab Sen,Udayon Sen,Mayukhi Paul,Apurba Prasad Padhy,Sujith Sai,Aakash Mallik,Chhandak Mallick
类目: Machine Learning (cs.LG)
*备注: 8 pages and 9 figures, Accepted by the 15th IEEE International Symposium Series on Computational Intelligence (SSCI 2023), March 17-21, 2025, Trondheim, Norway

点击查看摘要

Abstract:Accurate weather forecasting holds significant importance, serving as a crucial tool for decision-making in various industrial sectors. The limitations of statistical models, assuming independence among data points, highlight the need for advanced methodologies. The correlation between meteorological variables necessitate models capable of capturing complex dependencies. This research highlights the practical efficacy of employing advanced machine learning techniques proposing GenHybQLSTM and BO-QEnsemble architecture based on adaptive weight adjustment strategy. Through comprehensive hyper-parameter optimization using hybrid quantum genetic particle swarm optimisation algorithm and Bayesian Optimization, our model demonstrates a substantial improvement in the accuracy and reliability of meteorological predictions through the assessment of performance metrics such as MSE (Mean Squared Error) and MAPE (Mean Absolute Percentage Prediction Error). The paper highlights the importance of optimized ensemble techniques to improve the performance the given weather forecasting task.

[LG-101] Which price to pay? Auto-tuning building MPC controller for optimal economic cost

链接: https://arxiv.org/abs/2501.10859
作者: Jiarui Yu,Jicheng Shi,Wenjie Xu,Colin N. Jones
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Model predictive control (MPC) controller is considered for temperature management in buildings but its performance heavily depends on hyperparameters. Consequently, MPC necessitates meticulous hyperparameter tuning to attain optimal performance under diverse contracts. However, conventional building controller design is an open-loop process without critical hyperparameter optimization, often leading to suboptimal performance due to unexpected environmental disturbances and modeling errors. Furthermore, these hyperparameters are not adapted to different pricing schemes and may lead to non-economic operations. To address these issues, we propose an efficient performance-oriented building MPC controller tuning method based on a cutting-edge efficient constrained Bayesian optimization algorithm, CONFIG, with global optimality guarantees. We demonstrate that this technique can be applied to efficiently deal with real-world DSM program selection problems under customized black-box constraints and objectives. In this study, a simple MPC controller, which offers the advantages of reduced commissioning costs, enhanced computational efficiency, was optimized to perform on a comparable level to a delicately designed and computationally expensive MPC controller. The results also indicate that with an optimized simple MPC, the monthly electricity cost of a household can be reduced by up to 26.90% compared with the cost when controlled by a basic rule-based controller under the same constraints. Then we compared 12 real electricity contracts in Belgium for a household family with customized black-box occupant comfort constraints. The results indicate a monthly electricity bill saving up to 20.18% when the most economic contract is compared with the worst one, which again illustrates the significance of choosing a proper electricity contract.

[LG-102] Learning Nonverbal Cues in Multiparty Social Interactions for Robotic Facilitators

链接: https://arxiv.org/abs/2501.10857
作者: Antonio Lech Martin-Ozimek,Isuru Jayarathne,Su Larb Mon,Jouhyeong Chew
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to as a short contribution to HRI2025

点击查看摘要

Abstract:Conventional behavior cloning (BC) models often struggle to replicate the subtleties of human actions. Previous studies have attempted to address this issue through the development of a new BC technique: Implicit Behavior Cloning (IBC). This new technique consistently outperformed the conventional Mean Squared Error (MSE) BC models in a variety of tasks. Our goal is to replicate the performance of the IBC model by Florence [in Proceedings of the 5th Conference on Robot Learning, 164:158-168, 2022], for social interaction tasks using our custom dataset. While previous studies have explored the use of large language models (LLMs) for enhancing group conversations, they often overlook the significance of non-verbal cues, which constitute a substantial part of human communication. We propose using IBC to replicate nonverbal cues like gaze behaviors. The model is evaluated against various types of facilitator data and compared to an explicit, MSE BC model. Results show that the IBC model outperforms the MSE BC model across session types using the same metrics used in the previous IBC paper. Despite some metrics showing mixed results which are explainable for the custom dataset for social interaction, we successfully replicated the IBC model to generate nonverbal cues. Our contributions are (1) the replication and extension of the IBC model, and (2) a nonverbal cues generation model for social interaction. These advancements facilitate the integration of robots into the complex interactions between robots and humans, e.g., in the absence of a human facilitator.

[LG-103] An Interpretable Measure for Quantifying Predictive Dependence between Continuous Random Variables – Extended Version SDM’25

链接: https://arxiv.org/abs/2501.10815
作者: Renato Assunção,Flávio Figueiredo,Francisco N. Tinoco Júnior,Léo M. de Sá-Freire,Fábio Silva
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: This is the extended version of a paper accepted at 2025 SIAM International Conference on Data Mining (SDM’25)

点击查看摘要

Abstract:A fundamental task in statistical learning is quantifying the joint dependence or association between two continuous random variables. We introduce a novel, fully non-parametric measure that assesses the degree of association between continuous variables X and Y , capable of capturing a wide range of relationships, including non-functional ones. A key advantage of this measure is its interpretability: it quantifies the expected relative loss in predictive accuracy when the distribution of X is ignored in predicting Y . This measure is bounded within the interval [0,1] and is equal to zero if and only if X and Y are independent. We evaluate the performance of our measure on over 90,000 real and synthetic datasets, benchmarking it against leading alternatives. Our results demonstrate that the proposed measure provides valuable insights into underlying relationships, particularly in cases where existing methods fail to capture important dependencies.

[LG-104] Jailbreaking Large Language Models in Infinitely Many Ways

链接: https://arxiv.org/abs/2501.10800
作者: Oliver Goldstein,Emanuele La Malfa,Felix Drinkall,Samuele Marro,Michael Wooldridge
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We discuss the “Infinitely Many Meanings” attacks (IMM), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications to bypass their defensive mechanisms. IMMs’ viability pairs and grows with a model’s capabilities to handle and bind the semantics of simple mappings between tokens and work extremely well in practice, posing a concrete threat to the users of the most powerful LLMs in commerce. We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies. One can protect against IMMs by improving the guardrails and making them scale with the LLMs’ capabilities. For two categories of attacks that are straightforward to implement, i.e., bijection and encoding, we discuss two defensive strategies, one in token and the other in embedding space. We conclude with some research questions we believe should be prioritised to enhance the defensive mechanisms of LLMs and our understanding of their safety.

[LG-105] Dynamic Trend Fusion Module for Traffic Flow Prediction

链接: https://arxiv.org/abs/2501.10796
作者: Jing Chen,Haocheng Ye,Zhian Ying,Yuntao Sun,Wenqiang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate traffic flow prediction is essential for applications like transport logistics but remains challenging due to complex spatio-temporal correlations and non-linear traffic patterns. Existing methods often model spatial and temporal dependencies separately, failing to effectively fuse them. To overcome this limitation, the Dynamic Spatial-Temporal Trend Transformer DST2former is proposed to capture spatio-temporal correlations through adaptive embedding and to fuse dynamic and static information for learning multi-view dynamic features of traffic networks. The approach employs the Dynamic Trend Representation Transformer (DTRformer) to generate dynamic trends using encoders for both temporal and spatial dimensions, fused via Cross Spatial-Temporal Attention. Predefined graphs are compressed into a representation graph to extract static attributes and reduce redundancy. Experiments on four real-world traffic datasets demonstrate that our framework achieves state-of-the-art performance.

[LG-106] Measuring Fairness in Financial Transaction Machine Learning Models WWW

链接: https://arxiv.org/abs/2501.10784
作者: Carlos Mougan,Deniz Sezin Ayvaz,Lorenzo Belenguer,Hankun He,Deborah Dormah Kanubala,Mingxu Li,Soung Low,Faithful Chiagoziem Onwuegbuche,Yulu Pi,Natalia Sikora,Dan Tran,Shresth Verma,Hanzhi Wang,Skyler Xie,Adeline Pelletier
类目: Machine Learning (cs.LG)
*备注: Mastercard Data Study Group Alan Turing Institute: this https URL

点击查看摘要

Abstract:Mastercard, a global leader in financial services, develops and deploys machine learning models aimed at optimizing card usage and preventing attrition through advanced predictive models. These models use aggregated and anonymized card usage patterns, including cross-border transactions and industry-specific spending, to tailor bank offerings and maximize revenue opportunities. Mastercard has established an AI Governance program, based on its Data and Tech Responsibility Principles, to evaluate any built and bought AI for efficacy, fairness, and transparency. As part of this effort, Mastercard has sought expertise from the Turing Institute through a Data Study Group to better assess fairness in more complex AI/ML models. The Data Study Group challenge lies in defining, measuring, and mitigating fairness in these predictions, which can be complex due to the various interpretations of fairness, gaps in the research literature, and ML-operations challenges.

[LG-107] Model Monitoring in the Absence of Labelled Truth Data via Feature Attributions Distributions

链接: https://arxiv.org/abs/2501.10774
作者: Carlos Mougan
类目: Machine Learning (cs.LG)
*备注: PhD Thesis

点击查看摘要

Abstract:Model monitoring involves analyzing AI algorithms once they have been deployed and detecting changes in their behaviour. This thesis explores machine learning model monitoring ML before the predictions impact real-world decisions or users. This step is characterized by one particular condition: the absence of labelled data at test time, which makes it challenging, even often impossible, to calculate performance metrics. The thesis is structured around two main themes: (i) AI alignment, measuring if AI models behave in a manner consistent with human values and (ii) performance monitoring, measuring if the models achieve specific accuracy goals or desires. The thesis uses a common methodology that unifies all its sections. It explores feature attribution distributions for both monitoring dimensions. Using these feature attribution explanations, we can exploit their theoretical properties to derive and establish certain guarantees and insights into model monitoring. Comments: PhD Thesis Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.10774 [cs.LG] (or arXiv:2501.10774v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.10774 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] An Experimental Study on Joint Modeling for Sound Event Localization and Detection with Source Distance Estimation ICASSP2025

链接: https://arxiv.org/abs/2501.10755
作者: Yuxuan Dong,Qing Wang,Hengyi Hong,Ya Jiang,Shi Cheng
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure, accepted by ICASSP2025

点击查看摘要

Abstract:In traditional sound event localization and detection (SELD) tasks, the focus is typically on sound event detection (SED) and direction-of-arrival (DOA) estimation, but they fall short of providing full spatial information about the sound source. The 3D SELD task addresses this limitation by integrating source distance estimation (SDE), allowing for complete spatial localization. We propose three approaches to tackle this challenge: a novel method with independent training and joint prediction, which firstly treats DOA and distance estimation as separate tasks and then combines them to solve 3D SELD; a dual-branch representation with source Cartesian coordinate used for simultaneous DOA and distance estimation; and a three-branch structure that jointly models SED, DOA, and SDE within a unified framework. Our proposed method ranked first in the DCASE 2024 Challenge Task 3, demonstrating the effectiveness of joint modeling for addressing the 3D SELD task. The relevant code for this paper will be open-sourced in the future.

[LG-109] PEARL: Preconditioner Enhancement through Actor-critic Reinforcement Learning

链接: https://arxiv.org/abs/2501.10750
作者: David Millard,Arielle Carr,Stéphane Gaudreault,Ali Baheri
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present PEARL (Preconditioner Enhancement through Actor-critic Reinforcement Learning), a novel approach to learning matrix preconditioners. Existing preconditioners such as Jacobi, Incomplete LU, and Algebraic Multigrid methods offer problem-specific advantages but rely heavily on hyperparameter tuning. Recent advances have explored using deep neural networks to learn preconditioners, though challenges such as misbehaved objective functions and costly training procedures remain. PEARL introduces a reinforcement learning approach for learning preconditioners, specifically, a contextual bandit formulation. The framework utilizes an actor-critic model, where the actor generates the incomplete Cholesky decomposition of preconditioners, and the critic evaluates them based on reward-specific feedback. To further guide the training, we design a dual-objective function, combining updates from the critic and condition number. PEARL contributes a generalizable preconditioner learning method, dynamic sparsity exploration, and cosine schedulers for improved stability and exploratory power. We compare our approach to traditional and neural preconditioners, demonstrating improved flexibility and iterative solving speed.

[LG-110] A Unified Regularization Approach to High-Dimensional Generalized Tensor Bandits

链接: https://arxiv.org/abs/2501.10722
作者: Jiannan Li,Yiyang Yang,Shaojie Tang,Yao Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern decision-making scenarios often involve data that is both high-dimensional and rich in higher-order contextual information, where existing bandits algorithms fail to generate effective policies. In response, we propose in this paper a generalized linear tensor bandits algorithm designed to tackle these challenges by incorporating low-dimensional tensor structures, and further derive a unified analytical framework of the proposed algorithm. Specifically, our framework introduces a convex optimization approach with the weakly decomposable regularizers, enabling it to not only achieve better results based on the tensor low-rankness structure assumption but also extend to cases involving other low-dimensional structures such as slice sparsity and low-rankness. The theoretical analysis shows that, compared to existing low-rankness tensor result, our framework not only provides better bounds but also has a broader applicability. Notably, in the special case of degenerating to low-rank matrices, our bounds still offer advantages in certain scenarios.

[LG-111] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

链接: https://arxiv.org/abs/2501.10714
作者: Xinglin Pan,Wenxiang Lin,Lin Zhang,Shaohuai Shi,Zhenheng Tang,Rui Wang,Bo Li,Xiaowen Chu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42 \times speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18 \times -1.22 \times on 1458 MoE layers and 1.19 \times -3.01 \times on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.

[LG-112] An Interpretable Neural Control Network with Adaptable Online Learning for Sample Efficient Robot Locomotion Learning

链接: https://arxiv.org/abs/2501.10698
作者: Arthicha Srisuchinnawong,Poramate Manoonpong
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 11 Figures + 6 Figures in supplementary material section, 2 Tables, submitted to TNNLS (minor revision; revision submitted 5 October 2024)

点击查看摘要

Abstract:Robot locomotion learning using reinforcement learning suffers from training sample inefficiency and exhibits the non-understandable/black-box nature. Thus, this work presents a novel SME-AGOL to address such problems. Firstly, Sequential Motion Executor (SME) is a three-layer interpretable neural network, where the first produces the sequentially propagating hidden states, the second constructs the corresponding triangular bases with minor non-neighbor interference, and the third maps the bases to the motor commands. Secondly, the Adaptable Gradient-weighting Online Learning (AGOL) algorithm prioritizes the update of the parameters with high relevance score, allowing the learning to focus more on the highly relevant ones. Thus, these two components lead to an analyzable framework, where each sequential hidden state/basis represents the learned key poses/robot configuration. Compared to state-of-the-art methods, the SME-AGOL requires 40% fewer samples and receives 150% higher final reward/locomotion performance on a simulated hexapod robot, while taking merely 10 minutes of learning time from scratch on a physical hexapod robot. Taken together, this work not only proposes the SME-AGOL for sample efficient and understandable locomotion learning but also emphasizes the potential exploitation of interpretability for improving sample efficiency and learning performance.

[LG-113] Deep Operator Networks for Bayesian Parameter Estimation in PDEs

链接: https://arxiv.org/abs/2501.10684
作者: Amogh Raj,Carol Eunice Gudumotou,Sakol Bun,Keerthana Srinivasa,Arash Sarshar
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a novel framework combining Deep Operator Networks (DeepONets) with Physics-Informed Neural Networks (PINNs) to solve partial differential equations (PDEs) and estimate their unknown parameters. By integrating data-driven learning with physical constraints, our method achieves robust and accurate solutions across diverse scenarios. Bayesian training is implemented through variational inference, allowing for comprehensive uncertainty quantification for both aleatoric and epistemic uncertainties. This ensures reliable predictions and parameter estimates even in noisy conditions or when some of the physical equations governing the problem are missing. The framework demonstrates its efficacy in solving forward and inverse problems, including the 1D unsteady heat equation and 2D reaction-diffusion equations, as well as regression tasks with sparse, noisy observations. This approach provides a computationally efficient and generalizable method for addressing uncertainty quantification in PDE surrogate modeling.

[LG-114] Precision Adaptive Imputation Network : An Unified Technique for Mixed Datasets

链接: https://arxiv.org/abs/2501.10667
作者: Harsh Joshi,Rajeshwari Mistri,Manasi Mali,Nachiket Kapure,Parul Kumari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The challenge of missing data remains a significant obstacle across various scientific domains, necessitating the development of advanced imputation techniques that can effectively address complex missingness patterns. This study introduces the Precision Adaptive Imputation Network (PAIN), a novel algorithm designed to enhance data reconstruction by dynamically adapting to diverse data types, distributions, and missingness mechanisms. PAIN employs a tri-step process that integrates statistical methods, random forests, and autoencoders, ensuring balanced accuracy and efficiency in imputation. Through rigorous evaluation across multiple datasets, including those characterized by high-dimensional and correlated features, PAIN consistently outperforms traditional imputation methods, such as mean and median imputation, as well as other advanced techniques like MissForest. The findings highlight PAIN’s superior ability to preserve data distributions and maintain analytical integrity, particularly in complex scenarios where missingness is not completely at random. This research not only contributes to a deeper understanding of missing data reconstruction but also provides a critical framework for future methodological innovations in data science and machine learning, paving the way for more effective handling of mixed-type datasets in real-world applications.

[LG-115] Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture

链接: https://arxiv.org/abs/2501.10666
作者: Qianhe Ouyang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 5 figures, Applied and Computational Engineering

点击查看摘要

Abstract:Emotion detection techniques have been applied to multiple cases mainly from facial image features and vocal audio features, of which the latter aspect is disputed yet not only due to the complexity of speech audio processing but also the difficulties of extracting appropriate features. Part of the SAVEE and RAVDESS datasets are selected and combined as the dataset, containing seven sorts of common emotions (i.e. happy, neutral, sad, anger, disgust, fear, and surprise) and thousands of samples. Based on the Librosa package, this paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction. The hybrid CNN-LSTM architecture is adopted by virtue of its strong capability to deal with sequential data and time series, which mainly consists of four convolutional layers and three long short-term memory layers. As a result, the architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively. It can also be concluded that the classification accuracy is dependent on the properties of emotion to some extent, with frequently-used and distinct-featured emotions having less probability to be misclassified into other categories. Emotions like surprise whose meaning depends on the specific context are more likely to confuse with positive or negative emotions, and negative emotions also have a possibility to get mixed with each other.

[LG-116] MOFA: Discovering Materials for Carbon Capture with a GenAI- and Simulation-Based Workflow

链接: https://arxiv.org/abs/2501.10651
作者: Xiaoli Yan,Nathaniel Hudson,Hyun Park,Daniel Grzenda,J. Gregory Pauloski,Marcus Schwarting,Haochen Pan,Hassan Harb,Samuel Foreman,Chris Knight,Tom Gibbs,Kyle Chard,Santanu Chaudhuri,Emad Tajkhorshid,Ian Foster,Mohamad Moosavi,Logan Ward,E. A. Huerta
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screening and filtering AI-generated MOFs using molecular dynamics, density functional theory, and Monte Carlo simulations. These heterogeneous tasks are unified within an online learning framework that optimizes the utilization of available CPU and GPU resources across HPC systems. Performance metrics from a 450-node (14,400 AMD Zen 3 CPUs + 1800 NVIDIA A100 GPUs) supercomputer run demonstrate that MOFA achieves high-throughput generation of novel MOF structures, with CO _2 adsorption capacities ranking among the top 10 in the hypothetical MOF (hMOF) dataset. Furthermore, the production of high-quality MOFs exhibits a linear relationship with the number of nodes utilized. The modular architecture of MOFA will facilitate its integration into other scientific applications that dynamically combine GenAI with large-scale simulations.

[LG-117] UAV-Assisted Multi-Task Federated Learning with Task Knowledge Sharing

链接: https://arxiv.org/abs/2501.10644
作者: Yubo Yang,Tao Yang,Xiaofeng Wu,Bo Hu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted in IEEE International Conference on Communications (ICC 2025)

点击查看摘要

Abstract:The rapid development of Unmanned aerial vehicles (UAVs) technology has spawned a wide variety of applications, such as emergency communications, regional surveillance, and disaster relief. Due to their limited battery capacity and processing power, multiple UAVs are often required for complex tasks. In such cases, a control center is crucial for coordinating their activities, which fits well with the federated learning (FL) framework. However, conventional FL approaches often focus on a single task, ignoring the potential of training multiple related tasks simultaneously. In this paper, we propose a UAV-assisted multi-task federated learning scheme, in which data collected by multiple UAVs can be used to train multiple related tasks concurrently. The scheme facilitates the training process by sharing feature extractors across related tasks and introduces a task attention mechanism to balance task performance and encourage knowledge sharing. To provide an analytical description of training performance, the convergence analysis of the proposed scheme is performed. Additionally, the optimal bandwidth allocation for UAVs under limited bandwidth conditions is derived to minimize communication time. Meanwhile, a UAV-EV association strategy based on coalition formation game is proposed. Simulation results validate the effectiveness of the proposed scheme in enhancing multi-task performance and training speed.

[LG-118] HOPS: High-order Polynomials with Self-supervised Dimension Reduction for Load Forecasting

链接: https://arxiv.org/abs/2501.10637
作者: Pengyang Song,Han Feng,Shreyashi Shukla,Jue Wang,Tao Hong
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Load forecasting is a fundamental task in smart grid. Many techniques have been applied to developing load forecasting models. Due to the challenges such as the Curse of Dimensionality, overfitting, and limited computing resources, multivariate higher-order polynomial models have received limited attention in load forecasting, despite their desirable mathematical foundations and optimization properties. In this paper, we propose low rank approximation and self-supervised dimension reduction to address the aforementioned issues. To further improve computational efficiency, we also introduce a fast Conjugate Gradient based algorithm for the proposed polynomial models. Based on the ISO New England dataset used in Global Energy Forecasting Competition 2017, the proposed method high-order polynomials with self-supervised dimension reduction (HOPS) demonstrates higher forecasting accuracy over several competitive models. Additionally, experimental results indicate that our approach alleviates redundant variable construction, achieving better forecasts with fewer input variables.

[LG-119] Assessing Markov Property in Driving Behaviors: Insights from Statistical Tests

链接: https://arxiv.org/abs/2501.10625
作者: Zheng Li,Haoming Meng,Chengyuan Ma,Ke Ma,Xiaopeng Li
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The Markov property serves as a foundational assumption in most existing work on vehicle driving behavior, positing that future states depend solely on the current state, not the series of preceding states. This study validates the Markov properties of vehicle trajectories for both Autonomous Vehicles (AVs) and Human-driven Vehicles (HVs). A statistical method used to test whether time series data exhibits Markov properties is applied to examine whether the trajectory data possesses Markov characteristics. t test and F test are additionally introduced to characterize the differences in Markov properties between AVs and HVs. Based on two public trajectory datasets, we investigate the presence and order of the Markov property of different types of vehicles through rigorous statistical tests. Our findings reveal that AV trajectories generally exhibit stronger Markov properties compared to HV trajectories, with a higher percentage conforming to the Markov property and lower Markov orders. In contrast, HV trajectories display greater variability and heterogeneity in decision-making processes, reflecting the complex perception and information processing involved in human driving. These results have significant implications for the development of driving behavior models, AV controllers, and traffic simulation systems. Our study also demonstrates the feasibility of using statistical methods to test the presence of Markov properties in driving trajectory data.

[LG-120] Mutual Regression Distance

链接: https://arxiv.org/abs/2501.10617
作者: Dong Qiao,Jicong Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The maximum mean discrepancy and Wasserstein distance are popular distance measures between distributions and play important roles in many machine learning problems such as metric learning, generative modeling, domain adaption, and clustering. However, since they are functions of pair-wise distances between data points in two distributions, they do not exploit the potential manifold properties of data such as smoothness and hence are not effective in measuring the dissimilarity between the two distributions in the form of manifolds. In this paper, different from existing measures, we propose a novel distance called Mutual Regression Distance (MRD) induced by a constrained mutual regression problem, which can exploit the manifold property of data. We prove that MRD is a pseudometric that satisfies almost all the axioms of a metric. Since the optimization of the original MRD is costly, we provide a tight MRD and a simplified MRD, based on which a heuristic algorithm is established. We also provide kernel variants of MRDs that are more effective in handling nonlinear data. Our MRDs especially the simplified MRDs have much lower computational complexity than the Wasserstein distance. We provide theoretical guarantees, such as robustness, for MRDs. Finally, we apply MRDs to distribution clustering, generative models, and domain adaptation. The numerical results demonstrate the effectiveness and superiority of MRDs compared to the baselines.

[LG-121] Differentiable Adversarial Attacks for Marked Temporal Point Processes AAAI2025

链接: https://arxiv.org/abs/2501.10606
作者: Pritish Chakraborty,Vinayak Gupta,Rahul R,Srikanta J. Bedathur,Abir De
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: AAAI 2025 (Main Track)

点击查看摘要

Abstract:Marked temporal point processes (MTPPs) have been shown to be extremely effective in modeling continuous time event sequences (CTESs). In this work, we present adversarial attacks designed specifically for MTPP models. A key criterion for a good adversarial attack is its imperceptibility. For objects such as images or text, this is often achieved by bounding perturbation in some fixed L_p norm-ball. However, similarly minimizing distance norms between two CTESs in the context of MTPPs is challenging due to their sequential nature and varying time-scales and lengths. We address this challenge by first permuting the events and then incorporating the additive noise to the arrival timestamps. However, the worst case optimization of such adversarial attacks is a hard combinatorial problem, requiring exploration across a permutation space that is factorially large in the length of the input sequence. As a result, we propose a novel differentiable scheme PERMTPP using which we can perform adversarial attacks by learning to minimize the likelihood, while minimizing the distance between two CTESs. Our experiments on four real-world datasets demonstrate the offensive and defensive capabilities, and lower inference times of PERMTPP.

[LG-122] Wasserstein Adaptive Value Estimation for Actor-Critic Reinforcement Learning

链接: https://arxiv.org/abs/2501.10605
作者: Ali Baheri,Zahra Sharooei,Chirayu Salgarkar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present Wasserstein Adaptive Value Estimation for Actor-Critic (WAVE), an approach to enhance stability in deep reinforcement learning through adaptive Wasserstein regularization. Our method addresses the inherent instability of actor-critic algorithms by incorporating an adaptively weighted Wasserstein regularization term into the critic’s loss function. We prove that WAVE achieves \mathcalO\left(\frac1k\right) convergence rate for the critic’s mean squared error and provide theoretical guarantees for stability through Wasserstein-based regularization. Using the Sinkhorn approximation for computational efficiency, our approach automatically adjusts the regularization based on the agent’s performance. Theoretical analysis and experimental results demonstrate that WAVE achieves superior performance compared to standard actor-critic methods.

[LG-123] Solving Finite-Horizon MDPs via Low-Rank Tensors

链接: https://arxiv.org/abs/2501.10598
作者: Sergio Rozada,Jose Luis Orejuela,Antonio G. Marques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems.

[LG-124] Universality of Benign Overfitting in Binary Linear Classification

链接: https://arxiv.org/abs/2501.10538
作者: Ichiro Hashimoto,Stanislav Volgushev,Piotr Zwiernik
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 66 pages, 5 figures

点击查看摘要

Abstract:The practical success of deep learning has led to the discovery of several surprising phenomena. One of these phenomena, that has spurred intense theoretical research, is ``benign overfitting’': deep neural networks seem to generalize well in the over-parametrized regime even though the networks show a perfect fit to noisy training data. It is now known that benign overfitting also occurs in various classical statistical models. For linear maximum margin classifiers, benign overfitting has been established theoretically in a class of mixture models with very strong assumptions on the covariate distribution. However, even in this simple setting, many questions remain open. For instance, most of the existing literature focuses on the noiseless case where all true class labels are observed without errors, whereas the more interesting noisy case remains poorly understood. We provide a comprehensive study of benign overfitting for linear maximum margin classifiers. We discover a phase transition in test error bounds for the noisy model which was previously unknown and provide some geometric intuition behind it. We further considerably relax the required covariate assumptions in both, the noisy and noiseless case. Our results demonstrate that benign overfitting of maximum margin classifiers holds in a much wider range of scenarios than was previously known and provide new insights into the underlying mechanisms.

[LG-125] A Tensor Low-Rank Approximation for Value Functions in Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2501.10529
作者: Sergio Rozada,Santiago Paternain,Juan Andres Bazerque,Antonio G. Marques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In pursuit of reinforcement learning systems that could train in physical environments, we investigate multi-task approaches as a means to alleviate the need for massive data acquisition. In a tabular scenario where the Q-functions are collected across tasks, we model our learning problem as optimizing a higher order tensor structure. Recognizing that close-related tasks may require similar actions, our proposed method imposes a low-rank condition on this aggregated Q-tensor. The rationale behind this approach to multi-task learning is that the low-rank structure enforces the notion of similarity, without the need to explicitly prescribe which tasks are similar, but inferring this information from a reduced amount of data simultaneously with the stochastic optimization of the Q-tensor. The efficiency of our low-rank tensor approach to multi-task learning is demonstrated in two numerical experiments, first in a benchmark environment formed by a collection of inverted pendulums, and then into a practical scenario involving multiple wireless communication devices.

[LG-126] DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids ICASSP2025

链接: https://arxiv.org/abs/2501.10525
作者: Iosif Tsangko,Andreas Triantafyllopoulos,Michael Müller,Hendrik Schröter,Björn W. Schuller
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Comments: Accepted at ICASSP 2025. 5 pages, 3 figures

点击查看摘要

Abstract:The \textbfDeepFilterNet (\textbfDFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all’ approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the \textbfDFN model, thus proposing the \textbfDFingerNet (\textbfDFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.

[LG-127] ACCEPT: Diagnostic Forecasting of Battery Degradation Through Contrastive Learning

链接: https://arxiv.org/abs/2501.10492
作者: James Sadler,Rizwaan Mohammed,Michael Castle,Kotub Uddin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Modeling lithium-ion battery (LIB) degradation offers significant cost savings and enhances the safety and reliability of electric vehicles (EVs) and battery energy storage systems (BESS). Whilst data-driven methods have received great attention for forecasting degradation, they often demonstrate limited generalization ability and tend to underperform particularly in critical scenarios involving accelerated degradation, which are crucial to predict accurately. These methods also fail to elucidate the underlying causes of degradation. Alternatively, physical models provide a deeper understanding, but their complex parameters and inherent uncertainties limit their applicability in real-world settings. To this end, we propose a new model - ACCEPT. Our novel framework uses contrastive learning to map the relationship between the underlying physical degradation parameters and observable operational quantities, combining the benefits of both approaches. Furthermore, due to the similarity of degradation paths between LIBs with the same chemistry, this model transfers non-trivially to most downstream tasks, allowing for zero-shot inference. Additionally, since categorical features can be included in the model, it can generalize to other LIB chemistries. This work establishes a foundational battery degradation model, providing reliable forecasts across a range of battery types and operating conditions.

[LG-128] Using Domain Knowledge with Deep Learning to Solve Applied Inverse Problems

链接: https://arxiv.org/abs/2501.10481
作者: Qinyi Tian,Winston Lindqwister,Manolis Veveakis,Laura E. Dalton
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Advancements in deep learning have improved the ability to model complex, nonlinear relationships, such as those encountered in complex material inverse problems. However, the effectiveness of these methods often depends on large datasets, which are not always available. In this study, the incorporation of domain-specific knowledge of mechanical behavior is investigated to evaluate the impact on the predictive performance of the models in data-scarce scenarios. To demonstrate this, stress-strain curves were used to predict key microstructural features of porous materials, and the performance of models trained with and without domain knowledge was compared using five deep learning models: Convolutional Neural Networks, Extreme Gradient Boosting, K-Nearest Neighbors, Long Short-Term Memory, and Random Forest. The results of the models with domain-specific characteristics consistently achieved higher R^2 values and improved learning efficiency compared to models without prior knowledge. When the models did not include domain knowledge, the model results revealed meaningful patterns were not recognized, while those enhanced with mechanical insights showed superior feature extraction and predictions. These findings underscore the critical role of domain knowledge in guiding deep learning models, highlighting the need to combine domain expertise with data-driven approaches to achieve reliable and accurate outcomes in materials science and related fields.

[LG-129] Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2501.10479
作者: Daniel Severo,Giuseppe Ottaviano,Matthew Muckley,Karen Ullrich,Matthijs Douze
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at this https URL.

[LG-130] Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data

链接: https://arxiv.org/abs/2501.10471
作者: Aditya Ballal,Esha Datta,Gregory A. DePaul,Erik Carlsson,Ye Chen-Izu,Javier E. López,Leighton T. Izu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Software available at this https URL

点击查看摘要

Abstract:Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call “Village-Net”. Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as “villages”. Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(Nkd), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.

[LG-131] Off-policy Evaluation for Payments at Adyen RECSYS’25

链接: https://arxiv.org/abs/2501.10470
作者: Alex Egg
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures, submitted to RecSys '25

点击查看摘要

Abstract:This paper demonstrates the successful application of Off-Policy Evaluation (OPE) to accelerate recommender system development and optimization at Adyen, a global leader in financial payment processing. Facing the limitations of traditional A/B testing, which proved slow, costly, and often inconclusive, we integrated OPE to enable rapid evaluation of new recommender system variants using historical data. Our analysis, conducted on a billion-scale dataset of transactions, reveals a strong correlation between OPE estimates and online A/B test results, projecting an incremental 9–54 million transactions over a six-month period. We explore the practical challenges and trade-offs associated with deploying OPE in a high-volume production environment, including leveraging exploration traffic for data collection, mitigating variance in importance sampling, and ensuring scalability through the use of Apache Spark. By benchmarking various OPE estimators, we provide guidance on their effectiveness and integration into the decision-making systems for large-scale industrial payment systems.

[LG-132] Efficient Traffic Prediction Through Spatio-Temporal Distillation

链接: https://arxiv.org/abs/2501.10459
作者: Qianru Zhang,Xinyi Gao,Haixin Wang,Siu-Ming Yiu,Hongzhi Yin
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 9 pages

点击查看摘要

Abstract:Graph neural networks (GNNs) have gained considerable attention in recent years for traffic flow prediction due to their ability to learn spatio-temporal pattern representations through a graph-based message-passing framework. Although GNNs have shown great promise in handling traffic datasets, their deployment in real-life applications has been hindered by scalability constraints arising from high-order message passing. Additionally, the over-smoothing problem of GNNs may lead to indistinguishable region representations as the number of layers increases, resulting in performance degradation. To address these challenges, we propose a new knowledge distillation paradigm termed LightST that transfers spatial and temporal knowledge from a high-capacity teacher to a lightweight student. Specifically, we introduce a spatio-temporal knowledge distillation framework that helps student MLPs capture graph-structured global spatio-temporal patterns while alleviating the over-smoothing effect with adaptive knowledge distillation. Extensive experiments verify that LightST significantly speeds up traffic flow predictions by 5X to 40X compared to state-of-the-art spatio-temporal GNNs, all while maintaining superior accuracy.

[LG-133] Spatio-Temporal Graph Convolutional Networks: Optimised Temporal Architecture

链接: https://arxiv.org/abs/2501.10454
作者: Edward Turner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Spatio-Temporal graph convolutional networks were originally introduced with CNNs as temporal blocks for feature extraction. Since then LSTM temporal blocks have been proposed and shown to have promising results. We propose a novel architecture combining both CNN and LSTM temporal blocks and then provide an empirical comparison between our new and the pre-existing models. We provide theoretical arguments for the different temporal blocks and use a multitude of tests across different datasets to assess our hypotheses.

[LG-134] Automating Credit Card Limit Adjustments Using Machine Learning

链接: https://arxiv.org/abs/2501.10451
作者: Diego Pestana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Venezuelan banks have historically made credit card limit adjustment decisions manually through committees. However, since the number of credit card holders in Venezuela is expected to increase in the upcoming months due to economic improvements, manual decisions are starting to become unfeasible. In this project, a machine learning model that uses cost-sensitive learning is proposed to automate the task of handing out credit card limit increases. To accomplish this, several neural network and XGBoost models are trained and compared, leveraging Venezolano de Credito’s data and using grid search with 10-fold cross-validation. The proposed model is ultimately chosen due to its superior balance of accuracy, cost-effectiveness, and interpretability. The model’s performance is evaluated against the committee’s decisions using Cohen’s kappa coefficient, showing an almost perfect agreement.

[LG-135] Robust Hybrid Classical-Quantum Transfer Learning Model for Text Classification Using GPT -Neo 125M with LoRA SMOTE Enhancement

链接: https://arxiv.org/abs/2501.10435
作者: Santanam Wishal
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 11 figures

点击查看摘要

Abstract:This research introduces a hybrid classical-quantum framework for text classification, integrating GPT-Neo 125M with Low-Rank Adaptation (LoRA) and Synthetic Minority Over-sampling Technique (SMOTE) using quantum computing backends. While the GPT-Neo 125M baseline remains the best-performing model, the implementation of LoRA and SMOTE enhances the hybrid model, resulting in improved accuracy, faster convergence, and better generalization. Experiments on IBM’s 127-qubit quantum backend and Pennylane’s 32-qubit simulation demonstrate the viability of combining classical neural networks with quantum circuits. This framework underscores the potential of hybrid architectures for advancing natural language processing applications.

[LG-136] Quantum Annealing for Robust Principal Component Analysis

链接: https://arxiv.org/abs/2501.10431
作者: Ian Tomeo(1),Panos P. Markopoulos(2),Andreas Savakis(1) ((1) Rochester Institute of Technology, (2) The University of Texas at San Antonio)
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Principal component analysis is commonly used for dimensionality reduction, feature extraction, denoising, and visualization. The most commonly used principal component analysis method is based upon optimization of the L2-norm, however, the L2-norm is known to exaggerate the contribution of errors and outliers. When optimizing over the L1-norm, the components generated are known to exhibit robustness or resistance to outliers in the data. The L1-norm components can be solved for with a binary optimization problem. Previously, L1-BF has been used to solve the binary optimization for multiple components simultaneously. In this paper we propose QAPCA, a new method for finding principal components using quantum annealing hardware which will optimize over the robust L1-norm. The conditions required for convergence of the annealing problem are discussed. The potential speedup when using quantum annealing is demonstrated through complexity analysis and experimental results. To showcase performance against classical principal component analysis techniques experiments upon synthetic Gaussian data, a fault detection scenario and breast cancer diagnostic data are studied. We find that the reconstruction error when using QAPCA is comparable to that when using L1-BF.

[LG-137] Prediction Model of Aqua Fisheries Using IoT Devices

链接: https://arxiv.org/abs/2501.10430
作者: Md. Monirul Islam
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Aquaculture involves cultivating marine and freshwater organisms, with real-time monitoring of aquatic parameters being crucial in fish farming. This thesis proposes an IoT-based framework using sensors and Arduino for efficient monitoring and control of water quality. Different sensors including pH, temperature, and turbidity are placed in cultivating pond water and each of them is connected to a common microcontroller board built on an Arduino Uno. The sensors read the data from the water and store it as a CSV file in an IoT cloud named Thingspeak through the Arduino Microcontroller. In the experimental part, we collected data from 5 ponds with various sizes and environments. After getting the real-time data, we compared these with the standard reference values. As a result, we can make the decision about which ponds are satisfactory for cultivating fish and what is not. After that, we labeled the data with 11 fish categories including Katla, sing, prawn, rui, koi, pangas, tilapia, silvercarp, karpio, magur, and shrimp. In addition, the data were analyzed using 10 machine learning (ML) algorithms containing J48, Random Forest, K-NN, K*, LMT, REPTree, JRIP, PART, Decision Table, and Logit boost. After experimental evaluation, it was observed among 5 ponds, only three ponds were perfect for fish farming, where these 3 ponds only satisfied the standard reference values of pH (6.5-8.5), Temperature (16-24)oC, Turbidity (below 10)ntu, Conductivity (970-1825)\muS/cm, and Depth (1-4) meter. Among the state-of-the-art machine learning algorithms, Random Forest achieved the highest score of performance metrics as accuracy 94.42%, kappa statistics 93.5%, and Avg. TP Rate 94.4%. In addition, we calculated the BOD, COD, and DO for one scenario. This study includes details of the proposed IoT system’s prototype hardware.

[LG-138] Delay Neural Networks (DeNN) for exploiting temporal information in event-based datasets

链接: https://arxiv.org/abs/2501.10425
作者: Alban Gattepaille(I3S),Alexandre Muzy(I3S, ILLS)
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Deep Neural Networks (DNN) and Spiking Neural Networks (SNN), the information of a neuron is computed based on the sum of the amplitudes (weights) of the electrical potentials received in input from other neurons. We propose here a new class of neural networks, namely Delay Neural Networks (DeNN), where the information of a neuron is computed based on the sum of its input synaptic delays and on the spike times of the electrical potentials received from other neurons. This way, DeNN are designed to explicitly use exact continuous temporal information of spikes in both forward and backward passes, without approximation. (Deep) DeNN are applied here to images and event-based (audio and visual) data sets. Good performances are obtained, especially for datasets where temporal information is important, with much less parameters and less energy than other models.

[LG-139] Making Software FAIR: A machine-assisted workflow for the research software lifecycle

链接: https://arxiv.org/abs/2501.10415
作者: Petr Knoth(CORE, Knowledge Media institute, The Open University),Laurent Romary(Inria),Patrice Lopez(Science Miner),Roberto Di Cosmo(Inria),Pavel Smrz(Brno University of Technology),Tomasz Umerle(Polish Academy of Sciences),Melissa Harrison(European Bioinformatics Institute),Alain Monteil(Inria),Matteo Cancellieri(Knowledge Media institute, The Open University),David Pride(CORE, Knowledge Media institute, The Open University)
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 5 pages

点击查看摘要

Abstract:A key issue hindering discoverability, attribution and reusability of open research software is that its existence often remains hidden within the manuscript of research papers. For these resources to become first-class bibliographic records, they first need to be identified and subsequently registered with persistent identifiers (PIDs) to be made FAIR (Findable, Accessible, Interoperable and Reusable). To this day, much open research software fails to meet FAIR principles and software resources are mostly not explicitly linked from the manuscripts that introduced them or used them. SoFAIR is a 2-year international project (2024-2025) which proposes a solution to the above problem realised over the content available through the global network of open repositories. SoFAIR will extend the capabilities of widely used open scholarly infrastructures (CORE, Software Heritage, HAL) and tools (GROBID) operated by the consortium partners, delivering and deploying an effective solution for the management of the research software lifecycle, including: 1) ML-assisted identification of research software assets from within the manuscripts of scholarly papers, 2) validation of the identified assets by authors, 3) registration of software assets with PIDs and their archival.

[LG-140] Nirvana AI Governance: How AI Policymaking Is Committing Three Old Fallacies

链接: https://arxiv.org/abs/2501.10384
作者: Jiawei Zhang
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:This research applies Harold Demsetz’s concept of the nirvana approach to the realm of AI governance and debunks three common fallacies in various AI policy proposals–“the grass is always greener on the other side,” “free lunch,” and “the people could be different.” Through this, I expose fundamental flaws in the current AI regulatory proposal. First, some commentators intuitively believe that people are more reliable than machines and that government works better in risk control than companies’ self-regulation, but they do not fully compare the differences between the status quo and the proposed replacements. Second, when proposing some regulatory tools, some policymakers and researchers do not realize and even gloss over the fact that harms and costs are also inherent in their proposals. Third, some policy proposals are initiated based on a false comparison between the AI-driven world, where AI does lead to some risks, and an entirely idealized world, where no risk exists at all. However, the appropriate approach is to compare the world where AI causes risks to the real world where risks are everywhere, but people can live well with these risks. The prevalence of these fallacies in AI governance underscores a broader issue: the tendency to idealize potential solutions without fully considering their real-world implications. This idealization can lead to regulatory proposals that are not only impractical but potentially harmful to innovation and societal progress.

[LG-141] Energy-Constrained Information Storag e on Memristive Devices in the Presence of Resistive Drift

链接: https://arxiv.org/abs/2501.10376
作者: Waleed El-Geresy,Christos Papavassiliou,Deniz Gündüz
类目: Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:In this paper, we examine the problem of information storage on memristors affected by resistive drift noise under energy constraints. We introduce a novel, fundamental trade-off between the information lifetime of memristive states and the energy that must be expended to bring the device into a particular state. We then treat the storage problem as one of communication over a noisy, energy-constrained channel, and propose a joint source-channel coding (JSCC) approach to storing images in an analogue fashion. To design an encoding scheme for natural images and to model the memristive channel, we make use of data-driven techniques from the field of deep learning for communications, namely deep joint source-channel coding (DeepJSCC), employing a generative model of resistive drift as a computationally tractable differentiable channel model for end-to-end optimisation. We introduce a modified version of generalised divisive normalisation (GDN), a biologically inspired form of normalisation, that we call conditional GDN (cGDN), allowing for conditioning on continuous channel characteristics, including the initial resistive state and the delay between storage and reading. Our results show that the delay-conditioned network is able to learn an energy-aware coding scheme that achieves a higher and more balanced reconstruction quality across a range of storage delays.

[LG-142] DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference DATE

链接: https://arxiv.org/abs/2501.10375
作者: Yujie Zhang,Shivam Aggarwal,Tulika Mitra
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 7 pages, 10 figures, Accepted by DATE Conference 2025

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.

[LG-143] Measured Hockey-Stick Divergence and its Applications to Quantum Pufferfish Privacy

链接: https://arxiv.org/abs/2501.12359
作者: Theshani Nuradha,Vishal Singh,Mark M. Wilde
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 21 pages, submission to the 2025 International Symposium on Information Theory to be held at University of Michigan

点击查看摘要

Abstract:The hockey-stick divergence is a fundamental quantity characterizing several statistical privacy frameworks that ensure privacy for classical and quantum data. In such quantum privacy frameworks, the adversary is allowed to perform all possible measurements. However, in practice, there are typically limitations to the set of measurements that can be performed. To this end, here, we comprehensively analyze the measured hockey-stick divergence under several classes of practically relevant measurement classes. We prove several of its properties, including data processing and convexity. We show that it is efficiently computable by semi-definite programming for some classes of measurements and can be analytically evaluated for Werner and isotropic states. Notably, we show that the measured hockey-stick divergence characterizes optimal privacy parameters in the quantum pufferfish privacy framework. With this connection and the developed technical tools, we enable methods to quantify and audit privacy for several practically relevant settings. Lastly, we introduce the measured hockey-stick divergence of channels and explore its applications in ensuring privacy for channels.

[LG-144] Uncertainty Quantification With Noise Injection in Neural Networks: A Bayesian Perspective

链接: https://arxiv.org/abs/2501.12314
作者: Xueqiong Yuan,Jipeng Li,Ercan Engin Kuruoglu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model uncertainty quantification involves measuring and evaluating the uncertainty linked to a model’s predictions, helping assess their reliability and confidence. Noise injection is a technique used to enhance the robustness of neural networks by introducing randomness. In this paper, we establish a connection between noise injection and uncertainty quantification from a Bayesian standpoint. We theoretically demonstrate that injecting noise into the weights of a neural network is equivalent to Bayesian inference on a deep Gaussian process. Consequently, we introduce a Monte Carlo Noise Injection (MCNI) method, which involves injecting noise into the parameters during training and performing multiple forward propagations during inference to estimate the uncertainty of the prediction. Through simulation and experiments on regression and classification tasks, our method demonstrates superior performance compared to the baseline model.

[LG-145] Fast sparse optimization via adaptive shrinkage

链接: https://arxiv.org/abs/2501.12236
作者: Vito Cerone,Sophie M. Fosson,Diego Regruto
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The need for fast sparse optimization is emerging, e.g., to deal with large-dimensional data-driven problems and to track time-varying systems. In the framework of linear sparse optimization, the iterative shrinkage-thresholding algorithm is a valuable method to solve Lasso, which is particularly appreciated for its ease of implementation. Nevertheless, it converges slowly. In this paper, we develop a proximal method, based on logarithmic regularization, which turns out to be an iterative shrinkage-thresholding algorithm with adaptive shrinkage hyperparameter. This adaptivity substantially enhances the trajectory of the algorithm, in a way that yields faster convergence, while keeping the simplicity of the original method. Our contribution is twofold: on the one hand, we derive and analyze the proposed algorithm; on the other hand, we validate its fast convergence via numerical experiments and we discuss the performance with respect to state-of-the-art algorithms.

[LG-146] Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms

链接: https://arxiv.org/abs/2501.12212
作者: Xiaoyu Wang,Mikolaj J. Kasprzak,Jeffrey Negrea,Solesne Bourguin,Jonathan H. Huggins
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Stochastic iterative algorithms, including stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD), are widely utilized for optimization and sampling in large-scale and high-dimensional problems in machine learning, statistics, and engineering. Numerous works have bounded the parameter error in, and characterized the uncertainty of, these approximations. One common approach has been to use scaling limit analyses to relate the distribution of algorithm sample paths to a continuous-time stochastic process approximation, particularly in asymptotic setups. Focusing on the univariate setting, in this paper, we build on previous work to derive non-asymptotic functional approximation error bounds between the algorithm sample paths and the Ornstein-Uhlenbeck approximation using an infinite-dimensional version of Stein’s method of exchangeable pairs. We show that this bound implies weak convergence under modest additional assumptions and leads to a bound on the error of the variance of the iterate averages of the algorithm. Furthermore, we use our main result to construct error bounds in terms of two common metrics: the Lévy-Prokhorov and bounded Wasserstein distances. Our results provide a foundation for developing similar error bounds for the multivariate setting and for more sophisticated stochastic approximation algorithms.

[LG-147] MirrorCBO: A consensus-based optimization method in the spirit of mirror descent

链接: https://arxiv.org/abs/2501.12189
作者: Leon Bungert,Franca Hoffmann,Doh Yeon Kim,Tim Roith
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 64 pages, 18 figures, 19 tables

点击查看摘要

Abstract:In this work we propose MirrorCBO, a consensus-based optimization (CBO) method which generalizes standard CBO in the same way that mirror descent generalizes gradient descent. For this we apply the CBO methodology to a swarm of dual particles and retain the primal particle positions by applying the inverse of the mirror map, which we parametrize as the subdifferential of a strongly convex function \phi . In this way, we combine the advantages of a derivative-free non-convex optimization algorithm with those of mirror descent. As a special case, the method extends CBO to optimization problems with convex constraints. Assuming bounds on the Bregman distance associated to \phi , we provide asymptotic convergence results for MirrorCBO with explicit exponential rate. Another key contribution is an exploratory numerical study of this new algorithm across different application settings, focusing on (i) sparsity-inducing optimization, and (ii) constrained optimization, demonstrating the competitive performance of MirrorCBO. We observe empirically that the method can also be used for optimization on (non-convex) submanifolds of Euclidean space, can be adapted to mirrored versions of other recent CBO variants, and that it inherits from mirror descent the capability to select desirable minimizers, like sparse ones. We also include an overview of recent CBO approaches for constrained optimization and compare their performance to MirrorCBO.

[LG-148] Dual NUP Representations and Min-Maximization in Factor Graphs

链接: https://arxiv.org/abs/2501.12113
作者: Yun-Peng Li,Hans-Andrea Loeliger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Normals with unknown parameters (NUP) can be used to convert nontrivial model-based estimation problems into iterations of linear least-squares or Gaussian estimation problems. In this paper, we extend this approach by augmenting factor graphs with convex-dual variables and pertinent NUP representations. In particular, in a state space setting, we propose a new iterative forward-backward algorithm that is dual to a recently proposed backward-forward algorithm.

[LG-149] A note on the relations between mixture models maximum-likelihood and entropic optimal transport

链接: https://arxiv.org/abs/2501.12005
作者: Titouan Vayer(OCKHAM),Etienne Lasalle(OCKHAM)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note aims to demonstrate that performing maximum-likelihood estimation for a mixture model is equivalent to minimizing over the parameters an optimal transport problem with entropic regularization. The objective is pedagogical: we seek to present this already known result in a concise and hopefully simple manner. We give an illustration with Gaussian mixture models by showing that the standard EM algorithm is a specific block-coordinate descent on an optimal transport loss.

[LG-150] Can Bayesian Neural Networks Make Confident Predictions? NEURIPS2024

链接: https://arxiv.org/abs/2501.11773
作者: Katharine Fisher,Youssef Marzouk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Mathematics of Modern Machine Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Bayesian inference promises a framework for principled uncertainty quantification of neural network predictions. Barriers to adoption include the difficulty of fully characterizing posterior distributions on network parameters and the interpretability of posterior predictive distributions. We demonstrate that under a discretized prior for the inner layer weights, we can exactly characterize the posterior predictive distribution as a Gaussian mixture. This setting allows us to define equivalence classes of network parameter values which produce the same likelihood (training error) and to relate the elements of these classes to the network’s scaling regime – defined via ratios of the training sample size, the size of each layer, and the number of final layer parameters. Of particular interest are distinct parameter realizations that map to low training error and yet correspond to distinct modes in the posterior predictive distribution. We identify settings that exhibit such predictive multimodality, and thus provide insight into the accuracy of unimodal posterior approximations. We also characterize the capacity of a model to “learn from data” by evaluating contraction of the posterior predictive in different scaling regimes.

[LG-151] Disentangling stellar atmospheric parameters in astronomical spectra using Generative Adversarial Neural Networks

链接: https://arxiv.org/abs/2501.11762
作者: Minia Manteiga,Raúl Santoveña,Marco A. Álvarez,Carlos Dafonte,Manuel G. Penedo,Silvana Navarro,Luis Corral
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:A method based on Generative Adversaria! Networks (GANs) is developed for disentangling the physical (effective temperature and gravity) and chemical (metallicity, overabundance of a-elements with respect to iron) atmospheric properties in astronomical spectra. Using a projection of the stellar spectra, commonly called latent space, in which the contribution dueto one or several main stellar physicochemical properties is minimised while others are enhanced, it was possible to maximise the information related to certain properties, which can then be extracted using artificial neural networks (ANN) as regressors with higher accuracy than a reference method based on the use of ANN trained with the original spectra. Methods. Our model utilises autoencoders, comprising two artificial neural networks: an encoder anda decoder which transform input data into a low-dimensional representation known as latent space. It also uses discriminators, which are additional neural networks aimed at transforming the traditional autoencoder training into an adversaria! approach, to disentangle or reinforce the astrophysical parameters from the latent space. The GANDALF tool is described. It was developed to define, train, and test our GAN model with a web framework to show how the disentangling algorithm works visually. It is open to the community in Github. Results. The performance of our approach for retrieving atmospheric stellar properties from spectra is demonstrated using Gaia Radial Velocity Spectrograph (RVS) data from DR3. We use a data-driven perspective and obtain very competitive values, ali within the literature errors, and with the advantage of an important dimensionality reduction of the data to be processed.

[LG-152] Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database

链接: https://arxiv.org/abs/2501.11720
作者: Jeff J.H. Kim,George R. Nahass,Yang Dai,Theja Tulabandhula
类目: Tissues and Organs (q-bio.TO); Machine Learning (cs.LG)
*备注: JJHK and GRN contributed equally, YD and TT are co-corresponding. 11 pages, 7 figures, 1 Table

点击查看摘要

Abstract:Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality, with lung metastases being the most common site of distant spread and significantly worsening prognosis. Despite the growing availability of clinical and demographic data, predictive models for lung metastasis in HCC remain limited in scope and clinical applicability. In this study, we develop and validate an end-to-end machine learning pipeline using data from the Surveillance, Epidemiology, and End Results (SEER) database. We evaluated three machine learning models (Random Forest, XGBoost, and Logistic Regression) alongside a multilayer perceptron (MLP) neural network. Our models achieved high AUROC values and recall, with the Random Forest and MLP models demonstrating the best overall performance (AUROC = 0.82). However, the low precision across models highlights the challenges of accurately predicting positive cases. To address these limitations, we developed a custom loss function incorporating recall optimization, enabling the MLP model to achieve the highest sensitivity. An ensemble approach further improved overall recall by leveraging the strengths of individual models. Feature importance analysis revealed key predictors such as surgery status, tumor staging, and follow up duration, emphasizing the relevance of clinical interventions and disease progression in metastasis prediction. While this study demonstrates the potential of machine learning for identifying high-risk patients, limitations include reliance on imbalanced datasets, incomplete feature annotations, and the low precision of predictions. Future work should leverage the expanding SEER dataset, improve data imputation techniques, and explore advanced pre-trained models to enhance predictive accuracy and clinical utility.

[LG-153] Classification of HI Galaxy Profiles Using Unsupervised Learning and Convolutional Neural Networks: A Comparative Analysis and Methodological Cases of Studies

链接: https://arxiv.org/abs/2501.11657
作者: Gabriel Jaimes-Illanes,Manuel Parra-Royon,Laura Darriba-Pol,Javier Moldón,Amidou Sorgho,Susana Sánchez-Expósito,Julián Garrido-Sánchez,Lourdes Verdes-Montenegro
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Hydrogen, the most abundant element in the universe, is crucial for understanding galaxy formation and evolution. The 21 cm neutral atomic hydrogen - HI spectral line maps the gas kinematics within galaxies, providing key insights into interactions, galactic structure, and star formation processes. With new radio instruments, the volume and complexity of data is increasing. To analyze and classify integrated HI spectral profiles in a efficient way, this work presents a framework that integrates Machine Learning techniques, combining unsupervised methods and CNNs. To this end, we apply our framework to a selected subsample of 318 spectral HI profiles of the CIG and 30.780 profiles from the Arecibo Legacy Fast ALFA Survey catalogue. Data pre-processing involved the Busyfit package and iterative fitting with polynomial, Gaussian, and double-Lorentzian models. Clustering methods, including K-means, spectral clustering, DBSCAN, and agglomerative clustering, were used for feature extraction and to bootstrap classification we applied K-NN, SVM, and Random Forest classifiers, optimizing accuracy with CNN. Additionally, we introduced a 2D model of the profiles to enhance classification by adding dimensionality to the data. Three 2D models were generated based on transformations and normalised versions to quantify the level of asymmetry. These methods were tested in a previous analytical classification study conducted by the Analysis of the Interstellar Medium in Isolated Galaxies group. This approach enhances classification accuracy and aims to establish a methodology that could be applied to data analysis in future surveys conducted with the Square Kilometre Array (SKA), currently under construction. All materials, code, and models have been made publicly available in an open-access repository, adhering to FAIR principles.

[LG-154] Beyond R-barycenters: an effective averag ing method on Stiefel and Grassmann manifolds

链接: https://arxiv.org/abs/2501.11555
作者: Florent Bouchard,Nils Laurent,Salem Said,Nicolas Le Bihan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, the issue of averaging data on a manifold is addressed. While the Fréchet mean resulting from Riemannian geometry appears ideal, it is unfortunately not always available and often computationally very expensive. To overcome this, R-barycenters have been proposed and successfully applied to Stiefel and Grassmann manifolds. However, R-barycenters still suffer severe limitations as they rely on iterative algorithms and complicated operators. We propose simpler, yet efficient, barycenters that we call RL-barycenters. We show that, in the setting relevant to most applications, our framework yields astonishingly simple barycenters: arithmetic means projected onto the manifold. We apply this approach to the Stiefel and Grassmann manifolds. On simulated data, our approach is competitive with respect to existing averaging methods, while computationally cheaper.

[LG-155] Empirical Bayes Estimation for Lasso-Type Regularizers: Analysis of Automatic Relevance Determination

链接: https://arxiv.org/abs/2501.11280
作者: Tsukasa Yoshida,Kazuho Watanabe
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:This paper focuses on linear regression models with non-conjugate sparsity-inducing regularizers such as lasso and group lasso. Although empirical Bayes approach enables us to estimate the regularization parameter, little is known on the properties of the estimators. In particular, there are many unexplained aspects regarding the specific conditions under which the mechanism of automatic relevance determination (ARD) occurs. In this paper, we derive the empirical Bayes estimators for the group lasso regularized linear regression models with a limited number of parameters. It is shown that the estimators diverge under a certain condition, giving rise to the ARD mechanism. We also prove that empirical Bayes methods can produce ARD mechanism in general regularized linear regression models and clarify the conditions under which models such as ridge, lasso, and group lasso can produce ARD mechanism.

[LG-156] Conditional Feature Importance with Generative Modeling Using Adversarial Random Forests

链接: https://arxiv.org/abs/2501.11178
作者: Kristin Blesch,Niklas Koenen,Jan Kapar,Pegah Golchian,Lukas Burk,Markus Loecher,Marvin N. Wright
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a method for measuring conditional feature importance via generative modeling. In explainable artificial intelligence (XAI), conditional feature importance assesses the impact of a feature on a prediction model’s performance given the information of other features. Model-agnostic post hoc methods to do so typically evaluate changes in the predictive performance under on-manifold feature value manipulations. Such procedures require creating feature values that respect conditional feature distributions, which can be challenging in practice. Recent advancements in generative modeling can facilitate this. For tabular data, which may consist of both categorical and continuous features, the adversarial random forest (ARF) stands out as a generative model that can generate on-manifold data points without requiring intensive tuning efforts or computational resources, making it a promising candidate model for subroutines in XAI methods. This paper proposes cARFi (conditional ARF feature importance), a method for measuring conditional feature importance through feature values sampled from ARF-estimated conditional distributions. cARFi requires only little tuning to yield robust importance scores that can flexibly adapt for conditional or marginal notions of feature importance, including straightforward extensions to condition on feature subsets and allows for inferring the significance of feature importances through statistical tests.

[LG-157] Community detection for Contexual-LSBM: Theoretical limitation on misclassfication ratio and effecient algorithm

链接: https://arxiv.org/abs/2501.11139
作者: Dian Jin,Yuqian Zhang,Qiaosheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of both network information and node attribute information has recently gained significant attention in the context of community recovery problems. In this work, we address the task of determining the optimal classification rate for the Label-SBM(LSBM) model with node attribute information and. Specifically, we derive the optimal lower bound, which is characterized by the Chernoff-Hellinger divergence for a general LSBM network model with Gaussian node attributes. Additionally, we highlight the connection between the divergence D(\bs\alpha, \mb P, \bs\mu) in our model and those introduced in \citeyun2016optimal and \citelu2016statistical. We also presents a consistent algorithm based on spectral method for the proposed aggreated latent factor model.

[LG-158] A Regularized Online Newton Method for Stochastic Convex Bandits with Linear Vanishing Noise

链接: https://arxiv.org/abs/2501.11127
作者: Jingxin Zhan,Yuchen Xin,Kaicheng Jin,Zhihua Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a stochastic convex bandit problem where the subgaussian noise parameter is assumed to decrease linearly as the learner selects actions closer and closer to the minimizer of the convex loss function. Accordingly, we propose a Regularized Online Newton Method (RONM) for solving the problem, based on the Online Newton Method (ONM) of arXiv:2406.06506. Our RONM reaches a polylogarithmic regret in the time horizon n when the loss function grows quadratically in the constraint set, which recovers the results of arXiv:2402.12042 in linear bandits. Our analyses rely on the growth rate of the precision matrix \Sigma_t^-1 in ONM and we find that linear growth solves the question exactly. These analyses also help us obtain better convergence rates when the loss function grows faster. We also study and analyze two new bandit models: stochastic convex bandits with noise scaled to a subgaussian parameter function and convex bandits with stochastic multiplicative noise.

[LG-159] Issues with Neural Tangent Kernel Approach to Neural Networks

链接: https://arxiv.org/abs/2501.10929
作者: Haoran Liu,Anthony Tai,David J. Crandall,Chunfeng Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural tangent kernels (NTKs) have been proposed to study the behavior of trained neural networks from the perspective of Gaussian processes. An important result in this body of work is the theorem of equivalence between a trained neural network and kernel regression with the corresponding NTK. This theorem allows for an interpretation of neural networks as special cases of kernel regression. However, does this theorem of equivalence hold in practice? In this paper, we revisit the derivation of the NTK rigorously and conduct numerical experiments to evaluate this equivalence theorem. We observe that adding a layer to a neural network and the corresponding updated NTK do not yield matching changes in the predictor error. Furthermore, we observe that kernel regression with a Gaussian process kernel in the literature that does not account for neural network training produces prediction errors very close to that of kernel regression with NTKs. These observations suggest the equivalence theorem does not hold well in practice and puts into question whether neural tangent kernels adequately address the training process of neural networks. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.10929 [stat.ML] (or arXiv:2501.10929v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.10929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-160] Unfolding Tensors to Identify the Graph in Discrete Latent Bipartite Graphical Models

链接: https://arxiv.org/abs/2501.10897
作者: Yuqi Gu
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We use a tensor unfolding technique to prove a new identifiability result for discrete bipartite graphical models, which have a bipartite graph between an observed and a latent layer. This model family includes popular models such as Noisy-Or Bayesian networks for medical diagnosis and Restricted Boltzmann Machines in machine learning. These models are also building blocks for deep generative models. Our result on identifying the graph structure enjoys the following nice properties. First, our identifiability proof is constructive, in which we innovatively unfold the population tensor under the model into matrices and inspect the rank properties of the resulting matrices to uncover the graph. This proof itself gives a population-level structure learning algorithm that outputs both the number of latent variables and the bipartite graph. Second, we allow various forms of nonlinear dependence among the variables, unlike many continuous latent variable graphical models that rely on linearity to show identifiability. Third, our identifiability condition is interpretable, only requiring each latent variable to connect to at least two “pure” observed variables in the bipartite graph. The new result not only brings novel advances in algebraic statistics, but also has useful implications for these models’ trustworthy applications in scientific disciplines and interpretable machine learning.

[LG-161] Certifying Robustness via Topological Representations NEURIPS2024

链接: https://arxiv.org/abs/2501.10876
作者: Jens Agerberg,Andrea Guidolin,Andrea Martinelli,Pepijn Roos Hoefgeest,David Eklund,Martina Scolamiero
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: Workshop on Symmetry and Geometry in Neural Representations (NeurReps) at NeurIPS 2024, Extended Abstract Track

点击查看摘要

Abstract:We propose a neural network architecture that can learn discriminative geometric representations of data from persistence diagrams, common descriptors of Topological Data Analysis. The learned representations enjoy Lipschitz stability with a controllable Lipschitz constant. In adversarial learning, this stability can be used to certify \epsilon -robustness for samples in a dataset, which we demonstrate on the ORBIT5K dataset representing the orbits of a discrete dynamical system.

[LG-162] Model-Robust and Adaptive-Optimal Transfer Learning for Tackling Concept Shifts in Nonparametric Regression

链接: https://arxiv.org/abs/2501.10870
作者: Haotian Lin,Matthew Reimherr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When concept shifts and sample scarcity are present in the target domain of interest, nonparametric regression learners often struggle to generalize effectively. The technique of transfer learning remedies these issues by leveraging data or pre-trained models from similar source domains. While existing generalization analyses of kernel-based transfer learning typically rely on correctly specified models, we present a transfer learning procedure that is robust against model misspecification while adaptively attaining optimality. To facilitate our analysis and avoid the risk of saturation found in classical misspecified results, we establish a novel result in the misspecified single-task learning setting, showing that spectral algorithms with fixed bandwidth Gaussian kernels can attain minimax convergence rates given the true function is in a Sobolev space, which may be of independent interest. Building on this, we derive the adaptive convergence rates of the excess risk for specifying Gaussian kernels in a prevalent class of hypothesis transfer learning algorithms. Our results are minimax optimal up to logarithmic factors and elucidate the key determinants of transfer efficiency.

[LG-163] Non-Expansive Mappings in Two-Time-Scale Stochastic Approximation: Finite-Time Analysis

链接: https://arxiv.org/abs/2501.10806
作者: Siddharth Chandak
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Submitted to SIAM Journal on Control and Optimization

点击查看摘要

Abstract:Two-time-scale stochastic approximation is an iterative algorithm used in applications such as optimization, reinforcement learning, and control. Finite-time analysis of these algorithms has primarily focused on fixed point iterations where both time-scales have contractive mappings. In this paper, we study two-time-scale iterations, where the slower time-scale has a non-expansive mapping. For such algorithms, the slower time-scale can be considered a stochastic inexact Krasnoselskii-Mann iteration. We show that the mean square error decays at a rate O(1/k^1/4-\epsilon) , where \epsilon0 is arbitrarily small. We also show almost sure convergence of iterates to the set of fixed points. We show the applicability of our framework by applying our results to minimax optimization, linear stochastic approximation, and Lagrangian optimization.

[LG-164] Robust Local Polynomial Regression with Similarity Kernels

链接: https://arxiv.org/abs/2501.10729
作者: Yaniv Shulman
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Local Polynomial Regression (LPR) is a widely used nonparametric method for modeling complex relationships due to its flexibility and simplicity. It estimates a regression function by fitting low-degree polynomials to localized subsets of the data, weighted by proximity. However, traditional LPR is sensitive to outliers and high-leverage points, which can significantly affect estimation accuracy. This paper revisits the kernel function used to compute regression weights and proposes a novel framework that incorporates both predictor and response variables in the weighting mechanism. By introducing two positive definite kernels, the proposed method robustly estimates weights, mitigating the influence of outliers through localized density estimation. The method is implemented in Python and is publicly available at this https URL, demonstrating competitive performance in synthetic benchmark experiments. Compared to standard LPR, the proposed approach consistently improves robustness and accuracy, especially in heteroscedastic and noisy environments, without requiring multiple iterations. This advancement provides a promising extension to traditional LPR, opening new possibilities for robust regression applications.

[LG-165] Hybrid-Quantum Neural Architecture Search for The Proximal Policy Optimization Algorithm

链接: https://arxiv.org/abs/2501.10673
作者: Moustafa Zada
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent studies in quantum machine learning advocated the use of hybrid models to assist with the limitations of the currently existing Noisy Intermediate Scale Quantum (NISQ) devices, but what was missing from most of them was the explanations and interpretations of the choices that were made to pick those exact architectures and the differentiation between good and bad hybrid architectures, this research attempts to tackle that gap in the literature by using the Regularized Evolution algorithm to search for the optimal hybrid classical-quantum architecture for the Proximal Policy Optimization (PPO) algorithm, a well-known reinforcement learning algorithm, ultimately the classical models dominated the leaderboard with the best hybrid model coming in eleventh place among all unique models, while we also try to explain the factors that contributed to such results,and for some models to behave better than others in hope to grasp a better intuition about what we should consider good practices for designing an efficient hybrid architecture.

[LG-166] Accurate and thermodynamically consistent hydrogen equation of state for planetary modeling with flow matching

链接: https://arxiv.org/abs/2501.10594
作者: Hao Xie,Saburo Howard,Guglielmo Mazzola
类目: Earth and Planetary Astrophysics (astro-ph.EP); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 7+7 pages, 4+9 figures

点击查看摘要

Abstract:Accurate determination of the equation of state of dense hydrogen is essential for understanding gas giants. Currently, there is still no consensus on methods for calculating its entropy, which play a fundamental role and can result in qualitatively different predictions for Jupiter’s interior. Here, we investigate various aspects of entropy calculation for dense hydrogen based on ab initio molecular dynamics simulations. Specifically, we employ the recently developed flow matching method to validate the accuracy of the traditional thermodynamic integration approach. We then clearly identify pitfalls in previous attempts and propose a reliable framework for constructing the hydrogen equation of state, which is accurate and thermodynamically consistent across a wide range of temperature and pressure conditions. This allows us to conclusively address the long-standing discrepancies in Jupiter’s adiabat among earlier studies, demonstrating the potential of our approach for providing reliable equations of state of diverse materials.

[LG-167] DPERC: Direct Parameter Estimation for Mixed Data

链接: https://arxiv.org/abs/2501.10540
作者: Tuan L.Vo,Quan Huu Do,Uyen Dang,Thu Nguyen,Pål Halvorsen,Michael A. Riegler,Binh T. Nguyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this challenge, they often entail a trade-off between computational efficiency and estimation accuracy. Consequently, attention has shifted towards direct parameter estimation, given its precision and reduced computational burden. In this paper, we propose Direct Parameter Estimation for Randomly Missing Data with Categorical Features (DPERC), an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features. Our method is motivated by leveraging information from categorical features, which can significantly enhance covariance matrix estimation for continuous features. Our approach effectively harnesses the information embedded within mixed data structures. Through comprehensive evaluations of diverse datasets, we demonstrate the competitive performance of DPERC compared to various contemporary techniques. In addition, we also show by experiments that DPERC is a valuable tool for visualizing the correlation heatmap.

[LG-168] Multi-Output Conformal Regression: A Unified Comparative Study with New Conformity Scores

链接: https://arxiv.org/abs/2501.10533
作者: Victor Dheur,Matteo Fontana,Yorick Estievenart,Naomi Desobry,Souhaib Ben Taieb
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantifying uncertainty in multivariate regression is essential in many real-world applications, yet existing methods for constructing prediction regions often face limitations such as the inability to capture complex dependencies, lack of coverage guarantees, or high computational cost. Conformal prediction provides a robust framework for producing distribution-free prediction regions with finite-sample coverage guarantees. In this work, we present a unified comparative study of multi-output conformal methods, exploring their properties and interconnections. Based on our findings, we introduce two classes of conformity scores that achieve asymptotic conditional coverage: one is compatible with any generative model, and the other offers low computational cost by leveraging invertible generative models. Finally, we conduct a comprehensive empirical study across 32 tabular datasets to compare all the multi-output conformal methods considered in this work. All methods are implemented within a unified code base to ensure a fair and consistent comparison.

[LG-169] Extension of Symmetrized Neural Network Operators with Fractional and Mixed Activation Functions

链接: https://arxiv.org/abs/2501.10496
作者: Rômulo Damasclin Chaves dos Santos,Jorge Henrique de Oliveira Sales
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:We propose a novel extension to symmetrized neural network operators by incorporating fractional and mixed activation functions. This study addresses the limitations of existing models in approximating higher-order smooth functions, particularly in complex and high-dimensional spaces. Our framework introduces a fractional exponent in the activation functions, allowing adaptive non-linear approximations with improved accuracy. We define new density functions based on q -deformed and \theta -parametrized logistic models and derive advanced Jackson-type inequalities that establish uniform convergence rates. Additionally, we provide a rigorous mathematical foundation for the proposed operators, supported by numerical validations demonstrating their efficiency in handling oscillatory and fractional components. The results extend the applicability of neural network approximation theory to broader functional spaces, paving the way for applications in solving partial differential equations and modeling complex systems.

[LG-170] Enhancing the Reliability in Machine Learning for Gravitational Wave Parameter Estimation with Attention-Based Models

链接: https://arxiv.org/abs/2501.10486
作者: Hibiki Iwanaga,Mahoro Matsuyama,Yousuke Itoh
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 9 pages, 14 figures

点击查看摘要

Abstract:We introduce a technique to enhance the reliability of gravitational wave parameter estimation results produced by machine learning. We develop two independent machine learning models based on the Vision Transformer to estimate effective spin and chirp mass from spectrograms of gravitational wave signals from binary black hole mergers. To enhance the reliability of these models, we utilize attention maps to visualize the areas our models focus on when making predictions. This approach enables demonstrating that both models perform parameter estimation based on physically meaningful information. Furthermore, by leveraging these attention maps, we demonstrate a method to quantify the impact of glitches on parameter estimation. We show that as the models focus more on glitches, the parameter estimation results become more strongly biased. This suggests that attention maps could potentially be used to distinguish between cases where the results produced by the machine learning model are reliable and cases where they are not.

[LG-171] Simulation of Random LR Fuzzy Intervals

链接: https://arxiv.org/abs/2501.10482
作者: Maciej Romaniuk,Abbas Parchami,Przemysław Grzegorzewski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Probability (math.PR); Computation (stat.CO); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Random fuzzy variables join the modeling of the impreciseness (due to their ``fuzzy part’') and randomness. Statistical samples of such objects are widely used, and their direct, numerically effective generation is therefore necessary. Usually, these samples consist of triangular or trapezoidal fuzzy numbers. In this paper, we describe theoretical results and simulation algorithms for another family of fuzzy numbers – LR fuzzy numbers with interval-valued cores. Starting from a simulation perspective on the piecewise linear LR fuzzy numbers with the interval-valued cores, their limiting behavior is then considered. This leads us to the numerically efficient algorithm for simulating a sample consisting of such fuzzy values.

[LG-172] Median of Means Sampling for the Keister Function

链接: https://arxiv.org/abs/2501.10440
作者: Bocheng Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study investigates the performance of median-of-means sampling compared to traditional mean-of-means sampling for computing the Keister function integral using Randomized Quasi-Monte Carlo (RQMC) methods. The research tests both lattice points and digital nets as point distributions across dimensions 2, 3, 5, and 8, with sample sizes ranging from 2^8 to 2^19 points. Results demonstrate that median-of-means sampling consistently outperforms mean-of-means for sample sizes larger than 10^3 points, while mean-of-means shows better accuracy with smaller sample sizes, particularly for digital nets. The study also confirms previous theoretical predictions about median-of-means’ superior performance with larger sample sizes and reflects the known challenges of maintaining accuracy in higher-dimensional integration. These findings support recent research suggesting median-of-means as a promising alternative to traditional sampling methods in numerical integration, though limitations in sample size and dimensionality warrant further investigation with different test functions and larger parameter spaces.

[LG-173] Perception-Guided EEG Analysis: A Deep Learning Approach Inspired by Level of Detail (LOD) Theory

链接: https://arxiv.org/abs/2501.10428
作者: BG Tong
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: This study explores a novel deep learning approach for EEG analysis and perceptual state guidance, inspired by Level of Detail (LOD) theory. The goal is to improve perceptual state identification accuracy and advance personalized psychological therapy. Methods: Portable EEG devices and music rhythm signals were used for data collection. LOD theory was applied to dynamically adjust EEG signal processing, extracting core perceptual features. A Unity-based software system integrated EEG data with audio materials. The deep learning model combined a CNN for feature extraction and classification, and a DQN for reinforcement learning to optimize rhythm adjustments. Results: The CNN achieved 94.05% accuracy in perceptual state classification. The DQN guided subjects to target states with a 92.45% success rate, averaging 13.2 rhythm cycles. However, only 50% of users reported psychological alignment with the target state, indicating room for improvement. Discussion: The results validate the potential of LOD-based EEG biofeedback. Limitations include dataset source, label subjectivity, and reward function optimization. Future work will expand to diverse subjects, incorporate varied musical elements, and refine reward functions for better generalization and personalization.

[LG-174] Do we actually understand the impact of renewables on electricity prices? A causal inference approach

链接: https://arxiv.org/abs/2501.10423
作者: Davide Cacciarelli,Pierre Pinson,Filip Panagiotopoulos,David Dixon,Lizzie Blaxland
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The energy transition is profoundly reshaping electricity market dynamics. It makes it essential to understand how renewable energy generation actually impacts electricity prices, among all other market drivers. These insights are critical to design policies and market interventions that ensure affordable, reliable, and sustainable energy systems. However, identifying causal effects from observational data is a major challenge, requiring innovative causal inference approaches that go beyond conventional regression analysis only. We build upon the state of the art by developing and applying a local partially linear double machine learning approach. Its application yields the first robust causal evidence on the distinct and non-linear effects of wind and solar power generation on UK wholesale electricity prices, revealing key insights that have eluded previous analyses. We find that, over 2018-2024, wind power generation has a U-shaped effect on prices: at low penetration levels, a 1 GWh increase in energy generation reduces prices by up to 7 GBP/MWh, but this effect gets close to none at mid-penetration levels (20-30%) before intensifying again. Solar power places substantial downward pressure on prices at very low penetration levels (up to 9 GBP/MWh per 1 GWh increase in energy generation), though its impact weakens quite rapidly. We also uncover a critical trend where the price-reducing effects of both wind and solar power have become more pronounced over time (from 2018 to 2024), highlighting their growing influence on electricity markets amid rising penetration. Our study provides both novel analysis approaches and actionable insights to guide policymakers in appraising the way renewables impact electricity markets.

[LG-175] Automated Detection of Epileptic Spikes and Seizures Incorporating a Novel Spatial Clustering Prior

链接: https://arxiv.org/abs/2501.10404
作者: Hanyang Dong,Shurong Sheng,Xiongfei Wang,Jiahong Gao,Yi Sun,Wanli Yang,Kuntao Xiao,Pengfei Teng,Guoming Luan,Zhao Lv
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, accepted by BIBM2024

点击查看摘要

Abstract:A Magnetoencephalography (MEG) time-series recording consists of multi-channel signals collected by superconducting sensors, with each signal’s intensity reflecting magnetic field changes over time at the sensor location. Automating epileptic MEG spike detection significantly reduces manual assessment time and effort, yielding substantial clinical benefits. Existing research addresses MEG spike detection by encoding neural network inputs with signals from all channel within a time segment, followed by classification. However, these methods overlook simultaneous spiking occurred from nearby sensors. We introduce a simple yet effective paradigm that first clusters MEG channels based on their sensor’s spatial position. Next, a novel convolutional input module is designed to integrate the spatial clustering and temporal changes of the signals. This module is fed into a custom MEEG-ResNet3D developed by the authors, which learns to extract relevant features and classify the input as a spike clip or not. Our method achieves an F1 score of 94.73% on a large real-world MEG dataset Sanbo-CMR collected from two centers, outperforming state-of-the-art approaches by 1.85%. Moreover, it demonstrates efficacy and stability in the Electroencephalographic (EEG) seizure detection task, yielding an improved weighted F1 score of 1.4% compared to current state-of-the-art techniques evaluated on TUSZ, whch is the largest EEG seizure dataset.

[LG-176] Custom Loss Functions in Fuel Moisture Modeling

链接: https://arxiv.org/abs/2501.10401
作者: Jonathon Hirschi
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Master’s Project in Statistics at CU Denver. July 2024

点击查看摘要

Abstract:Fuel moisture content (FMC) is a key predictor for wildfire rate of spread (ROS). Machine learning models of FMC are being used more in recent years, augmenting or replacing traditional physics-based approaches. Wildfire rate of spread (ROS) has a highly nonlinear relationship with FMC, where small differences in dry fuels lead to large differences in ROS. In this study, custom loss functions that place more weight on dry fuels were examined with a variety of machine learning models of FMC. The models were evaluated with a spatiotemporal cross-validation procedure to examine whether the custom loss functions led to more accurate forecasts of ROS. Results show that the custom loss functions improved accuracy for ROS forecasts by a small amount. Further research would be needed to establish whether the improvement in ROS forecasts leads to more accurate real-time wildfire simulations.

[LG-177] Handwriting Anomalies and Learning Disabilities through Recurrent Neural Networks and Geometric Pattern Analysis

链接: https://arxiv.org/abs/2405.07238
作者: Vasileios Alevizos,Sabrina Edralin,Akebu Simasiku,Dimitra Malliarou,Antonis Messinis,George Papakostas,Clark Xu,Zongliang Yue
类目: Quantitative Methods (q-bio.QM); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dyslexia and dysgraphia are learning disabilities that profoundly impact reading, writing, and language processing capabilities. Dyslexia primarily affects reading, manifesting as difficulties in word recognition and phonological processing, where individuals struggle to connect sounds with their corresponding letters. Dysgraphia, on the other hand, affects writing skills, resulting in difficulties with letter formation, spacing, and alignment. The coexistence of dyslexia and dysgraphia complicates diagnosis, requiring a nuanced approach capable of adapting to these complexities while accurately identifying and differentiating between the disorders. This study utilizes advanced geometrical patterns and recurrent neural networks (RNN) to identify handwriting anomalies indicative of dyslexia and dysgraphia. Handwriting is first standardized, followed by feature extraction that focuses on baseline deviations, letter connectivity, stroke thickness, and other anomalies. These features are then fed into an RNN-based autoencoder to identify irregularities. Initial results demonstrate the ability of this RNN model to achieve state-of-art performance on combined dyslexia and dysgraphia detection, while showing the challenges associated with complex pattern adaptation of deep-learning to a diverse corpus of about 33,000 writing samples.

信息检索

[IR-0] Optimizing Leaky Private Information Retrieval Codes to Achieve O(log K) Leakage Ratio Exponent

链接: https://arxiv.org/abs/2501.12310
作者: Wenyuan Zhao,Yu-Shin Huang,Chao Tian,Alex Sprintson
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: Long version of the paper submitted to ISIT 2025. 8 pages, 2 figures

点击查看摘要

Abstract:We study the problem of leaky private information retrieval (L-PIR), where the amount of privacy leakage is measured by the pure differential privacy parameter, referred to as the leakage ratio exponent. Unlike the previous L-PIR scheme proposed by Samy et al., which only adjusted the probability allocation to the clean (low-cost) retrieval pattern, we optimize the probabilities assigned to all the retrieval patterns jointly. It is demonstrated that the optimal retrieval pattern probability distribution is quite sophisticated and has a layered structure: the retrieval patterns associated with the random key values of lower Hamming weights should be assigned higher probabilities. This new scheme provides a significant improvement, leading to an O(\log K) leakage ratio exponent with fixed download cost D and number of servers N , in contrast to the previous art that only achieves a \Theta(K) exponent, where K is the number of messages.

[IR-1] DataPro – A Standardized Data Understanding and Processing Procedure: A Case Study of an Eco-Driving Project

链接: https://arxiv.org/abs/2501.12176
作者: Zhipeng Ma,Bo Nørregaard Jørgensen,Zheng Grace Ma
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A systematic pipeline for data processing and knowledge discovery is essential to extracting knowledge from big data and making recommendations for operational decision-making. The CRISP-DM model is the de-facto standard for developing data-mining projects in practice. However, advancements in data processing technologies require enhancements to this framework. This paper presents the DataPro (a standardized data understanding and processing procedure) model, which extends CRISP-DM and emphasizes the link between data scientists and stakeholders by adding the “technical understanding” and “implementation” phases. Firstly, the “technical understanding” phase aligns business demands with technical requirements, ensuring the technical team’s accurate comprehension of business goals. Next, the “implementation” phase focuses on the practical application of developed data science models, ensuring theoretical models are effectively applied in business contexts. Furthermore, clearly defining roles and responsibilities in each phase enhances management and communication among all participants. Afterward, a case study on an eco-driving data science project for fuel efficiency analysis in the Danish public transportation sector illustrates the application of the DataPro model. By following the proposed framework, the project identified key business objectives, translated them into technical requirements, and developed models that provided actionable insights for reducing fuel consumption. Finally, the model is evaluated qualitatively, demonstrating its superiority over other data science procedures.

[IR-2] Less is More: Information Bottleneck Denoised Multimedia Recommendation

链接: https://arxiv.org/abs/2501.12175
作者: Yonghui Yang,Le Wu,Zhuangzhuang He,Zhengwei Wu,Richang Hong,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Empowered by semantic-rich content information, multimedia recommendation has emerged as a potent personalized technique. Current endeavors center around harnessing multimedia content to refine item representation or uncovering latent item-item structures based on modality similarity. Despite the effectiveness, we posit that these methods are usually suboptimal due to the introduction of irrelevant multimedia features into recommendation tasks. This stems from the fact that generic multimedia feature extractors, while well-designed for domain-specific tasks, can inadvertently introduce task-irrelevant features, leading to potential misguidance of recommenders. In this work, we propose a denoised multimedia recommendation paradigm via the Information Bottleneck principle (IB). Specifically, we propose a novel Information Bottleneck denoised Multimedia Recommendation (IBMRec) model to tackle the irrelevant feature issue. IBMRec removes task-irrelevant features from both feature and item-item structure perspectives, which are implemented by two-level IB learning modules: feature-level (FIB) and graph-level (GIB). In particular, FIB focuses on learning the minimal yet sufficient multimedia features. This is achieved by maximizing the mutual information between multimedia representation and recommendation tasks, while concurrently minimizing it between multimedia representation and pre-trained multimedia features. Furthermore, GIB is designed to learn the robust item-item graph structure, it refines the item-item graph based on preference affinity, then minimizes the mutual information between the original graph and the refined one. Extensive experiments across three benchmarks validate the effectiveness of our proposed model, showcasing high performance, and applicability to various multimedia recommenders.

[IR-3] A Contrastive Framework with User Item and Review Alignment for Recommendation

链接: https://arxiv.org/abs/2501.11963
作者: Hoang V. Dong,Yuan Fang,Hady W. Lauw
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learning effective latent representations for users and items is the cornerstone of recommender systems. Traditional approaches rely on user-item interaction data to map users and items into a shared latent space, but the sparsity of interactions often poses challenges. While leveraging user reviews could mitigate this sparsity, existing review-aware recommendation models often exhibit two key limitations. First, they typically rely on reviews as additional features, but reviews are not universal, with many users and items lacking them. Second, such approaches do not integrate reviews into the user-item space, leading to potential divergence or inconsistency among user, item, and review representations. To overcome these limitations, our work introduces a Review-centric Contrastive Alignment Framework for Recommendation (ReCAFR), which incorporates reviews into the core learning process, ensuring alignment among user, item, and review representations within a unified space. Specifically, we leverage two self-supervised contrastive strategies that not only exploit review-based augmentation to alleviate sparsity, but also align the tripartite representations to enhance robustness. Empirical studies on public benchmark datasets demonstrate the effectiveness and robustness of ReCAFR.

[IR-4] Generating with Fairness: A Modality-Diffused Counterfactual Framework for Incomplete Multimodal Recommendations

链接: https://arxiv.org/abs/2501.11916
作者: Jin Li,Shoujin Wang,Qi Zhang,Shui Yu,Fang Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Incomplete scenario is a prevalent, practical, yet challenging setting in Multimodal Recommendations (MMRec), where some item modalities are missing due to various factors. Recently, a few efforts have sought to improve the recommendation accuracy by exploring generic structures from incomplete data. However, two significant gaps persist: 1) the difficulty in accurately generating missing data due to the limited ability to capture modality distributions; and 2) the critical but overlooked visibility bias, where items with missing modalities are more likely to be disregarded due to the prioritization of items’ multimodal data over user preference alignment. This bias raises serious concerns about the fair treatment of items. To bridge these two gaps, we propose a novel Modality-Diffused Counterfactual (MoDiCF) framework for incomplete multimodal recommendations. MoDiCF features two key modules: a novel modality-diffused data completion module and a new counterfactual multimodal recommendation module. The former, equipped with a particularly designed multimodal generative framework, accurately generates and iteratively refines missing data from learned modality-specific distribution spaces. The latter, grounded in the causal perspective, effectively mitigates the negative causal effects of visibility bias and thus assures fairness in recommendations. Both modules work collaboratively to address the two aforementioned significant gaps for generating more accurate and fair results. Extensive experiments on three real-world datasets demonstrate the superior performance of MoDiCF in terms of both recommendation accuracy and fairness

[IR-5] Integrate Temporal Graph Learning into LLM -based Temporal Knowledge Graph Model

链接: https://arxiv.org/abs/2501.11911
作者: He Chang,Jie Wu,Zhulin Tao,Yunshan Ma,Xianglin Huang,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Temporal Knowledge Graph Forecasting (TKGF) aims to predict future events based on the observed events in history. Recently, Large Language Models (LLMs) have exhibited remarkable capabilities, generating significant research interest in their application for reasoning over temporal knowledge graphs (TKGs). Existing LLM-based methods have integrated retrieved historical facts or static graph representations into LLMs. Despite the notable performance of LLM-based methods, they are limited by the insufficient modeling of temporal patterns and ineffective cross-modal alignment between graph and language, hindering the ability of LLMs to fully grasp the temporal and structural information in TKGs. To tackle these issues, we propose a novel framework TGL-LLM to integrate temporal graph learning into LLM-based temporal knowledge graph model. Specifically, we introduce temporal graph learning to capture the temporal and relational patterns and obtain the historical graph embedding. Furthermore, we design a hybrid graph tokenization to sufficiently model the temporal patterns within LLMs. To achieve better alignment between graph and language, we employ a two-stage training paradigm to finetune LLMs on high-quality and diverse data, thereby resulting in better performance. Extensive experiments on three real-world datasets show that our approach outperforms a range of state-of-the-art (SOTA) methods.

[IR-6] Poison-RAG : Adversarial Data Poisoning Attacks on Retrieval-Augmented Generation in Recommender Systems

链接: https://arxiv.org/abs/2501.11759
作者: Fatemeh Nazary,Yashar Deldjoo,Tommaso di Noia
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study presents Poison-RAG, a framework for adversarial data poisoning attacks targeting retrieval-augmented generation (RAG)-based recommender systems. Poison-RAG manipulates item metadata, such as tags and descriptions, to influence recommendation outcomes. Using item metadata generated through a large language model (LLM) and embeddings derived via the OpenAI API, we explore the impact of adversarial poisoning attacks on provider-side, where attacks are designed to promote long-tail items and demote popular ones. Two attack strategies are proposed: local modifications, which personalize tags for each item using BERT embeddings, and global modifications, applying uniform tags across the dataset. Experiments conducted on the MovieLens dataset in a black-box setting reveal that local strategies improve manipulation effectiveness by up to 50%, while global strategies risk boosting already popular items. Results indicate that popular items are more susceptible to attacks, whereas long-tail items are harder to manipulate. Approximately 70% of items lack tags, presenting a cold-start challenge; data augmentation and synthesis are proposed as potential defense mechanisms to enhance RAG-based systems’ resilience. The findings emphasize the need for robust metadata management to safeguard recommendation frameworks. Code and data are available at this https URL.

[IR-7] Exploring Preference-Guided Diffusion Model for Cross-Domain Recommendation KDD’2025

链接: https://arxiv.org/abs/2501.11671
作者: Xiaodong Li,Hengzhu Tang,Jiawei Sheng,Xinghua Zhang,Li Gao,Suqi Cheng,Dawei Yin,Tingwen Liu
类目: Information Retrieval (cs.IR)
*备注: This paper is accepted by KDD’2025

点击查看摘要

Abstract:Cross-domain recommendation (CDR) has been proven as a promising way to alleviate the cold-start issue, in which the most critical problem is how to draw an informative user representation in the target domain via the transfer of user preference existing in the source domain. Prior efforts mostly follow the embedding-and-mapping paradigm, which first integrate the preference into user representation in the source domain, and then perform a mapping function on this representation to the target domain. However, they focus on mapping features across domains, neglecting to explicitly model the preference integration process, which may lead to learning coarse user representation. Diffusion models (DMs), which contribute to more accurate user/item representations due to their explicit information injection capability, have achieved promising performance in recommendation systems. Nevertheless, these DMs-based methods cannot directly account for valuable user preference in other domains, leading to challenges in adapting to the transfer of preference for cold-start users. Consequently, the feasibility of DMs for CDR remains underexplored. To this end, we explore to utilize the explicit information injection capability of DMs for user preference integration and propose a Preference-Guided Diffusion Model for CDR to cold-start users, termed as DMCDR. Specifically, we leverage a preference encoder to establish the preference guidance signal with the user’s interaction history in the source domain. Then, we explicitly inject the preference guidance signal into the user representation step by step to guide the reverse process, and ultimately generate the personalized user representation in the target domain, thus achieving the transfer of user preference across domains. Furthermore, we comprehensively explore the impact of six DMs-based variants on CDR.

[IR-8] Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

链接: https://arxiv.org/abs/2501.11628
作者: Sebastian Bruch,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini,Leonardo Venuta
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.

[IR-9] KEIR @ ECIR 2025: The Second Workshop on Knowledge-Enhanced Information Retrieval ECIR2025

链接: https://arxiv.org/abs/2501.11499
作者: Zihan Wang,Jinyuan Fang,Giacomo Frisoni,Zhuyun Dai,Zaiqiao Meng,Gianluca Moro,Emine Yilmaz
类目: Information Retrieval (cs.IR)
*备注: KEIR @ ECIR 2025 workshop

点击查看摘要

Abstract:Pretrained language models (PLMs) like BERT and GPT-4 have become the foundation for modern information retrieval (IR) systems. However, existing PLM-based IR models primarily rely on the knowledge learned during training for prediction, limiting their ability to access and incorporate external, up-to-date, or domain-specific information. Therefore, current information retrieval systems struggle with semantic nuances, context relevance, and domain-specific issues. To address these challenges, we propose the second Knowledge-Enhanced Information Retrieval workshop (KEIR @ ECIR 2025) as a platform to discuss innovative approaches that integrate external knowledge, aiming to enhance the effectiveness of information retrieval in a rapidly evolving technological landscape. The goal of this workshop is to bring together researchers from academia and industry to discuss various aspects of knowledge-enhanced information retrieval.

[IR-10] Revisiting Language Models in Neural News Recommender Systems ECIR2025

链接: https://arxiv.org/abs/2501.11391
作者: Yuyue Zhao,Jin Huang,David Vos,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注: 16 pages, ECIR 2025, the 47th European Conference on Information Retrieval

点击查看摘要

Abstract:Neural news recommender systems (RSs) have integrated language models (LMs) to encode news articles with rich textual information into representations, thereby improving the recommendation process. Most studies suggest that (i) news RSs achieve better performance with larger pre-trained language models (PLMs) than shallow language models (SLMs), and (ii) that large language models (LLMs) outperform PLMs. However, other studies indicate that PLMs sometimes lead to worse performance than SLMs. Thus, it remains unclear whether using larger LMs consistently improves the performance of news RSs. In this paper, we revisit, unify, and extend these comparisons of the effectiveness of LMs in news RSs using the real-world MIND dataset. We find that (i) larger LMs do not necessarily translate to better performance in news RSs, and (ii) they require stricter fine-tuning hyperparameter selection and greater computational resources to achieve optimal recommendation performance than smaller LMs. On the positive side, our experiments show that larger LMs lead to better recommendation performance for cold-start users: they alleviate dependency on extensive user interaction history and make recommendations more reliant on the news content.

[IR-11] Disentangled Modeling of Preferences and Social Influence for Group Recommendation AAAI2025

链接: https://arxiv.org/abs/2501.11342
作者: Guangze Ye,Wen Wu,Guoqing Wang,Xi Chen,Hong Zheng,Liang He
类目: Information Retrieval (cs.IR)
*备注: AAAI 2025 Oral

点击查看摘要

Abstract:The group recommendation (GR) aims to suggest items for a group of users in social networks. Existing work typically considers individual preferences as the sole factor in aggregating group preferences. Actually, social influence is also an important factor in modeling users’ contributions to the final group decision. However, existing methods either neglect the social influence of individual members or bundle preferences and social influence together as a unified representation. As a result, these models emphasize the preferences of the majority within the group rather than the actual interaction items, which we refer to as the preference bias issue in GR. Moreover, the self-supervised learning (SSL) strategies they designed to address the issue of group data sparsity fail to account for users’ contextual social weights when regulating group representations, leading to suboptimal results. To tackle these issues, we propose a novel model based on Disentangled Modeling of Preferences and Social Influence for Group Recommendation (DisRec). Concretely, we first design a user-level disentangling network to disentangle the preferences and social influence of group members with separate embedding propagation schemes based on (hyper)graph convolution networks. We then introduce a socialbased contrastive learning strategy, selectively excluding user nodes based on their social importance to enhance group representations and alleviate the group-level data sparsity issue. The experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two realworld datasets.

[IR-12] Generative Retrieval for Book search KDD

链接: https://arxiv.org/abs/2501.11034
作者: Yubao Tang,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Shihao Liu,Shuaiqing Wang,Dawei Yin,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注: Accepted at KDD ADS 2025

点击查看摘要

Abstract:In book search, relevant book information should be returned in response to a query. Books contain complex, multi-faceted information such as metadata, outlines, and main text, where the outline provides hierarchical information between chapters and sections. Generative retrieval (GR) is a new retrieval paradigm that consolidates corpus information into a single model to generate identifiers of documents that are relevant to a given query. How can GR be applied to book search? Directly applying GR to book search is a challenge due to the unique characteristics of book search: The model needs to retain the complex, multi-faceted information of the book, which increases the demand for labeled data. Splitting book information and treating it as a collection of separate segments for learning might result in a loss of hierarchical information. We propose an effective Generative retrieval framework for Book Search (GBS) that features two main components: data augmentation and outline-oriented book encoding. For data augmentation, GBS constructs multiple query-book pairs for training; it constructs multiple book identifiers based on the outline, various forms of book contents, and simulates real book retrieval scenarios with varied pseudo-queries. This includes coverage-promoting book identifier augmentation, allowing the model to learn to index effectively, and diversity-enhanced query augmentation, allowing the model to learn to retrieve effectively. Outline-oriented book encoding improves length extrapolation through bi-level positional encoding and retentive attention mechanisms to maintain context over long sequences. Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines, achieving a 9.8% improvement in terms of MRR@20, over the state-of-the-art RIPOR method…

[IR-13] Enhancing User Intent for Recommendation Systems via Large Language Models

链接: https://arxiv.org/abs/2501.10871
作者: Xiaochuan Xu,Zeqiu Xu,Peiyang Yu,Jiani Wang
类目: Information Retrieval (cs.IR)
*备注: CAIMLR 2024 accepted

点击查看摘要

Abstract:Recommendation systems play a critical role in enhancing user experience and engagement in various online platforms. Traditional methods, such as Collaborative Filtering (CF) and Content-Based Filtering (CBF), rely heavily on past user interactions or item features. However, these models often fail to capture the dynamic and evolving nature of user preferences. To address these limitations, we propose DUIP (Dynamic User Intent Prediction), a novel framework that combines LSTM networks with Large Language Models (LLMs) to dynamically capture user intent and generate personalized item recommendations. The LSTM component models the sequential and temporal dependencies of user behavior, while the LLM utilizes the LSTM-generated prompts to predict the next item of interest. Experimental results on three diverse datasets ML-1M, Games, and Bundle show that DUIP outperforms a wide range of baseline models, demonstrating its ability to handle the cold-start problem and real-time intent adaptation. The integration of dynamic prompts based on recent user interactions allows DUIP to provide more accurate, context-aware, and personalized recommendations. Our findings suggest that DUIP is a promising approach for next-generation recommendation systems, with potential for further improvements in cross-modal recommendations and scalability.

[IR-14] Diffusion Models in Recommendation Systems: A Survey

链接: https://arxiv.org/abs/2501.10548
作者: Ting-Ruen Wei,Yi Fang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems remain an essential topic due to its wide application in various domains and the business potential behind them. With the rise of deep learning, common solutions have leveraged neural networks to facilitate collaborative filtering, and some have turned to generative adversarial networks to augment the dataset and tackle the data sparsity issue. However, they are limited in learning the complex user and item distribution and still suffer from model collapse. Given the great generation capability exhibited by diffusion models in computer vision recently, many recommender systems have adopted diffusion models and found improvements in performance for various tasks. Diffusion models in recommender systems excel in managing complex user and item distributions and do not suffer from mode collapse. With these advantages, the amount of research in this domain have been growing rapidly and calling for a systematic survey. In this survey paper, we present and propose a taxonomy on past research papers in recommender systems that utilize diffusion models. Distinct from a prior survey paper that categorizes based on the role of the diffusion model, we categorize based on the recommendation task at hand. The decision originates from the rationale that after all, the adoption of diffusion models is to enhance the recommendation performance, not vice versa: adapting the recommendation task to enable diffusion models. Nonetheless, we offer a unique perspective for diffusion models in recommender systems complementary to existing surveys. We present the foundation algorithms in diffusion models and their applications in recommender systems to summarize the rapid development in this field. Finally, we discuss open research directions to prepare and encourage further efforts to advance the field. We compile the relevant papers in a public GitHub repository.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-22

目录

概览 (2025-01-22)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载