本篇博文主要内容为 2025-01-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-01-22)
今日共更新1015篇论文,其中:
- 自然语言处理共120篇(Computation and Language (cs.CL))
- 人工智能共226篇(Artificial Intelligence (cs.AI))
- 计算机视觉共209篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共321篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
【速读】: 该论文旨在解决当前视频理解基准测试在评估基础模型(foundation models)时存在的局限性,特别是缺乏对领域特定知识和专家级推理能力的评估。为此,作者提出了MMVU(Multi-Discipline Multi-Video Understanding),一个综合性的专家级多学科基准测试,涵盖科学、医疗、人文社会科学和工程四个核心学科的27个主题,包含3,000个由专家标注的问题。解决方案的关键在于三个方面:首先,MMVU要求模型应用领域特定知识并进行专家级推理,超越当前基准测试中通常评估的基本视觉感知能力;其次,每个示例均由专家从头标注,并实施严格的数据质量控制以确保数据集的高质量;最后,每个示例还附有专家标注的推理依据和相关领域知识,便于深入分析。通过这些设计,MMVU为未来在专家级、知识密集型视频理解领域的进一步研究提供了可操作的见解。
链接: https://arxiv.org/abs/2501.12380
作者: Yilun Zhao,Lujing Xie,Haowei Zhang,Guo Gan,Yitao Long,Zhiyuan Hu,Tongyan Hu,Weiyuan Chen,Chuhan Li,Junyang Song,Zhijian Xu,Chengye Wang,Weifeng Pan,Ziyao Shangguan,Xiangru Tang,Zhenwen Liang,Yixin Liu,Chen Zhao,Arman Cohan
机构: Yale NLP MMVU Team
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.
zh
[NLP-1] InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
【速读】: 该论文试图解决大型视觉语言模型(LVLMs)在生成输出时偶尔产生错误的问题。尽管强化学习中的奖励模型(RMs)或测试时缩放有潜力提高生成质量,但目前公开可用的多模态奖励模型稀缺,且专有模型的实现细节不明确。为解决这一问题,论文提出了InternLM-XComposer2.5-Reward(IXC-2.5-Reward),这是一个简单但有效的多模态奖励模型,旨在将LVLMs与人类偏好对齐。关键解决方案包括构建一个高质量的多模态偏好语料库,涵盖文本、图像和视频输入,并应用于指令遵循、通用理解、文本丰富的文档、数学推理和视频理解等多个领域。IXC-2.5-Reward在多模态奖励模型基准测试中表现出色,并在文本奖励模型基准测试中展示了竞争力。此外,论文还展示了IXC-2.5-Reward的三个关键应用:为强化学习训练提供监督信号、在测试时缩放中选择最佳响应、以及从现有图像和视频指令调优训练数据中过滤异常或噪声样本。
链接: https://arxiv.org/abs/2501.12368
作者: Yuhang Zang,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Ziyu Liu,Shengyuan Ding,Shenxi Wu,Yubo Ma,Haodong Duan,Wenwei Zhang,Kai Chen,Dahua Lin,Jiaqi Wang
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Shanghai Jiao Tong University(上海交通大学); Nanjing University(南京大学); Fudan University(复旦大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Tech Report
点击查看摘要
Abstract:Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at this https URL
zh
[NLP-2] FuocChuVIP123 at CoMeDi Shared Task: Disagreement Ranking with XLM-Roberta Sentence Embeddings and Deep Neural Regression COLING2025
【速读】: 该论文旨在解决多语言环境下的分歧排序(Disagreement Ranking)问题,特别是在CoMeDi共享任务的子任务2中。解决方案的关键在于利用paraphrase-xlm-r-multilingual-v1模型生成的句子嵌入(sentence embeddings),并结合深度神经回归模型(deep neural regression model),该模型引入了批归一化(batch normalization)和丢弃法(dropout)以提高泛化能力。通过预测标注者之间成对判断差异的均值,该方法明确针对分歧排序,与传统“黄金标签”聚合方法不同。论文通过定制化的架构和训练过程优化系统,在Spearman相关性方面取得了与平均分歧标签相竞争的性能。研究结果表明,在多语言环境中,鲁棒的嵌入、有效的模型架构以及对判断差异的细致处理对于分歧排序至关重要。这些发现为使用上下文表示进行序数判断任务提供了见解,并为分歧预测模型的进一步改进开辟了途径。
链接: https://arxiv.org/abs/2501.12336
作者: Phuoc Duong Huy Chu
机构: University of Information Technology (信息技术大学); Vietnam National University - Ho Chi Minh City (越南国家大学 - 胡志明市)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COMEDI shared Task, Workshop at COLING 2025
点击查看摘要
Abstract:This paper presents results of our system for CoMeDi Shared Task, focusing on Subtask 2: Disagreement Ranking. Our system leverages sentence embeddings generated by the paraphrase-xlm-r-multilingual-v1 model, combined with a deep neural regression model incorporating batch normalization and dropout for improved generalization. By predicting the mean of pairwise judgment differences between annotators, our method explicitly targets disagreement ranking, diverging from traditional “gold label” aggregation approaches. We optimized our system with a customized architecture and training procedure, achieving competitive performance in Spearman correlation against mean disagreement labels. Our results highlight the importance of robust embeddings, effective model architecture, and careful handling of judgment differences for ranking disagreement in multilingual contexts. These findings provide insights into the use of contextualized representations for ordinal judgment tasks and open avenues for further refinement of disagreement prediction models.
zh
[NLP-3] Automatic Labelling with Open-source LLM s using Dynamic Label Schema Integration
【速读】: 该论文试图解决在现实世界的机器学习项目中获取高质量标注数据的高成本问题。尽管大型语言模型(LLMs),如GPT-4,在数据标注方面表现出高准确性,但由于隐私和成本问题,GPT-4的广泛应用受到限制。为此,论文探索了如何有效利用开源模型进行自动标注。关键解决方案是提出了检索增强分类(Retrieval Augmented Classification, RAC)方法。RAC通过动态整合标签描述,逐一对标签进行推理,从最相关的标签开始迭代,直到LLM选择一个标签。这种方法在高基数任务中显著提升了标注性能,并通过专注于最有希望的标签,实现了标注质量和覆盖范围之间的权衡,从而能够自动标注内部数据集。
链接: https://arxiv.org/abs/2501.12332
作者: Thomas Walshe,Sae Young Moon,Chunyang Xiao,Yawwani Gunawardana,Fran Silavong
机构: J.P. Morgan Chase; Snorkel AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 1 figure
点击查看摘要
Abstract:Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage - a property we leverage to automatically label our internal datasets.
zh
[NLP-4] UI-TARS: Pioneering Automated GUI Interaction with Native Agents
【速读】: 该论文试图解决现有GUI代理框架依赖高度封装的商业模型(如GPT-4o)以及专家设计的工作流程的问题,提出了一种名为UI-TARS的原生GUI代理模型。UI-TARS仅通过屏幕截图作为输入,执行类似人类的交互操作(如键盘和鼠标操作),并在多个GUI代理基准测试中表现出色。解决方案的关键在于以下几个创新点:(1) 增强的感知能力,利用大规模GUI截图数据集进行上下文感知的UI元素理解和精确标注;(2) 统一动作建模,将动作标准化为跨平台的统一空间,并通过大规模动作轨迹实现精确的定位和交互;(3) 系统-2推理,将深思熟虑的推理引入多步决策中,涉及任务分解、反思思维、里程碑识别等多种推理模式;(4) 迭代训练与反思在线轨迹,通过自动收集、过滤和反思优化新交互轨迹,解决数据瓶颈问题。通过这些创新,UI-TARS能够持续从错误中学习,并在最少人工干预的情况下适应不可预见的情况。
链接: https://arxiv.org/abs/2501.12326
作者: Yujia Qin,Yining Ye,Junjie Fang,Haoming Wang,Shihao Liang,Shizuo Tian,Junda Zhang,Jiahao Li,Yunxin Li,Shijue Huang,Wanjun Zhong,Kuanye Li,Jiale Yang,Yu Miao,Woyu Lin,Longxiang Liu,Xu Jiang,Qianli Ma,Jingyu Li,Xiaojun Xiao,Kai Cai,Chuang Li,Yaowei Zheng,Chaolin Jin,Chen Li,Xiao Zhou,Minchao Wang,Haoli Chen,Zhaojian Li,Haihua Yang,Haifeng Liu,Feng Lin,Tao Peng,Xin Liu,Guang Shi
机构: ByteDance Seed(字节跳动种子); Tsinghua University(清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
zh
[NLP-5] Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
【速读】: 该论文试图解决高质量监督微调(Supervised Fine-Tuning, SFT)数据稀缺的问题,这一问题在大型语言模型(Large Language Models, LLMs)日益先进的背景下尤为突出。为了解决这一问题,论文提出了Condor,一种新颖的两阶段合成数据生成框架。该框架结合了世界知识树(World Knowledge Tree)和自我反思精炼(Self-Reflection Refinement)技术,能够大规模生成高质量的SFT数据。实验结果表明,仅使用20K Condor生成的样本进行微调的基模型,其性能优于其他对比模型。Condor中的额外精炼阶段还支持不同规模(最高达72B)的LLMs进行迭代自我改进,验证了该方法的有效性。此外,论文还探讨了合成数据在训练后扩展中的潜力,揭示了未来研究中性能提升的广阔前景。
链接: https://arxiv.org/abs/2501.12273
作者: Maosong Cao,Taolin Zhang,Mo Li,Chuyu Zhang,Yunxin Liu,Haodong Duan,Songyang Zhang,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tech Report. Github: this https URL
点击查看摘要
Abstract:The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
zh
[NLP-6] CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
【速读】: 该论文试图解决在医疗工作流程中采用基于深度学习(deep learning)解决方案时面临的两个主要挑战:标注数据的可用性和系统缺乏可解释性。解决方案的关键在于提出了一种名为CBVLM(Concept-Based Vision-Language Model)的方法。该方法利用大型视觉-语言模型(Large Vision-Language Models, LVLMs)在少样本(few-shot)设置下的卓越性能,通过两个阶段来实现:首先,对于每个预定义的概念,提示LVLM判断输入图像中是否存在该概念;其次,基于这些概念预测,要求LVLM对图像进行分类。此外,该方法还引入了一个检索模块,用于选择最佳的上下文学习示例。通过将最终诊断基于预测的概念,确保了系统的可解释性;同时,利用LVLMs的少样本能力,显著降低了标注成本。实验表明,CBVLM在多个医疗数据集和多种LVLMs上均优于传统的概念瓶颈模型(Concept Bottleneck Models, CBMs)和特定任务的监督学习方法,且无需训练,仅需少量标注示例。
链接: https://arxiv.org/abs/2501.12266
作者: Cristiano Patrício,Isabel Rio-Torto,Jaime S. Cardoso,Luís F. Teixeira,João C. Neves
机构: INESC TEC; NOVA LINCS; Universidade da Beira Interior (贝拉内政大学); Universidade do Porto (波尔图大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: this https URL.
zh
[NLP-7] FOCUS: First Order Concentrated Updating Scheme
【速读】: 该论文试图解决大语言模型(LLMs)在预训练过程中由于梯度噪声(gradient noise)导致的性能下降问题。具体来说,作者观察到在高梯度噪声环境下,Adam优化器的表现不如Signum优化器,因为Adam会过度减小有效步长(effective step size),从而影响模型的训练效果。基于这一观察,作者提出了FOCUS优化器,该优化器在Signum的基础上引入了对移动平均参数(moving averaged parameters)的吸引力机制,使其能够在保持较大步长的同时更好地处理噪声。实验结果表明,FOCUS在训练GPT-2时比Signum更稳定,且比Adam更快,表明梯度噪声可能是LLM训练中一个被低估的限制因素,而FOCUS为解决这一问题提供了有效的解决方案。
链接: https://arxiv.org/abs/2501.12243
作者: Yizhou Liu,Ziming Liu,Jeff Gore
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注: 19 pages, 8 figures
点击查看摘要
Abstract:Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley’s sharpness, Adam’s performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest that gradient noise may be an underappreciated limiting factor in LLM training, and FOCUS offers promising solutions.
zh
[NLP-8] InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models
【速读】: 该论文旨在解决如何构建一个能够理解多模态任务情境并实时响应用户查询的虚拟助手。解决方案的关键在于开发了一个名为InsTALL的上下文感知指令任务助手,该助手利用多模态大语言模型(Multi-modal Large Language Models)处理在线视觉流(如用户的屏幕共享或视频录制),并结合任务视频和配对的文本数据进行训练。InsTALL通过自动从视频数据中提取任务图(task graph),并在训练和推理过程中利用该任务图,从而在多模态活动理解的子任务(如任务识别、动作识别、下一步动作预测和计划预测)中实现了最先进的性能,并在自动错误识别的两个新子任务上超越了现有基线。
链接: https://arxiv.org/abs/2501.12231
作者: Pha Nguyen,Sailik Sengupta,Girik Malik,Arshit Gupta,Bonan Min
机构: University of Arkansas(阿肯色大学); Amazon AWS AI Labs(亚马逊AWS AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The improved competence of generative models can help building multi-modal virtual assistants that leverage modalities beyond language. By observing humans performing multi-step tasks, one can build assistants that have situational awareness of actions and tasks being performed, enabling them to cater assistance based on this understanding. In this paper, we develop a Context-aware Instructional Task Assistant with Multi-modal Large Language Models (InsTALL) that leverages an online visual stream (e.g. a user’s screen share or video recording) and responds in real-time to user queries related to the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal model on task videos and paired textual data, and 2) automatically extracts task graph from video data and leverages it at training and inference time. We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding – task recognition (TR), action recognition (AR), next action prediction (AP), and plan prediction (PP) – and outperforms existing baselines on two novel sub-tasks related to automatic error identification.
zh
[NLP-9] Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在生成图像描述时出现的幻觉(hallucination)问题,即模型生成的内容包含输入图像中不存在的对象或细节。为了解决这一问题,论文提出了一种新颖的注意力修正方法,其关键在于两个核心组件:首先,采用双流令牌选择机制(dual-stream token selection mechanism),通过识别并优先处理局部信息丰富和空间显著的视觉令牌(visual tokens),以增强视觉信息的提取;其次,引入注意力头特异性调制策略(attention head-specific modulation strategy),根据每个注意力头的视觉敏感性差异性地放大视觉信息处理。实验表明,该方法在MSCOCO数据集上显著减少了幻觉现象,幻觉率降低了62.3%,同时保持了与基线模型相当的任务性能。通过选择性调制具有不同视觉敏感性的注意力头中的令牌,该方法在不重新训练模型的情况下显著改善了视觉基础(visual grounding)。
链接: https://arxiv.org/abs/2501.12206
作者: Kazi Hasan Ibn Arif,Sajib Acharjee Dip,Khizar Hussain,Lang Zhang,Chris Thomas
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 5 tables, 4 figures
点击查看摘要
Abstract:Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models frequently exhibit hallucination behavior, where they generate descriptions containing objects or details absent in the input image. Our work investigates this phenomenon by analyzing attention patterns across transformer layers and heads, revealing that hallucinations often stem from progressive degradation of visual grounding in deeper layers. We propose a novel attention modification approach that combines selective token emphasis and head-specific modulation to maintain visual grounding throughout the generation process. Our method introduces two key components: (1) a dual-stream token selection mechanism that identifies and prioritizes both locally informative and spatially significant visual tokens, and (2) an attention head-specific modulation strategy that differentially amplifies visual information processing based on measured visual sensitivity of individual attention heads. Through extensive experimentation on the MSCOCO dataset, we demonstrate that our approach reduces hallucination rates by up to 62.3% compared to baseline models while maintaining comparable task performance. Our analysis reveals that selectively modulating tokens across attention heads with varying levels of visual sensitivity can significantly improve visual grounding without requiring model retraining.
zh
[NLP-10] Extend Adversarial Policy Against Neural Machine Translation via Unknown Token
【速读】: 该论文试图解决在神经机器翻译(NMT)领域中,现有的对抗样本生成方法在面对字符扰动(character perturbations)时效果不佳的问题。现有的对抗策略通常适用于固定的分词(tokenization)方式,难以应对涉及多种分词方式的字符扰动。为此,论文提出了一种名为“DexChar policy”的解决方案,该方案基于强化学习(RL)的现有对抗生成方法,引入了字符扰动,以改进基于词替换(token substitution)的主流对抗策略。此外,论文还改进了自监督匹配(self-supervised matching)机制,以在强化学习中提供反馈,满足训练对抗样本时所需的语义约束。实验表明,该方法在基线对抗方法失效的场景中表现良好,能够生成高效的对抗样本,用于系统的分析和优化。
链接: https://arxiv.org/abs/2501.12183
作者: Wei Zou,Shujian Huang,Jiajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by CCMT 2024()
点击查看摘要
Abstract:Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy’ that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.
zh
[NLP-11] AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
【速读】: 该论文旨在解决大语言模型(LLM)服务系统中如何在不牺牲吞吐量的情况下,支持服务级别目标(SLO)定制化的问题。现有的系统难以在满足多样化SLO需求的同时保持高吞吐量。论文提出的解决方案AdaServe通过细粒度的推测解码(speculative decoding)来实现这一目标。其关键创新在于利用草稿模型(draft model)的logits预测token的推测准确性,并采用理论最优算法构建token树进行验证。此外,AdaServe通过推测与选择(speculation-and-selection)机制,首先为每个请求构建候选token树,然后动态选择token以满足个体SLO约束并优化吞吐量。实验结果表明,AdaServe在SLO达成率和有效吞吐量(goodput)方面分别比现有最先进系统提高了73%和74%,显著提升了LLM部署的效率和适应性。
链接: https://arxiv.org/abs/2501.12162
作者: Zikun Li,Zhuofu Chen,Remi Delacourt,Gabriele Oliaro,Zeyu Wang,Qinghan Chen,Shuhuai Lin,April Yang,Zhihao Zhang,Zhuoming Chen,Sean Lai,Xupeng Miao,Zhihao Jia
机构: Carnegie Mellon University(卡内基梅隆大学); Tongji University(同济大学); EPFL(洛桑联邦理工学院); Amazon Web Services(亚马逊网络服务); Purdue University(普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe’s potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
zh
[NLP-12] Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities
【速读】: 该论文试图解决在大语言模型(LLMs)的指令微调过程中,如何选择适当的训练数据以同时实现两个目标:(1) 激发模型的强大能力,以及 (2) 在多样化任务上实现平衡的性能。现有的基于影响力(influence-based)的方法虽然在估计每个训练样本对模型预测的贡献方面表现出色,但在实现任务间的平衡性能方面存在不足。论文通过系统研究发现,这种不足源于某些任务在影响力上具有固有偏差,导致数据选择偏向这些任务,进而损害模型在其他任务上的表现,甚至对高影响力任务本身也产生负面影响。
为解决这一问题,论文提出了BIDS(Balanced and Influential Data Selection)算法,其关键在于对训练数据的影响力得分进行归一化,并通过迭代选择对最不具代表性任务具有最高影响力的训练样本来平衡数据选择。实验结果表明,BIDS在多个基准测试中均优于现有的基于影响力的算法和其他非基于影响力的选择框架。值得注意的是,使用BIDS选择的15%子集进行训练,甚至可以在更平衡的性能上优于全数据集训练。分析进一步强调了实例级归一化和迭代优化在选择数据中的重要性,以实现多样化能力的平衡学习。
链接: https://arxiv.org/abs/2501.12147
作者: Qirun Dai,Dylan Zhang,Jiaqi W. Ma,Hao Peng
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Selecting appropriate training data is crucial for effective instruction fine-tuning of large language models (LLMs), which aims to (1) elicit strong capabilities, and (2) achieve balanced performance across a diverse range of tasks. Influence-based methods show promise in achieving (1) by estimating the contribution of each training example to the model’s predictions, but often struggle with (2). Our systematic investigation reveals that this underperformance can be attributed to an inherent bias where certain tasks intrinsically have greater influence than others. As a result, data selection is often biased towards these tasks, not only hurting the model’s performance on others but also, counterintuitively, harms performance on these high-influence tasks themselves. As a remedy, we propose BIDS, a Balanced and Influential Data Selection algorithm. BIDS first normalizes influence scores of the training data, and then iteratively balances data selection by choosing the training example with the highest influence on the most underrepresented task. Experiments with both Llama-3 and Mistral-v0.3 on seven benchmarks spanning five diverse capabilities show that BIDS consistently outperforms both state-of-the-art influence-based algorithms and other non-influence-based selection frameworks. Surprisingly, training on a 15% subset selected by BIDS can even outperform full-dataset training with a much more balanced performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2501.12147 [cs.CL] (or arXiv:2501.12147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.12147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-13] Can open source large language models be used for tumor documentation in Germany? – An evaluation on urological doctors notes
【速读】: 该论文试图解决德国肿瘤文档记录过程中手动操作的低效性和可靠性问题。当前的肿瘤文档记录主要依赖于人工阅读患者记录并将数据输入结构化数据库,这一过程既耗时又容易出错。论文提出利用大语言模型(LLMs)来自动化这一过程,以提高效率和准确性。
解决方案的关键在于评估了11种不同规模的开源大语言模型(模型参数从1亿到700亿不等),在肿瘤文档记录的三个基本任务上的表现:识别肿瘤诊断、分配ICD-10编码(International Classification of Diseases, 10th Revision)以及提取首次诊断日期。研究使用了基于泌尿科匿名医生笔记的标注文本片段数据集,并通过不同的提示策略(如少样本提示)来探索模型的能力。研究发现,参数规模在7-12亿之间的模型(如Llama 3.1 8B、Mistral 7B和Mistral NeMo 12B)在这些任务上表现较好,且资源效率较高。此外,跨医学领域的示例在少样本提示中也能提升模型表现,表明大语言模型具备处理肿瘤文档记录任务的能力。通过定制化的微调和精心设计的提示策略,这些模型有望成为未来临床文档记录的重要工具。
链接: https://arxiv.org/abs/2501.12106
作者: Stefan Lenz,Arsenij Ustjanzew,Marco Jeray,Torsten Panholzer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 48 pages, 5 figures
点击查看摘要
Abstract:Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors’ notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from this https URL. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.
zh
[NLP-14] EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition
【速读】: 该论文旨在解决参数高效微调方法(Parameter-efficient fine-tuning, PEFT)在可扩展性和学习模式与全微调(full fine-tuning)之间的差异问题。现有的方法如LoRA(Low-Rank Adaptation)虽然减少了可训练参数的数量,但在扩展性和学习能力上存在局限。为此,论文提出了一种新的PEFT方法——高效权重分解低秩适应(Efficient Weight-Decomposed Low-Rank Adaptation, EDoRA)。该方法的关键在于将预训练权重分解为幅度和方向分量,并通过冻结低秩矩阵、使用奇异值分解(Singular Value Decomposition, SVD)初始化,以及在分量之间引入一个小的可训练矩阵,从而在显著减少可训练参数的同时保持学习能力。实验结果表明,EDoRA在GLUE基准测试中表现优异,与现有方法如LoRA和DoRA相比,可训练参数减少了多达30倍,适用于内存受限环境下的LLM(Large Language Models)任务适配。
链接: https://arxiv.org/abs/2501.12067
作者: Hamid Nasiri,Peter Garraghan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at this https URL .
zh
[NLP-15] MedS3: Towards Medical Small Language Models with Self-Evolved Slow Thinking
【速读】: 该论文旨在解决现有医学语言模型(Medical Language Models, MLMs)在真实临床应用中数据效率低和实用性有限的问题。现有模型通常依赖于预训练或监督微调,难以满足临床任务中的长链推理需求。论文提出的解决方案是开发一个可部署的小规模医学语言模型 \mone,采用自进化范式(self-evolution paradigm)进行长链推理。关键创新在于通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)构建可验证的推理链,并为每个推理步骤分配进化推演值(evolution rollout value),从而训练策略模型和奖励模型。在推理阶段,策略模型生成多个响应,奖励模型选择得分最高的响应。实验结果表明,\mone 在多个评估数据集上优于现有开源模型,且奖励模型的引入进一步提升了性能。
链接: https://arxiv.org/abs/2501.12051
作者: Shuyang Jiang,Yusheng Liao,Zhe Chen,Ya Zhang,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 19 pages; technical report
点击查看摘要
Abstract:Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAIs O1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present a deployable, small-scale medical language model, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the reward model. During inference, the policy model generates multiple responses, and the reward model selects the one with the highest reward score. Experiments on eleven evaluation datasets demonstrate that \mone outperforms prior open-source models by 2 points, with the addition of the reward model further boosting performance ( \sim 13 points), surpassing GPT-4o-mini. Code and data are available at \urlthis https URL.
zh
[NLP-16] Reference-free Evaluation Metrics for Text Generation: A Survey
【速读】: 该论文旨在探讨自然语言生成(NLG)系统中自动评估指标的应用和发展。目前,最常见的自动评估方法是基于参考的指标(reference-based metric),即通过将模型输出与人工编写的黄金标准参考文本进行比较来评估模型性能。然而,生成这些参考文本成本高昂,且在某些任务(如对话中的响应生成)中,创建参考文本并不简单。因此,近年来出现了多种无参考的评估指标(reference-free metrics)。论文通过对各类NLG任务中常用的评估方法进行全面调查,分析了这些方法的应用场景及其在模型评估之外的其他用途。最后,论文还指出了未来研究的一些有前景的方向。解决方案的关键在于开发和应用无参考的评估指标,以降低评估成本并提高评估的灵活性。
链接: https://arxiv.org/abs/2501.12011
作者: Takumi Ito,Kees van Deemter,Jun Suzuki
机构: Tohoku University(东北大学); Langsmith Inc.; Utrecht University(乌得勒支大学); RIKEN(理化学研究所)
类目: Computation and Language (cs.CL)
备注: Work in progress
点击查看摘要
Abstract:A number of automatic evaluation metrics have been proposed for natural language generation systems. The most common approach to automatic evaluation is the use of a reference-based metric that compares the model’s output with gold-standard references written by humans. However, it is expensive to create such references, and for some tasks, such as response generation in dialogue, creating references is not a simple matter. Therefore, various reference-free metrics have been developed in recent years. In this survey, which intends to cover the full breadth of all NLG tasks, we investigate the most commonly used approaches, their application, and their other uses beyond evaluating models. The survey concludes by highlighting some promising directions for future research.
zh
[NLP-17] Leverag ing Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues
【速读】: 该论文旨在解决任务导向对话系统(task-oriented dialogue systems)训练过程中数据集创建成本高、耗时长的问题。传统方法依赖于大量的人工标注,而近期的方法虽然利用了大语言模型(LLMs)生成合成数据,但仍需要定制提示或代码,限制了非技术用户的使用。论文提出的解决方案是GraphTOD,这是一个端到端(end-to-end)框架,通过允许用户以JSON格式指定转移图(transition graphs)来简化任务导向对话的生成。该框架显著降低了数据集创建的复杂性和成本,并在多个领域中生成了高质量的对话数据。
链接: https://arxiv.org/abs/2501.11977
作者: Maya Medjad,Hugo Imbert,Bruno Yun,Raphaël Szymocha,Frédéric Armetta
机构: UCBL, CNRS, Centrale Lyon, INSA Lyon, Univ. Lumière Lyon 2, LIRIS, UMR5205 (里昂大学, 法国国家科学研究中心, 里昂中央理工学院, 里昂国立应用科学学院, 里昂第二大学, 里昂信息与系统研究所, UMR5205); Reecall (Reecall公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Training task-oriented dialogue systems is both costly and time-consuming, due to the need for high-quality datasets encompassing diverse intents. Traditional methods depend on extensive human annotation, while recent advancements leverage large language models (LLMs) to generate synthetic data. However, these approaches often require custom prompts or code, limiting accessibility for non-technical users. We introduce GraphTOD, an end-to-end framework that simplifies the generation of task-oriented dialogues. Users can create dialogues by specifying transition graphs in JSON format. Our evaluation demonstrates that GraphTOD generates high-quality dialogues across various domains, significantly lowering the cost and complexity of dataset creation.
zh
[NLP-18] A Hybrid Attention Framework for Fake News Detection with Large Language Models
【速读】: 该论文旨在解决在线信息快速增长背景下虚假新闻传播的严重社会问题。为解决这一问题,作者提出了一种基于大语言模型(Large Language Models, LLMs)的新型检测框架,通过整合文本统计特征和深度语义特征来识别和分类虚假新闻。该解决方案的关键在于利用大语言模型的上下文理解能力进行文本分析,并引入混合注意力机制(hybrid attention mechanism)以重点关注对虚假新闻识别尤为重要的特征组合。实验结果表明,该模型在WELFake新闻数据集上显著优于现有方法,F1分数提高了1.5%。此外,通过注意力热图和SHAP值评估模型的可解释性,为内容审核策略提供了可操作的见解。该框架为应对虚假新闻传播提供了可扩展且高效的解决方案,有助于构建更可靠的在线信息生态系统。
链接: https://arxiv.org/abs/2501.11967
作者: Xiaochuan Xu,Peiyang Yu,Zeqiu Xu,Jiani Wang
机构: Information Networking Institute, Carnegie Mellon University (卡内基梅隆大学); Department of Computer Science, Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:With the rapid growth of online information, the spread of fake news has become a serious social challenge. In this study, we propose a novel detection framework based on Large Language Models (LLMs) to identify and classify fake news by integrating textual statistical features and deep semantic features. Our approach utilizes the contextual understanding capability of the large language model for text analysis and introduces a hybrid attention mechanism to focus on feature combinations that are particularly important for fake news identification. Extensive experiments on the WELFake news dataset show that our model significantly outperforms existing methods, with a 1.5% improvement in F1 score. In addition, we assess the interpretability of the model through attention heat maps and SHAP values, providing actionable insights for content review strategies. Our framework provides a scalable and efficient solution to deal with the spread of fake news and helps build a more reliable online information ecosystem.
zh
[NLP-19] AD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection
【速读】: 该论文试图解决文本异常检测(Text Anomaly Detection)在自然语言处理任务中的有效性和泛化性问题。尽管基于嵌入(embedding-based)的方法在文本异常检测中得到了广泛应用,但其在不同应用场景中的效果和泛化能力尚未得到充分探索。为此,作者提出了TAD-Bench,一个全面的基准测试工具,旨在系统评估基于嵌入的文本异常检测方法。TAD-Bench整合了多个跨领域的数据集,并结合了来自大型语言模型的最先进嵌入技术和多种异常检测算法。通过大量实验,作者分析了嵌入与检测方法之间的相互作用,揭示了它们在不同任务中的优势、劣势及适用性。这些发现为构建更鲁棒、高效且泛化能力强的异常检测系统提供了新的视角。
链接: https://arxiv.org/abs/2501.11960
作者: Yang Cao,Sikun Yang,Chen Li,Haolong Xiang,Lianyong Qi,Bo Liu,Rongsheng Li,Ming Liu
机构: 1School of Computing and Information Technology, Great Bay University, China (大湾区大学计算与信息技术学院); 2Great Bay Institute for Advanced Study, Great Bay University, China (大湾区大学高级研究院); 3Graduate School of Informatics, Nagoya University, Japan (名古屋大学信息学研究生院); 4School of Software, Nanjing University of Information Science and Technology, China (南京信息工程大学软件学院); 5College of Computer Science and Technology, China University of Petroleum (East China), China (中国石油大学(华东)计算机科学与技术学院); 6College of Cyberspace Security, Zhengzhou University, China (郑州大学网络空间安全学院); 7School of Computer, Harbin Engineering University, China (哈尔滨工程大学计算机学院); 8School of IT, Deakin University, Australia (迪肯大学信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text anomaly detection is crucial for identifying spam, misinformation, and offensive language in natural language processing tasks. Despite the growing adoption of embedding-based methods, their effectiveness and generalizability across diverse application scenarios remain under-explored. To address this, we present TAD-Bench, a comprehensive benchmark designed to systematically evaluate embedding-based approaches for text anomaly detection. TAD-Bench integrates multiple datasets spanning different domains, combining state-of-the-art embeddings from large language models with a variety of anomaly detection algorithms. Through extensive experiments, we analyze the interplay between embeddings and detection methods, uncovering their strengths, weaknesses, and applicability to different tasks. These findings offer new perspectives on building more robust, efficient, and generalizable anomaly detection systems for real-world applications.
zh
[NLP-20] Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model
【速读】: 该论文试图解决机器翻译(MT)在翻译文化元素(如成语、谚语和口语表达)方面的不足问题,特别是针对谚语的翻译。论文通过构建包含独立谚语和对话中谚语的翻译数据集,研究了最先进的神经机器翻译(NMT)和大型语言模型(LLMs)在翻译谚语方面的能力。实验结果表明,LLMs在谚语翻译方面通常优于NMT模型,尤其是在文化背景相似的语言之间。此外,论文指出当前的自动评估指标(如BLEU、CHRF++和COMET)在评估谚语翻译质量时存在不足,强调了开发更具文化意识的评估指标的必要性。解决方案的关键在于利用LLMs的优越性能,并推动开发更适用于文化元素翻译的评估方法。
链接: https://arxiv.org/abs/2501.11953
作者: Minghan Wang,Viet-Thanh Pham,Farhad Moghimifar,Thuy-Trang Vu
机构: Department of Data Science & AI, Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
zh
[NLP-21] HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
【速读】: 该论文试图解决的是现代人难以理解和翻译韩国历史文献的问题,这些文献主要使用汉文(Hanja)书写,而汉文是一种在20世纪之前在韩国使用的古老语言,其字符源自古代汉字但在韩国演变数百年。由于现代韩国人和中国人无法直接理解这些文献,且现有的翻译工作依赖于深厚的专业知识,导致大部分文献未被翻译成现代语言。为解决这一问题,论文提出了HERITAGE,这是一个开源的汉文自然语言处理(NLP)工具包,旨在帮助理解和翻译这些未探索的韩国历史文献。HERITAGE的关键解决方案包括:1)提供基于汉文语言模型的三个关键任务预测,即标点恢复、命名实体识别和机器翻译(MT);2)提供一个交互式词汇表,展示汉文字符的现代韩语读音和英文定义。通过这些功能,HERITAGE不仅使非专业人士能够初步理解文献内容,还为汉文专家提供了修订模型输出的工具,从而提高翻译效率,推动更多历史文献被翻译成现代语言。
链接: https://arxiv.org/abs/2501.11951
作者: Seyoung Song,Haneul Yoo,Jiho Jin,Kyunghyun Cho,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Demo and video are available at this https URL and this https URL
点击查看摘要
Abstract:While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
zh
[NLP-22] LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models
【速读】: 该论文旨在解决跨领域机器生成文本(Cross-Domain Machine-Generated Text, MGT)检测问题,特别是在非对抗性和对抗性场景下的检测任务。论文提出了一个基于微调变压器模型(transformer models)的集成方法,并通过逆困惑度加权(inverse perplexity weighting)来提升分类准确性。解决方案的关键在于结合了微调的RoBERTa-base模型和集成OpenAI检测器的RoBERTa-base模型,分别用于非对抗性MGT检测和对抗性MGT检测。通过逆困惑度加权,模型在不同文本领域中的泛化能力和性能得到了显著提升,展示了变压器模型在跨领域AI生成内容检测中的潜力。
链接: https://arxiv.org/abs/2501.11918
作者: Md Kamrujjaman Mobin,Md Saiful Islam
机构: Computer Science and Engineering, Shahjalal University of Science and Technology (沙贾拉尔科技大学); Computing Science, University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.
zh
[NLP-23] LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts
【速读】: 该论文旨在解决机器生成文本(machine-generated text)与人类书写文本(human-written text)的二元分类问题,特别是在COLING 2025 Workshop on Detecting AI-Generated Content的Task 1中。解决方案的关键在于使用集成模型(ensemble of models)并结合逆困惑度加权(inverse perplexity weighting)技术来提升分类准确性。具体而言,作者在英语文本检测任务中结合了RoBERTa-base、RoBERTa-base与OpenAI检测器以及BERT-base-cased模型,并在多语言文本检测任务中集成了RemBERT、XLM-RoBERTa-base和BERT-base-multilingual-cased模型。通过这种加权集成方法,作者在英语任务中获得了0.7458的Macro F1分数,在多语言任务中获得了0.7513的Macro F1分数,分别排名第12和第4。结果表明,逆困惑度加权技术能够有效提升单语和多语言环境下机器生成文本检测的鲁棒性,展示了集成方法在这一复杂任务中的潜力。
链接: https://arxiv.org/abs/2501.11914
作者: Md Kamrujjaman Mobin,Md Saiful Islam
机构: Computer Science and Engineering, Shahjalal University of Science and Technology (沙贾拉尔科技大学计算机科学与工程); Computing Science, University of Alberta (阿尔伯塔大学计算机科学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content, focusing on the binary classification of machine-generated versus human-written text. Our approach utilizes an ensemble of models, with weights assigned according to each model’s inverse perplexity, to enhance classification accuracy. For the English text detection task, we combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and BERT-base-multilingual-case for the multilingual text detection task, employing the same inverse perplexity weighting technique. This resulted in a Macro F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings, highlighting the potential of ensemble methods for this challenging task.
zh
[NLP-24] Panoramic Interests: Stylistic-Content Aware Personalized Headline Generation WWW’25
【速读】: 该论文试图解决个性化新闻标题生成中忽视用户风格偏好(stylistic preferences)的问题,现有方法主要关注用户的内容偏好(content preferences),而忽略了风格偏好对用户全景兴趣(panoramic interests)的重要性,导致个性化效果不佳。为解决这一问题,论文提出了一个新颖的风格-内容感知个性化标题生成框架(Stylistic-Content Aware Personalized Headline Generation, SCAPE)。其关键解决方案在于:通过大语言模型(LLM)协作提取标题的内容和风格特征,并利用基于对比学习的分层融合网络(contrastive learning-based hierarchical fusion network)自适应地整合用户的长期和短期兴趣。通过将全景兴趣融入标题生成过程,SCAPE能够在生成过程中反映用户的风格-内容偏好,从而提升个性化效果。实验结果表明,SCAPE在真实数据集PENS上优于基线方法。
链接: https://arxiv.org/abs/2501.11900
作者: Junhong Lian,Xiang Ao,Xinyu Liu,Yang Liu,Qing He
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to The ACM Web Conference 2025 (WWW’25, short paper)
点击查看摘要
Abstract:Personalized news headline generation aims to provide users with attention-grabbing headlines that are tailored to their preferences. Prevailing methods focus on user-oriented content preferences, but most of them overlook the fact that diverse stylistic preferences are integral to users’ panoramic interests, leading to suboptimal personalization. In view of this, we propose a novel Stylistic-Content Aware Personalized Headline Generation (SCAPE) framework. SCAPE extracts both content and stylistic features from headlines with the aid of large language model (LLM) collaboration. It further adaptively integrates users’ long- and short-term interests through a contrastive learning-based hierarchical fusion network. By incorporating the panoramic interests into the headline generator, SCAPE reflects users’ stylistic-content preferences during the generation process. Extensive experiments on the real-world dataset PENS demonstrate the superiority of SCAPE over baselines.
zh
[NLP-25] Med-R2: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine
【速读】: 该论文试图解决大型语言模型(LLMs)在医疗场景中应用时面临的挑战,包括高成本的医学数据集训练、数据过时、外部知识库检索精度有限以及答案提取效果不佳等问题。这些挑战导致LLMs在掌握医学专业知识方面未能达到预期水平。为解决这些问题,论文提出了Med-R^2框架,该框架基于循证医学(EBM)流程,通过高效整合检索机制、证据选择和推理过程,提升了LLMs在医疗场景中的问题解决能力,并增强了其可信度。Med-R^2的关键在于其无需额外训练成本的情况下,相较于传统的RAG方法和微调策略,分别实现了14.87%和3.59%的性能提升。
链接: https://arxiv.org/abs/2501.11885
作者: Keer Lu,Zheng Liang,Da Pan,Shusen Zhang,Xin Wu,Weipeng Chen,Zenan Zhou,Guosheng Dong,Bin Cui,Wentao Zhang
机构: Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学数据科学中心, 前沿交叉学科研究院); School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University(北京大学计算机学院 & 高可信软件技术教育部重点实验室); Baichuan Inc.(百川智能)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. However, despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.87% improvement over vanilla RAG methods and even a 3.59% enhancement compared to fine-tuning strategies, without incurring additional training costs.
zh
[NLP-26] From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning
【速读】: 该论文旨在解决如何在不增加数据量或模型规模的情况下,进一步提升大语言模型(LLMs)的性能问题。传统的训练时扩展(training-time scaling)和测试时计算资源增加已被证明有效,但本文提出了一种新的监督微调范式——聚合微调(Aggregation Fine-Tuning, AFT)。AFT的核心在于模型学习将多个草稿响应(proposals)合成为一个精炼的答案(aggregation)。在推理阶段,通过“提出-聚合”策略,模型迭代生成多个草稿响应并对其进行聚合,从而进一步提升性能。实验结果表明,AFT训练的模型在基准数据集上显著优于标准的监督微调(SFT),尤其是在AlpacaEval 2上,AFT模型以较小的数据量(64k)和模型规模(Llama3.1-8B-Base)超越了更大的模型(如Llama3.1-405B-Instruct和GPT4)。通过结合顺序精炼和并行采样,AFT框架在推理时灵活扩展计算资源,展示了在不增加数据或模型规模的情况下解锁LLMs额外潜力的前景。
链接: https://arxiv.org/abs/2501.11877
作者: Yafu Li,Zhilin Wang,Tingchen Fu,Ganqu Cui,Sen Yang,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages; work in progress
点击查看摘要
Abstract:Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.
zh
[NLP-27] Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
【速读】: 该论文探讨了在训练混合专家模型(Mixture-of-Experts, MoEs)时,负载均衡损失(Load-Balancing Loss, LBL)的实现问题。具体来说,现有的MoE训练框架通常采用并行训练策略,在微批次(micro-batch)内计算专家选择频率(f_i)和LBL,并在并行组之间进行平均。然而,由于微批次通常包含的序列数量较少,LBL几乎是在序列级别上计算的,这导致路由器(router)被迫在每个序列内均匀分配令牌(token),从而抑制了专家的领域专业化(domain specialization)。为了解决这一问题,论文提出了一种基于全局批次(global-batch)的LBL计算方法。全局批次包含更多样化的序列,能够在语料库级别上实现负载均衡。具体而言,该方法通过引入额外的通信步骤来同步微批次之间的f_i,并用于计算LBL。实验结果表明,全局批次LBL策略在预训练困惑度(perplexity)和下游任务中均表现出显著的性能提升,同时显著提高了MoE专家的领域专业化能力。
链接: https://arxiv.org/abs/2501.11873
作者: Zihan Qiu,Zeyu Huang,Bo Zheng,Kaiyue Wen,Zekun Wang,Rui Men,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper revisits the implementation of \textbfL oad- \textbfb alancing \textbfL oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E \sum_i=1^N_E f_i p_i , where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i . Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a \textbfmicro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ( \textite.g. , code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a \textbfglobal-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to \textbf42.8B total parameters and \textbf400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.
zh
[NLP-28] EmbodiedEval: Evaluate Multimodal LLM s as Embodied Agents
【速读】: 该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身任务评估中的局限性问题。现有的评估基准主要依赖静态图像或视频,无法充分评估MLLMs在交互式场景中的具身能力。同时,现有的具身AI基准任务过于特定且缺乏多样性,无法全面评估MLLMs的具身能力。为此,作者提出了EmbodiedEval,一个全面且交互式的评估基准,专门用于评估MLLMs在具身任务中的表现。EmbodiedEval的关键在于其设计了328个不同的任务,分布在125个多样化的3D场景中,涵盖了导航、物体交互、社交互动、属性问答和空间问答五大类别,以全面评估MLLMs的多种能力。通过这一统一的仿真和评估框架,作者揭示了现有MLLMs在具身任务中与人类水平的显著差距,为未来的模型改进提供了重要见解。
链接: https://arxiv.org/abs/2501.11858
作者: Zhili Cheng,Yuge Tu,Ran Li,Shiqi Dai,Jinyi Hu,Shengding Hu,Jiahao Li,Yang Shi,Tianyu Yu,Weize Chen,Lei Shi,Maosong Sun
机构: Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at this https URL.
zh
[NLP-29] Cross-Entropy Attacks to Language Models via Rare Event Simulation
【速读】: 该论文试图解决黑盒文本对抗攻击(Black-box textual adversarial attacks)中的几个关键问题:缺乏模型信息、文本的离散性和不可微性导致攻击方法缺乏通用性、现有方法由于依赖词显著性排序(word saliency ranking)而导致的攻击效率低下,以及为了提升攻击效果而牺牲语义完整性的问题。论文提出的解决方案是引入一种新的方法,称为交叉熵攻击(Cross-Entropy Attacks, CEA),该方法通过交叉熵优化(Cross-Entropy optimization)来定义软标签(soft-label)和硬标签(hard-label)设置下的对抗目标,并利用交叉熵优化来识别最优的替换词。实验表明,该方法在攻击性能、不可察觉性和句子质量方面表现优异。
链接: https://arxiv.org/abs/2501.11852
作者: Mingze Ni,Yongshun Gong,Wei Liu
机构: University of Technology Sydney(悉尼科技大学); Shandong University(山东大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Black-box textual adversarial attacks are challenging due to the lack of model information and the discrete, non-differentiable nature of text. Existing methods often lack versatility for attacking different models, suffer from limited attacking performance due to the inefficient optimization with word saliency ranking, and frequently sacrifice semantic integrity to achieve better attack outcomes. This paper introduces a novel approach to textual adversarial attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy optimization to address the above issues. Our CEA approach defines adversarial objectives for both soft-label and hard-label settings and employs CE optimization to identify optimal replacements. Through extensive experiments on document classification and language translation problems, we demonstrate that our attack method excels in terms of attacking performance, imperceptibility, and sentence quality.
zh
[NLP-30] Challenges in Expanding Portuguese Resources: A View from Open Information Extraction
【速读】: 该论文试图解决葡萄牙语(Portuguese)在开放信息抽取(Open Information Extraction, Open IE)领域缺乏高质量标注数据集的问题。由于传统开放信息抽取方法主要依赖于无监督学习,而近年来基于数据的监督学习方法在英语领域取得了显著进展,但其他语言(如葡萄牙语)由于缺乏标注数据集,相关研究进展缓慢。为此,作者提出了一种基于严格语义理论的高质量手动标注葡萄牙语语料库,并制定了结构化和上下文标注规则。该语料库的构建不仅填补了葡萄牙语在开放信息抽取领域的数据空白,还为该领域新方法和系统的开发与评估提供了重要支持。
链接: https://arxiv.org/abs/2501.11851
作者: Marlo Souza,Bruno Cabral,Daniela Claro,Lais Salvador
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Open Information Extraction (Open IE) is the task of extracting structured information from textual documents, independent of domain. While traditional Open IE methods were based on unsupervised approaches, recently, with the emergence of robust annotated datasets, new data-based approaches have been developed to achieve better results. These innovations, however, have focused mainly on the English language due to a lack of datasets and the difficulty of constructing such resources for other languages. In this work, we present a high-quality manually annotated corpus for Open Information Extraction in the Portuguese language, based on a rigorous methodology grounded in established semantic theories. We discuss the challenges encountered in the annotation process, propose a set of structural and contextual annotation rules, and validate our corpus by evaluating the performance of state-of-the-art Open IE systems. Our resource addresses the lack of datasets for Open IE in Portuguese and can support the development and evaluation of new methods and systems in this area.
zh
[NLP-31] Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance
【速读】: 该论文旨在解决社交媒体上识别有组织的政治宣传活动(astroturf campaigns)的问题,特别是在应对虚假信息传播方面。现有方法主要依赖于网络科学、图机器学习(graph machine learning)和自然语言处理(natural language processing)技术,通过分析用户之间的关系和互动(如转发)以及帖子之间的文本相似性来识别这些活动。然而,这些方法面临的主要挑战是训练数据集中类别不平衡的问题。为了解决这一问题,论文提出了一种基于大语言模型(LLMs)的新框架,引入了平衡检索增强生成(Balanced Retrieval-Augmented Generation, Balanced RAG)组件。该框架通过将社交媒体帖子(如推文)的文本信息和用户互动作为输入,结合提示工程(prompt engineering)和Balanced RAG方法,有效地检测出X(Twitter)平台上的协调虚假信息宣传活动。该框架无需对语言模型进行训练或微调,而是通过策略性地利用提示工程和Balanced RAG的优势,克服类别不平衡的影响,显著提升了识别精度、召回率和F1分数,相较于传统的基于图的方法,性能提升了2-3倍。
链接: https://arxiv.org/abs/2501.11849
作者: Nikos Kanakaris,Heng Ping,Xiongye Xiao,Nesreen K. Ahmed,Luca Luceri,Emilio Ferrara,Paul Bogdan
机构: University of Southern California(南加州大学); Cisco AI Research(思科人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Detecting organized political campaigns is of paramount importance in fighting against disinformation on social media. Existing approaches for the identification of such organized actions employ techniques mostly from network science, graph machine learning and natural language processing. Their ultimate goal is to analyze the relationships and interactions (e.g. re-posting) among users and the textual similarities of their posts. Despite their effectiveness in recognizing astroturf campaigns, these methods face significant challenges, notably the class imbalance in available training datasets. To mitigate this issue, recent methods usually resort to data augmentation or increasing the number of positive samples, which may not always be feasible or sufficient in real-world settings. Following a different path, in this paper, we propose a novel framework for identifying astroturf campaigns based solely on large language models (LLMs), introducing a Balanced Retrieval-Augmented Generation (Balanced RAG) component. Our approach first gives both textual information concerning the posts (in our case tweets) and the user interactions of the social network as input to a language model. Then, through prompt engineering and the proposed Balanced RAG method, it effectively detects coordinated disinformation campaigns on X (Twitter). The proposed framework does not require any training or fine-tuning of the language model. Instead, by strategically harnessing the strengths of prompt engineering and Balanced RAG, it facilitates LLMs to overcome the effects of class imbalance and effectively identify coordinated political campaigns. The experimental results demonstrate that by incorporating the proposed prompt engineering and Balanced RAG methods, our framework outperforms the traditional graph-based baselines, achieving 2x-3x improvements in terms of precision, recall and F1 scores.
zh
[NLP-32] Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLM s
【速读】: 该论文探讨了心理定势(Mental Set)如何影响大语言模型(LLMs)的推理能力。尽管LLMs在多种自然语言处理任务中表现出色,尤其是在参数高效微调(PEFT)和上下文学习(ICL)等新兴能力的推动下,但在复杂推理任务中,选择合适的模型进行PEFT或ICL仍然至关重要。当前评估方法主要依赖于MMLU、MATH和GSM8K等基准测试的分数,或通过更大模型的推理链评估,但这些方法忽视了模型在应对陌生情境和克服固有思维模式方面的适应性。心理定势在认知心理学中指的是倾向于坚持使用先前成功的策略,即使这些策略在特定情境下变得低效。论文通过比较Llama-3.1-8B-Instruct、Llama-3.1-70B-Instruct和GPT-4o等模型在心理定势存在下的表现,首次将认知心理学概念引入LLMs的复杂推理任务评估中,从而更深入地理解其适应性和问题解决效能。解决方案的关键在于将心理定势的概念融入模型评估框架,以揭示LLMs在面对新问题和克服固有思维模式时的实际能力。
链接: https://arxiv.org/abs/2501.11833
作者: Saiful Haq,Niyati Chhaya,Piyush Pandey,Pushpak Bhattacharya
机构: IIT Bombay(印度理工学院孟买分校); Hyperbots Inc(超机器人公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this paper, we present an investigative study on how Mental Sets influence the reasoning capabilities of LLMs. LLMs have excelled in diverse natural language processing (NLP) tasks, driven by advancements in parameter-efficient fine-tuning (PEFT) and emergent capabilities like in-context learning (ICL). For complex reasoning tasks, selecting the right model for PEFT or ICL is critical, often relying on scores on benchmarks such as MMLU, MATH, and GSM8K. However, current evaluation methods, based on metrics like F1 Score or reasoning chain assessments by larger models, overlook a key dimension: adaptability to unfamiliar situations and overcoming entrenched thinking patterns. In cognitive psychology, Mental Set refers to the tendency to persist with previously successful strategies, even when they become inefficient - a challenge for problem solving and reasoning. We compare the performance of LLM models like Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct and GPT-4o in the presence of mental sets. To the best of our knowledge, this is the first study to integrate cognitive psychology concepts into the evaluation of LLMs for complex reasoning tasks, providing deeper insights into their adaptability and problem-solving efficacy.
zh
[NLP-33] Fact-Preserved Personalized News Headline Generation ICDM2023
【速读】: 该论文试图解决个性化新闻标题生成(Personalized News Headline Generation)中个性化与事实一致性(factual consistency)之间的平衡问题。现有研究通常通过将用户兴趣嵌入(user interest embedding)注入编码器-解码器(encoder-decoder)标题生成器来实现个性化,但生成标题的事实一致性往往不足。为此,论文提出了一个名为事实保留的个性化新闻标题生成框架(Fact-Preserved Personalized News Headline Generation, FPG)。该框架的关键在于利用候选新闻与用户历史点击新闻的相似性,对候选新闻中的关键事实赋予不同级别的注意力,并通过相似性分数学习一个事实感知的全局用户嵌入(fact-aware global user embedding)。此外,框架还引入了基于对比学习(contrastive learning)的额外训练过程,以进一步增强生成标题的事实一致性。实验结果表明,FPG在个性化与事实一致性之间的权衡上表现优异。
链接: https://arxiv.org/abs/2501.11828
作者: Zhao Yang,Junhong Lian,Xiang Ao
机构: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)(中国科学院智能信息处理重点实验室); Institute of Computing Technology, CAS(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); Institute of Intelligent Computing Technology, Suzhou, CAS(中国科学院苏州智能计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ICDM 2023, Short paper, 6 pages
点击查看摘要
Abstract:Personalized news headline generation, aiming at generating user-specific headlines based on readers’ preferences, burgeons a recent flourishing research direction. Existing studies generally inject a user interest embedding into an encoderdecoder headline generator to make the output personalized, while the factual consistency of headlines is inadequate to be verified. In this paper, we propose a framework Fact-Preserved Personalized News Headline Generation (short for FPG), to prompt a tradeoff between personalization and consistency. In FPG, the similarity between the candidate news to be exposed and the historical clicked news is used to give different levels of attention to key facts in the candidate news, and the similarity scores help to learn a fact-aware global user embedding. Besides, an additional training procedure based on contrastive learning is devised to further enhance the factual consistency of generated headlines. Extensive experiments conducted on a real-world benchmark PENS validate the superiority of FPG, especially on the tradeoff between personalization and factual consistency.
zh
[NLP-34] Benchmarking Large Language Models via Random Variables
【速读】: 该论文试图解决当前大语言模型(LLMs)在数学推理领域性能评估的可靠性问题。现有的数学基准测试存在设计过于简单和潜在数据泄露等问题,导致无法准确评估LLMs的真实数学推理能力。为解决这一问题,作者提出了RV-Bench框架,通过随机变量(Random Variables)来评估LLMs的数学推理能力。RV-Bench的关键在于其问题设计:随机变量问题的背景内容与现有标准基准测试中的原始问题一致,但变量组合被随机化为不同的值。LLMs必须完全理解原始问题的解题过程,才能正确回答不同变量组合的随机变量问题。通过这种方式,RV-Bench能够更准确地反映LLMs在数学推理中的真实能力。实验结果表明,当前LLMs在复杂数学推理问题上仍存在显著困难。
链接: https://arxiv.org/abs/2501.11790
作者: Zijin Hong,Hao Wu,Su Dong,Junnan Dong,Yilin Xiao,Yujing Zhang,Zhu Wang,Feiran Huang,Linyi Li,Hongxia Yang,Xiao Huang
机构: The Hong Kong Polytechnic University(香港理工大学); University of Electronic Science and Technology of China(电子科技大学); Jinan University(暨南大学); Simon Fraser University(西蒙弗雷泽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
点击查看摘要
Abstract:With the continuous advancement of large language models (LLMs) in mathematical reasoning, evaluating their performance in this domain has become a prominent research focus. Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data leakage. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of LLMs in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing standard benchmarks, but the variable combinations are randomized into different values. LLMs must fully understand the problem-solving process for the original problem to correctly answer RV questions with various combinations of variable values. As a result, the LLM’s genuine capability in mathematical reasoning is reflected by its accuracy on RV-Bench. Extensive experiments are conducted with 29 representative LLMs across 900+ RV questions. A leaderboard for RV-Bench ranks the genuine capability of these LLMs. Further analysis of accuracy dropping indicates that current LLMs still struggle with complex mathematical reasoning problems.
zh
[NLP-35] Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
【速读】: 该论文探讨了在大型语言模型(LLMs)上进行成员推断攻击(Membership Inference Attacks, MIAs)时,使用合成数据作为替代方案可能导致的误导性结果。研究发现,MIAs实际上起到了机器生成文本检测器的作用,错误地将合成数据识别为训练样本,无论数据来源如何。这种行为在不同模型架构和规模的模型中均存在,包括开源模型和商业模型如GPT-3.5。论文的关键发现是,使用合成数据进行成员评估可能会导致关于模型记忆和数据泄漏的错误结论。因此,论文警告在评估模型信号(如损失)时,使用合成或机器生成的翻译数据替代真实世界样本可能会影响评估结果的准确性。
链接: https://arxiv.org/abs/2501.11786
作者: Ali Naseh,Niloofar Mireshghallah
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts. While researchers have turned to synthetic data as an alternative, we show this approach can be fundamentally misleading. Our experiments indicate that MIAs function as machine-generated text detectors, incorrectly identifying synthetic data as training samples regardless of the data source. This behavior persists across different model architectures and sizes, from open-source models to commercial ones such as GPT-3.5. Even synthetic text generated by different, potentially larger models is classified as training data by the target model. Our findings highlight a serious concern: using synthetic data in membership evaluations may lead to false conclusions about model memorization and data leakage. We caution that this issue could affect other evaluations using model signals such as loss where synthetic or machine-generated translated data substitutes for real-world samples.
zh
[NLP-36] he Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers
【速读】: 该论文试图解决的问题是如何从面向儿童和青少年的TikTok视频中提取隐含的价值观(values),并探讨这些价值观如何通过社交媒体平台传播。传统上,儿童和青少年通过父母、教育者或同伴学习价值观,而如今社交媒体平台成为他们获取信息和娱乐的主要渠道,可能也是他们学习不同价值观的媒介。论文通过构建一个基于Schwartz个人价值观理论(Schwartz Theory of Personal Values)的TikTok视频数据集,并采用两种不同的方法进行价值观提取:一种是从视频中直接提取价值观,另一种是先将视频转换为详细的脚本,再从脚本中提取价值观。研究结果表明,两步法(2-step approach)显著优于直接提取法,并且使用可训练的掩码语言模型(Masked Language Model)作为第二步的效果优于使用少量样本的大型语言模型(Large Language Models)。此外,论文还讨论了微调(fine-tuning)对模型性能的影响,并比较了不同模型在识别TikTok视频中呈现或矛盾的价值观时的表现。最终,论文分享了首个价值观标注的TikTok视频数据集,为基于视频的社交媒体平台上的影响力和价值观传播研究奠定了基础。
链接: https://arxiv.org/abs/2501.11770
作者: Alina Starovolsky-Shitrit,Alon Neduva,Naama Appel Doron,Ella Daniel,Oren Tsur
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Societal and personal values are transmitted to younger generations through interaction and exposure. Traditionally, children and adolescents learned values from parents, educators, or peers. Nowadays, social platforms serve as a significant channel through which youth (and adults) consume information, as the main medium of entertainment, and possibly the medium through which they learn different values. In this paper we extract implicit values from TikTok movies uploaded by online influencers targeting children and adolescents. We curated a dataset of hundreds of TikTok movies and annotated them according to the Schwartz Theory of Personal Values. We then experimented with an array of Masked and Large language model, exploring how values can be detected. Specifically, we considered two pipelines – direct extraction of values from video and a 2-step approach in which videos are first converted to elaborated scripts and then values are extracted. Achieving state-of-the-art results, we find that the 2-step approach performs significantly better than the direct approach and that using a trainable Masked Language Model as a second step significantly outperforms a few-shot application of a number of Large Language Models. We further discuss the impact of fine-tuning and compare the performance of the different models on identification of values present or contradicted in the TikTok. Finally, we share the first values-annotated dataset of TikTok videos. Our results pave the way to further research on influence and value transmission in video-based social platforms. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI) Cite as: arXiv:2501.11770 [cs.CL] (or arXiv:2501.11770v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11770 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-37] Is logical analysis performed by transformers taking place in self-attention or in the fully connected part?
【速读】: 该论文探讨了Transformer架构中自注意力机制(self-attention)是否能够独立执行逻辑分析任务,而不依赖于全连接层(fully connected layer)。传统观点认为,自注意力机制主要用于信息聚合,而逻辑分析则由全连接层完成。然而,本文通过设计一个手工编码的单层编码器,展示了自注意力机制本身也能够执行逻辑分析。论文进一步研究了在单层Transformer模型中,模型在自学习过程中如何选择使用全连接层或自注意力机制进行逻辑分析。为了避免梯度下降(gradient descent)陷入不希望的零点,作者显式计算了这些零点并提出了避免方法。研究背景是基于预测文本中相邻标记的语法类别对。本文的发现对理解自注意力机制潜在逻辑操作的能力具有广泛意义。
链接: https://arxiv.org/abs/2501.11765
作者: Evgeniy Shin,Heinrich Matzinger
机构: School of Mathematics, Georgia Institute of Technology (乔治亚理工学院数学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 3 figures, to be submitted
点击查看摘要
Abstract:Transformers architecture apply self-attention to tokens represented as vectors, before a fully connected (neuronal network) layer. These two parts can be layered many times. Traditionally, self-attention is seen as a mechanism for aggregating information before logical operations are performed by the fully connected layer. In this paper, we show, that quite counter-intuitively, the logical analysis can also be performed within the self-attention. For this we implement a handcrafted single-level encoder layer which performs the logical analysis within self-attention. We then study the scenario in which a one-level transformer model undergoes self-learning using gradient descent. We investigate whether the model utilizes fully connected layers or self-attention mechanisms for logical analysis when it has the choice. Given that gradient descent can become stuck at undesired zeros, we explicitly calculate these unwanted zeros and find ways to avoid them. We do all this in the context of predicting grammatical category pairs of adjacent tokens in a text. We believe that our findings have broader implications for understanding the potential logical operations performed by self-attention.
zh
[NLP-38] Optimizing Pretraining Data Mixtures with LLM -Estimated Utility
【速读】: 该论文旨在解决在大规模语言模型(Large Language Models, LLMs)训练过程中,如何高效利用大规模高质量训练数据的问题。具体而言,论文探讨了在计算资源和数据受限的情况下,如何平衡数据的质量、数量和多样性,以优化模型的训练效果。通过对九种基线方法的评估,论文发现基于词元计数(token-count heuristics)的简单方法在数据集大小和多样性方面表现出色,优于手动和学习的混合方法。基于这一发现,论文提出了两种互补的解决方案:UtiliMax 和 Model Estimated Data Utility (MEDU)。UtiliMax 通过结合小规模消融实验(reduced-scale ablations)的效用估计,扩展了基于词元的启发式方法,实现了比手动基线方法高达10.6倍的加速;而 MEDU 则利用 LLMs 从小样本中估计数据效用,匹配了基于消融实验的性能,同时减少了约200倍的计算需求。这两种方法共同建立了一个自动化、计算高效的数据混合框架,适用于多种训练场景。
链接: https://arxiv.org/abs/2501.11747
作者: William Held,Bhargavi Paranjape,Punit Singh Koura,Mike Lewis,Frank Zhang,Todor Mihaylov
机构: Meta AI; Stanford University (斯坦福大学); Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by \sim 200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.
zh
[NLP-39] Mobile-Agent -E: Self-Evolving Mobile Assistant for Complex Tasks
【速读】: 该论文旨在解决当前基于大模态模型(LMM)的移动代理在应对复杂任务时的局限性,包括无法有效满足现实世界中的人类需求、难以处理推理密集型和长时程任务,以及缺乏从以往经验中学习和改进的机制。为解决这些问题,论文提出了Mobile-Agent-E,一种分层多代理框架,能够通过过去的经验实现自我进化。该框架的关键在于其分层结构,明确区分了高层规划和低层动作执行。框架包括一个负责将复杂任务分解为子目标并制定总体计划的Manager,以及四个下属代理——Perceptor(感知器)、Operator(操作器)、Action Reflector(动作反射器)和Notetaker(记录器),分别负责细粒度的视觉感知、即时动作执行、错误验证和信息聚合。此外,Mobile-Agent-E引入了一个新颖的自我进化模块,该模块通过维护包含Tips(提示)和Shortcuts(快捷方式)的持久长期记忆来实现性能的持续优化。Tips是从以往任务中总结出的与环境有效交互的一般性指导,而Shortcuts则是针对特定子任务的可重用原子操作序列。通过这些机制,Mobile-Agent-E在复杂移动任务中表现出显著的性能提升,相较于现有最先进方法,其性能提升了22%。
链接: https://arxiv.org/abs/2501.11733
作者: Zhenhailong Wang,Haiyang Xu,Junyang Wang,Xi Zhang,Ming Yan,Ji Zhang,Fei Huang,Heng Ji
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents–Perceptor, Operator, Action Reflector, and Notetaker–which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: this https URL.
zh
[NLP-40] Explain-Query-Test: Self-Evaluating LLM s Via Explanation and Comprehension Discrepancy
【速读】: 该论文试图解决大型语言模型(LLMs)在生成复杂概念详细解释时是否真正理解这些概念的问题。为了解决这一问题,作者提出了一种自评估流程,称为Explain-Query-Test(EQT)。该流程包括三个步骤:(i) 给定一个主题,模型生成关于该主题的摘要;(ii) 给定摘要,模型生成问题-答案对;(iii) 给定问题,模型生成答案。通过这一流程,作者发现模型在生成问题上的准确性与典型基准测试(如MMLU-Pro)的表现高度相关,表明EQT可以用于模型排名,而无需外部评估数据。此外,研究结果揭示了模型在生成详细解释与回答相关问题时表现之间的差距,突显了当前LLMs在内部知识表示和推理能力上的根本局限性。
链接: https://arxiv.org/abs/2501.11721
作者: Saeid Asgari Taghanaki,Joao Monteiro
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT’s performance is predictive of MMLU-Pro’s, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models’ ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at this https URL.
zh
[NLP-41] YouLeQD: Decoding the Cognitive Complexity of Questions and Engagement in Online Educational Videos from Learners Perspectives
【速读】: 该论文试图解决的问题是如何利用人工智能(AI)技术在教育环境中自动生成和分析问题,以促进学生的理解和互动。具体来说,研究关注的是如何通过分析学生在YouTube教学视频评论中提出的问题,来理解这些问题的认知复杂性,并基于布鲁姆分类法(Bloom’s Taxonomy)进行分类。解决方案的关键在于创建了一个名为YouTube Learners’ Questions on Bloom’s Taxonomy Dataset (YouLeQD)的数据集,并开发了两个基于RoBERTa的分类模型。这些模型利用大型语言模型(Large Language Models)来检测问题并分析其认知复杂性,从而为开发更有效的教育AI模型提供基础。通过这一研究,作者旨在提升学生的学习体验,并促进教育环境中的人机互动。
链接: https://arxiv.org/abs/2501.11712
作者: Nong Ming,Sachin Sharma,Jiho Noh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11pages. Extended version, Jan 2025. A shortened version was resubmitted and published in IEEE Conference on Semantic Computing Feb 2025
点击查看摘要
Abstract:Questioning is a fundamental aspect of education, as it helps assess students’ understanding, promotes critical thinking, and encourages active engagement. With the rise of artificial intelligence in education, there is a growing interest in developing intelligent systems that can automatically generate and answer questions and facilitate interactions in both virtual and in-person education settings. However, to develop effective AI models for education, it is essential to have a fundamental understanding of questioning. In this study, we created the YouTube Learners’ Questions on Bloom’s Taxonomy Dataset (YouLeQD), which contains learner-posed questions from YouTube lecture video comments. Along with the dataset, we developed two RoBERTa-based classification models leveraging Large Language Models to detect questions and analyze their cognitive complexity using Bloom’s Taxonomy. This dataset and our findings provide valuable insights into the cognitive complexity of learner-posed questions in educational videos and their relationship with interaction metrics. This can aid in the development of more effective AI models for education and improve the overall learning experience for students.
zh
[NLP-42] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
【速读】: 该论文旨在解决大型语言模型(LLMs)在复杂推理任务中测试时扩展(test-time scaling)效果不佳的问题。现有方法主要依赖模仿学习(imitation learning),难以实现有效的测试时扩展。尽管强化学习(RL)在自我探索和从反馈中学习方面具有潜力,但最近的尝试在复杂推理任务中仅取得了有限的改进。论文提出的解决方案T1通过鼓励探索和理解推理扩展来提升RL的效果。具体而言,T1首先使用合成的链式思维数据(chain-of-thought data)初始化LLM,这些数据结合了试错(trial-and-error)和自我验证(self-verification)。为了扩展RL训练,T1通过过采样(oversampling)增加采样多样性,并采用熵奖励(entropy bonus)作为辅助损失,结合动态锚点(dynamic anchor)进行正则化,以促进奖励优化。实验表明,基于开源LLM的T1在推理扩展行为上表现出色,并在数学推理基准测试中取得了优越的性能。此外,论文还提出了一种简单的策略来检验推理扩展,即增加推理预算直接提升T1的性能,而无需额外的验证。
链接: https://arxiv.org/abs/2501.11651
作者: Zhenyu Hou,Xin Lv,Rui Lu,Jiajie Zhang,Yujiang Li,Zijun Yao,Juanzi Li,Jie Tang,Yuxiao Dong
机构: Tsinghua University(清华大学); Zhipu AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration and learning from feedback, recent attempts yield only modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We further employ an entropy bonus as an auxiliary loss, alongside a dynamic anchor for regularization to facilitate reward optimization. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. For example, T1 with Qwen2.5-32B as the base model outperforms the recent Qwen QwQ-32B-Preview model on MATH500, AIME2024, and Omni-math-500. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. We will open-source the T1 models and the data used to train them at \urlthis https URL.
zh
[NLP-43] StAyaL | Multilingual Style Transfer KR
【速读】: 该论文旨在解决跨语言风格化文本生成的问题,即如何在不同的语言中生成特定说话者风格的文本。解决方案的关键在于通过仅使用100行文本,捕捉个体的独特风格并将其表示为高维嵌入(high-dimensional embedding),从而用于文本生成和风格化翻译。该方法通过三个主要阶段实现:首先,利用风格一致的外部数据源增强说话者的数据;其次,使用机器学习和深度学习技术将风格与内容分离;最后,通过对学习到的嵌入进行均值池化(mean pooling)生成抽象的风格轮廓(style profile)。该方法具有主题无关性(topic-agnostic),实验结果显示其测试准确率和F1分数分别为74.9%和0.75,表明其在多语言通信中的潜力,并为个性化内容生成和跨语言风格迁移的进一步应用铺平了道路。
链接: https://arxiv.org/abs/2501.11639
作者: Karishma Thakrar,Katrina Lawrence,Kyle Howard
机构: Cohere for AI Community; Cohere for AI Community; Cohere for AI Community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The primary authors, Karishma Thakrar and Katrina Lawrence, contributed equally to this work
点击查看摘要
Abstract:Stylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker’s style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker’s data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer.
zh
[NLP-44] Biomedical Knowledge Graph: A Survey of Domains Tasks and Real-World Applications
【速读】: 该论文旨在解决当前关于生物医学知识图谱(Biomedical Knowledge Graphs, BKGs)的综述文献往往局限于特定领域或方法,未能全面反映其广泛的应用场景和快速发展的技术进展的问题。为此,论文通过系统性地从三个核心视角(领域、任务和应用)对BKGs进行综述,填补了这一空白。解决方案的关键在于:首先,分析了BKGs如何从多种数据源(如分子相互作用、药理学数据集和临床记录)构建;其次,探讨了BKGs支持的关键任务,包括知识管理、检索、推理和解释;最后,展示了BKGs在精准医学、药物发现和科学研究等领域的实际应用,突出了其跨领域的转化影响。通过将这些视角整合到一个统一的框架中,该论文不仅阐明了BKG研究的现状,还为未来的探索奠定了基础,推动了方法学创新和实际应用的进一步发展。
链接: https://arxiv.org/abs/2501.11632
作者: Yuxing Lu,Sin Yee Goi,Xukai Zhao,Jinzhuo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注: 45 pages, 4 figures, 3 tables
点击查看摘要
Abstract:Biomedical knowledge graphs (BKGs) have emerged as powerful tools for organizing and leveraging the vast and complex data found across the biomedical field. Yet, current reviews of BKGs often limit their scope to specific domains or methods, overlooking the broader landscape and the rapid technological progress reshaping it. In this survey, we address this gap by offering a systematic review of BKGs from three core perspectives: domains, tasks, and applications. We begin by examining how BKGs are constructed from diverse data sources, including molecular interactions, pharmacological datasets, and clinical records. Next, we discuss the essential tasks enabled by BKGs, focusing on knowledge management, retrieval, reasoning, and interpretation. Finally, we highlight real-world applications in precision medicine, drug discovery, and scientific research, illustrating the translational impact of BKGs across multiple sectors. By synthesizing these perspectives into a unified framework, this survey not only clarifies the current state of BKG research but also establishes a foundation for future exploration, enabling both innovative methodological advances and practical implementations.
zh
[NLP-45] rojan Detection Through Pattern Recognition for Large Language Models
【速读】: 该论文试图解决在大语言模型(Large Language Models, LLMs)中检测特洛伊木马后门(Trojan backdoors)的问题。特洛伊木马后门可以在预训练(pretraining)、微调(fine-tuning)和上下文学习(in-context learning)等不同阶段被注入模型,对模型的对齐性(alignment)构成严重威胁。由于因果语言建模(causal language modeling)的特性,检测这些触发器(triggers)在庞大的搜索空间中具有挑战性。论文提出了一种多阶段框架,包括令牌过滤(token filtration)、触发器识别(trigger identification)和触发器验证(trigger verification),以有效检测这些后门。关键解决方案在于提出了一种基于输出logits的黑盒触发器反演方法(black-box trigger inversion method),并利用beam search和greedy decoding两种变体进行触发器识别。此外,验证阶段通过语义保持提示(semantic-preserving prompts)和特殊扰动(special perturbations)来区分真实的特洛伊触发器与其他具有类似特征的对抗性字符串,确保检测的准确性。
链接: https://arxiv.org/abs/2501.11621
作者: Vedant Bhasin,Matthew Yudin,Razvan Stefanescu,Rauf Izmailov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 11 Figures
点击查看摘要
Abstract:Trojan backdoors can be injected into large language models at various stages, including pretraining, fine-tuning, and in-context learning, posing a significant threat to the model’s alignment. Due to the nature of causal language modeling, detecting these triggers is challenging given the vast search space. In this study, we propose a multistage framework for detecting Trojan triggers in large language models consisting of token filtration, trigger identification, and trigger verification. We discuss existing trigger identification methods and propose two variants of a black-box trigger inversion method that rely on output logits, utilizing beam search and greedy decoding respectively. We show that the verification stage is critical in the process and propose semantic-preserving prompts and special perturbations to differentiate between actual Trojan triggers and other adversarial strings that display similar characteristics. The evaluation of our approach on the TrojAI and RLHF poisoned model datasets demonstrates promising results.
zh
[NLP-46] Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems
【速读】: 该论文试图解决如何利用大语言模型(LLMs)可靠地执行复杂业务流程的问题。尽管LLMs在自然语言理解方面表现出色,但在实际应用中,如何将其工程化为能够稳定执行复杂任务导向对话系统仍然具有挑战性。论文提出的解决方案是Conversation Routines(CR)框架,该框架通过自然语言规范将任务导向的逻辑嵌入到LLM提示中,从而开发出Conversation Agentic Systems(CAS)。CR框架的关键在于提供了一种系统化的方法,用于设计和实现复杂的对话工作流,同时保持行为一致性。通过两个概念验证案例(火车票预订系统和交互式故障排除助手),论文验证了CR框架在编码复杂行为模式和决策逻辑方面的有效性,同时保持了自然对话的灵活性。该框架使得领域专家能够用自然语言设计对话工作流,而软件工程师则专注于核心API的实现,实现了职责的高效分工。
链接: https://arxiv.org/abs/2501.11613
作者: Giorgio Robino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注:
点击查看摘要
Abstract:This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging. The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework’s effectiveness through two proof of concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR’s capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility. Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom enterprise functionalities (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design. While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non-deterministic behavior, and domain-specific logic optimization. Future research directions include enhancing system robustness, improving scalability for complex multi-agent interactions, and addressing the identified limitations across diverse business applications.
zh
[NLP-47] SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks AAAI2025
【速读】: 该论文试图解决大型语言模型(LLMs)在演绎推理(deductive reasoning)任务中可能无法遵循正确推理路径的问题。尽管通过链式思维提示(Chain-of-Thought prompts)增强了LLMs的推理能力,但其在复杂知识推理任务中的表现仍存在不足。论文提出了一种多阶段的三段论推理思维框架(Syllogistic-Reasoning Framework of Thought, SR-FoT),旨在模仿人类的演绎推理范式,提升LLMs的演绎推理能力。该框架的关键在于通过多阶段推理步骤,首先解释问题并生成合适的大前提(major premise),随后分两阶段生成并回答小前提(minor premise)问题,最终利用生成的大前提和小前提进行三段论演绎推理,从而得出原始问题的答案。实验结果表明,SR-FoT在知识推理任务中具有显著的有效性和优势。
链接: https://arxiv.org/abs/2501.11599
作者: Wentao Wan,Zhuojie Yang,Yongcan Chen,Chenglin Luo,Ruilin Wang,Kehao Cai,Nan Kang,Liang Lin,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted by AAAI 2025
点击查看摘要
Abstract:Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.
zh
[NLP-48] raining-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing
【速读】: 该论文试图解决压缩感知(Compressed Sensing, CS)在大规模数据应用中稀疏重建时间过长的问题。传统迭代方法在处理大规模数据时效率低下,而现有的AI方法如深度展开(deep unfolding)由于预训练模型在训练条件之外泛化能力差或缺乏可解释性,无法有效替代传统方法。论文提出了一种称为系数学习(Coefficients Learning, CL)的超小型人工神经网络模型,能够在无需训练的情况下实现快速稀疏重建,同时完美继承了传统迭代方法的泛化性和可解释性。CL的关键在于仅需信号长度n的最小可训练参数,显著提高了重建效率。通过案例模型CLOMP的实验验证,该方法在大规模数据上的效率提升了100到1000倍,并在多个图像数据集上显著提高了结构相似性指数(Structural Similarity Index)。
链接: https://arxiv.org/abs/2501.11592
作者: Chaoqing Tang,Huanze Zhuang,Guiyun Tian,Zhenli Zeng,Yi Ding,Wenzhong Liu,Xiang Bai
机构: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学人工智能与自动化学院); China Belt and Road Joint Lab on Measurement and Control Technology (中国一带一路联合实验室测量与控制技术); School of Electric and Electrical Engineering, Chongqing University of Technology (重庆理工大学电气与电子工程学院); Optics Valley Laboratory (光谷实验室); School of Soft Engineering, Huazhong University of Science and Technology (华中科技大学软件学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length n only needs a minimal of n trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.
zh
[NLP-49] PIKE-RAG : sPecIalized KnowledgE and Rationale Augmented Generation
【速读】: 该论文试图解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂多样的工业应用场景中表现不足的问题。尽管RAG系统通过外部检索扩展了大语言模型(LLM)的能力,但其依赖单一检索机制难以从专业语料库中提取深层次、领域特定的知识,并在逻辑推理任务中表现不佳。为解决这一问题,论文提出了sPecIalized KnowledgE and Rationale Augmentation Generation(PIKE-RAG)框架,其核心在于提取、理解和应用领域特定知识,并通过构建连贯的推理过程逐步引导LLM生成准确响应。关键解决方案包括:1)引入任务分类范式,根据知识提取和应用的复杂性对任务进行分类,以系统评估RAG系统的解决问题能力;2)提出知识原子化(knowledge atomizing)和知识感知任务分解(knowledge-aware task decomposition)方法,从数据块中有效提取多维知识,并基于原始查询和累积知识迭代构建推理过程。这些策略为RAG系统的分阶段开发和增强提供了路线图,以更好地满足工业应用的动态需求。
链接: https://arxiv.org/abs/2501.11551
作者: Jinyu Wang,Jingjing Fu,Lei Song,Jiang Bian
机构: Microsoft Research Asia(微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注: 36 pages, 18 figures, technique report
点击查看摘要
Abstract:Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.
zh
[NLP-50] Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
【速读】: 该论文试图解决大语言模型(LLMs)在基于用户偏好数据进行指令对齐(aligned)时,无法理解用户为何选择或拒绝某些输出的问题。现有的偏好数据格式仅能表明用户对输出的选择倾向,但无法解释背后的原因,导致模型难以根据不同用户的需求进行个性化响应。为解决这一问题,论文提出了一种基于溯因推理(abductive reasoning)的方法,通过推断用户的需求和兴趣(即用户画像,personas)来揭示用户选择或拒绝输出的原因。解决方案的关键在于两个步骤:1)用户画像推断(Persona Inference, PI),通过溯因推理推断出偏好选择或拒绝输出的用户画像;2)用户画像定制(Persona Tailoring, PT),训练模型以根据PI推断的用户画像生成个性化响应。实验表明,该方法能够准确推断用户画像,并通过增强的偏好数据提升个性化能力,尤其对具有非典型偏好的用户效果显著。论文主张从溯因视角看待偏好数据,不仅关注“哪个输出更好”,还关注“何时、为何、对谁更好”。
链接: https://arxiv.org/abs/2501.11549
作者: Nishant Balepur,Vishakh Padmakumar,Fumeng Yang,Shi Feng,Rachel Rudinger,Jordan Lee Boyd-Graber
机构: University of Maryland(马里兰大学); New York University(纽约大学); George Washington University(乔治华盛顿大学)
类目: Computation and Language (cs.CL)
备注: In Progress Preprint
点击查看摘要
Abstract:LLMs are tuned to follow instructions (aligned) by learning which of two outputs users prefer for a prompt. However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs and interests of users, i.e. personas, that may prefer each output. We test this idea in two steps: Persona Inference (PI)-abductively inferring personas of users who prefer chosen or rejected outputs-and Persona Tailoring (PT)-training models to tailor responses to personas from PI. We find: 1) LLMs infer personas accurately explaining why different users may prefer both chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization, enabling models to support user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
zh
[NLP-51] Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
【速读】: 该论文旨在解决将自然语言问题(NLQs)转换为可执行的SQL查询(text-to-SQL)任务中的一个关键问题,即缺乏针对低资源语言(如阿拉伯方言)的大规模、跨领域的文本到SQL数据集。现有的数据集(如SPIDER和WikiSQL)主要关注高资源语言(如英语和中文),无法充分反映低资源语言在实际应用中的复杂性和挑战。为此,作者提出了Dialect2SQL,这是第一个针对阿拉伯方言(特别是摩洛哥方言)的大规模、跨领域的文本到SQL数据集。该数据集包含9,428个NLQ-SQL对,覆盖69个不同领域的数据库,并引入了SQL相关的挑战(如长模式、脏数据和复杂查询)以及摩洛哥方言特有的复杂性(如多样化的源语言、大量借词和独特表达)。这一解决方案的关键在于通过引入低资源语言的真实场景复杂性,推动文本到SQL任务在更广泛语言环境中的发展。
链接: https://arxiv.org/abs/2501.11498
作者: Salmane Chafik,Saad Ezzini,Ismail Berrada
机构: Mohammed VI Polytechnic University(穆罕默德六世理工大学); King Fahd University of Petroleum and Minerals(法赫德国王石油与矿业大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:
点击查看摘要
Abstract:The task of converting natural language questions (NLQs) into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source languages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.
zh
[NLP-52] Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges
【速读】: 该论文探讨了生成式 AI 和大规模语言模型(LLM)在濒危语言保护中的作用,旨在解决全球语言多样性急剧下降的问题。论文分析了生成式 AI 和 LLM 在语言保护中的潜力,特别是针对资源匮乏的语言(low-resource languages)。解决方案的关键在于利用自然语言处理(NLP)和深度学习技术,通过生成式 AI 和 LLM 来支持濒危语言的记录、教育和文化传承。同时,论文还讨论了数据稀缺、技术挑战和伦理问题,并提出了增强 AI 驱动语言保护的解决方案。
链接: https://arxiv.org/abs/2501.11496
作者: Vincent Koc
机构: Hyperthink, Sydney, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, submitted for IEEE publication
点击查看摘要
Abstract:Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation, particularly for near-native and endangered languages. With the increasing reliance on technology for communication, education, and cultural documentation, new opportunities have emerged to mitigate the dramatic decline of linguistic diversity worldwide. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use. We analyze the underlying technologies driving these models, including natural language processing (NLP) and deep learning, and explore several cases where these technologies have been applied to low-resource languages. Additionally, we discuss ethical considerations, data scarcity issues, and technical challenges while proposing solutions to enhance AI-driven language preservation.
zh
[NLP-53] Graph-defined Language Learning with LLM s
【速读】: 该论文试图解决在大语言模型(LLMs)中建模文本属性图结构时面临的两个主要问题:(i) 高阶图结构的描述变得冗长;(ii) 仅依赖文本属性无法充分捕捉图结构信息。为解决这些问题,论文提出了一种名为Graph-Defined Language for Large Language Model (GDL4LLM)的新框架。该框架的关键在于将图结构转化为一种图语言语料库,而不是通过冗长的图描述来传达图结构信息。通过在这种语料库上预训练LLMs,GDL4LLM能够使LLMs在微调阶段仅用少量标记就能简洁地描述目标节点的结构信息。通过将图视为一种新的语言,GDL4LLM使LLMs能够在节点分类任务中高效且简洁地建模不同阶数的图结构,从而超越了基于描述和文本属性嵌入的基线方法。
链接: https://arxiv.org/abs/2501.11478
作者: Huachi Zhou,Jiahe Du,Chuang Zhou,Chang Yang,Yilin Xiao,Yuxuan Xie,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbfGraph-\textbfDefined \textbfLanguage for \textbfLarge \textbfLanguage \textbfModel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.11478 [cs.CL] (or arXiv:2501.11478v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-54] Curiosity-Driven Reinforcement Learning from Human Feedback
【速读】: 该论文试图解决在使用人类反馈进行强化学习(RLHF)时,大型语言模型(LLMs)在输出多样性和对齐质量之间的权衡问题。传统的RLHF方法虽然在使模型输出与人类偏好对齐方面有效,但往往以牺牲输出多样性为代价。为了解决这一问题,论文提出了好奇心驱动的RLHF(CD-RLHF)框架,其关键创新在于引入了对新颖状态的内在奖励(intrinsic rewards),与传统的稀疏外在奖励(extrinsic rewards)相结合,以同时优化输出多样性和对齐质量。通过在一系列任务(如文本摘要和指令遵循)上的广泛实验,CD-RLHF在保持与人类偏好对齐的同时,显著提升了输出多样性。
链接: https://arxiv.org/abs/2501.11463
作者: Haoran Sun,Yekun Chai,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang
机构: Baidu Inc.(百度)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at this https URL.
zh
[NLP-55] Ontology Matching with Large Language Models and Prioritized Depth-First Search
【速读】: 该论文试图解决本体匹配(Ontology Matching, OM)中的两个主要问题:一是现有方法需要大量训练数据集且词汇处理能力有限,二是基于大语言模型(Large Language Model, LLMs)的方法虽然展现出潜力,但性能有限且计算开销较大。为解决这些问题,论文提出了一种名为MILA的新方法,其关键创新在于将“检索-识别-提示”(retrieve-identify-prompt)管道嵌入到优先深度优先搜索(Prioritized Depth-First Search, PDFS)策略中。这种方法通过高效识别大量语义对应关系,并仅对最边缘的情况请求LLM,从而在保证高精度的同时显著减少LLM请求次数。实验结果表明,MILA在多个无监督任务中表现优异,且无需领域特定的启发式方法或微调,展示了高性能LLM-based OM的可行性。
链接: https://arxiv.org/abs/2501.11441
作者: Maria Taboada,Diego Martinez,Mohammed Arideh,Rosa Mosquera
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Ontology matching (OM) plays a key role in enabling data interoperability and knowledge sharing, but it remains challenging due to the need for large training datasets and limited vocabulary processing in machine learning approaches. Recently, methods based on Large Language Model (LLMs) have shown great promise in OM, particularly through the use of a retrieve-then-prompt pipeline. In this approach, relevant target entities are first retrieved and then used to prompt the LLM to predict the final matches. Despite their potential, these systems still present limited performance and high computational overhead. To address these issues, we introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy. This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases. We evaluated MILA using the biomedical challenge proposed in the 2023 and 2024 editions of the Ontology Alignment Evaluation Initiative. Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%. It also performed better than or comparable to the leading supervised OM systems. MILA further exhibited task-agnostic performance, remaining stable across all tasks and settings, while significantly reducing LLM requests. These findings highlight that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics, without the need of domain-specific heuristics or fine-tuning.
zh
[NLP-56] RACCOON: A Retrieval-Augmented Generation Approach for Location Coordinate Capture from News Articles WWW2025
【速读】: 该论文旨在解决从新闻报道中自动提取事件发生地点的地理坐标(geocoding)问题,特别是在流行病情报或灾害管理等领域的应用。论文提出的解决方案是Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON),这是一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的开源地理编码方法。RACCOON的关键在于通过从位置数据库中检索候选位置及其相关信息作为上下文,并将包含检索到的上下文、位置提及和新闻报道的提示输入到大型语言模型(LLM)中,从而生成地理坐标。该方法通过在三组数据集、两种基础LLM、三种基线模型以及多个消融实验中的评估,展示了其有效性。RACCOON是首个基于RAG方法并利用预训练LLM进行地理编码的解决方案。
链接: https://arxiv.org/abs/2501.11440
作者: Jonathan Lin,Aditya Joshi,Hye-young Paik,Tri Dung Doung,Deepti Gurdasani
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Accepted at WWW 2025 as a short paper. 4 pages with references
点击查看摘要
Abstract:Geocoding involves automatic extraction of location coordinates of incidents reported in news articles, and can be used for epidemic intelligence or disaster management. This paper introduces Retrieval-Augmented Coordinate Capture Of Online News articles (RACCOON), an open-source geocoding approach that extracts geolocations from news articles. RACCOON uses a retrieval-augmented generation (RAG) approach where candidate locations and associated information are retrieved in the form of context from a location database, and a prompt containing the retrieved context, location mentions and news articles is fed to an LLM to generate the location coordinates. Our evaluation on three datasets, two underlying LLMs, three baselines and several ablation tests based on the components of RACCOON demonstrate the utility of RACCOON. To the best of our knowledge, RACCOON is the first RAG-based approach for geocoding using pre-trained LLMs.
zh
[NLP-57] Neural Contextual Reinforcement Framework for Logical Structure Language Generation
【速读】: 该论文试图解决大语言模型生成文本时逻辑连贯性和结构一致性不足的问题,特别是在处理长序列依赖关系时面临的挑战。解决方案的关键在于引入了神经上下文强化框架(Neural Contextual Reinforcement Framework),该框架结合了强化学习原理,通过定制奖励函数和动态上下文对齐机制来优化文本生成。具体而言,框架采用了多头注意力层(multi-head attention layers)和分层编码模块(hierarchical encoding modules),以增强模型在长距离依赖关系中的表现,从而生成更符合人类逻辑结构和语义流畅性预期的文本。实验结果表明,该框架在连贯性指标、困惑度降低和语义对齐方面显著优于基线模型,并在多语言环境中表现出良好的适应性和资源效率。
链接: https://arxiv.org/abs/2501.11417
作者: Marcus Irvin,William Cooper,Edward Hughes,Jessica Morgan,Christopher Hamilton
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The Neural Contextual Reinforcement Framework introduces an innovative approach to enhancing the logical coherence and structural consistency of text generated by large language models. Leveraging reinforcement learning principles, the framework integrates custom reward functions and dynamic context alignment mechanisms to address challenges inherent in maintaining long-range dependencies across extended sequences. The architecture incorporates multi-head attention layers and hierarchical encoding modules, enabling the model to produce outputs that align closely with human expectations of logical structure and semantic flow. Quantitative evaluations across diverse datasets demonstrate substantial improvements in coherence metrics, perplexity reduction, and semantic alignment, showcasing the framework’s ability to outperform baseline models in both general and domain-specific tasks. Qualitative analyses further highlight the framework’s capacity to generate text with improved narrative clarity and reduced redundancy, reflecting its effectiveness in balancing fluency with structural precision. In addition to its performance gains, the framework exhibits robustness in handling noisy input data and scalability across varying model sizes, reinforcing its versatility in practical applications. Experimental results reveal that optimal context window sizes significantly influence coherence outcomes, showing the importance of architectural flexibility in adapting to diverse linguistic structures. Cross-lingual performance evaluations affirm the framework’s adaptability to multiple languages, extending its utility beyond monolingual contexts. Resource efficiency analyses indicate a reduction in computational overhead compared to traditional approaches, emphasizing the practicality of the framework for large-scale deployment.
zh
[NLP-58] Verifying Cross-modal Entity Consistency in News using Vision-language Models ECIR
【速读】: 该论文试图解决跨模态信息(如图像和文本)中实体(如人物、地点和事件)一致性的验证问题,特别是在新闻领域中检测虚假信息(disinformation)。现有的方法要么通过评估图像与整个文档的一致性来识别上下文外的虚假信息,忽略了单个实体之间的关系,要么专注于与新闻无关的通用实体。论文提出了一种基于大规模视觉-语言模型(Large Vision-Language Models, LVLMs)的框架,称为LVLM4CEC,用于验证新闻文章中人物、地点和事件在图像和文本之间的一致性。解决方案的关键在于利用从网络上爬取的参考图像,设计有效的提示策略(prompting strategies)来引导LVLMs进行实体验证。此外,论文扩展了三个现有数据集,提供了手动标注的真实数据(ground-truth data),以支持实体验证任务。实验结果表明,LVLMs在自动化跨模态实体验证方面具有潜力,特别是在使用证据图像时,识别人物和事件的准确性有所提高,并且在地点和事件的验证任务中优于基线方法。
链接: https://arxiv.org/abs/2501.11403
作者: Sahar Tahmasebi,Eric Müller-Budack,Ralph Ewerth
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted for publication in: European Conference on Information Retrieval (ECIR) 2025
点击查看摘要
Abstract:The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at \urlthis https URL.
zh
[NLP-59] Few-shot Policy (de)composition in Conversational Question Answering
【速读】: 该论文旨在解决政策合规性检测(Policy Compliance Detection, PCD)任务中的问题,即在对话场景中判断某个情境是否符合一组书面政策的要求。现有方法通常依赖于隐式推理能力或需要大量标注数据,而本文提出了一种神经符号框架——逻辑分解政策合规性(Logical Decomposition for Policy Compliance, LDPC),利用大语言模型(Large Language Models, LLMs)在少样本设置下进行政策合规性检测。该框架的关键在于通过选择少量示例并结合最新的提示技术,能够从给定政策中提取子问题、从上下文信息中分配真值,并显式生成一组逻辑语句。通过构建显式逻辑图,LDPC能够以更高的透明度和可解释性回答PCD相关问题。该方法在ShARC基准测试中表现出色,且无需任务特定的微调,同时其可解释的架构有助于识别错误来源,揭示了ShARC数据集中的模糊性,并凸显了对话问答推理中的挑战。
链接: https://arxiv.org/abs/2501.11335
作者: Kyle Erwin,Guy Axelrod,Maria Chang,Achille Fokoue,Maxwell Crouse,Soham Dan,Tian Gao,Rosario Uceda-Sosa,Ndivhuwo Makondo,Naweed Khan,Alexander Gray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The task of policy compliance detection (PCD) is to determine if a scenario is in compliance with respect to a set of written policies. In a conversational setting, the results of PCD can indicate if clarifying questions must be asked to determine compliance status. Existing approaches usually claim to have reasoning capabilities that are latent or require a large amount of annotated data. In this work, we propose logical decomposition for policy compliance (LDPC): a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting. By selecting only a few exemplars alongside recently developed prompting techniques, we demonstrate that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies. The formulation of explicit logic graphs can in turn help answer PCDrelated questions with increased transparency and explainability. We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning. We also leverage the inherently interpretable architecture of LDPC to understand where errors occur, revealing ambiguities in the ShARC dataset and highlighting the challenges involved with reasoning for conversational question answering.
zh
[NLP-60] Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering
【速读】: 该论文旨在解决基于知识库(如Wikipedia和Wikidata)的问答系统中,如何高效且精确地检索相关信息的问题。传统的问答系统通常通过生成答案或直接检索文档内容来实现,而本文提出了一种新的方法,即通过“问题到问题”匹配和检索来实现。具体而言,解决方案的关键在于使用指令调优的大语言模型(LLM)为每个逻辑内容单元生成一组全面的问题,并将这些问题进行向量嵌入(vector embedding)并存储,形成问题向量库。当用户提出查询时,系统会将查询向量与问题向量库中的向量进行匹配,选择相似度最高的向量,并直接检索与之关联的文章内容,从而避免了答案生成的过程。该方法在Wikipedia和Wikidata上表现出色,能够实现高余弦相似度(>0.9)的精确检索,具有计算效率高、响应速度快和可扩展性强等优势。此外,该方法还支持通过Wikidata进行结构化事实检索,为多模态问答开辟了新途径。
链接: https://arxiv.org/abs/2501.11301
作者: Santhosh Thottingal
机构: Wikimedia Foundation(维基媒体基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper introduces an approach to question answering over knowledge bases like Wikipedia and Wikidata by performing “question-to-question” matching and retrieval from a dense vector embedding store. Instead of embedding document content, we generate a comprehensive set of questions for each logical content unit using an instruction-tuned LLM. These questions are vector-embedded and stored, mapping to the corresponding content. Vector embedding of user queries are then matched against this question vector store. The highest similarity score leads to direct retrieval of the associated article content, eliminating the need for answer generation. Our method achieves high cosine similarity ( 0.9 ) for relevant question pairs, enabling highly precise retrieval. This approach offers several advantages including computational efficiency, rapid response times, and increased scalability. We demonstrate its effectiveness on Wikipedia and Wikidata, including multimedia content through structured fact retrieval from Wikidata, opening up new pathways for multimodal question answering.
zh
[NLP-61] Advancing Multi-Party Dialogue Systems with Speaker-ware Contrastive Learning
【速读】: 该论文试图解决多轮多方对话(multi-party dialogue)中的响应生成问题。与传统的双人对话(dyadic dialogue)相比,多方对话涉及更多参与者,且每个参与者可能讨论不同主题,导致任务复杂度显著增加。现有方法通常依赖图神经网络(Graph Neural Networks, GNNs)来建模对话上下文,虽然能够捕捉多方对话的结构动态,但这些方法过于依赖复杂的图结构和数据集标注,且往往忽略了参与者的独特说话风格。为解决这些问题,论文提出了基于对比学习(Contrastive Learning)的多方对话响应生成模型CMR。CMR通过自监督对比学习来更好地区分“谁说了什么”,并通过比较同一对话中的不同说话者,捕捉说话风格和主题转换的差异。实验结果表明,CMR在多方对话响应生成任务中显著优于现有最先进的模型。
链接: https://arxiv.org/abs/2501.11292
作者: Zhongtian Hu,Qi He,Ronghan Li,Meng Zhao,Lifang Wang
机构: 1School of Computer Science and Engineering, Northwestern Polytechnical University (西北工业大学计算机科学与工程学院); 2School of Computer Science and Technology, Xidian University (西安电子科技大学计算机科学与技术学院); 3School of Artificial Intelligence and Big Data, Henan University of Technology (河南工业大学人工智能与大数据学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Dialogue response generation has made significant progress, but most research has focused on dyadic dialogue. In contrast, multi-party dialogues involve more participants, each potentially discussing different topics, making the task more complex. Current methods often rely on graph neural networks to model dialogue context, which helps capture the structural dynamics of multi-party conversations. However, these methods are heavily dependent on intricate graph structures and dataset annotations, and they often overlook the distinct speaking styles of participants. To address these challenges, we propose CMR, a Contrastive learning-based Multi-party dialogue Response generation model. CMR uses self-supervised contrastive learning to better distinguish “who says what.” Additionally, by comparing speakers within the same conversation, the model captures differences in speaking styles and thematic transitions. To the best of our knowledge, this is the first approach to apply contrastive learning in multi-party dialogue generation. Experimental results show that CMR significantly outperforms state-of-the-art models in multi-party dialogue response tasks.
zh
[NLP-62] RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?
【速读】: 该论文探讨了通过扩展长链思维(Long Chain-of-Thought, Long-CoT)数据规模至1000k样本,是否能够提升推理能力的问题。研究团队开发了一种名为RedStar的慢思维模型,并通过大量实验揭示了长链思维训练中专业化和规模化的关键因素。研究发现,即使较小的模型在有限数据下也能显著提升性能,表明长链思维具有较高的样本效率,且样本难度在学习过程中起着关键作用。此外,论文引入了强化学习(Reinforcement Learning, RL)规模化训练作为推进慢思维系统的有前景方向。RedStar在多个领域表现出色,特别是在MATH-Hard基准测试中,RedStar-code-math将性能从66.2%提升至81.6%,并在美国数学奥林匹克(AIME)中仅使用21k混合代码-数学数据集解决了46.7%的问题。研究结果表明,通过精心调优,扩展长链思维数据可以解锁非凡的推理能力,即使数据集有限,也能为慢思维模型设定新的标准。
链接: https://arxiv.org/abs/2501.11284
作者: Haotian Xu,Xing Wu,Weinong Wang,Zhongzhi Li,Da Zheng,Boyuan Chen,Yi Hu,Shijia Kang,Jiaming Ji,Yingying Zhang,Zhijiang Guo,Yaodong Yang,Muhan Zhang,Debing Zhang
机构: Xiaohongshu Inc; Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); ECNU (华东师范大学); HKUST (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: technique-report, this https URL
点击查看摘要
Abstract:Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME), it solves 46.7% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at this https URL.
zh
[NLP-63] Multi-round Chain-of-thought Post-editing for Unfaithful Summaries
【速读】: 该论文旨在解决新闻摘要生成中的忠实性(faithfulness)问题,即生成的摘要与源新闻文档之间的事实一致性。论文探讨了使用大语言模型(LLMs)来评估和提升摘要的忠实性,并通过实验验证了其在定位和纠正事实不一致性方面的有效性。解决方案的关键在于利用链式思维提示(chain-of-thought prompts)来引导LLMs进行事实错误的识别和修正,从而提升编辑成功率。此外,论文还提出了多轮后编辑(multiple rounds of post-editing)的策略,逐步改进那些无法通过单轮编辑完全纠正的摘要的忠实性。实验结果表明,这种基于链式思维推理的提示策略在忠实性后编辑任务中表现优异,与经过微调的后编辑模型相当。
链接: https://arxiv.org/abs/2501.11273
作者: Yi-Hui Lee,Xiangci Li,Jessica Ouyang
机构: The University of Texas at Dallas; Amazon Web Services
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent large language models (LLMs) have demonstrated a remarkable ability to perform natural language understanding and generation tasks. In this work, we investigate the use of LLMs for evaluating faithfulness in news summarization, finding that it achieves a strong correlation with human judgments. We further investigate LLMs’ capabilities as a faithfulness post-editor, experimenting with different chain-of-thought prompts for locating and correcting factual inconsistencies between a generated summary and the source news document and are able to achieve a higher editing success rate than was reported in prior work. We perform both automated and human evaluations of the post-edited summaries, finding that prompting LLMs using chain-of-thought reasoning about factual error types is an effective faithfulness post-editing strategy, performing comparably to fine-tuned post-editing models. We also demonstrate that multiple rounds of post-editing, which has not previously been explored, can be used to gradually improve the faithfulness of summaries whose errors cannot be fully corrected in a single round.
zh
[NLP-64] Can xLLM s Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios
【速读】: 该论文试图解决多语言研究领域中的两个主要问题:高质量多语言数据集的稀缺性以及现有数据集在捕捉真实对话场景复杂性方面的局限性。为了解决这些问题,作者引入了XMP数据集,这是一个基于多参与者播客对话的高质量平行多语言数据集。该数据集中的每个样本都包含至少三名参与者,讨论的主题广泛,涵盖社会、文化、政治等多个领域。通过广泛的实验,作者揭示了大型语言模型(LLMs)在复杂对话场景中的多语言能力存在显著局限性,特别是其广泛认可的多语言互补能力受到影响。进一步实验从多个角度探索了LLMs在多语言环境中的机制,为其在现实世界多样化对话场景中的表现提供了新的见解。
链接: https://arxiv.org/abs/2501.11269
作者: Zhongtian Hu,Yiwen Cui,Ronghan Li,Meng Zhao,Lifang Wang
机构: School of Computer Science and Engineering, Northwestern Polytechnical University(西北工业大学计算机科学与工程学院); School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院); School of Artificial Intelligence and Big Data, Henan University of Technology(河南工业大学人工智能与大数据学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Multilingual research has garnered increasing attention, especially in the domain of dialogue systems. The rapid advancements in large language models (LLMs) have fueled the demand for high-performing multilingual models. However, two major challenges persist: the scarcity of high-quality multilingual datasets and the limited complexity of existing datasets in capturing realistic dialogue scenarios. To address these gaps, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues. Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and this http URL extensive experiments, we uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios. For instance, the widely accepted multilingual complementary ability of LLMs is notably impacted. By conducting further experiments, we explore the mechanisms of LLMs in multilingual environments from multiple perspectives, shedding new light on their performance in real-world, diverse conversational contexts.
zh
[NLP-65] Code Readability in the Age of Large Language Models : An Industrial Case Study from Atlassian
【速读】: 该论文试图解决的问题是:在大语言模型(LLMs)自动生成代码的背景下,代码的可读性是否仍然重要,以及LLM生成的代码与人工编写的代码在可读性上的比较。论文通过调查从业者的视角,探讨了LLM时代代码可读性的重要性,并通过对比LLM生成的代码与人工编写的代码,评估了其可读性。解决方案的关键在于开发了一个基于LLM的软件开发代理框架HULA,并通过实际场景中的代码生成实验,验证了LLM生成的代码在可读性上与人工编写的代码相当,从而促进了从业者对LLM驱动的软件开发平台的信任和广泛采用。
链接: https://arxiv.org/abs/2501.11264
作者: Wannita Takerngsaksiri,Micheal Fu,Chakkrit Tantithamthavorn,Jirat Pasuksmit,Kun Chen,Ming Wu
机构: Monash University(莫纳什大学); The University of Melbourne(墨尔本大学); Atlassian(澳大利亚); Atlassian(美国)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 2 figures, 5 tables, under review
点击查看摘要
Abstract:Programmers spend a significant amount of time reading code during the software development process. This trend is amplified by the emergence of large language models (LLMs) that automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners’ perspectives in this new era. In this paper, we conduct a survey to explore the practitioners’ perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.
zh
[NLP-66] Irony in Emojis: A Comparative Study of Human and LLM Interpretation
【速读】: 该论文试图解决的问题是大型语言模型(LLMs)在解释表情符号(emojis)中的讽刺(irony)时所面临的挑战。讽刺由于其表面意义与真实意图之间的不一致性,对LLMs的理解能力提出了较高的要求。论文通过让GPT-4o评估特定表情符号在社交媒体上表达讽刺的可能性,并将其解释与人类感知进行比较,旨在缩小机器与人类在理解讽刺表情符号方面的差距。解决方案的关键在于通过对比GPT-4o的解释与人类感知,揭示GPT-4o在解释讽刺表情符号时的能力,并探讨人口统计因素(如年龄和性别)如何影响表情符号的解释以及GPT-4o的表现。
链接: https://arxiv.org/abs/2501.11241
作者: Yawen Zheng,Hanjia Lyu,Jiebo Luo
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o’s interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o’s performance.
zh
[NLP-67] PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents ECIR2025
【速读】: 该论文旨在解决图表可视化在PDF或数字扫描件中仅以图像形式存在,缺乏源数据表和样式信息的问题,从而限制了图表的有效编辑。为了解决这一问题,论文提出了PlotEdit,一个基于自然语言驱动的多智能体框架,用于端到端的图表图像编辑。PlotEdit通过五个LLM(大语言模型)智能体的协同工作实现这一目标:(1) Chart2Table用于提取数据表,(2) Chart2Vision用于识别样式属性,(3) Chart2Code用于检索渲染代码,(4) Instruction Decomposition Agent用于将用户请求解析为可执行步骤,(5) Multimodal Editing Agent用于实现图表组件的细微修改。这些智能体通过多模态反馈进行协调,以保持视觉保真度。PlotEdit在ChartCraft数据集上优于现有基线,特别是在样式、布局、格式和数据为中心的编辑任务中,提升了视觉障碍用户的可访问性,并提高了新手用户的生产力。
链接: https://arxiv.org/abs/2501.11233
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted at ECIR 2025
点击查看摘要
Abstract:Chart visualizations, while essential for data interpretation and communication, are predominantly accessible only as images in PDFs, lacking source data tables and stylistic information. To enable effective editing of charts in PDFs or digital scans, we present PlotEdit, a novel multi-agent framework for natural language-driven end-to-end chart image editing via self-reflective LLM agents. PlotEdit orchestrates five LLM agents: (1) Chart2Table for data table extraction, (2) Chart2Vision for style attribute identification, (3) Chart2Code for retrieving rendering code, (4) Instruction Decomposition Agent for parsing user requests into executable steps, and (5) Multimodal Editing Agent for implementing nuanced chart component modifications - all coordinated through multimodal feedback to maintain visual fidelity. PlotEdit outperforms existing baselines on the ChartCraft dataset across style, layout, format, and data-centric edits, enhancing accessibility for visually challenged users and improving novice productivity.
zh
[NLP-68] Reasoning Language Models: A Blueprint
【速读】: 该论文试图解决推理语言模型(RLMs)或大型推理模型(LRMs)在可访问性和可扩展性方面面临的挑战。这些挑战主要源于其高成本、专有性质以及复杂的架构,这些架构独特地结合了强化学习(Reinforcement Learning, RL)、搜索启发式方法和大型语言模型(LLMs)。为了解决这些问题,论文提出了一种模块化框架的蓝图,该蓝图基于对所有RLM工作的调查和分析,将RLM组件组织成模块化结构。关键解决方案包括:1)整合多样化的推理结构(如链式、树状、图状和嵌套形式);2)采用多种推理策略(如蒙特卡洛树搜索、束搜索);3)结合强化学习概念(如策略模型、价值模型等);4)引入监督方案(基于输出的监督和基于过程的监督)。此外,论文还提供了详细的数学公式和算法规范,以简化RLM的实现。通过展示LLaMA-Berry、QwQ、Journey Learning和Graph of Thoughts等方案如何作为特例融入该蓝图,论文展示了其通用性和统一潜力。最后,论文通过引入x1模块化实现,进一步说明了该蓝图的实用性,并提供了关键见解,如策略模型和价值模型的多阶段训练,以及熟悉训练分布的重要性。
链接: https://arxiv.org/abs/2501.11223
作者: Maciej Besta,Julia Barth,Eric Schreiber,Ales Kubicek,Afonso Catarino,Robert Gerstenberger,Piotr Nyczyk,Patrick Iff,Yueling Li,Sam Houliston,Tomasz Sternal,Marcin Copik,Grzegorz Kwaśniewski,Jürgen Müller,Łukasz Flis,Hannes Eberhard,Hubert Niewiadomski,Torsten Hoefler
机构: ETH Zurich(苏黎世联邦理工学院); Cledar; BASF SE(巴斯夫); Cyfronet AGH
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), and supervision schemes (Output-Based and Process-Based Supervision). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we outline how RLMs can integrate with a broader LLM ecosystem, including tools and databases. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between “rich AI” and “poor AI” by lowering barriers to RLM development and experimentation.
zh
[NLP-69] Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation
【速读】: 该论文试图解决在临床文本分类任务中,由于高质量数据和专家标注的高成本和时间消耗,导致预训练语言模型(pre-trained language models)微调过程困难的问题。为了解决这一问题,作者提出了一种基于嵌入驱动的方法(embedding-driven approach),通过从少量真实临床笔记中进行多样性采样(diversity sampling),指导大语言模型在少样本提示(few-shot prompting)下生成更符合临床语法特征的合成文本。该方法在CheXpert数据集上的分类任务中进行了评估,结果表明,相较于随机少样本和零样本方法,生成的合成文本在余弦相似度和图灵测试中更接近真实临床文本。此外,使用合成数据增强模型后,AUROC和AUPRC分别提升了57%和68%,且合成数据的有效性达到了真实数据的90%,价值提升了60%。
链接: https://arxiv.org/abs/2501.11199
作者: Ivan Lopez,Fateme Nateghi Haredasht,Kaitlin Caoili,Jonathan H Chen,Akshay Chaudhari
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.
zh
[NLP-70] AIMA at SemEval-2024 Task 3: Simple Yet Powerful Emotion Cause Pair Analysis SEMEVAL-2024
【速读】: 该论文旨在解决在对话语境中提取情感-原因对(emotion-cause pair extraction)的问题,具体分为两个子任务:子任务1专注于从文本中提取情感-原因对,其中原因被定义为对话中的文本片段;子任务2则扩展到了多模态(multimodal)分析,涵盖了语言、音频和视觉信息,以应对原因可能不完全体现在文本中的情况。解决方案的关键在于提出的模型结构,该模型分为三个核心部分:(i) 嵌入提取(embedding extraction),(ii) 情感分类与原因对提取(cause-pair extraction and emotion classification),以及 (iii) 在找到原因对后通过问答机制(QA)进行原因提取。通过结合最先进的技术并在任务特定数据集上进行微调,该模型有效地揭示了对话动态中的复杂关系,并提取了情感表达中的因果关系线索。
链接: https://arxiv.org/abs/2501.11170
作者: Alireza Ghahramani Kure,Mahshid Dehghani,Mohammad Mahdi Abootorabi,Nona Ghazizadeh,Seyed Arshan Dalili,Ehsaneddin Asgari
机构: NLP & DH Lab, Computer Engineering Department, Sharif University of Technology (NLP与数字人文实验室,计算机工程系,谢里夫理工大学); Qatar Computing Research Institute, Doha, Qatar (卡塔尔计算研究所,多哈,卡塔尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
点击查看摘要
Abstract:The SemEval-2024 Task 3 presents two subtasks focusing on emotion-cause pair extraction within conversational contexts. Subtask 1 revolves around the extraction of textual emotion-cause pairs, where causes are defined and annotated as textual spans within the conversation. Conversely, Subtask 2 extends the analysis to encompass multimodal cues, including language, audio, and vision, acknowledging instances where causes may not be exclusively represented in the textual data. Our proposed model for emotion-cause analysis is meticulously structured into three core segments: (i) embedding extraction, (ii) cause-pair extraction emotion classification, and (iii) cause extraction using QA after finding pairs. Leveraging state-of-the-art techniques and fine-tuning on task-specific datasets, our model effectively unravels the intricate web of conversational dynamics and extracts subtle cues signifying causality in emotional expressions. Our team, AIMA, demonstrated strong performance in the SemEval-2024 Task 3 competition. We ranked as the 10th in subtask 1 and the 6th in subtask 2 out of 23 teams.
zh
[NLP-71] AIMA at SemEval-2024 Task 10: History-Based Emotion Recognition in Hindi-English Code-Mixed Conversations SEMEVAL-2024
【速读】: 该论文旨在解决在代码混合(code-mixed)的印地语-英语(Hindi-English)对话中进行情感识别(Emotion Recognition in Conversation, ERC)的挑战。由于现有模型通常在单语数据集上训练,难以有效处理代码混合数据,因此作者提出了一系列模型,这些模型不仅考虑了当前话语的前后上下文,还结合了对话的顺序信息。为了处理代码混合数据,作者开发了一个将印地语-英语混合对话(Hinglish)翻译为英语的管道。此外,作者设计了四种不同的基础模型,每种模型都利用强大的预训练编码器(pre-trained encoders)从输入中提取特征,但具有不同的架构。最终,通过集成这些模型,作者开发了一个优于所有基线的最终模型。
链接: https://arxiv.org/abs/2501.11166
作者: Mohammad Mahdi Abootorabi,Nona Ghazizadeh,Seyed Arshan Dalili,Alireza Ghahramani Kure,Mahshid Dehghani,Ehsaneddin Asgari
机构: NLP & DH Lab, Computer Engineering Department, Sharif University of Technology (NLP与数字人文实验室,计算机工程系,谢里夫理工大学); Qatar Computing Research Institute, Doha, Qatar (卡塔尔计算研究所,多哈,卡塔尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
点击查看摘要
Abstract:In this study, we introduce a solution to the SemEval 2024 Task 10 on subtask 1, dedicated to Emotion Recognition in Conversation (ERC) in code-mixed Hindi-English conversations. ERC in code-mixed conversations presents unique challenges, as existing models are typically trained on monolingual datasets and may not perform well on code-mixed data. To address this, we propose a series of models that incorporate both the previous and future context of the current utterance, as well as the sequential information of the conversation. To facilitate the processing of code-mixed data, we developed a Hinglish-to-English translation pipeline to translate the code-mixed conversations into English. We designed four different base models, each utilizing powerful pre-trained encoders to extract features from the input but with varying architectures. By ensembling all of these models, we developed a final model that outperforms all other baselines.
zh
[NLP-72] A Collection of Question Answering Datasets for Norwegian ALT
【速读】: 该论文旨在解决挪威语(Norwegian)在问答系统(question answering)领域缺乏高质量数据集的问题。为此,作者引入了四个新的挪威语问答数据集:NorOpenBookQA、NorCommonSenseQA、NorTruthfulQA和NRK-Quiz-QA。这些数据集涵盖了广泛的知识领域和技能,包括世界知识、常识推理(commonsense reasoning)、真实性(truthfulness)以及关于挪威的知识。数据集覆盖了挪威语的两种书面标准——Bokmål和Nynorsk,并包含超过10,000个由母语者创建的问题-答案对。解决方案的关键在于通过详细的标注和评估方法,创建了一个多样化的数据集,并评估了11种语言模型(LMs)在零样本(zero-shot)和少样本(few-shot)场景下的表现。研究结果表明,大多数语言模型在Bokmål上的表现优于Nynorsk,且在常识推理任务上表现较差,生成的答案往往缺乏真实性。所有数据集和标注材料均已公开,为后续研究提供了重要资源。
链接: https://arxiv.org/abs/2501.11128
作者: Vladislav Mikhailov,Petter Mæhlum,Victoria Ovedie Chruickshank Langø,Erik Velldal,Lilja Øvrelid
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for NoDaLiDa / Baltic-HLT 2025
点击查看摘要
Abstract:This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokmål and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.
zh
[NLP-73] Assessing Semantic Annotation Activities with Formal Concept Analysis
【速读】: 该论文试图解决如何评估和改进语义标注(semantic annotation)活动的问题。具体来说,作者提出了一种基于形式概念分析(Formal Concept Analysis, FCA)的方法,用于评估标注者在使用领域专家创建的分类本体(taxonomical ontologies)进行数字资源标注时的表现。解决方案的关键在于利用FCA生成概念格(concept lattices),这些概念格以图形化的方式展示了本体在语义标注过程中的使用情况。通过这种方式,领域专家能够直观地了解标注者如何使用本体,并据此提供改进建议,包括如何更有效地使用本体以及如何优化本体以更好地满足标注者的需求。论文通过在一个名为@note的富互联网应用(Rich Internet Application, RIA)中实现该方法,并结合案例研究和评估结果,展示了该方法的可行性和有效性。
链接: https://arxiv.org/abs/2501.11123
作者: Juan Cigarrán-Recuero,Joaquín Gayoso-Cabada,Miguel Rodríguez-Artacho,María-Dolores Romero-López,Antonio Sarasa-Cabezuelo,José-Luis Sierra
机构: 未知
类目: Computation and Language (cs.CL)
备注: pre-print
点击查看摘要
Abstract:This paper describes an approach to assessing semantic annotation activities based on formal concept analysis (FCA). In this approach, annotators use taxonomical ontologies created by domain experts to annotate digital resources. Then, using FCA, domain experts are provided with concept lattices that graphically display how their ontologies were used during the semantic annotation process. In consequence, they can advise annotators on how to better use the ontologies, as well as how to refine them to better suit the needs of the semantic annotators. To illustrate the approach, we describe its implementation in @note, a Rich Internet Application (RIA) for the collaborative annotation of digitized literary texts, we exemplify its use with a case study, and we provide some evaluation results using the method.
zh
[NLP-74] me about yourself: LLM s are aware of their learned behaviors ICLR2025
【速读】: 该论文研究了大型语言模型(LLM)的行为自我意识(behavioral self-awareness),即模型在没有上下文示例的情况下,能够明确描述其自身行为的能力。论文通过微调(finetune)LLM,使其在特定行为(如做出高风险经济决策或输出不安全的代码)的数据集上进行训练,尽管这些数据集中并未包含与这些行为相关的明确描述,但微调后的模型能够明确表达这些行为。例如,经过训练输出不安全代码的模型会表示“我写的代码是不安全的”。研究的关键在于,模型在没有专门训练或示例的情况下,能够自发地表达其隐含行为,这种行为自我意识对于AI安全具有重要意义,因为模型可以利用这种能力主动披露潜在的问题行为。此外,论文还探讨了后门策略(backdoor policies),发现模型有时能够识别自身是否具有后门,即使在没有触发条件的情况下。然而,模型默认情况下无法直接输出其触发条件。研究结果表明,模型在自我意识和隐含行为的自发表达方面具有令人惊讶的能力。未来的研究可以进一步探讨这种能力在更广泛场景和模型中的应用,并解释其在LLM中的产生机制。
链接: https://arxiv.org/abs/2501.11120
作者: Jan Betley,Xuchan Bao,Martín Soto,Anna Sztyber-Betley,James Chua,Owain Evans
机构: Truthful AI; University of Toronto(多伦多大学); UK AISI; Warsaw University of Technology(华沙理工大学); UC Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Submitted to ICLR 2025. 17 pages, 13 figures
点击查看摘要
Abstract:We study behavioral self-awareness – an LLM’s ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.‘’ Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors – models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs. Comments: Submitted to ICLR 2025. 17 pages, 13 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2501.11120 [cs.CL] (or arXiv:2501.11120v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.11120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-75] Clinical trial cohort selection using Large Language Models on n2c2 Challenges
【速读】: 该论文试图解决临床研究中的队列选择(cohort selection)问题,特别是在处理患者文本记录时,手动筛选特定关键词的过程耗时且效率低下。为了解决这一问题,论文探讨了利用预训练大语言模型(LLMs)在自然语言处理(NLP)任务中的潜力,尤其是其在临床研究队列选择中的应用。解决方案的关键在于利用LLMs的文本理解能力,通过n2c2挑战赛的数据集来评估这些模型在简单队列选择任务中的表现。研究结果表明,LLMs在简单任务中表现良好,但在需要细粒度知识和推理的复杂任务中仍面临挑战。
链接: https://arxiv.org/abs/2501.11114
作者: Chi-en Amy Tai,Xavier Tannier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.
zh
[NLP-76] Chain-of-Reasoning : Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
【速读】: 该论文旨在解决大型语言模型(LLMs)在数学推理任务中依赖单一推理范式(single-paradigm reasoning)的问题,这种依赖限制了模型在多样化任务中的有效性。为了解决这一问题,论文提出了一个名为“推理链”(Chain-of-Reasoning, CoR)的统一框架,该框架整合了多种推理范式,包括自然语言推理(Natural Language Reasoning, NLR)、算法推理(Algorithmic Reasoning, AR)和符号推理(Symbolic Reasoning, SR),以实现协同合作。CoR通过生成多个潜在的答案,并将这些答案综合成一个连贯的最终解决方案。此外,论文还提出了一种渐进式范式训练(Progressive Paradigm Training, PPT)策略,使模型能够逐步掌握这些推理范式,最终开发出CoR-Math-7B模型。实验结果表明,CoR-Math-7B在定理证明任务中显著优于当前的最先进模型(SOTA),并在算术任务中表现出色,展示了其增强的数学综合能力和跨任务的零样本泛化能力。
链接: https://arxiv.org/abs/2501.11110
作者: Yiyao Yu,Yuxiang Zhang,Dongdong Zhang,Xiao Liang,Hengyuan Zhang,Xingxing Zhang,Ziyi Yang,Mahmoud Khademi,Hany Awadalla,Junjie Wang,Yujiu Yang,Furu Wei
机构: Tsinghua University(清华大学); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet they often rely on single-paradigm reasoning that limits their effectiveness across diverse tasks. In this paper, we introduce Chain-of-Reasoning (CoR), a novel unified framework that integrates multiple reasoning paradigms–Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)–to enable synergistic collaboration. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy that allows models to progressively master these paradigms, culminating in the development of CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4 in theorem proving tasks and a 7.9% improvement over RL-based methods in arithmetic tasks. These results showcase the enhanced mathematical comprehensive ability of our model, achieving significant performance gains on specific tasks and enabling zero-shot generalization across tasks.
zh
[NLP-77] ChaosEater: Fully Automating Chaos Engineering with Large Language Models
【速读】: 该论文试图解决混沌工程(Chaos Engineering, CE)中手动定义实验和实验后系统重新配置的高成本问题。混沌工程是一种通过人为注入特定故障来观察分布式系统行为并提升其弹性的工程技术。尽管现有的CE工具已经实现了预定义实验的自动化执行,但实验的定义和实验后的系统重新配置仍然依赖手动操作,导致时间和经济成本较高。
论文提出的解决方案是\textsc{ChaosEater},一个利用大语言模型(Large Language Models, LLMs)实现整个CE操作自动化的系统。该系统的关键点在于:首先,它根据系统的CE周期预定义了通用流程,并将流程中的细分操作分配给LLMs执行;其次,该系统假设系统基于基础设施即代码(Infrastructure as Code, IaC),即系统配置和人为故障通过代码管理,因此LLMs的操作对应于软件工程任务,包括需求定义、代码生成与调试以及测试。通过案例研究,论文验证了该系统在小型和大型系统中均能显著降低时间和经济成本,同时完成合理的单个CE周期。
链接: https://arxiv.org/abs/2501.11107
作者: Daisuke Kikuta,Hiroki Ikeuchi,Kengo Tajiri,Yuusuke Nakano
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: 138 pages (12 main), 10 figures. Project page: this https URL
点击查看摘要
Abstract:Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools realize the automated execution of predefined CE experiments. However, defining these experiments and reconfiguring the system after the experiments still remain manual. To reduce the costs of the manual operations, we propose \textscChaosEater, a \textitsystem for automating the entire CE operations with Large Language Models (LLMs). It pre-defines the general flow according to the systematic CE cycle and assigns subdivided operations within the flow to LLMs. We assume systems based on Infrastructure as Code (IaC), wherein the system configurations and artificial failures are managed through code. Hence, the LLMs’ operations in our \textitsystem correspond to software engineering tasks, including requirement definition, code generation and debugging, and testing. We validate our \textitsystem through case studies on both small and large systems. The results demonstrate that our \textitsystem significantly reduces both time and monetary costs while completing reasonable single CE cycles.
zh
[NLP-78] Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model
【速读】: 该论文试图解决通过社交媒体文本检测自杀意念(suicidal ideation)的问题,这是预防自杀的关键步骤。解决方案的核心在于采用了一种混合框架,结合了卷积神经网络(CNN)和双向长短期记忆网络(BiLSTM),并通过注意力机制(attention mechanism)进行增强。此外,为了提高模型预测的可解释性,论文引入了可解释人工智能(Explainable AI, XAI)方法,特别是SHapley Additive exPlanations(SHAP)。通过微调和早停技术,模型的准确率从92.81%提升至94.29%。SHAP分析揭示了影响模型预测的关键特征,如与心理健康问题相关的术语,从而增强了模型的可信度,并帮助心理健康专业人员理解和信任预测结果。该研究强调了结合强大的机器学习方法与可解释性来开发可靠且有效的心理健康解决方案的重要性。
链接: https://arxiv.org/abs/2501.11094
作者: Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Nur Hafieza Ismail
机构: Universiti Malaysia Pahang Al-Sltan Abdullah (马来西亚彭亨大学苏丹阿卜杜拉校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Suicidal ideation detection is crucial for preventing suicides, a leading cause of death worldwide. Many individuals express suicidal thoughts on social media, offering a vital opportunity for early detection through advanced machine learning techniques. The identification of suicidal ideation in social media text is improved by utilising a hybrid framework that integrates Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM), enhanced with an attention mechanism. To enhance the interpretability of the model’s predictions, Explainable AI (XAI) methods are applied, with a particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At first, the model managed to reach an accuracy of 92.81%. By applying fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The SHAP analysis revealed key features influencing the model’s predictions, such as terms related to mental health struggles. This level of transparency boosts the model’s credibility while helping mental health professionals understand and trust the predictions. This work highlights the potential for improving the accuracy and interpretability of detecting suicidal tendencies, making a valuable contribution to the progress of mental health monitoring systems. It emphasizes the significance of blending powerful machine learning methods with explainability to develop reliable and impactful mental health solutions.
zh
[NLP-79] Dynamic semantic networks for exploration of creative thinking
【速读】: 该论文试图解决的问题是如何通过动态语义网络(dynamic semantic networks)来实时监测和评估创造性问题解决过程中的关键事件,进而人工增强人类创造力。解决方案的关键在于利用词汇数据库(如WordNet)进行信息论量化,通过移动时间窗口计算语义度量的动态变化,从而捕捉设计任务中的发散思维(divergent thinking)。这种方法能够同时处理词汇和语义,并解释与概念理解和产生相关的功能活跃脑皮层区域,最终实现对设计创意成功率的预测。
链接: https://arxiv.org/abs/2501.11090
作者: Danko D. Georgiev,Georgi V. Georgiev
机构: Institute for Advanced Study, 30 Vasilaki Papadopulu Str., Varna, 9010, Bulgaria; Center for Ubiquitous Computing, Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, FIN-90014, Finland
类目: Computation and Language (cs.CL)
备注: 24 pages, 7 figures
点击查看摘要
Abstract:Human creativity originates from brain cortical networks that are specialized in idea generation, processing, and evaluation. The concurrent verbalization of our inner thoughts during the execution of a design task enables the use of dynamic semantic networks as a tool for investigating, evaluating, and monitoring creative thought. The primary advantage of using lexical databases such as WordNet for reproducible information-theoretic quantification of convergence or divergence of design ideas in creative problem solving is the simultaneous handling of both words and meanings, which enables interpretation of the constructed dynamic semantic networks in terms of underlying functionally active brain cortical regions involved in concept comprehension and production. In this study, the quantitative dynamics of semantic measures computed with a moving time window is investigated empirically in the DTRS10 dataset with design review conversations and detected divergent thinking is shown to predict success of design ideas. Thus, dynamic semantic networks present an opportunity for real-time computer-assisted detection of critical events during creative problem solving, with the goal of employing this knowledge to artificially augment human creativity.
zh
[NLP-80] IntellAgent : A Multi-Agent Framework for Evaluating Conversational AI Systems
【速读】: 该论文试图解决的是评估对话式AI系统(conversational AI systems)的挑战,特别是在多轮对话、领域特定API集成和严格政策约束下的复杂性和变异性。传统评估方法难以捕捉这些系统在真实世界交互中的复杂性。论文提出的解决方案是IntellAgent,一个可扩展的开源多智能体框架,旨在全面评估对话式AI系统。IntellAgent通过结合策略驱动的图建模(policy-driven graph modeling)、真实事件生成(realistic event generation)和交互式用户-智能体模拟(interactive user-agent simulations),自动化生成多样化的合成基准测试。这一创新方法提供了细粒度的诊断,克服了静态和手动策划的基准测试中粗粒度指标的局限性。IntellAgent通过模拟不同复杂度的多策略场景,捕捉智能体能力和政策约束之间的微妙相互作用,并采用基于图的策略模型来表示关系、可能性和政策交互的复杂性,从而实现高度详细的诊断。此外,IntellAgent还识别关键性能差距,为针对性优化提供可操作的见解。其模块化和开源设计支持新领域、政策和API的无缝集成,促进可重复性和社区协作。
链接: https://arxiv.org/abs/2501.11067
作者: Elad Levi,Ilan Kadar
机构: Plurai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at this https URL
zh
[NLP-81] Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach
【速读】: 该论文试图解决大语言模型(LLM)在面对语义相同但表达不同的提示时,生成不一致甚至矛盾输出的问题。为了解决这一问题,论文提出了一种更具解释性的方法,即通过模型编辑(model editing)来增强LLM的语义一致性。关键解决方案包括:首先识别对LLM语义一致性有重要影响的模型组件(如注意力头,attention heads),然后沿着语义一致性激活方向对这些组件的输出注入偏差。这种方法不仅计算成本低,且无需对原始模型参数进行大规模修改。通过在构建的自然语言理解(NLU)和开源自然语言生成(NLG)数据集上的全面实验,该方法显著提升了LLM的语义一致性和任务性能,并展示了在主要任务之外的泛化能力。
链接: https://arxiv.org/abs/2501.11041
作者: Jingyuan Yang,Dapeng Chen,Yajing Sun,Rongjun Li,Zhiyong Feng,Wei Peng
机构: 1College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); 2IT Innovation and Research Center, Huawei Technologies (华为技术有限公司IT创新与研究中心)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A Large Language Model (LLM) tends to generate inconsistent and sometimes contradictory outputs when presented with a prompt that has equivalent semantics but is expressed differently from the original prompt. To achieve semantic consistency of an LLM, one of the key approaches is to finetune the model with prompt-output pairs with semantically equivalent meanings. Despite its effectiveness, a data-driven finetuning method incurs substantial computation costs in data preparation and model optimization. In this regime, an LLM is treated as a ``black box’', restricting our ability to gain deeper insights into its internal mechanism. In this paper, we are motivated to enhance the semantic consistency of LLMs through a more interpretable method (i.e., model editing) to this end. We first identify the model components (i.e., attention heads) that have a key impact on the semantic consistency of an LLM. We subsequently inject biases into the output of these model components along the semantic-consistency activation direction. It is noteworthy that these modifications are cost-effective, without reliance on mass manipulations of the original model parameters. Through comprehensive experiments on the constructed NLU and open-source NLG datasets, our method demonstrates significant improvements in the semantic consistency and task performance of LLMs. Additionally, our method exhibits promising generalization capabilities by performing well on tasks beyond the primary tasks.
zh
[NLP-82] LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在面对语义等价的改写输入时生成不一致响应的问题。具体来说,现有的激活引导(activation steering)方法通常在模型组件级别(如层隐藏状态或注意力头)进行操作,但由于LLMs的模型组件通常编码多个纠缠特征(polysemanticity issue),导致精确引导变得困难。为解决这一问题,论文提出了一种新的激活引导方法LF-Steering,其关键在于通过稀疏自编码器(SAE)将相关Transformer层的隐藏状态映射到一个稀疏激活的高维特征空间,从而基于解耦的特征表示进行模型引导,最小化干扰。实验结果表明,该方法在提升语义一致性方面具有显著效果,并在多种自然语言理解(NLU)和自然语言生成(NLG)任务中取得了显著的性能提升。
链接: https://arxiv.org/abs/2501.11036
作者: Jingyuan Yang,Rongjun Li,Weixuan Wang,Ziyu Zhou,Zhiyong Feng,Wei Peng
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); IT Innovation and Research Center, Huawei Technologies (华为技术有限公司IT创新与研究中心); Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLM behavior by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention heads. They face a challenge due to the ``polysemanticity issue’', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on both NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.
zh
[NLP-83] From Arabic Text to Puzzles: LLM -Driven Development of Arabic Educational Crosswords COLING2025
【速读】: 该论文旨在解决阿拉伯语教育工具稀缺的问题,特别是缺乏高级的、基于人工智能的互动学习工具。解决方案的关键在于开发了一个阿拉伯语填字游戏生成器,该生成器利用了先进的生成式 AI 模型(如 GPT-4-Turbo、GPT-3.5-Turbo 和 Llama3-8B-Instruct),并结合了一个精心构建的数据集 Arabic-Clue-Instruct。该数据集包含超过 50,000 条条目,涵盖文本、答案、线索和类别,旨在生成与特定文本和关键词相关的线索。通过将最先进的人工智能技术与现代学习方法相结合,该工具能够从任何给定的教育文本中生成填字游戏,从而促进互动和有趣的学习体验。这一工具不仅推动了教育范式的进步,还为互动和认知学习技术设定了新标准。
链接: https://arxiv.org/abs/2501.11035
作者: Kamyar Zeinalipour,Mohamed Zaky Saad,Marco Maggini,Marco Gori
机构: University of Siena, DIISM, Via Roma 56, 53100 Siena, Italy (锡耶纳大学, DIISM, Via Roma 56, 53100 锡耶纳, 意大利)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at LoResLM @ COLING 2025
点击查看摘要
Abstract:We present an Arabic crossword puzzle generator from a given text that utilizes advanced language models such as GPT-4-Turbo, GPT-3.5-Turbo and Llama3-8B-Instruct, specifically developed for educational purposes, this innovative generator leverages a meticulously compiled dataset named Arabic-Clue-Instruct with over 50,000 entries encompassing text, answers, clues, and categories. This dataset is intricately designed to aid in the generation of pertinent clues linked to specific texts and keywords within defined categories. This project addresses the scarcity of advanced educational tools tailored for the Arabic language, promoting enhanced language learning and cognitive development. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective through gamification and interactivity. Integrating state-of-the-art artificial intelligence with contemporary learning methodologies, this tool can generate crossword puzzles from any given educational text, thereby facilitating an interactive and enjoyable learning experience. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies. The model and dataset are publicly available.
zh
[NLP-84] AdaptiveLog: An Adaptive Log Analysis Framework with the Collaboration of Large and Small Language Model
【速读】: 该论文试图解决在自动化日志分析(automated log analysis)中,如何在性能与推理成本之间取得平衡的问题。具体来说,小型语言模型(SLMs)虽然成本较低但能力有限,而大型语言模型(LLMs)虽然强大但成本高且效率低。为解决这一问题,论文提出了一个名为AdaptiveLog的自适应日志分析框架。该框架的关键在于通过协同使用LLM和SLM,策略性地将复杂日志分配给LLM处理,而将简单日志分配给SLM处理。为了高效调用LLM,论文提出了一种基于SLM不确定性估计的自适应选择策略,仅在SLM不确定时调用LLM。此外,论文还提出了一种新的提示策略,通过检索类似的易错案例作为参考,增强LLM在日志分析任务中的推理能力。实验结果表明,AdaptiveLog在不同任务中均达到了最先进的性能,同时保持了成本效率。
链接: https://arxiv.org/abs/2501.11031
作者: Lipeng Ma,Weidong Yang,Yixuan Li,Ben Fei,Mingjie Zhou,Shuhao Li,Sihang Jiang,Bo Xu,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学数据科学重点实验室, 计算机科学学院); School of Computer Science and Technology, Donghua University (东华大学计算机科学与技术学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Automated log analysis is crucial to ensure high availability and reliability of complex systems. The advent of LLMs in NLP has ushered in a new era of language model-driven automated log analysis, garnering significant interest. Within this field, two primary paradigms based on language models for log analysis have become prominent. Small Language Models (SLMs) follow the pre-train and fine-tune paradigm, focusing on the specific log analysis task through fine-tuning on supervised datasets. On the other hand, LLMs following the in-context learning paradigm, analyze logs by providing a few examples in prompt contexts without updating parameters. Despite their respective strengths, we notice that SLMs are more cost-effective but less powerful, whereas LLMs with large parameters are highly powerful but expensive and inefficient. To trade-off between the performance and inference costs of both models in automated log analysis, this paper introduces an adaptive log analysis framework known as AdaptiveLog, which effectively reduces the costs associated with LLM while ensuring superior results. This framework collaborates an LLM and a small language model, strategically allocating the LLM to tackle complex logs while delegating simpler logs to the SLM. Specifically, to efficiently query the LLM, we propose an adaptive selection strategy based on the uncertainty estimation of the SLM, where the LLM is invoked only when the SLM is uncertain. In addition, to enhance the reasoning ability of the LLM in log analysis tasks, we propose a novel prompt strategy by retrieving similar error-prone cases as the reference, enabling the model to leverage past error experiences and learn solutions from these cases. Extensive experiments demonstrate that AdaptiveLog achieves state-of-the-art results across different tasks, elevating the overall accuracy of log analysis while maintaining cost efficiency.
zh
[NLP-85] Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa
【速读】: 该论文试图解决低资源语言(如豪萨语)在情感分析(Sentiment Analysis, SA)中的挑战,主要由于缺乏数字资源。解决方案的关键在于采用语言自适应微调(Language-Adaptive Fine-Tuning, LAFT)技术,通过构建一个多样化的未标注语料库来扩展模型的语言能力,并应用LAFT将AfriBERTa模型适配到豪萨语的特定语言特征上。随后,该模型在标注的NaijaSenti情感数据集上进行微调,以评估其性能。研究结果表明,LAFT带来了适度的性能提升,尽管这可能归因于使用了正式的豪萨语文本而非非正式的社交媒体数据。此外,预训练的AfriBERTa模型显著优于未针对豪萨语进行专门训练的模型,强调了在低资源语言环境中使用预训练模型的重要性。该研究强调了多样化数据源在推进低资源非洲语言自然语言处理应用中的必要性。
链接: https://arxiv.org/abs/2501.11023
作者: Sani Abdullahi Sani,Shamsuddeen Hassan Muhammad,Devon Jarvis
机构: University of the Witwatersrand, Johannesburg(约翰内斯堡金山大学); Imperial College London(伦敦帝国理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by ~identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model’s linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We published the code and the dataset to encourage further research and facilitate reproducibility in low-resource NLP here: this https URL
zh
[NLP-86] GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
【速读】: 该论文旨在解决机器生成文本(machine generated text)的二元检测问题,具体任务为区分文本是由人类撰写还是由生成式 AI(Generative AI)生成。解决方案的关键在于设计并实施一个共享任务(shared task),该任务分为两个子任务:单语(Monolingual,仅英语)和多语(Multilingual)。通过吸引大量参与者(36个团队参与单语子任务,26个团队参与多语子任务),收集并分析不同系统的性能数据,提供了对数据集、结果排名、系统性能评分以及提交系统的详细描述和深入分析。这一方法为机器生成文本检测领域提供了基准数据和系统性能评估框架。
链接: https://arxiv.org/abs/2501.11012
作者: Yuxia Wang,Artem Shelmanov,Jonibek Mansurov,Akim Tsvigun,Vladislav Mikhailov,Rui Xing,Zhuohan Xie,Jiahui Geng,Giovanni Puccetti,Ekaterina Artemova,jinyan su,Minh Ngoc Ta,Mervat Abassy,Kareem Ashraf Elozeiri,Saad El Dine Ahmed El Etter,Maiya Goloburda,Tarek Mahmoud,Raj Vardhan Tomar,Nurkhan Laiyk,Osama Mohammed Afzal,Ryuto Koike,Masahiro Kaneko,Alham Fikri Aji,Nizar Habash,Iryna Gurevych,Preslav Nakov
机构: MBZUAI; Nebius AI; KU Leuven; University of Oslo; ISTI-CNR; Toloka AI; Institute of Science Tokyo; New York University Abu Dhabi; BKAI Research Center, Hanoi University of Science and Technology; Cornell University; Zewail City of Science and Technology; TU Darmstadt; Alexandria University; Cluster Innovation Center, University of Delhi; University of Florida
类目: Computation and Language (cs.CL)
备注: 18 pages
点击查看摘要
Abstract:We present the GenAI Content Detection Task~1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions. this https URL
zh
[NLP-87] Building low-resource African language corpora: A case study of Kidawida Kalenjin and Dholuo
【速读】: 该论文试图解决非洲语言在自然语言处理(Natural Language Processing, NLP)领域中资源匮乏的问题,特别是针对肯尼亚的三种低资源语言(Kidaw’ida、Kalenjin 和 Dholuo)。由于缺乏足够的语言资源,这些语言在数字化转型中代表性不足,限制了相关NLP应用的发展。论文的关键解决方案是通过众包(crowd-sourcing)方法,收集这三种语言的文本和语音数据,构建平行语料库(parallel corpora)和语音语料库(speech corpora)。具体方法包括:(1)记录对话并将其翻译成斯瓦希里语(Kiswahili),以创建平行语料库;(2)通过朗读和记录书面文本来生成语音语料库。这些资源通过开放研究平台(如Zenodo和Mozilla Common Voice)免费公开,便于开发者和研究人员使用这些数据进行模型训练和NLP应用开发。该项目的核心在于通过基层语料库建设,推动非洲语言在人工智能创新中的包容性发展,同时促进语言多样性和本地社区的赋权。
链接: https://arxiv.org/abs/2501.11003
作者: Audrey Mbogho,Quin Awuor,Andrew Kipkebut,Lilian Wanzare,Vivian Oloo
机构: usiu.ac.ke (United States International University - Africa); kabarak.ac.ke (Kabarak University); maseno.ac.ke (Maseno University)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, intend to submit to a Springer Nature journal
点击查看摘要
Abstract:Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.
zh
[NLP-88] he Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s
【速读】: 该论文试图解决在大规模语言模型(LLMs)作为注释者和评估者时,如何确定其是否能够替代人类注释者的问题。尽管LLMs在自然语言处理(NLP)及其他领域(如医学、心理学和社会科学)中广泛应用,但目前缺乏标准且严谨的流程来评估LLMs是否能够胜任这一角色。为此,论文提出了一种新颖的统计方法——替代注释者测试(Alternative Annotator Test, alt-test),该方法仅需少量标注样本即可验证LLMs注释的合理性。此外,论文还引入了一种通用且可解释的度量方法,用于比较不同LLMs的表现。通过实验,作者展示了在某些情况下,闭源LLMs(如GPT-4o)能够替代人类注释者,且优于开源LLMs,同时不同的提示技术(prompting techniques)也会影响LLMs的表现质量。该研究旨在推动更严谨和可靠的实践方法。
链接: https://arxiv.org/abs/2501.10970
作者: Nitay Calderon,Roi Reichart,Rotem Dror
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:The “LLM-as-a-judge” paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure – the Alternative Annotator Test (alt-test) – that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
zh
[NLP-89] AI Based Font Pair Suggestion Modelling For Graphic Design MICRO
【速读】: 该论文试图解决在Microsoft Designer中AI生成设计时,如何选择最符合上下文且新颖的字体(fonts)用于设计建议的关键挑战。以往的方法是通过手动将设计意图映射到字体,虽然质量较高,但无法应对大量字体(超过3000种)和多样化的用户设计意图。解决方案的关键在于创建字体视觉嵌入(font visual embeddings)、字体笔画宽度算法(font stroke width algorithm)、字体类别到字体的映射数据集(font category to font mapping dataset)、基于大语言模型(LLM)的类别利用描述,以及一个轻量级、低延迟的知识蒸馏小型语言模型(Mini LM V2),用于推荐多对符合上下文的标题和副标题字体组合。此外,还采用了加权评分机制、最近邻方法和分层抽样来对字体对进行排序,并为预测结果引入新颖性。
链接: https://arxiv.org/abs/2501.10969
作者: Aryan Singh,Sumithra Bhakthavatsalam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: In the Microsoft Journal of Applied Research (MSJAR), Volume 21, July 2024
点击查看摘要
Abstract:One of the key challenges of AI generated designs in Microsoft Designer is selecting the most contextually relevant and novel fonts for the design suggestions. Previous efforts involved manually mapping design intent to fonts. Though this was high quality, this method does not scale for a large number of fonts (3000+) and numerous user intents for graphic design. In this work we create font visual embeddings, a font stroke width algorithm, a font category to font mapping dataset, an LLM-based category utilization description and a lightweight, low latency knowledge-distilled mini language model (Mini LM V2) to recommend multiple pairs of contextual heading and subheading fonts for beautiful and intuitive designs. We also utilize a weighted scoring mechanism, nearest neighbor approach and stratified sampling to rank the font pairs and bring novelty to the predictions.
zh
[NLP-90] Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
【速读】: 该论文试图解决视觉-语言模型(Vision-language Models, VLMs)在视觉位置编码(Visual Position Encoding)方面的不合理性问题,这一问题限制了模型在不同粒度上的综合感知性能。传统的栅格扫描方法(raster-scan methods)和旋转位置嵌入(Rotary Position Embedding, RoPE)导致的长期衰减效应(long-term decay effects)是主要挑战。论文提出的解决方案是金字塔下降视觉位置编码(Pyramid-descent Visual Position Encoding, PyPE),其关键创新在于从外围到中心分配视觉位置索引,并逐步扩展中心感受野(receptive field)。这种方法减少了相关视觉元素与指令标记之间的相对距离,促进了注意力权重的更合理分配,实现了对视觉元素的多粒度感知,并减少了对锚定标记(anchor tokens)的过度依赖。实验结果表明,PyPE在不同规模的VLMs中均显著提升了模型的综合能力。
链接: https://arxiv.org/abs/2501.10967
作者: Zhanpeng Chen,Mingxiao Li,Ziyang Chen,Nan Du,Xiaolong Li,Yuexian Zou
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University (北京大学深圳研究生院广东省超高清沉浸式媒体技术重点实验室); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models’ comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at this https URL.
zh
[NLP-91] InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在中文保险行业等专业领域应用中的有效性问题。保险领域的复杂性,包括专业术语和多样化的数据类型,对模型和用户都提出了显著挑战。为解决这一问题,作者提出了InsQABench,一个针对中文保险行业的基准数据集,该数据集分为三类:保险常识知识(Insurance Commonsense Knowledge)、保险结构化数据库(Insurance Structured Database)和保险非结构化文档(Insurance Unstructured Documents),以反映现实世界中的保险问答场景。此外,作者还提出了两种方法,SQL-ReAct和RAG-ReAct,分别用于处理结构化和非结构化数据任务。评估结果表明,尽管LLMs在处理领域特定术语和复杂条款文本时存在困难,但在InsQABench上进行微调后,性能显著提升。该基准为推进LLMs在保险领域的应用奠定了坚实基础。
链接: https://arxiv.org/abs/2501.10943
作者: Jing Ding,Kai Feng,Binbin Lin,Jiarui Cai,Qiushi Wang,Yu Xie,Xiaojin Zhang,Zhongyu Wei,Wei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering this http URL also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at this https URL.
zh
[NLP-92] Leverag ing Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data ICASSP2025
【速读】: 该论文试图解决在语音对话系统中生成具有同理心(empathetic)的响应的问题。现有的基于大语言模型(LLMs)的对话系统虽然在理解语音内容方面表现出色,但由于缺乏包含语音风格信息的问答数据集来进行监督微调(SFT),这些系统在生成具有情感共鸣的响应时表现不佳。为了解决这一问题,论文提出了一种名为“倾听、感知与表达”(Listen, Perceive, and Express, LPE)的新方法。该方法的关键在于采用两阶段训练过程:首先引导大语言模型倾听语音内容并感知其中的情感信息,然后利用思维链(Chain-of-Thought, CoT)提示技术,基于所听到的语音内容和感知到的情感线索,激发模型生成具有同理心的响应。这一方法首次尝试将思维链技术应用于基于语音的对话系统,旨在提升系统的情感感知和响应能力。
链接: https://arxiv.org/abs/2501.10937
作者: Jingran Xie,Shun Lei,Yue Yu,Yang Xiang,Hui Wang,Xixin Wu,Zhiyong Wu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Pengcheng Laboratory (鹏城实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2025
点击查看摘要
Abstract:Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model’s potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
zh
[NLP-93] LegalGuardian: A Privacy-Preserving Framework for Secure Integration of Large Language Models in Legal Practice
【速读】: 该论文试图解决在法律实践中使用大型语言模型(LLMs)时面临的客户机密信息(PII)泄露风险问题。由于律师在处理法律事务时可能会在提示中包含敏感的客户信息,这些信息一旦暴露,可能导致未经授权的数据泄露。为了解决这一问题,论文提出了LegalGuardian框架,这是一个轻量级且注重隐私保护的解决方案,专门为使用LLM工具的律师设计。LegalGuardian通过使用命名实体识别(NER)技术和本地LLM,在提示中自动屏蔽和解除屏蔽机密信息,从而在外部交互之前保护敏感数据。该框架在移民法场景中通过合成提示库进行了有效性评估,结果显示,使用GLiNER和Qwen2.5-14B模型时,LegalGuardian在PII检测中的F1得分分别达到93%和97%。语义相似性分析进一步证实,该框架在保持输出高保真度的同时,确保了LLM工具的实用性。因此,LegalGuardian使法律专业人员能够在保护客户机密信息和法律文件质量的前提下,充分利用先进的AI技术。
链接: https://arxiv.org/abs/2501.10915
作者: M. Mikail Demir,Hakan T. Otal,M. Abdullah Canbaz
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures
点击查看摘要
Abstract:Large Language Models (LLMs) hold promise for advancing legal practice by automating complex tasks and improving access to justice. However, their adoption is limited by concerns over client confidentiality, especially when lawyers include sensitive Personally Identifiable Information (PII) in prompts, risking unauthorized data exposure. To mitigate this, we introduce LegalGuardian, a lightweight, privacy-preserving framework tailored for lawyers using LLM-based tools. LegalGuardian employs Named Entity Recognition (NER) techniques and local LLMs to mask and unmask confidential PII within prompts, safeguarding sensitive data before any external interaction. We detail its development and assess its effectiveness using a synthetic prompt library in immigration law scenarios. Comparing traditional NER models with one-shot prompted local LLM, we find that LegalGuardian achieves a F1-score of 93% with GLiNER and 97% with Qwen2.5-14B in PII detection. Semantic similarity analysis confirms that the framework maintains high fidelity in outputs, ensuring robust utility of LLM-based tools. Our findings indicate that legal professionals can harness advanced AI technologies without compromising client confidentiality or the quality of legal documents.
zh
[NLP-94] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
【速读】: 该论文试图解决CLIP(Contrastive Language–Image Pretraining)模型在理解否定(negation)方面的局限性,例如无法区分“停车”和“禁止停车”等概念。这种局限性主要源于预训练数据中缺乏包含否定的样本。为解决这一问题,论文提出了通过使用大型语言模型(LLM)和多模态LLM生成包含否定的标注数据的数据生成管道(data generation pipelines),并在此基础上对CLIP进行微调,开发出NegationCLIP。该模型在保持通用性的同时,显著提升了否定理解能力。此外,论文还提出了NegRefCOCOg基准,用于全面评估视觉语言模型(VLMs)在句子中不同位置和表达方式下理解否定的能力。实验结果表明,该数据生成管道有效提升了CLIP的否定感知能力,并在文本到图像生成和参考图像分割等多模态任务中展示了实际应用价值。
链接: https://arxiv.org/abs/2501.10913
作者: Junsung Park,Jungbeom Lee,Jongyoon Song,Sangwon Yu,Dahuin Jung,Sungroh Yoon
机构: 1Department of Electrical and Computer Engineering, Seoul National University (首尔国立大学); 2Amazon (亚马逊); 3School of Computer Science and Engineering, Soongsil University (崇实大学); 4IPAI, AIIS, ASRI, INMC, and ISRC, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation - such as failing to differentiate concepts like “parking” from “no parking” - poses substantial challenges. By analyzing the data used in the public CLIP model’s pre-training, we posit this limitation stems from a lack of negation-inclusive data. To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions. Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality. Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg-a benchmark tailored to test VLMs’ ability to interpret negation across diverse expressions and positions within a sentence. Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP’s ability to perceive negation accurately. Additionally, NegationCLIP’s enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
zh
[NLP-95] A Benchmark of French ASR Systems Based on Error Severity COLING2025
【速读】: 该论文试图解决自动语音识别(ASR)系统在转录错误评估中的局限性,特别是现有评估方法(如词错误率 WER 或基于语义的评分)往往忽略了人类对转录错误的理解程度。为解决这一问题,论文提出了一种新的评估方法,该方法基于客观的语言学标准、上下文模式和以内容词为分析单位,将错误分为四个严重程度等级,并进一步细分为子类型。这一评估方法应用于10种最先进的法语ASR系统(包括基于隐马尔可夫模型 HMM 和端到端模型),揭示了各系统的优缺点,并识别出哪些系统能为用户提供最舒适的阅读体验。解决方案的关键在于通过更细粒度的错误分类和上下文分析,更准确地反映转录错误对人类理解的影响。
链接: https://arxiv.org/abs/2501.10879
作者: Antoine Tholly,Jane Wottawa,Mickael Rouvier,Richard Dufour
机构: LS2N, Nantes Université, France (南特大学); LIUM, Le Mans Université, France (勒芒大学); LIA, Avignon Université, France (阿维尼翁大学)
类目: Computation and Language (cs.CL)
备注: To be published in COLING 2025 Proceedings
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) transcription errors are commonly assessed using metrics that compare them with a reference transcription, such as Word Error Rate (WER), which measures spelling deviations from the reference, or semantic score-based metrics. However, these approaches often overlook what is understandable to humans when interpreting transcription errors. To address this limitation, a new evaluation is proposed that categorizes errors into four levels of severity, further divided into subtypes, based on objective linguistic criteria, contextual patterns, and the use of content words as the unit of analysis. This metric is applied to a benchmark of 10 state-of-the-art ASR systems on French language, encompassing both HMM-based and end-to-end models. Our findings reveal the strengths and weaknesses of each system, identifying those that provide the most comfortable reading experience for users.
zh
[NLP-96] Generating Structured Outputs from Language Models: Benchmark and Studies
【速读】: 该论文试图解决在生成结构化输出时,约束解码(constrained decoding)方法在实际应用中的有效性和性能评估不足的问题。尽管约束解码已成为现代语言模型应用中生成结构化输出的主要技术,但其行为和性能的系统性评估尚未得到充分研究。论文提出了一种评估框架,旨在从三个关键维度评估约束解码方法:生成符合约束的输出的效率、覆盖多样化约束类型的能力以及生成输出的质量。为了支持这一评估,作者引入了JSONSchemaBench,一个包含10K真实世界JSON模式(JSON Schema)的基准测试集,涵盖了各种复杂度的约束类型。通过结合现有的官方JSON Schema测试套件,作者评估了六种先进的约束解码框架(包括Guidance、Outlines、Llamacpp、XGrammar、OpenAI和Gemini),并深入分析了这些框架在真实世界JSON模式下的能力和局限性。该研究为改进约束解码框架和结构化生成任务提供了可操作的见解,并为约束解码和结构化生成的评估设定了新标准。
链接: https://arxiv.org/abs/2501.10868
作者: Saibo Geng,Hudson Cooper,Michał Moskal,Samuel Jenkins,Julian Berman,Nathan Ranchin,Robert West,Eric Horvitz,Harsha Nori
机构: EPFL(洛桑联邦理工学院); Microsoft(微软); JSON Schema(JSON Schema)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at this https URL
zh
[NLP-97] Zero-shot and Few-shot Learning with Instruction-following LLM s for Claim Matching in Automated Fact-checking COLING2025
【速读】: 该论文旨在解决声明匹配(Claim Matching, CM)任务中的自动化问题,通过将能够通过相同事实核查解决的声明进行匹配,从而提升自动化事实核查流程的效率。论文首次探索了零样本学习(zero-shot learning)和少样本学习(few-shot learning)方法在CM任务中的应用。关键解决方案包括将CM任务视为二分类问题,并实验了多种指令跟随的大型语言模型(如GPT-3.5-turbo、Gemini-1.5-flash、Mistral-7B-Instruct和Llama-3-8B-Instruct),同时研究了不同的提示模板(prompt templates)。此外,论文引入了一个新的CM数据集ClaimMatch,并提出了一个针对不同长度文本的CM处理流程。通过利用自然语言推理(natural language inference)或释义检测(paraphrase detection)等更为成熟且相似的任务,论文展示了LLMs在CM任务中的潜力。
链接: https://arxiv.org/abs/2501.10860
作者: Dina Pisarevskaya,Arkaitz Zubiaga
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 31st International Conference on Computational Linguistics (COLING 2025)
点击查看摘要
Abstract:The claim matching (CM) task can benefit an automated fact-checking pipeline by putting together claims that can be resolved with the same fact-check. In this work, we are the first to explore zero-shot and few-shot learning approaches to the task. We consider CM as a binary classification task and experiment with a set of instruction-following large language models (GPT-3.5-turbo, Gemini-1.5-flash, Mistral-7B-Instruct, and Llama-3-8B-Instruct), investigating prompt templates. We introduce a new CM dataset, ClaimMatch, which will be released upon acceptance. We put LLMs to the test in the CM task and find that it can be tackled by leveraging more mature yet similar tasks such as natural language inference or paraphrase detection. We also propose a pipeline for CM, which we evaluate on texts of different lengths.
zh
[NLP-98] BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
【速读】: 该论文旨在解决在物理世界中理解和执行指令的交互式智能体(Interactive agents)的核心挑战,特别是在Minecraft协作建造任务(MCBT)中的建造者动作预测(Builder Action Prediction, BAP)子任务。BAP任务的核心挑战在于如何在有限的多模态游戏上下文数据中准确预测建造者的动作序列。论文提出了BAP v2,通过两个关键改进来解决这一问题:首先,引入了一个增强的评估基准,包括更干净的测试集和更公平、更具洞察力的评估指标;其次,通过新颖的Minecraft对话和目标结构模拟器生成了额外的合成训练数据。这些改进使得即使在相对简单的训练方法下,也能训练出性能更强、鲁棒性更好的神经网络模型。此外,论文还展示了这些数据和方法对基于LLM和transformer的简单模型的影响,验证了其方法的鲁棒性,并为未来更先进的架构和LLM的应用奠定了基础。
链接: https://arxiv.org/abs/2501.10836
作者: Prashant Jayannavar,Liliang Ren,Marisa Hudspeth,Charlotte Lambert,Ariel Cordes,Elizabeth Kaplan,Anjali Narayan-Chen,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Microsoft(微软); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Amazon(亚马逊); Amazon AGI(亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Interactive agents capable of understanding and executing instructions in the physical world have long been a central goal in AI research. The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal (Narayan-Chen, Jayannavar, and Hockenmaier 2019). It is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a given multimodal game context with limited training data (Jayannavar, Narayan-Chen, and Hockenmaier 2020). We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task. This will allow future work to make more efficient and meaningful progress on it. It comprises of: (1) an enhanced evaluation benchmark that includes a cleaner test set and fairer, more insightful metrics, and (2) additional synthetic training data generated from novel Minecraft dialogue and target structure simulators emulating the MCBT. We show that the synthetic data can be used to train more performant and robust neural models even with relatively simple training methods. Looking ahead, such data could also be crucial for training more sophisticated, data-hungry deep transformer models and training/fine-tuning increasingly large LLMs. Although modeling is not the primary focus of this work, we also illustrate the impact of our data and training methodologies on a simple LLM- and transformer-based model, thus validating the robustness of our approach, and setting the stage for more advanced architectures and LLMs going forward.
zh
[NLP-99] Development of Application-Specific Large Language Models to Facilitate Research Ethics Review
【速读】: 该论文试图解决机构审查委员会(IRBs)在确保人类受试者研究伦理审查过程中面临的挑战,包括审查不一致性、延迟和效率低下等问题。解决方案的关键在于开发和实施针对IRB审查流程的应用特定大语言模型(LLMs)。这些IRB特定的LLMs将通过IRB特定文献和机构数据集进行微调,并配备检索功能以访问最新的、与上下文相关的信息。论文提出了这些模型在预审筛查、初步分析、一致性检查和决策支持等方面的潜在应用。尽管存在准确性、上下文敏感性和人类监督等方面的担忧,但通过增强伦理审查的效率和质量,同时保持人类在关键决策中的判断力,IRB特定的LLMs有望成为改进研究监督的有力工具。论文呼吁进行试点研究以评估该方法的可行性和影响。
链接: https://arxiv.org/abs/2501.10741
作者: Sebastian Porsdam Mann,Joel Seah Jiehao,Stephen R. Latham,Julian Savulescu,Mateo Aboy,Brian D. Earp
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, 0 figures
点击查看摘要
Abstract:Institutional review boards (IRBs) play a crucial role in ensuring the ethical conduct of human subjects research, but face challenges including inconsistency, delays, and inefficiencies. We propose the development and implementation of application-specific large language models (LLMs) to facilitate IRB review processes. These IRB-specific LLMs would be fine-tuned on IRB-specific literature and institutional datasets, and equipped with retrieval capabilities to access up-to-date, context-relevant information. We outline potential applications, including pre-review screening, preliminary analysis, consistency checking, and decision support. While addressing concerns about accuracy, context sensitivity, and human oversight, we acknowledge remaining challenges such as over-reliance on AI and the need for transparency. By enhancing the efficiency and quality of ethical review while maintaining human judgment in critical decisions, IRB-specific LLMs offer a promising tool to improve research oversight. We call for pilot studies to evaluate the feasibility and impact of this approach.
zh
[NLP-100] Computational Discovery of Chiasmus in Ancient Religious Text
【速读】: 该论文旨在解决如何系统地在圣经文本中检测交错配列(chiasmus)这一文学手法的问题。交错配列在圣经文本中一直是一个备受争议的文学手法,吸引了神秘主义者的关注并引发了学术界的持续讨论。论文提出了一种基于神经嵌入(neural embeddings)的计算方法,通过捕捉与交错配列相关的词汇和语义模式,在多个文本粒度(如半节、节)上进行检测。该方法的关键在于利用神经嵌入来捕捉文本中的复杂模式,并结合专家注释者对检测结果进行验证,以确保其准确性和可靠性。尽管该方法计算效率高,但在节级别和半节级别的检测中分别达到了0.80和0.60的精确度(precision@k),并展示了高水平的注释者一致性。此外,论文还提供了对检测到的交错配列分布的定性分析,并通过具体示例展示了该方法的有效性。
链接: https://arxiv.org/abs/2501.10739
作者: Hope McGovern,Hale Sirin,Tom Lippincott
机构: Department of Computer Science & Technology, University of Cambridge, U.K.(剑桥大学计算机科学与技术系); Center for Digital Humanities, Johns Hopkins University, Baltimore, U.S.A.(约翰霍普金斯大学数字人文中心)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Chiasmus, a debated literary device in Biblical texts, has captivated mystics while sparking ongoing scholarly discussion. In this paper, we introduce the first computational approach to systematically detect chiasmus within Biblical passages. Our method leverages neural embeddings to capture lexical and semantic patterns associated with chiasmus, applied at multiple levels of textual granularity (half-verses, verses). We also involve expert annotators to review a subset of the detected patterns. Despite its computational efficiency, our method achieves robust results, with high inter-annotator agreement and system precision@k of 0.80 at the verse level and 0.60 at the half-verse level. We further provide a qualitative analysis of the distribution of detected chiasmi, along with selected examples that highlight the effectiveness of our approach.
zh
[NLP-101] Characterizing the Effects of Translation on Intertextuality using Multilingual Embedding Spaces
【速读】: 该论文试图解决的问题是如何在多语言嵌入空间(multilingual embedding spaces)中表征和保留互文性(intertextuality)这一常见的修辞手法,特别是在文学文献的翻译过程中。互文性在文学翻译中至关重要,但其翻译难度较大。论文通过分析圣经文本(Biblical texts)——这些文本富含互文性且被广泛翻译——来探讨人类翻译和机器翻译在保留互文性方面的差异。解决方案的关键在于提出了一种在语料库层面表征互文性的度量方法,并对现有的人类翻译和机器翻译进行了定量分析。此外,论文还通过定性分析揭示了人类翻译在某些情况下会过度强调或弱化原文中的互文性,而机器翻译则提供了一个中性的基线。这一发现支持了已有学术观点,即人类译者在翻译过程中倾向于放大原文的某些文学特征。
链接: https://arxiv.org/abs/2501.10731
作者: Hope McGovern,Hale Sirin,Tom Lippincott
机构: Department of Computer Science & Technology, University of Cambridge, U.K. (剑桥大学计算机科学与技术系); Center for Digital Humanities, Johns Hopkins University, Baltimore, U.S.A. (约翰霍普金斯大学数字人文中心)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Rhetorical devices are difficult to translate, but they are crucial to the translation of literary documents. We investigate the use of multilingual embedding spaces to characterize the preservation of intertextuality, one common rhetorical device, across human and machine translation. To do so, we use Biblical texts, which are both full of intertextual references and are highly translated works. We provide a metric to characterize intertextuality at the corpus level and provide a quantitative analysis of the preservation of this rhetorical device across extant human translations and machine-generated counterparts. We go on to provide qualitative analysis of cases wherein human translations over- or underemphasize the intertextuality present in the text, whereas machine translations provide a neutral baseline. This provides support for established scholarship proposing that human translators have a propensity to amplify certain literary characteristics of the original manuscripts.
zh
[NLP-102] Simulation of Hypergraph Algorithms with Looped Transformers
【速读】: 该论文试图解决将Loop Transformer架构应用于超图(hypergraph)算法模拟的问题,特别是针对超图的高阶关系建模及其带来的计算挑战。超图通过建模多个实体之间的高阶关系,提供了更丰富的表示能力,但也引入了显著的计算复杂性。论文的关键解决方案包括两个方面:首先,提出了一种新的降级机制,将超图简化为图表示,从而能够模拟基于图的算法,如Dijkstra最短路径算法;其次,引入了一种超边感知的编码方案,用于模拟超图特定的算法,例如Helly算法。通过这些方法,论文展示了使用Loop Transformer处理高维和组合数据的可行性,并为其提供了理论保证,进一步凸显了Transformer作为结构化数据通用算法求解器的潜力。
链接: https://arxiv.org/abs/2501.10688
作者: Xiaoyu Li,Yingyu Liang,Jiangxuan Long,Zhenmei Shi,Zhao Song,Zhen Zhuang
机构: Independent Researcher; The University of Hong Kong(香港大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); South China University of Technology(华南理工大学); The Simons Institute for the Theory of Computing at the University of California, Berkeley(加州大学伯克利分校西蒙斯理论计算研究所); University of Minnesota(明尼苏达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Looped Transformers have shown exceptional capability in simulating traditional graph algorithms, but their application to more complex structures like hypergraphs remains underexplored. Hypergraphs generalize graphs by modeling higher-order relationships among multiple entities, enabling richer representations but introducing significant computational challenges. In this work, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. In this paper, we extend the Loop Transformer architecture to simulate hypergraph algorithms efficiently, addressing the gap between neural networks and combinatorial optimization over hypergraphs. Specifically, we propose a novel degradation mechanism for reducing hypergraphs to graph representations, enabling the simulation of graph-based algorithms, such as Dijkstra’s shortest path. Furthermore, we introduce a hyperedge-aware encoding scheme to simulate hypergraph-specific algorithms, exemplified by Helly’s algorithm. The paper establishes theoretical guarantees for these simulations, demonstrating the feasibility of processing high-dimensional and combinatorial data using Loop Transformers. This work highlights the potential of Transformers as general-purpose algorithmic solvers for structured data.
zh
[NLP-103] Harnessing the Potential of Large Language Models in Modern Marketing Management: Applications Future Directions and Strategic Recommendations
【速读】: 该论文旨在探讨大型语言模型(LLMs)在营销管理中的变革潜力,包括其在客户互动、活动优化和内容生成等方面的应用。论文重点分析了LLMs在个性化、实时交互式客户洞察和内容自动化等关键业务驱动因素中的作用,以及如何通过这些技术提升客户体验和业务成果。此外,论文还涉及了AI在数据隐私、透明度和减少偏见等伦理方面的挑战,提出了通过最佳实践和新技术来促进负责任使用LLMs的建议。解决方案的关键在于通过整合LLMs到营销策略中,帮助企业在不损害品牌价值观的前提下,利用这些强大的技术实现增长并在数字营销的竞争中保持领先地位。
链接: https://arxiv.org/abs/2501.10685
作者: Raha Aghaei,Ali A. Kiaei,Mahnaz Boush,Javad Vahidi,Mohammad Zavvar,Zeynab Barzegar,Mahan Rofoosheh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 40 pages, 9 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized the process of customer engagement, campaign optimization, and content generation, in marketing management. In this paper, we explore the transformative potential of LLMs along with the current applications, future directions, and strategic recommendations for marketers. In particular, we focus on LLMs major business drivers such as personalization, real-time-interactive customer insights, and content automation, and how they enable customers and business outcomes. For instance, the ethical aspects of AI with respect to data privacy, transparency, and mitigation of bias are also covered, with the goal of promoting responsible use of the technology through best practices and the use of new technologies businesses can tap into the LLM potential, which help growth and stay one step ahead in the turmoil of digital marketing. This article is designed to give marketers the necessary guidance by using best industry practices to integrate these powerful LLMs into their marketing strategy and innovation without compromising on the ethos of their brand.
zh
[NLP-104] Can Multimodal LLM s do Visual Temporal Understanding and Reasoning ? The answer is No!
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在时间理解(temporal understanding)方面的不足,特别是在视觉问答(Visual Question Answering, VQA)任务中。时间理解对于理解现实世界中的动态变化至关重要,但现有的MLLMs在这一领域的能力尚未得到充分探索。为此,作者提出了一个名为TemporalVQA的评估基准,该基准包含两个部分:时间顺序理解(Temporal Order Understanding)和时间间隔估计(Time-lapse Estimation)。时间顺序理解要求MLLMs通过分析时间上连续的视频帧来确定事件的顺序,而时间间隔估计则通过呈现具有不同时间间隔的图像对,并以多项选择题的形式要求MLLMs估计图像之间的时间间隔。通过对GPT-4o和Gemini-1.5-Pro等先进MLLMs的评估,发现这些模型在时间顺序任务中的平均一致准确率仅为43.8%,在时间间隔估计任务中的准确率为70%,开源模型的表现更差。这些结果表明当前MLLMs在视觉时间理解和推理方面存在显著局限性,强调了进一步改进其时间能力的必要性。
链接: https://arxiv.org/abs/2501.10674
作者: Mohamed Fazli Imam,Chenyang Lyu,Alham Fikri Aji
机构: Mohamed bin Zayed University of Artificial Intelligence; Alibaba International Digital Commerce
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our dataset can be found at \url{ this https URL }
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: (1) Temporal Order Understanding and (2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges: GPT-4o achieved only 43.8% average consistent accuracy in temporal order tasks and 70% in time-lapse estimation, with open-source models performing even less effectively. These findings underscore the limitations of current MLLMs in visual temporal understanding and reasoning, highlighting the need for further improvements in their temporal capabilities. Our dataset can be found at this https URL.
zh
[NLP-105] MappedTrace: Tracing Pointer Remotely with Compiler-generated Maps
【速读】: 该论文旨在解决现有精确指针追踪方法在程序执行过程中引入的高运行时开销以及仅适用于特定程序执行点的问题。提出的解决方案MappedTrace利用编译器生成的只读映射(read-only maps)来准确识别程序执行状态中任意快照的所有指针。这些映射记录了指针的位置和类型,使得追踪器能够精确识别指针,而无需被追踪程序维护额外的数据结构或在安全点进行轮询,从而显著降低了运行时开销。此外,MappedTrace通过在不同地址空间或机器上运行追踪器,为改进内存管理技术(如内存泄漏检测)提供了新的可能性,并支持在资源受限环境中实现无限内存抽象等新颖用例。
链接: https://arxiv.org/abs/2501.10668
作者: Zhiyao Ma,Caihua Li,Lin Zhong
机构: Yale University (耶鲁大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Existing precise pointer tracing methods introduce substantial runtime overhead to the program being traced and are applicable only at specific program execution points. We propose MappedTrace that leverages compiler-generated read-only maps to accurately identify all pointers in any given snapshot of a program’s execution state. The maps record the locations and types of pointers, allowing the tracer to precisely identify pointers without requiring the traced program to maintain bookkeeping data structures or poll at safe points, thereby reducing runtime overhead. By running the tracer from a different address space or machine, MappedTrace presents new opportunities to improve memory management techniques like memory leak detection and enables novel use cases such as infinite memory abstraction for resource-constrained environments.
zh
[NLP-106] Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades
【速读】: 该论文旨在探索大型基础模型(Large Foundation Models, LFMs)权重的内在机制,以简化人工智能研究。通过对现有LFMs的广泛观察和分析,研究发现无论初始化策略如何,这些模型的权重主要遵循高斯分布(Gaussian distribution),偶尔呈现尖锐、倒T形或线性模式。进一步发现,这些权重具有与高斯噪声相同的独立同分布(i.i.d.)特性,并探讨了它们之间的直接关系。研究发现,变换权重可以从高斯噪声中推导出来,其主要作用是增加预训练权重的标准差,且标准差随层深度增加而增大。换句话说,变换权重扩大了与最优权重的可接受偏差范围,从而促进了对下游任务的适应。基于这些结论,论文深入讨论了最优权重的本质,最终得出结论:最优权重应具有零均值、对称性和稀疏性,稀疏值表现为截断高斯分布和少量异常值。通过在LFM适应和编辑中的实验,验证了这些见解的有效性。这些发现为LFM社区的未来发展提供了基础性理解。
链接: https://arxiv.org/abs/2501.10661
作者: Chongjie Si,Jingjing Jiang,Wei Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Revisions ongoing
点击查看摘要
Abstract:This paper presents a pioneering exploration of the mechanisms underlying large foundation models’ (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.
zh
[NLP-107] DNA 1.0 Technical Report
【速读】: 该论文旨在解决双语语言模型在韩语和英语任务中的性能优化问题,特别是在韩语任务上的表现。解决方案的关键在于通过持续预训练(Continual Pre-training, CPT)和高质量的韩语数据集对Llama 3.1 8B模型进行优化,随后进行监督微调(Supervised Fine-tuning, SFT),以创建一个能够更好地遵循指令的模型。接着,通过球面线性插值(Spherical Linear Interpolation, SLERP)将该模型与Llama 3.1 8B Instruct模型合并,并进一步通过直接偏好优化(Direct Preference Optimization, DPO)和知识蒸馏(Knowledge Distillation, KD)进行优化。最终,DNA 1.0 8B Instruct模型在韩语特定任务(如KMMLU、KoBEST和BELEBELE)上取得了最先进的成果,同时在英语任务(如MMLU、MMLU-Pro和GSM8K)上也保持了较强的性能。
链接: https://arxiv.org/abs/2501.10648
作者: Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this report, we present DNA 1.0 8B Instruct, a state-of-the-art bilingual language model optimized for Korean and English language tasks. By applying continual pre-training (CPT) with high-quality Korean datasets to Llama 3.1 8B and subsequent supervised fine-tuning (SFT), we create an instruction-following model with enhanced Korean language capabilities. This model is then merged with Llama 3.1 8B Instruct via spherical linear interpolation (SLERP) and undergoes further optimization through direct preference optimization (DPO) and knowledge distillation (KD). DNA 1.0 8B Instruct achieves state-of-the-art results on Korean-specific tasks, including KMMLU (53.26%), KoBEST (83.40%), and BELEBELE (57.99%), while maintaining strong English capabilities on MMLU (66.64%), MMLU-Pro (43.05%) and GSM8K (80.52%). As an open model, DNA 1.0 8B Instruct represents a significant advancement in bilingual language modeling. As an open model, DNA 1.0 8B Instruct is freely available through this https URL . For commercial licensing inquiries or feedback, please contact us at this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.10648 [cs.CL] (or arXiv:2501.10648v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.10648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-108] Iterative Tree Analysis for Medical Critics
【速读】: 该论文试图解决大型语言模型(LLMs)在医学领域中生成误导性关键声明(hallucinations)的问题。这些误导性声明在开放式长文本中难以验证,主要原因有两个:一是关键声明通常深嵌于文本中,无法仅通过表层信息提取;二是基于表层词汇的检索方法往往缺乏精确或具体的证据,导致声明无法通过深层机制分析进行验证。论文提出的解决方案是引入一种名为迭代树分析(Iterative Tree Analysis, ITA)的新方法。ITA通过迭代和自适应的树状推理过程,从长医学文本中提取隐含声明,并通过自上而下的任务分解和自下而上的证据整合相结合的方式,实现对复杂医学声明的精确验证。实验结果表明,ITA在检测复杂医学文本中的事实错误方面比现有方法提高了10%。此外,论文还计划发布一个全面的测试集,以促进该领域的进一步研究。
链接: https://arxiv.org/abs/2501.10642
作者: Zenan Huang,Mingwei Li,Zheng Zhou,Youxin Jiang
机构: Baichuan Inc.
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have been widely adopted across various domains, yet their application in the medical field poses unique challenges, particularly concerning the generation of hallucinations. Hallucinations in open-ended long medical text manifest as misleading critical claims, which are difficult to verify due to two reasons. First, critical claims are often deeply entangled within the text and cannot be extracted based solely on surface-level presentation. Second, verifying these claims is challenging because surface-level token-based retrieval often lacks precise or specific evidence, leaving the claims unverifiable without deeper mechanism-based analysis. In this paper, we introduce a novel method termed Iterative Tree Analysis (ITA) for medical critics. ITA is designed to extract implicit claims from long medical texts and verify each claim through an iterative and adaptive tree-like reasoning process. This process involves a combination of top-down task decomposition and bottom-up evidence consolidation, enabling precise verification of complex medical claims through detailed mechanism-level reasoning. Our extensive experiments demonstrate that ITA significantly outperforms previous methods in detecting factual inaccuracies in complex medical text verification tasks by 10%. Additionally, we will release a comprehensive test set to the public, aiming to foster further advancements in research within this domain.
zh
[NLP-109] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks
【速读】: 该论文试图解决大型语言模型(LLMs)在应对越狱攻击(jailbreak attacks)时面临的安全对齐(safety alignment)问题,特别是现有防御机制在对抗训练过程中容易导致过度拒绝(over-refusal)行为,从而影响模型的整体实用性。为解决这一问题,论文提出了一个名为“潜在空间对抗训练与后感知校准”(Latent-space Adversarial Training with Post-aware Calibration, LATPC)的框架。该框架的关键在于:在对抗训练阶段,通过比较潜在空间中的有害和无害指令,提取安全关键维度(safety-critical dimensions)来构建拒绝特征攻击(refusal features attack),从而精确模拟需要对抗缓解的未知越狱攻击类型;在推理阶段,采用嵌入级校准机制(embedding-level calibration mechanism)来缓解过度拒绝行为,同时保持较低的计算开销。实验结果表明,LATPC框架在五种越狱攻击类型中实现了安全性与实用性的最佳平衡,并验证了从潜在空间提取安全关键维度对构建鲁棒拒绝特征攻击的有效性。
链接: https://arxiv.org/abs/2501.10639
作者: Xin Yi,Yue Li,Linlin Wang,Xiaoling Wang,Liang He
机构: Lab of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育实验室); Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学上海人工智能教育研究院); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Under Review
点击查看摘要
Abstract:Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.
zh
[NLP-110] When language and vision meet road safety: leverag ing multimodal large language models for video-based traffic accident analysis
【速读】: 该论文试图解决的是如何高效分析24/7/365全天候运行的交通监控视频,以提升交通事故的时空覆盖率和交通安全。当前基于视觉的方法主要集中于提取原始信息(如车辆轨迹或单个物体检测),但需要大量后处理才能获得可操作的见解,这在实际应用中存在较大挑战。论文提出的解决方案是SeeUnsafe框架,该框架通过集成多模态大语言模型(Multimodal Large Language Model, MLLM)代理,将基于视频的交通事故分析从传统的“提取-解释”工作流程转变为更具交互性和对话性的方法。这一转变通过自动化复杂任务(如视频分类和视觉定位)显著提高了处理吞吐量,并通过无缝调整以适应不同的交通场景和用户定义的查询,增强了系统的适应性。关键创新点包括:采用基于严重性的聚合策略处理不同长度的视频,引入多模态提示生成结构化响应以支持细粒度视觉定位,并提出基于MLLM的新度量标准IMS(Information Matching Score)来对齐结构化响应与真实情况。实验结果表明,SeeUnsafe在丰田Woven交通安全数据集上有效实现了事故感知的视频分类和视觉定位。
链接: https://arxiv.org/abs/2501.10604
作者: Ruixuan Zhang,Beichen Wang,Juexiao Zhang,Zilin Bian,Chen Feng,Kaan Ozbay
机构: New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \urlthis https URL.
zh
[NLP-111] Adapting Large Language Models for Character-based Augmentative and Alternative Communication
【速读】: 该论文试图解决增强与替代沟通(AAC)用户在通过字符语言模型界面逐字母书写时,如何有效利用最先进的大规模预训练语言模型进行准确且高效的字符预测的问题。大多数现有的大规模预训练语言模型预测的是可变长度的子词(subword)标记,而AAC用户需要的是逐字符的预测。论文通过使用一个经过精心筛选的大规模句子数据集对模型进行微调,其中每个句子都根据其在口语或书面AAC沟通中的实用性进行了评分。研究发现,通过算法从子词大规模语言模型中生成字符预测,比添加分类层或使用字节级模型提供了更准确的预测结果。此外,论文提出的领域适应课程(domain adaptation curriculum)在提高模型对简单对话文本的性能方面表现出色。解决方案的关键在于通过微调和领域适应策略,优化大规模预训练语言模型在字符预测任务中的表现。
链接: https://arxiv.org/abs/2501.10582
作者: Dylan Gaines,Keith Vertanen
机构: Michigan Technological University(密歇根理工大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation curriculum is effective at improving model performance on simple, conversational text.
zh
[NLP-112] he Geometry of Tokens in Internal Representations of Large Language Models
【速读】: 该论文旨在研究Transformer模型中token嵌入的几何特性与其在下一个token预测中的作用之间的关系。具体来说,作者通过引入经验测度(empirical measure)的概念,分析了token点云在Transformer各层中的分布及其在平均场相互作用框架下的演化。为了探究这些经验测度,作者使用了内在维度(intrinsic dimension)、邻域重叠(neighborhood overlap)和余弦相似度(cosine similarity)等度量方法,并通过与打乱token顺序的数据集进行对比,验证了这些度量的有效性。研究结果表明,token嵌入的几何特性与下一个token预测的交叉熵损失之间存在相关性,提示损失值较高的提示(prompts)中的token往往位于更高维的表示空间中。解决方案的关键在于通过几何度量和经验测度来揭示token嵌入的演化规律及其对模型性能的影响。
链接: https://arxiv.org/abs/2501.10573
作者: Karthik Viswanathan,Yuri Gardinazzi,Giada Panerai,Alberto Cazzaniga,Matteo Biagetti
机构: 1. University of Amsterdam (阿姆斯特丹大学); 2. AREA Science Park (AREA科学园); 3. Unknown
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15+9 pages, 21 figures, all comments welcome!
点击查看摘要
Abstract:We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.
zh
[NLP-113] Improved IR-based Bug Localization with Intelligent Relevance Feedback
【速读】: 该论文试图解决软件开发与维护过程中软件缺陷(bug)定位的难题。现有技术通常采用信息检索(Information Retrieval, IR)方法,通过缺陷报告与源代码之间的文本和语义相关性来定位缺陷。然而,这些方法往往难以弥补缺陷报告与代码之间需要深入上下文理解的差距,这超出了单纯的文本或语义相关性。论文提出了一种新的缺陷定位技术——BRaIn,该技术通过使用大语言模型(Large Language Models, LLM)评估缺陷报告与代码之间的相关性,并利用LLM的反馈(即智能相关性反馈,Intelligent Relevance Feedback)来重新制定查询和重新排序源文档,从而改进缺陷定位。BRaIn在Bench4BL基准数据集上进行了评估,并在MAP、MRR和HIT@K三个性能指标上分别比基线技术提高了87.6%、89.5%和48.8%。此外,BRaIn能够定位约52%的基线技术无法定位的缺陷,这些缺陷通常由于缺陷报告质量较差而难以处理。通过解决上下文差距并引入智能相关性反馈,BRaIn不仅在理论上有所突破,还显著提升了基于IR的缺陷定位效果。
链接: https://arxiv.org/abs/2501.10542
作者: Asif Mohammed Samir,Mohammad Masudur Rahman
机构: Department of Computer Science, Dalhousie University (达尔豪斯大学计算机科学系)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures
点击查看摘要
Abstract:Software bugs pose a significant challenge during development and maintenance, and practitioners spend nearly 50% of their time dealing with bugs. Many existing techniques adopt Information Retrieval (IR) to localize a reported bug using textual and semantic relevance between bug reports and source code. However, they often struggle to bridge a critical gap between bug reports and code that requires in-depth contextual understanding, which goes beyond textual or semantic relevance. In this paper, we present a novel technique for bug localization - BRaIn - that addresses the contextual gaps by assessing the relevance between bug reports and code with Large Language Models (LLM). It then leverages the LLM’s feedback (a.k.a., Intelligent Relevance Feedback) to reformulate queries and re-rank source documents, improving bug localization. We evaluate BRaIn using a benchmark dataset, Bench4BL, and three performance metrics and compare it against six baseline techniques from the literature. Our experimental results show that BRaIn outperforms baselines by 87.6%, 89.5%, and 48.8% margins in MAP, MRR, and HIT@K, respectively. Additionally, it can localize approximately 52% of bugs that cannot be localized by the baseline techniques due to the poor quality of corresponding bug reports. By addressing the contextual gaps and introducing Intelligent Relevance Feedback, BRaIn advances not only theory but also improves IR-based bug localization.
zh
[NLP-114] abular-TX: Theme-Explanation Structure-based Table Summarization via In-Context Learning ACL2024
【速读】: 该论文旨在解决表格数据的高效处理和摘要生成问题,特别是在资源受限的环境中。现有的基于微调的方法在处理复杂表格数据时存在局限性,而该论文提出的解决方案——基于主题-解释结构的表格摘要生成管道(Tabular-TX),通过预处理表格数据并生成结构化的摘要句子来应对这一挑战。Tabular-TX的关键在于其独特的主题-解释结构,其中主题部分以状语短语形式呈现,解释部分则以从句形式呈现。此外,Tabular-TX利用上下文学习(In-Context Learning)优化大型语言模型(LLMs)的分析能力,无需微调即可有效处理表格数据的结构复杂性。实验结果表明,Tabular-TX在生成表格摘要任务中表现优于现有的基于微调的方法,尤其在处理复杂表格数据时表现出色,为表格问答和摘要任务提供了新的替代方案。
链接: https://arxiv.org/abs/2501.10487
作者: TaeYoon Kwack,Jisoo Kim,Ki Yong Jung,DongGeon Lee,Heesun Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, in Korean language. The 2024 Joint Conference on Human and Cognitive Language Technology, Korean Association for Corpus Linguistics (HCLT-KACL 2024)
点击查看摘要
Abstract:This paper proposes a Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline designed to efficiently process table data. Tabular-TX preprocesses table data by focusing on highlighted cells and then generates summary sentences structured with a Theme Part in the form of adverbial phrases followed by an Explanation Part in the form of clauses. In this process, customized analysis is performed by considering the structural characteristics and comparability of the table. Additionally, by utilizing In-Context Learning, Tabular-TX optimizes the analytical capabilities of large language models (LLMs) without the need for fine-tuning, effectively handling the structural complexity of table data. Results from applying the proposed Tabular-TX to generate table-based summaries demonstrated superior performance compared to existing fine-tuning-based methods, despite limitations in dataset size. Experimental results confirmed that Tabular-TX can process complex table data more effectively and established it as a new alternative for table-based question answering and summarization tasks, particularly in resource-constrained environments.
zh
[NLP-115] ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature
【速读】: 该论文试图解决语言模型(Language Models, LMs)在生成科学文献相关内容时出现的“幻觉”(hallucination)问题,即生成看似合理但实际上虚假的信息,包括虚构的引用和不存在的研究论文。这种不准确性在需要高度事实正确性的领域(如学术界和教育)中尤为危险。论文提出了一种名为ArxEval的评估管道,通过两个任务(Jumbled Titles和Mixed Titles)来评估语言模型在生成科学文献响应时的幻觉频率。该解决方案的关键在于利用ArXiv作为知识库,对十五种广泛使用的语言模型进行评估,从而提供它们在处理科学文献时的可靠性比较分析。
链接: https://arxiv.org/abs/2501.10483
作者: Aarush Sinha,Viraj Virk,Dipshikha Chakraborty,P.S. Sreeja
机构: Vellore Institute of Technology (韦洛尔理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.
zh
[NLP-116] Beyond the Sum: Unlocking AI Agents Potential Through Market Forces
【速读】: 该论文探讨了大型语言模型(Large Language Models, LLMs)作为自主经济主体在数字市场中参与时所面临的基础设施挑战。论文指出,尽管这些AI代理在操作连续性、完美复制和分布式学习能力方面具有显著优势,能够为数字市场带来前所未有的价值创造潜力,但现有的数字基础设施主要为人机交互设计,严重阻碍了AI代理的参与。论文通过系统分析,提出了四个关键领域的基础设施需求:身份与授权(identity and authorization)、服务发现(service discovery)、接口(interfaces)和支付系统(payment systems),并指出这些现有基础设施如何阻碍AI代理的参与。论文认为,解决这些基础设施挑战不仅是技术上的必要,更是实现新型经济组织形式的关键步骤。通过解决这些挑战,AI代理可以在数字市场中实现持续操作、完美信息共享和快速适应变化条件,从而显著提升经济效率。
链接: https://arxiv.org/abs/2501.10388
作者: Jordi Montes Sanabria,Pol Alvarez Vecino
机构: Fewsats
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 20 pages, 5 figures
点击查看摘要
Abstract:The emergence of Large Language Models has fundamentally transformed the capabilities of AI agents, enabling a new class of autonomous agents capable of interacting with their environment through dynamic code generation and execution. These agents possess the theoretical capacity to operate as independent economic actors within digital markets, offering unprecedented potential for value creation through their distinct advantages in operational continuity, perfect replication, and distributed learning capabilities. However, contemporary digital infrastructure, architected primarily for human interaction, presents significant barriers to their participation. This work presents a systematic analysis of the infrastructure requirements necessary for AI agents to function as autonomous participants in digital markets. We examine four key areas - identity and authorization, service discovery, interfaces, and payment systems - to show how existing infrastructure actively impedes agent participation. We argue that addressing these infrastructure challenges represents more than a technical imperative; it constitutes a fundamental step toward enabling new forms of economic organization. Much as traditional markets enable human intelligence to coordinate complex activities beyond individual capability, markets incorporating AI agents could dramatically enhance economic efficiency through continuous operation, perfect information sharing, and rapid adaptation to changing conditions. The infrastructure challenges identified in this work represent key barriers to realizing this potential. Comments: 20 pages, 5 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) ACMclasses: I.2.2; I.2.7; I.2.11; J.4; K.4.4 Cite as: arXiv:2501.10388 [cs.CY] (or arXiv:2501.10388v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.10388 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-117] he Three Social Dimensions of Chatbot Technology
【速读】: 该论文试图解决的问题是如何全面理解聊天机器人(chatbot)技术在社会中的多维角色及其对人类生活的影响。传统的技术中心视角无法充分揭示聊天机器人在社会动态中的嵌入方式。为此,论文提出了一个结构化框架,从三个社会维度(科学研究对象、商业工具和亲密互动媒介)对聊天机器人进行系统分析。解决方案的关键在于通过这一多维框架,揭示聊天机器人从实验室到市场再到私人生活的演变过程,从而为学术界提供更全面的视角,探讨聊天机器人技术对人类生活体验和社会动态的影响。
链接: https://arxiv.org/abs/2501.10377
作者: Mauricio Figueroa-Torres
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The development and deployment of chatbot technology, while spanning decades and employing different techniques, require innovative frameworks to understand and interrogate their functionality and implications. A mere technocentric account of the evolution of chatbot technology does not fully illuminate how conversational systems are embedded in societal dynamics. This study presents a structured examination of chatbots across three societal dimensions, highlighting their roles as objects of scientific research, commercial instruments, and agents of intimate interaction. Through furnishing a dimensional framework for the evolution of conversational systems, from laboratories to marketplaces to private lives, this article contributes to the wider scholarly inquiry of chatbot technology and its impact in lived human experiences and dynamics.
zh
[NLP-118] How Large Language Models (LLM s) Extrapolate: From Guided Missiles to Guided Prompts
【速读】: 该论文试图解决的问题是如何正确理解大型语言模型(LLMs)的功能及其在生成文本时出现的“幻觉”(hallucination)现象。论文认为,LLMs应被视为外推(extrapolation)机器,外推是一种用于预测序列中下一个值的统计函数。外推既是GPT成功的关键,也是其引发争议的原因。论文指出,所谓的“幻觉”并非模型故障,而是模型在外推过程中效率过高的表现。论文还从历史角度追溯了外推概念的起源,将其与20世纪40年代的导弹科学、冷战时期的控制论(cybernetics)以及当代关于LLM性能的讨论联系起来。解决方案的关键在于重新定义LLMs的功能,将其视为外推机器,并理解外推在模型生成文本中的核心作用。
链接: https://arxiv.org/abs/2501.10361
作者: Xuenan Cao
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper argues that we should perceive LLMs as machines of extrapolation. Extrapolation is a statistical function for predicting the next value in a series. Extrapolation contributes to both GPT successes and controversies surrounding its hallucination. The term hallucination implies a malfunction, yet this paper contends that it in fact indicates the chatbot efficiency in extrapolation, albeit an excess of it. This article bears a historical dimension: it traces extrapolation to the nascent years of cybernetics. In 1941, when Norbert Wiener transitioned from missile science to communication engineering, the pivotal concept he adopted was none other than extrapolation. Soviet mathematician Andrey Kolmogorov, renowned for his compression logic that inspired OpenAI, had developed in 1939 another extrapolation project that Wiener later found rather like his own. This paper uncovers the connections between hot war science, Cold War cybernetics, and the contemporary debates on LLM performances.
zh
[NLP-119] Leverag ing Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition
【速读】: 该论文试图解决跨语言语音情感识别(Cross-Linguistic Speech Emotion Recognition, CLSER)中的挑战,特别是由于不同语言在语言学和声学特征上的显著差异所导致的识别困难。为了解决这一问题,作者提出了一种名为HuMP-CAT的新方法,该方法结合了HuBERT(一种自监督学习模型)、MFCC(梅尔频率倒谱系数)和韵律特征(prosodic characteristics),并在特征提取阶段通过交叉注意力变换器(Cross-Attention Transformer, CAT)机制进行特征融合。此外,作者采用了迁移学习策略,利用源情感语音数据集(如IEMOCAP)训练源模型,并在目标语料库上进行微调,以实现跨语言的情感识别。实验结果表明,HuMP-CAT在七个数据集(涵盖五种语言)上的平均准确率达到78.75%,尤其在德语数据集EMODB和意大利语数据集EMOVO上分别取得了88.69%和79.48%的显著性能,优于现有方法。
链接: https://arxiv.org/abs/2501.10408
作者: Ruoyu Zhao,Xiantao Jiang,F. Richard Yu,Victor C.M. Leung,Tao Wang,Shaohu Zhang
机构: Shanghai Maritime University (上海海事大学); Carleton University (卡尔顿大学); The University of British Columbia (不列颠哥伦比亚大学); Stanford University (斯坦福大学); The University of North Carolina at Pembroke (北卡罗来纳大学彭布罗克分校)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:
点击查看摘要
Abstract:Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
zh
计算机视觉
[CV-0] owards Affordance-Aware Articulation Synthesis for Rigged Objects
【速读】:该论文试图解决在艺术创作流程中,如何自动生成符合上下文、物理规律和对象个性的逼真姿态(affordance-aware postures)的问题。传统方法依赖于经验丰富的艺术家手动调整,耗时且劳动密集。论文提出的解决方案A3Syn通过结合环境网格和文本提示,自动合成任意开放域(open-domain)绑定对象(rigged objects)的关节参数(articulation parameters)。其关键技术包括:1)使用2D修复扩散模型(2D inpainting diffusion model)和多种控制技术合成上下文相关的功能信息(affordance information);2)通过可微分渲染(differentiable rendering)和语义对应(semantic correspondence)实现高效的骨骼对应对齐(bone correspondence alignment)。A3Syn能够在几分钟内稳定收敛,并在不同场景和对象组合下生成合理的功能姿态。
链接: https://arxiv.org/abs/2501.12393
作者: Yu-Chu Yu,Chieh Hubert Lin,Hsin-Ying Lee,Chaoyang Wang,Yu-Chiang Frank Wang,Ming-Hsuan Yang
机构: National Taiwan University(国立台湾大学); UC Merced(加州大学默塞德分校); Snap Research(Snap研究); Yonsei University(延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Rigged objects are commonly used in artist pipelines, as they can flexibly adapt to different scenes and postures. However, articulating the rigs into realistic affordance-aware postures (e.g., following the context, respecting the physics and the personalities of the object) remains time-consuming and heavily relies on human labor from experienced artists. In this paper, we tackle the novel problem and design A3Syn. With a given context, such as the environment mesh and a text prompt of the desired posture, A3Syn synthesizes articulation parameters for arbitrary and open-domain rigged objects obtained from the Internet. The task is incredibly challenging due to the lack of training data, and we do not make any topological assumptions about the open-domain rigs. We propose using 2D inpainting diffusion model and several control techniques to synthesize in-context affordance information. Then, we develop an efficient bone correspondence alignment using a combination of differentiable rendering and semantic correspondence. A3Syn has stable convergence, completes in minutes, and synthesizes plausible affordance on different combinations of in-the-wild object rigs and scenes.
zh
[CV-1] Learning segmentation from point trajectories WWW NEURIPS2024
【速读】:该论文试图解决基于运动信息进行视频对象分割(segmentation)的问题,且不依赖于其他形式的监督信号。现有方法通常利用“共同命运”(common fate)原则,即同一对象上的点运动具有强相关性,但大多数研究仅依赖于光流(optical flow)提供的瞬时运动信息。本文提出了一种利用长期点轨迹(long-term point trajectories)作为监督信号来补充光流的方法。关键挑战在于长期运动难以建模,任何参数化近似都无法准确捕捉长时间内的复杂运动模式。为此,本文从子空间聚类(subspace clustering)方法中汲取灵感,提出了一种损失函数,旨在将轨迹分组为低秩矩阵,使得对象点的运动可以近似表示为其他点轨迹的线性组合。实验结果表明,该方法在基于运动的分割任务上优于现有技术,证明了长期运动信息的有效性及其提出的损失函数的优越性。
链接: https://arxiv.org/abs/2501.12392
作者: Laurynas Karazija,Iro Laina,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2024 Spotlight. Project this https URL
点击查看摘要
Abstract:We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model – any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
zh
[CV-2] GPS as a Control Signal for Image Generation
【速读】:该论文旨在解决如何利用照片元数据中的GPS标签(GPS tags)作为控制信号来生成具有地理位置特征的图像。具体来说,研究通过训练GPS-to-image模型,结合扩散模型(diffusion model)和文本条件,生成能够捕捉城市中不同区域(如街区、公园和地标)独特外观的图像。解决方案的关键在于利用GPS条件约束图像生成过程,并通过分数蒸馏采样(score distillation sampling)从2D GPS-to-image模型中提取3D模型,从而在多个视角下约束重建的外观。实验结果表明,GPS条件模型能够成功生成基于地理位置变化的图像,并且GPS条件显著改善了3D结构的估计。
链接: https://arxiv.org/abs/2501.12390
作者: Chao Feng,Ziyang Chen,Aleksander Holynski,Alexei A. Efros,Andrew Owens
机构: University of Michigan(密歇根大学); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.
zh
[CV-3] aming Teacher Forcing for Masked Autoregressive Video Generation
【速读】:该论文试图解决视频生成中的两个关键问题:帧内生成和帧间生成的连贯性。为了解决这些问题,作者提出了MAGI(混合视频生成框架),该框架结合了掩码建模(masked modeling)用于帧内生成和因果建模(causal modeling)用于下一帧生成。其核心创新在于完全教师强制(Complete Teacher Forcing, CTF)方法,该方法通过将掩码帧基于完整观测帧而非掩码帧进行条件生成,从而实现了从令牌级(patch-level)到帧级自回归生成的平滑过渡。与传统的掩码教师强制(Masked Teacher Forcing, MTF)相比,CTF在第一帧条件视频预测任务中显著提升了FVD(Fréchet Video Distance)分数,提升了23%。此外,为了解决曝光偏差(exposure bias)等问题,作者采用了针对性的训练策略,为自回归视频生成设定了新的基准。实验结果表明,MAGI能够在仅训练16帧的情况下生成超过100帧的长且连贯的视频序列,展示了其在可扩展、高质量视频生成中的潜力。
链接: https://arxiv.org/abs/2501.12389
作者: Deyu Zhou,Quan Sun,Yuang Peng,Kun Yan,Runpei Dong,Duomin Wang,Zheng Ge,Nan Duan,Xiangyu Zhang,Lionel M. Ni,Heung-Yeung Shum
机构: HKUST(GZ)(香港科技大学广州校区); StepFun; UIUC(伊利诺伊大学厄巴纳-香槟分校); THU(清华大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures
点击查看摘要
Abstract:We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
zh
[CV-4] Continuous 3D Perception Model with Persistent State
【速读】:该论文旨在解决广泛的3D任务,特别是如何从连续的图像流中在线生成度量尺度的点云图(metric-scale pointmaps),并将其累积为一致的、密集的场景重建。解决方案的关键在于提出了一个名为CUT3R(Continuous Updating Transformer for 3D Reconstruction)的状态循环模型(stateful recurrent model),该模型能够随着每个新的观测不断更新其状态表示。CUT3R不仅能够从图像观测中预测精确的点云图,还能通过虚拟的、未观测的视角推断场景中未见的区域。该方法的灵活性使其能够处理不同长度的图像流,无论是视频流还是无序的照片集合,且能够处理静态和动态内容。通过在各种3D/4D任务上的评估,CUT3R展示了其竞争性或最先进的性能。
链接: https://arxiv.org/abs/2501.12387
作者: Qianqian Wang,Yifei Zhang,Aleksander Holynski,Alexei A. Efros,Angjoo Kanazawa
机构: University of California, Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: this https URL
zh
[CV-5] InternVideo2.5: Empowering Video MLLM s with Long and Rich Context Modeling
【速读】:该论文旨在通过长且丰富的上下文(Long and Rich Context, LRC)建模来提升视频多模态大语言模型(Multimodal Large Language Models, MLLM)的性能。具体而言,论文提出了一种新版本的InternVideo2.5,重点在于增强原始MLLM在视频中感知细粒度细节和捕捉长时间结构的能力。解决方案的关键在于将密集视觉任务标注通过直接偏好优化(Direct Preference Optimization)整合到MLLM中,并通过自适应分层令牌压缩(Adaptive Hierarchical Token Compression)开发紧凑的时空表示。实验结果表明,这种独特的LRC设计显著提升了视频MLLM在主流视频理解基准测试(包括短时和长时)中的表现,使其能够记忆显著更长的视频输入(至少比原始模型长6倍),并掌握如目标跟踪和分割等专业视觉能力。该研究强调了多模态上下文丰富性(长度和精细度)在增强MLLM内在能力(专注力和记忆力)方面的重要性,为未来视频MLLM的研究提供了新的见解。
链接: https://arxiv.org/abs/2501.12386
作者: Yi Wang,Xinhao Li,Ziang Yan,Yinan He,Jiashuo Yu,Xiangyu Zeng,Chenting Wang,Changlian Ma,Haian Huang,Jianfei Gao,Min Dou,Kai Chen,Wenhai Wang,Yu Qiao,Yali Wang,Limin Wang
机构: 1Shanghai AI Laboratory (上海人工智能实验室); 2Nanjing University (南京大学); 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
点击查看摘要
Abstract:This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs’ ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM’s innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at this https URL
zh
[CV-6] CCESAR: Coastline Classification-Extraction From SAR Images Using CNN-U-Net Combination
【速读】:该论文旨在解决从合成孔径雷达(Synthetic Aperture Radar, SAR)图像中提取海岸线时,单一分割模型难以准确表征不同类型海岸线的问题。为此,作者提出了一种两阶段模型,首先进行图像分类,随后进行分割。通过在不同压缩级别的SAR图像上进行实验,作者验证了两阶段工作流的优越性。具体而言,结合卷积神经网络(CNN)和U-Net模型的两阶段工作流——海岸线分类与提取(CCESAR),在Sentinel-1图像上的表现优于单一U-Net分割模型。该解决方案的关键在于通过分类阶段预先区分海岸线类型,从而提升后续分割的精度和鲁棒性。
链接: https://arxiv.org/abs/2501.12384
作者: Vidhu Arora,Shreyan Gupta,Ananthakrishna Kudupu,Aditya Priyadarshi,Aswathi Mundayatt,Jaya Sreevalsan-Nair
机构: Graphics-Visualization-Computing Lab, International Institute of Information Technology Bangalore (国际信息技术学院班加罗尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:In this article, we improve the deep learning solution for coastline extraction from Synthetic Aperture Radar (SAR) images by proposing a two-stage model involving image classification followed by segmentation. We hypothesize that a single segmentation model usually used for coastline detection is insufficient to characterize different coastline types. We demonstrate that the need for a two-stage workflow prevails through different compression levels of these images. Our results from experiments using a combination of CNN and U-Net models on Sentinel-1 images show that the two-stage workflow, coastline classification-extraction from SAR images (CCESAR) outperforms a single U-Net segmentation model.
zh
[CV-7] DiffDoctor: Diagnosing Image Diffusion Models Before Treating
【速读】:该论文旨在解决图像扩散模型(image diffusion models)在生成图像时产生的伪影(artifacts)问题。尽管已有进展,这些模型仍会在生成的图像中引入缺陷。论文提出了一种名为DiffDoctor的两阶段解决方案,其关键在于首先开发一个鲁棒的伪影检测器(artifact detector),该检测器能够识别图像中缺陷的具体位置,而不仅仅是整体质量评估。为此,作者收集了一个包含超过100万张有缺陷的合成图像的数据集,并通过人工参与的标注过程,结合精心设计的类别平衡策略,训练了一个高效的检测器。在第二阶段,该检测器通过为每个合成图像生成逐像素的置信度图(per-pixel confidence map),用于调整扩散模型,从而减少伪影的生成。实验表明,该伪影检测器及其“先诊断后治疗”的设计在文本到图像扩散模型中具有显著效果。
链接: https://arxiv.org/abs/2501.12382
作者: Yiyang Wang,Xi Chen,Xiaogang Xu,Sihui Ji,Yu Liu,Yujun Shen,Hengshuang Zhao
机构: The University of Hong Kong(香港大学); Tongyi Lab(通义实验室); Ant Financial Services Group(蚂蚁金服集团); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages of main body and 2 pages of references, 9 figures, 2 tables
点击查看摘要
Abstract:In spite of the recent progress, image diffusion models still produce artifacts. A common solution is to refine an established model with a quality assessment system, which generally rates an image in its entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to tune the diffusion model through assigning a per-pixel confidence map for each synthesis. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
zh
[CV-8] Parallel Sequence Modeling via Generalized Spatial Propagation Network
【速读】:该论文旨在解决现有注意力机制(如Transformer、线性注意力及Mamba等状态空间模型)在处理多维数据时将其视为一维序列,从而导致空间一致性和计算效率下降的问题。为此,论文提出了广义空间传播网络(Generalized Spatial Propagation Network, GSPN),其关键创新在于直接操作空间一致的图像数据,并通过线扫描方法形成密集的成对连接。GSPN的核心是稳定性-上下文条件(Stability-Context Condition),该条件确保了在二维序列中的稳定且上下文感知的传播,并将有效序列长度减少到√N(N为方形图中的元素数量),显著提升了计算效率。此外,GSPN通过可学习的、输入依赖的权重,且不依赖位置嵌入,实现了卓越的空间保真度,并在视觉任务(如图像分类、类引导图像生成和文本到图像生成)中达到了最先进的性能。特别是在生成16K图像时,GSPN将SD-XL与softmax注意力的加速比提升至84倍以上。
链接: https://arxiv.org/abs/2501.12381
作者: Hongjun Wang,Wonmin Byeon,Jiarui Xu,Jinwei Gu,Ka Chun Cheung,Xiaolong Wang,Kai Han,Jan Kautz,Sifei Liu
机构: NVIDIA; The University of Hong Kong (香港大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this http URL
点击查看摘要
Abstract:We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to \sqrtN for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84\times when generating 16K images.
zh
[CV-9] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
【速读】:该论文旨在解决单目深度估计(monocular depth estimation)在视频中存在的时序不一致性问题,这一问题限制了其在实际应用中的广泛使用。现有的方法通常通过利用视频生成模型或引入光流(optical flow)和相机姿态(camera poses)的先验信息来缓解这一问题,但这些方法仅适用于短视频(10秒以内),并且在质量和计算效率之间存在权衡。论文提出的解决方案是“Video Depth Anything”,该模型基于Depth Anything V2,并通过替换其头部为高效的时空头部(spatial-temporal head)来实现高质量且一致的深度估计,适用于超长视频(数分钟以上)。关键创新点包括设计了一种简单但有效的时序一致性损失函数,通过约束时序深度梯度(temporal depth gradient)来消除对额外几何先验的需求,并开发了一种基于关键帧(key-frame-based)的策略用于长视频推理。实验表明,该模型能够在保持质量、一致性和泛化能力的同时,应用于任意长度的视频,并在多个视频基准测试中达到了零样本视频深度估计的最新水平。
链接: https://arxiv.org/abs/2501.12375
作者: Sili Chen,Hengkai Guo,Shengnan Zhu,Feihu Zhang,Zilong Huang,Jiashi Feng,Bingyi Kang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos ( 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
zh
[CV-10] DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions
【速读】:该论文试图解决基于Splatting的三维重建方法中,重建核函数(reconstruction kernels)局限于指数族函数(exponential family functions)的问题。尽管指数族函数(如高斯函数)因其各向异性、易于投影和可微性在光栅化中被广泛使用,但广义的重建核函数尚未得到充分探索,主要原因是其在三维到二维投影中缺乏易于积分的特性。论文提出了一种新的解决方案,即使用一类衰减的各向异性径向基函数(decaying anisotropic radial basis functions, DARBFs),这些函数基于马氏距离(Mahalanobis distance)且非负,能够通过近似高斯函数的闭式积分优势来支持Splatting。这一方法在训练过程中实现了高达34%的收敛速度提升,并在多种DARBF重建核函数中减少了15%的内存消耗,同时保持了与现有方法相当的PSNR、SSIM和LPIPS结果。
链接: https://arxiv.org/abs/2501.12369
作者: Vishagar Arunan(1),Saeedha Nazar(1),Hashiru Pramuditha(1),Vinasirajan Viruthshaan(1),Sameera Ramasinghe(2),Simon Lucey(2),Ranga Rodrigo(1) ((1) University of Moratuwa, (2) University of Adelaide)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Link to the project page: this https URL
点击查看摘要
Abstract:Splatting-based 3D reconstruction methods have gained popularity with the advent of 3D Gaussian Splatting, efficiently synthesizing high-quality novel views. These methods commonly resort to using exponential family functions, such as the Gaussian function, as reconstruction kernels due to their anisotropic nature, ease of projection, and differentiability in rasterization. However, the field remains restricted to variations within the exponential family, leaving generalized reconstruction kernels largely underexplored, partly due to the lack of easy integrability in 3D to 2D projections. In this light, we show that a class of decaying anisotropic radial basis functions (DARBFs), which are non-negative functions of the Mahalanobis distance, supports splatting by approximating the Gaussian function’s closed-form integration advantage. With this fresh perspective, we demonstrate up to 34% faster convergence during training and a 15% reduction in memory consumption across various DARB reconstruction kernels, while maintaining comparable PSNR, SSIM, and LPIPS results. We will make the code available.
zh
[CV-11] Vision-Language Models for Automated Chest X-ray Interpretation: Leverag ing ViT and GPT -2
【速读】:该论文旨在解决放射学报告中手动生成非结构化报告耗时且易出错的问题,这一问题在临床工作流程中形成了显著的瓶颈。尽管生成式AI在放射学报告生成方面取得了进展,但在生成详细且准确的报告方面仍存在挑战。论文的解决方案关键在于整合计算机视觉(Computer Vision)和自然语言处理(Natural Language Processing)的多模态模型,通过预训练的Vision Transformer(ViT-B16)和SWIN Transformer作为图像编码器,以及BART和GPT-2作为文本解码器,来生成全面的放射学报告。研究使用IU-Xray数据集的胸部X光图像和报告,评估了SWIN Transformer-BART、SWIN Transformer-GPT-2、ViT-B16-BART和ViT-B16-GPT-2四种模型的性能,最终发现SWIN-BART模型在ROUGE、BLEU和BERTScore等评估指标上表现最佳。
链接: https://arxiv.org/abs/2501.12356
作者: Md. Rakibul Islam,Md. Zahid Hossain,Mustofa Ahmed,Most. Sharmin Sultana Samu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, manuscript under-review
点击查看摘要
Abstract:Radiology plays a pivotal role in modern medicine due to its non-invasive diagnostic capabilities. However, the manual generation of unstructured medical reports is time consuming and prone to errors. It creates a significant bottleneck in clinical workflows. Despite advancements in AI-generated radiology reports, challenges remain in achieving detailed and accurate report generation. In this study we have evaluated different combinations of multimodal models that integrate Computer Vision and Natural Language Processing to generate comprehensive radiology reports. We employed a pretrained Vision Transformer (ViT-B16) and a SWIN Transformer as the image encoders. The BART and GPT-2 models serve as the textual decoders. We used Chest X-ray images and reports from the IU-Xray dataset to evaluate the usability of the SWIN Transformer-BART, SWIN Transformer-GPT-2, ViT-B16-BART and ViT-B16-GPT-2 models for report generation. We aimed at finding the best combination among the models. The SWIN-BART model performs as the best-performing model among the four models achieving remarkable results in almost all the evaluation metrics like ROUGE, BLEU and BERTScore.
zh
[CV-12] VARGPT : Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
【速读】:该论文旨在解决多模态大语言模型(MLLM)在视觉理解和生成任务中的统一性问题。传统方法通常将视觉理解和生成任务分开处理,导致模型在处理混合模态输入和输出时效率低下。VARGPT通过引入一种新颖的自回归框架,将视觉理解和生成统一在一个模型中。其关键解决方案包括:1)采用“下一标记预测”(next-token prediction)范式进行视觉理解;2)采用“下一尺度预测”(next-scale prediction)范式进行视觉自回归生成;3)基于LLaVA架构进行扩展,实现高效的尺度自回归视觉生成。此外,VARGPT通过三阶段的统一训练策略(包括预训练和两个混合视觉指令微调阶段),实现了视觉与文本特征的对齐,增强了指令跟随能力,并提升了视觉生成质量。实验表明,VARGPT在视觉问答和推理任务等视觉中心基准测试中显著优于LLaVA-1.5,并展示了其在自回归视觉生成和指令到图像合成任务中的多功能性。
链接: https://arxiv.org/abs/2501.12327
作者: Xianwei Zhuang,Yuxin Xie,Yufan Deng,Liming Liang,Jinghan Ru,Yuguo Yin,Yuexian Zou
机构: SECE of Peking University (北京大学信息科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present VARGPT, a novel multimodal large language model (MLLM) that unifies visual understanding and generation within a single autoregressive framework. VARGPT employs a next-token prediction paradigm for visual understanding and a next-scale prediction paradigm for visual autoregressive generation. VARGPT innovatively extends the LLaVA architecture, achieving efficient scale-wise autoregressive visual generation within MLLMs while seamlessly accommodating mixed-modal input and output within a single model framework. Our VARGPT undergoes a three-stage unified training process on specially curated datasets, comprising a pre-training phase and two mixed visual instruction-tuning phases. The unified training strategy are designed to achieve alignment between visual and textual features, enhance instruction following for both understanding and generation, and improve visual generation quality, respectively. Despite its LLAVA-based architecture for multimodel understanding, VARGPT significantly outperforms LLaVA-1.5 across various vision-centric benchmarks, such as visual question-answering and reasoning tasks. Notably, VARGPT naturally supports capabilities in autoregressive visual generation and instruction-to-image synthesis, showcasing its versatility in both visual understanding and generation tasks. Project page is at: \urlthis https URL
zh
[CV-13] Metric for Evaluating Performance of Reference-Free Demorphing Methods
【速读】:该论文旨在解决面部去变形(face demorphing)技术评估中缺乏统一评价指标的问题。面部去变形是指从合成的面部图像中恢复出原始的面部图像,而现有的评估方法存在不足,无法有效比较不同去变形技术的性能。为此,作者提出了一种新的评估指标,称为生物特征交叉加权图像质量评估(biometrically cross-weighted IQA),该指标克服了现有方法的局限性,并通过在三种现有去变形方法和六个数据集上的实验验证了其有效性。解决方案的关键在于引入生物特征交叉加权机制,结合图像质量评估和生物特征匹配性能,从而更全面地衡量去变形技术的效果。
链接: https://arxiv.org/abs/2501.12319
作者: Nitish Shukla,Arun Ross
机构: Michigan State University(密歇根州立大学); Michigan State University(密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A facial morph is an image created by combining two (or more) face images pertaining to two (or more) distinct identities. Reference-free face demorphing inverts the process and tries to recover the face images constituting a facial morph without using any other information. However, there is no consensus on the evaluation metrics to be used to evaluate and compare such demorphing techniques. In this paper, we first analyze the shortcomings of the demorphing metrics currently used in the literature. We then propose a new metric called biometrically cross-weighted IQA that overcomes these issues and extensively benchmark current methods on the proposed metric to show its efficacy. Experiments on three existing demorphing methods and six datasets on two commonly used face matchers validate the efficacy of our proposed metric.
zh
[CV-14] BlanketGen2-Fit3D: Synthetic Blanket Augmentation Towards Improving Real-World In-Bed Blanket Occluded Human Pose Estimation
【速读】:该论文试图解决在临床环境中,基于单目RGB图像的人体姿态估计(Human Pose Estimation, HPE)在床场景下由于被子遮挡而面临的挑战。由于被子遮挡频繁出现,且在此场景下的标注数据稀缺,现有的HPE模型在此类情况下的表现受限。为解决这一问题,论文提出了BlanketGen2-Fit3D(BG2-Fit3D)数据集,该数据集是对Fit3D数据集的增强,包含1,217,312帧带有合成逼真被子的图像。生成这些图像的关键在于使用了改进的BlanketGen2管道,该管道通过基于真实人体网格模型(Skinned Multi-Person Linear model, SMPL)生成合成被子,并将其渲染为透明图像,叠加到原始帧上。通过将BG2-Fit3D与原始Fit3D数据集结合,微调了ViTPose-B HPE模型,并评估了合成被子增强的有效性。实验结果表明,使用合成数据增强的模型在BG2-Fit3D数据集上的姿态估计性能显著提升(PCK提高4.4%),并且在真实世界的被子遮挡数据集(SLP数据集)上也表现出2.3%的PCK提升。这些结果表明,合成被子增强在改善床场景下被子遮挡的HPE任务中具有潜力。
链接: https://arxiv.org/abs/2501.12318
作者: Tamás Karácsony,João Carmona,João Paulo Silva Cunha
机构: Fundação para a Ciência e a Tecnologia (葡萄牙科学技术基金会); CMU Portugal program (CMU葡萄牙项目); Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia (葡萄牙资助机构,FCT - 葡萄牙科学技术基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Human Pose Estimation (HPE) from monocular RGB images is crucial for clinical in-bed skeleton-based action recognition, however, it poses unique challenges for HPE models due to the frequent presence of blankets occluding the person, while labeled HPE data in this scenario is scarce. To address this we introduce BlanketGen2-Fit3D (BG2-Fit3D), an augmentation of Fit3D dataset that contains 1,217,312 frames with synthetic photo-realistic blankets. To generate it we used BlanketGen2, our new and improved version of our BlanketGen pipeline that simulates synthetic blankets using ground-truth Skinned Multi-Person Linear model (SMPL) meshes and then renders them as transparent images that can be layered on top of the original frames. This dataset was used in combination with the original Fit3D to finetune the ViTPose-B HPE model, to evaluate synthetic blanket augmentation effectiveness. The trained models were further evaluated on a real-world blanket occluded in-bed HPE dataset (SLP dataset). Comparing architectures trained on only Fit3D with the ones trained with our synthetic blanket augmentation the later improved pose estimation performance on BG2-Fit3D, the synthetic blanket occluded dataset significantly to (0.977 Percentage of Correct Keypoints (PCK), 0.149 Normalized Mean Error (NME)) with an absolute 4.4% PCK increase. Furthermore, the test results on SLP demonstrated the utility of synthetic data augmentation by improving performance by an absolute 2.3% PCK, on real-world images with the poses occluded by real blankets. These results show synthetic blanket augmentation has the potential to improve in-bed blanket occluded HPE from RGB images. The dataset as well as the code will be made available to the public.
zh
[CV-15] RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning
【速读】:该论文试图解决自动驾驶系统在从真实世界数据集训练后,难以适应新环境(尤其是极端天气等极端情况)的问题。由于在真实世界中收集这些极端情况数据非常困难,通常需要使用模拟器进行验证。然而,高计算成本和数据分布中的领域差距(domain gap)阻碍了真实与模拟驾驶场景之间的无缝过渡。为解决这一问题,论文提出了检索增强学习框架(Retrieval-Augmented Learning for Autonomous Driving, RALAD),其关键解决方案包括:(1) 通过增强的最优传输(Optimal Transport, OT)方法进行领域适应,该方法同时考虑了单个和分组图像的距离;(2) 设计了一个简单且统一的框架,适用于多种模型;(3) 采用高效的微调技术,冻结计算成本高的层,同时保持模型的鲁棒性。实验结果表明,RALAD在模拟环境中显著提升了性能(如mIOU和mAP分别提高了10.30%和12.29%),同时在真实场景中保持了准确性,并且重新训练成本降低了约88.1%。
链接: https://arxiv.org/abs/2501.12296
作者: Jiacheng Zuo,Haibo Hu,Zikang Zhou,Yufei Cui,Ziquan Liu,Jianping Wang,Nan Guan,Jin Wang,Chun Jason Xue
机构: Department of Computer Science, Soochow University(苏州大学计算机科学系); Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系); Department of Computer Science, McGill University(麦吉尔大学计算机科学系); School of Electronic Engineering and Computer Science, Queen Mary University of London(伦敦玛丽女王大学电子工程与计算机科学学院); Department of Computer Science, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at this https URL.
zh
[CV-16] owards Accurate Unified Anomaly Segmentation
【速读】:该论文试图解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中异常像素的精确分割问题。尽管现有的方法在建模正常数据分布和区分异常方面取得了一定进展,但在不平衡的UAD设置下,广泛使用的AUROC(Area Under the Receiver Operating Characteristic)指标难以准确反映异常分割的效果。为此,论文强调了使用pAP(Pixel-wise Average Precision)和DSC(Dice Similarity Coefficient)作为评估指标的重要性。为解决这一未解决的异常分割任务,论文提出了统一异常分割(Unified Anomaly Segmentation, UniAS)方法。UniAS的关键在于其多层次混合管道,该管道从粗到细逐步增强正常信息,并结合了一种新颖的多粒度门控卷积神经网络(Multi-Granularity Gated CNN, MGG-CNN)与Transformer层,以显式聚合来自不同粒度的局部细节。UniAS在MVTec-AD和VisA数据集上分别达到了65.12/59.33和40.06/32.50的pAP/DSC,显著超越了现有方法。
链接: https://arxiv.org/abs/2501.12295
作者: Wenxin Ma,Qingsong Yao,Xiang Zhang,Zhelong Huang,Zihang Jiang,S. Kevin Zhou
机构: University of Science and Technology of China (USTC) (中国科学技术大学); Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC (苏州先进技术研究院); Stanford University (斯坦福大学); School of Medicine, Shanghai University (上海大学医学院); Key Laboratory of Precision and Intelligent Chemistry, USTC (中国科学技术大学精密智能化学重点实验室); Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (中国科学院智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
点击查看摘要
Abstract:Unsupervised anomaly detection (UAD) from images strives to model normal data distributions, creating discriminative representations to distinguish and precisely localize anomalies. Despite recent advancements in the efficient and unified one-for-all scheme, challenges persist in accurately segmenting anomalies for further monitoring. Moreover, this problem is obscured by the widely-used AUROC metric under imbalanced UAD settings. This motivates us to emphasize the significance of precise segmentation of anomaly pixels using pAP and DSC as metrics. To address the unsolved segmentation task, we introduce the Unified Anomaly Segmentation (UniAS). UniAS presents a multi-level hybrid pipeline that progressively enhances normal information from coarse to fine, incorporating a novel multi-granularity gated CNN (MGG-CNN) into Transformer layers to explicitly aggregate local details from different granularities. UniAS achieves state-of-the-art anomaly segmentation performance, attaining 65.12/59.33 and 40.06/32.50 in pAP/DSC on the MVTec-AD and VisA datasets, respectively, surpassing previous methods significantly. The codes are shared at this https URL.
zh
[CV-17] Regressor-Guided Image Editing Regulates Emotional Response to Reduce Online Engagement
【速读】:该论文试图解决的问题是如何通过图像编辑技术降低图像对观众情绪的影响。解决方案的关键在于提出了三种基于回归器引导的图像编辑方法:(i) 基于全局图像变换的参数优化方法,这些变换已知会影响情绪;(ii) 针对生成对抗网络(GAN)风格潜在空间的优化方法;(iii) 基于扩散模型(diffusion model)的方法,结合了分类器引导(classifier guidance)和无分类器引导(classifier-free guidance)。研究结果表明,这些方法能够有效改变图像的情绪属性,同时保持较高的视觉质量。其中,基于优化的方法主要通过调整颜色色调和亮度等低层次属性来影响情绪,而基于扩散模型的方法则引入了语义层面的变化,如改变外观或面部表情。行为学研究表明,只有基于扩散模型的方法能够成功引发观众情绪反应的变化,同时保持较高的图像质量感知。未来的研究将进一步探讨这些图像调整对互联网用户行为的影响。
链接: https://arxiv.org/abs/2501.12289
作者: Christoph Gebhardt,Robin Willardt,Seyedmorteza Sadat,Chih-Wei Ning,Andreas Brombach,Jie Song,Otmar Hilliges,Christian Holz
机构: ETH Zurich(苏黎世联邦理工学院); UNSW Sydney(新南威尔士大学悉尼分校); HKUST Guangzhou(香港科技大学广州校区); Eastern Switzerland University of Applied Sciences(瑞士东部应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 39 pages, 22 figures
点击查看摘要
Abstract:Emotions are known to mediate the relationship between users’ content consumption and their online engagement, with heightened emotional intensity leading to increased engagement. Building on this insight, we propose three regressor-guided image editing approaches aimed at diminishing the emotional impact of images. These include (i) a parameter optimization approach based on global image transformations known to influence emotions, (ii) an optimization approach targeting the style latent space of a generative adversarial network, and (iii) a diffusion-based approach employing classifier guidance and classifier-free guidance. Our findings demonstrate that approaches can effectively alter the emotional properties of images while maintaining high visual quality. Optimization-based methods primarily adjust low-level properties like color hues and brightness, whereas the diffusion-based approach introduces semantic changes, such as altering appearance or facial expressions. Notably, results from a behavioral study reveal that only the diffusion-based approach successfully elicits changes in viewers’ emotional responses while preserving high perceived image quality. In future work, we will investigate the impact of these image adaptations on internet user behavior.
zh
[CV-18] With Great Backbones Comes Great Adversarial Transferability
【速读】:该论文试图解决在自监督学习(SSL)预训练模型(如ResNet和ViT)中,模型在面对对抗攻击时的鲁棒性问题。尽管自监督学习提升了模型的表示鲁棒性和性能,但这些预训练模型在对抗攻击下的脆弱性尚未得到充分研究。论文通过系统评估20,000种不同的调优元信息组合(包括微调技术、骨干网络家族、数据集和攻击类型),探讨了这些因素对模型对抗鲁棒性的影响。关键解决方案包括使用代理模型(proxy models)来模拟不同目标知识水平的攻击,并提出了一种基于骨干网络的“骨干攻击”(backbone attack),该攻击仅利用骨干网络生成对抗样本,结果显示其性能优于黑盒攻击,并接近白盒攻击的效果。此外,论文还通过消融实验揭示了调优元信息对攻击可转移性的影响。
链接: https://arxiv.org/abs/2501.12275
作者: Erik Arakelyan,Karen Hambardzumyan,Davit Papikyan,Pasquale Minervini,Albert Gordo,Isabelle Augenstein,Aram H. Markosyan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Advances in self-supervised learning (SSL) for machine vision have improved representation robustness and model performance, giving rise to pre-trained backbones like \emphResNet and \emphViT models tuned with SSL methods such as \emphSimCLR. Due to the computational and data demands of pre-training, the utilization of such backbones becomes a strenuous necessity. However, employing these backbones may inherit vulnerabilities to adversarial attacks. While adversarial robustness has been studied under \emphwhite-box and \emphblack-box settings, the robustness of models tuned on pre-trained backbones remains largely unexplored. Additionally, the role of tuning meta-information in mitigating exploitation risks is unclear. This work systematically evaluates the adversarial robustness of such models across 20,000 combinations of tuning meta-information, including fine-tuning techniques, backbone families, datasets, and attack types. We propose using proxy models to transfer attacks, simulating varying levels of target knowledge by fine-tuning these proxies with diverse configurations. Our findings reveal that proxy-based attacks approach the effectiveness of \emphwhite-box methods, even with minimal tuning knowledge. We also introduce a naive “backbone attack,” leveraging only the backbone to generate adversarial samples, which outperforms \emphblack-box attacks and rivals \emphwhite-box methods, highlighting critical risks in model-sharing practices. Finally, our ablations reveal how increasing tuning meta-information impacts attack transferability, measuring each meta-information combination.
zh
[CV-19] Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems
【速读】:该论文旨在解决基于深度神经网络(DNNs)的高级驾驶辅助系统(ADAS)在面对输入变化(如噪声和光照变化)时的鲁棒性和泛化能力问题。这些输入变化可能导致系统失效,进而引发安全隐患。论文通过全面的实证评估,研究了图像扰动技术在揭示ADAS感知系统脆弱性方面的有效性,并提出了改进方案。关键解决方案包括:1)系统性地识别了38类图像扰动,并评估了它们在组件和系统层面上对ADAS的影响;2)探索了基于扰动的数据增强和持续学习策略,以提高ADAS在新操作设计域中的适应能力。研究结果表明,所有类别的图像扰动均能有效暴露ADAS的鲁棒性问题,而数据增强和持续学习显著提升了ADAS在未见环境中的性能。
链接: https://arxiv.org/abs/2501.12269
作者: Stefano Carlo Lambertenghi,Hannes Leonhard,Andrea Stocco
机构: Technical University of Munich(慕尼黑工业大学), fortiss; Technical University of Munich(慕尼黑工业大学); Technical University of Munich(慕尼黑工业大学), fortiss
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)
点击查看摘要
Abstract:Advanced Driver Assistance Systems (ADAS) based on deep neural networks (DNNs) are widely used in autonomous vehicles for critical perception tasks such as object detection, semantic segmentation, and lane recognition. However, these systems are highly sensitive to input variations, such as noise and changes in lighting, which can compromise their effectiveness and potentially lead to safety-critical failures. This study offers a comprehensive empirical evaluation of image perturbations, techniques commonly used to assess the robustness of DNNs, to validate and improve the robustness and generalization of ADAS perception systems. We first conducted a systematic review of the literature, identifying 38 categories of perturbations. Next, we evaluated their effectiveness in revealing failures in two different ADAS, both at the component and at the system level. Finally, we explored the use of perturbation-based data augmentation and continuous learning strategies to improve ADAS adaptation to new operational design domains. Our results demonstrate that all categories of image perturbations successfully expose robustness issues in ADAS and that the use of dataset augmentation and continuous learning significantly improves ADAS performance in novel, unseen environments. Comments: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.12269 [cs.SE] (or arXiv:2501.12269v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.12269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-20] VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models WACV2025
【速读】:该论文试图解决视频修复(video inpainting)中在大面积掩码区域中心无法找到像素对应关系时产生的严重伪影问题。现有的视频修复方法通常利用光流(optical flow)在图像空间或特征空间中引导像素传播,但在掩码区域过大时,这些方法在中心区域无法找到有效的像素对应关系,导致修复结果出现伪影。论文提出的解决方案VipDiff是一个无需训练的框架,通过在反向扩散过程(reverse diffusion process)中引入光流作为引导,从参考帧中提取有效像素作为约束,优化随机采样的高斯噪声,从而生成时空一致的修复结果。VipDiff的关键在于利用预训练的扩散模型(diffusion models)进行条件生成,避免了额外的训练数据或微调需求,同时允许通过不同的噪声采样生成多样化的修复结果。实验表明,VipDiff在时空一致性和保真度方面显著优于现有的视频修复方法。
链接: https://arxiv.org/abs/2501.12267
作者: Chaohao Xie,Kai Han,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 Figures (Accepted at WACV 2025)
点击查看摘要
Abstract:Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However, they would produce severe artifacts in the mask center when the masked area is too large and no pixel correspondences can be found for the center. Recently, diffusion models have demonstrated impressive performance in generating diverse and high-quality images, and have been exploited in a number of works for image inpainting. These methods, however, cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper, we propose a training-free framework, named VipDiff, for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained diffusion models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise, and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that VipDiff can largely outperform state-of-the-art video inpainting methods in terms of both spatial-temporal coherence and fidelity.
zh
[CV-21] mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework
【速读】:该论文旨在解决多智能体协同感知(Collaborative Perception)在实际部署中面临的带宽限制和信息交换过程中的校准误差问题。为了解决这些问题,作者提出了mmCooper框架,这是一个多智能体、多阶段、通信高效且协作鲁棒的协同感知框架。该框架的关键在于采用多阶段协作策略,动态且自适应地平衡中间阶段和后期阶段的信息共享,以在保持通信效率的同时提升感知性能。此外,框架通过捕捉多尺度上下文信息以增强中间阶段的鲁棒融合,并在后期阶段对接收到的检测结果进行校准,从而提高准确性。实验结果表明,mmCooper在真实世界和模拟数据集上均表现出优越性能,验证了其有效性及各组件的贡献。
链接: https://arxiv.org/abs/2501.12263
作者: Bingyi Liu,Jian Teng,Hongfei Xue,Enshu Wang,Chuanhui Zhu,Pu Wang,Libing Wu
机构: Wuhan University Of Technology(武汉理工大学); University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework captures multi-scale contextual information for robust fusion in the intermediate stage and calibrates the received detection results to improve accuracy in the late stage. We validate the effectiveness of mmCooper through extensive experiments on real-world and simulated datasets. The results demonstrate the superiority of our proposed framework and the effectiveness of each component.
zh
[CV-22] HAC: Towards 100X Compression of 3D Gaussian Splatting ECCV2024
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)框架中由于大量高斯点及其相关属性导致的存储和压缩问题。3DGS虽然在新视角合成中表现出色,但其点云数据稀疏且无序,给压缩带来了挑战。论文提出的解决方案HAC++通过利用无序锚点(anchors)与结构化哈希网格(structured hash grid)之间的关系,结合它们的互信息进行上下文建模,从而有效压缩数据。此外,HAC++还捕捉锚点内部的上下文关系,进一步提升压缩性能。为了支持熵编码,HAC++采用高斯分布精确估计每个量化属性的概率,并引入自适应量化模块以实现高精度量化,从而提高保真度恢复。同时,自适应掩码策略被用于消除无效的高斯点和锚点。实验结果表明,HAC++在所有数据集上平均实现了超过100倍的尺寸压缩,同时提升了保真度,相比Scaffold-GS也实现了超过20倍的尺寸压缩。
链接: https://arxiv.org/abs/2501.12255
作者: Yihang Chen,Qianyi Wu,Weiyao Lin,Mehrtash Harandi,Jianfei Cai
机构: Monash University(莫纳什大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TPAMI Submission. This paper is an extension of HAC at arXiv:2403.14530 (ECCV 2024)
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at this https URL.
zh
[CV-23] Memory Storyboard: Leverag ing Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos
【速读】:该论文试图解决从现实世界的连续未整理数据流中进行自监督学习(self-supervised learning)的问题,特别是针对长时程的自我中心(egocentric)视频流。现有的视觉自监督学习方法主要集中于静态图像或人工生成的数据流,而本文则探索了更为现实的学习场景。解决方案的关键在于提出了“记忆故事板”(Memory Storyboard),该方法通过将最近的过去帧分组为时间片段,从而更有效地总结过去的视觉流以进行记忆回放。为了适应高效的时间分割,论文还提出了一个双层记忆层次结构:最近的过去存储在短期记忆中,而故事板时间片段则转移到长期记忆中。通过在真实世界的自我中心视频数据集(如SAYCam和KrishnaCam)上的实验,论文展示了基于故事板帧的对比学习目标能够生成语义上有意义的表示,并且优于现有的无监督持续学习方法。
链接: https://arxiv.org/abs/2501.12254
作者: Yanlai Yang,Mengye Ren
机构: New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 8 figures
点击查看摘要
Abstract:Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose “Memory Storyboard” that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.
zh
[CV-24] Video Deblurring by Sharpness Prior Detection and Edge Information
【速读】:该论文旨在解决视频去模糊(video deblurring)任务中的两个主要问题:传统方法直接估计运动模糊核(motion blur kernels)容易引入伪影并导致效果不佳,以及现有数据集依赖固定数量的清晰帧(sharp frames),限制了模型的训练多样性和领域适应性。为解决这些问题,论文提出了两个关键解决方案:首先,引入了GoPro Random Sharp (GoProRS)数据集,该数据集允许自定义序列中清晰帧的频率,从而支持更多样化的训练和测试场景;其次,提出了一种名为SPEINet的新型视频去模糊模型,该模型通过基于注意力机制的编码器-解码器架构(attention-based encoder-decoder architecture),将清晰帧特征整合到模糊帧重建中,并结合轻量级且鲁棒的清晰帧检测和边缘提取阶段。实验结果表明,SPEINet在多个数据集上均优于现有最先进方法,平均PSNR(峰值信噪比)提升了3.2%。
链接: https://arxiv.org/abs/2501.12246
作者: Yang Tian,Fabio Brau,Giulio Rossolini,Giorgio Buttazzo,Hao Meng
机构: Harbin Engineering University(哈尔滨工程大学); University of Cagliari(卡利亚里大学); Scuola Superiore Sant’Anna(圣安娜高等研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in Pattern Recognition
点击查看摘要
Abstract:Video deblurring is essential task for autonomous driving, facial recognition, and security surveillance. Traditional methods directly estimate motion blur kernels, often introducing artifacts and leading to poor results. Recent approaches utilize the detection of sharp frames within video sequences to enhance deblurring. However, existing datasets rely on fixed number of sharp frames, which may be too restrictive for some applications and may introduce a bias during model training. To address these limitations and enhance domain adaptability, this work first introduces GoPro Random Sharp (GoProRS), a new dataset where the the frequency of sharp frames within the sequence is customizable, allowing more diverse training and testing scenarios. Furthermore, it presents a novel video deblurring model, called SPEINet, that integrates sharp frame features into blurry frame reconstruction through an attention-based encoder-decoder architecture, a lightweight yet robust sharp frame detection and an edge extraction phase. Extensive experimental results demonstrate that SPEINet outperforms state-of-the-art methods across multiple datasets, achieving an average of +3.2% PSNR improvement over recent techniques. Given such promising results, we believe that both the proposed model and dataset pave the way for future advancements in video deblurring based on the detection of sharp frames.
zh
[CV-25] Investigating Market Strength Prediction with CNNs on Candlestick Chart Images ACML
【速读】:该论文旨在通过仅使用蜡烛图(candlestick chart)图像来预测市场强度,以辅助投资决策。核心研究问题是开发一种基于计算机视觉的有效模型,仅利用原始蜡烛图视觉数据,而不依赖于时间序列数据。研究特别分析了通过YOLOv8检测到的蜡烛图形态对模型性能的影响。解决方案的关键在于两种方法的实现:一是直接在图表图像上使用纯卷积神经网络(CNN),二是采用一种分解器架构(Decomposer architecture)来检测蜡烛图形态。实验结果表明,在本研究中,蜡烛图形态的引入并未显著提升模型性能,仅使用图像数据的模型表现最佳,准确率约为0.7,低于更复杂的时间序列模型。这一发现揭示了仅从视觉形态中提取足够预测能力的挑战,并强调了结合其他数据模态的必要性。
链接: https://arxiv.org/abs/2501.12239
作者: Thanh Nam Duong,Trung Kien Hoang,Quoc Khanh Duong,Quoc Dat Dinh,Duc Hoan Le,Huy Tuan Nguyen,Xuan Bach Nguyen,Quy Ban Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMLC 2025; 8 pages
点击查看摘要
Abstract:This paper investigates predicting market strength solely from candlestick chart images to assist investment decisions. The core research problem is developing an effective computer vision-based model using raw candlestick visuals without time-series data. We specifically analyze the impact of incorporating candlestick patterns that were detected by YOLOv8. The study implements two approaches: pure CNN on chart images and a Decomposer architecture detecting patterns. Experiments utilize diverse financial datasets spanning stocks, cryptocurrencies, and forex assets. Key findings demonstrate candlestick patterns do not improve model performance over only image data in our research. The significance is illuminating limitations in candlestick image signals. Performance peaked at approximately 0.7 accuracy, below more complex time-series models. Outcomes reveal challenges in distilling sufficient predictive power from visual shapes alone, motivating the incorporation of other data modalities. This research clarifies how purely image-based models can inform trading while confirming patterns add little value over raw charts. Our content is endeavored to be delineated into distinct sections, each autonomously furnishing a unique contribution while maintaining cohesive linkage. Note that, the examples discussed herein are not limited to the scope, applicability, or knowledge outlined in the paper.
zh
[CV-26] DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains
【速读】:该论文旨在解决低光图像增强(Low-light image enhancement, LLE)中的视觉质量问题,包括低亮度、低对比度、噪声和颜色失真等问题。这些问题影响了计算机视觉任务(如目标检测、人脸识别和自动驾驶)的性能。现有的增强技术(如多尺度融合和直方图均衡化)在复杂光照条件下难以保留细节并保持图像的自然外观。尽管Retinex理论为图像分解提供了基础,但它通常会放大噪声,导致图像质量不理想。论文提出的解决方案是双光增强网络(Dual Light Enhance Network, DLEN),其关键创新在于结合了两种不同的注意力机制,分别考虑空间域和频率域。该模型在光照估计阶段引入了可学习的小波变换模块,以保留高频和低频成分,从而增强边缘和纹理细节。此外,设计了一个双分支结构,利用Transformer架构的优势,同时增强图像的照明和结构成分。实验表明,该模型在标准数据集上优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.12235
作者: Junyu Xia,Jiesong Bai,Yihang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10pages,6figures
点击查看摘要
Abstract:Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous this http URL enhancement techniques, such as multi-scale fusion and histogram equalization, fail to preserve fine details and often struggle with maintaining the natural appearance of enhanced images under complex lighting conditions. Although the Retinex theory provides a foundation for image decomposition, it often amplifies noise, leading to suboptimal image quality. In this paper, we propose the Dual Light Enhance Network (DLEN), a novel architecture that incorporates two distinct attention mechanisms, considering both spatial and frequency domains. Our model introduces a learnable wavelet transform module in the illumination estimation phase, preserving high- and low-frequency components to enhance edge and texture details. Additionally, we design a dual-branch structure that leverages the power of the Transformer architecture to enhance both the illumination and structural components of the this http URL extensive experiments, our model outperforms state-of-the-art methods on standard this http URL is available here: this https URL
zh
[CV-27] okenVerse: Versatile Multi-concept Personalization in Token Modulation Space
【速读】:该论文旨在解决多概念个性化(multi-concept personalization)问题,即如何从少量图像中解耦复杂的视觉元素和属性,并实现从多个图像中提取的概念的无缝组合生成。现有的方法通常难以处理每个图像包含多个概念的情况,且支持的概念范围有限。TokenVerse 提出了一种基于预训练文本到图像扩散模型(text-to-image diffusion model)的框架,能够从单个图像中解耦多个复杂概念,并支持广泛的视觉元素,如物体、配饰、材质、姿态和光照等。其关键解决方案在于利用基于 DiT(Diffusion Transformer)的文本到图像模型,通过调制空间(modulation space)实现语义控制。具体而言,TokenVerse 通过优化框架为每个输入图像和文本描述找到调制空间中的特定方向,从而实现对复杂概念的局部控制,并生成符合预期配置的新图像。该方法在个性化设置中表现出显著优势,超越了现有方法。
链接: https://arxiv.org/abs/2501.12224
作者: Daniel Garibi,Shahar Yadin,Roni Paiss,Omer Tov,Shiran Zada,Ariel Ephrat,Tomer Michaeli,Inbar Mosseri,Tali Dekel
机构: Google DeepMind; Tel Aviv University (特拉维夫大学); Technion (以色列理工学院); Weizmann Institute (魏茨曼科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present TokenVerse – a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project’s webpage in this https URL
zh
[CV-28] Exploring Temporally-Aware Features for Point Tracking
【速读】:该论文试图解决视频中点跟踪(point tracking)任务中的两个主要问题:一是现有方法通常依赖于在合成数据上从头训练的简单特征骨干网络(feature backbone),这可能在真实场景中限制了模型的鲁棒性;二是现有方法通常采用两阶段处理流程(即粗预测和细化阶段),虽然通过细化阶段注入时间信息并修正粗预测阶段的错误,但这种方法计算成本高且可能存在冗余。论文提出的解决方案是引入一种名为Chrono的特征骨干网络,该网络专门为点跟踪任务设计,具有内置的时间感知能力。Chrono利用自监督学习模型DINOv2的预训练表示,并通过时间适配器(temporal adapter)增强,能够有效捕捉长期时间上下文,从而在无需细化阶段的情况下实现精确预测。实验结果表明,Chrono在TAP-Vid-DAVIS和TAP-Vid-Kinetics数据集上实现了最先进的性能,且具有较高的计算效率。
链接: https://arxiv.org/abs/2501.12218
作者: Inès Hyeonsu Kim,Seokju Cho,Jiahui Huang,Jung Yi,Joon-Young Lee,Seungryong Kim
机构: KAIST AI; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Point tracking in videos is a fundamental task with applications in robotics, video editing, and more. While many vision tasks benefit from pre-trained feature backbones to improve generalizability, point tracking has primarily relied on simpler backbones trained from scratch on synthetic data, which may limit robustness in real-world scenarios. Additionally, point tracking requires temporal awareness to ensure coherence across frames, but using temporally-aware features is still underexplored. Most current methods often employ a two-stage process: an initial coarse prediction followed by a refinement stage to inject temporal information and correct errors from the coarse stage. These approach, however, is computationally expensive and potentially redundant if the feature backbone itself captures sufficient temporal information. In this work, we introduce Chrono, a feature backbone specifically designed for point tracking with built-in temporal awareness. Leveraging pre-trained representations from self-supervised learner DINOv2 and enhanced with a temporal adapter, Chrono effectively captures long-term temporal context, enabling precise prediction even without the refinement stage. Experimental results demonstrate that Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets, among common feature backbones used in point tracking as well as DINOv2, with exceptional efficiency. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.12218 [cs.CV] (or arXiv:2501.12218v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.12218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-29] Early Detection and Classification of Breast Cancer Using Deep Learning Techniques
【速读】:该论文旨在解决乳腺癌早期检测的问题,特别是通过自动化技术提高检测的准确性和效率。乳腺癌是全球范围内致死率较高的癌症之一,早期检测可以有效降低其恶性发展的风险。论文提出的解决方案关键在于利用人工智能(Artificial Intelligence, AI)和机器学习(Machine Learning, ML)技术,特别是通过预训练模型(如ResNet50、MobileNet和VGG16)以及自定义的卷积神经网络(CNN)模型,对乳腺癌超声图像进行分类。这些模型在乳腺癌图像分类数据集上表现出色,其中ResNet50模型达到了最高的准确率(98.41%),表明机器学习方法在乳腺癌分类和早期检测中具有较高的适用性和效果。
链接: https://arxiv.org/abs/2501.12217
作者: Mst. Mumtahina Labonno,D.M. Asadujjaman,Md. Mahfujur Rahman,Abdullah Tamim,Mst. Jannatul Ferdous,Rafi Muttaki Mahi
机构: Dept. of Computer Science and Engineering, Varendra University, Rajshahi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Breast cancer is one of the deadliest cancers causing about massive number of patients to die annually all over the world according to the WHO. It is a kind of cancer that develops when the tissues of the breast grow rapidly and unboundly. This fatality rate can be prevented if the cancer is detected before it gets malignant. Using automation for early-age detection of breast cancer, Artificial Intelligence and Machine Learning technologies can be implemented for the best outcome. In this study, we are using the Breast Cancer Image Classification dataset collected from the Kaggle depository, which comprises 9248 Breast Ultrasound Images and is classified into three categories: Benign, Malignant, and Normal which refers to non-cancerous, cancerous, and normal this http URL research introduces three pretrained model featuring custom classifiers that includes ResNet50, MobileNet, and VGG16, along with a custom CNN model utilizing the ReLU activation this http URL models ResNet50, MobileNet, VGG16, and a custom CNN recorded accuracies of 98.41%, 97.91%, 98.19%, and 92.94% on the dataset, correspondingly, with ResNet50 achieving the highest accuracy of 98.41%.This model, with its deep and powerful architecture, is particularly successful in detecting aberrant cells as well as cancerous or non-cancerous tumors. These accuracies show that the Machine Learning methods are more compatible for the classification and early detection of breast cancer.
zh
[CV-30] RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression
【速读】:该论文试图解决的问题是:在现代应用中,如自动驾驶,大多数视频是作为AI系统(如目标识别或分割)的输入,而非供人类观看。因此,传统的视频编码器(Video Encoder)以最小化重建误差为目标进行压缩优化,可能不再适用于这些任务。论文提出了一种新的方法,通过优化编码器以提升下游任务(如目标检测)的性能,而不是仅仅优化感知图像质量。
解决方案的关键在于:通过在大块级别(macro-block level)控制量化参数(Quantization Parameters, QPs),实现对任务相关区域的优先编码。具体而言,论文将这一优化问题建模为强化学习(Reinforcement Learning, RL)任务,智能体(agent)学习在长期任务性能和比特率约束之间平衡选择QPs的影响。值得注意的是,该方法在推理过程中不需要下游任务作为输入,因此适用于流媒体应用和边缘设备(如车辆)。实验表明,与传统任务无关的编码方法相比,该方法在给定比特率下显著提升了任务性能,如车辆检测和感兴趣区域(ROI)编码。
链接: https://arxiv.org/abs/2501.12216
作者: Uri Gadot,Assaf Shocher,Shie Mannor,Gal Chechik,Assaf Hallak
机构: NVIDIA Research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream task instead of for perceptual image quality. However, a major challenge is how to combine such downstream optimization with existing standard video encoders, which are highly efficient and popular. Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task. This granular control allows us to prioritize encoding for task-relevant regions within each frame. We formulate this optimization problem as a Reinforcement Learning (RL) task, where the agent learns to balance long-term implications of choosing QPs on both task performance and bit-rate constraints. Notably, our policy does not require the downstream task as an input during inference, making it suitable for streaming applications and edge devices such as vehicles. We demonstrate significant improvements in two tasks, car detection, and ROI (saliency) encoding. Our approach improves task performance for a given bit rate compared to traditional task agnostic encoding methods, paving the way for more efficient task-aware video compression.
zh
[CV-31] Explainability for Vision Foundation Models: A Survey
【速读】:该论文旨在探讨基础模型(foundation models)与可解释人工智能(eXplainable AI, XAI)在视觉领域的交叉点,并解决如何在这些复杂模型中实现可解释性的问题。基础模型由于其广泛的泛化能力和新兴用途,在可解释性领域中处于一个模糊的位置:其复杂性使得它们本身难以解释,但它们又被越来越多地用作构建可解释模型的工具。论文的解决方案关键在于首先通过整理相关文献,构建一个涵盖这两个领域的综合文献库;其次,根据这些文献的架构特征进行分类;接着,讨论当前研究在将XAI集成到基础模型中所面临的挑战;然后,回顾这些结合方法的常见评估方法;最后,提出未来研究的方向。通过这些步骤,论文为这一快速发展的领域提供了系统的分析和前瞻性见解。
链接: https://arxiv.org/abs/2501.12203
作者: Rémi Kazmierczak,Eloïse Berthier,Goran Frehse,Gianni Franchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As artificial intelligence systems become increasingly integrated into daily life, the field of explainability has gained significant attention. This trend is particularly driven by the complexity of modern AI models and their decision-making processes. The advent of foundation models, characterized by their extensive generalization capabilities and emergent uses, has further complicated this landscape. Foundation models occupy an ambiguous position in the explainability domain: their complexity makes them inherently challenging to interpret, yet they are increasingly leveraged as tools to construct explainable models. In this survey, we explore the intersection of foundation models and eXplainable AI (XAI) in the vision domain. We begin by compiling a comprehensive corpus of papers that bridge these fields. Next, we categorize these works based on their architectural characteristics. We then discuss the challenges faced by current research in integrating XAI within foundation models. Furthermore, we review common evaluation methodologies for these combined approaches. Finally, we present key observations and insights from our survey, offering directions for future research in this rapidly evolving field.
zh
[CV-32] Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
【速读】:该论文旨在解决大规模高分辨率纹理3D资产生成的问题,特别是在几何细节、条件对齐和纹理质量等方面超越现有技术。解决方案的关键在于提出了Hunyuan3D 2.0系统,该系统包含两个核心组件:Hunyuan3D-DiT和Hunyuan3D-Paint。Hunyuan3D-DiT是一个基于可扩展流式扩散变换器(scalable flow-based diffusion transformer)的几何生成模型,能够根据给定的条件图像生成与之对齐的几何形状。Hunyuan3D-Paint则是一个纹理合成模型,利用几何和扩散先验(diffusion priors)生成高分辨率且色彩鲜艳的纹理贴图。此外,论文还介绍了Hunyuan3D-Studio,这是一个多功能、用户友好的生产平台,简化了3D资产的重建过程,使专业和业余用户都能高效地操作甚至动画化他们的网格模型。通过系统评估,Hunyuan3D 2.0在多个方面超越了现有的开源和闭源模型,填补了开源3D社区在大规模基础生成模型方面的空白。
链接: https://arxiv.org/abs/2501.12202
作者: Zibo Zhao,Zeqiang Lai,Qingxiang Lin,Yunfei Zhao,Haolin Liu,Shuhui Yang,Yifei Feng,Mingxin Yang,Sheng Zhang,Xianghui Yang,Huiwen Shi,Sicong Liu,Junta Wu,Yihang Lian,Fan Yang,Ruining Tang,Zebin He,Xinzhou Wang,Jian Liu,Xuhui Zuo,Zhuo Chen,Biwen Lei,Haohan Weng,Jing Xu,Yiling Zhu,Xinhai Liu,Lixin Xu,Changrong Hu,Tianyu Huang,Lifu Wang,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Chao Zhang,Yonghao Tan,Jie Xiao,Yangyu Tao,Jianchen Zhu,Jinbao Xue,Kai Liu,Chongqing Zhao,Xinming Wu,Zhichao Hu,Lei Qin,Jianbing Peng,Zhan Li,Minghui Chen,Xipeng Zhang,Lin Niu,Paige Wang,Yingkai Wang,Haozhao Kuang,Zhongyi Fan,Xu Zheng,Weihao Zhuang,YingPing He,Tian Liu,Yong Yang,Di Wang,Yuhong Liu,Jie Jiang,Jingwei Huang,Chunchao Guo(refer to the report for detailed contributions)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub link: this https URL
点击查看摘要
Abstract:We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model – Hunyuan3D-DiT, and a large-scale texture synthesis model – Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio – a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: this https URL
zh
[CV-33] A margin-based replacement for cross-entropy loss
【速读】:该论文试图解决使用交叉熵损失(Cross-Entropy Loss, CE)训练深度神经网络时存在的鲁棒性和泛化性问题。具体来说,CE损失在应对未知类别拒绝、对抗鲁棒性、不平衡数据学习、持续学习和语义分割等任务时表现不佳。为了解决这些问题,论文提出了一种称为高误差边际损失(High Error Margin Loss, HEM)的变体,这是一种多类边际损失(multi-class margin loss)的改进版本。HEM损失通过引入更大的误差边际来克服其他基于边际的损失函数在训练中的问题。实验结果表明,HEM损失在多个任务上优于CE损失,并且在大多数情况下甚至优于专门为特定任务设计的损失函数(如LogitNorm、Logit-adjusted loss和DICE)。尽管HEM在干净数据上的准确率略低于CE,但这一差异并不显著。因此,HEM损失作为一种通用替代方案,能够有效提升深度神经网络在多种任务上的性能。
链接: https://arxiv.org/abs/2501.12191
作者: Michael W. Spratling,Heiko H. Schütt
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
点击查看摘要
Abstract:Cross-entropy (CE) loss is the de-facto standard for training deep neural networks to perform classification. However, CE-trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi-class margin loss that overcomes the training issues of other margin-based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross-entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel-level classification task). Despite all training hyper-parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses that have previously been proposed to improve performance on specific tasks. LogitNorm, a loss achieving state-of-the-art performance on unknown class rejection, produces similar performance to HEM for this task, but is much poorer for continual learning and semantic segmentation. Logit-adjusted loss, designed for imbalanced data, has superior results to HEM for that task, but performs more poorly on unknown class rejection and semantic segmentation. DICE, a popular loss for semantic segmentation, is inferior to HEM loss on all tasks, including semantic segmentation. Thus, HEM often out-performs specialised losses, and in contrast to them, is a general-purpose replacement for CE loss.
zh
[CV-34] High-dimensional multimodal uncertainty estimation by manifold alignment:Application to 3D right ventricular strain computations
【速读】:该论文试图解决在医学图像分析中,由于不同定义或计算方法导致的生理描述符(如心肌变形)的局部不确定性(local uncertainties)问题。传统方法通常假设单个样本足以代表每个受试者,而忽略了数据本身的不确定性。论文提出了一种表示学习策略,通过流形对齐(manifold alignment)来匹配与不同高维输入描述符相关的潜在表示,进而构建潜在不确定性的合理分布,并利用这些分布重建输入高维描述符的不确定性。该方法的关键在于通过流形对齐和不确定性建模,量化不同描述符定义下的心肌变形局部不确定性,从而为临床医生提供更可靠的结果。论文以右心室三维超声图像序列中的心肌变形(应变)量化为例,展示了该方法的有效性,并表明其可推广至其他涉及异质高维描述符的群体分析。
链接: https://arxiv.org/abs/2501.12178
作者: Maxime Di Folco,Gabriel Bernardino,Patrick Clarysse,Nicolas Duchateau
机构: Univ Lyon, Université Claude Bernard Lyon 1, INSA-Lyon, CNRS, Inserm, CREATIS UMR 5220, U1294, F-69621, Lyon, France; Institute of Machine Learning in Biomedical Imaging, Helmholtz Center Munich, Germany; DTIC, Universitat Pompeu Fabra, Barcelona, Spain; Institut Universitaire de France (IUF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Confidence in the results is a key ingredient to improve the adoption of machine learning methods by clinicians. Uncertainties on the results have been considered in the literature, but mostly those originating from the learning and processing methods. Uncertainty on the data is hardly challenged, as a single sample is often considered representative enough of each subject included in the analysis. In this paper, we propose a representation learning strategy to estimate local uncertainties on a physiological descriptor (here, myocardial deformation) previously obtained from medical images by different definitions or computations. We first use manifold alignment to match the latent representations associated to different high-dimensional input descriptors. Then, we formulate plausible distributions of latent uncertainties, and finally exploit them to reconstruct uncertainties on the input high-dimensional descriptors. We demonstrate its relevance for the quantification of myocardial deformation (strain) from 3D echocardiographic image sequences of the right ventricle, for which a lack of consensus exists in its definition and which directional component to use. We used a database of 100 control subjects with right ventricle overload, for which different types of strain are available at each point of the right ventricle endocardial surface mesh. Our approach quantifies local uncertainties on myocardial deformation from different descriptors defining this physiological concept. Such uncertainties cannot be directly estimated by local statistics on such descriptors, potentially of heterogeneous types. Beyond this controlled illustrative application, our methodology has the potential to be generalized to many other population analyses considering heterogeneous high-dimensional descriptors.
zh
[CV-35] ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions
【速读】:该论文旨在解决当前多模态图像生成任务中,特别是在人类图像生成方面,现有方法在灵活性和精确性上的不足。现有方法主要依赖于文本到图像或基于参考图像的生成方式,无法满足日益复杂的需求。为此,论文提出了ComposeAnyone,一种可控的布局到人类图像生成方法,通过解耦的多模态条件实现对任意手绘布局部分的控制。该方法允许使用文本或参考图像对手绘布局中的任意部分进行解耦控制,并在生成过程中无缝整合这些条件。手绘布局采用色块几何形状(如椭圆和矩形),易于绘制,提供了更灵活和可访问的方式来定义空间布局。此外,论文还引入了ComposeHuman数据集,该数据集为每张人类图像的不同组件提供了解耦的文本和参考图像注释,从而扩展了人类图像生成任务的应用范围。实验结果表明,ComposeAnyone在多个数据集上生成的图像与给定布局、文本描述和参考图像具有更好的对齐性,展示了其多任务能力和可控性。
链接: https://arxiv.org/abs/2501.12173
作者: Shiyue Zhang,Zheng Chong,Xi Lu,Wenqing Zhang,Haoxiang Li,Xujie Zhang,Jiehui Huang,Xiao Dong,Xiaodan Liang
机构: Sun Yat-Sen University(中山大学); National University of Singapore(新加坡国立大学); Pixocial Technology(Pixocial Technology); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Building on the success of diffusion models, significant advancements have been made in multimodal image generation tasks. Among these, human image generation has emerged as a promising technique, offering the potential to revolutionize the fashion design process. However, existing methods often focus solely on text-to-image or image reference-based human generation, which fails to satisfy the increasingly sophisticated demands. To address the limitations of flexibility and precision in human generation, we introduce ComposeAnyone, a controllable layout-to-human generation method with decoupled multimodal conditions. Specifically, our method allows decoupled control of any part in hand-drawn human layouts using text or reference images, seamlessly integrating them during the generation process. The hand-drawn layout, which utilizes color-blocked geometric shapes such as ellipses and rectangles, can be easily drawn, offering a more flexible and accessible way to define spatial layouts. Additionally, we introduce the ComposeHuman dataset, which provides decoupled text and reference image annotations for different components of each human image, enabling broader applications in human image generation tasks. Extensive experiments on multiple datasets demonstrate that ComposeAnyone generates human images with better alignment to given layouts, text descriptions, and reference images, showcasing its multi-task capability and controllability.
zh
[CV-36] SVGS-DSGAT: An IoT-Enabled Innovation in Underwater Robotic Object Detection Technology
【速读】:该论文旨在解决复杂水下环境中高噪声和低对比度图像的目标检测与跟踪问题。现有方法在处理这些复杂环境时,往往缺乏精度和鲁棒性。论文提出的解决方案是引入一种新型的SVGS-DSGAT模型,该模型结合了GraphSage、SVAM(Spatial-Visual Attention Module)和DSGAT(Dual-Scale Graph Attention Network)模块,通过图神经网络和注意力机制增强了特征提取和目标检测能力。此外,该模型集成了物联网(IoT)技术,实现了实时数据采集与处理,优化了资源分配和模型响应速度。实验结果表明,SVGS-DSGAT模型在URPC 2020和SeaDronesSee数据集上分别达到了40.8%和41.5%的mAP(mean Average Precision),显著优于现有主流模型。这一基于IoT的增强方法不仅在高噪声和复杂背景下表现出色,还提升了系统的整体效率和可扩展性,为水下目标检测技术提供了有效的解决方案。
链接: https://arxiv.org/abs/2501.12169
作者: Dongli Wu,Ling Luo
机构: College of Design and Engineering, National University of Singapore (新加坡国立大学设计与工程学院); Institute of Semiconductors, CAS AnnLab, Institute of Semiconductors, Chinese Academy of Sciences (中国科学院半导体研究所); Beijing Ratu Technology Co., Ltd. (北京睿图科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures
点击查看摘要
Abstract:With the advancement of Internet of Things (IoT) technology, underwater target detection and tracking have become increasingly important for ocean monitoring and resource management. Existing methods often fall short in handling high-noise and low-contrast images in complex underwater environments, lacking precision and robustness. This paper introduces a novel SVGS-DSGAT model that combines GraphSage, SVAM, and DSGAT modules, enhancing feature extraction and target detection capabilities through graph neural networks and attention mechanisms. The model integrates IoT technology to facilitate real-time data collection and processing, optimizing resource allocation and model responsiveness. Experimental results demonstrate that the SVGS-DSGAT model achieves an mAP of 40.8% on the URPC 2020 dataset and 41.5% on the SeaDronesSee dataset, significantly outperforming existing mainstream models. This IoT-enhanced approach not only excels in high-noise and complex backgrounds but also improves the overall efficiency and scalability of the system. This research provides an effective IoT solution for underwater target detection technology, offering significant practical application value and broad development prospects.
zh
[CV-37] Fast-RF-Shimming: Accelerate RF Shimming in 7T MRI using Deep Learning
【速读】:该论文旨在解决超高场(Ultrahigh Field, UHF)磁共振成像(Magnetic Resonance Imaging, MRI)中射频(Radiofrequency, RF)场不均匀性导致的图像伪影问题。传统方法如幅度最小二乘(Magnitude Least Squares, MLS)优化虽然能够缓解RF场不均匀性,但其计算耗时且通常需要患者在扫描过程中参与。论文提出了一种基于机器学习的快速RF匀场(Fast RF Shimming)框架,通过随机初始化的自适应矩估计(Adaptive Moment Estimation, Adam)从多通道RF场中推导参考匀场权重,并利用残差网络(Residual Network, ResNet)将RF场映射到匀场输出,同时在损失函数中引入置信度参数。此外,非均匀场检测器(Non-uniformity Field Detector, NFD)用于识别极端非均匀结果。该框架在速度和预测准确性上均显著优于传统方法,并支持进一步扩展,如结合解剖学先验或多回波数据,以提高RF场校正的鲁棒性。
链接: https://arxiv.org/abs/2501.12157
作者: Zhengyi Lu,Hao Liang,Ming Lu,Xiao Wang,Xinqiang Yan,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) provides a high signal-to-noise ratio (SNR), enabling exceptional spatial resolution for clinical diagnostics and research. However, higher fields introduce challenges such as transmit radiofrequency (RF) field inhomogeneities, which result in uneven flip angles and image intensity artifacts. These artifacts degrade image quality and limit clinical adoption. Traditional RF shimming methods, including Magnitude Least Squares (MLS) optimization, mitigate RF field inhomogeneity but are time-intensive and often require the presence of the patient. Recent machine learning methods, such as RF Shim Prediction by Iteratively Projected Ridge Regression and other deep learning architectures, offer alternative approaches but face challenges such as extensive training requirements, limited complexity, and practical data constraints. This paper introduces a holistic learning-based framework called Fast RF Shimming, which achieves a 5000-fold speedup compared to MLS methods. First, random-initialized Adaptive Moment Estimation (Adam) derives reference shimming weights from multichannel RF fields. Next, a Residual Network (ResNet) maps RF fields to shimming outputs while incorporating a confidence parameter into the loss function. Finally, a Non-uniformity Field Detector (NFD) identifies extreme non-uniform outcomes. Comparative evaluations demonstrate significant improvements in both speed and predictive accuracy. The proposed pipeline also supports potential extensions, such as the integration of anatomical priors or multi-echo data, to enhance the robustness of RF field correction. This approach offers a faster and more efficient solution to RF shimming challenges in UHF MRI.
zh
[CV-38] DNRSelect: Active Best View Selection for Deferred Neural Rendering ICRA2025
【速读】:该论文试图解决在延迟神经渲染(Deferred Neural Rendering, DNR)中过度依赖高质量光线追踪(ray-traced)图像的问题,同时保持渲染的高保真度。解决方案的关键在于提出了DNRSelect,该方法集成了基于强化学习的视图选择器(view selector)和3D纹理聚合器(3D texture aggregator)。视图选择器通过训练在易于获取的光栅化(rasterized)图像上,能够识别出最优的视图,从而仅需获取少量光线追踪图像即可实现高质量的渲染。3D纹理聚合器则通过融合深度图(depth maps)、法线图(normal maps)和UV图的金字塔特征,进一步增强DNR的空间感知和几何一致性。通过这种方法,DNRSelect显著减少了对光线追踪数据的依赖,同时仍能实现高保真度的渲染效果。
链接: https://arxiv.org/abs/2501.12150
作者: Dongli Wu,Haochen Li,Xiaobao Wei
机构: College of Design and Engineering, National University of Singapore(新加坡国立大学设计与工程学院); School of Cyber Science and Technology, Beihang University(北京航空航天大学网络空间安全学院); University of Chinese Academy of Sciences(中国科学院大学); Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures, submitted to ICRA 2025
点击查看摘要
Abstract:Deferred neural rendering (DNR) is an emerging computer graphics pipeline designed for high-fidelity rendering and robotic perception. However, DNR heavily relies on datasets composed of numerous ray-traced images and demands substantial computational resources. It remains under-explored how to reduce the reliance on high-quality ray-traced images while maintaining the rendering fidelity. In this paper, we propose DNRSelect, which integrates a reinforcement learning-based view selector and a 3D texture aggregator for deferred neural rendering. We first propose a novel view selector for deferred neural rendering based on reinforcement learning, which is trained on easily obtained rasterized images to identify the optimal views. By acquiring only a few ray-traced images for these selected views, the selector enables DNR to achieve high-quality rendering. To further enhance spatial awareness and geometric consistency in DNR, we introduce a 3D texture aggregator that fuses pyramid features from depth maps and normal maps with UV maps. Given that acquiring ray-traced images is more time-consuming than generating rasterized images, DNRSelect minimizes the need for ray-traced data by using only a few selected views while still achieving high-fidelity rendering results. We conduct detailed experiments and ablation studies on the NeRF-Synthetic dataset to demonstrate the effectiveness of DNRSelect. The code will be released.
zh
[CV-39] ENTIRE: Learning-based Volume Rendering Time Prediction
【速读】:该论文试图解决时间依赖的体积数据(time-dependent volume data)在渲染过程中渲染时间预测的问题。这类数据通常包含数百或数千个时间步长的复杂变形结构,且相机配置对渲染性能有显著影响。解决方案的关键在于首先从体积数据中提取一个特征向量(feature vector),该向量捕捉了与渲染时间性能相关的结构信息。然后,将此特征向量与其他相关参数(如相机设置)结合,进行最终的渲染时间预测。实验结果表明,该方法能够在多种数据集上高效实现高精度的预测,并具有快速的响应速度。此外,ENTIRE方法还展示了在动态参数调整和负载平衡方面的能力,以确保稳定的帧率。
链接: https://arxiv.org/abs/2501.12119
作者: Zikai Yin,Hamid Gadirov,Jiri Kosinka,Steffen Frey
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present ENTIRE, a novel approach for volume rendering time prediction. Time-dependent volume data from simulations or experiments typically comprise complex deforming structures across hundreds or thousands of time steps, which in addition to the camera configuration has a significant impact on rendering performance. We first extract a feature vector from a volume that captures its structure that is relevant for rendering time performance. Then we combine this feature vector with further relevant parameters (e.g. camera setup), and with this perform the final prediction. Our experiments conducted on various datasets demonstrate that our model is capable of efficiently achieving high prediction accuracy with fast response rates. We showcase ENTIRE’s capability of enabling dynamic parameter adaptation for stable frame rates and load balancing in two case studies.
zh
[CV-40] Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning
【速读】:该论文旨在解决在多任务学习(MTL)场景中,深度神经网络(DNNs)如何自动生成最优稀疏共享结构的问题。传统方法依赖于手动调整超参数来控制稀疏度,而本文提出的“元稀疏性”(meta-sparsity)框架则通过学习控制稀疏度的参数,使得模型能够在多任务学习中动态生成最优的稀疏结构。该框架的关键在于借鉴了模型无关元学习(MAML)的思想,通过在元训练阶段引入基于惩罚的通道级结构化稀疏性(channel-wise structured sparsity),从而学习共享且最优的稀疏参数。这种方法不仅能够去除不必要的参数,提升模型效率,还能增强模型在处理已知和未知任务时的泛化能力。实验结果表明,该方法在多个任务上表现优异,展示了其在构建高效、适应性强的稀疏神经网络方面的潜力。
链接: https://arxiv.org/abs/2501.12115
作者: Richa Upadhyay,Ronald Phlypo,Rajkumar Saini,Marcus Liwicki
机构: Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典); University Grenoble Alpes, France(格勒诺布尔阿尔卑斯大学, 法国); Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典); Luleå University of Technology, Sweden(吕勒奥理工大学, 瑞典)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model’s efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.
zh
[CV-41] acher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection
【速读】:该论文试图解决视觉异常检测(Visual Anomaly Detection)中的挑战,特别是针对单类分类(one-class classification)和分割问题。现有的学生-教师(Student-Teacher, S-T)框架主要依赖预训练的教师网络来指导学生网络学习多尺度相似特征,但忽视了学生网络通过多尺度特征融合(multi-scale feature fusion)来增强学习的潜力。为此,论文提出了一种名为PFADSeg的新模型,其关键解决方案包括:1)将预训练的教师网络、具有多尺度特征融合的去噪学生网络以及引导异常分割网络集成到一个统一框架中;2)采用独特的教师编码器-学生解码器去噪模式,提升学生网络从教师网络特征中学习的能力;3)引入自适应特征融合机制,训练自监督分割网络以自主合成异常掩码,从而显著提升检测性能。实验结果表明,PFADSeg在MVTec AD数据集上取得了图像级AUC为98.9%、像素级平均精度为76.4%和实例级平均精度为78.7%的先进性能。
链接: https://arxiv.org/abs/2501.12104
作者: ShiXuan Song,Hao Chen,Shu Hu,Xin Wang,Jinrong Hu,Xi Wu
机构: CUIT, China; Purdue University, USA; University at Albany, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network’s ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Evaluated on the MVTec AD dataset, PFADSeg achieves state-of-the-art results with an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
zh
[CV-42] Proxies for Distortion and Consistency with Applications for Real-World Image Restoration
【速读】:该论文旨在解决真实世界图像恢复(real-world image restoration)中的挑战,即在仅给定退化图像(degraded images)且无对应真实图像(ground-truth)的情况下,设计和评估图像恢复算法的困难。论文提出了一套工具,用于设计和评估真实世界图像恢复算法。其关键解决方案包括:1)提出一个训练模型,用于预测给定真实世界测量图像所经历的退化链(chain of degradations),并利用该估计器近似测量值与任何恢复图像之间的一致性(consistency);2)利用预训练的基于扩散的图像先验(diffusion-based image prior),设计了一个简单且高效的即插即用(plug-and-play)图像恢复算法;3)提出了无参考(no-reference)的代理指标,如近似均方误差(MSE)和学习感知图像块相似度(LPIPS),用于在没有真实图像的情况下对恢复算法进行排序。这套工具为真实场景下的盲图像恢复算法(blind image restoration algorithms)提供了一个首创的、多功能的评估和比较框架。
链接: https://arxiv.org/abs/2501.12102
作者: Sean Man,Guy Ohayon,Ron Raphaeli,Michael Elad
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Project page in this https URL
点击查看摘要
Abstract:Real-world image restoration deals with the recovery of images suffering from an unknown degradation. This task is typically addressed while being given only degraded images, without their corresponding ground-truth versions. In this hard setting, designing and evaluating restoration algorithms becomes highly challenging. This paper offers a suite of tools that can serve both the design and assessment of real-world image restoration algorithms. Our work starts by proposing a trained model that predicts the chain of degradations a given real-world measured input has gone through. We show how this estimator can be used to approximate the consistency – the match between the measurements and any proposed recovered image. We also use this estimator as a guiding force for the design of a simple and highly-effective plug-and-play real-world image restoration algorithm, leveraging a pre-trained diffusion-based image prior. Furthermore, this work proposes no-reference proxy measures of MSE and LPIPS, which, without access to the ground-truth images, allow ranking of real-world image restoration algorithms according to their (approximate) MSE and LPIPS. The proposed suite provides a versatile, first of its kind framework for evaluating and comparing blind image restoration algorithms in real-world scenarios.
zh
[CV-43] UAV-Assisted Real-Time Disaster Detection Using Optimized Transformer Model
【速读】:该论文旨在解决在灾害恢复和管理中,特别是在不稳定环境和难以到达的地形中,准确和及时的灾害检测所面临的挑战。解决方案的关键在于利用配备机载嵌入式平台和摄像头传感器的无人机(UAVs),通过机载航空图像处理来避免连接性、隐私和延迟问题。论文提出了一种基于UAV的边缘计算框架,用于实时灾害管理,并采用了一种经过优化的模型进行实时航空图像分类。该模型通过后训练量化技术进行优化,以提高在资源受限设备上的推理速度和内存使用效率。此外,论文还引入了一个名为DisasterEye的新数据集,包含无人机拍摄的灾害场景和现场人员拍摄的地面图像,以支持真实世界灾害场景的应用。实验结果表明,该模型在资源受限的UAV平台上实现了高准确率、低延迟和低内存使用,展示了其可扩展性和适应性。
链接: https://arxiv.org/abs/2501.12087
作者: Branislava Jankovic,Sabina Jangirova,Waseem Ullah,Latif U. Khan,Mohsen Guizani
机构: Mohamed Bin Zayed University of Artificial Intelligence, United Arab Emirates(穆罕默德·本·扎耶德人工智能大学, 阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Disaster recovery and management present significant challenges, particularly in unstable environments and hard-to-reach terrains. These difficulties can be overcome by employing unmanned aerial vehicles (UAVs) equipped with onboard embedded platforms and camera sensors. In this work, we address the critical need for accurate and timely disaster detection by enabling onboard aerial imagery processing and avoiding connectivity, privacy, and latency issues despite the challenges posed by limited onboard hardware resources. We propose a UAV-assisted edge framework for real-time disaster management, leveraging our proposed model optimized for real-time aerial image classification. The optimization of the model employs post-training quantization techniques. For real-world disaster scenarios, we introduce a novel dataset, DisasterEye, featuring UAV-captured disaster scenes as well as ground-level images taken by individuals on-site. Experimental results demonstrate the effectiveness of our model, achieving high accuracy with reduced inference latency and memory usage on resource-constrained devices. The framework’s scalability and adaptability make it a robust solution for real-time disaster detection on resource-limited UAV platforms.
zh
[CV-44] DSTSA-GCN: Advancing Skeleton-Based Gesture Recognition with Semantic-Aware Spatio-Temporal Topology Modeling
【速读】:该论文旨在解决现有基于图卷积网络(GCNs)的骨架动作和手势识别方法中的两个关键问题:一是缺乏有效的时空拓扑建模,无法捕捉骨骼运动中的动态变化;二是难以建模超越局部关节连接的多尺度结构关系。为解决这些问题,论文提出了一种名为动态时空语义感知图卷积网络(DSTSA-GCN)的新框架。该框架的核心在于引入了三个关键模块:组通道图卷积(GC-GC)、组时序图卷积(GT-GC)和多尺度时序卷积(MS-TCN)。GC-GC和GT-GC并行工作,分别建模通道特定和帧特定的相关性,从而实现对时空变化的鲁棒拓扑学习。此外,这两个模块采用分组策略,自适应地捕捉多尺度结构关系。MS-TCN则通过具有不同感受野的分组时序卷积进一步增强时序建模能力。实验结果表明,DSTSA-GCN显著提升了GCNs的拓扑建模能力,在多个基准数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2501.12086
作者: Hu Cui,Renjing Huang,Ruoyu Zhang,Tessai Hayama
机构: Nagaoka University of Technology (长冈技术科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit to Neurocomputing
点击查看摘要
Abstract:Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14/28, NTU-RGB+D, and NTU-RGB+D-120.
zh
[CV-45] Scalable Whole Slide Image Representation Using K-Mean Clustering and Fisher Vector Aggregation
【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSIs)分类中的计算挑战,这些图像由于高分辨率和巨大的尺寸,传统机器学习模型难以处理。论文提出了一种可扩展且高效的方法,通过基于补丁的特征提取、聚类和Fisher向量编码来实现WSI分类。关键解决方案包括:首先将WSI分割为固定大小的补丁,并使用预训练的卷积神经网络(CNN)提取每个补丁的深度特征嵌入;接着通过K-means聚类将这些补丁级嵌入进行聚类,每个聚类聚合了WSI中语义相似的区域;然后通过将每个聚类中的补丁嵌入分布建模为参数化的高斯混合模型(GMM),计算Fisher向量表示;最后将这些Fisher向量拼接成一个高维特征向量,用于分类器预测WSI的诊断标签。该方法能够捕捉局部和全局组织结构,并在大规模WSI分类中表现出优异的准确性和可扩展性。
链接: https://arxiv.org/abs/2501.12085
作者: Ravi Kant Gupta,Shounak Das,Ardhendu Sekhar,Amit Sethi
机构: Department of Electrical Engineering, Indian Institute of Technology Bombay (电气工程系, 印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Whole slide images (WSIs) are high-resolution, gigapixel sized images that pose significant computational challenges for traditional machine learning models due to their size and this http URL this paper, we present a scalable and efficient methodology for WSI classification by leveraging patch-based feature extraction, clustering, and Fisher vector encoding. Initially, WSIs are divided into fixed size patches, and deep feature embeddings are extracted from each patch using a pre-trained convolutional neural network (CNN). These patch-level embeddings are subsequently clustered using K-means clustering, where each cluster aggregates semantically similar regions of the WSI. To effectively summarize each cluster, Fisher vector representations are computed by modeling the distribution of patch embeddings in each cluster as a parametric Gaussian mixture model (GMM). The Fisher vectors from each cluster are concatenated into a high-dimensional feature vector, creating a compact and informative representation of the entire WSI. This feature vector is then used by a classifier to predict the WSI’s diagnostic label. Our method captures local and global tissue structures and yields robust performance for large-scale WSI classification, demonstrating superior accuracy and scalability compared to other approaches.
zh
[CV-46] A Multi-annotated and Multi-modal Dataset for Wide-angle Video Quality Assessment
【速读】:该论文试图解决广角视频(wide-angle video)质量评估的问题。广角视频因其宽广的视角和大范围场景捕捉能力,在体育和冒险记录中具有广泛应用前景,但其易受变形、曝光等失真影响,导致视频质量下降,进而影响感知和体验,限制了其在竞技体育等领域的应用。目前,针对广角视频质量评估的研究较少,主要原因在于缺乏专门的广角视频数据集。为解决这一问题,论文构建了首个多标注、多模态的广角视频质量评估数据集(Multi-annotated and multi-modal Wide-angle Video quality assessment, MWV),并通过跨数据集测试和数据集内测试,评估了现有先进视频质量评估方法在该数据集上的表现。实验结果表明,这些方法在广角视频质量评估上存在显著局限性。因此,构建专门的数据集是解决广角视频质量评估问题的关键。
链接: https://arxiv.org/abs/2501.12082
作者: Bo Hu,Wei Wang,Chunyi Li,Lihuo He,Leida Li,Xinbo Gao
机构: Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China(重庆邮电大学图像认知重点实验室); School of Electronic Engineering, Xidian University, Xi’an, China(西安电子科技大学电子工程学院); School of Artificial Intelligence, Xidian University, Xi’an, China(西安电子科技大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Wide-angle video is favored for its wide viewing angle and ability to capture a large area of scenery, making it an ideal choice for sports and adventure recording. However, wide-angle video is prone to deformation, exposure and other distortions, resulting in poor video quality and affecting the perception and experience, which may seriously hinder its application in fields such as competitive sports. Up to now, few explorations focus on the quality assessment issue of wide-angle video. This deficiency primarily stems from the absence of a specialized dataset for wide-angle videos. To bridge this gap, we construct the first Multi-annotated and multi-modal Wide-angle Video quality assessment (MWV) dataset. Then, the performances of state-of-the-art video quality methods on the MWV dataset are investigated by inter-dataset testing and intra-dataset testing. Experimental results show that these methods impose significant limitations on their applicability.
zh
[CV-47] owards autonomous photogrammetric forest inventory using a lightweight under-canopy robotic drone
【速读】:该论文试图解决在森林冠层下进行自主无人机飞行和数据采集的挑战。由于在密集森林环境中,全球导航卫星系统(GNSS)无法提供可靠的定位,且无人机需要自主调整飞行路径以避免碰撞,传统的自动化飞行技术难以适用。为此,论文提出了一种基于先进开源方法的机器人无人机原型,能够在GNSS受限且障碍物丰富的森林环境中实现自主飞行。该解决方案的关键在于利用机载立体相机和摄影测量方法进行数据采集,并通过多组测试飞行验证了其在复杂森林环境中的性能。实验结果表明,该原型在森林重建和胸径(DBH)估计方面表现出色,特别是在DBH小于30厘米的树木上,误差显著降低。总体而言,该方案在DBH精度、自主性和森林复杂性方面的表现优于现有文献中的方法。
链接: https://arxiv.org/abs/2501.12073
作者: Väinö Karjalainen,Niko Koivumäki,Teemu Hakala,Jesse Muhojoki,Eric Hyyppä,Anand George,Juha Suomalainen,Eija Honkavaara
机构: Department of Remote Sensing and Photogrammetry, Finnish Geospatial Research Institute FGI, The National Land Survey of Finland (芬兰地理空间研究所FGI, 芬兰国家土地调查局)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 13 Figures
点击查看摘要
Abstract:Drones are increasingly used in forestry to capture high-resolution remote sensing data. While operations above the forest canopy are already highly automated, flying inside forests remains challenging, primarily relying on manual piloting. Inside dense forests, reliance on the Global Navigation Satellite System (GNSS) for localization is not feasible. Additionally, the drone must autonomously adjust its flight path to avoid collisions. Recently, advancements in robotics have enabled autonomous drone flights in GNSS-denied obstacle-rich areas. In this article, a step towards autonomous forest data collection is taken by building a prototype of a robotic under-canopy drone utilizing state-of-the-art open-source methods and validating its performance for data collection inside forests. The autonomous flight capability was evaluated through multiple test flights in two boreal forest test sites. The tree parameter estimation capability was studied by conducting diameter at breast height (DBH) estimation using onboard stereo camera data and photogrammetric methods. The prototype conducted flights in selected challenging forest environments, and the experiments showed excellent performance in forest reconstruction with a miniaturized stereoscopic photogrammetric system. The stem detection algorithm managed to identify 79.31 % of the stems. The DBH estimation had a root mean square error (RMSE) of 3.33 cm (12.79 %) and a bias of 1.01 cm (3.87 %) across all trees. For trees with a DBH less than 30 cm, the RMSE was 1.16 cm (5.74 %), and the bias was 0.13 cm (0.64 %). When considering the overall performance in terms of DBH accuracy, autonomy, and forest complexity, the proposed approach was superior compared to methods proposed in the scientific literature. Results provided valuable insights into autonomous forest reconstruction using drones, and several further development topics were proposed.
zh
[CV-48] Co-Paced Learning Strategy Based on Confidence for Flying Bird Object Detection Model Training
【速读】:该论文旨在解决在监控视频中飞鸟目标检测(FBOD)模型训练过程中,硬样本(hard samples)对模型性能的负面影响问题。为了解决这一问题,作者提出了一种基于置信度的协同学习策略(Co-Paced Learning Based on Confidence, CPL-BC)。该策略的核心在于使用两个结构相同但初始参数配置不同的模型,通过相互协作选择预测置信度超过设定阈值的易样本(easy samples)进行训练。随着训练的进行,策略逐步降低置信度阈值,使更多样本参与训练,从而增强模型从易到难的样本识别能力。在应用CPL-BC策略之前,作者首先对两个FBOD模型进行了预训练,使其具备评估飞鸟目标样本难度的能力。实验结果表明,与其他模型学习策略相比,CPL-BC显著提高了检测精度,验证了该方法的有效性和先进性。
链接: https://arxiv.org/abs/2501.12071
作者: Zi-Wei Sun,Ze-Xi Hua,Heng-Chao Li,Yan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To mitigate the adverse effects of hard samples on the training of the Flying Bird Object Detection (FBOD) model for surveillance videos, we propose a Co-Paced Learning Based on Confidence (CPL-BC) strategy and apply this strategy to the training process of the FBOD model. This strategy involves maintaining two models with identical structures but different initial parameter configurations, which collaborate with each other to select easy samples with prediction confidence exceeding a set threshold for training. As training progresses, the strategy gradually lowers the threshold, allowing more samples to participate, enhancing the model’s ability to recognize objects from easy to hard. Before applying the CPL-BC strategy to train the FBOD models, we initially trained the two FBOD models to equip them with the capability to assess the difficulty level of flying bird object samples. Experimental results on two different datasets of flying bird objects in surveillance videos demonstrate that, compared to other model learning strategies, CPL-BC significantly improves detection accuracy, verifying the effectiveness and advancement of this method.
zh
[CV-49] GaussianVideo: Efficient Video Representation Through 2D Gaussian Splatting
【速读】:该论文旨在解决视频表示和压缩的问题,提出了一种基于2D高斯斑点(2D Gaussian splats)的新方法GaussianVideo。该方法通过以下关键技术实现高效视频表示和压缩:(i) 利用相邻帧之间的时间冗余性,基于前一帧预测当前帧的高斯斑点,从而加速训练并提高压缩效率;(ii) 通过移除对视频质量贡献较低的高斯斑点,控制文件大小与质量之间的权衡;(iii) 通过随机添加高斯斑点来捕捉视频中的动态内容,如大幅运动或新出现的物体;(iv) 在学习过程中基于损失差异检测关键帧,以处理场景中的显著变化。实验结果表明,GaussianVideo在率失真权衡(rate-distortion trade-offs)方面表现优异,与AV1和VVC等先进视频编解码器相当,并在1920x1080分辨率下实现了1500 fps的渲染速度。
链接: https://arxiv.org/abs/2501.12060
作者: Longan Wang,Yuang Shi,Wei Tsang Ooi
机构: Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:3D Gaussian splats have emerged as a revolutionary, effective, learned representation for static 3D scenes. In this work, we explore using 2D Gaussian splats as a new primitive for representing videos. We propose GaussianVideo, an approach to learning a set of 2D Gaussian splats that can effectively represent video frames. GaussianVideo incorporates the following techniques: (i) To exploit temporal redundancy among adjacent frames, which can speed up training and improve the compression efficiency, we predict the Gaussian splats of a frame based on its previous frame; (ii) To control the trade-offs between file size and quality, we remove Gaussian splats with low contribution to the video quality; (iii) To capture dynamics in videos, we randomly add Gaussian splats to fit content with large motion or newly-appeared objects; (iv) To handle significant changes in the scene, we detect key frames based on loss differences during the learning process. Experiment results show that GaussianVideo achieves good rate-distortion trade-offs, comparable to state-of-the-art video codecs such as AV1 and VVC, and a rendering speed of 1500 fps for a 1920x1080 video.
zh
[CV-50] Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning
【速读】:该论文试图解决在3D MRI(磁共振成像)数据分析中,由于数据稀缺且预训练的2D模型无法捕捉体积上下文信息,导致自监督深度学习难以有效应用的问题。解决方案的关键在于提出了一种序列不变的自监督框架,利用定量MRI(qMRI)技术,通过从单个3D qMRI扫描中模拟多种MRI对比度,并强制这些对比度之间的一致性表示,从而学习到以解剖结构为中心而非序列特定的特征。这种方法生成了一个鲁棒的3D编码器,能够在多种任务和协议中表现出色,特别是在低数据环境下(如健康脑部分割、中风病变分割和MRI去噪任务中),显著优于基线自监督学习方法。此外,该模型还能有效泛化到未见过的站点,展示了其在可扩展性和临床可靠性方面的潜力。
链接: https://arxiv.org/abs/2501.12057
作者: Liam Chalcroft,Jenny Cronin,Cathy J. Price,John Ashburner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a sequence-invariant self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. This yields a robust 3D encoder that performs strongly across varied tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3% Dice, +4.2 dB PSNR). Our model also generalises effectively to unseen sites, demonstrating potential for more scalable and clinically reliable volumetric analysis. All code and trained models are publicly available.
zh
[CV-51] ORCAst: Operational High-Resolution Current Forecasts
【速读】:该论文旨在解决实时预测海洋表面流(ocean surface currents)的挑战,特别是在一周时间尺度内的高分辨率预测。由于卫星遥感数据提供的信息通常是间接或不完整的,因此这一问题具有较高的复杂性。论文提出的解决方案是ORCAst模型,这是一个多阶段、多臂网络(multi-stage, multi-arm network),通过多阶段学习过程,利用真实卫星数据和浮标(drifters)的现场测量数据进行训练。模型的关键在于其多臂编码器-解码器架构(multi-arm encoder-decoder architecture),首先从大量的天底(nadir)和SWOT高度计数据中预测海面高度(sea surface height)和地转流(geostrophic currents),然后从稀疏的浮标现场测量数据中学习预测海洋表面流。通过在特定区域进行训练,模型在预测海洋表面流的实时预报和短期预报方面表现优于多种最先进的方法。
链接: https://arxiv.org/abs/2501.12054
作者: Pierre Garcia,Inès Larroche,Amélie Pesnec,Hannah Bull,Théo Archambault,Evangelos Moschos,Alexandre Stegner,Anastase Charantonis,Dominique Béréziat
机构: Amphitrite; Sorbonne Université (索邦大学), CNRS (法国国家科学研究中心), LIP6; Inria (法国国家信息与自动化研究所), Sorbonne Université (索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
点击查看摘要
Abstract:We present ORCAst, a multi-stage, multi-arm network for Operational high-Resolution Current forecAsts over one week. Producing real-time nowcasts and forecasts of ocean surface currents is a challenging problem due to indirect or incomplete information from satellite remote sensing data. Entirely trained on real satellite data and in situ measurements from drifters, our model learns to forecast global ocean surface currents using various sources of ground truth observations in a multi-stage learning procedure. Our multi-arm encoder-decoder model architecture allows us to first predict sea surface height and geostrophic currents from larger quantities of nadir and SWOT altimetry data, before learning to predict ocean surface currents from much more sparse in situ measurements from drifters. Training our model on specific regions improves performance. Our model achieves stronger nowcast and forecast performance in predicting ocean surface currents than various state-of-the-art methods.
zh
[CV-52] Aggrotech: Leverag ing Deep Learning for Sustainable Tomato Disease Management
【速读】:该论文旨在解决番茄作物健康监测中的病害及时准确检测问题,以确保农业生产力和粮食安全。论文提出的解决方案基于深度学习技术,具体采用了两类卷积神经网络(CNNs):VGG19和Inception v3。这两种模型在番茄村庄数据集(Tomato Villages Dataset)上进行了训练和测试,该数据集包含健康番茄叶片和受多种病害影响的叶片图像。VGG19模型通过增加全连接层进行增强,而Inception v3模型则通过引入全局平均池化层和密集分类层进行改进。实验结果表明,这两种模型在测试集上的准确率达到了93.93%,证明了其在作物健康监测中的有效性。论文还提出了一种包括数据归一化、图像大小调整、数据集准备和独特模型架构的深度学习策略,并通过准确率、精确率、召回率和F1分数等指标评估了模型的性能。该方法在精准农业中具有实际应用潜力,能够帮助早期预防番茄病害。
链接: https://arxiv.org/abs/2501.12052
作者: MD Mehraz Hosen,Md. Hasibul Islam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, ROC curves, confusion matrix analysis, and classification reports
点击查看摘要
Abstract:Tomato crop health plays a critical role in ensuring agricultural productivity and food security. Timely and accurate detection of diseases affecting tomato plants is vital for effective disease management. In this study, we propose a deep learning-based approach for Tomato Leaf Disease Detection using two well-established convolutional neural networks (CNNs), namely VGG19 and Inception v3. The experiment is conducted on the Tomato Villages Dataset, encompassing images of both healthy tomato leaves and leaves afflicted by various diseases. The VGG19 model is augmented with fully connected layers, while the Inception v3 model is modified to incorporate a global average pooling layer and a dense classification layer. Both models are trained on the prepared dataset, and their performances are evaluated on a separate test set. This research employs VGG19 and Inception v3 models on the Tomato Villages dataset (4525 images) for tomato leaf disease detection. The models’ accuracy of 93.93% with dropout layers demonstrates their usefulness for crop health monitoring. The paper suggests a deep learning-based strategy that includes normalization, resizing, dataset preparation, and unique model architectures. During training, VGG19 and Inception v3 serve as feature extractors, with possible data augmentation and fine-tuning. Metrics like accuracy, precision, recall, and F1 score are obtained through evaluation on a test set and offer important insights into the strengths and shortcomings of the model. The method has the potential for practical use in precision agriculture and could help tomato crops prevent illness early on.
zh
[CV-53] Adaptive Class Learning to Screen Diabetic Disorders in Fundus Images of Eye ICPR
【速读】:该论文旨在解决全球范围内日益增长的眼科疾病(ocular illnesses)问题,特别是如何通过早期检测和及时干预来预防视力损害并改善患者预后。论文提出了一种名为“有限数据下的类别扩展”(Class Extension with Limited Data, CELD)的新框架,用于训练分类器对眼底图像进行分类。该框架的关键在于先训练分类器识别健康(Healthy)和糖尿病视网膜病变(Diabetic Retinopathy, DR)两类相关特征,然后通过微调使其能够将输入图像分类为健康、DR和青光眼(Glaucoma)三类。这种策略使模型能够在仅有少量标注数据集的情况下逐步提升分类能力。此外,论文还采用了扰动方法(perturbation methods)来识别影响模型决策过程的输入图像特征。最终,该模型在公开数据集上实现了91%的总体准确率。
链接: https://arxiv.org/abs/2501.12048
作者: Shramana Dey,Pallabi Dutta,Riddhasree Bhattacharyya,Surochita Pal,Sushmita Mitra,Rajiv Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at International Conference on Pattern Recognition (ICPR) 2024
点击查看摘要
Abstract:The prevalence of ocular illnesses is growing globally, presenting a substantial public health challenge. Early detection and timely intervention are crucial for averting visual impairment and enhancing patient prognosis. This research introduces a new framework called Class Extension with Limited Data (CELD) to train a classifier to categorize retinal fundus images. The classifier is initially trained to identify relevant features concerning Healthy and Diabetic Retinopathy (DR) classes and later fine-tuned to adapt to the task of classifying the input images into three classes: Healthy, DR, and Glaucoma. This strategy allows the model to gradually enhance its classification capabilities, which is beneficial in situations where there are only a limited number of labeled datasets available. Perturbation methods are also used to identify the input image characteristics responsible for influencing the models decision-making process. We achieve an overall accuracy of 91% on publicly available datasets.
zh
[CV-54] Advancing Earth Observation: A Survey on AI-Powered Image Processing in Satellites
【速读】:该论文试图解决地球观测卫星(Earth Observation, EO)在获取大量高质量图像后,传统工作流程中将这些图像传输到地面进行处理所面临的效率挑战。随着技术进步和成本降低,卫星捕获的图像质量和数量显著增加,导致传统处理方式难以应对。论文提出的解决方案关键在于利用预训练的人工智能模型在卫星上进行图像处理,从而减少数据传输需求并提高处理效率。然而,这一方案在卫星环境中的实施面临诸多约束,论文详细探讨了这些约束及其最新的缓解策略。
链接: https://arxiv.org/abs/2501.12030
作者: Aidan Duggan,Bruno Andrade,Haithem Afli
机构: Computer Science Department, Munster Technological University, Cork, T12 P928 Ireland(爱尔兰科克市芒斯特理工大学计算机科学系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures
点击查看摘要
Abstract:Advancements in technology and reduction in it’s cost have led to a substantial growth in the quality quantity of imagery captured by Earth Observation (EO) satellites. This has presented a challenge to the efficacy of the traditional workflow of transmitting this imagery to Earth for processing. An approach to addressing this issue is to use pre-trained artificial intelligence models to process images on-board the satellite, but this is difficult given the constraints within a satellite’s environment. This paper provides an up-to-date and thorough review of research related to image processing on-board Earth observation satellites. The significant constraints are detailed along with the latest strategies to mitigate them.
zh
[CV-55] Comparative Analysis of Pre-trained Deep Learning Models and DINOv2 for Cushings Syndrome Diagnosis in Facial Analysis
【速读】:该论文旨在解决库欣综合征(Cushing’s syndrome)的诊断问题,特别是通过面部图像进行自动化诊断。库欣综合征是由于肾上腺皮质分泌过多的糖皮质激素(glucocorticoid)引起的疾病,常表现为满月脸(moon facies)和多血质(plethora),因此面部数据在诊断中至关重要。传统的卷积神经网络(CNNs)在捕捉局部特征方面表现较好,但库欣综合征的面部特征往往是全局性的。为此,论文提出使用基于自注意力机制(self-attention)的Transformer模型(如ViT和SWIN)以及基础模型DINOv2,这些模型能够更好地捕捉长距离依赖和全局特征。研究结果表明,Transformer模型和DINOv2在诊断库欣综合征时优于CNNs,其中ViT的F1得分最高,达到85.74%。此外,DINOv2在冻结参数后表现出更好的性能,且对女性样本的准确率更高。因此,Transformer模型和DINOv2是库欣综合征分类的有效解决方案。
链接: https://arxiv.org/abs/2501.12023
作者: Hongjun Liu,Changwei Song,Jiaqi Qiang,Jianqiang Li,Hui Pan,Lin Lu,Xiao Long,Qing Zhao,Jiuzuo Huang,Shi Chen
机构: School of Software Engineering, Beijing University of Technology, Beijing, China(北京工业大学软件工程学院); Key Laboratory of Endocrinology of National Health Commission, Department of Endocrinology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院内分泌科国家卫生健康委员会内分泌重点实验室); Department of Plastic Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院整形外科); State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China(中国医学科学院北京协和医学院北京协和医院复杂重症罕见病国家重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Cushing’s syndrome is a condition caused by excessive glucocorticoid secretion from the adrenal cortex, often manifesting with moon facies and plethora, making facial data crucial for diagnosis. Previous studies have used pre-trained convolutional neural networks (CNNs) for diagnosing Cushing’s syndrome using frontal facial images. However, CNNs are better at capturing local features, while Cushing’s syndrome often presents with global facial features. Transformer-based models like ViT and SWIN, which utilize self-attention mechanisms, can better capture long-range dependencies and global features. Recently, DINOv2, a foundation model based on visual Transformers, has gained interest. This study compares the performance of various pre-trained models, including CNNs, Transformer-based models, and DINOv2, in diagnosing Cushing’s syndrome. We also analyze gender bias and the impact of freezing mechanisms on DINOv2. Our results show that Transformer-based models and DINOv2 outperformed CNNs, with ViT achieving the highest F1 score of 85.74%. Both the pre-trained model and DINOv2 had higher accuracy for female samples. DINOv2 also showed improved performance when freezing parameters. In conclusion, Transformer-based models and DINOv2 are effective for Cushing’s syndrome classification.
zh
[CV-56] Foreign object segmentation in chest x-rays through anatomy-guided shape insertion
【速读】:该论文试图解决胸部X光片中异物(如术后随访中的支架、起搏器或儿童误吞的物体)实例分割(instance segmentation)的挑战。由于异物的多样性,现有的数据集标注不足,导致密集标注变得复杂。为了解决这一问题,论文提出了一种通过生成合成数据的简单方法,关键步骤包括:(1)插入具有不同对比度和不透明度的任意形状(如线条、多边形、椭圆),(2)从少量半自动提取的标签中进行剪切-粘贴增强。这些插入操作通过解剖学标签进行指导,以确保异物的放置符合实际情况(例如支架仅出现在相关血管中)。该方法使网络能够在最少手动标注数据的情况下分割复杂结构,并在使用93%更少手动标注的情况下,实现了与全监督模型相当的性能。
链接: https://arxiv.org/abs/2501.12022
作者: Constantin Seibold,Hamza Kalisch,Lukas Heine,Simon Reiß,Jens Kleesiek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we tackle the challenge of instance segmentation for foreign objects in chest radiographs, commonly seen in postoperative follow-ups with stents, pacemakers, or ingested objects in children. The diversity of foreign objects complicates dense annotation, as shown in insufficient existing datasets. To address this, we propose the simple generation of synthetic data through (1) insertion of arbitrary shapes (lines, polygons, ellipses) with varying contrasts and opacities, and (2) cut-paste augmentations from a small set of semi-automatically extracted labels. These insertions are guided by anatomy labels to ensure realistic placements, such as stents appearing only in relevant vessels. Our approach enables networks to segment complex structures with minimal manually labeled data. Notably, it achieves performance comparable to fully supervised models while using 93% fewer manual annotations.
zh
[CV-57] On the “Illusion” of Gender Bias in Face Recognition: Explaining the Fairness Issue Through Non-demographic Attributes
【速读】:该论文旨在解决人脸识别系统(FRS)中存在的性别偏差问题。研究表明,FRS的准确性在不同性别用户之间存在显著差异,这种性别差距降低了系统的可信度。尽管已有研究尝试探讨其原因,但这些研究通常依赖于手动选择、相关性高且规模较小的面部特征集,难以全面解释性别偏差的来源。本文通过扩展搜索范围,分析了40个非人口统计学的面部特征(non-demographic facial characteristics)之间的去相关性组合,以更全面地揭示性别偏差的成因。关键解决方案包括:1)提出一种工具链,有效去相关并聚合面部属性,从而在大规模数据上进行更少偏差的性别分析;2)引入两种新的公平性度量指标,分别在有上下文和无上下文的条件下评估公平性;3)提出一种新颖的无监督算法,能够可靠地识别出在平衡测试数据集中使用时能够消除偏差的属性组合。实验结果表明,当男性和女性受试者的图像共享特定属性时,性别差距消失,表明该问题并非生物学差异所致,而是社会对外貌定义的结果。这些发现可能重塑我们对人脸生物识别中公平性的理解,并为解决FRS中的性别偏差问题提供新的见解。
链接: https://arxiv.org/abs/2501.12020
作者: Paul Jonas Kurz,Haiyu Wu,Kevin W. Bowyer,Philipp Terhörst
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Face recognition systems (FRS) exhibit significant accuracy differences based on the user’s gender. Since such a gender gap reduces the trustworthiness of FRS, more recent efforts have tried to find the causes. However, these studies make use of manually selected, correlated, and small-sized sets of facial features to support their claims. In this work, we analyse gender bias in face recognition by successfully extending the search domain to decorrelated combinations of 40 non-demographic facial characteristics. First, we propose a toolchain to effectively decorrelate and aggregate facial attributes to enable a less-biased gender analysis on large-scale data. Second, we introduce two new fairness metrics to measure fairness with and without context. Based on these grounds, we thirdly present a novel unsupervised algorithm able to reliably identify attribute combinations that lead to vanishing bias when used as filter predicates for balanced testing datasets. The experiments show that the gender gap vanishes when images of male and female subjects share specific attributes, clearly indicating that the issue is not a question of biology but of the social definition of appearance. These findings could reshape our understanding of fairness in face biometrics and provide insights into FRS, helping to address gender bias issues.
zh
[CV-58] Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?
【速读】:该论文旨在评估自监督的视网膜特异性基础模型(RETFound)与三种基于ImageNet预训练的传统深度学习模型(ResNet50、ViT-base、SwinV2)在检测眼部和全身性疾病方面的性能差异。研究的关键在于通过在大规模和小规模数据集上的微调和训练,比较这些模型在内部和外部验证数据集上的表现,使用AUC(受试者工作特征曲线下面积)和经过Bonferroni校正的Z检验来评估模型性能。研究结果表明,传统深度学习模型在大数据集上的眼部疾病检测性能与RETFound相当,但在小数据集上,RETFound在全身性疾病检测方面表现更优。这一发现为传统模型和基础模型各自的优势和局限性提供了有价值的见解。
链接: https://arxiv.org/abs/2501.12016
作者: Samantha Min Er Yew,Xiaofeng Lei,Jocelyn Hui Lin Goh,Yibing Chen,Sahana Srinivasan,Miao-li Chee,Krithi Pushpanathan,Ke Zou,Qingshan Hou,Zhi Da Soh,Cancan Xue,Marco Chak Yan Yu,Charumathi Sabanayagam,E Shyong Tai,Xueling Sim,Yaxing Wang,Jost B. Jonas,Vinay Nangia,Gabriel Dawei Yang,Emma Anran Ran,Carol Yim-Lui Cheung,Yangqin Feng,Jun Zhou,Rick Siow Mong Goh,Yukun Zhou,Pearse A. Keane,Yong Liu,Ching-Yu Cheng,Yih-Chung Tham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Background: RETFound, a self-supervised, retina-specific foundation model (FM), showed potential in downstream applications. However, its comparative performance with traditional deep learning (DL) models remains incompletely understood. This study aimed to evaluate RETFound against three ImageNet-pretrained supervised DL models (ResNet50, ViT-base, SwinV2) in detecting ocular and systemic diseases. Methods: We fine-tuned/trained RETFound and three DL models on full datasets, 50%, 20%, and fixed sample sizes (400, 200, 100 images, with half comprising disease cases; for each DR severity class, 100 and 50 cases were used. Fine-tuned models were tested internally using the SEED (53,090 images) and APTOS-2019 (3,672 images) datasets and externally validated on population-based (BES, CIEMS, SP2, UKBB) and open-source datasets (ODIR-5k, PAPILA, GAMMA, IDRiD, MESSIDOR-2). Model performance was compared using area under the receiver operating characteristic curve (AUC) and Z-tests with Bonferroni correction (P0.05/3). Interpretation: Traditional DL models are mostly comparable to RETFound for ocular disease detection with large datasets. However, RETFound is superior in systemic disease detection with smaller datasets. These findings offer valuable insights into the respective merits and limitation of traditional models and FMs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.12016 [cs.CV] (or arXiv:2501.12016v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.12016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-59] Survey on Hand Gesture Recognition from Visual Input
【速读】:该论文旨在解决手势识别(hand gesture recognition)领域缺乏全面综述的问题,特别是在涵盖最新研究进展、可用解决方案和基准数据集方面。论文通过分析从不同类型相机输入数据(如RGB图像、深度图像、单目或多视角相机视频)中识别手势和3D手部姿态(3D hand pose recognition)的最新进展,探讨了不同方法的需求差异。此外,论文还提供了广泛使用的数据集的概述,详细描述了它们的主要特征和应用领域。解决方案的关键在于综合近期研究的目标、方法和应用,为未来研究提供有价值的见解,并突出开放挑战,如在实际环境中实现鲁棒识别、处理遮挡、确保跨用户的泛化能力以及满足实时应用的计算效率需求。
链接: https://arxiv.org/abs/2501.11992
作者: Manousos Linardakis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: Harokopio University of Athens (哈罗科皮奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction in fields such as sign language recognition, virtual and augmented reality, and robotics. Despite the rapid growth of the field, there are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets. This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data including RGB images, depth images, and videos from monocular or multiview cameras, examining the differing methodological requirements of each approach. Furthermore, an overview of widely used datasets is provided, detailing their main characteristics and application domains. Finally, open challenges such as achieving robust recognition in real-world environments, handling occlusions, ensuring generalization across diverse users, and addressing computational efficiency for real-time applications are highlighted to guide future research directions. By synthesizing the objectives, methodologies, and applications of recent studies, this survey offers valuable insights into current trends, challenges, and opportunities for future research in human hand gesture recognition.
zh
[CV-60] SMamba: Sparse Mamba for Event-based Object Detection AAAI2025
【速读】:该论文试图解决基于Transformer的事件目标检测方法在处理非事件和噪声区域时计算开销过高的问题。现有的窗口注意力稀疏化策略虽然减少了计算量,但牺牲了全局建模能力,导致性能下降。为解决这一问题,论文提出了Sparse Mamba (SMamba),其关键解决方案包括:1) 引入时空连续性评估模块(Spatio-Temporal Continuity Assessment),通过分析活动事件与噪声事件的时空分布差异,评估信息量并丢弃无信息量的token;2) 设计信息优先局部扫描策略(Information-Prioritized Local Scan),缩短高信息量token之间的扫描距离,促进它们在空间维度上的交互;3) 提出全局通道交互模块(Global Channel Interaction),从全局空间角度聚合通道信息,将全局交互从2D空间扩展到3D表示。实验结果表明,该方法在性能和效率上均优于现有方法。
链接: https://arxiv.org/abs/2501.11971
作者: Nan Yang,Yang Wang,Zhanwen Liu,Meng Li,Yisheng An,Xiangmo Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2025
点击查看摘要
Abstract:Transformer-based methods have achieved remarkable performance in event-based object detection, owing to the global modeling ability. However, they neglect the influence of non-event and noisy regions and process them uniformly, leading to high computational overhead. To mitigate computation cost, some researchers propose window attention based sparsification strategies to discard unimportant regions, which sacrifices the global modeling ability and results in suboptimal performance. To achieve better trade-off between accuracy and efficiency, we propose Sparse Mamba (SMamba), which performs adaptive sparsification to reduce computational effort while maintaining global modeling capability. Specifically, a Spatio-Temporal Continuity Assessment module is proposed to measure the information content of tokens and discard uninformative ones by leveraging the spatiotemporal distribution differences between activity and noise events. Based on the assessment results, an Information-Prioritized Local Scan strategy is designed to shorten the scan distance between high-information tokens, facilitating interactions among them in the spatial dimension. Furthermore, to extend the global interaction from 2D space to 3D representations, a Global Channel Interaction module is proposed to aggregate channel information from a global spatial perspective. Results on three datasets (Gen1, 1Mpx, and eTram) demonstrate that our model outperforms other methods in both performance and efficiency.
zh
[CV-61] A Lightweight and Interpretable Deepfakes Detection Framework
【速读】:该论文旨在解决深度伪造(deepfakes)检测中的关键问题,即现有检测方法通常只能针对特定类型的深度伪造(如换脸、唇同步或傀儡操纵)进行检测,而缺乏一个统一的框架来同时检测所有类型的深度伪造。为了解决这一问题,论文提出了一种基于混合面部特征点(hybrid facial landmarks)和心率特征(heart rate features)融合的统一检测框架。该框架通过将心率特征与面部特征点特征相结合,能够更好地提取伪造视频中的面部伪影和原始视频中的自然变化。这些特征被用于训练一个轻量级的XGBoost模型,以区分深度伪造视频和真实视频。实验结果表明,该框架在包含多种深度伪造类型的世界领导人数据集(WLDR)上表现出优越的检测性能,且与深度学习模型LSTM-FCN相比,具有相似的检测效果,但更具可解释性。
链接: https://arxiv.org/abs/2501.11927
作者: Muhammad Umar Farooq,Ali Javed,Khalid Mahmood Malik,Muhammad Anas Raza
机构: University of Engineering and Technology, Taxila, Pakistan (塔克西拉工程技术大学); Oakland University, Rochester, MI, USA (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The recent realistic creation and dissemination of so-called deepfakes poses a serious threat to social life, civil rest, and law. Celebrity defaming, election manipulation, and deepfakes as evidence in court of law are few potential consequences of deepfakes. The availability of open source trained models based on modern frameworks such as PyTorch or TensorFlow, video manipulations Apps such as FaceApp and REFACE, and economical computing infrastructure has easen the creation of deepfakes. Most of the existing detectors focus on detecting either face-swap, lip-sync, or puppet master deepfakes, but a unified framework to detect all three types of deepfakes is hardly explored. This paper presents a unified framework that exploits the power of proposed feature fusion of hybrid facial landmarks and our novel heart rate features for detection of all types of deepfakes. We propose novel heart rate features and fused them with the facial landmark features to better extract the facial artifacts of fake videos and natural variations available in the original videos. We used these features to train a light-weight XGBoost to classify between the deepfake and bonafide videos. We evaluated the performance of our framework on the world leaders dataset (WLDR) that contains all types of deepfakes. Experimental results illustrate that the proposed framework offers superior detection performance over the comparative deepfakes detection methods. Performance comparison of our framework against the LSTM-FCN, a candidate of deep learning model, shows that proposed model achieves similar results, however, it is more interpretable.
zh
[CV-62] Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery
【速读】:该论文试图解决现有洪水监测方法在利用多光谱卫星信息时忽视相关特征的问题。现有的洪水分割方法通常未能充分利用多光谱数据中的关联特征,导致洪水监测的准确性受限。为此,作者提出了一种渐进式交叉注意力网络(ProCANet),该模型通过逐步应用自注意力机制和交叉注意力机制,生成最优的多光谱特征组合,从而提升洪水分割的精度。该模型在Sen1Floods11数据集和印度尼西亚Citarum河流域的定制洪水数据上进行了测试,结果显示其具有最高的交并比(IoU)得分0.815。通过对比不同模态下有无注意力机制的场景,该研究为利用遥感技术提高洪水分析的准确性开辟了新的途径。
链接: https://arxiv.org/abs/2501.11923
作者: Vicky Feliren,Fithrothul Khikmah,Irfan Dwiki Bhaswara,Bahrul I. Nasution,Alex M. Lechner,Muhamad Risqi U. Saputra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, published in IEEE Geoscience and Remote Sensing Letters
点击查看摘要
Abstract:In recent years, the integration of deep learning techniques with remote sensing technology has revolutionized the way natural hazards, such as floods, are monitored and managed. However, existing methods for flood segmentation using remote sensing data often overlook the utility of correlative features among multispectral satellite information. In this study, we introduce a progressive cross attention network (ProCANet), a deep learning model that progressively applies both self- and cross-attention mechanisms to multispectral features, generating optimal feature combinations for flood segmentation. The proposed model was compared with state-of-the-art approaches using Sen1Floods11 dataset and our bespoke flood data generated for the Citarum River basin, Indonesia. Our model demonstrated superior performance with the highest Intersection over Union (IoU) score of 0.815. Our results in this study, coupled with the ablation assessment comparing scenarios with and without attention across various modalities, opens a promising path for enhancing the accuracy of flood analysis using remote sensing technology.
zh
[CV-63] Enhancing Adversarial Transferability via Component-Wise Augmentation Method
【速读】:该论文试图解决深度神经网络(DNNs)在面对对抗样本(adversarial examples)时的高度脆弱性问题,特别是在安全敏感应用中,这一问题尤为突出。现有的基于输入变换的对抗攻击方法在增强对抗样本的迁移性(transferability)方面表现出色,但存在两个主要问题:一是未能充分多样化不同模型之间的注意力区域(attention regions),二是在变换过程中引入了过多的信息损失。为解决这些问题,论文提出了一种新的基于输入变换的方法,称为组件增强(Component-Wise Augmentation, CWA)。CWA通过在局部应用块级变换(block-wise transformations),结合插值(interpolation)和选择性旋转(selective rotation)来多样化模型的注意力区域,同时保持语义完整性。实验结果表明,CWA在ImageNet数据集上显著优于现有的最先进方法,在攻击成功率和稳定性方面均表现出色,并且对多种防御方法也展现了优越的性能。
链接: https://arxiv.org/abs/2501.11901
作者: Hangyu Liu,Bo Peng,Pengxiang Ding,Donglin Wang
机构: Westlake University(西湖大学); Beijing University of Posts and Telecommunications(北京邮电大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages,5 figures
点击查看摘要
Abstract:Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples, which pose significant challenges in security-sensitive applications. Among various adversarial attack strategies, input transformation-based attacks have demonstrated remarkable effectiveness in enhancing adversarial transferability. However, existing methods fail to diversify attention regions across models adequately and introduce excessive information loss during transformations. In this paper, we introduce a novel input transformation-based method, termed Component-Wise Augmentation (CWA), designed to enhance transferability by locally applying block-wise transformations. CWA strategically integrates interpolation and selective rotation on individual image blocks to diversify model attention regions while preserving semantic integrity. Extensive experiments on the standard ImageNet dataset show that CWA consistently outperforms state-of-the-art methods in both attack success rates and stability across CNN- and Transformer-based models, while also demonstrating superior performance against multiple defense methods.
zh
[CV-64] LASER: Lip Landmark Assisted Speaker Detection for Robustness
【速读】:该论文试图解决主动说话者检测(Active Speaker Detection, ASD)在复杂视觉场景中识别说话者时面临的挑战,特别是在音频和唇部运动不同步的情况下,现有模型容易误判非说话者的问题。为了解决这一局限性,论文提出了Lip landmark Assisted Speaker dEtection for Robustness (LASER)模型。其关键解决方案在于通过整合唇部标志点(lip landmarks)来显式关注唇部运动。具体而言,LASER从面部轨迹中提取帧级视觉特征和唇部标志点的2D坐标,并将这些坐标编码为密集特征图,以提供唇部位置的空间和结构信息。此外,考虑到在低分辨率、遮挡或极端角度等挑战性条件下,唇部标志点检测器可能失效,LASER还引入了一个辅助一致性损失函数,以对齐基于唇部特征和仅基于面部特征的预测,从而确保即使在唇部数据缺失的情况下也能保持可靠的性能。实验结果表明,LASER在多个数据集上优于现有最先进的模型,尤其是在音频和视觉不同步的场景中表现出色,展示了其在真实世界视频环境中的鲁棒性。
链接: https://arxiv.org/abs/2501.11899
作者: Le Thien Phuc Nguyen,Zhuoran Yu,Yong Jae Lee
机构: University of Wisconsin - Madison(威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \urlthis https URL.
zh
[CV-65] Contrastive Masked Autoencoders for Character-Level Open-Set Writer Identification
【速读】:该论文试图解决数字取证和文档认证领域中的“开放集场景”(open-set scenario)问题,即在模型训练过程中未见过的书写者识别问题。解决方案的关键在于结合了掩码自编码器(Masked Auto-Encoders, MAE)和对比学习(Contrastive Learning, CL)的对比掩码自编码器(Contrastive Masked Auto-Encoders, CMAE)。通过这种方法,模型能够同时捕捉手写风格的序列信息并区分不同的书写风格,从而在开放集场景下实现高精度的书写者识别。实验结果表明,该模型在CASIA在线手写数据集上达到了89.7%的精确率,显著提升了书写者识别的性能。
链接: https://arxiv.org/abs/2501.11895
作者: Xiaowei Jiang,Wenhao Ma,Yiqun Duan,Thomas Do,Chin-Teng Lin
机构: GrapheneX-UTS Human-centric AI Centre, Australian AI Institute, School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In the realm of digital forensics and document authentication, writer identification plays a crucial role in determining the authors of documents based on handwriting styles. The primary challenge in writer-id is the “open-set scenario”, where the goal is accurately recognizing writers unseen during the model training. To overcome this challenge, representation learning is the key. This method can capture unique handwriting features, enabling it to recognize styles not previously encountered during training. Building on this concept, this paper introduces the Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Demonstrating its effectiveness, our model achieves state-of-the-art (SOTA) results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%. Our study advances universal writer-id with a sophisticated representation learning approach, contributing substantially to the ever-evolving landscape of digital handwriting analysis, and catering to the demands of an increasingly interconnected world.
zh
[CV-66] Fast Underwater Scene Reconstruction using Multi-View Stereo and Physical Imaging
【速读】:该论文旨在解决水下场景重建中的挑战,特别是由于光与介质之间的复杂相互作用导致的散射和吸收效应,使得深度估计和渲染变得更加复杂。尽管基于神经辐射场(NeRF)的方法通过建模和分离散射介质在水下场景中取得了高质量的成果,但其训练和渲染速度较慢。为解决这些问题,论文提出了一种新颖的方法,将多视图立体(MVS)与基于物理的水下图像形成模型相结合。该方法包括两个分支:一个用于通过MVS的传统成本体积管道进行深度估计,另一个用于基于物理的图像形成模型进行渲染。深度分支改进了场景几何,而介质分支则确定散射参数以实现精确的场景渲染。与依赖真实深度数据的传统MVSNet方法不同,该方法无需使用真实深度数据,从而加快了训练和渲染过程。通过利用介质子网估计介质参数并结合颜色MLP进行渲染,该方法恢复了水下场景的真实颜色,并实现了更高保真度的几何表示。实验结果表明,该方法能够在散射介质中高质量合成新视角,通过去除介质恢复清晰视图,并在渲染质量和训练效率上优于现有方法。
链接: https://arxiv.org/abs/2501.11884
作者: Shuyi Hu,Qi Liu
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Underwater scene reconstruction poses a substantial challenge because of the intricate interplay between light and the medium, resulting in scattering and absorption effects that make both depth estimation and rendering more complex. While recent Neural Radiance Fields (NeRF) based methods for underwater scenes achieve high-quality results by modeling and separating the scattering medium, they still suffer from slow training and rendering speeds. To address these limitations, we propose a novel method that integrates Multi-View Stereo (MVS) with a physics-based underwater image formation model. Our approach consists of two branches: one for depth estimation using the traditional cost volume pipeline of MVS, and the other for rendering based on the physics-based image formation model. The depth branch improves scene geometry, while the medium branch determines the scattering parameters to achieve precise scene rendering. Unlike traditional MVSNet methods that rely on ground-truth depth, our method does not necessitate the use of depth truth, thus allowing for expedited training and rendering processes. By leveraging the medium subnet to estimate the medium parameters and combining this with a color MLP for rendering, we restore the true colors of underwater scenes and achieve higher-fidelity geometric representations. Experimental results show that our method enables high-quality synthesis of novel views in scattering media, clear views restoration by removing the medium, and outperforms existing methods in rendering quality and training efficiency.
zh
[CV-67] FNIN: A Fourier Neural Operator-based Numerical Integration Network for Surface-form-gradients AAAI2025
【速读】:该论文试图解决从梯度(gradient)恢复三维(3D)表面(surface)的问题,即表面从梯度重建(Surface-from-gradients, SfG)。传统方法在处理高精度和高分辨率输入时面临显著挑战,尤其是在处理不连续性和大规模线性求解器的低效性方面。尽管深度学习的最新进展(如光度立体法,photometric stereo)提高了法线估计的准确性,但仍未完全解决基于梯度的表面重建的复杂性。为此,论文提出了一种基于傅里叶神经算子(Fourier Neural Operator, FNO)的数值积分网络(FNIN),采用两阶段优化框架。第一阶段通过迭代架构进行数值积分,利用傅里叶神经算子在傅里叶空间中近似求解算子,并结合自学习注意力机制有效检测和处理不连续性。第二阶段通过加权最小二乘问题进一步优化表面重建,合理处理已识别的不连续性。实验表明,该方法在处理高分辨率复杂数据时,相较于现有最先进的求解器,在精度和效率上均有显著提升,测试对象的误差小于0.1毫米。
链接: https://arxiv.org/abs/2501.11876
作者: Jiaqi Leng,Yakun Ju,Yuanxu Duan,Jiangnan Zhang,Qingxuan Lv,Zuxuan Wu,Hao Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025
点击查看摘要
Abstract:Surface-from-gradients (SfG) aims to recover a three-dimensional (3D) surface from its gradients. Traditional methods encounter significant challenges in achieving high accuracy and handling high-resolution inputs, particularly facing the complex nature of discontinuities and the inefficiencies associated with large-scale linear solvers. Although recent advances in deep learning, such as photometric stereo, have enhanced normal estimation accuracy, they do not fully address the intricacies of gradient-based surface reconstruction. To overcome these limitations, we propose a Fourier neural operator-based Numerical Integration Network (FNIN) within a two-stage optimization framework. In the first stage, our approach employs an iterative architecture for numerical integration, harnessing an advanced Fourier neural operator to approximate the solution operator in Fourier space. Additionally, a self-learning attention mechanism is incorporated to effectively detect and handle discontinuities. In the second stage, we refine the surface reconstruction by formulating a weighted least squares problem, addressing the identified discontinuities rationally. Extensive experiments demonstrate that our method achieves significant improvements in both accuracy and efficiency compared to current state-of-the-art solvers. This is particularly evident in handling high-resolution images with complex data, achieving errors of fewer than 0.1 mm on tested objects.
zh
[CV-68] Survey on Monocular Metric Depth Estimation
【速读】:该论文试图解决单目深度估计(Monocular Depth Estimation, MDE)中缺乏度量尺度信息的问题,这一问题导致了尺度不一致性,限制了其在视觉SLAM、3D重建和新视角合成等下游任务中的应用。为了解决这一问题,论文提出了单目度量深度估计(Monocular Metric Depth Estimation, MMDE),通过实现精确的场景尺度深度推断,提升了深度一致性、增强了序列任务的稳定性、简化了下游应用的集成,并拓宽了实际应用场景。解决方案的关键在于零样本泛化(zero-shot generalization)能力的提升,这是MMDE的基础能力。论文详细探讨了零样本MMDE研究的最新进展,重点关注模型泛化和场景边界细节丢失等挑战,并提出了包括无标签数据增强、图像分块、架构优化和生成技术等创新策略来应对这些问题。这些进展显著推动了现有局限性的克服,并为未来的研究方向提供了清晰的路线图。
链接: https://arxiv.org/abs/2501.11841
作者: Jiuling Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monocular Depth Estimation (MDE) is a fundamental computer vision task underpinning applications such as spatial understanding, 3D reconstruction, and autonomous driving. While deep learning-based MDE methods can predict relative depth from a single image, their lack of metric scale information often results in scale inconsistencies, limiting their utility in downstream tasks like visual SLAM, 3D reconstruction, and novel view synthesis. Monocular Metric Depth Estimation (MMDE) addresses these challenges by enabling precise, scene-scale depth inference. MMDE improves depth consistency, enhances sequential task stability, simplifies integration into downstream applications, and broadens practical use cases. This paper provides a comprehensive review of depth estimation technologies, highlighting the evolution from geometry-based methods to state-of-the-art deep learning approaches. It emphasizes advancements in scale-agnostic methods, which are crucial for enabling zero-shot generalization as the foundational capability for MMDE. Recent progress in zero-shot MMDE research is explored, focusing on challenges such as model generalization and the loss of detail at scene boundaries. Innovative strategies to address these issues include unlabelled data augmentation, image patching, architectural optimization, and generative techniques. These advancements, analyzed in detail, demonstrate significant contributions to overcoming existing limitations. Finally, this paper synthesizes recent developments in zero-shot MMDE, identifies unresolved challenges, and outlines future research directions. By offering a clear roadmap and cutting-edge insights, this work aims to deepen understanding of MMDE, inspire novel applications, and drive technological innovation.
zh
[CV-69] Data-driven Detection and Evaluation of Damages in Concrete Structures: Using Deep Learning and Computer Vision
【速读】:该论文试图解决传统方法在检测混凝土基础设施(如桥梁、隧道和墙壁)损伤(如裂缝和剥落)时存在的劳动密集、耗时且易受人为误差影响的问题。解决方案的关键在于采用先进的数据驱动技术,特别是基于深度学习的自动化损伤检测与分析。研究评估了两种最先进的实例分割模型(YOLO-v7实例分割和Mask R-CNN),通过增强数据集(从400张图像扩充至10,995张)来提高模型的鲁棒性。YOLO-v7在平均精度(mAP@0.5)和帧率(FPS)方面表现更优,分别为96.1%和40 FPS,优于Mask R-CNN的92.1%和18 FPS。因此,YOLO-v7更适合实时高速的结构健康监测,而Mask R-CNN则适用于详细的离线评估。该研究表明深度学习在基础设施维护中具有革命性潜力,提供了可扩展且高效的自动化损伤检测解决方案。
链接: https://arxiv.org/abs/2501.11836
作者: Saeid Ataei,Saeed Adibnazari,Seyyed Taghi Ataei
机构: Stevens Institute of Technology(史蒂文斯理工学院); Sharif University of Technology(谢里夫理工大学); University of Tehran(德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 10 figures. This study focuses on the data-driven detection and evaluation of damages in concrete structures using deep learning and computer vision techniques
点击查看摘要
Abstract:Structural integrity is vital for maintaining the safety and longevity of concrete infrastructures such as bridges, tunnels, and walls. Traditional methods for detecting damages like cracks and spalls are labor-intensive, time-consuming, and prone to human error. To address these challenges, this study explores advanced data-driven techniques using deep learning for automated damage detection and analysis. Two state-of-the-art instance segmentation models, YOLO-v7 instance segmentation and Mask R-CNN, were evaluated using a dataset comprising 400 images, augmented to 10,995 images through geometric and color-based transformations to enhance robustness. The models were trained and validated using a dataset split into 90% training set, validation and test set 10%. Performance metrics such as precision, recall, mean average precision (mAP@0.5), and frames per second (FPS) were used for evaluation. YOLO-v7 achieved a superior mAP@0.5 of 96.1% and processed 40 FPS, outperforming Mask R-CNN, which achieved a mAP@0.5 of 92.1% with a slower processing speed of 18 FPS. The findings recommend YOLO-v7 instance segmentation model for real-time, high-speed structural health monitoring, while Mask R-CNN is better suited for detailed offline assessments. This study demonstrates the potential of deep learning to revolutionize infrastructure maintenance, offering a scalable and efficient solution for automated damage detection.
zh
[CV-70] CogMorph: Cognitive Morphing Attacks for Text-to-Image Models
【速读】:该论文揭示并解决了一个之前未被充分认识的伦理风险问题,即文本到图像生成模型(Text-to-Image, T2I)在生成高质量图像时可能被操纵以嵌入有害或毒性的上下文元素,从而放大情感伤害。这种操纵利用了人类认知原则,即人类对概念的理解受到整个视觉场景及其上下文的影响。论文提出了一种名为“认知变形攻击”(Cognitive Morphing Attack, CogMorph)的新方法,该方法通过操纵T2I模型生成保留原始核心主题但嵌入有害元素的图像。解决方案的关键在于两个核心步骤:首先,构建了一个基于人类认知感知维度的图像毒性分类体系,并生成了1,176个高质量的T2I毒性提示词;其次,通过“认知毒性增强”(Cognitive Toxicity Augmentation)和“上下文层次变形”(Contextual Hierarchical Morphing)技术,分别从外部毒性表示和原始提示词中提取关键部分,并迭代融合毒性特征以注入有害上下文。实验结果表明,CogMorph在多个开源T2I模型和商业API上显著优于其他基线方法,平均提升了20.62%的效果。
链接: https://arxiv.org/abs/2501.11815
作者: Zonglei Jing,Zonghao Ying,Le Wang,Siyuan Liang,Aishan Liu,Xianglong Liu,Dacheng Tao
机构: Beihang University(北京航空航天大学); National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62% on average).
zh
[CV-71] FLOP: Table Structure Recognition Framework with Layout Pointer Mechanism IJCAI
【速读】:该论文旨在解决表格结构识别(Table Structure Recognition, TSR)任务中文本区域与表格结构标签之间的对齐问题。传统方法通过预测文本区域并将其与表格结构标签进行匹配,但这种方法容易产生对齐错误,且需要复杂的后处理步骤。论文提出的解决方案是TFLOP(TSR Framework with LayOut Pointer mechanism),该框架将传统的文本区域预测和匹配问题重新表述为直接指向文本区域的问题。TFLOP利用文本区域信息同时识别表格的结构标签和与之对齐的文本区域,从而避免了额外的文本区域匹配阶段。此外,TFLOP采用跨度感知对比监督(span-aware contrastive supervision)来增强复杂结构表格中的指向机制。实验结果表明,TFLOP在多个基准数据集(如PubTabNet、FinTabNet和SynthTabNet)上达到了最先进的性能,并在带有水印或非英文领域的工业文档TSR场景中表现出色。
链接: https://arxiv.org/abs/2501.11800
作者: Minsoo Khang,Teakgyu Hong
机构: Upstage AI, South Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IJCAI Proceedings 2024
点击查看摘要
Abstract:Table Structure Recognition (TSR) is a task aimed at converting table images into a machine-readable format (e.g. HTML), to facilitate other applications such as information retrieval. Recent works tackle this problem by identifying the HTML tags and text regions, where the latter is used for text extraction from the table document. These works however, suffer from misalignment issues when mapping text into the identified text regions. In this paper, we introduce a new TSR framework, called TFLOP (TSR Framework with LayOut Pointer mechanism), which reformulates the conventional text region prediction and matching into a direct text region pointing problem. Specifically, TFLOP utilizes text region information to identify both the table’s structure tags and its aligned text regions, simultaneously. Without the need for region prediction and alignment, TFLOP circumvents the additional text region matching stage, which requires finely-calibrated post-processing. TFLOP also employs span-aware contrastive supervision to enhance the pointing mechanism in tables with complex structure. As a result, TFLOP achieves the state-of-the-art performance across multiple benchmarks such as PubTabNet, FinTabNet, and SynthTabNet. In our extensive experiments, TFLOP not only exhibits competitive performance but also shows promising results on industrial document TSR scenarios such as documents with watermarks or in non-English domain.
zh
[CV-72] Provably effective detection of effective data poisoning attacks
【速读】:该论文旨在解决数据集投毒攻击(dataset poisoning attack)的检测问题。数据集投毒攻击是指通过恶意修改训练数据来影响机器学习模型的性能或行为。论文的核心贡献在于提出了一个数学上精确的定义,并证明了对数据集进行有效投毒的行为本身可以被有效检测。关键解决方案是引入了一种新的统计测试方法,称为“Conformal Separability Test”,该方法能够从数学上保证数据集投毒的可识别性,并通过实验验证了其在现实世界中对投毒攻击的有效检测能力。
链接: https://arxiv.org/abs/2501.11795
作者: Jonathan Gallagher,Yasaman Esfandiari,Callen MacPhee,Michael Warren
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:This paper establishes a mathematically precise definition of dataset poisoning attack and proves that the very act of effectively poisoning a dataset ensures that the attack can be effectively detected. On top of a mathematical guarantee that dataset poisoning is identifiable by a new statistical test that we call the Conformal Separability Test, we provide experimental evidence that we can adequately detect poisoning attempts in the real world.
zh
[CV-73] Generating visual explanations from deep networks using implicit neural representations WACV2025
【速读】:该论文试图解决深度学习模型的可解释性问题,特别是如何生成视觉解释(visual explanations)以帮助人类理解模型的决策过程。解决方案的关键在于利用隐式神经表示(implicit neural representations, INRs)来生成归因掩码(attribution masks)。具体而言,作者首先通过基于坐标的隐式网络重新构建并扩展了极值扰动技术(extremal perturbations technique),从而生成符合面积约束的归因掩码。其次,作者提出了一种基于INR的迭代方法,能够为同一图像生成多个不重叠的归因掩码。实验结果表明,隐式网络能够有效生成归因掩码,并揭示深度学习模型在图像分类任务中可能同时关注目标物体的外观及其伴随区域和纹理的特征。
链接: https://arxiv.org/abs/2501.11784
作者: Michal Byra,Henrik Skibbe
机构: Institute of Fundamental Technological Research, Polish Academy of Sciences, Poland (波兰科学院基础技术研究所); RIKEN Center for Brain Science, Japan (日本理化学研究所脑科学中心); Samsung AI Center Warsaw, Poland (三星AI中心华沙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025
点击查看摘要
Abstract:Explaining deep learning models in a way that humans can easily understand is essential for responsible artificial intelligence applications. Attribution methods constitute an important area of explainable deep learning. The attribution problem involves finding parts of the network’s input that are the most responsible for the model’s output. In this work, we demonstrate that implicit neural representations (INRs) constitute a good framework for generating visual explanations. Firstly, we utilize coordinate-based implicit networks to reformulate and extend the extremal perturbations technique and generate attribution masks. Experimental results confirm the usefulness of our method. For instance, by proper conditioning of the implicit network, we obtain attribution masks that are well-behaved with respect to the imposed area constraints. Secondly, we present an iterative INR-based method that can be used to generate multiple non-overlapping attribution masks for the same image. We depict that a deep learning model may associate the image label with both the appearance of the object of interest as well as with areas and textures usually accompanying the object. Our study demonstrates that implicit networks are well-suited for the generation of attribution masks and can provide interesting insights about the performance of deep learning models.
zh
[CV-74] EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process
【速读】:该论文试图解决虚拟试衣(virtual try-on)中的核心挑战,即如何实现高质量的图像到图像转换(image-to-image translation),使服装能够适应不同的人体形态、姿势和体型。早期方法依赖2D变换,虽然速度快,但图像质量较差且缺乏深度学习的细节表现。尽管基于生成对抗网络(GAN)的技术提升了真实感,但其对配对数据的依赖限制了应用。更灵活的方法虽然提供了更好的视觉效果,但计算资源消耗大且耗时长。最近,扩散模型(diffusion models)在高保真图像转换方面显示出潜力,但现有虚拟试衣工具仍面临细节丢失和形变问题。
论文提出的解决方案EfficientVITON,利用预训练的Stable Diffusion模型,通过空间编码器(spatial encoder)保留服装的细节,并采用零交叉注意力块(zero cross-attention blocks)捕捉服装与人体贴合时的细微变化。此外,输入图像经过精心处理,扩散过程也经过优化以显著缩短生成时间而不损失图像质量。训练过程分为两个阶段,通过平衡损失函数确保试衣结果的准确性和视觉效果的高质量。在VITON-HD数据集上的测试表明,EfficientVITON达到了当前最先进的性能。
链接: https://arxiv.org/abs/2501.11776
作者: Mostafa Atef,Mariam Ayman,Ahmed Rashed,Ashrakat Saeed,Abdelrahman Saeed,Ahmed Fares
机构: Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology, Alexandria, Egypt (埃及-日本科学技术大学计算机科学与工程系,亚历山大,埃及)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
点击查看摘要
Abstract:Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.
zh
[CV-75] A Review Paper of the Effects of Distinct Modalities and ML Techniques to Distracted Driving Detection
【速读】:该论文试图解决分心驾驶(distracted driving)检测中的关键挑战,特别是现有单模态方法在识别复杂分心模式(尤其是认知分心)方面的不足。解决方案的关键在于全面分析机器学习(ML)和深度学习(DL)技术在多模态数据(包括视觉、感官、听觉和多模态数据)中的应用。通过对不同模态、数据可访问性和方法学进行分类和评估,论文明确了哪些方法在特定分心驾驶检测目标中具有最高准确性和适用性,并强调了多模态系统相较于单模态系统的优势。这一系统性综述为开发更鲁棒的分心驾驶检测框架提供了重要见解,支持提升道路安全和制定更有效的干预策略。
链接: https://arxiv.org/abs/2501.11758
作者: Anthony. Dontoh,Stephanie. Ivey,Logan. Sirbaugh,Armstrong. Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Distracted driving remains a significant global challenge with severe human and economic repercussions, demanding improved detection and intervention strategies. While previous studies have extensively explored single-modality approaches, recent research indicates that these systems often fall short in identifying complex distraction patterns, particularly cognitive distractions. This systematic review addresses critical gaps by providing a comprehensive analysis of machine learning (ML) and deep learning (DL) techniques applied across various data modalities - visual, sensory, auditory, and multimodal. By categorizing and evaluating studies based on modality, data accessibility, and methodology, this review clarifies which approaches yield the highest accuracy and are best suited for specific distracted driving detection goals. The findings offer clear guidance on the advantages of multimodal versus single-modal systems and capture the latest advancements in the field. Ultimately, this review contributes valuable insights for developing robust distracted driving detection frameworks, supporting enhanced road safety and mitigation strategies.
zh
[CV-76] Are generative models fair? A study of racial bias in dermatological image generation
【速读】:该论文试图解决医学领域,特别是皮肤病学中存在的种族偏见问题,尤其是在生成式模型(如变分自编码器,VAE)中的公平性。种族偏见通常源于训练数据集中深色肤色的代表性不足,这可能导致模型在不同肤色上的表现不均衡。论文的核心解决方案是通过训练一个带有感知损失(perceptual loss)的VAE模型,生成和重建不同肤色的高质量皮肤图像,并利用Fitzpatrick17k数据集评估种族偏见对这些模型表现的影响。研究结果表明,VAE的性能受训练数据集中肤色多样性的影响,且在浅色肤色上的表现更好。此外,VAE生成的不确定性估计无法有效评估模型的公平性。因此,论文强调了改进不确定性量化机制的必要性,以检测和解决生成式模型中的种族偏见,从而推动可信赖的医疗技术的发展。
链接: https://arxiv.org/abs/2501.11752
作者: Miguel López-Pérez,Søren Hauberg,Aasa Feragen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
点击查看摘要
Abstract:Racial bias in medicine, particularly in dermatology, presents significant ethical and clinical challenges. It often results from the underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that the VAE is influenced by the diversity of skin tones in the training dataset, with better performance observed for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model’s fairness. These results highlight the need for improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.
zh
[CV-77] SILO: Solving Inverse Problems with Latent Operators
【速读】:该论文旨在解决在使用潜在扩散模型(latent diffusion models)处理逆问题(inverse problems)时,由于在恢复过程中多次应用自编码器(Autoencoder)而带来的计算效率和恢复质量方面的挑战。为了解决这一问题,论文提出了一种新的方法,即在潜在空间中使用学习到的退化函数(learned degradation function)来模拟已知的图像空间退化。这种方法将自编码器的使用限制在恢复过程的初始和最终步骤,从而减少了计算负担并提高了恢复质量。通过在各种图像恢复任务和数据集上的实验,论文证明了该方法的有效性,并显著超越了现有技术的表现。
链接: https://arxiv.org/abs/2501.11746
作者: Ron Raphaeli,Sean Man,Michael Elad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page in this https URL
点击查看摘要
Abstract:Consistent improvement of image priors over the years has led to the development of better inverse problem solvers. Diffusion models are the newcomers to this arena, posing the strongest known prior to date. Recently, such models operating in a latent space have become increasingly predominant due to their efficiency. In recent works, these models have been applied to solve inverse problems. Working in the latent space typically requires multiple applications of an Autoencoder during the restoration process, which leads to both computational and restoration quality challenges. In this work, we propose a new approach for handling inverse problems with latent diffusion models, where a learned degradation function operates within the latent space, emulating a known image space degradation. Usage of the learned operator reduces the dependency on the Autoencoder to only the initial and final steps of the restoration process, facilitating faster sampling and superior restoration quality. We demonstrate the effectiveness of our method on a variety of image restoration tasks and datasets, achieving significant improvements over prior art.
zh
[CV-78] FaceSORT: a Multi-Face Tracking Method based on Biometric and Appearance Features
【速读】:该论文试图解决多面部跟踪(multiple face tracking)中由于部分遮挡或侧脸导致的跟踪性能下降问题。传统的多面部跟踪方法通常依赖于生物特征(biometric)面部特征,但这些特征提取模型通常需要正面面部图像,限制了其在非正面情况下的应用。为解决这一问题,论文提出了一种名为FaceSORT的多面部跟踪方法,其关键创新在于将生物特征面部特征与视觉外观特征(visual appearance features)相结合。这些特征均从同一面部区域提取,其中视觉外观特征由通用物体分类器生成。通过这种结合,FaceSORT能够更好地处理部分遮挡或侧脸情况,从而提升跟踪性能。论文还通过全面的实验评估验证了该方法的有效性,包括对不同面部描述符、参数设置和相似性度量的比较,并公开了一个新的多面部跟踪数据集。
链接: https://arxiv.org/abs/2501.11741
作者: Robert Jöchl,Andreas Uhl
机构: University of Salzburg, Department of Artificial Intelligence and Human Interfaces (萨尔茨堡大学,人工智能与人类界面系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Tracking multiple faces is a difficult problem, as there may be partially occluded or lateral faces. In multiple face tracking, association is typically based on (biometric) face features. However, the models used to extract these face features usually require frontal face images, which can limit the tracking performance. In this work, a multi-face tracking method inspired by StrongSort, FaceSORT, is proposed. To mitigate the problem of partially occluded or lateral faces, biometric face features are combined with visual appearance features (i.e., generated by a generic object classifier), with both features are extracted from the same face patch. A comprehensive experimental evaluation is performed, including a comparison of different face descriptors, an evaluation of different parameter settings, and the application of a different similarity metric. All experiments are conducted with a new multi-face tracking dataset and a subset of the ChokePoint dataset. The `Paris Lodron University Salzburg Faces in a Queue’ dataset consists of a total of seven fully annotated sequences (12730 frames) and is made publicly available as part of this work. Together with this dataset, annotations of 6 sequences from the ChokePoint dataset are also provided.
zh
[CV-79] SeRpEnt: Selective Resampling for Expressive State Space Models
【速读】:该论文试图解决状态空间模型(State Space Models, SSMs)在序列建模中的选择性机制(selectivity)的有效性问题,特别是其在处理长序列时的信息压缩能力。尽管Mamba模型通过选择性机制在性能上媲美Transformer模型,但其选择性机制的有效性仅通过实验验证,缺乏理论解释。论文通过分析选择性时间间隔在Mamba中的作用,揭示了其作为信息线性近似器的功能。基于此,作者提出了SeRpEnt架构,进一步利用选择性机制,通过信息感知的方式压缩序列。SeRpEnt采用重采样机制(resampling mechanism),根据序列元素的信息内容进行聚合。实验结果表明,SeRpEnt在长序列建模任务中表现出色,验证了其重采样机制的有效性。
链接: https://arxiv.org/abs/2501.11729
作者: Stefano Rando,Luca Romani,Matteo Migliarini,Luca Franco,Denis Gudovskiy,Fabio Galasso
机构: italailabs.com; Università degli Studi di Roma “La Sapienza” (罗马大学); Panasonic (松下)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 3 figures
点击查看摘要
Abstract:State Space Models (SSMs) have recently enjoyed a rise to prominence in the field of deep learning for sequence modeling, especially as an alternative to Transformers. Their success stems from avoiding two well-known drawbacks of attention-based models: quadratic complexity with respect to the sequence length and inability to model long-range dependencies. The SSM variant Mamba has demonstrated performance comparable to Transformers without any form of attention, thanks to the use of a selective mechanism for the state parameters. Selectivity, however, is only evaluated empirically and the reasons of its effectiveness remain unclear. In this work, we show how selectivity is related to the sequence processing. Our analysis shows that selective time intervals in Mamba act as linear approximators of information. Then, we propose our SeRpEnt architecture, a SSM that further exploits selectivity to compress sequences in an information-aware fashion. It employs a resampling mechanism that aggregates elements based on their information content. Our empirical results in the Long Range Arena benchmark and other language modeling tasks show benefits of the SeRpEnt’s resampling mechanism.
zh
[CV-80] GL-ICNN: An End-To-End Interpretable Convolutional Neural Network for the Diagnosis and Prediction of Alzheimers Disease
【速读】:该论文试图解决基于卷积神经网络(CNNs)的深度学习方法在阿尔茨海默病(AD)痴呆早期和准确诊断中的可解释性问题。尽管CNNs在影像数据分析中表现出巨大潜力,但由于深度学习模型的可解释性有限,这些方法尚未在临床实践中广泛应用。论文提出了一种结合CNNs和可解释增强机(EBM)的新型可解释模型,用于AD的诊断和预测。解决方案的关键在于开发了一种创新的训练策略,交替训练CNN组件作为特征提取器和EBM组件作为输出块,形成一个端到端的模型。该模型以影像数据为输入,不仅提供预测结果,还提供可解释的特征重要性度量。通过在阿尔茨海默病神经影像倡议(ADNI)数据集和Health-RI Parelsnoer神经退行性疾病生物库(PND)外部测试集上的验证,该模型在AD与对照组分类中达到了0.956的AUC值,在轻度认知障碍(MCI)向AD转化的预测中达到了0.694的AUC值。该模型作为一种“玻璃盒”模型,与其他最先进的“黑盒”模型相比,具有相当的性能。
链接: https://arxiv.org/abs/2501.11715
作者: Wenjie Kang,Lize Jiskoot,Peter De Deyn,Geert Biessels,Huiberdina Koek,Jurgen Claassen,Huub Middelkoop,Wiesje Flier,Willemijn J. Jansen,Stefan Klein,Esther Bron
机构: Biomedical Imaging Group Rotterdam, Erasmus MC, NL; Erasmus MC, NL; University Medical Center Groningen, NL; University Medical Center Utrecht, NL; Radboud University Medical Center, NL; Leiden University Medical Center, NL; Amsterdam University Medical Center, NL; Maastricht University Medical Center, NL
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures
点击查看摘要
Abstract:Deep learning methods based on Convolutional Neural Networks (CNNs) have shown great potential to improve early and accurate diagnosis of Alzheimer’s disease (AD) dementia based on imaging data. However, these methods have yet to be widely adopted in clinical practice, possibly due to the limited interpretability of deep learning models. The Explainable Boosting Machine (EBM) is a glass-box model but cannot learn features directly from input imaging data. In this study, we propose a novel interpretable model that combines CNNs and EBMs for the diagnosis and prediction of AD. We develop an innovative training strategy that alternatingly trains the CNN component as a feature extractor and the EBM component as the output block to form an end-to-end model. The model takes imaging data as input and provides both predictions and interpretable feature importance measures. We validated the proposed model on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset and the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND) as an external testing set. The proposed model achieved an area-under-the-curve (AUC) of 0.956 for AD and control classification, and 0.694 for the prediction of conversion of mild cognitive impairment (MCI) to AD on the ADNI cohort. The proposed model is a glass-box model that achieves a comparable performance with other state-of-the-art black-box models. Our code is publicly available at: this https URL.
zh
[CV-81] Dynamic Scene Understanding from Vision-Language Representations
【速读】:该论文旨在解决复杂动态场景图像的自动解析问题,这需要对整体情境的高层次理解以及对参与实体及其交互的细粒度识别。当前的解决方案通常针对子任务(如情境识别、人-人交互和人-物体交互检测)采用不同的方法。然而,最新的图像理解进展通过利用网络规模的视觉-语言(Vision-Language, VL)表示,减少了对任务特定工程的需求。本文提出了一种基于现代冻结VL表示的动态场景理解框架,通过将这些任务统一为结构化文本的预测和解析,或直接将表示连接到现有模型的输入,实现了在相对较少可训练参数的情况下达到最先进的性能。关键点在于,现代VL表示能够有效编码动态场景语义,使得这一方法成为可能。
链接: https://arxiv.org/abs/2501.11653
作者: Shahaf Pruss,Morris Alper,Hadar Averbuch-Elor
机构: Tel Aviv University(特拉维夫大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Images depicting complex, dynamic scenes are challenging to parse automatically, requiring both high-level comprehension of the overall situation and fine-grained identification of participating entities and their interactions. Current approaches use distinct methods tailored to sub-tasks such as Situation Recognition and detection of Human-Human and Human-Object Interactions. However, recent advances in image understanding have often leveraged web-scale vision-language (VL) representations to obviate task-specific engineering. In this work, we propose a framework for dynamic scene understanding tasks by leveraging knowledge from modern, frozen VL representations. By framing these tasks in a generic manner - as predicting and parsing structured text, or by directly concatenating representations to the input of existing models - we achieve state-of-the-art results while using a minimal number of trainable parameters relative to existing approaches. Moreover, our analysis of dynamic knowledge of these representations shows that recent, more powerful representations effectively encode dynamic scene semantics, making this approach newly possible.
zh
[CV-82] Early evidence of how LLM s outperform traditional systems on OCR/HTR tasks for historical records
【速读】:该论文旨在解决历史手写文档的转录问题,特别是针对表格形式的数据。研究比较了两种大型语言模型(LLMs)——GPT-4o和Claude Sonnet 3.5——与传统OCR/HTR系统(如EasyOCR、Keras、Pytesseract和TrOCR)在转录历史手写文档时的性能差异。研究通过两种实验设计进行评估:一种是逐行分割图像进行转录,另一种是将整个扫描图像作为输入。通过字符错误率(CER)和BLEU评分,研究证明了LLMs在转录任务上优于传统OCR/HTR方法。此外,研究还结合了人工评估,以更好地理解CER和BLEU评分的影响因素。最终,研究得出结论:对于逐行图像,两样本GPT-4o表现最佳;对于整个扫描图像,两样本Claude Sonnet 3.5的转录结果最接近真实值。解决方案的关键在于利用LLMs的上下文理解能力,结合两样本学习策略,显著提升了转录的准确性。
链接: https://arxiv.org/abs/2501.11623
作者: Seorin Kim,Julien Baudru,Wouter Ryckbosch,Hugues Bersini,Vincent Ginis
机构: Vrije Universiteit Brussel (VUB)(布鲁塞尔自由大学); Université Libre de Bruxelles (ULB)(布鲁塞尔自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures
点击查看摘要
Abstract:We explore the ability of two LLMs – GPT-4o and Claude Sonnet 3.5 – to transcribe historical handwritten documents in a tabular format and compare their performance to traditional OCR/HTR systems: EasyOCR, Keras, Pytesseract, and TrOCR. Considering the tabular form of the data, two types of experiments are executed: one where the images are split line by line and the other where the entire scan is used as input. Based on CER and BLEU, we demonstrate that LLMs outperform the conventional OCR/HTR methods. Moreover, we also compare the evaluated CER and BLEU scores to human evaluations to better judge the outputs of whole-scan experiments and understand influential factors for CER and BLEU. Combining judgments from all the evaluation metrics, we conclude that two-shot GPT-4o for line-by-line images and two-shot Claude Sonnet 3.5 for whole-scan images yield the transcriptions of the historical records most similar to the ground truth.
zh
[CV-83] Compressibility Analysis for the differentiable shift-variant Filtered Backprojection Model
【速读】:该论文试图解决在锥束计算机断层扫描(CBCT)数据重建中,基于可微分平移不变滤波反投影(FBP)模型的计算冗余问题。具体来说,传统的FBP模型在非圆形轨迹下需要为每个投影计算冗余权重(redundancy weights),这一过程计算量巨大,限制了模型的实际应用。论文提出了一种基于主成分分析(PCA)的压缩和优化方法,通过将冗余权重层参数分解为可训练的特征向量矩阵、压缩权重和均值向量,显著减少了模型的可训练参数数量。这一创新方法在不影响重建精度的前提下,实现了97.25%的参数压缩,大幅降低了模型复杂度并提升了训练速度,从而增强了模型在实际应用中的实用性。
链接: https://arxiv.org/abs/2501.11586
作者: Chengze Ye,Linda-Sophie Schneider,Yipeng Sun,Mareike Thies,Andreas Maier
机构: Friedrich-Alexander University Erlangen-Nuremberg (弗里德里希-亚历山大大学埃尔兰根-纽伦堡); Fraunhofer EZRT (弗劳恩霍夫 EZRT)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The differentiable shift-variant filtered backprojection (FBP) model enables the reconstruction of cone-beam computed tomography (CBCT) data for any non-circular trajectories. This method employs deep learning technique to estimate the redundancy weights required for reconstruction, given knowledge of the specific trajectory at optimization time. However, computing the redundancy weight for each projection remains computationally intensive. This paper presents a novel approach to compress and optimize the differentiable shift-variant FBP model based on Principal Component Analysis (PCA). We apply PCA to the redundancy weights learned from sinusoidal trajectory projection data, revealing significant parameter redundancy in the original model. By integrating PCA directly into the differentiable shift-variant FBP reconstruction pipeline, we develop a method that decomposes the redundancy weight layer parameters into a trainable eigenvector matrix, compressed weights, and a mean vector. This innovative technique achieves a remarkable 97.25% reduction in trainable parameters without compromising reconstruction accuracy. As a result, our algorithm significantly decreases the complexity of the differentiable shift-variant FBP model and greatly improves training speed. These improvements make the model substantially more practical for real-world applications.
zh
[CV-84] aching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution
【速读】:该论文旨在解决基于多模态大语言模型(MLLMs)的图像质量评估(IQA)方法在准确评分图像质量方面的不足。当前方法在将连续的质量评分(通常建模为高斯分布)与MLLMs生成的离散标记输出进行匹配时存在挑战,导致信息丢失和图像间关系捕捉不足。论文提出了一种基于分布的解决方案,将评分分布离散化为软标签(soft label),从而保留评分分布的特性,提高准确性并维持图像间关系。此外,针对不同IQA数据集分布差异的问题,论文引入了基于Thurstone模型的保真度损失(fidelity loss),以捕捉数据集内部关系,促进跨多个IQA数据集的联合训练。通过这些设计,论文开发了基于分布的图像质量评分回归模型(DeQA-Score),实验表明该模型在多个基准测试中稳定优于基线方法,并能预测与人类标注高度一致的评分分布。
链接: https://arxiv.org/abs/2501.11561
作者: Zhiyuan You,Xin Cai,Jinjin Gu,Tianfan Xue,Chao Dong
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); The Chinese University of Hong Kong (香港中文大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone’s model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the distribution-based Depicted image Quality Assessment model for Score regression (DeQA-Score). Experiments across multiple benchmarks show that DeQA-Score stably outperforms baselines in score regression. Also, DeQA-Score can predict the score distribution that closely aligns with human annotations. Codes and model weights have been released in this https URL.
zh
[CV-85] Event-based vision for egomotion estimation using precise event timing
【速读】:该论文旨在解决自主导航和机器人应用中自我运动估计(egomotion estimation)的准确性和实时性问题。传统方法依赖惯性传感器,对外部条件高度敏感,且在长距离运动中容易产生漂移,导致较大误差。论文提出了一种基于事件视觉传感器(event-based vision sensors)的解决方案,通过仅在场景变化时捕捉数据,显著降低了功耗,同时提供了高速、低延迟的反馈。关键创新在于提出了一种完全基于事件的处理流程,直接在事件域中处理事件流,避免了帧间中介的需求,从而实现了低延迟和高效能的运动估计。该方法采用浅层脉冲神经网络(spiking neural network)和突触门控机制(synaptic gating mechanism),将精确的事件时间转换为脉冲爆发,编码局部光流速度,并通过网络输出基于事件的自我运动估计。实验表明,该方法在专用芯片上表现出低延迟、低功耗的潜力,并在模拟更大网络时达到了基于事件相机的最先进精度,适用于实时、功耗受限的机器人应用。
链接: https://arxiv.org/abs/2501.11554
作者: Hugh Greatorex,Michele Mastella,Madison Cotteret,Ole Richter,Elisabetta Chicca
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Robotics (cs.RO)
备注: 10 pages, 7 figures. Supplementary material: 4 pages, 1 figure
点击查看摘要
Abstract:Egomotion estimation is crucial for applications such as autonomous navigation and robotics, where accurate and real-time motion tracking is required. However, traditional methods relying on inertial sensors are highly sensitive to external conditions, and suffer from drifts leading to large inaccuracies over long distances. Vision-based methods, particularly those utilising event-based vision sensors, provide an efficient alternative by capturing data only when changes are perceived in the scene. This approach minimises power consumption while delivering high-speed, low-latency feedback. In this work, we propose a fully event-based pipeline for egomotion estimation that processes the event stream directly within the event-based domain. This method eliminates the need for frame-based intermediaries, allowing for low-latency and energy-efficient motion estimation. We construct a shallow spiking neural network using a synaptic gating mechanism to convert precise event timing into bursts of spikes. These spikes encode local optical flow velocities, and the network provides an event-based readout of egomotion. We evaluate the network’s performance on a dedicated chip, demonstrating strong potential for low-latency, low-power motion estimation. Additionally, simulations of larger networks show that the system achieves state-of-the-art accuracy in egomotion estimation tasks with event-based cameras, making it a promising solution for real-time, power-constrained robotics applications.
zh
[CV-86] A baseline for machine-learning-based hepatocellular carcinoma diagnosis using multi-modal clinical data
【速读】:该论文旨在为肝细胞癌(HCC)的多模态数据分类提供一个基准,使用的数据集包括图像数据(增强CT和MRI图像)和表格数据(临床实验室测试数据和病例报告表)。分类任务是基于TNM分期系统。研究的关键在于通过结合图像数据和临床实验室数据,提取向量化预处理后的表格数据特征以及增强CT和MRI图像的放射组学特征,并基于互信息进行特征选择。最终,使用XGBoost分类器预测TNM分期,结果显示预测准确率为0.89 ± 0.05,AUC为0.93 ± 0.03。研究表明,仅通过结合图像和临床数据才能达到如此高的预测准确性,因此这是一个多模态分类在实现准确结果中不可或缺的典型案例。
链接: https://arxiv.org/abs/2501.11535
作者: Binwu Wang,Isaac Rodriguez,Leon Breitinger,Fabian Tollens,Timo Itzel,Dennis Grimm,Andrei Sirazitdinov,Matthias Frölich,Stefan Schönberg,Andreas Teufel,Jürgen Hesser,Wenzhao Zhao
机构: Mannheim Institute for Intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University(曼海姆医学智能系统研究所,曼海姆医学院,海德堡大学); UMM Mannheim, Mannheim, Germany(曼海姆大学医学中心,曼海姆,德国); Complex data processing in medical informatics (CMI), Mannheim Medical Faculty, Heidelberg University(医学信息学中的复杂数据处理,曼海姆医学院,海德堡大学); Clinic for Radiology and Nuclear Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany(放射学和核医学诊所,曼海姆医学院,海德堡大学,曼海姆,德国); Heidelberg University, Mannheim, Germany(海德堡大学,曼海姆,德国); Interdisciplinary Center for Scientific Computing, Central Institute for Computer Engineering, CSZ Heidelberg Center for Model-Based AI, Data Analysis and Modeling in Medicine, Mannheim Institute for Intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University(科学计算跨学科中心,计算机工程中心,海德堡基于模型的AI、数据分析和医学建模中心,曼海姆医学智能系统研究所,曼海姆医学院,海德堡大学); School of Information Engineering, Nanjing University of Finance and Economics(信息工程学院,南京财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The objective of this paper is to provide a baseline for performing multi-modal data classification on a novel open multimodal dataset of hepatocellular carcinoma (HCC), which includes both image data (contrast-enhanced CT and MRI images) and tabular data (the clinical laboratory test data as well as case report forms). TNM staging is the classification task. Features from the vectorized preprocessed tabular data and radiomics features from contrast-enhanced CT and MRI images are collected. Feature selection is performed based on mutual information. An XGBoost classifier predicts the TNM staging and it shows a prediction accuracy of 0.89 \pm 0.05 and an AUC of 0.93 \pm 0.03 . The classifier shows that this high level of prediction accuracy can only be obtained by combining image and clinical laboratory data and therefore is a good example case where multi-model classification is mandatory to achieve accurate results.
zh
[CV-87] UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
【速读】:该论文试图解决在高动态范围(HDR)场景下,传统曝光融合技术(exposure fusion technique)在处理大曝光差异(通常超过3-4档)时出现的对齐错误、光照不一致或色调映射伪影等问题。为了解决这些问题,论文提出了UltraFusion技术,这是第一种能够处理9档曝光差异的曝光融合方法。其关键创新在于将曝光融合建模为一个引导修复(guided inpainting)问题,利用欠曝光图像作为软引导(soft guidance)来填补过曝光区域中的高光缺失信息。这种方法不仅能够有效应对对齐问题和光照变化,还通过生成模型的图像先验(image prior)生成自然的色调映射,从而在超高动态范围场景中表现出色。实验结果表明,UltraFusion在最新的HDR基准测试中优于HDR-Transformer,并在新构建的UltraFusion数据集上展示了高质量融合效果。
链接: https://arxiv.org/abs/2501.11515
作者: Zixuan Chen,Yujin Wang,Xin Cai,Zhiyuan You,Zheming Lu,Fan Zhang,Shi Guo,Tianfan Xue
机构: Shanghai AI Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion technique, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose UltraFusion, the first exposure fusion technique that can merge input with 9 stops differences. The key idea is that we model the exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlight in the over-exposed region. Using under-exposed image as a soft guidance, instead of a hard constrain, our model is robust to potential alignment issue or lighting variations. Moreover, utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scene. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scene, we capture a new real-world exposure fusion benchmark, UltraFusion Dataset, with exposure difference up to 9 stops, and experiments show that \model~can generate beautiful and high-quality fusion results under various scenarios. An online demo is provided at this https URL.
zh
[CV-88] ransferability of labels between multilens cameras
【速读】:该论文旨在解决多镜头相机(multilens cameras)中不同通道间的边界框(Bounding Box, BB)和掩码标签(mask labels)自动扩展的问题。解决方案的关键在于结合相位相关方法(phase correlation method)和优化过程(refinement process)。首先,通过在频域进行互相关(cross correlation)处理,并在空间域中定位强度峰值来实现图像对齐。其次,通过迭代过程最大化交并比(Intersection over Union, IoU)指标,获得最佳变换。该方法能够在大多数情况下以超过90%的准确率在不同镜头间传递标签,且整个过程仅需65毫秒。最终,通过生成人工RGB图像并对其进行标注,将这些信息传递到其他镜头中。这一方法扩展了多镜头相机的应用领域,使其不仅限于卫星或医学图像,还能用于标注可见光谱中不可见的物体。
链接: https://arxiv.org/abs/2501.11513
作者: Ignacio de Loyola Páez-Ubieta,Daniel Frau-Alfaro,Santiago T. Puente
机构: AUtomatics, RObotics, and Artificial Vision (AUROVA) Lab, University Institute for Computer Research (IUII), University of Alicante (阿利坎特大学), Crta. San Vicente s/n, San Vicente del Raspeig, E-03690, Alicante, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is a preprint version of the work accepted at 20th International Conference on Computer Vision Theory and Applications (VISAPP 2025)
点击查看摘要
Abstract:In this work, a new method for automatically extending Bounding Box (BB) and mask labels across different channels on multilens cameras is presented. For that purpose, the proposed method combines the well known phase correlation method with a refinement process. During the first step, images are aligned by localizing the peak of intensity obtained in the spatial domain after performing the cross correlation process in the frequency domain. The second step consists of obtaining the best possible transformation by using an iterative process maximising the IoU (Intersection over Union) metric. Results show that, by using this method, labels could be transferred across different lens on a camera with an accuracy over 90% in most cases and just by using 65 ms in the whole process. Once the transformations are obtained, artificial RGB images are generated, for labeling them so as to transfer this information into each of the other lens. This work will allow users to use this type of cameras in more fields rather than satellite or medical imagery, giving the chance of labeling even invisible objects in the visible spectrum.
zh
[CV-89] See In Detail: Enhancing Sparse-view 3D Gaussian Splatting with Local Depth and Semantic Regularization ICASSP2025
【速读】:该论文试图解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在稀疏视角输入下渲染质量下降的问题,具体表现为内容失真和细节减少,限制了其实际应用。为解决这一问题,论文提出了一种稀疏视角的3DGS方法。其解决方案的关键在于引入了两种正则化技术:一是语义正则化(semantic regularization),利用预训练的DINO-ViT模型提取特征,确保多视角语义一致性;二是局部深度正则化(local depth regularization),通过约束深度值来提高对未见视角的泛化能力。该方法在LLFF数据集上显著提升了渲染质量,PSNR(峰值信噪比)提高了0.4dB,并减少了失真,增强了视觉质量。
链接: https://arxiv.org/abs/2501.11508
作者: Zongqi He,Zhe Xiao,Kin-Chung Chan,Yushen Zuo,Jun Xiao,Kin-Man Lam
机构: Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University (香港理工大学电子及电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures, has been accepted by the ICASSP 2025
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has shown remarkable performance in novel view synthesis. However, its rendering quality deteriorates with sparse inphut views, leading to distorted content and reduced details. This limitation hinders its practical application. To address this issue, we propose a sparse-view 3DGS method. Given the inherently ill-posed nature of sparse-view rendering, incorporating prior information is crucial. We propose a semantic regularization technique, using features extracted from the pretrained DINO-ViT model, to ensure multi-view semantic consistency. Additionally, we propose local depth regularization, which constrains depth values to improve generalization on unseen views. Our method outperforms state-of-the-art novel view synthesis approaches, achieving up to 0.4dB improvement in terms of PSNR on the LLFF dataset, with reduced distortion and enhanced visual quality.
zh
[CV-90] Communication-Efficient Federated Learning Based on Explanation-Guided Pruning for Remote Sensing Image Classification
【速读】:该论文试图解决在遥感(Remote Sensing, RS)图像分类中,联邦学习(Federated Learning, FL)系统由于模型更新传输量大而导致的高通信开销问题。为了解决这一问题,论文提出了一种基于解释引导的剪枝策略(explanation-guided pruning strategy),该策略利用层次相关性传播(Layerwise Relevance Propagation, LRP)驱动的解释来识别并保留模型中最相关和信息量最大的参数,同时剔除不重要的参数,从而减少模型更新的传输量。实验结果表明,该策略在BigEarthNet-S2数据集上有效减少了共享模型更新的数量,同时提高了全局模型的泛化能力。
链接: https://arxiv.org/abs/2501.11493
作者: Jonas Klotz,Barış Büyüktaş,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2025
点击查看摘要
Abstract:Federated learning (FL) is a decentralized machine learning paradigm, where multiple clients collaboratively train a global model by exchanging only model updates with the central server without sharing the local data of clients. Due to the large volume of model updates required to be transmitted between clients and the central server, most FL systems are associated with high transfer costs (i.e., communication overhead). This issue is more critical for operational applications in remote sensing (RS), especially when large-scale RS data is processed and analyzed through FL systems with restricted communication bandwidth. To address this issue, we introduce an explanation-guided pruning strategy for communication-efficient FL in the context of RS image classification. Our pruning strategy is defined based on the layerwise relevance propagation (LRP) driven explanations to: 1) efficiently and effectively identify the most relevant and informative model parameters (to be exchanged between clients and the central server); and 2) eliminate the non-informative ones to minimize the volume of model updates. The experimental results on the BigEarthNet-S2 dataset demonstrate that our strategy effectively reduces the number of shared model updates, while increasing the generalization ability of the global model. The code of this work will be publicly available at this https URL
zh
[CV-91] SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models
【速读】:该论文旨在解决在现实世界的机器学习应用中,特别是在安全关键领域,检测分布外(Out-of-Distribution, OOD)数据的问题。现有的方法通常利用视觉-语言模型(Vision-Language Models, VLMs)中的语言信息,通过丰富的类别文本信息来增强置信度估计,从而提升OOD检测效果。然而,这些方法在构建OOD检测分数时,要么关注每个分布内(In-Distribution, ID)类别,要么关注整个ID标签集,忽略了ID类别之间的内在联系。论文发现,不同ID类别之间的语义信息对于有效的OOD检测是有益的。因此,作者研究了VLMs中不同语义相关ID标签之间的图像-文本理解能力,并提出了一种称为SimLabel的后处理策略。SimLabel通过建立一种更鲁棒的图像-类别相似性度量,考虑了一组相似类别标签的一致性,从而增强了ID和OOD样本之间的可分离性。实验结果表明,SimLabel在多个零样本OOD检测基准上表现出色,并且该模型可以扩展到不同的VLM骨干网络,展示了其良好的泛化能力。
链接: https://arxiv.org/abs/2501.11485
作者: Shu Zou,Xinyu Tian,Qinyu Zhao,Zhaoyuan Yang,Jing Zhang
机构: School of Computing, the Australian National University, Canberra, Australia(澳大利亚国立大学计算机学院); GE Research, America(美国通用电气研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
点击查看摘要
Abstract:Detecting out-of-distribution (OOD) data is crucial in real-world machine learning applications, particularly in safety-critical domains. Existing methods often leverage language information from vision-language models (VLMs) to enhance OOD detection by improving confidence estimation through rich class-wise text information. However, when building OOD detection score upon on in-distribution (ID) text-image affinity, existing works either focus on each ID class or whole ID label sets, overlooking inherent ID classes’ connection. We find that the semantic information across different ID classes is beneficial for effective OOD detection. We thus investigate the ability of image-text comprehension among different semantic-related ID labels in VLMs and propose a novel post-hoc strategy called SimLabel. SimLabel enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels. Extensive experiments demonstrate the superior performance of SimLabel on various zero-shot OOD detection benchmarks. The proposed model is also extended to various VLM-backbones, demonstrating its good generalization ability. Our demonstration and implementation codes are available at: this https URL.
zh
[CV-92] MASS: Overcoming Language Bias in Image-Text Matching AAAI2025
【速读】:该论文试图解决视觉-语言模型(visual-language models)在图像-文本匹配任务中存在的语言偏差(language bias)问题。具体而言,现有模型在匹配图像和文本时过度依赖语言先验(language priors),而未能充分考虑到视觉内容,导致匹配结果的准确性受到影响。为解决这一问题,论文提出了多模态关联评分(Multimodal ASsociation Score, MASS)框架。该框架的关键在于减少对语言先验的依赖,从而提升图像-文本匹配中的视觉准确性。MASS无需额外训练即可无缝集成到现有的视觉-语言模型中,实验表明其在降低语言偏差的同时,仍能保持对语言组合性(linguistic compositionality)的理解。因此,MASS为提升视觉-语言模型在图像-文本匹配任务中的性能提供了一种有效的解决方案。
链接: https://arxiv.org/abs/2501.11469
作者: Jiwan Chung,Seungwon Lim,Sangkyu Lee,Youngjae Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025
点击查看摘要
Abstract:Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
zh
[CV-93] On the Adversarial Vulnerabilities of Transfer Learning in Remote Sensing
【速读】:该论文试图解决在遥感任务中使用预训练模型时引入的安全漏洞问题。具体来说,公开可用的预训练模型可能被用作代理来攻击下游模型,从而影响其性能。论文提出了一种新颖的对抗性神经元操纵方法(Adversarial Neuron Manipulation),通过选择性地操纵预训练模型中的单个或多个神经元来生成可迁移的扰动。与现有攻击方法不同,该方法无需领域特定信息,因此具有更广泛的适用性和更高的效率。通过针对多个脆弱神经元,该方法能够实现卓越的攻击性能,揭示了深度学习模型中的关键漏洞。实验结果表明,该方法在多种模型和遥感数据集上均表现出显著的有效性,强调了在安全关键的遥感任务中设计更鲁棒的防御机制的紧迫性。
链接: https://arxiv.org/abs/2501.11462
作者: Tao Bai,Xingjian Tian,Yonghao Xu,Bihan Wen
机构: School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (南洋理工大学电气与电子工程学院); Computer Vision Laboratory (CVL) at the Department of Electrical Engineering (ISY), Linköping University, Linköping, Sweden (瑞典林雪平大学电气工程系计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:The use of pretrained models from general computer vision tasks is widespread in remote sensing, significantly reducing training costs and improving performance. However, this practice also introduces vulnerabilities to downstream tasks, where publicly available pretrained models can be used as a proxy to compromise downstream models. This paper presents a novel Adversarial Neuron Manipulation method, which generates transferable perturbations by selectively manipulating single or multiple neurons in pretrained models. Unlike existing attacks, this method eliminates the need for domain-specific information, making it more broadly applicable and efficient. By targeting multiple fragile neurons, the perturbations achieve superior attack performance, revealing critical vulnerabilities in deep learning models. Experiments on diverse models and remote sensing datasets validate the effectiveness of the proposed method. This low-access adversarial neuron manipulation technique highlights a significant security risk in transfer learning models, emphasizing the urgent need for more robust defenses in their design when addressing the safety-critical remote sensing tasks.
zh
[CV-94] Enhancing Coronary Artery Calcium Scoring via Multi-Organ Segmentation on Non-Contrast Cardiac Computed Tomography
【速读】:该论文试图解决的问题是尽管冠状动脉钙化评分(coronary artery calcium scoring)在医学人工智能领域被认为是一个基本解决的问题,但仍存在改进空间。论文提出了一种新的算法,通过将重点从病理检测转向对解剖结构的深入理解,不仅实现了高精度的冠状动脉钙化评分,还增强了结果的可解释性。解决方案的关键在于采用了一种基于解剖学的方法,通过更细致地理解心脏的解剖结构,从而在心血管健康领域获得更准确和可解释的结果。该方法在开源的多厂商数据集上进行了评估,结果显示其精度达到了观察者间一致性的水平,超越了当前的最新技术。此外,定性分析还展示了该算法在标记冠状动脉钙化、识别主动脉钙化以及过滤噪声引起的假阳性检测等任务中的实际应用价值。
链接: https://arxiv.org/abs/2501.11428
作者: Jakub Nalepa,Tomasz Bartczak,Mariusz Bujny,Jarosław Gośliński,Katarzyna Jesionek,Wojciech Malara,Filip Malawski,Karol Miszalski-Jamka,Patrycja Rewa,Marcin Kostur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite coronary artery calcium scoring being considered a largely solved problem within the realm of medical artificial intelligence, this paper argues that significant improvements can still be made. By shifting the focus from pathology detection to a deeper understanding of anatomy, the novel algorithm proposed in the paper both achieves high accuracy in coronary artery calcium scoring and offers enhanced interpretability of the results. This approach not only aids in the precise quantification of calcifications in coronary arteries, but also provides valuable insights into the underlying anatomical structures. Through this anatomically-informed methodology, the paper shows how a nuanced understanding of the heart’s anatomy can lead to more accurate and interpretable results in the field of cardiovascular health. We demonstrate the superior accuracy of the proposed method by evaluating it on an open-source multi-vendor dataset, where we obtain results at the inter-observer level, surpassing the current state of the art. Finally, the qualitative analyses show the practical value of the algorithm in such tasks as labeling coronary artery calcifications, identifying aortic calcifications, and filtering out false positive detections due to noise.
zh
[CV-95] Block Flow: Learning Straight Flow on Data Blocks
【速读】:该论文旨在解决流匹配模型(flow-matching models)中由于生成轨迹的高曲率(curvature)导致的截断误差(truncation error)问题。高曲率会增加采样步骤中的数值误差,影响生成样本的质量和多样性。为解决这一问题,论文提出了一种新的方法——块匹配(block matching)。该方法通过利用标签信息将数据分布划分为多个块,并将这些块与基于相同标签信息参数化的先验分布进行匹配,从而学习到更直的流(straighter flows)。关键创新在于通过控制先验分布的方差来调节前向轨迹的曲率上限,并通过设计灵活的正则化策略来优化生成性能,有效平衡生成样本的多样性与数值求解器误差之间的权衡。实验结果表明,该方法在相同参数规模下具有竞争力。
链接: https://arxiv.org/abs/2501.11361
作者: Zibin Wang,Zhiyuan Ouyang,Xiangyun Zhang
机构: East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter this http URL is available at \urlthis https URL.
zh
[CV-96] Automatic Labelling Semantic Segmentation with 4D Radar Tensors ICASSP2025
【速读】:该论文旨在解决自动驾驶领域中多传感器数据融合的自动标注问题,特别是利用LiDAR(激光雷达)和相机(camera)的互补信息生成高质量的地面真值(ground truth)标签。解决方案的关键在于提出了一种自动标注流程,通过结合LiDAR和相机的数据生成精确的标签,并将这些标签与4D雷达数据一起输入到一个语义分割网络(semantic segmentation network)中,以实现对每个空间体素(voxel)的分类标注。该方法在公开的RaDelft数据集上取得了显著效果,相较于文献中的其他变体,所提出的网络在LiDAR检测性能上达到了65%以上,车辆检测概率提升了13.2%,并且在Chamfer距离上减少了0.54米。
链接: https://arxiv.org/abs/2501.11351
作者: Botao Sun,Ignacio Roldan,Francesco Fioranelli
机构: Microwave Sensing, Signals & Systems (MS3) Group, Dept. of Microelectronics, TU Delft (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted in ICASSP 2025
点击查看摘要
Abstract:In this paper, an automatic labelling process is presented for automotive datasets, leveraging on complementary information from LiDAR and camera. The generated labels are then used as ground truth with the corresponding 4D radar data as inputs to a proposed semantic segmentation network, to associate a class label to each spatial voxel. Promising results are shown by applying both approaches to the publicly shared RaDelft dataset, with the proposed network achieving over 65% of the LiDAR detection performance, improving 13.2% in vehicle detection probability, and reducing 0.54 m in terms of Chamfer distance, compared to variants inspired from the literature.
zh
[CV-97] EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
【速读】:该论文试图解决在机器人辅助手术(robotic-assisted surgery)中缺乏专门用于手术场景理解(surgical scene understanding)的多模态大语言模型(Multimodal Large Language Models, MLLMs)的问题。为了解决这一问题,作者提出了EndoChat模型,旨在处理外科医生在手术场景理解中遇到的各种对话范式(dialogue paradigms)和子任务。解决方案的关键在于构建了Surg-396K数据集,该数据集通过系统化提取手术信息并基于大规模内窥镜手术数据集生成结构化注释。此外,作者引入了多尺度视觉标记交互机制(multi-scale visual token interaction mechanism)和基于视觉对比的推理机制(visual contrast-based reasoning mechanism),以增强模型的表示学习和推理能力。通过这些创新,EndoChat在五种对话范式和八种手术场景理解任务中实现了最先进的性能,并获得了专业外科医生的积极反馈,展示了其在机器人辅助手术训练和自动化中的巨大潜力。
链接: https://arxiv.org/abs/2501.11347
作者: Guankun Wang,Long Bai,Junyi Wang,Kun Yuan,Zhen Li,Tianxu Jiang,Xiting He,Jinlin Wu,Zhen Chen,Zhen Lei,Hongbin Liu,Jiazheng Wang,Fan Zhang,Nicolas Padoy,Nassir Navab,Hongliang Ren
机构: The Chinese University of Hong Kong(香港中文大学); Huawei Technologies Co. Ltd.(华为技术有限公司); Technical University of Munich(慕尼黑工业大学); University of Strasbourg, CNRS, INSERM, ICube & IHU Strasbourg(斯特拉斯堡大学, 法国国家科学研究中心, 法国国家健康与医学研究院, ICube & IHU斯特拉斯堡); Qilu Hospital of Shandong University(山东大学齐鲁医院); Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences(香港科学创新研究院, 中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model’s representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.
zh
[CV-98] GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video
【速读】:该论文旨在解决AI生成视频检测领域面临的挑战,特别是由于缺乏大规模、高质量的数据集而导致的检测模型开发困难。为了解决这一问题,作者提出了GenVidBench数据集,该数据集具有三个关键优势:1)跨源和跨生成器(Cross Source and Cross Generator),通过跨生成源减少视频内容对检测的干扰,并通过跨生成器确保训练集和测试集之间的视频属性多样性,避免过度相似;2)包含8种最先进的AI视频生成器(State-of-the-Art Video Generators),确保数据集涵盖视频生成领域的最新进展;3)丰富的语义(Rich Semantics),通过对视频内容的多维度分析,将视频分类为多种语义类别,确保数据集不仅规模大,而且多样性高,从而有助于开发更通用和有效的检测模型。通过这些关键设计,GenVidBench为研究人员提供了一个高效开发和评估AI生成视频检测模型的工具。
链接: https://arxiv.org/abs/2501.11340
作者: Zhenliang Ni,Qiangyu Yan,Mouxiao Huang,Tianning Yuan,Yehui Tang,Hailin Hu,Xinghao Chen,Yunhe Wang
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at this https URL.
zh
[CV-99] CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)技术在图像和视频试衣任务中难以实现高质量结果的问题,尤其是在长视频场景下。现有的方法在处理静态图像和动态视频试衣时表现不一致,难以同时满足高质量和高效的需求。论文提出的解决方案是CatV2TON,这是一种基于视觉的虚拟试衣方法,通过单一扩散变换器模型(diffusion transformer model)同时支持图像和视频试衣任务。其关键创新点包括:1)通过时间上拼接服装和人物输入,并在混合的图像和视频数据集上进行训练,以实现静态和动态场景下的鲁棒试衣效果;2)提出了一种基于重叠片段的推理策略,利用序列帧引导和自适应片段归一化(Adaptive Clip Normalization, AdaCN)来保持时间一致性,同时减少资源需求;3)引入了ViViD-S数据集,通过过滤背面帧和应用3D掩码平滑来增强时间一致性。实验表明,CatV2TON在图像和视频试衣任务中均优于现有方法,为多样化场景下的逼真虚拟试衣提供了可靠解决方案。
链接: https://arxiv.org/abs/2501.11325
作者: Zheng Chong,Wenqing Zhang,Shiyue Zhang,Jun Zheng,Xiao Dong,Haoxiang Li,Yiling Wu,Dongmei Jiang,Xiaodan Liang
机构: Sun Yat-Sen University(中山大学); National University of Singapore(新加坡国立大学); Pixocial Technology(Pixocial Technology); Pengcheng Laboratory(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 5 tables
点击查看摘要
Abstract:Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.
zh
[CV-100] StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
【速读】:该论文旨在解决基于无训练扩散模型(training-free diffusion-based methods)在风格迁移(style transfer)过程中存在的两个主要问题:原始内容图像的布局变化(layout changes)和风格图像的内容泄漏(content leakage)。为了解决这些问题,论文提出了StyleSSP方法,其关键在于通过两个核心组件来优化采样阶段的起点(startpoint):(1) 频率操纵(Frequency Manipulation),通过减少DDIM潜在表示的低频成分,增强对内容图像布局的关注,从而更好地保留原始内容;(2) 反演阶段的负引导(Negative Guidance via Inversion),通过在反演阶段引入负引导,确保采样阶段的起点远离风格图像的内容,从而减少内容泄漏。实验结果表明,StyleSSP在保留原始内容和减少风格图像内容泄漏方面优于现有的无训练风格迁移基线方法。
链接: https://arxiv.org/abs/2501.11319
作者: Ruojun Xu,Weijie Xi,Xiaodi Wang,Yongbo Mao,Zach Cheng
机构: Zhejiang University(浙江大学); Dcar; Bytedance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.
zh
[CV-101] Nested Annealed Training Scheme for Generative Adversarial Networks
【速读】:该论文旨在解决生成对抗网络(GANs)在数学理论基础上的不足,特别是针对复合函数梯度生成对抗网络(CFG)的理论框架进行深入研究。论文揭示了CFG模型与基于分数的模型(score-based models)之间的理论联系,并指出CFG判别器的训练目标等价于寻找一个最优的D(x),其梯度能够区分真实样本和生成样本的分数函数积分差异。同时,CFG生成器的训练则涉及寻找一个最优的G(x),以最小化这一差异。为解决CFG方法在应用于当前最先进的GAN模型时的局限性,论文提出了一种嵌套退火训练方案(NATS),该方案保留了CFG方法中的退火权重,并能够无缝适应各种GAN模型,无论其结构、损失函数或正则化方式如何。实验结果表明,退火CFG和NATS方法显著提高了生成样本的质量和多样性,尤其是在与当前最先进的GAN模型进行比较时。
链接: https://arxiv.org/abs/2501.11318
作者: Chang Wan,Ming-Hsuan Yang,Minglu Li,Yunliang Jiang,Zhonglong Zheng
机构: School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院); Department of Computer Science and Engineering, University of California, Merced (加州大学默塞德分校计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recently, researchers have proposed many deep generative models, including generative adversarial networks(GANs) and denoising diffusion models. Although significant breakthroughs have been made and empirical success has been achieved with the GAN, its mathematical underpinnings remain relatively unknown. This paper focuses on a rigorous mathematical theoretical framework: the composite-functional-gradient GAN (CFG)[1]. Specifically, we reveal the theoretical connection between the CFG model and score-based models. We find that the training objective of the CFG discriminator is equivalent to finding an optimal D(x). The optimal gradient of D(x) differentiates the integral of the differences between the score functions of real and synthesized samples. Conversely, training the CFG generator involves finding an optimal G(x) that minimizes this difference. In this paper, we aim to derive an annealed weight preceding the weight of the CFG discriminator. This new explicit theoretical explanation model is called the annealed CFG method. To overcome the limitation of the annealed CFG method, as the method is not readily applicable to the SOTA GAN model, we propose a nested annealed training scheme (NATS). This scheme keeps the annealed weight from the CFG method and can be seamlessly adapted to various GAN models, no matter their structural, loss, or regularization differences. We conduct thorough experimental evaluations on various benchmark datasets for image generation. The results show that our annealed CFG and NATS methods significantly improve the quality and diversity of the synthesized samples. This improvement is clear when comparing the CFG method and the SOTA GAN models.
zh
[CV-102] Anomaly Detection for Industrial Applications Its Challenges Solutions and Future Directions: A Review
【速读】:该论文旨在解决工业领域中基于视觉的异常检测(Vision-based Anomaly Detection)问题,特别是在生产过程中通过相机传感器捕获的图像进行异常检测的应用。传统方法依赖于人工检查,效率低下且繁琐。论文通过综述自2019年以来发表的研究,重点探讨了基于视觉的异常检测技术,提出了自动化检测系统的关键组成部分,包括数据获取、预处理、学习机制和评估等方面。解决方案的关键在于利用计算机视觉技术自动提取、处理和解释图像特征,从而实现工业操作的自动化。此外,论文还总结了相关工业数据集,并讨论了未来的研究方向,为研究人员提供了工业检测领域的最新进展和挑战。
链接: https://arxiv.org/abs/2501.11310
作者: Abdelrahman Alzarooni,Ehtesham Iqbal,Samee Ullah Khan,Sajid Javed,Brain Moyo,Yusra Abdulrahman
机构: Advanced Research and Innovation Center (ARIC), Khalifa University of Science and Technology (哈利法科技大学); Department of Aerospace Engineering, Khalifa University of Science and Technology (哈利法科技大学); Department of Computer Science, Khalifa University of Science and Technology (哈利法科技大学); Research & Development Program, Sanad Aerotech (Sanad Aerotech 研发项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Anomaly detection from images captured using camera sensors is one of the mainstream applications at the industrial level. Particularly, it maintains the quality and optimizes the efficiency in production processes across diverse industrial tasks, including advanced manufacturing and aerospace engineering. Traditional anomaly detection workflow is based on a manual inspection by human operators, which is a tedious task. Advances in intelligent automated inspection systems have revolutionized the Industrial Anomaly Detection (IAD) process. Recent vision-based approaches can automatically extract, process, and interpret features using computer vision and align with the goals of automation in industrial operations. In light of the shift in inspection methodologies, this survey reviews studies published since 2019, with a specific focus on vision-based anomaly detection. The components of an IAD pipeline that are overlooked in existing surveys are presented, including areas related to data acquisition, preprocessing, learning mechanisms, and evaluation. In addition to the collected publications, several scientific and industry-related challenges and their perspective solutions are highlighted. Popular and relevant industrial datasets are also summarized, providing further insight into inspection applications. Finally, future directions of vision-based IAD are discussed, offering researchers insight into the state-of-the-art of industrial inspection.
zh
[CV-103] Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
【速读】:该论文试图解决类激活图(Class Activation Map, CAM)在区分视觉上相似的细粒度类别时难以准确定位判别性区域的问题。尽管CAM具有简单和计算效率高的优点,但其在识别区分性区域时表现不佳,尤其是在处理视觉上相似的细粒度类别时。论文提出的解决方案Finer-CAM的关键在于,通过显式比较目标类别与相似类别之间的差异,抑制与其他类别共享的特征,并强调目标类别的独特判别性细节。这种方法不仅保留了CAM的效率,还实现了对判别性区域的精确定位。Finer-CAM易于实现,兼容多种CAM方法,并可扩展到多模态模型中以准确定位特定概念。此外,Finer-CAM允许调整比较强度,使用户能够选择性地突出粗粒度对象轮廓或细粒度判别性细节。
链接: https://arxiv.org/abs/2501.11309
作者: Ziheng Zhang,Jianyang Gu,Arpita Chowdhury,Zheda Mai,David Carlyn,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao
机构: The Ohio State University(俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM’s efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in “how” it explains, but in “what” it explains. Specifically, previous methods attempt to identify all cues contributing to the target class’s logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at this https URL.
zh
[CV-104] MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching
【速读】:该论文旨在解决多模态图像匹配中由于单模态数据训练的描述符缺乏对多模态数据非线性变化的鲁棒性而导致的问题。现有的关键点检测和描述方法在单模态图像匹配中表现良好,但在多模态数据上往往表现不佳,主要原因是多模态数据的非线性变化使得单模态数据训练的描述符难以适应。为了解决这一问题,论文提出了一种模态不变特征学习网络(MIFNet),该网络仅使用单模态训练数据来计算多模态图像匹配中的模态不变特征。关键解决方案包括引入一个新颖的潜在特征聚合模块和一个累积混合聚合模块,通过利用预训练的Stable Diffusion模型的特征来增强基于单模态数据训练的关键点描述符。该方法在三个多模态视网膜图像数据集(CF-FA、CF-OCT、EMA-OCTA)和两个遥感数据集(Optical-SAR和Optical-NIR)上进行了验证,实验结果表明,MIFNet能够在无需访问目标模态的情况下学习到模态不变特征,并具有良好的零样本泛化能力。
链接: https://arxiv.org/abs/2501.11299
作者: Yepeng Liu,Zhichao Sun,Baosheng Yu,Yitian Zhao,Bo Du,Yongchao Xu,Jun Cheng
机构: National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430070, China (武汉大学多媒体软件国家工程研究中心、人工智能研究所、计算机学院、多媒体与网络通信工程湖北省重点实验室); Lee Kong Chian School of Medicine, Nanyang Technological University, 308232, Singapore (南洋理工大学李光前医学院); Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, Zhejiang 315211, China (中国科学院宁波材料技术与工程研究所); Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), 1 Fusionpolis Way, #21-01, Connexis South Tower, Singapore 138632, Republic of Singapore (新加坡科技研究局信息通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The source code will be made publicly available.
zh
[CV-105] PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues
【速读】:该论文旨在解决多目标跟踪(Multi-object Tracking, MOT)在复杂场景中由于严重遮挡导致的关联性能下降问题。当前主流的跟踪-检测(Tracking-by-Detection, TBD)方法在帧间进行目标检测和关联,但在遮挡严重的复杂场景中表现不佳。为此,论文提出了一种基于伪深度线索的增强关联性能的方法,称为Pseudo-Depth SORT (PD-SORT)。其关键解决方案包括:1)扩展卡尔曼滤波(Kalman Filter)状态向量,引入伪深度状态;2)提出一种新的深度体积交并比(Depth Volume IoU, DVIoU),将传统的2D交并比(2D IoU)与伪深度结合;3)开发了一种量化伪深度测量(Quantized Pseudo-Depth Measurement, QPDM)策略,以提高数据关联的鲁棒性;4)集成相机运动补偿(Camera Motion Compensation, CMC)以应对动态相机场景。通过这些设计,PD-SORT显著缓解了遮挡引起的模糊关联问题,并在DanceTrack、MOT17和MOT20数据集上取得了领先的性能,尤其在DanceTrack数据集上表现尤为突出,该数据集中的目标具有复杂运动、相似外观和频繁遮挡的特点。
链接: https://arxiv.org/abs/2501.11288
作者: Yanchao Wang,Dawei Zhang,Run Li,Zhonglong Zheng,Minglu Li
机构: School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world this http URL this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at this https URL.
zh
[CV-106] Spatiotemporal Air Quality Mapping in Urban Areas Using Sparse Sensor Data Satellite Imagery Meteorological Factors and Spatial Features
【速读】:该论文试图解决传统空气质量监测方法(如地面传感器和卫星遥感)在部署成本高、传感器覆盖稀疏以及环境干扰等方面的局限性问题。为此,论文提出了一种基于稀疏传感器数据、卫星图像和多种时空因素的高分辨率时空空气质量指数(AQI)映射框架。解决方案的关键在于利用图神经网络(GNNs),通过捕捉空间和时间依赖性,估算未监测位置的AQI值。该框架整合了多种环境特征,包括气象数据、道路网络、兴趣点(PoIs)、人口密度和城市绿地等,以提高预测精度。通过巴基斯坦拉合尔的案例研究,展示了该方法在多分辨率数据下生成精细时空尺度空气质量指数地图的应用。
链接: https://arxiv.org/abs/2501.11270
作者: Osama Ahmad,Zubair Khalid,Muhammad Tahir,Momin Uppal
机构: School of Science and Engineering, Lahore University of Management Sciences, Lahore 54792, Pakistan (拉合尔管理科学大学科学与工程学院,拉合尔 54792,巴基斯坦)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monitoring air pollution is crucial for protecting human health from exposure to harmful substances. Traditional methods of air quality monitoring, such as ground-based sensors and satellite-based remote sensing, face limitations due to high deployment costs, sparse sensor coverage, and environmental interferences. To address these challenges, this paper proposes a framework for high-resolution spatiotemporal Air Quality Index (AQI) mapping using sparse sensor data, satellite imagery, and various spatiotemporal factors. By leveraging Graph Neural Networks (GNNs), we estimate AQI values at unmonitored locations based on both spatial and temporal dependencies. The framework incorporates a wide range of environmental features, including meteorological data, road networks, points of interest (PoIs), population density, and urban green spaces, which enhance prediction accuracy. We illustrate the use of our approach through a case study in Lahore, Pakistan, where multi-resolution data is used to generate the air quality index map at a fine spatiotemporal scale.
zh
[CV-107] owards Loss-Resilient Image Coding for Unstable Satellite Networks AAAI2025
【速读】:该论文旨在解决地球静止轨道(GEO)卫星通信中由于网络不稳定(尤其是频繁丢包)导致的图像传输不准确的问题。为了解决这一问题,作者提出了一种基于端到端优化的学习图像压缩(LIC)方法,该方法具有抗丢包能力。解决方案的关键在于采用了通道级渐进编码框架,并在编码器端引入了空间-通道重排(SCR)技术,在解码器端引入了掩码条件聚合(MCA)技术,以在不可预测的错误情况下提高重建质量。此外,通过将Gilbert-Elliot模型集成到训练过程中,增强了模型在真实网络条件下的泛化能力。实验结果表明,该方法在压缩性能和不同丢包情况下的稳定性方面优于传统方法和基于深度学习的方法,能够在恶劣环境下实现稳健且高效的渐进传输。
链接: https://arxiv.org/abs/2501.11263
作者: Hongwei Sha,Muchen Dong,Quanyou Luo,Ming Lu,Hao Chen,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted as a poster presentation at AAAI 2025
点击查看摘要
Abstract:Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model’s ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at this https URL.
zh
[CV-108] A Survey of World Models for Autonomous Driving
【速读】:该论文旨在探讨自动驾驶领域中的关键技术挑战及其解决方案,特别是通过世界模型(world models)来提升自动驾驶系统的感知、预测和规划能力。世界模型通过整合多传感器数据、语义线索和时间动态信息,提供了高保真的驾驶环境表示,从而在复杂和不可预测的条件下实现快速且明智的决策。解决方案的关键在于利用大规模预训练和先进的自监督学习技术,增强模型对罕见事件的模拟能力和实时交互能力。此外,论文还强调了领域适应、长尾异常检测和多模态融合等关键挑战的应对策略,为更鲁棒、可靠和适应性强的自动驾驶系统铺平了道路。
链接: https://arxiv.org/abs/2501.11260
作者: Tuo Feng,Wenguan Wang,Yi Yang
机构: ReLER Lab, Australian Artificial Intelligence Institute (AAII), University of Technology Sydney (悉尼科技大学); Collaborative Innovation Center of Artificial Intelligence (CCAI), Zhejiang University (浙江大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Ongoing project
点击查看摘要
Abstract:Recent breakthroughs in autonomous driving have revolutionized the way vehicles perceive and interact with their surroundings. In particular, world models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. Such models unify perception, prediction, and planning, thereby enabling autonomous systems to make rapid, informed decisions under complex and often unpredictable conditions. Research trends span diverse areas, including 4D occupancy prediction and generative data synthesis, all of which bolster scene understanding and trajectory forecasting. Notably, recent works exploit large-scale pretraining and advanced self-supervised learning to scale up models’ capacity for rare-event simulation and real-time interaction. In addressing key challenges – ranging from domain adaptation and long-tail anomaly detection to multimodal fusion – these world models pave the way for more robust, reliable, and adaptable autonomous driving solutions. This survey systematically reviews the state of the art, categorizing techniques by their focus on future prediction, behavior planning, and the interaction between the two. We also identify potential directions for future research, emphasizing holistic integration, improved computational efficiency, and advanced simulation. Our comprehensive analysis underscores the transformative role of world models in driving next-generation autonomous systems toward safer and more equitable mobility.
zh
[CV-109] Enhancing Uncertainty Estimation in Semantic Segmentation via Monte-Carlo Frequency Dropout
【速读】:该论文旨在解决确定性神经网络中预测分布估计的问题,特别是在医学影像分析中,传统 dropout 方法在信号空间内应用时可能无法有效处理频率相关噪声,从而导致预测估计偏差。论文提出了一种新颖的解决方案,即将 dropout 扩展到频域(frequency domain),在推理过程中对信号频率进行随机衰减。这种方法在保持结构完整性的同时,能够在特征图中生成多样化的全局纹理变化,从而更准确地估计语义分割中的不确定性。通过在三项涉及不同成像模态的分割任务(双参数 MRI 中的前列腺区域、对比增强 CT 中的肝脏肿瘤以及胸部 X 光片中的肺部)中进行评估,结果表明,MC-Frequency Dropout 在模型校准、收敛性和语义不确定性方面均有显著提升,有助于改善预测的精确性、边界划分以及医学决策的准确性。
链接: https://arxiv.org/abs/2501.11258
作者: Tal Zeevi,Lawrence H. Staib,John A. Onofrey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: Accepted by IEEE ISBI 2025 4-page paper. Code for the implementation is available at this https URL
点击查看摘要
Abstract:Monte-Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency-related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity – a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC-Dropout and the MC-frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast-enhanced CT, and (iii) lungs in chest X-ray scans. Our results show that MC-Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision-making.
zh
[CV-110] Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders
【速读】:该论文试图解决在合成孔径雷达(SAR)图像中,由于缺乏领域特定的预训练模型,传统方法通常依赖于自然场景(如ImageNet)的预训练模型进行监督微调(SFT),但由于SAR图像与自然场景图像的特性差异较大,导致在小规模标注的SAR数据上进行SFT时,模型在下游任务中的性能受限。论文提出了一种基于掩码自编码器(MAE)的自监督学习(SSL)方法,通过在预训练过程中学习SAR图像的特征表示,从而提升SAR图像目标检测任务中的模型泛化能力。解决方案的关键在于通过自监督学习将预训练领域从自然场景转换为SAR图像,从而捕获SAR图像的潜在表示,并在大规模SAR目标检测基准SARDet-100k上验证了该方法的有效性,相比仅使用SFT策略,该方法在SARDet-100k基准上实现了1.3 mAP的提升。
链接: https://arxiv.org/abs/2501.11249
作者: Xinyang Pu,Feng Xu
机构: Key Lab for Information Science of Electromagnetic Waves (MoE), Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Supervised fine-tuning methods (SFT) perform great efficiency on artificial intelligence interpretation in SAR images, leveraging the powerful representation knowledge from pre-training models. Due to the lack of domain-specific pre-trained backbones in SAR images, the traditional strategies are loading the foundation pre-train models of natural scenes such as ImageNet, whose characteristics of images are extremely different from SAR images. This may hinder the model performance on downstream tasks when adopting SFT on small-scale annotated SAR data. In this paper, an self-supervised learning (SSL) method of masked image modeling based on Masked Auto-Encoders (MAE) is proposed to learn feature representations of SAR images during the pre-training process and benefit the object detection task in SAR images of SFT. The evaluation experiments on the large-scale SAR object detection benchmark named SARDet-100k verify that the proposed method captures proper latent representations of SAR images and improves the model generalization in downstream tasks by converting the pre-trained domain from natural scenes to SAR images through SSL. The proposed method achieves an improvement of 1.3 mAP on the SARDet-100k benchmark compared to only the SFT strategies.
zh
[CV-111] A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs
【速读】:该论文旨在解决生成对抗网络(GANs)在大规模数据集上训练时的不稳定性问题,特别是模式崩溃(mode collapse)现象。传统GANs通过生成器和判别器之间的极小极大博弈进行学习,这种方法在经验上表现出不稳定性,且缺乏理论保证。为了解决这些问题,作者提出了一种新颖的Lipschitz约束函数梯度GANs学习方法(Li-CFG),通过减少潜在向量的邻域大小来稳定GAN的训练,并提供了理论依据以有效增加生成样本的多样性。具体而言,作者证明了通过增加判别器梯度的范数可以减少潜在向量的邻域大小,从而增强生成样本的多样性。为了有效增大判别器梯度的范数,作者引入了一种新的ε中心梯度惩罚(ε-centered gradient penalty),利用超参数ε来放大判别器梯度的范数。与其他约束方法相比,该方法通过增大判别器范数,获得了最小的潜在向量邻域大小。实验结果表明,Li-CFG方法和ε中心梯度惩罚在图像生成基准数据集上显著提高了训练的稳定性和生成样本的多样性。
链接: https://arxiv.org/abs/2501.11236
作者: Chang Wan,Ke Fan,Xinwei Sun,Yanwei Fu,Minglu Li,Yunliang Jiang,Zhonglong Zheng
机构: Zhejiang Normal University (浙江师范大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a discriminator, which is known to be empirically unstable. Previous learning paradigms have encountered mode collapse issues without a theoretical solution. To address these challenges, we propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN and provide a theoretical foundation for effectively increasing the diversity of synthetic samples by reducing the neighborhood size of the latent vector. Specifically, we demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient, resulting in enhanced diversity of synthetic samples. To efficiently enlarge the norm of the discriminator gradient, we introduce a novel \epsilon-centered gradient penalty that amplifies the norm of the discriminator gradient using the hyper-parameter \epsilon. In comparison to other constraints, our method enlarging the discriminator norm, thus obtaining the smallest neighborhood size of the latent vector. Extensive experiments on benchmark datasets for image generation demonstrate the efficacy of the Li-CFG method and the \epsilon-centered gradient penalty. The results showcase improved stability and increased diversity of synthetic samples.
zh
[CV-112] KPL: Training-Free Medical Knowledge Mining of Vision-Language Models AAAI
【速读】:该论文试图解决在医学图像诊断中应用CLIP(Contrastive Language–Image Pretraining)进行零样本分类(zero-shot classification)时面临的两个主要挑战:1)仅使用单一类别名称无法充分表示图像类别;2)CLIP编码器生成的视觉和文本空间之间存在模态差距(modal gap)。尽管已有研究尝试通过大型语言模型丰富疾病描述,但由于缺乏类别特定的知识,性能仍然较差。此外,现有代理学习方法在自然图像数据集上的零样本图像分类表现不稳定,尤其是在医学数据集上。
为解决这些问题,论文提出了知识代理学习(Knowledge Proxy Learning, KPL)方法,旨在通过从CLIP中挖掘知识来提升医学图像分类的性能。KPL的关键在于通过文本代理优化(Text Proxy Optimization)和多模态代理学习(Multimodal Proxy Learning)来利用CLIP的多模态理解能力。具体而言,KPL从构建的知识增强库中检索与图像相关的知识描述,以丰富语义文本代理,并利用CLIP编码的输入图像和这些描述生成稳定的多模态代理,从而提升零样本分类性能。实验结果表明,KPL在医学和自然图像数据集上均显著优于现有基线方法,展示了从CLIP中挖掘知识在医学图像分类及其他领域的巨大潜力。
链接: https://arxiv.org/abs/2501.11231
作者: Jiaxiang Liu,Tianxiang Hu,Jiawei Du,Ruiyuan Zhang,Joey Tianyi Zhou,Zuozhu Liu
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI(Oral)
点击查看摘要
Abstract:Visual Language Models such as CLIP excel in image recognition due to extensive image-text pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) the inadequacy of representing image classes solely with single category names; 2) the modal gap between the visual and text spaces generated by CLIP encoders. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. KPL is designed to leverage CLIP’s multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies. It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance. Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas.
zh
[CV-113] Successive Interference Cancellation-aided Diffusion Models for Joint Channel Estimation and Data Detection in Low Rank Channel Scenarios ICASSP2025
【速读】:该论文旨在解决在低秩信道(low-rank channel)场景下,现有联合信道估计和源检测算法性能不足的问题。特别是在用户数量超过接入点(AP)天线数量的情况下,传统方法在处理低秩信道时表现不佳。论文提出了一种基于生成式分数扩散模型(generative score-based diffusion models)和连续干扰消除(SIC)的联合算法。该算法的关键在于通过分数迭代扩散过程估计部分信道的先验分布梯度,并递归更新信道估计和源信号。实验结果表明,该方法在全秩和低秩信道场景下均优于现有基线方法,尤其在低秩信道场景下表现更为显著,显著降低了归一化均方误差(NMSE)和符号错误率(SER)。
链接: https://arxiv.org/abs/2501.11229
作者: Sagnik Bhattacharya,Muhammad Ahmed Mohsin,Kamyar Rajabalifardi,John M. Cioffi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: Published at IEEE ICASSP 2025
点击查看摘要
Abstract:This paper proposes a novel joint channel-estimation and source-detection algorithm using successive interference cancellation (SIC)-aided generative score-based diffusion models. Prior work in this area focuses on massive MIMO scenarios, which are typically characterized by full-rank channels, and fail in low-rank channel scenarios. The proposed algorithm outperforms existing methods in joint source-channel estimation, especially in low-rank scenarios where the number of users exceeds the number of antennas at the access point (AP). The proposed score-based iterative diffusion process estimates the gradient of the prior distribution on partial channels, and recursively updates the estimated channel parts as well as the source. Extensive simulation results show that the proposed method outperforms the baseline methods in terms of normalized mean squared error (NMSE) and symbol error rate (SER) in both full-rank and low-rank channel scenarios, while having a more dominant effect in the latter, at various signal-to-noise ratios (SNR).
zh
[CV-114] Leverag ing GANs For Active Appearance Models Optimized Model Fitting
【速读】:该论文试图解决在计算机视觉领域中,特别是在涉及可变形模型(如主动外观模型,Active Appearance Models, AAMs)的拟合过程中,优化与外观和形状变化相关的非线性参数时所面临的挑战。论文提出的解决方案之关键在于利用生成对抗网络(Generative Adversarial Networks, GANs)的对抗训练框架,以最小化拟合误差并提高收敛速度。通过这种方法,即使在存在高外观变异性和遮挡的情况下,也能实现鲁棒的性能。与传统的优化技术相比,该方法在精度和计算效率方面表现出显著改进,从而确立了GANs在高级图像模型拟合中的强大作用。
链接: https://arxiv.org/abs/2501.11218
作者: Anurag Awasthi
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, in proceeding at conference
点击查看摘要
Abstract:Generative Adversarial Networks (GANs) have gained prominence in refining model fitting tasks in computer vision, particularly in domains involving deformable models like Active Appearance Models (AAMs). This paper explores the integration of GANs to enhance the AAM fitting process, addressing challenges in optimizing nonlinear parameters associated with appearance and shape variations. By leveraging GANs’ adversarial training framework, the aim is to minimize fitting errors and improve convergence rates. Achieving robust performance even in cases with high appearance variability and occlusions. Our approach demonstrates significant improvements in accuracy and computational efficiency compared to traditional optimization techniques, thus establishing GANs as a potent tool for advanced image model fitting.
zh
[CV-115] Ditto: Accelerating Diffusion Model via Temporal Value Similarity HPCA2025
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成任务中由于迭代结构导致的高计算开销问题。扩散模型在相邻时间步之间表现出高度的数值相似性,导致连续时间步之间的差异较小。基于这一观察,论文提出了一种名为Ditto的算法,该算法利用时间步之间的相似性和量化技术来提升扩散模型的效率。Ditto算法的关键在于通过量化减少差异的位宽表示,并在初始时间步执行全位宽操作,而在后续时间步中仅处理时间差异。此外,Ditto算法还设计了执行流程优化以减少时间差异处理的内存开销,并开发了专用的硬件加速器Ditto硬件,以充分利用算法的动态特性。实验结果表明,Ditto硬件相比其他加速器实现了最高1.5倍的加速和17.74%的能耗节省。
链接: https://arxiv.org/abs/2501.11211
作者: Sungbin Kim,Hyunwuk Lee,Wonho Cho,Mincheol Park,Won Woo Ro
机构: School of Electrical and Electronic Engineering, Yonsei University (延世大学); Samsung Electronics (三星电子)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)
点击查看摘要
Abstract:Diffusion models achieve superior performance in image generation tasks. However, it incurs significant computation overheads due to its iterative structure. To address these overheads, we analyze this iterative structure and observe that adjacent time steps in diffusion models exhibit high value similarity, leading to narrower differences between consecutive time steps. We adapt these characteristics to a quantized diffusion model and reveal that the majority of these differences can be represented with reduced bit-width, and even zero. Based on our observations, we propose the Ditto algorithm, a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models. By exploiting the narrower differences and the distributive property of layer operations, it performs full bit-width operations for the initial time step and processes subsequent steps with temporal differences. In addition, Ditto execution flow optimization is designed to mitigate the memory overhead of temporal difference processing, further boosting the efficiency of the Ditto algorithm. We also design the Ditto hardware, a specialized hardware accelerator, fully exploiting the dynamic characteristics of the proposed algorithm. As a result, the Ditto hardware achieves up to 1.5x speedup and 17.74% energy saving compared to other accelerators.
zh
[CV-116] Advancing Oyster Phenotype Segmentation with Multi-Network Ensemble and Multi-Scale mechanism
【速读】:该论文试图解决的是牡蛎表型分割(phenotype segmentation)中的肉质量评估问题,特别是针对牡蛎的壳、肉、性腺和肌肉等组分的分割。传统的手动检测方法耗时且主观性强,因此论文提出采用机器视觉技术来实现高效且客观的评估。解决方案的关键在于开发了一种多网络集成方法(multi-network ensemble approach),并结合了全局-局部层次注意力机制(global-local hierarchical attention mechanism)。该方法通过整合多个模型的预测结果,解决了不同尺度变化带来的挑战,确保了各组分实例分割的鲁棒性。论文还通过多个真实数据集对提出的方法进行了全面评估,证明了其在提升牡蛎表型分割效果方面的有效性和鲁棒性。
链接: https://arxiv.org/abs/2501.11203
作者: Wenli Yang,Yanyu Chen,Andrew Trotter,Byeong Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Phenotype segmentation is pivotal in analysing visual features of living organisms, enhancing our understanding of their characteristics. In the context of oysters, meat quality assessment is paramount, focusing on shell, meat, gonad, and muscle components. Traditional manual inspection methods are time-consuming and subjective, prompting the adoption of machine vision technology for efficient and objective evaluation. We explore machine vision’s capacity for segmenting oyster components, leading to the development of a multi-network ensemble approach with a global-local hierarchical attention mechanism. This approach integrates predictions from diverse models and addresses challenges posed by varying scales, ensuring robust instance segmentation across components. Finally, we provide a comprehensive evaluation of the proposed method’s performance using different real-world datasets, highlighting its efficacy and robustness in enhancing oyster phenotype segmentation.
zh
[CV-117] ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
【速读】:该论文旨在提升对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)在少样本适应任务中的效果和通用性。具体而言,论文探讨了无需额外微调的轻量级适应方法,特别是以Tip-Adapter为代表的缓存方法(caching methods),并从核(kernel)的角度重新审视了这些方法。通过理论分析,论文揭示了缓存方法作为局部适配器(local adapters)的运作机制,并指出其在核文献中的理论基础。在此基础上,论文提出了一种全局方法,称为ProKeR(Proximal Kernel ridge Regression),该方法在学习过程中引入了一个近端正则化器(proximal regularizer),并在再生核希尔伯特空间(reproducing kernel Hilbert space, RKHS)中利用CLIP作为基础学习器。ProKeR具有闭式解,并在标准的少样本适应基准测试中,在11个数据集上实现了最先进的性能。解决方案的关键在于结合全局信息来增强局部适配器的表现,并通过核方法提升模型的适应能力。
链接: https://arxiv.org/abs/2501.11175
作者: Yassir Bendou,Amine Ouasfi,Vincent Gripon,Adnane Boukhayma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL
点击查看摘要
Abstract:The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP’s effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
zh
[CV-118] Counteracting temporal attacks in Video Copy Detection
【速读】:该论文旨在解决视频拷贝检测(Video Copy Detection, VCD)中的两个主要问题:一是现有方法在处理精确拷贝时的局限性,二是对时间攻击(temporal attacks)的脆弱性。具体而言,论文指出双级检测方法(Dual-level detection)在视频编辑检测(Video Editing Detection, VED)组件中存在显著不足,尤其是在处理精确拷贝时表现不佳。此外,该方法在面对时间攻击时也表现出脆弱性。
论文提出的解决方案的关键在于改进帧选择策略,基于帧间差异的局部最大值(local maxima of interframe differences)来选择关键帧。这一策略不仅增强了对对抗性时间修改的鲁棒性,还显著降低了计算开销。与标准的每秒1帧(1 FPS)方法相比,该方法的效率提高了1.4到5.8倍。与双级检测方法相比,该方法在保持相当的微平均精度(μAP)的同时,还展示出对时间攻击的更强鲁棒性。此外,该方法减少了56%的表示大小,并将推理时间缩短了2倍以上,使其更适合实际应用中的资源限制。
链接: https://arxiv.org/abs/2501.11171
作者: Katarzyna Fojcik,Piotr Syga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 14 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision ( \mu AP) while also demonstrating improved robustness against temporal attacks. Given 56% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.
zh
[CV-119] DeepEyeNet: Adaptive Genetic Bayesian Algorithm Based Hybrid ConvNeXtTiny Framework For Multi-Feature Glaucoma Eye Diagnosis
【速读】:该论文旨在解决青光眼(Glaucoma)早期检测的挑战,青光眼是全球不可逆失明的主要原因之一。论文提出了一种名为DeepEyeNet的自动化青光眼检测框架,其核心解决方案包括以下几个关键点:首先,通过动态阈值化(dynamic thresholding)实现先进的图像标准化;其次,利用U-Net模型进行精确的视盘(optic disc)和视杯(optic cup)分割;第三,结合解剖学和基于纹理的特征进行全面的特征提取;最后,采用基于ConvNeXtTiny的卷积神经网络(CNN)分类器,并通过提出的自适应遗传贝叶斯优化(Adaptive Genetic Bayesian Optimization, AGBO)算法进行超参数优化。AGBO算法在探索与利用之间取得平衡,显著提升了模型性能。实验结果表明,DeepEyeNet在EyePACS-AIROGS-light-V2数据集上实现了95.84%的高分类准确率,优于现有方法。通过整合先进的图像处理技术、深度学习以及优化的超参数调优,DeepEyeNet展现了在临床环境中进行早期青光眼检测的潜力。
链接: https://arxiv.org/abs/2501.11168
作者: Angshuman Roy,Anuvab Sen,Soumyajit Gupta,Soham Haldar,Subhrajit Deb,Taraka Nithin Vankala,Arkapravo Das
机构: Indian Institute of Engineering Science and Technology, Shibpur, Howrah 711103, India (印度工程技术学院); Georgia Institute of Technology, Atlanta, GA 30332, USA (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 7 pages, 12 figures, 3 Tables, Accepted by 15th IEEE Symposium Series on Computational Intelligence (SSCI) 2025, Trondheim, Norway, Europe
点击查看摘要
Abstract:Glaucoma is a leading cause of irreversible blindness worldwide, emphasizing the critical need for early detection and intervention. In this paper, we present DeepEyeNet, a novel and comprehensive framework for automated glaucoma detection using retinal fundus images. Our approach integrates advanced image standardization through dynamic thresholding, precise optic disc and cup segmentation via a U-Net model, and comprehensive feature extraction encompassing anatomical and texture-based features. We employ a customized ConvNeXtTiny based Convolutional Neural Network (CNN) classifier, optimized using our Adaptive Genetic Bayesian Optimization (AGBO) algorithm. This proposed AGBO algorithm balances exploration and exploitation in hyperparameter tuning, leading to significant performance improvements. Experimental results on the EyePACS-AIROGS-light-V2 dataset demonstrate that DeepEyeNet achieves a high classification accuracy of 95.84%, which was possible due to the effective optimization provided by the novel AGBO algorithm, outperforming existing methods. The integration of sophisticated image processing techniques, deep learning, and optimized hyperparameter tuning through our proposed AGBO algorithm positions DeepEyeNet as a promising tool for early glaucoma detection in clinical settings.
zh
[CV-120] LiFT: Lightweight FPGA-tailored 3D object detection based on LiDAR data
【速读】:该论文旨在解决在FPGA平台上实现实时推理的轻量级、全量化3D目标检测问题。针对FPGA平台的特定限制,如计算复杂度限制在30 GMACs(十亿次乘加运算)、权重和激活的INT8量化、基于2D单元的处理而非3D体素、以及最小化跳跃连接的使用,论文提出了LiFT算法。LiFT通过结合可重参数化卷积和全稀疏架构等先进技术,设计了双边界柱特征网络(Dual-bound Pillar Feature Net),在不增加复杂度的前提下提升性能,并实现了输入特征的高效INT8量化方案。LiFT的计算成本仅为20.73 GMACs,在NuScenes验证数据集上达到了51.84%的mAP(平均精度)和61.01%的NDS(归一化检测分数),在同类方法中表现最佳。
链接: https://arxiv.org/abs/2501.11159
作者: Konrad Lis,Tomasz Kryjak,Marek Gorgon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
备注: The paper has been accepted for the DASIP 2025 workshop in conjunction with the HiPEAC 2025 conference in Barcelona
点击查看摘要
Abstract:This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm’s design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at this https URL.
zh
[CV-121] Efficient Frame Extraction: A Novel Approach Through Frame Similarity and Surgical Tool Tracking for Video Segmentation
【速读】:该论文旨在解决在手术视频分析中,由于视频时长过长(通常为30分钟至数小时)导致的人工智能(AI)模型学习效率低下的问题。为了解决这一问题,作者提出了一种名为“运动学自适应帧识别”(Kinematics Adaptive Frame Recognition, KAFR)的新技术。该技术的核心在于通过跟踪手术工具的运动来计算连续帧之间的相似性,从而有效去除冗余帧,减少数据集大小和计算时间,同时保留有用的帧以提高分析准确性。具体步骤包括:1) 使用YOLOv8模型检测手术工具;2) 通过估计工具的空间位置和速度变化来计算帧间相似性;3) 使用X3D CNN进行分类。实验结果表明,该方法在Gastrojejunostomy(GJ)和Pancreaticojejunostomy(PJ)数据集上实现了帧数减少十倍,同时准确率提高了4.32%。
链接: https://arxiv.org/abs/2501.11153
作者: Huu Phong Nguyen,Shekhar Madhav Khairnar,Sofia Garces Palacios,Amr Al-Abbas,Francisco Antunes,Bernardete Ribeiro,Melissa E. Hogg,Amer H. Zureikat,Patricio M. Polanco,Herbert Zeh III,Ganesh Sankaranarayanan
机构: Department of Surgery, University of Texas Southwestern Medical Center, Texas, USA; NorthShore University HealthSystem, Evanston, IL, USA; University of Pittsburgh Medical Center, Pittsburgh, PA, USA; University of Coimbra, Coimbra, Portugal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17
点击查看摘要
Abstract:The interest in leveraging Artificial Intelligence (AI) for surgical procedures to automate analysis has witnessed a significant surge in recent years. One of the primary tools for recording surgical procedures and conducting subsequent analyses, such as performance assessment, is through videos. However, these operative videos tend to be notably lengthy compared to other fields, spanning from thirty minutes to several hours, which poses a challenge for AI models to effectively learn from them. Despite this challenge, the foreseeable increase in the volume of such videos in the near future necessitates the development and implementation of innovative techniques to tackle this issue effectively. In this article, we propose a novel technique called Kinematics Adaptive Frame Recognition (KAFR) that can efficiently eliminate redundant frames to reduce dataset size and computation time while retaining useful frames to improve accuracy. Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools. Our approach follows these steps: i) Tracking phase: a YOLOv8 model is utilized to detect tools presented in the scene, ii) Similarity phase: Similarities between consecutive frames are computed by estimating variation in the spatial positions and velocities of the tools, iii) Classification phase: A X3D CNN is trained to classify segmentation. We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases at two referral centers. The Gastrojejunostomy (GJ) dataset covers procedures performed between 2017 to 2021, while the Pancreaticojejunostomy (PJ) dataset spans from 2011 to 2022 at the same centers. By adaptively selecting relevant frames, we achieve a tenfold reduction in the number of frames while improving accuracy by 4.32% (from 0.749 to 0.7814).
zh
[CV-122] CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning
【速读】:该论文试图解决生成式 AI 模型(Generative AI)生成的逼真媒体内容与真实图像之间的区分问题,特别是在分类器遇到未包含在其训练数据中的生成模型图像时性能下降的挑战。传统方法是通过定期更新分类器的训练数据并重新训练,但在实际应用中,由于存储、计算或隐私限制,这种方法往往不可行。论文提出了一种基于持续学习(Continual Learning)的解决方案,使分类器能够在无需重新训练整个数据集的情况下进行更新。关键解决方案是引入了一个新的数据集 CLOFAI(Continual Learning On Fake and Authentic Images),并将其作为评估持续学习方法的基准。通过在该数据集上测试三种基础持续学习方法(EWC、GEM 和 Experience Replay),发现 GEM 和 Experience Replay 表现优于 EWC 和 Naive 基线,展示了持续学习在应对生成式 AI 模型变化时的潜力。
链接: https://arxiv.org/abs/2501.11140
作者: William Doherty,Anton Lee,Heitor Murilo Gomes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier’s training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods – EWC, GEM, and Experience Replay – and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: this https URL.
zh
[CV-123] Advanced technology in railway track monitoring using the GPR Technique: A Review
【速读】:该论文旨在解决铁路轨道地下结构评估中的关键问题,特别是如何通过先进的无损检测技术(NDT)——地质雷达(GPR)——来早期检测和修复可能导致事故或脱轨的结构弱点或缺陷。论文的核心解决方案包括利用合成建模技术校准实际GPR数据,以提高对地下特征(如道砟条件和结构异常)的识别精度,并应用多种算法(如支持向量机(SVM)、模糊C均值聚类和广义回归神经网络)来优化GPR数据分析。此外,论文特别强调了深度学习技术,尤其是卷积神经网络(CNN)和循环神经网络(RNN)在识别GPR图像中缺陷相关模式方面的有效性,并开发了一种结合CNN和RNN架构的卷积循环神经网络(CRNN)模型。该模型在缺陷检测能力和处理速度上优于传统的目标检测模型(如Faster R-CNN),从而为铁路轨道的地下结构评估提供了更高效和准确的解决方案。
链接: https://arxiv.org/abs/2501.11132
作者: Farhad Kooban,Aleksandra Radlińska,Reza Mousapour,Maryam Saraei
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 2nd Canadian Cold Regions Rail Research Conference 2024 (CCRC 2024)
点击查看摘要
Abstract:Subsurface evaluation of railway tracks is crucial for safe operation, as it allows for the early detection and remediation of potential structural weaknesses or defects that could lead to accidents or derailments. Ground Penetrating Radar (GPR) is an electromagnetic survey technique as advanced non-destructive technology (NDT) that can be used to monitor railway tracks. This technology is well-suited for railway applications due to the sub-layered composition of the track, which includes ties, ballast, sub-ballast, and subgrade regions. It can detect defects such as ballast pockets, fouled ballast, poor drainage, and subgrade settlement. The paper reviews recent works on advanced technology and interpretations of GPR data collected for different layers. Further, this paper demonstrates the current techniques for using synthetic modeling to calibrate real-world GPR data, enhancing accuracy in identifying subsurface features like ballast conditions and structural anomalies and applying various algorithms to refine GPR data analysis. These include Support Vector Machine (SVM) for classifying railway ballast types, Fuzzy C-means, and Generalized Regression Neural Networks for high-accuracy defect classification. Deep learning techniques, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are also highlighted for their effectiveness in recognizing patterns associated with defects in GPR images. The article specifically focuses on the development of a Convolutional Recurrent Neural Network (CRNN) model, which combines CNN and RNN architectures for efficient processing of GPR data. This model demonstrates enhanced detection capabilities and faster processing compared to traditional object detection models like Faster R-CNN.
zh
[CV-124] Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction
【速读】:该论文旨在解决弱监督时序动作定位(Weakly-Supervised Temporal Action Localization)中伪标签(pseudo-labels)噪声对全监督检测头(fully-supervised detection head)学习过程的干扰问题。具体来说,伪标签噪声会导致以下问题:(1) 边界定位不准确;(2) 短动作片段未被检测到;(3) 多个相邻片段被错误地检测为一个片段。为解决这些问题,论文提出了一种两阶段的噪声标签学习策略。首先,通过一个帧级伪标签生成模型结合上下文感知去噪算法(context-aware denoising algorithm)来优化边界定位。其次,引入了一个在线修正的师生框架(online-revised teacher-student framework),该框架包含缺失实例补偿模块(missing instance compensation module)和模糊实例校正模块(ambiguous instance correction module),以解决短动作缺失和多对一检测问题。此外,论文还采用了高质量伪标签挖掘损失(high-quality pseudo-label mining loss),为噪声标签赋予不同权重,从而更有效地训练模型。该方案在THUMOS14和ActivityNet v1.2基准测试中显著提升了检测精度和推理速度。
链接: https://arxiv.org/abs/2501.11124
作者: Quan Zhang,Yuxin Qi,Xi Tang,Rui Yuan,Xi Lin,Ke Zhang,Chun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
zh
[CV-125] RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering
【速读】:该论文试图解决在稀疏输入视图下进行3D重建时,如何高效合成新颖视图并保持准确性的关键挑战。现有方法如辐射场(radiance fields)和3D高斯溅射(3D Gaussian Splatting)虽然在密集视图输入下实现了高质量的渲染和显著的效率,但在稀疏视图输入下存在显著的几何重建误差。此外,尽管最近的方法利用单目深度估计(monocular depth estimation)来增强几何学习,但其对单视图估计深度的依赖常常导致不同视角下的视图不一致问题,进而引入几何信息的不准确性,影响场景重建质量。
解决方案的关键在于提出了一种基于3D高斯溅射的相对深度引导(Relative Depth Guidance)框架,称为RDG-GS。该框架通过利用相对深度引导来优化高斯场,使其朝向视图一致的空间几何表示,从而实现准确的几何结构重建和复杂纹理的捕捉。具体而言,首先设计了精细的深度先验来修正粗略估计的深度,并将全局和细粒度的场景信息融入常规高斯分布中。其次,通过优化深度和图像空间相关补丁之间的相似性,提出了相对深度引导,以解决绝对深度带来的空间几何不准确问题。此外,还通过自适应采样快速密集化处理难以收敛的稀疏区域。实验结果表明,RDG-GS在Mip-NeRF360、LLFF、DTU和Blender等数据集上展示了最先进的渲染质量和效率,显著推动了实际应用的发展。
链接: https://arxiv.org/abs/2501.11102
作者: Chenlu Zhan,Yufei Zhang,Yu Lin,Gaoang Wang,Hongwei Wang
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures
点击查看摘要
Abstract:Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leverage monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present RDG-GS, a novel sparse-view 3D rendering framework with Relative Depth Guidance based on 3D Gaussian Splatting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information to regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world application.
zh
[CV-126] Unit Region Encoding: A Unified and Compact Geometry-aware Representation for Floorplan Applications
【速读】:该论文旨在解决在室内空间规划、平面图度量学习以及平面图生成等任务中,如何有效地表示平面图的问题。现有的方法通常使用过度分割的栅格化图像或房间级别的图结构,这些方法在灵活性和准确性上存在局限。论文提出了一种基于几何感知密度图的单元区域编码(Unit Region Encoding)方法,通过边界自适应的单元区域划分,将平面图表示为潜在编码。该编码通过训练的网络(URE-Net)从输入的密集密度图和其他可用的语义图中提取。与现有方法相比,这种表示方法能够灵活适应不同应用场景,同时提高了准确性和视觉质量。关键解决方案在于利用几何感知密度图进行聚类,生成边界自适应的单元区域,并通过网络提取潜在编码,从而实现更高效的平面图表示。
链接: https://arxiv.org/abs/2501.11097
作者: Huichao Zhang,Pengyu Wang,Manyi Li,Zuojun Li,Yaguang Wu
机构: ByteDance(字节跳动); Alibaba(阿里巴巴); Shandong University(山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present the Unit Region Encoding of floorplans, which is a unified and compact geometry-aware encoding representation for various applications, ranging from interior space planning, floorplan metric learning to floorplan generation tasks. The floorplans are represented as the latent encodings on a set of boundary-adaptive unit region partition based on the clustering of the proposed geometry-aware density map. The latent encodings are extracted by a trained network (URE-Net) from the input dense density map and other available semantic maps. Compared to the over-segmented rasterized images and the room-level graph structures, our representation can be flexibly adapted to different applications with the sliced unit regions while achieving higher accuracy performance and better visual quality. We conduct a variety of experiments and compare to the state-of-the-art methods on the aforementioned applications to validate the superiority of our representation, as well as extensive ablation studies to demonstrate the effect of our slicing choices.
zh
[CV-127] Reproducibility review of “Why Not Other Classes”: Towards Class-Contrastive Back-Propagation Explanations
【速读】:该论文旨在解决神经网络图像分类器中为何选择某一类别而非其他类别的对比解释问题。其核心解决方案是通过在softmax层之后而非之前使用基于反向传播的解释方法(back-propagation-based explanation methods),从而提供类别的对比解释。该方法的关键在于通过调整解释方法的应用位置,增强了模型输出类别选择的解释能力。此外,论文还通过评估XGradCAM、FullGrad和Vision Transformers等方法,验证了该解决方案的泛化能力,并发现其在Vision Transformers和其他反向传播方法中表现良好。然而,论文也指出原始方法存在细节不足和公式错误等问题,影响了可复现性,因此作者提供了开源代码库以支持进一步研究和复现。
链接: https://arxiv.org/abs/2501.11096
作者: Arvid Eriksson(1),Anton Israelsson(1),Mattias Kallhauge(1) ((1) KTH Royal Institute of Technology)
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:“Why Not Other Classes?”: Towards Class-Contrastive Back-Propagation Explanations (Wang Wang, 2022) provides a method for contrastively explaining why a certain class in a neural network image classifier is chosen above others. This method consists of using back-propagation-based explanation methods from after the softmax layer rather than before. Our work consists of reproducing the work in the original paper. We also provide extensions to the paper by evaluating the method on XGradCAM, FullGrad, and Vision Transformers to evaluate its generalization capabilities. The reproductions show similar results as the original paper, with the only difference being the visualization of heatmaps which could not be reproduced to look similar. The generalization seems to be generally good, with implementations working for Vision Transformers and alternative back-propagation methods. We also show that the original paper suffers from issues such as a lack of detail in the method and an erroneous equation which makes reproducibility difficult. To remedy this we provide an open-source repository containing all code used for this project.
zh
[CV-128] Leverag ing counterfactual concepts for debugging and improving CNN model performance
【速读】:该论文试图解决如何利用反事实解释(counterfactual explanation)方法来提升基于卷积神经网络(CNN)的图像分类模型的性能。尽管反事实解释方法在提供易于理解且符合人类推理的解释方面受到了广泛关注,但其在改进模型性能方面的应用却较少被探讨。论文提出的解决方案关键在于通过反事实推理识别出在决策过程中起关键作用的滤波器(filters),并设计了一种新颖的方法和损失函数来进行模型重训练。该方法鼓励激活与类别相关的重要滤波器,同时抑制与类别无关的滤波器的激活,从而有效减少局部预测的激活模式与全局类别激活模式之间的偏差。通过引入反事实解释,论文不仅验证了模型对未见数据的预测能力,还识别了误分类情况,揭示了模型学习过程中的潜在弱点和偏差,进而实现了有针对性的改进和性能提升。实验结果表明,该方法在公开数据集上实现了1-2%的性能提升,验证了其有效性。
链接: https://arxiv.org/abs/2501.11087
作者: Syed Ali Tariq,Tehseen Zia
机构: COMSATS University Islamabad (COMSATS大学伊斯兰堡)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This manuscript is currently under consideration for publication in Pattern Recognition Letters
点击查看摘要
Abstract:Counterfactual explanation methods have recently received significant attention for explaining CNN-based image classifiers due to their ability to provide easily understandable explanations that align more closely with human reasoning. However, limited attention has been given to utilizing explainability methods to improve model performance. In this paper, we propose to leverage counterfactual concepts aiming to enhance the performance of CNN models in image classification tasks. Our proposed approach utilizes counterfactual reasoning to identify crucial filters used in the decision-making process. Following this, we perform model retraining through the design of a novel methodology and loss functions that encourage the activation of class-relevant important filters and discourage the activation of irrelevant filters for each class. This process effectively minimizes the deviation of activation patterns of local predictions and the global activation patterns of their respective inferred classes. By incorporating counterfactual explanations, we validate unseen model predictions and identify misclassifications. The proposed methodology provides insights into potential weaknesses and biases in the model’s learning process, enabling targeted improvements and enhanced performance. Experimental results on publicly available datasets have demonstrated an improvement of 1-2%, validating the effectiveness of the approach.
zh
[CV-129] Refinement Module based on Parse Graph of Feature Map for Human Pose Estimation
【速读】:该论文试图解决人体姿态估计(Human Pose Estimation, HPE)中预设计的人体解析图(parse graph)难以适应与预设结构不同情况的问题。传统方法通常预先设计人体结构的解析图,并基于此设计HPE框架,但这些框架在面对与预设结构不同的情况时难以灵活适应。论文提出的解决方案关键在于将特征图(feature map)视为一个整体,类似于人体结构,通过解析图优化特征图,并隐式学习每个节点的特征,而非显式设计。具体而言,论文设计了基于特征图解析图的精炼模块(Refinement Module based on the Parse Graph, RMPG),该模块包括自上而下的分解和自下而上的组合两个阶段。在分解阶段,特征图沿通道分解为多个子特征图,并计算其上下文关系以获取各自的上下文信息;在组合阶段,子特征图与其上下文信息结合生成精炼后的子特征图,最终拼接得到精炼后的特征图。此外,论文还设计了使用多个RMPG模块的自上而下框架,部分模块通过监督学习获取身体部位间的上下文关系。该框架在COCO关键点检测、CrowdPose和MPII人体姿态数据集上取得了优异的结果,并验证了RMPG在不同方法(如SimpleBaselines、Hourglass和ViTPose)中的有效性。
链接: https://arxiv.org/abs/2501.11069
作者: Shibang Liu,Xuemei Xie,Guangming Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Parse graphs of the human body can be obtained in the human brain to help humans complete the human pose estimation (HPE). It contains a hierarchical structure, like a tree structure, and context relations among nodes. Many researchers pre-design the parse graph of body structure, and then design framework for HPE. However, these frameworks are difficulty adapting when encountering situations that differ from the preset human structure. Different from them, we regard the feature map as a whole, similarly to human body, so the feature map can be optimized based on parse graphs and each node feature is learned implicitly instead of explicitly, which means it can flexibly respond to different human body structure. In this paper, we design the Refinement Module based on the Parse Graph of feature map (RMPG), which includes two stages: top-down decomposition and bottom-up combination. In the top-down decomposition stage, the feature map is decomposed into multiple sub-feature maps along the channel and their context relations are calculated to obtain their respective context information. In the bottom-up combination stage, the sub-feature maps and their context information are combined to obtain refined sub-feature maps, and then these refined sub-feature maps are concatenated to obtain the refined feature map. Additionally ,we design a top-down framework by using multiple RMPG modules for HPE, some of which are supervised to obtain context relations among body parts. Our framework achieves excellent results on the COCO keypoint detection, CrowdPose and MPII human pose datasets. More importantly, our experiments also demonstrate the effectiveness of RMPG on different methods, including SimpleBaselines, Hourglass, and ViTPose.
zh
[CV-130] Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection
【速读】:该论文试图解决深度度量学习(Deep Metric Learning, DML)中存在的噪声标签问题。噪声标签会显著降低深度学习模型的性能,尽管在分类任务中已有大量研究致力于提高对噪声标签的鲁棒性,但在DML中这一问题尚未得到充分探索。现有的噪声标签学习方法通常直接丢弃可疑的噪声样本,导致训练数据的浪费。为解决这一问题,论文提出了一种基于子组的正样本选择(SubGroup-based Positive-pair Selection, SGPS)的噪声鲁棒DML框架。该框架通过概率基础的干净样本选择策略有效识别干净样本和噪声样本,并利用子组信息发现噪声样本的潜在相似样本,进而通过正样本原型生成模块将这些样本聚合为信息丰富的正样本原型。随后,论文为噪声样本及其选定的正样本对设计了一种新的对比损失函数。SGPS框架可以轻松集成到现有的成对DML任务(如图像检索和人脸识别)的训练过程中。实验结果表明,该方法在多个合成和真实世界的大规模噪声标签数据集上均优于现有的噪声标签DML方法。
链接: https://arxiv.org/abs/2501.11063
作者: Zhipeng Yu,Qianqian Xu,Yangbangyan Jiang,Yingfei Sun,Qingming Huang
机构: School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences(中国科学院大学电子、电气与通信工程学院); Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所智能信息处理重点实验室); School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2108.01431 , arXiv:2103.16047 by other authors
点击查看摘要
Abstract:The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \urlthis https URL.
zh
[CV-131] Learning with Open-world Noisy Data via Class-independent Margin in Dual Representation Space AAAI2025
【速读】:该论文试图解决在开放世界噪声(open-world noise)环境下,模型在面对来自未知类别的噪声标签时的泛化问题。现有方法通常假设噪声标签来自已知类别(即闭集噪声,closed-set noise),但在实际场景中,噪声标签可能来自相似的未知类别(即开集噪声,open-set noise),这会对学习噪声标签(LNL)方法的性能产生显著影响。论文提出了一种新颖的双空间联合学习方法,通过构建双表示空间来缓解模型对闭集和开集噪声的过拟合。具体而言,该方法使用两个网络:一个投影网络(projection network)在原型空间中学习共享表示,另一个一对多网络(One-Vs-All network, OVA)在类别无关空间中使用独特的语义表示进行预测。通过在两个空间中引入双层对比学习(bi-level contrastive learning)和一致性正则化(consistency regularization),增强了模型对未知类别数据的检测能力。此外,设计了类别无关的边界准则(class-independent margin criteria)来有效选择干净样本、加权闭集噪声并过滤开集噪声。实验结果表明,该方法在CIFAR80N数据集上平均准确率提升了4.55%,AUROC提升了6.17%,优于现有最先进方法。
链接: https://arxiv.org/abs/2501.11053
作者: Linchao Pan,Can Gao,Jie Zhou,Jinbao Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages of main text, 4 pages of appendix, accepted to AAAI 2025
点击查看摘要
Abstract:Learning with Noisy Labels (LNL) aims to improve the model generalization when facing data with noisy labels, and existing methods generally assume that noisy labels come from known classes, called closed-set noise. However, in real-world scenarios, noisy labels from similar unknown classes, i.e., open-set noise, may occur during the training and inference stage. Such open-world noisy labels may significantly impact the performance of LNL methods. In this study, we propose a novel dual-space joint learning method to robustly handle the open-world noise. To mitigate model overfitting on closed-set and open-set noises, a dual representation space is constructed by two networks. One is a projection network that learns shared representations in the prototype space, while the other is a One-Vs-All (OVA) network that makes predictions using unique semantic representations in the class-independent space. Then, bi-level contrastive learning and consistency regularization are introduced in two spaces to enhance the detection capability for data with unknown classes. To benefit from the memorization effects across different types of samples, class-independent margin criteria are designed for sample identification, which selects clean samples, weights closed-set noise, and filters open-set noise effectively. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods and achieves an average accuracy improvement of 4.55% and an AUROC improvement of 6.17% on CIFAR80N.
zh
[CV-132] BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
【速读】:该论文旨在解决低分辨率、低帧率视频向高分辨率、高帧率视频转换的问题,以提升用户体验。现有方法通常使用隐式神经表示(Implicit Neural Representation, INR)进行连续编码,但它们在捕捉视频数据复杂性方面存在不足,主要依赖于简单的坐标拼接和预训练的光流网络进行运动表示。论文发现,添加位置编码不仅没有提升性能,反而可能降低性能,尤其是在与预训练光流网络结合时,限制了模型的灵活性。为解决这些问题,论文提出了BF-STVSR框架,其关键创新在于两个模块:1)B样条映射器(B-spline Mapper),用于平滑的时间插值;2)傅里叶映射器(Fourier Mapper),用于捕捉主要的空间频率。该框架在PSNR和SSIM指标上达到了最先进的性能,显著提升了空间细节和时间一致性。
链接: https://arxiv.org/abs/2501.11043
作者: Eunjin Kim,Hyeonjin Kim,Kyong Hwan Jin,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (UNIST); Korea University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11pages, 5 figures
点击查看摘要
Abstract:Enhancing low-resolution, low-frame-rate videos to high-resolution, high-frame-rate quality is essential for a seamless user experience, motivating advancements in Continuous Spatial-Temporal Video Super Resolution (C-STVSR). While prior methods employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow network for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve-and even degrade performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
zh
[CV-133] racking Mouse from Incomplete Body-Part Observations and Deep-Learned Deformable-Mouse Model Motion-Track Constraint for Behavior Analysis
【速读】:该论文旨在解决由于遮挡导致的小鼠身体部位在视频中跟踪不完整的问题,从而影响后续动作和行为分析的准确性。解决方案的关键在于通过多视角视频的集成,利用全局外部相机定位(global exterior camera orientation)进行三维三角测量(3D triangulation)和捆绑调整(bundle adjustment)。此外,通过引入三维小鼠模型、深度学习身体部位运动预测以及全局运动轨迹平滑约束(global motion-track smoothness constraint),实现了整体三维轨迹重建的一致性。最终,该方法显著提高了小鼠身体和身体部位轨迹估计的完整性,从而改善了动物行为分析的准确性。
链接: https://arxiv.org/abs/2501.11030
作者: Olaf Hellwich,Niek Andresen,Katharina Hohlbaum,Marcus N. Boon,Monika Kwiatkowski,Simon Matern,Patrik Reiske,Henning Sprekeler,Christa ThöneReineke,Lars Lewejohann,Huma Ghani Zada,Michael Brück,Soledad Traverso
机构: TU Berlin, Computer Vision & Remote Sensing(柏林工业大学,计算机视觉与遥感); German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR)(德国实验动物保护中心,德国联邦风险评估研究所); TU Berlin, Modeling of Cognitive Processes(柏林工业大学,认知过程建模); FU Berlin, Institute of Animal Welfare, Animal Behavior and Laboratory Animal Science(柏林自由大学,动物福利、动物行为与实验动物科学研究所); TU Berlin, Remote Sensing Image Analysis(柏林工业大学,遥感图像分析); TU Berlin, Science of Intelligence Excellence Cluster(柏林工业大学,智能卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
点击查看摘要
Abstract:Tracking mouse body parts in video is often incomplete due to occlusions such that - e.g. - subsequent action and behavior analysis is impeded. In this conceptual work, videos from several perspectives are integrated via global exterior camera orientation; body part positions are estimated by 3D triangulation and bundle adjustment. Consistency of overall 3D track reconstruction is achieved by introduction of a 3D mouse model, deep-learned body part movements, and global motion-track smoothness constraint. The resulting 3D body and body part track estimates are substantially more complete than the original single-frame-based body part detection, therefore, allowing improved animal behavior analysis.
zh
[CV-134] Car-GS: Addressing Reflective and Transparent Surface Challenges in 3D Car Reconstruction
【速读】:该论文旨在解决3D汽车建模中由于汽车表面材料(如高反射和透明材料)的特殊性质导致的几何和着色重建(3DGS)不准确的问题。现有方法在处理这些材料时,常常难以有效应对镜面高光和RGB与几何耦合的挑战。为此,论文提出了Car-GS方法,其关键创新包括:首先,引入了视点依赖的高斯基元(view-dependent Gaussian primitives)以有效建模表面反射;其次,针对透明物体建模时共享不透明度参数(shared opacity parameter)的局限性,为每个2D高斯基元分配了可学习的几何特定不透明度(learnable geometry-specific opacity),专门用于渲染深度和法线;最后,针对相机视角与玻璃表面接近正交时重建误差显著的问题,开发了一个质量感知监督模块(quality-aware supervision module),自适应地利用预训练的大规模法线先验。实验结果表明,Car-GS在汽车表面重建精度上显著优于现有方法。
链接: https://arxiv.org/abs/2501.11020
作者: Congcong Li,Jin Wang,Xiaomeng Wang,Xingchen Zhou,Wei Wu,Yuzhi Zhang,Tongyi Cao
机构: DeepRoute.AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D car modeling is crucial for applications in autonomous driving systems, virtual and augmented reality, and gaming. However, due to the distinctive properties of cars, such as highly reflective and transparent surface materials, existing methods often struggle to achieve accurate 3D car this http URL address these limitations, we propose Car-GS, a novel approach designed to mitigate the effects of specular highlights and the coupling of RGB and geometry in 3D geometric and shading reconstruction (3DGS). Our method incorporates three key innovations: First, we introduce view-dependent Gaussian primitives to effectively model surface reflections. Second, we identify the limitations of using a shared opacity parameter for both image rendering and geometric attributes when modeling transparent objects. To overcome this, we assign a learnable geometry-specific opacity to each 2D Gaussian primitive, dedicated solely to rendering depth and normals. Third, we observe that reconstruction errors are most prominent when the camera view is nearly orthogonal to glass surfaces. To address this issue, we develop a quality-aware supervision module that adaptively leverages normal priors from a pre-trained large-scale normal this http URL results demonstrate that Car-GS achieves precise reconstruction of car surfaces and significantly outperforms prior methods. The project page is available at this https URL.
zh
[CV-135] HFGCN:Hypergraph Fusion Graph Convolutional Networks for Skeleton-Based Action Recognition
【速读】:该论文试图解决动作识别(action recognition)领域中,现有方法在骨架点(skeleton points)分类和拓扑建模(topological modeling)方面的不足。具体而言,现有研究大多通过深度学习方法来提升性能,而忽略了骨架点与身体部位之间的拓扑关系,且未充分考虑骨架点的运动学(kinematics)特性。为此,论文提出了一种基于身体部位和距离的骨架点拓扑关系分类方法,并结合运动学理论进行建模。解决方案的关键在于提出了一种新颖的超图融合图卷积网络(Hypergraph Fusion Graph Convolutional Network, HFGCN),该网络能够同时关注人体骨架点和不同身体部位,并通过超图(hypergraph)表示骨架点的分类关系,将其融入图卷积网络中以建模高阶关系,从而增强网络的特征表示能力。此外,论文还引入了超图注意力模块和超图图卷积模块,分别在时间和通道维度上优化拓扑建模,进一步提升网络性能。实验结果表明,该方法在多个数据集上优于现有的基于骨架的动作识别方法。
链接: https://arxiv.org/abs/2501.11007
作者: Pengcheng Dong,Wenbo Wan,Huaxiang Zhang,Jiande Sun
机构: School of Information Science and Engineering, Shandong Normal University, China (山东师范大学信息科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In recent years, action recognition has received much attention and wide application due to its important role in video understanding. Most of the researches on action recognition methods focused on improving the performance via various deep learning methods rather than the classification of skeleton points. The topological modeling between skeleton points and body parts was seldom considered. Although some studies have used a data-driven approach to classify the topology of the skeleton point, the nature of the skeleton point in terms of kinematics has not been taken into consideration. Therefore, in this paper, we draw on the theory of kinematics to adapt the topological relations of the skeleton point and propose a topological relation classification based on body parts and distance from core of body. To synthesize these topological relations for action recognition, we propose a novel Hypergraph Fusion Graph Convolutional Network (HFGCN). In particular, the proposed model is able to focus on the human skeleton points and the different body parts simultaneously, and thus construct the topology, which improves the recognition accuracy obviously. We use a hypergraph to represent the categorical relationships of these skeleton points and incorporate the hypergraph into a graph convolution network to model the higher-order relationships among the skeleton points and enhance the feature representation of the network. In addition, our proposed hypergraph attention module and hypergraph graph convolution module optimize topology modeling in temporal and channel dimensions, respectively, to further enhance the feature representation of the network. We conducted extensive experiments on three widely used this http URL results validate that our proposed method can achieve the best performance when compared with the state-of-the-art skeleton-based methods.
zh
[CV-136] Self-CephaloNet: A Two-stage Novel Framework using Operational Neural Network for Cephalometric Analysis
【速读】:该论文旨在解决在正畸诊断和治疗规划中,手动检测侧位头颅X光片(lateral cephalograms)中的解剖标志点(anatomical landmarks)耗时且效率低下的问题。为了解决这一问题,作者提出了一种端到端的级联深度学习框架(Self-CepahloNet),该框架在预测19个牙科标志点时在ISBI 2015数据集上展现了基准性能。解决方案的关键在于引入了自操作神经网络(Self-ONN),该网络在复杂特征空间的学习性能上优于传统的卷积神经网络(CNN)。此外,作者在HRNetV2(高分辨率网络)骨干网络中引入了一种新颖的自瓶颈(self-bottleneck)结构,进一步提升了模型性能。实验结果表明,该模型在2mm范围内的标志点检测成功率显著提高,第一阶段达到了70.95%,第二阶段进一步提升至82.25%,并在外部验证数据集PKU上也表现出了75.95%的成功率。
链接: https://arxiv.org/abs/2501.10984
作者: Md. Shaheenur Islam Sumon,Khandaker Reajul Islam,Tanzila Rafique,Gazi Shamim Hassan,Md. Sakib Abrar Hossain,Kanchon Kanti Podder,Noha Barhom,Faleh Tamimi,Abdulrahman Alqahtani,Muhammad E. H. Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: The paper has been accepted for publication in Neural Computing and Applications
点击查看摘要
Abstract:Cephalometric analysis is essential for the diagnosis and treatment planning of orthodontics. In lateral cephalograms, however, the manual detection of anatomical landmarks is a time-consuming procedure. Deep learning solutions hold the potential to address the time constraints associated with certain tasks; however, concerns regarding their performance have been observed. To address this critical issue, we proposed an end-to-end cascaded deep learning framework (Self-CepahloNet) for the task, which demonstrated benchmark performance over the ISBI 2015 dataset in predicting 19 dental landmarks. Due to their adaptive nodal capabilities, Self-ONN (self-operational neural networks) demonstrate superior learning performance for complex feature spaces over conventional convolutional neural networks. To leverage this attribute, we introduced a novel self-bottleneck in the HRNetV2 (High Resolution Network) backbone, which has exhibited benchmark performance on the ISBI 2015 dataset for the dental landmark detection task. Our first-stage results surpassed previous studies, showcasing the efficacy of our singular end-to-end deep learning model, which achieved a remarkable 70.95% success rate in detecting cephalometric landmarks within a 2mm range for the Test1 and Test2 datasets. Moreover, the second stage significantly improved overall performance, yielding an impressive 82.25% average success rate for the datasets above within the same 2mm distance. Furthermore, external validation was conducted using the PKU cephalogram dataset. Our model demonstrated a commendable success rate of 75.95% within the 2mm range.
zh
[CV-137] SMARTe-VR: Student Monitoring and Adaptive Response Technology for e-learning in Virtual Reality AAAI2025
【速读】:该论文旨在解决在线教育中学生学习效果监测和个性化学习的问题。其核心解决方案是开发了一个名为SMARTe-VR的平台,该平台通过沉浸式虚拟现实(VR)环境收集学生的面部生物特征(facial biometrics)和学习元数据(learning metadata),以支持自适应学习(adaptive learning)。平台的关键功能包括:允许教师创建定制化的学习会话,提供视频讲座、自动问答系统(Auto QA system)以评估学生的理解程度,以及互动工具(如教科书高亮和讲座标记)和实时反馈。此外,论文还发布了一个包含5个研究挑战的数据集,涵盖了10名用户在VR环境下的TOEIC(托业)学习会话数据,总时长超过25小时,包括面部特征、学习元数据、450个回答、问题难度级别、概念标签和理解标签。论文还初步探索了基于项目反应理论(Item Response Theory)的模型,用于通过面部特征检测学生的理解程度,并测试了两种架构:用于局部特征的时序卷积网络(Temporal Convolutional Network)和用于全局特征的多层感知器(Multilayer Perceptron)。
链接: https://arxiv.org/abs/2501.10977
作者: Roberto Daza,Lin Shengkai,Aythami Morales,Julian Fierrez,Katashi Nagao
机构: 1. Universidad Autonoma de Madrid (马德里自治大学); 2. Nagoya University (名古屋大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Workshop on Artificial Intelligence for Education (AI4EDU) at AAAI 2025
点击查看摘要
Abstract:This work introduces SMARTe-VR, a platform for student monitoring in an immersive virtual reality environment designed for online education. SMARTe-VR is aimed to gather data for adaptive learning, focusing on facial biometrics and learning metadata. The platform allows instructors to create tailored learning sessions with video lectures, featuring an interface with an Auto QA system to evaluate understanding, interaction tools (e.g., textbook highlighting and lecture tagging), and real-time feedback. Additionally, we release a dataset containing 5 research challenges with data from 10 users in VR-based TOEIC sessions. This dataset, spanning over 25 hours, includes facial features, learning metadata, 450 responses, question difficulty levels, concept tags, and understanding labels. Alongside the database, we present preliminary experiments using Item Response Theory models, adapted for understanding detection using facial features. Two architectures were explored: a Temporal Convolutional Network for local features and a Multilayer Perceptron for global features.
zh
[CV-138] DC-PCN: Point Cloud Completion Network with Dual-Codebook Guided Quantization AAAI25
【速读】:该论文试图解决点云补全(Point Cloud Completion)中的一个关键问题,即在从同一3D物体表面采样的点云中存在的高度变异性。这种变异性会导致补全结果的模糊性,从而影响补全的精确性。为了解决这一问题,论文提出了一种新颖的点云补全网络,称为双码本点云补全网络(Dual-Codebook Point Completion Network, DC-PCN)。该网络采用编码器-解码器(encoder-decoder)架构,并通过引入双码本设计来从多层次角度量化点云表示。具体来说,DC-PCN包含一个编码器码本(encoder-codebook)和一个解码器码本(decoder-codebook),分别用于捕捉浅层和深层的点云模式。此外,为了增强这两个码本之间的信息流动,论文还设计了一种信息交换机制,确保浅层和深层的关键特征和模式能够有效地用于点云补全。实验结果表明,该方法在PCN、ShapeNet_Part和ShapeNet34数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2501.10966
作者: Qiuxia Wu,Haiyang Huang,Kunming Su,Zhiyong Wang,Kun Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI25 Accepted
点击查看摘要
Abstract:Point cloud completion aims to reconstruct complete 3D shapes from partial 3D point clouds. With advancements in deep learning techniques, various methods for point cloud completion have been developed. Despite achieving encouraging results, a significant issue remains: these methods often overlook the variability in point clouds sampled from a single 3D object surface. This variability can lead to ambiguity and hinder the achievement of more precise completion results. Therefore, in this study, we introduce a novel point cloud completion network, namely Dual-Codebook Point Completion Network (DC-PCN), following an encder-decoder pipeline. The primary objective of DC-PCN is to formulate a singular representation of sampled point clouds originating from the same 3D surface. DC-PCN introduces a dual-codebook design to quantize point-cloud representations from a multilevel perspective. It consists of an encoder-codebook and a decoder-codebook, designed to capture distinct point cloud patterns at shallow and deep levels. Additionally, to enhance the information flow between these two codebooks, we devise an information exchange mechanism. This approach ensures that crucial features and patterns from both shallow and deep levels are effectively utilized for completion. Extensive experiments on the PCN, ShapeNet_Part, and ShapeNet34 datasets demonstrate the state-of-the-art performance of our method.
zh
[CV-139] Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation ICASSP2025
【速读】:该论文旨在解决RGB和热成像(thermal image)融合在低光照条件下进行语义分割(semantic segmentation)时,现有方法需要大量参数更新和计算资源的问题。现有方法通常采用双分支编码器框架进行多模态特征提取,并设计复杂的特征融合策略,导致计算负担较重。为解决这一问题,论文提出了一种基于早期融合策略(early fusion strategy)的新型多模态融合网络(EFNet),并结合简单但有效的特征聚类方法,以实现高效的RGB-T语义分割。此外,论文还提出了一种基于欧几里得距离(Euclidean distance)的轻量级多尺度特征聚合解码器(multi-scale feature aggregation decoder),以进一步降低计算复杂度。实验结果表明,该方法在不同数据集上均表现出色,且参数和计算量显著低于现有最优方法。
链接: https://arxiv.org/abs/2501.10958
作者: Zhengwen Shen,Yulian Li,Han Zhang,Yuchen Weng,Jun Wang
机构: School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China (中国矿业大学信息与控制工程学院, 徐州, 江苏 221116, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025
点击查看摘要
Abstract:RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low-illumination conditions. Existing methods typically employ a two-branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation. In addition, we also propose a lightweight and efficient multi-scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.
zh
[CV-140] MARIO: A Mixed Annotation Framework For Polyp Segmentation
【速读】:该论文旨在解决现有息肉分割(polyp segmentation)模型面临的高标注成本和小规模数据集限制的问题。现有的模型通常依赖于单一类型的标注,导致大量息肉数据集未被充分利用。为了解决这一问题,论文提出了MARIO模型,该模型采用混合监督(mixed supervision)方法,能够适应多种标注类型,从而显著扩展了可用数据的范围。MARIO通过整合五种监督形式(像素级、框级、多边形级、涂鸦级和点级)来从未充分利用的数据集中学习,每种监督形式都配有定制的损失函数,以有效利用监督标签并最小化噪声。这一方法使MARIO能够超越单一标注类型的限制,并主要利用弱标注和低成本标注的数据集,减少对大规模全标注数据集的依赖。实验结果表明,MARIO在五个基准数据集上均优于现有方法,展示了其在平衡不同监督形式之间的权衡和最大化息肉分割性能方面的有效性。
链接: https://arxiv.org/abs/2501.10957
作者: Haoyang Li,Yiwen Hu,Jun Wei,Zhen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ISBI 2025 4-page paper
点击查看摘要
Abstract:Existing polyp segmentation models are limited by high labeling costs and the small size of datasets. Additionally, vast polyp datasets remain underutilized because these models typically rely on a single type of annotation. To address this dilemma, we introduce MARIO, a mixed supervision model designed to accommodate various annotation types, significantly expanding the range of usable data. MARIO learns from underutilized datasets by incorporating five forms of supervision: pixel-level, box-level, polygon-level, scribblelevel, and point-level. Each form of supervision is associated with a tailored loss that effectively leverages the supervision labels while minimizing the noise. This allows MARIO to move beyond the constraints of relying on a single annotation type. Furthermore, MARIO primarily utilizes dataset with weak and cheap annotations, reducing the dependence on large-scale, fully annotated ones. Experimental results across five benchmark datasets demonstrate that MARIO consistently outperforms existing methods, highlighting its efficacy in balancing trade-offs between different forms of supervision and maximizing polyp segmentation performance
zh
[CV-141] SVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval AAAI2025
【速读】:该论文试图解决跨模态检索(cross-modal retrieval)中由于数据对未对齐和广泛存在的标注噪声(noisy correspondence, NC)导致的性能下降问题。现有的方法通常假设数据对是良好对齐的,并且忽略了标注噪声,这会导致模型性能的显著下降。尽管已有研究尝试通过使用相同架构的协同教学范式(co-teaching paradigm)来提供不同的数据视角,但这些架构之间的差异主要源于随机初始化,导致模型在训练过程中逐渐趋同,从而限制了该范式带来的额外信息。
为解决这一问题,论文提出了一种基于语义变化一致性(Semantic Variation Consistency, TSVC)的三方学习框架。该框架包括一个协调器(Coordinator)、一个主模型(Master)和一个辅助模型(Assistant)。协调器负责数据分配,辅助模型通过多样化的数据支持主模型的噪声标签预测。此外,论文还引入了一种基于互信息变化(mutual information variation)的软标签估计方法,用于量化新样本中的噪声并分配相应的软标签。同时,论文提出了一种新的损失函数,以增强模型的鲁棒性并优化训练效果。通过在三个广泛使用的数据集上进行的大量实验,TSVC在检索准确性和训练稳定性方面表现出显著优势,即使在噪声比例增加的情况下也能保持稳定的性能。
链接: https://arxiv.org/abs/2501.10935
作者: Shuai Lyu,Zijing Tian,Zhonghong Ou,Yifan Zhu,Xiao Zhang,Qiankun Ha,Haoran Luo,Meina Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the Main Track of AAAI 2025. It contains 9 pages, 7 figures, and is relevant to the areas of cross-modal retrieval and machine learning. The work presents a novel approach in robust image-text retrieval using a tripartite learning framework
点击查看摘要
Abstract:Cross-modal retrieval maps data under different modality via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures are primarily stemmed from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce a Tripartite learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model’s noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.
zh
[CV-142] Generative Physical AI in Vision: A Survey
【速读】:该论文旨在解决生成式人工智能(Generative AI)在计算机视觉领域中生成内容时缺乏物理合理性的问题。传统生成模型主要关注视觉逼真度,而忽略了生成内容是否符合现实世界的物理规律,这限制了其在需要遵循物理定律的应用(如机器人、自主系统和科学模拟)中的有效性。论文的关键解决方案是通过物理感知的生成式人工智能(physics-aware generative AI),将物理知识融入生成模型中,从而提升生成内容的物理合理性。具体方法包括显式模拟(explicit simulation)和隐式学习(implicit learning),通过这些方法,生成式AI能够更好地模拟现实世界的物理交互,进而推动其在虚拟与物理现实之间的桥梁作用。
链接: https://arxiv.org/abs/2501.10928
作者: Daochang Liu,Junyu Zhang,Anh-Dung Dinh,Eunbyung Park,Shichao Zhang,Chang Xu
机构: The University of Western Australia(西澳大利亚大学); Central South University(中南大学); The University of Sydney(悉尼大学); Sungkyunkwan University(成均馆大学); Guangxi Normal University(广西师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication. This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D or 4D content. Traditionally, generative models primarily focus on visual fidelity while often neglecting the physical plausibility of generated content. This gap limits their effectiveness in applications requiring adherence to real-world physical laws, such as robotics, autonomous systems, and scientific simulations. As generative AI evolves to increasingly integrate physical realism and dynamic simulation, its potential to function as a “world simulator” expands-enabling the modeling of interactions governed by physics and bridging the divide between virtual and physical realities. This survey systematically reviews this emerging field of physics-aware generative AI in computer vision, categorizing methods based on how they incorporate physical knowledge-either through explicit simulation or implicit learning. We analyze key paradigms, discuss evaluation protocols, and identify future research directions. By offering a comprehensive overview, this survey aims to help future developments in physically grounded generation for vision. The reviewed papers are summarized at this https URL.
zh
[CV-143] Decomposing and Fusing Intra- and Inter-Sensor Spatio-Temporal Signal for Multi-Sensor Wearable Human Activity Recognition
【速读】:该论文试图解决可穿戴设备人体活动识别(Wearable Human Activity Recognition, WHAR)中多传感器同步测量时,现有方法无法有效捕捉传感器内部(intra-sensor)和传感器之间(inter-sensor)时空关系的问题。现有方法通常使用共享卷积核(shared convolutional kernels)对所有传感器变量进行无差别的时间特征提取,导致无法充分捕捉传感器内部和传感器之间的时空特征。论文提出的解决方案是DecomposeWHAR模型,该模型包含分解阶段和融合阶段。分解阶段通过改进的深度可分离卷积(Depth Separable Convolution)为每个传感器内部变量生成高维表示,以捕捉局部时间特征并保留其独特性。融合阶段首先捕捉传感器内部变量之间的关系,并在通道和变量级别融合其特征,然后使用状态空间模型(State Space Model, SSM)建模长时间依赖关系,最后通过自注意力机制(self-attention mechanism)动态捕捉跨传感器交互,突出传感器之间的空间相关性。该模型在三个广泛使用的WHAR数据集上表现出色,显著优于现有最先进模型,同时保持了可接受的计算效率。
链接: https://arxiv.org/abs/2501.10917
作者: Haoyu Xie,Haoxuan Li,Chunyuan Zheng,Haonan Yuan,Guorui Liao,Jun Liao,Li Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing. Multi-sensor synchronous measurement has proven to be more effective for WHAR than using a single sensor. However, existing WHAR methods use shared convolutional kernels for indiscriminate temporal feature extraction across each sensor variable, which fails to effectively capture spatio-temporal relationships of intra-sensor and inter-sensor variables. We propose the DecomposeWHAR model consisting of a decomposition phase and a fusion phase to better model the relationships between modality variables. The decomposition creates high-dimensional representations of each intra-sensor variable through the improved Depth Separable Convolution to capture local temporal features while preserving their unique characteristics. The fusion phase begins by capturing relationships between intra-sensor variables and fusing their features at both the channel and variable levels. Long-range temporal dependencies are modeled using the State Space Model (SSM), and later cross-sensor interactions are dynamically captured through a self-attention mechanism, highlighting inter-sensor spatial correlations. Our model demonstrates superior performance on three widely used WHAR datasets, significantly outperforming state-of-the-art models while maintaining acceptable computational efficiency. Our codes and supplementary materials are available at this https URL.
zh
[CV-144] Green Video Camouflaged Object Detection
【速读】:该论文旨在解决视频中的伪装目标检测(Camouflaged Object Detection, COD)问题,特别是针对隐藏在与其环境高度相似的目标。传统视频COD方法通常通过显式提取运动线索或使用复杂的深度学习网络来处理时间信息,但这些方法存在高复杂性和性能不稳定的问题。本文提出了一种名为GreenVCOD的绿色视频COD方法,其关键解决方案是基于绿色ICOD方法,利用长短期时间邻域(Temporal Neighborhoods, TN)来捕捉联合的时空上下文信息,从而优化决策。实验结果表明,GreenVCOD在性能上与现有的先进视频COD基准方法具有竞争力。
链接: https://arxiv.org/abs/2501.10914
作者: Xinyu Wang,Hong-Shuo Chen,Zhiruo Zhou,Suya You,Azad M. Madni,C.-C. Jay Kuo
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
点击查看摘要
Abstract:Camouflaged object detection (COD) aims to distinguish hidden objects embedded in an environment highly similar to the object. Conventional video-based COD (VCOD) methods explicitly extract motion cues or employ complex deep learning networks to handle the temporal information, which is limited by high complexity and unstable performance. In this work, we propose a green VCOD method named GreenVCOD. Built upon a green ICOD method, GreenVCOD uses long- and short-term temporal neighborhoods (TN) to capture joint spatial/temporal context information for decision refinement. Experimental results show that GreenVCOD offers competitive performance compared to state-of-the-art VCOD benchmarks.
zh
[CV-145] Explainable Adversarial Attacks on Coarse-to-Fine Classifiers ICASSP2025
【速读】:该论文试图解决传统对抗攻击(adversarial attacks)在解释性和多阶段分类器(multi-stage classifiers)应用中的不足。传统对抗攻击通常通过生成人眼难以察觉的扰动来改变输入图像的预测标签,但这些方法缺乏解释性,且主要针对单阶段分类器,对多阶段分类器的研究较少。论文提出的解决方案关键是通过层间相关性传播(Layer-wise Relevance Propagation, LRP)来生成可解释的对抗扰动。LRP通过为像素分配相关性分数,识别并针对对粗粒度和细粒度分类都至关重要的关键特征。与传统的对抗攻击不同,该方法不仅诱导误分类,还增强了模型在不同分类阶段行为的可解释性。实验结果表明,该方法在多阶段分类器中有效且具有解释性。
链接: https://arxiv.org/abs/2501.10906
作者: Akram Heidarizadeh,Connor Hatfield,Lorenzo Lazzarotto,HanQin Cai,George Atia
机构: University of Central Florida (中佛罗里达大学); Pontifícia Universidade Católica do Rio Grande do Sul (南里奥格兰德天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: ICASSP 2025
点击查看摘要
Abstract:Traditional adversarial attacks typically aim to alter the predicted labels of input images by generating perturbations that are imperceptible to the human eye. However, these approaches often lack explainability. Moreover, most existing work on adversarial attacks focuses on single-stage classifiers, but multi-stage classifiers are largely unexplored. In this paper, we introduce instance-based adversarial attacks for multi-stage classifiers, leveraging Layer-wise Relevance Propagation (LRP), which assigns relevance scores to pixels based on their influence on classification outcomes. Our approach generates explainable adversarial perturbations by utilizing LRP to identify and target key features critical for both coarse and fine-grained classifications. Unlike conventional attacks, our method not only induces misclassification but also enhances the interpretability of the model’s behavior across classification stages, as demonstrated by experimental results.
zh
[CV-146] A Remote Sensing Image Change Detection Method Integrating Layer Exchange and Channel-Spatial Differences
【速读】:该论文试图解决遥感影像中的变化检测问题,特别是在双时相图像中像素级变化区域的准确分割。变化检测的核心在于确定双时相图像中对应像素是否发生了变化。论文提出的解决方案关键在于设计了通道-空间差异加权(CSDW)模块,该模块通过聚合和分配双时相特征,增强了模型对差异特征的敏感性。此外,论文还提出了一种基于层交换(LE)方法的解码结构,用于增强双时相特征之间的交互,从而更好地构建双时相图像之间的相关性。通过在多个数据集上的实验验证,所提出的LENet模型显著提升了变化检测的性能。
链接: https://arxiv.org/abs/2501.10905
作者: Sijun Dong,Fangcheng Zuo,Geng Chen,Siming Fu,Xiaoliang Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures
点击查看摘要
Abstract:Change detection in remote sensing imagery is a critical technique for Earth observation, primarily focusing on pixel-level segmentation of change regions between bi-temporal images. The essence of pixel-level change detection lies in determining whether corresponding pixels in bi-temporal images have changed. In deep learning, the spatial and channel dimensions of feature maps represent different information from the original images. In this study, we found that in change detection tasks, difference information can be computed not only from the spatial dimension of bi-temporal features but also from the channel dimension. Therefore, we designed the Channel-Spatial Difference Weighting (CSDW) module as an aggregation-distribution mechanism for bi-temporal features in change detection. This module enhances the sensitivity of the change detection model to difference features. Additionally, bi-temporal images share the same geographic location and exhibit strong inter-image correlations. To construct the correlation between bi-temporal images, we designed a decoding structure based on the Layer-Exchange (LE) method to enhance the interaction of bi-temporal features. Comprehensive experiments on the CLCD, PX-CLCD, LEVIR-CD, and S2Looking datasets demonstrate that the proposed LENet model significantly improves change detection performance. The code and pre-trained models will be available at: this https URL.
zh
[CV-147] Visual RAG : Expanding MLLM visual knowledge without fine-tuning
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在计算机视觉任务中面临的局限性,特别是其在推理过程中依赖于预训练数据且需要大量微调的问题。为了解决这些问题,论文提出了一种名为Visual RAG的新方法,该方法通过结合MLLMs的上下文学习能力和检索机制,动态选择最相关的示例来增强模型的知识。这种方法的核心理念是通过类比学习,使模型能够在推理过程中利用动态提供的新信息,从而不再局限于从训练数据中提取的知识,并且无需微调即可快速更新。此外,Visual RAG显著减少了提升模型图像分类性能的计算成本,并扩展了模型到未训练过的视觉领域和任务的能力。实验结果表明,与现有的多示例上下文学习方法相比,Visual RAG在使用更少示例的情况下,能够达到接近甚至更高的准确率(平均提升约2%)。
链接: https://arxiv.org/abs/2501.10834
作者: Mirco Bonomo,Simone Bianco
机构: University of Milano-Bicocca, Italy (米兰比可卡大学, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved notable performance in computer vision tasks that require reasoning across visual and textual modalities, yet their capabilities are limited to their pre-trained data, requiring extensive fine-tuning for updates. Recent researches have explored the use of In-Context Learning (ICL) to overcome these challenges by providing a set of demonstrating examples as context to augment MLLMs performance in several tasks, showing that many-shot ICL leads to substantial improvements compared to few-shot ICL. However, the reliance on numerous demonstrating examples and the limited MLLMs context windows presents significant obstacles. This paper aims to address these challenges by introducing a novel approach, Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism. The crux of this approach is to ensure to augment the MLLM knowledge by selecting only the most relevant demonstrating examples for the query, pushing it to learn by analogy. In this way, relying on the new information provided dynamically during inference time, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning. Furthermore, this greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for. Extensive experiments on eight different datasets in the state of the art spanning several domains and image classification tasks show that the proposed Visual RAG, compared to the most recent state of the art (i.e., many-shot ICL), is able to obtain an accuracy that is very close or even higher (approx. +2% improvement on average) while using a much smaller set of demonstrating examples (approx. only 23% on average).
zh
[CV-148] GAUDA: Generative Adaptive Uncertainty-guided Diffusion-based Augmentation for Surgical Segmentation
【速读】:该论文试图解决在手术数据积累过程中面临的伦理、组织和监管问题,通过生成式建模(Generative Modelling)来增强数据,特别是针对手术中的图像分割任务,生成高质量的(图像,掩码)对。论文提出了一种联合建模方法,利用潜在扩散模型(Latent Diffusion Model)学习(图像,掩码)空间的语义丰富且紧凑的潜在表示,从而生成具有显著语义一致性的未见过的分割数据。此外,论文进一步提出了生成式自适应不确定性引导的扩散增强方法(Generative Adaptive Uncertainty-guided Diffusion-based Augmentation, GAUDA),通过贝叶斯下游模型的认知不确定性(epistemic uncertainty)进行有针对性的在线合成,生成当前数据分布中最不确定类别的额外样本。该方法能够有效减少额外训练样本的数量,并围绕数据分布中最不确定的部分进行增强,从而显著提升下游分割任务的性能。
链接: https://arxiv.org/abs/2501.10819
作者: Yannik Frisch,Christina Bornberg,Moritz Fuchs,Anirban Mukhopadhyay
机构: Technical University Darmstadt(达姆施塔特工业大学); University Medical Center Mainz(美因茨大学医学中心); University of Girona(赫罗纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Augmentation by generative modelling yields a promising alternative to the accumulation of surgical data, where ethical, organisational and regulatory aspects must be considered. Yet, the joint synthesis of (image, mask) pairs for segmentation, a major application in surgery, is rather unexplored. We propose to learn semantically comprehensive yet compact latent representations of the (image, mask) space, which we jointly model with a Latent Diffusion Model. We show that our approach can effectively synthesise unseen high-quality paired segmentation data of remarkable semantic coherence. Generative augmentation is typically applied pre-training by synthesising a fixed number of additional training samples to improve downstream task models. To enhance this approach, we further propose Generative Adaptive Uncertainty-guided Diffusion-based Augmentation (GAUDA), leveraging the epistemic uncertainty of a Bayesian downstream model for targeted online synthesis. We condition the generative model on classes with high estimated uncertainty during training to produce additional unseen samples for these classes. By adaptively utilising the generative model online, we can minimise the number of additional training samples and centre them around the currently most uncertain parts of the data distribution. GAUDA effectively improves downstream segmentation results over comparable methods by an average absolute IoU of 1.6% on CaDISv2 and 1.5% on CholecSeg8k, two prominent surgical datasets for semantic segmentation.
zh
[CV-149] Efficient Auto-Labeling of Large-Scale Poultry Datasets (ALPD) Using Semi-Supervised Models Active Learning and Prompt-then-Detect Approach
【速读】:该论文旨在解决家禽养殖中大规模、多样化数据集的高效标注问题。传统的手动标注方法耗时且不适用于现代系统持续生成的数据。为此,研究提出了一种半监督自动标注框架,结合主动学习(active learning)和“提示-检测”(prompt-then-detect)范式,以提高家禽行为和健康监测的AI驱动效率。解决方案的关键在于利用多种机器学习模型,包括零样本模型(如Grounding DINO、YOLO-World和CLIP)和监督模型(如YOLO和Faster-RCNN),并通过半监督学习和主动学习显著减少标注时间。研究结果表明,YOLOv8s-ALPD在半监督模型中表现最佳,精度和召回率分别达到96.1%和99.0%,同时混合YOLO-World模型在品种检测和行为检测中均表现出色。此外,半监督模型在行为检测中的精度和F1分数分别提升了31%和16%,且标注时间减少了80%以上。
链接: https://arxiv.org/abs/2501.10809
作者: Ramesh Bahadur Bist,Lilong Chai,Shawna Weimer,Hannah Atungulua,Chantel Pennicott,Xiao Yang,Sachin Subedi,Chaitanya Pallerla,Yang Tian,Dongyi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid growth of AI in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time-consuming, making it impractical for modern systems that continuously generate data. This study explores semi-supervised auto-labeling methods, integrating active learning, and prompt-then-detect paradigm to develop an efficient framework for auto-labeling of large poultry datasets aimed at advancing AI-driven behavior and health monitoring. Viideo data were collected from broilers and laying hens housed at the University of Arkansas and the University of Georgia. The collected videos were converted into images, pre-processed, augmented, and labeled. Various machine learning models, including zero-shot models like Grounding DINO, YOLO-World, and CLIP, and supervised models like YOLO and Faster-RCNN, were utilized for broilers, hens, and behavior detection. The results showed that YOLOv8s-World and YOLOv9s performed better when compared performance metrics for broiler and hen detection under supervised learning, while among the semi-supervised model, YOLOv8s-ALPD achieved the highest precision (96.1%) and recall (99.0%) with an RMSE of 1.9. The hybrid YOLO-World model, incorporating the optimal YOLOv8s backbone, demonstrated the highest overall performance. It achieved a precision of 99.2%, recall of 99.4%, and an F1 score of 98.7% for breed detection, alongside a precision of 88.4%, recall of 83.1%, and an F1 score of 84.5% for individual behavior detection. Additionally, semi-supervised models showed significant improvements in behavior detection, achieving up to 31% improvement in precision and 16% in F1-score. The semi-supervised models with minimal active learning reduced annotation time by over 80% compared to full manual labeling. Moreover, integrating zero-shot models enhanced detection and behavior identification.
zh
[CV-150] CS-Net:Contribution-based Sampling Network for Point Cloud Simplification
【速读】:该论文旨在解决点云采样(point cloud sampling)在视觉任务中计算成本和存储需求过高的问题。传统采样方法(如最远点采样)缺乏任务特定的信息,无法保证在特定应用中的最优性能。基于学习的方法虽然通过训练网络进行采样,但无法确保采样的点是最相关的,且可能导致重复采样点,需要通过后处理技术完成采样点云。为解决这些局限性,论文提出了一种基于贡献的采样网络(CS-Net),将采样操作形式化为Top-k操作。为确保网络可以通过梯度下降算法进行端到端训练,作者通过最优传输问题的熵正则化实现了Top-k操作的可微分近似。CS-Net由特征嵌入模块、级联注意力模块和贡献评分模块组成,通过减少参数、突出重要特征并生成每个点的贡献评分,指导采样过程优先选择最重要的点。实验结果表明,CS-Net在分类、配准、压缩和表面重建等任务中达到了最先进的性能。
链接: https://arxiv.org/abs/2501.10789
作者: Tian Guo,Chen Chen,Hui Yuan,Xiaolong Mao,Raouf Hamzaoui,Junhui Hou
机构: Shandong University(山东大学); De Montfort University(德蒙福特大学); City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Point cloud sampling plays a crucial role in reducing computation costs and storage requirements for various vision tasks. Traditional sampling methods, such as farthest point sampling, lack task-specific information and, as a result, cannot guarantee optimal performance in specific applications. Learning-based methods train a network to sample the point cloud for the targeted downstream task. However, they do not guarantee that the sampled points are the most relevant ones. Moreover, they may result in duplicate sampled points, which requires completion of the sampled point cloud through post-processing techniques. To address these limitations, we propose a contribution-based sampling network (CS-Net), where the sampling operation is formulated as a Top-k operation. To ensure that the network can be trained in an end-to-end way using gradient descent algorithms, we use a differentiable approximation to the Top-k operation via entropy regularization of an optimal transport problem. Our network consists of a feature embedding module, a cascade attention module, and a contribution scoring module. The feature embedding module includes a specifically designed spatial pooling layer to reduce parameters while preserving important features. The cascade attention module combines the outputs of three skip connected offset attention layers to emphasize the attractive features and suppress less important ones. The contribution scoring module generates a contribution score for each point and guides the sampling process to prioritize the most important ones. Experiments on the ModelNet40 and PU147 showed that CS-Net achieved state-of-the-art performance in two semantic-based downstream tasks (classification and registration) and two reconstruction-based tasks (compression and surface reconstruction).
zh
[CV-151] Decoupling Appearance Variations with 3D Consistent Features in Gaussian Splatting AAAI2025
【速读】:该论文试图解决高斯泼溅(Gaussian Splatting)在新型视图合成(novel view synthesis)中由于现代相机图像信号处理器(ISP)、不同时间、天气条件和局部光照变化等因素导致的外观变化问题。这些变化会导致渲染图像或视频中出现浮动物体和颜色失真。现有的外观建模方法要么与渲染过程紧密耦合,影响实时渲染性能,要么只能处理轻微的全局变化,在局部光照变化的场景中表现不佳。
论文提出的解决方案是DAVIGS,该方法通过解耦外观变化并以即插即用(plug-and-play)的方式高效处理这些问题。其关键在于在图像级别而非高斯级别对渲染结果进行变换,从而以最小的优化时间和内存开销建模外观变化。此外,该方法在三维空间中收集外观相关信息来变换渲染图像,从而隐式地构建跨视图的三维一致性。实验表明,DAVIGS在多种外观变化场景中实现了最先进的渲染质量,且在不影响渲染速度的情况下,显著减少了训练时间和内存使用。
链接: https://arxiv.org/abs/2501.10788
作者: Jiaqi Lin,Zhihao Li,Binxiao Huang,Xiao Tang,Jianzhuang Liu,Shiyong Liu,Xiaofei Wu,Fenglong Song,Wenming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025. Project website: this https URL
点击查看摘要
Abstract:Gaussian Splatting has emerged as a prominent 3D representation in novel view synthesis, but it still suffers from appearance variations, which are caused by various factors, such as modern camera ISPs, different time of day, weather conditions, and local light changes. These variations can lead to floaters and color distortions in the rendered images/videos. Recent appearance modeling approaches in Gaussian Splatting are either tightly coupled with the rendering process, hindering real-time rendering, or they only account for mild global variations, performing poorly in scenes with local light changes. In this paper, we propose DAVIGS, a method that decouples appearance variations in a plug-and-play and efficient manner. By transforming the rendering results at the image level instead of the Gaussian level, our approach can model appearance variations with minimal optimization time and memory overhead. Furthermore, our method gathers appearance-related information in 3D space to transform the rendered images, thus building 3D consistency across views implicitly. We validate our method on several appearance-variant scenes, and demonstrate that it achieves state-of-the-art rendering quality with minimal training time and memory usage, without compromising rendering speeds. Additionally, it provides performance improvements for different Gaussian Splatting baselines in a plug-and-play manner.
zh
[CV-152] LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
【速读】:该论文旨在解决视频时刻检索(Video Moment Retrieval)和高光检测(Highlight Detection)任务中存在的三个主要问题:(1) 数据集中不同样本之间的语义信息重叠(overlapping semantic information)影响了模型的多模态对齐性能;(2) 现有模型无法高效提取视频的局部特征(local features);(3) 现有模型使用的Transformer解码器(Transformer Decoder)无法充分解码多模态特征。为解决这些问题,作者提出了LD-DETR模型。其关键解决方案包括:首先,通过将相似度矩阵蒸馏为恒等矩阵(identity matrix)来减轻语义信息重叠的影响;其次,设计了一种方法使卷积层能够更高效地提取多模态局部特征;最后,通过将Transformer解码器的输出反馈回自身,以充分解码多模态信息。实验结果表明,LD-DETR在多个公开基准数据集上优于现有最先进的模型。
链接: https://arxiv.org/abs/2501.10787
作者: Pengcheng Zhao,Zhixian He,Fuwei Zhang,Shujin Lin,Fan Zhou
机构: Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model’s multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at this https URL.
zh
[CV-153] MedFILIP: Medical Fine-grained Language-Image Pre-training ALT
【速读】:该论文试图解决现有医学视觉-语言预训练(VLP)模型在医学图像分析中难以准确表征图像与疾病之间关联的问题,导致诊断结果不准确或不完整。为解决这一问题,论文提出了MedFILIP模型,其关键解决方案包括:1)基于大语言模型的信息提取器,通过灵活的提示工程从报告中解耦出详细的疾病信息,有效降低文本复杂性,同时以极小的代价保留丰富信息;2)知识注入器,构建类别与视觉属性之间的关系,帮助模型基于图像特征做出判断,并促进对不熟悉疾病类别的知识外推;3)基于细粒度注释的语义相似性矩阵,提供更平滑、信息更丰富的标签,从而实现细粒度的图像-文本对齐。通过这些创新,MedFILIP在多个数据集上实现了最先进的性能,分类准确率最高提升了6.69%。
链接: https://arxiv.org/abs/2501.10775
作者: Xinjie Liang,Xiangyu Li,Fanding Li,Jie Jiang,Qing Dong,Wei Wang,Kuanquan Wang,Suyu Dong,Gongning Luo,Shuo Li
机构: School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (哈尔滨工业大学计算机科学与技术学院); Department of Thoracic Surgery at No. 4 Affiliated Hospital, Harbin Medical University, Harbin, China (哈尔滨医科大学附属第四医院胸外科); School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区计算机科学与技术学院); College of computer and control engineering, Northeast Forestry University, Harbin, China (东北林业大学计算机与控制工程学院); Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia (阿卜杜拉国王科技大学计算机、电气和数学科学与工程学部); Department of Biomedical Engineering and Department of Computer and Data Science, Case Western Reserve University, Cleveland, OH, USA (凯斯西储大学生物医学工程系和计算机与数据科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, IEEE Journal of Biomedical and Health Informatics 2025
点击查看摘要
Abstract:Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%. The code is available in this https URL.
zh
[CV-154] Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption
【速读】:该论文旨在解决红外-可见光图像融合(Infrared-visible image fusion, IVIF)领域中缺乏近期全面综述的问题。自2018年以来,随着深度学习技术的引入,IVIF领域涌现了多种网络架构和损失函数,以提升视觉性能。然而,数据兼容性、感知准确性和效率等方面的挑战仍然存在。为此,论文提供了一个多维度的框架,详细阐述了基于学习的IVIF方法,涵盖了从视觉增强策略到数据兼容性和任务适应性的各个方面。解决方案的关键在于通过系统性综述和分析,梳理现有方法的核心理念,并通过性能对比(包括定量和定性分析)来评估这些方法在配准、融合及后续高级任务中的表现。此外,论文还探讨了该领域的未来发展方向和开放性问题。
链接: https://arxiv.org/abs/2501.10761
作者: Jinyuan Liu,Guanyao Wu,Zhu Liu,Di Wang,Zhiying Jiang,Long Ma,Wei Zhong,Xin Fan,Risheng Liu
机构: School of Software Technology, Dalian University of Technology, Dalian, 116024, China (大连理工大学软件技术学院, 大连, 116024, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data compatibility, perception accuracy, and efficiency remain. Unfortunately, there is a lack of recent comprehensive surveys that address this rapidly expanding domain. This paper fills that gap by providing a thorough survey covering a broad range of topics. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods, from visual enhancement strategies to data compatibility and task adaptability. We also present a detailed analysis of these approaches, accompanied by a lookup table clarifying their core ideas. Furthermore, we summarize performance comparisons, both quantitatively and qualitatively, focusing on registration, fusion, and subsequent high-level tasks. Beyond technical analysis, we discuss potential future directions and open issues in this area. For further details, visit our GitHub repository: this https URL.
zh
[CV-155] Quadcopter Position Hold Function using Optical Flow in a Smartphone-based Flight Computer
【速读】:该论文探讨了智能手机作为四旋翼无人机(quadcopter)计算设备的潜力,特别是其在位置保持功能(position hold function)中的应用。论文的核心问题是如何利用智能手机的传感器和内置摄像头进行图像处理,以实现无人机的位置保持。解决方案的关键在于使用Shi-Tomasi角点检测(Shi-Tomasi corner detection)和Lucas-Kanade稀疏光流算法(Lucas-Kanade sparse optical flow algorithms)来识别和跟踪地面特征,并通过计算无人机相对于图像中心的欧几里得距离(Euclidian distance)来维持位置。此外,PID控制器用于计算相应的俯仰(pitch)和横滚(roll)估计值。实验结果表明,智能手机的传感器和摄像头能够有效执行光流位置保持功能,证明了其在无人机应用中的潜力。
链接: https://arxiv.org/abs/2501.10752
作者: Noel P Caliston,Chris Jordan C. Aliac,James Arnold E. Nogra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
点击查看摘要
Abstract:Purpose. This paper explores the capability of smartphones as computing devices for a quadcopter, specifically in terms of the ability of drones to maintain a position known as the position hold function. Image processing can be performed with the phone’s sensors and powerful built-in camera. Method. Using Shi-Tomasi corner detection and the Lucas-Kanade sparse optical flow algorithms, ground features are recognized and tracked using the downward-facing camera. The position is maintained by computing quadcopter displacement from the center of the image using Euclidian distance, and the corresponding pitch and roll estimate is calculated using the PID controller. Results. Actual flights show a double standard deviation of 18.66 cm from the center for outdoor tests. With a quadcopter size of 58cm x 58cm used, it implies that 95% of the time, the quadcopter is within a diameter of 96 cm. For indoor tests, a double standard deviation of 10.55 cm means that 95% of the time, the quadcopter is within a diameter of 79 cm. Conclusion. Smartphone sensors and cameras can be used to perform optical flow position hold functions, proving their potential as computing devices for drones. Recommendations. To further improve the positioning system of the phone-based quadcopter system, it is suggested that potential sensor fusion be explored with the phone’s GNSS sensor, which gives absolute positioning information for outdoor applications. Research Implications. As different devices and gadgets are integrated into the smartphone, this paper presents an opportunity for phone manufacturers and researchers to explore the potential of smartphones for a drone use-case.
zh
[CV-156] Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention
【速读】:该论文旨在解决遥感(RS)图像语义分割任务中半监督学习面临的挑战,特别是多尺度特征丰富性和类间相似性高的问题。为了解决这些问题,论文提出了一种新颖的半监督多尺度不确定性和跨师生注意力(MUCA)模型。该模型的关键解决方案包括两个方面:首先,通过引入多尺度不确定性一致性正则化(multi-scale uncertainty consistency regularization),约束网络不同层次特征图之间的一致性,从而提升半监督算法在未标记数据上的多尺度学习能力;其次,利用跨师生注意力机制(Cross-Teacher-Student attention mechanism),通过教师网络的互补特征指导学生网络构建更具判别性的特征表示。此外,该模型通过有效整合弱增强(WA)和强增强(SA)进一步提升了分割性能。实验结果表明,该方法在ISPRS-Potsdam和LoveDA数据集上优于现有的半监督方法,尤其在区分高度相似物体方面表现出色。
链接: https://arxiv.org/abs/2501.10736
作者: Shanwen Wang,Changrui Chen,Xin Sun,Danfeng Hong,Jungong Han
机构: Faculty of Data Science, City University of Macau, 999078, SAR Macao, China(澳门城市大学数据科学学院); WMG, University of Warwick, UK(华威大学WMG); Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China(中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学电子电气与通信工程学院); Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Semi-supervised learning offers an appealing solution for remote sensing (RS) image segmentation to relieve the burden of labor-intensive pixel-level labeling. However, RS images pose unique challenges, including rich multi-scale features and high inter-class similarity. To address these problems, this paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. Specifically, MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. It improves the multi-scale learning capability of semi-supervised algorithms on unlabeled data. Additionally, MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations through complementary features from the teacher network. This design effectively integrates weak and strong augmentations (WA and SA) to further boost segmentation performance. To verify the effectiveness of our model, we conduct extensive experiments on ISPRS-Potsdam and LoveDA datasets. The experimental results show the superiority of our method over state-of-the-art semi-supervised methods. Notably, our model excels in distinguishing highly similar objects, showcasing its potential for advancing semi-supervised RS image segmentation tasks.
zh
[CV-157] A CNN-Transformer for Classification of Longitudinal 3D MRI Images – A Case Study on Hepatocellular Carcinoma Prediction
【速读】:该论文试图解决在慢性疾病如肝细胞癌(HCC)中,如何通过纵向MRI分析来预测疾病进展的问题。由于数据可用性有限、实质变化细微以及医学筛查时间不规律等挑战,现有方法主要依赖于横截面成像数据。为解决这一问题,作者提出了HCCNet,一种新颖的模型架构,结合了3D ConvNeXt CNN架构和Transformer编码器,以捕捉3D MRI的复杂空间特征和不同时间点之间的时间依赖性。HCCNet采用了两阶段预训练过程,分别针对3D MRI的自监督学习和序列顺序预测任务进行预训练,从而增强对疾病进展的理解。实验结果表明,HCCNet在预测准确性和可靠性方面显著优于基线模型,为个性化HCC监测提供了强有力的工具。
链接: https://arxiv.org/abs/2501.10733
作者: Jakob Nolte,Maureen M. J. Guichelaar,Donald E. Bouman,Stephanie M. van den Berg,Maryam Amir Haeri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for publication to Biomedical Signal Processing and Control
点击查看摘要
Abstract:Longitudinal MRI analysis is crucial for predicting disease outcomes, particularly in chronic conditions like hepatocellular carcinoma (HCC), where early detection can significantly influence treatment strategies and patient prognosis. Yet, due to challenges like limited data availability, subtle parenchymal changes, and the irregular timing of medical screenings, current approaches have so far focused on cross-sectional imaging data. To address this, we propose HCCNet, a novel model architecture that integrates a 3D adaptation of the ConvNeXt CNN architecture with a Transformer encoder, capturing both the intricate spatial features of 3D MRIs and the complex temporal dependencies across different time points. HCCNet utilizes a two-stage pre-training process tailored for longitudinal MRI data. The CNN backbone is pre-trained using a self-supervised learning framework adapted for 3D MRIs, while the Transformer encoder is pre-trained with a sequence-order-prediction task to enhance its understanding of disease progression over time. We demonstrate the effectiveness of HCCNet by applying it to a cohort of liver cirrhosis patients undergoing regular MRI screenings for HCC surveillance. Our results show that HCCNet significantly improves predictive accuracy and reliability over baseline models, providing a robust tool for personalized HCC surveillance. The methodological approach presented in this paper is versatile and can be adapted to various longitudinal MRI screening applications. Its ability to handle varying patient record lengths and irregular screening intervals establishes it as an invaluable framework for monitoring chronic diseases, where timely and accurate disease prognosis is critical for effective treatment planning. Comments: Submitted for publication to Biomedical Signal Processing and Control Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.9; I.2.1 Cite as: arXiv:2501.10733 [cs.CV] (or arXiv:2501.10733v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.10733 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-158] In the Picture: Medical Imaging Datasets Artifacts and their Living Review
【速读】:该论文试图解决医学影像研究中数据集(datasets)存在的标签质量、捷径学习(shortcuts)和元数据(metadata)等问题,这些问题往往被忽视,可能影响算法的泛化能力,进而对患者结果产生负面影响。现有的医学影像文献综述大多集中于机器学习方法,仅有少数关注特定应用的数据集,且这些综述通常是静态的,发布后不再更新,无法反映数据集发布后其他研究者可能贡献的新发现,如偏见、捷径学习和额外注释等。论文将这些新发现称为研究产物(research artifacts)。为解决这一问题,论文提出了一种动态综述(living review),持续跟踪多个医学影像应用中的公共数据集及其相关研究产物。解决方案的关键包括一个用于监控数据文档产物的框架,以及一个SQL数据库,用于可视化研究产物与数据集之间的引用关系。此外,论文还讨论了创建医学影像数据集的关键考虑因素,回顾了数据注释的最佳实践,探讨了捷径学习和人口多样性(demographic diversity)的重要性,并强调了在整个生命周期中管理数据集的重要性。
链接: https://arxiv.org/abs/2501.10727
作者: Amelia Jiménez-Sánchez,Natalia-Rozalia Avlona,Sarah de Boer,Víctor M. Campello,Aasa Feragen,Enzo Ferrante,Melanie Ganz,Judy Wawira Gichoya,Camila González,Steff Groefsema,Alessa Hering,Adam Hulman,Leo Joskowicz,Dovile Juodelyte,Melih Kandemir,Thijs Kooi,Jorge del Pozo Lérida,Livie Yumeng Li,Andre Pacheco,Tim Rädsch,Mauricio Reyes,Théo Sourget,Bram van Ginneken,David Wen,Nina Weng,Jack Junchi Xu,Hubert Dariusz Zając,Maria A. Zuluaga,Veronika Cheplygina
机构: IT University of Copenhagen(哥本哈根信息技术大学); University of Copenhagen(哥本哈根大学); Radboud University Medical Center(拉德堡德大学医学中心); Universitat de Barcelona(巴塞罗那大学); Technical University of Denmark(丹麦技术大学); CONICET(阿根廷国家科学技术研究委员会); University of Buenos Aires(布宜诺斯艾利斯大学); Rigshospitalet(里格斯医院); Emory University(埃默里大学); Stanford University(斯坦福大学); University of Groningen(格罗宁根大学); Steno Diabetes Center Aarhus, Aarhus University Hospital(奥胡斯大学医院斯泰诺糖尿病中心); Department of Public Health, Aarhus University(奥胡斯大学公共卫生系); The Hebrew University of Jerusalem(耶路撒冷希伯来大学); University of Southern Denmark(南丹麦大学); Lunit(Lunit); IT University of Copenhagen & Cerebriu A/S(哥本哈根信息技术大学与Cerebriu A/S); Federal University of Espírito Santo(圣埃斯皮里图联邦大学); Division of Intelligent Medical Systems, German Cancer Research Center(德国癌症研究中心智能医疗系统部门); Helmholtz Imaging, German Cancer Research Center(德国癌症研究中心亥姆霍兹成像); Engineering Faculty, Heidelberg University(海德堡大学工程学院); ARTORG Center for Biomedical Engineering Research, University of Bern(伯尔尼大学ARTORG生物医学工程研究中心); Department of Radiation Oncology, University Hospital Bern, University of Bern(伯尔尼大学医院放射肿瘤科); Plain Medical(Plain Medical); Department of Dermatology, Churchill Hospital, Oxford University Hospitals(牛津大学医院丘吉尔医院皮肤科); Copenhagen University Hospital, Herlev and Gentofte(哥本哈根大学医院赫勒乌和根托夫特); Radiological AI Testcenter(放射学AI测试中心); EURECOM(EURECOM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Manuscript under review
点击查看摘要
Abstract:Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static – they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at this http URL.
zh
[CV-159] Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning
【速读】:该论文试图解决组合零样本学习(Compositional Zero-Shot Learning)中的条件依赖性问题,这一问题导致同一状态(对象)在不同对象(状态)下表现出显著的属性变化。现有方法通常采用“多对一”或“一对一”表示范式,但这些极端方法在可迁移性和可区分性之间造成了不平衡,往往偏向一方而牺牲另一方。相比之下,人类擅长通过层次聚类的方式进行类比和推理,直观地将具有相似属性的类别分组形成连贯的概念。受此启发,论文提出了同质组表示学习(Homogeneous Group Representation Learning, HGRL),将状态(对象)表示学习重新定义为多个同质子组的表示学习。HGRL通过自适应地发现和聚合具有共享属性的类别,学习保留组内特定区分特征的分布式组中心,从而在语义可迁移性和可区分性之间实现平衡。该方法集成了三个核心组件,旨在同时增强模型的视觉和提示表示能力。通过在三个基准数据集上的广泛实验,验证了该方法的有效性。
链接: https://arxiv.org/abs/2501.10695
作者: Zhijie Rao,Jingcai Guo,Miaoge Li,Yang Chen
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method.
zh
[CV-160] Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection ICME2024
【速读】:该论文旨在解决视频时刻检索和高光检测(MRHD)任务中的多模态信息融合问题。现有方法主要依赖RGB图像作为输入,忽略了光流(optical flow)和深度图(depth map)等多模态视觉信号。为此,论文提出了一种多模态融合与查询精炼网络(MRNet),通过动态融合RGB、光流和深度图来学习互补信息。此外,为了模拟人类对句子的理解,论文还引入了查询精炼模块,该模块在不同粒度(词、短语和句子级别)上融合文本信息。实验结果表明,MRNet在QVHighlights和Charades数据集上显著优于现有方法,特别是在MR-mAP@Avg和HD-HIT@1指标上分别提升了3.41和3.46。
链接: https://arxiv.org/abs/2501.10692
作者: Yifang Xu,Yunzhuo Sun,Benxiang Zhai,Zien Xie,Youyao Jia,Sidan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2024
点击查看摘要
Abstract:Given a video and a linguistic query, video moment retrieval and highlight detection (MRHD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
zh
[CV-161] EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
【速读】:该论文旨在解决音频驱动(audio-driven)的说话头(talking head)生成中,如何同时生成高度表达的面部表情和手势的挑战。现有方法主要关注生成全身或半身姿态,但音频特征与全身手势之间的弱对应关系限制了生成效果。为此,论文提出了一种两阶段的解决方案:首先,直接从音频输入生成手部姿态,利用音频信号与手部运动之间的强相关性;其次,采用扩散模型(diffusion model)合成视频帧,结合第一阶段生成的手部姿态,生成逼真的面部表情和身体动作。该方法在视觉质量和同步准确性方面优于现有技术(如CyberHost和Vlogger),为音频驱动的手势生成提供了新的视角和鲁棒的框架。
链接: https://arxiv.org/abs/2501.10687
作者: Linrui Tian,Siqi Hu,Qi Wang,Bang Zhang,Liefeng Bo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art approaches, such as CyberHost and Vlogger, in terms of both visual quality and synchronization accuracy. This work provides a new perspective on audio-driven gesture generation and a robust framework for creating expressive and natural talking head animations.
zh
[CV-162] ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning
【速读】:该论文试图解决视觉图神经网络(Vision GNNs, ViG)在图构建过程中由于基于k近邻(k-Nearest Neighbors, k-NN)的图构建方法导致的计算效率低下问题。具体来说,现有的ViG方法在图构建时依赖于昂贵的k-NN算法,这严重限制了其性能,尤其是在处理高分辨率图像时。为了解决这一问题,论文提出了一种名为动态高效图卷积(Dynamic Efficient Graph Convolution, DEGC)的新方法。DEGC通过并行分区构建图,显著提高了图构建的效率。此外,DEGC结合了局部图内和全局图间的特征学习,增强了全局上下文感知能力。基于DEGC,论文进一步提出了一种新的CNN-GNN架构——ClusterViG,用于计算机视觉任务。实验表明,ClusterViG在保持相似模型参数数量的情况下,显著降低了端到端推理延迟,并在图像分类、目标检测和实例分割任务中达到了最先进的性能。
链接: https://arxiv.org/abs/2501.10640
作者: Dhruv Parikh,Jacob Fein-Ashley,Tian Ye,Rajgopal Kannan,Viktor Prasanna
机构: University of Southern California (南加州大学); DEVCOM Army Research Office (DEVCOM陆军研究办公室)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Preprint
点击查看摘要
Abstract:Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive k -Nearest Neighbors ( k -NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to 5\times when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
zh
[CV-163] A Resource-Efficient Training Framework for Remote Sensing Text–Image Retrieval
【速读】:该论文试图解决遥感文本-图像检索(RSTIR)中模型复杂度和资源效率低下的问题。随着大规模视觉-语言预训练模型的快速发展,RSTIR的研究在迁移学习过程中面临资源效率不佳的挑战。为解决这一问题,作者提出了一种计算和内存高效的检索框架(CMER)。其关键解决方案包括:1)引入Focus-Adapter模块,采用侧分支结构,通过焦点层抑制背景像素对小目标的干扰,从而减少训练内存消耗;2)利用遥感场景类别作为元数据,设计简洁的数据增强技术,缩小搜索空间;3)提出负样本回收策略,使负样本池与mini-batch大小解耦,提升泛化性能而不引入额外编码器。实验结果表明,CMER在RSITMD数据集上的检索性能比现有先进方法高出2%-5%,同时减少了49%的内存消耗,并实现了1.4倍的数据吞吐量提升。
链接: https://arxiv.org/abs/2501.10638
作者: Weihang Zhang,Jihao Li,Shuoke Li,Ziqing Niu,Jialiang Chen,Wenkai Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Remote sensing text–image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%–5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at this https URL.
zh
[CV-164] RoMu4o: A Robotic Manipulation Unit For Orchard Operations Automating Proximal Hyperspectral Leaf Sensing
【速读】:该论文旨在解决精准农业中劳动力短缺和快速增长的粮食需求问题,提出了一种用于果园操作的机器人自动化解决方案。关键解决方案是RoMu4o,一种配备6自由度(6DOF)机械臂和视觉系统的地面机器人,能够进行近端高光谱叶片传感。该机器人通过实时深度学习图像处理和运动规划,实现了对目标叶片的精确抓取和高光谱测量。其核心创新在于开发了鲁棒的感知和操作流程,能够从观察到的叶片群中识别并提取叶片的3D结构,提出6D位姿,并生成无碰撞的约束感知路径以实现精确的叶片操作。机械臂的末端执行器集成了独立光源和高光谱传感器,确保了高保真数据采集和简化校准过程。该系统在实验室和室外植物模型中的性能评估表明,其在1-LPB高光谱采样中表现出色,实验室成功率为95%,田间试验成功率为79%,在开心果果园中的自主叶片抓取和高光谱测量总体成功率为70%。
链接: https://arxiv.org/abs/2501.10621
作者: Mehrad Mortazavi,David J. Cappelleri,Reza Ehsani
机构: University of California, Merced (加州大学默塞德分校); Purdue University (普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Driven by the need to address labor shortages and meet the demands of a rapidly growing population, robotic automation has become a critical component in precision agriculture. Leaf-level hyperspectral spectroscopy is shown to be a powerful tool for phenotyping, monitoring crop health, identifying essential nutrients within plants as well as detecting diseases and water stress. This work introduces RoMu4o, a robotic manipulation unit for orchard operations offering an automated solution for proximal hyperspectral leaf sensing. This ground robot is equipped with a 6DOF robotic arm and vision system for real-time deep learning-based image processing and motion planning. We developed robust perception and manipulation pipelines that enable the robot to successfully grasp target leaves and perform spectroscopy. These frameworks operate synergistically to identify and extract the 3D structure of leaves from an observed batch of foliage, propose 6D poses, and generate collision-free constraint-aware paths for precise leaf manipulation. The end-effector of the arm features a compact design that integrates an independent lighting source with a hyperspectral sensor, enabling high-fidelity data acquisition while streamlining the calibration process for accurate measurements. Our ground robot is engineered to operate in unstructured orchard environments. However, the performance of the system is evaluated in both indoor and outdoor plant models. The system demonstrated reliable performance for 1-LPB hyperspectral sampling, achieving 95% success rate in lab trials and 79% in field trials. Field experiments revealed an overall success rate of 70% for autonomous leaf grasping and hyperspectral measurement in a pistachio orchard. The open-source repository is available at: this https URL
zh
[CV-165] Hierarchical LoG Bayesian Neural Network for Enhanced Aorta Segmentation
【速读】:该论文旨在解决主动脉及其分支的精确分割问题,特别是在多尺度结构和周围组织复杂性的背景下,现有的深度学习方法仍面临挑战。论文提出了一种基于贝叶斯神经网络的分层拉普拉斯高斯(LoG)模型,通过结合3D U-Net流和分层LoG流来增强主动脉分割。3D U-Net流提供初始的主动脉分割,而分层LoG流通过学习适合的LoG核,自适应地处理主动脉血管中不同尺度的部分,从而提升血管检测的准确性。此外,贝叶斯方法用于参数化LoG流,并为分割结果提供置信区间,确保预测的鲁棒性和可靠性。实验结果表明,该模型在多个主动脉数据集上显著优于现有方法,Dice系数提升了至少3%,并能为主动脉的不同部分提供可靠的置信区间。
链接: https://arxiv.org/abs/2501.10615
作者: Delin An,Pan Du,Pengfei Gu,Jian-Xun Wang,Chaoli Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation of the aorta and its associated arch branches is crucial for diagnosing aortic diseases. While deep learning techniques have significantly improved aorta segmentation, they remain challenging due to the intricate multiscale structure and the complexity of the surrounding tissues. This paper presents a novel approach for enhancing aorta segmentation using a Bayesian neural network-based hierarchical Laplacian of Gaussian (LoG) model. Our model consists of a 3D U-Net stream and a hierarchical LoG stream: the former provides an initial aorta segmentation, and the latter enhances blood vessel detection across varying scales by learning suitable LoG kernels, enabling self-adaptive handling of different parts of the aorta vessels with significant scale differences. We employ a Bayesian method to parameterize the LoG stream and provide confidence intervals for the segmentation results, ensuring robustness and reliability of the prediction for vascular medical image analysts. Experimental results show that our model can accurately segment main and supra-aortic vessels, yielding at least a 3% gain in the Dice coefficient over state-of-the-art methods across multiple volumes drawn from two aorta datasets, and can provide reliable confidence intervals for different parts of the aorta. The code is available at this https URL.
zh
[CV-166] High Resolution Tree Height Mapping of the Amazon Forest using Planet NICFI Images and LiDAR-Informed U-Net Model
【速读】:该论文旨在解决亚马逊森林树冠高度(tree canopy height)的精确测量问题,这是评估森林生物量、生产力和生态系统结构的重要指标。传统的地面和空间测量方法存在挑战,难以实现高精度的测量。为此,研究提出了一种基于U-Net模型的回归方法,利用Planet NICFI影像(空间分辨率约为4.78米)对2020-2024年期间的亚马逊森林平均树冠高度进行制图。关键解决方案包括:使用航空LiDAR数据生成的树冠高度模型作为参考,结合Planet NICFI影像训练U-Net模型。该模型在验证样本上的预测误差平均为3.68米,且在整个树高范围内表现出较低的系统偏差,能够有效估计高达40-50米的树冠高度,优于现有的全球模型产品。研究还发现亚马逊森林的平均树冠高度约为22米,并展示了通过树高变化检测伐木或森林砍伐事件以及监测再生森林高度的潜力。这些结果表明,利用Planet NICFI影像进行大规模树高制图和监测具有重要应用价值。
链接: https://arxiv.org/abs/2501.10600
作者: Fabien H Wagner,Ricardo Dalagnol,Griffin Carter,Mayumi CM Hirye,Shivraj Gill,Le Bienfaiteur Sagang Takougoum,Samuel Favrichon,Michael Keller,Jean PHB Ometto,Lorena Alves,Cynthia Creze,Stephanie P George-Chacon,Shuang Li,Zhihua Liu,Adugna Mullissa,Yan Yang,Erone G Santos,Sarah R Worden,Martin Brandt,Philippe Ciais,Stephen C Hagen,Sassan Saatchi
机构: CTrees, Pasadena, CA 91105, US; Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove, Pasadena, CA 91109, USA; Institute of Environment and Sustainability, University of California, Los Angeles, CA, USA; Quapá Lab, Faculty of Architecture and Urbanism, University of São Paulo, 05508080, São Paulo, SP, Brazil; Gamma Remote Sensing Ag, Gumligen, Switzerland; USDA Forest Service, International Institute of Tropical Forestry, Rio Piedras, Puerto Rico, USA; EMBRAPA Satellite Monitoring, Campinas 13070-115, SP, Brazil; Remote Sensing Division, National Institute for Space Research—INPE, São José dos Campos 12227-010, SP, Brazil; Department of Geosciences and Natural Resource Management, University of Copenhagen, Copenhagen, 1350, Denmark; Laboratoire des Sciences du Climat et de l’Environnement, CEA-CNRS-UVSQ, CE Orme des Merisiers, Gif sur Yvette, 91190, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: will be submitted to the journal Remote Sensing of Environment in February 2025
点击查看摘要
Abstract:Tree canopy height is one of the most important indicators of forest biomass, productivity, and ecosystem structure, but it is challenging to measure accurately from the ground and from space. Here, we used a U-Net model adapted for regression to map the mean tree canopy height in the Amazon forest from Planet NICFI images at ~4.78 m spatial resolution for the period 2020-2024. The U-Net model was trained using canopy height models computed from aerial LiDAR data as a reference, along with their corresponding Planet NICFI images. Predictions of tree heights on the validation sample exhibited a mean error of 3.68 m and showed relatively low systematic bias across the entire range of tree heights present in the Amazon forest. Our model successfully estimated canopy heights up to 40-50 m without much saturation, outperforming existing canopy height products from global models in this region. We determined that the Amazon forest has an average canopy height of ~22 m. Events such as logging or deforestation could be detected from changes in tree height, and encouraging results were obtained to monitor the height of regenerating forests. These findings demonstrate the potential for large-scale mapping and monitoring of tree height for old and regenerating Amazon forests using Planet NICFI imagery.
zh
[CV-167] On the Benefits of Instance Decomposition in Video Prediction Models
【速读】:该论文试图解决视频预测任务中的一个关键问题,即在动态场景中如何更准确地预测未来帧。现有的视频预测方法通常将场景的动态变化联合建模,而没有显式地将场景中的各个对象分解开来。这种做法在处理复杂动态场景时可能不够优化,因为每个对象的运动模式通常是相对独立的。论文提出了一种解决方案,即在潜在变换器(latent-transformer)视频预测模型中显式地对动态场景中的各个对象进行单独建模。通过这种分解方法,论文在合成和真实数据集上进行了详细的实验,结果表明,与未进行对象分解的模型相比,显式分解动态场景能够显著提高预测质量。
链接: https://arxiv.org/abs/2501.10562
作者: Eliyas Suleyman,Paul Henderson,Nicolas Pugeault
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
zh
[CV-168] HyperCam: Low-Power Onboard Computer Vision for IoT Cameras
【速读】:该论文旨在解决在低功耗物联网(IoT)摄像头系统上进行计算机视觉任务时,如何在资源受限的硬件上实现高效的图像分类问题。现有的机器学习分类器(如SVM、xgBoost、MicroNets、MobileNetV3和MCUNetV3)在低功耗设备上难以同时兼顾高精度和低资源消耗。为此,论文提出了HyperCam,一种基于超维度计算(hyperdimensional computing)的图像分类管道,能够在低功耗微控制器上高效地进行训练和推理。HyperCam的关键创新在于其能够在保持较高分类精度的同时,显著减少内存占用和推理延迟。实验结果表明,HyperCam在MNIST、Fashion-MNIST、人脸检测和人脸识别任务上分别达到了93.60%、84.06%、92.98%和72.79%的准确率,并且在资源效率上显著优于其他分类器,推理延迟为0.08-0.27秒,峰值时仅使用42.91-63.00KB的闪存和22.25KB的RAM。
链接: https://arxiv.org/abs/2501.10547
作者: Chae Young Lee, Pu (Luke)Yi,Maxwell Fite,Tejus Rao,Sara Achour,Zerina Kapetanovic
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:We present HyperCam, an energy-efficient image classification pipeline that enables computer vision tasks onboard low-power IoT camera systems. HyperCam leverages hyperdimensional computing to perform training and inference efficiently on low-power microcontrollers. We implement a low-power wireless camera platform using off-the-shelf hardware and demonstrate that HyperCam can achieve an accuracy of 93.60%, 84.06%, 92.98%, and 72.79% for MNIST, Fashion-MNIST, Face Detection, and Face Identification tasks, respectively, while significantly outperforming other classifiers in resource efficiency. Specifically, it delivers inference latency of 0.08-0.27s while using 42.91-63.00KB flash memory and 22.25KB RAM at peak. Among other machine learning classifiers such as SVM, xgBoost, MicroNets, MobileNetV3, and MCUNetV3, HyperCam is the only classifier that achieves competitive accuracy while maintaining competitive memory footprint and inference latency that meets the resource requirements of low-power camera systems.
zh
[CV-169] Poxel: Voxel Reconstruction for 3D Printing
【速读】:该论文旨在解决现有3D重建技术(如NeRF和Plenoxel)在物理3D打印中的局限性问题。这些技术主要针对数字环境优化,使用依赖于视角的颜色模型(RGB)和2D splatting技术,无法很好地适应物理3D打印的需求。论文提出的解决方案是“Poxel”(Printable-Voxel),一种基于体素(voxel)的3D重建框架,专门为光敏聚合物喷射3D打印优化。Poxel通过去除视角依赖性,并将数字RGB颜色空间转换为适用于多材料喷射的物理CMYKWCl颜色空间,直接输出可打印的体素网格。这一方法显著提高了打印模型的保真度和质量,满足了物理3D物体的需求。
链接: https://arxiv.org/abs/2501.10474
作者: Ruixiang Cao,Satoshi Yagi,Satoshi Yamamori,Jun Morimoto
机构: Graduate School of Informatics, Kyoto University (京都大学); Dept. of Brain Robot Interface, Computational Neuroscience Labs, ATR (脑机器人接口部门,计算神经科学实验室,ATR)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advancements in 3D reconstruction, especially through neural rendering approaches like Neural Radiance Fields (NeRF) and Plenoxel, have led to high-quality 3D visualizations. However, these methods are optimized for digital environments and employ view-dependent color models (RGB) and 2D splatting techniques, which do not translate well to physical 3D printing. This paper introduces “Poxel”, which stands for Printable-Voxel, a voxel-based 3D reconstruction framework optimized for photopolymer jetting 3D printing, which allows for high-resolution, full-color 3D models using a CMYKWCl color model. Our framework directly outputs printable voxel grids by removing view-dependency and converting the digital RGB color space to a physical CMYKWCl color space suitable for multi-material jetting. The proposed system achieves better fidelity and quality in printed models, aligning with the requirements of physical 3D objects.
zh
[CV-170] Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection ICML2024
【速读】:该论文试图解决在自监督对抗训练(Self-Supervised Adversarial Training, SSAT)中使用大量未标记数据所导致的内存占用和训练时间增加的问题。为了解决这一问题,论文提出了一种新颖的方法,通过策略性地选择一小部分对SSAT和模型鲁棒性提升至关重要的未标记数据。其解决方案的关键在于基于潜在聚类技术(latent clustering-based techniques)优先选择靠近模型决策边界的数据点,从而高效地识别出包含更多边界邻近点的关键未标记数据子集。同时,该方法在关注边界数据的同时,保持了边界与非边界数据点之间的平衡比例,以避免过拟合。实验结果表明,该方法在图像基准测试中能够显著减少内存和计算需求,同时保持较高的模型鲁棒性,尤其是在使用k-means聚类方法时,能够在减少5到10倍外部或生成未标记数据的情况下,达到几乎相同的测试时鲁棒精度。此外,该方法在包括COVID-19胸部X光分类在内的多种应用场景中展示了良好的泛化能力。
链接: https://arxiv.org/abs/2501.10466
作者: Somrita Ghosh,Yuelin Xu,Xiao Zhang
机构: CISPA Helmholtz Center for Information Security (CISPA 亥姆霍兹信息安全中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Shorter version of this work accepted by NextGenAISafety Workshop at ICML 2024
点击查看摘要
Abstract:Compared with standard learning, adversarially robust learning is widely recognized to demand significantly more training examples. Recent works propose the use of self-supervised adversarial training (SSAT) with external or synthetically generated unlabeled data to enhance model robustness. However, SSAT requires a substantial amount of extra unlabeled data, significantly increasing memory usage and model training times. To address these challenges, we propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement. Our selection prioritizes data points near the model’s decision boundary based on latent clustering-based techniques, efficiently identifying a critical subset of unlabeled data with a higher concentration of boundary-adjacent points. While focusing on near-boundary data, our methods are designed to maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Our experiments on image benchmarks show that integrating our selection strategies into self-supervised adversarial training can largely reduce memory and computational requirements while achieving high model robustness. In particular, our latent clustering-based selection method with k-means is the most effective, achieving nearly identical test-time robust accuracies with 5 to 10 times less external or generated unlabeled data when applied to image benchmarks. Additionally, we validate the generalizability of our approach across various application scenarios, including a real-world medical dataset for COVID-19 chest X-ray classification.
zh
[CV-171] BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation
【速读】:该论文旨在解决当前3D场景生成方法中存在的存储空间占用大、几何失真以及缺乏有效正则化的问题。解决方案的关键在于提出了BloomScene,一种轻量级的结构化3D高斯泼溅(3D Gaussian splatting)方法,用于跨模态场景生成。具体而言,BloomScene通过跨模态渐进式场景生成框架,利用增量点云重建和3D高斯泼溅技术生成连贯的场景。此外,论文提出了一种基于层次深度先验的正则化机制,通过多层次深度精度和平滑度约束来增强生成场景的真实感和连续性。最后,论文还提出了一种结构化上下文引导的压缩机制,利用结构化哈希网格(structured hash grids)对无序锚点属性进行建模,显著消除了结构冗余并减少了存储开销。这些创新使得生成的3D场景在多样性和质量上均优于现有基线方法。
链接: https://arxiv.org/abs/2501.10462
作者: Xiaolu Hou,Mingcheng Li,Dingkang Yang,Jiawei Chen,Ziyun Qian,Xiao Zhao,Yue Jiang,Jinjie Wei,Qingyao Xu,Lihua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
zh
[CV-172] PhyDeformer: High-Quality Non-Rigid Garment Registration with Physics-Awareness
【速读】:该论文旨在解决高质量服装网格配准(garment mesh registration)中的变形问题。解决方案的关键在于分两个阶段进行:首先,通过服装分级(garment grading)实现网格模板与目标网格之间的粗略三维对齐,考虑比例缩放和合身性(如长度、尺寸);其次,利用基于雅可比矩阵(Jacobian-based)的变形框架进行优化,进一步细化分级后的网格,使其与目标的三维细节精确对齐。该方法在合成和真实服装上的定量和定性评估中均表现出显著效果。
链接: https://arxiv.org/abs/2501.10455
作者: Boyang Yu,Frederic Cordier,Hyewon Seo
机构: ICube laboratory, CNRS–University of Strasbourg, France(ICube实验室, CNRS–斯特拉斯堡大学, 法国); IRIMAS, University of Haute-Alsace, France(IRIMAS, 上阿尔萨斯大学, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:We present PhyDeformer, a new deformation method for high-quality garment mesh registration. It operates in two phases: In the first phase, a garment grading is performed to achieve a coarse 3D alignment between the mesh template and the target mesh, accounting for proportional scaling and fit (e.g. length, size). Then, the graded mesh is refined to align with the fine-grained details of the 3D target through an optimization coupled with the Jacobian-based deformation framework. Both quantitative and qualitative evaluations on synthetic and real garments highlight the effectiveness of our method.
zh
[CV-173] Cinepro: Robust Training of Foundation Models for Cancer Detection in Prostate Ultrasound Cineloops
【速读】:该论文试图解决前列腺癌(PCa)检测中由于超声图像缺乏像素级癌症标注(pixel-level cancer annotations)而引入的标签噪声问题。当前的方法通常局限于有限的感兴趣区域(ROIs),忽略了准确诊断所需的解剖学背景。解决方案的关键在于提出了Cinepro框架,该框架通过将病理报告中活检核心的癌症组织比例整合到损失函数中,以应对标签噪声,并提供更细致的监督。此外,Cinepro利用多帧的时间数据来应用鲁棒的增强技术,增强了模型学习稳定癌症相关特征的能力。Cinepro在多中心前列腺超声数据集上表现出色,AUROC达到77.1%,平衡准确率为83.8%,超越了现有基准。这些发现表明Cinepro在推进弱标注超声数据的基础模型方面具有潜力。
链接: https://arxiv.org/abs/2501.12331
作者: Mohamed Harmanani,Amoon Jamzad,Minh Nguyen Nhat To,Paul F.R. Wilson,Zhuoxin Guo,Fahimeh Fooladgar,Samira Sojoudi,Mahdi Gilany,Silvia Chang,Peter Black,Michael Leveridge,Robert Siemens,Purang Abolmaesumi,Parvin Mousavi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注: accepted to IEEE ISBI 2025
点击查看摘要
Abstract:Prostate cancer (PCa) detection using deep learning (DL) models has shown potential for enhancing real-time guidance during biopsies. However, prostate ultrasound images lack pixel-level cancer annotations, introducing label noise. Current approaches often focus on limited regions of interest (ROIs), disregarding anatomical context necessary for accurate diagnosis. Foundation models can overcome this limitation by analyzing entire images to capture global spatial relationships; however, they still encounter challenges stemming from the weak labels associated with coarse pathology annotations in ultrasound data. We introduce Cinepro, a novel framework that strengthens foundation models’ ability to localize PCa in ultrasound cineloops. Cinepro adapts robust training by integrating the proportion of cancer tissue reported by pathology in a biopsy core into its loss function to address label noise, providing a more nuanced supervision. Additionally, it leverages temporal data across multiple frames to apply robust augmentations, enhancing the model’s ability to learn stable cancer-related features. Cinepro demonstrates superior performance on a multi-center prostate ultrasound dataset, achieving an AUROC of 77.1% and a balanced accuracy of 83.8%, surpassing current benchmarks. These findings underscore Cinepro’s promise in advancing foundation models for weakly labeled ultrasound data.
zh
[CV-174] Deep Learning Based Segmentation of Blood Vessels from HE Stained Oesophageal Adenocarcinoma Whole-Slide Images
【速读】:该论文旨在解决在肿瘤微环境(Tumor Micro-Environment, TME)中手动量化血(Blood Vessels, BVs)在苏木精和伊红(Hematoxylin and Eosin, HE)染色图像中的困难,由于血血管的异质性外观,手动量化既耗时又费力。论文提出了一种新颖的方法,通过构建引导图(guiding maps)来改进现有最先进的分割模型在血血管分割中的性能。引导图能够促使模型学习血血管的代表性特征,这对于计算病理学尤为重要,因为标记的训练数据通常有限,且大型模型容易过拟合。论文通过定量和定性结果展示了该方法在提高分割准确性方面的有效性。未来,作者计划验证该方法在不同组织类型中的血血管分割效果,并研究细胞结构与血血管在肿瘤微环境中的关系。
链接: https://arxiv.org/abs/2501.12323
作者: Jiaqi Lv,Stefan S Antonowicz,Shan E Ahmed Raza
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2025
点击查看摘要
Abstract:Blood vessels (BVs) play a critical role in the Tumor Micro-Environment (TME), potentially influencing cancer progression and treatment response. However, manually quantifying BVs in Hematoxylin and Eosin (HE) stained images is challenging and labor-intensive due to their heterogeneous appearances. We propose a novel approach of constructing guiding maps to improve the performance of state-of-the-art segmentation models for BV segmentation, the guiding maps encourage the models to learn representative features of BVs. This is particularly beneficial for computational pathology, where labeled training data is often limited and large models are prone to overfitting. We have quantitative and qualitative results to demonstrate the efficacy of our approach in improving segmentation accuracy. In future, we plan to validate this method to segment BVs across various tissue types and investigate the role of cellular structures in relation to BVs in the TME.
zh
[CV-175] Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters
【速读】:该论文旨在解决高斯混合模型(Gaussian Mixture Models, GMMs)在处理高维大规模数据集时计算复杂度高的问题。具体来说,传统的GMM在训练过程中,尤其是当数据点数量N和维度D较大时,计算复杂度会急剧增加,导致训练时间过长。论文提出了一种高效的变分近似方法,并将其与因子分析混合模型(Mixtures of Factor Analyzers, MFAs)相结合。该算法的关键创新在于显著降低了每次迭代的运行时间复杂度,从原来的(\mathcal{O}(NCD^2))降低到与D线性相关且与C无关的复杂度。通过数值验证,论文展示了该算法在大规模数据集上的优化过程中所需的距离评估次数与NC呈次线性关系,从而实现了相比现有技术一个数量级的加速。作为概念验证,论文在约1亿张图像上训练了包含超过100亿参数的GMM,并在单个高性能CPU上实现了约9小时的训练时间。
链接: https://arxiv.org/abs/2501.12299
作者: Sebastian Salwig,Till Kahlke,Florian Hirschberger,Dennis Forster,Jörg Lücke
机构: 1: University of Oldenburg (奥尔登堡大学); 2: Frankfurt University of Applied Sciences (法兰克福应用科技大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 6 figures (and 17 pages, 3 figures in Appendix)
点击查看摘要
Abstract:Gaussian Mixture Models (GMMs) range among the most frequently used machine learning models. However, training large, general GMMs becomes computationally prohibitive for datasets with many data points N of high-dimensionality D . For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is integrated with mixtures of factor analyzers (MFAs). For GMMs with C components, our proposed algorithm significantly reduces runtime complexity per iteration from \mathcalO(NCD^2) to a complexity scaling linearly with D and remaining constant w.r.t. C . Numerical validation of this theoretical complexity reduction then shows the following: the distance evaluations required for the entire GMM optimization process scale sublinearly with NC . On large-scale benchmarks, this sublinearity results in speed-ups of an order-of-magnitude compared to the state-of-the-art. As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
zh
[CV-176] Quality Enhancement of Radiographic X-ray Images by Interpretable Mapping
【速读】:该论文旨在解决X射线成像(X-ray imaging)中由于患者体位、体型和扫描协议不同导致的图像亮度(brightness)和对比度(contrast)不一致的问题。这种不一致性增加了放射科医生调整图像的工作负担,且现有基于深度学习(deep learning)的端到端解决方案虽然性能优异,但缺乏可解释性,难以被临床专家理解。为此,论文提出了一种新颖的基于深度学习的可解释映射方法,能够自动全局和局部增强图像亮度和对比度。该模型的设计灵感来源于亮度与对比度调整的工作流程,能够提供可解释的像素映射(pixel maps),以解释图像增强的动机。实验结果表明,该方法在临床数据集上能够以24.75 dB的峰值信噪比(PSNR)和0.8431的结构相似性(SSIM)实现一致的亮度和对比度校正。
链接: https://arxiv.org/abs/2501.12245
作者: Hongxu Yang,Najib Akram Aboobacker,Xiaomeng Dong,German Gonzalez,Lehel Ferenczi,Gopal Avinash
机构: GE Healthcare(通用电气医疗集团), Netherlands(荷兰); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), USA(美国); GE Healthcare(通用电气医疗集团), Hungary(匈牙利); GE Healthcare(通用电气医疗集团), USA(美国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: SPIE Medical Imaging 2025
点击查看摘要
Abstract:X-ray imaging is the most widely used medical imaging modality. However, in the common practice, inconsistency in the initial presentation of X-ray images is a common complaint by radiologists. Different patient positions, patient habitus and scanning protocols can lead to differences in image presentations, e.g., differences in brightness and contrast globally or regionally. To compensate for this, additional work will be executed by clinical experts to adjust the images to the desired presentation, which can be time-consuming. Existing deep-learning-based end-to-end solutions can automatically correct images with promising performances. Nevertheless, these methods are hard to be interpreted and difficult to be understood by clinical experts. In this manuscript, a novel interpretable mapping method by deep learning is proposed, which automatically enhances the image brightness and contrast globally and locally. Meanwhile, because the model is inspired by the workflow of the brightness and contrast manipulation, it can provide interpretable pixel maps for explaining the motivation of image enhancement. The experiment on the clinical datasets show the proposed method can provide consistent brightness and contrast correction on X-ray images with accuracy of 24.75 dB PSNR and 0.8431 SSIM.
zh
[CV-177] Zero-shot Bias Correction: Efficient MR Image Inhomogeneity Reduction Without Any Data
【速读】:该论文旨在解决图像不均匀性(image inhomogeneity)问题,特别是在无需预训练数据集的情况下进行图像校正。当前基于有监督或无监督学习的深度神经网络方法需要大量数据收集和标注,成本高昂且耗时。本文提出了一种新颖的零样本(zero-shot)深度神经网络方法,无需预训练数据,也不需要对偏差场(bias field)进行专门假设。该方法通过设计轻量级的卷积神经网络(CNN),实现了高效的零样本自适应,用于校正偏差污染的图像。其核心解决方案是通过迭代均匀性优化(iterative homogeneity refinement)来缓解图像偏差问题,确保在零样本优化过程中具有稳定的收敛性。实验结果表明,该方法在效率和准确性上均优于当前的无数据N4方法。
链接: https://arxiv.org/abs/2501.12244
作者: Hongxu Yang,Edina Timko,Brice Fernandez
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2025. Supported by IHI PREDICTOM Project
点击查看摘要
Abstract:In recent years, deep neural networks for image inhomogeneity reduction have shown promising results. However, current methods with (un)supervised solutions require preparing a training dataset, which is expensive and laborious for data collection. In this work, we demonstrate a novel zero-shot deep neural networks, which requires no data for pre-training and dedicated assumption of the bias field. The designed light-weight CNN enables an efficient zero-shot adaptation for bias-corrupted image correction. Our method provides a novel solution to mitigate the biased corrupted image as iterative homogeneity refinement, which therefore ensures the considered issue can be solved easier with stable convergence of zero-shot optimization. Extensive comparison on different datasets show that the proposed method performs better than current data-free N4 methods in both efficiency and accuracy.
zh
[CV-178] WaveNet-SF: A Hybrid Network for Retinal Disease Detection Based on Wavelet Transform in the Spatial-Frequency Domain
【速读】:该论文旨在解决视网膜疾病诊断中光学相干断层扫描(OCT)图像分析面临的挑战,包括斑点噪声、复杂病变形状和不同病变尺寸等问题,这些问题使得图像解释变得困难。为解决这些问题,论文提出了一种名为WaveNet-SF的新框架,该框架通过整合空间域和频域学习来增强视网膜疾病的检测能力。解决方案的关键在于利用小波变换将OCT图像分解为低频和高频成分,从而提取全局结构特征和细粒度细节。此外,论文引入了多尺度小波空间注意力(MSW-SA)模块,以增强模型对多尺度感兴趣区域的关注,并结合高频特征补偿块(HFFC)来恢复小波分解过程中丢失的边缘信息,抑制噪声并保留对病变检测至关重要的细节。通过这些创新,WaveNet-SF在OCT-C8和OCT2017数据集上分别达到了97.82%和99.58%的分类准确率,超越了现有方法,展示了其在OCT图像分析中的高效性和作为视网膜疾病诊断工具的潜力。
链接: https://arxiv.org/abs/2501.11854
作者: Jilan Cheng,Guoli Long,Zeyu Zhang,Zhenjia Qi,Hanyu Wang,Libin Lu,Shuihua Wang,Yudong Zhang,Jin Hong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Retinal diseases are a leading cause of vision impairment and blindness, with timely diagnosis being critical for effective treatment. Optical Coherence Tomography (OCT) has become a standard imaging modality for retinal disease diagnosis, but OCT images often suffer from issues such as speckle noise, complex lesion shapes, and varying lesion sizes, making interpretation challenging. In this paper, we propose a novel framework, WaveNet-SF, to enhance retinal disease detection by integrating spatial-domain and frequency-domain learning. The framework utilizes wavelet transforms to decompose OCT images into low- and high-frequency components, enabling the model to extract both global structural features and fine-grained details. To improve lesion detection, we introduce a multi-scale wavelet spatial attention (MSW-SA) module, which enhances the model’s focus on regions of interest at multiple scales. Additionally, a high-frequency feature compensation block (HFFC) is incorporated to recover edge information lost during wavelet decomposition, suppress noise, and preserve fine details crucial for lesion detection. Our approach achieves state-of-the-art (SOTA) classification accuracies of 97.82% and 99. 58% on the OCT-C8 and OCT2017 datasets, respectively, surpassing existing methods. These results demonstrate the efficacy of WaveNet-SF in addressing the challenges of OCT image analysis and its potential as a powerful tool for retinal disease diagnosis.
zh
[CV-179] A generalizable 3D framework and model for self-supervised learning in medical imaging
【速读】:该论文旨在解决当前自监督学习(Self-Supervised Learning, SSL)方法在3D医学影像中的局限性,特别是其依赖于简单的预训练任务(pretext tasks)和特定器官或模态的数据集,导致泛化能力和可扩展性不足的问题。为此,作者提出了3DINO,一种适用于3D数据集的前沿自监督学习方法,并利用其预训练了一个通用的医学影像模型3DINO-ViT。该模型在一个包含约100,000个3D医学影像扫描的多模态、多器官数据集上进行预训练,涵盖了超过10个器官。通过大量实验验证,3DINO-ViT在多种医学影像分割和分类任务中表现出色,能够跨模态和跨器官泛化,甚至在分布外任务和数据集上也优于现有最先进方法。3DINO框架和3DINO-ViT模型的发布将促进3D基础模型的研究,并为广泛的医学影像应用提供进一步微调的基础。
链接: https://arxiv.org/abs/2501.11755
作者: Tony Xu,Sepehr Hosseini,Chris Anderson,Anthony Rinaldi,Rahul G. Krishnan,Anne L. Martel,Maged Goubran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.
zh
[CV-180] MedicoSAM: Towards foundation models for medical image segmentation
【速读】:该论文旨在解决医学图像分割(medical image segmentation)领域中模型训练和适应新条件时所需的高成本问题,特别是由于需要大量手动标注数据(manually labeled data)所带来的挑战。论文提出通过利用视觉基础模型(vision foundation models),特别是 Segment Anything 模型,来实现医学图像的通用分割(universal segmentation),从而克服这些限制。解决方案的关键在于对 Segment Anything 模型进行微调(finetuning),并在一个大规模且多样化的数据集上比较不同的微调策略。研究结果表明,微调后的模型在交互式分割(interactive segmentation)任务中表现显著提升,但在语义分割(semantic segmentation)任务中,预训练于医学图像并未带来明显优势。最终,论文提出的最佳模型 MedicoSAM 已公开发布,并与现有数据标注工具兼容,具有重要的实际应用价值。
链接: https://arxiv.org/abs/2501.11734
作者: Anwai Archit,Luca Freckmann,Constantin Pape
机构: Institute of Computer Science, University of Göttingen, Germany(计算机科学研究所,哥廷根大学,德国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at this https URL. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.
zh
[CV-181] Fundus Image Quality Assessment and Enhancement: a Systematic Review
【速读】:该论文旨在解决眼底摄影图像质量评估(IQA)和增强(IQE)领域的研究空白,特别是在复杂成像环境下图像退化对诊断和治疗的影响。论文通过全面综述眼底IQA和IQE算法、研究进展及实际应用,填补了现有文献中对IQA与IQE之间相互作用及其临床部署挑战的不足。解决方案的关键在于系统地总结眼底摄影成像系统的基本原理和相关干扰,并详细分析IQA和IQE的范式,同时探讨实际部署中的挑战及解决方案,为未来研究方向提供见解。
链接: https://arxiv.org/abs/2501.11520
作者: Heng Li,Haojin Li,Mingyang Ou,Xiangyang Yu,Xiaoqing Zhang,Ke Niu,Huazhu Fu,Jiang Liu
机构: Research Institute of Trustworthy Autonomous Systems, SUSTech, Shenzhen, China; Department of Computer Science and Engineering, SUSTech, Shenzhen, China; Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Computer School, Beijing Information Science and Technology University, Beijing, China; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As an affordable and convenient eye scan, fundus photography holds the potential for preventing vision impairment, especially in resource-limited regions. However, fundus image degradation is common under intricate imaging environments, impacting following diagnosis and treatment. Consequently, image quality assessment (IQA) and enhancement (IQE) are essential for ensuring the clinical value and reliability of fundus images. While existing reviews offer some overview of this field, a comprehensive analysis of the interplay between IQA and IQE, along with their clinical deployment challenges, is lacking. This paper addresses this gap by providing a thorough review of fundus IQA and IQE algorithms, research advancements, and practical applications. We outline the fundamentals of the fundus photography imaging system and the associated interferences, and then systematically summarize the paradigms in fundus IQA and IQE. Furthermore, we discuss the practical challenges and solutions in deploying IQA and IQE, as well as offer insights into potential future research directions.
zh
[CV-182] Multitask Auxiliary Network for Perceptual Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
【速读】:该论文试图解决全向图像质量评估(Omnidirectional Image Quality Assessment, OIQA)中非均匀失真(non-uniform distortion)问题。现有研究主要集中在解决均匀失真(uniform distortion)问题,而在捕捉非均匀失真方面的能力尚不令人满意。为此,论文提出了一种多任务辅助网络(multitask auxiliary network),通过联合训练主任务和其他辅助任务来优化网络参数。该网络主要由三部分组成:用于从视口序列中提取多尺度特征的主干网络(backbone)、用于动态分配特定特征到不同任务的多任务特征选择模块(multitask feature selection module),以及用于引导模型捕捉局部失真和全局质量变化的辅助子网络(auxiliary sub-networks)。实验结果表明,该模型在两个大规模OIQA数据库上优于其他最先进的OIQA指标,且辅助子网络对提升模型性能起到了重要作用。
链接: https://arxiv.org/abs/2501.11512
作者: Jiebin Yan,Jiale Rao,Junjie Chen,Ziwen Tan,Weide Liu,Yuming Fang
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics (江西财经大学计算与人工智能学院); Jiangxi Provincial Key Laboratory of Multimedia Intelligent Processing (江西省多媒体智能处理重点实验室); Harvard Medical School, Harvard University (哈佛大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Omnidirectional image quality assessment (OIQA) has been widely investigated in the past few years and achieved much success. However, most of existing studies are dedicated to solve the uniform distortion problem in OIQA, which has a natural gap with the non-uniform distortion problem, and their ability in capturing non-uniform distortion is far from satisfactory. To narrow this gap, in this paper, we propose a multitask auxiliary network for non-uniformly distorted omnidirectional images, where the parameters are optimized by jointly training the main task and other auxiliary tasks. The proposed network mainly consists of three parts: a backbone for extracting multiscale features from the viewport sequence, a multitask feature selection module for dynamically allocating specific features to different tasks, and auxiliary sub-networks for guiding the proposed model to capture local distortion and global quality change. Extensive experiments conducted on two large-scale OIQA databases demonstrate that the proposed model outperforms other state-of-the-art OIQA metrics, and these auxiliary sub-networks contribute to improve the performance of the proposed model. The source code is available at this https URL.
zh
[CV-183] Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
【速读】:该论文主要解决了全向图像质量评估(Omnidirectional Image Quality Assessment, OIQA)中的非均匀失真问题。传统的研究大多集中在均匀失真上,即全向图像的所有区域受到相同程度的噪声干扰,而忽略了非均匀失真,即图像中部分区域受到不同程度的干扰。此外,现有的OIQA模型通常在样本数量有限的平台上验证,增加了过拟合风险,阻碍了OIQA的发展。为解决这些问题,论文从主观和客观两个角度进行了深入研究。具体而言,作者构建了一个包含10,320张非均匀失真全向图像的大型数据库,并通过心理物理实验探讨了失真范围和观看条件等整体和个体因素对全向图像质量的影响。在此基础上,提出了一种感知引导的OIQA模型,通过自适应模拟用户的观看行为来评估非均匀失真。实验结果表明,该模型优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.11511
作者: Jiebin Yan,Jiale Rao,Xuelin Liu,Yuming Fang,Yifan Zuo,Weide Liu
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics (江西财经大学计算与人工智能学院); Harvard Medical School, Harvard University (哈佛大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the same amount'' of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo
different amount’’ of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users’ viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods. The source code is available at this https URL.
zh
[CV-184] ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction
【速读】:该论文旨在解决阿尔茨海默病(AD)前驱阶段——轻度认知障碍(MCI)向AD转化的早期预测问题,特别是针对多模态数据(如正电子发射断层扫描(PET)数据缺失)和异质性带来的挑战。解决方案的关键在于提出了一种创新的多模态方法,包括以下几个核心模块:1)缺失模态生成模块,通过磁共振成像(MRI)合成缺失的PET数据;2)专门设计的编码器用于特征提取;3)通道聚合模块和三模态共注意力融合模块,以减少特征冗余并实现有效的多模态数据融合;4)设计了一种损失函数,用于处理缺失模态问题并对齐跨模态特征。这些模块共同提升了网络性能,实验结果表明该方法在ADNI1和ADNI2数据集上显著优于现有的单模态和其他多模态模型。
链接: https://arxiv.org/abs/2501.11276
作者: Xiangyang Hu,Xiangyu Shen,Yifei Sun,Xuhao Shan,Wenwen Min,Liyilei Su,Xiaomao Fan,Ahmed Elazab,Ruiquan Ge,Changmiao Wang,Xiaopeng Fan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure, accepted by IEEE ISBI 2025
点击查看摘要
Abstract:Alzheimer’s disease (AD) is a common neurodegenerative disease among the elderly. Early prediction and timely intervention of its prodromal stage, mild cognitive impairment (MCI), can decrease the risk of advancing to AD. Combining information from various modalities can significantly improve predictive accuracy. However, challenges such as missing data and heterogeneity across modalities complicate multimodal learning methods as adding more modalities can worsen these issues. Current multimodal fusion techniques often fail to adapt to the complexity of medical data, hindering the ability to identify relationships between modalities. To address these challenges, we propose an innovative multimodal approach for predicting MCI conversion, focusing specifically on the issues of missing positron emission tomography (PET) data and integrating diverse medical information. The proposed incomplete triple-modal MCI conversion prediction network is tailored for this purpose. Through the missing modal generation module, we synthesize the missing PET data from the magnetic resonance imaging and extract features using specifically designed encoders. We also develop a channel aggregation module and a triple-modal co-attention fusion module to reduce feature redundancy and achieve effective multimodal data fusion. Furthermore, we design a loss function to handle missing modality issues and align cross-modal features. These components collectively harness multimodal data to boost network performance. Experimental results on the ADNI1 and ADNI2 datasets show that our method significantly surpasses existing unimodal and other multimodal models. Our code is available at this https URL.
zh
[CV-185] How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks? ICLR-2024
【速读】:该论文试图解决在3D图像分割等多样化任务中,由于缺乏大规模标注的3D数据集(如ImageNet规模)而导致的模型预训练效果受限的问题。解决方案的关键在于两个方面:首先,构建了一个名为AbdomenAtlas 1.1的大规模3D CT数据集,包含9,262个三维CT扫描体,并提供了25个解剖结构的高质量体素级标注以及7种肿瘤类型的伪标注;其次,开发了一套基于AbdomenAtlas 1.1进行预训练的模型,用于迁移学习。实验表明,仅使用21个CT扫描体、672个标注掩码和40个GPU小时训练的模型,其迁移学习能力与使用5,050个未标注CT扫描体和1,152个GPU小时训练的模型相当。此外,随着标注数据集的扩大,监督预训练模型的迁移学习能力可以进一步提升,显著优于现有的预训练模型。该研究旨在推动构建更大规模的3D医学数据集和发布更多监督预训练模型的集体努力。
链接: https://arxiv.org/abs/2501.11253
作者: Wenxuan Li,Alan Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR-2024
点击查看摘要
Abstract:The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. While ImageNet pre-training has shown enormous success, it is formed in 2D, and the learned features are for classification tasks; when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct AbdomenAtlas 1.1 that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations of 25 anatomical structures and pseudo annotations of seven tumor types. Secondly, we develop a suite of models that are pre-trained on our AbdomenAtlas 1.1 for transfer learning. Our preliminary analyses indicate that the model trained only with 21 CT volumes, 672 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 (unlabeled) CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets, achieving significantly better performance than preexisting pre-trained models, irrespective of their pre-training methodologies or data sources. We hope this study can facilitate collective efforts in constructing larger 3D medical datasets and more releases of supervised pre-trained models.
zh
[CV-186] CNN-based TEM image denoising from first principles
【速读】:该论文旨在解决透射电子显微镜(Transmission Electron Microscope, TEM)图像因噪声干扰而难以解释的问题。解决方案的关键在于利用深度学习技术,特别是卷积神经网络(Convolutional Neural Network, CNN),对噪声进行有效去除。具体而言,研究通过密度泛函理论(Density Functional Theory, DFT)计算生成高精度的模拟图像作为基准数据,并引入四种不同类型的噪声来创建逼真的训练数据集。每种噪声类型分别用于训练一个独立的CNN模型。实验结果表明,这些CNN模型在不同噪声水平的图像上均表现出良好的去噪效果,尽管在某些情况下仍存在局限性,如圆形结构的完整性保持和图像块之间的可见伪影问题。为此,研究提出了替代训练策略和未来研究方向,为TEM图像去噪的深度学习模型训练提供了有价值的框架。
链接: https://arxiv.org/abs/2501.11225
作者: Jinwoong Chae,Sungwook Hong,Sungkyu Kim,Sungroh Yoon,Gunn Kim
机构: Sejong University (世宗大学); Seoul National University (首尔国立大学)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages and 4 figures
点击查看摘要
Abstract:Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to create realistic training datasets. Each type of noise is then used to train a separate convolutional neural network (CNN) model. Our results show that these CNNs are effective in reducing noise, even when applied to images with different noise levels than those used during training. However, we observe limitations in some cases, particularly in preserving the integrity of circular shapes and avoiding visible artifacts between image patches. To overcome these challenges, we propose alternative training strategies and future research directions. This study provides a valuable framework for training deep learning models for TEM image denoising.
zh
[CV-187] Finding Reproducible and Prognostic Radiomic Features in Variable Slice Thickness Contrast Enhanced CT of Colorectal Liver Metastases
【速读】:该论文旨在解决放射组学特征(radiomic features)在结直肠肝转移(colorectal liver metastases, CRLM)患者中的可重复性和预后价值问题。具体来说,研究通过分析对比增强CT扫描中肝实质和最大肝转移灶的放射组学特征,评估这些特征在不同切片厚度重建图像中的可重复性,并探讨其在预测患者总生存期(overall survival)中的预后价值。研究使用了来自两个美国主要癌症中心的81名患者的前瞻性队列来评估特征的可重复性,并使用了一个公开的单中心队列(197名术前扫描患者)来评估特征的预后价值。
解决方案的关键在于采用数据驱动的方法进行特征提取和选择。研究通过使用八种不同的特征提取设置,提取了93个标准特征,并发现最可重复和最具预后区分能力的特征值高度依赖于感兴趣区域(region of interest)和具体特征。研究结果表明,尽管使用特定设置提取的特征可以生成最佳预测模型(C-index = 0.630),但通过整合所有提取设置的特征并基于可重复性(CCC ≥ 0.85)进行阈值筛选后,生成的模型性能相当(C-index = 0.629)。因此,研究支持在特征提取和选择过程中优先考虑包含多个特征,并在有相关数据时基于可重复性进行特征筛选。
链接: https://arxiv.org/abs/2501.11221
作者: Jacob J. Peoples,Mohammad Hamghalam,Imani James,Maida Wasim,Natalie Gangai,Hyunseon Christine Kang,X. John Rong,Yun Shin Chun,Richard K. G. Do,Amber L. Simpson
机构: School of Computing, Queen’s University, Kingston, ON, Canada(加拿大皇后大学计算机学院); Department of Electrical Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran(伊朗加兹温伊斯兰阿扎德大学电气工程系); Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, NY, USA(美国纽约纪念斯隆-凯特琳癌症中心放射科); Department of Abdominal Imaging, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心腹部影像科); Department of Imaging Physics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心影像物理科); Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA(美国德克萨斯大学MD安德森癌症中心外科肿瘤科); Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, ON, Canada(加拿大皇后大学生物医学与分子科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL
点击查看摘要
Abstract:Establishing the reproducibility of radiomic signatures is a critical step