本篇博文主要内容为 2025-02-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-02-04)
今日共更新965篇论文,其中:
- 自然语言处理共160篇(Computation and Language (cs.CL))
- 人工智能共279篇(Artificial Intelligence (cs.AI))
- 计算机视觉共193篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共381篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Scaling Embedding Layers in Language Models
【速读】: 该论文旨在解决在扩展输入嵌入层(input embedding layers)规模以提升语言模型性能过程中导致的解码成本增加的问题。解决方案的关键在于引入SCONE方法,通过为频繁出现的n-gram(n-gram)引入嵌入向量(embeddings),这些嵌入向量能够提供上下文化的表示(contextualized representation)且在训练期间由单独的模型进行学习。这种方法能够在保持推理阶段浮点运算次数(FLOPS)不变的前提下,显著提升模型性能,并减少解码所需的成本。
链接: https://arxiv.org/abs/2502.01637
作者: Da Yu,Edith Cohen,Badih Ghazi,Yangsibo Huang,Pritish Kamath,Ravi Kumar,Daogao Liu,Chiyuan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We propose SCONE ( \textbfS calable, \textbfC ontextualized, \textbfO ffloaded, \textbfN -gram \textbfE mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n -grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n -gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
zh
[NLP-1] Lifelong Sequential Knowledge Editing without Model Degradation
【速读】: 该论文旨在解决大规模顺序知识编辑导致模型性能显著下降的问题。关键在于提出了一种名为ENCORE(Early stopping and Norm-Constrained Robust knowledge Editing)的方法,通过控制过拟合和不合理的范数增长,实现多达10,000次的连续编辑而不损失下游任务性能,同时ENCORE比现有方法如MEMIT和AlphaEdit更快。
链接: https://arxiv.org/abs/2502.01636
作者: Akshat Gupta,Phudish Prateepamornkul,Maochuan Lu,Ahmed Alaa,Thomas Hartvigsen,Gopala Anumanchipalli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Prior work in parameter-modifying knowledge editing has shown that large-scale sequential editing leads to significant model degradation. In this paper, we study the reasons behind this and scale sequential knowledge editing to 10,000 sequential edits, while maintaining the downstream performance of the original model. We first show that locate-then-edit knowledge editing methods lead to overfitting on the edited facts. We also show that continuous knowledge editing using these methods leads to disproportionate growth in the norm of the edited matrix. We then provide a crucial insight into the inner workings of locate-then-edit methods. We show that norm-growth is a hidden trick employed by these methods that gives larger importance to the output activations produced from the edited layers. With this “importance hacking”, the edited layers provide a much larger contributions to the model’s output. To mitigate these issues, we present ENCORE - Early stopping and Norm-Constrained Robust knowledge Editing. ENCORE controls for overfitting and the disproportionate norm-growth to enable long-term sequential editing, where we are able to perform up to 10,000 sequential edits without loss of downstream performance. ENCORE is also 61% faster than MEMIT and 64% faster than AlphaEdit on Llama3-8B.
zh
[NLP-2] LLM -TA: An LLM -Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease AAAI2025 ALT
【速读】: 该论文旨在解决主题分析(Thematic Analysis, TA)在处理大规模复杂医疗数据集时资源密集且难以扩展的问题。研究提出了一种基于大型语言模型(Large Language Models, LLMs)的主题分析增强管道(LLM-Enhanced Thematic Analysis, LLM-TA),通过集成先进的LLM(如GPT-4o mini)、LangChain及提示工程与分块技术,以分析罕见先天性心脏病——冠状动脉异常起源(AAOCA)患者的父母访谈记录。关键在于利用LLM-TA管道提升分析的可扩展性、效率和准确性,并减轻分析师的工作负担,同时强调与领域专家紧密合作的重要性。
链接: https://arxiv.org/abs/2502.01620
作者: Muhammad Zain Raza,Jiawei Xu,Terence Lim,Lily Boddy,Carlos M. Mery,Andrew Well,Ying Ding
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by GenAI for Health Workshop @ AAAI 2025, Philadelphia
点击查看摘要
Abstract:Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. this https URL
zh
[NLP-3] Learning to Generate Unit Tests for Automated Debugging
【速读】: 该论文旨在解决在自动化生成单元测试(Unit Tests, UTs)过程中存在的权衡问题:生成能够揭示错误的单元测试输入与正确预测单元测试输出之间的矛盾。论文的关键解决方案是提出了一种名为UTGen的方法,该方法通过任务描述和候选代码训练大语言模型(LLMs)生成既能揭示错误又能提供正确预期输出的单元测试输入。此外,UTGen被整合进了一个名为UTDebug的健壮调试流程中,以利用生成的测试帮助LLMs更有效地调试代码,并通过测试时计算扩展和基于多个生成的UT进行验证及回溯编辑来减少模型生成测试中的噪声信号影响。
链接: https://arxiv.org/abs/2502.01619
作者: Archiki Prasad,Elias Stengel-Eskin,Justin Chih-Yao Chen,Zaid Khan,Mohit Bansal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: First two authors contributed equally. Dataset and Code: this https URL
点击查看摘要
Abstract:Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen’s unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.
zh
[NLP-4] Large Language Models Are Human-Like Internally
【速读】: 该论文旨在重新评估大型语言模型(Large Language Models, LLMs)在认知建模中的合理性。此前的研究认为LLMs的认知可行性较低,主要因为它们在人类阅读行为拟合方面表现较差。本文的关键在于通过机制可解释性(mechanistic interpretability)的视角,指出之前的研究结论可能受到仅关注模型最终层的影响。研究发现,大型语言模型内部较早层的下一词概率与人类句子处理数据的契合度,不亚于甚至优于较小模型。这种一致性体现在行为学指标(如自我调节阅读时间、注视时长、MAZE任务处理时间)以及神经生理学指标(如N400脑电位)中。此外,论文首次揭示了模型层与人类测量之间的有趣关联:早期层更接近快速注视时长,而后期层则更好地匹配较慢信号,如N400脑电位和MAZE处理时间。这表明大型语言模型的认知可行性可能被低估,并为机制可解释性和认知建模交叉领域的研究开辟了新途径。
链接: https://arxiv.org/abs/2502.01615
作者: Tatsuki Kuribayashi,Yohei Oseki,Souhaib Ben Taieb,Kentaro Inui,Timothy Baldwin
机构: MBZUAI; The University of Tokyo; University of Mons; Tohoku University; RIKEN; The University of Melbourne
类目: Computation and Language (cs.CL)
备注: 19 pages
点击查看摘要
Abstract:Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior, leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.
zh
[NLP-5] Breaking Focus: Contextual Distraction Curse in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理包含无关细节的输入时所表现出的Contextual Distraction Vulnerability (CDV),即在语义连贯但无关的上下文干扰下,模型性能显著下降的问题。论文的关键解决方案在于通过树状搜索方法自动生成CDV示例,并通过针对特定任务的后训练策略来增强模型对上下文干扰的鲁棒性。研究表明,这些策略可以将最先进的LLMs在四种数据集上的平均性能提升约45%。
链接: https://arxiv.org/abs/2502.01609
作者: Yue Huang,Yanbo Wang,Zixiang Xu,Chujie Gao,Siyuan Wu,Jiayi Ye,Xiuying Chen,Pin-Yu Chen,Xiangliang Zhang
机构: University of Notre Dame; MBZUAI; MBZUAI; MBZUAI; Independent Researcher; Independent Researcher; MBZUAI; IBM Research; University of Notre Dame
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have revolutionized generative systems, achieving excellent performance across diverse domains. Although these models perform well in controlled environments, their real-world applications frequently encounter inputs containing both essential and irrelevant details. Our investigation has revealed a critical vulnerability in LLMs, which we term Contextual Distraction Vulnerability (CDV). This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context. To systematically investigate this vulnerability, we propose an efficient tree-based search methodology to automatically generate CDV examples. Our approach successfully generates CDV examples across four datasets, causing an average performance degradation of approximately 45% in state-of-the-art LLMs. To address this critical issue, we explore various mitigation strategies and find that post-targeted training approaches can effectively enhance model robustness against contextual distractions. Our findings highlight the fundamental nature of CDV as an ability-level challenge rather than a knowledge-level issue since models demonstrate the necessary knowledge by answering correctly in the absence of distractions. This calls the community’s attention to address CDV during model development to ensure reliability. The code is available at this https URL.
zh
[NLP-6] FutureVision: A methodology for the investigation of future cognition
【速读】: 该论文旨在探讨理解未来情景沟通时的认知努力,并提出了一种结合多模态语义分析与眼动追踪实验协议的方法。关键解决方案在于通过记录参与者在评估虚构广告描述未来情景时的眼动模式,并将其与对刺激物的语义表示及参与者的描述进行对比分析,从而揭示不同未来情景类型引起的不同认知负荷。
链接: https://arxiv.org/abs/2502.01597
作者: Tiago Timponi Torrent,Mark Turner,Nicolás Hinrichs,Frederico Belcavello,Igor Lourenço,Arthur Lorenzi Almeida,Marcelo Viridiano,Ely Edison Matos
机构: Federal University of Juiz de Fora (联邦大学弗茹利州); CNPq (CNPq); Federal University of Uberlândia (联邦大学乌贝兰迪亚); Case Western Reserve University (凯斯西储大学); FrameNet Brasil (FrameNet Brasil)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper presents a methodology combining multimodal semantic analysis with an eye-tracking experimental protocol to investigate the cognitive effort involved in understanding the communication of future scenarios. To demonstrate the methodology, we conduct a pilot study examining how visual fixation patterns vary during the evaluation of valence and counterfactuality in fictional ad pieces describing futuristic scenarios, using a portable eye tracker. Participants eye movements are recorded while evaluating the stimuli and describing them to a conversation partner. Gaze patterns are analyzed alongside semantic representations of the stimuli and participants descriptions, constructed from a frame semantic annotation of both linguistic and visual modalities. Preliminary results show that far-future and pessimistic scenarios are associated with longer fixations and more erratic saccades, supporting the hypothesis that fractures in the base spaces underlying the interpretation of future scenarios increase cognitive load for comprehenders.
zh
[NLP-7] ReGLA: Refining Gated Linear Attention NAACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂语言建模任务中表现出色的同时,由于softmax注意力机制的二次计算复杂性导致的显著计算和存储需求问题。为缓解这一问题,线性注意力被设计用于降低标准变换器固有的二次时空复杂度。论文的关键在于全面探索了门控线性注意力模块的三个关键组件:特征映射、归一化和门控机制,并提出了一种特征映射函数来解决先前方法忽略的重要问题,进一步论证了归一化层的整合以稳定训练过程,并通过引入精炼模块解决了门控机制的饱和现象。
链接: https://arxiv.org/abs/2502.01578
作者: Peng Lu,Ivan Kobyzev,Mehdi Rezagholizadeh,Boxing Chen,Philippe Langlais
机构: DIRO, UniversitГ© de MontrГ©al(蒙特利尔大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025 (main)
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
zh
[NLP-8] Visual Theory of Mind Enables the Invention of Writing Systems
【速读】: 该论文旨在探讨早期书写系统(Writing Systems)的起源与演化,特别是那些最初由象形图(Iconic Pictographs)构成的系统。论文的关键在于开发了一种多智能体强化学习测试平台,称为“意义游戏”(Signification Game),通过这一平台,智能体能够利用视觉心智理论(Visual Theory of Mind)进行推断性沟通(Inferential Communication),从而使用象形图来传达动作。这种模型不仅促进了对人类和动物认知过程的理解,还揭示了促使早期书写系统发展的认知和文化过程。
链接: https://arxiv.org/abs/2502.01568
作者: Benjamin A. Spiegel,Lucas Gelfond,George Konidaris
机构: Department of Computer Science, Brown University (计算机科学系, 布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In submission to CogSci 2025
点击查看摘要
Abstract:Abstract symbolic writing systems are \textitsemiotic codes that are ubiquitous in modern society but are otherwise absent in the animal kingdom. Anthropological evidence suggests that the earliest forms of some writing systems originally consisted of \textiticonic pictographs, which signify their referent via visual resemblance. While previous studies have examined the emergence and, separately, the evolution of pictographic writing systems through a computational lens, most employ non-naturalistic methodologies that make it difficult to draw clear analogies to human and animal cognition. We develop a multi-agent reinforcement learning testbed for emergent communication called a \textitSignification Game, and formulate a model of inferential communication that enables agents to leverage \textitvisual theory of mind to communicate actions using pictographs. Our model, which is situated within a broader formalism for animal communication, sheds light on the cognitive and cultural processes that led to the development of early writing systems.
zh
[NLP-9] Scalable Language Models with Posterior Inference of Latent Thought Vectors
【速读】: 该论文旨在解决传统语言模型(Language Models, LMs)在样本效率和参数效率方面的局限性。论文提出了一种新颖的潜在思维语言模型(Latent-Thought Language Models, LTMs),其关键是引入显式的潜在思维向量(latent thought vectors),这些向量遵循显式的先验模型,并通过Transformer解码器指导基础标记的自回归生成。此外,LTMs采用双重速率优化过程,在经典的变分贝叶斯框架内进行训练,实现局部变分参数的快速学习以及全局解码参数的慢速学习。这一设计使得LTMs在样本效率和参数效率方面超越了传统的自回归模型和离散扩散模型,并展示了与模型和潜在规模相关的新兴少量上下文推理能力。
链接: https://arxiv.org/abs/2502.01567
作者: Deqian Kong,Minglu Zhao,Dehong Xu,Bo Pang,Shu Wang,Edouardo Honig,Zhangzhang Si,Chuan Li,Jianwen Xie,Sirui Xie,Ying Nian Wu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:We propose a novel family of language models, Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in conditional and unconditional text generation.
zh
[NLP-10] Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
【速读】: 该论文旨在探究大规模语言模型(Large Language Models, LLMs)在理解上下文知识时,注意力机制中的查询(Query, Q)和键(Key, K)表示中集中出现的大值模式,并分析这些大值在模型中的作用。论文的关键发现在于这些大值主要集中在Q和K中,而非值(Value, V),并且它们对于解释上下文知识至关重要,而并非用于检索存储在模型参数中的参数化知识。通过研究量化策略,论文进一步表明忽略这些大值会导致需要丰富上下文理解的任务性能显著下降。最终,论文追踪到这种集中现象是由旋转位置编码(Rotary Positional Encoding, RoPE)引起的,自模型的第一层起即已存在。这一发现为理解LLMs中Q和K的工作机制提供了新的视角,并为模型设计与优化提供了实用见解。
链接: https://arxiv.org/abs/2502.01563
作者: Mingyu Jin,Kai Mei,Wujiang Xu,Mingjie Sun,Ruixiang Tang,Mengnan Du,Zirui Liu,Yongfeng Zhang
机构: Rutgers University; Carnegie Mellon University; New Jersey Institute of Technology; University of Minnesota
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at this https URL.
zh
[NLP-11] What is a Number That a Large Language Model May Know It?
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理数字序列时面临的挑战:即在不同上下文中,相同的数字序列可能被解释为字符串或数字。论文的关键在于揭示这些模型如何通过混合字符串和数值表示的空间来学习这种双重性,并且这种混合表示会导致在隐含嵌入中的纠缠现象。论文通过一系列实验展示了这种纠缠如何受上下文影响但不能完全消除,并且它可以传播到实际决策场景中。关键解决方案在于使用基于相似性的提示技术,表明整数对的诱发相似性判断可以通过Levenshtein编辑距离和数值对数线性距离的组合来捕捉,从而证明了这种纠缠表示的存在。
链接: https://arxiv.org/abs/2502.01540
作者: Raja Marjieh,Veniamin Veselovsky,Thomas L. Griffiths,Ilia Sucholutsky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures
点击查看摘要
Abstract:Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.
zh
[NLP-12] VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven Reliable and Explainable Alzheimers Disease Diagnosis
【速读】: 该论文旨在解决利用高维医学影像评估阿尔茨海默病(Alzheimer’s Disease, AD)的临床重要性与挑战。解决方案的关键在于提出了一种名为VisTA的多模态语言-视觉模型,该模型借助对比学习(contrastive learning)进行优化,以提升疾病预测准确性及证据驱动的可解释性,从而辅助临床决策。
链接: https://arxiv.org/abs/2502.01535
作者: Duy-Cat Can,Linh D. Dang,Quang-Huy Tang,Dang Minh Ly,Huong Ha,Guillaume Blanc,Oliver Y. Chén,Binh T. Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:
点击查看摘要
Abstract:Objective: Assessing Alzheimer’s disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making. Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA’s efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA’s explainability, we compared its explanations with human experts’ explanations. Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts’ and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM) Cite as: arXiv:2502.01535 [cs.CV] (or arXiv:2502.01535v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.01535 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-13] Preference Leakage: A Contamination Problem in LLM -as-a-judge
【速读】: 该论文旨在解决在大型语言模型(LLM)作为评估者时由于合成数据生成器与基于LLM的评估者之间的相关性所导致的偏好泄露问题。这种偏好泄露会导致评估偏差,从而影响模型训练和评价的准确性。研究的关键在于定义并验证了三种常见的相关性类型:相同的模型、具有继承关系的模型以及属于同一模型家族的模型,并通过大量实验确认了偏好泄露导致的偏见。最终结果表明,偏好泄露是一个普遍且难以检测的问题,需要进一步关注和解决。
链接: https://arxiv.org/abs/2502.01534
作者: Dawei Li,Renliang Sun,Yue Huang,Ming Zhong,Bohan Jiang,Jiawei Han,Xiangliang Zhang,Wei Wang,Huan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures
点击查看摘要
Abstract:Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: this https URL.
zh
[NLP-14] he in-context inductive biases of vision-language models differ across modalities
【速读】: 该论文旨在探究基础模型在情境学习过程中如何利用归纳偏置进行泛化,特别是在不同模态(视觉与文本)输入条件下泛化的差异。关键在于通过三种不同的实验范式,在三个不同的视觉-语言模型中研究这些模型在面对视觉和文本描述的刺激时,如何根据形状而非颜色进行泛化,并且这种倾向在视觉呈现时更为显著;而在文本描述中,形容词的顺序则会影响泛化结果。这些发现有助于揭示视觉-语言模型在特定上下文中对不同类型输入的表征方式,并可能对视觉-语言模型的实际应用产生影响。
链接: https://arxiv.org/abs/2502.01530
作者: Kelsey Allen,Ishita Dasgupta,Eliza Kosoy,Andrew K. Lampinen
机构: Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages
点击查看摘要
Abstract:Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories – e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.
zh
[NLP-15] CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
【速读】: 该论文旨在解决大型语言模型(LLMs)在问答(QA)任务中因模糊问题而产生幻觉的问题。论文的关键解决方案是引入条件感知评估指标和Conditional Ambiguous Question-Answering (CondAmbigQA)基准,包含200个模糊查询,并采用基于检索的标注策略利用维基百科片段来识别查询的可能解释作为条件,从而最小化标注者不同知识水平引入的人为偏差。实验表明,在回答前考虑条件的模型性能提高了20%,当条件被显式提供时,额外提升了5%。这些结果强调了条件推理在QA中的价值,为研究人员提供了严格评估模糊性解决方法的工具。
链接: https://arxiv.org/abs/2502.01523
作者: Zongxi Li,Yang Li,Haoran Xie,S. Joe Qin
机构: School of Data Science, Lingnan University (岭南大学), Hong Kong SAR; School of Science and Technology, Hong Kong Metropolitan University (香港都会大学), Hong Kong SAR
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions. Users often assume that LLMs share their cognitive alignment, a mutual understanding of context, intent, and implicit details, leading them to omit critical information in the queries. However, LLMs generate responses based on assumptions that can misalign with user intent, which may be perceived as hallucinations if they misalign with the user’s intent. Therefore, identifying those implicit assumptions is crucial to resolve ambiguities in QA. Prior work, such as AmbigQA, reduces ambiguity in queries via human-annotated clarifications, which is not feasible in real application. Meanwhile, ASQA compiles AmbigQA’s short answers into long-form responses but inherits human biases and fails capture explicit logical distinctions that differentiates the answers. We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries and condition-aware evaluation metrics. Our study pioneers the concept of ``conditions’’ in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities. The retrieval-based annotation strategy uses retrieved Wikipedia fragments to identify possible interpretations for a given query as its conditions and annotate the answers through those conditions. Such a strategy minimizes human bias introduced by different knowledge levels among annotators. By fixing retrieval results, CondAmbigQA evaluates how RAG systems leverage conditions to resolve ambiguities. Experiments show that models considering conditions before answering improve performance by 20% , with an additional 5% gain when conditions are explicitly provided. These results underscore the value of conditional reasoning in QA, offering researchers tools to rigorously evaluate ambiguity resolution.
zh
[NLP-16] Hybrid Machine Learning Model for Detecting Bangla Smishing Text Using BERT and Character-Level CNN CEC
【速读】: 该论文旨在解决Bangla语言环境中短信欺诈(Smishing)的问题。解决方案的关键在于提出了一种新颖的混合机器学习模型,该模型结合了双向编码器表示从变压器(BERT)和卷积神经网络(CNNs),以增强字符级分析,并通过多类别分类方法区分正常、促销和欺诈短信。不同于传统的二元分类方法,该模型融合了BERT的上下文嵌入和CNN的字符级特征,从而提升了检测准确性。此外,引入注意力机制进一步优化了模型对文本关键部分的优先处理能力。
链接: https://arxiv.org/abs/2502.01518
作者: Gazi Tanbhir,Md. Farhan Shahriyar,Khandker Shahed,Abdullah Md Raihan Chy,Md Al Adnan
机构: World University of Bangladesh; Jashore University of Science and Technology; Southern University Bangladesh; Institute of Information Technology, Noakhali Science and Technology University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Conference Name: 13th International Conference on Electrical and Computer Engineering (ICECE 2024)
点击查看摘要
Abstract:Smishing is a social engineering attack using SMS containing malicious content to deceive individuals into disclosing sensitive information or transferring money to cybercriminals. Smishing attacks have surged by 328%, posing a major threat to mobile users, with losses exceeding \ 54.2 million in 2019. Despite its growing prevalence, the issue remains significantly under-addressed. This paper presents a novel hybrid machine learning model for detecting Bangla smishing texts, combining Bidirectional Encoder Representations from Transformers (BERT) with Convolutional Neural Networks (CNNs) for enhanced character-level analysis. Our model addresses multi-class classification by distinguishing between Normal, Promotional, and Smishing SMS. Unlike traditional binary classification methods, our approach integrates BERT’s contextual embeddings with CNN’s character-level features, improving detection accuracy. Enhanced by an attention mechanism, the model effectively prioritizes crucial text segments. Our model achieves 98.47% accuracy, outperforming traditional classifiers, with high precision and recall in Smishing detection, and strong performance across all categories. Comments: Conference Name: 13th International Conference on Electrical and Computer Engineering (ICECE 2024) Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2502.01518 [cs.CL] (or arXiv:2502.01518v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.01518 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation
【速读】: 该论文旨在探讨在序列级知识蒸馏(SeqKD)过程中,教师神经机器翻译(NMT)模型的实例级记忆如何被学生模型继承。研究发现,尽管学生模型没有直接接触到原始训练数据,但它们表现出比基线模型更多的记忆现象(3.4%的精确匹配和57%的提取式记忆),并且幻觉率增加。论文进一步分析了学生模型在特定训练数据子群上的行为,特别是低质量子群和特定反事实记忆(CM)得分的子群,并发现学生模型在处理低质量数据子群时表现出更强的去噪能力。为了解决上述问题,论文提出了一种名为自适应序列级知识蒸馏(Adaptive-SeqKD)的改进方法,该方法通过干预SeqKD过程来减少记忆和幻觉现象。总体而言,论文建议在应用SeqKD时需谨慎,因为学生模型会继承教师模型的优点及其潜在缺陷,需要进行主动监控。
链接: https://arxiv.org/abs/2502.01491
作者: Verna Dankers,Vikas Raunak
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) – 3.4% for exact matches and 57% for extractive memorization – and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers’ superior performance and their fault modes, thereby requiring active monitoring.
zh
[NLP-18] Explaining Context Length Scaling and Bounds for Language Models
【速读】: 该论文旨在解决长上下文对语言模型性能影响的理解问题。论文的关键解决方案在于提出了一种从内在空间视角解释上下文长度对语言建模影响的清晰而有效的理论框架,并通过自然语言和合成数据的实验验证了这一理论假设和推论。该框架提供了实用见解,如确定训练数据集大小决定最优上下文长度,并为特定情况设定上下文长度的扩展界限。
链接: https://arxiv.org/abs/2502.01481
作者: Jingzhe Shi,Qinwei Ma,Hongyi Liu,Hang Zhao,Jeng-Neng Hwang,Serge Belongie,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 14 figures
点击查看摘要
Abstract:Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impact Language Modeling. In this work, we (1) propose a clean and effective theoretical framework on explaining the impact of context length to Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at this url: this https URL.
zh
[NLP-19] FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在应用过程中无意中编码敏感或有害信息的安全隐患。论文的关键解决方案是提出了一种名为FALCON(Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment)的新方法。FALCON是一种基于表征引导的无学习方法,通过信息论指导进行高效的参数选择,利用对比机制增强表征分离,并将冲突梯度投影到正交子空间以解决遗忘与保持目标之间的冲突,从而实现更精确的知识分离,并在保证模型效用的同时提高无学习效果。
链接: https://arxiv.org/abs/2502.01472
作者: Jinwei Hu,Zhenglin Huang,Xiangyu Yin,Wenjie Ruan,Guangliang Cheng,Yi Dong,Xiaowei Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
点击查看摘要
Abstract:Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.
zh
[NLP-20] Process Reinforcement through Implicit Rewards
【速读】: 该论文旨在解决在大规模语言模型(Large Language Models, LLMs)的推理时间扩展中,使用密集过程奖励(Dense Process Rewards)替代稀疏结果级奖励(Sparse Outcome-Level Rewards)所面临的挑战。关键在于通过提出PRIME方法,实现了仅利用策略滚动(policy rollouts)和结果标签(outcome labels)来更新过程奖励模型(Process Reward Models, PRMs),从而避免了高昂的高质量过程标签收集成本,并减少了开发开销。这一方案显著提升了模型在复杂推理任务中的表现,在多个基准测试中取得了显著改进。
链接: https://arxiv.org/abs/2502.01456
作者: Ganqu Cui,Lifan Yuan,Zefan Wang,Hanbin Wang,Wendi Li,Bingxiang He,Yuchen Fan,Tianyu Yu,Qixin Xu,Weize Chen,Jiarui Yuan,Huayu Chen,Kaiyan Zhang,Xingtai Lv,Shuo Wang,Yuan Yao,Xu Han,Hao Peng,Yu Cheng,Zhiyuan Liu,Maosong Sun,Bowen Zhou,Ning Ding
机构: Tsinghua University (清华大学); Shanghai AI Lab (上海人工智能实验室); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Peking University (北京大学); Shanghai Jiaotong University (上海交通大学); CUHK (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages. ModelCodeData available at this https URL
点击查看摘要
Abstract:Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME’s effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
zh
[NLP-21] owards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPT s
【速读】: 该论文旨在解决定制化大型语言模型(Custom GPTs)在实际应用中的安全与合规风险问题。论文的关键解决方案是一个可扩展的自动化评估框架,该框架通过三个核心组件实现:(1) 自动发现和收集来自GPT商店的模型数据,(2) 针对特定政策类别和目标GPT特性的红队提示生成器,以及(3) 利用大型语言模型作为裁判的技术来分析每个提示-响应对中的潜在政策违规行为。该框架通过大规模实验证明了其有效性,并揭示了这些模型中存在显著的非合规性问题。
链接: https://arxiv.org/abs/2502.01436
作者: David Rodriguez,William Seymour,Jose M. Del Alamo,Jose Such
机构: ETSI Telecomunicación (电信工程学院), Universidad Politécnica de Madrid (马德里理工大学); King’s College London (伦敦国王学院), VRAIN, Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have gained unprecedented prominence, achieving widespread adoption across diverse domains and integrating deeply into society. The capability to fine-tune general-purpose LLMs, such as Generative Pre-trained Transformers (GPT), for specific tasks has facilitated the emergence of numerous Custom GPTs. These tailored models are increasingly made available through dedicated marketplaces, such as OpenAI’s GPT Store. However, their black-box nature introduces significant safety and compliance risks. In this work, we present a scalable framework for the automated evaluation of Custom GPTs against OpenAI’s usage policies, which define the permissible behaviors of these systems. Our framework integrates three core components: (1) automated discovery and data collection of models from the GPT store, (2) a red-teaming prompt generator tailored to specific policy categories and the characteristics of each target GPT, and (3) an LLM-as-a-judge technique to analyze each prompt-response pair for potential policy violations. We validate our framework with a manually annotated ground truth, and evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs. Our manual annotation process achieved an F1 score of 0.975 in identifying policy violations, confirming the reliability of the framework’s assessments. The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store’s review and approval processes. Furthermore, our findings indicate that a model’s popularity does not correlate with compliance, and non-compliance issues largely stem from behaviors inherited from base models rather than user-driven customizations. We believe this approach is extendable to other chatbot platforms and policy domains, improving LLM-based systems safety. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.1; I.2.7 Cite as: arXiv:2502.01436 [cs.CL] (or arXiv:2502.01436v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.01436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-22] Emergent Stack Representations in Modeling Counter Languages Using Transformers
【速读】: 该论文旨在探究变压器(Transformer)架构在学习形式语言方面的内部工作机制,特别是通过训练模型处理计数器语言(Counter Languages)来实现这一目标。论文的关键解决方案在于分析变压器模型在计数器语言上的表现,并将其与使用堆栈(stacks)建模的语言进行对比,通过研究输入每个标记时的堆栈深度(stack depths)来揭示这些模型是否形成了类似于堆栈的内部表示。这有助于深入理解变压器如何学习算法语言的细节,并促进电路发现(circuit discovery)。
链接: https://arxiv.org/abs/2502.01432
作者: Utkarsh Tiwari,Aviral Gupta,Michael Hahn
机构: Birla Institute of Technology and Science (BITS) Pilani; Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by isolating the learning capabilities of these architectures by training them over well-understood classes of formal languages. We extend this literature by analyzing models trained over counter languages, which can be modeled using counter variables. We train transformer models on 4 counter languages, and equivalently formulate these languages using stacks, whose depths can be understood as the counter values. We then probe their internal representations for stack depths at each input token to show that these models when trained as next token predictors learn stack-like representations. This brings us closer to understanding the algorithmic details of how transformers learn languages and helps in circuit discovery.
zh
[NLP-23] Originality in scientific titles and abstracts can predict citation count
【速读】: 该论文旨在探究科学文献的原创性,并通过计算方法衡量原创性的Divergent Semantic Integration (DSI)指标来分析Web of Science数据库中的99,557篇科学摘要和标题。研究发现不同学科领域之间存在显著的DSI差异,并且DSI随时间有轻微上升趋势。论文的关键在于利用DSI模型预测科学文献被引用次数,并建立其与DSI之间的统计显著正相关关系,调整后的R²值达到0.13。
链接: https://arxiv.org/abs/2502.01417
作者: Jack H. Culbert,Yoed N. Kenett,Philipp Mayr
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, submitted to ISSI 2025, research in progress paper
点击查看摘要
Abstract:In this research-in-progress paper, we apply a computational measure correlating with originality from creativity science: Divergent Semantic Integration (DSI), to a selection of 99,557 scientific abstracts and titles selected from the Web of Science. We observe statistically significant differences in DSI between subject and field of research, and a slight rise in DSI over time. We model the base 10 logarithm of the citation count after 5 years with DSI and find a statistically significant positive correlation in all fields of research with an adjusted R^2 of 0.13.
zh
[NLP-24] GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models
【速读】: 该论文旨在解决人工智能系统中存在的性别偏见问题,特别是在变压器(Transformer)基础语言模型中。论文的关键在于提出了一种新颖的编解码方法,通过利用模型梯度学习单一单义性特征神经元来编码性别信息。这种方法能够在保持模型其他功能的同时,有效减少性别偏见。
链接: https://arxiv.org/abs/2502.01406
作者: Jonathan Drechsel,Steffen Herbold
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across multiple encoder-only based models and highlight its potential for broader applications.
zh
[NLP-25] AdaSVD: Adaptive Singular Value Decomposition for Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在资源受限设备上的部署挑战,特别是其显著的内存需求问题。现有基于奇异值分解(SVD)的方法在压缩过程中难以有效缓解由截断引入的误差,且统一的压缩比率无法适应不同层的重要性差异。论文的关键解决方案是提出AdaSVD方法,它通过引入adaComp动态补偿SVD截断误差,并通过adaCR自适应分配各层特定的压缩比率,从而在保持高性能的同时大幅降低内存需求。
链接: https://arxiv.org/abs/2502.01403
作者: Li Zhiteng,Xia Mingyuan,Zhang Jingyuan,Hui Zheng,Kong Linghe,Zhang Yulun,Yang Xiaokang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The code and models will be available at this https URL
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices U and V^T. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. The code and models will be available at this https URL.
zh
[NLP-26] Annotation Tool and Dataset for Fact-Checking Podcasts
【速读】: 该论文旨在解决播客事实核查的挑战,包括转录、标注和验证未经过滤的多样化和多语言内容中的声明。解决方案的关键在于提供一种新颖的方法,通过在播放过程中实现实时标注功能,允许用户在收听播客的同时标注关键元素,如值得核查的声明、声明跨度及上下文错误。这种方法结合了先进的转录模型(如OpenAI的Whisper)和众包标注,以创建高质量数据集,进而微调多语言变换器模型(如XLM-RoBERTa),用于声明检测和立场分类任务。
链接: https://arxiv.org/abs/2502.01402
作者: Vinay Setty,Adam James Becker
机构: University of Stavanger(斯塔万格大学)
类目: Computation and Language (cs.CL)
备注: Accepted as resource paper in TheWebConf 2025
点击查看摘要
Abstract:Podcasts are a popular medium on the web, featuring diverse and multilingual content that often includes unverified claims. Fact-checking podcasts is a challenging task, requiring transcription, annotation, and claim verification, all while preserving the contextual details of spoken content. Our tool offers a novel approach to tackle these challenges by enabling real-time annotation of podcasts during playback. This unique capability allows users to listen to the podcast and annotate key elements, such as check-worthy claims, claim spans, and contextual errors, simultaneously. By integrating advanced transcription models like OpenAI’s Whisper and leveraging crowdsourced annotations, we create high-quality datasets to fine-tune multilingual transformer models such as XLM-RoBERTa for tasks like claim detection and stance classification. Furthermore, we release the annotated podcast transcripts and sample annotations with preliminary experiments.
zh
[NLP-27] Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)作为日常助手在规划和顺序决策能力方面的表现,并解决如何通过这些能力提供有效的日常协助。研究的关键在于采用“LLM-modulo”设置结合人类参与的模式,通过计划然后执行的方式,评估用户在不同阶段的参与度对其信任及协作团队表现的影响。研究表明,LLMs作为日常助手的效果取决于高质量的计划和必要的用户执行参与度,同时用户可能因看似合理的计划而轻易产生不信任感。关键解决方案在于通过调整用户信任来优化任务结果,从而为未来日常助手的设计和人机协作提供了重要启示。
链接: https://arxiv.org/abs/2502.01390
作者: Gaole He,Gianluca Demartini,Ujwal Gadiraju
机构: Delft University of Technology(代尔夫特理工大学); The University of Queensland(昆士兰大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: conditionally accepted to CHI 2025
点击查看摘要
Abstract:Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of ‘LLM-modulo’ setups in conjunction with humans-in-the-loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan-then-execute manner, wherein the agents conducted step-wise planning and step-by-step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double-edged sword – (1) they can work well when a high-quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human-AI collaboration with LLM agents.
zh
[NLP-28] opic-FlipRAG : Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models
【速读】: 该论文旨在解决针对主题导向的对抗性意见操纵攻击问题,特别是在检索增强型生成(Retrieval-Augmented Generation, RAG)系统中,这些系统需要推理和综合多个视角,从而容易受到系统性知识中毒的影响。论文的关键解决方案是提出Topic-FlipRAG,这是一种两阶段的操纵攻击管道,通过精心设计的对抗性扰动来影响相关查询的意见。这种方法结合了传统的对抗性排名攻击技术,并利用大型语言模型(LLMs)内部丰富的相关知识和推理能力,执行语义层面的扰动。实验表明,所提出的攻击能够有效地改变模型在特定主题上的输出意见,显著影响用户的信息感知。当前的缓解方法无法有效防御此类攻击,强调了增强RAG系统安全措施的必要性,并为LLM安全研究提供了重要见解。
链接: https://arxiv.org/abs/2502.01386
作者: Yuyang Gong,Zhuo Chen,Miaokun Chen,Fengchang Yu,Wei Lu,Xiaofeng Wang,Xiaozhong Liu,Jiawei Liu
机构: Wuhan University(武汉大学); Indiana University Bloomington(印第安纳大学布卢明顿分校); Worcester Polytechnic Institute(伍斯特理工学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become essential for tasks such as question answering and content generation. However, their increasing impact on public opinion and information dissemination has made them a critical focus for security research due to inherent vulnerabilities. Previous studies have predominantly addressed attacks targeting factual or single-query manipulations. In this paper, we address a more practical scenario: topic-oriented adversarial opinion manipulation attacks on RAG models, where LLMs are required to reason and synthesize multiple perspectives, rendering them particularly susceptible to systematic knowledge poisoning. Specifically, we propose Topic-FlipRAG, a two-stage manipulation attack pipeline that strategically crafts adversarial perturbations to influence opinions across related queries. This approach combines traditional adversarial ranking attack techniques and leverages the extensive internal relevant knowledge and reasoning capabilities of LLMs to execute semantic-level perturbations. Experiments show that the proposed attacks effectively shift the opinion of the model’s outputs on specific topics, significantly impacting user information perception. Current mitigation methods cannot effectively defend against such attacks, highlighting the necessity for enhanced safeguards for RAG systems, and offering crucial insights for LLM security research.
zh
[NLP-29] Meursault as a Data Point
【速读】: 该论文旨在探讨在数据化时代,将人类经验量化为可度量指标所引发的深层次哲学和伦理问题。通过分析阿尔贝·加缪《局外人》中的主人公梅尔苏的人生命运,研究使用自然语言处理(NLP)技术,包括情感检测(BERT)、情感分析(VADER)和命名实体识别(spaCy),来量化其生活中的关键事件和行为。论文的关键在于揭示算法模型应用于复杂人类经验时的固有限制,特别是那些源于存在主义疏离和道德模糊性的经验,并通过现代人工智能工具如何误读梅尔苏的行为和情感,强调将细腻的人类叙事简化为数据点所带来的更广泛的伦理困境。论文倡导在人工智能中融入人文价值观,以应对过度依赖数据驱动叙事的问题。
链接: https://arxiv.org/abs/2502.01364
作者: Abhinav Pratap,Amit Pathak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 7 pages, 9 figures, 4 tables
点击查看摘要
Abstract:In an era dominated by datafication, the reduction of human experiences to quantifiable metrics raises profound philosophical and ethical questions. This paper explores these issues through the lens of Meursault, the protagonist of Albert Camus’ The Stranger, whose emotionally detached existence epitomizes the existential concept of absurdity. Using natural language processing (NLP) techniques including emotion detection (BERT), sentiment analysis (VADER), and named entity recognition (spaCy)-this study quantifies key events and behaviors in Meursault’s life. Our analysis reveals the inherent limitations of applying algorithmic models to complex human experiences, particularly those rooted in existential alienation and moral ambiguity. By examining how modern AI tools misinterpret Meursault’s actions and emotions, this research underscores the broader ethical dilemmas of reducing nuanced human narratives to data points, challenging the foundational assumptions of our data-driven society. The findings presented in this paper serve as a critique of the increasing reliance on data-driven narratives and advocate for incorporating humanistic values in artificial intelligence.
zh
[NLP-30] Bias Beware: The Impact of Cognitive Biases on LLM -Driven Product Recommendations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在产品推荐系统中的脆弱性问题,特别是它们易受对抗性操纵的挑战。关键解决方案在于利用人类心理学原理,无缝修改产品描述,从而使得这些对抗性操纵难以被检测到。通过实验揭示了LLMs作为推荐器使用的显著漏洞,并提供了保护这些系统的关键见解。
链接: https://arxiv.org/abs/2502.01349
作者: Giorgos Filandrianos,Angeliki Dimitriou,Maria Lymperaiou,Konstantinos Thomas,Giorgos Stamou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) has revolutionized product recommendation systems, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making these adversarial manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive experiments on LLMs of varying scales, we reveal significant vulnerabilities in their use as recommenders, providing critical insights into safeguarding these systems.
zh
[NLP-31] PSSD: Making Large Language Models Self-denial via Human Psyche Structure WWW’25
【速读】: 该论文旨在解决大型语言模型(LLMs)在推理结果准确性提升过程中所面临的资源竞争问题,这些问题导致显著的时间和计算开销。论文的关键在于提出了一种名为PSSD的方案,该方案模仿人类心理结构,引入了三个相互关联的角色:基于直觉的本我角色、基于规则的超我角色以及以脚本为中心的自我角色。这三个角色通过多智能体范式协同工作,从而更好地增强LLMs的推理能力,并与现有模型无缝集成,实现优越的性能表现。
链接: https://arxiv.org/abs/2502.01344
作者: Jinzhi Liao,Zenghua Liao,Xiang Zhao
机构: National University of Defense Technology (国防科技大学), Changsha, Hunan, China (中国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: WWW '25
点击查看摘要
Abstract:The enhance of accuracy in reasoning results of LLMs arouses the community’s interests, wherein pioneering studies investigate post-hoc strategies to rectify potential mistakes. Despite extensive efforts, they are all stuck in a state of resource competition demanding significant time and computing expenses. The cause of the situation lies in the failure of identifying the fundamental feature of the solutions in this line, coined as the self-denial of LLMs. In other words, LLMs should confidently determine the potential existence of mistakes and carefully execute the targeted correction. As the whole procedure conducts within LLMs, supporting and persuasive references are hard to acquire, while the absence of specific steps towards refining hidden mistakes persists even when errors are acknowledged. In response to the challenges, we present PSSD, which refers to and implements the human psyche structure such that three distinct and interconnected roles contribute to human reasoning. Specifically, PSSD leverages the recent multi-agent paradigm, and is further enhanced with three innovatively conceived roles: (1) the intuition-based id role that provides initial attempts based on benign LLMs; (2) the rule-driven superego role that summarizes rules to regulate the above attempts, and returns specific key points as guidance; and (3) the script-centric ego role that absorbs all procedural information to generate executable script for the final answer prediction. Extensive experiments demonstrate that the proposed design not only better enhance reasoning capabilities, but also seamlessly integrate with current models, leading to superior performance.
zh
[NLP-32] AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中视觉特征与语言嵌入对齐的关键挑战。现有方法如多层感知机(Multilayer Perceptrons, MLPs)常产生分布外或噪声输入,导致模态间对齐不佳。论文提出了一种名为AlignVLM的新方法,通过将视觉特征映射到语言模型(LLM)文本嵌入的加权平均值上来实现更好的对齐。这种方法利用了LLM编码的语言先验知识,确保视觉特征被映射到LLM能够有效解释的空间区域。关键在于使用加权平均的LLM文本嵌入来提高视觉和文本特征的对齐精度和鲁棒性。
链接: https://arxiv.org/abs/2502.01341
作者: Ahmed Masry,Juan A. Rodriguez,Tianyu Zhang,Suyuchen Wang,Chao Wang,Aarash Feizi,Akshay Kalkunte Suresh,Abhay Puri,Xiangru Jian,Pierre-André Noël,Sathwik Tejaswi Madhusudhan,Marco Pedersoli,Bang Liu,Nicolas Chapados,Yoshua Bengio,Enamul Hoque,Christopher Pal,Issam H. Laradji,David Vazquez,Perouz Taslakian,Spandana Gella,Sai Rajeswar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.
zh
[NLP-33] Main Predicate and Their Arguments as Explanation Signals For Intent Classification
【速读】: 该论文旨在解决意图分类(Intent Classification)的可解释性问题,由于合适的基准数据集的缺乏,这一领域研究较少。论文的关键解决方案在于提出了一种新的技术,通过自动标记主要谓词(主要是动词)及其论元(依存关系)作为解释信号,从而增强意图分类数据集中的文本样本的可解释性。具体而言,作者利用ATIS和SNIPS两个基准数据集,创建了一个包含21k实例的独特数据集,用于提升模型在意图分类任务中的可解释性。实验表明,引导模型在训练过程中关注这些解释信号,能够显著提高模型的推理能力,特别是在合理性和忠实性等可解释性指标上的表现提升了3-4%。
链接: https://arxiv.org/abs/2502.01270
作者: Sameer Pimparkhede,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Intent classification is crucial for conversational agents (chatbots), and deep learning models perform well in this area. However, little research has been done on the explainability of intent classification due to the absence of suitable benchmark data. Human annotation of explanation signals in text samples is time-consuming and costly. However, from inspection of data on intent classification, we see that, more often than not, the main verb denotes the action, and the direct object indicates the domain of conversation, serving as explanation signals for intent. This observation enables us to hypothesize that the main predicate in the text utterances, along with the arguments of the main predicate, can serve as explanation signals. Leveraging this, we introduce a new technique to automatically augment text samples from intent classification datasets with word-level explanations. We mark main predicates (primarily verbs) and their arguments (dependency relations) as explanation signals in benchmark intent classification datasets ATIS and SNIPS, creating a unique 21k-instance dataset for explainability. Further, we experiment with deep learning and language models. We observe that models that work well for classification do not perform well in explainability metrics like plausibility and faithfulness. We also observe that guiding models to focus on explanation signals from our dataset during training improves the plausibility Token F1 score by 3-4%, improving the model’s reasoning.
zh
[NLP-34] Learnable polynomial trigonometric and tropical activations
【速读】: 该论文旨在解决深度神经网络中静态激活函数导致的适应性不足及由此引发的稳定性问题,如梯度消失或爆炸。关键解决方案在于提出了一种初始化方案,该方案能够单独保持变换器和卷积网络中的单位方差,从而确保在深层架构中梯度流动的稳定性。实验结果表明,基于Hermite、Fourier和Tropical的可学习激活函数显著提升了GPT-2和ConvNeXt网络在ImageNet-1K分类和OpenWebText上的准确性和困惑度,证明了可学习激活函数在大规模任务中的可行性。相关激活函数已封装成一个完全基于PyTorch的库:torchortho。
链接: https://arxiv.org/abs/2502.01247
作者: Ismail Khalfaoui-Hassani,Stefan Kesselheim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:
点击查看摘要
Abstract:This paper investigates scalable neural networks with learnable activation functions based on orthogonal function bases and tropical polynomials, targeting ImageNet-1K classification and next token prediction on OpenWebText. Traditional activations, such as ReLU, are static. In contrast, learnable activations enable the network to adapt dynamically during training. However, stability issues, such as vanishing or exploding gradients, arise with improper variance management in deeper networks. To remedy this, we propose an initialization scheme that single-handedly preserves unitary variance in transformers and convolutional networks, ensuring stable gradient flow even in deep architectures. Extensive experiments demonstrate that networks with Hermite, Fourier, and Tropical-based learnable activations significantly improve over GPT-2 and ConvNeXt networks in terms of accuracy and perplexity in train and test, highlighting the viability of learnable activations in large-scale tasks. The activation functions developed here are the subject of a library coded entirely in pure PyTorch: torchortho, available at this https URL.
zh
[NLP-35] OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
【速读】: 该论文旨在解决大型语言模型(LLMs)在眼科临床实践中的实际应用能力评估及其局限性识别的问题。为填补这一研究空白并支持LLMs的实际应用,论文提出了一种名为OphthBench的专门基准测试,该基准测试系统地将典型的眼科临床工作流程划分为五个关键场景:教育、分诊、诊断、治疗和预后。解决方案的关键在于通过设计涵盖多种问题类型的多样化任务,构建了一个包含9个任务和591个问题的全面基准框架,从而实现对LLMs能力的全面评估,并为其在中国眼科领域的实际应用提供洞见。
链接: https://arxiv.org/abs/2502.01243
作者: Chengfeng Zhou,Ji Wang,Juanjuan Qin,Yining Wang,Ling Sun,Weiwei Dai
机构: Changsha Aier Eye Hospital (长沙爱尔眼科医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown significant promise across various medical applications, with ophthalmology being a notable area of focus. Many ophthalmic tasks have shown substantial improvement through the integration of LLMs. However, before these models can be widely adopted in clinical practice, evaluating their capabilities and identifying their limitations is crucial. To address this research gap and support the real-world application of LLMs, we introduce the OphthBench, a specialized benchmark designed to assess LLM performance within the context of Chinese ophthalmic practices. This benchmark systematically divides a typical ophthalmic clinical workflow into five key scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each scenario, we developed multiple tasks featuring diverse question types, resulting in a comprehensive benchmark comprising 9 tasks and 591 questions. This comprehensive framework allows for a thorough assessment of LLMs’ capabilities and provides insights into their practical application in Chinese ophthalmology. Using this benchmark, we conducted extensive experiments and analyzed the results from 39 popular LLMs. Our evaluation highlights the current gap between LLM development and its practical utility in clinical settings, providing a clear direction for future advancements. By bridging this gap, we aim to unlock the potential of LLMs and advance their development in ophthalmology.
zh
[NLP-36] Eliciting Language Model Behaviors with Investigator Agents
【速读】: 该论文旨在解决行为诱导问题,即搜索能够从目标语言模型中诱导出特定目标行为(如幻觉或有害响应)的提示。解决方案的关键在于训练调查者模型(investigator models),通过有监督微调、深度策略优化(DPO)以及一种新颖的Frank-Wolfe训练目标,以探索庞大的提示空间,并发现多样化的诱导策略,从而有效地诱导出包括越狱、幻觉在内的多种开放性异常行为。
链接: https://arxiv.org/abs/2502.01236
作者: Xiang Lisa Li,Neil Chowdhury,Daniel D. Johnson,Tatsunori Hashimoto,Percy Liang,Sarah Schwettmann,Jacob Steinhardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 7 figures
点击查看摘要
Abstract:Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.
zh
[NLP-37] On the Robustness of Temporal Factual Knowledge in Language Models
【速读】: 该论文旨在探究语言模型(Language Models, LMs)处理时序性事实知识的时间鲁棒性。研究的关键在于设计了一项控制实验,评估多个预训练及指令调优的语言模型在不同时间粒度(日、月、年)下对维基数据事实的处理能力,从而揭示大规模最先进的模型如Llama-3.1-70B在时序性知识理解方面的局限性,特别是它们无法将知识从一个粒度泛化到另一个粒度。
链接: https://arxiv.org/abs/2502.01220
作者: Hichem Ammar Khodja,Frédéric Béchet,Quentin Brabant,Alexis Nasr,Gwénolé Lecorvé
机构: Orange(橙色) - Lannion, France; Aix Marseille Université, CNRS, LIS, UMR 7020 - Marseille, France; International Laboratory on Learning Systems (ILLS - IRL2020 CNRS)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper explores the temporal robustness of language models (LMs) in handling factual knowledge. While LMs can often complete simple factual statements, their ability to manage temporal facts (those valid only within specific timeframes) remains uncertain. We design a controlled experiment to test the robustness of temporal factual knowledge inside LMs, which we use to evaluate several pretrained and instruction-tuned models using prompts on popular Wikidata facts, assessing their performance across different temporal granularities (Day, Month, and Year). Our findings indicate that even very large state-of-the-art models, such as Llama-3.1-70B, vastly lack robust knowledge of temporal facts. In addition, they are incapable of generalizing their knowledge from one granularity to another. These results highlight the inherent limitations of using LMs as temporal knowledge bases. The source code and data to reproduce our experiments will be released.
zh
[NLP-38] Modelling change in neural dynamics during phonetic accommodation
【速读】: 该论文旨在探究实时语音输入如何塑造对话者在语音规划中的表征,并解决短期语音同化过程中语音表征变化的计算模型。关键解决方案在于通过动态神经场方程调整抑制性记忆动力学的幅度,以反映由于音系和/或社会语言学压力导致的同化阻力,从而再现实验观察到的影子模仿过程中的特定元音收敛及模仿后的基线恢复现象。
链接: https://arxiv.org/abs/2502.01210
作者: Sam Kirkham,Patrycja Strycharczuk,Rob Davies,Danielle Welburn
机构: Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室); Linguistics and English Language, University of Manchester (曼彻斯特大学语言学与英语语言系); Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室); Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Short-term phonetic accommodation is a fundamental driver behind accent change, but how does real-time input from another speaker’s voice shape the speech planning representations of an interlocutor? We advance a computational model of change in phonetic representations during phonetic accommodation, grounded in dynamic neural field equations for movement planning and memory dynamics. We test the model’s ability to capture empirical patterns from an experimental study where speakers shadowed a model talker with a different accent from their own. The experimental data shows vowel-specific degrees of convergence during shadowing, followed by return to baseline (or minor divergence) post-shadowing. The model can reproduce these phenomena by modulating the magnitude of inhibitory memory dynamics, which may reflect resistance to accommodation due to phonological and/or sociolinguistic pressures. We discuss the implications of these results for the relation between short-term phonetic accommodation and longer-term patterns of sound change.
zh
[NLP-39] Almost Surely Safe Alignment of Large Language Models at Inference-Time
【速读】: 该论文旨在解决高度 capable 的大语言模型(Large Language Models, LLMs)在生成响应时可能出现的偏见或不安全内容的问题。现有缓解此问题的对齐技术,如基于奖励的强化学习(Reinforcement Learning from Human Feedback, RLHF),虽然有效但成本高昂且容易过拟合,因为它们需要重新训练 LLM。论文提出的关键解决方案是,在推理阶段采用一种新颖的对齐方法,确保 LLM 几乎肯定(概率接近于一)生成安全响应。具体而言,通过将推理阶段的安全响应生成构架为受限马尔可夫决策过程(Markov Decision Process, MDP)来实现这一点,并在 LLM 的潜在空间中进行操作。关键创新在于引入了一个安全状态,用于追踪安全约束条件的变化,并在潜在空间中的 MDP 解决方案完成后提供正式的安全性保证。基于这一基础,论文提出了名为 InferenceGuard 的实用实现方案,能够在不修改模型权重的情况下安全地对齐 LLM。实证研究表明,InferenceGuard 在平衡安全性和任务性能方面表现出色,优于现有的推理阶段对齐方法。
链接: https://arxiv.org/abs/2502.01208
作者: Xiaotong Ji,Shyam Sundhar Ramesh,Matthieu Zimmer,Ilija Bogunovic,Jun Wang,Haitham Bou Ammar
机构: Cranberry-Lemon University (蔓莓柠檬大学); Department of Computational Neuroscience, University of the Witwatersrand (计算神经科学系, 比勒陀利亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM’s latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.
zh
[NLP-40] OCR Error Post-Correction with LLM s in Historical Documents: No Free Lunches
【速读】: 该论文旨在解决光学字符识别(OCR)系统在转录历史文档时引入的错误问题,通过探索利用开放权重的大语言模型(LLMs)进行OCR错误校正的方法。研究的关键在于评估不同策略的效果,包括参数优化、量化、段落长度的影响以及文本延续方法,并揭示了现代LLMs在降低英语字符错误率(CER)方面的潜力,同时也指出了在芬兰语应用中尚未达到实用性能的局限性。
链接: https://arxiv.org/abs/2502.01205
作者: Jenna Kanerva,Cassandra Ledins,Siiri Käpyaho,Filip Ginter
机构: TurkuNLP, Department of Computing (计算系), University of Turku (图尔库大学), Finland (芬兰)
类目: Computation and Language (cs.CL)
备注: To be published in RESOURCEFUL 2025
点击查看摘要
Abstract:Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
zh
[NLP-41] COVE: COntext and VEracity prediction for out-of-context images NAACL2025
【速读】: 该论文旨在解决图像脱离原context导致的多模态虚假信息问题。论文的关键解决方案是引入COVE方法,该方法首先预测图像的真实context,然后利用这个context来验证caption的真实性。通过这种方式,COVE在context预测任务上超越了现有的最先进模型,并且在真实数据上的caption真实性验证上优于其他模型,表明按顺序结合这两个任务是有益的。
链接: https://arxiv.org/abs/2502.01194
作者: Jonathan Tonglet,Gabriel Thiem,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), TU Darmstadt (达姆施塔特工业大学); Department of Electrical Engineering, KU Leuven (鲁汶大学); Department of Computer Science, KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: Camera-ready version accepted to NAACL 2025 Main Conference
点击查看摘要
Abstract:Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image’s caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.
zh
[NLP-42] Skewed Memorization in Large Language Models : Quantification and Decomposition
【速读】: 该论文旨在解决大型语言模型(LLMs)在有监督微调(SFT)过程中因记忆训练数据而导致的隐私和安全风险。论文的关键在于通过分析序列长度上的记忆概率,揭示记忆分布的高度偏斜性,并将其与令牌生成过程联系起来,从而提供估算记忆的方法,并提出检测和缓解这些风险的策略,以促进更注重隐私保护的LLMs的发展。
链接: https://arxiv.org/abs/2502.01187
作者: Hao Li,Di Huang,Ziyu Wang,Amir M. Rahmani
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Memorization in Large Language Models (LLMs) poses privacy and security risks, as models may unintentionally reproduce sensitive or copyrighted data. Existing analyses focus on average-case scenarios, often neglecting the highly skewed distribution of memorization. This paper examines memorization in LLM supervised fine-tuning (SFT), exploring its relationships with training duration, dataset size, and inter-sample similarity. By analyzing memorization probabilities over sequence lengths, we link this skewness to the token generation process, offering insights for estimating memorization and comparing it to established metrics. Through theoretical analysis and empirical evaluation, we provide a comprehensive understanding of memorization behaviors and propose strategies to detect and mitigate risks, contributing to more privacy-preserving LLMs.
zh
[NLP-43] A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation
【速读】: 该论文旨在解决低资源语言对神经机器翻译性能的限制以及多模型集成方法中存在的高计算成本问题。论文的关键解决方案在于提出了一种基于中介语言的单模型集成策略,包括两步:首先通过中介语言生成候选翻译,其次在后处理阶段从这些候选中选择高质量翻译进行合并。这种方法不仅利用单一模型实现了知识迁移,还提升了翻译质量。
链接: https://arxiv.org/abs/2502.01182
作者: Seokjin Oh,Keonwoong Noh,Woohwan Jung
机构: Department of Applied Artificial Intelligence, Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Despite the significant advances in neural machine translation, performance remains subpar for low-resource language pairs. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, the previous approaches face the challenge of high computational costs for training multiple models. Furthermore, for black-box models, averaging token-level probabilities at each decoding step is not feasible. To address the problems of multi-model ensemble methods, we present a pivot-based single model ensemble. The proposed strategy consists of two steps: pivot-based candidate generation and post-hoc aggregation. In the first step, we generate candidates through pivot translation. This can be achieved with only a single model and facilitates knowledge transfer from high-resource pivot languages, resulting in candidates that are not only diverse but also more accurate. Next, in the aggregation step, we select k high-quality candidates from the generated candidates and merge them to generate a final translation that outperforms the existing candidates. Our experimental results show that our method produces translations of superior quality by leveraging candidates from pivot translation to capture the subtle nuances of the source sentence.
zh
[NLP-44] Joint Localization and Activation Editing for Low-Resource Fine-Tuning
【速读】: 该论文旨在解决在低资源场景下参数高效微调(Parameter-efficient fine-tuning, PEFT)方法效果有限的问题,特别是在仅有数百个样本的情况下。论文的关键解决方案是提出了一种名为Joint Localization and Activation Editing (JoLA)的方法,该方法能够同时学习需要编辑的Transformer头部、干预类型(加性、乘性或两者兼有)以及干预参数本身,从而在小数据集上实现更稳定且性能更优的模型调整。
链接: https://arxiv.org/abs/2502.01179
作者: Wen Lai,Alexander Fraser,Ivan Titov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code for the method is released at this https URL
点击查看摘要
Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.
zh
[NLP-45] Jailbreaking with Universal Multi-Prompts NAACL
【速读】: 该论文旨在解决大型语言模型(LLMs)在面对通用攻击者时的防御问题,特别是那些能够泛化到未见过的任务的攻击。现有方法主要针对特定案例优化对抗输入,导致处理大规模数据集时计算成本较高。论文的关键在于提出了一种基于提示的方法JUMP (Jumping UnMoored Multi-Prompt),通过使用通用多提示来破解LLMs,并进一步将其适应于防御策略DUMP (Defensive UnMoored Multi-Prompt)。实验结果表明,该方法在优化通用多提示方面优于现有技术。
链接: https://arxiv.org/abs/2502.01154
作者: Yu-Ling Hsu,Hsuan Su,Shang-Tse Chen
机构: National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by NAACL Findings 2025
点击查看摘要
Abstract:Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.
zh
[NLP-46] DeepRAG : Thinking to Retrieval Step by Step for Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在推理过程中存在的严重事实性幻觉问题,这些问题源于参数知识的时效性、准确性和覆盖范围。同时,将推理与检索增强生成(RAG)相结合仍然面临任务分解不力和冗余检索的挑战,这可能导致噪声引入和响应质量下降。论文的关键解决方案是提出DeepRAG框架,该框架将检索增强推理建模为马尔可夫决策过程(MDP),从而实现策略性和自适应检索。通过迭代分解查询,DeepRAG能够在每一步动态决定是否检索外部知识或依赖参数推理,以此优化检索增强推理的效果。实验表明,DeepRAG在提高检索效率的同时,将答案准确性提升了21.99%。
链接: https://arxiv.org/abs/2502.01142
作者: Xinyan Guan,Jiali Zeng,Fandong Meng,Chunlei Xin,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun,Jie Zhou
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所中文信息技术处理实验室); University of Chinese Academy of Sciences(中国科学院大学); Pattern Recognition Center, WeChat AI, Tencent Inc, China(中国腾讯公司微信人工智能模式识别中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. Meanwhile, integrating reasoning with retrieval-augmented generation (RAG) remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling strategic and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.
zh
[NLP-47] Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在提供可靠置信度估计方面的不足,以帮助用户识别模型输出中的错误,并在必要时将其结果转交给人类专家。论文的关键解决方案在于引入相对置信度估计方法,即通过让模型对不同问题之间的置信度进行相对判断(例如,“你更自信正确回答哪个问题?”),而不是直接评估单个问题的绝对置信度。这种方法利用了排名聚合技术如Elo评分和Bradley-Terry模型将模型的偏好转换为置信分数。实验结果显示,相对置信度估计在所有测试的语言模型和数据集上提供了比绝对置信度估计和自一致性方法更可靠的置信度评分。
链接: https://arxiv.org/abs/2502.01126
作者: Vaishnavi Shrivastava,Ananya Kumar,Percy Liang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence (“Score your confidence from 0-1.”) is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence (“Which question are you more confident in answering correctly?”). Treating each question as a “player” in a series of matchups against other questions and the model’s preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model’s confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs – GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B – across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.
zh
[NLP-48] Picky LLM s and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning
【速读】: 该论文旨在解决在对齐的大规模语言模型(LLMs)进行良性微调以适应特定领域任务时,安全对齐可能意外退化的问题。论文的关键在于系统性地分析导致安全对齐退化的三个关键因素:答案结构、身份校准和角色扮演,并评估当前最先进的奖励模型(RMs)在指导对齐过程中的可靠性。研究发现,这些奖励模型常常无法准确反映人类对安全性的偏好,从而揭示了其在实际应用中的局限性。通过揭示这些挑战,论文强调了在微调过程中保持安全对齐的复杂性,并为开发者提供了平衡实用性和安全性方面的指导。
链接: https://arxiv.org/abs/2502.01116
作者: Guanlin Li,Kangjie Chen,Shangwei Guo,Jie Zhang,Han Qiu,Chao Zhang,Guoyin Wang,Tianwei Zhang,Jiwei Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in this https URL.
zh
[NLP-49] GFM-RAG : Graph Foundation Model for Retrieval Augmented Generation
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-augmented generation, RAG)模型难以捕捉复杂知识间关系的问题,从而限制其在需要从多源整合知识的复杂推理任务中的性能。为了解决这一问题,论文提出了一种新型图基础模型(Graph Foundation Model, GFM),即GFM-RAG。其关键是引入了一个创新的图神经网络,能够通过显式建模图结构来捕获复杂的查询-知识关系,从而实现更有效的知识检索和整合。
链接: https://arxiv.org/abs/2502.01113
作者: Linhao Luo,Zicheng Zhao,Gholamreza Haffari,Dinh Phung,Chen Gong,Shirui Pan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.
zh
[NLP-50] ZebraLogic: On the Scaling Limits of LLM s for Logical Reasoning
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在复杂非单调推理中的逻辑推理能力及其可扩展性。为了解决这一问题,论文引入了ZebraLogic评估框架,用于评估LLMs在源自约束满足问题(Constraint Satisfaction Problems, CSPs)的逻辑网格谜题上的推理性能。ZebraLogic能够生成具有可控且量化复杂度的谜题,从而系统地研究包括Llama、o1模型和DeepSeek-R1在内的模型的扩展极限。通过涵盖广泛的搜索空间复杂性和多样的逻辑约束,ZebraLogic提供了一个结构化的环境来评估推理难度增加时的表现。论文的关键在于揭示了随着问题复杂度增加,模型准确率显著下降的现象,并探讨了包括Best-of-N采样、回溯机制和自我验证提示等策略以增强逻辑推理能力。
链接: https://arxiv.org/abs/2502.01100
作者: Bill Yuchen Lin,Ronan Le Bras,Kyle Richardson,Ashish Sabharwal,Radha Poovendran,Peter Clark,Yejin Choi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Website: this https URL
点击查看摘要
Abstract:We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows – a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement. Comments: Website: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2502.01100 [cs.AI] (or arXiv:2502.01100v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.01100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-51] Enhancing Aspect-based Sentiment Analysis with ParsBERT in Persian Language
【速读】: 该论文旨在解决波斯语文本挖掘中数据集稀缺和现有语言模型效率低下的挑战。解决方案的关键在于提出了一种基于方面的情感分析方法,利用增强型的ParsBERT模型和相关词典,从而显著提升了情感分析的准确度(88.2%)和F1得分(61.7),有效增强了针对波斯语的语言模型效能。
链接: https://arxiv.org/abs/2502.01091
作者: Farid Ariai,Maryam Tayefeh Mahmoudi,Ali Moeini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the era of pervasive internet use and the dominance of social networks, researchers face significant challenges in Persian text mining including the scarcity of adequate datasets in Persian and the inefficiency of existing language models. This paper specifically tackles these challenges, aiming to amplify the efficiency of language models tailored to the Persian language. Focusing on enhancing the effectiveness of sentiment analysis, our approach employs an aspect-based methodology utilizing the ParsBERT model, augmented with a relevant lexicon. The study centers on sentiment analysis of user opinions extracted from the Persian website ‘Digikala.’ The experimental results not only highlight the proposed method’s superior semantic capabilities but also showcase its efficiency gains with an accuracy of 88.2% and an F1 score of 61.7. The importance of enhancing language models in this context lies in their pivotal role in extracting nuanced sentiments from user-generated content, ultimately advancing the field of sentiment analysis in Persian text mining by increasing efficiency and accuracy.
zh
[NLP-52] Classic4Children: Adapting Chinese Literary Classics for Children with Large Language Model NAACL2025
【速读】: 该论文旨在解决儿童难以阅读中国文学经典的问题,通过引入儿童友好型文学改编(Child-Friendly Literary Adaptation, CLA)任务,使这些作品更易于儿童理解。论文的关键解决方案是提出了一种名为InstructChild的方法,该方法通过增强大型语言模型(LLM)以适应儿童的阅读偏好(如生动的角色描绘、简洁的叙事结构和适当的可读性),并采用细粒度指令微调来获取角色个性和叙事结构。此外,论文设计了一个可读性指标作为奖励来调整LLM与儿童阅读水平的一致性,并应用前瞻解码策略在推理过程中提高生成文本的可读性。为了支持CLA任务的评估,构建了包含原著及其儿童友好版本的Classic4Children数据集。实验结果表明,InstructChild显著提升了自动评估和人工评估的性能。
链接: https://arxiv.org/abs/2502.01090
作者: Jiali Chen,Xusen Hei,Yuqi Xue,Zihan Wu,Jiayuan Xie,Yi Cai
机构: Key Laboratory of Big Data and Intelligent Robot (大数据与智能机器人重点实验室) Ministry of Education; School of Software Engineering (软件工程学院), South China University of Technology; The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NAACL 2025 Findings
点击查看摘要
Abstract:Chinese literary classics hold significant cultural and educational value, offering deep insights into morality, history, and human nature. These works often include classical Chinese and complex narratives, making them difficult for children to read. To bridge this gap, we introduce a child-friendly literary adaptation (CLA) task to adapt the Chinese literary classic into engaging and accessible text for children. However, recent large language models (LLMs) overlook children’s reading preferences (\ie, vivid character portrayals, concise narrative structures, and appropriate readability), which poses challenges in CLA. In this paper, we propose a method called InstructChild, which augments the LLM with these preferences for adaptation. Specifically, we first obtain the characters’ personalities and narrative structure as additional information for fine-grained instruction tuning. Then, we devise a readability metric as the reward to align the LLM with the children’s reading level. Finally, a lookahead decoding strategy is applied to improve the readability of the generated text during inference. To support the evaluation of CLA task, we construct the Classic4Children dataset, which comprises both the original and child-friendly versions of the Four Great Classical Novels of Chinese literature. Experimental results show that our InstructChild significantly improves automatic and human evaluation performance.
zh
[NLP-53] ool Unlearning for Tool-Augmented LLM s
【速读】: 该论文旨在解决工具增强型大语言模型(Tool-augmented LLMs)在面对安全漏洞、隐私法规或工具弃用等情况下,如何有效地“遗忘”已学工具的问题。这一任务被称为“工具未学习”(tool unlearning),在现有未学习(unlearning)研究中尚未被探讨。论文的关键解决方案是提出了一种名为ToolDelete的方法,该方法具备三项关键属性以有效应对工具未学习中的挑战,并引入了一种新的成员推理攻击(Membership Inference Attack, MIA)模型用于评估。实验结果表明,ToolDelete能够有效删除随机选择的工具,同时保持模型在其他未删除工具上的知识以及整体任务性能。
链接: https://arxiv.org/abs/2502.01083
作者: Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL
点击查看摘要
Abstract:Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning’’ has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM’s knowledge on non-deleted tools and maintaining performance on general tasks.
zh
[NLP-54] he Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT -[n] and o-[n] Models on Multimodal Puzzles
【速读】: 该论文旨在解决大型语言模型在处理多模态任务中的高级推理能力不足的问题。关键在于评估和追踪GPT系列和o系列模型在复杂多模态难题中的表现,这些难题需要细粒度的视觉感知以及抽象或算法推理能力。研究表明,尽管o系列模型在某些方面表现出色,但仍存在显著的性能瓶颈,特别是在简单的多模态抽象推理和算法推理任务上。
链接: https://arxiv.org/abs/2502.01081
作者: Vernon Y.H. Toh,Yew Ken Chia,Deepanway Ghosal,Soujanya Poria
机构: Singapore University of Technology and Design (新加坡科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The releases of OpenAI’s o1 and o3 mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, o3 outperformed humans in novel problem-solving and skill acquisition on the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles, requiring fine-grained visual perception with abstract or algorithmic reasoning. The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency. Our results reveal a clear upward trend in reasoning capabilities across model iterations, with notable performance jumps across GPT-series models and subsequently to o1. Nonetheless, we observe that the o1 model still struggles with simple multimodal puzzles requiring abstract reasoning. Furthermore, its performance in algorithmic puzzles remains poor. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available this https URL.
zh
[NLP-55] FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理长上下文序列时所需的关键值(KV)缓存消耗大量计算资源和内存的问题。现有压缩方法主要关注减少内存需求,但未能显著提升延迟性能。论文提出的关键解决方案是FastKV,这是一种KV缓存压缩方法,通过引入Token-Selective Propagation(TSP)技术,在保持精度的同时加速处理速度,并采用grouped-query attention(GQA)感知的KV缓存压缩来提高内存和计算效率。实验结果表明,FastKV相比最先进的HeadKV方法,在首次令牌时间(TTFT)和吞吐量方面分别提升了2.00倍和1.40倍,同时保持了长上下文基准测试的准确性。
链接: https://arxiv.org/abs/2502.01068
作者: Dongwon Jo,Jiwon Song,Yulhwa Kim,Jae-Joon Kim
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00 \times and 1.40 \times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at this https URL.
zh
[NLP-56] Knowledge Synthesis of Photosynthesis Research Using a Large Language Model
【速读】: 该论文旨在解决当前大型语言模型(LLMs)在处理复杂生物数据和光合作用理论模型时存在的不足,无法提供准确科学背景的问题。解决方案的关键在于提出了一种基于OpenAI的GPT-4o,结合检索增强生成(RAG)技术和提示优化的光合作用研究助手(PRAG)。通过使用向量数据库和自动化反馈循环进行提示优化,以提高对光合作用相关查询响应的准确性和相关性。
链接: https://arxiv.org/abs/2502.01059
作者: Seungri Yoon,Woosang Jeon,Sanghyeok Choi,Taehyeong Kim,Tae In Ahn
机构: Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
点击查看摘要
Abstract:The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI’s GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG’s responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities.
zh
[NLP-57] Mitigating Hallucinations in Large Vision-Language Models with Internal Fact-based Contrastive Decoding
【速读】: 该论文旨在解决大型视觉语言模型(Large Visual Language Models, LVLMs)在推理过程中出现的对象幻觉(object hallucinations)问题。论文的关键解决方案是提出了一种名为内部事实基础对比解码(Internal Fact-based Contrastive Decoding, IFCD)的模型无关方法。IFCD通过利用LVLMs自身的幻觉现象,在推理过程中校准模型输出,并有效移除最终预测中的幻觉logits,从而缓解对象级别和属性级别的幻觉问题,同时提升了POPE和MME数据集上的准确性。
链接: https://arxiv.org/abs/2502.01056
作者: Chao Wang,Xuancheng Zhou,Weiwei Fu,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Visual Language Models (LVLMs) integrate visual and linguistic modalities, exhibiting exceptional performance across various multimodal tasks. Nevertheless, LVLMs remain vulnerable to the issue of object hallucinations. Previous efforts to mitigate this issue focus on supervised fine-tuning (SFT) or incorporating external knowledge, both of which entail significant costs related to training and the acquisition of external data. To address these challenges, we propose a novel model-agnostic approach termed Internal Fact-based Contrastive Decoding (IFCD), designed to mitigate and suppress hallucinations during the inference process of LVLMs by exploiting the LVLMs’ own hallucinations. IFCD is grounded in experimental observations that alterations to the LVLMs’ internal representations tend to amplify hallucinations caused by language bias. By contrasting disturbed distribution, IFCD calibrates the LVLMs’ output and effectively removes the hallucinatory logits from the final predictions. Experimental results validate that IFCD significantly alleviates both object-level and attribute-level hallucinations while achieving an average 9% accuracy improvement on POPE and 8% accuracy improvement on MME object hallucinations subset compared with direct decoding, respectively.
zh
[NLP-58] PARA: Parameter-Efficient Fine-tuning with Prompt Aware Representation Adjustment ACL-2024
【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在单骨干多租户应用中的效率与性能平衡问题。论文的关键在于提出了一种新的方法,称为提示感知表征调整(Prompt Aware Representation Adjustment, PARA)。PARA通过在每个Transformer层内集成一个轻量级向量生成器来实现,该生成器能够根据输入提示生成响应向量,从而相应地调整隐藏表示。这种方法在保持相似可调参数数量的同时,展示了超越现有PEFT基准的性能,并且在单骨干多租户场景下比LoRA更为高效。
链接: https://arxiv.org/abs/2502.01033
作者: Zequan Liu,Yi Zhao,Ming Tan,Wei Zhu,Aaron Xuxiang Tian
机构: RWTH Aachen University (RWTH 亚琛工业大学); University of Pennsylvania (宾夕法尼亚大学); Southern University of Science and Technology (南方科技大学); University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: accepted by ACL-2024
点击查看摘要
Abstract:In the realm of parameter-efficient fine-tuning (PEFT) methods, while options like LoRA are available, there is a persistent demand in the industry for a PEFT approach that excels in both efficiency and performance within the context of single-backbone multi-tenant applications. This paper introduces a new and straightforward PEFT technique, termed \underlinePrompt \underlineAware \underlineRepresentation \underlineAdjustment (PARA). The core of our proposal is to integrate a lightweight vector generator within each Transformer layer. This generator produces vectors that are responsive to input prompts, thereby adjusting the hidden representations accordingly. Our extensive experimentation across diverse tasks has yielded promising results. Firstly, the PARA method has been shown to surpass current PEFT benchmarks in terms of performance, despite having a similar number of adjustable parameters. Secondly, it has proven to be more efficient than LoRA in the single-backbone multi-tenant scenario, highlighting its significant potential for industrial adoption.
zh
[NLP-59] Knowing When to Stop: Dynamic Context Cutoff for Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理输入上下文时效率低下的问题,特别是在查询所需信息局限于局部上下文的情况下。论文的关键解决方案是动态上下文截断(Dynamic Context Cutoff),这是一种受人类启发的方法,使模型能够在获取足够的任务相关的信息后自动终止处理。通过分析模型内部,研究发现特定的注意力头(attention heads)天然编码了“充分性信号”(sufficiency signals),这些信号可以通过轻量级分类器检测到,并预测何时已处理到关键信息。这一发现揭示了一个新的效率范式:模型内部的理解自然地指导处理需求,而不是依赖外部压缩启发式方法。
链接: https://arxiv.org/abs/2502.01025
作者: Roy Xie,Junlin Wang,Paul Rosu,Chunyuan Deng,Bolun Sun,Zihao Lin,Bhuwan Dhingra
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project Website: this https URL
点击查看摘要
Abstract:Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode “sufficiency signals” - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models’ internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
zh
[NLP-60] MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs NAACL2024
【速读】: 该论文旨在解决在将不同领域专家模型(Expert LLMs)融合成统一的混合专家模型(Mixture-of-Experts, MoE)过程中遇到的挑战,特别是针对参数权重高度不同的模型或具有不同架构的模型。论文的关键解决方案包括引入新的MoE融合技术,通过策略减轻参数干扰、采用路由启发式方法减少MoE微调需求,并提出一种新型的方法来融合具有不同架构的专家模型。这些方法显著降低了微调成本,提升了性能,并扩展了MoE融合的应用范围。
链接: https://arxiv.org/abs/2502.00997
作者: Yuhang Zhou,Giannis Karamanolakis,Victor Soto,Anna Rumshisky,Mayank Kulkarni,Furong Huang,Wei Ai,Jianhua Lu
机构: University of Maryland, College Park (马里兰大学公园分校); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2024 Main
点击查看摘要
Abstract:The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.
zh
[NLP-61] Self-supervised Analogical Learning using Language Models
【速读】: 该论文旨在解决大型语言模型在推理一致性方面的问题,即模型在训练数据不熟悉的情境下表现不佳,尽管它们能够成功解决类似且更常见的问题。为了解决这一问题,论文提出了一种名为SAL(自监督类比学习框架)的方法。SAL的关键在于模仿人类的类比过程,通过训练模型将高质量的符号化解决方案从已知的解题案例转移到其他罕见且容易出错的情境中,从而促使模型理解高层次和抽象的推理过程,而非仅仅关注最终答案。这种方法显著提升了模型在多种推理基准测试中的性能,并增强了模型的泛化能力和可控性。
链接: https://arxiv.org/abs/2502.00996
作者: Ben Zhou,Sarthak Jain,Yi Zhang,Qiang Ning,Shuai Wang,Yassine Benajiba,Dan Roth
机构: Arizona State University (亚利桑那州立大学); Amazon (亚马逊); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models have been shown to suffer from reasoning inconsistency issues. That is, they fail more in situations unfamiliar to the training data, even though exact or very similar reasoning paths exist in more common cases that they can successfully solve. Such observations motivate us to propose methods that encourage models to understand the high-level and abstract reasoning processes during training instead of only the final answer. This way, models can transfer the exact solution to similar cases, regardless of their relevance to the pre-training data distribution. In this work, we propose SAL, a self-supervised analogical learning framework. SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions from cases that they know how to solve to other rare cases in which they tend to fail more. We show that the resulting models after SAL learning outperform base language models on a wide range of reasoning benchmarks, such as StrategyQA, GSM8K, and HotpotQA, by 2% to 20%. At the same time, we show that our model is more generalizable and controllable through analytical studies.
zh
[NLP-62] ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution
【速读】: 该论文旨在解决大型语言模型(LLMs)在执行图表问答任务时经常生成未经验证的幻觉性回复的问题。现有答案归因方法难以将回答与源图表关联,主要因为有限的视觉语义上下文、复杂的视觉文本对齐需求以及复杂布局中的边界框预测难题。论文提出的关键解决方案是ChartCitor,一个多智能体框架,通过识别图表图像内的支持证据来提供细粒度的边界框引用。该系统协调LLM智能体进行图表到表格的提取、答案重铸、表格增强、预筛选和重新排序的证据检索,以及表格到图表的映射。这些步骤共同提升了现有基线模型在不同图表类型上的表现,并增强了用户对生成式AI的信任。
链接: https://arxiv.org/abs/2502.00989
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: IGDTUW(印度技术大学德里分校), Delhi India; Adobe Research(Adobe研究), USA; Adobe Research(Adobe研究), USA; Adobe Research(Adobe研究), USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can perform chart question-answering tasks but often generate unverified hallucinated responses. Existing answer attribution methods struggle to ground responses in source charts due to limited visual-semantic context, complex visual-text alignment requirements, and difficulties in bounding box prediction across complex layouts. We present ChartCitor, a multi-agent framework that provides fine-grained bounding box citations by identifying supporting evidence within chart images. The system orchestrates LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval through pre-filtering and re-ranking, and table-to-chart mapping. ChartCitor outperforms existing baselines across different chart types. Qualitative user studies show that ChartCitor helps increase user trust in Generative AI by providing enhanced explainability for LLM-assisted chart QA and enables professionals to be more productive.
zh
[NLP-63] PlotGen: Multi-Agent LLM -based Scientific Data Visualization via Multimodal Feedback
【速读】: 该论文旨在解决 novice 用户在科学数据可视化过程中面临的工具选择复杂性和技术掌握困难的问题。解决方案的关键在于 PlotGen,这是一个多代理框架,通过包括查询规划代理、代码生成代理以及三个检索反馈代理在内的多个基于大规模语言模型(LLM)的代理,实现科学可视化创建的自动化。这些代理协同工作,逐步分解用户请求、生成可执行代码,并通过迭代反馈机制优化数据准确性、文本标签和视觉正确性,从而提高可视化结果的质量和用户信任度。
链接: https://arxiv.org/abs/2502.00988
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: IGDTUW(德里技术大学); Adobe Research(Adobe研究); Adobe Research(Adobe研究); Adobe Research(Adobe研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Scientific data visualization is pivotal for transforming raw data into comprehensible visual representations, enabling pattern recognition, forecasting, and the presentation of data-driven insights. However, novice users often face difficulties due to the complexity of selecting appropriate tools and mastering visualization techniques. Large Language Models (LLMs) have recently demonstrated potential in assisting code generation, though they struggle with accuracy and require iterative debugging. In this paper, we propose PlotGen, a novel multi-agent framework aimed at automating the creation of precise scientific visualizations. PlotGen orchestrates multiple LLM-based agents, including a Query Planning Agent that breaks down complex user requests into executable steps, a Code Generation Agent that converts pseudocode into executable Python code, and three retrieval feedback agents - a Numeric Feedback Agent, a Lexical Feedback Agent, and a Visual Feedback Agent - that leverage multimodal LLMs to iteratively refine the data accuracy, textual labels, and visual correctness of generated plots via self-reflection. Extensive experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatPlotBench dataset, leading to enhanced user trust in LLM-generated visualizations and improved novice productivity due to a reduction in debugging time needed for plot errors.
zh
[NLP-64] RandLoRA: Full-rank parameter-efficient fine-tuning of large models ICLR
【速读】: 该论文旨在解决在低秩适应(Low-Rank Adaptation, LoRA)与标准微调之间观察到的性能差距问题。论文的关键在于引入RandLoRA方法,通过学习线性组合低秩、非训练随机矩阵的方式实现全秩更新,同时限制优化仅作用于应用于固定随机矩阵的对角缩放矩阵。这种方法能够在保持参数和内存效率的同时,有效克服低秩带来的表示能力限制。
链接: https://arxiv.org/abs/2502.00987
作者: Paul Albert,Frederic Z. Zhang,Hemanth Saratchandran,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at the International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. However, the low-rank nature of the weight update inherently limits the representation power of fine-tuned models, potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency? This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces – and sometimes eliminates – the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
zh
[NLP-65] Context-Aware Hierarchical Merging for Long Document Summarization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长文本摘要时因固定输入长度限制而产生的局限性。具体而言,层次合并(Hierarchical Merging)技术虽能将长文本分解为更小部分进行处理,但递归合并过程会放大模型的幻觉效应(hallucinations),增加事实不准确性的风险。论文的关键解决方案在于通过从源文档中引入上下文信息来增强层次合并技术,提出了多种上下文增强方法,包括替换中间摘要、使用上下文作为支持证据进行精炼以及隐式引用输入文档。实验结果显示,在法律和叙事领域的数据集上,这些上下文增强方法显著优于零样本和基本层次合并方法,特别是在与抽取式摘要结合使用时,精炼方法表现出最佳性能。
链接: https://arxiv.org/abs/2502.00977
作者: Litu Ou,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 30 pages
点击查看摘要
Abstract:Hierarchical Merging is a technique commonly used to summarize very long texts ( 100K tokens) by breaking down the input into smaller sections, summarizing those sections individually, and then merging or combining those summaries into a final coherent summary. Although it helps address the limitations of large language models (LLMs) with fixed input length constraints, the recursive merging process can amplify LLM hallucinations, increasing the risk of factual inaccuracies. In this paper, we seek to mitigate hallucinations by enriching hierarchical merging with context from the source document. Specifically, we propose different approaches to contextual augmentation ranging from \emphreplacing intermediate summaries with relevant input context, to \emphrefining them while using the context as supporting evidence, and \emphaligning them implicitly (via citations) to the input. Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines for the Llama 3.1 model family. Our analysis further reveals that refinement methods tend to perform best when paired with extractive summarization for identifying relevant input.
zh
[NLP-66] Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching SIGDIAL2024
【速读】: 该论文旨在解决聊天式产品搜索(CPS)领域中因缺乏可靠且大规模数据集而导致的智能助手训练难题。论文的关键解决方案是提出了一种名为TRACER的新方法,该方法利用大型语言模型(LLMs)生成针对不同购物领域的逼真且自然的对话,并通过与对话计划(dialogue plans)结合,确保对话过程中产品搜索轨迹的相关性和高效性。此外,论文还发布了首个目标导向的CPS数据集Wizard of Shopping (WoS),包含三个购物领域的高度自然连贯的对话共3.6k条,以验证所提方法的有效性。
链接: https://arxiv.org/abs/2502.00969
作者: Xiangci Li,Zhiyu Chen,Jason Ingyu Choi,Nikhita Vedula,Besnik Fetahu,Oleg Rokhlenko,Shervin Malmasi
机构: AWS AI Labs; Amazon.com, Inc.
类目: Computation and Language (cs.CL)
备注: Accepted by SIGDIAL 2024 but withdrawn
点击查看摘要
Abstract:The goal of conversational product search (CPS) is to develop an intelligent, chat-based shopping assistant that can directly interact with customers to understand shopping intents, ask clarification questions, and find relevant products. However, training such assistants is hindered mainly due to the lack of reliable and large-scale datasets. Prior human-annotated CPS datasets are extremely small in size and lack integration with real-world product search systems. We propose a novel approach, TRACER, which leverages large language models (LLMs) to generate realistic and natural conversations for different shopping domains. TRACER’s novelty lies in grounding the generation to dialogue plans, which are product search trajectories predicted from a decision tree model, that guarantees relevant product discovery in the shortest number of search conditions. We also release the first target-oriented CPS dataset Wizard of Shopping (WoS), containing highly natural and coherent conversations (3.6k) from three shopping domains. Finally, we demonstrate the quality and effectiveness of WoS via human evaluations and downstream tasks.
zh
[NLP-67] Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
【速读】: 该论文旨在解决基于Q值选择数据以增强大规模语言模型(Large Language Model, LLM)驱动的多智能体系统(multi-agent system, MAS)自训练过程中存在的不一致性问题。论文的关键解决方案在于提出了一种名为数据影响力导向树搜索(Data Influence-oriented Tree Search, DITS)的新框架,通过引入影响力分数来指导树搜索和数据选择过程。DITS 方法通过利用影响力分数有效识别对系统改进影响最大的数据,从而提升模型性能,并且针对非可微指标设计了影响力分数估算方法,显著降低了计算开销。研究表明,在数据合成过程中更多地分配推理资源用于估算影响力分数而非Q值,能够更有效地提升模型训练效果。
链接: https://arxiv.org/abs/2502.00955
作者: Wentao Shi,Zichun Yu,Fuli Feng,Xiangnan He,Chenyan Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Monte Carlo Tree Search (MCTS) based methods provide promising approaches for generating synthetic data to enhance the self-training of Large Language Model (LLM) based multi-agent systems (MAS). These methods leverage Q-values to estimate individual agent contributions. However, relying solely on Q-values to identify informative data may misalign with the data synthesis objective, as the focus should be on selecting data that best enhances model training. To address this discrepancy, we propose Data Influence-oriented Tree Search (DITS), a novel framework that incorporates influence scores to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement, thereby enhancing model performance. Furthermore, we derive influence score estimation methods tailored for non-differentiable metrics, significantly reducing computational overhead by utilizing inference computations. Extensive experiments on eight multi-agent datasets demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more inference resources to estimate influence scores, rather than Q-values, during data synthesis can more effectively and efficiently enhance model training.
zh
[NLP-68] Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
【速读】: 该论文旨在解决从大量未结构化的临床文本中高效提取和规范化医学信息的问题。传统方法需要大量的手动工作,包括制定规则或标注训练标签,这限制了其可扩展性。论文的关键解决方案是提出UniMedAbstractor (UMA),这是一种零样本医学抽象框架,通过模块化和可定制的提示模板利用大规模语言模型 (Large Language Models, LLMs)。UMA通过通用的提示模板快速适应新属性,无需为每个属性特定的训练标签或规则进行调整,从而实现更广泛的适用性和更高的效率。
链接: https://arxiv.org/abs/2502.00943
作者: Cliff Wong,Sam Preston,Qianchu Liu,Zelalem Gero,Jass Bagga,Sheng Zhang,Shrey Jain,Theodore Zhao,Yu Gu,Yanbo Xu,Sid Kiblawi,Roshanthi Weerasinghe,Rom Leidner,Kristina Young,Brian Piening,Carlo Bifulco,Tristan Naumann,Mu Wei,Hoifung Poon
机构: Microsoft, Redmond, WA, USA; Providence Research Network, Renton, WA, USA; Providence Genomics, Portland, OR, USA; Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA; The Oregon Clinic, Radiation Oncology Division, Portland, OR; Unknown
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The vast majority of real-world patient information resides in unstructured clinical text, and the process of medical abstraction seeks to extract and normalize structured information from this unstructured input. However, traditional medical abstraction methods can require significant manual efforts that can include crafting rules or annotating training labels, limiting scalability. In this paper, we propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs) through a modular and customizable prompt template. We refer to our approach as universal abstraction as it can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules. We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey, from short-context attributes (e.g., performance status, treatment) to complex long-context attributes requiring longitudinal reasoning (e.g., tumor site, histology, TNM staging). Experiments on real-world data show UMA’s strong performance and generalizability. Compared to supervised and heuristic baselines, UMA with GPT-4o achieves on average an absolute 2-point F1/accuracy improvement for both short-context and long-context attribute abstraction. For pathologic T staging, UMA even outperforms the supervised model by 20 points in accuracy.
zh
[NLP-69] Attention Sinks and Outlier Features: A Catch Tag and Release Mechanism for Embeddings
【速读】: 该论文旨在探究大型语言模型(LLMs)中的两个显著特征,即大范数(异常值)特征的存在以及令牌倾向于强烈关注少数其他令牌的现象。论文特别关注这些现象在模型参数中的体现及其对模型性能、压缩和流处理的影响。论文的关键解决方案在于证明“捕捉、标记、释放”机制是简单任务如平均操作所必需的,从而解释了这种机制如何自然地出现在现代LLMs中。此外,通过实验发现,注意力汇点(attention sinks)可以通过低秩矩阵完全体现在模型参数中,这不仅有助于模型压缩,也验证了近期采用低秩项以抵消性能下降的方法的成功。
链接: https://arxiv.org/abs/2502.00919
作者: Stephen Zhang,Mustafa Khan,Vardan Papyan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. Despite often having no semantic relevance, these select tokens, called attention sinks, along with the large outlier features, have proven important for model performance, compression, and streaming. Consequently, investigating the roles of these phenomena within models and exploring how they might manifest in the model parameters has become an area of active interest. Through an empirical investigation, we demonstrate that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved. We prove that simple tasks, like averaging, necessitate the ‘catch, tag, release’ mechanism hence explaining why it would arise organically in modern LLMs. Our experiments also show that the creation of attention sinks can be completely captured in the model parameters using low-rank matrices, which has important implications for model compression and substantiates the success of recent approaches that incorporate a low-rank term to offset performance degradation.
zh
[NLP-70] he Accuracy Robustness and Readability of LLM -Generated Sustainability-Related Word Definitions
【速读】: 该论文旨在解决通用语言与标准化定义在气候讨论中的重要性,同时关注大型语言模型(LLMs)在表述气候术语时可能存在的误读问题。为此,作者对比分析了300个官方政府间气候变化专门委员会(IPCC)术语定义与由GPT-4o-mini、Llama3.1 8B及Mistral 7B生成的相应定义,评估了其一致性(平均一致率为0.57-0.59 ± 0.15)、稳健性和可读性。研究的关键在于通过分析模型生成的定义与原始定义之间的差异,尤其是那些多义或模糊词汇的处理,以突出需要标准化的术语。论文结果表明,虽然LLMs能够支持环境讨论,但其输出需与已确立的术语保持一致,以确保清晰性和统一性。
链接: https://arxiv.org/abs/2502.00916
作者: Alice Heiman
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: NLP4Ecology Workshop 2025
点击查看摘要
Abstract:A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of 0.57-0.59 \pm 0.15 , and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.
zh
[NLP-71] Embracing Dialectic Intersubjectivity: Coordination of Different Perspectives in Content Analysis with LLM Persona Simulation
【速读】: 该论文旨在推进内容分析方法从共识导向到协调导向的实践,以包容多样的编码输出并探讨不同视角之间的动态关系。关键解决方案在于评估六个GPT-4配置在分析福克斯新闻(Fox News)和MSNBC关于拜登和特朗普在2020年美国总统大选期间的转录文本中的情感倾向时的表现,并通过这些模型的评估来探索在LLM辅助内容分析(LACA)中如何识别党派选择性处理。研究表明,党派专属的大型语言模型(Partisan Persona LLMs)在处理政治上一致的内容时表现出更强的意识形态偏见,且同党派模型间的编码者可靠性高于跨党派模型组合。这一方法增强了对LLM输出的细致理解,并提升了AI驱动的社会科学研究的严谨性,使其能够模拟现实世界的影响。
链接: https://arxiv.org/abs/2502.00903
作者: Taewoo Kang,Kjerstin Thorson,Tai-Quan Peng,Dan Hiaeshutter-Rice,Sanguk Lee,Stuart Soroka
机构: Michigan State University; Colorado State University; Texas Christian University; University of California, Los Angeles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:This study attempts to advancing content analysis methodology from consensus-oriented to coordination-oriented practices, thereby embracing diverse coding outputs and exploring the dynamics among differential perspectives. As an exploratory investigation of this approach, we evaluate six GPT-4o configurations to analyze sentiment in Fox News and MSNBC transcripts on Biden and Trump during the 2020 U.S. presidential campaign, examining patterns across these models. By assessing each model’s alignment with ideological perspectives, we explore how partisan selective processing could be identified in LLM-Assisted Content Analysis (LACA). Findings reveal that partisan persona LLMs exhibit stronger ideological biases when processing politically congruent content. Additionally, intercoder reliability is higher among same-partisan personas compared to cross-partisan pairs. This approach enhances the nuanced understanding of LLM outputs and advances the integrity of AI-driven social science research, enabling simulations of real-world implications.
zh
[NLP-72] MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
【速读】: 该论文旨在解决现有分词方法(如Byte Pair Encoding, BPE)在处理形态学丰富的语言时,忽视词素边界导致的次优分词问题。解决方案的关键在于引入MorphBPE,这是一种融合了语言结构的形态学感知型BPE扩展,能够在保持统计效率的同时,改进子词切分的准确性与一致性。
链接: https://arxiv.org/abs/2502.00894
作者: Ehsaneddin Asgari,Yassine El Kheir,Mohammad Ali Sadraei Javaheri
机构: Qatar Computing Research Institute (QCRI)(卡塔尔计算研究研究院), Doha (多哈), Qatar (卡塔尔); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心), Berlin (柏林), Germany (德国); Technical University of Berlin(柏林工业大学), Berlin (柏林), Germany (德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: this https URL and this https URL
zh
[NLP-73] SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters ICLR2025
【速读】: 该论文旨在解决现有语言模型对齐中的偏好优化目标需要大量调参的问题,这增加了微调大型语言模型的复杂性和时间消耗。论文提出了一种简单而有效的无超参数偏好优化算法,即SimPER(Simple Preference Optimization via Inverse Perplexity)。其关键是通过优化逆困惑度(inverse perplexity),即所选回复与拒绝回复的平均对数似然函数的指数的倒数,来实现性能优化。这种方法无需昂贵的超参数调优和参考模型,从而在计算和内存效率方面表现出色。实验结果表明,SimPER在多个基准测试中显著优于现有方法,甚至不使用任何超参数或参考模型。
链接: https://arxiv.org/abs/2502.00883
作者: Teng Xiao,Yige Yuan,Zhengyu Chen,Mingxiao Li,Shangsong Liang,Zhaochun Ren,Vasant G Honavar
机构: Pennsylvania State University; University of Chinese Academy of Sciences; Meituan Inc (美团); Tencent AI Lab (腾讯AI实验室); Sun Yat-Sen University (中山大学); Leiden University (莱顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR 2025
点击查看摘要
Abstract:Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for this http URL observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: this https URL.
zh
[NLP-74] Language Models Use Trigonometry to Do Addition
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理简单数学任务,特别是加法运算时的内部机制理解不足的问题。关键解决方案在于发现这些模型使用一种广义螺旋(generalized helix)来表示数字,并通过“Clock”算法操作这种螺旋结构以完成加法计算。论文通过因果干预验证了这一表示方法及其操作机制的有效性,从而提供了对LLMs数学能力的首个表征层面解释。
链接: https://arxiv.org/abs/2502.00873
作者: Subhash Kantamneni,Max Tegmark
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Mathematical reasoning is an increasingly important indicator of large language model (LLM) capabilities, yet we lack understanding of how LLMs process even simple mathematical tasks. To address this, we reverse engineer how three mid-sized LLMs compute addition. We first discover that numbers are represented in these LLMs as a generalized helix, which is strongly causally implicated for the tasks of addition and subtraction, and is also causally relevant for integer division, multiplication, and modular arithmetic. We then propose that LLMs compute addition by manipulating this generalized helix using the “Clock” algorithm: to solve a+b , the helices for a and b are manipulated to produce the a+b answer helix which is then read out to model logits. We model influential MLP outputs, attention head outputs, and even individual neuron preactivations with these helices and verify our understanding with causal interventions. By demonstrating that LLMs represent numbers on a helix and manipulate this helix to perform addition, we present the first representation-level explanation of an LLM’s mathematical capability.
zh
[NLP-75] Predicting potentially unfair clauses in Chilean terms of services with natural language processing
【速读】: 该论文旨在解决消费者合同中的信息不对称问题,特别是在复杂且少有人阅读的在线服务条款背景下。针对此问题,论文的关键解决方案在于引入了一种新的注释方案,包含四个类别和总共二十个类别,并将其应用于五十份智利使用的在线服务条款。此外,论文评估了基于Transformer的模型在检测和分类潜在滥用条款时的表现,重点关注语言特异性和领域特定的预训练、少量样本大小以及模型架构对性能的影响。
链接: https://arxiv.org/abs/2502.00865
作者: Christoffer Loeffler,Andrea Martínez Freile,Tomás Rey Pizarro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 37 pages, 2 figures, under review
点击查看摘要
Abstract:This study addresses the growing concern of information asymmetry in consumer contracts, exacerbated by the proliferation of online services with complex Terms of Service that are rarely even read. Even though research on automatic analysis methods is conducted, the problem is aggravated by the general focus on English-language Machine Learning approaches and on major jurisdictions, such as the European Union. We introduce a new methodology and a substantial dataset addressing this gap. We propose a novel annotation scheme with four categories and a total of 20 classes, and apply it on 50 online Terms of Service used in Chile. Our evaluation of transformer-based models highlights how factors like language- and/or domain-specific pre-training, few-shot sample size, and model architecture affect the detection and classification of potentially abusive clauses. Results show a large variability in performance for the different tasks and models, with the highest macro-F1 scores for the detection task ranging from 79% to 89% and micro-F1 scores up to 96%, while macro-F1 scores for the classification task range from 60% to 70% and micro-F1 scores from 64% to 80%. Notably, this is the first Spanish-language multi-label classification dataset for legal clauses, applying Chilean law and offering a comprehensive evaluation of Spanish-language models in the legal domain. Our work lays the ground for future research in method development for rarely considered legal analysis and potentially leads to practical applications to support consumers in Chile and Latin America as a whole.
zh
[NLP-76] HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions SIGIR2025
【速读】: 该论文旨在解决自动提示生成领域资源分散、数据格式不一以及缺乏兼容的评估工具的问题。解决方案的关键在于引入了HintEval,这是一个Python库,它能够便捷地访问多样化的数据集,并提供多种方法来生成和评估提示。HintEval整合了分散的资源,形成一个支持广泛研究目标的工具包,并实现了清晰、多维度且可靠的评估。此外,该库还包含详细的在线文档,帮助用户快速探索其功能并上手使用。通过降低进入门槛并鼓励一致的评估实践,HintEval为自然语言处理/信息检索(NLP/IR)社区中的提示生成与分析研究提供了重要进展。
链接: https://arxiv.org/abs/2502.00857
作者: Jamshid Mozafari,Bhawna Piryani,Abdelrahman Abdallah,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to SIGIR 2025
点击查看摘要
Abstract:Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
zh
[NLP-77] Explainability in Practice: A Survey of Explainable NLP Across Various Domains
【速读】: 该论文旨在解决自然语言处理(NLP)领域中高级模型的黑箱性质导致的透明度和可解释性不足的问题。关键解决方案在于探索和设计适用于不同领域的可解释NLP(XNLP)方法,以满足各行业如医疗健康和金融的具体需求,包括提供清晰的洞察力和强化欺诈检测及风险评估的能力。此外,论文还致力于填补现有文献中的知识空白,通过探讨实际应用、指标评估以及人机交互在模型评估中的作用来深化对XNLP的理解,并提出未来研究方向以增强其广泛应用。
链接: https://arxiv.org/abs/2502.00837
作者: Hadi Mohammadi,Ayoub Bagheri,Anastasia Giachanou,Daniel L. Oberski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Natural Language Processing (NLP) has become a cornerstone in many critical sectors, including healthcare, finance, and customer relationship management. This is especially true with the development and use of advanced models such as GPT-based architectures and BERT, which are widely used in decision-making processes. However, the black-box nature of these advanced NLP models has created an urgent need for transparency and explainability. This review explores explainable NLP (XNLP) with a focus on its practical deployment and real-world applications, examining its implementation and the challenges faced in domain-specific contexts. The paper underscores the importance of explainability in NLP and provides a comprehensive perspective on how XNLP can be designed to meet the unique demands of various sectors, from healthcare’s need for clear insights to finance’s emphasis on fraud detection and risk assessment. Additionally, this review aims to bridge the knowledge gap in XNLP literature by offering a domain-specific exploration and discussing underrepresented areas such as real-world applicability, metric evaluation, and the role of human interaction in model assessment. The paper concludes by suggesting future research directions that could enhance the understanding and broader application of XNLP.
zh
[NLP-78] Generalization of Medical Large Language Models through Cross-Domain Weak Supervision
【速读】: 该论文旨在解决如何有效提升医疗领域大型语言模型(Medical Large Language Models, MLLMs)的生成能力,以适应复杂的医疗自然语言处理任务。论文的关键解决方案是提出了增量式课程学习精调(Incremental Curriculum-Based Fine-Tuning, ICFT)框架,通过结合基于课程的学习、双阶段记忆协调和参数高效精调,实现从通用语言知识到特定医学领域专业知识的渐进过渡,从而显著提高模型在准确性与效率方面的表现,并增强其泛化能力和减少错误。
链接: https://arxiv.org/abs/2502.00832
作者: Robert Long,Eric Gonzalez,Harrison Fuller
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The advancement of large language models (LLMs) has opened new frontiers in natural language processing, particularly in specialized domains like healthcare. In this paper, we propose the Incremental Curriculum-Based Fine-Tuning (ICFT) framework to enhance the generative capabilities of medical large language models (MLLMs). ICFT combines curriculum-based learning, dual-stage memory coordination, and parameter-efficient fine-tuning to enable a progressive transition from general linguistic knowledge to strong domain-specific expertise. Experimental results across diverse medical NLP tasks, including question answering, preference classification, and response generation, demonstrate that ICFT consistently outperforms state-of-the-art baselines, achieving improvements in both accuracy and efficiency. Further analysis reveals the framework’s ability to generalize to unseen data, reduce errors, and deliver diverse, contextually relevant medical responses. These findings establish ICFT as a robust and scalable solution for adapting LLMs to the medical domain, offering practical benefits for real-world healthcare applications.
zh
[NLP-79] Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models
【速读】: 该论文旨在解决文本到图像生成中的挑战,包括计算效率低下、训练不稳定以及对文本变化的鲁棒性不足。关键解决方案在于结合大型语言模型(Large Language Models, LLMs)与扩散模型(diffusion models),引入一种新的动态KL加权策略以优化扩散过程,并利用预训练的LLMs进行语义理解来指导生成过程。
链接: https://arxiv.org/abs/2502.00826
作者: Julian Perry,Frank Sanders,Carter Scott
机构: Delta University for Science and Technology
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including ablation studies and human evaluations, confirm that our method outperforms existing approaches in terms of image realism, relevance to the input text, and overall aesthetic quality. Our approach also shows promise in scalability to other multimodal tasks, making it a versatile solution for a wide range of generative applications.
zh
[NLP-80] Probing Large Language Models in Reasoning and Translating Complex Linguistic Puzzles
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在解决复杂语言谜题方面的应用,这类任务需要高级推理能力和熟练的翻译能力,类似于人类的认知过程。研究的关键在于探索特定的提示技术,包括输入输出提示(Input-Output Prompting, IO)、思维链提示(Chain-of-Thought Prompting, CoT)和独奏表现提示(Solo Performance Prompting, SPP),以增强LLMs的推理能力和揭示其决策路径。通过使用来自机器谜题竞赛和各类语言学奥林匹克的数据集,采用一系列综合评估指标来衡量GPT-4 0603在这些提示方法下的性能,从而深入理解LLMs在语言推理和复杂翻译任务中的潜力与局限性。这项研究显著推动了自然语言处理(NLP)领域的发展,为优化LLM应用以提高推理和翻译准确性提供了见解。
链接: https://arxiv.org/abs/2502.00817
作者: Zheng-Lin Lin,Yu-Fei Shih,Shu-Kai Hsieh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 8 figures
点击查看摘要
Abstract:This paper investigates the utilization of Large Language Models (LLMs) for solving complex linguistic puzzles, a domain requiring advanced reasoning and adept translation capabilities akin to human cognitive processes. We explore specific prompting techniques designed to enhance ability of LLMs to reason and elucidate their decision-making pathways, with a focus on Input-Output Prompting (IO), Chain-of-Thought Prompting (CoT), and Solo Performance Prompting (SPP). Utilizing datasets from the Puzzling Machine Competition and various Linguistics Olympiads, we employ a comprehensive set of metrics to assess the performance of GPT-4 0603, a prominent LLM, across these prompting methods. Our findings illuminate the potential of LLMs in linguistic reasoning and complex translation tasks, highlighting their capabilities and identifying limitations in the context of linguistic puzzles. This research contributes significantly to the broader field of Natural Language Processing (NLP) by providing insights into the optimization of LLM applications for improved reasoning and translation accuracy, thereby enriching the ongoing dialogue in NLP advancements.
zh
[NLP-81] Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling
【速读】: 该论文旨在解决两个主要问题:一是现有的奖励模型容易受到表层干扰因素的影响,特别是长度偏差对偏好建模的显著影响;二是微调的大规模语言模型(LLMs)难以遵循明确的长度指令。为了解决这些问题,论文提出的关键方案是引入一个名为Response-conditioned Bradley-Terry (Rc-BT)模型,该模型能够显式地区分人类语义偏好与响应长度要求,并通过训练增强奖励模型在减轻长度偏差和遵循长度指令方面的能力。此外,论文还提出了Rc-DPO算法,利用Rc-BT模型进行直接策略优化(DPO),从而同时减轻长度偏差并促进对长度指令的遵守。
链接: https://arxiv.org/abs/2502.00814
作者: Jianfeng Cai,Jinhua Zhu,Ruopei Sun,Yue Wang,Li Li,Wengang Zhou,Houqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model’s scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model’s capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive evaluations demonstrate that our approach substantially improves both preference modeling and length instruction compliance, with its effectiveness validated across various foundational models and preference datasets.
zh
[NLP-82] Vision-centric Token Compression in Large Language Model
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理扩展上下文令牌时存在的效率低下和冗余问题。关键在于使用视觉编码器(vision encoder)直接处理文本令牌序列,发现其性能可与传统的文本编码器相媲美,并且在多个中小型文本理解基准测试中实现了相当的结果,同时减少了16%的浮点运算次数(FLOPs)和50%的内存使用。此外,研究团队还揭示了令牌之间的显著冗余,并开发了一种基于频率的掩码策略来引导视觉编码器关注最关键的信息,从而进一步优化模型性能。
链接: https://arxiv.org/abs/2502.00791
作者: Ling Xing,Alex Jinpeng Wang,Rui Yan,Jinhui Tang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized natural language processing, excelling in handling longer sequences. However, the inefficiency and redundancy in processing extended in-context tokens remain a challenge. Many attempts to address this rely on compressing tokens with smaller text encoders, yet we question whether text encoders are truly indispensable. Our journey leads to an unexpected discovery-a much smaller vision encoder, applied directly to sequences of text tokens, can rival text encoders on text tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small text understanding benchmarks, VIST leads to comparable results with 16% fewer FLOPs and 50% less memory usage. We further uncover significant token redundancy and devise a frequency-based masking strategy to guide the focus of the visual encoder toward the most critical tokens. Interestingly, we observe the trained visual encoder performs like a summarizer, selectively ignoring less important words such as prepositions and conjunctions. This approach delivers remarkable results, outperforming traditional text encoder-based methods by 5.7% on average over benchmarks like TriviaQA, NQ, PopQA, TREF, SST2, and SST5, setting a new standard for token efficiency in LLMs.
zh
[NLP-83] FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
【速读】: 该论文旨在解决大型语言模型(LLMs)预训练过程中高质量数据选择的问题。论文的关键在于提出了一种名为FIRE的灵活且可扩展的框架,用于整合多种数据质量评估器,从而在多个维度上全面评估数据质量。FIRE通过将多种质量信号对齐到统一空间,并结合多样化的数据质量评估器,为每个数据点提供综合的质量信号。此外,FIRE引入了一种基于渐进式的数据选择方案,逐步优化高质量数据点的选择,平衡计算复杂度与正交性改进。
链接: https://arxiv.org/abs/2502.00761
作者: Liangyu Xu,Xuemiao Zhang,Feiyu Duan,Sirui Wang,Jingang Wang,Xunliang Cai
机构: Peking University (北京大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注: 19 pages, 11 figures
点击查看摘要
Abstract:Selecting high-quality data can significantly improve the pre-training efficiency of large language models (LLMs). Existing methods often rely on heuristic techniques and single quality signals, limiting their ability to comprehensively evaluate data quality. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points, balancing computational complexity with the refinement of orthogonality. Experiments on the SlimPajama dataset reveal that FIRE consistently outperforms other selection methods and significantly enhances the pre-trained model across a wide range of downstream tasks, with a 2.9% average performance boost and reducing the FLOPs necessary to achieve a certain performance level by more than half.
zh
[NLP-84] Structural Latency Perturbation in Large Language Models Through Recursive State Induction
【速读】: 该论文旨在解决高容量语言模型在实时应用中的计算效率问题,特别是推理延迟和资源消耗的限制。解决方案的关键在于引入了一种结构化延迟扰动机制,通过递归状态诱导修改计算路径,动态抑制冗余激活,同时保持生成保真度。该机制通过选择性地抑制冗余激活来提高计算效率,并且不损害令牌保留或内存利用率。
链接: https://arxiv.org/abs/2502.00758
作者: Michael Mangrum,Jonathan Pemberton,Benedict Wetherby,Philip Montague
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Computational efficiency has remained a critical consideration in scaling high-capacity language models, with inference latency and resource consumption presenting significant constraints on real-time applications. The study has introduced a structured latency perturbation mechanism that modifies computational pathways through recursive state induction, enabling dynamic suppression of redundant activations while preserving generative fidelity. A formal mathematical framework has been established to describe recursive perturbations, ensuring that modifications remain adaptive rather than statically imposed. Experiments have demonstrated that applying recursive state adjustments reduces inference latency across varying sequence lengths, with longer text generations benefiting from cumulative efficiency improvements. Comparative evaluations against structured pruning and quantization have indicated that latency gains can be achieved without compromising token retention or memory utilization. The analysis of computational overhead has suggested that selectively suppressing redundant activations contributes to improved power efficiency, particularly in scenarios requiring extended text generation. An assessment of linguistic stability has shown that token-level consistency remains largely intact under controlled perturbation thresholds, reinforcing the viability of structural latency modifications as an alternative to weight-centric optimization techniques. The results have supported the hypothesis that recursive state induction offers an effective method for reducing computational complexity without requiring architectural modifications or external augmentation.
zh
[NLP-85] Zero-Shot Warning Generation for Misinformative Multimodal Content
【速读】: 该论文旨在解决多媒体形式的误导信息(Misinformation)传播问题,特别是那些通过将真实的图像与虚假的文字配对来误导公众的不恰当上下文误导信息。论文的关键解决方案在于提出了一种通过跨模态一致性检查(cross-modality consistency checks)来检测这种多媒体误导信息的模型,并且该模型能够在极短的训练时间内实现高效检测。此外,论文还介绍了一种轻量级模型,仅使用现有模型三分之一的参数就能达到竞争性的性能表现。
链接: https://arxiv.org/abs/2502.00752
作者: Giovanni Pio Delvecchio,Huy Hong Nguyen,Isao Echizen
机构: National Institute of Informatics (NII), Japan; The University of Tokyo, Japan
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.
zh
[NLP-86] Universal Post-Processing Networks for Joint Optimization of Modules in Task-Oriented Dialogue Systems AAAI2025
【速读】: 该论文旨在解决现有基于后处理网络(Post-processing Networks, PPNs)的方法仅能优化系统内部分模块输出的问题,从而限制了整体任务完成能力的提升。论文的关键解决方案是提出通用后处理网络(Universal Post-processing Networks, UniPPNs),这是一种基于语言模型的网络,能够将任意模块的输出视为序列转换任务进行统一优化。此外,论文采用了一种模块级马尔可夫决策过程(Markov Decision Process, MDP)的强化学习算法,实现每个模块的精细价值和优势估计,进而稳定所有模块输出的联合学习过程。通过仿真和人类评估实验,证明了UniPPNs在面向任务的对话系统中的任务完成能力优于传统PPNs。
链接: https://arxiv.org/abs/2502.00747
作者: Atsumoto Ohashi,Ryuichiro Higashinaka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025 Main Technical Track
点击查看摘要
Abstract:Post-processing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.
zh
[NLP-87] BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts ICLR
【速读】: 该论文旨在解决深度神经网络(DNNs)推理延迟的问题,并提出了一种新的早期退出(Early Exit, EE)决策准则。关键在于引入了BEEM方法,将退出分类器视为专家,并仅在相邻专家预测一致时聚合其置信分数,从而捕捉到集成效应。通过这种方法,当聚合的置信值超过阈值时,样本即提前退出,这一阈值基于中间退出的错误率设定,以超越传统DNN推理的性能。实验结果表明,该方法提升了现有EE方法的性能,在图像描述和多种语言任务中实现了1.5倍到2.1倍的速度提升,同时保持或提高了准确性。
链接: https://arxiv.org/abs/2502.00745
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at this https URL
zh
[NLP-88] Model Provenance Testing for Large Language Models
【速读】: 该论文旨在解决通过微调等方式定制大型语言模型所带来的版权执行和下游影响管理挑战。论文的关键在于开发了一种框架,用于检测模型的起源,以确定一个模型是否由另一个模型衍生而来。该方法基于这样的观察:实际中的模型衍生会在模型输出中保留显著的相似性,这种相似性可以通过统计分析来检测。解决方案的关键是利用假设检验,对比目标模型与无关模型之间的相似性,从而在仅具有黑盒访问权限的情况下实现对衍生模型的有效识别。
链接: https://arxiv.org/abs/2502.00706
作者: Ivica Nikolic,Teodora Baluta,Prateek Saxena
机构: National University of Singapore; Georgia Institute of Technology
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts. Tracking model origins is crucial both for protecting intellectual property and for identifying derived models when biases or vulnerabilities are discovered in foundation models. We address this challenge by developing a framework for testing model provenance: Whether one model is derived from another. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90-95% precision and 80-90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
zh
[NLP-89] Learning Autonomous Code Integration for Math Language Models
【速读】: 该论文旨在解决现有工具集成的数学大语言模型(Math LLMs)在方法选择上的自主性不足问题。当前模型依赖外部指令来决定使用链式思维(CoT)推理还是代码执行,缺乏独立选择最适当方法的能力。为了解决这一挑战,论文提出了一种创新的期望最大化(EM)框架,通过自我探索改进模型的决策制定能力。该框架的关键在于交替进行参考策略的计算以提升模型对其自身能力的信心,并基于此更新模型。此外,引入了一种高效的数据合成策略和离策略强化学习,进一步增强了该框架。实验结果表明,所提方法显著提升了现有数学大语言模型的性能,在MATH基准测试中准确率提高了近20%,达到了65.28%,同时减少了高达65%的代码执行次数。
链接: https://arxiv.org/abs/2502.00691
作者: Haozhe Wang,Long Li,Chao Qu,Fengming Zhu,Weidi Xu,Wei Chu,Fangzhen Lin
机构: Technology†, Hong Kong University of Science and Technology‡
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent research on tool integration for math Large Language Models (LLMs) aims to combine complementary strengths of chain-of-thought (CoT) reasoning and code execution. However, we discover a critical limitation: current tool-integrated math LLMs rely on externally dictated instructions to decide whether to use CoT or code, lacking the autonomy to choose the most appropriate method independently. This prompts us to study \emphAutonomous Code integration for math LLMs, which enables models to \emphindependently develop their own methodology-selection strategy in the absence of reliable supervision. To address this challenge, we propose an innovative Expectation-Maximization (EM) formulation that refines the model’s decision-making through the exploration of its capabilities. This framework alternates between (a) computing a reference strategy that improves the model’s belief over its capabilities through self-exploration, and (b) updating the model based on the refined belief. We further enhance this framework with an efficient implementation, incorporating a novel data synthesis strategy and off-policy reinforcement learning. Extensive experiments demonstrate that our approach, using only a public query set, significantly boosts the performance of existing math LLMs, raising accuracy by nearly 20% to 65.28% on the challenging MATH benchmark, while reducing code executions by up to 65% .
zh
[NLP-90] A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models
【速读】: 该论文旨在解决图表示学习中连续嵌入方法所面临的参数效率、可解释性和鲁棒性等问题。论文的关键在于提出并探讨了量化图表示(Quantized Graph Representation, QGR)的学习方法,通过离散码而非传统的连续嵌入来表示图结构,并探索其与大规模语言模型(Large Language Models, LLMs)的整合策略。这一新兴范式具有显著潜力,论文通过全面综述以促进其快速发展。
链接: https://arxiv.org/abs/2502.00681
作者: Qika Lin,Zhen Peng,Kaize Shi,Kai He,Yiming Xu,Erik Cambria,Mengling Feng
机构: Saw Swee Hock School of Public Health, National University of Singapore(苏瑞福公共卫生学院,新加坡国立大学); School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院); School of Computer Science, University of Technology Sydney(悉尼科技大学计算机科学学院); College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent years have witnessed rapid advances in graph representation learning, with the continuous embedding approach emerging as the dominant paradigm. However, such methods encounter issues regarding parameter efficiency, interpretability, and robustness. Thus, Quantized Graph Representation (QGR) learning has recently gained increasing interest, which represents the graph structure with discrete codes instead of conventional continuous embeddings. Given its analogous representation form to natural language, QGR also possesses the capability to seamlessly integrate graph structures with large language models (LLMs). As this emerging paradigm is still in its infancy yet holds significant promise, we undertake this thorough survey to promote its rapid future prosperity. We first present the background of the general quantization methods and their merits. Moreover, we provide an in-depth demonstration of current QGR studies from the perspectives of quantized strategies, training objectives, distinctive designs, knowledge graph quantization, and applications. We further explore the strategies for code dependence learning and integration with LLMs. At last, we give discussions and conclude future directions, aiming to provide a comprehensive picture of QGR and inspire future research.
zh
[NLP-91] How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
【速读】: 该论文旨在解决数据集污染(Dataset Contamination)问题,即评估数据集与预训练语料库之间的重叠导致性能指标虚高,从而影响模型评估的可靠性。为了解决这一问题,论文提出了一种名为核差异得分(Kernel Divergence Score, KDS)的新方法。KDS通过计算样本嵌入在基准数据集微调前后核相似性矩阵的差异来量化数据集污染程度。其关键是利用微调对未见过的样本影响更大的特性,从而提供一个可靠的污染度量标准。
链接: https://arxiv.org/abs/2502.00678
作者: Hyeong Kyu Choi,Maxim Khanov,Hongxin Wei,Yixuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Quantifying dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model’s ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings.
zh
[NLP-92] ReFoRCE: A Text-to-SQL Agent with Self-Refinement Format Restriction and Column Exploration
【速读】: 该论文旨在解决在企业环境中部署Text-to-SQL系统所面临的挑战,如大规模复杂模式(3000列)、多样的SQL方言(例如BigQuery、Snowflake)以及复杂的查询需求。当前最先进的模型在Spider 2.0数据集上的表现受限,仅达到20%,主要局限在于指令遵循不足、长上下文理解差、自我优化能力弱以及特定方言知识不足。为解决这些问题,论文提出ReFoRCE方法,其关键是引入表压缩以缓解长上下文限制,格式限制以确保答案格式正确,以及迭代列探索以增强模式理解。此外,ReFoRCE采用包含并行化工作流与投票机制及基于公用表表达式(CTE)的细化方法的自我优化流程来处理未决案例。
链接: https://arxiv.org/abs/2502.00675
作者: Minghang Deng,Ashwin Ramachandran,Canwen Xu,Lanxiang Hu,Zhewei Yao,Anupam Datta,Hao Zhang
机构: University of California, San Diego(加州大学圣地亚哥分校); Snowflake AI Research(雪flake AI 研究)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure
点击查看摘要
Abstract:Text-to-SQL systems have unlocked easier access to critical data insights by enabling natural language queries over structured databases. However, deploying such systems in enterprise environments remains challenging due to factors such as large, complex schemas ( 3000 columns), diverse SQL dialects (e.g., BigQuery, Snowflake) and sophisticated query requirements (e.g., transformation, analytics). Current state-of-the-art performance on the Spider 2.0 dataset – a benchmark built to mimic such complex environments – remains limited at 20%. Key limitations include inadequate instruction-following, poor long-context comprehension, weak self-refinement, and insufficient dialect-specific knowledge. To address these gaps, we propose ReFoRCE (Self-Refinement Agent with Format Restriction and Column Exploration) which introduces (1) table compression to mitigate long-context limitations (2) format restriction to ensure accurate answer format, and (3) iterative column exploration for enhanced schema understanding. Additionally, it employs self-refinement pipeline consisting of (1) parallelized workflows with voting mechanisms and (2) a Common Table Expression (CTE) based refinement approach to handle unresolved cases. ReFoRCE achieves state-of-the-art results scoring 26.69 on the Spider 2.0-Snow and scoring 24.50 on the Spider 2.0-Lite tasks.
zh
[NLP-93] Rethinking Mixture-of-Agents : Is Mixing Different Large Language Models Beneficial?
【速读】: 该论文旨在探讨在语言模型领域,混合不同大型语言模型(Large Language Models, LLMs)是否真正有益。论文的关键解决方案是提出Self-MoA方法,即仅聚合单一顶级表现LLM的输出。研究结果表明,Self-MoA在多种基准测试中优于传统的混合方法(Mixture-of-Agents, MoA),包括在AlpacaEval 2.0上提升6.6%,并在多个基准测试(如MMLU、CRUX、MATH)中平均提升3.8%。这一方法不仅提升了性能,还达到了新的状态-of-the-art水平。研究表明,这种改进源于对输出多样性和质量之间权衡的深入分析,确认了混合不同LLMs往往会降低模型的平均质量。
链接: https://arxiv.org/abs/2502.00674
作者: Wenzhe Li,Yong Lin,Mengzhou Xia,Chi Jin
机构: Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA – an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.
zh
[NLP-94] Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Model, VLM)在分布外(Out-of-Distribution, OOD)检测中由于图像与文本模态差距导致的高误报率问题。论文的关键解决方案在于引入了来自相同分布内的图像原型(ID image prototypes),与已有的文本原型(ID text prototypes)结合使用,以缓解模态差距的影响。此外,论文提出了一个名为SUPREME的少样本调优框架,包括偏置提示生成(Biased Prompts Generation, BPG)模块和图像-文本一致性(Image-Text Consistency, ITC)模块,进一步减小图像与文本之间的差距,并提出了一种新的基于统一模态和跨模态相似性的OOD评分方法 (S_\text{GMP})。这些改进共同提升了基于VLM的OOD检测性能。
链接: https://arxiv.org/abs/2502.00662
作者: Yimu Wang,Evelien Riddell,Adrian Chow,Sean Sedwards,Krzysztof Czarnecki
机构: University of Waterloo
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score S_\textitGMP , leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.
zh
[NLP-95] Reformulation is All You Need: Addressing Malicious Text Features in DNNs
【速读】: 该论文旨在解决深度神经网络(DNN)模型在自然语言处理(NLP)任务中面临的对抗性攻击和后门攻击问题。论文的关键在于提出了一种统一且自适应的防御框架,通过利用重构模块来识别并处理文本输入中的潜在恶意特征,同时保持原始语义的完整性,从而有效抵御对抗性和后门攻击。
链接: https://arxiv.org/abs/2502.00652
作者: Yi Jiang,Oubo Ma,Yong Yang,Tong Zhang,Shouling Ji
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Human language encompasses a wide range of intricate and diverse implicit features, which attackers can exploit to launch adversarial or backdoor attacks, compromising DNN models for NLP tasks. Existing model-oriented defenses often require substantial computational resources as model size increases, whereas sample-oriented defenses typically focus on specific attack vectors or schemes, rendering them vulnerable to adaptive attacks. We observe that the root cause of both adversarial and backdoor attacks lies in the encoding process of DNN models, where subtle textual features, negligible for human comprehension, are erroneously assigned significant weight by less robust or trojaned models. Based on it we propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks. Our approach leverages reformulation modules to address potential malicious features in textual inputs while preserving the original semantic integrity. Extensive experiments demonstrate that our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.
zh
[NLP-96] Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
【速读】: 该论文旨在解决在资源受限环境中高效摘要工具的需求与现有大型语言模型(Large Language Models, LLMs)高计算资源需求之间的矛盾。解决方案的关键在于全面评估小型语言模型(Small Language Models, SLMs)在新闻摘要任务中的表现,发现如Phi3-Mini和Llama3.2-3B-Ins等顶级SLMs不仅能在生成更简洁的摘要同时达到与70B LLMs相当的效果,还指出SLMs更适合简单提示,并且指令微调并不总能提升其新闻摘要能力。
链接: https://arxiv.org/abs/2502.00641
作者: Borui Xu,Yao Chen,Zeyi Wen,Weiguo Liu,Bingsheng He
机构: Shandong University; National University of Singapore; HKUST (香港科技大学); Shandong University; National University of Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The increasing demand for efficient summarization tools in resource-constrained environments highlights the need for effective solutions. While large language models (LLMs) deliver superior summarization quality, their high computational resource requirements limit practical use applications. In contrast, small language models (SLMs) present a more accessible alternative, capable of real-time summarization on edge devices. However, their summarization capabilities and comparative performance against LLMs remain underexplored. This paper addresses this gap by presenting a comprehensive evaluation of 19 SLMs for news summarization across 2,000 news samples, focusing on relevance, coherence, factual consistency, and summary length. Our findings reveal significant variations in SLM performance, with top-performing models such as Phi3-Mini and Llama3.2-3B-Ins achieving results comparable to those of 70B LLMs while generating more concise summaries. Notably, SLMs are better suited for simple prompts, as overly complex prompts may lead to a decline in summary quality. Additionally, our analysis indicates that instruction tuning does not consistently enhance the news summarization capabilities of SLMs. This research not only contributes to the understanding of SLMs but also provides practical insights for researchers seeking efficient summarization solutions that balance performance and resource use.
zh
[NLP-97] SimulPL: Aligning Human Preferences in Simultaneous Machine Translation ICLR2025
【速读】: 该论文旨在解决同时机器翻译(Simultaneous Machine Translation, SiMT)模型在满足人类用户偏好方面的问题。现有方法主要关注于优化生成的翻译结果,而忽视了与延迟相关的用户偏好以及在偏好优化阶段读写策略的优化。论文的关键解决方案是提出了一种名为Simultaneous Preference Learning (SimulPL) 的框架,该框架将人类偏好分为五个方面:翻译质量偏好、单调性偏好、关键点偏好、简洁性偏好和延迟偏好。通过利用前四种偏好构造人类偏好提示,有效地引导GPT-4/4o生成SiMT任务的偏好数据,并在偏好优化阶段将延迟偏好整合到优化目标中,使SiMT模型能够改进读写策略,从而更有效地与人类偏好保持一致。实验结果显示,SimulPL在不同延迟水平下均表现出更好的人类偏好对齐效果。
链接: https://arxiv.org/abs/2502.00634
作者: Donglei Yu,Yang Zhao,Jie Zhu,Yangyifan Xu,Yu Zhou,Chengqing Zong
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,中国科学院自动化研究所,中国北京); Graduate School of Translation and Interpretation, Beijing Foreign Studies University(北京外国语大学翻译学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025. 23 pages,13 figures,11 tables
点击查看摘要
Abstract:Simultaneous Machine Translation (SiMT) generates translations while receiving streaming source inputs. This requires the SiMT model to learn a read/write policy, deciding when to translate and when to wait for more source input. Numerous linguistic studies indicate that audiences in SiMT scenarios have distinct preferences, such as accurate translations, simpler syntax, and no unnecessary latency. Aligning SiMT models with these human preferences is crucial to improve their performances. However, this issue still remains unexplored. Additionally, preference optimization for SiMT task is also challenging. Existing methods focus solely on optimizing the generated responses, ignoring human preferences related to latency and the optimization of read/write policy during the preference optimization phase. To address these challenges, we propose Simultaneous Preference Learning (SimulPL), a preference learning framework tailored for the SiMT task. In the SimulPL framework, we categorize SiMT human preferences into five aspects: \textbftranslation quality preference, \textbfmonotonicity preference, \textbfkey point preference, \textbfsimplicity preference, and \textbflatency preference. By leveraging the first four preferences, we construct human preference prompts to efficiently guide GPT-4/4o in generating preference data for the SiMT task. In the preference optimization phase, SimulPL integrates \textbflatency preference into the optimization objective and enables SiMT models to improve the read/write policy, thereby aligning with human preferences more effectively. Experimental results indicate that SimulPL exhibits better alignment with human preferences across all latency levels in Zh \rightarrow En, De \rightarrow En and En \rightarrow Zh SiMT tasks. Our data and code will be available at \urlthis https URL.
zh
[NLP-98] Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures
【速读】: 该论文旨在解决在低数据条件下,Transformer模型训练成本高且参数量大的问题。解决方案的关键在于通过选择性替换注意力层(attention layers)为前馈层(feed-forward layers)和准循环神经网络层(quasi-recurrent neural network layers),从而在保持相近性能的同时显著减少模型参数数量。
链接: https://arxiv.org/abs/2502.00617
作者: Gabriel Lindenmaier,Sean Papay,Sebastian Padó
机构: University of Bamberg (班贝格大学); University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注: PDF has 12 pages total, 7 without references and abstract; 10 individual graphics combined to 3 figures; 5 tables
点击查看摘要
Abstract:Transformer-based language models have recently been at the forefront of active research in text generation. However, these models’ advances come at the price of prohibitive training costs, with parameter counts in the billions and compute requirements measured in petaflop/s-decades. In this paper, we investigate transformer-based architectures for improving model performance in a low-data regime by selectively replacing attention layers with feed-forward and quasi-recurrent neural network layers. We test these architectures on the standard Enwik8 and Wikitext-103 corpora. Our results show that our reduced architectures outperform existing models with a comparable number of parameters, and obtain comparable performance to larger models while significantly reducing the number of parameters.
zh
[NLP-99] Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing
【速读】: 该论文旨在解决大型语言模型(LLMs)在静态语料库上训练导致的知识过时问题,并提出了一种名为OVERTONE的方法。OVERTONE通过在token级别进行平滑处理,解决了异构token过拟合(HTO)的问题,从而实现对特定知识的有效更新而不损害模型的其他预训练能力。关键在于其适应性地优化目标分布,以减轻不同token的学习速率不一致导致的过拟合现象。
链接: https://arxiv.org/abs/2502.00602
作者: Tianci Liu,Zihan Dong,Linjun Zhang,Haoyu Wang,Jing Gao
机构: Purdue University; Rutgers University; SUNY Albany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: heterogeneous token overfitting (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose OVERTONE, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
zh
[NLP-100] RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines ICML2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为基于文本的角色扮演游戏(Role-Playing Game, RPG)引擎时的评估问题。为了解决这一问题,论文提出了RPGBench基准测试,其关键是通过两个核心任务——游戏创建(Game Creation, GC)和游戏模拟(Game Simulation, GS)——来全面评估LLMs在逻辑连贯性、一致性以及可验证的游戏机制方面的表现。通过结合客观评价方法和LLM作为裁判的主观评价框架,RPGBench提供了一种新的标准,用于衡量LLMs在平衡创造性、连贯性和复杂性方面的能力。
链接: https://arxiv.org/abs/2502.00595
作者: Pengfei Yu,Dongming Shen,Silin Meng,Jaewon Lee,Weisu Yin,Andrea Yaoyun Cui,Zhenlin Xu,Yi Zhu,Xingjian Shi,Mu Li,Alex Smola
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ICML 2025
点击查看摘要
Abstract:We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a strong LLM grades each candidate’s outputs. Empirical results demonstrate that state-of-the-art LLMs can produce engaging stories but often struggle to implement consistent, verifiable game mechanics, particularly in long or complex scenarios. By combining structured, rule-based assessments with LLM-based judgments, RPGBench provides a new standard for evaluating how well LLMs can balance creativity, coherence, and complexity in text-based RPGs, opening avenues for more immersive and controllable interactive storytelling.
zh
[NLP-101] M: Extending MemoryLLM with Scalable Long-Term Memory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长序列时难以保留远期信息的问题。MemoryLLM虽通过将过去的信息压缩到所有层的隐藏状态中形成记忆池来扩展上下文窗口,但其效果仅限于最多16k个标记的序列长度,对于超过20k个标记的序列则难以保持知识。论文的关键解决方案是引入M+模型,它基于MemoryLLM,并通过集成长期记忆机制与协同训练的检索器来显著增强长期信息保留能力。M+在文本生成过程中动态检索相关信息,从而实现在相似GPU内存开销下,将知识保留能力从不足20k个标记提升至超过160k个标记。
链接: https://arxiv.org/abs/2502.00592
作者: Yu Wang,Dmitry Krotov,Yuanzhe Hu,Yifan Gao,Wangchunshu Zhou,Julian McAuley,Dan Gutfreund,Rogerio Feris,Zexue He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
zh
[NLP-102] Converting Transformers into DGNNs Form
【速读】: 该论文旨在探索将自注意力机制(Self-Attention)替换为有向图卷积(Digraph Convolution)的可能性,以期提升Transformer模型在长序列处理任务中的性能。论文的关键在于引入了一种基于有向图傅里叶变换的合成酉有向图卷积(Synthetic Unitary Digraph Convolution),从而形成一种新的模型——Converter。这种转换使得Transformer模型能够以有向图神经网络(DGNN)的形式运作,实验结果表明Converter在保持计算效率和架构简洁性的同时,实现了卓越的性能,确立了其作为轻量但强大的Transformer变体的地位。
链接: https://arxiv.org/abs/2502.00585
作者: Jie Zhang,Kuan-Chieh Wang,Bo-Wei Chiu,Min-Te Sun
机构: National Central University, Taiwan(中央大学,台湾)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 3 figures, and 8 tables
点击查看摘要
Abstract:Recent advances in deep learning have established Transformer architectures as the predominant modeling paradigm. Central to the success of Transformers is the self-attention mechanism, which scores the similarity between query and key matrices to modulate a value matrix. This operation bears striking similarities to digraph convolution, prompting an investigation into whether digraph convolution could serve as an alternative to self-attention. In this study, we formalize this concept by introducing a synthetic unitary digraph convolution based on the digraph Fourier transform. The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network (DGNN) form. We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification. Our experimental results demonstrate that Converter achieves superior performance while maintaining computational efficiency and architectural simplicity, which establishes it as a lightweight yet powerful Transformer variant.
zh
[NLP-103] Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition ICASSP2025
【速读】: 该论文旨在解决非流利或带有口音的演讲者在语音识别中的挑战。特别是针对非母语英语使用者,传统基于规则的发音模式难以充分捕捉非母语者的错误。论文的关键解决方案是采用数据驱动的方法,通过使用注意力图将非母语音素与母语音素对齐,从而自动检测误读模式。这种方法在母语英语数据集上的语音识别准确率提高了5.7%,而在非母语英语,尤其是韩国人英语演讲者的识别准确率提高了12.8%。
链接: https://arxiv.org/abs/2502.00583
作者: Anna Seo Gyeong Choi,Jonghyeon Park,Myungwoo Oh
机构: Cornell University; NAVER Cloud Corporation; NAVER Cloud Corporation
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP 2025
点击查看摘要
Abstract:Recent advancements in machine learning have significantly improved speech recognition, but recognizing speech from non-fluent or accented speakers remains a challenge. Previous efforts, relying on rule-based pronunciation patterns, have struggled to fully capture non-native errors. We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. By aligning non-native phones with their native counterparts using attention maps, we achieved a 5.7% improvement in speech recognition on native English datasets and a 12.8% improvement for non-native English speakers, particularly Korean speakers. Our method offers practical advancements for robust Automatic Speech Recognition (ASR) systems particularly for situations where prior linguistic knowledge is not applicable.
zh
[NLP-104] Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
【速读】: 该论文旨在解决通过Best-of-N (BoN) 方法进行的语言模型(Language Models, LLMs)越狱攻击问题。解决方案的关键在于Defense Against The Dark Prompts (DATDP) 方法,该方法通过反复利用评估语言模型来检测提示中的危险或操纵行为,并明确寻找越狱企图,直至生成稳健的安全评级。实验结果显示,即使使用较小的评估模型,DATDP也能有效阻止大部分成功越狱案例,从而显著提高生成式AI系统的安全性。
链接: https://arxiv.org/abs/2502.00580
作者: Stuart Armstrong,Matija Franklin,Connor Stevens,Rebecca Gorman
机构: Stuart Armstrong* Aligned AI(对齐AI); Matija Franklin* Aligned AI(对齐AI); Connor Stevens* Oxford University(牛津大学); Rebecca Gorman* University College London (UCL)(伦敦大学学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper’s successful jailbreaks (confidence interval [99.65%, 100.00%] ) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%] ) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors–unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts–until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.
zh
[NLP-105] Understanding Multimodal LLM s Under Distribution Shifts: An Information-Theoretic Approach
【速读】: 该论文旨在解决多模态大型语言模型(MLLMs)在分布偏移(distribution shifts)条件下表现不稳定的问题。论文的关键解决方案在于提出了一种基于信息论的新理论框架,通过引入有效互信息(Effective Mutual Information, EMI)这一度量标准,量化输入查询与模型响应之间的相关性,并推导出其在分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)数据上的差异上限,从而连接视觉和文本的分布差异。该框架能够系统地表征和量化MLLMs在分布偏移条件下的最大风险,确保这些模型在实际应用中的安全性和可靠性。
链接: https://arxiv.org/abs/2502.00577
作者: Changdae Oh,Zhen Fang,Shawn Im,Xuefeng Du,Yixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios empirically validate our theoretical insights.
zh
[NLP-106] Detecting Ambiguities to Guide Query Rewrite for Robust Conversations in Enterprise AI Assistants
【速读】: 该论文旨在解决多轮企业级AI助手对话中的歧义问题,这些歧义源于问题间的对话依赖关系,导致理解错误。关键解决方案在于提出了一种NLU-NLG框架,通过自动重述查询来检测和解决歧义,并引入了一项新任务“基于歧义的查询重写”(Ambiguity-guided Query Rewrite)。论文开发了一套基于真实用户对话日志的分类规则和特征提取方法,以设计出性能优越的分类器,该分类器在检测模糊查询方面优于基于大型语言模型的基线方法。此外,将查询重写模块与歧义检测分类器结合使用,证明了这一端到端框架能够有效减轻歧义,同时不会对清晰查询造成不必要的干扰,从而提升了AI助手的整体性能。
链接: https://arxiv.org/abs/2502.00537
作者: Md Mehrab Tanjim,Xiang Chen,Victor S. Bursztyn,Uttaran Bhattacharya,Tung Mai,Vaishnavi Muppala,Akash Maharaj,Saayan Mitra,Eunyee Koh,Yunyao Li,Ken Russell
机构: Adobe Research(Adobe研究); Adobe Inc.(Adobe公司)
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Multi-turn conversations with an Enterprise AI Assistant can be challenging due to conversational dependencies in questions, leading to ambiguities and errors. To address this, we propose an NLU-NLG framework for ambiguity detection and resolution through reformulating query automatically and introduce a new task called “Ambiguity-guided Query Rewrite.” To detect ambiguities, we develop a taxonomy based on real user conversational logs and draw insights from it to design rules and extract features for a classifier which yields superior performance in detecting ambiguous queries, outperforming LLM-based baselines. Furthermore, coupling the query rewrite module with our ambiguity detecting classifier shows that this end-to-end framework can effectively mitigate ambiguities without risking unnecessary insertions of unwanted phrases for clear queries, leading to an improvement in the overall performance of the AI Assistant. Due to its significance, this has been deployed in the real world application, namely Adobe Experience Platform AI Assistant.
zh
[NLP-107] Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
【速读】: 该论文旨在解决在PET/CT影像中将文本描述与图像中的具体位置进行关联的问题。由于缺乏大规模标注的图像-文本数据集,论文提出了一种自动化弱标签生成管道,用于链接PET/CT报告描述与图像位置,并基于此训练了一个三维视觉-语言接地模型(3D vision-language visual grounding model)。解决方案的关键在于开发了这一自动化弱标签生成管道,通过识别SUVmax和轴向切片编号来找到PET/CT报告中的阳性发现,从而提取出11,356个句子-标签对用于训练ConTEXTual Net 3D模型。
链接: https://arxiv.org/abs/2502.00528
作者: Zachary Huemann,Samuel Church,Joshua D. Warner,Daniel Tran,Xin Tie,Alan B McMillan,Junjie Hu,Steve Y. Cho,Meghan Lubner,Tyler J. Bradshaw
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); University of Wisconsin Health(威斯康星大学健康中心); Carbone Cancer Center(卡本癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model’s performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.
zh
[NLP-108] PolarQuant: Leverag ing Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
【速读】: 该论文旨在解决大型语言模型中KV缓存内存使用过高的问题,特别是由于异常值导致的传统量化方法在量化关键向量时遇到的挑战。论文的关键解决方案是提出了一种新的量化方法PolarQuant,它通过将关键向量分为两维子向量组,并采用极坐标表示(量化半径和极角),有效解决了异常值问题,从而提高了KV缓存量化效率并加速了解码过程,同时保持了全精度模型的下游性能。
链接: https://arxiv.org/abs/2502.00527
作者: Songhao Wu,Ang Lv,Xiao Feng,Yufei Zhang,Xun Zhang,Guojun Yin,Wei Lin,Rui Yan
机构: Renmin University of China(中国人民大学); ShanghaiTech University(上海科技大学); Meituan(美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: preprint
点击查看摘要
Abstract:The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.
zh
[NLP-109] Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning
【速读】: 该论文旨在解决复杂推理任务中单次推理结果不可靠的问题,提出的方法关键在于Reasoning-Pruning Perplexity Consistency (RPC)。RPC通过结合Perplexity Consistency与Reasoning Pruning,前者无缝集成大规模语言模型的困惑度与自一致性,后者消除低概率推理路径,从而有效防止估计误差的累积。理论分析表明,RPC不仅加速了估计误差收敛至指数级的速度,还显著减少了模型误差。
链接: https://arxiv.org/abs/2502.00511
作者: Zhi Zhou,Tan Yuhao,Zenan Li,Yuan Yao,Lan-Zhe Guo,Xiaoxing Ma,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency. In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate. To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (RPC). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction. Theoretical analysis demonstrates that RPC not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error. Extensive empirical evaluations on seven benchmark datasets confirm that RPC can significantly improve reasoning performance, sample efficiency, and confidence reliability.
zh
[NLP-110] Whos the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents
【速读】: 该论文旨在解决大型语言模型(LLM)代理框架中量化各模块对整体系统性能贡献的难题,以提高优化和可解释性。论文的关键解决方案是引入CapaBench评估框架,该框架基于合作博弈论中的Shapley值,能够系统地衡量单个模块及其交互作用的边际影响。通过在所有可能的组合中替换默认模块与测试变体,CapaBench提供了一种原则性的方法来归因性能贡献。
链接: https://arxiv.org/abs/2502.00510
作者: Yingxuan Yang,Bo Huang,Siyuan Qi,Chao Feng,Haoyi Hu,Yuxuan Zhu,Jinbo Hu,Haoran Zhao,Ziyi He,Xiao Liu,Zongyu Wang,Lin Qiu,Xuezhi Cao,Xunliang Cai,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University(上海交通大学); University of Chicago(芝加哥大学); University of Toronto(多伦多大学); Meituan(美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module to overall system performance remains a significant challenge, impeding optimization and interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory’s Shapley Value, which systematically measures the marginal impact of individual modules and their interactions within an agent’s architecture. By replacing default modules with test variants across all possible combinations, CapaBench provides a principle method for attributing performance contributions. Key contributions include: (1) We are the first to propose a Shapley Value-based methodology for quantifying the contributions of capabilities in LLM agents; (2) Modules with high Shapley Values consistently lead to predictable performance gains when combined, enabling targeted optimization; and (3) We build a multi-round dataset of over 1,000 entries spanning diverse domains and practical task scenarios, enabling comprehensive evaluation of agent capabilities. CapaBench bridges the gap between component-level evaluation and holistic system assessment, providing actionable insights for optimizing modular LLM agents and advancing their deployment in complex, real-world scenarios.
zh
[NLP-111] A statistically consistent measure of Semantic Variability using Language Models
【速读】: 该论文旨在解决语言模型输出结果的语义变异性问题。关键在于提出了一种语义谱熵(semantic spectral entropy)的度量方法,该方法在轻度假设下具有统计一致性,并且易于实现,仅需现成的语言模型即可应用。
链接: https://arxiv.org/abs/2502.00507
作者: Yi Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:To address the issue of variability in the output generated by a language model, we present a measure of semantic variability that is statistically consistent under mild assumptions. This measure, denoted as semantic spectral entropy, is a easy to implement algorithm that requires just off the shelf language models. We put very few restrictions on the language models and we have shown in a clear simulation studies that such method can generate accurate metric despite randomness that arise from the language models.
zh
[NLP-112] owards Privacy-aware Mental Health AI Models: Advances Challenges and Opportunities
【速读】: 该论文旨在解决在开发和部署用于精神健康诊断与治疗的人工智能(Artificial Intelligence, AI)模型时所面临的隐私挑战。论文的关键解决方案包括数据匿名化、合成数据生成以及隐私保护模型训练,以增强实际应用中的隐私保障。此外,论文还讨论了评估框架,用以衡量这些方法中隐私性和实用性之间的权衡。通过解决这些挑战,研究旨在推进可靠且注重隐私的人工智能工具的发展,以支持临床决策并改善精神健康结果。
链接: https://arxiv.org/abs/2502.00451
作者: Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych
机构: Technische Universität Darmstadt (达姆施塔特工业大学); Hessian Center for AI (hessian.AI); Department of Computer Science (计算机科学系); Indian Institute of Technology Delhi, India (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures
点击查看摘要
Abstract:Mental illness is a widespread and debilitating condition with substantial societal and personal costs. Traditional diagnostic and treatment approaches, such as self-reported questionnaires and psychotherapy sessions, often impose significant burdens on both patients and clinicians, limiting accessibility and efficiency. Recent advances in Artificial Intelligence (AI), particularly in Natural Language Processing and multimodal techniques, hold great potential for recognizing and addressing conditions such as depression, anxiety, bipolar disorder, schizophrenia, and post-traumatic stress disorder. However, privacy concerns, including the risk of sensitive data leakage from datasets and trained models, remain a critical barrier to deploying these AI systems in real-world clinical settings. These challenges are amplified in multimodal methods, where personal identifiers such as voice and facial data can be misused. This paper presents a critical and comprehensive study of the privacy challenges associated with developing and deploying AI models for mental health. We further prescribe potential solutions, including data anonymization, synthetic data generation, and privacy-preserving model training, to strengthen privacy safeguards in practical applications. Additionally, we discuss evaluation frameworks to assess the privacy-utility trade-offs in these approaches. By addressing these challenges, our work aims to advance the development of reliable, privacy-aware AI tools to support clinical decision-making and improve mental health outcomes.
zh
[NLP-113] HERA: Improving Long Document Summarization using Large Language Models with Context Packaging and Reordering
【速读】: 该论文旨在解决大型语言模型(LLMs)在长文档摘要任务上的表现不佳问题。主要原因是长文档中的相关信息分散且叙述顺序混乱,影响了LLMs对文档的准确理解和利用。为了解决这些问题,论文提出了一种新的摘要生成框架HERA。关键解决方案在于首先根据语义结构分割长文档,并检索关于同一事件的文本片段,最后重新排序这些片段以形成输入上下文。
链接: https://arxiv.org/abs/2502.00448
作者: Taiji Li,Hao Chen,Fei Yu,Yin Zhang
机构: College of Computer Science and Technology, Zhejiang University(浙江大学); Ant Group, China(蚂蚁集团, 中国)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure
点击查看摘要
Abstract:Despite the rapid growth of context length of large language models (LLMs) , LLMs still perform poorly in long document summarization. An important reason for this is that relevant information about an event is scattered throughout long documents, and the messy narrative order impairs the accurate understanding and utilization of LLMs for long documents. To address these issues, we propose a novel summary generation framework, called HERA. Specifically, we first segment a long document by its semantic structure and retrieve text segments about the same event, and finally reorder them to form the input context. We evaluate our approach on two long document summarization datasets. The experimental results show that HERA outperforms foundation models in ROUGE, BERTScore and faithfulness metrics, while HERA does not require additional fine-tuning and resources.
zh
[NLP-114] UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中的后训练部署挑战,特别是显著的内存开销和明显的推理延迟。现有方法如层内键值(KV)共享和跨层KV共享虽有所改进,但仍存在不足。论文的关键在于识别到Softmax操作是LLM推理的主要瓶颈,并且在后训练过程中实际上是冗余的。为此,论文提出了一种新的后训练方法——注意力中的Softmax统一(UniAttn),通过统一Transformer块中的Softmax激活来降低LLM的推理成本,并采用线性投影补偿由Softmax统一引起的误差。实验表明,UniAttn在保持标准后训练性能的同时显著降低了推理成本,优于现有的高效架构。
链接: https://arxiv.org/abs/2502.00439
作者: Yizhe Xiong,Wei Huang,Xin Ye,Hui Chen,Zijia Lin,Haoran Lian,Zhenpeng Su,Jungong Han,Guiguang Ding
机构: School of Software, Tsinghua University (清华大学软件学院); School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机科学学院); Kuaishou Technology (快手科技); Beihang University (北京航空航天大学); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures. Preprint, under review
点击查看摘要
Abstract:Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \textttSoftmax operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbfUnification in \textbfAtte\textbfntion (\textbfUniAttn), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \urlthis https URL.
zh
[NLP-115] Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language ICASSP2025
【速读】: 该论文旨在解决奥罗莫语(Oromo)自动语音识别(ASR)资源匮乏的问题。解决方案的关键在于构建了一个包含100小时真实世界音频记录及对应转录的新型ASR数据集,并通过使用Conformer模型和微调Whisper模型,分别实现了15.32%和10.82%的词错误率(WER),从而为奥罗莫语ASR建立了基准,展示了提升该语言ASR性能的潜力与挑战。
链接: https://arxiv.org/abs/2502.00421
作者: Turi Abu,Ying Shi,Thomas Fang Zheng,Dong Wang
机构: Center for Speech and Language Technologies, BNRist, Beijing(北京语音与语言技术中心, BNRist, 北京); Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系, 北京, 中国); School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院, 哈尔滨, 中国)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing)
点击查看摘要
Abstract:We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at this https URL and we encourage its use for further research and development in Oromo speech processing.
zh
[NLP-116] Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
【速读】: 该论文旨在解决在政治敏感场景中,特别是在冲突特定背景下,社交媒体平台上意识形态立场检测的研究不足问题。研究通过分析9,969条与以色列-巴勒斯坦冲突相关的Reddit评论,提出了多种方法,包括机器学习、预训练语言模型、神经网络以及针对开源大型语言模型(LLMs)的提示工程策略,来分类这些评论的立场,如亲以色列、亲巴勒斯坦和中立。关键解决方案在于采用Scoring和Reflective Re-read提示策略,在Mixtral 8x7B模型中实现了最高的性能表现,从而有效提升了在高度两极分化的社交媒体环境中意识形态立场检测的准确性。
链接: https://arxiv.org/abs/2502.00414
作者: Hasin Jawad Ali,Ajwad Abrar,S.M. Hozaifa Hossain,M. Firoz Mridha
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.
zh
[NLP-117] Doing More with Less – Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLM)系统在处理不同任务时资源利用不均衡的问题。论文的关键解决方案在于引入一种路由机制,将用户查询分配到最适合的组件,如较小的LLM或特定领域的专家。这种方法通过优化资源配置,提高响应质量的同时最小化成本。
链接: https://arxiv.org/abs/2502.00409
作者: Clovis Varangot-Reille,Christophe Bouvard,Antoine Gourru,Mathieu Ciancone,Marion Schaeffer,François Jacquenet
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (i.e. GPT-4), trained on very large multi-topic corpora, can perform well in a variety of tasks. However, they require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.
zh
[NLP-118] ALU: Agent ic LLM Unlearning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中信息移除或抑制的需求,特别是在AI监管、法律合规、安全性和隐私性方面。论文的关键在于提出了一种名为“代理型LLM无学习”(Agent-based LLM Unlearning, ALU)的方法,这是一种多代理、无需重新训练、与模型无关的LLM无学习方法。ALU通过多个专门设计用于无学习过程特定步骤的LLM代理来实现高效的信息删除,同时保持模型的实用性,并且无需更新任何代理的模型权重。这种方法使得用户可以灵活地请求任意顺序的无学习实例,从而在实时适应方面表现出色,而无需对基础LLM模型进行任何修改。
链接: https://arxiv.org/abs/2502.00406
作者: Debdeep Sanyal,Murari Mandal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. We present the first agentic LLM unlearning (ALU) method, a multi-agent, retrain-free, model-agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our ALU framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and ALU seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that ALU consistently stands out as the most robust LLM unlearning framework among current state-of-the-art methods while incurring a low constant-time cost. We further highlight ALU’s superior performance compared to existing methods when evaluated at scale. Specifically, ALU is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.
zh
[NLP-119] he Impact of Persona-based Political Perspectives on Hateful Content Detection
【速读】: 该论文旨在探究persona-based prompting策略在多模态仇恨言论检测任务(特别是针对表情包中的仇恨言论)中能否实现与政治预训练相当的效果。关键在于通过映射persona到政治罗盘并测量persona一致性,发现内在的政治立场与分类决策之间的相关性较低,即使注入更强的意识形态描述也依然如此。这表明虽然大型语言模型(LLMs)在直接回答政治问题时可能表现出政治偏见,但在实际分类任务中的影响可能比之前认为的要小,从而质疑了昂贵的计算资源需求以实现公平性能的政治预训练的必要性。
链接: https://arxiv.org/abs/2502.00385
作者: Stefano Civelli,Pietro Bernardelle,Gianluca Demartini
机构: The University of Queensland(昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While pretraining language models with politically diverse content has been shown to improve downstream task fairness, such approaches require significant computational resources often inaccessible to many researchers and organizations. Recent work has established that persona-based prompting can introduce political diversity in model outputs without additional training. However, it remains unclear whether such prompting strategies can achieve results comparable to political pretraining for downstream tasks. We investigate this question using persona-based prompting strategies in multimodal hate-speech detection tasks, specifically focusing on hate speech in memes. Our analysis reveals that when mapping personas onto a political compass and measuring persona agreement, inherent political positioning has surprisingly little correlation with classification decisions. Notably, this lack of correlation persists even when personas are explicitly injected with stronger ideological descriptors. Our findings suggest that while LLMs can exhibit political biases in their responses to direct political questions, these biases may have less impact on practical classification tasks than previously assumed. This raises important questions about the necessity of computationally expensive political pretraining for achieving fair performance in downstream tasks.
zh
[NLP-120] When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation
【速读】: 该论文旨在解决级联语音到文本翻译模型中的错误传播问题。关键解决方案在于结合自动语音识别(ASR)的多个候选结果和自监督语音特征,以减少语音领域相似样本映射到文本领域时的差异性,从而提高机器翻译(MT)模型的准确性,并最小化错误传播。这一策略充分利用了大规模的ASR和MT数据集以及预训练的ASR/MT模型。
链接: https://arxiv.org/abs/2502.00377
作者: Anna Min,Chenxu Hu,Yi Ren,Hang Zhao
机构: School of Software (软件学院), Tsinghua University (清华大学); IIIS (清华交叉信息研究院), Tsinghua University (清华大学); TickTok; IIIS (清华交叉信息研究院), Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.
zh
[NLP-121] A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation
【速读】: 该论文旨在解决语音到语音翻译(Speech-to-Speech Translation, S2ST)中忽视情感和态度等副语言信息(Paralinguistic Information)的问题。为了解决这一问题,研究引入了一个精心编纂的多语言数据集,该数据集源自多种电影音频片段,并且每对数据在副语言信息和时长方面进行了精确匹配。关键解决方案在于整合多种韵律迁移技术,以实现既准确又自然且富含副语言细节的翻译。实验结果表明,该模型在保持高翻译准确性和自然性的同时,能够保留更多的源语音副语言信息。
链接: https://arxiv.org/abs/2502.00374
作者: Anna Min,Chenxu Hu,Yi Ren,Hang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.
zh
[NLP-122] FinchGPT : a Transformer based language model for birdsong analysis
【速读】: 该论文旨在探究非人类动物在连续发声中是否存在类似于人类语言中的长程依赖性。解决方案的关键在于使用基于Transformer架构的FinchGPT模型,该模型在文本化的鸣鸟歌声数据集上进行训练,并通过注意力权重分析有效捕捉了音节序列中的长程依赖性。此外,通过限制模型的注意力范围和破坏鸟类歌曲语法,研究展示了计算和生物学操作对其性能的影响。
链接: https://arxiv.org/abs/2502.00344
作者: Kosei Kobayashi,Kosuke Matsuzaki,Masaya Taniguchi,Keisuke Sakaguchi,Kentaro Inui,Kentaro Abe
机构: Graduate School of Life Sciences, Tohoku University(东北大学生命科学研究科), Japan; Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究科), Japan; RIKEN Center for Advanced Intelligence Project(理化学研究所高级智能项目中心), Japan; Natural Language Processing Department, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学自然语言处理系), United Arab Emirates; Center for Language AI Research, Tohoku University(东北大学语言AI研究中心), Japan; Division for the Establishment of Frontier Sciences, Tohoku University(东北大学前沿科学建立部门), Japan
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures
点击查看摘要
Abstract:The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT’s attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model’s outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.
zh
[NLP-123] Enhancing Token Filtering Efficiency in Large Language Model Training with Collider
【速读】: 该论文旨在解决通过token过滤提升大规模语言模型(Large Language Models, LLMs)效用时未能实现更高效率的问题。现有方法仅在输出层过滤token,导致稀疏度不足,并且即使有足够稀疏度,稀疏GEMM操作依然低效。论文的关键解决方案在于提出Collider系统,它通过对所有层的非重要token激活进行过滤来保持高稀疏度,并通过自动工作流将稀疏GEMM转换为降维密集GEMM,以优化效率。
链接: https://arxiv.org/abs/2502.00340
作者: Di Chai,Pengbo Li,Feiyuan Zhang,Yilun Jin,Han Tian,Junxue Zhang,Kai Chen
机构: Hong Kong University of Science and Technology; University of Science and Technology of China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:Token filtering has been proposed to enhance utility of large language models (LLMs) by eliminating inconsequential tokens during training. While using fewer tokens should reduce computational workloads, existing studies have not succeeded in achieving higher efficiency. This is primarily due to the insufficient sparsity caused by filtering tokens only in the output layers, as well as inefficient sparse GEMM (General Matrix Multiplication), even when having sufficient sparsity. This paper presents Collider, a system unleashing the full efficiency of token filtering in LLM training. At its core, Collider filters activations of inconsequential tokens across all layers to maintain sparsity. Additionally, it features an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency. Evaluations on three LLMs-TinyLlama-1.1B, Qwen2.5-1.5B, and Phi1.5-1.4B-demonstrate that Collider reduces backpropagation time by up to 35.1% and end-to-end training time by up to 22.0% when filtering 40% of tokens. Utility assessments of training TinyLlama on 15B tokens indicate that Collider sustains the utility advancements of token filtering by relatively improving model utility by 16.3% comparing to regular training, and reduces training time from 4.7 days to 3.5 days using 8 GPUs. Collider is designed for easy integration into existing LLM training frameworks, allowing systems already using token filtering to accelerate training with just one line of code. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2502.00340 [cs.LG] (or arXiv:2502.00340v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00340 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-124] Challenges and Innovations in LLM -Powered Fake News Detection: A Synthesis of Approaches and Future Directions
【速读】: 该论文旨在解决社交媒体平台上假新闻传播所带来的信任危机、社会不稳定及民主制度受损等关键风险。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的进步,通过更先进的语义理解和多模态融合技术来提升检测准确性,以应对动态且多模态的虚假信息。然而,研究还指出了适应社交媒体趋势、实时跨平台检测能力以及大型语言模型误用所引发的伦理挑战等关键缺口。未来的研究方向包括开发风格无关模型、跨语言检测框架以及稳健政策,以减轻由大型语言模型驱动的虚假信息。
链接: https://arxiv.org/abs/2502.00339
作者: Jingyuan Yi,Zeqiu Xu,Tianyi Huang,Peiyang Yu
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:The pervasiveness of the dissemination of fake news through social media platforms poses critical risks to the trust of the general public, societal stability, and democratic institutions. This challenge calls for novel methodologies in detection, which can keep pace with the dynamic and multi-modal nature of misinformation. Recent works include powering the detection using large language model advances in multimodal frameworks, methodologies using graphs, and adversarial training in the literature of fake news. Based on the different approaches which can bring success, some key highlights will be underlined: enhanced LLM-improves accuracy through more advanced semantics and cross-modality fusion for robust detections. The review further identifies critical gaps in adaptability to dynamic social media trends, real-time, and cross-platform detection capabilities, as well as the ethical challenges thrown up by the misuse of LLMs. Future directions underline the development of style-agnostic models, cross-lingual detection frameworks, and robust policies with a view to mitigating LLM-driven misinformation. This synthesis thus lays a concrete foundation for those researchers and practitioners committed to reinforcing fake news detection systems with complications that keep on growing in the digital landscape.
zh
[NLP-125] UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在本科物理推理任务中的表现不足问题。现有基准测试往往无法全面评估LLMs在本科物理广度和深度上的能力,从而凸显出构建综合性评估工具的需求。为填补这一空白,论文引入了UGPhysics,这是一个专门设计用于评估LLMs处理本科物理推理能力的大规模综合基准,包含5,520个英语和中文的本科物理题目,并覆盖13个主题,七种不同答案类型及四种独特的物理推理技能。关键解决方案在于开发了Model-Assistant Rule-based Judgment (MARJ) 管道,以确保对物理问题解答正确性的准确评估。
链接: https://arxiv.org/abs/2502.00334
作者: Xin Xu,Qiyun Xu,Tong Xiao,Tianhao Chen,Yuchen Yan,Jiaxin Zhang,Shizhe Diao,Can Yang,Yang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs’ abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning.
zh
[NLP-126] MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections NAACL2025
【速读】: 该论文旨在解决可争论性查询(Debatable Queries)的查询聚焦摘要(Query-Focused Summarization, QFS)问题。传统QFS方法假设查询只有一个答案,忽视了具有争议性的查询(如“法学院值得就读吗?”)。为应对这一挑战,论文提出了Debatable QFS (DQFS),其目标是通过包含对立观点的文档生成全面且平衡的摘要,而不偏袒任何一方。论文的关键解决方案是设计了一个名为MODS的多语言模型框架,该框架模拟人类小组讨论的过程。MODS将文档视为独立的发言者语言模型(Speaker LLMs),并由一个主持人语言模型(Moderator LLM)挑选发言者,针对计划主题提出定制化查询。发言者使用定制化查询从文档中检索相关上下文,并提供视角,这些视角被追踪在一个丰富的提纲中,形成内容计划以指导最终的摘要生成。这一方法有效提升了在主题段落覆盖率和平衡性方面的表现,超越了现有技术(SOTA)系统。
链接: https://arxiv.org/abs/2502.00322
作者: Nishant Balepur,Alexa Siu,Nedim Lipka,Franck Dernoncourt,Tong Sun,Jordan Boyd-Graber,Puneet Mathur
机构: University of Maryland(马里兰大学); Adobe Research(Adobe研究)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at NAACL 2025(main)
点击查看摘要
Abstract:Query-focused summarization (QFS) gives a summary of documents to answer a query. Past QFS work assumes queries have one answer, ignoring debatable ones (Is law school worth it?). We introduce Debatable QFS (DQFS), a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must comprehensively cover all sources and balance perspectives, favoring no side. These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) use the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document’s content. To overcome this, we design MODS, a multi-LLM framework mirroring human panel discussions. MODS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics. Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary. Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MODS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MODS’s summaries to be readable and more balanced.
zh
[NLP-127] Distributive Fairness in Large Language Models : Evaluating Alignment with Human Values
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在社会和经济决策中的表现,特别是它们是否符合公平性概念(如平等性、无嫉妒性和罗尔斯最大最小原则),以及它们与人类偏好的一致性。研究的关键在于评估几种LLMs在反映这些公平性指标方面的性能,并比较它们之间的差异。研究结果表明,当前LLMs的响应与人类在资源分配上的偏好不一致,且无法利用金钱作为可转移资源来缓解不平等。然而,当LLMs被要求从预定义选项中选择而非生成新方案时,其表现有所改善。此外,论文还分析了LLMs响应对语义因素或非语义提示变化的鲁棒性,并提出了增强LLM行为与既定公平概念一致性的潜在策略。
链接: https://arxiv.org/abs/2502.00313
作者: Hadi Hosseini,Samarth Khanna
机构: Penn State University (宾夕法尼亚州立大学), USA
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:The growing interest in employing large language models (LLMs) for decision-making in social and economic contexts has raised questions about their potential to function as agents in these domains. A significant number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes. In this paper, we examine whether LLM responses adhere to fundamental fairness concepts such as equitability, envy-freeness, and Rawlsian maximin, and investigate their alignment with human preferences. We evaluate the performance of several LLMs, providing a comparative benchmark of their ability to reflect these measures. Our results demonstrate a lack of alignment between current LLM responses and human distributional preferences. Moreover, LLMs are unable to utilize money as a transferable resource to mitigate inequality. Nonetheless, we demonstrate a stark contrast when (some) LLMs are tasked with selecting from a predefined menu of options rather than generating one. In addition, we analyze the robustness of LLM responses to variations in semantic factors (e.g. intentions or personas) or non-semantic prompting changes (e.g. templates or orderings). Finally, we highlight potential strategies aimed at enhancing the alignment of LLM behavior with well-established fairness concepts.
zh
[NLP-128] SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中的系统复杂性、特征区分度不足以及噪声干扰等问题。关键解决方案在于提出了一种新的端到端(End-to-End, E2E)深度学习多分辨率框架,通过快速离散小波变换(Fast Discrete Wavelet Transform, FDWT)的特性,包括级联算法、共轭四边形滤波器和系数去噪,引入了可学习的小波基和去噪模型。该框架利用激活函数实现可学习的非对称硬阈值处理,并结合一维膨胀卷积神经网络(1D dilated Convolutional Neural Networks, 1D dilated CNN)、空间注意力层以及双向门控循环单元(Bidirectional Gated Recurrent Units, Bi-GRU)与时间注意力层,有效捕捉情感特征的空间和时间特性。该方法无需分割变长语音信号,且不需要预处理或后处理步骤。
链接: https://arxiv.org/abs/2502.00310
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
机构: Data Science Laboratory, University of Québec (TÉLUQ) (魁北克大学数据科学实验室); Concordia Institute for Information Systems Engineering, Concordia University (康考迪亚大学信息系统工程学院); Psychology Department, University of Québec at Montréal (魁北克大学蒙特利尔分校心理学系); Centre for Research and Intervention on Suicide, Ethical Issues and End-of-Life Practices (自杀研究与干预中心、伦理问题及临终关怀实践中心)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Published in: IEEE Transactions on Affective Computing
点击查看摘要
Abstract:In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: this https URL.
zh
[NLP-129] Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation
【速读】: 该论文旨在解决在 Retrieval-Augmented Generation (RAG) 系统中,通过自然语言查询推断模型数据存储库中包含的文档成员身份的问题。论文的关键解决方案是提出了一种名为 Interrogation Attack (IA) 的成员推理技术,通过构造仅依赖于目标文档存在的自然文本查询,实现对文档成员身份的有效且隐蔽的推断,仅需30个查询即可成功执行,同时避免被现有检测方法轻易识别。这种方法在多种RAG配置下表现出比先前攻击方法更高的真阳性率(TPR@1%FPR),并且每次文档推理的成本低于0.02。
链接: https://arxiv.org/abs/2502.00306
作者: Ali Naseh,Yuefeng Peng,Anshuman Suri,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Northeastern University(东北大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model’s context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document’s presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than 0.02 per document inference.
zh
[NLP-130] DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning ACL
【速读】: 该论文旨在解决冷启动主动学习(Cold-start Active Learning, CSAL)中因忽略弱类别和困难代表性样本而导致的学习偏差问题。关键解决方案在于提出了一种名为双重多样性增强和不确定性感知(Dual-Diversity Enhancing and Uncertainty-Aware, DEUCE)的框架。DEUCE通过利用预训练语言模型(PLM)高效提取文本表示、类别预测及预测不确定性,并构建双重邻域图(Dual-Neighbor Graph, DNG)来结合文本多样性和类别多样性信息,确保数据分布平衡。此外,它通过基于密度的聚类传播不确定性信息,以选择困难代表性实例,从而实现类别均衡和信息丰富的样本选择。
链接: https://arxiv.org/abs/2502.00305
作者: Jiaxin Guo,C. L. Philip Chen,Shuzhen Li,Tong Zhang
机构: Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence (广东省计算智能模型与认知重点实验室), School of Computer Science and Engineering (计算机科学与工程学院), South China University of Technology (华南理工大学), Guangzhou, China (中国广州);
Pazhou Lab (琶洲实验室), Guangzhou, China (中国广州);
Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human (教育部健康智能感知与平行数字人工程研究中心), Guangzhou, China (中国广州)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 18 pages, 3 figures, 12 tables. Accepted manuscript by TACL. For published version by MIT Press, see this https URL
点击查看摘要
Abstract:Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
zh
[NLP-131] Contextual Morphogenesis in Large Language Models : A Novel Approach to Self-Organizing Token Representations
【速读】: 该论文旨在解决传统分词策略在处理语言模型时因固定分词边界而无法动态调整以适应不断变化的上下文关系的问题。解决方案的关键在于引入上下文形态发生机制(Contextual Morphogenesis),这一机制通过自我组织的方式基于学习到的上下文依赖关系重新构建分词边界,从而允许嵌入表示在迭代处理过程中逐步进化。这种方法不仅降低了困惑度(perplexity),还保持了表征稳定性,特别是在语言结构复杂的领域中表现出色。
链接: https://arxiv.org/abs/2502.00301
作者: Alistair Dombrowski,Beatrix Engelhardt,Dimitri Fairbrother,Henry Evidail
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The introduction of contextual morphogenesis establishes a self-organizing mechanism that restructures token boundaries based on learned contextual dependencies, allowing embeddings to evolve progressively across iterative processing steps. Empirical evaluations demonstrate that dynamically adjusted tokenization contributes to reductions in perplexity while maintaining representational stability, particularly in linguistically complex domains where static segmentation fails to capture nuanced dependencies. Computational trade-offs associated with self-organizing token structures indicate that additional processing overhead remains within feasible limits, provided that optimization strategies account for segmentation update efficiency. Comparative assessments across different linguistic corpora suggest that adaptive tokenization preserves interpretability while improving alignment with contextual cues, reinforcing the potential of morphogenetic segmentation mechanisms to refine predictive accuracy. Stability analyses confirm that evolving token structures maintain consistent segmentation behaviors across varied text distributions, ensuring that representational adaptations remain linguistically coherent. The effectiveness of contextual morphogenesis in refining structural stability and predictive performance highlights its viability as an alternative to traditional tokenization methods. Further analysis of computational efficiency considerations suggests that hybrid strategies integrating both static and dynamic segmentation techniques may offer a balanced approach to optimizing representational flexibility while maintaining inference efficiency.
zh
[NLP-132] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
【速读】: 该论文旨在解决在使用大型语言模型(Large Language Models, LLMs)进行长上下文推理时内存成本高的问题。现有方法主要关注于压缩不同标记(tokens)的关键值(KV)缓存,但这些方法单独衡量标记的重要性,忽略了实际语言特性中不同标记之间的依赖关系。为了解决这一问题,论文提出ChunkKV方案,将标记分组为基本压缩单元,并保留最具信息量的语义片段,同时舍弃较不重要的部分。关键创新在于引入层间索引重用机制,以进一步减少计算开销。实验结果显示,ChunkKV在多种基准测试中实现了最高达10%的性能提升,尤其是在指令调优和多步推理(O1和R1)的LLMs中,与现有方法相比,在高压缩比下具有显著优势。
链接: https://arxiv.org/abs/2502.00299
作者: Xiang Liu,Zhenheng Tang,Peijie Dong,Zeyu Li,Bo Li,Xuming Hu,Xiaowen Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 35 pages
点击查看摘要
Abstract:To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10% performance improvement under aggressive compression ratios compared to existing methods.
zh
[NLP-133] Estimating LLM Uncertainty with Logits
【速读】: 该论文旨在解决大型语言模型(LLMs)在生成响应时容易出现幻觉的问题,即产生不可靠的回答。为应对这一挑战,论文提出了一种名为Logits-induced Token Uncertainty (LogU)的新框架,该框架能够实时估计LLMs中特定标记的不确定性,而无需多次采样。LogU的关键在于利用证据建模来实现标记级别不确定性的评估,从而指导下游任务。实验结果表明,LogU在减轻模型幻觉方面具有显著效果和潜力,标志着在解决模型幻觉问题上的重要进展。
链接: https://arxiv.org/abs/2502.00290
作者: Huan Ma,Jingdong Chen,Guangyu Wang,Changqing Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In recent years, Large Language Models (LLMs) have seen remarkable advancements and have been extensively integrated across various fields. Despite their progress, LLMs are prone to hallucinations, producing responses that may not be dependable if the models lack sufficient grounding knowledge. To mitigate this issue, methods for estimating uncertainty have been adopted, with a focus on critical tokens as indicators of reliability. Nevertheless, probability-based approaches have shown limitations in assessing token-level reliability due to the erosion of evidence strength information acquired during training. In this paper, we introduce Logits-induced Token Uncertainty (LogU), a novel framework designed to estimate token-specific uncertainty in LLMs in real time, without the need for multiple sampling rounds. By leveraging evidence modeling for the implementation of LogU, we utilize the derived uncertainty measures to steer downstream tasks. Our experimental findings highlight the substantial effectiveness and potential of LogU, marking a significant advancement in addressing the challenge of model hallucinations.
zh
[NLP-134] Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning
【速读】: 该论文旨在解决大规模语言模型(LLMs)在多步推理任务中的性能限制问题。尽管验证器引导搜索(Verifier-guided search)在有限样本情况下优于重复采样(repeated sampling),但随着样本量增加,其优势逐渐减弱并最终表现不如重复采样。论文指出,这一现象主要归因于验证器(verifiers)的失效,即不完美的验证器错误地排序候选路径并剪枝所有有效的推理路径。为了缓解验证器失效的问题,作者探索减少对验证器的依赖,并通过两种简单方法进行了初步研究。论文的关键在于揭示了验证器引导搜索的根本局限性,并提出了未来的研究方向。
链接: https://arxiv.org/abs/2502.00271
作者: Fei Yu,Yingru Li,Benyou Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement. Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out-of-distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier-guided search and suggest future directions.
zh
[NLP-135] ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLM s
【速读】: 该论文旨在解决大型语言模型(LLMs)在自然语言处理任务中表现卓越但因规模庞大导致服务效率低下和成本高昂的问题。解决方案的关键在于提出了一种名为ProxSparse的学习型框架,用于通过正则化优化实现掩码选择。ProxSparse将刚性的、不可微的掩码选择过程转化为一个平滑的优化过程,允许灵活的渐进式掩码探索,并且在确定掩码后不再涉及额外的权重更新。这克服了现有半结构化剪枝方法仅依赖局部、逐层优化及启发式规则而未能充分利用全局反馈的局限性。
链接: https://arxiv.org/abs/2502.00258
作者: Hongyi Liu,Rajarshi Saha,Zhen Jia,Youngsuk Park,Jiaji Huang,Shoham Sabach,Yu-Xiang Wang,George Karypis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.
zh
[NLP-136] Context-Preserving Tensorial Reconfiguration in Large Language Model Training
【速读】: 该论文旨在解决神经架构在处理长距离依赖时因计算限制和低效上下文保留机制所面临的核心挑战。解决方案的关键在于引入了一种名为Context-Preserving Tensorial Reconfiguration (CPTR)的新方法,通过结构化分解和自适应收缩实现权重张量的动态重组,从而增强上下文整合,同时不增加显著的计算负担。
链接: https://arxiv.org/abs/2502.00246
作者: Larin Tonix,Morgana Baskerville,Nathaniel Stourton,Ophelia Tattershall
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Handling long-range dependencies in neural architectures has remained a persistent challenge due to computational limitations and inefficient contextual retention mechanisms. Tensorial operations have provided a foundation for restructuring model representations, yet conventional architectures have struggled to incorporate such techniques without introducing excessive complexity. A novel approach, Context-Preserving Tensorial Reconfiguration (CPTR), enables dynamic reorganization of weight tensors through structured factorization and adaptive contraction, allowing for enhanced contextual integration without substantial computational overhead. Empirical evaluations demonstrate that CPTR improves coherence retention across extended sequences, leading to measurable reductions in perplexity and improved recall accuracy for long-context tasks. Performance comparisons reveal that CPTR-enhanced models exhibit greater computational efficiency and reduced memory consumption while maintaining competitive language generation fluency and accuracy. Gradient stability metrics further validate the improved training efficiency, revealing more controlled variance in weight updates. Comparative studies across baseline and CPTR-enhanced models confirm that tensorial reconfiguration contributes to more stable and computationally efficient language modeling. The findings support the potential of CPTR in refining contemporary neural architectures for tasks requiring long-range contextual understanding and efficient memory utilization.
zh
[NLP-137] Mordal: Automated Pretrained Model Selection for Vision Language Models
【速读】: 该论文旨在解决自动化创建针对特定任务的视觉语言模型(Vision Language Models, VLMs)的问题。目前,尽管已有多种VLM在不同基准测试中展示了出色的视觉能力,但这些模型均是由人类专家手工设计的,缺乏自动化的框架来生成任务专用的多模态模型。论文的关键解决方案是引入Mordal,一个自动化多模态模型搜索框架,通过减少搜索过程中需要考虑的候选模型数量以及缩短每个剩余候选模型的评估时间,高效地找到最适合用户定义任务的VLM,相比网格搜索,Mordal可降低高达8.9到11.6倍的GPU小时数。此外,在评估过程中,还发现了性能超越现有最先进水平的新VLM。
链接: https://arxiv.org/abs/2502.00241
作者: Shiqi He,Insu Jang,Mosharaf Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to 8.9\times – 11.6\times lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.00241 [cs.LG] (or arXiv:2502.00241v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-138] Should You Use Your Large Language Model to Explore or Exploit?
【速读】: 该论文旨在评估当前大型语言模型(Large Language Models, LLMs)在面对探索-利用权衡的决策任务中的有效性。研究通过在各种上下文多臂老虎机(Contextual Bandit Tasks)任务中让LLMs独立进行探索和利用来实现这一目标。研究的关键发现是,尽管LLMs在利用方面常常表现不佳,但可以通过上下文化的缓解措施显著提升其在小规模任务中的性能。然而,即使如此,LLMs的表现仍不如简单的线性回归模型。另一方面,研究还发现LLMs在处理具有内在语义的大规模动作空间的探索任务中表现出优势,能够建议合适的探索候选对象。因此,该研究的关键解决方案在于探索如何利用LLMs在大规模动作空间探索方面的潜力,并通过上下文化方法改善其在小规模任务中的利用能力。
链接: https://arxiv.org/abs/2502.00225
作者: Keegan Harris,Aleksandrs Slivkins
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. We use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks. However even then, LLMs perform worse than a simple linear regression. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.
zh
[NLP-139] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
【速读】: 该论文旨在解决大型语言模型(LLM)对齐算法领域复杂且碎片化的问题,当前该领域对不同方法的有效性及其相互关系缺乏清晰理解。论文的关键解决方案是提出奖励感知偏好优化(Reward-Aware Preference Optimization, RPO)框架,该框架统一了包括DPO、IPO、SimPO和REINFORCE(LOO)在内的流行偏好优化技术。RPO提供了一种结构化的方法,用于解析和系统地研究各种设计选择(如优化目标、每个提示的响应数量以及隐式与显式奖励模型的使用)对LLM偏好优化的影响,并进一步提出了新的实验设置以清晰直接地消解这些设计选择的影响。通过在RPO框架内进行广泛的消融研究,论文揭示了影响模型对齐的关键因素,提供了改善LLM对齐的有效策略的实际指导。
链接: https://arxiv.org/abs/2502.00203
作者: Shengyang Sun,Yian Zhang,Alexander Bukharin,David Mosallanezhad,Jiaqi Zeng,Soumye Singhal,Gerald Shen,Adi Renduchintala,Tugrul Konuk,Yi Dong,Zhilin Wang,Dmitry Chichkov,Olivier Delalleau,Oleksii Kuchaiev
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 4 figures
点击查看摘要
Abstract:The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques in LLM alignment, including DPO, IPO, SimPO, and REINFORCE (LOO), among others. RPO provides a structured approach to disentangle and systematically study the impact of various design choices, such as the optimization objective, the number of responses per prompt, and the use of implicit versus explicit reward models, on LLM preference optimization. We additionally propose a new experimental setup that enables the clean and direct ablation of such design choices. Through an extensive series of ablation studies within the RPO framework, we gain insights into the critical factors shaping model alignment, offering practical guidance on the most effective strategies for improving LLM alignment.
zh
[NLP-140] Fairshare Data Pricing for Large Language Models
【速读】: 该论文旨在解决数据市场中不公平定价导致的数据买家(如大型语言模型 LLM 的构建者)和卖家(如人类标注员)参与度降低的问题,这会减少数据的数量和质量。论文的关键解决方案是提出了一种公平份额定价框架(Fairshare Pricing Framework),该框架利用数据估值方法来量化训练数据对 LLM 的贡献,并据此设定价格。通过该框架,买家依据数据估值做出购买决策,而卖家则基于预期买家购买量最大化其利润。此框架理论证明了定价与数据估值及买家预算紧密相关,对买卖双方都是最优的。通过使用当前 LLM 和数据集(包括数学问题、医学诊断和物理推理)进行市场模拟,验证了该框架能够确保买家以反映模型训练价值的方式购买数据,从而提高每美元投入数据所带来的 LLM 任务性能,并确保卖家以最优价格出售数据。
链接: https://arxiv.org/abs/2502.00198
作者: Luyang Zhang,Cathy Jiao,Beibei Li,Chenyan Xiong
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Training data is a pivotal resource for building large language models (LLMs), but unfair pricing in data markets poses a serious challenge for both data buyers (e.g., LLM builders) and sellers (e.g., human annotators), which discourages market participation, reducing data quantity and quality. In this paper, we propose a fairshare pricing framework that sets training data prices using data valuation methods to quantify their contribution to LLMs. In our framework, buyers make purchasing decisions using data valuation and sellers set prices to maximize their profits based on the anticipated buyer purchases. We theoretically show that pricing derived from our framework is tightly linked to data valuation and buyers’ budget, optimal for both buyers and sellers. Through market simulations using current LLMs and datasets (math problems, medical diagnosis, and physical reasoning), we show that our framework is fairshare for buyers by ensuring their purchased data is reflective of model training value, leading to higher LLM task performances per-dollar spent on data, and fairshare for sellers by ensuring they sell their data at optimal prices. Our framework lays the foundation for future research on equitable and sustainable data markets for large-scale AI.
zh
[NLP-141] DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
【速读】: 该论文旨在解决在皮肤科领域开发视觉大语言模型(Vision LLMs)所面临的大型图像-文本配对数据集缺乏的问题。解决方案的关键在于引入DermaSynth数据集,该数据集包含92,020个合成的图像-文本对,源自45,205张临床和皮肤病镜图像,并通过先进的大语言模型(LLMs),使用Gemini 2.0和自指导方法生成多样且丰富的合成文本。通过将数据集的元数据纳入输入提示,以减少潜在的幻觉现象,从而构建出基于开放访问皮肤科图像存储库的高质量数据集。此外,还初步微调了一个名为DermatoLlama 1.0的模型。
链接: https://arxiv.org/abs/2502.00196
作者: Abdurrahim Yilmaz,Furkan Yuceyalcin,Ece Gokyayla,Donghee Choi,Ozan Erdem Ali Anil Demircali,Rahmetullah Varol,Ufuk Gorkem Kirabali,Gulsum Gencoglan,Joram M. Posma,Burak Temelkuran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures
点击查看摘要
Abstract:A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image–text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image–text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at this https URL.
zh
[NLP-142] Resolving Editing-Unlearning Conflicts: A Knowledge Codebook Framework for Large Language Model Updating
【速读】: 该论文旨在解决大型语言模型(LLMs)在更新过程中存在的两个主要问题:知识存储的有效性不足(包括过于稀疏或过于密集)以及编辑与遗忘任务之间的冲突。论文提出的关键解决方案是LOKA框架,它基于知识代码本,通过多记忆代码本存储更新的知识,并利用相似度感知的知识映射确保相关知识片段被聚类到同一内存中。此外,LOKA通过任务特定和多任务记忆,以及由冲突评分引导的方法来解决任务冲突。在推理阶段,LOKA从代码本中检索最相关的记忆并将其插入原始LLM以应用更新的知识,从而提高知识利用率。
链接: https://arxiv.org/abs/2502.00158
作者: Binchi Zhang,Zhengzhang Chen,Zaiyi Zheng,Jundong Li,Haifeng Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) excel in natural language processing by encoding extensive human knowledge, but their utility relies on timely updates as knowledge evolves. Updating LLMs involves two key tasks simultaneously: unlearning to remove unwanted knowledge and editing to incorporate new information. Existing methods face two major challenges: ineffective knowledge storage (either too sparse or too dense) and task conflicts between editing and unlearning, as validated through our theoretical and experimental results. To address these issues, we propose LOKA, a conflict-free framework for LLM updating based on a knowledge codebook. During training, updated knowledge is stored in multiple codebook memories. To optimize knowledge storage, a similarity-aware knowledge mapping ensures that related knowledge pieces are clustered and allocated to the same memory. Additionally, LOKA resolves task conflicts by employing task-specific and multi-task memories guided by a conflict score. In the inference stage, LOKA retrieves the most relevant memory from the codebook and plugs it into the original LLM to apply the updated knowledge. A learning-based router controls codebook activation to further improve knowledge utilization. Extensive experiments demonstrate the effectiveness of LOKA in LLM knowledge updating tasks.
zh
[NLP-143] A Three-Branch Checks-and-Balances Frameworkfor Context-Aware Ethical Alignment of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在伦理对齐方面的局限性,特别是通过强化学习基于人类反馈(Reinforcement Learning from Human Feedback, RLHF)方法所存在的问题。论文的关键在于提出一个三分支制衡框架,包含知识生成(LLMs作为执行机构)、伦理规范设定(DIKE作为立法机构)以及情境解读(ERIS作为司法机构)。这一架构通过可解释、可适应且文化敏感的伦理推理机制,解决了现有方法的不足。
链接: https://arxiv.org/abs/2502.00136
作者: Edward Y. Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 tables, 6 figures. arXiv admin note: substantial text overlap with arXiv:2405.07076
点击查看摘要
Abstract:This paper introduces a three-branch checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. The adversarial DIKE-ERIS duality enables adaptation to diverse cultural contexts while upholding consistent ethical principles. This architecture addresses limitations of reinforcement learning with human feedback (RLHF) by providing interpretable, adaptable, and culturally-aware ethical reasoning. Through self-supervised learning and adversarial testing, our framework demonstrates how emotional modeling can guide linguistic behaviors toward ethical outcomes while preserving independence across knowledge generation, ethical oversight, and contextual interpretation.
zh
[NLP-144] Sparse Autoencoder Insights on Voice Embeddings
【速读】: 该论文旨在探索稀疏自编码器在从密集编码嵌入中提取单义特征方面的有效性,尤其关注非文本嵌入数据。关键解决方案在于应用稀疏自编码器于源自Titanet模型的说话者嵌入(Speaker Embeddings),从而成功识别并操纵如语言和音乐等在原始嵌入中不明显的特征。实验结果表明,所提取的特征与大型语言模型 (LLM) 嵌入中的特征相似,包括特征分割和调节。这表明稀疏自编码器可以成为理解与解释多个领域(包括基于音频的说话者识别)中嵌入数据的重要工具。
链接: https://arxiv.org/abs/2502.00127
作者: Daniel Pluth,Yu Zhou,Vijay K. Gurbani
机构: Vail Systems, Inc. (维尔斯系统公司)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be a valuable tool for understanding and interpreting embedded data in many domains, including audio-based speaker recognition.
zh
[NLP-145] AIN: The Arabic INclusive Large Multimodal Model ACL
【速读】: 该论文旨在解决阿拉伯语大型多模态模型(Arabic LMMs)研究不足的问题。解决方案的关键在于引入AIN(阿拉伯包容性多模态模型),这是一个双语(英语-阿拉伯语)的大型多模态模型,利用精心构建的360万高质量英阿多模态数据样本进行训练。AIN展示了在阿拉伯语处理方面的最先进性能,并且具备强大的英语视觉理解能力,在包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学影像分析、植物病害识别以及基于遥感的土地使用理解在内的38个子领域中表现出色,超越了GPT-4在平均八个领域的绝对增益达3.4%。AIN的卓越能力使其成为向阿拉伯语用户提供先进多模态生成式AI工具的重要进展。
链接: https://arxiv.org/abs/2502.00094
作者: Ahmed Heakl,Sara Ghaboura,Omkar Thawkar,Fahad Shahbaz Khan,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan
机构: Mohamed bin Zayed University of AI(穆罕默德· bin Zayed 人工智能大学); Linköping University(林雪平大学); Aalto University(阿尔托大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 20 pages, 16 figures, ACL
点击查看摘要
Abstract:Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN’s superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.
zh
[NLP-146] Disambiguating Numeral Sequences to Decipher Ancient Accounting Corpora
【速读】: 该论文旨在解决古代半释读的楔形文字——原始埃兰语(Proto-Elamite, PE)书写系统中数值记录的歧义问题。论文的关键在于提出了一种算法来提取每种子数值表示的可能读法列表,并贡献了两种基于文档结构特性的消歧方法以及通过自助法(bootstrapping algorithm)训练的分类器。此外,论文还提供了一个测试集用于评估消歧技术,并提出了一种新颖的谨慎规则选择方法以优化自助法分类器。这些方法有助于确认关于该书写系统的已有直觉,并揭示了泥板内容与数值大小之间的新关联。
链接: https://arxiv.org/abs/2502.00090
作者: Logan Born,M. Willis Monroe,Kathryn Kelley,Anoop Sarkar
机构: Simon Fraser University(西蒙弗雷泽大学); University of British Columbia(不列颠哥伦比亚大学); Università di Bologna(博洛尼亚大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A numeration system encodes abstract numeric quantities as concrete strings of written characters. The numeration systems used by modern scripts tend to be precise and unambiguous, but this was not so for the ancient and partially-deciphered proto-Elamite (PE) script, where written numerals can have up to four distinct readings depending on the system that is used to read them. We consider the task of disambiguating between these readings in order to determine the values of the numeric quantities recorded in this corpus. We algorithmically extract a list of possible readings for each PE numeral notation, and contribute two disambiguation techniques based on structural properties of the original documents and classifiers learned with the bootstrapping algorithm. We also contribute a test set for evaluating disambiguation techniques, as well as a novel approach to cautious rule selection for bootstrapped classifiers. Our analysis confirms existing intuitions about this script and reveals previously-unknown correlations between tablet content and numeral magnitude. This work is crucial to understanding and deciphering PE, as the corpus is heavily accounting-focused and contains many more numeric tokens than tokens of text.
zh
[NLP-147] Ensembles of Low-Rank Expert Adapters ICLR2025
【速读】: 该论文旨在解决大型语言模型(LLMs)在多源异构数据训练和微调过程中因梯度方向冲突导致的优化困难和性能下降问题,进而影响模型在不同任务中的泛化能力。关键解决方案在于提出了一种名为Ensembles of Low-Rank Expert Adapters (ELREA) 的框架,通过基于梯度方向对训练指令进行聚类,减少优化过程中的冲突,并利用低秩适应(LoRA)技术训练专家适配器,确保高效且可扩展的训练。在推理阶段,ELREA 根据输入数据与训练聚类的梯度相似性,选择最相关的专家适配器进行预测,从而实现每个任务的最佳适配器选择。
链接: https://arxiv.org/abs/2502.00089
作者: Yinghao Li,Vianne Gao,Chao Zhang,MohamadAli Torkamani
机构: Amazon Web Service(亚马逊网络服务); Amazon.com(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 5 figures, 5 tables; proceedings in ICLR 2025
点击查看摘要
Abstract:The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model’s capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (LoRA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data’s gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline LoRA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.
zh
[NLP-148] Efficient Beam Search for Large Language Models Using Trie-Based Decoding
【速读】: 该论文旨在解决Transformer-based序列到序列生成中批处理束搜索方法存在的高内存消耗问题。解决方案的关键在于引入了一种基于trie(前缀树)的并行解码方法,通过在共享相同前缀的所有束之间共用单一的键值(KV)缓存,不仅大幅减少了内存消耗,还实现了所有分支的并行解码。这一创新性地使用前缀树为束搜索提供了一个高效的替代方案,在保持推理速度的同时显著节省了内存,特别适用于内存受限环境或大规模模型部署。
链接: https://arxiv.org/abs/2502.00085
作者: Brian J Chan,Jui-Hung Cheng,Mao Xun Huang,Chao-Ting Chen,Hen-Hsen Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages
点击查看摘要
Abstract:In Transformer-based sequence-to-sequence generation, beam search has proven effective in enhancing the quality of generated sequences compared to greedy decoding. Conventional beam search methods typically adopt either a sequential or batch-based approach. The sequential approach, while memory-efficient, requires multiple decoding passes to construct a complete search tree, leading to significantly slower inference. On the other hand, the batch-based approach enables parallel computation across beams, but at the expense of high memory consumption due to the need to maintain separate key-value (KV) caches for each beam. In this study, we introduce a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache among all beams that share the same prefix, the proposed method not only reduces memory consumption dramatically but also enables parallel decoding across all branches. This innovative use of a prefix tree offers an efficient alternative for beam search, achieving significant memory savings while preserving inference speed, making it particularly well-suited for memory-constrained environments or large-scale model deployments.
zh
[NLP-149] BTS: Harmonizing Specialized Experts into a Generalist LLM
【速读】: 该论文旨在解决如何高效且灵活地将多个独立训练的领域专家大型语言模型(Large Language Model, LLM)整合成一个具备广泛能力的通用模型。解决方案的关键在于Branch-Train-Stitch (BTS)算法,该算法通过插入轻量级的缝合层(stitch layers),在冻结的专家模型与初始种子语言模型之间实现融合,并仅需少量训练数据即可使种子模型在前向传播过程中集成来自多个专家模型的表示,从而实现在保持专家特定能力的同时,提升模型在新领域的泛化能力。
链接: https://arxiv.org/abs/2502.00075
作者: Qizhen Zhang,Prajjwal Bhargava,Chloe Bi,Chris X. Cai,Jakob Foerster,Jeremy Fu,Punit Singh Koura,Ruan Silva,Sheng Shen,Emily Dinan,Suchin Gururangan,Mike Lewis
机构: Oxford University (牛津大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.
zh
[NLP-150] LLM Cyber Evaluations Dont Capture Real-World Risk
【速读】: 该论文旨在解决评估大型语言模型(Large Language Models, LLMs)在网络安全应用中的风险与实际影响不匹配的问题。论文的关键解决方案在于提出一个综合的风险评估框架,该框架不仅考虑模型的能力,还纳入了对威胁行为者采用行为及其潜在影响的分析。通过这一框架,论文评估了一种具体用例——即用于网络安全助手的LLMs,并发现其合规率高但准确性一般,且整体风险较低,因为其操作优势和影响潜力有限。基于这些发现,论文建议加强学术界与产业界的协作,更真实地模拟攻击者行为,并在评估中加入经济指标,以更好地对齐研究重点与实际影响评估。
链接: https://arxiv.org/abs/2502.00072
作者: Kamilė Lukošiūtė,Adam Swanda
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages
点击查看摘要
Abstract:Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more than just measuring model capabilities – it demands a comprehensive risk assessment that incorporates analysis of threat actor adoption behavior and potential for impact. We propose a risk assessment framework for LLM cyber capabilities and apply it to a case study of language models used as cybersecurity assistants. Our evaluation of frontier models reveals high compliance rates but moderate accuracy on realistic cyber assistance tasks. However, our framework suggests that this particular use case presents only moderate risk due to limited operational advantages and impact potential. Based on these findings, we recommend several improvements to align research priorities with real-world impact assessment, including closer academia-industry collaboration, more realistic modeling of attacker behavior, and inclusion of economic metrics in evaluations. This work represents an important step toward more effective assessment and mitigation of LLM-enabled cybersecurity risks.
zh
[NLP-151] A Multi-Layered Large Language Model Framework for Disease Prediction
【速读】: 该论文旨在解决通过社交媒体和在线健康平台收集的大量阿拉伯语医学文本在疾病分类和症状严重性评估中的处理与应用问题。关键解决方案在于采用先进的阿拉伯语医学文本预处理技术,包括文本摘要、文本精炼以及命名实体识别(NER),并结合CAMeL-BERT模型进行优化。研究发现,使用CAMeL-BERT结合NER增强的文本能够显著提升疾病类型分类(83%)和症状严重性评估(69%)的性能。
链接: https://arxiv.org/abs/2502.00063
作者: Malak Mohamed,Rokaia Emad,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Social telehealth has revolutionized healthcare by enabling patients to share symptoms and receive medical consultations remotely. Users frequently post symptoms on social media and online health platforms, generating a vast repository of medical data that can be leveraged for disease classification and symptom severity assessment. Large language models (LLMs), such as LLAMA3, GPT-3.5 Turbo, and BERT, process complex medical data to enhance disease classification. This study explores three Arabic medical text preprocessing techniques: text summarization, text refinement, and Named Entity Recognition (NER). Evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the best performance was achieved using CAMeL-BERT with NER-augmented text (83% type classification, 69% severity assessment). Non-fine-tuned models performed poorly (13%-20% type classification, 40%-49% severity assessment). Integrating LLMs into social telehealth systems enhances diagnostic accuracy and treatment outcomes.
zh
[NLP-152] Contextually Entangled Gradient Mapping for Optimized LLM Comprehension
【速读】: 该论文旨在解决神经架构在长文本推理、上下文保持及适应新领域任务中的优化策略不足的问题。关键在于引入了Contextually Entangled Gradient Mapping (CEGM),将梯度视为动态承载上下文依赖性的实体,而非孤立的数值,通过在损失正则化框架中整合纠缠梯度动力学,显著提升了模型在这些任务上的表现。
链接: https://arxiv.org/abs/2502.00048
作者: Colin Sisate,Alistair Goldfinch,Vincent Waterstone,Sebastian Kingsley,Mariana Blackthorn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.
zh
[NLP-153] Optimization Strategies for Enhancing Resource Efficiency in Transformers Large Language Models
【速读】: 该论文旨在解决自然语言处理领域中Transformer架构在性能提升过程中伴随的资源消耗问题。关键解决方案在于探索并优化压缩技术,包括量化(Quantization)、知识蒸馏(Knowledge Distillation)和剪枝(Pruning),以实现更高的能源和计算效率,同时保持模型性能。研究发现,4位量化(4-bit Quantization)显著减少了能源使用且几乎没有精度损失。混合方法如NVIDIA的Minitron方法结合了知识蒸馏与结构化剪枝,进一步展示了在减少模型大小与保持精度之间的良好权衡。通过这些方法的研究,论文为开发更加可持续和高效的大型语言模型提供了有价值的见解,并强调了常常被忽视的能源效率问题。
链接: https://arxiv.org/abs/2502.00046
作者: Tom Wallace,Naser Ezzati-Jivan,Beatrice Ombuki-Berman
机构: Brock University(布鲁克大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted for ACM’s ICPE 2025 in Short Paper format
点击查看摘要
Abstract:Advancements in Natural Language Processing are heavily reliant on the Transformer architecture, whose improvements come at substantial resource costs due to ever-growing model sizes. This study explores optimization techniques, including Quantization, Knowledge Distillation, and Pruning, focusing on energy and computational efficiency while retaining performance. Among standalone methods, 4-bit Quantization significantly reduces energy use with minimal accuracy loss. Hybrid approaches, like NVIDIA’s Minitron approach combining KD and Structured Pruning, further demonstrate promising trade-offs between size reduction and accuracy retention. A novel optimization equation is introduced, offering a flexible framework for comparing various methods. Through the investigation of these compression methods, we provide valuable insights for developing more sustainable and efficient LLMs, shining a light on the often-ignored concern of energy efficiency.
zh
[NLP-154] MALT: Mechanistic Ablation of Lossy Translation in LLM s for a Low-Resource Language: Urdu
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理低资源语言如乌尔都语时性能显著下降的问题。研究的关键在于发现LLMs在内部处理低资源语言时,虽然其英文潜层响应较为连贯,但翻译功能存在损失,导致最终翻译质量不佳。为此,论文提出通过机制性移除这些翻译功能,并采用独立的翻译模型来翻译LLM的内部潜层响应,从而显著提升LLMs在低资源语言中的表现,同时保留输入的文化细微差别。
链接: https://arxiv.org/abs/2502.00041
作者: Taaha Saleem Bajwa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:LLMs are predominantly trained on English data, which leads to a significant drop in performance on low-resource languages. Understanding how LLMs handle these languages is crucial for improving their effectiveness. This study focuses on Urdu as a use case for exploring the challenges faced by LLMs in processing low-resource languages. LLMs primarily reason in English when prompted in another language, with the final layers acting as translators to convert the English response into the target language. This study finds that even for low-resource languages, the internal latent response of LLMs in English is quite coherent; however, the translation features are lossy and result in poor translations, leading to reduced performance. By mechanistically removing these translation features and using a separate translation model to translate the internal latent response of LLM, the performance of LLMs improves significantly while also preserving the cultural nuances of the input in low-resource languages.
zh
[NLP-155] Zoning in American Cities: Are Reforms Making a Difference? An AI-based Analysis
【速读】: 该论文旨在探讨形式导向性分区法规(Form-Based Codes, FBCs)在解决由传统分区法规引发的城市可持续性挑战中的应用与影响。关键在于通过自然语言处理(NLP)技术分析美国各地的分区文件,揭示FBCs促进紧凑混合用途城市形态的效果,从而改善步行友好性、缩短通勤距离,并提高多户住宅的比例。
链接: https://arxiv.org/abs/2502.00008
作者: Arianna Salazar-Miranda,Emily Talen
机构: University of Chicago (芝加哥大学); Yale (耶鲁大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 31 pages, 6 figures, 1 table
点击查看摘要
Abstract:Cities are at the forefront of addressing global sustainability challenges, particularly those exacerbated by climate change. Traditional zoning codes, which often segregate land uses, have been linked to increased vehicular dependence, urban sprawl, and social disconnection, undermining broader social and environmental sustainability objectives. This study investigates the adoption and impact of form-based codes (FBCs), which aim to promote sustainable, compact, and mixed-use urban forms as a solution to these issues. Using Natural Language Processing (NLP) techniques, we analyzed zoning documents from over 2000 U.S. census-designated places to identify linguistic patterns indicative of FBC principles. Our findings reveal widespread adoption of FBCs across the country, with notable variations within regions. FBCs are associated with higher floor-to-area ratios, narrower and more consistent street setbacks, and smaller plots. We also find that places with FBCs have improved walkability, shorter commutes, and a higher share of multi-family housing. Our findings highlight the utility of NLP for evaluating zoning codes and underscore the potential benefits of form-based zoning reforms for enhancing urban sustainability.
zh
[NLP-156] Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
【速读】: 该论文旨在解决在使用策略梯度方法微调离散扩散模型时所遇到的挑战,尤其是在处理非可微奖励的强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)任务中。论文的关键解决方案是提出了一种名为评分熵策略优化(Score Entropy Policy Optimization, SEPO)的高效、广泛适用且理论上有据可依的策略梯度算法。通过这种方法,作者展示了其在多个离散生成任务中的可扩展性和效率。
链接: https://arxiv.org/abs/2502.01384
作者: Oussama Zekri,Nicolas Boullé
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at this https URL
zh
[NLP-157] Probabilistic adaptation of language comprehension for individual speakers: Evidence from neural oscillations
【速读】: 该论文旨在探究听者如何根据说话人产生刻板印象不符话语的概率来动态更新其语言理解的心理表征。研究的关键在于揭示了两种可能的机制:一种是调整总体期望的说话人通用机制(speaker-general mechanism),另一种是个体说话人模型的更新机制(speaker-specific mechanism)。通过两个EEG实验,研究发现高Beta波(21-30 Hz)和Theta波(4-6 Hz)振荡在不同条件下表现出不同的模式,从而支持了这两种机制的存在。这些发现提供了语言处理如何受社会认知实时影响的证据。
链接: https://arxiv.org/abs/2502.01299
作者: Hanlin Wu,Xiaohui Rao,Zhenguang G. Cai
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Listeners adapt language comprehension based on their mental representations of speakers, but how these representations are dynamically updated remains unclear. We investigated whether listeners probabilistically adapt their comprehension based on the likelihood of speakers producing stereotype-incongruent utterances. Our findings reveal two potential mechanisms: a speaker-general mechanism that adjusts overall expectations about speaker-content relationships, and a speaker-specific mechanism that updates individual speaker models. In two EEG experiments, participants heard speakers make stereotype-congruent or incongruent utterances, with incongruency base rate manipulated between blocks. In Experiment 1, speaker incongruency modulated both high-beta (21-30 Hz) and theta (4-6 Hz) oscillations: incongruent utterances decreased oscillatory power in low base rate condition but increased it in high base rate condition. The theta effect varied with listeners’ openness trait: less open participants showed theta increases to speaker-incongruencies, suggesting maintenance of speaker-specific information, while more open participants showed theta decreases, indicating flexible model updating. In Experiment 2, we dissociated base rate from the target speaker by manipulating the overall base rate using an alternative non-target speaker. Only the high-beta effect persisted, showing power decrease for speaker-incongruencies in low base rate condition but no effect in high base rate condition. The high-beta oscillations might reflect the speaker-general adjustment, while theta oscillations may index the speaker-specific model updating. These findings provide evidence for how language processing is shaped by social cognition in real time.
zh
[NLP-158] MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents
【速读】: 该论文旨在解决股票分析与决策过程中信息整合与处理效率的问题。解决方案的关键在于提出了一种名为MarketSenseAI的新框架,它结合了 Retrieval-Augmented Generation 和大型语言模型(LLM)代理的新型架构。这一框架能够处理SEC文件和财报电话会议,并通过系统化处理多样化的机构报告来增强宏观经济分析。通过这些方法,MarketSenseAI显著提升了基本面分析的准确性,并在实证评估中展示了优于市场指数的表现,从而验证了其有效性。
链接: https://arxiv.org/abs/2502.00415
作者: George Fatouros,Kostas Metaxas,John Soldatos,Manos Karathanassis
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Portfolio Management (q-fin.PM)
备注: 25 pages, 7 figures, Under review at Financial Innovation (FIN)
点击查看摘要
Abstract:MarketSenseAI is a novel framework for holistic stock analysis which leverages Large Language Models (LLMs) to process financial news, historical prices, company fundamentals and the macroeconomic environment to support decision making in stock analysis and selection. In this paper, we present the latest advancements on MarketSenseAI, driven by rapid technological expansion in LLMs. Through a novel architecture combining Retrieval-Augmented Generation and LLM agents, the framework processes SEC filings and earnings calls, while enriching macroeconomic analysis through systematic processing of diverse institutional reports. We demonstrate a significant improvement in fundamental analysis accuracy over the previous version. Empirical evaluation on S\P 100 stocks over two years (2023-2024) shows MarketSenseAI achieving cumulative returns of 125.9% compared to the index return of 73.5%, while maintaining comparable risk profiles. Further validation on S\P 500 stocks during 2024 demonstrates the framework’s scalability, delivering a 33.8% higher Sortino ratio than the market. This work marks a significant advancement in applying LLM technology to financial analysis, offering insights into the robustness of LLM-driven investment strategies.
zh
[NLP-159] AlphaSharpe: LLM -Driven Discovery of Robust Risk-Adjusted Metrics
【速读】: 该论文旨在解决传统金融指标(如夏普比率)在动态和波动的市场条件下所面临的稳健性和泛化性不足的问题。解决方案的关键在于引入AlphaSharpe框架,该框架利用大型语言模型(LLMs)迭代演化和优化金融指标,通过迭代交叉、变异和评估生成增强的风险-收益指标,从而在稳健性和与未来绩效指标的相关性方面超越传统方法。
链接: https://arxiv.org/abs/2502.00029
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: aiXplain Inc. (aiXplain公司), San Jose, CA, USA
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Risk Management (q-fin.RM)
备注:
点击查看摘要
Abstract:Financial metrics like the Sharpe ratio are pivotal in evaluating investment performance by balancing risk and return. However, traditional metrics often struggle with robustness and generalization, particularly in dynamic and volatile market conditions. This paper introduces AlphaSharpe, a novel framework leveraging large language models (LLMs) to iteratively evolve and optimize financial metrics. AlphaSharpe generates enhanced risk-return metrics that outperform traditional approaches in robustness and correlation with future performance metrics by employing iterative crossover, mutation, and evaluation. Key contributions of this work include: (1) an innovative use of LLMs for generating and refining financial metrics inspired by domain-specific knowledge, (2) a scoring mechanism to ensure the evolved metrics generalize effectively to unseen data, and (3) an empirical demonstration of 3x predictive power for future risk-return forecasting. Experimental results on a real-world dataset highlight the superiority of AlphaSharpe metrics, making them highly relevant for portfolio managers and financial decision-makers. This framework not only addresses the limitations of existing metrics but also showcases the potential of LLMs in advancing financial analytics, paving the way for informed and robust investment strategies.
zh
计算机视觉
[CV-0] SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)视觉能力控制的问题。现有方法需要用户针对每个编辑方向单独指定属性,而SliderSpace提出了一种框架,能够从单一文本提示中同时发现多个可解释且多样的方向。解决方案的关键在于将每个方向训练为低秩适配器(low-rank adaptor),从而实现组合控制,并在模型的隐空间中发现意想不到的可能性。通过广泛的实验验证,SliderSpace展示了其在概念分解、艺术风格探索和多样性增强三个应用中的有效性。
链接: https://arxiv.org/abs/2502.01639
作者: Rohit Gandikota,Zongze Wu,Richard Zhang,David Bau,Eli Shechtman,Nick Kolkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project Website: this https URL
点击查看摘要
Abstract:We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model’s latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace’s effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model’s knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at this https URL
zh
[CV-1] MFP-VTON: Enhancing Mask-Free Person-to-Person Virtual Try-On via Diffusion Transformer
【速读】:该论文旨在解决衣物虚拟试穿(Virtual Try-On, VTON)任务中的衣物获取难题,并提出了一种无掩码的人体到人体的VTON框架MFP-VTON。解决方案的关键在于利用预训练的扩散变换模型,并引入Focus Attention损失函数以强调参考人体的衣物细节以及目标人体衣物外的部分。通过这种方法,该模型在人体到人体及衣物到人体的VTON任务中表现出色,能够生成高保真的试穿图像。
链接: https://arxiv.org/abs/2502.01626
作者: Le Shen,Yanting Kang,Rong Huang,Zhijie Wang
机构: Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The garment-to-person virtual try-on (VTON) task, which aims to generate fitting images of a person wearing a reference garment, has made significant strides. However, obtaining a standard garment is often more challenging than using the garment already worn by the person. To improve ease of use, we propose MFP-VTON, a Mask-Free framework for Person-to-Person VTON. Recognizing the scarcity of person-to-person data, we adapt a garment-to-person model and dataset to construct a specialized dataset for this task. Our approach builds upon a pretrained diffusion transformer, leveraging its strong generative capabilities. During mask-free model fine-tuning, we introduce a Focus Attention loss to emphasize the garment of the reference person and the details outside the garment of the target person. Experimental results demonstrate that our model excels in both person-to-person and garment-to-person VTON tasks, generating high-fidelity fitting images.
zh
[CV-2] Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
【速读】:该论文旨在解决多模态大型语言模型(MLLMs)在面对视觉对抗扰动时的脆弱性问题,这些扰动可能导致幻觉生成、响应操控或绕过安全机制。论文的关键解决方案在于利用已在大规模数据上进行对抗性预训练的现有视觉分类模型,通过与这些鲁棒模型的端到端集成,增强了语言组件对鲁棒视觉特征的适应能力。这种方法不仅无需额外的对抗性训练,还展示了对多样化对抗威胁的优越鲁棒性,并在复杂推理任务中超越了现有的即插即用方法。
链接: https://arxiv.org/abs/2502.01576
作者: Hashmat Shadab Malik,Fahad Shamshad,Muzammal Naseer,Karthik Nandakumar,Fahad Khan,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
点击查看摘要
Abstract:Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at this https URL.
zh
[CV-3] MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
【速读】:该论文旨在解决通过人工智能生成结构化多步骤过程教程的挑战,主要面临三大障碍:(1) 多任务过程数据集的稀缺性,(2) 步骤间的逻辑连贯性和视觉一致性保持,以及 (3) 跨多个领域的泛化能力。为应对这些挑战,论文提出了一个涵盖21项任务、超过24,000个过程序列的跨领域数据集,并引入了基于扩散变换器(Diffusion Transformer, DIT)的MakeAnything框架。关键解决方案在于利用微调激活DIT的上下文能力以生成一致的过程序列,并通过非对称低秩适应(Low-Rank Adaptation, LoRA)在图像生成中平衡泛化能力和任务特定性能,同时冻结编码器参数并自适应调整解码层。此外,ReCraft模型通过时空一致性约束实现从图像到过程的生成,使静态图像能够分解成合理的创建序列。
链接: https://arxiv.org/abs/2502.01572
作者: Yiren Song,Cheng Liu,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.
zh
[CV-4] GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection
【速读】:该论文旨在解决定向目标检测(Oriented Object Detection, OOD)中的边界不连续性问题以及圆形对象编码的模糊性问题。论文的关键解决方案是提出了一种新的回归头GauCho,它直接基于Cholesky矩阵分解生成高斯分布。这一方法理论上缓解了边界不连续性问题,并且与现有的基于高斯的回归损失函数完全兼容。此外,论文建议使用定向椭圆(Oriented Ellipse, OE)来表示定向对象,通过双射函数与GauCho相关联,从而减轻了圆形对象的编码模糊性问题。
链接: https://arxiv.org/abs/2502.01565
作者: Jeffri Murrugarra-LLerena,Jose Henrique Lima Marques,Claudio R. Jung
机构: Stony Brook University; Federal University of Rio Grande do Sul (南里奥格兰德联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Oriented Object Detection (OOD) has received increased attention in the past years, being a suitable solution for detecting elongated objects in remote sensing analysis. In particular, using regression loss functions based on Gaussian distributions has become attractive since they yield simple and differentiable terms. However, existing solutions are still based on regression heads that produce Oriented Bounding Boxes (OBBs), and the known problem of angular boundary discontinuity persists. In this work, we propose a regression head for OOD that directly produces Gaussian distributions based on the Cholesky matrix decomposition. The proposed head, named GauCho, theoretically mitigates the boundary discontinuity problem and is fully compatible with recent Gaussian-based regression loss functions. Furthermore, we advocate using Oriented Ellipses (OEs) to represent oriented objects, which relates to GauCho through a bijective function and alleviates the encoding ambiguity problem for circular objects. Our experimental results show that GauCho can be a viable alternative to the traditional OBB head, achieving results comparable to or better than state-of-the-art detectors for the challenging dataset DOTA
zh
[CV-5] FireCastNet: Earth-as-a-Graph for Seasonal Fire Prediction
【速读】:该论文旨在解决全球范围内季节性野火预测的准确性与时效性问题。为实现这一目标,论文提出的关键解决方案是开发了一种名为FireCastNet的新架构,该架构结合了三维卷积编码器与GraphCast技术。FireCastNet通过捕捉不同空间和时间尺度下的野火发生背景信息,提升了长时间序列输入下的预测稳健性,并增强了对野火时空动态特征的捕捉能力,从而提高了预测性能。研究表明,更长的输入时间序列和更大的空间感受野有助于提升长期预测精度。
链接: https://arxiv.org/abs/2502.01550
作者: Dimitrios Michail,Charalampos Davalas,Lefki-Ioanna Panagiotou,Ioannis Prapas,Spyros Kondylatos,Nikolaos Ioannis Bountos,Ioannis Papoutsis
机构: Harokopio University of Athens (哈罗科皮奥雅典大学), Greece; OrionLab, National Technical University & National Observatory of Athens (国家天文台), Greece
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With climate change expected to exacerbate fire weather conditions, the accurate and timely anticipation of wildfires becomes increasingly crucial for disaster mitigation. In this study, we utilize SeasFire, a comprehensive global wildfire dataset with climate, vegetation, oceanic indices, and human-related variables, to enable seasonal wildfire forecasting with machine learning. For the predictive analysis, we present FireCastNet, a novel architecture which combines a 3D convolutional encoder with GraphCast, originally developed for global short-term weather forecasting using graph neural networks. FireCastNet is trained to capture the context leading to wildfires, at different spatial and temporal scales. Our investigation focuses on assessing the effectiveness of our model in predicting the presence of burned areas at varying forecasting time horizons globally, extending up to six months into the future, and on how different spatial or/and temporal context affects the performance. Our findings demonstrate the potential of deep learning models in seasonal fire forecasting; longer input time-series leads to more robust predictions, while integrating spatial information to capture wildfire spatio-temporal dynamics boosts performance. Finally, our results hint that in order to enhance performance at longer forecasting horizons, a larger receptive field spatially needs to be considered.
zh
[CV-6] VideoRAG : Retrieval-Augmented Generation with Extreme Long-Context Videos
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理和理解长上下文视频时的知识整合问题。现有方法主要集中在文本内容上,忽视了多模态视频知识的丰富领域。论文的关键创新在于引入了VideoRAG框架,其核心是一个双通道架构,能够无缝集成基于图的文本知识接地以捕捉跨视频语义关系,并通过多模态上下文编码高效保留视觉特征。这种设计使VideoRAG能够处理无限长度的视频,通过构建跨越多个视频的精确知识图谱,同时保持语义依赖性。
链接: https://arxiv.org/abs/2502.01549
作者: Xubin Ren,Lingrui Xu,Long Xia,Shuaiqiang Wang,Dawei Yin,Chao Huang
机构: Baidu Inc.; The University of Hong Kong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: this https URL.
zh
[CV-7] VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion
【速读】:该论文旨在解决现实世界环境中部署腿足机器人时遇到的模拟与真实环境差距(sim-to-real gap)问题,特别是视觉逼真度和复杂几何结构的再现不足限制了基于RGB感知的高级任务支持。论文的关键解决方案是提出了一种Real-to-Sim-to-Real框架,通过从多视角图像进行基于3D高斯点阵(3D Gaussian Splatting, 3DGS)的场景重建,生成高度逼真的交互式“数字孪生”仿真环境,以支持视觉导航和运动学习,并实现仅使用RGB数据的模拟到现实的策略转移。
链接: https://arxiv.org/abs/2502.01536
作者: Shaoting Zhu,Linzhan Mou,Derun Li,Baijun Ye,Runhan Huang,Hang Zhao
机构: IIIS, Tsinghua University(清华大学); Galaxea AI; Shanghai Qi Zhi Institute(上海启智研究院); Shanghai Jiao Tong University(上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Recent success in legged robot locomotion is attributed to the integration of reinforcement learning and physical simulators. However, these policies often encounter challenges when deployed in real-world environments due to sim-to-real gaps, as simulators typically fail to replicate visual realism and complex real-world geometry. Moreover, the lack of realistic visual rendering limits the ability of these policies to support high-level tasks requiring RGB-based perception like ego-centric navigation. This paper presents a Real-to-Sim-to-Real framework that generates photorealistic and physically interactive “digital twin” simulation environments for visual navigation and locomotion learning. Our approach leverages 3D Gaussian Splatting (3DGS) based scene reconstruction from multi-view images and integrates these environments into simulations that support ego-centric visual perception and mesh-based physical interactions. To demonstrate its effectiveness, we train a reinforcement learning policy within the simulator to perform a visual goal-tracking task. Extensive experiments show that our framework achieves RGB-only sim-to-real policy transfer. Additionally, our framework facilitates the rapid adaptation of robot policies with effective exploration capability in complex new environments, highlighting its potential for applications in households and factories.
zh
[CV-8] Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective
【速读】:该论文旨在填补现有研究在理解视觉-语言大型语言模型(Vision Large Language Models, VLLMs)训练范式及其参数高效性考虑方面的空白。论文的关键在于通过从训练范式角度分析34个来自顶级会议、期刊和高引用Arxiv论文的VLLMs,聚焦于参数效率。论文首先介绍了大型语言模型(LLMs)的架构及参数高效学习方法,随后讨论了视觉编码器和模态整合器的全面分类。接着,论文回顾了三种训练范式及其效率考量,并总结了VLLM领域的基准测试结果。为了更深入地了解这些模型在参数效率方面的有效性,论文还复制了直接适应(Direct Adaptation)范式的实验,从而提供有关最新进展和实际应用的见解,成为研究人员和实践者在高效整合视觉模态到LLMs方面的重要指南。
链接: https://arxiv.org/abs/2502.01524
作者: Xiaorui Ma,Haoran Xie,S. Joe Qin
机构: Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 3 figures
点击查看摘要
Abstract:The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.
zh
[CV-9] BD-Diff: Generative Diffusion Model for Image Deblurring on Unknown Domains with Blur-Decoupled Learning
【速读】:该论文旨在解决在真实场景中获取大量现实配对数据的挑战和成本问题,以及仅依赖合成数据导致的过拟合问题,这些问题限制了扩散模型在未知模糊模式下的去模糊性能。论文的关键解决方案是提出了一种名为BD-Diff的基于生成扩散的模型,通过在三个特殊设计的任务上联合训练来解耦结构特征和模糊模式。BD-Diff使用两个Q-Former分别作为结构表示和模糊模式提取器,并通过监督去模糊任务和无监督模糊传递任务利用合成数据和目标域中的非配对模糊图像。此外,引入重构任务使结构特征和模糊模式互补,从而增强BD-Diff在遇到未知领域模糊模式时的泛化能力。
链接: https://arxiv.org/abs/2502.01522
作者: Junhao Cheng,Wei-Ting Chen,Xi Lu,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We propose BD-Diff to integrate generative diffusion model into unpaired deblurring tasks
点击查看摘要
Abstract:Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring tasks via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose BD-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural representations and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. Furthermore, we introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of BD-Diff when encountering unknown domain blur patterns. Experiments on real-world datasets demonstrate that BD-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios. The codes will be released in this https URL
zh
[CV-10] End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)合成任务中复杂模态交互建模的挑战。论文的关键在于提出了一种端到端学习的文本嵌入方法,这些嵌入是专门为T2I合成网络设计的。此外,论文结合了生成式训练和对比式训练,并使用了两种嵌入:一种优化以增强生成图像的真实感,另一种则致力于捕捉文本与图像之间的对齐关系。这一方法在三个基准数据集上的实验表明,使用分离的嵌入比共享嵌入效果更佳,并且优于采用从预先训练的判别式文本编码器获取文本表示的方法。
链接: https://arxiv.org/abs/2502.01507
作者: Yeruru Asrar Ahmed,Anurag Mittal
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Finally, we demonstrate that such learned embeddings can be used in other contexts as well, such as text-to-image manipulation.
zh
[CV-11] MoireDB: Formula-generated Interference-fringe Image Dataset
【速读】:该论文旨在解决图像识别模型在处理现实世界退化时的鲁棒性不足问题。解决方案的关键在于提出MoireDB数据集,这是一个通过公式生成的干涉条纹图像数据集,用于增强图像增强和模型鲁棒性。MoireDB通过利用错觉模式,消除了版权顾虑,降低了数据集构建成本,并提高了模型对现实世界退化的鲁棒性。实验表明,使用MoireDB增强的图像表现优于传统的分形艺术和基于特征可视化(FVis)的增强方法。
链接: https://arxiv.org/abs/2502.01490
作者: Yuto Matsuo,Ryo Hayamizu,Hirokatsu Kataoka,Akio Nakamura
机构: Tokyo Denki University (东京电气大学); National Institute of Advanced Industrial Science and Technology (AIST) (先进产业科学技术研究所); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Image recognition models have struggled to treat recognition robustness to real-world degradations. In this context, data augmentation methods like PixMix improve robustness but rely on generative arts and feature visualizations (FVis), which have copyright, drawing cost, and scalability issues. We propose MoireDB, a formula-generated interference-fringe image dataset for image augmentation enhancing robustness. MoireDB eliminates copyright concerns, reduces dataset assembly costs, and enhances robustness by leveraging illusory patterns. Experiments show that MoireDB augmented images outperforms traditional Fractal arts and FVis-based augmentations, making it a scalable and effective solution for improving model robustness against real-world degradations.
zh
[CV-12] Simultaneous Automatic Picking and Manual Picking Refinement for First-Break
【速读】:该论文旨在解决微地震数据处理中自动识别初至波时遇到的手动标记数据集中的异常值和潜在误标问题。这些问题会影响神经网络训练的有效性。论文的关键解决方案是Simultaneous Picking and Refinement (SPR)算法,它将初至波的真实位置视为概率模型中的潜在变量,并引入先验标签来处理噪声或异常数据。SPR通过动态调整和优化,提高了在包含异常值或部分不准确数据的数据集中识别初至波的准确性。此外,SPR的灵活性使其能够适应多种基于深度学习的初至波拾取方法。
链接: https://arxiv.org/abs/2502.01474
作者: Haowen Bai,Zixiang Zhao,Jiangshe Zhang,Yukun Cui,Chunxia Zhang,Zhenbo Guo,Yongjun Wang
机构: School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学院), China; Geophysical Technology Research Center of Bureau of Geophysical Prospecting, Zhuozhou, Hebei, P.R.China(中国石油集团东方物探研究院地球物理技术研究中心); School of Artificial Intelligence, Wenzhou Polytechnic(温州职业技术学院人工智能学院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method’s performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR’s resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
zh
[CV-13] Deep Unfolding Multi-modal Image Fusion Network via Attribution Analysis
【速读】:该论文旨在解决多模态图像融合过程中缺乏直接指导和交互的问题,当前方法主要集中在通过复杂的映射获取视觉显示层面的信息丰富的融合图像,而忽视了融合过程与下游任务(如语义分割)之间的有效互动。论文的关键解决方案在于提出了一种“展开归因分析融合网络”(UAAFusion),通过归因分析技术更有效地调整融合图像以适应语义分割任务,增强融合与分割之间的互动。具体而言,该方法利用归因分析探索源图像中语义区域对任务区分的贡献,并将更有益的特征整合到融合算法中,从而让分割任务引导融合过程。这种方法构建了一个基于模型驱动的展开网络,使用来自归因分析的优化目标,并通过计算当前分割网络状态下的归因融合损失来实现这一目标。
链接: https://arxiv.org/abs/2502.01467
作者: Haowen Bai,Zixiang Zhao,Jiangshe Zhang,Baisong Jiang,Lilun Deng,Yukun Cui,Shuang Xu,Chunxia Zhang
机构: School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学院), China; Photogrammetry and Remote Sensing, ETH Zürich(瑞士苏黎世联邦理工学院摄影测量与遥感研究所), Switzerland; School of Mathematics and Statistics, Northwestern Polytechnical University(西北工业大学数学与统计学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2024
点击查看摘要
Abstract:Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network’’ (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method’s superiority in image fusion and applicability to semantic segmentation.
zh
[CV-14] mporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting
【速读】:该论文旨在解决弱监督(Weakly Supervised, WS)方法在视频流语境下的语义分割精度不足的问题。关键解决方案在于构建利用视频中连续帧之间时间一致性(temporal coherence)的显著性图(saliency maps),通过最小化相邻帧之间显著性图的差异来提高分割精度,并在训练辅助分类器时直接整合这种时间一致性,从而实现更准确的材料移除识别。
链接: https://arxiv.org/abs/2502.01455
作者: Andrea Marelli,Luca Magri,Federica Arrigoni,Giacomo Boracchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures
点击查看摘要
Abstract:In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.
zh
[CV-15] SPFFNet: Strip Perception and Feature Fusion Spatial Pyramid Pooling for Fabric Defect Detection
【速读】:该论文旨在解决织物缺陷检测中复杂背景和特定形状缺陷难以识别的问题。关键解决方案包括引入条形感知模块(Strip Perception Module, SPM)以增强多尺度卷积特征捕获能力,改进空间金字塔池化快速模块(SPPF)为SE-SPPF模块以更好地整合空间和通道信息,以及提出一种具有自适应权重的焦点增强完全交并比度量(FECIoU),通过调整难检测实例的权重来解决尺度差异和类别不平衡问题。这些改进显著提升了模型在多个数据集上的平均精度均值(mAP)。
链接: https://arxiv.org/abs/2502.01445
作者: Peizhe Zhao
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Waterford Institute (沃特福德学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, conference
点击查看摘要
Abstract:Defect detection in fabrics is critical for quality control, yet existing methods often struggle with complex backgrounds and shape-specific defects. In this paper, we propose an improved fabric defect detection model based on YOLOv4. To enhance the detection of strip defects, we introduce a Strip Perception Module (SPM) that improves feature capture through multi-scale convolution. We further enhance the spatial pyramid pooling fast (SPPF) by integrating a squeeze-and-excitation mechanism, resulting in the SE-SPPF module, which better integrates spatial and channel information for more effective defect feature extraction. Additionally, we propose a novel focal enhanced complete intersection over union (FECIoU) metric with adaptive weights, addressing scale differences and class imbalance by adjusting the weights of hard-to-detect instances through focal loss. Experimental results demonstrate that our model achieves a 0.8-8.1% improvement in mean average precision (mAP) on the Tianchi dataset and a 1.6-13.2% improvement on our custom dataset, outperforming other state-of-the-art methods.
zh
[CV-16] Improved Training Technique for Latent Consistency Models ICLR2025
【速读】:该论文旨在解决一致性模型在大规模数据集上训练时,特别是在文本到图像和视频生成任务中的性能退化问题。论文的关键在于分析了像素空间与隐空间之间的统计差异,并发现隐空间数据中存在高度尖峰的离群值,严重影响了一致性模型在隐空间中的表现。为了解决这一问题,论文提出了采用Cauchy损失替换Pseudo-Huber损失以减轻离群值的影响,并引入扩散损失和最优传输(Optimal Transport, OT)耦合以进一步提升性能。此外,论文还引入自适应缩放调度器和非缩放LayerNorm来管理稳健的训练过程并更好地捕捉特征统计信息,从而减少离群值的影响。通过这些策略,成功训练出能够在一到两步内进行高质量采样的一致性模型,显著缩小了一致性模型与扩散模型之间的性能差距。
链接: https://arxiv.org/abs/2502.01441
作者: Quan Dao,Khanh Doan,Di Liu,Trung Le,Dimitris Metaxas
机构: Rutgers University; VinAI Research; Monash University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR2025
点击查看摘要
Abstract:Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling- c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: this https URL
zh
[CV-17] Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning
【速读】:该论文旨在解决人工神经网络在持续学习过程中面临的稳定性-可塑性困境,同时尝试借鉴生物策略来提升机器学习算法。论文的关键解决方案在于引入了一个简化的果蝇嗅觉回路模型(Fly Model),该模型能够与现代机器学习方法结合使用,以增强记忆稳定性和学习可塑性,从而克服当前持续学习策略的局限性。
链接: https://arxiv.org/abs/2502.01427
作者: Heming Zou,Yunliang Zang,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Artificial neural networks face the stability-plasticity dilemma in continual learning, while the brain can maintain memories and remain adaptable. However, the biological strategies for continual learning and their potential to inspire learning algorithms in neural networks are poorly understood. This study presents a minimal model of the fly olfactory circuit to investigate the biological strategies that support continual odor learning. We introduce the fly olfactory circuit as a plug-and-play component, termed the Fly Model, which can integrate with modern machine learning methods to address this dilemma. Our findings demonstrate that the Fly Model enhances both memory stability and learning plasticity, overcoming the limitations of current continual learning strategies. We validated its effectiveness across various challenging continual learning scenarios using commonly used datasets. The fly olfactory system serves as an elegant biological circuit for lifelong learning, offering a module that enhances continual learning with minimal additional computational cost for machine learning.
zh
[CV-18] Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成详细图像描述时,由于响应长度增加导致视觉注意力减弱和噪声增加的问题。这限制了模型在精确度(Precision)与召回率(Recall)之间的平衡。为了解决这一问题,论文提出了一种名为SPARC(Selective Progressive Attention ReCalibration)的方法。SPARC的关键在于通过选择性增强视觉标记的影响来改善解码过程中的视觉注意力,从而同时提升精确度和召回率,且计算开销极小。
链接: https://arxiv.org/abs/2502.01419
作者: Mingi Jung,Saehuyng Lee,Eunji Kim,Sungroh Yoon
机构: Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
zh
[CV-19] Human Body Restoration with One-Step Diffusion Model and A New Benchmark
【速读】:该论文旨在解决人体图像复原领域中高质量基准数据集缺乏的问题。为了解决这一难题,论文提出了一种高质自动化裁剪与筛选(High-Quality Automated Cropping and Filtering, HQ-ACF)管道,利用现有的目标检测数据集和其他未标注图像自动裁剪和筛选高质量的人体图像,从而构建了一个包含训练、验证和测试集的个性化复原复杂对象和自然活动(\emphPERSONA)数据集。此外,论文还提出了一个新颖的一阶段扩散模型(One-Step Diffusion Model for Human Body Restoration, \emphOSDHuman),其中引入了高保真图像嵌入器(High-Fidelity Image Embedder, HFIE)作为提示生成器,以更好地利用低质量人体图像信息引导模型,有效避免误导性提示。这些方法显著提升了视觉质量和定量指标表现。
链接: https://arxiv.org/abs/2502.01411
作者: Jue Gong,Jingkai Wang,Zheng Chen,Xing Liu,Hong Gu,Yulun Zhang,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures. The code and model will be available at this https URL
点击查看摘要
Abstract:Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emphPERSONA) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emphOSDHuman, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at this https URL.
zh
[CV-20] FourieRF: Few-Shot NeRFs via Progressive Fourier Frequency Control
【速读】:该论文旨在解决少样本场景下的快速高质量重建问题。解决方案的关键在于通过显式的课程训练程序有效地参数化特征,并在优化过程中逐步增加场景复杂度。这种方法产生的先验既稳健又具有广泛的适应性,从而建立了FourieRF作为少样本渲染问题中的强大且通用的基准方法。尽管如此,该方法在严重欠约束场景下仍可能导致重建误差,特别是在视图遮挡导致形状部分未被覆盖的情况下。
链接: https://arxiv.org/abs/2502.01405
作者: Diego Gomez,Bingchen Gong,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris (LIX, 法国巴黎高等理工学院, IP Paris)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3DV 2025 conference
点击查看摘要
Abstract:In this work, we introduce FourieRF, a novel approach for achieving fast and high-quality reconstruction in the few-shot setting. Our method effectively parameterizes features through an explicit curriculum training procedure, incrementally increasing scene complexity during optimization. Experimental results show that the prior induced by our approach is both robust and adaptable across a wide variety of scenes, establishing FourieRF as a strong and versatile baseline for the few-shot rendering problem. While our approach significantly reduces artifacts, it may still lead to reconstruction errors in severely under-constrained scenarios, particularly where view occlusion leaves parts of the shape uncovered. In the future, our method could be enhanced by integrating foundation models to complete missing parts using large data-driven priors.
zh
[CV-21] Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection
【速读】:该论文旨在解决3D视觉定位(3D Visual Grounding, 3DVG)任务中的两个主要挑战:一是监督方法依赖于稀缺且高成本的3D视觉-语言数据集;二是基于大型语言模型/视觉语言模型(LLM/VLM)的方法在推理过程中需要耗费大量时间和令牌。为了解决这些问题,论文提出了一种名为可进化符号视觉定位器(Evolvable Symbolic Visual Grounder, EaSe)的新型无训练符号框架。EaSe通过使用LLM生成的代码来计算空间关系,并实现了一个自动流水线来评估和优化这些代码的质量以及整合VLM以辅助定位过程。关键在于,EaSe显著降低了推理成本,同时保持了与基于代理的方法相当的性能,在Nr3D数据集上达到了52.9%的准确率,在ScanRefer上达到了49.2% Acc@0.25,从而在性能和效率之间实现了良好的平衡。
链接: https://arxiv.org/abs/2502.01401
作者: Boyu Mi,Hanqing Wang,Tai Wang,Yilun Chen,Jiangmiao Pang
机构: Shanghai AI Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D visual grounding (3DVG) is challenging because of the requirement of understanding on visual information, language and spatial relationships. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high cost of 3D vision-language datasets. On the other hand, LLM/VLM based agents are proposed for 3DVG, eliminating the need for training data. However, these methods incur prohibitive time and token costs during inference. To address the challenges, we introduce a novel training-free symbolic framework for 3D visual grounding, namely Evolvable Symbolic Visual Grounder, that offers significantly reduced inference costs compared to previous agent-based methods while maintaining comparable performance. EaSe uses LLM generated codes to compute on spatial relationships. EaSe also implements an automatic pipeline to evaluate and optimize the quality of these codes and integrate VLMs to assist in the grounding process. Experimental results demonstrate that EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2% Acc@0.25 on ScanRefer, which is top-tier among training-free methods. Moreover, it substantially reduces the inference time and cost, offering a balanced trade-off between performance and efficiency. Codes are available at this https URL.
zh
[CV-22] Learning Traffic Anomalies from Generative Models on Real-Time Observations
【速读】:该论文旨在解决城市交通管理中实时交通异常检测的问题。解决方案的关键在于采用时空生成对抗网络(STGAN)框架,结合图神经网络(Graph Neural Networks)和长短时记忆网络(Long Short-Term Memory networks),以捕捉交通数据中的复杂时空依赖关系。
链接: https://arxiv.org/abs/2502.01391
作者: Fotis I. Giasemis,Alexandros Sopasakis
机构: LIP6 (LIP6), LPNHE (LPNHE); Sorbonne Université (索邦大学); CNRS, IN2P3 (法国国家科学研究中心, IN2P3); Department of Mathematics (数学系); Lund University (隆德大学); Lund, Scania, Sweden (瑞典)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate detection of traffic anomalies is crucial for effective urban traffic management and congestion mitigation. We use the Spatiotemporal Generative Adversarial Network (STGAN) framework combining Graph Neural Networks and Long Short-Term Memory networks to capture complex spatial and temporal dependencies in traffic data. We apply STGAN to real-time, minute-by-minute observations from 42 traffic cameras across Gothenburg, Sweden, collected over several months in 2020. The images are processed to compute a flow metric representing vehicle density, which serves as input for the model. Training is conducted on data from April to November 2020, and validation is performed on a separate dataset from November 14 to 23, 2020. Our results demonstrate that the model effectively detects traffic anomalies with high precision and low false positive rates. The detected anomalies include camera signal interruptions, visual artifacts, and extreme weather conditions affecting traffic flow.
zh
[CV-23] Detecting Backdoor Samples in Contrastive Language Image Pretraining ICLR2025
【速读】:该论文旨在解决CLIP模型在大规模预训练过程中易受中毒后门攻击的问题。论文的关键在于发现中毒样本在局部子空间中的独特表征特征,即它们的局部邻域比干净样本更加稀疏。基于这一发现,论文提出使用传统的基于密度比的局部异常检测器来有效地检测这些后门攻击,而现有的方法则无法胜任。实验结果表明,这种方法可以高效地清理大规模网络数据集(如CC3M)中的后门污染,耗时仅需15分钟。
链接: https://arxiv.org/abs/2502.01385
作者: Hanxun Huang,Sarah Erfani,Yige Li,Xingjun Ma,James Bailey
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2025
点击查看摘要
Abstract:Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \hrefthis https URLGitHub repository.
zh
[CV-24] Inverse Bridge Matching Distillation
【速读】:该论文旨在解决扩散桥接模型(Diffusion Bridge Models, DBMs)在图像到图像翻译应用中的慢推理速度问题。关键解决方案在于提出了一种基于逆向桥接匹配公式的新颖蒸馏技术,并推导出实用的可解目标函数。此方法能够蒸馏条件和非条件类型的DBMs,通过一步生成器进行蒸馏,并仅使用被破坏的图像进行训练。
链接: https://arxiv.org/abs/2502.01362
作者: Nikita Gushchin,David Li,Daniil Selikhanovych,Evgeny Burnaev,Dmitry Baranchuk,Alexander Korotin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
zh
[CV-25] Bayesian Approximation-Based Trajectory Prediction and Tracking with 4D Radar
【速读】:该论文旨在解决在恶劣天气条件下,基于LiDAR和摄像头的多目标跟踪(MOT)方法性能下降的问题,同时指出雷达方法虽然稳健但存在垂直分辨率有限和运动模型简单的问题。现有基于卡尔曼滤波的方法依赖固定的噪声协方差,导致其在对象突然机动时适应性较差。论文的关键解决方案在于提出Bayes-4DRTrack框架,采用基于变换器的运动预测网络以捕捉非线性运动动态,并在检测和预测步骤中使用贝叶斯近似。此外,两阶段数据关联利用多普勒测量来更好地分辨接近的目标。这些改进使得Bayes-4DRTrack在K-Radar数据集上的平均多目标跟踪精度(AMOTA)提升了5.7%,展示了其在严苛实际条件下的增强鲁棒性和准确性。
链接: https://arxiv.org/abs/2502.01357
作者: Dong-In Kim,Dong-Hee Paek,Seung-Hyun Song,Seung-Hyun Kong
机构: Korea Advanced Institute of Science and Technology(韩国科学技术院); Hyundai Motor Company(现代汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6pages, 4 figures
点击查看摘要
Abstract:Accurate 3D multi-object tracking (MOT) is vital for autonomous vehicles, yet LiDAR and camera-based methods degrade in adverse weather. Meanwhile, Radar-based solutions remain robust but often suffer from limited vertical resolution and simplistic motion models. Existing Kalman filter-based approaches also rely on fixed noise covariance, hampering adaptability when objects make sudden maneuvers. We propose Bayes-4DRTrack, a 4D Radar-based MOT framework that adopts a transformer-based motion prediction network to capture nonlinear motion dynamics and employs Bayesian approximation in both detection and prediction steps. Moreover, our two-stage data association leverages Doppler measurements to better distinguish closely spaced targets. Evaluated on the K-Radar dataset (including adverse weather scenarios), Bayes-4DRTrack demonstrates a 5.7% gain in Average Multi-Object Tracking Accuracy (AMOTA) over methods with traditional motion models and fixed noise covariance. These results showcase enhanced robustness and accuracy in demanding, real-world conditions.
zh
[CV-26] Quasi-Conformal Convolution : A Learnable Convolution for Deep Learning on Riemann Surfaces
【速读】:该论文旨在解决在非欧几里得域上定义卷积操作的挑战,特别是在分析复杂几何数据时缺乏常见坐标系和熟悉的欧几里得属性的问题。解决方案的关键是引入了一种名为拟共形卷积(Quasi-conformal Convolution, QCC)的新框架,通过利用可训练的估计模块生成拟共形映射,实现了适应性和可学习的卷积算子,这些算子可以根据底层数据结构动态调整。QCC统一了广泛的空间定义卷积,促进了在每个基础曲面上针对特定任务优化的定制卷积算子的学习。基于此,开发了拟共形卷积神经网络(QCCNN),验证了其在分类定义于曲面流形上的图像以及在医学应用中的有效性,包括三维面部数据的颅面分析和三维人脸上的病变分割。
链接: https://arxiv.org/abs/2502.01356
作者: Han Zhang,Tsz Lok Ip,Lok Ming Lui
机构: City University of Hong Kong(香港城市大学); Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep learning on non-Euclidean domains is important for analyzing complex geometric data that lacks common coordinate systems and familiar Euclidean properties. A central challenge in this field is to define convolution on domains, which inherently possess irregular and non-Euclidean this http URL this work, we introduce Quasi-conformal Convolution (QCC), a novel framework for defining convolution on Riemann surfaces using quasi-conformal theories. Each QCC operator is linked to a specific quasi-conformal mapping, enabling the adjustment of the convolution operation through manipulation of this mapping. By utilizing trainable estimator modules that produce Quasi-conformal mappings, QCC facilitates adaptive and learnable convolution operators that can be dynamically adjusted according to the underlying data structured on Riemann surfaces. QCC unifies a broad range of spatially defined convolutions, facilitating the learning of tailored convolution operators on each underlying surface optimized for specific tasks. Building on this foundation, we develop the Quasi-Conformal Convolutional Neural Network (QCCNN) to address a variety of tasks related to geometric data. We validate the efficacy of QCCNN through the classification of images defined on curvilinear Riemann surfaces, demonstrating superior performance in this context. Additionally, we explore its potential in medical applications, including craniofacial analysis using 3D facial data and lesion segmentation on 3D human faces, achieving enhanced accuracy and reliability.
zh
[CV-27] ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies
【速读】:该论文旨在解决传统自监督学习方法在捕捉细粒度概念(如解剖结构或器官)方面存在的局限性。关键在于引入ConceptVAE框架,通过自监督方式检测并分离输入数据中的细粒度概念及其风格特征。该框架包含一系列设计用于将输入数据离散化为预设数量的概念及其局部风格的损失项和模型架构组件。
链接: https://arxiv.org/abs/2502.01335
作者: Costin F. Ciusdel,Alex Serban,Tiziano Passerini
机构: Siemens SRL (西门子股份公司), Brasov, Romania; Siemens Healthineers (西门子医疗), Princeton, NJ, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
zh
[CV-28] CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
【速读】:该论文旨在解决类别级物体姿态估计在面对未见过实例的显著变化时,因模型中存在的“不干净”混杂因素导致的虚假相关性问题。为了解决这一问题,论文提出了一种名为CleanPose的新方法,关键在于结合因果学习和知识蒸馏技术。具体而言,通过开发基于前门调整的因果推理模块来减轻未观察到的混杂因素的负面影响,从而减少潜在的虚假相关性,实现无偏估计;同时,设计了一种基于残差的知识蒸馏方法,以增强模型的泛化能力,提供全面的类别信息指导。
链接: https://arxiv.org/abs/2502.01312
作者: Xiao Lin,Yun Peng,Liuyi Wang,Xianyou Zhong,Minghao Zhu,Jingwei Yang,Chengju Liu,Qijun Chen
机构: School of Electronic and Information Engineering, Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Category-level object pose estimation aims to recover the rotation, translation and size of unseen instances within predefined categories. In this task, deep neural network-based methods have demonstrated remarkable performance. However, previous studies show they suffer from spurious correlations raised by “unclean” confounders in models, hindering their performance on novel instances with significant variations. To address this issue, we propose CleanPose, a novel approach integrating causal learning and knowledge distillation to enhance category-level pose estimation. To mitigate the negative effect of unobserved confounders, we develop a causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further improve generalization ability, we devise a residual-based knowledge distillation method that has proven effective in providing comprehensive category information guidance. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be released.
zh
[CV-29] Heterogeneous Image GNN: Graph-Conditioned Diffusion for Image Synthesis
【速读】:该论文旨在解决扩散基图像合成模型在处理复杂场景中的异构图数据条件输入时所遇到的问题。现有方法通常通过跨注意层或图像拼接的方式直接将条件变量纳入模型架构,难以高效处理包含多样关系的复杂条件输入。为了解决这一问题,论文提出了一种名为“异构图像图”(Heterogeneous Image Graphs, HIG) 的新型表示方法,该方法将条件变量和目标图像建模为两个相互连接的图,从而能够有效地处理长度可变的条件输入及其关系。此外,论文还提出了一种保持幅值的图神经网络(Magnitude-Preserving GNN),通过ControlNet方法将其与现有的EDM2扩散模型集成。关键在于HIG能够更好地表征和处理复杂的条件关系,从而提升模型在COCO-stuff和Visual Genome数据集上的表现。
链接: https://arxiv.org/abs/2502.01309
作者: Rupert Menneer,Christos Margadji,Sebastian W. Pattinson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce a novel method for conditioning diffusion-based image synthesis models with heterogeneous graph data. Existing approaches typically incorporate conditioning variables directly into model architectures, either through cross-attention layers that attend to text latents or image concatenation that spatially restrict generation. However, these methods struggle to handle complex scenarios involving diverse, relational conditioning variables, which are more naturally represented as unstructured graphs. This paper presents Heterogeneous Image Graphs (HIG), a novel representation that models conditioning variables and target images as two interconnected graphs, enabling efficient handling of variable-length conditioning inputs and their relationships. We also propose a magnitude-preserving GNN that integrates the HIG into the existing EDM2 diffusion model using a ControlNet approach. Our approach improves upon the SOTA on a variety of conditioning inputs for the COCO-stuff and Visual Genome datasets, and showcases the ability to condition on graph attributes and relationships represented by edges in the HIG.
zh
[CV-30] Partial Channel Network: Compute Fewer Perform Better
【速读】:该论文旨在解决设计模块或机制以保持网络低参数和计算量(FLOPs)同时不牺牲精度和吞吐量的挑战。关键在于利用特征图通道内的冗余性,提出了一种新的部分通道机制(PCM)。具体而言,通过分割操作将特征图通道分为不同部分,每个部分对应不同的操作,如卷积、注意力、池化和恒等映射。基于这一假设,引入了一种新颖的部分注意力卷积(PATConv),能够高效地结合卷积与视觉注意力。此外,还提出了动态部分卷积(DPConv),能够自适应学习不同层中分割通道的比例,从而实现更好的权衡。这些方法共同构成了PartialNet,实现了优于一些最新技术模型(SOTA)的分类精度和推理速度,并在COCO数据集上表现出色的检测和分割能力。
链接: https://arxiv.org/abs/2502.01303
作者: Haiduo Huang,Tian Xia,Wenzhe zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学) ⋅⋅\cdot⋅ Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Designing a module or mechanism that enables a network to maintain low parameters and FLOPs without sacrificing accuracy and throughput remains a challenge. To address this challenge and exploit the redundancy within feature map channels, we propose a new solution: partial channel mechanism (PCM). Specifically, through the split operation, the feature map channels are divided into different parts, with each part corresponding to different operations, such as convolution, attention, pooling, and identity mapping. Based on this assumption, we introduce a novel partial attention convolution (PATConv) that can efficiently combine convolution with visual attention. Our exploration indicates that the PATConv can completely replace both the regular convolution and the regular visual attention while reducing model parameters and FLOPs. Moreover, PATConv can derive three new types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp), and Partial Self-Attention block (PAT_sf). In addition, we propose a novel dynamic partial convolution (DPConv) that can adaptively learn the proportion of split channels in different layers to achieve better trade-offs. Building on PATConv and DPConv, we propose a new hybrid network family, named PartialNet, which achieves superior top-1 accuracy and inference speed compared to some SOTA models on ImageNet-1K classification and excels in both detection and segmentation on the COCO dataset. Our code is available at this https URL.
zh
[CV-31] XR-VIO: High-precision Visual Inertial Odometry with Fast Initialization for XR Applications
【速读】:该论文旨在解决视觉惯性里程计(Visual Inertial Odometry, VIO)初始化过程中稳定性不足以及特征匹配效率和准确性不高的问题。关键解决方案在于提出了一种新的紧耦合陀螺仪测量的视觉惯性初始化管道,增强了视觉运动结构(Structure from Motion, SfM)的鲁棒性和准确性,并引入了一种结合光流和基于描述子匹配的混合特征匹配方法,实现了高效、准确且鲁棒的跟踪结果。
链接: https://arxiv.org/abs/2502.01297
作者: Shangjin Zhai,Nan Wang,Xiaomeng Wang,Danpeng Chen,Weijian Xie,Hujun Bao,Guofeng Zhang
机构: SenseTime Research; State Key Lab of CAD&CG, Zhejiang University; Tetras.AI; State Key Lab of CAD&CG, Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents a novel approach to Visual Inertial Odometry (VIO), focusing on the initialization and feature matching modules. Existing methods for initialization often suffer from either poor stability in visual Structure from Motion (SfM) or fragility in solving a huge number of parameters simultaneously. To address these challenges, we propose a new pipeline for visual inertial initialization that robustly handles various complex scenarios. By tightly coupling gyroscope measurements, we enhance the robustness and accuracy of visual SfM. Our method demonstrates stable performance even with only four image frames, yielding competitive results. In terms of feature matching, we introduce a hybrid method that combines optical flow and descriptor-based matching. By leveraging the robustness of continuous optical flow tracking and the accuracy of descriptor matching, our approach achieves efficient, accurate, and robust tracking results. Through evaluation on multiple benchmarks, our method demonstrates state-of-the-art performance in terms of accuracy and success rate. Additionally, a video demonstration on mobile devices showcases the practical applicability of our approach in the field of Augmented Reality/Virtual Reality (AR/VR).
zh
[CV-32] A Framework for Double-Blind Federated Adaptation of Foundation Models
【速读】:该论文旨在解决在数据孤岛环境下,如何通过双盲联邦学习的方式,利用完全同态加密(Fully Homomorphic Encryption, FHE)技术适应预训练的基础模型(Foundational Models, FMs),以改进特定下游任务性能的问题。解决方案的关键在于首先通过知识蒸馏将基础模型分解为一系列适合FHE处理的模块,然后利用无需通过基础模型进行反向传播的低秩并行适配器(low-rank parallel adapters)来适应下游任务。此外,设计了一种隐私保护的置换方案,防止数据拥有者通过模型提取攻击获取基础模型的信息,并采用安全聚合协议进行低秩并行适配器的联邦学习。
链接: https://arxiv.org/abs/2502.01289
作者: Nurbek Tastan,Karthik Nandakumar
机构: Mohamed bin Zayed University of AI (MBZUAI); Michigan State University (MSU)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:The availability of foundational models (FMs) pre-trained on large-scale data has advanced the state-of-the-art in many computer vision tasks. While FMs have demonstrated good zero-shot performance on many image classification tasks, there is often scope for performance improvement by adapting the FM to the downstream task. However, the data that is required for this adaptation typically exists in silos across multiple entities (data owners) and cannot be collated at a central location due to regulations and privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with the data owners due to proprietary reasons. In some cases, the data owners may not even have the resources to store such large FMs. Hence, there is a need for algorithms to adapt the FM in a double-blind federated manner, i.e., the data owners do not know the FM or each other’s data, and the LSP does not see the data for the downstream tasks. In this work, we propose a framework for double-blind federated adaptation of FMs using fully homomorphic encryption (FHE). The proposed framework first decomposes the FM into a sequence of FHE-friendly blocks through knowledge distillation. The resulting FHE-friendly model is adapted for the downstream task via low-rank parallel adapters that can be learned without backpropagation through the FM. Since the proposed framework requires the LSP to share intermediate representations with the data owners, we design a privacy-preserving permutation scheme to prevent the data owners from learning the FM through model extraction attacks. Finally, a secure aggregation protocol is employed for federated learning of the low-rank parallel adapters. Empirical results on four datasets demonstrate the practical feasibility of the proposed framework.
zh
[CV-33] mplate Matching in Images using Segmented Normalized Cross-Correlation
【速读】:该论文旨在解决模板匹配中归一化互相关(NCC)计算效率低的问题。解决方案的关键在于提出了一种新的算法,通过预计算模板图像的近似表示,实现了比直接使用原始模板进行精确NCC计算更为高效的近似NCC计算。具体而言,该近似模板通过分裂合并(split-and-merge)方法从原始模板图像中获得,并分解为轴对齐的矩形段,这些段的大小取决于像素强度方差。每个段被赋予原模板中相应像素的平均灰度值,从而在保持与FFT-based NCC算法相当的计算性能的同时,将NCC近似误差控制在可接受范围内。
链接: https://arxiv.org/abs/2502.01286
作者: Davor Marušić,Siniša Popović,Zoran Kalafatić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 tables, 3 figures
点击查看摘要
Abstract:In this paper, a new variant of an algorithm for normalized cross-correlation (NCC) is proposed in the context of template matching in images. The proposed algorithm is based on the precomputation of a template image approximation, enabling more efficient calculation of approximate NCC with the source image than using the original template for exact NCC calculation. The approximate template is precomputed from the template image by a split-and-merge approach, resulting in a decomposition to axis-aligned rectangular segments, whose sizes depend on per-segment pixel intensity variance. In the approximate template, each segment is assigned the mean grayscale value of the corresponding pixels from the original template. The proposed algorithm achieves superior computational performance with negligible NCC approximation errors compared to the well-known Fast Fourier Transform (FFT)-based NCC algorithm, when applied on less visually complex and/or smaller template images. In other cases, the proposed algorithm can maintain either computational performance or NCC approximation error within the range of the FFT-based algorithm, but not both.
zh
[CV-34] Label Correction for Road Segmentation Using Road-side Cameras
【速读】:该论文旨在解决在不同天气条件下可靠的道路分割问题,这对于智能交通应用、自动驾驶汽车及高级驾驶辅助系统至关重要。由于收集和标注包含所有天气条件的数据集需要大量资源,论文提出利用现有的道路侧相机基础设施自动收集不同天气条件下的道路数据,并提出了一种新的半自动标注方法。该方法的关键在于仅需手动标注每个相机的一帧图像,然后通过频域图像配准补偿小相机运动来将标签转移到其他帧。论文通过在芬兰927个相机长达四个月冬季期间收集的数据验证了该方法,并证明使用半自动标注的数据训练可以提升多种深度学习分割模型的性能。
链接: https://arxiv.org/abs/2502.01281
作者: Henrik Toikka,Eerik Alamikkotervo,Risto Ojala
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Reliable road segmentation in all weather conditions is critical for intelligent transportation applications, autonomous vehicles and advanced driver’s assistance systems. For robust performance, all weather conditions should be included in the training data of deep learning-based perception models. However, collecting and annotating such a dataset requires extensive resources. In this paper, existing roadside camera infrastructure is utilized for collecting road data in varying weather conditions automatically. Additionally, a novel semi-automatic annotation method for roadside cameras is proposed. For each camera, only one frame is labeled manually and then the label is transferred to other frames of that camera feed. The small camera movements between frames are compensated using frequency domain image registration. The proposed method is validated with roadside camera data collected from 927 cameras across Finland over 4 month time period during winter. Training on the semi-automatically labeled data boosted the segmentation performance of several deep learning segmentation models. Testing was carried out on two different datasets to evaluate the robustness of the resulting models. These datasets were an in-domain roadside camera dataset and out-of-domain dataset captured with a vehicle on-board camera.
zh
[CV-35] FSPGD: Rethinking Black-box Attacks on Semantic Segmentation
【速读】:该论文旨在解决在黑盒攻击中语义分割模型的对抗样本迁移能力有限的问题。解决方案的关键在于引入了一种新的攻击方法——特征相似投影梯度下降(Feature Similarity Projected Gradient Descent, FSPGD)攻击。与传统方法依赖输出预测计算梯度不同,FSPGD通过中间层特征计算梯度,并设计了一种损失函数来同时针对局部信息和干扰对象间的空间关系,从而显著提升了对抗样本的迁移能力和攻击性能。
链接: https://arxiv.org/abs/2502.01262
作者: Eun-Sol Park,MiSo Park,Seung Park,Yong-Goo Shin
机构: Department of Electronics and Information Engineering, Korea University(韩国大学电子与信息工程系); College of Medicine, Chungbuk National University(忠北国立大学医学院); Department of Biomedical Engineering, Chungbuk National University Hospital(忠北国立大学医院生物医学工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black-box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real-world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black-box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state-of-the-art benchmark. Code is available at this https URL.
zh
[CV-36] Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
【速读】:该论文旨在解决预训练视觉-语言表示时过度依赖未来帧导致的错误视觉-语言关联问题。解决方案的关键在于提出了一种名为Action Temporal Coherence Learning (AcTOL)的方法,该方法通过对比帧间的语义差异来反映自然顺序,并施加局部布朗桥约束以确保中间帧之间的平滑过渡,从而学习有序且连续的视觉-语言表示,而无需严格的基于目标的限制。
链接: https://arxiv.org/abs/2502.01218
作者: Zhizhen Zhang,Lei Zhu,Zhen Fang,Zi Huang,Yadan Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments across varying numbers of demonstrations show that the pretrained features significantly enhance downstream manipulation tasks by up to 49% with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents. The source code is included in the supplementary material for reference.
zh
[CV-37] Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models
【速读】:该论文旨在解决工业缺陷分割在多种复杂场景下的样本稀缺问题,特别是现有研究大多局限于简单纹理缺陷的处理。论文的关键解决方案在于提出了一种基于特征匹配的新颖高效few-shot缺陷分割方法,并探讨了使用Segment Anything (SAM)模型在视频跟踪模式下的有效性。此外,论文贡献了一个新的现实世界数据集,并重新组织了一些现有数据集以构建更全面的基准,同时系统性地研究了视觉基础模型(Vision Foundation Models, VFMs)在这类任务中的适用性。
链接: https://arxiv.org/abs/2502.01216
作者: Tongkun Liu,Bing Li,Xiao Jin,Yupeng Shi,Qiuying Li,Xiang Wei
机构: State Key Laboratory for Manufacturing System Engineering, Xi’an Jiaotong University (西安交通大学制造系统工程国家重点实验室); International Joint Research Laboratory for Micro/Nano Manufacturing and Measurement Technologies, Xi’an Jiaotong University (西安交通大学微纳制造与测量技术国际联合研究实验室); Mechanical Engineering Program, Physical Science and Engineering Division, King Abdullah University of Science and Technology (KAUST) (King Abdullah University of Science and Technology (KAUST) (国王 Abdullah 科学和技术大学))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few-shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real-world dataset and reorganize some existing datasets to build a more comprehensive few-shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs). We observe that existing meta-learning-based methods are generally not well-suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: this https URL.
zh
[CV-38] Land Surface Temperature Super-Resolution with a Scale-Invariance-Free Neural Approach: Application to MODIS
【速读】:该论文旨在解决热空间探测器在时间和空间分辨率之间的权衡问题,提出了一种无需尺度不变假设(Scale-Invariance-Free)的方法来训练神经网络模型,以实现更高分辨率的地表温度(Land Surface Temperature, LST)地图。关键在于引入了无需尺度不变假设的训练方法,并开发了两种名为SIF-CNN-SR的神经网络模型,能够直接从高分辨率的归一化植被指数(NDVI)中提取细粒度纹理信息,从而在降低分辨率后仍能恢复初始LST值。这种方法避免了传统方法中对尺度不变性的依赖,提高了超分辨率重建的性能。
链接: https://arxiv.org/abs/2502.01204
作者: Romuald Ait-Bachir(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Carlos Granero-Belinchon(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Aurélie Michel,Julien Michel(CESBIO, CNES),Xavier Briottet,Lucas Drumetz(Lab-STICC_OSE, IMT Atlantique - MEE, ODYSSEY)
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Due to the trade-off between the temporal and spatial resolution of thermal spaceborne sensors, super-resolution methods have been developed to provide fine-scale Land SurfaceTemperature (LST) maps. Most of them are trained at low resolution but applied at fine resolution, and so they require a scale-invariance hypothesis that is not always adapted. Themain contribution of this work is the introduction of a Scale-Invariance-Free approach for training Neural Network (NN) models, and the implementation of two NN models, calledScale-Invariance-Free Convolutional Neural Network for Super-Resolution (SIF-CNN-SR) for the super-resolution of MODIS LST products. The Scale-Invariance-Free approach consists ontraining the models in order to provide LST maps at high spatial resolution that recover the initial LST when they are degraded at low resolution and that contain fine-scale texturesinformed by the high resolution NDVI. The second contribution of this work is the release of a test database with ASTER LST images concomitant with MODIS ones that can be usedfor evaluation of super-resolution algorithms. We compare the two proposed models, SIF-CNN-SR1 and SIF-CNN-SR2, with four state-of-the-art methods, Bicubic, DMS, ATPRK, Tsharp,and a CNN sharing the same architecture as SIF-CNN-SR but trained under the scale-invariance hypothesis. We show that SIF-CNN-SR1 outperforms the state-of-the-art methods and the other two CNN models as evaluated with LPIPS and Fourier space metrics focusing on the analysis of textures. These results and the available ASTER-MODIS database for evaluation are promising for future studies on super-resolution of LST.
zh
[CV-39] One-to-Normal: Anomaly Personalization for Few-shot Anomaly Detection NEURIPS2024
【速读】:该论文旨在解决传统异常检测方法在精度提升方面存在的局限性,特别是在少量正常样本情况下直接比较查询图像特征与正常图像特征导致的精度损失问题。为了解决这些问题,论文的关键方案包括引入一种异常个性化方法,通过使用定制的无异常生成模型对查询图像进行个性化的一对正常变换,以确保与正常流形的紧密对齐。此外,论文还提出了一种三元对比异常推理策略,通过综合比较查询图像与生成的无异常数据池及提示信息,进一步增强预测结果的稳定性和鲁棒性。这些方法在三个领域的十一个数据集上的广泛评估中证明了其有效性,并且可以灵活地应用于其他异常检测方法中,从而提高其性能。
链接: https://arxiv.org/abs/2502.01201
作者: Yiyue Li,Shaoting Zhang,Kang Li,Qicheng Lao
机构: West China Biomedical Big Data Center, West China Hospital, Sichuan University(四川大学西区华西医院西部生物医学大数据中心); School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院); Sichuan University Pittsburgh Institute, Sichuan University(四川大学匹兹堡学院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024)
点击查看摘要
Abstract:Traditional Anomaly Detection (AD) methods have predominantly relied on unsupervised learning from extensive normal data. Recent AD methods have evolved with the advent of large pre-trained vision-language models, enhancing few-shot anomaly detection capabilities. However, these latest AD methods still exhibit limitations in accuracy improvement. One contributing factor is their direct comparison of a query image’s features with those of few-shot normal images. This direct comparison often leads to a loss of precision and complicates the extension of these techniques to more complex domains–an area that remains underexplored in a more refined and comprehensive manner. To address these limitations, we introduce the anomaly personalization method, which performs a personalized one-to-normal transformation of query images using an anomaly-free customized generation model, ensuring close alignment with the normal manifold. Moreover, to further enhance the stability and robustness of prediction results, we propose a triplet contrastive anomaly inference strategy, which incorporates a comprehensive comparison between the query and generated anomaly-free data pool and prompt information. Extensive evaluations across eleven datasets in three domains demonstrate our model’s effectiveness compared to the latest AD methods. Additionally, our method has been proven to transfer flexibly to other AD methods, with the generated image data effectively improving the performance of other AD methods.
zh
[CV-40] Nearly Lossless Adaptive Bit Switching
【速读】:该论文旨在解决在模型量化过程中不同硬件和传输需求导致的固定比特宽度设置带来的显著训练和存储成本问题。论文的关键解决方案包括引入Double Rounding量化方法以减少存储开销,并提出Adaptive Learning Rate Scaling (ALRS)技术来优化多精度联合训练过程中的梯度一致性问题。此外,论文还扩展了Double Rounding方法到混合精度训练,并开发了Hessian-Aware Stochastic Bit-switching (HASB)策略。这些方法共同提升了多精度和混合精度下的训练效果。
链接: https://arxiv.org/abs/2502.01199
作者: Haiduo Huang,Zhenhua Liu,Tian Xia,Wenzhe zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at this https URL.
zh
[CV-41] owards Robust and Reliable Concept Representations: Reliability-Enhanced Concept Embedding Model
【速读】:该论文旨在解决概念瓶颈模型(CBMs)在确保可靠概念表示方面所面临的挑战,这些问题可能导致下游任务的性能下降,尤其是在分布变化的情况下。论文指出两个主要问题:对无关特征的敏感性(如背景变化)以及不同样本间同一概念缺乏语义一致性。为了解决这些局限性,论文提出了一种可靠性增强的概念嵌入模型(RECEM),其关键是引入了概念级解缠(Concept-Level Disentanglement)以分离无关特征与概念相关的信息,并采用概念混合(Concept Mixup)机制以确保样本间的语义对齐。这些机制共同提高了概念的可靠性,使模型能够关注有意义的对象属性并生成忠实的概念表示。
链接: https://arxiv.org/abs/2502.01191
作者: Yuxuan Cai,Xiyu Wang,Satoshi Tsutsui,Winnie Pang,Bihan Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) aim to enhance interpretability by predicting human-understandable concepts as intermediates for decision-making. However, these models often face challenges in ensuring reliable concept representations, which can propagate to downstream tasks and undermine robustness, especially under distribution shifts. Two inherent issues contribute to concept unreliability: sensitivity to concept-irrelevant features (e.g., background variations) and lack of semantic consistency for the same concept across different samples. To address these limitations, we propose the Reliability-Enhanced Concept Embedding Model (RECEM), which introduces a two-fold strategy: Concept-Level Disentanglement to separate irrelevant features from concept-relevant information and a Concept Mixup mechanism to ensure semantic alignment across samples. These mechanisms work together to improve concept reliability, enabling the model to focus on meaningful object attributes and generate faithful concept representations. Experimental results demonstrate that RECEM consistently outperforms existing baselines across multiple datasets, showing superior performance under background and domain shifts. These findings highlight the effectiveness of disentanglement and alignment strategies in enhancing both reliability and robustness in CBMs.
zh
[CV-42] A High-Accuracy SSIM-based Scoring System for Coin Die Link Identification
【速读】:该论文旨在解决古代硬币分析中识别相同模具铸造硬币(die link detection)的难题,尤其在大规模发现时手动识别过程变得极其繁琐甚至不可能。论文的关键解决方案在于引入了一个公开可用的标注硬币图片数据集(329张图像),一个基于SSIM的评分方法以实现硬币配对的快速准确区分,以及通过该评分方法评估聚类技术以实现接近完美的模具链接识别。这些贡献共同促进了考古学特别是钱币学领域更强大工具的发展。
链接: https://arxiv.org/abs/2502.01186
作者: Patrice Labedan,Nicolas Drougard,Alexandre Berezin,Guowei Sun,Francis Dieulafait
机构: ISAE-SUPAERO (ISAE-SUPAERO), Université de Toulouse (图卢兹大学), France (法国); Hades, Bureau d’investigations archéologiques (哈德斯考古调查局), L’Union, France (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The analyses of ancient coins, and especially the identification of those struck with the same die, provides invaluable information for archaeologists and historians. Nowadays, these die links are identified manually, which makes the process laborious, if not impossible when big treasures are discovered as the number of comparisons is too large. This study introduces advances that promise to streamline and enhance archaeological coin analysis. Our contributions include: 1) First publicly accessible labeled dataset of coin pictures (329 images) for die link detection, facilitating method benchmarking; 2) Novel SSIM-based scoring method for rapid and accurate discrimination of coin pairs, outperforming current techniques used in this research field; 3) Evaluation of clustering techniques using our score, demonstrating near-perfect die link identification. We provide datasets, to foster future research and the development of even more powerful tools for archaeology, and more particularly for numismatics.
zh
[CV-43] Enhancing Environmental Robustness in Few-shot Learning via Conditional Representation Learning
【速读】:该论文旨在解决Few-shot学习(FSL)在实际测试中由于环境因素导致性能显著下降的问题。当前研究忽视了“环境鲁棒性”的概念,即模型在复杂多变的物理环境中保持一致性能的能力。为了解决这一问题,论文提出了一个新的真实世界多领域Few-shot学习基准(RD-FSL),包含四个领域和六个评估数据集,并引入了一种新颖的条件表示学习网络(CRLNet)。CRLNet的关键在于它能够将训练图像和测试图像之间的交互作为条件信息整合到各自的表示过程中,从而减少类内方差或增强类间方差,最终提升FSL模型的性能。实验结果表明,CRLNet相比现有方法取得了6.83%至16.98%的性能提升。
链接: https://arxiv.org/abs/2502.01183
作者: Qianyu Guo,Jingrong Wu,Tianxing Wu,Haofen Wang,Weifeng Ge,Wenqiang Zhang
机构: School of Computer Science, Fudan University (复旦大学计算机科学学院); Shanghai Institute of Virology, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院上海病毒研究所); School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China (中国东南大学计算机科学与工程学院); College of Design and Innovation, Tongji University (同济大学设计与创新学院); Engineering Research Center of AI & Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University, Shanghai, 20043, China (教育部人工智能与机器人工程研究中心,复旦大学工程技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, Accepted by IEEE Transactions on Image Processing
点击查看摘要
Abstract:Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of “environmental robustness”, which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at this https URL.
zh
[CV-44] BVINet: Unlocking Blind Video Inpainting with Zero Annotations
【速读】:该论文旨在解决视频修复(Video Inpainting)中的已知缺陷,即现有方法通常假设损坏区域的位置已知,并主要关注“如何修复”。然而,这种假设需要手动标注二值掩码来指示“修复位置”,这是一项劳动密集且昂贵的任务,限制了当前方法的实用性。为了解决这一问题,论文提出了一种新的盲视频修复设置(Blind Video Inpainting),使网络能够直接从受损视频映射到修复结果,无需标注损坏区域。
关键解决方案在于提出的端到端盲视频修复网络(Blind Video Inpainting Network, BVINet),它同时解决了“修复位置”和“如何修复”的问题。BVINet通过检测帧内语义不连续区域和利用视频的时间一致性先验来预测损坏区域的掩码。此外,预测的掩码被整合进BVINet中,使其能够从未损坏区域捕获有效上下文信息以填充损坏区域。论文还引入了一致性损失(consistency loss)来调节BVINet的训练参数。这种方法使掩码预测和视频修复相互制约,从而最大化模型的整体性能。
链接: https://arxiv.org/abs/2502.01181
作者: Zhiliang Wu,Kerui Chen,Kun Li,Hehe Fan,Yi Yang
机构: ReLER Lab, CCAI, Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the “how to inpaint”. This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate “whereto inpaint”. However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both “where to inpaint” and “how to inpaint” simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
zh
[CV-45] owards Agile Swarming in Real World: Onboard Relative Localization with Fast Tracking of Active Blinking Markers
【速读】:本文旨在解决多机器人团队在紧密耦合编队中进行实时相对定位的鲁棒性问题,特别是在户外复杂环境中。传统跟踪算法难以应对快速移动且间歇性出现在相机视野中的闪烁标记。为解决这一挑战,关键在于引入了主动闪烁标记跟踪(Active Blinking Marker Tracking, AMT)方法,该方法通过加权多项式回归预测未来闪烁标记的出现位置,并考虑预测中的不确定性。实验验证表明,AMT方法在跟踪密度、精度和复杂度方面优于现有技术。
链接: https://arxiv.org/abs/2502.01172
作者: Tim Felix Lakemann,Daniel Bonilla Licea,Viktor Walter,Tomáš Báča,Martin Saska
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:A novel onboard tracking approach enabling vision-based relative localization and communication using Active blinking Marker Tracking (AMT) is introduced in this article. Active blinking markers on multi-robot team members improve the robustness of relative localization for aerial vehicles in tightly coupled swarms during real-world deployments, while also serving as a resilient communication channel. Traditional tracking algorithms struggle to track fast moving blinking markers due to their intermittent appearance in the camera frames. AMT addresses this by using weighted polynomial regression to predict the future appearance of active blinking markers while accounting for uncertainty in the prediction. In outdoor experiments, the AMT approach outperformed state-of-the-art methods in tracking density, accuracy, and complexity. The experimental validation of this novel tracking approach for relative localization involved testing motion patterns motivated by our research on agile multi-robot deployment.
zh
[CV-46] MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks
【速读】:该论文旨在解决多模态数据集较小及多模态模型复杂度较高导致的性能下降问题。论文的关键解决方案是提出Modality-INformed知识蒸馏(MIND)框架,通过从不同规模的预训练深度神经网络集成中迁移知识到更小的多模态学生模型,从而实现模型压缩。MIND采用多头联合融合模型,允许在处理单模态样本时使用单模态编码器,无需对缺失模态进行插补或屏蔽。实验结果表明,MIND在二分类和多标签临床预测任务以及非医学多模态多分类任务中均提升了小规模多模态网络的性能。
链接: https://arxiv.org/abs/2502.01158
作者: Alejandro Guerra-Manzanares,Farah E. Shamout
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research (01/2025), this https URL
点击查看摘要
Abstract:Multimodal fusion leverages information across modalities to learn better feature representations with the goal of improving performance in fusion-based tasks. However, multimodal datasets, especially in medical settings, are typically smaller than their unimodal counterparts, which can impede the performance of multimodal models. Additionally, the increase in the number of modalities is often associated with an overall increase in the size of the multimodal network, which may be undesirable in medical use cases. Utilizing smaller unimodal encoders may lead to sub-optimal performance, particularly when dealing with high-dimensional clinical data. In this paper, we propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach based on knowledge distillation that transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student. The teacher models consist of unimodal networks, allowing the student to learn from diverse representations. MIND employs multi-head joint fusion models, as opposed to single-head models, enabling the use of unimodal encoders in the case of unimodal samples without requiring imputation or masking of absent modalities. As a result, MIND generates an optimized multimodal model, enhancing both multimodal and unimodal representations. It can also be leveraged to balance multimodal learning during training. We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images. Additionally, we assess the generalizability of the MIND framework on three non-medical multimodal multiclass datasets. Experimental results demonstrate that MIND enhances the performance of the smaller multimodal network across all five tasks, as well as various fusion methods and multimodal architectures, compared to state-of-the-art baselines.
zh
[CV-47] Radiant Foam: Real-Time Differentiable Ray Tracing
【速读】:该论文旨在解决在利用光栅化方法提高渲染速度的同时,导致实现光照传输现象(如反射和折射)变得困难的问题。关键解决方案在于提出了一种新的场景表示方法——Radiant Foam,它通过采用高效的体素网格光线追踪算法,避免了光栅化带来的近似,同时保持了与高斯散射(Gaussian Splatting)相当的渲染速度和质量,且无需特殊硬件或API支持。
链接: https://arxiv.org/abs/2502.01157
作者: Shrisudhan Govindarajan,Daniel Rebain,Kwang Moo Yi,Andrea Tagliasacchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Research on differentiable scene representations is consistently moving towards more efficient, real-time models. Recently, this has led to the popularization of splatting methods, which eschew the traditional ray-based rendering of radiance fields in favor of rasterization. This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. The resulting model, which we name Radiant Foam, achieves rendering speed and quality comparable to Gaussian Splatting, without the constraints of rasterization. Unlike ray traced Gaussian models that use hardware ray tracing acceleration, our method requires no special hardware or APIs beyond the standard features of a programmable GPU.
zh
[CV-48] Learning to Learn Weight Generation via Trajectory Diffusion
【速读】:该论文旨在解决扩散算法在多任务学习中的跨任务迁移能力有限以及仅利用最优权重进行训练的问题。为了解决这些问题,论文提出Lt-Di方法,将扩散算法与元学习相结合以生成未见任务的权重,并扩展了基本扩散算法为轨迹扩散算法,以利用优化过程中的其他权重。关键在于通过分解整个扩散链为多个较短的子链来提高训练和推理效率,并且分析了权重生成范式的收敛特性,从而在不增加额外时间开销的情况下提高了收敛效率。
链接: https://arxiv.org/abs/2502.01117
作者: Yunchuan Guan,Yu Liu,Ke Zhou,Zhiqi Shen,Serge Belongie,Jenq-Neng Hwang,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion-based algorithms have emerged as promising techniques for weight generation, particularly in scenarios like multi-task learning that require frequent weight updates. However, existing solutions suffer from limited cross-task transferability. In addition, they only utilize optimal weights as training samples, ignoring the value of other weights in the optimization process. To address these issues, we propose Lt-Di, which integrates the diffusion algorithm with meta-learning to generate weights for unseen tasks. Furthermore, we extend the vanilla diffusion algorithm into a trajectory diffusion algorithm to utilize other weights along the optimization trajectory. Trajectory diffusion decomposes the entire diffusion chain into multiple shorter ones, improving training and inference efficiency. We analyze the convergence properties of the weight generation paradigm and improve convergence efficiency without additional time overhead. Our experiments demonstrate Lt-Di’s higher accuracy while reducing computational overhead across various tasks, including zero-shot and few-shot learning, multi-domain generalization, and large-scale language model this http URL code is released at this https URL.
zh
[CV-49] LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
【速读】:该论文旨在解决生成认知对齐的分层SVG(Scalable Vector Graphics)的挑战,现有方法要么产生过于简化的单层输出,要么因优化导致形状冗余。解决方案的关键在于提出LayerTracer框架,这是一种基于扩散变换器的方法,通过从新颖的设计操作序列数据集中学习设计师创建分层SVG的过程来弥合这一差距。LayerTracer采用两阶段方法:首先,文本条件下的DiT(Diffusion Transformer)生成多阶段栅格化构造蓝图,模拟人类设计工作流程;其次,通过逐层矢量化和路径去重生成干净、可编辑的SVG。此外,对于图像矢量化,引入了一种条件扩散机制,将参考图像编码为潜在令牌,指导层次重建同时保持结构完整性。
链接: https://arxiv.org/abs/2502.01105
作者: Yiren Song,Danze Chen,Mike Zheng Shou
机构: National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer’s superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.
zh
[CV-50] VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control
【速读】:该论文旨在解决通过手绘草图生成高质量视频动画的问题,现有方法仅限于静态图像生成,缺乏对手绘草图控制视频动画生成的能力。为了解决这一问题,论文提出了一种名为VidSketch的方法,其关键是引入了基于层级的草图控制策略(Level-Based Sketch Control Strategy)以自动调整生成过程中草图的引导强度,并设计了时空注意力机制(TempSpatial Attention)以增强生成视频动画的时空一致性,从而显著提高帧间连贯性。
链接: https://arxiv.org/abs/2502.01101
作者: Lifan Jiang,Shuang Chen,Boxi Wu,Xiaotong Guan,Jiahui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17pages, 15 figures
点击查看摘要
Abstract:With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public’s needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames. You can find more detailed cases on our official website.
zh
[CV-51] SatFlow: Generative model based framework for producing High Resolution Gap Free Remote Sensing Imagery
【速读】:该论文旨在解决农业和环境监测中高分辨率、频繁更新的遥感图像需求与实际观测受限于较低时间频率及云层遮挡的问题。解决方案的关键在于SatFlow框架,该框架基于条件流匹配(Conditional Flow Matching)训练的生成模型,融合低分辨率的MODIS图像和高分辨率的Landsat观测数据,生成无间隙的高频高分辨率表面反射率图像,并通过图像修复技术处理云层遮挡问题,从而可靠地填补云覆盖区域,确保下游应用如作物物候跟踪和环境变化检测的准确性。
链接: https://arxiv.org/abs/2502.01098
作者: Bharath Irigireddy,Varaprasad Bandaru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,
zh
[CV-52] Enhancing Feature Tracking Reliability for Visual Navigation using Real-Time Safety Filter ICRA2025
【速读】:该论文旨在解决机器人在视觉导航过程中特征跟踪可靠性和姿态估计准确性的问题,特别是在需要保持足够数量特征可见的前提下。论文的关键解决方案是提出了一种基于二次规划的实时安全滤波器,该滤波器通过利用机器人运动学模型中可视性约束的不变性质,在保证信息分数高于用户指定阈值的同时,最小化地偏离参考速度命令,从而确保所需特征的可见性。
链接: https://arxiv.org/abs/2502.01092
作者: Dabin Kim,Inkyu Jang,Youngsoo Han,Sunwoo Hwang,H. Jin Kim
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 7 pages, 6 figures, Accepted to 2025 IEEE International Conference on Robotics Automation (ICRA 2025)
点击查看摘要
Abstract:Vision sensors are extensively used for localizing a robot’s pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor’s relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot’s overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot’s kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.
zh
[CV-53] BC-GAN: A Generative Adversarial Network for Synthesizing a Batch of Collocated Clothing
【速读】:该论文旨在解决现有方法只能每次合成单一配对服装的问题,无法满足用户在不同场合和个人偏好下的多样化需求。为了解决这一限制,论文提出了一种新的批量服装生成框架BC-GAN。其关键是引入了一种新颖的时尚兼容性判别器,在对比学习视角下充分利用所有服装项目之间的搭配关系,从而能够同时合成多个视觉上相互匹配的服装图像,提高生成结果的多样性、视觉真实性和时尚兼容性。
链接: https://arxiv.org/abs/2502.01080
作者: Dongliang Zhou,Haijun Zhang,Jianghong Ma,Jianyang Shi
机构: Department of Computer Science, Harbin Institute of Technology, Shenzhen, Xili University Town, Shenzhen 518055, P. R. China (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This paper was accepted by IEEE TCSVT
点击查看摘要
Abstract:Collocated clothing synthesis using generative networks has become an emerging topic in the field of fashion intelligence, as it has significant potential economic value to increase revenue in the fashion industry. In previous studies, several works have attempted to synthesize visually-collocated clothing based on a given clothing item using generative adversarial networks (GANs) with promising results. These works, however, can only accomplish the synthesis of one collocated clothing item each time. Nevertheless, users may require different clothing items to meet their multiple choices due to their personal tastes and different dressing scenarios. To address this limitation, we introduce a novel batch clothing generation framework, named BC-GAN, which is able to synthesize multiple visually-collocated clothing images simultaneously. In particular, to further improve the fashion compatibility of synthetic results, BC-GAN proposes a new fashion compatibility discriminator in a contrastive learning perspective by fully exploiting the collocation relationship among all clothing items. Our model was examined in a large-scale dataset with compatible outfits constructed by ourselves. Extensive experiment results confirmed the effectiveness of our proposed BC-GAN in comparison to state-of-the-art methods in terms of diversity, visual authenticity, and fashion compatibility.
zh
[CV-54] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
【速读】:该论文旨在解决现有端到端人体动画方法难以规模化为大型通用视频生成模型的问题,从而限制其在实际应用中的潜力。解决方案的关键在于提出了一种基于扩散变换器(Diffusion Transformer)的框架OmniHuman,通过在训练阶段混合运动相关条件来扩展数据。该框架引入了两种训练原则及相应的模型架构和推理策略,使其能够充分利用数据驱动的运动生成能力,实现高度逼真的人体视频生成。此外,OmniHuman具有广泛的适用性,支持多种人物图像内容和动作模式,并且能够处理复杂的场景和姿势。
链接: https://arxiv.org/abs/2502.01061
作者: Gaojie Lin,Jianwen Jiang,Jiaqi Yang,Zerong Zheng,Chao Liang
机构: ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
点击查看摘要
Abstract:End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (this https URL)
zh
[CV-55] Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
【速读】:该论文旨在解决扩散模型(Diffusion Models)在步长级别(step-level)偏好优化中的挑战,特别是如何更有效地与人类图像偏好对齐。传统方法通常依赖于视觉语言模型(Vision-Language Models, VLMs)作为像素级奖励模型来近似人类偏好,但这些方法在处理不同时间步(timesteps)的噪声图像时面临困难,并且需要复杂的像素空间转换。论文的关键解决方案在于提出了一种潜空间奖励模型(Latent Reward Model, LRM),该模型重新利用扩散模型的组件来预测不同时间步下的噪声潜图像(latent images)的偏好。基于LRM,作者进一步提出了潜空间偏好优化(Latent Preference Optimization, LPO),这是一种直接在潜空间进行步长级别偏好优化的方法。实验结果表明,LPO不仅显著提升了扩散模型与一般、美学及文本-图像对齐偏好的一致性,还实现了2.5到28倍的训练速度提升。
链接: https://arxiv.org/abs/2502.01051
作者: Tao Zhang,Cheng Da,Kun Ding,Kun Jin,Yan Li,Tingting Gao,Di Zhang,Shiming Xiang,Chunhong Pan
机构: MAIS, CASIA(模式识别国家重点实验室, 中科院自动化所); Kuaishou Technology(快手科技); School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 tables, 15 figures
点击查看摘要
Abstract:Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28 \times training speedup compared to existing preference optimization methods. Our code will be available at this https URL.
zh
[CV-56] Sparks of Explainability: Recent Advancements in Explaining Large Vision Models
【速读】:该论文旨在提升计算机视觉中的可解释性,主要通过分析和建模深度神经网络所利用的特征。论文的关键解决方案在于引入和评估多种 attribution 方法,如基于算法稳定性的度量和利用 Sobol 指数的方法,这些方法显著减少了计算时间。此外,EVA 方法提供了通过验证扰动分析实现的 attribution 形式的正式保证。然而,实验结果显示在复杂场景下这些方法不足以提供充分的理解。因此,论文探讨了两种假设:一是使模型与人类推理对齐,通过引入模仿人类解释的训练程序,并在 1-Lipschitz 函数空间内进行优化;二是采用概念上的可解释性方法。为此,提出了 CRAFT 方法来自动化提取模型使用的概念并评估其重要性,并辅以 MACO 方法实现可视化。这些工作最终汇聚成一个统一框架,通过交互演示应用于 ResNet 模型的 1000 个 ImageNet 类别。
链接: https://arxiv.org/abs/2502.01048
作者: Thomas Fel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Doctoral thesis
点击查看摘要
Abstract:This thesis explores advanced approaches to improve explainability in computer vision by analyzing and modeling the features exploited by deep neural networks. Initially, it evaluates attribution methods, notably saliency maps, by introducing a metric based on algorithmic stability and an approach utilizing Sobol indices, which, through quasi-Monte Carlo sequences, allows a significant reduction in computation time. In addition, the EVA method offers a first formulation of attribution with formal guarantees via verified perturbation analysis. Experimental results indicate that in complex scenarios these methods do not provide sufficient understanding, particularly because they identify only “where” the model focuses without clarifying “what” it perceives. Two hypotheses are therefore examined: aligning models with human reasoning – through the introduction of a training routine that integrates the imitation of human explanations and optimization within the space of 1-Lipschitz functions – and adopting a conceptual explainability approach. The CRAFT method is proposed to automate the extraction of the concepts used by the model and to assess their importance, complemented by MACO, which enables their visualization. These works converge towards a unified framework, illustrated by an interactive demonstration applied to the 1000 ImageNet classes in a ResNet model. Comments: Doctoral thesis Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.01048 [cs.CV] (or arXiv:2502.01048v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.01048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-57] Emotional Face-to-Speech
【速读】:该论文旨在解决如何仅通过表情面部特征推断情感语音的问题,并提出了一项新任务——情感面部到语音转换(emotional face-to-speech),目标是从表达性面部线索直接合成情感语音。解决方案的关键在于引入DEmoFace,这是一个基于离散扩散变换器(DiT)与课程学习机制的新型生成框架,构建于多级神经音频编解码器之上。DEmoFace通过多模态DiT块动态对齐文本和语音,同时根据面部情绪和身份定制语音风格。此外,论文还提出了一种粗细结合的课程学习算法以增强训练效率和生成质量,并开发了一种增强的无预测引导方法来处理多样化的条件场景,从而实现多条件生成和有效分离复杂属性。
链接: https://arxiv.org/abs/2502.01046
作者: Jiaxin Ye,Boyuan Cao,Hongming Shan
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos are shown at this https URL.
zh
[CV-58] WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction
【速读】:该论文旨在解决从单目视频中重建高质量动态人体 avatar 的问题,特别是处理视频视角有限时难以重建未观测到的身体部位。解决方案的关键在于引入了 WonderHuman 系统,利用 2D 生成扩散模型先验知识,并结合 Dual-Space Optimization 技术和 Score Distillation Sampling (SDS),在规范空间和观测空间中进行采样以确保视觉一致性并增强动态人体重建的真实感。此外,还提出了 View Selection 策略和 Pose Feature Injection 方法来确保 SDS 预测与观测数据之间的一致性,从而实现更高的重建保真度。
链接: https://arxiv.org/abs/2502.01045
作者: Zilong Wang,Zhiyang Dou,Yuan Liu,Cheng Lin,Xiao Dong,Yunhui Guo,Chenxu Zhang,Xin Li,Wenping Wang,Xiaohu Guo
机构: The University of Texas at Dallas(德克萨斯大学达拉斯分校); The University of Hong Kong(香港大学); The Hong Kong University of Science and Technology(香港科技大学); BNU-HKBU United International College(北京师范大学-香港浸会大学联合国际学院); Texas A&M University(德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at this https URL.
zh
[CV-59] UASTHN: Uncertainty-Aware Deep Homography Estimation for UAV Satellite-Thermal Geo-localization ICRA2025
【速读】:该论文旨在解决热红外地理定位(Thermal Geo-localization, TG)方法在输出中缺乏不确定性测量的问题,这限制了系统在面对无纹理或损坏的热图像、自相似或过时的卫星地图、几何噪声以及超出卫星地图范围的热图像时的鲁棒性。论文的关键解决方案是提出了一种新颖的方法,即UASTHN,用于深度单应性估计(Deep Homography Estimation, DHE)任务中的不确定性估计(Uncertainty Estimation, UE)。具体而言,该方法引入了一种基于裁剪的测试时增强策略(Crop-based Test-Time Augmentation, CropTTA),通过利用裁剪图像视图的单应性一致性来有效测量数据不确定性。此外,该方法还采用了深度集成(Deep Ensembles, DE)来评估模型不确定性,从而提供了一种高效且可与任何DHE模型无缝集成的方案。
链接: https://arxiv.org/abs/2502.01035
作者: Jiuhong Xiao,Giuseppe Loianno
机构: New York University, Tandon School of Engineering (纽约大学,塔andon工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, accepted at ICRA 2025
点击查看摘要
Abstract:Geo-localization is an essential component of Unmanned Aerial Vehicle (UAV) navigation systems to ensure precise absolute self-localization in outdoor environments. To address the challenges of GPS signal interruptions or low illumination, Thermal Geo-localization (TG) employs aerial thermal imagery to align with reference satellite maps to accurately determine the UAV’s location. However, existing TG methods lack uncertainty measurement in their outputs, compromising system robustness in the presence of textureless or corrupted thermal images, self-similar or outdated satellite maps, geometric noises, or thermal images exceeding satellite maps. To overcome these limitations, this paper presents \textitUASTHN, a novel approach for Uncertainty Estimation (UE) in Deep Homography Estimation (DHE) tasks for TG applications. Specifically, we introduce a novel Crop-based Test-Time Augmentation (CropTTA) strategy, which leverages the homography consensus of cropped image views to effectively measure data uncertainty. This approach is complemented by Deep Ensembles (DE) employed for model uncertainty, offering comparable performance with improved efficiency and seamless integration with any DHE model. Extensive experiments across multiple DHE models demonstrate the effectiveness and efficiency of CropTTA in TG applications. Analysis of detected failure cases underscores the improved reliability of CropTTA under challenging conditions. Finally, we demonstrate the capability of combining CropTTA and DE for a comprehensive assessment of both data and model uncertainty. Our research provides profound insights into the broader intersection of localization and uncertainty estimation. The code and data is publicly available.
zh
[CV-60] Vessel segmentation for X-separation
【速读】:该论文旨在解决在使用χ-分离方法进行定量磁化率成像(QSM)时,血管引起的伪影干扰铁和髓鞘准确量化的问题。解决方案的关键在于提出了一种新的血管分割方法,该方法通过三步实现:1)从R2∗图谱及χpara与∣χdia∣乘积图谱生成种子;2)基于血管几何引导的区域生长,创建血管掩膜;3)通过排除非血管结构来细化血管掩膜。此方法显著优于传统血管分割方法,并在神经网络重建方法χ-sepnet-R2∗的定量评估及群体平均感兴趣区域分析中展现出改进效果。
链接: https://arxiv.org/abs/2502.01023
作者: Taechang Kim,Sooyeon Ji,Kyeongseon Min,Minjun Kim,Jonghyo Youn,Chungseok Oh,Jiye Kim,Jongho Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
点击查看摘要
Abstract: \chi -separation is an advanced quantitative susceptibility mapping (QSM) method that is designed to generate paramagnetic ( \chi_para ) and diamagnetic ( |\chi_dia| ) susceptibility maps, reflecting the distribution of iron and myelin in the brain. However, vessels have shown artifacts, interfering with the accurate quantification of iron and myelin in applications. To address this challenge, a new vessel segmentation method for \chi -separation is developed. The method comprises three steps: 1) Seed generation from \textitR_2^* and the product of \chi_para and |\chi_dia| maps; 2) Region growing, guided by vessel geometry, creating a vessel mask; 3) Refinement of the vessel mask by excluding non-vessel structures. The performance of the method was compared to conventional vessel segmentation methods both qualitatively and quantitatively. To demonstrate the utility of the method, it was tested in two applications: quantitative evaluation of a neural network-based \chi -separation reconstruction method ( \chi -sepnet- \textitR_2^* ) and population-averaged region of interest (ROI) analysis. The proposed method demonstrates superior performance to the conventional vessel segmentation methods, effectively excluding the non-vessel structures, achieving the highest Dice score coefficient. For the applications, applying vessel masks report notable improvements for the quantitative evaluation of \chi -sepnet- \textitR_2^* and statistically significant differences in population-averaged ROI analysis. These applications suggest excluding vessels when analyzing the \chi -separation maps provide more accurate evaluations. The proposed method has the potential to facilitate various applications, offering reliable analysis through the generation of a high-quality vessel mask.
zh
[CV-61] ZeroBP: Learning Position-Aware Correspondence for Zero-shot 6D Pose Estimation in Bin-Picking ICRA2025
【速读】:该论文旨在解决二元拣选(Bin-picking)任务中零样本6D位姿估计(Zero-shot 6D pose estimation)的效率问题。现有方法依赖于特定对象的训练数据,导致在处理新工件时需要大量的数据收集和模型重新训练。论文的关键解决方案是提出了一种名为ZeroBP的框架,它通过学习场景实例与CAD模型之间的位置感知对应关系(Position-Aware Correspondence, PAC),结合局部特征和全局位置来解决因相似形状和外观引起的不匹配问题。实验结果表明,ZeroBP在ROBI数据集上的表现优于现有的零样本6D位姿估计方法,正确位姿的平均召回率提高了9.1%。
链接: https://arxiv.org/abs/2502.01004
作者: Jianqiu Chen,Zikun Zhou,Xin Li,Ye Zheng,Tianpeng Bao,Zhenyu He
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室); JD.com, Inc. (京东集团); SenseTime Research (商汤研究室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025
点击查看摘要
Abstract:Bin-picking is a practical and challenging robotic manipulation task, where accurate 6D pose estimation plays a pivotal role. The workpieces in bin-picking are typically textureless and randomly stacked in a bin, which poses a significant challenge to 6D pose estimation. Existing solutions are typically learning-based methods, which require object-specific training. Their efficiency of practical deployment for novel workpieces is highly limited by data collection and model retraining. Zero-shot 6D pose estimation is a potential approach to address the issue of deployment efficiency. Nevertheless, existing zero-shot 6D pose estimation methods are designed to leverage feature matching to establish point-to-point correspondences for pose estimation, which is less effective for workpieces with textureless appearances and ambiguous local regions. In this paper, we propose ZeroBP, a zero-shot pose estimation framework designed specifically for the bin-picking task. ZeroBP learns Position-Aware Correspondence (PAC) between the scene instance and its CAD model, leveraging both local features and global positions to resolve the mismatch issue caused by ambiguous regions with similar shapes and appearances. Extensive experiments on the ROBI dataset demonstrate that ZeroBP outperforms state-of-the-art zero-shot pose estimation methods, achieving an improvement of 9.1% in average recall of correct poses.
zh
[CV-62] Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review Datasets and Future Perspectives
【速读】:该论文旨在解决合成孔径雷达(SAR)与光学图像配准中的挑战,特别是在高分辨率下由于成像机制、几何失真和辐射属性差异导致的配准难题。论文的关键在于创建了MultiResSAR数据集,并系统性地评估了十六种最先进的算法。结果表明,没有一种算法能够实现100%的成功率,且随着分辨率的提高,性能显著下降。论文建议未来研究应着重于噪声抑制、三维几何融合、跨视角变换建模以及深度学习优化,以实现高分辨率SAR与光学图像的稳健配准。
链接: https://arxiv.org/abs/2502.01002
作者: Wenfei Zhang,Ruipeng Zhao,Yongxiang Yao,Yi Wan,Peihao Wu,Jiayuan Li,Yansheng Li,Yongjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 10 figures
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) and optical image registration is essential for remote sensing data fusion, with applications in military reconnaissance, environmental monitoring, and disaster management. However, challenges arise from differences in imaging mechanisms, geometric distortions, and radiometric properties between SAR and optical images. As image resolution increases, fine SAR textures become more significant, leading to alignment issues and 3D spatial discrepancies. Two major gaps exist: the lack of a publicly available multi-resolution, multi-scene registration dataset and the absence of systematic analysis of current methods. To address this, the MultiResSAR dataset was created, containing over 10k pairs of multi-source, multi-resolution, and multi-scene SAR and optical images. Sixteen state-of-the-art algorithms were tested. Results show no algorithm achieves 100% success, and performance decreases as resolution increases, with most failing on sub-meter data. XoFTR performs best among deep learning methods (40.58%), while RIFT performs best among traditional methods (66.51%). Future research should focus on noise suppression, 3D geometric fusion, cross-view transformation modeling, and deep learning optimization for robust registration of high-resolution SAR and optical images. The dataset is available at this https URL.
zh
[CV-63] Adapting Foundation Models for Few-Shot Medical Image Segmentation: Actively and Sequentially
【速读】:该论文旨在解决在目标任务存在较大领域差距且标注样本有限的情况下,确保可靠和鲁棒的模型适应性问题。解决方案的关键在于提出了一种名为Active and Sequential domain AdaPtation (ASAP) 的框架,通过将少量镜头领域适应(FSDA)问题形式化为多臂老虎机问题,并设计了一个高效的奖励函数来动态选择与目标任务紧密相关的辅助数据集,从而实现单轮微调。实验验证表明,该方法在多种医学分割数据集上表现出色,显著优于现有的FSDA方法,在MRI数据集上的Dice评分平均提升了27.75%,CT数据集上提升了7.52%。
链接: https://arxiv.org/abs/2502.01000
作者: Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in foundation models have brought promising results in computer vision, including medical image segmentation. Fine-tuning foundation models on specific low-resource medical tasks has become a standard practice. However, ensuring reliable and robust model adaptation when the target task has a large domain gap and few annotated samples remains a challenge. Previous few-shot domain adaptation (FSDA) methods seek to bridge the distribution gap between source and target domains by utilizing auxiliary data. The selection and scheduling of auxiliaries are often based on heuristics, which can easily cause negative transfer. In this work, we propose an Active and Sequential domain AdaPtation (ASAP) framework for dynamic auxiliary dataset selection in FSDA. We formulate FSDA as a multi-armed bandit problem and derive an efficient reward function to prioritize training on auxiliary datasets that align closely with the target task, through a single-round fine-tuning. Empirical validation on diverse medical segmentation datasets demonstrates that our method achieves favorable segmentation performance, significantly outperforming the state-of-the-art FSDA methods, achieving an average gain of 27.75% on MRI and 7.52% on CT datasets in Dice score. Code is available at the git repository: this https URL.
zh
[CV-64] FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits via Fashion Compatibility Boosting
【速读】:该论文旨在解决时尚穿搭生成领域中多样化选择不足的问题。现有的研究局限于基于给定物品生成唯一一套搭配,而无法为用户提供更多选择。为了解决这一问题,论文提出了一种新的框架FCBoost-Net,其关键是利用预训练的生成模型来生成多套协调且多样化的穿搭。通过引入一种新颖的时尚搭配增强器,FCBoost-Net能够在多轮迭代中逐步提升生成搭配的协调性和多样性。这种方法受到了提升算法的启发,能够有效改善随机生成的时尚物品的搭配性同时保持多样性。
链接: https://arxiv.org/abs/2502.00992
作者: Dongliang Zhou,Haijun Zhang,Jianghong Ma,Jicong Fan,Zhao Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳); Shenzhen(深圳); Guangdong(广东)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This paper has been accepted for presentation at ACM Multimedia 2023
点击查看摘要
Abstract:Outfit generation is a challenging task in the field of fashion technology, in which the aim is to create a collocated set of fashion items that complement a given set of items. Previous studies in this area have been limited to generating a unique set of fashion items based on a given set of items, without providing additional options to users. This lack of a diverse range of choices necessitates the development of a more versatile framework. However, when the task of generating collocated and diversified outfits is approached with multimodal image-to-image translation methods, it poses a challenging problem in terms of non-aligned image translation, which is hard to address with existing methods. In this research, we present FCBoost-Net, a new framework for outfit generation that leverages the power of pre-trained generative models to produce multiple collocated and diversified outfits. Initially, FCBoost-Net randomly synthesizes multiple sets of fashion items, and the compatibility of the synthesized sets is then improved in several rounds using a novel fashion compatibility booster. This approach was inspired by boosting algorithms and allows the performance to be gradually improved in multiple steps. Empirical evidence indicates that the proposed strategy can improve the fashion compatibility of randomly synthesized fashion items as well as maintain their diversity. Extensive experiments confirm the effectiveness of our proposed framework with respect to visual authenticity, diversity, and fashion compatibility.
zh
[CV-65] Pushing the Boundaries of State Space Models for Image and Video Generation
【速读】:该论文旨在探索状态空间模型(State-Space Models, SSM)在图像和视频生成任务中的能力边界。论文的关键解决方案在于构建迄今为止最大规模的扩散SSM-Transformer混合模型(50亿参数),基于次二次双向Hydra和自注意力机制,从而实现高达2K分辨率的图像和360p分辨率、8秒长(16帧/秒)的视频生成。实验结果表明,该模型能够生成与复杂文本提示一致且具有高动态范围且时间上一致的视频,这表明SSM在视觉生成任务中具有巨大潜力。
链接: https://arxiv.org/abs/2502.00972
作者: Yicong Hong,Long Mai,Yuan Yao,Feng Liu
机构: Adobe Research; University of Rochester
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, paper under review
点击查看摘要
Abstract:While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
zh
[CV-66] CoDe: Blockwise Control for Denoising Diffusion Models
【速读】:该论文旨在解决在对扩散模型(Diffusion Models)进行下游任务适配时,通常需要微调新模型或在推理阶段使用基于梯度的引导方法以实现从奖励倾斜后验(reward-tilted posterior)中采样的问题。论文的关键解决方案是提出了一种名为可控去噪(Controlled Denoising, CoDe)的无梯度引导方法。这种方法是一种在去噪过程中分块采样的技术,能够在不依赖可微分引导函数和无需微调模型的情况下,实现与下游奖励的一致性。实验表明,尽管CoDe简单,但其在奖励适配、指令遵循和推理成本之间提供了有利的权衡,且性能可与最先进的基线相媲美。
链接: https://arxiv.org/abs/2502.00968
作者: Anuj Singh,Sayak Mukherjee,Ahmad Beirami,Hadi Jamali-Rad
机构: Delft University of Technology(代尔夫特理工大学); Shell Global Solutions International B.V.(壳牌全球解决方案国际有限公司); Massachusetts Institute of Technology(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: this https URL.
zh
[CV-67] CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
【速读】:该论文旨在解决将Mixture-of-Experts (MoE)模型集成到多模态模型如CLIP中以提升性能的同时,面临的训练复杂度高和成本高的问题。关键解决方案在于提出了一种名为CLIP-Upcycling (CLIP-UP)的高效替代训练策略,通过将预训练的密集型CLIP模型转换为稀疏MoE架构,从而显著降低了训练复杂度和成本。实验结果表明,采用CLIP-UP训练的稀疏CLIP B/16模型在COCO和Flickr30k文本到图像Recall@1基准测试中分别比其密集型对应模型高出7.2%和6.6%,同时仅使用后者的30%推理浮点运算次数 (FLOPs),证明了该方法的有效性和可扩展性。
链接: https://arxiv.org/abs/2502.00965
作者: Xinze Wang,Chen Chen,Yinfei Yang,Hong-You Chen,Bowen Zhang,Aditya Pal,Xiangxin Zhu,Xianzhi Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
zh
[CV-68] SAM-guided Pseudo Label Enhancement for Multi-modal 3D Semantic Segmentation ICRA2025
【速读】:该论文旨在解决多模态三维语义分割在跨域适应过程中可靠伪标签难以生成及稀疏性问题。论文的关键解决方案在于提出了一种图像引导的伪标签增强方法,通过利用来自Segment Anything Model (SAM)的互补2D先验知识,引入更多可靠的伪标签,从而提升跨域适应性能。具体而言,该方法首先使用多数投票确定每个SAM掩膜的类别标签,并采用多种约束过滤不可靠的掩膜标签;随后,通过几何感知渐进传播(Geometry-Aware Progressive Propagation, GAPP)技术,在避免由于2D-3D不一致导致的异常点的情况下,将掩膜标签传播至SAM掩膜内的所有3D点。
链接: https://arxiv.org/abs/2502.00960
作者: Mingyu Yang,Jitong Lu,Hun-Seok Kim
机构: Department of Electrical and Computer Engineering, University of Michigan (密歇根大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025
点击查看摘要
Abstract:Multi-modal 3D semantic segmentation is vital for applications such as autonomous driving and virtual reality (VR). To effectively deploy these models in real-world scenarios, it is essential to employ cross-domain adaptation techniques that bridge the gap between training data and real-world data. Recently, self-training with pseudo-labels has emerged as a predominant method for cross-domain adaptation in multi-modal 3D semantic segmentation. However, generating reliable pseudo-labels necessitates stringent constraints, which often result in sparse pseudo-labels after pruning. This sparsity can potentially hinder performance improvement during the adaptation process. We propose an image-guided pseudo-label enhancement approach that leverages the complementary 2D prior knowledge from the Segment Anything Model (SAM) to introduce more reliable pseudo-labels, thereby boosting domain adaptation performance. Specifically, given a 3D point cloud and the SAM masks from its paired image data, we collect all 3D points covered by each SAM mask that potentially belong to the same object. Then our method refines the pseudo-labels within each SAM mask in two steps. First, we determine the class label for each mask using majority voting and employ various constraints to filter out unreliable mask labels. Next, we introduce Geometry-Aware Progressive Propagation (GAPP) which propagates the mask label to all 3D points within the SAM mask while avoiding outliers caused by 2D-3D misalignment. Experiments conducted across multiple datasets and domain adaptation scenarios demonstrate that our proposed method significantly increases the quantity of high-quality pseudo-labels and enhances the adaptation performance over baseline methods.
zh
[CV-69] Hypo3D: Exploring Hypothetical Reasoning in 3D
【速读】:该论文旨在解决现有3D推理基准假设实时场景可访问性的问题,这在实际应用中由于频繁场景更新的高成本而变得不切实际。为了解决这一问题,论文引入了假设性3D推理(Hypo3D)基准,其关键是让模型在没有实时场景数据的情况下,基于提供的变化描述想象场景状态,并在此基础上进行推理。Hypo3D作为一个3D视觉问答(VQA)基准,包含700个室内场景中的7,727个上下文变化,生成了14,885个问题-答案对,并通过锚点世界框架确保方向术语的一致引用。实验结果表明,当前最先进的基础模型在处理假设性变化场景时仍存在显著性能差距,尤其是在涉及运动变化和方向推理的场景中。
链接: https://arxiv.org/abs/2502.00954
作者: Ye Mao,Weixun Luo,Junpeng Jing,Anlan Qiu,Krystian Mikolajczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures, 9 tables
点击查看摘要
Abstract:The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models’ ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers.
zh
[CV-70] Fruit Fly Classification (Diptera: Tephritidae) in Images Applying Transfer Learning
【速读】:该论文旨在解决自动化分类实验室环境中两种水果蝇(Anastrepha fraterculus 和 Ceratitis capitata)的问题。当前的分类方法依赖于专家手动识别,受到人为因素的影响且面临时间挑战。论文的关键解决方案在于开发了一种迁移学习模型,并利用预训练的卷积神经网络(Convolutional Neural Networks, CNNs),特别是Inception-v3,在高精度图像处理和特征提取的基础上,实现了82%至93%的F1分数,验证了其在非受控环境中的可靠性和有效性。
链接: https://arxiv.org/abs/2502.00939
作者: Erick Andrew Bustamante Flores,Harley Vera Olivera,Ivan Cesar Medrano Valencia,Carlos Fernando Montoya Cubas
机构: Department of Computer Science, Universidad Nacional de San Antonio Abad del Cusco (国立圣安东尼奥阿巴德库斯科大学), Cusco, Perú
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages and 19 figures
点击查看摘要
Abstract:This study develops a transfer learning model for the automated classification of two species of fruit flies, Anastrepha fraterculus and Ceratitis capitata, in a controlled laboratory environment. The research addresses the need to optimize identification and classification, which are currently performed manually by experts, being affected by human factors and facing time challenges. The methodological process of this study includes the capture of high-quality images using a mobile phone camera and a stereo microscope, followed by segmentation to reduce size and focus on relevant morphological areas. The images were carefully labeled and preprocessed to ensure the quality and consistency of the dataset used to train the pre-trained convolutional neural network models VGG16, VGG19, and Inception-v3. The results were evaluated using the F1-score, achieving 82% for VGG16 and VGG19, while Inception-v3 reached an F1-score of 93%. Inception-v3’s reliability was verified through model testing in uncontrolled environments, with positive results, complemented by the Grad-CAM technique, demonstrating its ability to capture essential morphological features. These findings indicate that Inception-v3 is an effective and replicable approach for classifying Anastrepha fraterculus and Ceratitis capitata, with potential for implementation in automated monitoring systems.
zh
[CV-71] VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning
【速读】:该论文旨在解决移动机器人在未知环境中基于视觉和语言指令进行导航的问题。关键解决方案在于引入了一种名为启发式视觉-语言(Heuristic-Vision-Language, HVL)的空间推理方法,用于目标点选择。这种方法结合了像素级的视觉-语言特征和启发式探索,使机器人能够在不同环境和规模下高效、稳健地导航至人类指令指定的目标实例。
链接: https://arxiv.org/abs/2502.00931
作者: Yi Du,Taimeng Fu,Zhuoqun Chen,Bowen Li,Shaoshu Su,Zhipeng Zhao,Chen Wang
机构: Spatial AI & Robotics Lab, University at Buffalo (布法罗大学空间人工智能与机器人实验室); Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as “find a person wearing black”. We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, we introduce the heuristic-vision-language (HVL) spatial reasoning for goal point selection. It combines pixel-wise vision-language features and heuristic exploration to enable efficient navigation to human-instructed instances in various environments robustly. We deploy VL-Nav on a four-wheel mobile robot and conduct comprehensive navigation tasks in various environments of different scales and semantic complexities, indoors and outdoors. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Experimental results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
zh
[CV-72] LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation
【速读】:该论文旨在解决现有视觉提示(Visual Prompting)技术在参数高效调优中的局限性,特别是这些方法通常在图像周围添加提示参数,导致与原始图像的交互仅限于少量补丁,而忽视了不同补丁间共享信息的归纳偏置。论文的关键解决方案是引入了一种新颖的视觉提示设计——低秩矩阵乘法视觉提示(LoR-VP),它能够使图像像素行和列之间的共享信息和特定补丁信息得到充分利用。实验结果表明,与最先进的视觉提示方法相比,LoR-VP在七个网络架构和四个数据集上的表现显著提升,实现了最高可达6倍的训练速度加快,使用了少至1/18的视觉提示参数,并提升了3.1%的性能。
链接: https://arxiv.org/abs/2502.00896
作者: Can Jin,Ying Li,Mingyu Zhao,Shiyu Zhao,Zhenting Wang,Xiaoxiao He,Ligong Han,Tong Che,Dimitris N. Metaxas
机构: Rutgers University; Zhejiang University; Red Hat AI Innovation; MIT-IBM Watson AI Lab; NVIDIA Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as this https URL.
zh
[CV-73] Paper Copilot: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process
【速读】:该论文旨在探讨人工智能(AI)与机器学习(ML)会议从封闭评审平台向开放评审平台转变过程中所采用的不同模型,并分析其优势与局限性。论文特别关注透明同行评审日益增长的社区兴趣。通过分析Paper Copilot网站的数据,该论文强调了更加透明、开放且规范的同行评审机制的重要性,以促进更大范围的社区参与及领域的进步。关键在于推动一种更透明、开放且受规管的同行评审体系。
链接: https://arxiv.org/abs/2502.00874
作者: Jing Yang
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot, a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18-34 from 177 countries, many of whom are actively engaged in the peer review process. Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.
zh
[CV-74] STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation
【速读】:该论文旨在解决由ReLU网络的频谱偏见导致的限制,这种偏见阻碍了模型捕捉目标信号中的精细细节。论文的关键解决方案是引入了正弦可训练激活函数(Sinusoidal Trainable Activation Functions, STAF),它通过使网络能够自适应地学习和表示复杂信号来直接应对这一挑战。STAF通过内在调制其频率分量,实现了自适应频谱学习,从而显著提高了收敛速度和表达能力。
链接: https://arxiv.org/abs/2502.00869
作者: Alireza Morsali,MohammadJavad Vaez,Hossein Soltani,Amirhossein Kazerouni,Babak Taati,Morteza Mohammad-Noori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals. The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals. While previous works have attempted to mitigate this issue through frequency-based encodings or architectural modifications, these approaches often introduce additional complexity and do not fully address the underlying challenge of learning high-frequency components efficiently. We introduce Sinusoidal Trainable Activation Functions (STAF), designed to directly tackle this limitation by enabling networks to adaptively learn and represent complex signals with higher precision and efficiency. STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning. This capability significantly improves convergence speed and expressivity, making STAF highly effective for both signal representations and inverse problems. Through extensive evaluations, we demonstrate that STAF outperforms state-of-the-art (SOTA) methods in accuracy and reconstruction fidelity with superior Peak Signal-to-Noise Ratio (PSNR). These results establish STAF as a robust solution for overcoming spectral bias and the capacity-convergence gap, making it valuable for computer graphics and related fields. Our codebase is publicly accessible on the this https URL.
zh
[CV-75] RealRAG : Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning
【速读】:该论文旨在解决现有文本到图像生成模型(如Stable Diffusion V3和Flux)在处理细粒度和未见过的新颖现实世界物体(例如特斯拉Cybertruck)时,因受限于固定参数和封闭数据集而导致的显著幻觉或失真问题。解决方案的关键在于提出首个基于真实物体的检索增强生成框架(RealRAG),通过学习和检索真实世界图像来弥补生成模型的知识缺口。具体而言,通过自反思对比学习训练反射检索器,将生成器的知识注入到自反思负样本中,确保检索到的增强图像能够补偿模型缺失的知识,从而集成缺失的记忆以生成未见过的新颖物体,并提升生成模型对细粒度视觉知识的整合能力,有效解决失真问题并提高细粒度对象生成的逼真度。
链接: https://arxiv.org/abs/2502.00848
作者: Yuanhuiyi Lyu,Xu Zheng,Lutao Jiang,Yibo Yan,Xin Zou,Huiyu Zhou,Linfeng Zhang,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present the first real-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator’s knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model’s missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18% FID score with the auto-regressive model on the Stanford Car benchmark.
zh
[CV-76] VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
【速读】:该论文旨在解决在自动驾驶中视觉问答(VQA)任务面临的连续学习挑战,特别是在感知、预测和规划等不同任务中由于灾难性遗忘(Catastrophic Forgetting)导致的知识更新困难。解决方案的关键在于提出了一种结合视觉-语言模型(Vision-Language Models, VLMs)与选择性记忆回放(Selective Memory Replay)及知识蒸馏(Knowledge Distillation),并辅以任务特定投影层正则化(Task-Specific Projection Layer Regularization)的新型连续学习框架。其中,知识蒸馏机制通过“教师”模型引导后续任务的学习,减少遗忘现象;任务特定投影层则基于特征表示的差异计算损失,确保学习过程中的连续性和任务间转换的平稳性。
链接: https://arxiv.org/abs/2502.00843
作者: Yuxin Lin,Mengshi Qi,Liang Liu,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室), Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a “teacher” to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 21.40% to 32.28% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.
zh
[CV-77] Cross multiscale vision transformer for deep fake detection
【速读】:该论文旨在解决深度伪造技术泛滥带来的数字媒体真实性挑战,提出通过评估多种深度学习模型来检测深度伪造内容。解决方案的关键在于利用传统深度学习方法与新架构相结合,训练一系列模型并通过准确率等指标严格评估其性能。
链接: https://arxiv.org/abs/2502.00833
作者: Akhshan P,Taneti Sanjay,Chandrakala S
机构: Shiv Nadar University Chennai(谢瓦那得大学钦奈校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The proliferation of deep fake technology poses significant challenges to digital media authenticity, necessitating robust detection mechanisms. This project evaluates deep fake detection using the SP Cup’s 2025 deep fake detection challenge dataset. We focused on exploring various deep learning models for detecting deep fake content, utilizing traditional deep learning techniques alongside newer architectures. Our approach involved training a series of models and rigorously assessing their performance using metrics such as accuracy.
zh
[CV-78] OOD Detection with immature Models
【速读】:该论文旨在解决深度生成模型(Deep Generative Models, DGMs)在分配较高的似然值(likelihood)给训练数据(in-distribution, ID)相较于未见过的数据(out-of-distribution, OOD)时缺乏性能保证的问题。尤其当ID输入比OOD数据点更为复杂时,这一反直觉的行为尤为显著。论文的关键解决方案在于利用数据点相对于DGM参数的梯度,提出了一种新的异常检测框架,通过估计给定数据点各层梯度范数的联合密度来实现,这种方法不依赖于特定模型,并且在多种基于似然的DGM和图像数据集组合中的表现优于典型性检验(Typicality Test)。此外,研究发现即使使用训练早期阶段的未成熟模型也能在下游任务中达到与成熟模型相当甚至更优的结果,从而强调了部分训练模型在这些任务中的潜力。
链接: https://arxiv.org/abs/2502.00820
作者: Behrooz Montazeran,Ullrich Köthe
机构: University of Heidelberg (海德堡大学); Interdisciplinary Center for Scientific Computing (科学计算跨学科中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 Tables, 9 Figures
点击查看摘要
Abstract:Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data. However, these models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs. This counter-intuitive behaviour is particularly pronounced when ID inputs are more complex than OOD data points. One potential approach to address this challenge involves leveraging the gradient of a data point with respect to the parameters of the DGMs. A recent OOD detection framework proposed estimating the joint density of layer-wise gradient norms for a given data point as a model-agnostic method, demonstrating superior performance compared to the Typicality Test across likelihood-based DGMs and image dataset pairs. In particular, most existing methods presuppose access to fully converged models, the training of which is both time-intensive and computationally demanding. In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task compared to mature models capable of generating high-quality samples that closely resemble ID data. This novel finding enhances our understanding of how DGMs learn the distribution of ID data and highlights the potential of leveraging partially trained models for downstream tasks. Furthermore, we offer a possible explanation for this unexpected behaviour through the concept of support overlap.
zh
[CV-79] Environment-Driven Online LiDAR-Camera Extrinsic Calibration
【速读】:该论文旨在解决现有激光雷达与相机外参标定方法缺乏灵活性,无法适应传感器数据和环境变化的问题。关键在于提出了一种名为EdO-LCEC的环境驱动在线标定方法,该方法通过引入泛化场景判别器来主动解析环境条件,并采用双路径对应匹配技术(Dual-Path Correspondence Matching, DPCM),利用结构和纹理一致性实现可靠的3D-2D对应关系,从而提高在不同视图和场景中的精度。
链接: https://arxiv.org/abs/2502.00801
作者: Zhiwei Huang,Jiaqi Li,Ping Zhong,Rui Fan
机构: Department of Control Science & Engineering, the College of Electronics & Information Engineering, Tongji University(同济大学); School of Computer Science and Engineering, Central South University(中南大学); National Key Laboratory of Science and Technology on Automatic Target Recognition, National University of Defense Technology(国防科技大学自动目标识别国家重点实验室); Department of Control Science & Engineering, the College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University(同济大学); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学机电混合增强智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:LiDAR-camera extrinsic calibration (LCEC) is the core for data fusion in computer vision. Existing methods typically rely on customized calibration targets or fixed scene types, lacking the flexibility to handle variations in sensor data and environmental contexts. This paper introduces EdO-LCEC, the first environment-driven, online calibration approach that achieves human-like adaptability. Inspired by the human perceptual system, EdO-LCEC incorporates a generalizable scene discriminator to actively interpret environmental conditions, creating multiple virtual cameras that capture detailed spatial and textural information. To overcome cross-modal feature matching challenges between LiDAR and camera, we propose dual-path correspondence matching (DPCM), which leverages both structural and textural consistency to achieve reliable 3D-2D correspondences. Our approach formulates the calibration process as a spatial-temporal joint optimization problem, utilizing global constraints from multiple views and scenes to improve accuracy, particularly in sparse or partially overlapping sensor views. Extensive experiments on real-world datasets demonstrate that EdO-LCEC achieves state-of-the-art performance, providing reliable and precise calibration across diverse, challenging environments.
zh
[CV-80] Adversarial Semantic Augmentation for Training Generative Adversarial Networks under Limited Data
【速读】:该论文旨在解决生成对抗网络(GANs)在低数据量条件下合成图像性能显著下降的问题。为了解决这一问题,现有方法主要通过各种数据增强技术来扩充训练集。然而,这些增强技术可能导致数据分布泄露甚至改变。为此,论文提出了一种对抗语义增强(Adversarial Semantic Augmentation, ASA)技术,在语义层面上而非图像层面上扩充训练数据。关键在于通过估计真实图像和生成图像的语义特征协方差矩阵,找到有意义的变换方向,从而实现对原始特征的转换,例如改变人脸数据集中的背景或表情。这种方法通过优化预期对抗损失的上界来隐式实现语义增强,避免了冗余采样并引入了可忽略的计算开销,从而提高了计算效率。
链接: https://arxiv.org/abs/2502.00800
作者: Mengping Yang,Zhe Wang,Ziqiu Chi,Dongdong Li,Wenli Du
机构: East China University of Science and Technology (华东理工大学); Department of Computer Science & Engineering, East China University of Science & Technology (华东理工大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work was completed in 2022 and submitted to an IEEE journal for potential publication
点击查看摘要
Abstract:Generative adversarial networks (GANs) have made remarkable achievements in synthesizing images in recent years. Typically, training GANs requires massive data, and the performance of GANs deteriorates significantly when training data is limited. To improve the synthesis performance of GANs in low-data regimes, existing approaches use various data augmentation techniques to enlarge the training sets. However, it is identified that these augmentation techniques may leak or even alter the data distribution. To remedy this, we propose an adversarial semantic augmentation (ASA) technique to enlarge the training data at the semantic level instead of the image level. Concretely, considering semantic features usually encode informative information of images, we estimate the covariance matrices of semantic features for both real and generated images to find meaningful transformation directions. Such directions translate original features to another semantic representation, e.g., changing the backgrounds or expressions of the human face dataset. Moreover, we derive an upper bound of the expected adversarial loss. By optimizing the upper bound, our semantic augmentation is implicitly achieved. Such design avoids redundant sampling of the augmented features and introduces negligible computation overhead, making our approach computation efficient. Extensive experiments on both few-shot and large-scale datasets demonstrate that our method consistently improve the synthesis quality under various data regimes, and further visualized and analytic results suggesting satisfactory versatility of our proposed method.
zh
[CV-81] ask-Specific Adaptation with Restricted Model Access
【速读】:该论文旨在解决现有微调方法在实际应用中面临的挑战,包括管理多个模型副本或推理管道的复杂性、边缘设备优化的低效性,以及对专有权、隐私和不安全模型变体暴露的担忧。论文的关键解决方案是探索“灰盒”微调方法,这种方法隐藏模型架构和权重,仅允许梯度传播,并引入两个轻量级可学习模块以适应新任务。此外,提出了一种更少限制的变体,通过增加模型的入口点来平衡性能与模型暴露程度。
链接: https://arxiv.org/abs/2502.00796
作者: Matan Levy,Rami Ben-Ari,Dvir Samuel,Nir Darshan,Dani Lischinski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring “Gray-box” fine-tuning approaches, where the model’s architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model’s input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.
zh
[CV-82] Estimating forest carbon stocks from high-resolution remote sensing imagery by reducing domain shift with style transfer
【速读】:该论文旨在提高基于地面监测样本数据与卫星遥感影像融合分析的森林碳储量监测和评估的准确性。关键解决方案在于使用风格迁移方法,并引入Swin Transformer模型通过注意力机制提取全局特征,将碳储量估算转化为图像翻译问题。这种方法旨在提升大尺度观测下的精度。
链接: https://arxiv.org/abs/2502.00784
作者: Zhenyu Yu,Jinnian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Forests function as crucial carbon reservoirs on land, and their carbon sinks can efficiently reduce atmospheric CO2 concentrations and mitigate climate change. Currently, the overall trend for monitoring and assessing forest carbon stocks is to integrate ground monitoring sample data with satellite remote sensing imagery. This style of analysis facilitates large-scale observation. However, these techniques require improvement in accuracy. We used GF-1 WFV and Landsat TM images to analyze Huize County, Qujing City, Yunnan Province in China. Using the style transfer method, we introduced Swin Transformer to extract global features through attention mechanisms, converting the carbon stock estimation into an image translation.
zh
[CV-83] A method for estimating forest carbon storag e distribution density via artificial intelligence generated content model
【速读】:该论文旨在提高森林碳储量估算的精度与效率。研究的关键在于引入了知识蒸馏后的VGG-19模块(Knowledge Distillation-VGG, KD-VGG)进行初始特征提取,并提出了改进的隐式扩散模型(Improved Implicit Diffusion Model, IIDM)。通过这些方法,论文实现了减少模型参数数量的同时缩短推理时间,并提高了特征融合能力,从而提升了高分辨率图像在连续尺度上的恢复效果及整体估算准确性。最终,IIDM模型在碳储量估算中的均方根误差(RMSE)达到28.68,比回归模型提高了约31.45%。
链接: https://arxiv.org/abs/2502.00783
作者: Zhenyu Yu,Jinnian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Forest is the most significant land-based carbon storage mechanism. The forest carbon sink can effectively decrease the atmospheric CO2 concentration and mitigate climate change. Remote sensing estimation not only ensures high accuracy of data, but also enables large-scale area observation. Optical images provide the possibility for long-term monitoring, which is a potential issue in the future carbon storage estimation research. We chose Huize County, Qujing City, Yunnan Province, China as the study area, took GF-1 WFV satellite image as the data, introduced the KD-VGG module to extract the initial features, and proposed the improved implicit diffusion model (IIDM). The results showed that: (1) The VGG-19 module after knowledge distillation can realize the initial feature extraction, reduce the inference time and improve the accuracy in the case of reducing the number of model parameters. (2) The Attention + MLP module was added for feature fusion to obtain the relationship between global and local features and realized the restoration of high-fidelity images in the continuous scale range. (3) The IIDM model proposed in this paper had the highest estimation accuracy, with RMSE of 28.68, which was 13.16 higher than that of the regression model, about 31.45%. In the estimation of carbon storage, the generative model can extract deeper features, and its performance was significantly better than other models. It demonstrated the feasibility of artificial intelligence-generated content (AIGC) in the field of quantitative remote sensing and provided valuable insights for the study of carbon neutralization effect. By combining the actual characteristics of the forest, the regional carbon storage estimation with a resolution of 16-meter was utilized to provide a significant theoretical basis for the formulation of forest carbon sink regulation.
zh
[CV-84] Privacy Preserving Properties of Vision Classifiers
【速读】:该论文旨在评估不同视觉分类器架构在隐私保护方面的性能,并挑战了模型共享时隐含的隐私保护假设。论文的关键在于通过网络逆向重构技术,系统性地分析多层感知机(MLP)、卷积神经网络(CNN)和视觉变换器(ViT)等架构在隐私保护方面的差异,揭示它们在记忆和泄露训练数据方面的程度,并量化各模型逆向重构的难易程度。研究发现突显了输入表示、特征提取机制及权重结构等架构差异对隐私风险的影响,并识别出哪些架构更能抵御逆向攻击,同时探讨了模型性能与隐私保护之间的权衡。这一研究为设计安全且注重隐私的机器学习系统提供了可行的见解,强调了在处理专有或个人信息的应用中评估架构决策的重要性。
链接: https://arxiv.org/abs/2502.00760
作者: Pirzada Suhail,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision classifiers are often trained on proprietary datasets containing sensitive information, yet the models themselves are frequently shared openly under the privacy-preserving assumption. Although these models are assumed to protect sensitive information in their training data, the extent to which this assumption holds for different architectures remains unexplored. This assumption is challenged by inversion attacks which attempt to reconstruct training data from model weights, exposing significant privacy vulnerabilities. In this study, we systematically evaluate the privacy-preserving properties of vision classifiers across diverse architectures, including Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Vision Transformers (ViTs). Using network inversion-based reconstruction techniques, we assess the extent to which these architectures memorize and reveal training data, quantifying the relative ease of reconstruction across models. Our analysis highlights how architectural differences, such as input representation, feature extraction mechanisms, and weight structures, influence privacy risks. By comparing these architectures, we identify which are more resilient to inversion attacks and examine the trade-offs between model performance and privacy preservation, contributing to the development of secure and privacy-respecting machine learning models for sensitive applications. Our findings provide actionable insights into the design of secure and privacy-aware machine learning systems, emphasizing the importance of evaluating architectural decisions in sensitive applications involving proprietary or personal data.
zh
[CV-85] Continuity-Preserving Convolutional Autoencoders for Learning Continuous Latent Dynamical Models from Images
【速读】:该论文旨在解决从离散图像帧中学习连续动态系统的问题。传统方法直接应用卷积自编码器会导致潜在状态在时间上的不连续性。为了解决这一问题,论文提出了一种保持连续性的卷积自编码器(Continuity-preserving Convolutional Autoencoders, CpAEs),其关键是通过促进卷积滤波器的连续性来保持潜在状态的连续性,从而实现更准确的潜在动态模型。
链接: https://arxiv.org/abs/2502.00754
作者: Aiqing Zhu,Yuting Pan,Qianxiao Li
机构: National University of Singapore
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Continuous dynamical systems are cornerstones of many scientific and engineering disciplines. While machine learning offers powerful tools to model these systems from trajectory data, challenges arise when these trajectories are captured as images, resulting in pixel-level observations that are discrete in nature. Consequently, a naive application of a convolutional autoencoder can result in latent coordinates that are discontinuous in time. To resolve this, we propose continuity-preserving convolutional autoencoders (CpAEs) to learn continuous latent states and their corresponding continuous latent dynamical models from discrete image frames. We present a mathematical formulation for learning dynamics from image frames, which illustrates issues with previous approaches and motivates our methodology based on promoting the continuity of convolution filters, thereby preserving the continuity of the latent states. This approach enables CpAEs to produce latent states that evolve continuously with the underlying dynamics, leading to more accurate latent dynamical models. Extensive experiments across various scenarios demonstrate the effectiveness of CpAEs.
zh
[CV-86] An Event-Based Perception Pipeline for a Table Tennis Robot
【速读】:该论文旨在解决乒乓球机器人在快速运动球体检测中的精度与实时性问题。关键在于采用事件驱动相机(Event-based camera)替代传统的帧驱动(frame-based)相机,从而实现一个仅使用事件驱动相机的实时感知管道。这种方法能够提供比帧驱动相机高一个数量级的更新率,显著降低球体位置、速度和旋转估计的均值误差和不确定性,进而提升机器人的控制性能。
链接: https://arxiv.org/abs/2502.00749
作者: Andreas Ziegler,Thomas Gossard,Arren Glover,Andreas Zell
机构: University of Tübingen(图宾根大学); Istituto Italiano di Tecnologia(意大利技术研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Table tennis robots gained traction over the last years and have become a popular research challenge for control and perception algorithms. Fast and accurate ball detection is crucial for enabling a robotic arm to rally the ball back successfully. So far, most table tennis robots use conventional, frame-based cameras for the perception pipeline. However, frame-based cameras suffer from motion blur if the frame rate is not high enough for fast-moving objects. Event-based cameras, on the other hand, do not have this drawback since pixels report changes in intensity asynchronously and independently, leading to an event stream with a temporal resolution on the order of us. To the best of our knowledge, we present the first real-time perception pipeline for a table tennis robot that uses only event-based cameras. We show that compared to a frame-based pipeline, event-based perception pipelines have an update rate which is an order of magnitude higher. This is beneficial for the estimation and prediction of the ball’s position, velocity, and spin, resulting in lower mean errors and uncertainties. These improvements are an advantage for the robot control, which has to be fast, given the short time a table tennis ball is flying until the robot has to hit back.
zh
[CV-87] Spatio-Temporal Progressive Attention Model for EEG Classification in Rapid Serial Visual Presentation Task
【速读】:该论文旨在解决快速串行视觉呈现(RSVP)任务中脑电图(EEG)信号的空间和时间依赖性分析问题。解决方案的关键在于提出了一种新颖的空间-时间渐进注意力模型(STPAM),通过三个独立的空间专家逐步学习脑区的空间拓扑信息,并利用这些信息减少无关脑区的干扰。随后,基于获得的空间特征序列,再通过三个时间专家逐步关注关键的EEG切片来捕捉时间依赖性。这种空间-时间注意力机制显著提升了EEG分类性能。
链接: https://arxiv.org/abs/2502.00730
作者: Yang Li,Wei Liu,Tianzhi Feng,Fu Li,Chennan Wu,Boxun Fu,Zhifu Zhao,Xiaotian Wang,Guangming Shi
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education(智能感知与图像理解重点实验室), the School of Artificial Intelligence(人工智能学院), Xidian University(西安电子科技大学), Xi’an, 710071, China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As a type of multi-dimensional sequential data, the spatial and temporal dependencies of electroencephalogram (EEG) signals should be further investigated. Thus, in this paper, we propose a novel spatial-temporal progressive attention model (STPAM) to improve EEG classification in rapid serial visual presentation (RSVP) tasks. STPAM first adopts three distinct spatial experts to learn the spatial topological information of brain regions progressively, which is used to minimize the interference of irrelevant brain regions. Concretely, the former expert filters out EEG electrodes in the relative brain regions to be used as prior knowledge for the next expert, ensuring that the subsequent experts gradually focus their attention on information from significant EEG electrodes. This process strengthens the effect of the important brain regions. Then, based on the above-obtained feature sequence with spatial information, three temporal experts are adopted to capture the temporal dependence by progressively assigning attention to the crucial EEG slices. Except for the above EEG classification method, in this paper, we build a novel Infrared RSVP EEG Dataset (IRED) which is based on dim infrared images with small targets for the first time, and conduct extensive experiments on it. The results show that our STPAM can achieve better performance than all the compared methods.
zh
[CV-88] Vision and Language Reference Prompt into SAM for Few-shot Segmentation
【速读】:该论文旨在解决Few-shot分割模型中存在的参考信息有限导致精度受限的问题。关键在于提出了一种名为Vision and Language reference Prompt into SAM (VLP-SAM)的新模型,通过输入图像和文本标签作为参考信息,结合视觉和语言模态来增强提示嵌入(prompt embeddings),从而显著提升了在PASCAL-5i和COCO-20i数据集上的Few-shot分割任务性能,相比之前最先进的方法分别提高了6.3%和9.5%的平均交并比(mIoU)。
链接: https://arxiv.org/abs/2502.00719
作者: Kosuke Sakurai,Ryotaro Shimizu,Masayuki Goto
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures
点击查看摘要
Abstract:Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at this https URL.
zh
[CV-89] MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在高可靠性应用领域中存在的幻觉问题。论文的关键解决方案在于提出了一种名为MINT的新型无训练解码策略,通过减少不相关的图像标记的关注来增强局部感知能力,并使用对比解码以推动模型更加关注关键图像区域,从而引导模型在生成过程中更集中于关键视觉元素。
链接: https://arxiv.org/abs/2502.00717
作者: Chao Wang,Jianming Yang,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Hallucination has been a long-standing and inevitable problem that hinders the application of Large Vision-Language Models (LVLMs) in domains that require high reliability. Various methods focus on improvement depending on data annotations or training strategies, yet place less emphasis on LLM’s inherent problems. To fill this gap, we delve into the attention mechanism of the decoding process in the LVLM. Intriguingly, our investigation uncovers the prevalent attention redundancy within the hierarchical architecture of the LVLM, manifesting as overextended image processing in deep layers and an overabundance of non-essential image tokens. Stemming from the observation, we thus propose MINT, a novel training-free decoding strategy, MItigating hallucinations via tokeN reducTion. Specifically, we dynamically intensify the LVLM’s local perception capability by masking its attention to irrelevant image tokens. In addition, we use contrastive decoding that pushes the model to focus more on those key image regions. Our full method aims to guide the model in concentrating more on key visual elements during generation. Extensive experimental results on several popular public benchmarks show that our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models. Meanwhile, our approach is demonstrated to make the model perceive 5% more visual points even though we reduce a suite of image tokens.
zh
[CV-90] VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
【速读】:该论文旨在解决视觉推理任务中的两个主要问题:有限的推理可解释性以及问题文本中存在的欠规范现象。此外,细粒度视觉知识的缺乏限制了对主体行为的精确理解。为了解决这些问题,论文提出了一种名为VIKSER(基于视觉知识的自我强化推理框架)的方法。关键在于通过大型语言模型提取细粒度视觉知识,并利用视觉关系检测技术辅助这一过程。同时,VIKSER采用了一种称为证据链(Chain-of-Evidence, CoE)的新颖提示方法,以增强其推理能力的可解释性。此外,集成的自我反思技术使VIKSER能够从错误中学习和改进。
链接: https://arxiv.org/abs/2502.00711
作者: Chunbai Zhang,Chao Wang,Yang Zhou,Yan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages,12 figures
点击查看摘要
Abstract:Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning’’ to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
zh
[CV-91] PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation
【速读】:该论文旨在解决在生成三维场景时遇到的几个关键挑战:1)确保复合场景布局符合物理定律;2)准确捕捉复杂场景描述中的资产及其关系;3)布局方法中自主资产生成能力受限。为了解决这些问题,论文提出了一种名为PhiP-G的新框架,其关键是将基于世界模型的生成技术与布局指导无缝集成,并利用基于大型语言模型(LLM)的代理分析复杂场景描述以生成场景图。此外,PhiP-G结合了多模态二维生成代理和三维高斯生成方法进行目标资产创建,并通过具有粘附能力的物理池和视觉监督代理来预测和规划布局,从而显著提升了生成质量和物理合理性。
链接: https://arxiv.org/abs/2502.00708
作者: Qixuan Li,Chao Wang,Zongjin He,Yan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages.8 figures
点击查看摘要
Abstract:Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T ^3 Bench, and improves efficiency by 24x.
zh
[CV-92] S2CFormer: Reorienting Learned Image Compression from Spatial Interaction to Channel Aggregation
【速读】:该论文旨在重新评估变换器在学习图像压缩(LIC)中的关键因素,并解决现有方法中空间操作复杂化导致解码延迟与率失真性能之间权衡不佳的问题。论文的关键在于强调通道聚合模块的重要性,通过将空间操作替换为恒等映射,发现仅依靠通道操作即可达到领先方法的率失真性能。基于这一洞见,论文提出了"S2CFormer"范式,重新聚焦于通道聚合而非空间交互。论文展示了两种S2CFormer实例:S2C-Conv和S2C-Attention,它们均实现了最先进的率失真性能和显著更快的解码速度。此外,还引入了结合不同S2CFormer实例优势的S2C-Hybrid模型,在多个数据集上超越现有方法,树立了高效高性能LIC的新标杆。
链接: https://arxiv.org/abs/2502.00700
作者: Yunuo Chen,Qian Li,Bing He,Donghui Feng,Ronghua Wu,Qi Wang,Li Song,Guo Lu,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Transformers have achieved significant success in learned image compression (LIC), with Swin Transformers emerging as the mainstream choice for nonlinear transforms. A common belief is that their sophisticated spatial operations contribute most to their efficacy. However, the crucial role of the feed-forward network (FFN) based Channel Aggregation module within the transformer architecture has been largely overlooked, and the over-design of spatial operations leads to a suboptimal trade-off between decoding latency and R-D performance. In this paper, we reevaluate the key factors behind the competence of transformers in LIC. By replacing spatial operations with identity mapping, we are surprised to find that channel operations alone can approach the R-D performance of the leading methods. This solid lower bound of performance emphasizes that the presence of channel aggregation is more essential for the LIC model to achieve competitive performance, while the previously complex spatial interactions are partly redundant. Based on this insight, we initiate the “S2CFormer” paradigm, a general architecture that reorients the focus of LIC from Spatial Interaction to Channel Aggregation. We present two instantiations of the S2CFormer: S2C-Conv, and S2C-Attention. Each one incorporates a simple operator for spatial interaction and serves as nonlinear transform blocks for our LIC models. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. These results also motivate further exploration of advanced FFN structures to enhance the R-D performance while maintaining model efficiency. With these foundations, we introduce S2C-Hybrid, an enhanced LIC model that combines the strengths of different S2CFormer instantiations. This model outperforms all the existing methods on several datasets, setting a new benchmark for efficient and high-performance LIC.
zh
[CV-93] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
【速读】:该论文旨在解决人工智能研究领域缺乏系统性基准来量化多模态系统中的关键认知维度的问题。解决方案的关键在于提出了MM-IQ评估框架,该框架包含2,710个精心策划的测试项目,涵盖了8种不同的推理范式,从而能够更全面地评估多模态模型的认知能力。
链接: https://arxiv.org/abs/2502.00698
作者: Huanqia Cai,Yijun Yang,Winston Hu
机构: Tencent(腾讯)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.00698 [cs.AI] (or arXiv:2502.00698v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.00698 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-94] MI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging Clinical and Radiomic Data Fusion
【速读】:该论文旨在解决慢性肝病预后评估中多模态数据融合的挑战。现有的多模态融合方法难以适应更丰富的医学模态,并且在捕捉模态间关系方面存在困难。为了解决这些问题,论文提出了一种名为Triple-Modal Interaction Chronic Liver Network (TMI-CLNet)的方法。关键在于开发了Intra-Modality Aggregation模块以消除模态内的冗余信息,并设计了Triple-Modal Cross-Attention Fusion模块来提取跨模态信息。此外,还引入了Triple-Modal Feature Fusion损失函数以对齐不同模态间的特征表示。这些创新显著提升了在肝脏预后数据集上的表现,超越了现有的一流单模态模型和其他多模态技术。
链接: https://arxiv.org/abs/2502.00695
作者: Linglong Wu,Xuhao Shan,Ruiquan Ge,Ruoyu Liang,Chi Zhang,Yonghong Li,Ahmed Elazab,Huoling Luo,Yunbi Liu,Changmiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, accepted by IEEE ISBI 2025
点击查看摘要
Abstract:Chronic liver disease represents a significant health challenge worldwide and accurate prognostic evaluations are essential for personalized treatment plans. Recent evidence suggests that integrating multimodal data, such as computed tomography imaging, radiomic features, and clinical information, can provide more comprehensive prognostic information. However, modalities have an inherent heterogeneity, and incorporating additional modalities may exacerbate the challenges of heterogeneous data fusion. Moreover, existing multimodal fusion methods often struggle to adapt to richer medical modalities, making it difficult to capture inter-modal relationships. To overcome these limitations, We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet). Specifically, we develop an Intra-Modality Aggregation module and a Triple-Modal Cross-Attention Fusion module, which are designed to eliminate intra-modality redundancy and extract cross-modal information, respectively. Furthermore, we design a Triple-Modal Feature Fusion loss function to align feature representations across modalities. Extensive experiments on the liver prognosis dataset demonstrate that our approach significantly outperforms existing state-of-the-art unimodal models and other multi-modal techniques. Our code is available at this https URL.
zh
[CV-95] High-Order Matching for One-Step Shortcut Diffusion Models
【速读】:该论文旨在解决一阶轨迹监督在一步快捷扩散模型(One-step shortcut diffusion models)中的局限性。这些局限性包括无法捕捉内在流形几何、导致轨迹不稳定以及在高曲率区域表现不佳等问题。论文的关键解决方案是引入HOMO(高阶匹配框架),通过利用高阶监督来改进分布传输。HOMO不仅解决了上述问题,还实现了前所未有的平滑性、稳定性和几何精确度。
链接: https://arxiv.org/abs/2502.00688
作者: Bo Chen,Chengyue Gong,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Mingda Wan
机构: Middle Tennessee State University; The University of Texas at Austin; University of New South Wales; The University of Hong Kong; University of Wisconsin-Madison; Tsinghua University; University of Wisconsin-Madison; The Simons Institute for the Theory of Computing at UC Berkeley; Anhui University.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:One-step shortcut diffusion models [Frans, Hafner, Levine and Abbeel, ICLR 2025] have shown potential in vision generation, but their reliance on first-order trajectory supervision is fundamentally limited. The Shortcut model’s simplistic velocity-only approach fails to capture intrinsic manifold geometry, leading to erratic trajectories, poor geometric alignment, and instability-especially in high-curvature regions. These shortcomings stem from its inability to model mid-horizon dependencies or complex distributional features, leaving it ill-equipped for robust generative modeling. In this work, we introduce HOMO (High-Order Matching for One-Step Shortcut Diffusion), a game-changing framework that leverages high-order supervision to revolutionize distribution transportation. By incorporating acceleration, jerk, and beyond, HOMO not only fixes the flaws of the Shortcut model but also achieves unprecedented smoothness, stability, and geometric precision. Theoretically, we prove that HOMO’s high-order supervision ensures superior approximation accuracy, outperforming first-order methods. Empirically, HOMO dominates in complex settings, particularly in high-curvature regions where the Shortcut model struggles. Our experiments show that HOMO delivers smoother trajectories and better distributional alignment, setting a new standard for one-step generative models.
zh
[CV-96] Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios
【速读】:该论文旨在解决在不同监控场景中,尤其是在存在遮挡的情况下,人物重识别(Person Re-Identification, ReID)的复杂性问题。关键解决方案在于引入了一种名为Motion-Aware Fusion (MOTAR-FUSE)网络,该网络利用从静态图像中提取的运动线索显著增强ReID能力。MOTAR-FUSE网络通过双输入视觉适配器处理图像和视频,实现更有效的特征提取,并且集成了一个运动一致性任务,使运动感知变换器能够有效捕捉人体运动的动态性。这种方法在遮挡普遍存在的场景中显著提升了特征识别能力,从而推进了ReID过程。
链接: https://arxiv.org/abs/2502.00665
作者: Fuxi Ling,Hongye Liu,Guoqiang Huang,Jing Li,Hong Wu,Zhihao Tang
机构: Hangzhou Dianzi University (杭州电子科技大学); China Jiliang University (中国计量大学); Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Navigating the complexities of person re-identification (ReID) in varied surveillance scenarios, particularly when occlusions occur, poses significant challenges. We introduce an innovative Motion-Aware Fusion (MOTAR-FUSE) network that utilizes motion cues derived from static imagery to significantly enhance ReID capabilities. This network incorporates a dual-input visual adapter capable of processing both images and videos, thereby facilitating more effective feature extraction. A unique aspect of our approach is the integration of a motion consistency task, which empowers the motion-aware transformer to adeptly capture the dynamics of human motion. This technique substantially improves the recognition of features in scenarios where occlusions are prevalent, thereby advancing the ReID process. Our comprehensive evaluations across multiple ReID benchmarks, including holistic, occluded, and video-based scenarios, demonstrate that our MOTAR-FUSE network achieves superior performance compared to existing approaches.
zh
[CV-97] Enhanced Convolutional Neural Networks for Improved Image Classification
【速读】:该论文旨在解决在具有挑战性的数据集如CIFAR-10上应用卷积神经网络(Convolutional Neural Networks, CNNs)时常见的过拟合和次优特征表示问题。解决方案的关键在于提出了一种增强型CNN架构,通过集成更深的卷积块、批量归一化(Batch Normalization)和辍弃正则化(Dropout Regularization),从而实现更优性能。该模型在测试集上的准确率达到84.95%,显著优于基础CNN架构。
链接: https://arxiv.org/abs/2502.00663
作者: Xiaoran Yang,Shuhan Yu,Wenxi Xu
机构: Communication University of China; Hainan International College, Communication University of China; Hefei University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Image classification is a fundamental task in computer vision with diverse applications, ranging from autonomous systems to medical imaging. The CIFAR-10 dataset is a widely used benchmark to evaluate the performance of classification models on small-scale, multi-class datasets. Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art results; however, they often suffer from overfitting and suboptimal feature representation when applied to challenging datasets like CIFAR-10. In this paper, we propose an enhanced CNN architecture that integrates deeper convolutional blocks, batch normalization, and dropout regularization to achieve superior performance. The proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN architectures. Through detailed ablation studies, we demonstrate the effectiveness of the enhancements and analyze the hierarchical feature representations. This work highlights the potential of refined CNN architectures for tackling small-scale image classification problems effectively.
zh
[CV-98] EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis
【速读】:该论文旨在解决3D高斯散射(3D Gaussian Splatting)驱动的说话人脸合成在情感表达多样性方面的不足。为了解决这一问题,论文提出了一种基于唇部对齐的情感面部生成器,并利用其训练了一个条件于连续情感值(即价值和唤醒度)的面部表情操控模型——EmoTalkingGaussian。此外,为了实现自然场景下音频的精确唇部同步,引入了一种自监督学习方法,该方法结合了文本转语音网络和视听同步网络。关键在于通过引入情感面部生成器和改进唇部同步机制来提升情感表达的多样性和真实性。
链接: https://arxiv.org/abs/2502.00654
作者: Junuk Cha,Seongro Yoon,Valeriya Strizhkova,Francois Bremond,Seungryul Baek
机构: UNIST; Inria
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
点击查看摘要
Abstract:3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
zh
[CV-99] Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
【速读】:该论文旨在解决高效对概率扩散模型(Probabilistic Diffusion Model, DM)进行微调以满足下游应用需求的问题。现有方法如基于强化学习(Reinforcement Learning, RL)或截断反向传播(Truncated Backpropagation, BP)存在样本效率低及梯度估计偏差等问题。论文的关键解决方案是提出递归似然比优化器(Recursive Likelihood Ratio, RLR),这是一种基于零阶信息的微调范式。RLR通过重新排列递归扩散链中的计算图,实现了无偏且方差更低的梯度估计,从而克服了现有方法的局限性。
链接: https://arxiv.org/abs/2502.00639
作者: Tao Ren,Zishi Zhang,Zehao Li,Jingyang Jiang,Shentao Qin,Guanghao Li,Yan Li,Yi Zheng,Xinping Li,Min Zhan,Yijie Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous unlabeled data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a zeroth-order informed fine-tuning paradigm for DM. The zeroth-order gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR’s gradient estimator an unbiased one with the lower variance than other methods. We provide theoretical guarantees for the performance of the RLR. Extensive experiments are conducted on image and video generation tasks to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.
zh
[CV-100] MedConv: Convolutions Beat Transformers on Long-Tailed Bone Density Prediction
【速读】:该论文旨在解决骨密度预测中两个主要问题:一是基于Transformer架构的方法计算复杂度高,限制了其在便携式和临床环境中的应用;二是实际医院数据分布不平衡且呈长尾分布,导致预测偏差。为了解决这些问题,论文的关键方案是引入MedConv模型,这是一种卷积模型,相比Transformer模型具有更低的计算需求,并且能够提高预测准确性。此外,论文还采用了Bal-CE损失函数和后处理逻辑调整来改善类别平衡。实验结果表明,这种方法在AustinSpine数据集上实现了高达21%的准确率提升和20%的ROC AUC提升。
链接: https://arxiv.org/abs/2502.00631
作者: Xuyin Qi,Zeyu Zhang,Huazhan Zheng,Mingxi Chen,Numan Kutaiba,Ruth Lim,Cherie Chiang,Zi En Tham,Xuan Ren,Wenxin Zhang,Lei Zhang,Hao Zhang,Wenbing Lv,Guangzhen Yao,Renda Han,Kangsheng Wang,Mingyuan Li,Hongtao Mao,Yu Li,Zhibin Liao,Yang Zhao,Minh-Son To
机构: Flinders University; The University of Adelaide; The Australian National University; Zhejiang University of Technology; Guangdong Technion – Israel Institute of Technology; Austin Health; The University of Melbourne; La Trobe University; University of Chinese Academy of Sciences; Yunnan University; Northeast Normal University; Hainan University; Univeristy of Science and Technology Beijing; Hebei University of Technology; Central China Normal University; Hubei University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Bone density prediction via CT scans to estimate T-scores is crucial, providing a more precise assessment of bone health compared to traditional methods like X-ray bone density tests, which lack spatial resolution and the ability to detect localized changes. However, CT-based prediction faces two major challenges: the high computational complexity of transformer-based architectures, which limits their deployment in portable and clinical settings, and the imbalanced, long-tailed distribution of real-world hospital data that skews predictions. To address these issues, we introduce MedConv, a convolutional model for bone density prediction that outperforms transformer models with lower computational demands. We also adapt Bal-CE loss and post-hoc logit adjustment to improve class balance. Extensive experiments on our AustinSpine dataset shows that our approach achieves up to 21% improvement in accuracy and 20% in ROC AUC over previous state-of-the-art methods.
zh
[CV-101] Self-Prompt SAM: Medical Image Segmentation via Automatic Prompt SAM Adaptation
【速读】:该论文旨在解决生成式AI(Generative AI)模型在医学图像分割中的应用局限性,特别是针对Segment Anything Model (SAM) 在处理自然图像与医学图像差异时所表现出的性能不确定性。论文的关键解决方案在于提出了一种名为Self-Prompt-SAM的自提示适应框架,通过设计一个多尺度提示生成器结合SAM中的图像编码器生成辅助掩膜,并利用这些辅助掩膜生成边界框提示和距离变换选取中心点提示。此外,论文还设计了一个三维深度融合适配器(DfusedAdapter),将其注入到图像编码器和掩膜解码器的每个Transformer中,以使预训练的二维SAM模型能够提取三维信息并适应三维医学图像。
链接: https://arxiv.org/abs/2502.00630
作者: Bin Xie,Hao Tang,Dawen Cai,Yan Yan,Gady Agam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Segment Anything Model (SAM) has demonstrated impressive zero-shot performance and brought a range of unexplored capabilities to natural image segmentation tasks. However, as a very important branch of image segmentation, the performance of SAM remains uncertain when applied to medical image segmentation due to the significant differences between natural images and medical images. Meanwhile, it is harsh to meet the SAM’s requirements of extra prompts provided, such as points or boxes to specify medical regions. In this paper, we propose a novel self-prompt SAM adaptation framework for medical image segmentation, named Self-Prompt-SAM. We design a multi-scale prompt generator combined with the image encoder in SAM to generate auxiliary masks. Then, we use the auxiliary masks to generate bounding boxes as box prompts and use Distance Transform to select the most central points as point prompts. Meanwhile, we design a 3D depth-fused adapter (DfusedAdapter) and inject the DFusedAdapter into each transformer in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Extensive experiments demonstrate that our method achieves state-of-the-art performance and outperforms nnUNet by 2.3% on AMOS2022, 1.6% on ACDCand 0.5% on Synapse datasets.
zh
[CV-102] Strengthening Generative Robot Policies through Predictive World Modeling
【速读】:该论文旨在解决复杂物理交互控制任务中的预测控制问题。论文的关键在于引入生成式预测控制(GPC)框架,通过条件视频扩散(conditional video diffusion)学习接近物理准确的视觉世界模型,并实现稳健的视觉预测。GPC框架包括三个主要部分:从专家演示中克隆生成式扩散策略,训练一个动作条件的世界模型,以及使用该模型进行前瞻规划以优化行动提案。这种方法使得GPC在基于状态和基于视觉的任务中,无论是仿真还是真实环境中,均优于行为克隆方法。
链接: https://arxiv.org/abs/2502.00622
作者: Han Qi,Haocheng Yin,Yilun Du,Heng Yang
机构: School of Engineering and Applied Sciences, Harvard University (工程与应用科学学院, 哈佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL
点击查看摘要
Abstract:We present generative predictive control (GPC), a learning control framework that (i) clones a generative diffusion-based policy from expert demonstrations, (ii) trains a predictive action-conditioned world model from both expert demonstrations and random explorations, and (iii) synthesizes an online planner that ranks and optimizes the action proposals from (i) by looking ahead into the future using the world model from (ii). Crucially, we show that conditional video diffusion allows learning (near) physics-accurate visual world models and enable robust visual foresight. Focusing on planar pushing with rich contact and collision, we show GPC dominates behavior cloning across state-based and vision-based, simulated and real-world experiments.
zh
[CV-103] DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models
【速读】:该论文旨在解决连续适应视觉-语言模型(Vision-Language Models, VLMs)过程中存在的知识遗忘问题,特别是在处理不断扩展的下游任务和数据集时。现有研究通常关注于将视觉特征与特定类别的文本描述相匹配,而忽视了通用知识与特定知识之间的潜在关联。论文的关键发现是,强制模型优化不适当的视觉-文本匹配会加剧VLM的知识遗忘现象。为了解决这一问题,论文提出DesCLIP方法,通过利用一般属性(General Attribute, GA)描述来指导特定类别对象的理解,从而帮助VLM建立稳健的“视觉-GA-类别”三边关联,而不是仅仅依赖于“视觉-类别”连接。具体而言,该方法引入了一个语言助手生成具体的GA描述候选,并设计了一种基于锚点的嵌入过滤器来获取高度相关的GA描述嵌入,这些嵌入作为配对文本嵌入进行视觉-文本实例匹配,进而调整视觉编码器。同时,类别文本嵌入逐渐校准以与共享的GA描述嵌入对齐。实验结果验证了该方法的有效性和先进性,表明其在性能上优于现有的预训练和基于VLM的连续学习方法。
链接: https://arxiv.org/abs/2502.00618
作者: Chiyuan He,Zihuan Qiu,Fanman Meng,Linfeng Xu,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLMs. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust \textitvision-GA-class trilateral associations rather than relying solely on \textitvision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance compared to existing pretrained and VLM-based continual learning methods.
zh
[CV-104] Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing FAST
【速读】:该论文旨在解决基于状态空间模型(State Space Models, SSMs)的视觉模型在处理高分辨率图像时的计算效率问题。论文的关键解决方案是提出Fast Vision Mamba (FastVim),通过在Vision Mamba模型中进一步减少递归步骤的数量,并通过在多个Mamba块之间交替池化令牌来实现,从而将SSM块中的并行步数减少一半。这种方法实现了高达72.5%的推理速度提升,同时保持了模型性能,展示了在诸如图像分类、细胞扰动预测、分割和目标检测等任务中的卓越性能。
链接: https://arxiv.org/abs/2502.00594
作者: Saarthak Kapse,Robin Betz,Srinivasan Sivanandan
机构: Insitro; Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures, this https URL
点击查看摘要
Abstract:State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from L sequential steps to log(L) parallel steps with respect to the number of input tokens ( L ). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2 \times reduction in the number of parallel steps in SSM block. Our model offers up to 72.5% speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048 \times 2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at this https URL
zh
[CV-105] Contrastive Forward-Forward: A Training Algorithm of Vision Transformer
【速读】:该论文旨在解决现有前馈神经网络训练算法(如反向传播)在性能上的局限性,并寻找更接近大脑工作方式的训练方法。关键在于提出了一种名为对比前馈(Contrastive Forward-Forward)的改进算法,通过在视觉变换器(Vision Transformer)上应用该算法,实现了高达10%的准确率提升和收敛速度提高5至20倍的效果。对比于原始的前馈算法(Forward-Forward),这种改进显著缩小了与反向传播算法(Backpropagation)之间的性能差距,并在某些条件下甚至超越后者。
链接: https://arxiv.org/abs/2502.00571
作者: Hossein Aghagolzadeh,Mehdi Ezoji
机构: Babol Noshirvani University of Technology ( Babol Noshirvani理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 8 figures, under review
点击查看摘要
Abstract:Although backpropagation is widely accepted as a training algorithm for artificial neural networks, researchers are always looking for inspiration from the brain to find ways with potentially better performance. Forward-Forward is a new training algorithm that is more similar to what occurs in the brain, although there is a significant performance gap compared to backpropagation. In the Forward-Forward algorithm, the loss functions are placed after each layer, and the updating of a layer is done using two local forward passes and one local backward pass. Forward-Forward is in its early stages and has been designed and evaluated on simple multi-layer perceptron networks to solve image classification tasks. In this work, we have extended the use of this algorithm to a more complex and modern network, namely the Vision Transformer. Inspired by insights from contrastive learning, we have attempted to revise this algorithm, leading to the introduction of Contrastive Forward-Forward. Experimental results show that our proposed algorithm performs significantly better than the baseline Forward-Forward leading to an increase of up to 10% in accuracy and boosting the convergence speed by 5 to 20 times on Vision Transformer. Furthermore, if we take Cross Entropy as the baseline loss function in backpropagation, it will be demonstrated that the proposed modifications to the baseline Forward-Forward reduce its performance gap compared to backpropagation on Vision Transformer, and even outperforms it in certain conditions, such as inaccurate supervision.
zh
[CV-106] Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions
【速读】:该论文旨在解决在实际临床环境中,基于组织病理学的癌症分级和分子层面的生存风险预测难以直接融合进行联合决策的问题。论文的关键解决方案在于提出了一种基于扩散机制的跨模态生成式人工智能模型PathoGen,该模型能够从数字病理图像中合成基因表达,并在此基础上实现高精度(达到当前最先进水平)、高置信度(通过一致性覆盖保证)和可解释性(通过分布式注意力图)的癌症分级和患者生存风险预测。
链接: https://arxiv.org/abs/2502.00568
作者: Samiran Dey,Christopher R.S. Banerji,Partha Basuchowdhuri,Sanjoy K. Saha,Deepak Parashar,Tapabrata Chakraborti
机构: School of Mathematical & Computational Sciences, Indian Association for the Cultivation of Science(印度科学促进会数学与计算科学学院), Kolkata, India; The Alan Turing Institute(艾伦图灵研究所), London, UK; Comprehensive Cancer Center, King’s College London(伦敦国王学院综合癌症中心), London, UK; Department of Computer Science and Engineering, Jadavpur University(贾达普尔大学计算机科学与工程系), Kolkata, India; Warwick Medical School, University of Warwick(华威大学医学院), Coventry, UK; UCL Cancer Institute, University College London(伦敦大学学院癌症研究所), London, UK; Department of Medical Physics and Biomedical Engineering, University College London(伦敦大学学院医学物理与生物医学工程系), London, UK
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Emerging research has highlighted that artificial intelligence based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion for joint decision is impractical in real clinical settings, where histopathology is still the gold standard for diagnosis and transcriptomic tests are rarely requested, at least in the public healthcare system. With our novel diffusion based crossmodal generative AI model PathoGen, we show that genomic expressions synthesized from digital histopathology jointly predicts cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed attention maps). PathoGen code is available for open use by the research community through GitHub at this https URL.
zh
[CV-107] Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation
【速读】:该论文旨在解决深度神经网络在语义分割任务中面临的类别不平衡和实例不平衡问题,特别是小对象和细边界容易被忽略的问题。为应对多尺度目标的分割挑战,论文提出了一种新颖的复数小波互信息(Complex Wavelet Mutual Information, CWMI)损失函数。该方法利用复数可导向金字塔分解出的子带图像中的互信息,并结合其在多个方向上捕捉特征的能力以及在不同尺度上保持结构相似性的优势,从而有效提升了像素级精度和拓扑度量的性能,同时引入了最小的计算开销。
链接: https://arxiv.org/abs/2502.00563
作者: Renhao Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited for capturing high-dimensional directional features and exhibits greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. The code is available at this https URL
zh
[CV-108] Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition
【速读】:该论文旨在解决情感识别在人机交互(HCI)中的挑战,通过整合面部表情分析与脑电图(EEG)信号,引入了一种新的多模态框架—Milmer。解决方案的关键在于采用了一种基于变换器的融合方法,有效集成了视觉和生理模态信息。此外,该框架创新性地采用了多重实例学习(MIL)方法,从多个时间序列上的面部表情图像中提取有意义的信息,捕捉先前研究中常被忽视的关键时间动态。这些策略显著提升了情感识别的性能。
链接: https://arxiv.org/abs/2502.00547
作者: Zaitian Wang,Jian He,Yu Liang,Xiyuan Hu,Tianhao Peng,Kaixin Wang,Jiakai Wang,Chenlong Zhang,Weili Zhang,Shuang Niu,Xiaoyang Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at this https URL.
zh
[CV-109] Integrating Frequency Guidance into Multi-source Domain Generalization for Bearing Fault Diagnosis
【速读】:该论文旨在解决在未知工作条件下,由于不断增加的未知域导致领域不变特征包含实例级别的虚假相关性,从而影响模型泛化能力的问题。解决方案的关键在于提出了基于傅里叶变换的增强重建网络(FARNet),通过分离信号的相位分量和幅度分量来实现领域的增强,并采用多源领域数据增强策略在频域进行操作。此外,引入频率-空间交互模块(FSIM)处理全局信息和局部空间特征,促进两个子网络之间的表示学习。为了进一步优化决策边界,论文还提出了一种流形三元组损失(manifold triplet loss)。通过在CWRU和SJTU数据集上的实验验证,FARNet展示了有效的性能并优于现有的跨域方法。
链接: https://arxiv.org/abs/2502.00545
作者: Xiaotong Tu,Chenyu Ma,Qingyao Wu,Yinhao Liu,Hongyang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent generalizable fault diagnosis researches have effectively tackled the distributional shift between unseen working conditions. Most of them mainly focus on learning domain-invariant representation through feature-level methods. However, the increasing numbers of unseen domains may lead to domain-invariant features contain instance-level spurious correlations, which impact the previous models’ generalizable ability. To address the limitations, we propose the Fourier-based Augmentation Reconstruction Network, namely this http URL methods are motivated by the observation that the Fourier phase component and amplitude component preserve different semantic information of the signals, which can be employed in domain augmentation techniques. The network comprises an amplitude spectrum sub-network and a phase spectrum sub-network, sequentially reducing the discrepancy between the source and target domains. To construct a more robust generalized model, we employ a multi-source domain data augmentation strategy in the frequency domain. Specifically, a Frequency-Spatial Interaction Module (FSIM) is introduced to handle global information and local spatial features, promoting representation learning between the two sub-networks. To refine the decision boundary of our model output compared to conventional triplet loss, we propose a manifold triplet loss to contribute to generalization. Through extensive experiments on the CWRU and SJTU datasets, FARNet demonstrates effective performance and achieves superior results compared to current cross-domain approaches on the benchmarks.
zh
[CV-110] VertiFormer: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility
【速读】:该论文旨在解决在极端崎岖的野外地形中应用Transformer架构进行机器人移动的问题。由于野外环境下的真实移动数据难以获取,且现有的训练技术不完全适用于此类任务,论文提出了一种名为VertiFormer的新颖高效多任务Transformer模型。VertiFormer的关键在于采用了一种新的可学习掩码建模和下一令牌预测范式,能够在仅使用一小时的数据下同时预测机器人的下一个姿态、动作及地形块,从而支持多种野外移动任务。非自回归设计减少了计算瓶颈和误差传播问题,统一的模态表示增强了对不同时间映射和状态表示的学习能力,进一步提高了模型的泛化性能。
链接: https://arxiv.org/abs/2502.00543
作者: Mohammad Nazeri,Anuj Pokhrel,Alexandyr Card,Aniket Datar,Garrett Warnell,Xuesu Xiao
机构: Department of Computer Science, George Mason University; DEVCOM Army Research Laboratory; Department of Computer Science, The University of Texas at Austin
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 figures, url: this https URL
点击查看摘要
Abstract:Sophisticated learning architectures, e.g., Transformers, present a unique opportunity for robots to understand complex vehicle-terrain kinodynamic interactions for off-road mobility. While internet-scale data are available for Natural Language Processing (NLP) and Computer Vision (CV) tasks to train Transformers, real-world mobility data are difficult to acquire with physical robots navigating off-road terrain. Furthermore, training techniques specifically designed to process text and image data in NLP and CV may not apply to robot mobility. In this paper, we propose VertiFormer, a novel data-efficient multi-task Transformer model trained with only one hour of data to address such challenges of applying Transformer architectures for robot mobility on extremely rugged, vertically challenging, off-road terrain. Specifically, VertiFormer employs a new learnable masked modeling and next token prediction paradigm to predict the next pose, action, and terrain patch to enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. The non-autoregressive design mitigates computational bottlenecks and error propagation associated with autoregressive models. VertiFormer’s unified modality representation also enhances learning of diverse temporal mappings and state representations, which, combined with multiple objective functions, further improves model generalization. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate our efficiently trained Transformer can facilitate multiple off-road mobility tasks onboard a physical mobile robot.
zh
[CV-111] CAD: Confidence-Aware Adaptive Displacement for Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决半监督医学图像分割中保持高质量一致性学习的挑战。特别是在存在不确定预测的区域,过度扰动会降低对齐质量并阻碍精确的决策边界。论文的关键解决方案是引入了 Confidence-Aware Adaptive Displacement (CAD) 框架,该框架通过动态调整最大允许替换区域大小和置信度阈值,在训练过程中选择性地识别并替换低置信度区域为高置信度区域,从而逐步提升分割质量而不会使学习过程不堪重负。
链接: https://arxiv.org/abs/2502.00536
作者: Wenbo Xiao,Zhihao Xu,Guiping Liang,Yangjun Deng,Yi Xiao
机构: College of Information and Intelligence, Hunan Agricultural University (湖南农业大学信息与智能学院), Changsha, China (中国长沙)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 4 tables
点击查看摘要
Abstract:Semi-supervised medical image segmentation aims to leverage minimal expert annotations, yet remains confronted by challenges in maintaining high-quality consistency learning. Excessive perturbations can degrade alignment and hinder precise decision boundaries, especially in regions with uncertain predictions. In this paper, we introduce Confidence-Aware Adaptive Displacement (CAD), a framework that selectively identifies and replaces the largest low-confidence regions with high-confidence patches. By dynamically adjusting both the maximum allowable replacement size and the confidence threshold throughout training, CAD progressively refines the segmentation quality without overwhelming the learning process. Experimental results on public medical datasets demonstrate that CAD effectively enhances segmentation quality, establishing new state-of-the-art accuracy in this field. The source code will be released after the paper is published.
zh
[CV-112] Work-Efficient Parallel Non-Maximum Suppression Kernels
【速读】:该论文旨在解决在物体检测中滑动窗口分类器和单 Shot 卷积神经网络(CNN)元架构产生的大量重叠候选窗口的问题。解决方案的关键在于提出了一种高度可扩展的非极大值抑制(NMS)算法,专为嵌入式 GPU 架构设计,能够处理单张图像上数千个同时检测任务,且性能表现优异,在不同 NVIDIA Tegra 系列 GPU 上实现了显著的速度提升,相比现有方法提高了 14 倍到 40 倍。
链接: https://arxiv.org/abs/2502.00535
作者: David Oro,Carles Fernández,Xavier Martorell,Javier Hernando
机构: Universitat Politècnica de Catalunya(巴塞罗那加泰罗尼亚理工大学); Herta Security(赫尔塔安全)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Code: this https URL
点击查看摘要
Abstract:In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data.
zh
[CV-113] Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation
【速读】:该论文旨在解决视频建模过程中高效生成时间依赖性视频帧的问题。关键在于引入Video Latent Flow Matching (VLFM)方法,该方法利用当前强大的预训练图像生成模型,通过建模特定字幕引导的潜码流来生成时间依赖性的视频帧,从而实现任意帧率的插值和外推能力。
链接: https://arxiv.org/abs/2502.00500
作者: Yang Cao,Zhao Song,Chiwun Yang
机构: Simons Institute for the Theory of Computing, University of California, Berkeley.(西蒙斯计算理论研究所,加州大学伯克利分校); Sun Yat-sen University(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper considers an efficient video modeling process called Video Latent Flow Matching (VLFM). Unlike prior works, which randomly sampled latent patches for video generation, our method relies on current strong pre-trained image generation models, modeling a certain caption-guided flow of latent patches that can be decoded to time-dependent video frames. We first speculate multiple images of a video are differentiable with respect to time in some latent space. Based on this conjecture, we introduce the HiPPO framework to approximate the optimal projection for polynomials to generate the probability path. Our approach gains the theoretical benefits of the bounded universal approximation error and timescale robustness. Moreover, VLFM processes the interpolation and extrapolation abilities for video generation with arbitrary frame rates. We conduct experiments on several text-to-video datasets to showcase the effectiveness of our method.
zh
[CV-114] A framework for river connectivity classification using temporal image processing and attention based neural networks
【速读】:该论文旨在解决通过传统流速测量设备成本高昂且仅限于大型河流监测的问题,提出了一种基于影像自动分类的低成本、易部署的替代方案。解决方案的关键在于开发了一个包含图像预处理、图像增强和机器学习分类三个部分的自动化系统。特别地,该系统采用了视觉变换器架构和时间图像增强技术,结合使用扩散模型进行生成式增强,从而将新站点图像的基准准确率从75%提升至90%,显著提高了水体连通性的自动分类效果。
链接: https://arxiv.org/abs/2502.00474
作者: Timothy James Becker,Derin Gezgin,Jun Yi He Wu,Mary Becker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 15 pages, 8 figures
点击查看摘要
Abstract:Measuring the connectivity of water in rivers and streams is essential for effective water resource management. Increased extreme weather events associated with climate change can result in alterations to river and stream connectivity. While traditional stream flow gauges are costly to deploy and limited to large river bodies, trail camera methods are a low-cost and easily deployed alternative to collect hourly data. Image capturing, however requires stream ecologists to manually curate (select and label) tens of thousands of images per year. To improve this workflow, we developed an automated instream trail camera image classification system consisting of three parts: (1) image processing, (2) image augmentation and (3) machine learning. The image preprocessing consists of seven image quality filters, foliage-based luma variance reduction, resizing and bottom-center cropping. Images are balanced using variable amount of generative augmentation using diffusion models and then passed to a machine learning classification model in labeled form. By using the vision transformer architecture and temporal image enhancement in our framework, we are able to increase the 75% base accuracy to 90% for a new unseen site image. We make use of a dataset captured and labeled by staff from the Connecticut Department of Energy and Environmental Protection between 2018-2020. Our results indicate that a combination of temporal image processing and attention-based models are effective at classifying unseen river connectivity images.
zh
[CV-115] Weak-to-Strong Diffusion with Reflection
【速读】:该论文旨在解决扩散生成模型在训练过程中由于数据质量、建模策略及架构设计的局限性导致的理想输出与真实数据之间的不可避免的差距。解决方案的关键在于提出了一种名为弱到强扩散(Weak-to-Strong Diffusion, W2SD)的新框架,通过利用现有弱模型与强模型之间的差异来逼近理想模型与强模型之间的差距。W2SD通过交替执行去噪和反转操作,并引入弱到强差异的反射运算,引导潜在变量沿着采样轨迹向真实数据分布的区域移动,从而显著提升了生成结果的质量和多样性。
链接: https://arxiv.org/abs/2502.00473
作者: Lichen Bai,Masashi Sugiyama,Zeke Xie
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 19 figures, 14 tables
点击查看摘要
Abstract:The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to approximate the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.
zh
[CV-116] Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions
【速读】:该论文旨在解决西班牙语连续唇读(Lipreading)中的自动语音识别问题。解决方案的关键在于提出了一种基于混合CTC/注意力(CTC/Attention)架构的端到端系统,并通过广泛的消融研究(ablation study)分析了系统各组件对识别质量的影响。此外,论文还进行了严谨的错误分析,以探究可能影响自动系统学习的不同因素,并建立了新的西班牙语唇读基准。
链接: https://arxiv.org/abs/2502.00464
作者: David Gimeno-Gómez,Carlos-D. Martínez-Hinarejos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the “Language Resources and Evaluation” journal, Springer Nature
点击查看摘要
Abstract:Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at this https URL.
zh
[CV-117] MambaGlue: Fast and Robust Local Feature Matching With Mamba ICRA
【速读】:该论文旨在解决计算机视觉任务中对既稳健又快速的特征匹配方法的持续需求。解决方案的关键在于提出了一种基于Mamba架构的局部特征匹配方法MambaGlue。Mamba以其在训练和推理中的卓越速度以及与Transformer架构相比的优越性能而著称。MambaGlue通过引入两个模块来实现这一目标:一是MambaAttention混合器,它通过基于Mamba的自注意力结构同时且有选择性地理解局部和全局上下文;二是深度置信分数回归器,这是一种多层感知机(MLP)架构,用于评估匹配预测对应真实对应关系的信心得分。这些创新使得MambaGlue在实际应用中实现了稳健性和效率之间的平衡。
链接: https://arxiv.org/abs/2502.00462
作者: Kihwan Ryoo,Hyungtae Lim,Hyun Myung
机构: KAIST (韩国高等科技学院); LIDS (实验室for Information & Decision Systems),Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Proc. IEEE Int’l Conf. Robotics and Automation (ICRA) 2025
点击查看摘要
Abstract:In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on this https URL
zh
[CV-118] Explorations of the Softmax Space: Knowing When the Neural Network Doesnt Know…
【速读】:该论文旨在解决自动化决策过程中机器学习模型预测可靠性评估的问题。关键解决方案在于提出了一种基于聚类的距离度量方法,通过分析训练后的神经网络输出与类别中心之间的距离来衡量预测的置信度。具体而言,论文定义了一个安全阈值,即不正确预测到相应类别中心的最小距离,以此作为判断自动化预测是否可接受的标准。该方法在MNIST和CIFAR-10数据集上的实验验证了其有效性和一致性。
链接: https://arxiv.org/abs/2502.00456
作者: Daniel Sikar,Artur d’Avila Garcez,Tillman Weyde
机构: City, University of London (伦敦城市大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 1 table. arXiv admin note: substantial text overlap with arXiv:2407.07821
点击查看摘要
Abstract:Ensuring the reliability and safety of automated decision-making is crucial. This paper proposes a new approach for measuring the reliability of predictions in machine learning models. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators.
zh
[CV-119] SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models
【速读】:该论文旨在解决现有基础模型在处理多光谱、多时相和多传感器数据时遇到的计算复杂度高(quadratic computational scaling)的问题。关键在于提出SatMamba新预训练框架,该框架结合掩码自动编码器与状态空间模型(State Space Model),实现了线性计算复杂度(linear computational scaling),从而提高了模型在长序列数据上的性能。
链接: https://arxiv.org/abs/2502.00435
作者: Chuc Man Duc,Hiromichi Fukui
机构: Department of Computer Science, Faculty of Information Technology, University of Engineering and Technology, Vietnam National University (越南国立大学工程与技术大学信息技术学院计算机科学系); International Digital Earth Applied Science Research Center, Chubu University (千叶大学国际数字地球应用科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in this https URL.
zh
[CV-120] CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models
【速读】:该论文旨在解决扩散模型在文本到图像合成任务中因迭代去噪过程导致的高计算资源需求问题。关键解决方案在于结合令牌级别剪枝与缓存技术,通过噪声相对幅度识别显著的令牌变化,并利用空间聚类和分布平衡增强令牌选择,从而实现50%-60%的计算成本降低,同时保持模型性能。
链接: https://arxiv.org/abs/2502.00433
作者: Xinle Cheng,Zhuoming Chen,Zhihao Jia
机构: Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at this https URL
zh
[CV-121] ST-V: TEst-time Support-set Tuning for Zero-shot Video Classification
【速读】:该论文旨在解决零样本视频分类中的两个主要挑战:跨模态语义鸿沟(modality gap)和固定支持集(support-set)无法调整的问题。论文的关键解决方案是提出了一个新的框架,称为测试时支持集调整(TEST-V),它通过多提示支持集扩展(MSD)和基于时间感知的支持集侵蚀(TSE)来增强支持集的多样性和动态调整能力。具体而言,MSD利用从大型语言模型(LLMs)获取的多个提示来扩展每个类别的支持样本,从而丰富支持集的多样性;而TSE则通过可学习权重在自监督方式下根据时间预测一致性来调整支持集,以挖掘每个类别的重要支持线索。
链接: https://arxiv.org/abs/2502.00426
作者: Rui Yan,Jin Wang,Hongyu Qu,Xiaoyu Du,Dong Zhang,Jinhui Tang,Tieniu Tan
机构: Nanjing University (南京大学); Nanjing University of Science and Technology (南京理工大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other’s strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. \textbfTEST-V achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.
zh
[CV-122] MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization
【速读】:该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在实际部署中的大参数规模和高计算需求问题。论文的关键解决方案是提出了一种名为MQuant的后训练量化框架,它通过引入Modality-Specific Static Quantization (MSQ)、Attention-Invariant Flexible Switching (AIFS) 和Rotation Magnitude Suppression (RMS)等技术,分别解决了由于大量视觉标记导致的推理延迟、视觉与文本标记之间的分布差异以及Hadamard变换引起的极端异常值等问题,从而实现了接近浮点精度的同时将推理延迟降低高达30%。
链接: https://arxiv.org/abs/2502.00425
作者: JiangYong Yu,Sifan Zhou,Dawei Yang,Shuo Wang,Shuoyu Li,Xing Hu,Chen Xu,Zukang Xu,Changyong Shu,Zhihang Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First quantization solution for Multimodal large language models applicable to 5 mainstream MLLMs
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and this http URL quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and © extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code will be released.
zh
[CV-123] Parameter Efficient Fine-Tuning of Segment Anything Model
【速读】:该论文旨在解决生物医学图像分割在新条件下泛化能力不足及高成本数据标注的问题。解决方案的关键在于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法的应用,特别是通过评估九种PEFT方法在多样化数据集上的表现,并提出了一种资源高效的微调策略,包括对视觉变换器实现QLoRA以及一种新的SAM高效微调方法。
链接: https://arxiv.org/abs/2502.00418
作者: Carolin Teuber,Anwai Archit,Constantin Pape
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Segmentation is an important analysis task for biomedical images, enabling the study of individual organelles, cells or organs. Deep learning has massively improved segmentation methods, but challenges remain in generalization to new conditions, requiring costly data annotation. Vision foundation models, such as Segment Anything Model (SAM), address this issue through broad segmentation capabilities. However, these models still require finetuning on annotated data, although with less annotations, to achieve optimal results for new conditions. As a downside, they require more computational resources. This makes parameter-efficient finetuning (PEFT) relevant for their application. We contribute the first comprehensive study of PEFT for SAM applied to biomedical segmentation by evaluating 9 PEFT methods on diverse datasets. We also provide an implementation of QLoRA for vision transformers and a new approach for resource-efficient finetuning of SAM. Our code is publicly available at this https URL.
zh
[CV-124] ROI: Cross-Subject Pretraining with Sparse Voxel Selection for Enhanced fMRI Visual Decoding ICASSP2025
【速读】:该论文旨在解决fMRI视觉解码中手动标记ROI导致冗余信息和噪声的问题,并且缺乏自动化ROI标注方法限制了该技术在跨受试者任务中的实用性。论文的关键解决方案是提出了一种名为TROI (Trainable Region of Interest) 的新型数据驱动ROI标注方法。TROI通过预训练图像解码主干网络来快速生成新的输入层维度,并采用学习率重置策略优化输入层以适应新受试者,从而实现高效的跨受试者解码任务。
链接: https://arxiv.org/abs/2502.00412
作者: Ziyu Wang,Tengyu Pan,Zhenyu Li,Wu Ji,Li Xiuxing,Jianyong Wang
机构: Tsinghua University(清华大学); Beijing Institute of Technology(北京理工大学); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2025
点击查看摘要
Abstract:fMRI (functional Magnetic Resonance Imaging) visual decoding involves decoding the original image from brain signals elicited by visual stimuli. This often relies on manually labeled ROIs (Regions of Interest) to select brain voxels. However, these ROIs can contain redundant information and noise, reducing decoding performance. Additionally, the lack of automated ROI labeling methods hinders the practical application of fMRI visual decoding technology, especially for new subjects. This work presents TROI (Trainable Region of Interest), a novel two-stage, data-driven ROI labeling method for cross-subject fMRI decoding tasks, particularly when subject samples are limited. TROI leverages labeled ROIs in the dataset to pretrain an image decoding backbone on a cross-subject dataset, enabling efficient optimization of the input layer for new subjects without retraining the entire model from scratch. In the first stage, we introduce a voxel selection method that combines sparse mask training and low-pass filtering to quickly generate the voxel mask and determine input layer dimensions. In the second stage, we apply a learning rate rewinding strategy to fine-tune the input layer for downstream tasks. Experimental results on the same small sample dataset as the baseline method for brain visual retrieval and reconstruction tasks show that our voxel selection method surpasses the state-of-the-art method MindEye2 with an annotated ROI mask.
zh
[CV-125] Exploring Linear Attention Alternative for Single Image Super-Resolution
【速读】:该论文旨在解决基于深度学习的单图像超分辨率(Single-Image Super-Resolution, SISR)技术在计算复杂性和图像质量方面的挑战,特别是在遥感图像处理中的应用。关键解决方案在于提出Omni-Scale RWKV超分辨率(OmniRWKVSR)模型,该模型结合了Receptance Weighted Key Value (RWKV)架构与特征提取技术,如Visual RWKV空间混合(VRSM)和Visual RWKV通道混合(VRCM),以克服现有方法的局限并实现卓越的SISR性能。
链接: https://arxiv.org/abs/2502.00404
作者: Rongchang Lu,Changyu Li,Donghang Li,Guojing Zhang,Jianqiang Huang,Xilai Li
机构: Qinghai University (青海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been published to IEEE International Joint Conference on Neural Networks. Feel free to contact on nomodeset@qq.com
点击查看摘要
Abstract:Deep learning-based single-image super-resolution (SISR) technology focuses on enhancing low-resolution (LR) images into high-resolution (HR) ones. Although significant progress has been made, challenges remain in computational complexity and quality, particularly in remote sensing image processing. To address these issues, we propose our Omni-Scale RWKV Super-Resolution (OmniRWKVSR) model which presents a novel approach that combines the Receptance Weighted Key Value (RWKV) architecture with feature extraction techniques such as Visual RWKV Spatial Mixing (VRSM) and Visual RWKV Channel Mixing (VRCM), aiming to overcome the limitations of existing methods and achieve superior SISR performance. This work has proved able to provide effective solutions for high-quality image reconstruction. Under the 4x Super-Resolution tasks, compared to the MambaIR model, we achieved an average improvement of 0.26% in PSNR and 0.16% in SSIM.
zh
[CV-126] Enhancing Highway Safety: Accident Detection on the A9 Test Stretch Using Roadside Sensors
【速读】:该论文旨在解决道路交通事故导致的高死亡率问题,特别是由人为错误(如超速、酒驾和分心驾驶)引起的事故。论文的关键解决方案在于提出了一种结合基于规则的方法与基于学习的方法的事故检测框架。该框架通过利用包含高速碰撞序列的真实世界高速公路事故数据集进行训练和验证,数据集中包含了大量标注的二维和三维边界框以及跟踪ID,从而提高了事故检测的可靠性。
链接: https://arxiv.org/abs/2502.00402
作者: Walter Zimmer,Ross Greer,Xingcheng Zhou,Rui Song,Marc Pavel,Daniel Lehmberg,Ahmed Ghita,Akshay Gopalkrishnan,Mohan Trivedi,Alois Knoll
机构: Technical University of Munich (TUM); Laboratory for Intelligent and Safe Automobiles (LISA) at the Uni. of California San Diego (UCSD); University of California Merced (UCM); Fraunhofer Institute for Transportation and Infrastructure Systems (IVI); SETLabs Research GmbH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Road traffic injuries are the leading cause of death for people aged 5-29, resulting in about 1.19 million deaths each year. To reduce these fatalities, it is essential to address human errors like speeding, drunk driving, and distractions. Additionally, faster accident detection and quicker medical response can help save lives. We propose an accident detection framework that combines a rule-based approach with a learning-based one. We introduce a dataset of real-world highway accidents featuring high-speed crash sequences. It includes 294,924 labeled 2D boxes, 93,012 labeled 3D boxes, and track IDs across 48,144 frames captured at 10 Hz using four roadside cameras and LiDAR sensors. The dataset covers ten object classes and is released in the OpenLABEL format. Our experiments and analysis demonstrate the reliability of our method.
zh
[CV-127] Minimalistic Video Saliency Prediction via Efficient Decoder Spatio Temporal Action Cues ICASSP2025
【速读】:该论文旨在解决视频显著性检测(Video Saliency Detection)中的模型大小与性能之间的权衡问题。解决方案的关键在于提出了一种基于ViNet架构的轻量级模型ViNet-S(36MB),它采用了U-Net设计,并具有一个轻量级解码器,从而显著减少了模型大小和参数数量,同时保持了高性能。此外,论文还引入了ViNet-A(148MB),它集成了时空动作定位(Spatio-Temporal Action Localization, STAL)特性。通过将ViNet-S和ViNet-A的预测显著性图进行平均,该方法在多种视觉和视听显著性数据集上实现了当前最先进(state-of-the-art)的性能,同时在参数效率和实时性能方面优于基于Transformer的模型。
链接: https://arxiv.org/abs/2502.00397
作者: Rohit Girmaji,Siddharth Jain,Bhav Beri,Sarthak Bansal,Vineet Gandhi
机构: CVIT, IIIT Hyderabad (IIIT 海得拉巴); India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
点击查看摘要
Abstract:This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
zh
[CV-128] FlexCloud: Direct Modular Georeferencing and Drift-Correction of Point Cloud Maps
【速读】:该论文旨在解决基于同时定位与建图(SLAM)生成的点云地图缺乏全局位置数据的问题,导致内部扭曲和缺失地理参照信息,从而影响地图辅助定位方法的应用。论文的关键解决方案是提出FlexCloud系统,通过利用全球导航卫星系统(GNSS)位置信息和三维橡胶片变换(3D rubber-sheet transformation),实现无需额外控制点的自动地理参照,纠正由长期漂移引起的地图扭曲,进而生成一致且全局参照的点云地图。
链接: https://arxiv.org/abs/2502.00395
作者: Maximilian Leitenstern,Marko Alten,Christian Bolea-Schaser,Dominik Kulmer,Marcel Weinmann,Markus Lienkamp
机构: Institute of Automotive Technology, Technical University of Munich (汽车技术研究所, 慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at VEHITS 2025, Proceedings of the 11th International Conference on Vehicle Technology and Intelligent Transport Systems - VEHITS; 2025
点击查看摘要
Abstract:Current software stacks for real-world applications of autonomous driving leverage map information to ensure reliable localization, path planning, and motion prediction. An important field of research is the generation of point cloud maps, referring to the topic of simultaneous localization and mapping (SLAM). As most recent developments do not include global position data, the resulting point cloud maps suffer from internal distortion and missing georeferencing, preventing their use for map-based localization approaches. Therefore, we propose FlexCloud for an automatic georeferencing of point cloud maps created from SLAM. Our approach is designed to work modularly with different SLAM methods, utilizing only the generated local point cloud map and its odometry. Using the corresponding GNSS positions enables direct georeferencing without additional control points. By leveraging a 3D rubber-sheet transformation, we can correct distortions within the map caused by long-term drift while maintaining its structure. Our approach enables the creation of consistent, globally referenced point cloud maps from data collected by a mobile mapping system (MMS). The source code of our work is available at this https URL.
zh
[CV-129] RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes
【速读】:该论文旨在解决无人机场景下基于自然语言表达的指代表达理解(REC)挑战,特别是在多尺度目标检测、多目标及无目标样本处理以及复杂环境中丰富的上下文表达方面。论文的关键解决方案包括引入RefDrone数据集以及开发RDAgent半自动化标注工具以高效构建数据集,并提出Number GroundingDINO (NGDINO) 方法,该方法能够显式学习并利用表达中提及的目标数量,从而有效应对多目标和无目标情况。
链接: https://arxiv.org/abs/2502.00392
作者: Zhichao Sun,Yepeng Liu,Huachao Zhu,Yuliang Gu,Yuda Zou,Zelong Liu,Gui-Song Xia,Bo Du,Yongchao Xu
机构: School of Computer Science, Wuhan University (计算机科学学院, 武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code will be publicly at this https URL.
zh
[CV-130] Efficient Adaptive Label Refinement for Label Noise Learning
【速读】:该论文旨在解决深度神经网络在处理带有噪声标签的数据时容易过拟合的问题,从而导致性能下降。论文的关键解决方案是提出了一种名为自适应标签精炼(Adaptive Label Refinement, ALR)的方法。ALR通过将避免拟合错误标签和充分学习干净样本的任务解耦,并采用软标签更新和熵损失引导的方式,逐步提高高置信度标签的硬度,以更好地从干净样本中学习,而无需任何先验噪声知识或辅助数据集。这种方法简单且高效,验证了其在人工噪声(如CIFAR-10/100)和真实噪声数据集(如ANIMAL-10N, Clothing1M, WebVision)上的有效性,表明ALR在性能上超越了现有最先进方法。
链接: https://arxiv.org/abs/2502.00386
作者: Wenzhen Zhang,Debo Cheng,Guangquan Lu,Bo Zhou,Jiaye Li,Shichao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep neural networks are highly susceptible to overfitting noisy labels, which leads to degraded performance. Existing methods address this issue by employing manually defined criteria, aiming to achieve optimal partitioning in each iteration to avoid fitting noisy labels while thoroughly learning clean samples. However, this often results in overly complex and difficult-to-train models. To address this issue, we decouple the tasks of avoiding fitting incorrect labels and thoroughly learning clean samples and propose a simple yet highly applicable method called Adaptive Label Refinement (ALR). First, inspired by label refurbishment techniques, we update the original hard labels to soft labels using the model’s predictions to reduce the risk of fitting incorrect labels. Then, by introducing the entropy loss, we gradually `harden’ the high-confidence soft labels, guiding the model to better learn from clean samples. This approach is simple and efficient, requiring no prior knowledge of noise or auxiliary datasets, making it more accessible compared to existing methods. We validate ALR’s effectiveness through experiments on benchmark datasets with artificial label noise (CIFAR-10/100) and real-world datasets with inherent noise (ANIMAL-10N, Clothing1M, WebVision). The results show that ALR outperforms state-of-the-art methods.
zh
[CV-131] Masked Generative Nested Transformers with Decode Time Scaling
【速读】:该论文旨在解决视觉生成过程中推理计算效率瓶颈的问题。现有方法通常需要多次通过变压器模型(Transformer model)来生成标记或去噪输入,这导致计算成本高昂。为了解决这一问题,论文提出了两个关键方案:(a) 不同阶段的生成过程所需的计算资源不同,并设计了解码时模型缩放调度以有效利用计算资源;(b) 可以缓存和重用部分计算结果。结合这些方法,使较小模型可以处理更多标记,而较大模型处理较少标记,同时保持参数规模不变。实验结果显示,与基线相比,该方法在几乎减少3倍计算量的情况下仍能获得具有竞争力的性能。
链接: https://arxiv.org/abs/2502.00382
作者: Sahil Goyal,Debapriya Tula,Gagan Jain,Pradeep Shenoy,Prateek Jain,Sujoy Paul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256 \times 256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost 3\times less compute than baseline, our model obtains competitive performance.
zh
[CV-132] Latent Action Learning Requires Supervision in the Presence of Distractors
【速读】:该论文旨在解决在包含干扰因素(distractors)的真实世界视频数据中,基于潜在动作学习(Latent Action Policies, LAPO)方法的有效性问题。研究发现,现有的LAPO方法在处理含有与动作相关的干扰因素的数据时表现不佳。为此,论文提出了一种改进方法,即LAOM(Latent Action Objectives Modification),通过在潜在动作学习过程中引入少量的地面真值动作监督(约2.5%的数据集),显著提升了潜在动作的质量,从而在下游任务中的性能提高了4.2倍。关键在于,在训练潜在动作模型(Latent Action Models, LAM)时集成监督信号,以克服干扰因素带来的负面影响。
链接: https://arxiv.org/abs/2502.00379
作者: Alexander Nikulin,Ilya Zisman,Denis Tarasov,Nikita Lyubaykin,Andrei Polubarov,Igor Kiselev,Vladislav Kurenkov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. In review
点击查看摘要
Abstract:Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
zh
[CV-133] Scalable Framework for Classifying AI-Generated Content Across Modalities AAAI2025
【速读】:该论文旨在解决有效区分人类生成与AI生成内容及分类不同生成模型输出的问题。解决方案的关键在于提出了一种集成感知哈希(Perceptual Hashing)、相似性测量(Similarity Measurement)和伪标签(Pseudo-labeling)的可扩展框架。此方法能够无需重新训练即可整合新的生成模型,从而确保在动态场景中的适应性和鲁棒性。
链接: https://arxiv.org/abs/2502.00375
作者: Anh-Kiet Duong,Petra Gomez-Krämer
机构: L3i Laboratory, La Rochelle University (拉罗谢尔大学 L3i 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, Defactify4 @ AAAI 2025
点击查看摘要
Abstract:The rapid growth of generative AI technologies has heightened the importance of effectively distinguishing between human and AI-generated content, as well as classifying outputs from diverse generative models. This paper presents a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo-labeling to address these challenges. Our method enables the incorporation of new generative models without retraining, ensuring adaptability and robustness in dynamic scenarios. Comprehensive evaluations on the Defactify4 dataset demonstrate competitive performance in text and image classification tasks, achieving high accuracy across both distinguishing human and AI-generated content and classifying among generative methods. These results highlight the framework’s potential for real-world applications as generative AI continues to evolve. Source codes are publicly available at this https URL.
zh
[CV-134] NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
【速读】:该论文旨在解决视觉定位(Visual Grounding, VG)任务中复杂推理需求的挑战,特别是在需要详细查询解释的复杂推理任务中。当前方法主要分为端到端和组合式方法,而组合式方法虽然更为灵活,但在处理基于语言逻辑表示的复杂推理时仍存在局限性。论文的关键解决方案是提出NAVER,这是一种集成显式概率逻辑推理的组合式视觉定位方法,并嵌入有限状态自动机中,配备自校正机制。这种设计通过显式的逻辑推理提高了推理过程中的鲁棒性和可解释性,从而实现了最先进的性能。
链接: https://arxiv.org/abs/2502.00372
作者: Zhixi Cai,Fucai Ke,Simindokht Jahangard,Maria Garcia de la Banda,Reza Haffari,Peter J. Stuckey,Hamid Rezatofighi
机构: Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at this https URL .
zh
[CV-135] Shape from Semantics: 3D Shape Generation from Multi-View Semantics
【速读】:该论文旨在解决从语义信息创建与给定语义相匹配的三维模型的问题。传统方法通常依赖于视觉输入(如RGB图像或深度图)来重建几何形状,这限制了创造性探索。论文的关键解决方案在于采用语义作为输入,并利用多语义评分蒸馏采样(Multi-Semantics Score Distillation Sampling, SDS)从二维扩散模型中提取三维几何和外观信息,确保初始形状与语义输入一致。此外,通过图像恢复和视频生成模型添加细节,并引入神经符号距离场(Neural Signed Distance Field, SDF)表示法以实现详细的形状重建。这一系列方法显著扩展了设计空间,使得能够创建具有复杂细节、良好结构、连贯纹理以及平滑过渡的三维模型。
链接: https://arxiv.org/abs/2502.00360
作者: Liangchen Li,Caoliwen Wang,Yuqi Zhou,Bailin Deng,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
点击查看摘要
Abstract:We propose Shape from Semantics'', which is able to create 3D models whose geometry and appearance match given semantics when observed from different views. Traditional
Shape from X’’ tasks usually use visual input (e.g., RGB images or depth maps) to reconstruct geometry, imposing strict constraints that limit creative explorations. As applications, works like Shadow Art and Wire Art often struggle to grasp the embedded semantics of their design through direct observation and rely heavily on specific setups for proper display. To address these limitations, our framework uses semantics as input, greatly expanding the design space to create objects that integrate multiple semantic elements and are easily discernible by observers. Considering that this task requires a rich imagination, we adopt various generative models and structure-to-detail pipelines. Specifically, we adopt multi-semantics Score Distillation Sampling (SDS) to distill 3D geometry and appearance from 2D diffusion models, ensuring that the initial shape is consistent with the semantic input. We then use image restoration and video generation models to add more details as supervision. Finally, we introduce neural signed distance field (SDF) representation to achieve detailed shape reconstruction. Our framework generates meshes with complex details, well-structured geometry, coherent textures, and smooth transitions, resulting in visually appealing and eye-catching designs. Project page: this https URL
zh
[CV-136] Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering
【速读】:该论文旨在解决3D场景问答(3D SQA)领域内统一分析与比较的挑战,特别是在快速发展的大型多模态建模背景下。论文的关键在于系统性地综述现有的3D SQA数据集、方法论以及评估指标,并强调在数据集标准化、多模态融合及任务设计方面的关键挑战与未来机遇。
链接: https://arxiv.org/abs/2502.00342
作者: Zechuan Li,Hongshan Yu,Yihao Ding,Yan Li,Yong He,Naveed Akhtar
机构: Hunan University (湖南大学); The University of Melbourne (墨尔本大学); The University of Sydney (悉尼大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress
点击查看摘要
Abstract:3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics while highlighting critical challenges and future opportunities in dataset standardization, multimodal fusion, and task design.
zh
[CV-137] BiMaCoSR: Binary One-Step Diffusion Model Leverag ing Flexible Matrix Compression for Real Super-Resolution
【速读】:该论文旨在解决基于扩散模型(Diffusion Models, DM)的超分辨率方法在资源受限的边缘设备上部署困难的问题。关键解决方案在于提出BiMaCoSR方法,结合了二值化(binarization)和单步蒸馏(one-step distillation),以实现极致的压缩和加速。为了防止二值化导致的模型性能崩溃,文中引入了稀疏矩阵分支(Sparse Matrix Branch, SMB)和低秩矩阵分支(Low Rank Matrix Branch, LRM),这两个辅助分支传递全精度信息但方式不同,从而确保了压缩和加速的同时保持了模型性能。
链接: https://arxiv.org/abs/2502.00333
作者: Kai Liu,Kaicheng Yang,Zheng Chen,Zhiteng Li,Yong Guo,Wenbo Li,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. The code and models will be available at this https URL
点击查看摘要
Abstract:While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrixbranch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code and model are available at this https URL.
zh
[CV-138] MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model
【速读】:该论文旨在解决单目3D目标检测模型在深度估计不准确及依赖多阶段检测流程方面的问题。解决方案的关键在于采用基于Vision Transformer (ViT) 的基础模型作为主干网络,并结合Detection Transformer (DETR) 架构实现端到端的深度估计与目标检测。通过引入层次特征融合模块增强特征提取能力,并利用大规模数据训练的相对深度估计模型进行迁移学习以进一步提升深度估计精度。此外,解码器中的查询机制考虑参考点和二维边界框尺寸,从而提高识别性能。
链接: https://arxiv.org/abs/2502.00315
作者: Jihyeok Kim,Seongwoo Moon,Sungwon Nah,David Hyunchul Shim
机构: School of Electrical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
点击查看摘要
Abstract:This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer’s decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at this https URL.
zh
[CV-139] A Diffusion Model Translator for Efficient Image-to-Image Translation
【速读】:该论文旨在解决应用扩散模型(Diffusion Models)进行图像到图像翻译(Image-to-Image Translation, I2I)时的时间消耗问题。现有方法在每个去噪步骤中注入源图像信息以实现迭代优化,导致实施过程耗时。论文提出的关键解决方案是引入一个轻量级翻译器——扩散模型翻译器(DMT),仅在某些中间步骤转移分布至另一域,从而高效完成I2I任务。此外,作者提出了一种自动选择合适时间步长的实用策略,进一步提升性能。
链接: https://arxiv.org/abs/2502.00307
作者: Mengfei Xia,Yu Zhou,Ran Yi,Yong-Jin Liu,Wenping Wang
机构: MOE-Key Laboratory of Pervasive Computing, Department of Computer Science and Technology, Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Department of Computer Science and Computer Engineering at Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.
zh
[CV-140] K Nearest Neighbor-Guided Trajectory Similarity Learning
【速读】:该论文旨在解决轨迹相似性度量在时空数据挖掘应用中的准确性挑战,特别是在深度学习模型中由于轨迹粒度建模困难及训练数据中相似性信号利用不足所导致的问题。解决方案的关键在于提出了TSMini模型,该模型包含子视图建模机制以学习多粒度轨迹模式,并采用基于k近邻的损失函数指导模型不仅学习轨迹间的绝对相似值,还学习它们之间的相对相似排名。这些创新共同实现了高度准确的轨迹相似性近似。
链接: https://arxiv.org/abs/2502.00285
作者: Yanchuan Chang,Xu Cai,Christian S. Jensen,Jianzhong Qi
机构: The University of Melbourne (墨尔本大学); National University of Singapore (新加坡国立大学); Aalborg University (奥胡斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
点击查看摘要
Abstract:Trajectory similarity is fundamental to many spatio-temporal data mining applications. Recent studies propose deep learning models to approximate conventional trajectory similarity measures, exploiting their fast inference time once trained. Although efficient inference has been reported, challenges remain in similarity approximation accuracy due to difficulties in trajectory granularity modeling and in exploiting similarity signals in the training data. To fill this gap, we propose TSMini, a highly effective trajectory similarity model with a sub-view modeling mechanism capable of learning multi-granularity trajectory patterns and a k nearest neighbor-based loss that guides TSMini to learn not only absolute similarity values between trajectories but also their relative similarity ranks. Together, these two innovations enable highly accurate trajectory similarity approximation. Experiments show that TSMini can outperform the state-of-the-art models by 22% in accuracy on average when learning trajectory similarity measures.
zh
[CV-141] Simultaneous Estimation of Manipulation Skill and Hand Grasp Force from Forearm Ultrasound Images
【速读】:该论文旨在解决精确估计人体手部配置及施加力的问题,以提升遥操作和技能转移在机器人操作中的有效性。关键解决方案在于使用前臂超声数据同时估计操作技能和手部施力,通过深度学习模型实现了94.87%±10.16%的分类准确率和0.51±0.19牛顿的均方根误差(RMSE),从而证明了前臂超声技术在增强人机交互和复杂操作任务中的潜力。
链接: https://arxiv.org/abs/2502.00275
作者: Keshav Bimbraw,Srikar Nekkanti,Daniel B. Tiller II,Mihir Deshmukh,Berk Calli,Robert D. Howe,Haichong K. Zhang
机构: Inova Medical Group(英维奥医疗集团); Robotics Engineering, Worcester Polytechnic Institute (伍斯特理工学院机器人工程系), Worcester, MA, USA; Harvard Paulson School of Engineering and Applied Sciences(哈佛保罗森工程与应用科学学院), Cambridge, MA, USA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 30 pages, 52 references, 10 figures, 8 tables and 2 supplementary videos. Currently under review
点击查看摘要
Abstract:Accurate estimation of human hand configuration and the forces they exert is critical for effective teleoperation and skill transfer in robotic manipulation. A deeper understanding of human interactions with objects can further enhance teleoperation performance. To address this need, researchers have explored methods to capture and translate human manipulation skills and applied forces to robotic systems. Among these, biosignal-based approaches, particularly those using forearm ultrasound data, have shown significant potential for estimating hand movements and finger forces. In this study, we present a method for simultaneously estimating manipulation skills and applied hand force using forearm ultrasound data. Data collected from seven participants were used to train deep learning models for classifying manipulation skills and estimating grasp force. Our models achieved an average classification accuracy of 94.87 percent plus or minus 10.16 percent for manipulation skills and an average root mean square error (RMSE) of 0.51 plus or minus 0.19 Newtons for force estimation, as evaluated using five-fold cross-validation. These results highlight the effectiveness of forearm ultrasound in advancing human-machine interfacing and robotic teleoperation for complex manipulation tasks. This work enables new and effective possibilities for human-robot skill transfer and tele-manipulation, bridging the gap between human dexterity and robotic control.
zh
[CV-142] MCM: Multi-layer Concept Map for Efficient Concept Learning from Masked Images
【速读】:该论文旨在解决在视觉任务中概念学习依赖全图方法而未充分探索掩码策略的问题。论文的关键在于提出了一种基于掩码图像的有效概念学习方法——多层概念图(Multi-layer Concept Map, MCM)。通过建立编码器和解码器层之间的关联,并利用重构任务的后向梯度更新概念标记(concept tokens),MCM 方法能够在不同粒度级别学习概念标记,从而实现掩码图像块的填补或引导重构结果以反映特定概念。这种方法显著减少了计算成本,并提升了概念预测性能。
链接: https://arxiv.org/abs/2502.00266
作者: Yuwei Sun,Lu Mi,Ippei Fujisawa,Ryota Kanai
机构: Araya Research; RIKEN AIP; Georgia Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Masking strategies commonly employed in natural language processing are still underexplored in vision tasks such as concept learning, where conventional methods typically rely on full images. However, using masked images diversifies perceptual inputs, potentially offering significant advantages in concept learning with large-scale Transformer models. To this end, we propose Multi-layer Concept Map (MCM), the first work to devise an efficient concept learning method based on masked images. In particular, we introduce an asymmetric concept learning architecture by establishing correlations between different encoder and decoder layers, updating concept tokens using backward gradients from reconstruction tasks. The learned concept tokens at various levels of granularity help either reconstruct the masked image patches by filling in gaps or guide the reconstruction results in a direction that reflects specific concepts. Moreover, we present both quantitative and qualitative results across a wide range of metrics, demonstrating that MCM significantly reduces computational costs by training on fewer than 75% of the total image patches while enhancing concept prediction performance. Additionally, editing specific concept tokens in the latent space enables targeted image generation from masked images, aligning both the visible contextual patches and the provided concepts. By further adjusting the testing time mask ratio, we could produce a range of reconstructions that blend the visible patches with the provided concepts, proportional to the chosen ratios.
zh
[CV-143] Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion
【速读】:该论文旨在解决深度神经网络参数空间对称性在Transformer模型融合中的局限性问题。论文的关键在于引入旋转对称性(Rotation Symmetry),这是一种新的参数空间对称形式,通过在自注意力层中旋转参数矩阵来推广置换对称性(Permutation Symmetry)。不同于离散的置换对称性,旋转对称性在连续域中操作,显著扩展了Transformer模型的等效集。基于此特性,论文提出了一种理论上最优的参数匹配算法,作为插件模块以增强模型融合效果。实验结果表明,基于旋转对称性的匹配算法显著提升了模型融合性能。
链接: https://arxiv.org/abs/2502.00264
作者: Binchi Zhang,Zaiyi Zheng,Zhengzhang Chen,Jundong Li
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on this https URL.
zh
[CV-144] Your submission contained main.bib and main.tex file but no main.bbl file (include main.bbl or submit without main.bib; and remember to verify references)
【速读】:该论文旨在解决自动驾驶系统在处理不可预测的边缘情况(edge-case scenarios)时所面临的挑战,如对抗性行人行为、危险车辆操作和突发环境变化。当前端到端驾驶模型难以泛化到这些罕见事件,主要是由于传统检测和预测方法的局限性。为了解决这一问题,论文提出了一种名为INSIGHT(语义和视觉输入集成用于泛化风险追踪)的分层视觉-语言模型(VLM)框架。其关键是通过多模态数据融合整合语义和视觉表征,从而实现精确的情景解读和潜在危险的准确预测,并通过基于注意力机制和坐标回归技术优化空间风险定位,最终显著提升了危险预测的简便性和准确性。
链接: https://arxiv.org/abs/2502.00262
作者: Dianwei Chen,Zifan Zhang,Yuchen Liu,Xianfeng Terry Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.
zh
[CV-145] ransformer-Based Vector Font Classification Using Different Font Formats: TrueType versus PostScript IJCNN2025
【速读】:该论文旨在解决在Transformer-based模型中矢量字体分类任务中不同字体表示格式的影响。关键在于研究发现基于PostScript轮廓的字体表示在矢量字体分类任务中优于基于TrueType轮廓的表示。论文指出信息聚合在基于Transformer的矢量图形深度学习中至关重要,这为未来选择合适的轮廓格式提供了有价值的指导。
链接: https://arxiv.org/abs/2502.00250
作者: Takumu Fujioka(1),Gouhei Tanaka(1 and 2) ((1) Nagoya Institute of Technology, (2) The University of Tokyo)
机构: Department of Computer Science, Nagoya Institute of Technology (名古屋工业技术研究所); International Research Center for Neurointelligence, The University of Tokyo (东京大学神经智能国际研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 4 tables, Submitted to IJCNN 2025. Code available at this https URL
点击查看摘要
Abstract:Modern fonts adopt vector-based formats, which ensure scalability without loss of quality. While many deep learning studies on fonts focus on bitmap formats, deep learning for vector fonts remains underexplored. In studies involving deep learning for vector fonts, the choice of font representation has often been made conventionally. However, the font representation format is one of the factors that can influence the computational performance of machine learning models in font-related tasks. Here we show that font representations based on PostScript outlines outperform those based on TrueType outlines in Transformer-based vector font classification. TrueType outlines represent character shapes as sequences of points and their associated flags, whereas PostScript outlines represent them as sequences of commands. In previous research, PostScript outlines have been predominantly used when fonts are treated as part of vector graphics, while TrueType outlines are mainly employed when focusing on fonts alone. Whether to use PostScript or TrueType outlines has been mainly determined by file format specifications and precedent settings in previous studies, rather than performance considerations. To date, few studies have compared which outline format provides better embedding representations. Our findings suggest that information aggregation is crucial in Transformer-based deep learning for vector graphics, as in tokenization in language models and patch division in bitmap-based image recognition models. This insight provides valuable guidance for selecting outline formats in future research on vector graphics.
zh
[CV-146] Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms
【速读】:该论文旨在解决离散扩散模型在高维状态空间中的高效推理问题。现有方法主要分为精确模拟和近似方法(如(\tau)-跳跃)。精确方法面临不可预测的推理时间和冗余函数评估的问题,而(\tau)-跳跃方法仅具有一阶精度。论文的关键解决方案是提出了一种高阶数值推理方案的扩展,特别是(\theta)-梯形法,以实现更大的步长并减少误差,该方法在KL散度下证明具有二阶精度。实验结果表明,在同等计算约束下,所提方法在GPT-2级别的文本生成和ImageNet级别的图像生成任务中实现了更高质量的样本。
链接: https://arxiv.org/abs/2502.00234
作者: Yinuo Ren,Haoxuan Chen,Yuchen Zhu,Wei Guo,Yongxin Chen,Grant M. Rotskoff,Molei Tao,Lexing Ying
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注: 38 pages, 7 figures
点击查看摘要
Abstract:Discrete diffusion models have emerged as a powerful generative modeling framework for discrete data with successful applications spanning from text generation to image synthesis. However, their deployment faces challenges due to the high dimensionality of the state space, necessitating the development of efficient inference algorithms. Current inference approaches mainly fall into two categories: exact simulation and approximate methods such as \tau -leaping. While exact methods suffer from unpredictable inference time and redundant function evaluations, \tau -leaping is limited by its first-order accuracy. In this work, we advance the latter category by tailoring the first extension of high-order numerical inference schemes to discrete diffusion models, enabling larger step sizes while reducing error. We rigorously analyze the proposed schemes and establish the second-order accuracy of the \theta -trapezoidal method in KL divergence. Empirical evaluations on GPT-2 level text and ImageNet-level image generation tasks demonstrate that our method achieves superior sample quality compared to existing approaches under equivalent computational constraints.
zh
[CV-147] A Hybrid Random Forest and CNN Framework for Tile-Wise Oil-Water Classification in Hyperspectral Images
【速读】:该论文旨在解决油水分类在高光谱图像(HSI)中的空间上下文保持难题。解决方案的关键在于提出了一种新颖的随机森林(Random Forest)与卷积神经网络(CNN)混合框架。首先,通过将图像划分为较小的非重叠瓦片来保留空间信息,并将其用于训练、验证和测试。尽管随机森林在逐像素分类中表现出色,但它无法充分利用空间关系。因此,进一步利用CNN处理随机森林生成的概率图,以增强其空间特征学习能力,从而提高整体性能。
链接: https://arxiv.org/abs/2502.00232
作者: Mehdi Nickzamir,Seyed Mohammad Sheikh Ahamdi Gandab
机构: Politecnico di Torino(都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:A novel hybrid Random Forest and Convolutional Neural Network (CNN) framework is presented for oil-water classification in hyperspectral images (HSI). To address the challenge of preserving spatial context, the images were divided into smaller, non-overlapping tiles, which served as the basis for training, validation, and testing. Random Forest demonstrated strong performance in pixel-wise classification, outperforming models such as XGBoost, Attention-Based U-Net, and HybridSN. However, Random Forest loses spatial context, limiting its ability to fully exploit the spatial relationships in hyperspectral data. To improve performance, a CNN was trained on the probability maps generated by the Random Forest, leveraging the CNN’s capacity to incorporate spatial context. The hybrid approach achieved 7.6% improvement in recall (to 0.85), 2.4% improvement in F1 score (to 0.84), and 0.54% improvement in AUC (to 0.99) compared to the baseline. These results highlight the effectiveness of combining probabilistic outputs with spatial feature learning for context-aware analysis of hyperspectral images.
zh
[CV-148] Fantastic Multi-Task Gradient Updates and How to Find Them In a Cone
【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中竞争目标之间的平衡问题,主要挑战源于各个任务之间冲突的梯度。论文的关键解决方案是提出了一种名为ConicGrad的方法,该方法将MTL问题构造成一个带有角度约束的优化问题。通过动态调节梯度更新方向,使其限制在一个以总体目标参考梯度为中心的圆锥内,从而有效解决任务间梯度冲突,同时保持计算效率和高维参数空间的可扩展性。
链接: https://arxiv.org/abs/2502.00217
作者: Negar Hassanpour,Muhammad Kamran Janjua,Kunlin Zhang,Sepehr Lavasani,Xiaowen Zhang,Chunhua Zhou,Chao Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures, 5 tables
点击查看摘要
Abstract:Balancing competing objectives remains a fundamental challenge in multi-task learning (MTL), primarily due to conflicting gradients across individual tasks. A common solution relies on computing a dynamic gradient update vector that balances competing tasks as optimization progresses. Building on this idea, we propose ConicGrad, a principled, scalable, and robust MTL approach formulated as a constrained optimization problem. Our method introduces an angular constraint to dynamically regulate gradient update directions, confining them within a cone centered on the reference gradient of the overall objective. By balancing task-specific gradients without over-constraining their direction or magnitude, ConicGrad effectively resolves inter-task gradient conflicts. Moreover, our framework ensures computational efficiency and scalability to high-dimensional parameter spaces. We conduct extensive experiments on standard supervised learning and reinforcement learning MTL benchmarks, and demonstrate that ConicGrad achieves state-of-the-art performance across diverse tasks.
zh
[CV-149] EcoWeedNet: A Lightweight and Automated Weed Detection Method for Sustainable Next-Generation Agricultural Consumer Electronics
【速读】:该论文旨在解决可持续精准农业中的杂草检测问题,传统方法如化学除草剂和人工除草存在环境损害和健康风险。论文的关键解决方案是提出了一种名为EcoWeedNet的新模型,该模型在不显著增加计算复杂度的前提下提升了杂草检测性能,并且具有轻量级特性,适合部署在地面农业消费电子设备和机器人上。实验结果表明,EcoWeedNet在保持高性能的同时,参数量仅为YOLOv4的大约4.21%,浮点运算次数(GFLOPs)仅为6.59%。
链接: https://arxiv.org/abs/2502.00205
作者: Omar H. Khater,Abdul Jabbar Siddiqui,M. Shamim Hossain
机构: King Fahd University of Petroleum and Minerals (KFUPM); SDAIA-KFUPM Joint Research Center on Artificial Intelligence; IRC for Intelligent Secure Systems; King Saud University (KSU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Sustainable agriculture plays a crucial role in ensuring world food security for consumers. A critical challenge faced by sustainable precision agriculture is weed growth, as weeds share essential resources with the crops, such as water, soil nutrients, and sunlight, which notably affect crop yields. The traditional methods employed to combat weeds include the usage of chemical herbicides and manual weed removal methods. However, these could damage the environment and pose health hazards. The adoption of automated computer vision technologies and ground agricultural consumer electronic vehicles in precision agriculture offers sustainable, low-carbon solutions. However, prior works suffer from issues such as low accuracy and precision and high computational expense. This work proposes EcoWeedNet, a novel model with enhanced weed detection performance without adding significant computational complexity, aligning with the goals of low-carbon agricultural practices. Additionally, our model is lightweight and optimal for deployment on ground-based consumer electronic agricultural vehicles and robots. The effectiveness of the proposed model is demonstrated through comprehensive experiments on the CottonWeedDet12 benchmark dataset reflecting real-world scenarios. EcoWeedNet achieves performance close to that of large models yet with much fewer parameters. (approximately 4.21% of the parameters and 6.59% of the GFLOPs of YOLOv4). This work contributes effectively to the development of automated weed detection methods for next-generation agricultural consumer electronics featuring lower energy consumption and lower carbon footprint. This work paves the way forward for sustainable agricultural consumer technologies.
zh
[CV-150] Evaluating Deep Human-in-the-Loop Optimization for Retinal Implants Using Sighted Participants
【速读】:该论文旨在评估 Human-in-the-loop optimization (HILO) 方法在个性化视觉假体中的应用效果,特别是在真实条件下的优化刺激策略能力。解决方案的关键在于通过迭代反馈机制,利用受试者的选择来逐步优化深层刺激编码器(Deep Stimulus Encoder, DSE),从而生成更优的刺激参数,以提升视觉假体的性能。实验结果显示,HILO生成的刺激在所有测试条件下均优于基准方法,证明了该方法的有效性。
链接: https://arxiv.org/abs/2502.00177
作者: Eirini Schoinas,Adyah Rastogi,Anissa Carter,Jacob Granley,Michael Beyeler
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Human-in-the-loop optimization (HILO) is a promising approach for personalizing visual prostheses by iteratively refining stimulus parameters based on user feedback. Previous work demonstrated HILO’s efficacy in simulation, but its performance with human participants remains untested. Here we evaluate HILO using sighted participants viewing simulated prosthetic vision to assess its ability to optimize stimulation strategies under realistic conditions. Participants selected between phosphenes generated by competing encoders to iteratively refine a deep stimulus encoder (DSE). We tested HILO in three conditions: standard optimization, threshold misspecifications, and out-of-distribution parameter sampling. Participants consistently preferred HILO-generated stimuli over both a naïve encoder and the DSE alone, with log odds favoring HILO across all conditions. We also observed key differences between human and simulated decision-making, highlighting the importance of validating optimization strategies with human participants. These findings support HILO as a viable approach for adapting visual prostheses to individuals.
zh
[CV-151] Lifting by Gaussians: A Simple Fast and Flexible Method for 3D Instance Segmentation WACV2025
【速读】:该论文旨在解决开放世界下3D高斯辐射场(3D Gaussian Splatted Radiance Fields, 3DGS)的实例分割问题。解决方案的关键在于提出了一种名为Lifting By Gaussians (LBG) 的新方法,该方法直接将2D分割掩模从SAM(或FastSAM等)以及CLIP和DINOv2特征融合到3DGS或其他类似的高斯辐射场中,无需针对每个场景进行训练,从而实现高效且灵活的3D语义分割。
链接: https://arxiv.org/abs/2502.00173
作者: Rohan Chacko,Nicolai Haeni,Eldar Khaliullin,Lin Sun,Douglas Lee
机构: Magic Leap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025
点击查看摘要
Abstract:We introduce Lifting By Gaussians (LBG), a novel approach for open-world instance segmentation of 3D Gaussian Splatted Radiance Fields (3DGS). Recently, 3DGS Fields have emerged as a highly efficient and explicit alternative to Neural Field-based methods for high-quality Novel View Synthesis. Our 3D instance segmentation method directly lifts 2D segmentation masks from SAM (alternately FastSAM, etc.), together with features from CLIP and DINOv2, directly fusing them onto 3DGS (or similar Gaussian radiance fields such as 2DGS). Unlike previous approaches, LBG requires no per-scene training, allowing it to operate seamlessly on any existing 3DGS reconstruction. Our approach is not only an order of magnitude faster and simpler than existing approaches; it is also highly modular, enabling 3D semantic segmentation of existing 3DGS fields without requiring a specific parametrization of the 3D Gaussians. Furthermore, our technique achieves superior semantic segmentation for 2D semantic novel view synthesis and 3D asset extraction results while maintaining flexibility and efficiency. We further introduce a novel approach to evaluate individually segmented 3D assets from 3D radiance field segmentation methods.
zh
[CV-152] ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition ICLR2025
【速读】:该论文旨在解决动作识别模型中的背景偏见(Background Bias)和前景偏见(Foreground Bias),这些问题可能导致不公平的决策结果。论文的关键解决方案是提出了一种名为ALBAR的新型对抗训练方法,该方法无需专门的偏差属性知识即可缓解这两种偏差。ALBAR通过应用对抗交叉熵损失和熵最大化损失来使静态片段的类别概率均匀分布,并引入梯度惩罚损失以正则化去偏差过程。这种方法在HMDB51数据集上显著提升了综合去偏差性能,超过了现有方法超过12%。此外,论文还发现了UCF101协议中存在的背景泄漏问题,并提出了更精细的演员分割边界以改进偏差评估。
链接: https://arxiv.org/abs/2502.00156
作者: Joseph Fioresi,Ishan Rajendrakumar Dave,Mubarak Shah
机构: Center for Research in Computer Vision (计算机视觉研究中心), University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to ICLR 2025
点击查看摘要
Abstract:Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: this https URL
zh
[CV-153] Exploring Transfer Learning for Deep Learning Polyp Detection in Colonoscopy Images Using YOLOv8
【速读】:该论文旨在解决深度学习模型在有限训练数据下学习特定领域应用能力的挑战。解决方案的关键在于通过迁移学习技术,利用相关数据集的预训练知识,加快和优化新任务的学习过程。研究发现,针对具体任务(如息肉检测)进行预训练的模型显著优于从零开始训练的模型,强调了在具有共享领域特定特征的数据集上进行预训练的重要性。
链接: https://arxiv.org/abs/2502.00133
作者: Fabian Vazquez,Jose Angel Nuñez,Xiaoyan Fu,Pengfei Gu,Bin Fu
机构: University of Texas Rio Grande Valley(德克萨斯大学里奥格兰德河谷分校); The Second Affiliated Hospital of Fujian University of Traditional Chinese Medicine(福建中医药大学第二附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 6 tables, SPIE conference
点击查看摘要
Abstract:Deep learning methods have demonstrated strong performance in objection tasks; however, their ability to learn domain-specific applications with limited training data remains a significant challenge. Transfer learning techniques address this issue by leveraging knowledge from pre-training on related datasets, enabling faster and more efficient learning for new tasks. Finding the right dataset for pre-training can play a critical role in determining the success of transfer learning and overall model performance. In this paper, we investigate the impact of pre-training a YOLOv8n model on seven distinct datasets, evaluating their effectiveness when transferred to the task of polyp detection. We compare whether large, general-purpose datasets with diverse objects outperform niche datasets with characteristics similar to polyps. In addition, we assess the influence of the size of the dataset on the efficacy of transfer learning. Experiments on the polyp datasets show that models pre-trained on relevant datasets consistently outperform those trained from scratch, highlighting the benefit of pre-training on datasets with shared domain-specific features.
zh
[CV-154] ProtoSnap: Prototype Alignment for Cuneiform Signs ICLR2025
【速读】:该论文旨在解决通过自动化技术精确解析楔形文字内部复杂结构的问题。此前的方法大多将楔形文字类型视为类别标签,未能显式建模其高度变化的内部配置。论文的关键在于提出了一种无监督方法ProtoSnap,利用强大的生成模型和原型字体图像的外观与结构作为先验知识,通过深度图像特征来估计楔形文字的各种配置,并将基于骨架的模板拟合到拍摄的楔形文字图像上。这种方法能够实现结构一致性,显著提升了楔形文字识别的性能,特别是在罕见字符的识别方面。
链接: https://arxiv.org/abs/2502.00129
作者: Rachel Mikulinsky,Morris Alper,Shai Gordin,Enrique Jiménez,Yoram Cohen,Hadar Averbuch-Elor
机构: Tel Aviv University(特拉维夫大学); Ariel University(阿里尔大学); LMU(路德维希-马克西米利安大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025. Project page: this https URL
点击查看摘要
Abstract:The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: this https URL
zh
[CV-155] A Direct Semi-Exhaustive Search Method for Robust Partial-to-Full Point Cloud Registration IROS2024
【速读】:该论文旨在解决点云配准问题,即寻找将两个给定点云对齐的刚体变换。论文的关键在于提出了一种直接优化点云配准问题的方法,无需对应关系。具体而言,作者提出了直接半穷举搜索(Direct Semi-Exhaustive Search, DSES)算法,通过迭代潜在的旋转矩阵,并高效计算与每个旋转相关的最大内点集平移。此方法利用现代GPU的并行性,从而在ModelNet40基准测试和实际机器人位姿估计任务中表现出色。
链接: https://arxiv.org/abs/2502.00115
作者: Richard Cheng,Chavdar Papozov,Dan Helmick,Mark Tjersland
机构: Toyota Research Institute (丰田研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2024
点击查看摘要
Abstract:Point cloud registration refers to the problem of finding the rigid transformation that aligns two given point clouds, and is crucial for many applications in robotics and computer vision. The main insight of this paper is that we can directly optimize the point cloud registration problem without correspondences by utilizing an algorithmically simple, yet computationally complex, semi-exhaustive search approach that is very well-suited for parallelization on modern GPUs. Our proposed algorithm, Direct Semi-Exhaustive Search (DSES), iterates over potential rotation matrices and efficiently computes the inlier-maximizing translation associated with each rotation. It then computes the optimal rigid transformation based on any desired distance metric by directly computing the error associated with each transformation candidate \R, t\ . By leveraging the parallelism of modern GPUs, DSES outperforms state-of-the-art methods for partial-to-full point cloud registration on the simulated ModelNet40 benchmark and demonstrates high performance and robustness for pose estimation on a real-world robotics problem (this https URL).
zh
[CV-156] Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach
【速读】:该论文旨在解决手绘地图在机器人导航中因比例失真和地标缺失等不准确性所引发的挑战。解决方案的关键在于引入了一种新颖的手绘地图导航(HAM-Nav)架构,该架构利用预训练的视觉语言模型(VLMs)进行跨多样化环境、手绘风格及机器人形态的导航。HAM-Nav集成了选择性视觉关联提示(Selective Visual Association Prompting)方法以基于拓扑地图的位置估计和导航规划,并采用预测导航计划解析器(Predictive Navigation Plan Parser)来推断缺失地标,从而有效应对地图中的不准确性。
链接: https://arxiv.org/abs/2502.00114
作者: Aaron Hao Tan,Angus Fung,Haitong Wang,Goldie Nejat
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
点击查看摘要
Abstract:Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and efficient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes.
zh
[CV-157] CerraData-4MM: A multimodal benchmark dataset on Cerrado for land use and land cover classification
【速读】:该论文旨在解决塞拉多地区(Cerrado)面临的土地利用和土地覆盖(LULC)映射挑战,特别是在类别不平衡和视觉上相似的类别方面的难题。解决方案的关键在于提出了CerraData-4MM数据集,该数据集结合了Sentinel-1合成孔径雷达(SAR)和Sentinel-2多光谱成像(MSI),具有10米的空间分辨率,并包含两个层次分类,分别有7类和14类。通过评估标准U-Net和更复杂的Vision Transformer (ViT)模型,论文展示了ViT在多模态场景中的优越性能,最高宏F1得分为57.60%,平均交并比(mIoU)为49.05%。
链接: https://arxiv.org/abs/2502.00083
作者: Mateus de Souza Miranda,Ronny Hänsch,Valdivino Alexandre de Santiago Júnior,Thales Sehn Körting,Erison Carlos dos Santos Monteiro
机构: Instituto Nacional de Pesquisas Espaciais (INPE)(国家空间研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 13 Figures, 3 tables
点击查看摘要
Abstract:The Cerrado faces increasing environmental pressures, necessitating accurate land use and land cover (LULC) mapping despite challenges such as class imbalance and visually similar categories. To address this, we present CerraData-4MM, a multimodal dataset combining Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 MultiSpectral Imagery (MSI) with 10m spatial resolution. The dataset includes two hierarchical classification levels with 7 and 14 classes, respectively, focusing on the diverse Bico do Papagaio ecoregion. We highlight CerraData-4MM’s capacity to benchmark advanced semantic segmentation techniques by evaluating a standard U-Net and a more sophisticated Vision Transformer (ViT) model. The ViT achieves superior performance in multimodal scenarios, with the highest macro F1-score of 57.60% and a mean Intersection over Union (mIoU) of 49.05% at the first hierarchical level. Both models struggle with minority classes, particularly at the second hierarchical level, where U-Net’s performance drops to an F1-score of 18.16%. Class balancing improves representation for underrepresented classes but reduces overall accuracy, underscoring the trade-off in weighted training. CerraData-4MM offers a challenging benchmark for advancing deep learning models to handle class imbalance and multimodal data fusion. Code, trained models, and data are publicly available at this https URL.
zh
[CV-158] Influence of color correction on pathology detection in Capsule Endoscopy
【速读】:该论文旨在评估色彩校正对无线胶囊内镜 (Wireless Capsule Endoscopy, WCE) 病理检测的影响。研究使用两个显著的目标检测模型(Retinanet 和 YOLOv5)在原始数据集及其两种不同色彩校正版本上进行实验。关键在于通过比较这些模型在原始数据与色彩校正数据上的表现,揭示色彩校正如何改变边界框大小及交并比,并导致某些病理类型的误报增加,但这些变化并未一致地改善性能指标如 F1 分数、交并比 (IoU) 和平均精度 (AP50)。
链接: https://arxiv.org/abs/2502.00076
作者: Bidossessi Emmanuel Agossou,Marius Pedersen,Kiran Raja,Anuja Vats,Pål Anders Floor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Pathology detection in Wireless Capsule Endoscopy (WCE) using deep learning has been explored in the recent past. However, deep learning models can be influenced by the color quality of the dataset used to train them, impacting detection, segmentation and classification tasks. In this work, we evaluate the impact of color correction on pathology detection using two prominent object detection models: Retinanet and YOLOv5. We first generate two color corrected versions of a popular WCE dataset (i.e., SEE-AI dataset) using two different color correction functions. We then evaluate the performance of the Retinanet and YOLOv5 on the original and color corrected versions of the dataset. The results reveal that color correction makes the models generate larger bounding boxes and larger intersection areas with the ground truth annotations. Furthermore, color correction leads to an increased number of false positives for certain pathologies. However, these effects do not translate into a consistent improvement in performance metrics such as F1-scores, IoU, and AP50. The code is available at this https URL. Keywords: Wireless Capsule Endoscopy, Color correction, Retinanet, YOLOv5, Detection
zh
[CV-159] SpikingRTNH: Spiking Neural Network for 4D Radar Object Detection
【速读】:该论文旨在解决在自动驾驶系统中使用4D雷达进行3D物体检测时,处理高密度点云数据所导致的高能耗问题。解决方案的关键在于提出了一种名为SpikingRTNH的新型脉冲神经网络(SNN),通过采用泄漏积分与发射(LIF)脉冲神经元替代传统的ReLU激活函数,显著提高了能效。此外,引入了受人类认知过程启发的生物自上而下推理(BTI)机制,该机制从高密度到低密度顺序处理点云数据,从而有效利用噪声较低且更为重要的点进行检测。这些创新使得SpikingRTNH不仅实现了显著的能耗降低(相比传统人工神经网络ANN降低了78%),同时保持了可比的检测性能(AP 3D为51.1%,AP BEV为57.0%)。
链接: https://arxiv.org/abs/2502.00074
作者: Dong-Hee Paek,Seung-Hyun Kong
机构: CCS Graduate School of Mobility, KAIST (KAIST移动系); Daejeon 34051, Republic of Korea (韩国大田市)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: arxiv preprint
点击查看摘要
Abstract:Recently, 4D Radar has emerged as a crucial sensor for 3D object detection in autonomous vehicles, offering both stable perception in adverse weather and high-density point clouds for object shape recognition. However, processing such high-density data demands substantial computational resources and energy consumption. We propose SpikingRTNH, the first spiking neural network (SNN) for 3D object detection using 4D Radar data. By replacing conventional ReLU activation functions with leaky integrate-and-fire (LIF) spiking neurons, SpikingRTNH achieves significant energy efficiency gains. Furthermore, inspired by human cognitive processes, we introduce biological top-down inference (BTI), which processes point clouds sequentially from higher to lower densities. This approach effectively utilizes points with lower noise and higher importance for detection. Experiments on K-Radar dataset demonstrate that SpikingRTNH with BTI significantly reduces energy consumption by 78% while achieving comparable detection performance to its ANN counterpart (51.1% AP 3D, 57.0% AP BEV). These results establish the viability of SNNs for energy-efficient 4D Radar-based object detection in autonomous driving systems. All codes are available at this https URL.
zh
[CV-160] A two-stage dual-task learning strategy for early prediction of pathological complete response to neoadjuvant chemotherapy for breast cancer using dynamic contrast-enhanced magnetic resonance images
【速读】:该论文旨在解决早期预测乳腺癌患者病理完全缓解(Pathological Complete Response, pCR)的问题。为提高新辅助化疗早期阶段的预测准确性,研究提出了一种两阶段双任务学习策略。解决方案的关键在于利用动态对比增强磁共振成像(Dynamic Contrast-Enhanced Magnetic Resonance Imaging, DCE-MRI)在治疗前(T0)、治疗3周后(T1)以及治疗12周后(T2)的图像数据,通过训练卷积长短期记忆网络(Convolutional Long Short-Term Memory Network, ConvLSTM)提取T2阶段的潜在空间图像特征,并进一步采用双任务网络同时预测pCR及T2阶段的图像特征,从而实现基于T0和T1阶段图像的早期预测,而无需使用T2阶段的图像数据。
链接: https://arxiv.org/abs/2502.00051
作者: Bowen Jing(1),Jing Wang(1) ((1) Department of Radiation Oncology, University of Texas Southwestern Medical Center)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Rationale and Objectives: Early prediction of pathological complete response (pCR) can facilitate personalized treatment for breast cancer patients. To improve prediction accuracy at the early time point of neoadjuvant chemotherapy, we proposed a two-stage dual-task learning strategy to train a deep neural network for early prediction of pCR using early-treatment magnetic resonance images. Methods: We developed and validated the two-stage dual-task learning strategy using the dataset from the national-wide, multi-institutional I-SPY2 clinical trial, which included dynamic contrast-enhanced magnetic resonance images acquired at three time points: pretreatment (T0), after 3 weeks (T1), and after 12 weeks of treatment (T2). First, we trained a convolutional long short-term memory network to predict pCR and extract the latent space image features at T2. At the second stage, we trained a dual-task network to simultaneously predict pCR and the image features at T2 using images from T0 and T1. This allowed us to predict pCR earlier without using images from T2. Results: The conventional single-stage single-task strategy gave an area under the receiver operating characteristic curve (AUROC) of 0.799 for pCR prediction using all the data at time points T0 and T1. By using the proposed two-stage dual-task learning strategy, the AUROC was improved to 0.820. Conclusions: The proposed two-stage dual-task learning strategy can improve model performance significantly (p=0.0025) for predicting pCR at the early stage (3rd week) of neoadjuvant chemotherapy. The early prediction model can potentially help physicians to intervene early and develop personalized plans at the early stage of chemotherapy.
zh
[CV-161] mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
【速读】:该论文旨在解决多语言音频-视觉语音识别(AVSR)数据集规模有限及模型训练难度大的问题。关键在于提出了一种名为mWhisper-Flamingo的模型,它结合了预训练的音频模型(Whisper)和视频模型(AV-HuBERT)。为了实现更好的多模态整合并提升噪声环境下的多语言性能,引入了解码器模态dropout技术,使得模型能够在配对的音频-视觉输入以及单独的音频或视觉输入上进行训练。
链接: https://arxiv.org/abs/2502.01547
作者: Andrew Rouditchenko,Saurabhchand Bhati,Samuel Thomas,Hilde Kuehne,Rogerio Feris,James Glass
机构: MIT(麻省理工学院), USA; MIT-IBM Watson AI Lab(麻省理工学院-IBM Watson人工智能实验室), USA; University of Tuebingen(图宾根大学), DE
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
点击查看摘要
Abstract:Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
zh
[CV-162] Assessing the use of Diffusion models for motion artifact correction in brain MRI
【速读】:该论文旨在解决磁共振成像(MRI)中因患者运动导致的运动伪影问题。这些伪影会降低图像的诊断价值。论文的关键解决方案是评估扩散模型在修正2D脑部MRI扫描中的运动伪影方面的应用。通过与基于U-Net的监督学习方法进行对比,研究发现扩散模型能够产生准确预测或有害幻觉,这取决于数据异质性和输入的采集平面。
链接: https://arxiv.org/abs/2502.01418
作者: Paolo Angella,Vito Paolo Pastore,Matteo Santacesaria
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: Accepted at IEEE International Symposium for Biomedical Imaging (ISBI) 2025
点击查看摘要
Abstract:Magnetic Resonance Imaging generally requires long exposure times, while being sensitive to patient motion, resulting in artifacts in the acquired images, which may hinder their diagnostic relevance. Despite research efforts to decrease the acquisition time, and designing efficient acquisition sequences, motion artifacts are still a persistent problem, pushing toward the need for the development of automatic motion artifact correction techniques. Recently, diffusion models have been proposed as a solution for the task at hand. While diffusion models can produce high-quality reconstructions, they are also susceptible to hallucination, which poses risks in diagnostic applications. In this study, we critically evaluate the use of diffusion models for correcting motion artifacts in 2D brain MRI scans. Using a popular benchmark dataset, we compare a diffusion model-based approach with state-of-the-art methods consisting of Unets trained in a supervised fashion on motion-affected images to reconstruct ground truth motion-free images. Our findings reveal mixed results: diffusion models can produce accurate predictions or generate harmful hallucinations in this context, depending on data heterogeneity and the acquisition planes considered as input.
zh
[CV-163] Diffusion at Absolute Zero: Langevin Sampling Using Successive Moreau Envelopes
【速读】:本文旨在解决从形式为 (\pi(x) \propto \exp(-U(x))) 的吉布斯分布中采样(Gibbs sampling)的问题,其中 (U(x)) 是势函数(potential)。论文提出的关键解决方案是引入了一种新颖的方法,通过考虑一系列目标密度的近似序列 ((\pi^t_k)_k) ,其中当 (k) 较小时,(\pi^t_k) 近似于 (\pi);而当 (k) 较大时,(\pi^t_k) 具有更佳的采样性质。这一序列通过对势函数 (U) 的莫罗包络(Moreau envelopes)进行部分替换来获得。采样过程采用类似退火朗之万动力学(Annealed Langevin dynamics)的程序,即按顺序从 (\pi^t_k) 中采样,随着 (k) 的减小,有效地引导样本从一个简单的起始密度过渡到复杂的靶密度。理论分析和实验结果均表明,该方法提高了收敛速度,并且适用于多模态密度 (\pi)。
链接: https://arxiv.org/abs/2502.01358
作者: Andreas Habring,Alexander Falk,Thomas Pock
机构: Graz University of Technology(格拉茨技术大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
点击查看摘要
Abstract:In this article we propose a novel method for sampling from Gibbs distributions of the form \pi(x)\propto\exp(-U(x)) with a potential U(x) . In particular, inspired by diffusion models we propose to consider a sequence (\pi^t_k)_k of approximations of the target density, for which \pi^t_k\approx \pi for k small and, on the other hand, \pi^t_k exhibits favorable properties for sampling for k large. This sequence is obtained by replacing parts of the potential U by its Moreau envelopes. Sampling is performed in an Annealed Langevin type procedure, that is, sequentially sampling from \pi^t_k for decreasing k , effectively guiding the samples from a simple starting density to the more complex target. In addition to a theoretical analysis we show experimental results supporting the efficacy of the method in terms of increased convergence speed and applicability to multi-modal densities \pi .
zh
[CV-164] Deep generative computed perfusion-deficit mapping of ischaemic stroke
【速读】:该论文旨在解决利用急性缺血性卒中患者的计算机断层血管造影(CTA)灌注图来预测神经功能缺损的问题。关键在于采用深度生成推理方法,无需已知病变区域即可定位神经功能缺损的解剖基础,并揭示新的神经依赖关系。研究表明,这种基于急性CTA灌注图的方法在描述缺血性卒中的功能解剖关系方面具有高精度,且可能在临床和科学研究中发挥重要作用。
链接: https://arxiv.org/abs/2502.01334
作者: Chayanin Tangwiriyasakul,Pedro Borges,Guilherme Pombo,Stefano Moriconi,Michael S. Elmalem,Paul Wright,Yee-Haur Mah,Jane Rondina,Robert Gray,Sebastien Ourselin,Parashkev Nachev,M. Jorge Cardoso
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
点击查看摘要
Abstract:Focal deficits in ischaemic stroke result from impaired perfusion downstream of a critical vascular occlusion. While parenchymal lesions are traditionally used to predict clinical deficits, the underlying pattern of disrupted perfusion provides information upstream of the lesion, potentially yielding earlier predictive and localizing signals. Such perfusion maps can be derived from routine CT angiography (CTA) widely deployed in clinical practice. Analysing computed perfusion maps from 1,393 CTA-imaged-patients with acute ischaemic stroke, we use deep generative inference to localise neural substrates of NIHSS sub-scores. We show that our approach replicates known lesion-deficit relations without knowledge of the lesion itself and reveals novel neural dependents. The high achieved anatomical fidelity suggests acute CTA-derived computed perfusion maps may be of substantial clinical-and-scientific value in rich phenotyping of acute stroke. Using only hyperacute imaging, deep generative inference could power highly expressive models of functional anatomical relations in ischaemic stroke within the pre-interventional window.
zh
[CV-165] Compressed Image Generation with Denoising Diffusion Codebook Models
【速读】:本文旨在解决高质量图像生成与高效压缩之间的平衡问题。关键在于提出了一种基于去噪扩散模型(Denoising Diffusion Models, DDMs)的新方法——去噪扩散码本模型(Denoising Diffusion Codebook Model, DDCM)。该方法通过从预定义的固定独立同分布高斯向量(iid Gaussian vectors)码本中选择噪声样本,替代标准DDM中的高斯噪声采样,在保持样本质量和多样性的同时实现了无损压缩比特流表示,从而在保证生成图像质量的前提下,显著提升了图像压缩效果。
链接: https://arxiv.org/abs/2502.01189
作者: Guy Ohayon,Hila Manor,Tomer Michaeli,Michael Elad
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: Code and demo are available at this https URL
点击查看摘要
Abstract:We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
zh
[CV-166] owards Robust and Generalizable Lensless Imaging with Modular Learned Reconstruction
【速读】:该论文旨在解决现有镜头less成像技术在简化建模假设下的校准和计算复杂性问题,并探究这些学习方法对新掩膜类型的泛化能力。论文的关键解决方案在于引入了一种模块化的学习重构方法,其中包含一个图像恢复前的预处理器组件。理论分析证明了预处理器对于标准图像恢复技术(如维纳滤波和迭代算法)的必要性,并通过大量实验验证了其对多种镜头less成像方法及不同类型的掩膜数据集(振幅和相位掩膜)的有效性。此外,论文还进行了首次跨掩膜类型的一般化基准测试,评估了在一个系统上训练的重构模型对其他系统的泛化性能。这种模块化重构方法使得使用预训练组件和新系统的迁移学习成为可能,从而大幅减少了繁琐的测量和训练时间。
链接: https://arxiv.org/abs/2502.01102
作者: Eric Bezzam,Yohann Perron,Martin Vetterli
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
点击查看摘要
Abstract:Lensless cameras disregard the conventional design that imaging should mimic the human eye. This is done by replacing the lens with a thin mask, and moving image formation to the digital post-processing. State-of-the-art lensless imaging techniques use learned approaches that combine physical modeling and neural networks. However, these approaches make simplifying modeling assumptions for ease of calibration and computation. Moreover, the generalizability of learned approaches to lensless measurements of new masks has not been studied. To this end, we utilize a modular learned reconstruction in which a key component is a pre-processor prior to image recovery. We theoretically demonstrate the pre-processor’s necessity for standard image recovery techniques (Wiener filtering and iterative algorithms), and through extensive experiments show its effectiveness for multiple lensless imaging approaches and across datasets of different mask types (amplitude and phase). We also perform the first generalization benchmark across mask types to evaluate how well reconstructions trained with one system generalize to others. Our modular reconstruction enables us to use pre-trained components and transfer learning on new systems to cut down weeks of tedious measurements and training. As part of our work, we open-source four datasets, and software for measuring datasets and for training our modular reconstruction.
zh
[CV-167] Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images
【速读】:该论文旨在解决前列腺癌早期检测中MRI-TRUS融合活检复杂且耗时的问题,并减少手动标注带来的潜在错误。解决方案的关键在于提出了一种全自动的基于MRI-TRUS融合的分割方法,该方法通过注册-分割框架整合MRI和TRUS模态的空间信息,实现直接在经尿道超声(TRUS)图像中识别前列腺肿瘤,无需手动标注,从而提高了分割精度并降低了对人工操作的依赖。
链接: https://arxiv.org/abs/2502.00712
作者: Shengtian Sang,Hassan Jahanandish,Cynthia Xinran Li,Indrani Bhattachary,Jeong Hoon Lee,Lichun Zhang,Sulaiman Vesal,Pejman Ghanouni,Richard Fan,Geoffrey A. Sonn,Mirabela Rusu
机构: Stanford University (斯坦福大学); University of Miami (迈阿密大学); Dartmouth College (达特茅斯学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Prostate cancer is a major cause of cancer-related deaths in men, where early detection greatly improves survival rates. Although MRI-TRUS fusion biopsy offers superior accuracy by combining MRI’s detailed visualization with TRUS’s real-time guidance, it is a complex and time-intensive procedure that relies heavily on manual annotations, leading to potential errors. To address these challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation method that identifies prostate tumors directly in TRUS images without requiring manual annotations. Unlike traditional multimodal fusion approaches that rely on naive data concatenation, our method integrates a registration-segmentation framework to align and leverage spatial information between MRI and TRUS modalities. This alignment enhances segmentation accuracy and reduces reliance on manual effort. Our approach was validated on a dataset of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132) methods, with significant improvements (p 0.01). This framework demonstrates the potential for reducing the complexity of prostate cancer diagnosis and provides a flexible architecture applicable to other multimodal medical imaging tasks.
zh
[CV-168] Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective
【速读】:该论文旨在解决医疗图像分割中的公平性问题,特别是由于人口属性(如年龄、性别、种族)和临床因素(如疾病严重程度)导致的数据采集不平衡所引发的偏见。论文的关键解决方案是提出了一种基于最优控制理论的分布感知混合专家模型(Distribution-aware Mixture of Experts, dMoE)。此模型通过适应异构数据分布,在多个网络架构中展现了广泛的适用性,并在两个二维基准数据集和一个三维自建数据集上实现了最先进的性能,从而有效缓解了因数据分布不均带来的偏见。
链接: https://arxiv.org/abs/2502.00619
作者: Yujin Oh,Pengfei Jin,Sangjoon Park,Sekeun Kim,Siyeop Yoon,Kyungsang Kim,Jin Sung Kim,Xiang Li,Quanzheng Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures, 9 tables
点击查看摘要
Abstract:Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE’s role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available.
zh
[CV-169] Segment Anything for Histopathology
【速读】:该论文旨在解决在数字病理学中,现有自动分割方法难以应对来自不同分布的新数据的问题。为了解决这一挑战,论文提出了一种基于训练Segment Anything Model (SAM) 的多样化数据集的新型视觉基础模型(VFM),命名为PathoSAM。关键在于通过引入PathoSAM,实现了更稳健的自动和交互式核分割,并且展示了其在其他分割任务中的适应性,包括语义核分割。虽然在语义核分割任务上尚未超越CellViT,但PathoSAM已经成为了新的最先进模型。
链接: https://arxiv.org/abs/2502.00408
作者: Titus Griebel,Anwai Archit,Constantin Pape
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data-specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state-of-the-art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state-of-the-art, CellViT. Our models are open-source and compatible with popular tools for data annotation. We also provide scripts for whole-slide image segmentation. Our code and models are publicly available at this https URL.
zh
[CV-170] Prostate-Specific Foundation Models for Enhanced Detection of Clinically Significant
【速读】:该论文旨在解决前列腺癌诊断准确性低及潜在延迟的问题。现有方法依赖于MRI影像,但放射科医生的特异性和观察者间变异性较低,导致不必要的活检以及可能遗漏临床显著性癌症的风险。解决方案的关键在于提出了一种名为ProViCNet的前列腺器官特定视觉基础模型,该模型通过多机构的4,401名患者数据进行训练和验证,并采用基于活检确认标注的病灶级对比学习方法。ProViCNet在多种内部和外部验证队列中表现出色,其ROC曲线下面积在0.875至0.966之间,显著优于放射科医生的表现(0.907 vs. 0.805, p<0.001),尤其是在mpMRI检测中。此外,将ProViCNet与标准PSA测试结合,可以提高检测临床显著性癌症的特异性,从15%提升到38%,从而大幅减少不必要的活检。
链接: https://arxiv.org/abs/2502.00366
作者: Jeong Hoon Lee,Cynthia Xinran Li,Hassan Jahanandish,Indrani Bhattacharya,Sulaiman Vesal,Lichun Zhang,Shengtian Sang,Moon Hyung Choi,Simon John Christoph Soerensen,Steve Ran Zhou,Elijah Richard Sommer,Richard Fan,Pejman Ghanouni,Yuze Song,Tyler M. Seibert,Geoffrey A. Sonn,Mirabela Rusu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 44pages
点击查看摘要
Abstract:Accurate prostate cancer diagnosis remains challenging. Even when using MRI, radiologists exhibit low specificity and significant inter-observer variability, leading to potential delays or inaccuracies in identifying clinically significant cancers. This leads to numerous unnecessary biopsies and risks of missing clinically significant cancers. Here we present prostate vision contrastive network (ProViCNet), prostate organ-specific vision foundation models for Magnetic Resonance Imaging (MRI) and Trans-Rectal Ultrasound imaging (TRUS) for comprehensive cancer detection. ProViCNet was trained and validated using 4,401 patients across six institutions, as a prostate cancer detection model on radiology images relying on patch-level contrastive learning guided by biopsy confirmed radiologist annotations. ProViCNet demonstrated consistent performance across multiple internal and external validation cohorts with area under the receiver operating curve values ranging from 0.875 to 0.966, significantly outperforming radiologists in the reader study (0.907 versus 0.805, p0.001) for mpMRI, while achieving 0.670 to 0.740 for TRUS. We also integrated ProViCNet with standard PSA to develop a virtual screening test, and we showed that we can maintain the high sensitivity for detecting clinically significant cancers while more than doubling specificity from 15% to 38% (p0.001), thereby substantially reducing unnecessary biopsies. These findings highlight that ProViCNet’s potential for enhancing prostate cancer diagnosis accuracy and reduce unnecessary biopsies, thereby optimizing diagnostic pathways.
zh
[CV-171] A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation
【速读】:该论文旨在解决腹部后腔区域肿瘤自动分割面临的挑战,特别是由于其不规则形状导致的体积估算困难以及手动分割耗时的问题。论文的关键解决方案在于引入并评估了多种架构,包括改进的U-Net模型(如CNN、Vision Transformer (ViT)、Mamba State Space Model (Mamba SSM) 和 Extended Long-Short Term Memory (xLSTM)),其中ViLU-Net模型通过集成Vi-blocks来提升分割效果。特别地,实验结果突显了xLSTM在U-Net框架中的效率,能够以较低的资源消耗处理长距离依赖关系。
链接: https://arxiv.org/abs/2502.00314
作者: Moein Heidari,Ehsan Khodapanah Aghdam,Alexander Manzella,Daniel Hsu,Rebecca Scalabrino,Wenjin Chen,David J. Foran,Ilker Hacihaliloglu
机构: School of Biomedical Engineering, University of British Columbia (英属哥伦比亚大学); Independent Researcher (独立研究员); Rutgers Robert Wood Johnson Medical School (罗格斯罗伯特伍德约翰逊医学院); Beth Israel Deaconess Medical Center (贝丝以色列女执事医疗中心); Harvard Medical School (哈佛医学院); Weill Cornell Medical School (威尔康奈尔医学学院); Memorial Sloan Kettering Cancer Center (纪念斯隆凯特琳癌症中心); Center for Biomedical Imaging and Informatics, Rutgers Cancer Institute (罗格斯癌症研究所生物医学成像与信息中心); Department of Medicine, University of British Columbia (英属哥伦比亚大学医学系); Department of Radiology, University of British Columbia (英属哥伦比亚大学放射学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2025 SPIE Medical Imaging Conference
点击查看摘要
Abstract:The retroperitoneum hosts a variety of tumors, including rare benign and malignant types, which pose diagnostic and treatment challenges due to their infrequency and proximity to vital structures. Estimating tumor volume is difficult due to their irregular shapes, and manual segmentation is time-consuming. Automatic segmentation using U-Net and its variants, incorporating Vision Transformer (ViT) elements, has shown promising results but struggles with high computational demands. To address this, architectures like the Mamba State Space Model (SSM) and Extended Long-Short Term Memory (xLSTM) offer efficient solutions by handling long-range dependencies with lower resource consumption. This study evaluates U-Net enhancements, including CNN, ViT, Mamba, and xLSTM, on a new in-house CT dataset and a public organ segmentation dataset. The proposed ViLU-Net model integrates Vi-blocks for improved segmentation. Results highlight xLSTM’s efficiency in the U-Net framework. The code is publicly accessible on GitHub.
zh
[CV-172] Patch Triplet Similarity Purification for Guided Real-World Low-Dose CT Image Denoising
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像去噪问题,以提高临床诊断中的图像质量,同时减少辐射暴露。论文的关键解决方案在于引入无对比剂CT(Non-Contrast CT, NCCT)图像作为清洁指导信息,并设计了一种新的Patch Triplet Similarity Purification (PTSP)策略来选择高度相似的LDCT、正常剂量CT(Normal-Dose CT, NDCT)和NCCT图像块三元组用于网络训练。此外,通过将标准自注意力机制替换为交叉注意力机制,对SwinIR和HAT两种图像去噪变换器进行了修改,以适应NCCT图像指导。这些改进显著提升了实际LDCT图像去噪性能。
链接: https://arxiv.org/abs/2502.00253
作者: Junhao Long,Fengwei Yang,Juncheng Yan,Baoping Zhang,Chao Jin,Jian Yang,Changliang Zou,Jun Xu
机构: School of Statistics and Data Science, Nankai University (南开大学统计与数据科学学院), Tianjin, China; Department of Radiology, The First Affiliated Hospital of Xi’an Jiaotong University (西安交通大学第一附属医院放射科), Xi’an, China; Shanxi Engineering Research Center of Computational Imaging and Medical Intelligence (陕西计算成像与医学智能工程研究中心), Xi’an, China; Xi’an Key Laboratory of Medical Computational Imaging (西安医学计算成像重点实验室), Xi’an, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Image denoising of low-dose computed tomography (LDCT) is an important problem for clinical diagnosis with reduced radiation exposure. Previous methods are mostly trained with pairs of synthetic or misaligned LDCT and normal-dose CT (NDCT) images. However, trained with synthetic noise or misaligned LDCT/NDCT image pairs, the denoising networks would suffer from blurry structure or motion artifacts. Since non-contrast CT (NCCT) images share the content characteristics to the corresponding NDCT images in a three-phase scan, they can potentially provide useful information for real-world LDCT image denoising. To exploit this aspect, in this paper, we propose to incorporate clean NCCT images as useful guidance for the learning of real-world LDCT image denoising networks. To alleviate the issue of spatial misalignment in training data, we design a new Patch Triplet Similarity Purification (PTSP) strategy to select highly similar patch (instead of image) triplets of LDCT, NDCT, and NCCT images for network training. Furthermore, we modify two image denoising transformers of SwinIR and HAT to accommodate the NCCT image guidance, by replacing vanilla self-attention with cross-attention. On our collected clinical dataset, the modified transformers trained with the data selected by our PTSP strategy show better performance than 15 comparison methods on real-world LDCT image denoising. Ablation studies validate the effectiveness of our NCCT image guidance and PTSP strategy. We will publicly release our data and code.
zh
[CV-173] Improving Quality Control Of MRI Images Using Synthetic Motion Data
【速读】:该论文旨在解决MRI质量控制(QC)过程中由于数据集不平衡和有限以及主观评分所导致的挑战,这些问题阻碍了可靠自动化QC系统的开发。论文的关键解决方案在于通过在合成生成的运动伪影上预训练模型,然后应用迁移学习进行QC分类,从而不仅提高了识别低质量扫描的准确性,还减少了训练时间和资源需求。这种方法利用合成数据提供了更为稳健且资源高效的MRI质量控制自动化方案。
链接: https://arxiv.org/abs/2502.00160
作者: Charles Bricout,Sylvain Bouix,Samira Ebrahimi Kahou,Kang Ik K. Cho,Michael Harms,Ofer Pasternak,Carrie E. Bearden,Patrick D. McGorry,Rene S. Kahn,John Kane,Barnaby Nelson,Scott W. Woods,Martha E. Shenton
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2025
点击查看摘要
Abstract:MRI quality control (QC) is challenging due to unbalanced and limited datasets, as well as subjective scoring, which hinder the development of reliable automated QC systems. To address these issues, we introduce an approach that pretrains a model on synthetically generated motion artifacts before applying transfer learning for QC classification. This method not only improves the accuracy in identifying poor-quality scans but also reduces training time and resource requirements compared to training from scratch. By leveraging synthetic data, we provide a more robust and resource-efficient solution for QC automation in MRI, paving the way for broader adoption in diverse research settings.
zh
[CV-174] Multimodal MRI-Ultrasound AI for Prostate Cancer Detection Outperforms Radiologist MRI Interpretation: A Multi-Center Study
【速读】:该论文旨在解决前列腺活检过程中通过磁共振成像(MRI)检测到的临床显著前列腺癌(CsPCa)病灶在转换至经直肠超声(TRUS)图像时容易遗漏的问题。研究的关键在于提出了一种基于多模态人工智能(AI)框架,该框架整合了MRI和TRUS图像序列,以增强CsPCa的识别能力。具体而言,该框架采用了基于3D UNet架构的方法,并在三个机构的1700个测试病例中进行了评估,结果显示其敏感性(80%)和病灶Dice系数(42%)均优于仅使用MRI或TRUS的单模态AI模型。此外,该多模态AI模型在另一组110例患者中的表现也超过了放射科医生,显示出更高的特异性(88%)和病灶Dice系数(38%),同时保持了等效的敏感性(79%)。
链接: https://arxiv.org/abs/2502.00146
作者: Hassan Jahanandish,Shengtian Sang,Cynthia Xinran Li,Sulaiman Vesal,Indrani Bhattacharya,Jeong Hoon Lee,Richard Fan,Geoffrey A. Sonna,Mirabela Rusu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pre-biopsy magnetic resonance imaging (MRI) is increasingly used to target suspicious prostate lesions. This has led to artificial intelligence (AI) applications improving MRI-based detection of clinically significant prostate cancer (CsPCa). However, MRI-detected lesions must still be mapped to transrectal ultrasound (TRUS) images during biopsy, which results in missing CsPCa. This study systematically evaluates a multimodal AI framework integrating MRI and TRUS image sequences to enhance CsPCa identification. The study included 3110 patients from three cohorts across two institutions who underwent prostate biopsy. The proposed framework, based on the 3D UNet architecture, was evaluated on 1700 test cases, comparing performance to unimodal AI models that use either MRI or TRUS alone. Additionally, the proposed model was compared to radiologists in a cohort of 110 patients. The multimodal AI approach achieved superior sensitivity (80%) and Lesion Dice (42%) compared to unimodal MRI (73%, 30%) and TRUS models (49%, 27%). Compared to radiologists, the multimodal model showed higher specificity (88% vs. 78%) and Lesion Dice (38% vs. 33%), with equivalent sensitivity (79%). Our findings demonstrate the potential of multimodal AI to improve CsPCa lesion targeting during biopsy and treatment planning, surpassing current unimodal models and radiologists; ultimately improving outcomes for prostate cancer patients.
zh
[CV-175] Advanced Assessment of Stroke in Retinal Fundus Imaging with Deep Multi-view Learning
【速读】:该论文旨在解决通过视网膜影像准确识别和区分脑卒中(Stroke)和短暂性缺血发作(TIA)的问题。解决方案的关键在于提出了一种多视角脑卒中网络(MVS-Net),该网络采用端到端的深度学习方法,整合来自左右眼的视网膜影像多视角输入,定义并区分视网膜图像中的代表特征及黄斑中心和视神经头中心视角下的关联关系,从而实现对脑卒中和TIA的检测。实验结果表明,所提出的框架在检测脑卒中和TIA方面达到了0.84的AUC评分。
链接: https://arxiv.org/abs/2502.00079
作者: Aysen Degerli,Mika Hilvo,Juha Pajula,Petri Huhtinen,Pekka Jäkälä
机构: VTT Technical Research Centre of Finland (芬兰技术研究中心); Optomed Oyj (Optomed Oyj); Kuopio University Hospital (库奥皮奥大学医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Stroke is globally a major cause of mortality and morbidity, and hence accurate and rapid diagnosis of stroke is valuable. Retinal fundus imaging reveals the known markers of elevated stroke risk in the eyes, which are retinal venular widening, arteriolar narrowing, and increased tortuosity. In contrast to other imaging techniques used for stroke diagnosis, the acquisition of fundus images is easy, non-invasive, fast, and inexpensive. Therefore, in this study, we propose a multi-view stroke network (MVS-Net) to detect stroke and transient ischemic attack (TIA) using retinal fundus images. Contrary to existing studies, our study proposes for the first time a solution to discriminate stroke and TIA with deep multi-view learning by proposing an end-to-end deep network, consisting of multi-view inputs of fundus images captured from both right and left eyes. Accordingly, the proposed MVS-Net defines representative features from fundus images of both eyes and determines the relation within their macula-centered and optic nerve head-centered views. Experiments performed on a dataset collected from stroke and TIA patients, in addition to healthy controls, show that the proposed framework achieves an AUC score of 0.84 for stroke and TIA detection.
zh
[CV-176] Deep Ensembling with Multimodal Image Fusion for Efficient Classification of Lung Cancer
【速读】:该论文旨在解决多模态肺部图像中癌变与健康组织切片的分类问题。针对有限样本量的挑战,论文提出的关键解决方案是开发了一种基于深度集成的多模态融合网络(Deep Ensembled Multimodal Fusion, DEMF),通过主成分分析(Principal Component Analysis, PCA)和自动编码器(Autoencoder)融合正电子发射断层成像(Positron Emission Tomography, PET)和计算机断层扫描(Computed Tomography, CT)图像,并采用多数投票策略进行分类。此外,使用梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)来可视化分类结果,同时在训练阶段采用了随机图像增强策略以应对样本不足的问题。
链接: https://arxiv.org/abs/2502.00078
作者: Surochita Pal,Sushmita Mitra
机构: Indian Statistical Institute (印度统计学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study focuses on the classification of cancerous and healthy slices from multimodal lung images. The data used in the research comprises Computed Tomography (CT) and Positron Emission Tomography (PET) images. The proposed strategy achieves the fusion of PET and CT images by utilizing Principal Component Analysis (PCA) and an Autoencoder. Subsequently, a new ensemble-based classifier developed, Deep Ensembled Multimodal Fusion (DEMF), employing majority voting to classify the sample images under examination. Gradient-weighted Class Activation Mapping (Grad-CAM) employed to visualize the classification accuracy of cancer-affected images. Given the limited sample size, a random image augmentation strategy employed during the training phase. The DEMF network helps mitigate the challenges of scarce data in computer-aided medical image analysis. The proposed network compared with state-of-the-art networks across three publicly available datasets. The network outperforms others based on the metrics - Accuracy, F1-Score, Precision, and Recall. The investigation results highlight the effectiveness of the proposed network.
zh
[CV-177] LSU-Net: Lightweight Automatic Organs Segmentation Network For Medical Images ICASSP2025
【速读】:该论文旨在解决现有UNet及其变体在医学图像分割应用中的高参数量和计算复杂性问题,限制其在临床环境中有限计算资源下的实用性。解决方案的关键在于提出了一种新型的轻量化位移U-Net(LSU-Net),通过集成轻量化卷积块(Light Conv Block)和标记化位移块(Tokenized Shift Block),结合动态权重多损失设计,实现高效特征提取与动态权重分配。轻量化卷积块通过标准卷积与深度可分离卷积相结合的方式,以低参数量有效捕捉特征;标记化位移块则通过空间位移块与深度可分离卷积的组合优化特征表示,通过深度特征的位移和捕捉提升性能。动态调整各层的损失权重能够逼近最优解并增强训练稳定性。
链接: https://arxiv.org/abs/2502.00042
作者: Yujie Ding,Shenghua Teng,Zuoyong Li,Xiao Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, 4 tables. Accepted at ICASSP 2025
点击查看摘要
Abstract:UNet and its variants have widespread applications in medical image segmentation. However, the substantial number of parameters and computational complexity of these models make them less suitable for use in clinical settings with limited computational resources. To address this limitation, we propose a novel Lightweight Shift U-Net (LSU-Net). We integrate the Light Conv Block and the Tokenized Shift Block in a lightweight manner, combining them with a dynamic weight multi-loss design for efficient dynamic weight allocation. The Light Conv Block effectively captures features with a low parameter count by combining standard convolutions with depthwise separable convolutions. The Tokenized Shift Block optimizes feature representation by shifting and capturing deep features through a combination of the Spatial Shift Block and depthwise separable convolutions. Dynamic adjustment of the loss weights at each layer approaches the optimal solution and enhances training stability. We validated LSU-Net on the UWMGI and MSD Colon datasets, and experimental results demonstrate that LSU-Net outperforms most state-of-the-art segmentation architectures.
zh
人工智能
链接: https://arxiv.org/abs/2502.01635
作者: Stephen Casper,Luke Bailey,Rosco Hunter,Carson Ezell,Emma Cabalé,Michael Gerovitch,Stewart Slocum,Kevin Wei,Nikola Jurkovic,Ariba Khan,Phillip J.K. Christoffersen,A. Pinar Ozisik,Rakshit Trivedi,Dylan Hadfield-Menell,Noam Kolt
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accompanying website: this https URL
点击查看摘要
Abstract:Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system’s components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at this https URL
[AI-1] Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data
链接: https://arxiv.org/abs/2502.01634
作者: Huawei Lin,Jun Woo Chung,Yingjie Lao,Weijie Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 16 tables. Keywords: Decremental Learning, Incremental Learning, Machine Unlearning, Online Learning, Gradient Boosting Decision Trees, GBDTs
点击查看摘要
Abstract:Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. However, in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow to add or delete any data instances after training. In this paper, we propose an efficient online learning framework for GBDT supporting both incremental and decremental learning. To the best of our knowledge, this is the first work that considers an in-place unified incremental and decremental learning on GBDT. To reduce the learning cost, we present a collection of optimizations for our framework, so that it can add or delete a small fraction of data on the fly. We theoretically show the relationship between the hyper-parameters of the proposed optimizations, which enables trading off accuracy and cost on incremental and decremental learning. The backdoor attack results show that our framework can successfully inject and remove backdoor in a well-trained model using incremental and decremental learning, and the empirical results on public datasets confirm the effectiveness and efficiency of our proposed online learning framework and optimizations.
[AI-2] Adversarial Reasoning at Jailbreaking Time
链接: https://arxiv.org/abs/2502.01633
作者: Mahdi Sabbaghi,Paul Kassianik,George Pappas,Yaron Singer,Amin Karbasi,Hamed Hassani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
[AI-3] ReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM -Agents with Memory in Multi-Session Dialogues
链接: https://arxiv.org/abs/2502.01630
作者: Yubin Ge,Salvatore Romeo,Jason Cai,Raphael Shu,Monica Sunkara,Yassine Benajiba,Yi Zhang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs \textittime-aware memorization through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate \textitneuro-symbolic temporal reasoning, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.
[AI-4] A Probabilistic Inference Approach to Inference-Time Scaling of LLM s using Particle-Based Monte Carlo Methods
链接: https://arxiv.org/abs/2502.01618
作者: Isha Puri,Shivchander Sudalairaj,Guangxuan Xu,Kai Xu,Akash Srivastava
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at this https URL.
[AI-5] Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
链接: https://arxiv.org/abs/2502.01612
作者: Nayoung Lee,Ziyang Cai,Avi Schwarzschild,Kangwook Lee,Dimitris Papailiopoulos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture.
[AI-6] Reinforcement Learning for Long-Horizon Interactive LLM Agents
链接: https://arxiv.org/abs/2502.01600
作者: Kevin Chen,Marco Cusumano-Towner,Brody Huval,Aleksei Petrenko,Jackson Hamburger,Vladlen Koltun,Philipp Krähenbühl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned large language models (LLMs) can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive M-PPO, a data- and memory-efficient variant of proximal policy optimization. M-PPO uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent trained with M-PPO in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi-domain, multi-app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.
[AI-7] Improving Transformer World Models for Data-Efficient RL
链接: https://arxiv.org/abs/2502.01591
作者: Antoine Dedieu,Joseph Ortiz,Xinghua Lou,Carter Wendelken,Wolfgang Lehrach,J Swaroop Guntupalli,Miguel Lazaro-Gredilla,Kevin Patrick Murphy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities – such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps, significantly outperforming DreamerV3, which achieves 53.2%, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) “Dyna with warmup”, which trains the policy on real and imaginary data, (b) “nearest neighbor tokenizer” on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and © “block teacher forcing”, which allows the TWM to reason jointly about the future tokens of the next timestep.
[AI-8] Verbalized Bayesian Persuasion
链接: https://arxiv.org/abs/2502.01587
作者: Wenhao Li,Yue Lin,Xiangfeng Wang,Bo Jin,Hongyuan Zha,Baoxiang Wang
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 63 pages, 21 figures
点击查看摘要
Abstract:Information design (ID) explores how a sender influence the optimal behavior of receivers to achieve specific objectives. While ID originates from everyday human communication, existing game-theoretic and machine learning methods often model information structures as numbers, which limits many applications to toy games. This work leverages LLMs and proposes a verbalized framework in Bayesian persuasion (BP), which extends classic BP to real-world games involving human dialogues for the first time. Specifically, we map the BP to a verbalized mediator-augmented extensive-form game, where LLMs instantiate the sender and receiver. To efficiently solve the verbalized game, we propose a generalized equilibrium-finding algorithm combining LLM and game solver. The algorithm is reinforced with techniques including verbalized commitment assumptions, verbalized obedience constraints, and information obfuscation. Numerical experiments in dialogue scenarios, such as recommendation letters, courtroom interactions, and law enforcement, validate that our framework can both reproduce theoretical results in classic BP and discover effective persuasion strategies in more complex natural language and multi-stage scenarios.
[AI-9] PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
链接: https://arxiv.org/abs/2502.01584
作者: Carolyn Jane Anderson,Joydeep Biswas,Aleksander Boruch-Gruszecki,Federico Cassano,Molly Q Feldman,Arjun Guha,Francesca Lucchetti,Zixuan Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Existing benchmarks for frontier models often test specialized, PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with
I give up’’ before providing an answer that it knows is wrong. R1 can also be remarkably uncertain'' in its output and in rare cases, it does not
finish thinking,‘’ which suggests the need for an inference-time technique to ``wrap up’’ before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.01584 [cs.AI] (or arXiv:2502.01584v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.01584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] Next Steps in LLM -Supported Java Verification ICSE
链接: https://arxiv.org/abs/2502.01573
作者: Samuel Teuber,Bernhard Beckert
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Accepted to NSE 2025, 1st International Workshop on Neuro-Symbolic Software Engineering (ICSE Workshop), 6 pages, 3 figures
点击查看摘要
Abstract:Recent work has shown that Large Language Models (LLMs) are not only a suitable tool for code generation but also capable of generating annotation-based code specifications. Scaling these methodologies may allow us to deduce provable correctness guarantees for large-scale software systems. In comparison to other LLM tasks, the application field of deductive verification has the notable advantage of providing a rigorous toolset to check LLM-generated solutions. This short paper provides early results on how this rigorous toolset can be used to reliably elicit correct specification annotations from an unreliable LLM oracle.
[AI-11] MeetMap: Real-Time Collaborative Dialogue Mapping with LLM s in Online Meetings
链接: https://arxiv.org/abs/2502.01564
作者: Xinyue Chen,Nathan Yap,Xinyi Lu,Aylin Gunal,Xu Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CSCW2025 Accepted
点击查看摘要
Abstract:Video meeting platforms display conversations linearly through transcripts or summaries. However, ideas during a meeting do not emerge linearly. We leverage LLMs to create dialogue maps in real time to help people visually structure and connect ideas. Balancing the need to reduce the cognitive load on users during the conversation while giving them sufficient control when using AI, we explore two system variants that encompass different levels of AI assistance. In Human-Map, AI generates summaries of conversations as nodes, and users create dialogue maps with the nodes. In AI-Map, AI produces dialogue maps where users can make edits. We ran a within-subject experiment with ten pairs of users, comparing the two MeetMap variants and a baseline. Users preferred MeetMap over traditional methods for taking notes, which aligned better with their mental models of conversations. Users liked the ease of use for AI-Map due to the low effort demands and appreciated the hands-on opportunity in Human-Map for sense-making.
[AI-12] Search-Based Adversarial Estimates for Improving Sample Efficiency in Off-Policy Reinforcement Learning
链接: https://arxiv.org/abs/2502.01558
作者: Federico Malato,Ville Hautamaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to International Conference on Machine Learning 2025. Currently under peer-review
点击查看摘要
Abstract:Sample inefficiency is a long-lasting challenge in deep reinforcement learning (DRL). Despite dramatic improvements have been made, the problem is far from being solved and is especially challenging in environments with sparse or delayed rewards. In our work, we propose to use Adversarial Estimates as a new, simple and efficient approach to mitigate this problem for a class of feedback-based DRL algorithms. Our approach leverages latent similarity search from a small set of human-collected trajectories to boost learning, using only five minutes of human-recorded experience. The results of our study show algorithms trained with Adversarial Estimates converge faster than their original version. Moreover, we discuss how our approach could enable learning in feedback-based algorithms in extreme scenarios with very sparse rewards.
[AI-13] Query Brand Entity Linking in E-Commerce Search
链接: https://arxiv.org/abs/2502.01555
作者: Dong Liu,Sreyashi Nag
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averaging 2.4 words), lack natural language structure, and must handle a massive space of unique brands. We present a two-stage approach combining named-entity recognition with matching, and a novel end-to-end solution using extreme multi-class classification. We validate our solutions by both offline benchmarks and the impact of online A/B test.
[AI-14] ransformers trained on proteins can learn to attend to Euclidean distance
链接: https://arxiv.org/abs/2502.01533
作者