Arxiv今日论文 | 2025-02-04

本篇博文主要内容为 2025-02-04 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决在扩展输入嵌入层（input embedding layers）规模以提升语言模型性能过程中导致的解码成本增加的问题。解决方案的关键在于引入SCONE方法，通过为频繁出现的n-gram（n-gram）引入嵌入向量（embeddings），这些嵌入向量能够提供上下文化的表示（contextualized representation）且在训练期间由单独的模型进行学习。这种方法能够在保持推理阶段浮点运算次数（FLOPS）不变的前提下，显著提升模型性能，并减少解码所需的成本。

链接: https://arxiv.org/abs/2502.01637
作者: Da Yu,Edith Cohen,Badih Ghazi,Yangsibo Huang,Pritish Kamath,Ravi Kumar,Daogao Liu,Chiyuan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose SCONE ( \textbfS calable, \textbfC ontextualized, \textbfO ffloaded, \textbfN -gram \textbfE mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n -grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n -gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
zh

[NLP-1] Lifelong Sequential Knowledge Editing without Model Degradation

【速读】：该论文旨在解决大规模顺序知识编辑导致模型性能显著下降的问题。关键在于提出了一种名为ENCORE（Early stopping and Norm-Constrained Robust knowledge Editing）的方法，通过控制过拟合和不合理的范数增长，实现多达10,000次的连续编辑而不损失下游任务性能，同时ENCORE比现有方法如MEMIT和AlphaEdit更快。

链接: https://arxiv.org/abs/2502.01636
作者: Akshat Gupta,Phudish Prateepamornkul,Maochuan Lu,Ahmed Alaa,Thomas Hartvigsen,Gopala Anumanchipalli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prior work in parameter-modifying knowledge editing has shown that large-scale sequential editing leads to significant model degradation. In this paper, we study the reasons behind this and scale sequential knowledge editing to 10,000 sequential edits, while maintaining the downstream performance of the original model. We first show that locate-then-edit knowledge editing methods lead to overfitting on the edited facts. We also show that continuous knowledge editing using these methods leads to disproportionate growth in the norm of the edited matrix. We then provide a crucial insight into the inner workings of locate-then-edit methods. We show that norm-growth is a hidden trick employed by these methods that gives larger importance to the output activations produced from the edited layers. With this “importance hacking”, the edited layers provide a much larger contributions to the model’s output. To mitigate these issues, we present ENCORE - Early stopping and Norm-Constrained Robust knowledge Editing. ENCORE controls for overfitting and the disproportionate norm-growth to enable long-term sequential editing, where we are able to perform up to 10,000 sequential edits without loss of downstream performance. ENCORE is also 61% faster than MEMIT and 64% faster than AlphaEdit on Llama3-8B.
zh

[NLP-2] LLM -TA: An LLM -Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease AAAI2025 ALT

【速读】：该论文旨在解决主题分析（Thematic Analysis, TA）在处理大规模复杂医疗数据集时资源密集且难以扩展的问题。研究提出了一种基于大型语言模型（Large Language Models, LLMs）的主题分析增强管道（LLM-Enhanced Thematic Analysis, LLM-TA），通过集成先进的LLM（如GPT-4o mini）、LangChain及提示工程与分块技术，以分析罕见先天性心脏病——冠状动脉异常起源（AAOCA）患者的父母访谈记录。关键在于利用LLM-TA管道提升分析的可扩展性、效率和准确性，并减轻分析师的工作负担，同时强调与领域专家紧密合作的重要性。

链接: https://arxiv.org/abs/2502.01620
作者: Muhammad Zain Raza,Jiawei Xu,Terence Lim,Lily Boddy,Carlos M. Mery,Andrew Well,Ying Ding
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by GenAI for Health Workshop @ AAAI 2025, Philadelphia

点击查看摘要

Abstract:Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. this https URL
zh

[NLP-3] Learning to Generate Unit Tests for Automated Debugging

【速读】：该论文旨在解决在自动化生成单元测试（Unit Tests, UTs）过程中存在的权衡问题：生成能够揭示错误的单元测试输入与正确预测单元测试输出之间的矛盾。论文的关键解决方案是提出了一种名为UTGen的方法，该方法通过任务描述和候选代码训练大语言模型（LLMs）生成既能揭示错误又能提供正确预期输出的单元测试输入。此外，UTGen被整合进了一个名为UTDebug的健壮调试流程中，以利用生成的测试帮助LLMs更有效地调试代码，并通过测试时计算扩展和基于多个生成的UT进行验证及回溯编辑来减少模型生成测试中的噪声信号影响。

链接: https://arxiv.org/abs/2502.01619
作者: Archiki Prasad,Elias Stengel-Eskin,Justin Chih-Yao Chen,Zaid Khan,Mohit Bansal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: First two authors contributed equally. Dataset and Code: this https URL

点击查看摘要

Abstract:Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen’s unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.
zh

[NLP-4] Large Language Models Are Human-Like Internally

【速读】：该论文旨在重新评估大型语言模型（Large Language Models, LLMs）在认知建模中的合理性。此前的研究认为LLMs的认知可行性较低，主要因为它们在人类阅读行为拟合方面表现较差。本文的关键在于通过机制可解释性（mechanistic interpretability）的视角，指出之前的研究结论可能受到仅关注模型最终层的影响。研究发现，大型语言模型内部较早层的下一词概率与人类句子处理数据的契合度，不亚于甚至优于较小模型。这种一致性体现在行为学指标（如自我调节阅读时间、注视时长、MAZE任务处理时间）以及神经生理学指标（如N400脑电位）中。此外，论文首次揭示了模型层与人类测量之间的有趣关联：早期层更接近快速注视时长，而后期层则更好地匹配较慢信号，如N400脑电位和MAZE处理时间。这表明大型语言模型的认知可行性可能被低估，并为机制可解释性和认知建模交叉领域的研究开辟了新途径。

链接: https://arxiv.org/abs/2502.01615
作者: Tatsuki Kuribayashi,Yohei Oseki,Souhaib Ben Taieb,Kentaro Inui,Timothy Baldwin
机构: MBZUAI; The University of Tokyo; University of Mons; Tohoku University; RIKEN; The University of Melbourne
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Recent cognitive modeling studies have reported that larger language models (LMs) exhibit a poorer fit to human reading behavior, leading to claims of their cognitive implausibility. In this paper, we revisit this argument through the lens of mechanistic interpretability and argue that prior conclusions were skewed by an exclusive focus on the final layers of LMs. Our analysis reveals that next-word probabilities derived from internal layers of larger LMs align with human sentence processing data as well as, or better than, those from smaller LMs. This alignment holds consistently across behavioral (self-paced reading times, gaze durations, MAZE task processing times) and neurophysiological (N400 brain potentials) measures, challenging earlier mixed results and suggesting that the cognitive plausibility of larger LMs has been underestimated. Furthermore, we first identify an intriguing relationship between LM layers and human measures: earlier layers correspond more closely with fast gaze durations, while later layers better align with relatively slower signals such as N400 potentials and MAZE processing times. Our work opens new avenues for interdisciplinary research at the intersection of mechanistic interpretability and cognitive modeling.
zh

[NLP-5] Breaking Focus: Contextual Distraction Curse in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理包含无关细节的输入时所表现出的Contextual Distraction Vulnerability (CDV)，即在语义连贯但无关的上下文干扰下，模型性能显著下降的问题。论文的关键解决方案在于通过树状搜索方法自动生成CDV示例，并通过针对特定任务的后训练策略来增强模型对上下文干扰的鲁棒性。研究表明，这些策略可以将最先进的LLMs在四种数据集上的平均性能提升约45%。

链接: https://arxiv.org/abs/2502.01609
作者: Yue Huang,Yanbo Wang,Zixiang Xu,Chujie Gao,Siyuan Wu,Jiayi Ye,Xiuying Chen,Pin-Yu Chen,Xiangliang Zhang
机构: University of Notre Dame; MBZUAI; MBZUAI; MBZUAI; Independent Researcher; Independent Researcher; MBZUAI; IBM Research; University of Notre Dame
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have revolutionized generative systems, achieving excellent performance across diverse domains. Although these models perform well in controlled environments, their real-world applications frequently encounter inputs containing both essential and irrelevant details. Our investigation has revealed a critical vulnerability in LLMs, which we term Contextual Distraction Vulnerability (CDV). This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context. To systematically investigate this vulnerability, we propose an efficient tree-based search methodology to automatically generate CDV examples. Our approach successfully generates CDV examples across four datasets, causing an average performance degradation of approximately 45% in state-of-the-art LLMs. To address this critical issue, we explore various mitigation strategies and find that post-targeted training approaches can effectively enhance model robustness against contextual distractions. Our findings highlight the fundamental nature of CDV as an ability-level challenge rather than a knowledge-level issue since models demonstrate the necessary knowledge by answering correctly in the absence of distractions. This calls the community’s attention to address CDV during model development to ensure reliability. The code is available at this https URL.
zh

[NLP-6] FutureVision: A methodology for the investigation of future cognition

【速读】：该论文旨在探讨理解未来情景沟通时的认知努力，并提出了一种结合多模态语义分析与眼动追踪实验协议的方法。关键解决方案在于通过记录参与者在评估虚构广告描述未来情景时的眼动模式，并将其与对刺激物的语义表示及参与者的描述进行对比分析，从而揭示不同未来情景类型引起的不同认知负荷。

链接: https://arxiv.org/abs/2502.01597
作者: Tiago Timponi Torrent,Mark Turner,Nicolás Hinrichs,Frederico Belcavello,Igor Lourenço,Arthur Lorenzi Almeida,Marcelo Viridiano,Ely Edison Matos
机构: Federal University of Juiz de Fora (联邦大学弗茹利州); CNPq (CNPq); Federal University of Uberlândia (联邦大学乌贝兰迪亚); Case Western Reserve University (凯斯西储大学); FrameNet Brasil (FrameNet Brasil)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a methodology combining multimodal semantic analysis with an eye-tracking experimental protocol to investigate the cognitive effort involved in understanding the communication of future scenarios. To demonstrate the methodology, we conduct a pilot study examining how visual fixation patterns vary during the evaluation of valence and counterfactuality in fictional ad pieces describing futuristic scenarios, using a portable eye tracker. Participants eye movements are recorded while evaluating the stimuli and describing them to a conversation partner. Gaze patterns are analyzed alongside semantic representations of the stimuli and participants descriptions, constructed from a frame semantic annotation of both linguistic and visual modalities. Preliminary results show that far-future and pessimistic scenarios are associated with longer fixations and more erratic saccades, supporting the hypothesis that fractures in the base spaces underlying the interpretation of future scenarios increase cognitive load for comprehenders.
zh

[NLP-7] ReGLA: Refining Gated Linear Attention NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂语言建模任务中表现出色的同时，由于softmax注意力机制的二次计算复杂性导致的显著计算和存储需求问题。为缓解这一问题，线性注意力被设计用于降低标准变换器固有的二次时空复杂度。论文的关键在于全面探索了门控线性注意力模块的三个关键组件：特征映射、归一化和门控机制，并提出了一种特征映射函数来解决先前方法忽略的重要问题，进一步论证了归一化层的整合以稳定训练过程，并通过引入精炼模块解决了门控机制的饱和现象。

链接: https://arxiv.org/abs/2502.01578
作者: Peng Lu,Ivan Kobyzev,Mehdi Rezagholizadeh,Boxing Chen,Philippe Langlais
机构: DIRO, UniversitГ© de MontrГ©al(蒙特利尔大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025 (main)

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
zh

[NLP-8] Visual Theory of Mind Enables the Invention of Writing Systems

【速读】：该论文旨在探讨早期书写系统（Writing Systems）的起源与演化，特别是那些最初由象形图（Iconic Pictographs）构成的系统。论文的关键在于开发了一种多智能体强化学习测试平台，称为“意义游戏”（Signification Game），通过这一平台，智能体能够利用视觉心智理论（Visual Theory of Mind）进行推断性沟通（Inferential Communication），从而使用象形图来传达动作。这种模型不仅促进了对人类和动物认知过程的理解，还揭示了促使早期书写系统发展的认知和文化过程。

链接: https://arxiv.org/abs/2502.01568
作者: Benjamin A. Spiegel,Lucas Gelfond,George Konidaris
机构: Department of Computer Science, Brown University (计算机科学系, 布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In submission to CogSci 2025

点击查看摘要

Abstract:Abstract symbolic writing systems are \textitsemiotic codes that are ubiquitous in modern society but are otherwise absent in the animal kingdom. Anthropological evidence suggests that the earliest forms of some writing systems originally consisted of \textiticonic pictographs, which signify their referent via visual resemblance. While previous studies have examined the emergence and, separately, the evolution of pictographic writing systems through a computational lens, most employ non-naturalistic methodologies that make it difficult to draw clear analogies to human and animal cognition. We develop a multi-agent reinforcement learning testbed for emergent communication called a \textitSignification Game, and formulate a model of inferential communication that enables agents to leverage \textitvisual theory of mind to communicate actions using pictographs. Our model, which is situated within a broader formalism for animal communication, sheds light on the cognitive and cultural processes that led to the development of early writing systems.
zh

[NLP-9] Scalable Language Models with Posterior Inference of Latent Thought Vectors

【速读】：该论文旨在解决传统语言模型（Language Models, LMs）在样本效率和参数效率方面的局限性。论文提出了一种新颖的潜在思维语言模型（Latent-Thought Language Models, LTMs），其关键是引入显式的潜在思维向量（latent thought vectors），这些向量遵循显式的先验模型，并通过Transformer解码器指导基础标记的自回归生成。此外，LTMs采用双重速率优化过程，在经典的变分贝叶斯框架内进行训练，实现局部变分参数的快速学习以及全局解码参数的慢速学习。这一设计使得LTMs在样本效率和参数效率方面超越了传统的自回归模型和离散扩散模型，并展示了与模型和潜在规模相关的新兴少量上下文推理能力。

链接: https://arxiv.org/abs/2502.01567
作者: Deqian Kong,Minglu Zhao,Dehong Xu,Bo Pang,Shu Wang,Edouardo Honig,Zhangzhang Si,Chuan Li,Jianwen Xie,Sirui Xie,Ying Nian Wu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We propose a novel family of language models, Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in conditional and unconditional text generation.
zh

[NLP-10] Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

【速读】：该论文旨在探究大规模语言模型（Large Language Models, LLMs）在理解上下文知识时，注意力机制中的查询（Query, Q）和键（Key, K）表示中集中出现的大值模式，并分析这些大值在模型中的作用。论文的关键发现在于这些大值主要集中在Q和K中，而非值（Value, V），并且它们对于解释上下文知识至关重要，而并非用于检索存储在模型参数中的参数化知识。通过研究量化策略，论文进一步表明忽略这些大值会导致需要丰富上下文理解的任务性能显著下降。最终，论文追踪到这种集中现象是由旋转位置编码（Rotary Positional Encoding, RoPE）引起的，自模型的第一层起即已存在。这一发现为理解LLMs中Q和K的工作机制提供了新的视角，并为模型设计与优化提供了实用见解。

链接: https://arxiv.org/abs/2502.01563
作者: Mingyu Jin,Kai Mei,Wujiang Xu,Mingjie Sun,Ruixiang Tang,Mengnan Du,Zirui Liu,Yongfeng Zhang
机构: Rutgers University; Carnegie Mellon University; New Jersey Institute of Technology; University of Minnesota
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at this https URL.
zh

[NLP-11] What is a Number That a Large Language Model May Know It?

【速读】：该论文旨在解决大型语言模型（LLMs）在处理数字序列时面临的挑战：即在不同上下文中，相同的数字序列可能被解释为字符串或数字。论文的关键在于揭示这些模型如何通过混合字符串和数值表示的空间来学习这种双重性，并且这种混合表示会导致在隐含嵌入中的纠缠现象。论文通过一系列实验展示了这种纠缠如何受上下文影响但不能完全消除，并且它可以传播到实际决策场景中。关键解决方案在于使用基于相似性的提示技术，表明整数对的诱发相似性判断可以通过Levenshtein编辑距离和数值对数线性距离的组合来捕捉，从而证明了这种纠缠表示的存在。

链接: https://arxiv.org/abs/2502.01540
作者: Raja Marjieh,Veniamin Veselovsky,Thomas L. Griffiths,Ilia Sucholutsky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.
zh

[NLP-12] VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven Reliable and Explainable Alzheimers Disease Diagnosis

【速读】：该论文旨在解决利用高维医学影像评估阿尔茨海默病（Alzheimer’s Disease, AD）的临床重要性与挑战。解决方案的关键在于提出了一种名为VisTA的多模态语言-视觉模型，该模型借助对比学习（contrastive learning）进行优化，以提升疾病预测准确性及证据驱动的可解释性，从而辅助临床决策。

链接: https://arxiv.org/abs/2502.01535
作者: Duy-Cat Can,Linh D. Dang,Quang-Huy Tang,Dang Minh Ly,Huong Ha,Guillaume Blanc,Oliver Y. Chén,Binh T. Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Objective: Assessing Alzheimer’s disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making. Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA’s efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA’s explainability, we compared its explanations with human experts’ explanations. Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts’ and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM) Cite as: arXiv:2502.01535 [cs.CV] (or arXiv:2502.01535v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.01535 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] Preference Leakage: A Contamination Problem in LLM -as-a-judge

【速读】：该论文旨在解决在大型语言模型（LLM）作为评估者时由于合成数据生成器与基于LLM的评估者之间的相关性所导致的偏好泄露问题。这种偏好泄露会导致评估偏差，从而影响模型训练和评价的准确性。研究的关键在于定义并验证了三种常见的相关性类型：相同的模型、具有继承关系的模型以及属于同一模型家族的模型，并通过大量实验确认了偏好泄露导致的偏见。最终结果表明，偏好泄露是一个普遍且难以检测的问题，需要进一步关注和解决。

链接: https://arxiv.org/abs/2502.01534
作者: Dawei Li,Renliang Sun,Yue Huang,Ming Zhong,Bohan Jiang,Jiawei Han,Xiangliang Zhang,Wei Wang,Huan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: this https URL.
zh

[NLP-14] he in-context inductive biases of vision-language models differ across modalities

【速读】：该论文旨在探究基础模型在情境学习过程中如何利用归纳偏置进行泛化，特别是在不同模态（视觉与文本）输入条件下泛化的差异。关键在于通过三种不同的实验范式，在三个不同的视觉-语言模型中研究这些模型在面对视觉和文本描述的刺激时，如何根据形状而非颜色进行泛化，并且这种倾向在视觉呈现时更为显著；而在文本描述中，形容词的顺序则会影响泛化结果。这些发现有助于揭示视觉-语言模型在特定上下文中对不同类型输入的表征方式，并可能对视觉-语言模型的实际应用产生影响。

链接: https://arxiv.org/abs/2502.01530
作者: Kelsey Allen,Ishita Dasgupta,Eliza Kosoy,Andrew K. Lampinen
机构: Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories – e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.
zh

[NLP-15] CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering

【速读】：该论文旨在解决大型语言模型（LLMs）在问答（QA）任务中因模糊问题而产生幻觉的问题。论文的关键解决方案是引入条件感知评估指标和Conditional Ambiguous Question-Answering (CondAmbigQA)基准，包含200个模糊查询，并采用基于检索的标注策略利用维基百科片段来识别查询的可能解释作为条件，从而最小化标注者不同知识水平引入的人为偏差。实验表明，在回答前考虑条件的模型性能提高了20%，当条件被显式提供时，额外提升了5%。这些结果强调了条件推理在QA中的价值，为研究人员提供了严格评估模糊性解决方法的工具。

链接: https://arxiv.org/abs/2502.01523
作者: Zongxi Li,Yang Li,Haoran Xie,S. Joe Qin
机构: School of Data Science, Lingnan University (岭南大学), Hong Kong SAR; School of Science and Technology, Hong Kong Metropolitan University (香港都会大学), Hong Kong SAR
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions. Users often assume that LLMs share their cognitive alignment, a mutual understanding of context, intent, and implicit details, leading them to omit critical information in the queries. However, LLMs generate responses based on assumptions that can misalign with user intent, which may be perceived as hallucinations if they misalign with the user’s intent. Therefore, identifying those implicit assumptions is crucial to resolve ambiguities in QA. Prior work, such as AmbigQA, reduces ambiguity in queries via human-annotated clarifications, which is not feasible in real application. Meanwhile, ASQA compiles AmbigQA’s short answers into long-form responses but inherits human biases and fails capture explicit logical distinctions that differentiates the answers. We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries and condition-aware evaluation metrics. Our study pioneers the concept of ``conditions’’ in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities. The retrieval-based annotation strategy uses retrieved Wikipedia fragments to identify possible interpretations for a given query as its conditions and annotate the answers through those conditions. Such a strategy minimizes human bias introduced by different knowledge levels among annotators. By fixing retrieval results, CondAmbigQA evaluates how RAG systems leverage conditions to resolve ambiguities. Experiments show that models considering conditions before answering improve performance by 20% , with an additional 5% gain when conditions are explicitly provided. These results underscore the value of conditional reasoning in QA, offering researchers tools to rigorously evaluate ambiguity resolution.
zh

[NLP-16] Hybrid Machine Learning Model for Detecting Bangla Smishing Text Using BERT and Character-Level CNN CEC

【速读】：该论文旨在解决Bangla语言环境中短信欺诈（Smishing）的问题。解决方案的关键在于提出了一种新颖的混合机器学习模型，该模型结合了双向编码器表示从变压器（BERT）和卷积神经网络（CNNs），以增强字符级分析，并通过多类别分类方法区分正常、促销和欺诈短信。不同于传统的二元分类方法，该模型融合了BERT的上下文嵌入和CNN的字符级特征，从而提升了检测准确性。此外，引入注意力机制进一步优化了模型对文本关键部分的优先处理能力。

链接: https://arxiv.org/abs/2502.01518
作者: Gazi Tanbhir,Md. Farhan Shahriyar,Khandker Shahed,Abdullah Md Raihan Chy,Md Al Adnan
机构: World University of Bangladesh; Jashore University of Science and Technology; Southern University Bangladesh; Institute of Information Technology, Noakhali Science and Technology University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Conference Name: 13th International Conference on Electrical and Computer Engineering (ICECE 2024)

点击查看摘要

Abstract:Smishing is a social engineering attack using SMS containing malicious content to deceive individuals into disclosing sensitive information or transferring money to cybercriminals. Smishing attacks have surged by 328%, posing a major threat to mobile users, with losses exceeding \ 54.2 million in 2019. Despite its growing prevalence, the issue remains significantly under-addressed. This paper presents a novel hybrid machine learning model for detecting Bangla smishing texts, combining Bidirectional Encoder Representations from Transformers (BERT) with Convolutional Neural Networks (CNNs) for enhanced character-level analysis. Our model addresses multi-class classification by distinguishing between Normal, Promotional, and Smishing SMS. Unlike traditional binary classification methods, our approach integrates BERT’s contextual embeddings with CNN’s character-level features, improving detection accuracy. Enhanced by an attention mechanism, the model effectively prioritizes crucial text segments. Our model achieves 98.47% accuracy, outperforming traditional classifiers, with high precision and recall in Smishing detection, and strong performance across all categories. Comments: Conference Name: 13th International Conference on Electrical and Computer Engineering (ICECE 2024) Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2502.01518 [cs.CL] (or arXiv:2502.01518v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.01518 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-17] Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation

【速读】：该论文旨在探讨在序列级知识蒸馏（SeqKD）过程中，教师神经机器翻译（NMT）模型的实例级记忆如何被学生模型继承。研究发现，尽管学生模型没有直接接触到原始训练数据，但它们表现出比基线模型更多的记忆现象（3.4%的精确匹配和57%的提取式记忆），并且幻觉率增加。论文进一步分析了学生模型在特定训练数据子群上的行为，特别是低质量子群和特定反事实记忆（CM）得分的子群，并发现学生模型在处理低质量数据子群时表现出更强的去噪能力。为了解决上述问题，论文提出了一种名为自适应序列级知识蒸馏（Adaptive-SeqKD）的改进方法，该方法通过干预SeqKD过程来减少记忆和幻觉现象。总体而言，论文建议在应用SeqKD时需谨慎，因为学生模型会继承教师模型的优点及其潜在缺陷，需要进行主动监控。

链接: https://arxiv.org/abs/2502.01491
作者: Verna Dankers,Vikas Raunak
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) – 3.4% for exact matches and 57% for extractive memorization – and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers’ superior performance and their fault modes, thereby requiring active monitoring.
zh

[NLP-18] Explaining Context Length Scaling and Bounds for Language Models

【速读】：该论文旨在解决长上下文对语言模型性能影响的理解问题。论文的关键解决方案在于提出了一种从内在空间视角解释上下文长度对语言建模影响的清晰而有效的理论框架，并通过自然语言和合成数据的实验验证了这一理论假设和推论。该框架提供了实用见解，如确定训练数据集大小决定最优上下文长度，并为特定情况设定上下文长度的扩展界限。

链接: https://arxiv.org/abs/2502.01481
作者: Jingzhe Shi,Qinwei Ma,Hongyi Liu,Hang Zhao,Jeng-Neng Hwang,Serge Belongie,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 14 figures

点击查看摘要

Abstract:Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impact Language Modeling. In this work, we (1) propose a clean and effective theoretical framework on explaining the impact of context length to Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain case. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at this url: this https URL.
zh

[NLP-19] FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在应用过程中无意中编码敏感或有害信息的安全隐患。论文的关键解决方案是提出了一种名为FALCON（Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment）的新方法。FALCON是一种基于表征引导的无学习方法，通过信息论指导进行高效的参数选择，利用对比机制增强表征分离，并将冲突梯度投影到正交子空间以解决遗忘与保持目标之间的冲突，从而实现更精确的知识分离，并在保证模型效用的同时提高无学习效果。

链接: https://arxiv.org/abs/2502.01472
作者: Jinwei Hu,Zhenglin Huang,Xiangyu Yin,Wenjie Ruan,Guangliang Cheng,Yi Dong,Xiaowei Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.
zh

[NLP-20] Process Reinforcement through Implicit Rewards

【速读】：该论文旨在解决在大规模语言模型（Large Language Models, LLMs）的推理时间扩展中，使用密集过程奖励（Dense Process Rewards）替代稀疏结果级奖励（Sparse Outcome-Level Rewards）所面临的挑战。关键在于通过提出PRIME方法，实现了仅利用策略滚动（policy rollouts）和结果标签（outcome labels）来更新过程奖励模型（Process Reward Models, PRMs），从而避免了高昂的高质量过程标签收集成本，并减少了开发开销。这一方案显著提升了模型在复杂推理任务中的表现，在多个基准测试中取得了显著改进。

链接: https://arxiv.org/abs/2502.01456
作者: Ganqu Cui,Lifan Yuan,Zefan Wang,Hanbin Wang,Wendi Li,Bingxiang He,Yuchen Fan,Tianyu Yu,Qixin Xu,Weize Chen,Jiarui Yuan,Huayu Chen,Kaiyan Zhang,Xingtai Lv,Shuo Wang,Yuan Yao,Xu Han,Hao Peng,Yu Cheng,Zhiyuan Liu,Maosong Sun,Bowen Zhou,Ning Ding
机构: Tsinghua University (清华大学); Shanghai AI Lab (上海人工智能实验室); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Peking University (北京大学); Shanghai Jiaotong University (上海交通大学); CUHK (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages. ModelCodeData available at this https URL

点击查看摘要

Abstract:Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME’s effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
zh

[NLP-21] owards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPT s

【速读】：该论文旨在解决定制化大型语言模型（Custom GPTs）在实际应用中的安全与合规风险问题。论文的关键解决方案是一个可扩展的自动化评估框架，该框架通过三个核心组件实现：(1) 自动发现和收集来自GPT商店的模型数据，(2) 针对特定政策类别和目标GPT特性的红队提示生成器，以及(3) 利用大型语言模型作为裁判的技术来分析每个提示-响应对中的潜在政策违规行为。该框架通过大规模实验证明了其有效性，并揭示了这些模型中存在显著的非合规性问题。

链接: https://arxiv.org/abs/2502.01436
作者: David Rodriguez,William Seymour,Jose M. Del Alamo,Jose Such
机构: ETSI Telecomunicación (电信工程学院), Universidad Politécnica de Madrid (马德里理工大学); King’s College London (伦敦国王学院), VRAIN, Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained unprecedented prominence, achieving widespread adoption across diverse domains and integrating deeply into society. The capability to fine-tune general-purpose LLMs, such as Generative Pre-trained Transformers (GPT), for specific tasks has facilitated the emergence of numerous Custom GPTs. These tailored models are increasingly made available through dedicated marketplaces, such as OpenAI’s GPT Store. However, their black-box nature introduces significant safety and compliance risks. In this work, we present a scalable framework for the automated evaluation of Custom GPTs against OpenAI’s usage policies, which define the permissible behaviors of these systems. Our framework integrates three core components: (1) automated discovery and data collection of models from the GPT store, (2) a red-teaming prompt generator tailored to specific policy categories and the characteristics of each target GPT, and (3) an LLM-as-a-judge technique to analyze each prompt-response pair for potential policy violations. We validate our framework with a manually annotated ground truth, and evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs. Our manual annotation process achieved an F1 score of 0.975 in identifying policy violations, confirming the reliability of the framework’s assessments. The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store’s review and approval processes. Furthermore, our findings indicate that a model’s popularity does not correlate with compliance, and non-compliance issues largely stem from behaviors inherited from base models rather than user-driven customizations. We believe this approach is extendable to other chatbot platforms and policy domains, improving LLM-based systems safety. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.1; I.2.7 Cite as: arXiv:2502.01436 [cs.CL] (or arXiv:2502.01436v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.01436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] Emergent Stack Representations in Modeling Counter Languages Using Transformers

【速读】：该论文旨在探究变压器（Transformer）架构在学习形式语言方面的内部工作机制，特别是通过训练模型处理计数器语言（Counter Languages）来实现这一目标。论文的关键解决方案在于分析变压器模型在计数器语言上的表现，并将其与使用堆栈（stacks）建模的语言进行对比，通过研究输入每个标记时的堆栈深度（stack depths）来揭示这些模型是否形成了类似于堆栈的内部表示。这有助于深入理解变压器如何学习算法语言的细节，并促进电路发现（circuit discovery）。

链接: https://arxiv.org/abs/2502.01432
作者: Utkarsh Tiwari,Aviral Gupta,Michael Hahn
机构: Birla Institute of Technology and Science (BITS) Pilani; Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by isolating the learning capabilities of these architectures by training them over well-understood classes of formal languages. We extend this literature by analyzing models trained over counter languages, which can be modeled using counter variables. We train transformer models on 4 counter languages, and equivalently formulate these languages using stacks, whose depths can be understood as the counter values. We then probe their internal representations for stack depths at each input token to show that these models when trained as next token predictors learn stack-like representations. This brings us closer to understanding the algorithmic details of how transformers learn languages and helps in circuit discovery.
zh

[NLP-23] Originality in scientific titles and abstracts can predict citation count

【速读】：该论文旨在探究科学文献的原创性，并通过计算方法衡量原创性的Divergent Semantic Integration (DSI)指标来分析Web of Science数据库中的99,557篇科学摘要和标题。研究发现不同学科领域之间存在显著的DSI差异，并且DSI随时间有轻微上升趋势。论文的关键在于利用DSI模型预测科学文献被引用次数，并建立其与DSI之间的统计显著正相关关系，调整后的R²值达到0.13。

链接: https://arxiv.org/abs/2502.01417
作者: Jack H. Culbert,Yoed N. Kenett,Philipp Mayr
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, submitted to ISSI 2025, research in progress paper

点击查看摘要

Abstract:In this research-in-progress paper, we apply a computational measure correlating with originality from creativity science: Divergent Semantic Integration (DSI), to a selection of 99,557 scientific abstracts and titles selected from the Web of Science. We observe statistically significant differences in DSI between subject and field of research, and a slight rise in DSI over time. We model the base 10 logarithm of the citation count after 5 years with DSI and find a statistically significant positive correlation in all fields of research with an adjusted R^2 of 0.13.
zh

[NLP-24] GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models

【速读】：该论文旨在解决人工智能系统中存在的性别偏见问题，特别是在变压器（Transformer）基础语言模型中。论文的关键在于提出了一种新颖的编解码方法，通过利用模型梯度学习单一单义性特征神经元来编码性别信息。这种方法能够在保持模型其他功能的同时，有效减少性别偏见。

链接: https://arxiv.org/abs/2502.01406
作者: Jonathan Drechsel,Steffen Herbold
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI systems frequently exhibit and amplify social biases, including gender bias, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a single monosemantic feature neuron encoding gender information. We show that our method can be used to debias transformer-based language models, while maintaining other capabilities. We demonstrate the effectiveness of our approach across multiple encoder-only based models and highlight its potential for broader applications.
zh

[NLP-25] AdaSVD: Adaptive Singular Value Decomposition for Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在资源受限设备上的部署挑战，特别是其显著的内存需求问题。现有基于奇异值分解（SVD）的方法在压缩过程中难以有效缓解由截断引入的误差，且统一的压缩比率无法适应不同层的重要性差异。论文的关键解决方案是提出AdaSVD方法，它通过引入adaComp动态补偿SVD截断误差，并通过adaCR自适应分配各层特定的压缩比率，从而在保持高性能的同时大幅降低内存需求。

链接: https://arxiv.org/abs/2502.01403
作者: Li Zhiteng,Xia Mingyuan,Zhang Jingyuan,Hui Zheng,Kong Linghe,Zhang Yulun,Yang Xiaokang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The code and models will be available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices U and V^T. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. The code and models will be available at this https URL.
zh

[NLP-26] Annotation Tool and Dataset for Fact-Checking Podcasts

【速读】：该论文旨在解决播客事实核查的挑战，包括转录、标注和验证未经过滤的多样化和多语言内容中的声明。解决方案的关键在于提供一种新颖的方法，通过在播放过程中实现实时标注功能，允许用户在收听播客的同时标注关键元素，如值得核查的声明、声明跨度及上下文错误。这种方法结合了先进的转录模型（如OpenAI的Whisper）和众包标注，以创建高质量数据集，进而微调多语言变换器模型（如XLM-RoBERTa），用于声明检测和立场分类任务。

链接: https://arxiv.org/abs/2502.01402
作者: Vinay Setty,Adam James Becker
机构: University of Stavanger(斯塔万格大学)
类目: Computation and Language (cs.CL)
备注: Accepted as resource paper in TheWebConf 2025

点击查看摘要

Abstract:Podcasts are a popular medium on the web, featuring diverse and multilingual content that often includes unverified claims. Fact-checking podcasts is a challenging task, requiring transcription, annotation, and claim verification, all while preserving the contextual details of spoken content. Our tool offers a novel approach to tackle these challenges by enabling real-time annotation of podcasts during playback. This unique capability allows users to listen to the podcast and annotate key elements, such as check-worthy claims, claim spans, and contextual errors, simultaneously. By integrating advanced transcription models like OpenAI’s Whisper and leveraging crowdsourced annotations, we create high-quality datasets to fine-tune multilingual transformer models such as XLM-RoBERTa for tasks like claim detection and stance classification. Furthermore, we release the annotated podcast transcripts and sample annotations with preliminary experiments.
zh

[NLP-27] Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）作为日常助手在规划和顺序决策能力方面的表现，并解决如何通过这些能力提供有效的日常协助。研究的关键在于采用“LLM-modulo”设置结合人类参与的模式，通过计划然后执行的方式，评估用户在不同阶段的参与度对其信任及协作团队表现的影响。研究表明，LLMs作为日常助手的效果取决于高质量的计划和必要的用户执行参与度，同时用户可能因看似合理的计划而轻易产生不信任感。关键解决方案在于通过调整用户信任来优化任务结果，从而为未来日常助手的设计和人机协作提供了重要启示。

链接: https://arxiv.org/abs/2502.01390
作者: Gaole He,Gianluca Demartini,Ujwal Gadiraju
机构: Delft University of Technology(代尔夫特理工大学); The University of Queensland(昆士兰大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: conditionally accepted to CHI 2025

点击查看摘要

Abstract:Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of ‘LLM-modulo’ setups in conjunction with humans-in-the-loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan-then-execute manner, wherein the agents conducted step-wise planning and step-by-step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double-edged sword – (1) they can work well when a high-quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human-AI collaboration with LLM agents.
zh

[NLP-28] opic-FlipRAG : Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models

【速读】：该论文旨在解决针对主题导向的对抗性意见操纵攻击问题，特别是在检索增强型生成（Retrieval-Augmented Generation, RAG）系统中，这些系统需要推理和综合多个视角，从而容易受到系统性知识中毒的影响。论文的关键解决方案是提出Topic-FlipRAG，这是一种两阶段的操纵攻击管道，通过精心设计的对抗性扰动来影响相关查询的意见。这种方法结合了传统的对抗性排名攻击技术，并利用大型语言模型（LLMs）内部丰富的相关知识和推理能力，执行语义层面的扰动。实验表明，所提出的攻击能够有效地改变模型在特定主题上的输出意见，显著影响用户的信息感知。当前的缓解方法无法有效防御此类攻击，强调了增强RAG系统安全措施的必要性，并为LLM安全研究提供了重要见解。

链接: https://arxiv.org/abs/2502.01386
作者: Yuyang Gong,Zhuo Chen,Miaokun Chen,Fengchang Yu,Wei Lu,Xiaofeng Wang,Xiaozhong Liu,Jiawei Liu
机构: Wuhan University(武汉大学); Indiana University Bloomington(印第安纳大学布卢明顿分校); Worcester Polytechnic Institute(伍斯特理工学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become essential for tasks such as question answering and content generation. However, their increasing impact on public opinion and information dissemination has made them a critical focus for security research due to inherent vulnerabilities. Previous studies have predominantly addressed attacks targeting factual or single-query manipulations. In this paper, we address a more practical scenario: topic-oriented adversarial opinion manipulation attacks on RAG models, where LLMs are required to reason and synthesize multiple perspectives, rendering them particularly susceptible to systematic knowledge poisoning. Specifically, we propose Topic-FlipRAG, a two-stage manipulation attack pipeline that strategically crafts adversarial perturbations to influence opinions across related queries. This approach combines traditional adversarial ranking attack techniques and leverages the extensive internal relevant knowledge and reasoning capabilities of LLMs to execute semantic-level perturbations. Experiments show that the proposed attacks effectively shift the opinion of the model’s outputs on specific topics, significantly impacting user information perception. Current mitigation methods cannot effectively defend against such attacks, highlighting the necessity for enhanced safeguards for RAG systems, and offering crucial insights for LLM security research.
zh

[NLP-29] Meursault as a Data Point

【速读】：该论文旨在探讨在数据化时代，将人类经验量化为可度量指标所引发的深层次哲学和伦理问题。通过分析阿尔贝·加缪《局外人》中的主人公梅尔苏的人生命运，研究使用自然语言处理（NLP）技术，包括情感检测（BERT）、情感分析（VADER）和命名实体识别（spaCy），来量化其生活中的关键事件和行为。论文的关键在于揭示算法模型应用于复杂人类经验时的固有限制，特别是那些源于存在主义疏离和道德模糊性的经验，并通过现代人工智能工具如何误读梅尔苏的行为和情感，强调将细腻的人类叙事简化为数据点所带来的更广泛的伦理困境。论文倡导在人工智能中融入人文价值观，以应对过度依赖数据驱动叙事的问题。

链接: https://arxiv.org/abs/2502.01364
作者: Abhinav Pratap,Amit Pathak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 7 pages, 9 figures, 4 tables

点击查看摘要

Abstract:In an era dominated by datafication, the reduction of human experiences to quantifiable metrics raises profound philosophical and ethical questions. This paper explores these issues through the lens of Meursault, the protagonist of Albert Camus’ The Stranger, whose emotionally detached existence epitomizes the existential concept of absurdity. Using natural language processing (NLP) techniques including emotion detection (BERT), sentiment analysis (VADER), and named entity recognition (spaCy)-this study quantifies key events and behaviors in Meursault’s life. Our analysis reveals the inherent limitations of applying algorithmic models to complex human experiences, particularly those rooted in existential alienation and moral ambiguity. By examining how modern AI tools misinterpret Meursault’s actions and emotions, this research underscores the broader ethical dilemmas of reducing nuanced human narratives to data points, challenging the foundational assumptions of our data-driven society. The findings presented in this paper serve as a critique of the increasing reliance on data-driven narratives and advocate for incorporating humanistic values in artificial intelligence.
zh

[NLP-30] Bias Beware: The Impact of Cognitive Biases on LLM -Driven Product Recommendations

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在产品推荐系统中的脆弱性问题，特别是它们易受对抗性操纵的挑战。关键解决方案在于利用人类心理学原理，无缝修改产品描述，从而使得这些对抗性操纵难以被检测到。通过实验揭示了LLMs作为推荐器使用的显著漏洞，并提供了保护这些系统的关键见解。

链接: https://arxiv.org/abs/2502.01349
作者: Giorgos Filandrianos,Angeliki Dimitriou,Maria Lymperaiou,Konstantinos Thomas,Giorgos Stamou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has revolutionized product recommendation systems, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making these adversarial manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive experiments on LLMs of varying scales, we reveal significant vulnerabilities in their use as recommenders, providing critical insights into safeguarding these systems.
zh

[NLP-31] PSSD: Making Large Language Models Self-denial via Human Psyche Structure WWW’25

【速读】：该论文旨在解决大型语言模型（LLMs）在推理结果准确性提升过程中所面临的资源竞争问题，这些问题导致显著的时间和计算开销。论文的关键在于提出了一种名为PSSD的方案，该方案模仿人类心理结构，引入了三个相互关联的角色：基于直觉的本我角色、基于规则的超我角色以及以脚本为中心的自我角色。这三个角色通过多智能体范式协同工作，从而更好地增强LLMs的推理能力，并与现有模型无缝集成，实现优越的性能表现。

链接: https://arxiv.org/abs/2502.01344
作者: Jinzhi Liao,Zenghua Liao,Xiang Zhao
机构: National University of Defense Technology (国防科技大学), Changsha, Hunan, China (中国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: WWW '25

点击查看摘要

Abstract:The enhance of accuracy in reasoning results of LLMs arouses the community’s interests, wherein pioneering studies investigate post-hoc strategies to rectify potential mistakes. Despite extensive efforts, they are all stuck in a state of resource competition demanding significant time and computing expenses. The cause of the situation lies in the failure of identifying the fundamental feature of the solutions in this line, coined as the self-denial of LLMs. In other words, LLMs should confidently determine the potential existence of mistakes and carefully execute the targeted correction. As the whole procedure conducts within LLMs, supporting and persuasive references are hard to acquire, while the absence of specific steps towards refining hidden mistakes persists even when errors are acknowledged. In response to the challenges, we present PSSD, which refers to and implements the human psyche structure such that three distinct and interconnected roles contribute to human reasoning. Specifically, PSSD leverages the recent multi-agent paradigm, and is further enhanced with three innovatively conceived roles: (1) the intuition-based id role that provides initial attempts based on benign LLMs; (2) the rule-driven superego role that summarizes rules to regulate the above attempts, and returns specific key points as guidance; and (3) the script-centric ego role that absorbs all procedural information to generate executable script for the final answer prediction. Extensive experiments demonstrate that the proposed design not only better enhance reasoning capabilities, but also seamlessly integrate with current models, leading to superior performance.
zh

[NLP-32] AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）中视觉特征与语言嵌入对齐的关键挑战。现有方法如多层感知机（Multilayer Perceptrons, MLPs）常产生分布外或噪声输入，导致模态间对齐不佳。论文提出了一种名为AlignVLM的新方法，通过将视觉特征映射到语言模型（LLM）文本嵌入的加权平均值上来实现更好的对齐。这种方法利用了LLM编码的语言先验知识，确保视觉特征被映射到LLM能够有效解释的空间区域。关键在于使用加权平均的LLM文本嵌入来提高视觉和文本特征的对齐精度和鲁棒性。

链接: https://arxiv.org/abs/2502.01341
作者: Ahmed Masry,Juan A. Rodriguez,Tianyu Zhang,Suyuchen Wang,Chao Wang,Aarash Feizi,Akshay Kalkunte Suresh,Abhay Puri,Xiangru Jian,Pierre-André Noël,Sathwik Tejaswi Madhusudhan,Marco Pedersoli,Bang Liu,Nicolas Chapados,Yoshua Bengio,Enamul Hoque,Christopher Pal,Issam H. Laradji,David Vazquez,Perouz Taslakian,Spandana Gella,Sai Rajeswar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.
zh

[NLP-33] Main Predicate and Their Arguments as Explanation Signals For Intent Classification

【速读】：该论文旨在解决意图分类（Intent Classification）的可解释性问题，由于合适的基准数据集的缺乏，这一领域研究较少。论文的关键解决方案在于提出了一种新的技术，通过自动标记主要谓词（主要是动词）及其论元（依存关系）作为解释信号，从而增强意图分类数据集中的文本样本的可解释性。具体而言，作者利用ATIS和SNIPS两个基准数据集，创建了一个包含21k实例的独特数据集，用于提升模型在意图分类任务中的可解释性。实验表明，引导模型在训练过程中关注这些解释信号，能够显著提高模型的推理能力，特别是在合理性和忠实性等可解释性指标上的表现提升了3-4%。

链接: https://arxiv.org/abs/2502.01270
作者: Sameer Pimparkhede,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intent classification is crucial for conversational agents (chatbots), and deep learning models perform well in this area. However, little research has been done on the explainability of intent classification due to the absence of suitable benchmark data. Human annotation of explanation signals in text samples is time-consuming and costly. However, from inspection of data on intent classification, we see that, more often than not, the main verb denotes the action, and the direct object indicates the domain of conversation, serving as explanation signals for intent. This observation enables us to hypothesize that the main predicate in the text utterances, along with the arguments of the main predicate, can serve as explanation signals. Leveraging this, we introduce a new technique to automatically augment text samples from intent classification datasets with word-level explanations. We mark main predicates (primarily verbs) and their arguments (dependency relations) as explanation signals in benchmark intent classification datasets ATIS and SNIPS, creating a unique 21k-instance dataset for explainability. Further, we experiment with deep learning and language models. We observe that models that work well for classification do not perform well in explainability metrics like plausibility and faithfulness. We also observe that guiding models to focus on explanation signals from our dataset during training improves the plausibility Token F1 score by 3-4%, improving the model’s reasoning.
zh

[NLP-34] Learnable polynomial trigonometric and tropical activations

【速读】：该论文旨在解决深度神经网络中静态激活函数导致的适应性不足及由此引发的稳定性问题，如梯度消失或爆炸。关键解决方案在于提出了一种初始化方案，该方案能够单独保持变换器和卷积网络中的单位方差，从而确保在深层架构中梯度流动的稳定性。实验结果表明，基于Hermite、Fourier和Tropical的可学习激活函数显著提升了GPT-2和ConvNeXt网络在ImageNet-1K分类和OpenWebText上的准确性和困惑度，证明了可学习激活函数在大规模任务中的可行性。相关激活函数已封装成一个完全基于PyTorch的库：torchortho。

链接: https://arxiv.org/abs/2502.01247
作者: Ismail Khalfaoui-Hassani,Stefan Kesselheim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:This paper investigates scalable neural networks with learnable activation functions based on orthogonal function bases and tropical polynomials, targeting ImageNet-1K classification and next token prediction on OpenWebText. Traditional activations, such as ReLU, are static. In contrast, learnable activations enable the network to adapt dynamically during training. However, stability issues, such as vanishing or exploding gradients, arise with improper variance management in deeper networks. To remedy this, we propose an initialization scheme that single-handedly preserves unitary variance in transformers and convolutional networks, ensuring stable gradient flow even in deep architectures. Extensive experiments demonstrate that networks with Hermite, Fourier, and Tropical-based learnable activations significantly improve over GPT-2 and ConvNeXt networks in terms of accuracy and perplexity in train and test, highlighting the viability of learnable activations in large-scale tasks. The activation functions developed here are the subject of a library coded entirely in pure PyTorch: torchortho, available at this https URL.
zh

[NLP-35] OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology

【速读】：该论文旨在解决大型语言模型（LLMs）在眼科临床实践中的实际应用能力评估及其局限性识别的问题。为填补这一研究空白并支持LLMs的实际应用，论文提出了一种名为OphthBench的专门基准测试，该基准测试系统地将典型的眼科临床工作流程划分为五个关键场景：教育、分诊、诊断、治疗和预后。解决方案的关键在于通过设计涵盖多种问题类型的多样化任务，构建了一个包含9个任务和591个问题的全面基准框架，从而实现对LLMs能力的全面评估，并为其在中国眼科领域的实际应用提供洞见。

链接: https://arxiv.org/abs/2502.01243
作者: Chengfeng Zhou,Ji Wang,Juanjuan Qin,Yining Wang,Ling Sun,Weiwei Dai
机构: Changsha Aier Eye Hospital (长沙爱尔眼科医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown significant promise across various medical applications, with ophthalmology being a notable area of focus. Many ophthalmic tasks have shown substantial improvement through the integration of LLMs. However, before these models can be widely adopted in clinical practice, evaluating their capabilities and identifying their limitations is crucial. To address this research gap and support the real-world application of LLMs, we introduce the OphthBench, a specialized benchmark designed to assess LLM performance within the context of Chinese ophthalmic practices. This benchmark systematically divides a typical ophthalmic clinical workflow into five key scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each scenario, we developed multiple tasks featuring diverse question types, resulting in a comprehensive benchmark comprising 9 tasks and 591 questions. This comprehensive framework allows for a thorough assessment of LLMs’ capabilities and provides insights into their practical application in Chinese ophthalmology. Using this benchmark, we conducted extensive experiments and analyzed the results from 39 popular LLMs. Our evaluation highlights the current gap between LLM development and its practical utility in clinical settings, providing a clear direction for future advancements. By bridging this gap, we aim to unlock the potential of LLMs and advance their development in ophthalmology.
zh

[NLP-36] Eliciting Language Model Behaviors with Investigator Agents

【速读】：该论文旨在解决行为诱导问题，即搜索能够从目标语言模型中诱导出特定目标行为（如幻觉或有害响应）的提示。解决方案的关键在于训练调查者模型（investigator models），通过有监督微调、深度策略优化（DPO）以及一种新颖的Frank-Wolfe训练目标，以探索庞大的提示空间，并发现多样化的诱导策略，从而有效地诱导出包括越狱、幻觉在内的多种开放性异常行为。

链接: https://arxiv.org/abs/2502.01236
作者: Xiang Lisa Li,Neil Chowdhury,Daniel D. Johnson,Tatsunori Hashimoto,Percy Liang,Sarah Schwettmann,Jacob Steinhardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.
zh

[NLP-37] On the Robustness of Temporal Factual Knowledge in Language Models

【速读】：该论文旨在探究语言模型（Language Models, LMs）处理时序性事实知识的时间鲁棒性。研究的关键在于设计了一项控制实验，评估多个预训练及指令调优的语言模型在不同时间粒度（日、月、年）下对维基数据事实的处理能力，从而揭示大规模最先进的模型如Llama-3.1-70B在时序性知识理解方面的局限性，特别是它们无法将知识从一个粒度泛化到另一个粒度。

链接: https://arxiv.org/abs/2502.01220
作者: Hichem Ammar Khodja,Frédéric Béchet,Quentin Brabant,Alexis Nasr,Gwénolé Lecorvé
机构: Orange(橙色) - Lannion, France; Aix Marseille Université, CNRS, LIS, UMR 7020 - Marseille, France; International Laboratory on Learning Systems (ILLS - IRL2020 CNRS)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores the temporal robustness of language models (LMs) in handling factual knowledge. While LMs can often complete simple factual statements, their ability to manage temporal facts (those valid only within specific timeframes) remains uncertain. We design a controlled experiment to test the robustness of temporal factual knowledge inside LMs, which we use to evaluate several pretrained and instruction-tuned models using prompts on popular Wikidata facts, assessing their performance across different temporal granularities (Day, Month, and Year). Our findings indicate that even very large state-of-the-art models, such as Llama-3.1-70B, vastly lack robust knowledge of temporal facts. In addition, they are incapable of generalizing their knowledge from one granularity to another. These results highlight the inherent limitations of using LMs as temporal knowledge bases. The source code and data to reproduce our experiments will be released.
zh

[NLP-38] Modelling change in neural dynamics during phonetic accommodation

【速读】：该论文旨在探究实时语音输入如何塑造对话者在语音规划中的表征，并解决短期语音同化过程中语音表征变化的计算模型。关键解决方案在于通过动态神经场方程调整抑制性记忆动力学的幅度，以反映由于音系和/或社会语言学压力导致的同化阻力，从而再现实验观察到的影子模仿过程中的特定元音收敛及模仿后的基线恢复现象。

链接: https://arxiv.org/abs/2502.01210
作者: Sam Kirkham,Patrycja Strycharczuk,Rob Davies,Danielle Welburn
机构: Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室); Linguistics and English Language, University of Manchester (曼彻斯特大学语言学与英语语言系); Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室); Phonetics Laboratory, Lancaster University (兰开斯特大学语音学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Short-term phonetic accommodation is a fundamental driver behind accent change, but how does real-time input from another speaker’s voice shape the speech planning representations of an interlocutor? We advance a computational model of change in phonetic representations during phonetic accommodation, grounded in dynamic neural field equations for movement planning and memory dynamics. We test the model’s ability to capture empirical patterns from an experimental study where speakers shadowed a model talker with a different accent from their own. The experimental data shows vowel-specific degrees of convergence during shadowing, followed by return to baseline (or minor divergence) post-shadowing. The model can reproduce these phenomena by modulating the magnitude of inhibitory memory dynamics, which may reflect resistance to accommodation due to phonological and/or sociolinguistic pressures. We discuss the implications of these results for the relation between short-term phonetic accommodation and longer-term patterns of sound change.
zh

[NLP-39] Almost Surely Safe Alignment of Large Language Models at Inference-Time

【速读】：该论文旨在解决高度 capable 的大语言模型（Large Language Models, LLMs）在生成响应时可能出现的偏见或不安全内容的问题。现有缓解此问题的对齐技术，如基于奖励的强化学习（Reinforcement Learning from Human Feedback, RLHF），虽然有效但成本高昂且容易过拟合，因为它们需要重新训练 LLM。论文提出的关键解决方案是，在推理阶段采用一种新颖的对齐方法，确保 LLM 几乎肯定（概率接近于一）生成安全响应。具体而言，通过将推理阶段的安全响应生成构架为受限马尔可夫决策过程（Markov Decision Process, MDP）来实现这一点，并在 LLM 的潜在空间中进行操作。关键创新在于引入了一个安全状态，用于追踪安全约束条件的变化，并在潜在空间中的 MDP 解决方案完成后提供正式的安全性保证。基于这一基础，论文提出了名为 InferenceGuard 的实用实现方案，能够在不修改模型权重的情况下安全地对齐 LLM。实证研究表明，InferenceGuard 在平衡安全性和任务性能方面表现出色，优于现有的推理阶段对齐方法。

链接: https://arxiv.org/abs/2502.01208
作者: Xiaotong Ji,Shyam Sundhar Ramesh,Matthieu Zimmer,Ilija Bogunovic,Jun Wang,Haitham Bou Ammar
机构: Cranberry-Lemon University (蔓莓柠檬大学); Department of Computational Neuroscience, University of the Witwatersrand (计算神经科学系, 比勒陀利亚大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM’s latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.
zh

[NLP-40] OCR Error Post-Correction with LLM s in Historical Documents: No Free Lunches

【速读】：该论文旨在解决光学字符识别（OCR）系统在转录历史文档时引入的错误问题，通过探索利用开放权重的大语言模型（LLMs）进行OCR错误校正的方法。研究的关键在于评估不同策略的效果，包括参数优化、量化、段落长度的影响以及文本延续方法，并揭示了现代LLMs在降低英语字符错误率（CER）方面的潜力，同时也指出了在芬兰语应用中尚未达到实用性能的局限性。

链接: https://arxiv.org/abs/2502.01205
作者: Jenna Kanerva,Cassandra Ledins,Siiri Käpyaho,Filip Ginter
机构: TurkuNLP, Department of Computing (计算系), University of Turku (图尔库大学), Finland (芬兰)
类目: Computation and Language (cs.CL)
备注: To be published in RESOURCEFUL 2025

点击查看摘要

Abstract:Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
zh

[NLP-41] COVE: COntext and VEracity prediction for out-of-context images NAACL2025

【速读】：该论文旨在解决图像脱离原context导致的多模态虚假信息问题。论文的关键解决方案是引入COVE方法，该方法首先预测图像的真实context，然后利用这个context来验证caption的真实性。通过这种方式，COVE在context预测任务上超越了现有的最先进模型，并且在真实数据上的caption真实性验证上优于其他模型，表明按顺序结合这两个任务是有益的。

链接: https://arxiv.org/abs/2502.01194
作者: Jonathan Tonglet,Gabriel Thiem,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), TU Darmstadt (达姆施塔特工业大学); Department of Electrical Engineering, KU Leuven (鲁汶大学); Department of Computer Science, KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: Camera-ready version accepted to NAACL 2025 Main Conference

点击查看摘要

Abstract:Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image’s caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.
zh

[NLP-42] Skewed Memorization in Large Language Models : Quantification and Decomposition

【速读】：该论文旨在解决大型语言模型（LLMs）在有监督微调（SFT）过程中因记忆训练数据而导致的隐私和安全风险。论文的关键在于通过分析序列长度上的记忆概率，揭示记忆分布的高度偏斜性，并将其与令牌生成过程联系起来，从而提供估算记忆的方法，并提出检测和缓解这些风险的策略，以促进更注重隐私保护的LLMs的发展。

链接: https://arxiv.org/abs/2502.01187
作者: Hao Li,Di Huang,Ziyu Wang,Amir M. Rahmani
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Memorization in Large Language Models (LLMs) poses privacy and security risks, as models may unintentionally reproduce sensitive or copyrighted data. Existing analyses focus on average-case scenarios, often neglecting the highly skewed distribution of memorization. This paper examines memorization in LLM supervised fine-tuning (SFT), exploring its relationships with training duration, dataset size, and inter-sample similarity. By analyzing memorization probabilities over sequence lengths, we link this skewness to the token generation process, offering insights for estimating memorization and comparing it to established metrics. Through theoretical analysis and empirical evaluation, we provide a comprehensive understanding of memorization behaviors and propose strategies to detect and mitigate risks, contributing to more privacy-preserving LLMs.
zh

[NLP-43] A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation

【速读】：该论文旨在解决低资源语言对神经机器翻译性能的限制以及多模型集成方法中存在的高计算成本问题。论文的关键解决方案在于提出了一种基于中介语言的单模型集成策略，包括两步：首先通过中介语言生成候选翻译，其次在后处理阶段从这些候选中选择高质量翻译进行合并。这种方法不仅利用单一模型实现了知识迁移，还提升了翻译质量。

链接: https://arxiv.org/abs/2502.01182
作者: Seokjin Oh,Keonwoong Noh,Woohwan Jung
机构: Department of Applied Artificial Intelligence, Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the significant advances in neural machine translation, performance remains subpar for low-resource language pairs. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, the previous approaches face the challenge of high computational costs for training multiple models. Furthermore, for black-box models, averaging token-level probabilities at each decoding step is not feasible. To address the problems of multi-model ensemble methods, we present a pivot-based single model ensemble. The proposed strategy consists of two steps: pivot-based candidate generation and post-hoc aggregation. In the first step, we generate candidates through pivot translation. This can be achieved with only a single model and facilitates knowledge transfer from high-resource pivot languages, resulting in candidates that are not only diverse but also more accurate. Next, in the aggregation step, we select k high-quality candidates from the generated candidates and merge them to generate a final translation that outperforms the existing candidates. Our experimental results show that our method produces translations of superior quality by leveraging candidates from pivot translation to capture the subtle nuances of the source sentence.
zh

[NLP-44] Joint Localization and Activation Editing for Low-Resource Fine-Tuning

【速读】：该论文旨在解决在低资源场景下参数高效微调（Parameter-efficient fine-tuning, PEFT）方法效果有限的问题，特别是在仅有数百个样本的情况下。论文的关键解决方案是提出了一种名为Joint Localization and Activation Editing (JoLA)的方法，该方法能够同时学习需要编辑的Transformer头部、干预类型（加性、乘性或两者兼有）以及干预参数本身，从而在小数据集上实现更稳定且性能更优的模型调整。

链接: https://arxiv.org/abs/2502.01179
作者: Wen Lai,Alexander Fraser,Ivan Titov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code for the method is released at this https URL

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing techniques, which modify the activations of specific model components. These methods, due to their extremely small parameter counts, show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.
zh

[NLP-45] Jailbreaking with Universal Multi-Prompts NAACL

【速读】：该论文旨在解决大型语言模型（LLMs）在面对通用攻击者时的防御问题，特别是那些能够泛化到未见过的任务的攻击。现有方法主要针对特定案例优化对抗输入，导致处理大规模数据集时计算成本较高。论文的关键在于提出了一种基于提示的方法JUMP (Jumping UnMoored Multi-Prompt)，通过使用通用多提示来破解LLMs，并进一步将其适应于防御策略DUMP (Defensive UnMoored Multi-Prompt)。实验结果表明，该方法在优化通用多提示方面优于现有技术。

链接: https://arxiv.org/abs/2502.01154
作者: Yu-Ling Hsu,Hsuan Su,Shang-Tse Chen
机构: National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by NAACL Findings 2025

点击查看摘要

Abstract:Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.
zh

[NLP-46] DeepRAG : Thinking to Retrieval Step by Step for Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中存在的严重事实性幻觉问题，这些问题源于参数知识的时效性、准确性和覆盖范围。同时，将推理与检索增强生成（RAG）相结合仍然面临任务分解不力和冗余检索的挑战，这可能导致噪声引入和响应质量下降。论文的关键解决方案是提出DeepRAG框架，该框架将检索增强推理建模为马尔可夫决策过程（MDP），从而实现策略性和自适应检索。通过迭代分解查询，DeepRAG能够在每一步动态决定是否检索外部知识或依赖参数推理，以此优化检索增强推理的效果。实验表明，DeepRAG在提高检索效率的同时，将答案准确性提升了21.99%。

链接: https://arxiv.org/abs/2502.01142
作者: Xinyan Guan,Jiali Zeng,Fandong Meng,Chunlei Xin,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun,Jie Zhou
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所中文信息技术处理实验室); University of Chinese Academy of Sciences(中国科学院大学); Pattern Recognition Center, WeChat AI, Tencent Inc, China(中国腾讯公司微信人工智能模式识别中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable potential in reasoning while they still suffer from severe factual hallucinations due to timeliness, accuracy, and coverage of parametric knowledge. Meanwhile, integrating reasoning with retrieval-augmented generation (RAG) remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling strategic and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.
zh

[NLP-47] Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

【速读】：该论文旨在解决语言模型（Language Models, LMs）在提供可靠置信度估计方面的不足，以帮助用户识别模型输出中的错误，并在必要时将其结果转交给人类专家。论文的关键解决方案在于引入相对置信度估计方法，即通过让模型对不同问题之间的置信度进行相对判断（例如，“你更自信正确回答哪个问题？”），而不是直接评估单个问题的绝对置信度。这种方法利用了排名聚合技术如Elo评分和Bradley-Terry模型将模型的偏好转换为置信分数。实验结果显示，相对置信度估计在所有测试的语言模型和数据集上提供了比绝对置信度估计和自一致性方法更可靠的置信度评分。

链接: https://arxiv.org/abs/2502.01126
作者: Vaishnavi Shrivastava,Ananya Kumar,Percy Liang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence (“Score your confidence from 0-1.”) is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence (“Which question are you more confident in answering correctly?”). Treating each question as a “player” in a series of matchups against other questions and the model’s preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model’s confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs – GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B – across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.
zh

[NLP-48] Picky LLM s and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

【速读】：该论文旨在解决在对齐的大规模语言模型（LLMs）进行良性微调以适应特定领域任务时，安全对齐可能意外退化的问题。论文的关键在于系统性地分析导致安全对齐退化的三个关键因素：答案结构、身份校准和角色扮演，并评估当前最先进的奖励模型（RMs）在指导对齐过程中的可靠性。研究发现，这些奖励模型常常无法准确反映人类对安全性的偏好，从而揭示了其在实际应用中的局限性。通过揭示这些挑战，论文强调了在微调过程中保持安全对齐的复杂性，并为开发者提供了平衡实用性和安全性方面的指导。

链接: https://arxiv.org/abs/2502.01116
作者: Guanlin Li,Kangjie Chen,Shangwei Guo,Jie Zhang,Han Qiu,Chao Zhang,Guoyin Wang,Tianwei Zhang,Jiwei Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in this https URL.
zh

[NLP-49] GFM-RAG : Graph Foundation Model for Retrieval Augmented Generation

【速读】：该论文旨在解决传统检索增强生成（Retrieval-augmented generation, RAG）模型难以捕捉复杂知识间关系的问题，从而限制其在需要从多源整合知识的复杂推理任务中的性能。为了解决这一问题，论文提出了一种新型图基础模型（Graph Foundation Model, GFM），即GFM-RAG。其关键是引入了一个创新的图神经网络，能够通过显式建模图结构来捕获复杂的查询-知识关系，从而实现更有效的知识检索和整合。

链接: https://arxiv.org/abs/2502.01113
作者: Linhao Luo,Zicheng Zhao,Gholamreza Haffari,Dinh Phung,Chen Gong,Shirui Pan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.
zh

[NLP-50] ZebraLogic: On the Scaling Limits of LLM s for Logical Reasoning

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在复杂非单调推理中的逻辑推理能力及其可扩展性。为了解决这一问题，论文引入了ZebraLogic评估框架，用于评估LLMs在源自约束满足问题（Constraint Satisfaction Problems, CSPs）的逻辑网格谜题上的推理性能。ZebraLogic能够生成具有可控且量化复杂度的谜题，从而系统地研究包括Llama、o1模型和DeepSeek-R1在内的模型的扩展极限。通过涵盖广泛的搜索空间复杂性和多样的逻辑约束，ZebraLogic提供了一个结构化的环境来评估推理难度增加时的表现。论文的关键在于揭示了随着问题复杂度增加，模型准确率显著下降的现象，并探讨了包括Best-of-N采样、回溯机制和自我验证提示等策略以增强逻辑推理能力。

链接: https://arxiv.org/abs/2502.01100
作者: Bill Yuchen Lin,Ronan Le Bras,Kyle Richardson,Ashish Sabharwal,Radha Poovendran,Peter Clark,Yejin Choi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows – a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement. Comments: Website: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2502.01100 [cs.AI] (or arXiv:2502.01100v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.01100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] Enhancing Aspect-based Sentiment Analysis with ParsBERT in Persian Language

【速读】：该论文旨在解决波斯语文本挖掘中数据集稀缺和现有语言模型效率低下的挑战。解决方案的关键在于提出了一种基于方面的情感分析方法，利用增强型的ParsBERT模型和相关词典，从而显著提升了情感分析的准确度（88.2%）和F1得分（61.7），有效增强了针对波斯语的语言模型效能。

链接: https://arxiv.org/abs/2502.01091
作者: Farid Ariai,Maryam Tayefeh Mahmoudi,Ali Moeini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of pervasive internet use and the dominance of social networks, researchers face significant challenges in Persian text mining including the scarcity of adequate datasets in Persian and the inefficiency of existing language models. This paper specifically tackles these challenges, aiming to amplify the efficiency of language models tailored to the Persian language. Focusing on enhancing the effectiveness of sentiment analysis, our approach employs an aspect-based methodology utilizing the ParsBERT model, augmented with a relevant lexicon. The study centers on sentiment analysis of user opinions extracted from the Persian website ‘Digikala.’ The experimental results not only highlight the proposed method’s superior semantic capabilities but also showcase its efficiency gains with an accuracy of 88.2% and an F1 score of 61.7. The importance of enhancing language models in this context lies in their pivotal role in extracting nuanced sentiments from user-generated content, ultimately advancing the field of sentiment analysis in Persian text mining by increasing efficiency and accuracy.
zh

[NLP-52] Classic4Children: Adapting Chinese Literary Classics for Children with Large Language Model NAACL2025

【速读】：该论文旨在解决儿童难以阅读中国文学经典的问题，通过引入儿童友好型文学改编（Child-Friendly Literary Adaptation, CLA）任务，使这些作品更易于儿童理解。论文的关键解决方案是提出了一种名为InstructChild的方法，该方法通过增强大型语言模型（LLM）以适应儿童的阅读偏好（如生动的角色描绘、简洁的叙事结构和适当的可读性），并采用细粒度指令微调来获取角色个性和叙事结构。此外，论文设计了一个可读性指标作为奖励来调整LLM与儿童阅读水平的一致性，并应用前瞻解码策略在推理过程中提高生成文本的可读性。为了支持CLA任务的评估，构建了包含原著及其儿童友好版本的Classic4Children数据集。实验结果表明，InstructChild显著提升了自动评估和人工评估的性能。

链接: https://arxiv.org/abs/2502.01090
作者: Jiali Chen,Xusen Hei,Yuqi Xue,Zihan Wu,Jiayuan Xie,Yi Cai
机构: Key Laboratory of Big Data and Intelligent Robot (大数据与智能机器人重点实验室) Ministry of Education; School of Software Engineering (软件工程学院), South China University of Technology; The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NAACL 2025 Findings

点击查看摘要

Abstract:Chinese literary classics hold significant cultural and educational value, offering deep insights into morality, history, and human nature. These works often include classical Chinese and complex narratives, making them difficult for children to read. To bridge this gap, we introduce a child-friendly literary adaptation (CLA) task to adapt the Chinese literary classic into engaging and accessible text for children. However, recent large language models (LLMs) overlook children’s reading preferences (\ie, vivid character portrayals, concise narrative structures, and appropriate readability), which poses challenges in CLA. In this paper, we propose a method called InstructChild, which augments the LLM with these preferences for adaptation. Specifically, we first obtain the characters’ personalities and narrative structure as additional information for fine-grained instruction tuning. Then, we devise a readability metric as the reward to align the LLM with the children’s reading level. Finally, a lookahead decoding strategy is applied to improve the readability of the generated text during inference. To support the evaluation of CLA task, we construct the Classic4Children dataset, which comprises both the original and child-friendly versions of the Four Great Classical Novels of Chinese literature. Experimental results show that our InstructChild significantly improves automatic and human evaluation performance.
zh

[NLP-53] ool Unlearning for Tool-Augmented LLM s

【速读】：该论文旨在解决工具增强型大语言模型（Tool-augmented LLMs）在面对安全漏洞、隐私法规或工具弃用等情况下，如何有效地“遗忘”已学工具的问题。这一任务被称为“工具未学习”（tool unlearning），在现有未学习（unlearning）研究中尚未被探讨。论文的关键解决方案是提出了一种名为ToolDelete的方法，该方法具备三项关键属性以有效应对工具未学习中的挑战，并引入了一种新的成员推理攻击（Membership Inference Attack, MIA）模型用于评估。实验结果表明，ToolDelete能够有效删除随机选择的工具，同时保持模型在其他未删除工具上的知识以及整体任务性能。

链接: https://arxiv.org/abs/2502.01083
作者: Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning’’ has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM’s knowledge on non-deleted tools and maintaining performance on general tasks.
zh

[NLP-54] he Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT -[n] and o-[n] Models on Multimodal Puzzles

【速读】：该论文旨在解决大型语言模型在处理多模态任务中的高级推理能力不足的问题。关键在于评估和追踪GPT系列和o系列模型在复杂多模态难题中的表现，这些难题需要细粒度的视觉感知以及抽象或算法推理能力。研究表明，尽管o系列模型在某些方面表现出色，但仍存在显著的性能瓶颈，特别是在简单的多模态抽象推理和算法推理任务上。

链接: https://arxiv.org/abs/2502.01081
作者: Vernon Y.H. Toh,Yew Ken Chia,Deepanway Ghosal,Soujanya Poria
机构: Singapore University of Technology and Design (新加坡科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The releases of OpenAI’s o1 and o3 mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, o3 outperformed humans in novel problem-solving and skill acquisition on the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles, requiring fine-grained visual perception with abstract or algorithmic reasoning. The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency. Our results reveal a clear upward trend in reasoning capabilities across model iterations, with notable performance jumps across GPT-series models and subsequently to o1. Nonetheless, we observe that the o1 model still struggles with simple multimodal puzzles requiring abstract reasoning. Furthermore, its performance in algorithmic puzzles remains poor. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available this https URL.
zh

[NLP-55] FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长上下文序列时所需的关键值（KV）缓存消耗大量计算资源和内存的问题。现有压缩方法主要关注减少内存需求，但未能显著提升延迟性能。论文提出的关键解决方案是FastKV，这是一种KV缓存压缩方法，通过引入Token-Selective Propagation（TSP）技术，在保持精度的同时加速处理速度，并采用grouped-query attention（GQA）感知的KV缓存压缩来提高内存和计算效率。实验结果表明，FastKV相比最先进的HeadKV方法，在首次令牌时间（TTFT）和吞吐量方面分别提升了2.00倍和1.40倍，同时保持了长上下文基准测试的准确性。

链接: https://arxiv.org/abs/2502.01068
作者: Dongwon Jo,Jiwon Song,Yulhwa Kim,Jae-Joon Kim
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00 \times and 1.40 \times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at this https URL.
zh

[NLP-56] Knowledge Synthesis of Photosynthesis Research Using a Large Language Model

【速读】：该论文旨在解决当前大型语言模型（LLMs）在处理复杂生物数据和光合作用理论模型时存在的不足，无法提供准确科学背景的问题。解决方案的关键在于提出了一种基于OpenAI的GPT-4o，结合检索增强生成（RAG）技术和提示优化的光合作用研究助手（PRAG）。通过使用向量数据库和自动化反馈循环进行提示优化，以提高对光合作用相关查询响应的准确性和相关性。

链接: https://arxiv.org/abs/2502.01059
作者: Seungri Yoon,Woosang Jeon,Sanghyeok Choi,Taehyeong Kim,Tae In Ahn
机构: Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:The development of biological data analysis tools and large language models (LLMs) has opened up new possibilities for utilizing AI in plant science research, with the potential to contribute significantly to knowledge integration and research gap identification. Nonetheless, current LLMs struggle to handle complex biological data and theoretical models in photosynthesis research and often fail to provide accurate scientific contexts. Therefore, this study proposed a photosynthesis research assistant (PRAG) based on OpenAI’s GPT-4o with retrieval-augmented generation (RAG) techniques and prompt optimization. Vector databases and an automated feedback loop were used in the prompt optimization process to enhance the accuracy and relevance of the responses to photosynthesis-related queries. PRAG showed an average improvement of 8.7% across five metrics related to scientific writing, with a 25.4% increase in source transparency. Additionally, its scientific depth and domain coverage were comparable to those of photosynthesis research papers. A knowledge graph was used to structure PRAG’s responses with papers within and outside the database, which allowed PRAG to match key entities with 63% and 39.5% of the database and test papers, respectively. PRAG can be applied for photosynthesis research and broader plant science domains, paving the way for more in-depth data analysis and predictive capabilities.
zh

[NLP-57] Mitigating Hallucinations in Large Vision-Language Models with Internal Fact-based Contrastive Decoding

【速读】：该论文旨在解决大型视觉语言模型（Large Visual Language Models, LVLMs）在推理过程中出现的对象幻觉（object hallucinations）问题。论文的关键解决方案是提出了一种名为内部事实基础对比解码（Internal Fact-based Contrastive Decoding, IFCD）的模型无关方法。IFCD通过利用LVLMs自身的幻觉现象，在推理过程中校准模型输出，并有效移除最终预测中的幻觉logits，从而缓解对象级别和属性级别的幻觉问题，同时提升了POPE和MME数据集上的准确性。

链接: https://arxiv.org/abs/2502.01056
作者: Chao Wang,Xuancheng Zhou,Weiwei Fu,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Visual Language Models (LVLMs) integrate visual and linguistic modalities, exhibiting exceptional performance across various multimodal tasks. Nevertheless, LVLMs remain vulnerable to the issue of object hallucinations. Previous efforts to mitigate this issue focus on supervised fine-tuning (SFT) or incorporating external knowledge, both of which entail significant costs related to training and the acquisition of external data. To address these challenges, we propose a novel model-agnostic approach termed Internal Fact-based Contrastive Decoding (IFCD), designed to mitigate and suppress hallucinations during the inference process of LVLMs by exploiting the LVLMs’ own hallucinations. IFCD is grounded in experimental observations that alterations to the LVLMs’ internal representations tend to amplify hallucinations caused by language bias. By contrasting disturbed distribution, IFCD calibrates the LVLMs’ output and effectively removes the hallucinatory logits from the final predictions. Experimental results validate that IFCD significantly alleviates both object-level and attribute-level hallucinations while achieving an average 9% accuracy improvement on POPE and 8% accuracy improvement on MME object hallucinations subset compared with direct decoding, respectively.
zh

[NLP-58] PARA: Parameter-Efficient Fine-tuning with Prompt Aware Representation Adjustment ACL-2024

【速读】：该论文旨在解决参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在单骨干多租户应用中的效率与性能平衡问题。论文的关键在于提出了一种新的方法，称为提示感知表征调整（Prompt Aware Representation Adjustment, PARA）。PARA通过在每个Transformer层内集成一个轻量级向量生成器来实现，该生成器能够根据输入提示生成响应向量，从而相应地调整隐藏表示。这种方法在保持相似可调参数数量的同时，展示了超越现有PEFT基准的性能，并且在单骨干多租户场景下比LoRA更为高效。

链接: https://arxiv.org/abs/2502.01033
作者: Zequan Liu,Yi Zhao,Ming Tan,Wei Zhu,Aaron Xuxiang Tian
机构: RWTH Aachen University (RWTH 亚琛工业大学); University of Pennsylvania (宾夕法尼亚大学); Southern University of Science and Technology (南方科技大学); University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: accepted by ACL-2024

点击查看摘要

Abstract:In the realm of parameter-efficient fine-tuning (PEFT) methods, while options like LoRA are available, there is a persistent demand in the industry for a PEFT approach that excels in both efficiency and performance within the context of single-backbone multi-tenant applications. This paper introduces a new and straightforward PEFT technique, termed \underlinePrompt \underlineAware \underlineRepresentation \underlineAdjustment (PARA). The core of our proposal is to integrate a lightweight vector generator within each Transformer layer. This generator produces vectors that are responsive to input prompts, thereby adjusting the hidden representations accordingly. Our extensive experimentation across diverse tasks has yielded promising results. Firstly, the PARA method has been shown to surpass current PEFT benchmarks in terms of performance, despite having a similar number of adjustable parameters. Secondly, it has proven to be more efficient than LoRA in the single-backbone multi-tenant scenario, highlighting its significant potential for industrial adoption.
zh

[NLP-59] Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在处理输入上下文时效率低下的问题，特别是在查询所需信息局限于局部上下文的情况下。论文的关键解决方案是动态上下文截断（Dynamic Context Cutoff），这是一种受人类启发的方法，使模型能够在获取足够的任务相关的信息后自动终止处理。通过分析模型内部，研究发现特定的注意力头（attention heads）天然编码了“充分性信号”（sufficiency signals），这些信号可以通过轻量级分类器检测到，并预测何时已处理到关键信息。这一发现揭示了一个新的效率范式：模型内部的理解自然地指导处理需求，而不是依赖外部压缩启发式方法。

链接: https://arxiv.org/abs/2502.01025
作者: Roy Xie,Junlin Wang,Paul Rosu,Chunyuan Deng,Bolun Sun,Zihao Lin,Bhuwan Dhingra
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project Website: this https URL

点击查看摘要

Abstract:Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode “sufficiency signals” - detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models’ internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
zh

[NLP-60] MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs NAACL2024

【速读】：该论文旨在解决在将不同领域专家模型（Expert LLMs）融合成统一的混合专家模型（Mixture-of-Experts, MoE）过程中遇到的挑战，特别是针对参数权重高度不同的模型或具有不同架构的模型。论文的关键解决方案包括引入新的MoE融合技术，通过策略减轻参数干扰、采用路由启发式方法减少MoE微调需求，并提出一种新型的方法来融合具有不同架构的专家模型。这些方法显著降低了微调成本，提升了性能，并扩展了MoE融合的应用范围。

链接: https://arxiv.org/abs/2502.00997
作者: Yuhang Zhou,Giannis Karamanolakis,Victor Soto,Anna Rumshisky,Mayank Kulkarni,Furong Huang,Wei Ai,Jianhua Lu
机构: University of Maryland, College Park (马里兰大学公园分校); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2024 Main

点击查看摘要

Abstract:The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.
zh

[NLP-61] Self-supervised Analogical Learning using Language Models

【速读】：该论文旨在解决大型语言模型在推理一致性方面的问题，即模型在训练数据不熟悉的情境下表现不佳，尽管它们能够成功解决类似且更常见的问题。为了解决这一问题，论文提出了一种名为SAL（自监督类比学习框架）的方法。SAL的关键在于模仿人类的类比过程，通过训练模型将高质量的符号化解决方案从已知的解题案例转移到其他罕见且容易出错的情境中，从而促使模型理解高层次和抽象的推理过程，而非仅仅关注最终答案。这种方法显著提升了模型在多种推理基准测试中的性能，并增强了模型的泛化能力和可控性。

链接: https://arxiv.org/abs/2502.00996
作者: Ben Zhou,Sarthak Jain,Yi Zhang,Qiang Ning,Shuai Wang,Yassine Benajiba,Dan Roth
机构: Arizona State University (亚利桑那州立大学); Amazon (亚马逊); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have been shown to suffer from reasoning inconsistency issues. That is, they fail more in situations unfamiliar to the training data, even though exact or very similar reasoning paths exist in more common cases that they can successfully solve. Such observations motivate us to propose methods that encourage models to understand the high-level and abstract reasoning processes during training instead of only the final answer. This way, models can transfer the exact solution to similar cases, regardless of their relevance to the pre-training data distribution. In this work, we propose SAL, a self-supervised analogical learning framework. SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions from cases that they know how to solve to other rare cases in which they tend to fail more. We show that the resulting models after SAL learning outperform base language models on a wide range of reasoning benchmarks, such as StrategyQA, GSM8K, and HotpotQA, by 2% to 20%. At the same time, we show that our model is more generalizable and controllable through analytical studies.
zh

[NLP-62] ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution

【速读】：该论文旨在解决大型语言模型（LLMs）在执行图表问答任务时经常生成未经验证的幻觉性回复的问题。现有答案归因方法难以将回答与源图表关联，主要因为有限的视觉语义上下文、复杂的视觉文本对齐需求以及复杂布局中的边界框预测难题。论文提出的关键解决方案是ChartCitor，一个多智能体框架，通过识别图表图像内的支持证据来提供细粒度的边界框引用。该系统协调LLM智能体进行图表到表格的提取、答案重铸、表格增强、预筛选和重新排序的证据检索，以及表格到图表的映射。这些步骤共同提升了现有基线模型在不同图表类型上的表现，并增强了用户对生成式AI的信任。

链接: https://arxiv.org/abs/2502.00989
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: IGDTUW(印度技术大学德里分校), Delhi India; Adobe Research(Adobe研究), USA; Adobe Research(Adobe研究), USA; Adobe Research(Adobe研究), USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can perform chart question-answering tasks but often generate unverified hallucinated responses. Existing answer attribution methods struggle to ground responses in source charts due to limited visual-semantic context, complex visual-text alignment requirements, and difficulties in bounding box prediction across complex layouts. We present ChartCitor, a multi-agent framework that provides fine-grained bounding box citations by identifying supporting evidence within chart images. The system orchestrates LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval through pre-filtering and re-ranking, and table-to-chart mapping. ChartCitor outperforms existing baselines across different chart types. Qualitative user studies show that ChartCitor helps increase user trust in Generative AI by providing enhanced explainability for LLM-assisted chart QA and enables professionals to be more productive.
zh

[NLP-63] PlotGen: Multi-Agent LLM -based Scientific Data Visualization via Multimodal Feedback

【速读】：该论文旨在解决 novice 用户在科学数据可视化过程中面临的工具选择复杂性和技术掌握困难的问题。解决方案的关键在于 PlotGen，这是一个多代理框架，通过包括查询规划代理、代码生成代理以及三个检索反馈代理在内的多个基于大规模语言模型（LLM）的代理，实现科学可视化创建的自动化。这些代理协同工作，逐步分解用户请求、生成可执行代码，并通过迭代反馈机制优化数据准确性、文本标签和视觉正确性，从而提高可视化结果的质量和用户信任度。

链接: https://arxiv.org/abs/2502.00988
作者: Kanika Goswami,Puneet Mathur,Ryan Rossi,Franck Dernoncourt
机构: IGDTUW(德里技术大学); Adobe Research(Adobe研究); Adobe Research(Adobe研究); Adobe Research(Adobe研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific data visualization is pivotal for transforming raw data into comprehensible visual representations, enabling pattern recognition, forecasting, and the presentation of data-driven insights. However, novice users often face difficulties due to the complexity of selecting appropriate tools and mastering visualization techniques. Large Language Models (LLMs) have recently demonstrated potential in assisting code generation, though they struggle with accuracy and require iterative debugging. In this paper, we propose PlotGen, a novel multi-agent framework aimed at automating the creation of precise scientific visualizations. PlotGen orchestrates multiple LLM-based agents, including a Query Planning Agent that breaks down complex user requests into executable steps, a Code Generation Agent that converts pseudocode into executable Python code, and three retrieval feedback agents - a Numeric Feedback Agent, a Lexical Feedback Agent, and a Visual Feedback Agent - that leverage multimodal LLMs to iteratively refine the data accuracy, textual labels, and visual correctness of generated plots via self-reflection. Extensive experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatPlotBench dataset, leading to enhanced user trust in LLM-generated visualizations and improved novice productivity due to a reduction in debugging time needed for plot errors.
zh

[NLP-64] RandLoRA: Full-rank parameter-efficient fine-tuning of large models ICLR

【速读】：该论文旨在解决在低秩适应（Low-Rank Adaptation, LoRA）与标准微调之间观察到的性能差距问题。论文的关键在于引入RandLoRA方法，通过学习线性组合低秩、非训练随机矩阵的方式实现全秩更新，同时限制优化仅作用于应用于固定随机矩阵的对角缩放矩阵。这种方法能够在保持参数和内存效率的同时，有效克服低秩带来的表示能力限制。

链接: https://arxiv.org/abs/2502.00987
作者: Paul Albert,Frederic Z. Zhang,Hemanth Saratchandran,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at the International Conference on Learning Representations (ICLR) 2025

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) and its variants have shown impressive results in reducing the number of trainable parameters and memory requirements of large transformer networks while maintaining fine-tuning performance. However, the low-rank nature of the weight update inherently limits the representation power of fine-tuned models, potentially compromising performance on complex tasks. This raises a critical question: when a performance gap between LoRA and standard fine-tuning is observed, is it due to the reduced number of trainable parameters or the rank deficiency? This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. Our method limits the number of trainable parameters by restricting optimization to diagonal scaling matrices applied to the fixed random matrices. This allows us to effectively overcome the low-rank limitations while maintaining parameter and memory efficiency during training. Through extensive experimentation across vision, language, and vision-language benchmarks, we systematically evaluate the limitations of LoRA and existing random basis methods. Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces – and sometimes eliminates – the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
zh

[NLP-65] Context-Aware Hierarchical Merging for Long Document Summarization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长文本摘要时因固定输入长度限制而产生的局限性。具体而言，层次合并（Hierarchical Merging）技术虽能将长文本分解为更小部分进行处理，但递归合并过程会放大模型的幻觉效应（hallucinations），增加事实不准确性的风险。论文的关键解决方案在于通过从源文档中引入上下文信息来增强层次合并技术，提出了多种上下文增强方法，包括替换中间摘要、使用上下文作为支持证据进行精炼以及隐式引用输入文档。实验结果显示，在法律和叙事领域的数据集上，这些上下文增强方法显著优于零样本和基本层次合并方法，特别是在与抽取式摘要结合使用时，精炼方法表现出最佳性能。

链接: https://arxiv.org/abs/2502.00977
作者: Litu Ou,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: 30 pages

点击查看摘要

Abstract:Hierarchical Merging is a technique commonly used to summarize very long texts ( 100K tokens) by breaking down the input into smaller sections, summarizing those sections individually, and then merging or combining those summaries into a final coherent summary. Although it helps address the limitations of large language models (LLMs) with fixed input length constraints, the recursive merging process can amplify LLM hallucinations, increasing the risk of factual inaccuracies. In this paper, we seek to mitigate hallucinations by enriching hierarchical merging with context from the source document. Specifically, we propose different approaches to contextual augmentation ranging from \emphreplacing intermediate summaries with relevant input context, to \emphrefining them while using the context as supporting evidence, and \emphaligning them implicitly (via citations) to the input. Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines for the Llama 3.1 model family. Our analysis further reveals that refinement methods tend to perform best when paired with extractive summarization for identifying relevant input.
zh

[NLP-66] Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching SIGDIAL2024

【速读】：该论文旨在解决聊天式产品搜索（CPS）领域中因缺乏可靠且大规模数据集而导致的智能助手训练难题。论文的关键解决方案是提出了一种名为TRACER的新方法，该方法利用大型语言模型（LLMs）生成针对不同购物领域的逼真且自然的对话，并通过与对话计划（dialogue plans）结合，确保对话过程中产品搜索轨迹的相关性和高效性。此外，论文还发布了首个目标导向的CPS数据集Wizard of Shopping (WoS)，包含三个购物领域的高度自然连贯的对话共3.6k条，以验证所提方法的有效性。

链接: https://arxiv.org/abs/2502.00969
作者: Xiangci Li,Zhiyu Chen,Jason Ingyu Choi,Nikhita Vedula,Besnik Fetahu,Oleg Rokhlenko,Shervin Malmasi
机构: AWS AI Labs; Amazon.com, Inc.
类目: Computation and Language (cs.CL)
备注: Accepted by SIGDIAL 2024 but withdrawn

点击查看摘要

Abstract:The goal of conversational product search (CPS) is to develop an intelligent, chat-based shopping assistant that can directly interact with customers to understand shopping intents, ask clarification questions, and find relevant products. However, training such assistants is hindered mainly due to the lack of reliable and large-scale datasets. Prior human-annotated CPS datasets are extremely small in size and lack integration with real-world product search systems. We propose a novel approach, TRACER, which leverages large language models (LLMs) to generate realistic and natural conversations for different shopping domains. TRACER’s novelty lies in grounding the generation to dialogue plans, which are product search trajectories predicted from a decision tree model, that guarantees relevant product discovery in the shortest number of search conditions. We also release the first target-oriented CPS dataset Wizard of Shopping (WoS), containing highly natural and coherent conversations (3.6k) from three shopping domains. Finally, we demonstrate the quality and effectiveness of WoS via human evaluations and downstream tasks.
zh

[NLP-67] Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

【速读】：该论文旨在解决基于Q值选择数据以增强大规模语言模型（Large Language Model, LLM）驱动的多智能体系统（multi-agent system, MAS）自训练过程中存在的不一致性问题。论文的关键解决方案在于提出了一种名为数据影响力导向树搜索（Data Influence-oriented Tree Search, DITS）的新框架，通过引入影响力分数来指导树搜索和数据选择过程。DITS 方法通过利用影响力分数有效识别对系统改进影响最大的数据，从而提升模型性能，并且针对非可微指标设计了影响力分数估算方法，显著降低了计算开销。研究表明，在数据合成过程中更多地分配推理资源用于估算影响力分数而非Q值，能够更有效地提升模型训练效果。

链接: https://arxiv.org/abs/2502.00955
作者: Wentao Shi,Zichun Yu,Fuli Feng,Xiangnan He,Chenyan Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) based methods provide promising approaches for generating synthetic data to enhance the self-training of Large Language Model (LLM) based multi-agent systems (MAS). These methods leverage Q-values to estimate individual agent contributions. However, relying solely on Q-values to identify informative data may misalign with the data synthesis objective, as the focus should be on selecting data that best enhances model training. To address this discrepancy, we propose Data Influence-oriented Tree Search (DITS), a novel framework that incorporates influence scores to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement, thereby enhancing model performance. Furthermore, we derive influence score estimation methods tailored for non-differentiable metrics, significantly reducing computational overhead by utilizing inference computations. Extensive experiments on eight multi-agent datasets demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more inference resources to estimate influence scores, rather than Q-values, during data synthesis can more effectively and efficiently enhance model training.
zh

[NLP-68] Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale

【速读】：该论文旨在解决从大量未结构化的临床文本中高效提取和规范化医学信息的问题。传统方法需要大量的手动工作，包括制定规则或标注训练标签，这限制了其可扩展性。论文的关键解决方案是提出UniMedAbstractor (UMA)，这是一种零样本医学抽象框架，通过模块化和可定制的提示模板利用大规模语言模型 (Large Language Models, LLMs)。UMA通过通用的提示模板快速适应新属性，无需为每个属性特定的训练标签或规则进行调整，从而实现更广泛的适用性和更高的效率。

链接: https://arxiv.org/abs/2502.00943
作者: Cliff Wong,Sam Preston,Qianchu Liu,Zelalem Gero,Jass Bagga,Sheng Zhang,Shrey Jain,Theodore Zhao,Yu Gu,Yanbo Xu,Sid Kiblawi,Roshanthi Weerasinghe,Rom Leidner,Kristina Young,Brian Piening,Carlo Bifulco,Tristan Naumann,Mu Wei,Hoifung Poon
机构: Microsoft, Redmond, WA, USA; Providence Research Network, Renton, WA, USA; Providence Genomics, Portland, OR, USA; Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA; The Oregon Clinic, Radiation Oncology Division, Portland, OR; Unknown
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The vast majority of real-world patient information resides in unstructured clinical text, and the process of medical abstraction seeks to extract and normalize structured information from this unstructured input. However, traditional medical abstraction methods can require significant manual efforts that can include crafting rules or annotating training labels, limiting scalability. In this paper, we propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs) through a modular and customizable prompt template. We refer to our approach as universal abstraction as it can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules. We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey, from short-context attributes (e.g., performance status, treatment) to complex long-context attributes requiring longitudinal reasoning (e.g., tumor site, histology, TNM staging). Experiments on real-world data show UMA’s strong performance and generalizability. Compared to supervised and heuristic baselines, UMA with GPT-4o achieves on average an absolute 2-point F1/accuracy improvement for both short-context and long-context attribute abstraction. For pathologic T staging, UMA even outperforms the supervised model by 20 points in accuracy.
zh

[NLP-69] Attention Sinks and Outlier Features: A Catch Tag and Release Mechanism for Embeddings

【速读】：该论文旨在探究大型语言模型（LLMs）中的两个显著特征，即大范数（异常值）特征的存在以及令牌倾向于强烈关注少数其他令牌的现象。论文特别关注这些现象在模型参数中的体现及其对模型性能、压缩和流处理的影响。论文的关键解决方案在于证明“捕捉、标记、释放”机制是简单任务如平均操作所必需的，从而解释了这种机制如何自然地出现在现代LLMs中。此外，通过实验发现，注意力汇点（attention sinks）可以通过低秩矩阵完全体现在模型参数中，这不仅有助于模型压缩，也验证了近期采用低秩项以抵消性能下降的方法的成功。

链接: https://arxiv.org/abs/2502.00919
作者: Stephen Zhang,Mustafa Khan,Vardan Papyan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. Despite often having no semantic relevance, these select tokens, called attention sinks, along with the large outlier features, have proven important for model performance, compression, and streaming. Consequently, investigating the roles of these phenomena within models and exploring how they might manifest in the model parameters has become an area of active interest. Through an empirical investigation, we demonstrate that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved. We prove that simple tasks, like averaging, necessitate the ‘catch, tag, release’ mechanism hence explaining why it would arise organically in modern LLMs. Our experiments also show that the creation of attention sinks can be completely captured in the model parameters using low-rank matrices, which has important implications for model compression and substantiates the success of recent approaches that incorporate a low-rank term to offset performance degradation.
zh

[NLP-70] he Accuracy Robustness and Readability of LLM -Generated Sustainability-Related Word Definitions

【速读】：该论文旨在解决通用语言与标准化定义在气候讨论中的重要性，同时关注大型语言模型（LLMs）在表述气候术语时可能存在的误读问题。为此，作者对比分析了300个官方政府间气候变化专门委员会（IPCC）术语定义与由GPT-4o-mini、Llama3.1 8B及Mistral 7B生成的相应定义，评估了其一致性（平均一致率为0.57-0.59 ± 0.15）、稳健性和可读性。研究的关键在于通过分析模型生成的定义与原始定义之间的差异，尤其是那些多义或模糊词汇的处理，以突出需要标准化的术语。论文结果表明，虽然LLMs能够支持环境讨论，但其输出需与已确立的术语保持一致，以确保清晰性和统一性。

链接: https://arxiv.org/abs/2502.00916
作者: Alice Heiman
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: NLP4Ecology Workshop 2025

点击查看摘要

Abstract:A common language with standardized definitions is crucial for effective climate discussions. However, concerns exist about LLMs misrepresenting climate terms. We compared 300 official IPCC glossary definitions with those generated by GPT-4o-mini, Llama3.1 8B, and Mistral 7B, analyzing adherence, robustness, and readability using SBERT sentence embeddings. The LLMs scored an average adherence of 0.57-0.59 \pm 0.15 , and their definitions proved harder to read than the originals. Model-generated definitions vary mainly among words with multiple or ambiguous definitions, showing the potential to highlight terms that need standardization. The results show how LLMs could support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.
zh

[NLP-71] Embracing Dialectic Intersubjectivity: Coordination of Different Perspectives in Content Analysis with LLM Persona Simulation

【速读】：该论文旨在推进内容分析方法从共识导向到协调导向的实践，以包容多样的编码输出并探讨不同视角之间的动态关系。关键解决方案在于评估六个GPT-4配置在分析福克斯新闻（Fox News）和MSNBC关于拜登和特朗普在2020年美国总统大选期间的转录文本中的情感倾向时的表现，并通过这些模型的评估来探索在LLM辅助内容分析（LACA）中如何识别党派选择性处理。研究表明，党派专属的大型语言模型（Partisan Persona LLMs）在处理政治上一致的内容时表现出更强的意识形态偏见，且同党派模型间的编码者可靠性高于跨党派模型组合。这一方法增强了对LLM输出的细致理解，并提升了AI驱动的社会科学研究的严谨性，使其能够模拟现实世界的影响。

链接: https://arxiv.org/abs/2502.00903
作者: Taewoo Kang,Kjerstin Thorson,Tai-Quan Peng,Dan Hiaeshutter-Rice,Sanguk Lee,Stuart Soroka
机构: Michigan State University; Colorado State University; Texas Christian University; University of California, Los Angeles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:This study attempts to advancing content analysis methodology from consensus-oriented to coordination-oriented practices, thereby embracing diverse coding outputs and exploring the dynamics among differential perspectives. As an exploratory investigation of this approach, we evaluate six GPT-4o configurations to analyze sentiment in Fox News and MSNBC transcripts on Biden and Trump during the 2020 U.S. presidential campaign, examining patterns across these models. By assessing each model’s alignment with ideological perspectives, we explore how partisan selective processing could be identified in LLM-Assisted Content Analysis (LACA). Findings reveal that partisan persona LLMs exhibit stronger ideological biases when processing politically congruent content. Additionally, intercoder reliability is higher among same-partisan personas compared to cross-partisan pairs. This approach enhances the nuanced understanding of LLM outputs and advances the integrity of AI-driven social science research, enabling simulations of real-world implications.
zh

[NLP-72] MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

【速读】：该论文旨在解决现有分词方法（如Byte Pair Encoding, BPE）在处理形态学丰富的语言时，忽视词素边界导致的次优分词问题。解决方案的关键在于引入MorphBPE，这是一种融合了语言结构的形态学感知型BPE扩展，能够在保持统计效率的同时，改进子词切分的准确性与一致性。

链接: https://arxiv.org/abs/2502.00894
作者: Ehsaneddin Asgari,Yassine El Kheir,Mohammad Ali Sadraei Javaheri
机构: Qatar Computing Research Institute (QCRI)(卡塔尔计算研究研究院), Doha (多哈), Qatar (卡塔尔); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心), Berlin (柏林), Germany (德国); Technical University of Berlin(柏林工业大学), Berlin (柏林), Germany (德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: this https URL and this https URL
zh

[NLP-73] SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters ICLR2025

【速读】：该论文旨在解决现有语言模型对齐中的偏好优化目标需要大量调参的问题，这增加了微调大型语言模型的复杂性和时间消耗。论文提出了一种简单而有效的无超参数偏好优化算法，即SimPER（Simple Preference Optimization via Inverse Perplexity）。其关键是通过优化逆困惑度（inverse perplexity），即所选回复与拒绝回复的平均对数似然函数的指数的倒数，来实现性能优化。这种方法无需昂贵的超参数调优和参考模型，从而在计算和内存效率方面表现出色。实验结果表明，SimPER在多个基准测试中显著优于现有方法，甚至不使用任何超参数或参考模型。

链接: https://arxiv.org/abs/2502.00883
作者: Teng Xiao,Yige Yuan,Zhengyu Chen,Mingxiao Li,Shangsong Liang,Zhaochun Ren,Vasant G Honavar
机构: Pennsylvania State University; University of Chinese Academy of Sciences; Meituan Inc (美团); Tencent AI Lab (腾讯AI实验室); Sun Yat-Sen University (中山大学); Leiden University (莱顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for this http URL observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: this https URL.
zh

[NLP-74] Language Models Use Trigonometry to Do Addition

【速读】：该论文旨在解决大型语言模型（LLMs）在处理简单数学任务，特别是加法运算时的内部机制理解不足的问题。关键解决方案在于发现这些模型使用一种广义螺旋（generalized helix）来表示数字，并通过“Clock”算法操作这种螺旋结构以完成加法计算。论文通过因果干预验证了这一表示方法及其操作机制的有效性，从而提供了对LLMs数学能力的首个表征层面解释。

链接: https://arxiv.org/abs/2502.00873
作者: Subhash Kantamneni,Max Tegmark
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mathematical reasoning is an increasingly important indicator of large language model (LLM) capabilities, yet we lack understanding of how LLMs process even simple mathematical tasks. To address this, we reverse engineer how three mid-sized LLMs compute addition. We first discover that numbers are represented in these LLMs as a generalized helix, which is strongly causally implicated for the tasks of addition and subtraction, and is also causally relevant for integer division, multiplication, and modular arithmetic. We then propose that LLMs compute addition by manipulating this generalized helix using the “Clock” algorithm: to solve a+b , the helices for a and b are manipulated to produce the a+b answer helix which is then read out to model logits. We model influential MLP outputs, attention head outputs, and even individual neuron preactivations with these helices and verify our understanding with causal interventions. By demonstrating that LLMs represent numbers on a helix and manipulate this helix to perform addition, we present the first representation-level explanation of an LLM’s mathematical capability.
zh

[NLP-75] Predicting potentially unfair clauses in Chilean terms of services with natural language processing

【速读】：该论文旨在解决消费者合同中的信息不对称问题，特别是在复杂且少有人阅读的在线服务条款背景下。针对此问题，论文的关键解决方案在于引入了一种新的注释方案，包含四个类别和总共二十个类别，并将其应用于五十份智利使用的在线服务条款。此外，论文评估了基于Transformer的模型在检测和分类潜在滥用条款时的表现，重点关注语言特异性和领域特定的预训练、少量样本大小以及模型架构对性能的影响。

链接: https://arxiv.org/abs/2502.00865
作者: Christoffer Loeffler,Andrea Martínez Freile,Tomás Rey Pizarro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 37 pages, 2 figures, under review

点击查看摘要

Abstract:This study addresses the growing concern of information asymmetry in consumer contracts, exacerbated by the proliferation of online services with complex Terms of Service that are rarely even read. Even though research on automatic analysis methods is conducted, the problem is aggravated by the general focus on English-language Machine Learning approaches and on major jurisdictions, such as the European Union. We introduce a new methodology and a substantial dataset addressing this gap. We propose a novel annotation scheme with four categories and a total of 20 classes, and apply it on 50 online Terms of Service used in Chile. Our evaluation of transformer-based models highlights how factors like language- and/or domain-specific pre-training, few-shot sample size, and model architecture affect the detection and classification of potentially abusive clauses. Results show a large variability in performance for the different tasks and models, with the highest macro-F1 scores for the detection task ranging from 79% to 89% and micro-F1 scores up to 96%, while macro-F1 scores for the classification task range from 60% to 70% and micro-F1 scores from 64% to 80%. Notably, this is the first Spanish-language multi-label classification dataset for legal clauses, applying Chilean law and offering a comprehensive evaluation of Spanish-language models in the legal domain. Our work lays the ground for future research in method development for rarely considered legal analysis and potentially leads to practical applications to support consumers in Chile and Latin America as a whole.
zh

[NLP-76] HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions SIGIR2025

【速读】：该论文旨在解决自动提示生成领域资源分散、数据格式不一以及缺乏兼容的评估工具的问题。解决方案的关键在于引入了HintEval，这是一个Python库，它能够便捷地访问多样化的数据集，并提供多种方法来生成和评估提示。HintEval整合了分散的资源，形成一个支持广泛研究目标的工具包，并实现了清晰、多维度且可靠的评估。此外，该库还包含详细的在线文档，帮助用户快速探索其功能并上手使用。通过降低进入门槛并鼓励一致的评估实践，HintEval为自然语言处理/信息检索（NLP/IR）社区中的提示生成与分析研究提供了重要进展。

链接: https://arxiv.org/abs/2502.00857
作者: Jamshid Mozafari,Bhawna Piryani,Abdelrahman Abdallah,Adam Jatowt
机构: University of Innsbruck(因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to SIGIR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
zh

[NLP-77] Explainability in Practice: A Survey of Explainable NLP Across Various Domains

【速读】：该论文旨在解决自然语言处理（NLP）领域中高级模型的黑箱性质导致的透明度和可解释性不足的问题。关键解决方案在于探索和设计适用于不同领域的可解释NLP（XNLP）方法，以满足各行业如医疗健康和金融的具体需求，包括提供清晰的洞察力和强化欺诈检测及风险评估的能力。此外，论文还致力于填补现有文献中的知识空白，通过探讨实际应用、指标评估以及人机交互在模型评估中的作用来深化对XNLP的理解，并提出未来研究方向以增强其广泛应用。

链接: https://arxiv.org/abs/2502.00837
作者: Hadi Mohammadi,Ayoub Bagheri,Anastasia Giachanou,Daniel L. Oberski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) has become a cornerstone in many critical sectors, including healthcare, finance, and customer relationship management. This is especially true with the development and use of advanced models such as GPT-based architectures and BERT, which are widely used in decision-making processes. However, the black-box nature of these advanced NLP models has created an urgent need for transparency and explainability. This review explores explainable NLP (XNLP) with a focus on its practical deployment and real-world applications, examining its implementation and the challenges faced in domain-specific contexts. The paper underscores the importance of explainability in NLP and provides a comprehensive perspective on how XNLP can be designed to meet the unique demands of various sectors, from healthcare’s need for clear insights to finance’s emphasis on fraud detection and risk assessment. Additionally, this review aims to bridge the knowledge gap in XNLP literature by offering a domain-specific exploration and discussing underrepresented areas such as real-world applicability, metric evaluation, and the role of human interaction in model assessment. The paper concludes by suggesting future research directions that could enhance the understanding and broader application of XNLP.
zh

[NLP-78] Generalization of Medical Large Language Models through Cross-Domain Weak Supervision

【速读】：该论文旨在解决如何有效提升医疗领域大型语言模型（Medical Large Language Models, MLLMs）的生成能力，以适应复杂的医疗自然语言处理任务。论文的关键解决方案是提出了增量式课程学习精调（Incremental Curriculum-Based Fine-Tuning, ICFT）框架，通过结合基于课程的学习、双阶段记忆协调和参数高效精调，实现从通用语言知识到特定医学领域专业知识的渐进过渡，从而显著提高模型在准确性与效率方面的表现，并增强其泛化能力和减少错误。

链接: https://arxiv.org/abs/2502.00832
作者: Robert Long,Eric Gonzalez,Harrison Fuller
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has opened new frontiers in natural language processing, particularly in specialized domains like healthcare. In this paper, we propose the Incremental Curriculum-Based Fine-Tuning (ICFT) framework to enhance the generative capabilities of medical large language models (MLLMs). ICFT combines curriculum-based learning, dual-stage memory coordination, and parameter-efficient fine-tuning to enable a progressive transition from general linguistic knowledge to strong domain-specific expertise. Experimental results across diverse medical NLP tasks, including question answering, preference classification, and response generation, demonstrate that ICFT consistently outperforms state-of-the-art baselines, achieving improvements in both accuracy and efficiency. Further analysis reveals the framework’s ability to generalize to unseen data, reduce errors, and deliver diverse, contextually relevant medical responses. These findings establish ICFT as a robust and scalable solution for adapting LLMs to the medical domain, offering practical benefits for real-world healthcare applications.
zh

[NLP-79] Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models

【速读】：该论文旨在解决文本到图像生成中的挑战，包括计算效率低下、训练不稳定以及对文本变化的鲁棒性不足。关键解决方案在于结合大型语言模型（Large Language Models, LLMs）与扩散模型（diffusion models），引入一种新的动态KL加权策略以优化扩散过程，并利用预训练的LLMs进行语义理解来指导生成过程。

链接: https://arxiv.org/abs/2502.00826
作者: Julian Perry,Frank Sanders,Carter Scott
机构: Delta University for Science and Technology
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including ablation studies and human evaluations, confirm that our method outperforms existing approaches in terms of image realism, relevance to the input text, and overall aesthetic quality. Our approach also shows promise in scalability to other multimodal tasks, making it a versatile solution for a wide range of generative applications.
zh

[NLP-80] Probing Large Language Models in Reasoning and Translating Complex Linguistic Puzzles

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在解决复杂语言谜题方面的应用，这类任务需要高级推理能力和熟练的翻译能力，类似于人类的认知过程。研究的关键在于探索特定的提示技术，包括输入输出提示（Input-Output Prompting, IO）、思维链提示（Chain-of-Thought Prompting, CoT）和独奏表现提示（Solo Performance Prompting, SPP），以增强LLMs的推理能力和揭示其决策路径。通过使用来自机器谜题竞赛和各类语言学奥林匹克的数据集，采用一系列综合评估指标来衡量GPT-4 0603在这些提示方法下的性能，从而深入理解LLMs在语言推理和复杂翻译任务中的潜力与局限性。这项研究显著推动了自然语言处理（NLP）领域的发展，为优化LLM应用以提高推理和翻译准确性提供了见解。

链接: https://arxiv.org/abs/2502.00817
作者: Zheng-Lin Lin,Yu-Fei Shih,Shu-Kai Hsieh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:This paper investigates the utilization of Large Language Models (LLMs) for solving complex linguistic puzzles, a domain requiring advanced reasoning and adept translation capabilities akin to human cognitive processes. We explore specific prompting techniques designed to enhance ability of LLMs to reason and elucidate their decision-making pathways, with a focus on Input-Output Prompting (IO), Chain-of-Thought Prompting (CoT), and Solo Performance Prompting (SPP). Utilizing datasets from the Puzzling Machine Competition and various Linguistics Olympiads, we employ a comprehensive set of metrics to assess the performance of GPT-4 0603, a prominent LLM, across these prompting methods. Our findings illuminate the potential of LLMs in linguistic reasoning and complex translation tasks, highlighting their capabilities and identifying limitations in the context of linguistic puzzles. This research contributes significantly to the broader field of Natural Language Processing (NLP) by providing insights into the optimization of LLM applications for improved reasoning and translation accuracy, thereby enriching the ongoing dialogue in NLP advancements.
zh

[NLP-81] Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling

【速读】：该论文旨在解决两个主要问题：一是现有的奖励模型容易受到表层干扰因素的影响，特别是长度偏差对偏好建模的显著影响；二是微调的大规模语言模型（LLMs）难以遵循明确的长度指令。为了解决这些问题，论文提出的关键方案是引入一个名为Response-conditioned Bradley-Terry (Rc-BT)模型，该模型能够显式地区分人类语义偏好与响应长度要求，并通过训练增强奖励模型在减轻长度偏差和遵循长度指令方面的能力。此外，论文还提出了Rc-DPO算法，利用Rc-BT模型进行直接策略优化（DPO），从而同时减轻长度偏差并促进对长度指令的遵守。

链接: https://arxiv.org/abs/2502.00814
作者: Jianfeng Cai,Jinhua Zhu,Ruopei Sun,Yue Wang,Li Li,Wengang Zhou,Houqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model’s scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model’s capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive evaluations demonstrate that our approach substantially improves both preference modeling and length instruction compliance, with its effectiveness validated across various foundational models and preference datasets.
zh

[NLP-82] Vision-centric Token Compression in Large Language Model

【速读】：该论文旨在解决大型语言模型（LLMs）在处理扩展上下文令牌时存在的效率低下和冗余问题。关键在于使用视觉编码器（vision encoder）直接处理文本令牌序列，发现其性能可与传统的文本编码器相媲美，并且在多个中小型文本理解基准测试中实现了相当的结果，同时减少了16%的浮点运算次数（FLOPs）和50%的内存使用。此外，研究团队还揭示了令牌之间的显著冗余，并开发了一种基于频率的掩码策略来引导视觉编码器关注最关键的信息，从而进一步优化模型性能。

链接: https://arxiv.org/abs/2502.00791
作者: Ling Xing,Alex Jinpeng Wang,Rui Yan,Jinhui Tang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, excelling in handling longer sequences. However, the inefficiency and redundancy in processing extended in-context tokens remain a challenge. Many attempts to address this rely on compressing tokens with smaller text encoders, yet we question whether text encoders are truly indispensable. Our journey leads to an unexpected discovery-a much smaller vision encoder, applied directly to sequences of text tokens, can rival text encoders on text tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small text understanding benchmarks, VIST leads to comparable results with 16% fewer FLOPs and 50% less memory usage. We further uncover significant token redundancy and devise a frequency-based masking strategy to guide the focus of the visual encoder toward the most critical tokens. Interestingly, we observe the trained visual encoder performs like a summarizer, selectively ignoring less important words such as prepositions and conjunctions. This approach delivers remarkable results, outperforming traditional text encoder-based methods by 5.7% on average over benchmarks like TriviaQA, NQ, PopQA, TREF, SST2, and SST5, setting a new standard for token efficiency in LLMs.
zh

[NLP-83] FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training

【速读】：该论文旨在解决大型语言模型（LLMs）预训练过程中高质量数据选择的问题。论文的关键在于提出了一种名为FIRE的灵活且可扩展的框架，用于整合多种数据质量评估器，从而在多个维度上全面评估数据质量。FIRE通过将多种质量信号对齐到统一空间，并结合多样化的数据质量评估器，为每个数据点提供综合的质量信号。此外，FIRE引入了一种基于渐进式的数据选择方案，逐步优化高质量数据点的选择，平衡计算复杂度与正交性改进。

链接: https://arxiv.org/abs/2502.00761
作者: Liangyu Xu,Xuemiao Zhang,Feiyu Duan,Sirui Wang,Jingang Wang,Xunliang Cai
机构: Peking University (北京大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Selecting high-quality data can significantly improve the pre-training efficiency of large language models (LLMs). Existing methods often rely on heuristic techniques and single quality signals, limiting their ability to comprehensively evaluate data quality. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points, balancing computational complexity with the refinement of orthogonality. Experiments on the SlimPajama dataset reveal that FIRE consistently outperforms other selection methods and significantly enhances the pre-trained model across a wide range of downstream tasks, with a 2.9% average performance boost and reducing the FLOPs necessary to achieve a certain performance level by more than half.
zh

[NLP-84] Structural Latency Perturbation in Large Language Models Through Recursive State Induction

【速读】：该论文旨在解决高容量语言模型在实时应用中的计算效率问题，特别是推理延迟和资源消耗的限制。解决方案的关键在于引入了一种结构化延迟扰动机制，通过递归状态诱导修改计算路径，动态抑制冗余激活，同时保持生成保真度。该机制通过选择性地抑制冗余激活来提高计算效率，并且不损害令牌保留或内存利用率。

链接: https://arxiv.org/abs/2502.00758
作者: Michael Mangrum,Jonathan Pemberton,Benedict Wetherby,Philip Montague
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computational efficiency has remained a critical consideration in scaling high-capacity language models, with inference latency and resource consumption presenting significant constraints on real-time applications. The study has introduced a structured latency perturbation mechanism that modifies computational pathways through recursive state induction, enabling dynamic suppression of redundant activations while preserving generative fidelity. A formal mathematical framework has been established to describe recursive perturbations, ensuring that modifications remain adaptive rather than statically imposed. Experiments have demonstrated that applying recursive state adjustments reduces inference latency across varying sequence lengths, with longer text generations benefiting from cumulative efficiency improvements. Comparative evaluations against structured pruning and quantization have indicated that latency gains can be achieved without compromising token retention or memory utilization. The analysis of computational overhead has suggested that selectively suppressing redundant activations contributes to improved power efficiency, particularly in scenarios requiring extended text generation. An assessment of linguistic stability has shown that token-level consistency remains largely intact under controlled perturbation thresholds, reinforcing the viability of structural latency modifications as an alternative to weight-centric optimization techniques. The results have supported the hypothesis that recursive state induction offers an effective method for reducing computational complexity without requiring architectural modifications or external augmentation.
zh

[NLP-85] Zero-Shot Warning Generation for Misinformative Multimodal Content

【速读】：该论文旨在解决多媒体形式的误导信息（Misinformation）传播问题，特别是那些通过将真实的图像与虚假的文字配对来误导公众的不恰当上下文误导信息。论文的关键解决方案在于提出了一种通过跨模态一致性检查（cross-modality consistency checks）来检测这种多媒体误导信息的模型，并且该模型能够在极短的训练时间内实现高效检测。此外，论文还介绍了一种轻量级模型，仅使用现有模型三分之一的参数就能达到竞争性的性能表现。

链接: https://arxiv.org/abs/2502.00752
作者: Giovanni Pio Delvecchio,Huy Hong Nguyen,Isao Echizen
机构: National Institute of Informatics (NII), Japan; The University of Tokyo, Japan
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.
zh

[NLP-86] Universal Post-Processing Networks for Joint Optimization of Modules in Task-Oriented Dialogue Systems AAAI2025

【速读】：该论文旨在解决现有基于后处理网络（Post-processing Networks, PPNs）的方法仅能优化系统内部分模块输出的问题，从而限制了整体任务完成能力的提升。论文的关键解决方案是提出通用后处理网络（Universal Post-processing Networks, UniPPNs），这是一种基于语言模型的网络，能够将任意模块的输出视为序列转换任务进行统一优化。此外，论文采用了一种模块级马尔可夫决策过程（Markov Decision Process, MDP）的强化学习算法，实现每个模块的精细价值和优势估计，进而稳定所有模块输出的联合学习过程。通过仿真和人类评估实验，证明了UniPPNs在面向任务的对话系统中的任务完成能力优于传统PPNs。

链接: https://arxiv.org/abs/2502.00747
作者: Atsumoto Ohashi,Ryuichiro Higashinaka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025 Main Technical Track

点击查看摘要

Abstract:Post-processing networks (PPNs) are components that modify the outputs of arbitrary modules in task-oriented dialogue systems and are optimized using reinforcement learning (RL) to improve the overall task completion capability of the system. However, previous PPN-based approaches have been limited to handling only a subset of modules within a system, which poses a significant limitation in improving the system performance. In this study, we propose a joint optimization method for post-processing the outputs of all modules using universal post-processing networks (UniPPNs), which are language-model-based networks that can modify the outputs of arbitrary modules in a system as a sequence-transformation task. Moreover, our RL algorithm, which employs a module-level Markov decision process, enables fine-grained value and advantage estimation for each module, thereby stabilizing joint learning for post-processing the outputs of all modules. Through both simulation-based and human evaluation experiments using the MultiWOZ dataset, we demonstrated that UniPPN outperforms conventional PPNs in the task completion capability of task-oriented dialogue systems.
zh

[NLP-87] BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts ICLR

【速读】：该论文旨在解决深度神经网络（DNNs）推理延迟的问题，并提出了一种新的早期退出（Early Exit, EE）决策准则。关键在于引入了BEEM方法，将退出分类器视为专家，并仅在相邻专家预测一致时聚合其置信分数，从而捕捉到集成效应。通过这种方法，当聚合的置信值超过阈值时，样本即提前退出，这一阈值基于中间退出的错误率设定，以超越传统DNN推理的性能。实验结果表明，该方法提升了现有EE方法的性能，在图像描述和多种语言任务中实现了1.5倍到2.1倍的速度提升，同时保持或提高了准确性。

链接: https://arxiv.org/abs/2502.00745
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at International Conference on Learning Representations (ICLR) 2025

点击查看摘要

Abstract:Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at this https URL
zh

[NLP-88] Model Provenance Testing for Large Language Models

【速读】：该论文旨在解决通过微调等方式定制大型语言模型所带来的版权执行和下游影响管理挑战。论文的关键在于开发了一种框架，用于检测模型的起源，以确定一个模型是否由另一个模型衍生而来。该方法基于这样的观察：实际中的模型衍生会在模型输出中保留显著的相似性，这种相似性可以通过统计分析来检测。解决方案的关键是利用假设检验，对比目标模型与无关模型之间的相似性，从而在仅具有黑盒访问权限的情况下实现对衍生模型的有效识别。

链接: https://arxiv.org/abs/2502.00706
作者: Ivica Nikolic,Teodora Baluta,Prateek Saxena
机构: National University of Singapore; Georgia Institute of Technology
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts. Tracking model origins is crucial both for protecting intellectual property and for identifying derived models when biases or vulnerabilities are discovered in foundation models. We address this challenge by developing a framework for testing model provenance: Whether one model is derived from another. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models. On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90-95% precision and 80-90% recall in identifying derived models. These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
zh

[NLP-89] Learning Autonomous Code Integration for Math Language Models

【速读】：该论文旨在解决现有工具集成的数学大语言模型（Math LLMs）在方法选择上的自主性不足问题。当前模型依赖外部指令来决定使用链式思维（CoT）推理还是代码执行，缺乏独立选择最适当方法的能力。为了解决这一挑战，论文提出了一种创新的期望最大化（EM）框架，通过自我探索改进模型的决策制定能力。该框架的关键在于交替进行参考策略的计算以提升模型对其自身能力的信心，并基于此更新模型。此外，引入了一种高效的数据合成策略和离策略强化学习，进一步增强了该框架。实验结果表明，所提方法显著提升了现有数学大语言模型的性能，在MATH基准测试中准确率提高了近20%，达到了65.28%，同时减少了高达65%的代码执行次数。

链接: https://arxiv.org/abs/2502.00691
作者: Haozhe Wang,Long Li,Chao Qu,Fengming Zhu,Weidi Xu,Wei Chu,Fangzhen Lin
机构: Technology†, Hong Kong University of Science and Technology‡
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research on tool integration for math Large Language Models (LLMs) aims to combine complementary strengths of chain-of-thought (CoT) reasoning and code execution. However, we discover a critical limitation: current tool-integrated math LLMs rely on externally dictated instructions to decide whether to use CoT or code, lacking the autonomy to choose the most appropriate method independently. This prompts us to study \emphAutonomous Code integration for math LLMs, which enables models to \emphindependently develop their own methodology-selection strategy in the absence of reliable supervision. To address this challenge, we propose an innovative Expectation-Maximization (EM) formulation that refines the model’s decision-making through the exploration of its capabilities. This framework alternates between (a) computing a reference strategy that improves the model’s belief over its capabilities through self-exploration, and (b) updating the model based on the refined belief. We further enhance this framework with an efficient implementation, incorporating a novel data synthesis strategy and off-policy reinforcement learning. Extensive experiments demonstrate that our approach, using only a public query set, significantly boosts the performance of existing math LLMs, raising accuracy by nearly 20% to 65.28% on the challenging MATH benchmark, while reducing code executions by up to 65% .
zh

[NLP-90] A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models

【速读】：该论文旨在解决图表示学习中连续嵌入方法所面临的参数效率、可解释性和鲁棒性等问题。论文的关键在于提出并探讨了量化图表示（Quantized Graph Representation, QGR）的学习方法，通过离散码而非传统的连续嵌入来表示图结构，并探索其与大规模语言模型（Large Language Models, LLMs）的整合策略。这一新兴范式具有显著潜力，论文通过全面综述以促进其快速发展。

链接: https://arxiv.org/abs/2502.00681
作者: Qika Lin,Zhen Peng,Kaize Shi,Kai He,Yiming Xu,Erik Cambria,Mengling Feng
机构: Saw Swee Hock School of Public Health, National University of Singapore(苏瑞福公共卫生学院，新加坡国立大学); School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院); School of Computer Science, University of Technology Sydney(悉尼科技大学计算机科学学院); College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have witnessed rapid advances in graph representation learning, with the continuous embedding approach emerging as the dominant paradigm. However, such methods encounter issues regarding parameter efficiency, interpretability, and robustness. Thus, Quantized Graph Representation (QGR) learning has recently gained increasing interest, which represents the graph structure with discrete codes instead of conventional continuous embeddings. Given its analogous representation form to natural language, QGR also possesses the capability to seamlessly integrate graph structures with large language models (LLMs). As this emerging paradigm is still in its infancy yet holds significant promise, we undertake this thorough survey to promote its rapid future prosperity. We first present the background of the general quantization methods and their merits. Moreover, we provide an in-depth demonstration of current QGR studies from the perspectives of quantized strategies, training objectives, distinctive designs, knowledge graph quantization, and applications. We further explore the strategies for code dependence learning and integration with LLMs. At last, we give discussions and conclude future directions, aiming to provide a comprehensive picture of QGR and inspire future research.
zh

[NLP-91] How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence

【速读】：该论文旨在解决数据集污染（Dataset Contamination）问题，即评估数据集与预训练语料库之间的重叠导致性能指标虚高，从而影响模型评估的可靠性。为了解决这一问题，论文提出了一种名为核差异得分（Kernel Divergence Score, KDS）的新方法。KDS通过计算样本嵌入在基准数据集微调前后核相似性矩阵的差异来量化数据集污染程度。其关键是利用微调对未见过的样本影响更大的特性，从而提供一个可靠的污染度量标准。

链接: https://arxiv.org/abs/2502.00678
作者: Hyeong Kyu Choi,Maxim Khanov,Hongxin Wei,Yixuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Quantifying dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model’s ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings.
zh

[NLP-92] ReFoRCE: A Text-to-SQL Agent with Self-Refinement Format Restriction and Column Exploration

【速读】：该论文旨在解决在企业环境中部署Text-to-SQL系统所面临的挑战，如大规模复杂模式（3000列）、多样的SQL方言（例如BigQuery、Snowflake）以及复杂的查询需求。当前最先进的模型在Spider 2.0数据集上的表现受限，仅达到20%，主要局限在于指令遵循不足、长上下文理解差、自我优化能力弱以及特定方言知识不足。为解决这些问题，论文提出ReFoRCE方法，其关键是引入表压缩以缓解长上下文限制，格式限制以确保答案格式正确，以及迭代列探索以增强模式理解。此外，ReFoRCE采用包含并行化工作流与投票机制及基于公用表表达式(CTE)的细化方法的自我优化流程来处理未决案例。

链接: https://arxiv.org/abs/2502.00675
作者: Minghang Deng,Ashwin Ramachandran,Canwen Xu,Lanxiang Hu,Zhewei Yao,Anupam Datta,Hao Zhang
机构: University of California, San Diego(加州大学圣地亚哥分校); Snowflake AI Research(雪flake AI 研究)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Text-to-SQL systems have unlocked easier access to critical data insights by enabling natural language queries over structured databases. However, deploying such systems in enterprise environments remains challenging due to factors such as large, complex schemas ( 3000 columns), diverse SQL dialects (e.g., BigQuery, Snowflake) and sophisticated query requirements (e.g., transformation, analytics). Current state-of-the-art performance on the Spider 2.0 dataset – a benchmark built to mimic such complex environments – remains limited at 20%. Key limitations include inadequate instruction-following, poor long-context comprehension, weak self-refinement, and insufficient dialect-specific knowledge. To address these gaps, we propose ReFoRCE (Self-Refinement Agent with Format Restriction and Column Exploration) which introduces (1) table compression to mitigate long-context limitations (2) format restriction to ensure accurate answer format, and (3) iterative column exploration for enhanced schema understanding. Additionally, it employs self-refinement pipeline consisting of (1) parallelized workflows with voting mechanisms and (2) a Common Table Expression (CTE) based refinement approach to handle unresolved cases. ReFoRCE achieves state-of-the-art results scoring 26.69 on the Spider 2.0-Snow and scoring 24.50 on the Spider 2.0-Lite tasks.
zh

[NLP-93] Rethinking Mixture-of-Agents : Is Mixing Different Large Language Models Beneficial?

【速读】：该论文旨在探讨在语言模型领域，混合不同大型语言模型（Large Language Models, LLMs）是否真正有益。论文的关键解决方案是提出Self-MoA方法，即仅聚合单一顶级表现LLM的输出。研究结果表明，Self-MoA在多种基准测试中优于传统的混合方法（Mixture-of-Agents, MoA），包括在AlpacaEval 2.0上提升6.6%，并在多个基准测试（如MMLU、CRUX、MATH）中平均提升3.8%。这一方法不仅提升了性能，还达到了新的状态-of-the-art水平。研究表明，这种改进源于对输出多样性和质量之间权衡的深入分析，确认了混合不同LLMs往往会降低模型的平均质量。

链接: https://arxiv.org/abs/2502.00674
作者: Wenzhe Li,Yong Lin,Mengzhou Xia,Chi Jin
机构: Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA – an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.
zh

[NLP-94] Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation

【速读】：该论文旨在解决视觉语言模型（Vision-Language Model, VLM）在分布外（Out-of-Distribution, OOD）检测中由于图像与文本模态差距导致的高误报率问题。论文的关键解决方案在于引入了来自相同分布内的图像原型（ID image prototypes），与已有的文本原型（ID text prototypes）结合使用，以缓解模态差距的影响。此外，论文提出了一个名为SUPREME的少样本调优框架，包括偏置提示生成（Biased Prompts Generation, BPG）模块和图像-文本一致性（Image-Text Consistency, ITC）模块，进一步减小图像与文本之间的差距，并提出了一种新的基于统一模态和跨模态相似性的OOD评分方法 (S_\text{GMP})。这些改进共同提升了基于VLM的OOD检测性能。

链接: https://arxiv.org/abs/2502.00662
作者: Yimu Wang,Evelien Riddell,Adrian Chow,Sean Sedwards,Krzysztof Czarnecki
机构: University of Waterloo
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score S_\textitGMP , leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.
zh

[NLP-95] Reformulation is All You Need: Addressing Malicious Text Features in DNNs

【速读】：该论文旨在解决深度神经网络（DNN）模型在自然语言处理（NLP）任务中面临的对抗性攻击和后门攻击问题。论文的关键在于提出了一种统一且自适应的防御框架，通过利用重构模块来识别并处理文本输入中的潜在恶意特征，同时保持原始语义的完整性，从而有效抵御对抗性和后门攻击。

链接: https://arxiv.org/abs/2502.00652
作者: Yi Jiang,Oubo Ma,Yong Yang,Tong Zhang,Shouling Ji
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Human language encompasses a wide range of intricate and diverse implicit features, which attackers can exploit to launch adversarial or backdoor attacks, compromising DNN models for NLP tasks. Existing model-oriented defenses often require substantial computational resources as model size increases, whereas sample-oriented defenses typically focus on specific attack vectors or schemes, rendering them vulnerable to adaptive attacks. We observe that the root cause of both adversarial and backdoor attacks lies in the encoding process of DNN models, where subtle textual features, negligible for human comprehension, are erroneously assigned significant weight by less robust or trojaned models. Based on it we propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks. Our approach leverages reformulation modules to address potential malicious features in textual inputs while preserving the original semantic integrity. Extensive experiments demonstrate that our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.
zh

[NLP-96] Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance

【速读】：该论文旨在解决在资源受限环境中高效摘要工具的需求与现有大型语言模型（Large Language Models, LLMs）高计算资源需求之间的矛盾。解决方案的关键在于全面评估小型语言模型（Small Language Models, SLMs）在新闻摘要任务中的表现，发现如Phi3-Mini和Llama3.2-3B-Ins等顶级SLMs不仅能在生成更简洁的摘要同时达到与70B LLMs相当的效果，还指出SLMs更适合简单提示，并且指令微调并不总能提升其新闻摘要能力。

链接: https://arxiv.org/abs/2502.00641
作者: Borui Xu,Yao Chen,Zeyi Wen,Weiguo Liu,Bingsheng He
机构: Shandong University; National University of Singapore; HKUST (香港科技大学); Shandong University; National University of Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing demand for efficient summarization tools in resource-constrained environments highlights the need for effective solutions. While large language models (LLMs) deliver superior summarization quality, their high computational resource requirements limit practical use applications. In contrast, small language models (SLMs) present a more accessible alternative, capable of real-time summarization on edge devices. However, their summarization capabilities and comparative performance against LLMs remain underexplored. This paper addresses this gap by presenting a comprehensive evaluation of 19 SLMs for news summarization across 2,000 news samples, focusing on relevance, coherence, factual consistency, and summary length. Our findings reveal significant variations in SLM performance, with top-performing models such as Phi3-Mini and Llama3.2-3B-Ins achieving results comparable to those of 70B LLMs while generating more concise summaries. Notably, SLMs are better suited for simple prompts, as overly complex prompts may lead to a decline in summary quality. Additionally, our analysis indicates that instruction tuning does not consistently enhance the news summarization capabilities of SLMs. This research not only contributes to the understanding of SLMs but also provides practical insights for researchers seeking efficient summarization solutions that balance performance and resource use.
zh

[NLP-97] SimulPL: Aligning Human Preferences in Simultaneous Machine Translation ICLR2025

【速读】：该论文旨在解决同时机器翻译（Simultaneous Machine Translation, SiMT）模型在满足人类用户偏好方面的问题。现有方法主要关注于优化生成的翻译结果，而忽视了与延迟相关的用户偏好以及在偏好优化阶段读写策略的优化。论文的关键解决方案是提出了一种名为Simultaneous Preference Learning (SimulPL) 的框架，该框架将人类偏好分为五个方面：翻译质量偏好、单调性偏好、关键点偏好、简洁性偏好和延迟偏好。通过利用前四种偏好构造人类偏好提示，有效地引导GPT-4/4o生成SiMT任务的偏好数据，并在偏好优化阶段将延迟偏好整合到优化目标中，使SiMT模型能够改进读写策略，从而更有效地与人类偏好保持一致。实验结果显示，SimulPL在不同延迟水平下均表现出更好的人类偏好对齐效果。

链接: https://arxiv.org/abs/2502.00634
作者: Donglei Yu,Yang Zhao,Jie Zhu,Yangyifan Xu,Yu Zhou,Chengqing Zong
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院，中国科学院大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室，中国科学院自动化研究所，中国北京); Graduate School of Translation and Interpretation, Beijing Foreign Studies University(北京外国语大学翻译学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025. 23 pages,13 figures,11 tables

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) generates translations while receiving streaming source inputs. This requires the SiMT model to learn a read/write policy, deciding when to translate and when to wait for more source input. Numerous linguistic studies indicate that audiences in SiMT scenarios have distinct preferences, such as accurate translations, simpler syntax, and no unnecessary latency. Aligning SiMT models with these human preferences is crucial to improve their performances. However, this issue still remains unexplored. Additionally, preference optimization for SiMT task is also challenging. Existing methods focus solely on optimizing the generated responses, ignoring human preferences related to latency and the optimization of read/write policy during the preference optimization phase. To address these challenges, we propose Simultaneous Preference Learning (SimulPL), a preference learning framework tailored for the SiMT task. In the SimulPL framework, we categorize SiMT human preferences into five aspects: \textbftranslation quality preference, \textbfmonotonicity preference, \textbfkey point preference, \textbfsimplicity preference, and \textbflatency preference. By leveraging the first four preferences, we construct human preference prompts to efficiently guide GPT-4/4o in generating preference data for the SiMT task. In the preference optimization phase, SimulPL integrates \textbflatency preference into the optimization objective and enables SiMT models to improve the read/write policy, thereby aligning with human preferences more effectively. Experimental results indicate that SimulPL exhibits better alignment with human preferences across all latency levels in Zh \rightarrow En, De \rightarrow En and En \rightarrow Zh SiMT tasks. Our data and code will be available at \urlthis https URL.
zh

[NLP-98] Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures

【速读】：该论文旨在解决在低数据条件下，Transformer模型训练成本高且参数量大的问题。解决方案的关键在于通过选择性替换注意力层（attention layers）为前馈层（feed-forward layers）和准循环神经网络层（quasi-recurrent neural network layers），从而在保持相近性能的同时显著减少模型参数数量。

链接: https://arxiv.org/abs/2502.00617
作者: Gabriel Lindenmaier,Sean Papay,Sebastian Padó
机构: University of Bamberg (班贝格大学); University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注: PDF has 12 pages total, 7 without references and abstract; 10 individual graphics combined to 3 figures; 5 tables

点击查看摘要

Abstract:Transformer-based language models have recently been at the forefront of active research in text generation. However, these models’ advances come at the price of prohibitive training costs, with parameter counts in the billions and compute requirements measured in petaflop/s-decades. In this paper, we investigate transformer-based architectures for improving model performance in a low-data regime by selectively replacing attention layers with feed-forward and quasi-recurrent neural network layers. We test these architectures on the standard Enwik8 and Wikitext-103 corpora. Our results show that our reduced architectures outperform existing models with a comparable number of parameters, and obtain comparable performance to larger models while significantly reducing the number of parameters.
zh

[NLP-99] Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing

【速读】：该论文旨在解决大型语言模型（LLMs）在静态语料库上训练导致的知识过时问题，并提出了一种名为OVERTONE的方法。OVERTONE通过在token级别进行平滑处理，解决了异构token过拟合（HTO）的问题，从而实现对特定知识的有效更新而不损害模型的其他预训练能力。关键在于其适应性地优化目标分布，以减轻不同token的学习速率不一致导致的过拟合现象。

链接: https://arxiv.org/abs/2502.00602
作者: Tianci Liu,Zihan Dong,Linjun Zhang,Haoyu Wang,Jing Gao
机构: Purdue University; Rutgers University; SUNY Albany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: heterogeneous token overfitting (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose OVERTONE, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
zh

[NLP-100] RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines ICML2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在作为基于文本的角色扮演游戏（Role-Playing Game, RPG）引擎时的评估问题。为了解决这一问题，论文提出了RPGBench基准测试，其关键是通过两个核心任务——游戏创建（Game Creation, GC）和游戏模拟（Game Simulation, GS）——来全面评估LLMs在逻辑连贯性、一致性以及可验证的游戏机制方面的表现。通过结合客观评价方法和LLM作为裁判的主观评价框架，RPGBench提供了一种新的标准，用于衡量LLMs在平衡创造性、连贯性和复杂性方面的能力。

链接: https://arxiv.org/abs/2502.00595
作者: Pengfei Yu,Dongming Shen,Silin Meng,Jaewon Lee,Weisu Yin,Andrea Yaoyun Cui,Zhenlin Xu,Yi Zhu,Xingjian Shi,Mu Li,Alex Smola
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ICML 2025

点击查看摘要

Abstract:We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a strong LLM grades each candidate’s outputs. Empirical results demonstrate that state-of-the-art LLMs can produce engaging stories but often struggle to implement consistent, verifiable game mechanics, particularly in long or complex scenarios. By combining structured, rule-based assessments with LLM-based judgments, RPGBench provides a new standard for evaluating how well LLMs can balance creativity, coherence, and complexity in text-based RPGs, opening avenues for more immersive and controllable interactive storytelling.
zh

[NLP-101] M: Extending MemoryLLM with Scalable Long-Term Memory

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长序列时难以保留远期信息的问题。MemoryLLM虽通过将过去的信息压缩到所有层的隐藏状态中形成记忆池来扩展上下文窗口，但其效果仅限于最多16k个标记的序列长度，对于超过20k个标记的序列则难以保持知识。论文的关键解决方案是引入M+模型，它基于MemoryLLM，并通过集成长期记忆机制与协同训练的检索器来显著增强长期信息保留能力。M+在文本生成过程中动态检索相关信息，从而实现在相似GPU内存开销下，将知识保留能力从不足20k个标记提升至超过160k个标记。

链接: https://arxiv.org/abs/2502.00592
作者: Yu Wang,Dmitry Krotov,Yuanzhe Hu,Yifan Gao,Wangchunshu Zhou,Julian McAuley,Dan Gutfreund,Rogerio Feris,Zexue He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.
zh

[NLP-102] Converting Transformers into DGNNs Form

【速读】：该论文旨在探索将自注意力机制（Self-Attention）替换为有向图卷积（Digraph Convolution）的可能性，以期提升Transformer模型在长序列处理任务中的性能。论文的关键在于引入了一种基于有向图傅里叶变换的合成酉有向图卷积（Synthetic Unitary Digraph Convolution），从而形成一种新的模型——Converter。这种转换使得Transformer模型能够以有向图神经网络（DGNN）的形式运作，实验结果表明Converter在保持计算效率和架构简洁性的同时，实现了卓越的性能，确立了其作为轻量但强大的Transformer变体的地位。

链接: https://arxiv.org/abs/2502.00585
作者: Jie Zhang,Kuan-Chieh Wang,Bo-Wei Chiu,Min-Te Sun
机构: National Central University, Taiwan(中央大学,台湾)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 3 figures, and 8 tables

点击查看摘要

Abstract:Recent advances in deep learning have established Transformer architectures as the predominant modeling paradigm. Central to the success of Transformers is the self-attention mechanism, which scores the similarity between query and key matrices to modulate a value matrix. This operation bears striking similarities to digraph convolution, prompting an investigation into whether digraph convolution could serve as an alternative to self-attention. In this study, we formalize this concept by introducing a synthetic unitary digraph convolution based on the digraph Fourier transform. The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network (DGNN) form. We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification. Our experimental results demonstrate that Converter achieves superior performance while maintaining computational efficiency and architectural simplicity, which establishes it as a lightweight yet powerful Transformer variant.
zh

[NLP-103] Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition ICASSP2025

【速读】：该论文旨在解决非流利或带有口音的演讲者在语音识别中的挑战。特别是针对非母语英语使用者，传统基于规则的发音模式难以充分捕捉非母语者的错误。论文的关键解决方案是采用数据驱动的方法，通过使用注意力图将非母语音素与母语音素对齐，从而自动检测误读模式。这种方法在母语英语数据集上的语音识别准确率提高了5.7%，而在非母语英语，尤其是韩国人英语演讲者的识别准确率提高了12.8%。

链接: https://arxiv.org/abs/2502.00583
作者: Anna Seo Gyeong Choi,Jonghyeon Park,Myungwoo Oh
机构: Cornell University; NAVER Cloud Corporation; NAVER Cloud Corporation
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Recent advancements in machine learning have significantly improved speech recognition, but recognizing speech from non-fluent or accented speakers remains a challenge. Previous efforts, relying on rule-based pronunciation patterns, have struggled to fully capture non-native errors. We propose two data-driven approaches using speech corpora to automatically detect mispronunciation patterns. By aligning non-native phones with their native counterparts using attention maps, we achieved a 5.7% improvement in speech recognition on native English datasets and a 12.8% improvement for non-native English speakers, particularly Korean speakers. Our method offers practical advancements for robust Automatic Speech Recognition (ASR) systems particularly for situations where prior linguistic knowledge is not applicable.
zh

[NLP-104] Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

【速读】：该论文旨在解决通过Best-of-N (BoN) 方法进行的语言模型（Language Models, LLMs）越狱攻击问题。解决方案的关键在于Defense Against The Dark Prompts (DATDP) 方法，该方法通过反复利用评估语言模型来检测提示中的危险或操纵行为，并明确寻找越狱企图，直至生成稳健的安全评级。实验结果显示，即使使用较小的评估模型，DATDP也能有效阻止大部分成功越狱案例，从而显著提高生成式AI系统的安全性。

链接: https://arxiv.org/abs/2502.00580
作者: Stuart Armstrong,Matija Franklin,Connor Stevens,Rebecca Gorman
机构: Stuart Armstrong* Aligned AI(对齐AI); Matija Franklin* Aligned AI(对齐AI); Connor Stevens* Oxford University(牛津大学); Rebecca Gorman* University College London (UCL)(伦敦大学学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper’s successful jailbreaks (confidence interval [99.65%, 100.00%] ) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%] ) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors–unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts–until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.
zh

[NLP-105] Understanding Multimodal LLM s Under Distribution Shifts: An Information-Theoretic Approach

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在分布偏移（distribution shifts）条件下表现不稳定的问题。论文的关键解决方案在于提出了一种基于信息论的新理论框架，通过引入有效互信息（Effective Mutual Information, EMI）这一度量标准，量化输入查询与模型响应之间的相关性，并推导出其在分布内（in-distribution, ID）和分布外（out-of-distribution, OOD）数据上的差异上限，从而连接视觉和文本的分布差异。该框架能够系统地表征和量化MLLMs在分布偏移条件下的最大风险，确保这些模型在实际应用中的安全性和可靠性。

链接: https://arxiv.org/abs/2502.00577
作者: Changdae Oh,Zhen Fang,Shawn Im,Xuefeng Du,Yixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios empirically validate our theoretical insights.
zh

[NLP-106] Detecting Ambiguities to Guide Query Rewrite for Robust Conversations in Enterprise AI Assistants

【速读】：该论文旨在解决多轮企业级AI助手对话中的歧义问题，这些歧义源于问题间的对话依赖关系，导致理解错误。关键解决方案在于提出了一种NLU-NLG框架，通过自动重述查询来检测和解决歧义，并引入了一项新任务“基于歧义的查询重写”(Ambiguity-guided Query Rewrite)。论文开发了一套基于真实用户对话日志的分类规则和特征提取方法，以设计出性能优越的分类器，该分类器在检测模糊查询方面优于基于大型语言模型的基线方法。此外，将查询重写模块与歧义检测分类器结合使用，证明了这一端到端框架能够有效减轻歧义，同时不会对清晰查询造成不必要的干扰，从而提升了AI助手的整体性能。

链接: https://arxiv.org/abs/2502.00537
作者: Md Mehrab Tanjim,Xiang Chen,Victor S. Bursztyn,Uttaran Bhattacharya,Tung Mai,Vaishnavi Muppala,Akash Maharaj,Saayan Mitra,Eunyee Koh,Yunyao Li,Ken Russell
机构: Adobe Research(Adobe研究); Adobe Inc.(Adobe公司)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Multi-turn conversations with an Enterprise AI Assistant can be challenging due to conversational dependencies in questions, leading to ambiguities and errors. To address this, we propose an NLU-NLG framework for ambiguity detection and resolution through reformulating query automatically and introduce a new task called “Ambiguity-guided Query Rewrite.” To detect ambiguities, we develop a taxonomy based on real user conversational logs and draw insights from it to design rules and extract features for a classifier which yields superior performance in detecting ambiguous queries, outperforming LLM-based baselines. Furthermore, coupling the query rewrite module with our ambiguity detecting classifier shows that this end-to-end framework can effectively mitigate ambiguities without risking unnecessary insertions of unwanted phrases for clear queries, leading to an improvement in the overall performance of the AI Assistant. Due to its significance, this has been deployed in the real world application, namely Adobe Experience Platform AI Assistant.
zh

[NLP-107] Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

【速读】：该论文旨在解决在PET/CT影像中将文本描述与图像中的具体位置进行关联的问题。由于缺乏大规模标注的图像-文本数据集，论文提出了一种自动化弱标签生成管道，用于链接PET/CT报告描述与图像位置，并基于此训练了一个三维视觉-语言接地模型（3D vision-language visual grounding model）。解决方案的关键在于开发了这一自动化弱标签生成管道，通过识别SUVmax和轴向切片编号来找到PET/CT报告中的阳性发现，从而提取出11,356个句子-标签对用于训练ConTEXTual Net 3D模型。

链接: https://arxiv.org/abs/2502.00528
作者: Zachary Huemann,Samuel Church,Joshua D. Warner,Daniel Tran,Xin Tie,Alan B McMillan,Junjie Hu,Steve Y. Cho,Meghan Lubner,Tyler J. Bradshaw
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); University of Wisconsin Health(威斯康星大学健康中心); Carbone Cancer Center(卡本癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model’s performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.
zh

[NLP-108] PolarQuant: Leverag ing Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration

【速读】：该论文旨在解决大型语言模型中KV缓存内存使用过高的问题，特别是由于异常值导致的传统量化方法在量化关键向量时遇到的挑战。论文的关键解决方案是提出了一种新的量化方法PolarQuant，它通过将关键向量分为两维子向量组，并采用极坐标表示（量化半径和极角），有效解决了异常值问题，从而提高了KV缓存量化效率并加速了解码过程，同时保持了全精度模型的下游性能。

链接: https://arxiv.org/abs/2502.00527
作者: Songhao Wu,Ang Lv,Xiao Feng,Yufei Zhang,Xun Zhang,Guojun Yin,Wei Lin,Rui Yan
机构: Renmin University of China(中国人民大学); ShanghaiTech University(上海科技大学); Meituan(美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.
zh

[NLP-109] Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning

【速读】：该论文旨在解决复杂推理任务中单次推理结果不可靠的问题，提出的方法关键在于Reasoning-Pruning Perplexity Consistency (RPC)。RPC通过结合Perplexity Consistency与Reasoning Pruning，前者无缝集成大规模语言模型的困惑度与自一致性，后者消除低概率推理路径，从而有效防止估计误差的累积。理论分析表明，RPC不仅加速了估计误差收敛至指数级的速度，还显著减少了模型误差。

链接: https://arxiv.org/abs/2502.00511
作者: Zhi Zhou,Tan Yuhao,Zenan Li,Yuan Yao,Lan-Zhe Guo,Xiaoxing Ma,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency. In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate. To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (RPC). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction. Theoretical analysis demonstrates that RPC not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error. Extensive empirical evaluations on seven benchmark datasets confirm that RPC can significantly improve reasoning performance, sample efficiency, and confidence reliability.
zh

[NLP-110] Whos the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

【速读】：该论文旨在解决大型语言模型（LLM）代理框架中量化各模块对整体系统性能贡献的难题，以提高优化和可解释性。论文的关键解决方案是引入CapaBench评估框架，该框架基于合作博弈论中的Shapley值，能够系统地衡量单个模块及其交互作用的边际影响。通过在所有可能的组合中替换默认模块与测试变体，CapaBench提供了一种原则性的方法来归因性能贡献。

链接: https://arxiv.org/abs/2502.00510
作者: Yingxuan Yang,Bo Huang,Siyuan Qi,Chao Feng,Haoyi Hu,Yuxuan Zhu,Jinbo Hu,Haoran Zhao,Ziyi He,Xiao Liu,Zongyu Wang,Lin Qiu,Xuezhi Cao,Xunliang Cai,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University(上海交通大学); University of Chicago(芝加哥大学); University of Toronto(多伦多大学); Meituan(美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module to overall system performance remains a significant challenge, impeding optimization and interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory’s Shapley Value, which systematically measures the marginal impact of individual modules and their interactions within an agent’s architecture. By replacing default modules with test variants across all possible combinations, CapaBench provides a principle method for attributing performance contributions. Key contributions include: (1) We are the first to propose a Shapley Value-based methodology for quantifying the contributions of capabilities in LLM agents; (2) Modules with high Shapley Values consistently lead to predictable performance gains when combined, enabling targeted optimization; and (3) We build a multi-round dataset of over 1,000 entries spanning diverse domains and practical task scenarios, enabling comprehensive evaluation of agent capabilities. CapaBench bridges the gap between component-level evaluation and holistic system assessment, providing actionable insights for optimizing modular LLM agents and advancing their deployment in complex, real-world scenarios.
zh

[NLP-111] A statistically consistent measure of Semantic Variability using Language Models

【速读】：该论文旨在解决语言模型输出结果的语义变异性问题。关键在于提出了一种语义谱熵（semantic spectral entropy）的度量方法，该方法在轻度假设下具有统计一致性，并且易于实现，仅需现成的语言模型即可应用。

链接: https://arxiv.org/abs/2502.00507
作者: Yi Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the issue of variability in the output generated by a language model, we present a measure of semantic variability that is statistically consistent under mild assumptions. This measure, denoted as semantic spectral entropy, is a easy to implement algorithm that requires just off the shelf language models. We put very few restrictions on the language models and we have shown in a clear simulation studies that such method can generate accurate metric despite randomness that arise from the language models.
zh

[NLP-112] owards Privacy-aware Mental Health AI Models: Advances Challenges and Opportunities

【速读】：该论文旨在解决在开发和部署用于精神健康诊断与治疗的人工智能（Artificial Intelligence, AI）模型时所面临的隐私挑战。论文的关键解决方案包括数据匿名化、合成数据生成以及隐私保护模型训练，以增强实际应用中的隐私保障。此外，论文还讨论了评估框架，用以衡量这些方法中隐私性和实用性之间的权衡。通过解决这些挑战，研究旨在推进可靠且注重隐私的人工智能工具的发展，以支持临床决策并改善精神健康结果。

链接: https://arxiv.org/abs/2502.00451
作者: Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych
机构: Technische Universität Darmstadt (达姆施塔特工业大学); Hessian Center for AI (hessian.AI); Department of Computer Science (计算机科学系); Indian Institute of Technology Delhi, India (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Mental illness is a widespread and debilitating condition with substantial societal and personal costs. Traditional diagnostic and treatment approaches, such as self-reported questionnaires and psychotherapy sessions, often impose significant burdens on both patients and clinicians, limiting accessibility and efficiency. Recent advances in Artificial Intelligence (AI), particularly in Natural Language Processing and multimodal techniques, hold great potential for recognizing and addressing conditions such as depression, anxiety, bipolar disorder, schizophrenia, and post-traumatic stress disorder. However, privacy concerns, including the risk of sensitive data leakage from datasets and trained models, remain a critical barrier to deploying these AI systems in real-world clinical settings. These challenges are amplified in multimodal methods, where personal identifiers such as voice and facial data can be misused. This paper presents a critical and comprehensive study of the privacy challenges associated with developing and deploying AI models for mental health. We further prescribe potential solutions, including data anonymization, synthetic data generation, and privacy-preserving model training, to strengthen privacy safeguards in practical applications. Additionally, we discuss evaluation frameworks to assess the privacy-utility trade-offs in these approaches. By addressing these challenges, our work aims to advance the development of reliable, privacy-aware AI tools to support clinical decision-making and improve mental health outcomes.
zh

[NLP-113] HERA: Improving Long Document Summarization using Large Language Models with Context Packaging and Reordering

【速读】：该论文旨在解决大型语言模型（LLMs）在长文档摘要任务上的表现不佳问题。主要原因是长文档中的相关信息分散且叙述顺序混乱，影响了LLMs对文档的准确理解和利用。为了解决这些问题，论文提出了一种新的摘要生成框架HERA。关键解决方案在于首先根据语义结构分割长文档，并检索关于同一事件的文本片段，最后重新排序这些片段以形成输入上下文。

链接: https://arxiv.org/abs/2502.00448
作者: Taiji Li,Hao Chen,Fei Yu,Yin Zhang
机构: College of Computer Science and Technology, Zhejiang University(浙江大学); Ant Group, China(蚂蚁集团, 中国)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Despite the rapid growth of context length of large language models (LLMs) , LLMs still perform poorly in long document summarization. An important reason for this is that relevant information about an event is scattered throughout long documents, and the messy narrative order impairs the accurate understanding and utilization of LLMs for long documents. To address these issues, we propose a novel summary generation framework, called HERA. Specifically, we first segment a long document by its semantic structure and retrieve text segments about the same event, and finally reorder them to form the input context. We evaluate our approach on two long document summarization datasets. The experimental results show that HERA outperforms foundation models in ROUGE, BERTScore and faithfulness metrics, while HERA does not require additional fine-tuning and resources.
zh

[NLP-114] UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中的后训练部署挑战，特别是显著的内存开销和明显的推理延迟。现有方法如层内键值（KV）共享和跨层KV共享虽有所改进，但仍存在不足。论文的关键在于识别到Softmax操作是LLM推理的主要瓶颈，并且在后训练过程中实际上是冗余的。为此，论文提出了一种新的后训练方法——注意力中的Softmax统一（UniAttn），通过统一Transformer块中的Softmax激活来降低LLM的推理成本，并采用线性投影补偿由Softmax统一引起的误差。实验表明，UniAttn在保持标准后训练性能的同时显著降低了推理成本，优于现有的高效架构。

链接: https://arxiv.org/abs/2502.00439
作者: Yizhe Xiong,Wei Huang,Xin Ye,Hui Chen,Zijia Lin,Haoran Lian,Zhenpeng Su,Jungong Han,Guiguang Ding
机构: School of Software, Tsinghua University (清华大学软件学院); School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机科学学院); Kuaishou Technology (快手科技); Beihang University (北京航空航天大学); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures. Preprint, under review

点击查看摘要

Abstract:Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \textttSoftmax operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbfUnification in \textbfAtte\textbfntion (\textbfUniAttn), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \urlthis https URL.
zh

[NLP-115] Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language ICASSP2025

【速读】：该论文旨在解决奥罗莫语（Oromo）自动语音识别（ASR）资源匮乏的问题。解决方案的关键在于构建了一个包含100小时真实世界音频记录及对应转录的新型ASR数据集，并通过使用Conformer模型和微调Whisper模型，分别实现了15.32%和10.82%的词错误率（WER），从而为奥罗莫语ASR建立了基准，展示了提升该语言ASR性能的潜力与挑战。

链接: https://arxiv.org/abs/2502.00421
作者: Turi Abu,Ying Shi,Thomas Fang Zheng,Dong Wang
机构: Center for Speech and Language Technologies, BNRist, Beijing(北京语音与语言技术中心, BNRist, 北京); Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系, 北京, 中国); School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院, 哈尔滨, 中国)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing)

点击查看摘要

Abstract:We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at this https URL and we encourage its use for further research and development in Oromo speech processing.
zh

[NLP-116] Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments

【速读】：该论文旨在解决在政治敏感场景中，特别是在冲突特定背景下，社交媒体平台上意识形态立场检测的研究不足问题。研究通过分析9,969条与以色列-巴勒斯坦冲突相关的Reddit评论，提出了多种方法，包括机器学习、预训练语言模型、神经网络以及针对开源大型语言模型（LLMs）的提示工程策略，来分类这些评论的立场，如亲以色列、亲巴勒斯坦和中立。关键解决方案在于采用Scoring和Reflective Re-read提示策略，在Mixtral 8x7B模型中实现了最高的性能表现，从而有效提升了在高度两极分化的社交媒体环境中意识形态立场检测的准确性。

链接: https://arxiv.org/abs/2502.00414
作者: Hasin Jawad Ali,Ajwad Abrar,S.M. Hozaifa Hossain,M. Firoz Mridha
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.
zh

[NLP-117] Doing More with Less – Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）系统在处理不同任务时资源利用不均衡的问题。论文的关键解决方案在于引入一种路由机制，将用户查询分配到最适合的组件，如较小的LLM或特定领域的专家。这种方法通过优化资源配置，提高响应质量的同时最小化成本。

链接: https://arxiv.org/abs/2502.00409
作者: Clovis Varangot-Reille,Christophe Bouvard,Antoine Gourru,Mathieu Ciancone,Marion Schaeffer,François Jacquenet
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (i.e. GPT-4), trained on very large multi-topic corpora, can perform well in a variety of tasks. However, they require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.
zh

[NLP-118] ALU: Agent ic LLM Unlearning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中信息移除或抑制的需求，特别是在AI监管、法律合规、安全性和隐私性方面。论文的关键在于提出了一种名为“代理型LLM无学习”(Agent-based LLM Unlearning, ALU)的方法，这是一种多代理、无需重新训练、与模型无关的LLM无学习方法。ALU通过多个专门设计用于无学习过程特定步骤的LLM代理来实现高效的信息删除，同时保持模型的实用性，并且无需更新任何代理的模型权重。这种方法使得用户可以灵活地请求任意顺序的无学习实例，从而在实时适应方面表现出色，而无需对基础LLM模型进行任何修改。

链接: https://arxiv.org/abs/2502.00406
作者: Debdeep Sanyal,Murari Mandal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. We present the first agentic LLM unlearning (ALU) method, a multi-agent, retrain-free, model-agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our ALU framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and ALU seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that ALU consistently stands out as the most robust LLM unlearning framework among current state-of-the-art methods while incurring a low constant-time cost. We further highlight ALU’s superior performance compared to existing methods when evaluated at scale. Specifically, ALU is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.
zh

[NLP-119] he Impact of Persona-based Political Perspectives on Hateful Content Detection

【速读】：该论文旨在探究persona-based prompting策略在多模态仇恨言论检测任务（特别是针对表情包中的仇恨言论）中能否实现与政治预训练相当的效果。关键在于通过映射persona到政治罗盘并测量persona一致性，发现内在的政治立场与分类决策之间的相关性较低，即使注入更强的意识形态描述也依然如此。这表明虽然大型语言模型（LLMs）在直接回答政治问题时可能表现出政治偏见，但在实际分类任务中的影响可能比之前认为的要小，从而质疑了昂贵的计算资源需求以实现公平性能的政治预训练的必要性。

链接: https://arxiv.org/abs/2502.00385
作者: Stefano Civelli,Pietro Bernardelle,Gianluca Demartini
机构: The University of Queensland(昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While pretraining language models with politically diverse content has been shown to improve downstream task fairness, such approaches require significant computational resources often inaccessible to many researchers and organizations. Recent work has established that persona-based prompting can introduce political diversity in model outputs without additional training. However, it remains unclear whether such prompting strategies can achieve results comparable to political pretraining for downstream tasks. We investigate this question using persona-based prompting strategies in multimodal hate-speech detection tasks, specifically focusing on hate speech in memes. Our analysis reveals that when mapping personas onto a political compass and measuring persona agreement, inherent political positioning has surprisingly little correlation with classification decisions. Notably, this lack of correlation persists even when personas are explicitly injected with stronger ideological descriptors. Our findings suggest that while LLMs can exhibit political biases in their responses to direct political questions, these biases may have less impact on practical classification tasks than previously assumed. This raises important questions about the necessity of computationally expensive political pretraining for achieving fair performance in downstream tasks.
zh

[NLP-120] When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation

【速读】：该论文旨在解决级联语音到文本翻译模型中的错误传播问题。关键解决方案在于结合自动语音识别（ASR）的多个候选结果和自监督语音特征，以减少语音领域相似样本映射到文本领域时的差异性，从而提高机器翻译（MT）模型的准确性，并最小化错误传播。这一策略充分利用了大规模的ASR和MT数据集以及预训练的ASR/MT模型。

链接: https://arxiv.org/abs/2502.00377
作者: Anna Min,Chenxu Hu,Yi Ren,Hang Zhao
机构: School of Software (软件学院), Tsinghua University (清华大学); IIIS (清华交叉信息研究院), Tsinghua University (清华大学); TickTok; IIIS (清华交叉信息研究院), Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.
zh

[NLP-121] A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

【速读】：该论文旨在解决语音到语音翻译（Speech-to-Speech Translation, S2ST）中忽视情感和态度等副语言信息（Paralinguistic Information）的问题。为了解决这一问题，研究引入了一个精心编纂的多语言数据集，该数据集源自多种电影音频片段，并且每对数据在副语言信息和时长方面进行了精确匹配。关键解决方案在于整合多种韵律迁移技术，以实现既准确又自然且富含副语言细节的翻译。实验结果表明，该模型在保持高翻译准确性和自然性的同时，能够保留更多的源语音副语言信息。

链接: https://arxiv.org/abs/2502.00374
作者: Anna Min,Chenxu Hu,Yi Ren,Hang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.
zh

[NLP-122] FinchGPT : a Transformer based language model for birdsong analysis

【速读】：该论文旨在探究非人类动物在连续发声中是否存在类似于人类语言中的长程依赖性。解决方案的关键在于使用基于Transformer架构的FinchGPT模型，该模型在文本化的鸣鸟歌声数据集上进行训练，并通过注意力权重分析有效捕捉了音节序列中的长程依赖性。此外，通过限制模型的注意力范围和破坏鸟类歌曲语法，研究展示了计算和生物学操作对其性能的影响。

链接: https://arxiv.org/abs/2502.00344
作者: Kosei Kobayashi,Kosuke Matsuzaki,Masaya Taniguchi,Keisuke Sakaguchi,Kentaro Inui,Kentaro Abe
机构: Graduate School of Life Sciences, Tohoku University(东北大学生命科学研究科), Japan; Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究科), Japan; RIKEN Center for Advanced Intelligence Project(理化学研究所高级智能项目中心), Japan; Natural Language Processing Department, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学自然语言处理系), United Arab Emirates; Center for Language AI Research, Tohoku University(东北大学语言AI研究中心), Japan; Division for the Establishment of Frontier Sciences, Tohoku University(东北大学前沿科学建立部门), Japan
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT’s attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model’s outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks.
zh

[NLP-123] Enhancing Token Filtering Efficiency in Large Language Model Training with Collider

【速读】：该论文旨在解决通过token过滤提升大规模语言模型（Large Language Models, LLMs）效用时未能实现更高效率的问题。现有方法仅在输出层过滤token，导致稀疏度不足，并且即使有足够稀疏度，稀疏GEMM操作依然低效。论文的关键解决方案在于提出Collider系统，它通过对所有层的非重要token激活进行过滤来保持高稀疏度，并通过自动工作流将稀疏GEMM转换为降维密集GEMM，以优化效率。

链接: https://arxiv.org/abs/2502.00340
作者: Di Chai,Pengbo Li,Feiyuan Zhang,Yilun Jin,Han Tian,Junxue Zhang,Kai Chen
机构: Hong Kong University of Science and Technology; University of Science and Technology of China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Token filtering has been proposed to enhance utility of large language models (LLMs) by eliminating inconsequential tokens during training. While using fewer tokens should reduce computational workloads, existing studies have not succeeded in achieving higher efficiency. This is primarily due to the insufficient sparsity caused by filtering tokens only in the output layers, as well as inefficient sparse GEMM (General Matrix Multiplication), even when having sufficient sparsity. This paper presents Collider, a system unleashing the full efficiency of token filtering in LLM training. At its core, Collider filters activations of inconsequential tokens across all layers to maintain sparsity. Additionally, it features an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency. Evaluations on three LLMs-TinyLlama-1.1B, Qwen2.5-1.5B, and Phi1.5-1.4B-demonstrate that Collider reduces backpropagation time by up to 35.1% and end-to-end training time by up to 22.0% when filtering 40% of tokens. Utility assessments of training TinyLlama on 15B tokens indicate that Collider sustains the utility advancements of token filtering by relatively improving model utility by 16.3% comparing to regular training, and reduces training time from 4.7 days to 3.5 days using 8 GPUs. Collider is designed for easy integration into existing LLM training frameworks, allowing systems already using token filtering to accelerate training with just one line of code. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2502.00340 [cs.LG] (or arXiv:2502.00340v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00340 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-124] Challenges and Innovations in LLM -Powered Fake News Detection: A Synthesis of Approaches and Future Directions

【速读】：该论文旨在解决社交媒体平台上假新闻传播所带来的信任危机、社会不稳定及民主制度受损等关键风险。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的进步，通过更先进的语义理解和多模态融合技术来提升检测准确性，以应对动态且多模态的虚假信息。然而，研究还指出了适应社交媒体趋势、实时跨平台检测能力以及大型语言模型误用所引发的伦理挑战等关键缺口。未来的研究方向包括开发风格无关模型、跨语言检测框架以及稳健政策，以减轻由大型语言模型驱动的虚假信息。

链接: https://arxiv.org/abs/2502.00339
作者: Jingyuan Yi,Zeqiu Xu,Tianyi Huang,Peiyang Yu
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The pervasiveness of the dissemination of fake news through social media platforms poses critical risks to the trust of the general public, societal stability, and democratic institutions. This challenge calls for novel methodologies in detection, which can keep pace with the dynamic and multi-modal nature of misinformation. Recent works include powering the detection using large language model advances in multimodal frameworks, methodologies using graphs, and adversarial training in the literature of fake news. Based on the different approaches which can bring success, some key highlights will be underlined: enhanced LLM-improves accuracy through more advanced semantics and cross-modality fusion for robust detections. The review further identifies critical gaps in adaptability to dynamic social media trends, real-time, and cross-platform detection capabilities, as well as the ethical challenges thrown up by the misuse of LLMs. Future directions underline the development of style-agnostic models, cross-lingual detection frameworks, and robust policies with a view to mitigating LLM-driven misinformation. This synthesis thus lays a concrete foundation for those researchers and practitioners committed to reinforcing fake news detection systems with complications that keep on growing in the digital landscape.
zh

[NLP-125] UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在本科物理推理任务中的表现不足问题。现有基准测试往往无法全面评估LLMs在本科物理广度和深度上的能力，从而凸显出构建综合性评估工具的需求。为填补这一空白，论文引入了UGPhysics，这是一个专门设计用于评估LLMs处理本科物理推理能力的大规模综合基准，包含5,520个英语和中文的本科物理题目，并覆盖13个主题，七种不同答案类型及四种独特的物理推理技能。关键解决方案在于开发了Model-Assistant Rule-based Judgment (MARJ) 管道，以确保对物理问题解答正确性的准确评估。

链接: https://arxiv.org/abs/2502.00334
作者: Xin Xu,Qiyun Xu,Tong Xiao,Tianhao Chen,Yuchen Yan,Jiaxin Zhang,Shizhe Diao,Can Yang,Yang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs’ abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning.
zh

[NLP-126] MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections NAACL2025

【速读】：该论文旨在解决可争论性查询（Debatable Queries）的查询聚焦摘要（Query-Focused Summarization, QFS）问题。传统QFS方法假设查询只有一个答案，忽视了具有争议性的查询（如“法学院值得就读吗？”）。为应对这一挑战，论文提出了Debatable QFS (DQFS)，其目标是通过包含对立观点的文档生成全面且平衡的摘要，而不偏袒任何一方。论文的关键解决方案是设计了一个名为MODS的多语言模型框架，该框架模拟人类小组讨论的过程。MODS将文档视为独立的发言者语言模型（Speaker LLMs），并由一个主持人语言模型（Moderator LLM）挑选发言者，针对计划主题提出定制化查询。发言者使用定制化查询从文档中检索相关上下文，并提供视角，这些视角被追踪在一个丰富的提纲中，形成内容计划以指导最终的摘要生成。这一方法有效提升了在主题段落覆盖率和平衡性方面的表现，超越了现有技术（SOTA）系统。

链接: https://arxiv.org/abs/2502.00322
作者: Nishant Balepur,Alexa Siu,Nedim Lipka,Franck Dernoncourt,Tong Sun,Jordan Boyd-Graber,Puneet Mathur
机构: University of Maryland(马里兰大学); Adobe Research(Adobe研究)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at NAACL 2025(main)

点击查看摘要

Abstract:Query-focused summarization (QFS) gives a summary of documents to answer a query. Past QFS work assumes queries have one answer, ignoring debatable ones (Is law school worth it?). We introduce Debatable QFS (DQFS), a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must comprehensively cover all sources and balance perspectives, favoring no side. These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) use the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document’s content. To overcome this, we design MODS, a multi-LLM framework mirroring human panel discussions. MODS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics. Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary. Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MODS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MODS’s summaries to be readable and more balanced.
zh

[NLP-127] Distributive Fairness in Large Language Models : Evaluating Alignment with Human Values

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在社会和经济决策中的表现，特别是它们是否符合公平性概念（如平等性、无嫉妒性和罗尔斯最大最小原则），以及它们与人类偏好的一致性。研究的关键在于评估几种LLMs在反映这些公平性指标方面的性能，并比较它们之间的差异。研究结果表明，当前LLMs的响应与人类在资源分配上的偏好不一致，且无法利用金钱作为可转移资源来缓解不平等。然而，当LLMs被要求从预定义选项中选择而非生成新方案时，其表现有所改善。此外，论文还分析了LLMs响应对语义因素或非语义提示变化的鲁棒性，并提出了增强LLM行为与既定公平概念一致性的潜在策略。

链接: https://arxiv.org/abs/2502.00313
作者: Hadi Hosseini,Samarth Khanna
机构: Penn State University (宾夕法尼亚州立大学), USA
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The growing interest in employing large language models (LLMs) for decision-making in social and economic contexts has raised questions about their potential to function as agents in these domains. A significant number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes. In this paper, we examine whether LLM responses adhere to fundamental fairness concepts such as equitability, envy-freeness, and Rawlsian maximin, and investigate their alignment with human preferences. We evaluate the performance of several LLMs, providing a comparative benchmark of their ability to reflect these measures. Our results demonstrate a lack of alignment between current LLM responses and human distributional preferences. Moreover, LLMs are unable to utilize money as a transferable resource to mitigate inequality. Nonetheless, we demonstrate a stark contrast when (some) LLMs are tasked with selecting from a predefined menu of options rather than generating one. In addition, we analyze the robustness of LLM responses to variations in semantic factors (e.g. intentions or personas) or non-semantic prompting changes (e.g. templates or orderings). Finally, we highlight potential strategies aimed at enhancing the alignment of LLM behavior with well-established fairness concepts.
zh

[NLP-128] SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition

【速读】：该论文旨在解决语音情感识别（Speech Emotion Recognition, SER）中的系统复杂性、特征区分度不足以及噪声干扰等问题。关键解决方案在于提出了一种新的端到端（End-to-End, E2E）深度学习多分辨率框架，通过快速离散小波变换（Fast Discrete Wavelet Transform, FDWT）的特性，包括级联算法、共轭四边形滤波器和系数去噪，引入了可学习的小波基和去噪模型。该框架利用激活函数实现可学习的非对称硬阈值处理，并结合一维膨胀卷积神经网络（1D dilated Convolutional Neural Networks, 1D dilated CNN）、空间注意力层以及双向门控循环单元（Bidirectional Gated Recurrent Units, Bi-GRU）与时间注意力层，有效捕捉情感特征的空间和时间特性。该方法无需分割变长语音信号，且不需要预处理或后处理步骤。

链接: https://arxiv.org/abs/2502.00310
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
机构: Data Science Laboratory, University of Québec (TÉLUQ) (魁北克大学数据科学实验室); Concordia Institute for Information Systems Engineering, Concordia University (康考迪亚大学信息系统工程学院); Psychology Department, University of Québec at Montréal (魁北克大学蒙特利尔分校心理学系); Centre for Research and Intervention on Suicide, Ethical Issues and End-of-Life Practices (自杀研究与干预中心、伦理问题及临终关怀实践中心)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Published in: IEEE Transactions on Affective Computing

点击查看摘要

Abstract:In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: this https URL.
zh

[NLP-129] Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation

【速读】：该论文旨在解决在 Retrieval-Augmented Generation (RAG) 系统中，通过自然语言查询推断模型数据存储库中包含的文档成员身份的问题。论文的关键解决方案是提出了一种名为 Interrogation Attack (IA) 的成员推理技术，通过构造仅依赖于目标文档存在的自然文本查询，实现对文档成员身份的有效且隐蔽的推断，仅需30个查询即可成功执行，同时避免被现有检测方法轻易识别。这种方法在多种RAG配置下表现出比先前攻击方法更高的真阳性率（TPR@1%FPR），并且每次文档推理的成本低于0.02。

链接: https://arxiv.org/abs/2502.00306
作者: Ali Naseh,Yuefeng Peng,Anshuman Suri,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Northeastern University(东北大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to generate grounded responses by leveraging external knowledge databases without altering model parameters. Although the absence of weight tuning prevents leakage via model parameters, it introduces the risk of inference adversaries exploiting retrieved documents in the model’s context. Existing methods for membership inference and data extraction often rely on jailbreaking or carefully crafted unnatural queries, which can be easily detected or thwarted with query rewriting techniques common in RAG systems. In this work, we present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. By crafting natural-text queries that are answerable only with the target document’s presence, our approach demonstrates successful inference with just 30 queries while remaining stealthy; straightforward detectors identify adversarial prompts from existing methods up to ~76x more frequently than those generated by our attack. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations, all while costing less than 0.02 per document inference.
zh

[NLP-130] DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning ACL

【速读】：该论文旨在解决冷启动主动学习（Cold-start Active Learning, CSAL）中因忽略弱类别和困难代表性样本而导致的学习偏差问题。关键解决方案在于提出了一种名为双重多样性增强和不确定性感知（Dual-Diversity Enhancing and Uncertainty-Aware, DEUCE）的框架。DEUCE通过利用预训练语言模型（PLM）高效提取文本表示、类别预测及预测不确定性，并构建双重邻域图（Dual-Neighbor Graph, DNG）来结合文本多样性和类别多样性信息，确保数据分布平衡。此外，它通过基于密度的聚类传播不确定性信息，以选择困难代表性实例，从而实现类别均衡和信息丰富的样本选择。

链接: https://arxiv.org/abs/2502.00305
作者: Jiaxin Guo,C. L. Philip Chen,Shuzhen Li,Tong Zhang
机构: Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence (广东省计算智能模型与认知重点实验室), School of Computer Science and Engineering (计算机科学与工程学院), South China University of Technology (华南理工大学), Guangzhou, China (中国广州);

Pazhou Lab (琶洲实验室), Guangzhou, China (中国广州);

Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human (教育部健康智能感知与平行数字人工程研究中心), Guangzhou, China (中国广州)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 18 pages, 3 figures, 12 tables. Accepted manuscript by TACL. For published version by MIT Press, see this https URL

点击查看摘要

Abstract:Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
zh

[NLP-131] Contextual Morphogenesis in Large Language Models : A Novel Approach to Self-Organizing Token Representations

【速读】：该论文旨在解决传统分词策略在处理语言模型时因固定分词边界而无法动态调整以适应不断变化的上下文关系的问题。解决方案的关键在于引入上下文形态发生机制（Contextual Morphogenesis），这一机制通过自我组织的方式基于学习到的上下文依赖关系重新构建分词边界，从而允许嵌入表示在迭代处理过程中逐步进化。这种方法不仅降低了困惑度（perplexity），还保持了表征稳定性，特别是在语言结构复杂的领域中表现出色。

链接: https://arxiv.org/abs/2502.00301
作者: Alistair Dombrowski,Beatrix Engelhardt,Dimitri Fairbrother,Henry Evidail
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The introduction of contextual morphogenesis establishes a self-organizing mechanism that restructures token boundaries based on learned contextual dependencies, allowing embeddings to evolve progressively across iterative processing steps. Empirical evaluations demonstrate that dynamically adjusted tokenization contributes to reductions in perplexity while maintaining representational stability, particularly in linguistically complex domains where static segmentation fails to capture nuanced dependencies. Computational trade-offs associated with self-organizing token structures indicate that additional processing overhead remains within feasible limits, provided that optimization strategies account for segmentation update efficiency. Comparative assessments across different linguistic corpora suggest that adaptive tokenization preserves interpretability while improving alignment with contextual cues, reinforcing the potential of morphogenetic segmentation mechanisms to refine predictive accuracy. Stability analyses confirm that evolving token structures maintain consistent segmentation behaviors across varied text distributions, ensuring that representational adaptations remain linguistically coherent. The effectiveness of contextual morphogenesis in refining structural stability and predictive performance highlights its viability as an alternative to traditional tokenization methods. Further analysis of computational efficiency considerations suggests that hybrid strategies integrating both static and dynamic segmentation techniques may offer a balanced approach to optimizing representational flexibility while maintaining inference efficiency.
zh

[NLP-132] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

【速读】：该论文旨在解决在使用大型语言模型（Large Language Models, LLMs）进行长上下文推理时内存成本高的问题。现有方法主要关注于压缩不同标记（tokens）的关键值（KV）缓存，但这些方法单独衡量标记的重要性，忽略了实际语言特性中不同标记之间的依赖关系。为了解决这一问题，论文提出ChunkKV方案，将标记分组为基本压缩单元，并保留最具信息量的语义片段，同时舍弃较不重要的部分。关键创新在于引入层间索引重用机制，以进一步减少计算开销。实验结果显示，ChunkKV在多种基准测试中实现了最高达10%的性能提升，尤其是在指令调优和多步推理（O1和R1）的LLMs中，与现有方法相比，在高压缩比下具有显著优势。

链接: https://arxiv.org/abs/2502.00299
作者: Xiang Liu,Zhenheng Tang,Peijie Dong,Zeyu Li,Bo Li,Xuming Hu,Xiaowen Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 35 pages

点击查看摘要

Abstract:To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10% performance improvement under aggressive compression ratios compared to existing methods.
zh

[NLP-133] Estimating LLM Uncertainty with Logits

【速读】：该论文旨在解决大型语言模型（LLMs）在生成响应时容易出现幻觉的问题，即产生不可靠的回答。为应对这一挑战，论文提出了一种名为Logits-induced Token Uncertainty (LogU)的新框架，该框架能够实时估计LLMs中特定标记的不确定性，而无需多次采样。LogU的关键在于利用证据建模来实现标记级别不确定性的评估，从而指导下游任务。实验结果表明，LogU在减轻模型幻觉方面具有显著效果和潜力，标志着在解决模型幻觉问题上的重要进展。

链接: https://arxiv.org/abs/2502.00290
作者: Huan Ma,Jingdong Chen,Guangyu Wang,Changqing Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have seen remarkable advancements and have been extensively integrated across various fields. Despite their progress, LLMs are prone to hallucinations, producing responses that may not be dependable if the models lack sufficient grounding knowledge. To mitigate this issue, methods for estimating uncertainty have been adopted, with a focus on critical tokens as indicators of reliability. Nevertheless, probability-based approaches have shown limitations in assessing token-level reliability due to the erosion of evidence strength information acquired during training. In this paper, we introduce Logits-induced Token Uncertainty (LogU), a novel framework designed to estimate token-specific uncertainty in LLMs in real time, without the need for multiple sampling rounds. By leveraging evidence modeling for the implementation of LogU, we utilize the derived uncertainty measures to steer downstream tasks. Our experimental findings highlight the substantial effectiveness and potential of LogU, marking a significant advancement in addressing the challenge of model hallucinations.
zh

[NLP-134] Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning

【速读】：该论文旨在解决大规模语言模型（LLMs）在多步推理任务中的性能限制问题。尽管验证器引导搜索（Verifier-guided search）在有限样本情况下优于重复采样（repeated sampling），但随着样本量增加，其优势逐渐减弱并最终表现不如重复采样。论文指出，这一现象主要归因于验证器（verifiers）的失效，即不完美的验证器错误地排序候选路径并剪枝所有有效的推理路径。为了缓解验证器失效的问题，作者探索减少对验证器的依赖，并通过两种简单方法进行了初步研究。论文的关键在于揭示了验证器引导搜索的根本局限性，并提出了未来的研究方向。

链接: https://arxiv.org/abs/2502.00271
作者: Fei Yu,Yingru Li,Benyou Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement. Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths. However, we identify a critical limitation: scaling flaws, prevalent across different models (Mistral 7B and DeepSeekMath 7B), benchmarks (GSM8K and MATH), and verifiers (outcome value models and process reward models). As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling. Our analysis attributes this to verifier failures, where imperfect verifiers misrank candidates and erroneously prune all valid paths. These issues are further exacerbated in challenging and out-of-distribution problems, restricting search effectiveness. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods. Our findings reveal fundamental limitations in verifier-guided search and suggest future directions.
zh

[NLP-135] ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在自然语言处理任务中表现卓越但因规模庞大导致服务效率低下和成本高昂的问题。解决方案的关键在于提出了一种名为ProxSparse的学习型框架，用于通过正则化优化实现掩码选择。ProxSparse将刚性的、不可微的掩码选择过程转化为一个平滑的优化过程，允许灵活的渐进式掩码探索，并且在确定掩码后不再涉及额外的权重更新。这克服了现有半结构化剪枝方法仅依赖局部、逐层优化及启发式规则而未能充分利用全局反馈的局限性。

链接: https://arxiv.org/abs/2502.00258
作者: Hongyi Liu,Rajarshi Saha,Zhen Jia,Youngsuk Park,Jiaji Huang,Shoham Sabach,Yu-Xiang Wang,George Karypis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.
zh

[NLP-136] Context-Preserving Tensorial Reconfiguration in Large Language Model Training

【速读】：该论文旨在解决神经架构在处理长距离依赖时因计算限制和低效上下文保留机制所面临的核心挑战。解决方案的关键在于引入了一种名为Context-Preserving Tensorial Reconfiguration (CPTR)的新方法，通过结构化分解和自适应收缩实现权重张量的动态重组，从而增强上下文整合，同时不增加显著的计算负担。

链接: https://arxiv.org/abs/2502.00246
作者: Larin Tonix,Morgana Baskerville,Nathaniel Stourton,Ophelia Tattershall
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Handling long-range dependencies in neural architectures has remained a persistent challenge due to computational limitations and inefficient contextual retention mechanisms. Tensorial operations have provided a foundation for restructuring model representations, yet conventional architectures have struggled to incorporate such techniques without introducing excessive complexity. A novel approach, Context-Preserving Tensorial Reconfiguration (CPTR), enables dynamic reorganization of weight tensors through structured factorization and adaptive contraction, allowing for enhanced contextual integration without substantial computational overhead. Empirical evaluations demonstrate that CPTR improves coherence retention across extended sequences, leading to measurable reductions in perplexity and improved recall accuracy for long-context tasks. Performance comparisons reveal that CPTR-enhanced models exhibit greater computational efficiency and reduced memory consumption while maintaining competitive language generation fluency and accuracy. Gradient stability metrics further validate the improved training efficiency, revealing more controlled variance in weight updates. Comparative studies across baseline and CPTR-enhanced models confirm that tensorial reconfiguration contributes to more stable and computationally efficient language modeling. The findings support the potential of CPTR in refining contemporary neural architectures for tasks requiring long-range contextual understanding and efficient memory utilization.
zh

[NLP-137] Mordal: Automated Pretrained Model Selection for Vision Language Models

【速读】：该论文旨在解决自动化创建针对特定任务的视觉语言模型（Vision Language Models, VLMs）的问题。目前，尽管已有多种VLM在不同基准测试中展示了出色的视觉能力，但这些模型均是由人类专家手工设计的，缺乏自动化的框架来生成任务专用的多模态模型。论文的关键解决方案是引入Mordal，一个自动化多模态模型搜索框架，通过减少搜索过程中需要考虑的候选模型数量以及缩短每个剩余候选模型的评估时间，高效地找到最适合用户定义任务的VLM，相比网格搜索，Mordal可降低高达8.9到11.6倍的GPU小时数。此外，在评估过程中，还发现了性能超越现有最先进水平的新VLM。

链接: https://arxiv.org/abs/2502.00241
作者: Shiqi He,Insu Jang,Mosharaf Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using up to 8.9\times – 11.6\times lower GPU hours than grid search. In the process of our evaluation, we have also discovered new VLMs that outperform their state-of-the-art counterparts. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.00241 [cs.LG] (or arXiv:2502.00241v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-138] Should You Use Your Large Language Model to Explore or Exploit?

【速读】：该论文旨在评估当前大型语言模型（Large Language Models, LLMs）在面对探索-利用权衡的决策任务中的有效性。研究通过在各种上下文多臂老虎机（Contextual Bandit Tasks）任务中让LLMs独立进行探索和利用来实现这一目标。研究的关键发现是，尽管LLMs在利用方面常常表现不佳，但可以通过上下文化的缓解措施显著提升其在小规模任务中的性能。然而，即使如此，LLMs的表现仍不如简单的线性回归模型。另一方面，研究还发现LLMs在处理具有内在语义的大规模动作空间的探索任务中表现出优势，能够建议合适的探索候选对象。因此，该研究的关键解决方案在于探索如何利用LLMs在大规模动作空间探索方面的潜力，并通过上下文化方法改善其在小规模任务中的利用能力。

链接: https://arxiv.org/abs/2502.00225
作者: Keegan Harris,Aleksandrs Slivkins
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. We use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks. However even then, LLMs perform worse than a simple linear regression. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.
zh

[NLP-139] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

【速读】：该论文旨在解决大型语言模型（LLM）对齐算法领域复杂且碎片化的问题，当前该领域对不同方法的有效性及其相互关系缺乏清晰理解。论文的关键解决方案是提出奖励感知偏好优化（Reward-Aware Preference Optimization, RPO）框架，该框架统一了包括DPO、IPO、SimPO和REINFORCE（LOO）在内的流行偏好优化技术。RPO提供了一种结构化的方法，用于解析和系统地研究各种设计选择（如优化目标、每个提示的响应数量以及隐式与显式奖励模型的使用）对LLM偏好优化的影响，并进一步提出了新的实验设置以清晰直接地消解这些设计选择的影响。通过在RPO框架内进行广泛的消融研究，论文揭示了影响模型对齐的关键因素，提供了改善LLM对齐的有效策略的实际指导。

链接: https://arxiv.org/abs/2502.00203
作者: Shengyang Sun,Yian Zhang,Alexander Bukharin,David Mosallanezhad,Jiaqi Zeng,Soumye Singhal,Gerald Shen,Adi Renduchintala,Tugrul Konuk,Yi Dong,Zhilin Wang,Dmitry Chichkov,Olivier Delalleau,Oleksii Kuchaiev
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques in LLM alignment, including DPO, IPO, SimPO, and REINFORCE (LOO), among others. RPO provides a structured approach to disentangle and systematically study the impact of various design choices, such as the optimization objective, the number of responses per prompt, and the use of implicit versus explicit reward models, on LLM preference optimization. We additionally propose a new experimental setup that enables the clean and direct ablation of such design choices. Through an extensive series of ablation studies within the RPO framework, we gain insights into the critical factors shaping model alignment, offering practical guidance on the most effective strategies for improving LLM alignment.
zh

[NLP-140] Fairshare Data Pricing for Large Language Models

【速读】：该论文旨在解决数据市场中不公平定价导致的数据买家（如大型语言模型 LLM 的构建者）和卖家（如人类标注员）参与度降低的问题，这会减少数据的数量和质量。论文的关键解决方案是提出了一种公平份额定价框架（Fairshare Pricing Framework），该框架利用数据估值方法来量化训练数据对 LLM 的贡献，并据此设定价格。通过该框架，买家依据数据估值做出购买决策，而卖家则基于预期买家购买量最大化其利润。此框架理论证明了定价与数据估值及买家预算紧密相关，对买卖双方都是最优的。通过使用当前 LLM 和数据集（包括数学问题、医学诊断和物理推理）进行市场模拟，验证了该框架能够确保买家以反映模型训练价值的方式购买数据，从而提高每美元投入数据所带来的 LLM 任务性能，并确保卖家以最优价格出售数据。

链接: https://arxiv.org/abs/2502.00198
作者: Luyang Zhang,Cathy Jiao,Beibei Li,Chenyan Xiong
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training data is a pivotal resource for building large language models (LLMs), but unfair pricing in data markets poses a serious challenge for both data buyers (e.g., LLM builders) and sellers (e.g., human annotators), which discourages market participation, reducing data quantity and quality. In this paper, we propose a fairshare pricing framework that sets training data prices using data valuation methods to quantify their contribution to LLMs. In our framework, buyers make purchasing decisions using data valuation and sellers set prices to maximize their profits based on the anticipated buyer purchases. We theoretically show that pricing derived from our framework is tightly linked to data valuation and buyers’ budget, optimal for both buyers and sellers. Through market simulations using current LLMs and datasets (math problems, medical diagnosis, and physical reasoning), we show that our framework is fairshare for buyers by ensuring their purchased data is reflective of model training value, leading to higher LLM task performances per-dollar spent on data, and fairshare for sellers by ensuring they sell their data at optimal prices. Our framework lays the foundation for future research on equitable and sustainable data markets for large-scale AI.
zh

[NLP-141] DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets

【速读】：该论文旨在解决在皮肤科领域开发视觉大语言模型（Vision LLMs）所面临的大型图像-文本配对数据集缺乏的问题。解决方案的关键在于引入DermaSynth数据集，该数据集包含92,020个合成的图像-文本对，源自45,205张临床和皮肤病镜图像，并通过先进的大语言模型（LLMs），使用Gemini 2.0和自指导方法生成多样且丰富的合成文本。通过将数据集的元数据纳入输入提示，以减少潜在的幻觉现象，从而构建出基于开放访问皮肤科图像存储库的高质量数据集。此外，还初步微调了一个名为DermatoLlama 1.0的模型。

链接: https://arxiv.org/abs/2502.00196
作者: Abdurrahim Yilmaz,Furkan Yuceyalcin,Ece Gokyayla,Donghee Choi,Ozan Erdem Ali Anil Demircali,Rahmetullah Varol,Ufuk Gorkem Kirabali,Gulsum Gencoglan,Joram M. Posma,Burak Temelkuran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image–text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image–text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at this https URL.
zh

[NLP-142] Resolving Editing-Unlearning Conflicts: A Knowledge Codebook Framework for Large Language Model Updating

【速读】：该论文旨在解决大型语言模型（LLMs）在更新过程中存在的两个主要问题：知识存储的有效性不足（包括过于稀疏或过于密集）以及编辑与遗忘任务之间的冲突。论文提出的关键解决方案是LOKA框架，它基于知识代码本，通过多记忆代码本存储更新的知识，并利用相似度感知的知识映射确保相关知识片段被聚类到同一内存中。此外，LOKA通过任务特定和多任务记忆，以及由冲突评分引导的方法来解决任务冲突。在推理阶段，LOKA从代码本中检索最相关的记忆并将其插入原始LLM以应用更新的知识，从而提高知识利用率。

链接: https://arxiv.org/abs/2502.00158
作者: Binchi Zhang,Zhengzhang Chen,Zaiyi Zheng,Jundong Li,Haifeng Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in natural language processing by encoding extensive human knowledge, but their utility relies on timely updates as knowledge evolves. Updating LLMs involves two key tasks simultaneously: unlearning to remove unwanted knowledge and editing to incorporate new information. Existing methods face two major challenges: ineffective knowledge storage (either too sparse or too dense) and task conflicts between editing and unlearning, as validated through our theoretical and experimental results. To address these issues, we propose LOKA, a conflict-free framework for LLM updating based on a knowledge codebook. During training, updated knowledge is stored in multiple codebook memories. To optimize knowledge storage, a similarity-aware knowledge mapping ensures that related knowledge pieces are clustered and allocated to the same memory. Additionally, LOKA resolves task conflicts by employing task-specific and multi-task memories guided by a conflict score. In the inference stage, LOKA retrieves the most relevant memory from the codebook and plugs it into the original LLM to apply the updated knowledge. A learning-based router controls codebook activation to further improve knowledge utilization. Extensive experiments demonstrate the effectiveness of LOKA in LLM knowledge updating tasks.
zh

[NLP-143] A Three-Branch Checks-and-Balances Frameworkfor Context-Aware Ethical Alignment of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在伦理对齐方面的局限性，特别是通过强化学习基于人类反馈（Reinforcement Learning from Human Feedback, RLHF）方法所存在的问题。论文的关键在于提出一个三分支制衡框架，包含知识生成（LLMs作为执行机构）、伦理规范设定（DIKE作为立法机构）以及情境解读（ERIS作为司法机构）。这一架构通过可解释、可适应且文化敏感的伦理推理机制，解决了现有方法的不足。

链接: https://arxiv.org/abs/2502.00136
作者: Edward Y. Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 tables, 6 figures. arXiv admin note: substantial text overlap with arXiv:2405.07076

点击查看摘要

Abstract:This paper introduces a three-branch checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. The adversarial DIKE-ERIS duality enables adaptation to diverse cultural contexts while upholding consistent ethical principles. This architecture addresses limitations of reinforcement learning with human feedback (RLHF) by providing interpretable, adaptable, and culturally-aware ethical reasoning. Through self-supervised learning and adversarial testing, our framework demonstrates how emotional modeling can guide linguistic behaviors toward ethical outcomes while preserving independence across knowledge generation, ethical oversight, and contextual interpretation.
zh

[NLP-144] Sparse Autoencoder Insights on Voice Embeddings

【速读】：该论文旨在探索稀疏自编码器在从密集编码嵌入中提取单义特征方面的有效性，尤其关注非文本嵌入数据。关键解决方案在于应用稀疏自编码器于源自Titanet模型的说话者嵌入(Speaker Embeddings)，从而成功识别并操纵如语言和音乐等在原始嵌入中不明显的特征。实验结果表明，所提取的特征与大型语言模型 (LLM) 嵌入中的特征相似，包括特征分割和调节。这表明稀疏自编码器可以成为理解与解释多个领域（包括基于音频的说话者识别）中嵌入数据的重要工具。

链接: https://arxiv.org/abs/2502.00127
作者: Daniel Pluth,Yu Zhou,Vijay K. Gurbani
机构: Vail Systems, Inc. (维尔斯系统公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be a valuable tool for understanding and interpreting embedded data in many domains, including audio-based speaker recognition.
zh

[NLP-145] AIN: The Arabic INclusive Large Multimodal Model ACL

【速读】：该论文旨在解决阿拉伯语大型多模态模型（Arabic LMMs）研究不足的问题。解决方案的关键在于引入AIN（阿拉伯包容性多模态模型），这是一个双语（英语-阿拉伯语）的大型多模态模型，利用精心构建的360万高质量英阿多模态数据样本进行训练。AIN展示了在阿拉伯语处理方面的最先进性能，并且具备强大的英语视觉理解能力，在包括多图像理解、复杂视觉感知、手写文档理解、视频理解、医学影像分析、植物病害识别以及基于遥感的土地使用理解在内的38个子领域中表现出色，超越了GPT-4在平均八个领域的绝对增益达3.4%。AIN的卓越能力使其成为向阿拉伯语用户提供先进多模态生成式AI工具的重要进展。

链接: https://arxiv.org/abs/2502.00094
作者: Ahmed Heakl,Sara Ghaboura,Omkar Thawkar,Fahad Shahbaz Khan,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan
机构: Mohamed bin Zayed University of AI(穆罕默德· bin Zayed 人工智能大学); Linköping University(林雪平大学); Aalto University(阿尔托大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 20 pages, 16 figures, ACL

点击查看摘要

Abstract:Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN’s superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.
zh

[NLP-146] Disambiguating Numeral Sequences to Decipher Ancient Accounting Corpora

【速读】：该论文旨在解决古代半释读的楔形文字——原始埃兰语（Proto-Elamite, PE）书写系统中数值记录的歧义问题。论文的关键在于提出了一种算法来提取每种子数值表示的可能读法列表，并贡献了两种基于文档结构特性的消歧方法以及通过自助法（bootstrapping algorithm）训练的分类器。此外，论文还提供了一个测试集用于评估消歧技术，并提出了一种新颖的谨慎规则选择方法以优化自助法分类器。这些方法有助于确认关于该书写系统的已有直觉，并揭示了泥板内容与数值大小之间的新关联。

链接: https://arxiv.org/abs/2502.00090
作者: Logan Born,M. Willis Monroe,Kathryn Kelley,Anoop Sarkar
机构: Simon Fraser University(西蒙弗雷泽大学); University of British Columbia(不列颠哥伦比亚大学); Università di Bologna(博洛尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A numeration system encodes abstract numeric quantities as concrete strings of written characters. The numeration systems used by modern scripts tend to be precise and unambiguous, but this was not so for the ancient and partially-deciphered proto-Elamite (PE) script, where written numerals can have up to four distinct readings depending on the system that is used to read them. We consider the task of disambiguating between these readings in order to determine the values of the numeric quantities recorded in this corpus. We algorithmically extract a list of possible readings for each PE numeral notation, and contribute two disambiguation techniques based on structural properties of the original documents and classifiers learned with the bootstrapping algorithm. We also contribute a test set for evaluating disambiguation techniques, as well as a novel approach to cautious rule selection for bootstrapped classifiers. Our analysis confirms existing intuitions about this script and reveals previously-unknown correlations between tablet content and numeral magnitude. This work is crucial to understanding and deciphering PE, as the corpus is heavily accounting-focused and contains many more numeric tokens than tokens of text.
zh

[NLP-147] Ensembles of Low-Rank Expert Adapters ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在多源异构数据训练和微调过程中因梯度方向冲突导致的优化困难和性能下降问题，进而影响模型在不同任务中的泛化能力。关键解决方案在于提出了一种名为Ensembles of Low-Rank Expert Adapters (ELREA) 的框架，通过基于梯度方向对训练指令进行聚类，减少优化过程中的冲突，并利用低秩适应（LoRA）技术训练专家适配器，确保高效且可扩展的训练。在推理阶段，ELREA 根据输入数据与训练聚类的梯度相似性，选择最相关的专家适配器进行预测，从而实现每个任务的最佳适配器选择。

链接: https://arxiv.org/abs/2502.00089
作者: Yinghao Li,Vianne Gao,Chao Zhang,MohamadAli Torkamani
机构: Amazon Web Service(亚马逊网络服务); Amazon.com(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 5 figures, 5 tables; proceedings in ICLR 2025

点击查看摘要

Abstract:The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model’s capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (LoRA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data’s gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline LoRA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.
zh

[NLP-148] Efficient Beam Search for Large Language Models Using Trie-Based Decoding

【速读】：该论文旨在解决Transformer-based序列到序列生成中批处理束搜索方法存在的高内存消耗问题。解决方案的关键在于引入了一种基于trie（前缀树）的并行解码方法，通过在共享相同前缀的所有束之间共用单一的键值（KV）缓存，不仅大幅减少了内存消耗，还实现了所有分支的并行解码。这一创新性地使用前缀树为束搜索提供了一个高效的替代方案，在保持推理速度的同时显著节省了内存，特别适用于内存受限环境或大规模模型部署。

链接: https://arxiv.org/abs/2502.00085
作者: Brian J Chan,Jui-Hung Cheng,Mao Xun Huang,Chao-Ting Chen,Hen-Hsen Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:In Transformer-based sequence-to-sequence generation, beam search has proven effective in enhancing the quality of generated sequences compared to greedy decoding. Conventional beam search methods typically adopt either a sequential or batch-based approach. The sequential approach, while memory-efficient, requires multiple decoding passes to construct a complete search tree, leading to significantly slower inference. On the other hand, the batch-based approach enables parallel computation across beams, but at the expense of high memory consumption due to the need to maintain separate key-value (KV) caches for each beam. In this study, we introduce a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache among all beams that share the same prefix, the proposed method not only reduces memory consumption dramatically but also enables parallel decoding across all branches. This innovative use of a prefix tree offers an efficient alternative for beam search, achieving significant memory savings while preserving inference speed, making it particularly well-suited for memory-constrained environments or large-scale model deployments.
zh

[NLP-149] BTS: Harmonizing Specialized Experts into a Generalist LLM

【速读】：该论文旨在解决如何高效且灵活地将多个独立训练的领域专家大型语言模型（Large Language Model, LLM）整合成一个具备广泛能力的通用模型。解决方案的关键在于Branch-Train-Stitch (BTS)算法，该算法通过插入轻量级的缝合层（stitch layers），在冻结的专家模型与初始种子语言模型之间实现融合，并仅需少量训练数据即可使种子模型在前向传播过程中集成来自多个专家模型的表示，从而实现在保持专家特定能力的同时，提升模型在新领域的泛化能力。

链接: https://arxiv.org/abs/2502.00075
作者: Qizhen Zhang,Prajjwal Bhargava,Chloe Bi,Chris X. Cai,Jakob Foerster,Jeremy Fu,Punit Singh Koura,Ruan Silva,Sheng Shen,Emily Dinan,Suchin Gururangan,Mike Lewis
机构: Oxford University (牛津大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.
zh

[NLP-150] LLM Cyber Evaluations Dont Capture Real-World Risk

【速读】：该论文旨在解决评估大型语言模型（Large Language Models, LLMs）在网络安全应用中的风险与实际影响不匹配的问题。论文的关键解决方案在于提出一个综合的风险评估框架，该框架不仅考虑模型的能力，还纳入了对威胁行为者采用行为及其潜在影响的分析。通过这一框架，论文评估了一种具体用例——即用于网络安全助手的LLMs，并发现其合规率高但准确性一般，且整体风险较低，因为其操作优势和影响潜力有限。基于这些发现，论文建议加强学术界与产业界的协作，更真实地模拟攻击者行为，并在评估中加入经济指标，以更好地对齐研究重点与实际影响评估。

链接: https://arxiv.org/abs/2502.00072
作者: Kamilė Lukošiūtė,Adam Swanda
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Large language models (LLMs) are demonstrating increasing prowess in cybersecurity applications, creating creating inherent risks alongside their potential for strengthening defenses. In this position paper, we argue that current efforts to evaluate risks posed by these capabilities are misaligned with the goal of understanding real-world impact. Evaluating LLM cybersecurity risk requires more than just measuring model capabilities – it demands a comprehensive risk assessment that incorporates analysis of threat actor adoption behavior and potential for impact. We propose a risk assessment framework for LLM cyber capabilities and apply it to a case study of language models used as cybersecurity assistants. Our evaluation of frontier models reveals high compliance rates but moderate accuracy on realistic cyber assistance tasks. However, our framework suggests that this particular use case presents only moderate risk due to limited operational advantages and impact potential. Based on these findings, we recommend several improvements to align research priorities with real-world impact assessment, including closer academia-industry collaboration, more realistic modeling of attacker behavior, and inclusion of economic metrics in evaluations. This work represents an important step toward more effective assessment and mitigation of LLM-enabled cybersecurity risks.
zh

[NLP-151] A Multi-Layered Large Language Model Framework for Disease Prediction

【速读】：该论文旨在解决通过社交媒体和在线健康平台收集的大量阿拉伯语医学文本在疾病分类和症状严重性评估中的处理与应用问题。关键解决方案在于采用先进的阿拉伯语医学文本预处理技术，包括文本摘要、文本精炼以及命名实体识别（NER），并结合CAMeL-BERT模型进行优化。研究发现，使用CAMeL-BERT结合NER增强的文本能够显著提升疾病类型分类（83%）和症状严重性评估（69%）的性能。

链接: https://arxiv.org/abs/2502.00063
作者: Malak Mohamed,Rokaia Emad,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social telehealth has revolutionized healthcare by enabling patients to share symptoms and receive medical consultations remotely. Users frequently post symptoms on social media and online health platforms, generating a vast repository of medical data that can be leveraged for disease classification and symptom severity assessment. Large language models (LLMs), such as LLAMA3, GPT-3.5 Turbo, and BERT, process complex medical data to enhance disease classification. This study explores three Arabic medical text preprocessing techniques: text summarization, text refinement, and Named Entity Recognition (NER). Evaluating CAMeL-BERT, AraBERT, and Asafaya-BERT with LoRA, the best performance was achieved using CAMeL-BERT with NER-augmented text (83% type classification, 69% severity assessment). Non-fine-tuned models performed poorly (13%-20% type classification, 40%-49% severity assessment). Integrating LLMs into social telehealth systems enhances diagnostic accuracy and treatment outcomes.
zh

[NLP-152] Contextually Entangled Gradient Mapping for Optimized LLM Comprehension

【速读】：该论文旨在解决神经架构在长文本推理、上下文保持及适应新领域任务中的优化策略不足的问题。关键在于引入了Contextually Entangled Gradient Mapping (CEGM)，将梯度视为动态承载上下文依赖性的实体，而非孤立的数值，通过在损失正则化框架中整合纠缠梯度动力学，显著提升了模型在这些任务上的表现。

链接: https://arxiv.org/abs/2502.00048
作者: Colin Sisate,Alistair Goldfinch,Vincent Waterstone,Sebastian Kingsley,Mariana Blackthorn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.
zh

[NLP-153] Optimization Strategies for Enhancing Resource Efficiency in Transformers Large Language Models

【速读】：该论文旨在解决自然语言处理领域中Transformer架构在性能提升过程中伴随的资源消耗问题。关键解决方案在于探索并优化压缩技术，包括量化(Quantization)、知识蒸馏(Knowledge Distillation)和剪枝(Pruning)，以实现更高的能源和计算效率，同时保持模型性能。研究发现，4位量化(4-bit Quantization)显著减少了能源使用且几乎没有精度损失。混合方法如NVIDIA的Minitron方法结合了知识蒸馏与结构化剪枝，进一步展示了在减少模型大小与保持精度之间的良好权衡。通过这些方法的研究，论文为开发更加可持续和高效的大型语言模型提供了有价值的见解，并强调了常常被忽视的能源效率问题。

链接: https://arxiv.org/abs/2502.00046
作者: Tom Wallace,Naser Ezzati-Jivan,Beatrice Ombuki-Berman
机构: Brock University(布鲁克大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted for ACM’s ICPE 2025 in Short Paper format

点击查看摘要

Abstract:Advancements in Natural Language Processing are heavily reliant on the Transformer architecture, whose improvements come at substantial resource costs due to ever-growing model sizes. This study explores optimization techniques, including Quantization, Knowledge Distillation, and Pruning, focusing on energy and computational efficiency while retaining performance. Among standalone methods, 4-bit Quantization significantly reduces energy use with minimal accuracy loss. Hybrid approaches, like NVIDIA’s Minitron approach combining KD and Structured Pruning, further demonstrate promising trade-offs between size reduction and accuracy retention. A novel optimization equation is introduced, offering a flexible framework for comparing various methods. Through the investigation of these compression methods, we provide valuable insights for developing more sustainable and efficient LLMs, shining a light on the often-ignored concern of energy efficiency.
zh

[NLP-154] MALT: Mechanistic Ablation of Lossy Translation in LLM s for a Low-Resource Language: Urdu

【速读】：该论文旨在解决大型语言模型（LLMs）在处理低资源语言如乌尔都语时性能显著下降的问题。研究的关键在于发现LLMs在内部处理低资源语言时，虽然其英文潜层响应较为连贯，但翻译功能存在损失，导致最终翻译质量不佳。为此，论文提出通过机制性移除这些翻译功能，并采用独立的翻译模型来翻译LLM的内部潜层响应，从而显著提升LLMs在低资源语言中的表现，同时保留输入的文化细微差别。

链接: https://arxiv.org/abs/2502.00041
作者: Taaha Saleem Bajwa
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs are predominantly trained on English data, which leads to a significant drop in performance on low-resource languages. Understanding how LLMs handle these languages is crucial for improving their effectiveness. This study focuses on Urdu as a use case for exploring the challenges faced by LLMs in processing low-resource languages. LLMs primarily reason in English when prompted in another language, with the final layers acting as translators to convert the English response into the target language. This study finds that even for low-resource languages, the internal latent response of LLMs in English is quite coherent; however, the translation features are lossy and result in poor translations, leading to reduced performance. By mechanistically removing these translation features and using a separate translation model to translate the internal latent response of LLM, the performance of LLMs improves significantly while also preserving the cultural nuances of the input in low-resource languages.
zh

[NLP-155] Zoning in American Cities: Are Reforms Making a Difference? An AI-based Analysis

【速读】：该论文旨在探讨形式导向性分区法规（Form-Based Codes, FBCs）在解决由传统分区法规引发的城市可持续性挑战中的应用与影响。关键在于通过自然语言处理（NLP）技术分析美国各地的分区文件，揭示FBCs促进紧凑混合用途城市形态的效果，从而改善步行友好性、缩短通勤距离，并提高多户住宅的比例。

链接: https://arxiv.org/abs/2502.00008
作者: Arianna Salazar-Miranda,Emily Talen
机构: University of Chicago (芝加哥大学); Yale (耶鲁大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 31 pages, 6 figures, 1 table

点击查看摘要

Abstract:Cities are at the forefront of addressing global sustainability challenges, particularly those exacerbated by climate change. Traditional zoning codes, which often segregate land uses, have been linked to increased vehicular dependence, urban sprawl, and social disconnection, undermining broader social and environmental sustainability objectives. This study investigates the adoption and impact of form-based codes (FBCs), which aim to promote sustainable, compact, and mixed-use urban forms as a solution to these issues. Using Natural Language Processing (NLP) techniques, we analyzed zoning documents from over 2000 U.S. census-designated places to identify linguistic patterns indicative of FBC principles. Our findings reveal widespread adoption of FBCs across the country, with notable variations within regions. FBCs are associated with higher floor-to-area ratios, narrower and more consistent street setbacks, and smaller plots. We also find that places with FBCs have improved walkability, shorter commutes, and a higher share of multi-family housing. Our findings highlight the utility of NLP for evaluating zoning codes and underscore the potential benefits of form-based zoning reforms for enhancing urban sustainability.
zh

[NLP-156] Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

【速读】：该论文旨在解决在使用策略梯度方法微调离散扩散模型时所遇到的挑战，尤其是在处理非可微奖励的强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）任务中。论文的关键解决方案是提出了一种名为评分熵策略优化（Score Entropy Policy Optimization, SEPO）的高效、广泛适用且理论上有据可依的策略梯度算法。通过这种方法，作者展示了其在多个离散生成任务中的可扩展性和效率。

链接: https://arxiv.org/abs/2502.01384
作者: Oussama Zekri,Nicolas Boullé
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at this https URL
zh

[NLP-157] Probabilistic adaptation of language comprehension for individual speakers: Evidence from neural oscillations

【速读】：该论文旨在探究听者如何根据说话人产生刻板印象不符话语的概率来动态更新其语言理解的心理表征。研究的关键在于揭示了两种可能的机制：一种是调整总体期望的说话人通用机制（speaker-general mechanism），另一种是个体说话人模型的更新机制（speaker-specific mechanism）。通过两个EEG实验，研究发现高Beta波（21-30 Hz）和Theta波（4-6 Hz）振荡在不同条件下表现出不同的模式，从而支持了这两种机制的存在。这些发现提供了语言处理如何受社会认知实时影响的证据。

链接: https://arxiv.org/abs/2502.01299
作者: Hanlin Wu,Xiaohui Rao,Zhenguang G. Cai
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Listeners adapt language comprehension based on their mental representations of speakers, but how these representations are dynamically updated remains unclear. We investigated whether listeners probabilistically adapt their comprehension based on the likelihood of speakers producing stereotype-incongruent utterances. Our findings reveal two potential mechanisms: a speaker-general mechanism that adjusts overall expectations about speaker-content relationships, and a speaker-specific mechanism that updates individual speaker models. In two EEG experiments, participants heard speakers make stereotype-congruent or incongruent utterances, with incongruency base rate manipulated between blocks. In Experiment 1, speaker incongruency modulated both high-beta (21-30 Hz) and theta (4-6 Hz) oscillations: incongruent utterances decreased oscillatory power in low base rate condition but increased it in high base rate condition. The theta effect varied with listeners’ openness trait: less open participants showed theta increases to speaker-incongruencies, suggesting maintenance of speaker-specific information, while more open participants showed theta decreases, indicating flexible model updating. In Experiment 2, we dissociated base rate from the target speaker by manipulating the overall base rate using an alternative non-target speaker. Only the high-beta effect persisted, showing power decrease for speaker-incongruencies in low base rate condition but no effect in high base rate condition. The high-beta oscillations might reflect the speaker-general adjustment, while theta oscillations may index the speaker-specific model updating. These findings provide evidence for how language processing is shaped by social cognition in real time.
zh

[NLP-158] MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents

【速读】：该论文旨在解决股票分析与决策过程中信息整合与处理效率的问题。解决方案的关键在于提出了一种名为MarketSenseAI的新框架，它结合了 Retrieval-Augmented Generation 和大型语言模型（LLM）代理的新型架构。这一框架能够处理SEC文件和财报电话会议，并通过系统化处理多样化的机构报告来增强宏观经济分析。通过这些方法，MarketSenseAI显著提升了基本面分析的准确性，并在实证评估中展示了优于市场指数的表现，从而验证了其有效性。

链接: https://arxiv.org/abs/2502.00415
作者: George Fatouros,Kostas Metaxas,John Soldatos,Manos Karathanassis
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Portfolio Management (q-fin.PM)
备注: 25 pages, 7 figures, Under review at Financial Innovation (FIN)

点击查看摘要

Abstract:MarketSenseAI is a novel framework for holistic stock analysis which leverages Large Language Models (LLMs) to process financial news, historical prices, company fundamentals and the macroeconomic environment to support decision making in stock analysis and selection. In this paper, we present the latest advancements on MarketSenseAI, driven by rapid technological expansion in LLMs. Through a novel architecture combining Retrieval-Augmented Generation and LLM agents, the framework processes SEC filings and earnings calls, while enriching macroeconomic analysis through systematic processing of diverse institutional reports. We demonstrate a significant improvement in fundamental analysis accuracy over the previous version. Empirical evaluation on S\P 100 stocks over two years (2023-2024) shows MarketSenseAI achieving cumulative returns of 125.9% compared to the index return of 73.5%, while maintaining comparable risk profiles. Further validation on S\P 500 stocks during 2024 demonstrates the framework’s scalability, delivering a 33.8% higher Sortino ratio than the market. This work marks a significant advancement in applying LLM technology to financial analysis, offering insights into the robustness of LLM-driven investment strategies.
zh

[NLP-159] AlphaSharpe: LLM -Driven Discovery of Robust Risk-Adjusted Metrics

【速读】：该论文旨在解决传统金融指标（如夏普比率）在动态和波动的市场条件下所面临的稳健性和泛化性不足的问题。解决方案的关键在于引入AlphaSharpe框架，该框架利用大型语言模型（LLMs）迭代演化和优化金融指标，通过迭代交叉、变异和评估生成增强的风险-收益指标，从而在稳健性和与未来绩效指标的相关性方面超越传统方法。

链接: https://arxiv.org/abs/2502.00029
作者: Kamer Ali Yuksel,Hassan Sawaf
机构: aiXplain Inc. (aiXplain公司), San Jose, CA, USA
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:Financial metrics like the Sharpe ratio are pivotal in evaluating investment performance by balancing risk and return. However, traditional metrics often struggle with robustness and generalization, particularly in dynamic and volatile market conditions. This paper introduces AlphaSharpe, a novel framework leveraging large language models (LLMs) to iteratively evolve and optimize financial metrics. AlphaSharpe generates enhanced risk-return metrics that outperform traditional approaches in robustness and correlation with future performance metrics by employing iterative crossover, mutation, and evaluation. Key contributions of this work include: (1) an innovative use of LLMs for generating and refining financial metrics inspired by domain-specific knowledge, (2) a scoring mechanism to ensure the evolved metrics generalize effectively to unseen data, and (3) an empirical demonstration of 3x predictive power for future risk-return forecasting. Experimental results on a real-world dataset highlight the superiority of AlphaSharpe metrics, making them highly relevant for portfolio managers and financial decision-makers. This framework not only addresses the limitations of existing metrics but also showcases the potential of LLMs in advancing financial analytics, paving the way for informed and robust investment strategies.
zh

计算机视觉

[CV-0] SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

【速读】：该论文旨在解决扩散模型（Diffusion Models）视觉能力控制的问题。现有方法需要用户针对每个编辑方向单独指定属性，而SliderSpace提出了一种框架，能够从单一文本提示中同时发现多个可解释且多样的方向。解决方案的关键在于将每个方向训练为低秩适配器（low-rank adaptor），从而实现组合控制，并在模型的隐空间中发现意想不到的可能性。通过广泛的实验验证，SliderSpace展示了其在概念分解、艺术风格探索和多样性增强三个应用中的有效性。

链接: https://arxiv.org/abs/2502.01639
作者: Rohit Gandikota,Zongze Wu,Richard Zhang,David Bau,Eli Shechtman,Nick Kolkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project Website: this https URL

点击查看摘要

Abstract:We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model’s latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace’s effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model’s knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at this https URL
zh

[CV-1] MFP-VTON: Enhancing Mask-Free Person-to-Person Virtual Try-On via Diffusion Transformer

【速读】：该论文旨在解决衣物虚拟试穿（Virtual Try-On, VTON）任务中的衣物获取难题，并提出了一种无掩码的人体到人体的VTON框架MFP-VTON。解决方案的关键在于利用预训练的扩散变换模型，并引入Focus Attention损失函数以强调参考人体的衣物细节以及目标人体衣物外的部分。通过这种方法，该模型在人体到人体及衣物到人体的VTON任务中表现出色，能够生成高保真的试穿图像。

链接: https://arxiv.org/abs/2502.01626
作者: Le Shen,Yanting Kang,Rong Huang,Zhijie Wang
机构: Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The garment-to-person virtual try-on (VTON) task, which aims to generate fitting images of a person wearing a reference garment, has made significant strides. However, obtaining a standard garment is often more challenging than using the garment already worn by the person. To improve ease of use, we propose MFP-VTON, a Mask-Free framework for Person-to-Person VTON. Recognizing the scarcity of person-to-person data, we adapt a garment-to-person model and dataset to construct a specialized dataset for this task. Our approach builds upon a pretrained diffusion transformer, leveraging its strong generative capabilities. During mask-free model fine-tuning, we introduce a Focus Attention loss to emphasize the garment of the reference person and the details outside the garment of the target person. Experimental results demonstrate that our model excels in both person-to-person and garment-to-person VTON tasks, generating high-fidelity fitting images.
zh

[CV-2] Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在面对视觉对抗扰动时的脆弱性问题，这些扰动可能导致幻觉生成、响应操控或绕过安全机制。论文的关键解决方案在于利用已在大规模数据上进行对抗性预训练的现有视觉分类模型，通过与这些鲁棒模型的端到端集成，增强了语言组件对鲁棒视觉特征的适应能力。这种方法不仅无需额外的对抗性训练，还展示了对多样化对抗威胁的优越鲁棒性，并在复杂推理任务中超越了现有的即插即用方法。

链接: https://arxiv.org/abs/2502.01576
作者: Hashmat Shadab Malik,Fahad Shamshad,Muzammal Naseer,Karthik Nandakumar,Fahad Khan,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at this https URL.
zh

[CV-3] MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

【速读】：该论文旨在解决通过人工智能生成结构化多步骤过程教程的挑战，主要面临三大障碍：(1) 多任务过程数据集的稀缺性，(2) 步骤间的逻辑连贯性和视觉一致性保持，以及 (3) 跨多个领域的泛化能力。为应对这些挑战，论文提出了一个涵盖21项任务、超过24,000个过程序列的跨领域数据集，并引入了基于扩散变换器（Diffusion Transformer, DIT）的MakeAnything框架。关键解决方案在于利用微调激活DIT的上下文能力以生成一致的过程序列，并通过非对称低秩适应（Low-Rank Adaptation, LoRA）在图像生成中平衡泛化能力和任务特定性能，同时冻结编码器参数并自适应调整解码层。此外，ReCraft模型通过时空一致性约束实现从图像到过程的生成，使静态图像能够分解成合理的创建序列。

链接: https://arxiv.org/abs/2502.01572
作者: Yiren Song,Cheng Liu,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.
zh

[CV-4] GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection

【速读】：该论文旨在解决定向目标检测（Oriented Object Detection, OOD）中的边界不连续性问题以及圆形对象编码的模糊性问题。论文的关键解决方案是提出了一种新的回归头GauCho，它直接基于Cholesky矩阵分解生成高斯分布。这一方法理论上缓解了边界不连续性问题，并且与现有的基于高斯的回归损失函数完全兼容。此外，论文建议使用定向椭圆（Oriented Ellipse, OE）来表示定向对象，通过双射函数与GauCho相关联，从而减轻了圆形对象的编码模糊性问题。

链接: https://arxiv.org/abs/2502.01565
作者: Jeffri Murrugarra-LLerena,Jose Henrique Lima Marques,Claudio R. Jung
机构: Stony Brook University; Federal University of Rio Grande do Sul (南里奥格兰德联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Oriented Object Detection (OOD) has received increased attention in the past years, being a suitable solution for detecting elongated objects in remote sensing analysis. In particular, using regression loss functions based on Gaussian distributions has become attractive since they yield simple and differentiable terms. However, existing solutions are still based on regression heads that produce Oriented Bounding Boxes (OBBs), and the known problem of angular boundary discontinuity persists. In this work, we propose a regression head for OOD that directly produces Gaussian distributions based on the Cholesky matrix decomposition. The proposed head, named GauCho, theoretically mitigates the boundary discontinuity problem and is fully compatible with recent Gaussian-based regression loss functions. Furthermore, we advocate using Oriented Ellipses (OEs) to represent oriented objects, which relates to GauCho through a bijective function and alleviates the encoding ambiguity problem for circular objects. Our experimental results show that GauCho can be a viable alternative to the traditional OBB head, achieving results comparable to or better than state-of-the-art detectors for the challenging dataset DOTA
zh

[CV-5] FireCastNet: Earth-as-a-Graph for Seasonal Fire Prediction

【速读】：该论文旨在解决全球范围内季节性野火预测的准确性与时效性问题。为实现这一目标，论文提出的关键解决方案是开发了一种名为FireCastNet的新架构，该架构结合了三维卷积编码器与GraphCast技术。FireCastNet通过捕捉不同空间和时间尺度下的野火发生背景信息，提升了长时间序列输入下的预测稳健性，并增强了对野火时空动态特征的捕捉能力，从而提高了预测性能。研究表明，更长的输入时间序列和更大的空间感受野有助于提升长期预测精度。

链接: https://arxiv.org/abs/2502.01550
作者: Dimitrios Michail,Charalampos Davalas,Lefki-Ioanna Panagiotou,Ioannis Prapas,Spyros Kondylatos,Nikolaos Ioannis Bountos,Ioannis Papoutsis
机构: Harokopio University of Athens (哈罗科皮奥雅典大学), Greece; OrionLab, National Technical University & National Observatory of Athens (国家天文台), Greece
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With climate change expected to exacerbate fire weather conditions, the accurate and timely anticipation of wildfires becomes increasingly crucial for disaster mitigation. In this study, we utilize SeasFire, a comprehensive global wildfire dataset with climate, vegetation, oceanic indices, and human-related variables, to enable seasonal wildfire forecasting with machine learning. For the predictive analysis, we present FireCastNet, a novel architecture which combines a 3D convolutional encoder with GraphCast, originally developed for global short-term weather forecasting using graph neural networks. FireCastNet is trained to capture the context leading to wildfires, at different spatial and temporal scales. Our investigation focuses on assessing the effectiveness of our model in predicting the presence of burned areas at varying forecasting time horizons globally, extending up to six months into the future, and on how different spatial or/and temporal context affects the performance. Our findings demonstrate the potential of deep learning models in seasonal fire forecasting; longer input time-series leads to more robust predictions, while integrating spatial information to capture wildfire spatio-temporal dynamics boosts performance. Finally, our results hint that in order to enhance performance at longer forecasting horizons, a larger receptive field spatially needs to be considered.
zh

[CV-6] VideoRAG : Retrieval-Augmented Generation with Extreme Long-Context Videos

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理和理解长上下文视频时的知识整合问题。现有方法主要集中在文本内容上，忽视了多模态视频知识的丰富领域。论文的关键创新在于引入了VideoRAG框架，其核心是一个双通道架构，能够无缝集成基于图的文本知识接地以捕捉跨视频语义关系，并通过多模态上下文编码高效保留视觉特征。这种设计使VideoRAG能够处理无限长度的视频，通过构建跨越多个视频的精确知识图谱，同时保持语义依赖性。

链接: https://arxiv.org/abs/2502.01549
作者: Xubin Ren,Lingrui Xu,Long Xia,Shuaiqiang Wang,Dawei Yin,Chao Huang
机构: Baidu Inc.; The University of Hong Kong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: this https URL.
zh

[CV-7] VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion

【速读】：该论文旨在解决现实世界环境中部署腿足机器人时遇到的模拟与真实环境差距（sim-to-real gap）问题，特别是视觉逼真度和复杂几何结构的再现不足限制了基于RGB感知的高级任务支持。论文的关键解决方案是提出了一种Real-to-Sim-to-Real框架，通过从多视角图像进行基于3D高斯点阵（3D Gaussian Splatting, 3DGS）的场景重建，生成高度逼真的交互式“数字孪生”仿真环境，以支持视觉导航和运动学习，并实现仅使用RGB数据的模拟到现实的策略转移。

链接: https://arxiv.org/abs/2502.01536
作者: Shaoting Zhu,Linzhan Mou,Derun Li,Baijun Ye,Runhan Huang,Hang Zhao
机构: IIIS, Tsinghua University(清华大学); Galaxea AI; Shanghai Qi Zhi Institute(上海启智研究院); Shanghai Jiao Tong University(上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent success in legged robot locomotion is attributed to the integration of reinforcement learning and physical simulators. However, these policies often encounter challenges when deployed in real-world environments due to sim-to-real gaps, as simulators typically fail to replicate visual realism and complex real-world geometry. Moreover, the lack of realistic visual rendering limits the ability of these policies to support high-level tasks requiring RGB-based perception like ego-centric navigation. This paper presents a Real-to-Sim-to-Real framework that generates photorealistic and physically interactive “digital twin” simulation environments for visual navigation and locomotion learning. Our approach leverages 3D Gaussian Splatting (3DGS) based scene reconstruction from multi-view images and integrates these environments into simulations that support ego-centric visual perception and mesh-based physical interactions. To demonstrate its effectiveness, we train a reinforcement learning policy within the simulator to perform a visual goal-tracking task. Extensive experiments show that our framework achieves RGB-only sim-to-real policy transfer. Additionally, our framework facilitates the rapid adaptation of robot policies with effective exploration capability in complex new environments, highlighting its potential for applications in households and factories.
zh

[CV-8] Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective

【速读】：该论文旨在填补现有研究在理解视觉-语言大型语言模型（Vision Large Language Models, VLLMs）训练范式及其参数高效性考虑方面的空白。论文的关键在于通过从训练范式角度分析34个来自顶级会议、期刊和高引用Arxiv论文的VLLMs，聚焦于参数效率。论文首先介绍了大型语言模型（LLMs）的架构及参数高效学习方法，随后讨论了视觉编码器和模态整合器的全面分类。接着，论文回顾了三种训练范式及其效率考量，并总结了VLLM领域的基准测试结果。为了更深入地了解这些模型在参数效率方面的有效性，论文还复制了直接适应（Direct Adaptation）范式的实验，从而提供有关最新进展和实际应用的见解，成为研究人员和实践者在高效整合视觉模态到LLMs方面的重要指南。

链接: https://arxiv.org/abs/2502.01524
作者: Xiaorui Ma,Haoran Xie,S. Joe Qin
机构: Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:The integration of vision-language modalities has been a significant focus in multimodal learning, traditionally relying on Vision-Language Pretrained Models. However, with the advent of Large Language Models (LLMs), there has been a notable shift towards incorporating LLMs with vision modalities. Following this, the training paradigms for incorporating vision modalities into LLMs have evolved. Initially, the approach was to integrate the modalities through pretraining the modality integrator, named Single-stage Tuning. It has since branched out into methods focusing on performance enhancement, denoted as Two-stage Tuning, and those prioritizing parameter efficiency, referred to as Direct Adaptation. However, existing surveys primarily address the latest Vision Large Language Models (VLLMs) with Two-stage Tuning, leaving a gap in understanding the evolution of training paradigms and their unique parameter-efficient considerations. This paper categorizes and reviews 34 VLLMs from top conferences, journals, and highly cited Arxiv papers, focusing on parameter efficiency during adaptation from the training paradigm perspective. We first introduce the architecture of LLMs and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality integrators. We then review three training paradigms and their efficiency considerations, summarizing benchmarks in the VLLM field. To gain deeper insights into their effectiveness in parameter efficiency, we compare and discuss the experimental results of representative models, among which the experiment of the Direct Adaptation paradigm is replicated. Providing insights into recent developments and practical uses, this survey is a vital guide for researchers and practitioners navigating the efficient integration of vision modalities into LLMs.
zh

[CV-9] BD-Diff: Generative Diffusion Model for Image Deblurring on Unknown Domains with Blur-Decoupled Learning

【速读】：该论文旨在解决在真实场景中获取大量现实配对数据的挑战和成本问题，以及仅依赖合成数据导致的过拟合问题，这些问题限制了扩散模型在未知模糊模式下的去模糊性能。论文的关键解决方案是提出了一种名为BD-Diff的基于生成扩散的模型，通过在三个特殊设计的任务上联合训练来解耦结构特征和模糊模式。BD-Diff使用两个Q-Former分别作为结构表示和模糊模式提取器，并通过监督去模糊任务和无监督模糊传递任务利用合成数据和目标域中的非配对模糊图像。此外，引入重构任务使结构特征和模糊模式互补，从而增强BD-Diff在遇到未知领域模糊模式时的泛化能力。

链接: https://arxiv.org/abs/2502.01522
作者: Junhao Cheng,Wei-Ting Chen,Xi Lu,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We propose BD-Diff to integrate generative diffusion model into unpaired deblurring tasks

点击查看摘要

Abstract:Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring tasks via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose BD-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural representations and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. Furthermore, we introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of BD-Diff when encountering unknown domain blur patterns. Experiments on real-world datasets demonstrate that BD-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios. The codes will be released in this https URL
zh

[CV-10] End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）合成任务中复杂模态交互建模的挑战。论文的关键在于提出了一种端到端学习的文本嵌入方法，这些嵌入是专门为T2I合成网络设计的。此外，论文结合了生成式训练和对比式训练，并使用了两种嵌入：一种优化以增强生成图像的真实感，另一种则致力于捕捉文本与图像之间的对齐关系。这一方法在三个基准数据集上的实验表明，使用分离的嵌入比共享嵌入效果更佳，并且优于采用从预先训练的判别式文本编码器获取文本表示的方法。

链接: https://arxiv.org/abs/2502.01507
作者: Yeruru Asrar Ahmed,Anurag Mittal
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Finally, we demonstrate that such learned embeddings can be used in other contexts as well, such as text-to-image manipulation.
zh

[CV-11] MoireDB: Formula-generated Interference-fringe Image Dataset

【速读】：该论文旨在解决图像识别模型在处理现实世界退化时的鲁棒性不足问题。解决方案的关键在于提出MoireDB数据集，这是一个通过公式生成的干涉条纹图像数据集，用于增强图像增强和模型鲁棒性。MoireDB通过利用错觉模式，消除了版权顾虑，降低了数据集构建成本，并提高了模型对现实世界退化的鲁棒性。实验表明，使用MoireDB增强的图像表现优于传统的分形艺术和基于特征可视化(FVis)的增强方法。

链接: https://arxiv.org/abs/2502.01490
作者: Yuto Matsuo,Ryo Hayamizu,Hirokatsu Kataoka,Akio Nakamura
机构: Tokyo Denki University (东京电气大学); National Institute of Advanced Industrial Science and Technology (AIST) (先进产业科学技术研究所); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image recognition models have struggled to treat recognition robustness to real-world degradations. In this context, data augmentation methods like PixMix improve robustness but rely on generative arts and feature visualizations (FVis), which have copyright, drawing cost, and scalability issues. We propose MoireDB, a formula-generated interference-fringe image dataset for image augmentation enhancing robustness. MoireDB eliminates copyright concerns, reduces dataset assembly costs, and enhances robustness by leveraging illusory patterns. Experiments show that MoireDB augmented images outperforms traditional Fractal arts and FVis-based augmentations, making it a scalable and effective solution for improving model robustness against real-world degradations.
zh

[CV-12] Simultaneous Automatic Picking and Manual Picking Refinement for First-Break

【速读】：该论文旨在解决微地震数据处理中自动识别初至波时遇到的手动标记数据集中的异常值和潜在误标问题。这些问题会影响神经网络训练的有效性。论文的关键解决方案是Simultaneous Picking and Refinement (SPR)算法，它将初至波的真实位置视为概率模型中的潜在变量，并引入先验标签来处理噪声或异常数据。SPR通过动态调整和优化，提高了在包含异常值或部分不准确数据的数据集中识别初至波的准确性。此外，SPR的灵活性使其能够适应多种基于深度学习的初至波拾取方法。

链接: https://arxiv.org/abs/2502.01474
作者: Haowen Bai,Zixiang Zhao,Jiangshe Zhang,Yukun Cui,Chunxia Zhang,Zhenbo Guo,Yongjun Wang
机构: School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学院), China; Geophysical Technology Research Center of Bureau of Geophysical Prospecting, Zhuozhou, Hebei, P.R.China(中国石油集团东方物探研究院地球物理技术研究中心); School of Artificial Intelligence, Wenzhou Polytechnic(温州职业技术学院人工智能学院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:First-break picking is a pivotal procedure in processing microseismic data for geophysics and resource exploration. Recent advancements in deep learning have catalyzed the evolution of automated methods for identifying first-break. Nevertheless, the complexity of seismic data acquisition and the requirement for detailed, expert-driven labeling often result in outliers and potential mislabeling within manually labeled datasets. These issues can negatively affect the training of neural networks, necessitating algorithms that handle outliers or mislabeled data effectively. We introduce the Simultaneous Picking and Refinement (SPR) algorithm, designed to handle datasets plagued by outlier samples or even noisy labels. Unlike conventional approaches that regard manual picks as ground truth, our method treats the true first-break as a latent variable within a probabilistic model that includes a first-break labeling prior. SPR aims to uncover this variable, enabling dynamic adjustments and improved accuracy across the dataset. This strategy mitigates the impact of outliers or inaccuracies in manual labels. Intra-site picking experiments and cross-site generalization experiments on publicly available data confirm our method’s performance in identifying first-break and its generalization across different sites. Additionally, our investigations into noisy signals and labels underscore SPR’s resilience to both types of noise and its capability to refine misaligned manual annotations. Moreover, the flexibility of SPR, not being limited to any single network architecture, enhances its adaptability across various deep learning-based picking methods. Focusing on learning from data that may contain outliers or partial inaccuracies, SPR provides a robust solution to some of the principal obstacles in automatic first-break picking.
zh

[CV-13] Deep Unfolding Multi-modal Image Fusion Network via Attribution Analysis

【速读】：该论文旨在解决多模态图像融合过程中缺乏直接指导和交互的问题，当前方法主要集中在通过复杂的映射获取视觉显示层面的信息丰富的融合图像，而忽视了融合过程与下游任务（如语义分割）之间的有效互动。论文的关键解决方案在于提出了一种“展开归因分析融合网络”(UAAFusion)，通过归因分析技术更有效地调整融合图像以适应语义分割任务，增强融合与分割之间的互动。具体而言，该方法利用归因分析探索源图像中语义区域对任务区分的贡献，并将更有益的特征整合到融合算法中，从而让分割任务引导融合过程。这种方法构建了一个基于模型驱动的展开网络，使用来自归因分析的优化目标，并通过计算当前分割网络状态下的归因融合损失来实现这一目标。

链接: https://arxiv.org/abs/2502.01467
作者: Haowen Bai,Zixiang Zhao,Jiangshe Zhang,Baisong Jiang,Lilun Deng,Yukun Cui,Shuang Xu,Chunxia Zhang
机构: School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学院), China; Photogrammetry and Remote Sensing, ETH Zürich(瑞士苏黎世联邦理工学院摄影测量与遥感研究所), Switzerland; School of Mathematics and Statistics, Northwestern Polytechnical University(西北工业大学数学与统计学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2024

点击查看摘要

Abstract:Multi-modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network’’ (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model-driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high-level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method’s superiority in image fusion and applicability to semantic segmentation.
zh

[CV-14] mporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting

【速读】：该论文旨在解决弱监督（Weakly Supervised, WS）方法在视频流语境下的语义分割精度不足的问题。关键解决方案在于构建利用视频中连续帧之间时间一致性（temporal coherence）的显著性图（saliency maps），通过最小化相邻帧之间显著性图的差异来提高分割精度，并在训练辅助分类器时直接整合这种时间一致性，从而实现更准确的材料移除识别。

链接: https://arxiv.org/abs/2502.01455
作者: Andrea Marelli,Luca Magri,Federica Arrigoni,Giacomo Boracchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.
zh

[CV-15] SPFFNet: Strip Perception and Feature Fusion Spatial Pyramid Pooling for Fabric Defect Detection

【速读】：该论文旨在解决织物缺陷检测中复杂背景和特定形状缺陷难以识别的问题。关键解决方案包括引入条形感知模块（Strip Perception Module, SPM）以增强多尺度卷积特征捕获能力，改进空间金字塔池化快速模块（SPPF）为SE-SPPF模块以更好地整合空间和通道信息，以及提出一种具有自适应权重的焦点增强完全交并比度量（FECIoU），通过调整难检测实例的权重来解决尺度差异和类别不平衡问题。这些改进显著提升了模型在多个数据集上的平均精度均值（mAP）。

链接: https://arxiv.org/abs/2502.01445
作者: Peizhe Zhao
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Waterford Institute (沃特福德学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, conference

点击查看摘要

Abstract:Defect detection in fabrics is critical for quality control, yet existing methods often struggle with complex backgrounds and shape-specific defects. In this paper, we propose an improved fabric defect detection model based on YOLOv4. To enhance the detection of strip defects, we introduce a Strip Perception Module (SPM) that improves feature capture through multi-scale convolution. We further enhance the spatial pyramid pooling fast (SPPF) by integrating a squeeze-and-excitation mechanism, resulting in the SE-SPPF module, which better integrates spatial and channel information for more effective defect feature extraction. Additionally, we propose a novel focal enhanced complete intersection over union (FECIoU) metric with adaptive weights, addressing scale differences and class imbalance by adjusting the weights of hard-to-detect instances through focal loss. Experimental results demonstrate that our model achieves a 0.8-8.1% improvement in mean average precision (mAP) on the Tianchi dataset and a 1.6-13.2% improvement on our custom dataset, outperforming other state-of-the-art methods.
zh

[CV-16] Improved Training Technique for Latent Consistency Models ICLR2025

【速读】：该论文旨在解决一致性模型在大规模数据集上训练时，特别是在文本到图像和视频生成任务中的性能退化问题。论文的关键在于分析了像素空间与隐空间之间的统计差异，并发现隐空间数据中存在高度尖峰的离群值，严重影响了一致性模型在隐空间中的表现。为了解决这一问题，论文提出了采用Cauchy损失替换Pseudo-Huber损失以减轻离群值的影响，并引入扩散损失和最优传输（Optimal Transport, OT）耦合以进一步提升性能。此外，论文还引入自适应缩放调度器和非缩放LayerNorm来管理稳健的训练过程并更好地捕捉特征统计信息，从而减少离群值的影响。通过这些策略，成功训练出能够在一到两步内进行高质量采样的一致性模型，显著缩小了一致性模型与扩散模型之间的性能差距。

链接: https://arxiv.org/abs/2502.01441
作者: Quan Dao,Khanh Doan,Di Liu,Trung Le,Dimitris Metaxas
机构: Rutgers University; VinAI Research; Monash University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR2025

点击查看摘要

Abstract:Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling- c scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: this https URL
zh

[CV-17] Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning

【速读】：该论文旨在解决人工神经网络在持续学习过程中面临的稳定性-可塑性困境，同时尝试借鉴生物策略来提升机器学习算法。论文的关键解决方案在于引入了一个简化的果蝇嗅觉回路模型（Fly Model），该模型能够与现代机器学习方法结合使用，以增强记忆稳定性和学习可塑性，从而克服当前持续学习策略的局限性。

链接: https://arxiv.org/abs/2502.01427
作者: Heming Zou,Yunliang Zang,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Artificial neural networks face the stability-plasticity dilemma in continual learning, while the brain can maintain memories and remain adaptable. However, the biological strategies for continual learning and their potential to inspire learning algorithms in neural networks are poorly understood. This study presents a minimal model of the fly olfactory circuit to investigate the biological strategies that support continual odor learning. We introduce the fly olfactory circuit as a plug-and-play component, termed the Fly Model, which can integrate with modern machine learning methods to address this dilemma. Our findings demonstrate that the Fly Model enhances both memory stability and learning plasticity, overcoming the limitations of current continual learning strategies. We validated its effectiveness across various challenging continual learning scenarios using commonly used datasets. The fly olfactory system serves as an elegant biological circuit for lifelong learning, offering a module that enhances continual learning with minimal additional computational cost for machine learning.
zh

[CV-18] Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在生成详细图像描述时，由于响应长度增加导致视觉注意力减弱和噪声增加的问题。这限制了模型在精确度（Precision）与召回率（Recall）之间的平衡。为了解决这一问题，论文提出了一种名为SPARC（Selective Progressive Attention ReCalibration）的方法。SPARC的关键在于通过选择性增强视觉标记的影响来改善解码过程中的视觉注意力，从而同时提升精确度和召回率，且计算开销极小。

链接: https://arxiv.org/abs/2502.01419
作者: Mingi Jung,Saehuyng Lee,Eunji Kim,Sungroh Yoon
机构: Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.
zh

[CV-19] Human Body Restoration with One-Step Diffusion Model and A New Benchmark

【速读】：该论文旨在解决人体图像复原领域中高质量基准数据集缺乏的问题。为了解决这一难题，论文提出了一种高质自动化裁剪与筛选（High-Quality Automated Cropping and Filtering, HQ-ACF）管道，利用现有的目标检测数据集和其他未标注图像自动裁剪和筛选高质量的人体图像，从而构建了一个包含训练、验证和测试集的个性化复原复杂对象和自然活动（\emphPERSONA）数据集。此外，论文还提出了一个新颖的一阶段扩散模型（One-Step Diffusion Model for Human Body Restoration, \emphOSDHuman），其中引入了高保真图像嵌入器（High-Fidelity Image Embedder, HFIE）作为提示生成器，以更好地利用低质量人体图像信息引导模型，有效避免误导性提示。这些方法显著提升了视觉质量和定量指标表现。

链接: https://arxiv.org/abs/2502.01411
作者: Jue Gong,Jingkai Wang,Zheng Chen,Xing Liu,Hong Gu,Yulun Zhang,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures. The code and model will be available at this https URL

点击查看摘要

Abstract:Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emphPERSONA) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emphOSDHuman, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at this https URL.
zh

[CV-20] FourieRF: Few-Shot NeRFs via Progressive Fourier Frequency Control

【速读】：该论文旨在解决少样本场景下的快速高质量重建问题。解决方案的关键在于通过显式的课程训练程序有效地参数化特征，并在优化过程中逐步增加场景复杂度。这种方法产生的先验既稳健又具有广泛的适应性，从而建立了FourieRF作为少样本渲染问题中的强大且通用的基准方法。尽管如此，该方法在严重欠约束场景下仍可能导致重建误差，特别是在视图遮挡导致形状部分未被覆盖的情况下。

链接: https://arxiv.org/abs/2502.01405
作者: Diego Gomez,Bingchen Gong,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris (LIX, 法国巴黎高等理工学院, IP Paris)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3DV 2025 conference

点击查看摘要

Abstract:In this work, we introduce FourieRF, a novel approach for achieving fast and high-quality reconstruction in the few-shot setting. Our method effectively parameterizes features through an explicit curriculum training procedure, incrementally increasing scene complexity during optimization. Experimental results show that the prior induced by our approach is both robust and adaptable across a wide variety of scenes, establishing FourieRF as a strong and versatile baseline for the few-shot rendering problem. While our approach significantly reduces artifacts, it may still lead to reconstruction errors in severely under-constrained scenarios, particularly where view occlusion leaves parts of the shape uncovered. In the future, our method could be enhanced by integrating foundation models to complete missing parts using large data-driven priors.
zh

[CV-21] Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

【速读】：该论文旨在解决3D视觉定位（3D Visual Grounding, 3DVG）任务中的两个主要挑战：一是监督方法依赖于稀缺且高成本的3D视觉-语言数据集；二是基于大型语言模型/视觉语言模型（LLM/VLM）的方法在推理过程中需要耗费大量时间和令牌。为了解决这些问题，论文提出了一种名为可进化符号视觉定位器（Evolvable Symbolic Visual Grounder, EaSe）的新型无训练符号框架。EaSe通过使用LLM生成的代码来计算空间关系，并实现了一个自动流水线来评估和优化这些代码的质量以及整合VLM以辅助定位过程。关键在于，EaSe显著降低了推理成本，同时保持了与基于代理的方法相当的性能，在Nr3D数据集上达到了52.9%的准确率，在ScanRefer上达到了49.2% Acc@0.25，从而在性能和效率之间实现了良好的平衡。

链接: https://arxiv.org/abs/2502.01401
作者: Boyu Mi,Hanqing Wang,Tai Wang,Yilun Chen,Jiangmiao Pang
机构: Shanghai AI Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D visual grounding (3DVG) is challenging because of the requirement of understanding on visual information, language and spatial relationships. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high cost of 3D vision-language datasets. On the other hand, LLM/VLM based agents are proposed for 3DVG, eliminating the need for training data. However, these methods incur prohibitive time and token costs during inference. To address the challenges, we introduce a novel training-free symbolic framework for 3D visual grounding, namely Evolvable Symbolic Visual Grounder, that offers significantly reduced inference costs compared to previous agent-based methods while maintaining comparable performance. EaSe uses LLM generated codes to compute on spatial relationships. EaSe also implements an automatic pipeline to evaluate and optimize the quality of these codes and integrate VLMs to assist in the grounding process. Experimental results demonstrate that EaSe achieves 52.9% accuracy on Nr3D dataset and 49.2% Acc@0.25 on ScanRefer, which is top-tier among training-free methods. Moreover, it substantially reduces the inference time and cost, offering a balanced trade-off between performance and efficiency. Codes are available at this https URL.
zh

[CV-22] Learning Traffic Anomalies from Generative Models on Real-Time Observations

【速读】：该论文旨在解决城市交通管理中实时交通异常检测的问题。解决方案的关键在于采用时空生成对抗网络（STGAN）框架，结合图神经网络（Graph Neural Networks）和长短时记忆网络（Long Short-Term Memory networks），以捕捉交通数据中的复杂时空依赖关系。

链接: https://arxiv.org/abs/2502.01391
作者: Fotis I. Giasemis,Alexandros Sopasakis
机构: LIP6 (LIP6), LPNHE (LPNHE); Sorbonne Université (索邦大学); CNRS, IN2P3 (法国国家科学研究中心, IN2P3); Department of Mathematics (数学系); Lund University (隆德大学); Lund, Scania, Sweden (瑞典)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of traffic anomalies is crucial for effective urban traffic management and congestion mitigation. We use the Spatiotemporal Generative Adversarial Network (STGAN) framework combining Graph Neural Networks and Long Short-Term Memory networks to capture complex spatial and temporal dependencies in traffic data. We apply STGAN to real-time, minute-by-minute observations from 42 traffic cameras across Gothenburg, Sweden, collected over several months in 2020. The images are processed to compute a flow metric representing vehicle density, which serves as input for the model. Training is conducted on data from April to November 2020, and validation is performed on a separate dataset from November 14 to 23, 2020. Our results demonstrate that the model effectively detects traffic anomalies with high precision and low false positive rates. The detected anomalies include camera signal interruptions, visual artifacts, and extreme weather conditions affecting traffic flow.
zh

[CV-23] Detecting Backdoor Samples in Contrastive Language Image Pretraining ICLR2025

【速读】：该论文旨在解决CLIP模型在大规模预训练过程中易受中毒后门攻击的问题。论文的关键在于发现中毒样本在局部子空间中的独特表征特征，即它们的局部邻域比干净样本更加稀疏。基于这一发现，论文提出使用传统的基于密度比的局部异常检测器来有效地检测这些后门攻击，而现有的方法则无法胜任。实验结果表明，这种方法可以高效地清理大规模网络数据集（如CC3M）中的后门污染，耗时仅需15分钟。

链接: https://arxiv.org/abs/2502.01385
作者: Hanxun Huang,Sarah Erfani,Yige Li,Xingjun Ma,James Bailey
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2025

点击查看摘要

Abstract:Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \hrefthis https URLGitHub repository.
zh

[CV-24] Inverse Bridge Matching Distillation

【速读】：该论文旨在解决扩散桥接模型（Diffusion Bridge Models, DBMs）在图像到图像翻译应用中的慢推理速度问题。关键解决方案在于提出了一种基于逆向桥接匹配公式的新颖蒸馏技术，并推导出实用的可解目标函数。此方法能够蒸馏条件和非条件类型的DBMs，通过一步生成器进行蒸馏，并仅使用被破坏的图像进行训练。

链接: https://arxiv.org/abs/2502.01362
作者: Nikita Gushchin,David Li,Daniil Selikhanovych,Evgeny Burnaev,Dmitry Baranchuk,Alexander Korotin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
zh

[CV-25] Bayesian Approximation-Based Trajectory Prediction and Tracking with 4D Radar

【速读】：该论文旨在解决在恶劣天气条件下，基于LiDAR和摄像头的多目标跟踪（MOT）方法性能下降的问题，同时指出雷达方法虽然稳健但存在垂直分辨率有限和运动模型简单的问题。现有基于卡尔曼滤波的方法依赖固定的噪声协方差，导致其在对象突然机动时适应性较差。论文的关键解决方案在于提出Bayes-4DRTrack框架，采用基于变换器的运动预测网络以捕捉非线性运动动态，并在检测和预测步骤中使用贝叶斯近似。此外，两阶段数据关联利用多普勒测量来更好地分辨接近的目标。这些改进使得Bayes-4DRTrack在K-Radar数据集上的平均多目标跟踪精度（AMOTA）提升了5.7%，展示了其在严苛实际条件下的增强鲁棒性和准确性。

链接: https://arxiv.org/abs/2502.01357
作者: Dong-In Kim,Dong-Hee Paek,Seung-Hyun Song,Seung-Hyun Kong
机构: Korea Advanced Institute of Science and Technology(韩国科学技术院); Hyundai Motor Company(现代汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6pages, 4 figures

点击查看摘要

Abstract:Accurate 3D multi-object tracking (MOT) is vital for autonomous vehicles, yet LiDAR and camera-based methods degrade in adverse weather. Meanwhile, Radar-based solutions remain robust but often suffer from limited vertical resolution and simplistic motion models. Existing Kalman filter-based approaches also rely on fixed noise covariance, hampering adaptability when objects make sudden maneuvers. We propose Bayes-4DRTrack, a 4D Radar-based MOT framework that adopts a transformer-based motion prediction network to capture nonlinear motion dynamics and employs Bayesian approximation in both detection and prediction steps. Moreover, our two-stage data association leverages Doppler measurements to better distinguish closely spaced targets. Evaluated on the K-Radar dataset (including adverse weather scenarios), Bayes-4DRTrack demonstrates a 5.7% gain in Average Multi-Object Tracking Accuracy (AMOTA) over methods with traditional motion models and fixed noise covariance. These results showcase enhanced robustness and accuracy in demanding, real-world conditions.
zh

[CV-26] Quasi-Conformal Convolution : A Learnable Convolution for Deep Learning on Riemann Surfaces

【速读】：该论文旨在解决在非欧几里得域上定义卷积操作的挑战，特别是在分析复杂几何数据时缺乏常见坐标系和熟悉的欧几里得属性的问题。解决方案的关键是引入了一种名为拟共形卷积（Quasi-conformal Convolution, QCC）的新框架，通过利用可训练的估计模块生成拟共形映射，实现了适应性和可学习的卷积算子，这些算子可以根据底层数据结构动态调整。QCC统一了广泛的空间定义卷积，促进了在每个基础曲面上针对特定任务优化的定制卷积算子的学习。基于此，开发了拟共形卷积神经网络（QCCNN），验证了其在分类定义于曲面流形上的图像以及在医学应用中的有效性，包括三维面部数据的颅面分析和三维人脸上的病变分割。

链接: https://arxiv.org/abs/2502.01356
作者: Han Zhang,Tsz Lok Ip,Lok Ming Lui
机构: City University of Hong Kong(香港城市大学); Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning on non-Euclidean domains is important for analyzing complex geometric data that lacks common coordinate systems and familiar Euclidean properties. A central challenge in this field is to define convolution on domains, which inherently possess irregular and non-Euclidean this http URL this work, we introduce Quasi-conformal Convolution (QCC), a novel framework for defining convolution on Riemann surfaces using quasi-conformal theories. Each QCC operator is linked to a specific quasi-conformal mapping, enabling the adjustment of the convolution operation through manipulation of this mapping. By utilizing trainable estimator modules that produce Quasi-conformal mappings, QCC facilitates adaptive and learnable convolution operators that can be dynamically adjusted according to the underlying data structured on Riemann surfaces. QCC unifies a broad range of spatially defined convolutions, facilitating the learning of tailored convolution operators on each underlying surface optimized for specific tasks. Building on this foundation, we develop the Quasi-Conformal Convolutional Neural Network (QCCNN) to address a variety of tasks related to geometric data. We validate the efficacy of QCCNN through the classification of images defined on curvilinear Riemann surfaces, demonstrating superior performance in this context. Additionally, we explore its potential in medical applications, including craniofacial analysis using 3D facial data and lesion segmentation on 3D human faces, achieving enhanced accuracy and reliability.
zh

[CV-27] ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies

【速读】：该论文旨在解决传统自监督学习方法在捕捉细粒度概念（如解剖结构或器官）方面存在的局限性。关键在于引入ConceptVAE框架，通过自监督方式检测并分离输入数据中的细粒度概念及其风格特征。该框架包含一系列设计用于将输入数据离散化为预设数量的概念及其局部风格的损失项和模型架构组件。

链接: https://arxiv.org/abs/2502.01335
作者: Costin F. Ciusdel,Alex Serban,Tiziano Passerini
机构: Siemens SRL (西门子股份公司), Brasov, Romania; Siemens Healthineers (西门子医疗), Princeton, NJ, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While traditional self-supervised learning methods improve performance and robustness across various medical tasks, they rely on single-vector embeddings that may not capture fine-grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre-training methods, and enable novel applications such as fine-grained image retrieval and concept-based outlier detection. In this paper, we introduce ConceptVAE, a novel pre-training framework that detects and disentangles fine-grained concepts from their style characteristics in a self-supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine-grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self-supervised methods in tasks such as region-based instance retrieval, semantic segmentation, out-of-distribution detection, and object detection. Additionally, we explore the generation of in-distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre-training technique based on concept-style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black-box approaches.
zh

[CV-28] CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation

【速读】：该论文旨在解决类别级物体姿态估计在面对未见过实例的显著变化时，因模型中存在的“不干净”混杂因素导致的虚假相关性问题。为了解决这一问题，论文提出了一种名为CleanPose的新方法，关键在于结合因果学习和知识蒸馏技术。具体而言，通过开发基于前门调整的因果推理模块来减轻未观察到的混杂因素的负面影响，从而减少潜在的虚假相关性，实现无偏估计；同时，设计了一种基于残差的知识蒸馏方法，以增强模型的泛化能力，提供全面的类别信息指导。

链接: https://arxiv.org/abs/2502.01312
作者: Xiao Lin,Yun Peng,Liuyi Wang,Xianyou Zhong,Minghao Zhu,Jingwei Yang,Chengju Liu,Qijun Chen
机构: School of Electronic and Information Engineering, Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Category-level object pose estimation aims to recover the rotation, translation and size of unseen instances within predefined categories. In this task, deep neural network-based methods have demonstrated remarkable performance. However, previous studies show they suffer from spurious correlations raised by “unclean” confounders in models, hindering their performance on novel instances with significant variations. To address this issue, we propose CleanPose, a novel approach integrating causal learning and knowledge distillation to enhance category-level pose estimation. To mitigate the negative effect of unobserved confounders, we develop a causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further improve generalization ability, we devise a residual-based knowledge distillation method that has proven effective in providing comprehensive category information guidance. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be released.
zh

[CV-29] Heterogeneous Image GNN: Graph-Conditioned Diffusion for Image Synthesis

【速读】：该论文旨在解决扩散基图像合成模型在处理复杂场景中的异构图数据条件输入时所遇到的问题。现有方法通常通过跨注意层或图像拼接的方式直接将条件变量纳入模型架构，难以高效处理包含多样关系的复杂条件输入。为了解决这一问题，论文提出了一种名为“异构图像图”(Heterogeneous Image Graphs, HIG) 的新型表示方法，该方法将条件变量和目标图像建模为两个相互连接的图，从而能够有效地处理长度可变的条件输入及其关系。此外，论文还提出了一种保持幅值的图神经网络（Magnitude-Preserving GNN），通过ControlNet方法将其与现有的EDM2扩散模型集成。关键在于HIG能够更好地表征和处理复杂的条件关系，从而提升模型在COCO-stuff和Visual Genome数据集上的表现。

链接: https://arxiv.org/abs/2502.01309
作者: Rupert Menneer,Christos Margadji,Sebastian W. Pattinson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel method for conditioning diffusion-based image synthesis models with heterogeneous graph data. Existing approaches typically incorporate conditioning variables directly into model architectures, either through cross-attention layers that attend to text latents or image concatenation that spatially restrict generation. However, these methods struggle to handle complex scenarios involving diverse, relational conditioning variables, which are more naturally represented as unstructured graphs. This paper presents Heterogeneous Image Graphs (HIG), a novel representation that models conditioning variables and target images as two interconnected graphs, enabling efficient handling of variable-length conditioning inputs and their relationships. We also propose a magnitude-preserving GNN that integrates the HIG into the existing EDM2 diffusion model using a ControlNet approach. Our approach improves upon the SOTA on a variety of conditioning inputs for the COCO-stuff and Visual Genome datasets, and showcases the ability to condition on graph attributes and relationships represented by edges in the HIG.
zh

[CV-30] Partial Channel Network: Compute Fewer Perform Better

【速读】：该论文旨在解决设计模块或机制以保持网络低参数和计算量（FLOPs）同时不牺牲精度和吞吐量的挑战。关键在于利用特征图通道内的冗余性，提出了一种新的部分通道机制（PCM）。具体而言，通过分割操作将特征图通道分为不同部分，每个部分对应不同的操作，如卷积、注意力、池化和恒等映射。基于这一假设，引入了一种新颖的部分注意力卷积（PATConv），能够高效地结合卷积与视觉注意力。此外，还提出了动态部分卷积（DPConv），能够自适应学习不同层中分割通道的比例，从而实现更好的权衡。这些方法共同构成了PartialNet，实现了优于一些最新技术模型（SOTA）的分类精度和推理速度，并在COCO数据集上表现出色的检测和分割能力。

链接: https://arxiv.org/abs/2502.01303
作者: Haiduo Huang,Tian Xia,Wenzhe zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学) ⋅⋅\cdot⋅ Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing a module or mechanism that enables a network to maintain low parameters and FLOPs without sacrificing accuracy and throughput remains a challenge. To address this challenge and exploit the redundancy within feature map channels, we propose a new solution: partial channel mechanism (PCM). Specifically, through the split operation, the feature map channels are divided into different parts, with each part corresponding to different operations, such as convolution, attention, pooling, and identity mapping. Based on this assumption, we introduce a novel partial attention convolution (PATConv) that can efficiently combine convolution with visual attention. Our exploration indicates that the PATConv can completely replace both the regular convolution and the regular visual attention while reducing model parameters and FLOPs. Moreover, PATConv can derive three new types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp), and Partial Self-Attention block (PAT_sf). In addition, we propose a novel dynamic partial convolution (DPConv) that can adaptively learn the proportion of split channels in different layers to achieve better trade-offs. Building on PATConv and DPConv, we propose a new hybrid network family, named PartialNet, which achieves superior top-1 accuracy and inference speed compared to some SOTA models on ImageNet-1K classification and excels in both detection and segmentation on the COCO dataset. Our code is available at this https URL.
zh

[CV-31] XR-VIO: High-precision Visual Inertial Odometry with Fast Initialization for XR Applications

【速读】：该论文旨在解决视觉惯性里程计（Visual Inertial Odometry, VIO）初始化过程中稳定性不足以及特征匹配效率和准确性不高的问题。关键解决方案在于提出了一种新的紧耦合陀螺仪测量的视觉惯性初始化管道，增强了视觉运动结构（Structure from Motion, SfM）的鲁棒性和准确性，并引入了一种结合光流和基于描述子匹配的混合特征匹配方法，实现了高效、准确且鲁棒的跟踪结果。

链接: https://arxiv.org/abs/2502.01297
作者: Shangjin Zhai,Nan Wang,Xiaomeng Wang,Danpeng Chen,Weijian Xie,Hujun Bao,Guofeng Zhang
机构: SenseTime Research; State Key Lab of CAD&CG, Zhejiang University; Tetras.AI; State Key Lab of CAD&CG, Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel approach to Visual Inertial Odometry (VIO), focusing on the initialization and feature matching modules. Existing methods for initialization often suffer from either poor stability in visual Structure from Motion (SfM) or fragility in solving a huge number of parameters simultaneously. To address these challenges, we propose a new pipeline for visual inertial initialization that robustly handles various complex scenarios. By tightly coupling gyroscope measurements, we enhance the robustness and accuracy of visual SfM. Our method demonstrates stable performance even with only four image frames, yielding competitive results. In terms of feature matching, we introduce a hybrid method that combines optical flow and descriptor-based matching. By leveraging the robustness of continuous optical flow tracking and the accuracy of descriptor matching, our approach achieves efficient, accurate, and robust tracking results. Through evaluation on multiple benchmarks, our method demonstrates state-of-the-art performance in terms of accuracy and success rate. Additionally, a video demonstration on mobile devices showcases the practical applicability of our approach in the field of Augmented Reality/Virtual Reality (AR/VR).
zh

[CV-32] A Framework for Double-Blind Federated Adaptation of Foundation Models

【速读】：该论文旨在解决在数据孤岛环境下，如何通过双盲联邦学习的方式，利用完全同态加密（Fully Homomorphic Encryption, FHE）技术适应预训练的基础模型（Foundational Models, FMs），以改进特定下游任务性能的问题。解决方案的关键在于首先通过知识蒸馏将基础模型分解为一系列适合FHE处理的模块，然后利用无需通过基础模型进行反向传播的低秩并行适配器（low-rank parallel adapters）来适应下游任务。此外，设计了一种隐私保护的置换方案，防止数据拥有者通过模型提取攻击获取基础模型的信息，并采用安全聚合协议进行低秩并行适配器的联邦学习。

链接: https://arxiv.org/abs/2502.01289
作者: Nurbek Tastan,Karthik Nandakumar
机构: Mohamed bin Zayed University of AI (MBZUAI); Michigan State University (MSU)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The availability of foundational models (FMs) pre-trained on large-scale data has advanced the state-of-the-art in many computer vision tasks. While FMs have demonstrated good zero-shot performance on many image classification tasks, there is often scope for performance improvement by adapting the FM to the downstream task. However, the data that is required for this adaptation typically exists in silos across multiple entities (data owners) and cannot be collated at a central location due to regulations and privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with the data owners due to proprietary reasons. In some cases, the data owners may not even have the resources to store such large FMs. Hence, there is a need for algorithms to adapt the FM in a double-blind federated manner, i.e., the data owners do not know the FM or each other’s data, and the LSP does not see the data for the downstream tasks. In this work, we propose a framework for double-blind federated adaptation of FMs using fully homomorphic encryption (FHE). The proposed framework first decomposes the FM into a sequence of FHE-friendly blocks through knowledge distillation. The resulting FHE-friendly model is adapted for the downstream task via low-rank parallel adapters that can be learned without backpropagation through the FM. Since the proposed framework requires the LSP to share intermediate representations with the data owners, we design a privacy-preserving permutation scheme to prevent the data owners from learning the FM through model extraction attacks. Finally, a secure aggregation protocol is employed for federated learning of the low-rank parallel adapters. Empirical results on four datasets demonstrate the practical feasibility of the proposed framework.
zh

[CV-33] mplate Matching in Images using Segmented Normalized Cross-Correlation

【速读】：该论文旨在解决模板匹配中归一化互相关(NCC)计算效率低的问题。解决方案的关键在于提出了一种新的算法，通过预计算模板图像的近似表示，实现了比直接使用原始模板进行精确NCC计算更为高效的近似NCC计算。具体而言，该近似模板通过分裂合并(split-and-merge)方法从原始模板图像中获得，并分解为轴对齐的矩形段，这些段的大小取决于像素强度方差。每个段被赋予原模板中相应像素的平均灰度值，从而在保持与FFT-based NCC算法相当的计算性能的同时，将NCC近似误差控制在可接受范围内。

链接: https://arxiv.org/abs/2502.01286
作者: Davor Marušić,Siniša Popović,Zoran Kalafatić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 tables, 3 figures

点击查看摘要

Abstract:In this paper, a new variant of an algorithm for normalized cross-correlation (NCC) is proposed in the context of template matching in images. The proposed algorithm is based on the precomputation of a template image approximation, enabling more efficient calculation of approximate NCC with the source image than using the original template for exact NCC calculation. The approximate template is precomputed from the template image by a split-and-merge approach, resulting in a decomposition to axis-aligned rectangular segments, whose sizes depend on per-segment pixel intensity variance. In the approximate template, each segment is assigned the mean grayscale value of the corresponding pixels from the original template. The proposed algorithm achieves superior computational performance with negligible NCC approximation errors compared to the well-known Fast Fourier Transform (FFT)-based NCC algorithm, when applied on less visually complex and/or smaller template images. In other cases, the proposed algorithm can maintain either computational performance or NCC approximation error within the range of the FFT-based algorithm, but not both.
zh

[CV-34] Label Correction for Road Segmentation Using Road-side Cameras

【速读】：该论文旨在解决在不同天气条件下可靠的道路分割问题，这对于智能交通应用、自动驾驶汽车及高级驾驶辅助系统至关重要。由于收集和标注包含所有天气条件的数据集需要大量资源，论文提出利用现有的道路侧相机基础设施自动收集不同天气条件下的道路数据，并提出了一种新的半自动标注方法。该方法的关键在于仅需手动标注每个相机的一帧图像，然后通过频域图像配准补偿小相机运动来将标签转移到其他帧。论文通过在芬兰927个相机长达四个月冬季期间收集的数据验证了该方法，并证明使用半自动标注的数据训练可以提升多种深度学习分割模型的性能。

链接: https://arxiv.org/abs/2502.01281
作者: Henrik Toikka,Eerik Alamikkotervo,Risto Ojala
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable road segmentation in all weather conditions is critical for intelligent transportation applications, autonomous vehicles and advanced driver’s assistance systems. For robust performance, all weather conditions should be included in the training data of deep learning-based perception models. However, collecting and annotating such a dataset requires extensive resources. In this paper, existing roadside camera infrastructure is utilized for collecting road data in varying weather conditions automatically. Additionally, a novel semi-automatic annotation method for roadside cameras is proposed. For each camera, only one frame is labeled manually and then the label is transferred to other frames of that camera feed. The small camera movements between frames are compensated using frequency domain image registration. The proposed method is validated with roadside camera data collected from 927 cameras across Finland over 4 month time period during winter. Training on the semi-automatically labeled data boosted the segmentation performance of several deep learning segmentation models. Testing was carried out on two different datasets to evaluate the robustness of the resulting models. These datasets were an in-domain roadside camera dataset and out-of-domain dataset captured with a vehicle on-board camera.
zh

[CV-35] FSPGD: Rethinking Black-box Attacks on Semantic Segmentation

【速读】：该论文旨在解决在黑盒攻击中语义分割模型的对抗样本迁移能力有限的问题。解决方案的关键在于引入了一种新的攻击方法——特征相似投影梯度下降（Feature Similarity Projected Gradient Descent, FSPGD）攻击。与传统方法依赖输出预测计算梯度不同，FSPGD通过中间层特征计算梯度，并设计了一种损失函数来同时针对局部信息和干扰对象间的空间关系，从而显著提升了对抗样本的迁移能力和攻击性能。

链接: https://arxiv.org/abs/2502.01262
作者: Eun-Sol Park,MiSo Park,Seung Park,Yong-Goo Shin
机构: Department of Electronics and Information Engineering, Korea University(韩国大学电子与信息工程系); College of Medicine, Chungbuk National University(忠北国立大学医学院); Department of Biomedical Engineering, Chungbuk National University Hospital(忠北国立大学医院生物医学工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black-box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real-world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black-box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state-of-the-art benchmark. Code is available at this https URL.
zh

[CV-36] Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

【速读】：该论文旨在解决预训练视觉-语言表示时过度依赖未来帧导致的错误视觉-语言关联问题。解决方案的关键在于提出了一种名为Action Temporal Coherence Learning (AcTOL)的方法，该方法通过对比帧间的语义差异来反映自然顺序，并施加局部布朗桥约束以确保中间帧之间的平滑过渡，从而学习有序且连续的视觉-语言表示，而无需严格的基于目标的限制。

链接: https://arxiv.org/abs/2502.01218
作者: Zhizhen Zhang,Lei Zhu,Zhen Fang,Zi Huang,Yadan Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments across varying numbers of demonstrations show that the pretrained features significantly enhance downstream manipulation tasks by up to 49% with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents. The source code is included in the supplementary material for reference.
zh

[CV-37] Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models

【速读】：该论文旨在解决工业缺陷分割在多种复杂场景下的样本稀缺问题，特别是现有研究大多局限于简单纹理缺陷的处理。论文的关键解决方案在于提出了一种基于特征匹配的新颖高效few-shot缺陷分割方法，并探讨了使用Segment Anything (SAM)模型在视频跟踪模式下的有效性。此外，论文贡献了一个新的现实世界数据集，并重新组织了一些现有数据集以构建更全面的基准，同时系统性地研究了视觉基础模型（Vision Foundation Models, VFMs）在这类任务中的适用性。

链接: https://arxiv.org/abs/2502.01216
作者: Tongkun Liu,Bing Li,Xiao Jin,Yupeng Shi,Qiuying Li,Xiang Wei
机构: State Key Laboratory for Manufacturing System Engineering, Xi’an Jiaotong University (西安交通大学制造系统工程国家重点实验室); International Joint Research Laboratory for Micro/Nano Manufacturing and Measurement Technologies, Xi’an Jiaotong University (西安交通大学微纳制造与测量技术国际联合研究实验室); Mechanical Engineering Program, Physical Science and Engineering Division, King Abdullah University of Science and Technology (KAUST) (King Abdullah University of Science and Technology (KAUST) (国王 Abdullah 科学和技术大学))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few-shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real-world dataset and reorganize some existing datasets to build a more comprehensive few-shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs). We observe that existing meta-learning-based methods are generally not well-suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: this https URL.
zh

[CV-38] Land Surface Temperature Super-Resolution with a Scale-Invariance-Free Neural Approach: Application to MODIS

【速读】：该论文旨在解决热空间探测器在时间和空间分辨率之间的权衡问题，提出了一种无需尺度不变假设（Scale-Invariance-Free）的方法来训练神经网络模型，以实现更高分辨率的地表温度（Land Surface Temperature, LST）地图。关键在于引入了无需尺度不变假设的训练方法，并开发了两种名为SIF-CNN-SR的神经网络模型，能够直接从高分辨率的归一化植被指数（NDVI）中提取细粒度纹理信息，从而在降低分辨率后仍能恢复初始LST值。这种方法避免了传统方法中对尺度不变性的依赖，提高了超分辨率重建的性能。

链接: https://arxiv.org/abs/2502.01204
作者: Romuald Ait-Bachir(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Carlos Granero-Belinchon(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Aurélie Michel,Julien Michel(CESBIO, CNES),Xavier Briottet,Lucas Drumetz(Lab-STICC_OSE, IMT Atlantique - MEE, ODYSSEY)
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the trade-off between the temporal and spatial resolution of thermal spaceborne sensors, super-resolution methods have been developed to provide fine-scale Land SurfaceTemperature (LST) maps. Most of them are trained at low resolution but applied at fine resolution, and so they require a scale-invariance hypothesis that is not always adapted. Themain contribution of this work is the introduction of a Scale-Invariance-Free approach for training Neural Network (NN) models, and the implementation of two NN models, calledScale-Invariance-Free Convolutional Neural Network for Super-Resolution (SIF-CNN-SR) for the super-resolution of MODIS LST products. The Scale-Invariance-Free approach consists ontraining the models in order to provide LST maps at high spatial resolution that recover the initial LST when they are degraded at low resolution and that contain fine-scale texturesinformed by the high resolution NDVI. The second contribution of this work is the release of a test database with ASTER LST images concomitant with MODIS ones that can be usedfor evaluation of super-resolution algorithms. We compare the two proposed models, SIF-CNN-SR1 and SIF-CNN-SR2, with four state-of-the-art methods, Bicubic, DMS, ATPRK, Tsharp,and a CNN sharing the same architecture as SIF-CNN-SR but trained under the scale-invariance hypothesis. We show that SIF-CNN-SR1 outperforms the state-of-the-art methods and the other two CNN models as evaluated with LPIPS and Fourier space metrics focusing on the analysis of textures. These results and the available ASTER-MODIS database for evaluation are promising for future studies on super-resolution of LST.
zh

[CV-39] One-to-Normal: Anomaly Personalization for Few-shot Anomaly Detection NEURIPS2024

【速读】：该论文旨在解决传统异常检测方法在精度提升方面存在的局限性，特别是在少量正常样本情况下直接比较查询图像特征与正常图像特征导致的精度损失问题。为了解决这些问题，论文的关键方案包括引入一种异常个性化方法，通过使用定制的无异常生成模型对查询图像进行个性化的一对正常变换，以确保与正常流形的紧密对齐。此外，论文还提出了一种三元对比异常推理策略，通过综合比较查询图像与生成的无异常数据池及提示信息，进一步增强预测结果的稳定性和鲁棒性。这些方法在三个领域的十一个数据集上的广泛评估中证明了其有效性，并且可以灵活地应用于其他异常检测方法中，从而提高其性能。

链接: https://arxiv.org/abs/2502.01201
作者: Yiyue Li,Shaoting Zhang,Kang Li,Qicheng Lao
机构: West China Biomedical Big Data Center, West China Hospital, Sichuan University(四川大学西区华西医院西部生物医学大数据中心); School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院); Sichuan University Pittsburgh Institute, Sichuan University(四川大学匹兹堡学院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024)

点击查看摘要

Abstract:Traditional Anomaly Detection (AD) methods have predominantly relied on unsupervised learning from extensive normal data. Recent AD methods have evolved with the advent of large pre-trained vision-language models, enhancing few-shot anomaly detection capabilities. However, these latest AD methods still exhibit limitations in accuracy improvement. One contributing factor is their direct comparison of a query image’s features with those of few-shot normal images. This direct comparison often leads to a loss of precision and complicates the extension of these techniques to more complex domains–an area that remains underexplored in a more refined and comprehensive manner. To address these limitations, we introduce the anomaly personalization method, which performs a personalized one-to-normal transformation of query images using an anomaly-free customized generation model, ensuring close alignment with the normal manifold. Moreover, to further enhance the stability and robustness of prediction results, we propose a triplet contrastive anomaly inference strategy, which incorporates a comprehensive comparison between the query and generated anomaly-free data pool and prompt information. Extensive evaluations across eleven datasets in three domains demonstrate our model’s effectiveness compared to the latest AD methods. Additionally, our method has been proven to transfer flexibly to other AD methods, with the generated image data effectively improving the performance of other AD methods.
zh

[CV-40] Nearly Lossless Adaptive Bit Switching

【速读】：该论文旨在解决在模型量化过程中不同硬件和传输需求导致的固定比特宽度设置带来的显著训练和存储成本问题。论文的关键解决方案包括引入Double Rounding量化方法以减少存储开销，并提出Adaptive Learning Rate Scaling (ALRS)技术来优化多精度联合训练过程中的梯度一致性问题。此外，论文还扩展了Double Rounding方法到混合精度训练，并开发了Hessian-Aware Stochastic Bit-switching (HASB)策略。这些方法共同提升了多精度和混合精度下的训练效果。

链接: https://arxiv.org/abs/2502.01199
作者: Haiduo Huang,Zhenhua Liu,Tian Xia,Wenzhe zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at this https URL.
zh

[CV-41] owards Robust and Reliable Concept Representations: Reliability-Enhanced Concept Embedding Model

【速读】：该论文旨在解决概念瓶颈模型（CBMs）在确保可靠概念表示方面所面临的挑战，这些问题可能导致下游任务的性能下降，尤其是在分布变化的情况下。论文指出两个主要问题：对无关特征的敏感性（如背景变化）以及不同样本间同一概念缺乏语义一致性。为了解决这些局限性，论文提出了一种可靠性增强的概念嵌入模型（RECEM），其关键是引入了概念级解缠（Concept-Level Disentanglement）以分离无关特征与概念相关的信息，并采用概念混合（Concept Mixup）机制以确保样本间的语义对齐。这些机制共同提高了概念的可靠性，使模型能够关注有意义的对象属性并生成忠实的概念表示。

链接: https://arxiv.org/abs/2502.01191
作者: Yuxuan Cai,Xiyu Wang,Satoshi Tsutsui,Winnie Pang,Bihan Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) aim to enhance interpretability by predicting human-understandable concepts as intermediates for decision-making. However, these models often face challenges in ensuring reliable concept representations, which can propagate to downstream tasks and undermine robustness, especially under distribution shifts. Two inherent issues contribute to concept unreliability: sensitivity to concept-irrelevant features (e.g., background variations) and lack of semantic consistency for the same concept across different samples. To address these limitations, we propose the Reliability-Enhanced Concept Embedding Model (RECEM), which introduces a two-fold strategy: Concept-Level Disentanglement to separate irrelevant features from concept-relevant information and a Concept Mixup mechanism to ensure semantic alignment across samples. These mechanisms work together to improve concept reliability, enabling the model to focus on meaningful object attributes and generate faithful concept representations. Experimental results demonstrate that RECEM consistently outperforms existing baselines across multiple datasets, showing superior performance under background and domain shifts. These findings highlight the effectiveness of disentanglement and alignment strategies in enhancing both reliability and robustness in CBMs.
zh

[CV-42] A High-Accuracy SSIM-based Scoring System for Coin Die Link Identification

【速读】：该论文旨在解决古代硬币分析中识别相同模具铸造硬币（die link detection）的难题，尤其在大规模发现时手动识别过程变得极其繁琐甚至不可能。论文的关键解决方案在于引入了一个公开可用的标注硬币图片数据集（329张图像），一个基于SSIM的评分方法以实现硬币配对的快速准确区分，以及通过该评分方法评估聚类技术以实现接近完美的模具链接识别。这些贡献共同促进了考古学特别是钱币学领域更强大工具的发展。

链接: https://arxiv.org/abs/2502.01186
作者: Patrice Labedan,Nicolas Drougard,Alexandre Berezin,Guowei Sun,Francis Dieulafait
机构: ISAE-SUPAERO (ISAE-SUPAERO), Université de Toulouse (图卢兹大学), France (法国); Hades, Bureau d’investigations archéologiques (哈德斯考古调查局), L’Union, France (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The analyses of ancient coins, and especially the identification of those struck with the same die, provides invaluable information for archaeologists and historians. Nowadays, these die links are identified manually, which makes the process laborious, if not impossible when big treasures are discovered as the number of comparisons is too large. This study introduces advances that promise to streamline and enhance archaeological coin analysis. Our contributions include: 1) First publicly accessible labeled dataset of coin pictures (329 images) for die link detection, facilitating method benchmarking; 2) Novel SSIM-based scoring method for rapid and accurate discrimination of coin pairs, outperforming current techniques used in this research field; 3) Evaluation of clustering techniques using our score, demonstrating near-perfect die link identification. We provide datasets, to foster future research and the development of even more powerful tools for archaeology, and more particularly for numismatics.
zh

[CV-43] Enhancing Environmental Robustness in Few-shot Learning via Conditional Representation Learning

【速读】：该论文旨在解决Few-shot学习（FSL）在实际测试中由于环境因素导致性能显著下降的问题。当前研究忽视了“环境鲁棒性”的概念，即模型在复杂多变的物理环境中保持一致性能的能力。为了解决这一问题，论文提出了一个新的真实世界多领域Few-shot学习基准（RD-FSL），包含四个领域和六个评估数据集，并引入了一种新颖的条件表示学习网络（CRLNet）。CRLNet的关键在于它能够将训练图像和测试图像之间的交互作为条件信息整合到各自的表示过程中，从而减少类内方差或增强类间方差，最终提升FSL模型的性能。实验结果表明，CRLNet相比现有方法取得了6.83%至16.98%的性能提升。

链接: https://arxiv.org/abs/2502.01183
作者: Qianyu Guo,Jingrong Wu,Tianxing Wu,Haofen Wang,Weifeng Ge,Wenqiang Zhang
机构: School of Computer Science, Fudan University (复旦大学计算机科学学院); Shanghai Institute of Virology, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院上海病毒研究所); School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China (中国东南大学计算机科学与工程学院); College of Design and Innovation, Tongji University (同济大学设计与创新学院); Engineering Research Center of AI & Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University, Shanghai, 20043, China (教育部人工智能与机器人工程研究中心，复旦大学工程技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of “environmental robustness”, which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at this https URL.
zh

[CV-44] BVINet: Unlocking Blind Video Inpainting with Zero Annotations

【速读】：该论文旨在解决视频修复（Video Inpainting）中的已知缺陷，即现有方法通常假设损坏区域的位置已知，并主要关注“如何修复”。然而，这种假设需要手动标注二值掩码来指示“修复位置”，这是一项劳动密集且昂贵的任务，限制了当前方法的实用性。为了解决这一问题，论文提出了一种新的盲视频修复设置（Blind Video Inpainting），使网络能够直接从受损视频映射到修复结果，无需标注损坏区域。

关键解决方案在于提出的端到端盲视频修复网络（Blind Video Inpainting Network, BVINet），它同时解决了“修复位置”和“如何修复”的问题。BVINet通过检测帧内语义不连续区域和利用视频的时间一致性先验来预测损坏区域的掩码。此外，预测的掩码被整合进BVINet中，使其能够从未损坏区域捕获有效上下文信息以填充损坏区域。论文还引入了一致性损失（consistency loss）来调节BVINet的训练参数。这种方法使掩码预测和视频修复相互制约，从而最大化模型的整体性能。

链接: https://arxiv.org/abs/2502.01181
作者: Zhiliang Wu,Kerui Chen,Kun Li,Hehe Fan,Yi Yang
机构: ReLER Lab, CCAI, Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the “how to inpaint”. This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate “whereto inpaint”. However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both “where to inpaint” and “how to inpaint” simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
zh

[CV-45] owards Agile Swarming in Real World: Onboard Relative Localization with Fast Tracking of Active Blinking Markers

【速读】：本文旨在解决多机器人团队在紧密耦合编队中进行实时相对定位的鲁棒性问题，特别是在户外复杂环境中。传统跟踪算法难以应对快速移动且间歇性出现在相机视野中的闪烁标记。为解决这一挑战，关键在于引入了主动闪烁标记跟踪（Active Blinking Marker Tracking, AMT）方法，该方法通过加权多项式回归预测未来闪烁标记的出现位置，并考虑预测中的不确定性。实验验证表明，AMT方法在跟踪密度、精度和复杂度方面优于现有技术。

链接: https://arxiv.org/abs/2502.01172
作者: Tim Felix Lakemann,Daniel Bonilla Licea,Viktor Walter,Tomáš Báča,Martin Saska
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A novel onboard tracking approach enabling vision-based relative localization and communication using Active blinking Marker Tracking (AMT) is introduced in this article. Active blinking markers on multi-robot team members improve the robustness of relative localization for aerial vehicles in tightly coupled swarms during real-world deployments, while also serving as a resilient communication channel. Traditional tracking algorithms struggle to track fast moving blinking markers due to their intermittent appearance in the camera frames. AMT addresses this by using weighted polynomial regression to predict the future appearance of active blinking markers while accounting for uncertainty in the prediction. In outdoor experiments, the AMT approach outperformed state-of-the-art methods in tracking density, accuracy, and complexity. The experimental validation of this novel tracking approach for relative localization involved testing motion patterns motivated by our research on agile multi-robot deployment.
zh

[CV-46] MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks

【速读】：该论文旨在解决多模态数据集较小及多模态模型复杂度较高导致的性能下降问题。论文的关键解决方案是提出Modality-INformed知识蒸馏（MIND）框架，通过从不同规模的预训练深度神经网络集成中迁移知识到更小的多模态学生模型，从而实现模型压缩。MIND采用多头联合融合模型，允许在处理单模态样本时使用单模态编码器，无需对缺失模态进行插补或屏蔽。实验结果表明，MIND在二分类和多标签临床预测任务以及非医学多模态多分类任务中均提升了小规模多模态网络的性能。

链接: https://arxiv.org/abs/2502.01158
作者: Alejandro Guerra-Manzanares,Farah E. Shamout
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research (01/2025), this https URL

点击查看摘要

Abstract:Multimodal fusion leverages information across modalities to learn better feature representations with the goal of improving performance in fusion-based tasks. However, multimodal datasets, especially in medical settings, are typically smaller than their unimodal counterparts, which can impede the performance of multimodal models. Additionally, the increase in the number of modalities is often associated with an overall increase in the size of the multimodal network, which may be undesirable in medical use cases. Utilizing smaller unimodal encoders may lead to sub-optimal performance, particularly when dealing with high-dimensional clinical data. In this paper, we propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach based on knowledge distillation that transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student. The teacher models consist of unimodal networks, allowing the student to learn from diverse representations. MIND employs multi-head joint fusion models, as opposed to single-head models, enabling the use of unimodal encoders in the case of unimodal samples without requiring imputation or masking of absent modalities. As a result, MIND generates an optimized multimodal model, enhancing both multimodal and unimodal representations. It can also be leveraged to balance multimodal learning during training. We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images. Additionally, we assess the generalizability of the MIND framework on three non-medical multimodal multiclass datasets. Experimental results demonstrate that MIND enhances the performance of the smaller multimodal network across all five tasks, as well as various fusion methods and multimodal architectures, compared to state-of-the-art baselines.
zh

[CV-47] Radiant Foam: Real-Time Differentiable Ray Tracing

【速读】：该论文旨在解决在利用光栅化方法提高渲染速度的同时，导致实现光照传输现象（如反射和折射）变得困难的问题。关键解决方案在于提出了一种新的场景表示方法——Radiant Foam，它通过采用高效的体素网格光线追踪算法，避免了光栅化带来的近似，同时保持了与高斯散射（Gaussian Splatting）相当的渲染速度和质量，且无需特殊硬件或API支持。

链接: https://arxiv.org/abs/2502.01157
作者: Shrisudhan Govindarajan,Daniel Rebain,Kwang Moo Yi,Andrea Tagliasacchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Research on differentiable scene representations is consistently moving towards more efficient, real-time models. Recently, this has led to the popularization of splatting methods, which eschew the traditional ray-based rendering of radiance fields in favor of rasterization. This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. The resulting model, which we name Radiant Foam, achieves rendering speed and quality comparable to Gaussian Splatting, without the constraints of rasterization. Unlike ray traced Gaussian models that use hardware ray tracing acceleration, our method requires no special hardware or APIs beyond the standard features of a programmable GPU.
zh

[CV-48] Learning to Learn Weight Generation via Trajectory Diffusion

【速读】：该论文旨在解决扩散算法在多任务学习中的跨任务迁移能力有限以及仅利用最优权重进行训练的问题。为了解决这些问题，论文提出Lt-Di方法，将扩散算法与元学习相结合以生成未见任务的权重，并扩展了基本扩散算法为轨迹扩散算法，以利用优化过程中的其他权重。关键在于通过分解整个扩散链为多个较短的子链来提高训练和推理效率，并且分析了权重生成范式的收敛特性，从而在不增加额外时间开销的情况下提高了收敛效率。

链接: https://arxiv.org/abs/2502.01117
作者: Yunchuan Guan,Yu Liu,Ke Zhou,Zhiqi Shen,Serge Belongie,Jenq-Neng Hwang,Lei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based algorithms have emerged as promising techniques for weight generation, particularly in scenarios like multi-task learning that require frequent weight updates. However, existing solutions suffer from limited cross-task transferability. In addition, they only utilize optimal weights as training samples, ignoring the value of other weights in the optimization process. To address these issues, we propose Lt-Di, which integrates the diffusion algorithm with meta-learning to generate weights for unseen tasks. Furthermore, we extend the vanilla diffusion algorithm into a trajectory diffusion algorithm to utilize other weights along the optimization trajectory. Trajectory diffusion decomposes the entire diffusion chain into multiple shorter ones, improving training and inference efficiency. We analyze the convergence properties of the weight generation paradigm and improve convergence efficiency without additional time overhead. Our experiments demonstrate Lt-Di’s higher accuracy while reducing computational overhead across various tasks, including zero-shot and few-shot learning, multi-domain generalization, and large-scale language model this http URL code is released at this https URL.
zh

[CV-49] LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

【速读】：该论文旨在解决生成认知对齐的分层SVG（Scalable Vector Graphics）的挑战，现有方法要么产生过于简化的单层输出，要么因优化导致形状冗余。解决方案的关键在于提出LayerTracer框架，这是一种基于扩散变换器的方法，通过从新颖的设计操作序列数据集中学习设计师创建分层SVG的过程来弥合这一差距。LayerTracer采用两阶段方法：首先，文本条件下的DiT（Diffusion Transformer）生成多阶段栅格化构造蓝图，模拟人类设计工作流程；其次，通过逐层矢量化和路径去重生成干净、可编辑的SVG。此外，对于图像矢量化，引入了一种条件扩散机制，将参考图像编码为潜在令牌，指导层次重建同时保持结构完整性。

链接: https://arxiv.org/abs/2502.01105
作者: Yiren Song,Danze Chen,Mike Zheng Shou
机构: National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer’s superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.
zh

[CV-50] VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control

【速读】：该论文旨在解决通过手绘草图生成高质量视频动画的问题，现有方法仅限于静态图像生成，缺乏对手绘草图控制视频动画生成的能力。为了解决这一问题，论文提出了一种名为VidSketch的方法，其关键是引入了基于层级的草图控制策略（Level-Based Sketch Control Strategy）以自动调整生成过程中草图的引导强度，并设计了时空注意力机制（TempSpatial Attention）以增强生成视频动画的时空一致性，从而显著提高帧间连贯性。

链接: https://arxiv.org/abs/2502.01101
作者: Lifan Jiang,Shuang Chen,Boxi Wu,Xiaotong Guan,Jiahui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17pages, 15 figures

点击查看摘要

Abstract:With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public’s needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames. You can find more detailed cases on our official website.
zh

[CV-51] SatFlow: Generative model based framework for producing High Resolution Gap Free Remote Sensing Imagery

【速读】：该论文旨在解决农业和环境监测中高分辨率、频繁更新的遥感图像需求与实际观测受限于较低时间频率及云层遮挡的问题。解决方案的关键在于SatFlow框架，该框架基于条件流匹配(Conditional Flow Matching)训练的生成模型，融合低分辨率的MODIS图像和高分辨率的Landsat观测数据，生成无间隙的高频高分辨率表面反射率图像，并通过图像修复技术处理云层遮挡问题，从而可靠地填补云覆盖区域，确保下游应用如作物物候跟踪和环境变化检测的准确性。

链接: https://arxiv.org/abs/2502.01098
作者: Bharath Irigireddy,Varaprasad Bandaru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,
zh

[CV-52] Enhancing Feature Tracking Reliability for Visual Navigation using Real-Time Safety Filter ICRA2025

【速读】：该论文旨在解决机器人在视觉导航过程中特征跟踪可靠性和姿态估计准确性的问题，特别是在需要保持足够数量特征可见的前提下。论文的关键解决方案是提出了一种基于二次规划的实时安全滤波器，该滤波器通过利用机器人运动学模型中可视性约束的不变性质，在保证信息分数高于用户指定阈值的同时，最小化地偏离参考速度命令，从而确保所需特征的可见性。

链接: https://arxiv.org/abs/2502.01092
作者: Dabin Kim,Inkyu Jang,Youngsoo Han,Sunwoo Hwang,H. Jin Kim
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 7 pages, 6 figures, Accepted to 2025 IEEE International Conference on Robotics Automation (ICRA 2025)

点击查看摘要

Abstract:Vision sensors are extensively used for localizing a robot’s pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor’s relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot’s overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot’s kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.
zh

[CV-53] BC-GAN: A Generative Adversarial Network for Synthesizing a Batch of Collocated Clothing

【速读】：该论文旨在解决现有方法只能每次合成单一配对服装的问题，无法满足用户在不同场合和个人偏好下的多样化需求。为了解决这一限制，论文提出了一种新的批量服装生成框架BC-GAN。其关键是引入了一种新颖的时尚兼容性判别器，在对比学习视角下充分利用所有服装项目之间的搭配关系，从而能够同时合成多个视觉上相互匹配的服装图像，提高生成结果的多样性、视觉真实性和时尚兼容性。

链接: https://arxiv.org/abs/2502.01080
作者: Dongliang Zhou,Haijun Zhang,Jianghong Ma,Jianyang Shi
机构: Department of Computer Science, Harbin Institute of Technology, Shenzhen, Xili University Town, Shenzhen 518055, P. R. China (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This paper was accepted by IEEE TCSVT

点击查看摘要

Abstract:Collocated clothing synthesis using generative networks has become an emerging topic in the field of fashion intelligence, as it has significant potential economic value to increase revenue in the fashion industry. In previous studies, several works have attempted to synthesize visually-collocated clothing based on a given clothing item using generative adversarial networks (GANs) with promising results. These works, however, can only accomplish the synthesis of one collocated clothing item each time. Nevertheless, users may require different clothing items to meet their multiple choices due to their personal tastes and different dressing scenarios. To address this limitation, we introduce a novel batch clothing generation framework, named BC-GAN, which is able to synthesize multiple visually-collocated clothing images simultaneously. In particular, to further improve the fashion compatibility of synthetic results, BC-GAN proposes a new fashion compatibility discriminator in a contrastive learning perspective by fully exploiting the collocation relationship among all clothing items. Our model was examined in a large-scale dataset with compatible outfits constructed by ourselves. Extensive experiment results confirmed the effectiveness of our proposed BC-GAN in comparison to state-of-the-art methods in terms of diversity, visual authenticity, and fashion compatibility.
zh

[CV-54] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

【速读】：该论文旨在解决现有端到端人体动画方法难以规模化为大型通用视频生成模型的问题，从而限制其在实际应用中的潜力。解决方案的关键在于提出了一种基于扩散变换器（Diffusion Transformer）的框架OmniHuman，通过在训练阶段混合运动相关条件来扩展数据。该框架引入了两种训练原则及相应的模型架构和推理策略，使其能够充分利用数据驱动的运动生成能力，实现高度逼真的人体视频生成。此外，OmniHuman具有广泛的适用性，支持多种人物图像内容和动作模式，并且能够处理复杂的场景和姿势。

链接: https://arxiv.org/abs/2502.01061
作者: Gaojie Lin,Jianwen Jiang,Jiaqi Yang,Zerong Zheng,Chao Liang
机构: ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (this https URL)
zh

[CV-55] Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

【速读】：该论文旨在解决扩散模型（Diffusion Models）在步长级别（step-level）偏好优化中的挑战，特别是如何更有效地与人类图像偏好对齐。传统方法通常依赖于视觉语言模型（Vision-Language Models, VLMs）作为像素级奖励模型来近似人类偏好，但这些方法在处理不同时间步（timesteps）的噪声图像时面临困难，并且需要复杂的像素空间转换。论文的关键解决方案在于提出了一种潜空间奖励模型（Latent Reward Model, LRM），该模型重新利用扩散模型的组件来预测不同时间步下的噪声潜图像（latent images）的偏好。基于LRM，作者进一步提出了潜空间偏好优化（Latent Preference Optimization, LPO），这是一种直接在潜空间进行步长级别偏好优化的方法。实验结果表明，LPO不仅显著提升了扩散模型与一般、美学及文本-图像对齐偏好的一致性，还实现了2.5到28倍的训练速度提升。

链接: https://arxiv.org/abs/2502.01051
作者: Tao Zhang,Cheng Da,Kun Ding,Kun Jin,Yan Li,Tingting Gao,Di Zhang,Shiming Xiang,Chunhong Pan
机构: MAIS, CASIA(模式识别国家重点实验室, 中科院自动化所); Kuaishou Technology(快手科技); School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 tables, 15 figures

点击查看摘要

Abstract:Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28 \times training speedup compared to existing preference optimization methods. Our code will be available at this https URL.
zh

[CV-56] Sparks of Explainability: Recent Advancements in Explaining Large Vision Models

【速读】：该论文旨在提升计算机视觉中的可解释性，主要通过分析和建模深度神经网络所利用的特征。论文的关键解决方案在于引入和评估多种 attribution 方法，如基于算法稳定性的度量和利用 Sobol 指数的方法，这些方法显著减少了计算时间。此外，EVA 方法提供了通过验证扰动分析实现的 attribution 形式的正式保证。然而，实验结果显示在复杂场景下这些方法不足以提供充分的理解。因此，论文探讨了两种假设：一是使模型与人类推理对齐，通过引入模仿人类解释的训练程序，并在 1-Lipschitz 函数空间内进行优化；二是采用概念上的可解释性方法。为此，提出了 CRAFT 方法来自动化提取模型使用的概念并评估其重要性，并辅以 MACO 方法实现可视化。这些工作最终汇聚成一个统一框架，通过交互演示应用于 ResNet 模型的 1000 个 ImageNet 类别。

链接: https://arxiv.org/abs/2502.01048
作者: Thomas Fel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Doctoral thesis

点击查看摘要

Abstract:This thesis explores advanced approaches to improve explainability in computer vision by analyzing and modeling the features exploited by deep neural networks. Initially, it evaluates attribution methods, notably saliency maps, by introducing a metric based on algorithmic stability and an approach utilizing Sobol indices, which, through quasi-Monte Carlo sequences, allows a significant reduction in computation time. In addition, the EVA method offers a first formulation of attribution with formal guarantees via verified perturbation analysis. Experimental results indicate that in complex scenarios these methods do not provide sufficient understanding, particularly because they identify only “where” the model focuses without clarifying “what” it perceives. Two hypotheses are therefore examined: aligning models with human reasoning – through the introduction of a training routine that integrates the imitation of human explanations and optimization within the space of 1-Lipschitz functions – and adopting a conceptual explainability approach. The CRAFT method is proposed to automate the extraction of the concepts used by the model and to assess their importance, complemented by MACO, which enables their visualization. These works converge towards a unified framework, illustrated by an interactive demonstration applied to the 1000 ImageNet classes in a ResNet model. Comments: Doctoral thesis Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.01048 [cs.CV] (or arXiv:2502.01048v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.01048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-57] Emotional Face-to-Speech

【速读】：该论文旨在解决如何仅通过表情面部特征推断情感语音的问题，并提出了一项新任务——情感面部到语音转换（emotional face-to-speech），目标是从表达性面部线索直接合成情感语音。解决方案的关键在于引入DEmoFace，这是一个基于离散扩散变换器（DiT）与课程学习机制的新型生成框架，构建于多级神经音频编解码器之上。DEmoFace通过多模态DiT块动态对齐文本和语音，同时根据面部情绪和身份定制语音风格。此外，论文还提出了一种粗细结合的课程学习算法以增强训练效率和生成质量，并开发了一种增强的无预测引导方法来处理多样化的条件场景，从而实现多条件生成和有效分离复杂属性。

链接: https://arxiv.org/abs/2502.01046
作者: Jiaxin Ye,Boyuan Cao,Hongming Shan
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos are shown at this https URL.
zh

[CV-58] WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction

【速读】：该论文旨在解决从单目视频中重建高质量动态人体 avatar 的问题，特别是处理视频视角有限时难以重建未观测到的身体部位。解决方案的关键在于引入了 WonderHuman 系统，利用 2D 生成扩散模型先验知识，并结合 Dual-Space Optimization 技术和 Score Distillation Sampling (SDS)，在规范空间和观测空间中进行采样以确保视觉一致性并增强动态人体重建的真实感。此外，还提出了 View Selection 策略和 Pose Feature Injection 方法来确保 SDS 预测与观测数据之间的一致性，从而实现更高的重建保真度。

链接: https://arxiv.org/abs/2502.01045
作者: Zilong Wang,Zhiyang Dou,Yuan Liu,Cheng Lin,Xiao Dong,Yunhui Guo,Chenxu Zhang,Xin Li,Wenping Wang,Xiaohu Guo
机构: The University of Texas at Dallas(德克萨斯大学达拉斯分校); The University of Hong Kong(香港大学); The Hong Kong University of Science and Technology(香港科技大学); BNU-HKBU United International College(北京师范大学-香港浸会大学联合国际学院); Texas A&M University(德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at this https URL.
zh

[CV-59] UASTHN: Uncertainty-Aware Deep Homography Estimation for UAV Satellite-Thermal Geo-localization ICRA2025

【速读】：该论文旨在解决热红外地理定位（Thermal Geo-localization, TG）方法在输出中缺乏不确定性测量的问题，这限制了系统在面对无纹理或损坏的热图像、自相似或过时的卫星地图、几何噪声以及超出卫星地图范围的热图像时的鲁棒性。论文的关键解决方案是提出了一种新颖的方法，即UASTHN，用于深度单应性估计（Deep Homography Estimation, DHE）任务中的不确定性估计（Uncertainty Estimation, UE）。具体而言，该方法引入了一种基于裁剪的测试时增强策略（Crop-based Test-Time Augmentation, CropTTA），通过利用裁剪图像视图的单应性一致性来有效测量数据不确定性。此外，该方法还采用了深度集成（Deep Ensembles, DE）来评估模型不确定性，从而提供了一种高效且可与任何DHE模型无缝集成的方案。

链接: https://arxiv.org/abs/2502.01035
作者: Jiuhong Xiao,Giuseppe Loianno
机构: New York University, Tandon School of Engineering (纽约大学，塔andon工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, accepted at ICRA 2025

点击查看摘要

Abstract:Geo-localization is an essential component of Unmanned Aerial Vehicle (UAV) navigation systems to ensure precise absolute self-localization in outdoor environments. To address the challenges of GPS signal interruptions or low illumination, Thermal Geo-localization (TG) employs aerial thermal imagery to align with reference satellite maps to accurately determine the UAV’s location. However, existing TG methods lack uncertainty measurement in their outputs, compromising system robustness in the presence of textureless or corrupted thermal images, self-similar or outdated satellite maps, geometric noises, or thermal images exceeding satellite maps. To overcome these limitations, this paper presents \textitUASTHN, a novel approach for Uncertainty Estimation (UE) in Deep Homography Estimation (DHE) tasks for TG applications. Specifically, we introduce a novel Crop-based Test-Time Augmentation (CropTTA) strategy, which leverages the homography consensus of cropped image views to effectively measure data uncertainty. This approach is complemented by Deep Ensembles (DE) employed for model uncertainty, offering comparable performance with improved efficiency and seamless integration with any DHE model. Extensive experiments across multiple DHE models demonstrate the effectiveness and efficiency of CropTTA in TG applications. Analysis of detected failure cases underscores the improved reliability of CropTTA under challenging conditions. Finally, we demonstrate the capability of combining CropTTA and DE for a comprehensive assessment of both data and model uncertainty. Our research provides profound insights into the broader intersection of localization and uncertainty estimation. The code and data is publicly available.
zh

[CV-60] Vessel segmentation for X-separation

【速读】：该论文旨在解决在使用 $\chi$ -分离方法进行定量磁化率成像（QSM）时，血管引起的伪影干扰铁和髓鞘准确量化的问题。解决方案的关键在于提出了一种新的血管分割方法，该方法通过三步实现：1）从 $R_2^*$ 图谱及 $\chi_\text{para}$ 与 $|\chi_\text{dia}|$ 乘积图谱生成种子；2）基于血管几何引导的区域生长，创建血管掩膜；3）通过排除非血管结构来细化血管掩膜。此方法显著优于传统血管分割方法，并在神经网络重建方法 $\chi$ -sepnet- $R_2^*$ 的定量评估及群体平均感兴趣区域分析中展现出改进效果。

链接: https://arxiv.org/abs/2502.01023
作者: Taechang Kim,Sooyeon Ji,Kyeongseon Min,Minjun Kim,Jonghyo Youn,Chungseok Oh,Jiye Kim,Jongho Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract: \chi -separation is an advanced quantitative susceptibility mapping (QSM) method that is designed to generate paramagnetic ( \chi_para ) and diamagnetic ( |\chi_dia| ) susceptibility maps, reflecting the distribution of iron and myelin in the brain. However, vessels have shown artifacts, interfering with the accurate quantification of iron and myelin in applications. To address this challenge, a new vessel segmentation method for \chi -separation is developed. The method comprises three steps: 1) Seed generation from \textitR_2^* and the product of \chi_para and |\chi_dia| maps; 2) Region growing, guided by vessel geometry, creating a vessel mask; 3) Refinement of the vessel mask by excluding non-vessel structures. The performance of the method was compared to conventional vessel segmentation methods both qualitatively and quantitatively. To demonstrate the utility of the method, it was tested in two applications: quantitative evaluation of a neural network-based \chi -separation reconstruction method ( \chi -sepnet- \textitR_2^* ) and population-averaged region of interest (ROI) analysis. The proposed method demonstrates superior performance to the conventional vessel segmentation methods, effectively excluding the non-vessel structures, achieving the highest Dice score coefficient. For the applications, applying vessel masks report notable improvements for the quantitative evaluation of \chi -sepnet- \textitR_2^* and statistically significant differences in population-averaged ROI analysis. These applications suggest excluding vessels when analyzing the \chi -separation maps provide more accurate evaluations. The proposed method has the potential to facilitate various applications, offering reliable analysis through the generation of a high-quality vessel mask.
zh

[CV-61] ZeroBP: Learning Position-Aware Correspondence for Zero-shot 6D Pose Estimation in Bin-Picking ICRA2025

【速读】：该论文旨在解决二元拣选（Bin-picking）任务中零样本6D位姿估计（Zero-shot 6D pose estimation）的效率问题。现有方法依赖于特定对象的训练数据，导致在处理新工件时需要大量的数据收集和模型重新训练。论文的关键解决方案是提出了一种名为ZeroBP的框架，它通过学习场景实例与CAD模型之间的位置感知对应关系（Position-Aware Correspondence, PAC），结合局部特征和全局位置来解决因相似形状和外观引起的不匹配问题。实验结果表明，ZeroBP在ROBI数据集上的表现优于现有的零样本6D位姿估计方法，正确位姿的平均召回率提高了9.1%。

链接: https://arxiv.org/abs/2502.01004
作者: Jianqiu Chen,Zikun Zhou,Xin Li,Ye Zheng,Tianpeng Bao,Zhenyu He
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室); JD.com, Inc. (京东集团); SenseTime Research (商汤研究室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:Bin-picking is a practical and challenging robotic manipulation task, where accurate 6D pose estimation plays a pivotal role. The workpieces in bin-picking are typically textureless and randomly stacked in a bin, which poses a significant challenge to 6D pose estimation. Existing solutions are typically learning-based methods, which require object-specific training. Their efficiency of practical deployment for novel workpieces is highly limited by data collection and model retraining. Zero-shot 6D pose estimation is a potential approach to address the issue of deployment efficiency. Nevertheless, existing zero-shot 6D pose estimation methods are designed to leverage feature matching to establish point-to-point correspondences for pose estimation, which is less effective for workpieces with textureless appearances and ambiguous local regions. In this paper, we propose ZeroBP, a zero-shot pose estimation framework designed specifically for the bin-picking task. ZeroBP learns Position-Aware Correspondence (PAC) between the scene instance and its CAD model, leveraging both local features and global positions to resolve the mismatch issue caused by ambiguous regions with similar shapes and appearances. Extensive experiments on the ROBI dataset demonstrate that ZeroBP outperforms state-of-the-art zero-shot pose estimation methods, achieving an improvement of 9.1% in average recall of correct poses.
zh

[CV-62] Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review Datasets and Future Perspectives

【速读】：该论文旨在解决合成孔径雷达(SAR)与光学图像配准中的挑战，特别是在高分辨率下由于成像机制、几何失真和辐射属性差异导致的配准难题。论文的关键在于创建了MultiResSAR数据集，并系统性地评估了十六种最先进的算法。结果表明，没有一种算法能够实现100%的成功率，且随着分辨率的提高，性能显著下降。论文建议未来研究应着重于噪声抑制、三维几何融合、跨视角变换建模以及深度学习优化，以实现高分辨率SAR与光学图像的稳健配准。

链接: https://arxiv.org/abs/2502.01002
作者: Wenfei Zhang,Ruipeng Zhao,Yongxiang Yao,Yi Wan,Peihao Wu,Jiayuan Li,Yansheng Li,Yongjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 10 figures

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) and optical image registration is essential for remote sensing data fusion, with applications in military reconnaissance, environmental monitoring, and disaster management. However, challenges arise from differences in imaging mechanisms, geometric distortions, and radiometric properties between SAR and optical images. As image resolution increases, fine SAR textures become more significant, leading to alignment issues and 3D spatial discrepancies. Two major gaps exist: the lack of a publicly available multi-resolution, multi-scene registration dataset and the absence of systematic analysis of current methods. To address this, the MultiResSAR dataset was created, containing over 10k pairs of multi-source, multi-resolution, and multi-scene SAR and optical images. Sixteen state-of-the-art algorithms were tested. Results show no algorithm achieves 100% success, and performance decreases as resolution increases, with most failing on sub-meter data. XoFTR performs best among deep learning methods (40.58%), while RIFT performs best among traditional methods (66.51%). Future research should focus on noise suppression, 3D geometric fusion, cross-view transformation modeling, and deep learning optimization for robust registration of high-resolution SAR and optical images. The dataset is available at this https URL.
zh

[CV-63] Adapting Foundation Models for Few-Shot Medical Image Segmentation: Actively and Sequentially

【速读】：该论文旨在解决在目标任务存在较大领域差距且标注样本有限的情况下，确保可靠和鲁棒的模型适应性问题。解决方案的关键在于提出了一种名为Active and Sequential domain AdaPtation (ASAP) 的框架，通过将少量镜头领域适应（FSDA）问题形式化为多臂老虎机问题，并设计了一个高效的奖励函数来动态选择与目标任务紧密相关的辅助数据集，从而实现单轮微调。实验验证表明，该方法在多种医学分割数据集上表现出色，显著优于现有的FSDA方法，在MRI数据集上的Dice评分平均提升了27.75%，CT数据集上提升了7.52%。

链接: https://arxiv.org/abs/2502.01000
作者: Jingyun Yang,Guoqing Zhang,Jingge Wang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in foundation models have brought promising results in computer vision, including medical image segmentation. Fine-tuning foundation models on specific low-resource medical tasks has become a standard practice. However, ensuring reliable and robust model adaptation when the target task has a large domain gap and few annotated samples remains a challenge. Previous few-shot domain adaptation (FSDA) methods seek to bridge the distribution gap between source and target domains by utilizing auxiliary data. The selection and scheduling of auxiliaries are often based on heuristics, which can easily cause negative transfer. In this work, we propose an Active and Sequential domain AdaPtation (ASAP) framework for dynamic auxiliary dataset selection in FSDA. We formulate FSDA as a multi-armed bandit problem and derive an efficient reward function to prioritize training on auxiliary datasets that align closely with the target task, through a single-round fine-tuning. Empirical validation on diverse medical segmentation datasets demonstrates that our method achieves favorable segmentation performance, significantly outperforming the state-of-the-art FSDA methods, achieving an average gain of 27.75% on MRI and 7.52% on CT datasets in Dice score. Code is available at the git repository: this https URL.
zh

[CV-64] FCBoost-Net: A Generative Network for Synthesizing Multiple Collocated Outfits via Fashion Compatibility Boosting

【速读】：该论文旨在解决时尚穿搭生成领域中多样化选择不足的问题。现有的研究局限于基于给定物品生成唯一一套搭配，而无法为用户提供更多选择。为了解决这一问题，论文提出了一种新的框架FCBoost-Net，其关键是利用预训练的生成模型来生成多套协调且多样化的穿搭。通过引入一种新颖的时尚搭配增强器，FCBoost-Net能够在多轮迭代中逐步提升生成搭配的协调性和多样性。这种方法受到了提升算法的启发，能够有效改善随机生成的时尚物品的搭配性同时保持多样性。

链接: https://arxiv.org/abs/2502.00992
作者: Dongliang Zhou,Haijun Zhang,Jianghong Ma,Jicong Fan,Zhao Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学，深圳); Shenzhen(深圳); Guangdong(广东)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This paper has been accepted for presentation at ACM Multimedia 2023

点击查看摘要

Abstract:Outfit generation is a challenging task in the field of fashion technology, in which the aim is to create a collocated set of fashion items that complement a given set of items. Previous studies in this area have been limited to generating a unique set of fashion items based on a given set of items, without providing additional options to users. This lack of a diverse range of choices necessitates the development of a more versatile framework. However, when the task of generating collocated and diversified outfits is approached with multimodal image-to-image translation methods, it poses a challenging problem in terms of non-aligned image translation, which is hard to address with existing methods. In this research, we present FCBoost-Net, a new framework for outfit generation that leverages the power of pre-trained generative models to produce multiple collocated and diversified outfits. Initially, FCBoost-Net randomly synthesizes multiple sets of fashion items, and the compatibility of the synthesized sets is then improved in several rounds using a novel fashion compatibility booster. This approach was inspired by boosting algorithms and allows the performance to be gradually improved in multiple steps. Empirical evidence indicates that the proposed strategy can improve the fashion compatibility of randomly synthesized fashion items as well as maintain their diversity. Extensive experiments confirm the effectiveness of our proposed framework with respect to visual authenticity, diversity, and fashion compatibility.
zh

[CV-65] Pushing the Boundaries of State Space Models for Image and Video Generation

【速读】：该论文旨在探索状态空间模型（State-Space Models, SSM）在图像和视频生成任务中的能力边界。论文的关键解决方案在于构建迄今为止最大规模的扩散SSM-Transformer混合模型（50亿参数），基于次二次双向Hydra和自注意力机制，从而实现高达2K分辨率的图像和360p分辨率、8秒长（16帧/秒）的视频生成。实验结果表明，该模型能够生成与复杂文本提示一致且具有高动态范围且时间上一致的视频，这表明SSM在视觉生成任务中具有巨大潜力。

链接: https://arxiv.org/abs/2502.00972
作者: Yicong Hong,Long Mai,Yuan Yao,Feng Liu
机构: Adobe Research; University of Rochester
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, paper under review

点击查看摘要

Abstract:While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
zh

[CV-66] CoDe: Blockwise Control for Denoising Diffusion Models

【速读】：该论文旨在解决在对扩散模型（Diffusion Models）进行下游任务适配时，通常需要微调新模型或在推理阶段使用基于梯度的引导方法以实现从奖励倾斜后验（reward-tilted posterior）中采样的问题。论文的关键解决方案是提出了一种名为可控去噪（Controlled Denoising, CoDe）的无梯度引导方法。这种方法是一种在去噪过程中分块采样的技术，能够在不依赖可微分引导函数和无需微调模型的情况下，实现与下游奖励的一致性。实验表明，尽管CoDe简单，但其在奖励适配、指令遵循和推理成本之间提供了有利的权衡，且性能可与最先进的基线相媲美。

链接: https://arxiv.org/abs/2502.00968
作者: Anuj Singh,Sayak Mukherjee,Ahmad Beirami,Hadi Jamali-Rad
机构: Delft University of Technology(代尔夫特理工大学); Shell Global Solutions International B.V.(壳牌全球解决方案国际有限公司); Massachusetts Institute of Technology(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: this https URL.
zh

[CV-67] CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

【速读】：该论文旨在解决将Mixture-of-Experts (MoE)模型集成到多模态模型如CLIP中以提升性能的同时，面临的训练复杂度高和成本高的问题。关键解决方案在于提出了一种名为CLIP-Upcycling (CLIP-UP)的高效替代训练策略，通过将预训练的密集型CLIP模型转换为稀疏MoE架构，从而显著降低了训练复杂度和成本。实验结果表明，采用CLIP-UP训练的稀疏CLIP B/16模型在COCO和Flickr30k文本到图像Recall@1基准测试中分别比其密集型对应模型高出7.2%和6.6%，同时仅使用后者的30%推理浮点运算次数 (FLOPs)，证明了该方法的有效性和可扩展性。

链接: https://arxiv.org/abs/2502.00965
作者: Xinze Wang,Chen Chen,Yinfei Yang,Hong-You Chen,Bowen Zhang,Aditya Pal,Xiangxin Zhu,Xianzhi Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
zh

[CV-68] SAM-guided Pseudo Label Enhancement for Multi-modal 3D Semantic Segmentation ICRA2025

【速读】：该论文旨在解决多模态三维语义分割在跨域适应过程中可靠伪标签难以生成及稀疏性问题。论文的关键解决方案在于提出了一种图像引导的伪标签增强方法，通过利用来自Segment Anything Model (SAM)的互补2D先验知识，引入更多可靠的伪标签，从而提升跨域适应性能。具体而言，该方法首先使用多数投票确定每个SAM掩膜的类别标签，并采用多种约束过滤不可靠的掩膜标签；随后，通过几何感知渐进传播（Geometry-Aware Progressive Propagation, GAPP）技术，在避免由于2D-3D不一致导致的异常点的情况下，将掩膜标签传播至SAM掩膜内的所有3D点。

链接: https://arxiv.org/abs/2502.00960
作者: Mingyu Yang,Jitong Lu,Hun-Seok Kim
机构: Department of Electrical and Computer Engineering, University of Michigan (密歇根大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:Multi-modal 3D semantic segmentation is vital for applications such as autonomous driving and virtual reality (VR). To effectively deploy these models in real-world scenarios, it is essential to employ cross-domain adaptation techniques that bridge the gap between training data and real-world data. Recently, self-training with pseudo-labels has emerged as a predominant method for cross-domain adaptation in multi-modal 3D semantic segmentation. However, generating reliable pseudo-labels necessitates stringent constraints, which often result in sparse pseudo-labels after pruning. This sparsity can potentially hinder performance improvement during the adaptation process. We propose an image-guided pseudo-label enhancement approach that leverages the complementary 2D prior knowledge from the Segment Anything Model (SAM) to introduce more reliable pseudo-labels, thereby boosting domain adaptation performance. Specifically, given a 3D point cloud and the SAM masks from its paired image data, we collect all 3D points covered by each SAM mask that potentially belong to the same object. Then our method refines the pseudo-labels within each SAM mask in two steps. First, we determine the class label for each mask using majority voting and employ various constraints to filter out unreliable mask labels. Next, we introduce Geometry-Aware Progressive Propagation (GAPP) which propagates the mask label to all 3D points within the SAM mask while avoiding outliers caused by 2D-3D misalignment. Experiments conducted across multiple datasets and domain adaptation scenarios demonstrate that our proposed method significantly increases the quantity of high-quality pseudo-labels and enhances the adaptation performance over baseline methods.
zh

[CV-69] Hypo3D: Exploring Hypothetical Reasoning in 3D

【速读】：该论文旨在解决现有3D推理基准假设实时场景可访问性的问题，这在实际应用中由于频繁场景更新的高成本而变得不切实际。为了解决这一问题，论文引入了假设性3D推理（Hypo3D）基准，其关键是让模型在没有实时场景数据的情况下，基于提供的变化描述想象场景状态，并在此基础上进行推理。Hypo3D作为一个3D视觉问答（VQA）基准，包含700个室内场景中的7,727个上下文变化，生成了14,885个问题-答案对，并通过锚点世界框架确保方向术语的一致引用。实验结果表明，当前最先进的基础模型在处理假设性变化场景时仍存在显著性能差距，尤其是在涉及运动变化和方向推理的场景中。

链接: https://arxiv.org/abs/2502.00954
作者: Ye Mao,Weixun Luo,Junpeng Jing,Anlan Qiu,Krystian Mikolajczyk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures, 9 tables

点击查看摘要

Abstract:The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models’ ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers.
zh

[CV-70] Fruit Fly Classification (Diptera: Tephritidae) in Images Applying Transfer Learning

【速读】：该论文旨在解决自动化分类实验室环境中两种水果蝇（Anastrepha fraterculus 和 Ceratitis capitata）的问题。当前的分类方法依赖于专家手动识别，受到人为因素的影响且面临时间挑战。论文的关键解决方案在于开发了一种迁移学习模型，并利用预训练的卷积神经网络（Convolutional Neural Networks, CNNs），特别是Inception-v3，在高精度图像处理和特征提取的基础上，实现了82%至93%的F1分数，验证了其在非受控环境中的可靠性和有效性。

链接: https://arxiv.org/abs/2502.00939
作者: Erick Andrew Bustamante Flores,Harley Vera Olivera,Ivan Cesar Medrano Valencia,Carlos Fernando Montoya Cubas
机构: Department of Computer Science, Universidad Nacional de San Antonio Abad del Cusco (国立圣安东尼奥阿巴德库斯科大学), Cusco, Perú
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages and 19 figures

点击查看摘要

Abstract:This study develops a transfer learning model for the automated classification of two species of fruit flies, Anastrepha fraterculus and Ceratitis capitata, in a controlled laboratory environment. The research addresses the need to optimize identification and classification, which are currently performed manually by experts, being affected by human factors and facing time challenges. The methodological process of this study includes the capture of high-quality images using a mobile phone camera and a stereo microscope, followed by segmentation to reduce size and focus on relevant morphological areas. The images were carefully labeled and preprocessed to ensure the quality and consistency of the dataset used to train the pre-trained convolutional neural network models VGG16, VGG19, and Inception-v3. The results were evaluated using the F1-score, achieving 82% for VGG16 and VGG19, while Inception-v3 reached an F1-score of 93%. Inception-v3’s reliability was verified through model testing in uncontrolled environments, with positive results, complemented by the Grad-CAM technique, demonstrating its ability to capture essential morphological features. These findings indicate that Inception-v3 is an effective and replicable approach for classifying Anastrepha fraterculus and Ceratitis capitata, with potential for implementation in automated monitoring systems.
zh

[CV-71] VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning

【速读】：该论文旨在解决移动机器人在未知环境中基于视觉和语言指令进行导航的问题。关键解决方案在于引入了一种名为启发式视觉-语言（Heuristic-Vision-Language, HVL）的空间推理方法，用于目标点选择。这种方法结合了像素级的视觉-语言特征和启发式探索，使机器人能够在不同环境和规模下高效、稳健地导航至人类指令指定的目标实例。

链接: https://arxiv.org/abs/2502.00931
作者: Yi Du,Taimeng Fu,Zhuoqun Chen,Bowen Li,Shaoshu Su,Zhipeng Zhao,Chen Wang
机构: Spatial AI & Robotics Lab, University at Buffalo (布法罗大学空间人工智能与机器人实验室); Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language navigation in unknown environments is crucial for mobile robots. In scenarios such as household assistance and rescue, mobile robots need to understand a human command, such as “find a person wearing black”. We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots. Unlike prior methods that rely on a single image-level feature similarity to guide a robot, we introduce the heuristic-vision-language (HVL) spatial reasoning for goal point selection. It combines pixel-wise vision-language features and heuristic exploration to enable efficient navigation to human-instructed instances in various environments robustly. We deploy VL-Nav on a four-wheel mobile robot and conduct comprehensive navigation tasks in various environments of different scales and semantic complexities, indoors and outdoors. Remarkably, VL-Nav operates at a real-time frequency of 30 Hz with a Jetson Orin NX, highlighting its ability to conduct efficient vision-language navigation. Experimental results show that VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
zh

[CV-72] LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation

【速读】：该论文旨在解决现有视觉提示（Visual Prompting）技术在参数高效调优中的局限性，特别是这些方法通常在图像周围添加提示参数，导致与原始图像的交互仅限于少量补丁，而忽视了不同补丁间共享信息的归纳偏置。论文的关键解决方案是引入了一种新颖的视觉提示设计——低秩矩阵乘法视觉提示（LoR-VP），它能够使图像像素行和列之间的共享信息和特定补丁信息得到充分利用。实验结果表明，与最先进的视觉提示方法相比，LoR-VP在七个网络架构和四个数据集上的表现显著提升，实现了最高可达6倍的训练速度加快，使用了少至1/18的视觉提示参数，并提升了3.1%的性能。

链接: https://arxiv.org/abs/2502.00896
作者: Can Jin,Ying Li,Mingyu Zhao,Shiyu Zhao,Zhenting Wang,Xiaoxiao He,Ligong Han,Tong Che,Dimitris N. Metaxas
机构: Rutgers University; Zhejiang University; Red Hat AI Innovation; MIT-IBM Watson AI Lab; NVIDIA Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as this https URL.
zh

[CV-73] Paper Copilot: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process

【速读】：该论文旨在探讨人工智能（AI）与机器学习（ML）会议从封闭评审平台向开放评审平台转变过程中所采用的不同模型，并分析其优势与局限性。论文特别关注透明同行评审日益增长的社区兴趣。通过分析Paper Copilot网站的数据，该论文强调了更加透明、开放且规范的同行评审机制的重要性，以促进更大范围的社区参与及领域的进步。关键在于推动一种更透明、开放且受规管的同行评审体系。

链接: https://arxiv.org/abs/2502.00874
作者: Jing Yang
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning (ML) conferences has prompted many venues to transition from closed to open review platforms. Some have fully embraced open peer reviews, allowing public visibility throughout the process, while others adopt hybrid approaches, such as releasing reviews only after final decisions or keeping reviews private despite using open peer review systems. In this work, we analyze the strengths and limitations of these models, highlighting the growing community interest in transparent peer review. To support this discussion, we examine insights from Paper Copilot, a website launched two years ago to aggregate and analyze AI / ML conference data while engaging a global audience. The site has attracted over 200,000 early-career researchers, particularly those aged 18-34 from 177 countries, many of whom are actively engaged in the peer review process. Drawing on our findings, this position paper advocates for a more transparent, open, and well-regulated peer review aiming to foster greater community involvement and propel advancements in the field.
zh

[CV-74] STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation

【速读】：该论文旨在解决由ReLU网络的频谱偏见导致的限制，这种偏见阻碍了模型捕捉目标信号中的精细细节。论文的关键解决方案是引入了正弦可训练激活函数（Sinusoidal Trainable Activation Functions, STAF），它通过使网络能够自适应地学习和表示复杂信号来直接应对这一挑战。STAF通过内在调制其频率分量，实现了自适应频谱学习，从而显著提高了收敛速度和表达能力。

链接: https://arxiv.org/abs/2502.00869
作者: Alireza Morsali,MohammadJavad Vaez,Hossein Soltani,Amirhossein Kazerouni,Babak Taati,Morteza Mohammad-Noori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals. The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals. While previous works have attempted to mitigate this issue through frequency-based encodings or architectural modifications, these approaches often introduce additional complexity and do not fully address the underlying challenge of learning high-frequency components efficiently. We introduce Sinusoidal Trainable Activation Functions (STAF), designed to directly tackle this limitation by enabling networks to adaptively learn and represent complex signals with higher precision and efficiency. STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning. This capability significantly improves convergence speed and expressivity, making STAF highly effective for both signal representations and inverse problems. Through extensive evaluations, we demonstrate that STAF outperforms state-of-the-art (SOTA) methods in accuracy and reconstruction fidelity with superior Peak Signal-to-Noise Ratio (PSNR). These results establish STAF as a robust solution for overcoming spectral bias and the capacity-convergence gap, making it valuable for computer graphics and related fields. Our codebase is publicly accessible on the this https URL.
zh

[CV-75] RealRAG : Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

【速读】：该论文旨在解决现有文本到图像生成模型（如Stable Diffusion V3和Flux）在处理细粒度和未见过的新颖现实世界物体（例如特斯拉Cybertruck）时，因受限于固定参数和封闭数据集而导致的显著幻觉或失真问题。解决方案的关键在于提出首个基于真实物体的检索增强生成框架（RealRAG），通过学习和检索真实世界图像来弥补生成模型的知识缺口。具体而言，通过自反思对比学习训练反射检索器，将生成器的知识注入到自反思负样本中，确保检索到的增强图像能够补偿模型缺失的知识，从而集成缺失的记忆以生成未见过的新颖物体，并提升生成模型对细粒度视觉知识的整合能力，有效解决失真问题并提高细粒度对象生成的逼真度。

链接: https://arxiv.org/abs/2502.00848
作者: Yuanhuiyi Lyu,Xu Zheng,Lutao Jiang,Yibo Yan,Xin Zou,Huiyu Zhou,Linfeng Zhang,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present the first real-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator’s knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model’s missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18% FID score with the auto-regressive model on the Stanford Car benchmark.
zh

[CV-76] VLM-Assisted Continual learning for Visual Question Answering in Self-Driving

【速读】：该论文旨在解决在自动驾驶中视觉问答（VQA）任务面临的连续学习挑战，特别是在感知、预测和规划等不同任务中由于灾难性遗忘（Catastrophic Forgetting）导致的知识更新困难。解决方案的关键在于提出了一种结合视觉-语言模型（Vision-Language Models, VLMs）与选择性记忆回放（Selective Memory Replay）及知识蒸馏（Knowledge Distillation），并辅以任务特定投影层正则化（Task-Specific Projection Layer Regularization）的新型连续学习框架。其中，知识蒸馏机制通过“教师”模型引导后续任务的学习，减少遗忘现象；任务特定投影层则基于特征表示的差异计算损失，确保学习过程中的连续性和任务间转换的平稳性。

链接: https://arxiv.org/abs/2502.00843
作者: Yuxin Lin,Mengshi Qi,Liang Liu,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室), Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a “teacher” to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 21.40% to 32.28% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.
zh

[CV-77] Cross multiscale vision transformer for deep fake detection

【速读】：该论文旨在解决深度伪造技术泛滥带来的数字媒体真实性挑战，提出通过评估多种深度学习模型来检测深度伪造内容。解决方案的关键在于利用传统深度学习方法与新架构相结合，训练一系列模型并通过准确率等指标严格评估其性能。

链接: https://arxiv.org/abs/2502.00833
作者: Akhshan P,Taneti Sanjay,Chandrakala S
机构: Shiv Nadar University Chennai(谢瓦那得大学钦奈校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of deep fake technology poses significant challenges to digital media authenticity, necessitating robust detection mechanisms. This project evaluates deep fake detection using the SP Cup’s 2025 deep fake detection challenge dataset. We focused on exploring various deep learning models for detecting deep fake content, utilizing traditional deep learning techniques alongside newer architectures. Our approach involved training a series of models and rigorously assessing their performance using metrics such as accuracy.
zh

[CV-78] OOD Detection with immature Models

【速读】：该论文旨在解决深度生成模型（Deep Generative Models, DGMs）在分配较高的似然值（likelihood）给训练数据（in-distribution, ID）相较于未见过的数据（out-of-distribution, OOD）时缺乏性能保证的问题。尤其当ID输入比OOD数据点更为复杂时，这一反直觉的行为尤为显著。论文的关键解决方案在于利用数据点相对于DGM参数的梯度，提出了一种新的异常检测框架，通过估计给定数据点各层梯度范数的联合密度来实现，这种方法不依赖于特定模型，并且在多种基于似然的DGM和图像数据集组合中的表现优于典型性检验（Typicality Test）。此外，研究发现即使使用训练早期阶段的未成熟模型也能在下游任务中达到与成熟模型相当甚至更优的结果，从而强调了部分训练模型在这些任务中的潜力。

链接: https://arxiv.org/abs/2502.00820
作者: Behrooz Montazeran,Ullrich Köthe
机构: University of Heidelberg (海德堡大学); Interdisciplinary Center for Scientific Computing (科学计算跨学科中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 Tables, 9 Figures

点击查看摘要

Abstract:Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data. However, these models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs. This counter-intuitive behaviour is particularly pronounced when ID inputs are more complex than OOD data points. One potential approach to address this challenge involves leveraging the gradient of a data point with respect to the parameters of the DGMs. A recent OOD detection framework proposed estimating the joint density of layer-wise gradient norms for a given data point as a model-agnostic method, demonstrating superior performance compared to the Typicality Test across likelihood-based DGMs and image dataset pairs. In particular, most existing methods presuppose access to fully converged models, the training of which is both time-intensive and computationally demanding. In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task compared to mature models capable of generating high-quality samples that closely resemble ID data. This novel finding enhances our understanding of how DGMs learn the distribution of ID data and highlights the potential of leveraging partially trained models for downstream tasks. Furthermore, we offer a possible explanation for this unexpected behaviour through the concept of support overlap.
zh

[CV-79] Environment-Driven Online LiDAR-Camera Extrinsic Calibration

【速读】：该论文旨在解决现有激光雷达与相机外参标定方法缺乏灵活性，无法适应传感器数据和环境变化的问题。关键在于提出了一种名为EdO-LCEC的环境驱动在线标定方法，该方法通过引入泛化场景判别器来主动解析环境条件，并采用双路径对应匹配技术（Dual-Path Correspondence Matching, DPCM），利用结构和纹理一致性实现可靠的3D-2D对应关系，从而提高在不同视图和场景中的精度。

链接: https://arxiv.org/abs/2502.00801
作者: Zhiwei Huang,Jiaqi Li,Ping Zhong,Rui Fan
机构: Department of Control Science & Engineering, the College of Electronics & Information Engineering, Tongji University(同济大学); School of Computer Science and Engineering, Central South University(中南大学); National Key Laboratory of Science and Technology on Automatic Target Recognition, National University of Defense Technology(国防科技大学自动目标识别国家重点实验室); Department of Control Science & Engineering, the College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University(同济大学); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学机电混合增强智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:LiDAR-camera extrinsic calibration (LCEC) is the core for data fusion in computer vision. Existing methods typically rely on customized calibration targets or fixed scene types, lacking the flexibility to handle variations in sensor data and environmental contexts. This paper introduces EdO-LCEC, the first environment-driven, online calibration approach that achieves human-like adaptability. Inspired by the human perceptual system, EdO-LCEC incorporates a generalizable scene discriminator to actively interpret environmental conditions, creating multiple virtual cameras that capture detailed spatial and textural information. To overcome cross-modal feature matching challenges between LiDAR and camera, we propose dual-path correspondence matching (DPCM), which leverages both structural and textural consistency to achieve reliable 3D-2D correspondences. Our approach formulates the calibration process as a spatial-temporal joint optimization problem, utilizing global constraints from multiple views and scenes to improve accuracy, particularly in sparse or partially overlapping sensor views. Extensive experiments on real-world datasets demonstrate that EdO-LCEC achieves state-of-the-art performance, providing reliable and precise calibration across diverse, challenging environments.
zh

[CV-80] Adversarial Semantic Augmentation for Training Generative Adversarial Networks under Limited Data

【速读】：该论文旨在解决生成对抗网络（GANs）在低数据量条件下合成图像性能显著下降的问题。为了解决这一问题，现有方法主要通过各种数据增强技术来扩充训练集。然而，这些增强技术可能导致数据分布泄露甚至改变。为此，论文提出了一种对抗语义增强（Adversarial Semantic Augmentation, ASA）技术，在语义层面上而非图像层面上扩充训练数据。关键在于通过估计真实图像和生成图像的语义特征协方差矩阵，找到有意义的变换方向，从而实现对原始特征的转换，例如改变人脸数据集中的背景或表情。这种方法通过优化预期对抗损失的上界来隐式实现语义增强，避免了冗余采样并引入了可忽略的计算开销，从而提高了计算效率。

链接: https://arxiv.org/abs/2502.00800
作者: Mengping Yang,Zhe Wang,Ziqiu Chi,Dongdong Li,Wenli Du
机构: East China University of Science and Technology (华东理工大学); Department of Computer Science & Engineering, East China University of Science & Technology (华东理工大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work was completed in 2022 and submitted to an IEEE journal for potential publication

点击查看摘要

Abstract:Generative adversarial networks (GANs) have made remarkable achievements in synthesizing images in recent years. Typically, training GANs requires massive data, and the performance of GANs deteriorates significantly when training data is limited. To improve the synthesis performance of GANs in low-data regimes, existing approaches use various data augmentation techniques to enlarge the training sets. However, it is identified that these augmentation techniques may leak or even alter the data distribution. To remedy this, we propose an adversarial semantic augmentation (ASA) technique to enlarge the training data at the semantic level instead of the image level. Concretely, considering semantic features usually encode informative information of images, we estimate the covariance matrices of semantic features for both real and generated images to find meaningful transformation directions. Such directions translate original features to another semantic representation, e.g., changing the backgrounds or expressions of the human face dataset. Moreover, we derive an upper bound of the expected adversarial loss. By optimizing the upper bound, our semantic augmentation is implicitly achieved. Such design avoids redundant sampling of the augmented features and introduces negligible computation overhead, making our approach computation efficient. Extensive experiments on both few-shot and large-scale datasets demonstrate that our method consistently improve the synthesis quality under various data regimes, and further visualized and analytic results suggesting satisfactory versatility of our proposed method.
zh

[CV-81] ask-Specific Adaptation with Restricted Model Access

【速读】：该论文旨在解决现有微调方法在实际应用中面临的挑战，包括管理多个模型副本或推理管道的复杂性、边缘设备优化的低效性，以及对专有权、隐私和不安全模型变体暴露的担忧。论文的关键解决方案是探索“灰盒”微调方法，这种方法隐藏模型架构和权重，仅允许梯度传播，并引入两个轻量级可学习模块以适应新任务。此外，提出了一种更少限制的变体，通过增加模型的入口点来平衡性能与模型暴露程度。

链接: https://arxiv.org/abs/2502.00796
作者: Matan Levy,Rami Ben-Ari,Dvir Samuel,Nir Darshan,Dani Lischinski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring “Gray-box” fine-tuning approaches, where the model’s architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model’s input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.
zh

[CV-82] Estimating forest carbon stocks from high-resolution remote sensing imagery by reducing domain shift with style transfer

【速读】：该论文旨在提高基于地面监测样本数据与卫星遥感影像融合分析的森林碳储量监测和评估的准确性。关键解决方案在于使用风格迁移方法，并引入Swin Transformer模型通过注意力机制提取全局特征，将碳储量估算转化为图像翻译问题。这种方法旨在提升大尺度观测下的精度。

链接: https://arxiv.org/abs/2502.00784
作者: Zhenyu Yu,Jinnian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Forests function as crucial carbon reservoirs on land, and their carbon sinks can efficiently reduce atmospheric CO2 concentrations and mitigate climate change. Currently, the overall trend for monitoring and assessing forest carbon stocks is to integrate ground monitoring sample data with satellite remote sensing imagery. This style of analysis facilitates large-scale observation. However, these techniques require improvement in accuracy. We used GF-1 WFV and Landsat TM images to analyze Huize County, Qujing City, Yunnan Province in China. Using the style transfer method, we introduced Swin Transformer to extract global features through attention mechanisms, converting the carbon stock estimation into an image translation.
zh

[CV-83] A method for estimating forest carbon storag e distribution density via artificial intelligence generated content model

【速读】：该论文旨在提高森林碳储量估算的精度与效率。研究的关键在于引入了知识蒸馏后的VGG-19模块（Knowledge Distillation-VGG, KD-VGG）进行初始特征提取，并提出了改进的隐式扩散模型（Improved Implicit Diffusion Model, IIDM）。通过这些方法，论文实现了减少模型参数数量的同时缩短推理时间，并提高了特征融合能力，从而提升了高分辨率图像在连续尺度上的恢复效果及整体估算准确性。最终，IIDM模型在碳储量估算中的均方根误差（RMSE）达到28.68，比回归模型提高了约31.45%。

链接: https://arxiv.org/abs/2502.00783
作者: Zhenyu Yu,Jinnian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Forest is the most significant land-based carbon storage mechanism. The forest carbon sink can effectively decrease the atmospheric CO2 concentration and mitigate climate change. Remote sensing estimation not only ensures high accuracy of data, but also enables large-scale area observation. Optical images provide the possibility for long-term monitoring, which is a potential issue in the future carbon storage estimation research. We chose Huize County, Qujing City, Yunnan Province, China as the study area, took GF-1 WFV satellite image as the data, introduced the KD-VGG module to extract the initial features, and proposed the improved implicit diffusion model (IIDM). The results showed that: (1) The VGG-19 module after knowledge distillation can realize the initial feature extraction, reduce the inference time and improve the accuracy in the case of reducing the number of model parameters. (2) The Attention + MLP module was added for feature fusion to obtain the relationship between global and local features and realized the restoration of high-fidelity images in the continuous scale range. (3) The IIDM model proposed in this paper had the highest estimation accuracy, with RMSE of 28.68, which was 13.16 higher than that of the regression model, about 31.45%. In the estimation of carbon storage, the generative model can extract deeper features, and its performance was significantly better than other models. It demonstrated the feasibility of artificial intelligence-generated content (AIGC) in the field of quantitative remote sensing and provided valuable insights for the study of carbon neutralization effect. By combining the actual characteristics of the forest, the regional carbon storage estimation with a resolution of 16-meter was utilized to provide a significant theoretical basis for the formulation of forest carbon sink regulation.
zh

[CV-84] Privacy Preserving Properties of Vision Classifiers

【速读】：该论文旨在评估不同视觉分类器架构在隐私保护方面的性能，并挑战了模型共享时隐含的隐私保护假设。论文的关键在于通过网络逆向重构技术，系统性地分析多层感知机（MLP）、卷积神经网络（CNN）和视觉变换器（ViT）等架构在隐私保护方面的差异，揭示它们在记忆和泄露训练数据方面的程度，并量化各模型逆向重构的难易程度。研究发现突显了输入表示、特征提取机制及权重结构等架构差异对隐私风险的影响，并识别出哪些架构更能抵御逆向攻击，同时探讨了模型性能与隐私保护之间的权衡。这一研究为设计安全且注重隐私的机器学习系统提供了可行的见解，强调了在处理专有或个人信息的应用中评估架构决策的重要性。

链接: https://arxiv.org/abs/2502.00760
作者: Pirzada Suhail,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision classifiers are often trained on proprietary datasets containing sensitive information, yet the models themselves are frequently shared openly under the privacy-preserving assumption. Although these models are assumed to protect sensitive information in their training data, the extent to which this assumption holds for different architectures remains unexplored. This assumption is challenged by inversion attacks which attempt to reconstruct training data from model weights, exposing significant privacy vulnerabilities. In this study, we systematically evaluate the privacy-preserving properties of vision classifiers across diverse architectures, including Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Vision Transformers (ViTs). Using network inversion-based reconstruction techniques, we assess the extent to which these architectures memorize and reveal training data, quantifying the relative ease of reconstruction across models. Our analysis highlights how architectural differences, such as input representation, feature extraction mechanisms, and weight structures, influence privacy risks. By comparing these architectures, we identify which are more resilient to inversion attacks and examine the trade-offs between model performance and privacy preservation, contributing to the development of secure and privacy-respecting machine learning models for sensitive applications. Our findings provide actionable insights into the design of secure and privacy-aware machine learning systems, emphasizing the importance of evaluating architectural decisions in sensitive applications involving proprietary or personal data.
zh

[CV-85] Continuity-Preserving Convolutional Autoencoders for Learning Continuous Latent Dynamical Models from Images

【速读】：该论文旨在解决从离散图像帧中学习连续动态系统的问题。传统方法直接应用卷积自编码器会导致潜在状态在时间上的不连续性。为了解决这一问题，论文提出了一种保持连续性的卷积自编码器（Continuity-preserving Convolutional Autoencoders, CpAEs），其关键是通过促进卷积滤波器的连续性来保持潜在状态的连续性，从而实现更准确的潜在动态模型。

链接: https://arxiv.org/abs/2502.00754
作者: Aiqing Zhu,Yuting Pan,Qianxiao Li
机构: National University of Singapore
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous dynamical systems are cornerstones of many scientific and engineering disciplines. While machine learning offers powerful tools to model these systems from trajectory data, challenges arise when these trajectories are captured as images, resulting in pixel-level observations that are discrete in nature. Consequently, a naive application of a convolutional autoencoder can result in latent coordinates that are discontinuous in time. To resolve this, we propose continuity-preserving convolutional autoencoders (CpAEs) to learn continuous latent states and their corresponding continuous latent dynamical models from discrete image frames. We present a mathematical formulation for learning dynamics from image frames, which illustrates issues with previous approaches and motivates our methodology based on promoting the continuity of convolution filters, thereby preserving the continuity of the latent states. This approach enables CpAEs to produce latent states that evolve continuously with the underlying dynamics, leading to more accurate latent dynamical models. Extensive experiments across various scenarios demonstrate the effectiveness of CpAEs.
zh

[CV-86] An Event-Based Perception Pipeline for a Table Tennis Robot

【速读】：该论文旨在解决乒乓球机器人在快速运动球体检测中的精度与实时性问题。关键在于采用事件驱动相机(Event-based camera)替代传统的帧驱动(frame-based)相机，从而实现一个仅使用事件驱动相机的实时感知管道。这种方法能够提供比帧驱动相机高一个数量级的更新率，显著降低球体位置、速度和旋转估计的均值误差和不确定性，进而提升机器人的控制性能。

链接: https://arxiv.org/abs/2502.00749
作者: Andreas Ziegler,Thomas Gossard,Arren Glover,Andreas Zell
机构: University of Tübingen(图宾根大学); Istituto Italiano di Tecnologia(意大利技术研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Table tennis robots gained traction over the last years and have become a popular research challenge for control and perception algorithms. Fast and accurate ball detection is crucial for enabling a robotic arm to rally the ball back successfully. So far, most table tennis robots use conventional, frame-based cameras for the perception pipeline. However, frame-based cameras suffer from motion blur if the frame rate is not high enough for fast-moving objects. Event-based cameras, on the other hand, do not have this drawback since pixels report changes in intensity asynchronously and independently, leading to an event stream with a temporal resolution on the order of us. To the best of our knowledge, we present the first real-time perception pipeline for a table tennis robot that uses only event-based cameras. We show that compared to a frame-based pipeline, event-based perception pipelines have an update rate which is an order of magnitude higher. This is beneficial for the estimation and prediction of the ball’s position, velocity, and spin, resulting in lower mean errors and uncertainties. These improvements are an advantage for the robot control, which has to be fast, given the short time a table tennis ball is flying until the robot has to hit back.
zh

[CV-87] Spatio-Temporal Progressive Attention Model for EEG Classification in Rapid Serial Visual Presentation Task

【速读】：该论文旨在解决快速串行视觉呈现（RSVP）任务中脑电图（EEG）信号的空间和时间依赖性分析问题。解决方案的关键在于提出了一种新颖的空间-时间渐进注意力模型（STPAM），通过三个独立的空间专家逐步学习脑区的空间拓扑信息，并利用这些信息减少无关脑区的干扰。随后，基于获得的空间特征序列，再通过三个时间专家逐步关注关键的EEG切片来捕捉时间依赖性。这种空间-时间注意力机制显著提升了EEG分类性能。

链接: https://arxiv.org/abs/2502.00730
作者: Yang Li,Wei Liu,Tianzhi Feng,Fu Li,Chennan Wu,Boxun Fu,Zhifu Zhao,Xiaotian Wang,Guangming Shi
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education(智能感知与图像理解重点实验室), the School of Artificial Intelligence(人工智能学院), Xidian University(西安电子科技大学), Xi’an, 710071, China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a type of multi-dimensional sequential data, the spatial and temporal dependencies of electroencephalogram (EEG) signals should be further investigated. Thus, in this paper, we propose a novel spatial-temporal progressive attention model (STPAM) to improve EEG classification in rapid serial visual presentation (RSVP) tasks. STPAM first adopts three distinct spatial experts to learn the spatial topological information of brain regions progressively, which is used to minimize the interference of irrelevant brain regions. Concretely, the former expert filters out EEG electrodes in the relative brain regions to be used as prior knowledge for the next expert, ensuring that the subsequent experts gradually focus their attention on information from significant EEG electrodes. This process strengthens the effect of the important brain regions. Then, based on the above-obtained feature sequence with spatial information, three temporal experts are adopted to capture the temporal dependence by progressively assigning attention to the crucial EEG slices. Except for the above EEG classification method, in this paper, we build a novel Infrared RSVP EEG Dataset (IRED) which is based on dim infrared images with small targets for the first time, and conduct extensive experiments on it. The results show that our STPAM can achieve better performance than all the compared methods.
zh

[CV-88] Vision and Language Reference Prompt into SAM for Few-shot Segmentation

【速读】：该论文旨在解决Few-shot分割模型中存在的参考信息有限导致精度受限的问题。关键在于提出了一种名为Vision and Language reference Prompt into SAM (VLP-SAM)的新模型，通过输入图像和文本标签作为参考信息，结合视觉和语言模态来增强提示嵌入（prompt embeddings），从而显著提升了在PASCAL-5i和COCO-20i数据集上的Few-shot分割任务性能，相比之前最先进的方法分别提高了6.3%和9.5%的平均交并比(mIoU)。

链接: https://arxiv.org/abs/2502.00719
作者: Kosuke Sakurai,Ryotaro Shimizu,Masayuki Goto
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at this https URL.
zh

[CV-89] MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在高可靠性应用领域中存在的幻觉问题。论文的关键解决方案在于提出了一种名为MINT的新型无训练解码策略，通过减少不相关的图像标记的关注来增强局部感知能力，并使用对比解码以推动模型更加关注关键图像区域，从而引导模型在生成过程中更集中于关键视觉元素。

链接: https://arxiv.org/abs/2502.00717
作者: Chao Wang,Jianming Yang,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Hallucination has been a long-standing and inevitable problem that hinders the application of Large Vision-Language Models (LVLMs) in domains that require high reliability. Various methods focus on improvement depending on data annotations or training strategies, yet place less emphasis on LLM’s inherent problems. To fill this gap, we delve into the attention mechanism of the decoding process in the LVLM. Intriguingly, our investigation uncovers the prevalent attention redundancy within the hierarchical architecture of the LVLM, manifesting as overextended image processing in deep layers and an overabundance of non-essential image tokens. Stemming from the observation, we thus propose MINT, a novel training-free decoding strategy, MItigating hallucinations via tokeN reducTion. Specifically, we dynamically intensify the LVLM’s local perception capability by masking its attention to irrelevant image tokens. In addition, we use contrastive decoding that pushes the model to focus more on those key image regions. Our full method aims to guide the model in concentrating more on key visual elements during generation. Extensive experimental results on several popular public benchmarks show that our approach achieves a 4% improvement in mitigating hallucinations caused by distracted perception compared to original models. Meanwhile, our approach is demonstrated to make the model perceive 5% more visual points even though we reduce a suite of image tokens.
zh

[CV-90] VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

【速读】：该论文旨在解决视觉推理任务中的两个主要问题：有限的推理可解释性以及问题文本中存在的欠规范现象。此外，细粒度视觉知识的缺乏限制了对主体行为的精确理解。为了解决这些问题，论文提出了一种名为VIKSER（基于视觉知识的自我强化推理框架）的方法。关键在于通过大型语言模型提取细粒度视觉知识，并利用视觉关系检测技术辅助这一过程。同时，VIKSER采用了一种称为证据链（Chain-of-Evidence, CoE）的新颖提示方法，以增强其推理能力的可解释性。此外，集成的自我反思技术使VIKSER能够从错误中学习和改进。

链接: https://arxiv.org/abs/2502.00711
作者: Chunbai Zhang,Chao Wang,Yang Zhou,Yan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages,12 figures

点击查看摘要

Abstract:Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning’’ to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
zh

[CV-91] PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

【速读】：该论文旨在解决在生成三维场景时遇到的几个关键挑战：1）确保复合场景布局符合物理定律；2）准确捕捉复杂场景描述中的资产及其关系；3）布局方法中自主资产生成能力受限。为了解决这些问题，论文提出了一种名为PhiP-G的新框架，其关键是将基于世界模型的生成技术与布局指导无缝集成，并利用基于大型语言模型（LLM）的代理分析复杂场景描述以生成场景图。此外，PhiP-G结合了多模态二维生成代理和三维高斯生成方法进行目标资产创建，并通过具有粘附能力的物理池和视觉监督代理来预测和规划布局，从而显著提升了生成质量和物理合理性。

链接: https://arxiv.org/abs/2502.00708
作者: Qixuan Li,Chao Wang,Zongjin He,Yan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages.8 figures

点击查看摘要

Abstract:Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T ^3 Bench, and improves efficiency by 24x.
zh

[CV-92] S2CFormer: Reorienting Learned Image Compression from Spatial Interaction to Channel Aggregation

【速读】：该论文旨在重新评估变换器在学习图像压缩（LIC）中的关键因素，并解决现有方法中空间操作复杂化导致解码延迟与率失真性能之间权衡不佳的问题。论文的关键在于强调通道聚合模块的重要性，通过将空间操作替换为恒等映射，发现仅依靠通道操作即可达到领先方法的率失真性能。基于这一洞见，论文提出了"S2CFormer"范式，重新聚焦于通道聚合而非空间交互。论文展示了两种S2CFormer实例：S2C-Conv和S2C-Attention，它们均实现了最先进的率失真性能和显著更快的解码速度。此外，还引入了结合不同S2CFormer实例优势的S2C-Hybrid模型，在多个数据集上超越现有方法，树立了高效高性能LIC的新标杆。

链接: https://arxiv.org/abs/2502.00700
作者: Yunuo Chen,Qian Li,Bing He,Donghui Feng,Ronghua Wu,Qi Wang,Li Song,Guo Lu,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Transformers have achieved significant success in learned image compression (LIC), with Swin Transformers emerging as the mainstream choice for nonlinear transforms. A common belief is that their sophisticated spatial operations contribute most to their efficacy. However, the crucial role of the feed-forward network (FFN) based Channel Aggregation module within the transformer architecture has been largely overlooked, and the over-design of spatial operations leads to a suboptimal trade-off between decoding latency and R-D performance. In this paper, we reevaluate the key factors behind the competence of transformers in LIC. By replacing spatial operations with identity mapping, we are surprised to find that channel operations alone can approach the R-D performance of the leading methods. This solid lower bound of performance emphasizes that the presence of channel aggregation is more essential for the LIC model to achieve competitive performance, while the previously complex spatial interactions are partly redundant. Based on this insight, we initiate the “S2CFormer” paradigm, a general architecture that reorients the focus of LIC from Spatial Interaction to Channel Aggregation. We present two instantiations of the S2CFormer: S2C-Conv, and S2C-Attention. Each one incorporates a simple operator for spatial interaction and serves as nonlinear transform blocks for our LIC models. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. These results also motivate further exploration of advanced FFN structures to enhance the R-D performance while maintaining model efficiency. With these foundations, we introduce S2C-Hybrid, an enhanced LIC model that combines the strengths of different S2CFormer instantiations. This model outperforms all the existing methods on several datasets, setting a new benchmark for efficient and high-performance LIC.
zh

[CV-93] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

【速读】：该论文旨在解决人工智能研究领域缺乏系统性基准来量化多模态系统中的关键认知维度的问题。解决方案的关键在于提出了MM-IQ评估框架，该框架包含2,710个精心策划的测试项目，涵盖了8种不同的推理范式，从而能够更全面地评估多模态模型的认知能力。

链接: https://arxiv.org/abs/2502.00698
作者: Huanqia Cai,Yijun Yang,Winston Hu
机构: Tencent(腾讯)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.00698 [cs.AI] (or arXiv:2502.00698v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.00698 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-94] MI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging Clinical and Radiomic Data Fusion

【速读】：该论文旨在解决慢性肝病预后评估中多模态数据融合的挑战。现有的多模态融合方法难以适应更丰富的医学模态，并且在捕捉模态间关系方面存在困难。为了解决这些问题，论文提出了一种名为Triple-Modal Interaction Chronic Liver Network (TMI-CLNet)的方法。关键在于开发了Intra-Modality Aggregation模块以消除模态内的冗余信息，并设计了Triple-Modal Cross-Attention Fusion模块来提取跨模态信息。此外，还引入了Triple-Modal Feature Fusion损失函数以对齐不同模态间的特征表示。这些创新显著提升了在肝脏预后数据集上的表现，超越了现有的一流单模态模型和其他多模态技术。

链接: https://arxiv.org/abs/2502.00695
作者: Linglong Wu,Xuhao Shan,Ruiquan Ge,Ruoyu Liang,Chi Zhang,Yonghong Li,Ahmed Elazab,Huoling Luo,Yunbi Liu,Changmiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, accepted by IEEE ISBI 2025

点击查看摘要

Abstract:Chronic liver disease represents a significant health challenge worldwide and accurate prognostic evaluations are essential for personalized treatment plans. Recent evidence suggests that integrating multimodal data, such as computed tomography imaging, radiomic features, and clinical information, can provide more comprehensive prognostic information. However, modalities have an inherent heterogeneity, and incorporating additional modalities may exacerbate the challenges of heterogeneous data fusion. Moreover, existing multimodal fusion methods often struggle to adapt to richer medical modalities, making it difficult to capture inter-modal relationships. To overcome these limitations, We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet). Specifically, we develop an Intra-Modality Aggregation module and a Triple-Modal Cross-Attention Fusion module, which are designed to eliminate intra-modality redundancy and extract cross-modal information, respectively. Furthermore, we design a Triple-Modal Feature Fusion loss function to align feature representations across modalities. Extensive experiments on the liver prognosis dataset demonstrate that our approach significantly outperforms existing state-of-the-art unimodal models and other multi-modal techniques. Our code is available at this https URL.
zh

[CV-95] High-Order Matching for One-Step Shortcut Diffusion Models

【速读】：该论文旨在解决一阶轨迹监督在一步快捷扩散模型（One-step shortcut diffusion models）中的局限性。这些局限性包括无法捕捉内在流形几何、导致轨迹不稳定以及在高曲率区域表现不佳等问题。论文的关键解决方案是引入HOMO（高阶匹配框架），通过利用高阶监督来改进分布传输。HOMO不仅解决了上述问题，还实现了前所未有的平滑性、稳定性和几何精确度。

链接: https://arxiv.org/abs/2502.00688
作者: Bo Chen,Chengyue Gong,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Mingda Wan
机构: Middle Tennessee State University; The University of Texas at Austin; University of New South Wales; The University of Hong Kong; University of Wisconsin-Madison; Tsinghua University; University of Wisconsin-Madison; The Simons Institute for the Theory of Computing at UC Berkeley; Anhui University.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One-step shortcut diffusion models [Frans, Hafner, Levine and Abbeel, ICLR 2025] have shown potential in vision generation, but their reliance on first-order trajectory supervision is fundamentally limited. The Shortcut model’s simplistic velocity-only approach fails to capture intrinsic manifold geometry, leading to erratic trajectories, poor geometric alignment, and instability-especially in high-curvature regions. These shortcomings stem from its inability to model mid-horizon dependencies or complex distributional features, leaving it ill-equipped for robust generative modeling. In this work, we introduce HOMO (High-Order Matching for One-Step Shortcut Diffusion), a game-changing framework that leverages high-order supervision to revolutionize distribution transportation. By incorporating acceleration, jerk, and beyond, HOMO not only fixes the flaws of the Shortcut model but also achieves unprecedented smoothness, stability, and geometric precision. Theoretically, we prove that HOMO’s high-order supervision ensures superior approximation accuracy, outperforming first-order methods. Empirically, HOMO dominates in complex settings, particularly in high-curvature regions where the Shortcut model struggles. Our experiments show that HOMO delivers smoother trajectories and better distributional alignment, setting a new standard for one-step generative models.
zh

[CV-96] Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios

【速读】：该论文旨在解决在不同监控场景中，尤其是在存在遮挡的情况下，人物重识别（Person Re-Identification, ReID）的复杂性问题。关键解决方案在于引入了一种名为Motion-Aware Fusion (MOTAR-FUSE)网络，该网络利用从静态图像中提取的运动线索显著增强ReID能力。MOTAR-FUSE网络通过双输入视觉适配器处理图像和视频，实现更有效的特征提取，并且集成了一个运动一致性任务，使运动感知变换器能够有效捕捉人体运动的动态性。这种方法在遮挡普遍存在的场景中显著提升了特征识别能力，从而推进了ReID过程。

链接: https://arxiv.org/abs/2502.00665
作者: Fuxi Ling,Hongye Liu,Guoqiang Huang,Jing Li,Hong Wu,Zhihao Tang
机构: Hangzhou Dianzi University (杭州电子科技大学); China Jiliang University (中国计量大学); Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Navigating the complexities of person re-identification (ReID) in varied surveillance scenarios, particularly when occlusions occur, poses significant challenges. We introduce an innovative Motion-Aware Fusion (MOTAR-FUSE) network that utilizes motion cues derived from static imagery to significantly enhance ReID capabilities. This network incorporates a dual-input visual adapter capable of processing both images and videos, thereby facilitating more effective feature extraction. A unique aspect of our approach is the integration of a motion consistency task, which empowers the motion-aware transformer to adeptly capture the dynamics of human motion. This technique substantially improves the recognition of features in scenarios where occlusions are prevalent, thereby advancing the ReID process. Our comprehensive evaluations across multiple ReID benchmarks, including holistic, occluded, and video-based scenarios, demonstrate that our MOTAR-FUSE network achieves superior performance compared to existing approaches.
zh

[CV-97] Enhanced Convolutional Neural Networks for Improved Image Classification

【速读】：该论文旨在解决在具有挑战性的数据集如CIFAR-10上应用卷积神经网络（Convolutional Neural Networks, CNNs）时常见的过拟合和次优特征表示问题。解决方案的关键在于提出了一种增强型CNN架构，通过集成更深的卷积块、批量归一化（Batch Normalization）和辍弃正则化（Dropout Regularization），从而实现更优性能。该模型在测试集上的准确率达到84.95%，显著优于基础CNN架构。

链接: https://arxiv.org/abs/2502.00663
作者: Xiaoran Yang,Shuhan Yu,Wenxi Xu
机构: Communication University of China; Hainan International College, Communication University of China; Hefei University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image classification is a fundamental task in computer vision with diverse applications, ranging from autonomous systems to medical imaging. The CIFAR-10 dataset is a widely used benchmark to evaluate the performance of classification models on small-scale, multi-class datasets. Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art results; however, they often suffer from overfitting and suboptimal feature representation when applied to challenging datasets like CIFAR-10. In this paper, we propose an enhanced CNN architecture that integrates deeper convolutional blocks, batch normalization, and dropout regularization to achieve superior performance. The proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN architectures. Through detailed ablation studies, we demonstrate the effectiveness of the enhancements and analyze the hierarchical feature representations. This work highlights the potential of refined CNN architectures for tackling small-scale image classification problems effectively.
zh

[CV-98] EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

【速读】：该论文旨在解决3D高斯散射（3D Gaussian Splatting）驱动的说话人脸合成在情感表达多样性方面的不足。为了解决这一问题，论文提出了一种基于唇部对齐的情感面部生成器，并利用其训练了一个条件于连续情感值（即价值和唤醒度）的面部表情操控模型——EmoTalkingGaussian。此外，为了实现自然场景下音频的精确唇部同步，引入了一种自监督学习方法，该方法结合了文本转语音网络和视听同步网络。关键在于通过引入情感面部生成器和改进唇部同步机制来提升情感表达的多样性和真实性。

链接: https://arxiv.org/abs/2502.00654
作者: Junuk Cha,Seongro Yoon,Valeriya Strizhkova,Francois Bremond,Seungryul Baek
机构: UNIST; Inria
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
zh

[CV-99] Zeroth-order Informed Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

【速读】：该论文旨在解决高效对概率扩散模型（Probabilistic Diffusion Model, DM）进行微调以满足下游应用需求的问题。现有方法如基于强化学习（Reinforcement Learning, RL）或截断反向传播（Truncated Backpropagation, BP）存在样本效率低及梯度估计偏差等问题。论文的关键解决方案是提出递归似然比优化器（Recursive Likelihood Ratio, RLR），这是一种基于零阶信息的微调范式。RLR通过重新排列递归扩散链中的计算图，实现了无偏且方差更低的梯度估计，从而克服了现有方法的局限性。

链接: https://arxiv.org/abs/2502.00639
作者: Tao Ren,Zishi Zhang,Zehao Li,Jingyang Jiang,Shentao Qin,Guanghao Li,Yan Li,Yi Zheng,Xinping Li,Min Zhan,Yijie Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous unlabeled data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a zeroth-order informed fine-tuning paradigm for DM. The zeroth-order gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR’s gradient estimator an unbiased one with the lower variance than other methods. We provide theoretical guarantees for the performance of the RLR. Extensive experiments are conducted on image and video generation tasks to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.
zh

[CV-100] MedConv: Convolutions Beat Transformers on Long-Tailed Bone Density Prediction

【速读】：该论文旨在解决骨密度预测中两个主要问题：一是基于Transformer架构的方法计算复杂度高，限制了其在便携式和临床环境中的应用；二是实际医院数据分布不平衡且呈长尾分布，导致预测偏差。为了解决这些问题，论文的关键方案是引入MedConv模型，这是一种卷积模型，相比Transformer模型具有更低的计算需求，并且能够提高预测准确性。此外，论文还采用了Bal-CE损失函数和后处理逻辑调整来改善类别平衡。实验结果表明，这种方法在AustinSpine数据集上实现了高达21%的准确率提升和20%的ROC AUC提升。

链接: https://arxiv.org/abs/2502.00631
作者: Xuyin Qi,Zeyu Zhang,Huazhan Zheng,Mingxi Chen,Numan Kutaiba,Ruth Lim,Cherie Chiang,Zi En Tham,Xuan Ren,Wenxin Zhang,Lei Zhang,Hao Zhang,Wenbing Lv,Guangzhen Yao,Renda Han,Kangsheng Wang,Mingyuan Li,Hongtao Mao,Yu Li,Zhibin Liao,Yang Zhao,Minh-Son To
机构: Flinders University; The University of Adelaide; The Australian National University; Zhejiang University of Technology; Guangdong Technion – Israel Institute of Technology; Austin Health; The University of Melbourne; La Trobe University; University of Chinese Academy of Sciences; Yunnan University; Northeast Normal University; Hainan University; Univeristy of Science and Technology Beijing; Hebei University of Technology; Central China Normal University; Hubei University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bone density prediction via CT scans to estimate T-scores is crucial, providing a more precise assessment of bone health compared to traditional methods like X-ray bone density tests, which lack spatial resolution and the ability to detect localized changes. However, CT-based prediction faces two major challenges: the high computational complexity of transformer-based architectures, which limits their deployment in portable and clinical settings, and the imbalanced, long-tailed distribution of real-world hospital data that skews predictions. To address these issues, we introduce MedConv, a convolutional model for bone density prediction that outperforms transformer models with lower computational demands. We also adapt Bal-CE loss and post-hoc logit adjustment to improve class balance. Extensive experiments on our AustinSpine dataset shows that our approach achieves up to 21% improvement in accuracy and 20% in ROC AUC over previous state-of-the-art methods.
zh

[CV-101] Self-Prompt SAM: Medical Image Segmentation via Automatic Prompt SAM Adaptation

【速读】：该论文旨在解决生成式AI（Generative AI）模型在医学图像分割中的应用局限性，特别是针对Segment Anything Model (SAM) 在处理自然图像与医学图像差异时所表现出的性能不确定性。论文的关键解决方案在于提出了一种名为Self-Prompt-SAM的自提示适应框架，通过设计一个多尺度提示生成器结合SAM中的图像编码器生成辅助掩膜，并利用这些辅助掩膜生成边界框提示和距离变换选取中心点提示。此外，论文还设计了一个三维深度融合适配器（DfusedAdapter），将其注入到图像编码器和掩膜解码器的每个Transformer中，以使预训练的二维SAM模型能够提取三维信息并适应三维医学图像。

链接: https://arxiv.org/abs/2502.00630
作者: Bin Xie,Hao Tang,Dawen Cai,Yan Yan,Gady Agam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) has demonstrated impressive zero-shot performance and brought a range of unexplored capabilities to natural image segmentation tasks. However, as a very important branch of image segmentation, the performance of SAM remains uncertain when applied to medical image segmentation due to the significant differences between natural images and medical images. Meanwhile, it is harsh to meet the SAM’s requirements of extra prompts provided, such as points or boxes to specify medical regions. In this paper, we propose a novel self-prompt SAM adaptation framework for medical image segmentation, named Self-Prompt-SAM. We design a multi-scale prompt generator combined with the image encoder in SAM to generate auxiliary masks. Then, we use the auxiliary masks to generate bounding boxes as box prompts and use Distance Transform to select the most central points as point prompts. Meanwhile, we design a 3D depth-fused adapter (DfusedAdapter) and inject the DFusedAdapter into each transformer in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Extensive experiments demonstrate that our method achieves state-of-the-art performance and outperforms nnUNet by 2.3% on AMOS2022, 1.6% on ACDCand 0.5% on Synapse datasets.
zh

[CV-102] Strengthening Generative Robot Policies through Predictive World Modeling

【速读】：该论文旨在解决复杂物理交互控制任务中的预测控制问题。论文的关键在于引入生成式预测控制（GPC）框架，通过条件视频扩散（conditional video diffusion）学习接近物理准确的视觉世界模型，并实现稳健的视觉预测。GPC框架包括三个主要部分：从专家演示中克隆生成式扩散策略，训练一个动作条件的世界模型，以及使用该模型进行前瞻规划以优化行动提案。这种方法使得GPC在基于状态和基于视觉的任务中，无论是仿真还是真实环境中，均优于行为克隆方法。

链接: https://arxiv.org/abs/2502.00622
作者: Han Qi,Haocheng Yin,Yilun Du,Heng Yang
机构: School of Engineering and Applied Sciences, Harvard University (工程与应用科学学院, 哈佛大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:We present generative predictive control (GPC), a learning control framework that (i) clones a generative diffusion-based policy from expert demonstrations, (ii) trains a predictive action-conditioned world model from both expert demonstrations and random explorations, and (iii) synthesizes an online planner that ranks and optimizes the action proposals from (i) by looking ahead into the future using the world model from (ii). Crucially, we show that conditional video diffusion allows learning (near) physics-accurate visual world models and enable robust visual foresight. Focusing on planar pushing with rich contact and collision, we show GPC dominates behavior cloning across state-based and vision-based, simulated and real-world experiments.
zh

[CV-103] DesCLIP: Robust Continual Adaptation via General Attribute Descriptions for Pretrained Vision-Language Models

【速读】：该论文旨在解决连续适应视觉-语言模型（Vision-Language Models, VLMs）过程中存在的知识遗忘问题，特别是在处理不断扩展的下游任务和数据集时。现有研究通常关注于将视觉特征与特定类别的文本描述相匹配，而忽视了通用知识与特定知识之间的潜在关联。论文的关键发现是，强制模型优化不适当的视觉-文本匹配会加剧VLM的知识遗忘现象。为了解决这一问题，论文提出DesCLIP方法，通过利用一般属性（General Attribute, GA）描述来指导特定类别对象的理解，从而帮助VLM建立稳健的“视觉-GA-类别”三边关联，而不是仅仅依赖于“视觉-类别”连接。具体而言，该方法引入了一个语言助手生成具体的GA描述候选，并设计了一种基于锚点的嵌入过滤器来获取高度相关的GA描述嵌入，这些嵌入作为配对文本嵌入进行视觉-文本实例匹配，进而调整视觉编码器。同时，类别文本嵌入逐渐校准以与共享的GA描述嵌入对齐。实验结果验证了该方法的有效性和先进性，表明其在性能上优于现有的预训练和基于VLM的连续学习方法。

链接: https://arxiv.org/abs/2502.00618
作者: Chiyuan He,Zihuan Qiu,Fanman Meng,Linfeng Xu,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual adaptation of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt for expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLMs. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust \textitvision-GA-class trilateral associations rather than relying solely on \textitvision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance compared to existing pretrained and VLM-based continual learning methods.
zh

[CV-104] Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing FAST

【速读】：该论文旨在解决基于状态空间模型（State Space Models, SSMs）的视觉模型在处理高分辨率图像时的计算效率问题。论文的关键解决方案是提出Fast Vision Mamba (FastVim)，通过在Vision Mamba模型中进一步减少递归步骤的数量，并通过在多个Mamba块之间交替池化令牌来实现，从而将SSM块中的并行步数减少一半。这种方法实现了高达72.5%的推理速度提升，同时保持了模型性能，展示了在诸如图像分类、细胞扰动预测、分割和目标检测等任务中的卓越性能。

链接: https://arxiv.org/abs/2502.00594
作者: Saarthak Kapse,Robin Betz,Srinivasan Sivanandan
机构: Insitro; Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures, this https URL

点击查看摘要

Abstract:State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from L sequential steps to log(L) parallel steps with respect to the number of input tokens ( L ). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2 \times reduction in the number of parallel steps in SSM block. Our model offers up to 72.5% speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048 \times 2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at this https URL
zh

[CV-105] Contrastive Forward-Forward: A Training Algorithm of Vision Transformer

【速读】：该论文旨在解决现有前馈神经网络训练算法（如反向传播）在性能上的局限性，并寻找更接近大脑工作方式的训练方法。关键在于提出了一种名为对比前馈（Contrastive Forward-Forward）的改进算法，通过在视觉变换器（Vision Transformer）上应用该算法，实现了高达10%的准确率提升和收敛速度提高5至20倍的效果。对比于原始的前馈算法（Forward-Forward），这种改进显著缩小了与反向传播算法（Backpropagation）之间的性能差距，并在某些条件下甚至超越后者。

链接: https://arxiv.org/abs/2502.00571
作者: Hossein Aghagolzadeh,Mehdi Ezoji
机构: Babol Noshirvani University of Technology ( Babol Noshirvani理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 8 figures, under review

点击查看摘要

Abstract:Although backpropagation is widely accepted as a training algorithm for artificial neural networks, researchers are always looking for inspiration from the brain to find ways with potentially better performance. Forward-Forward is a new training algorithm that is more similar to what occurs in the brain, although there is a significant performance gap compared to backpropagation. In the Forward-Forward algorithm, the loss functions are placed after each layer, and the updating of a layer is done using two local forward passes and one local backward pass. Forward-Forward is in its early stages and has been designed and evaluated on simple multi-layer perceptron networks to solve image classification tasks. In this work, we have extended the use of this algorithm to a more complex and modern network, namely the Vision Transformer. Inspired by insights from contrastive learning, we have attempted to revise this algorithm, leading to the introduction of Contrastive Forward-Forward. Experimental results show that our proposed algorithm performs significantly better than the baseline Forward-Forward leading to an increase of up to 10% in accuracy and boosting the convergence speed by 5 to 20 times on Vision Transformer. Furthermore, if we take Cross Entropy as the baseline loss function in backpropagation, it will be demonstrated that the proposed modifications to the baseline Forward-Forward reduce its performance gap compared to backpropagation on Vision Transformer, and even outperforms it in certain conditions, such as inaccurate supervision.
zh

[CV-106] Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions

【速读】：该论文旨在解决在实际临床环境中，基于组织病理学的癌症分级和分子层面的生存风险预测难以直接融合进行联合决策的问题。论文的关键解决方案在于提出了一种基于扩散机制的跨模态生成式人工智能模型PathoGen，该模型能够从数字病理图像中合成基因表达，并在此基础上实现高精度（达到当前最先进水平）、高置信度（通过一致性覆盖保证）和可解释性（通过分布式注意力图）的癌症分级和患者生存风险预测。

链接: https://arxiv.org/abs/2502.00568
作者: Samiran Dey,Christopher R.S. Banerji,Partha Basuchowdhuri,Sanjoy K. Saha,Deepak Parashar,Tapabrata Chakraborti
机构: School of Mathematical & Computational Sciences, Indian Association for the Cultivation of Science(印度科学促进会数学与计算科学学院), Kolkata, India; The Alan Turing Institute(艾伦图灵研究所), London, UK; Comprehensive Cancer Center, King’s College London(伦敦国王学院综合癌症中心), London, UK; Department of Computer Science and Engineering, Jadavpur University(贾达普尔大学计算机科学与工程系), Kolkata, India; Warwick Medical School, University of Warwick(华威大学医学院), Coventry, UK; UCL Cancer Institute, University College London(伦敦大学学院癌症研究所), London, UK; Department of Medical Physics and Biomedical Engineering, University College London(伦敦大学学院医学物理与生物医学工程系), London, UK
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emerging research has highlighted that artificial intelligence based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion for joint decision is impractical in real clinical settings, where histopathology is still the gold standard for diagnosis and transcriptomic tests are rarely requested, at least in the public healthcare system. With our novel diffusion based crossmodal generative AI model PathoGen, we show that genomic expressions synthesized from digital histopathology jointly predicts cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed attention maps). PathoGen code is available for open use by the research community through GitHub at this https URL.
zh

[CV-107] Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation

【速读】：该论文旨在解决深度神经网络在语义分割任务中面临的类别不平衡和实例不平衡问题，特别是小对象和细边界容易被忽略的问题。为应对多尺度目标的分割挑战，论文提出了一种新颖的复数小波互信息（Complex Wavelet Mutual Information, CWMI）损失函数。该方法利用复数可导向金字塔分解出的子带图像中的互信息，并结合其在多个方向上捕捉特征的能力以及在不同尺度上保持结构相似性的优势，从而有效提升了像素级精度和拓扑度量的性能，同时引入了最小的计算开销。

链接: https://arxiv.org/abs/2502.00563
作者: Renhao Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited for capturing high-dimensional directional features and exhibits greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. The code is available at this https URL
zh

[CV-108] Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition

【速读】：该论文旨在解决情感识别在人机交互（HCI）中的挑战，通过整合面部表情分析与脑电图（EEG）信号，引入了一种新的多模态框架—Milmer。解决方案的关键在于采用了一种基于变换器的融合方法，有效集成了视觉和生理模态信息。此外，该框架创新性地采用了多重实例学习（MIL）方法，从多个时间序列上的面部表情图像中提取有意义的信息，捕捉先前研究中常被忽视的关键时间动态。这些策略显著提升了情感识别的性能。

链接: https://arxiv.org/abs/2502.00547
作者: Zaitian Wang,Jian He,Yu Liang,Xiyuan Hu,Tianhao Peng,Kaixin Wang,Jiakai Wang,Chenlong Zhang,Weili Zhang,Shuang Niu,Xiaoyang Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at this https URL.
zh

[CV-109] Integrating Frequency Guidance into Multi-source Domain Generalization for Bearing Fault Diagnosis

【速读】：该论文旨在解决在未知工作条件下，由于不断增加的未知域导致领域不变特征包含实例级别的虚假相关性，从而影响模型泛化能力的问题。解决方案的关键在于提出了基于傅里叶变换的增强重建网络（FARNet），通过分离信号的相位分量和幅度分量来实现领域的增强，并采用多源领域数据增强策略在频域进行操作。此外，引入频率-空间交互模块（FSIM）处理全局信息和局部空间特征，促进两个子网络之间的表示学习。为了进一步优化决策边界，论文还提出了一种流形三元组损失（manifold triplet loss）。通过在CWRU和SJTU数据集上的实验验证，FARNet展示了有效的性能并优于现有的跨域方法。

链接: https://arxiv.org/abs/2502.00545
作者: Xiaotong Tu,Chenyu Ma,Qingyao Wu,Yinhao Liu,Hongyang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent generalizable fault diagnosis researches have effectively tackled the distributional shift between unseen working conditions. Most of them mainly focus on learning domain-invariant representation through feature-level methods. However, the increasing numbers of unseen domains may lead to domain-invariant features contain instance-level spurious correlations, which impact the previous models’ generalizable ability. To address the limitations, we propose the Fourier-based Augmentation Reconstruction Network, namely this http URL methods are motivated by the observation that the Fourier phase component and amplitude component preserve different semantic information of the signals, which can be employed in domain augmentation techniques. The network comprises an amplitude spectrum sub-network and a phase spectrum sub-network, sequentially reducing the discrepancy between the source and target domains. To construct a more robust generalized model, we employ a multi-source domain data augmentation strategy in the frequency domain. Specifically, a Frequency-Spatial Interaction Module (FSIM) is introduced to handle global information and local spatial features, promoting representation learning between the two sub-networks. To refine the decision boundary of our model output compared to conventional triplet loss, we propose a manifold triplet loss to contribute to generalization. Through extensive experiments on the CWRU and SJTU datasets, FARNet demonstrates effective performance and achieves superior results compared to current cross-domain approaches on the benchmarks.
zh

[CV-110] VertiFormer: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility

【速读】：该论文旨在解决在极端崎岖的野外地形中应用Transformer架构进行机器人移动的问题。由于野外环境下的真实移动数据难以获取，且现有的训练技术不完全适用于此类任务，论文提出了一种名为VertiFormer的新颖高效多任务Transformer模型。VertiFormer的关键在于采用了一种新的可学习掩码建模和下一令牌预测范式，能够在仅使用一小时的数据下同时预测机器人的下一个姿态、动作及地形块，从而支持多种野外移动任务。非自回归设计减少了计算瓶颈和误差传播问题，统一的模态表示增强了对不同时间映射和状态表示的学习能力，进一步提高了模型的泛化性能。

链接: https://arxiv.org/abs/2502.00543
作者: Mohammad Nazeri,Anuj Pokhrel,Alexandyr Card,Aniket Datar,Garrett Warnell,Xuesu Xiao
机构: Department of Computer Science, George Mason University; DEVCOM Army Research Laboratory; Department of Computer Science, The University of Texas at Austin
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 figures, url: this https URL

点击查看摘要

Abstract:Sophisticated learning architectures, e.g., Transformers, present a unique opportunity for robots to understand complex vehicle-terrain kinodynamic interactions for off-road mobility. While internet-scale data are available for Natural Language Processing (NLP) and Computer Vision (CV) tasks to train Transformers, real-world mobility data are difficult to acquire with physical robots navigating off-road terrain. Furthermore, training techniques specifically designed to process text and image data in NLP and CV may not apply to robot mobility. In this paper, we propose VertiFormer, a novel data-efficient multi-task Transformer model trained with only one hour of data to address such challenges of applying Transformer architectures for robot mobility on extremely rugged, vertically challenging, off-road terrain. Specifically, VertiFormer employs a new learnable masked modeling and next token prediction paradigm to predict the next pose, action, and terrain patch to enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. The non-autoregressive design mitigates computational bottlenecks and error propagation associated with autoregressive models. VertiFormer’s unified modality representation also enhances learning of diverse temporal mappings and state representations, which, combined with multiple objective functions, further improves model generalization. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate our efficiently trained Transformer can facilitate multiple off-road mobility tasks onboard a physical mobile robot.
zh

[CV-111] CAD: Confidence-Aware Adaptive Displacement for Semi-Supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割中保持高质量一致性学习的挑战。特别是在存在不确定预测的区域，过度扰动会降低对齐质量并阻碍精确的决策边界。论文的关键解决方案是引入了 Confidence-Aware Adaptive Displacement (CAD) 框架，该框架通过动态调整最大允许替换区域大小和置信度阈值，在训练过程中选择性地识别并替换低置信度区域为高置信度区域，从而逐步提升分割质量而不会使学习过程不堪重负。

链接: https://arxiv.org/abs/2502.00536
作者: Wenbo Xiao,Zhihao Xu,Guiping Liang,Yangjun Deng,Yi Xiao
机构: College of Information and Intelligence, Hunan Agricultural University (湖南农业大学信息与智能学院), Changsha, China (中国长沙)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Semi-supervised medical image segmentation aims to leverage minimal expert annotations, yet remains confronted by challenges in maintaining high-quality consistency learning. Excessive perturbations can degrade alignment and hinder precise decision boundaries, especially in regions with uncertain predictions. In this paper, we introduce Confidence-Aware Adaptive Displacement (CAD), a framework that selectively identifies and replaces the largest low-confidence regions with high-confidence patches. By dynamically adjusting both the maximum allowable replacement size and the confidence threshold throughout training, CAD progressively refines the segmentation quality without overwhelming the learning process. Experimental results on public medical datasets demonstrate that CAD effectively enhances segmentation quality, establishing new state-of-the-art accuracy in this field. The source code will be released after the paper is published.
zh

[CV-112] Work-Efficient Parallel Non-Maximum Suppression Kernels

【速读】：该论文旨在解决在物体检测中滑动窗口分类器和单 Shot 卷积神经网络（CNN）元架构产生的大量重叠候选窗口的问题。解决方案的关键在于提出了一种高度可扩展的非极大值抑制（NMS）算法，专为嵌入式 GPU 架构设计，能够处理单张图像上数千个同时检测任务，且性能表现优异，在不同 NVIDIA Tegra 系列 GPU 上实现了显著的速度提升，相比现有方法提高了 14 倍到 40 倍。

链接: https://arxiv.org/abs/2502.00535
作者: David Oro,Carles Fernández,Xavier Martorell,Javier Hernando
机构: Universitat Politècnica de Catalunya(巴塞罗那加泰罗尼亚理工大学); Herta Security(赫尔塔安全)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Code: this https URL

点击查看摘要

Abstract:In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data.
zh

[CV-113] Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation

【速读】：该论文旨在解决视频建模过程中高效生成时间依赖性视频帧的问题。关键在于引入Video Latent Flow Matching (VLFM)方法，该方法利用当前强大的预训练图像生成模型，通过建模特定字幕引导的潜码流来生成时间依赖性的视频帧，从而实现任意帧率的插值和外推能力。

链接: https://arxiv.org/abs/2502.00500
作者: Yang Cao,Zhao Song,Chiwun Yang
机构: Simons Institute for the Theory of Computing, University of California, Berkeley.(西蒙斯计算理论研究所，加州大学伯克利分校); Sun Yat-sen University(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper considers an efficient video modeling process called Video Latent Flow Matching (VLFM). Unlike prior works, which randomly sampled latent patches for video generation, our method relies on current strong pre-trained image generation models, modeling a certain caption-guided flow of latent patches that can be decoded to time-dependent video frames. We first speculate multiple images of a video are differentiable with respect to time in some latent space. Based on this conjecture, we introduce the HiPPO framework to approximate the optimal projection for polynomials to generate the probability path. Our approach gains the theoretical benefits of the bounded universal approximation error and timescale robustness. Moreover, VLFM processes the interpolation and extrapolation abilities for video generation with arbitrary frame rates. We conduct experiments on several text-to-video datasets to showcase the effectiveness of our method.
zh

[CV-114] A framework for river connectivity classification using temporal image processing and attention based neural networks

【速读】：该论文旨在解决通过传统流速测量设备成本高昂且仅限于大型河流监测的问题，提出了一种基于影像自动分类的低成本、易部署的替代方案。解决方案的关键在于开发了一个包含图像预处理、图像增强和机器学习分类三个部分的自动化系统。特别地，该系统采用了视觉变换器架构和时间图像增强技术，结合使用扩散模型进行生成式增强，从而将新站点图像的基准准确率从75%提升至90%，显著提高了水体连通性的自动分类效果。

链接: https://arxiv.org/abs/2502.00474
作者: Timothy James Becker,Derin Gezgin,Jun Yi He Wu,Mary Becker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Measuring the connectivity of water in rivers and streams is essential for effective water resource management. Increased extreme weather events associated with climate change can result in alterations to river and stream connectivity. While traditional stream flow gauges are costly to deploy and limited to large river bodies, trail camera methods are a low-cost and easily deployed alternative to collect hourly data. Image capturing, however requires stream ecologists to manually curate (select and label) tens of thousands of images per year. To improve this workflow, we developed an automated instream trail camera image classification system consisting of three parts: (1) image processing, (2) image augmentation and (3) machine learning. The image preprocessing consists of seven image quality filters, foliage-based luma variance reduction, resizing and bottom-center cropping. Images are balanced using variable amount of generative augmentation using diffusion models and then passed to a machine learning classification model in labeled form. By using the vision transformer architecture and temporal image enhancement in our framework, we are able to increase the 75% base accuracy to 90% for a new unseen site image. We make use of a dataset captured and labeled by staff from the Connecticut Department of Energy and Environmental Protection between 2018-2020. Our results indicate that a combination of temporal image processing and attention-based models are effective at classifying unseen river connectivity images.
zh

[CV-115] Weak-to-Strong Diffusion with Reflection

【速读】：该论文旨在解决扩散生成模型在训练过程中由于数据质量、建模策略及架构设计的局限性导致的理想输出与真实数据之间的不可避免的差距。解决方案的关键在于提出了一种名为弱到强扩散（Weak-to-Strong Diffusion, W2SD）的新框架，通过利用现有弱模型与强模型之间的差异来逼近理想模型与强模型之间的差距。W2SD通过交替执行去噪和反转操作，并引入弱到强差异的反射运算，引导潜在变量沿着采样轨迹向真实数据分布的区域移动，从而显著提升了生成结果的质量和多样性。

链接: https://arxiv.org/abs/2502.00473
作者: Lichen Bai,Masashi Sugiyama,Zeke Xie
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 19 figures, 14 tables

点击查看摘要

Abstract:The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to approximate the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.
zh

[CV-116] Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

【速读】：该论文旨在解决西班牙语连续唇读（Lipreading）中的自动语音识别问题。解决方案的关键在于提出了一种基于混合CTC/注意力（CTC/Attention）架构的端到端系统，并通过广泛的消融研究（ablation study）分析了系统各组件对识别质量的影响。此外，论文还进行了严谨的错误分析，以探究可能影响自动系统学习的不同因素，并建立了新的西班牙语唇读基准。

链接: https://arxiv.org/abs/2502.00464
作者: David Gimeno-Gómez,Carlos-D. Martínez-Hinarejos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the “Language Resources and Evaluation” journal, Springer Nature

点击查看摘要

Abstract:Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at this https URL.
zh

[CV-117] MambaGlue: Fast and Robust Local Feature Matching With Mamba ICRA

【速读】：该论文旨在解决计算机视觉任务中对既稳健又快速的特征匹配方法的持续需求。解决方案的关键在于提出了一种基于Mamba架构的局部特征匹配方法MambaGlue。Mamba以其在训练和推理中的卓越速度以及与Transformer架构相比的优越性能而著称。MambaGlue通过引入两个模块来实现这一目标：一是MambaAttention混合器，它通过基于Mamba的自注意力结构同时且有选择性地理解局部和全局上下文；二是深度置信分数回归器，这是一种多层感知机（MLP）架构，用于评估匹配预测对应真实对应关系的信心得分。这些创新使得MambaGlue在实际应用中实现了稳健性和效率之间的平衡。

链接: https://arxiv.org/abs/2502.00462
作者: Kihwan Ryoo,Hyungtae Lim,Hyun Myung
机构: KAIST (韩国高等科技学院); LIDS (实验室for Information & Decision Systems)，Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Proc. IEEE Int’l Conf. Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on this https URL
zh

[CV-118] Explorations of the Softmax Space: Knowing When the Neural Network Doesnt Know…

【速读】：该论文旨在解决自动化决策过程中机器学习模型预测可靠性评估的问题。关键解决方案在于提出了一种基于聚类的距离度量方法，通过分析训练后的神经网络输出与类别中心之间的距离来衡量预测的置信度。具体而言，论文定义了一个安全阈值，即不正确预测到相应类别中心的最小距离，以此作为判断自动化预测是否可接受的标准。该方法在MNIST和CIFAR-10数据集上的实验验证了其有效性和一致性。

链接: https://arxiv.org/abs/2502.00456
作者: Daniel Sikar,Artur d’Avila Garcez,Tillman Weyde
机构: City, University of London (伦敦城市大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 1 table. arXiv admin note: substantial text overlap with arXiv:2407.07821

点击查看摘要

Abstract:Ensuring the reliability and safety of automated decision-making is crucial. This paper proposes a new approach for measuring the reliability of predictions in machine learning models. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators.
zh

[CV-119] SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models

【速读】：该论文旨在解决现有基础模型在处理多光谱、多时相和多传感器数据时遇到的计算复杂度高（quadratic computational scaling）的问题。关键在于提出SatMamba新预训练框架，该框架结合掩码自动编码器与状态空间模型（State Space Model），实现了线性计算复杂度（linear computational scaling），从而提高了模型在长序列数据上的性能。

链接: https://arxiv.org/abs/2502.00435
作者: Chuc Man Duc,Hiromichi Fukui
机构: Department of Computer Science, Faculty of Information Technology, University of Engineering and Technology, Vietnam National University (越南国立大学工程与技术大学信息技术学院计算机科学系); International Digital Earth Applied Science Research Center, Chubu University (千叶大学国际数字地球应用科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in this https URL.
zh

[CV-120] CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models

【速读】：该论文旨在解决扩散模型在文本到图像合成任务中因迭代去噪过程导致的高计算资源需求问题。关键解决方案在于结合令牌级别剪枝与缓存技术，通过噪声相对幅度识别显著的令牌变化，并利用空间聚类和分布平衡增强令牌选择，从而实现50%-60%的计算成本降低，同时保持模型性能。

链接: https://arxiv.org/abs/2502.00433
作者: Xinle Cheng,Zhuoming Chen,Zhihao Jia
机构: Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at this https URL
zh

[CV-121] ST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

【速读】：该论文旨在解决零样本视频分类中的两个主要挑战：跨模态语义鸿沟（modality gap）和固定支持集（support-set）无法调整的问题。论文的关键解决方案是提出了一个新的框架，称为测试时支持集调整（TEST-V），它通过多提示支持集扩展（MSD）和基于时间感知的支持集侵蚀（TSE）来增强支持集的多样性和动态调整能力。具体而言，MSD利用从大型语言模型（LLMs）获取的多个提示来扩展每个类别的支持样本，从而丰富支持集的多样性；而TSE则通过可学习权重在自监督方式下根据时间预测一致性来调整支持集，以挖掘每个类别的重要支持线索。

链接: https://arxiv.org/abs/2502.00426
作者: Rui Yan,Jin Wang,Hongyu Qu,Xiaoyu Du,Dong Zhang,Jinhui Tang,Tieniu Tan
机构: Nanjing University (南京大学); Nanjing University of Science and Technology (南京理工大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other’s strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. \textbfTEST-V achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.
zh

[CV-122] MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在实际部署中的大参数规模和高计算需求问题。论文的关键解决方案是提出了一种名为MQuant的后训练量化框架，它通过引入Modality-Specific Static Quantization (MSQ)、Attention-Invariant Flexible Switching (AIFS) 和Rotation Magnitude Suppression (RMS)等技术，分别解决了由于大量视觉标记导致的推理延迟、视觉与文本标记之间的分布差异以及Hadamard变换引起的极端异常值等问题，从而实现了接近浮点精度的同时将推理延迟降低高达30%。

链接: https://arxiv.org/abs/2502.00425
作者: JiangYong Yu,Sifan Zhou,Dawei Yang,Shuo Wang,Shuoyu Li,Xing Hu,Chen Xu,Zukang Xu,Changyong Shu,Zhihang Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First quantization solution for Multimodal large language models applicable to 5 mainstream MLLMs

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and this http URL quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and © extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code will be released.
zh

[CV-123] Parameter Efficient Fine-Tuning of Segment Anything Model

【速读】：该论文旨在解决生物医学图像分割在新条件下泛化能力不足及高成本数据标注的问题。解决方案的关键在于参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法的应用，特别是通过评估九种PEFT方法在多样化数据集上的表现，并提出了一种资源高效的微调策略，包括对视觉变换器实现QLoRA以及一种新的SAM高效微调方法。

链接: https://arxiv.org/abs/2502.00418
作者: Carolin Teuber,Anwai Archit,Constantin Pape
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation is an important analysis task for biomedical images, enabling the study of individual organelles, cells or organs. Deep learning has massively improved segmentation methods, but challenges remain in generalization to new conditions, requiring costly data annotation. Vision foundation models, such as Segment Anything Model (SAM), address this issue through broad segmentation capabilities. However, these models still require finetuning on annotated data, although with less annotations, to achieve optimal results for new conditions. As a downside, they require more computational resources. This makes parameter-efficient finetuning (PEFT) relevant for their application. We contribute the first comprehensive study of PEFT for SAM applied to biomedical segmentation by evaluating 9 PEFT methods on diverse datasets. We also provide an implementation of QLoRA for vision transformers and a new approach for resource-efficient finetuning of SAM. Our code is publicly available at this https URL.
zh

[CV-124] ROI: Cross-Subject Pretraining with Sparse Voxel Selection for Enhanced fMRI Visual Decoding ICASSP2025

【速读】：该论文旨在解决fMRI视觉解码中手动标记ROI导致冗余信息和噪声的问题，并且缺乏自动化ROI标注方法限制了该技术在跨受试者任务中的实用性。论文的关键解决方案是提出了一种名为TROI (Trainable Region of Interest) 的新型数据驱动ROI标注方法。TROI通过预训练图像解码主干网络来快速生成新的输入层维度，并采用学习率重置策略优化输入层以适应新受试者，从而实现高效的跨受试者解码任务。

链接: https://arxiv.org/abs/2502.00412
作者: Ziyu Wang,Tengyu Pan,Zhenyu Li,Wu Ji,Li Xiuxing,Jianyong Wang
机构: Tsinghua University(清华大学); Beijing Institute of Technology(北京理工大学); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2025

点击查看摘要

Abstract:fMRI (functional Magnetic Resonance Imaging) visual decoding involves decoding the original image from brain signals elicited by visual stimuli. This often relies on manually labeled ROIs (Regions of Interest) to select brain voxels. However, these ROIs can contain redundant information and noise, reducing decoding performance. Additionally, the lack of automated ROI labeling methods hinders the practical application of fMRI visual decoding technology, especially for new subjects. This work presents TROI (Trainable Region of Interest), a novel two-stage, data-driven ROI labeling method for cross-subject fMRI decoding tasks, particularly when subject samples are limited. TROI leverages labeled ROIs in the dataset to pretrain an image decoding backbone on a cross-subject dataset, enabling efficient optimization of the input layer for new subjects without retraining the entire model from scratch. In the first stage, we introduce a voxel selection method that combines sparse mask training and low-pass filtering to quickly generate the voxel mask and determine input layer dimensions. In the second stage, we apply a learning rate rewinding strategy to fine-tune the input layer for downstream tasks. Experimental results on the same small sample dataset as the baseline method for brain visual retrieval and reconstruction tasks show that our voxel selection method surpasses the state-of-the-art method MindEye2 with an annotated ROI mask.
zh

[CV-125] Exploring Linear Attention Alternative for Single Image Super-Resolution

【速读】：该论文旨在解决基于深度学习的单图像超分辨率（Single-Image Super-Resolution, SISR）技术在计算复杂性和图像质量方面的挑战，特别是在遥感图像处理中的应用。关键解决方案在于提出Omni-Scale RWKV超分辨率（OmniRWKVSR）模型，该模型结合了Receptance Weighted Key Value (RWKV)架构与特征提取技术，如Visual RWKV空间混合（VRSM）和Visual RWKV通道混合（VRCM），以克服现有方法的局限并实现卓越的SISR性能。

链接: https://arxiv.org/abs/2502.00404
作者: Rongchang Lu,Changyu Li,Donghang Li,Guojing Zhang,Jianqiang Huang,Xilai Li
机构: Qinghai University (青海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been published to IEEE International Joint Conference on Neural Networks. Feel free to contact on nomodeset@qq.com

点击查看摘要

Abstract:Deep learning-based single-image super-resolution (SISR) technology focuses on enhancing low-resolution (LR) images into high-resolution (HR) ones. Although significant progress has been made, challenges remain in computational complexity and quality, particularly in remote sensing image processing. To address these issues, we propose our Omni-Scale RWKV Super-Resolution (OmniRWKVSR) model which presents a novel approach that combines the Receptance Weighted Key Value (RWKV) architecture with feature extraction techniques such as Visual RWKV Spatial Mixing (VRSM) and Visual RWKV Channel Mixing (VRCM), aiming to overcome the limitations of existing methods and achieve superior SISR performance. This work has proved able to provide effective solutions for high-quality image reconstruction. Under the 4x Super-Resolution tasks, compared to the MambaIR model, we achieved an average improvement of 0.26% in PSNR and 0.16% in SSIM.
zh

[CV-126] Enhancing Highway Safety: Accident Detection on the A9 Test Stretch Using Roadside Sensors

【速读】：该论文旨在解决道路交通事故导致的高死亡率问题，特别是由人为错误（如超速、酒驾和分心驾驶）引起的事故。论文的关键解决方案在于提出了一种结合基于规则的方法与基于学习的方法的事故检测框架。该框架通过利用包含高速碰撞序列的真实世界高速公路事故数据集进行训练和验证，数据集中包含了大量标注的二维和三维边界框以及跟踪ID，从而提高了事故检测的可靠性。

链接: https://arxiv.org/abs/2502.00402
作者: Walter Zimmer,Ross Greer,Xingcheng Zhou,Rui Song,Marc Pavel,Daniel Lehmberg,Ahmed Ghita,Akshay Gopalkrishnan,Mohan Trivedi,Alois Knoll
机构: Technical University of Munich (TUM); Laboratory for Intelligent and Safe Automobiles (LISA) at the Uni. of California San Diego (UCSD); University of California Merced (UCM); Fraunhofer Institute for Transportation and Infrastructure Systems (IVI); SETLabs Research GmbH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road traffic injuries are the leading cause of death for people aged 5-29, resulting in about 1.19 million deaths each year. To reduce these fatalities, it is essential to address human errors like speeding, drunk driving, and distractions. Additionally, faster accident detection and quicker medical response can help save lives. We propose an accident detection framework that combines a rule-based approach with a learning-based one. We introduce a dataset of real-world highway accidents featuring high-speed crash sequences. It includes 294,924 labeled 2D boxes, 93,012 labeled 3D boxes, and track IDs across 48,144 frames captured at 10 Hz using four roadside cameras and LiDAR sensors. The dataset covers ten object classes and is released in the OpenLABEL format. Our experiments and analysis demonstrate the reliability of our method.
zh

[CV-127] Minimalistic Video Saliency Prediction via Efficient Decoder Spatio Temporal Action Cues ICASSP2025

【速读】：该论文旨在解决视频显著性检测（Video Saliency Detection）中的模型大小与性能之间的权衡问题。解决方案的关键在于提出了一种基于ViNet架构的轻量级模型ViNet-S（36MB），它采用了U-Net设计，并具有一个轻量级解码器，从而显著减少了模型大小和参数数量，同时保持了高性能。此外，论文还引入了ViNet-A（148MB），它集成了时空动作定位（Spatio-Temporal Action Localization, STAL）特性。通过将ViNet-S和ViNet-A的预测显著性图进行平均，该方法在多种视觉和视听显著性数据集上实现了当前最先进（state-of-the-art）的性能，同时在参数效率和实时性能方面优于基于Transformer的模型。

链接: https://arxiv.org/abs/2502.00397
作者: Rohit Girmaji,Siddharth Jain,Bhav Beri,Sarthak Bansal,Vineet Gandhi
机构: CVIT, IIIT Hyderabad (IIIT 海得拉巴); India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with a U-Net design, featuring a lightweight decoder that significantly reduces model size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models that use action classification backbones. Our studies show that an ensemble of ViNet-S and ViNet-A, by averaging predicted saliency maps, achieves state-of-the-art performance on three visual-only and six audio-visual saliency datasets, outperforming transformer-based models in both parameter efficiency and real-time performance, with ViNet-S reaching over 1000fps.
zh

[CV-128] FlexCloud: Direct Modular Georeferencing and Drift-Correction of Point Cloud Maps

【速读】：该论文旨在解决基于同时定位与建图（SLAM）生成的点云地图缺乏全局位置数据的问题，导致内部扭曲和缺失地理参照信息，从而影响地图辅助定位方法的应用。论文的关键解决方案是提出FlexCloud系统，通过利用全球导航卫星系统（GNSS）位置信息和三维橡胶片变换（3D rubber-sheet transformation），实现无需额外控制点的自动地理参照，纠正由长期漂移引起的地图扭曲，进而生成一致且全局参照的点云地图。

链接: https://arxiv.org/abs/2502.00395
作者: Maximilian Leitenstern,Marko Alten,Christian Bolea-Schaser,Dominik Kulmer,Marcel Weinmann,Markus Lienkamp
机构: Institute of Automotive Technology, Technical University of Munich (汽车技术研究所, 慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at VEHITS 2025, Proceedings of the 11th International Conference on Vehicle Technology and Intelligent Transport Systems - VEHITS; 2025

点击查看摘要

Abstract:Current software stacks for real-world applications of autonomous driving leverage map information to ensure reliable localization, path planning, and motion prediction. An important field of research is the generation of point cloud maps, referring to the topic of simultaneous localization and mapping (SLAM). As most recent developments do not include global position data, the resulting point cloud maps suffer from internal distortion and missing georeferencing, preventing their use for map-based localization approaches. Therefore, we propose FlexCloud for an automatic georeferencing of point cloud maps created from SLAM. Our approach is designed to work modularly with different SLAM methods, utilizing only the generated local point cloud map and its odometry. Using the corresponding GNSS positions enables direct georeferencing without additional control points. By leveraging a 3D rubber-sheet transformation, we can correct distortions within the map caused by long-term drift while maintaining its structure. Our approach enables the creation of consistent, globally referenced point cloud maps from data collected by a mobile mapping system (MMS). The source code of our work is available at this https URL.
zh

[CV-129] RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

【速读】：该论文旨在解决无人机场景下基于自然语言表达的指代表达理解（REC）挑战，特别是在多尺度目标检测、多目标及无目标样本处理以及复杂环境中丰富的上下文表达方面。论文的关键解决方案包括引入RefDrone数据集以及开发RDAgent半自动化标注工具以高效构建数据集，并提出Number GroundingDINO (NGDINO) 方法，该方法能够显式学习并利用表达中提及的目标数量，从而有效应对多目标和无目标情况。

链接: https://arxiv.org/abs/2502.00392
作者: Zhichao Sun,Yepeng Liu,Huachao Zhu,Yuliang Gu,Yuda Zou,Zelong Liu,Gui-Song Xia,Bo Du,Yongchao Xu
机构: School of Computer Science, Wuhan University (计算机科学学院, 武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code will be publicly at this https URL.
zh

[CV-130] Efficient Adaptive Label Refinement for Label Noise Learning

【速读】：该论文旨在解决深度神经网络在处理带有噪声标签的数据时容易过拟合的问题，从而导致性能下降。论文的关键解决方案是提出了一种名为自适应标签精炼（Adaptive Label Refinement, ALR）的方法。ALR通过将避免拟合错误标签和充分学习干净样本的任务解耦，并采用软标签更新和熵损失引导的方式，逐步提高高置信度标签的硬度，以更好地从干净样本中学习，而无需任何先验噪声知识或辅助数据集。这种方法简单且高效，验证了其在人工噪声（如CIFAR-10/100）和真实噪声数据集（如ANIMAL-10N, Clothing1M, WebVision）上的有效性，表明ALR在性能上超越了现有最先进方法。

链接: https://arxiv.org/abs/2502.00386
作者: Wenzhen Zhang,Debo Cheng,Guangquan Lu,Bo Zhou,Jiaye Li,Shichao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks are highly susceptible to overfitting noisy labels, which leads to degraded performance. Existing methods address this issue by employing manually defined criteria, aiming to achieve optimal partitioning in each iteration to avoid fitting noisy labels while thoroughly learning clean samples. However, this often results in overly complex and difficult-to-train models. To address this issue, we decouple the tasks of avoiding fitting incorrect labels and thoroughly learning clean samples and propose a simple yet highly applicable method called Adaptive Label Refinement (ALR). First, inspired by label refurbishment techniques, we update the original hard labels to soft labels using the model’s predictions to reduce the risk of fitting incorrect labels. Then, by introducing the entropy loss, we gradually `harden’ the high-confidence soft labels, guiding the model to better learn from clean samples. This approach is simple and efficient, requiring no prior knowledge of noise or auxiliary datasets, making it more accessible compared to existing methods. We validate ALR’s effectiveness through experiments on benchmark datasets with artificial label noise (CIFAR-10/100) and real-world datasets with inherent noise (ANIMAL-10N, Clothing1M, WebVision). The results show that ALR outperforms state-of-the-art methods.
zh

[CV-131] Masked Generative Nested Transformers with Decode Time Scaling

【速读】：该论文旨在解决视觉生成过程中推理计算效率瓶颈的问题。现有方法通常需要多次通过变压器模型（Transformer model）来生成标记或去噪输入，这导致计算成本高昂。为了解决这一问题，论文提出了两个关键方案：(a) 不同阶段的生成过程所需的计算资源不同，并设计了解码时模型缩放调度以有效利用计算资源；(b) 可以缓存和重用部分计算结果。结合这些方法，使较小模型可以处理更多标记，而较大模型处理较少标记，同时保持参数规模不变。实验结果显示，与基线相比，该方法在几乎减少3倍计算量的情况下仍能获得具有竞争力的性能。

链接: https://arxiv.org/abs/2502.00382
作者: Sahil Goyal,Debapriya Tula,Gagan Jain,Pradeep Shenoy,Prateek Jain,Sujoy Paul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256 \times 256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost 3\times less compute than baseline, our model obtains competitive performance.
zh

[CV-132] Latent Action Learning Requires Supervision in the Presence of Distractors

【速读】：该论文旨在解决在包含干扰因素（distractors）的真实世界视频数据中，基于潜在动作学习（Latent Action Policies, LAPO）方法的有效性问题。研究发现，现有的LAPO方法在处理含有与动作相关的干扰因素的数据时表现不佳。为此，论文提出了一种改进方法，即LAOM（Latent Action Objectives Modification），通过在潜在动作学习过程中引入少量的地面真值动作监督（约2.5%的数据集），显著提升了潜在动作的质量，从而在下游任务中的性能提高了4.2倍。关键在于，在训练潜在动作模型（Latent Action Models, LAM）时集成监督信号，以克服干扰因素带来的负面影响。

链接: https://arxiv.org/abs/2502.00379
作者: Alexander Nikulin,Ilya Zisman,Denis Tarasov,Nikita Lyubaykin,Andrei Polubarov,Igor Kiselev,Vladislav Kurenkov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. In review

点击查看摘要

Abstract:Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
zh

[CV-133] Scalable Framework for Classifying AI-Generated Content Across Modalities AAAI2025

【速读】：该论文旨在解决有效区分人类生成与AI生成内容及分类不同生成模型输出的问题。解决方案的关键在于提出了一种集成感知哈希（Perceptual Hashing）、相似性测量（Similarity Measurement）和伪标签（Pseudo-labeling）的可扩展框架。此方法能够无需重新训练即可整合新的生成模型，从而确保在动态场景中的适应性和鲁棒性。

链接: https://arxiv.org/abs/2502.00375
作者: Anh-Kiet Duong,Petra Gomez-Krämer
机构: L3i Laboratory, La Rochelle University (拉罗谢尔大学 L3i 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, Defactify4 @ AAAI 2025

点击查看摘要

Abstract:The rapid growth of generative AI technologies has heightened the importance of effectively distinguishing between human and AI-generated content, as well as classifying outputs from diverse generative models. This paper presents a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo-labeling to address these challenges. Our method enables the incorporation of new generative models without retraining, ensuring adaptability and robustness in dynamic scenarios. Comprehensive evaluations on the Defactify4 dataset demonstrate competitive performance in text and image classification tasks, achieving high accuracy across both distinguishing human and AI-generated content and classifying among generative methods. These results highlight the framework’s potential for real-world applications as generative AI continues to evolve. Source codes are publicly available at this https URL.
zh

[CV-134] NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

【速读】：该论文旨在解决视觉定位（Visual Grounding, VG）任务中复杂推理需求的挑战，特别是在需要详细查询解释的复杂推理任务中。当前方法主要分为端到端和组合式方法，而组合式方法虽然更为灵活，但在处理基于语言逻辑表示的复杂推理时仍存在局限性。论文的关键解决方案是提出NAVER，这是一种集成显式概率逻辑推理的组合式视觉定位方法，并嵌入有限状态自动机中，配备自校正机制。这种设计通过显式的逻辑推理提高了推理过程中的鲁棒性和可解释性，从而实现了最先进的性能。

链接: https://arxiv.org/abs/2502.00372
作者: Zhixi Cai,Fucai Ke,Simindokht Jahangard,Maria Garcia de la Banda,Reza Haffari,Peter J. Stuckey,Hamid Rezatofighi
机构: Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at this https URL .
zh

[CV-135] Shape from Semantics: 3D Shape Generation from Multi-View Semantics

【速读】：该论文旨在解决从语义信息创建与给定语义相匹配的三维模型的问题。传统方法通常依赖于视觉输入（如RGB图像或深度图）来重建几何形状，这限制了创造性探索。论文的关键解决方案在于采用语义作为输入，并利用多语义评分蒸馏采样（Multi-Semantics Score Distillation Sampling, SDS）从二维扩散模型中提取三维几何和外观信息，确保初始形状与语义输入一致。此外，通过图像恢复和视频生成模型添加细节，并引入神经符号距离场（Neural Signed Distance Field, SDF）表示法以实现详细的形状重建。这一系列方法显著扩展了设计空间，使得能够创建具有复杂细节、良好结构、连贯纹理以及平滑过渡的三维模型。

链接: https://arxiv.org/abs/2502.00360
作者: Liangchen Li,Caoliwen Wang,Yuqi Zhou,Bailin Deng,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose Shape from Semantics'', which is able to create 3D models whose geometry and appearance match given semantics when observed from different views. Traditional Shape from X’’ tasks usually use visual input (e.g., RGB images or depth maps) to reconstruct geometry, imposing strict constraints that limit creative explorations. As applications, works like Shadow Art and Wire Art often struggle to grasp the embedded semantics of their design through direct observation and rely heavily on specific setups for proper display. To address these limitations, our framework uses semantics as input, greatly expanding the design space to create objects that integrate multiple semantic elements and are easily discernible by observers. Considering that this task requires a rich imagination, we adopt various generative models and structure-to-detail pipelines. Specifically, we adopt multi-semantics Score Distillation Sampling (SDS) to distill 3D geometry and appearance from 2D diffusion models, ensuring that the initial shape is consistent with the semantic input. We then use image restoration and video generation models to add more details as supervision. Finally, we introduce neural signed distance field (SDF) representation to achieve detailed shape reconstruction. Our framework generates meshes with complex details, well-structured geometry, coherent textures, and smooth transitions, resulting in visually appealing and eye-catching designs. Project page: this https URL
zh

[CV-136] Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

【速读】：该论文旨在解决3D场景问答（3D SQA）领域内统一分析与比较的挑战，特别是在快速发展的大型多模态建模背景下。论文的关键在于系统性地综述现有的3D SQA数据集、方法论以及评估指标，并强调在数据集标准化、多模态融合及任务设计方面的关键挑战与未来机遇。

链接: https://arxiv.org/abs/2502.00342
作者: Zechuan Li,Hongshan Yu,Yihao Ding,Yan Li,Yong He,Naveed Akhtar
机构: Hunan University (湖南大学); The University of Melbourne (墨尔本大学); The University of Sydney (悉尼大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics while highlighting critical challenges and future opportunities in dataset standardization, multimodal fusion, and task design.
zh

[CV-137] BiMaCoSR: Binary One-Step Diffusion Model Leverag ing Flexible Matrix Compression for Real Super-Resolution

【速读】：该论文旨在解决基于扩散模型（Diffusion Models, DM）的超分辨率方法在资源受限的边缘设备上部署困难的问题。关键解决方案在于提出BiMaCoSR方法，结合了二值化（binarization）和单步蒸馏（one-step distillation），以实现极致的压缩和加速。为了防止二值化导致的模型性能崩溃，文中引入了稀疏矩阵分支（Sparse Matrix Branch, SMB）和低秩矩阵分支（Low Rank Matrix Branch, LRM），这两个辅助分支传递全精度信息但方式不同，从而确保了压缩和加速的同时保持了模型性能。

链接: https://arxiv.org/abs/2502.00333
作者: Kai Liu,Kaicheng Yang,Zheng Chen,Zhiteng Li,Yong Guo,Wenbo Li,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. The code and models will be available at this https URL

点击查看摘要

Abstract:While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrixbranch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code and model are available at this https URL.
zh

[CV-138] MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

【速读】：该论文旨在解决单目3D目标检测模型在深度估计不准确及依赖多阶段检测流程方面的问题。解决方案的关键在于采用基于Vision Transformer (ViT) 的基础模型作为主干网络，并结合Detection Transformer (DETR) 架构实现端到端的深度估计与目标检测。通过引入层次特征融合模块增强特征提取能力，并利用大规模数据训练的相对深度估计模型进行迁移学习以进一步提升深度估计精度。此外，解码器中的查询机制考虑参考点和二维边界框尺寸，从而提高识别性能。

链接: https://arxiv.org/abs/2502.00315
作者: Jihyeok Kim,Seongwoo Moon,Sungwon Nah,David Hyunchul Shim
机构: School of Electrical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer’s decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at this https URL.
zh

[CV-139] A Diffusion Model Translator for Efficient Image-to-Image Translation

【速读】：该论文旨在解决应用扩散模型（Diffusion Models）进行图像到图像翻译（Image-to-Image Translation, I2I）时的时间消耗问题。现有方法在每个去噪步骤中注入源图像信息以实现迭代优化，导致实施过程耗时。论文提出的关键解决方案是引入一个轻量级翻译器——扩散模型翻译器（DMT），仅在某些中间步骤转移分布至另一域，从而高效完成I2I任务。此外，作者提出了一种自动选择合适时间步长的实用策略，进一步提升性能。

链接: https://arxiv.org/abs/2502.00307
作者: Mengfei Xia,Yu Zhou,Ran Yi,Yong-Jin Liu,Wenping Wang
机构: MOE-Key Laboratory of Pervasive Computing, Department of Computer Science and Technology, Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Department of Computer Science and Computer Engineering at Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.
zh

[CV-140] K Nearest Neighbor-Guided Trajectory Similarity Learning

【速读】：该论文旨在解决轨迹相似性度量在时空数据挖掘应用中的准确性挑战，特别是在深度学习模型中由于轨迹粒度建模困难及训练数据中相似性信号利用不足所导致的问题。解决方案的关键在于提出了TSMini模型，该模型包含子视图建模机制以学习多粒度轨迹模式，并采用基于k近邻的损失函数指导模型不仅学习轨迹间的绝对相似值，还学习它们之间的相对相似排名。这些创新共同实现了高度准确的轨迹相似性近似。

链接: https://arxiv.org/abs/2502.00285
作者: Yanchuan Chang,Xu Cai,Christian S. Jensen,Jianzhong Qi
机构: The University of Melbourne (墨尔本大学); National University of Singapore (新加坡国立大学); Aalborg University (奥胡斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Trajectory similarity is fundamental to many spatio-temporal data mining applications. Recent studies propose deep learning models to approximate conventional trajectory similarity measures, exploiting their fast inference time once trained. Although efficient inference has been reported, challenges remain in similarity approximation accuracy due to difficulties in trajectory granularity modeling and in exploiting similarity signals in the training data. To fill this gap, we propose TSMini, a highly effective trajectory similarity model with a sub-view modeling mechanism capable of learning multi-granularity trajectory patterns and a k nearest neighbor-based loss that guides TSMini to learn not only absolute similarity values between trajectories but also their relative similarity ranks. Together, these two innovations enable highly accurate trajectory similarity approximation. Experiments show that TSMini can outperform the state-of-the-art models by 22% in accuracy on average when learning trajectory similarity measures.
zh

[CV-141] Simultaneous Estimation of Manipulation Skill and Hand Grasp Force from Forearm Ultrasound Images

【速读】：该论文旨在解决精确估计人体手部配置及施加力的问题，以提升遥操作和技能转移在机器人操作中的有效性。关键解决方案在于使用前臂超声数据同时估计操作技能和手部施力，通过深度学习模型实现了94.87%±10.16%的分类准确率和0.51±0.19牛顿的均方根误差（RMSE），从而证明了前臂超声技术在增强人机交互和复杂操作任务中的潜力。

链接: https://arxiv.org/abs/2502.00275
作者: Keshav Bimbraw,Srikar Nekkanti,Daniel B. Tiller II,Mihir Deshmukh,Berk Calli,Robert D. Howe,Haichong K. Zhang
机构: Inova Medical Group(英维奥医疗集团); Robotics Engineering, Worcester Polytechnic Institute (伍斯特理工学院机器人工程系), Worcester, MA, USA; Harvard Paulson School of Engineering and Applied Sciences(哈佛保罗森工程与应用科学学院), Cambridge, MA, USA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 30 pages, 52 references, 10 figures, 8 tables and 2 supplementary videos. Currently under review

点击查看摘要

Abstract:Accurate estimation of human hand configuration and the forces they exert is critical for effective teleoperation and skill transfer in robotic manipulation. A deeper understanding of human interactions with objects can further enhance teleoperation performance. To address this need, researchers have explored methods to capture and translate human manipulation skills and applied forces to robotic systems. Among these, biosignal-based approaches, particularly those using forearm ultrasound data, have shown significant potential for estimating hand movements and finger forces. In this study, we present a method for simultaneously estimating manipulation skills and applied hand force using forearm ultrasound data. Data collected from seven participants were used to train deep learning models for classifying manipulation skills and estimating grasp force. Our models achieved an average classification accuracy of 94.87 percent plus or minus 10.16 percent for manipulation skills and an average root mean square error (RMSE) of 0.51 plus or minus 0.19 Newtons for force estimation, as evaluated using five-fold cross-validation. These results highlight the effectiveness of forearm ultrasound in advancing human-machine interfacing and robotic teleoperation for complex manipulation tasks. This work enables new and effective possibilities for human-robot skill transfer and tele-manipulation, bridging the gap between human dexterity and robotic control.
zh

[CV-142] MCM: Multi-layer Concept Map for Efficient Concept Learning from Masked Images

【速读】：该论文旨在解决在视觉任务中概念学习依赖全图方法而未充分探索掩码策略的问题。论文的关键在于提出了一种基于掩码图像的有效概念学习方法——多层概念图（Multi-layer Concept Map, MCM）。通过建立编码器和解码器层之间的关联，并利用重构任务的后向梯度更新概念标记（concept tokens），MCM 方法能够在不同粒度级别学习概念标记，从而实现掩码图像块的填补或引导重构结果以反映特定概念。这种方法显著减少了计算成本，并提升了概念预测性能。

链接: https://arxiv.org/abs/2502.00266
作者: Yuwei Sun,Lu Mi,Ippei Fujisawa,Ryota Kanai
机构: Araya Research; RIKEN AIP; Georgia Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masking strategies commonly employed in natural language processing are still underexplored in vision tasks such as concept learning, where conventional methods typically rely on full images. However, using masked images diversifies perceptual inputs, potentially offering significant advantages in concept learning with large-scale Transformer models. To this end, we propose Multi-layer Concept Map (MCM), the first work to devise an efficient concept learning method based on masked images. In particular, we introduce an asymmetric concept learning architecture by establishing correlations between different encoder and decoder layers, updating concept tokens using backward gradients from reconstruction tasks. The learned concept tokens at various levels of granularity help either reconstruct the masked image patches by filling in gaps or guide the reconstruction results in a direction that reflects specific concepts. Moreover, we present both quantitative and qualitative results across a wide range of metrics, demonstrating that MCM significantly reduces computational costs by training on fewer than 75% of the total image patches while enhancing concept prediction performance. Additionally, editing specific concept tokens in the latent space enables targeted image generation from masked images, aligning both the visible contextual patches and the provided concepts. By further adjusting the testing time mask ratio, we could produce a range of reconstructions that blend the visible patches with the provided concepts, proportional to the chosen ratios.
zh

[CV-143] Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

【速读】：该论文旨在解决深度神经网络参数空间对称性在Transformer模型融合中的局限性问题。论文的关键在于引入旋转对称性（Rotation Symmetry），这是一种新的参数空间对称形式，通过在自注意力层中旋转参数矩阵来推广置换对称性（Permutation Symmetry）。不同于离散的置换对称性，旋转对称性在连续域中操作，显著扩展了Transformer模型的等效集。基于此特性，论文提出了一种理论上最优的参数匹配算法，作为插件模块以增强模型融合效果。实验结果表明，基于旋转对称性的匹配算法显著提升了模型融合性能。

链接: https://arxiv.org/abs/2502.00264
作者: Binchi Zhang,Zaiyi Zheng,Zhengzhang Chen,Jundong Li
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on this https URL.
zh

[CV-144] Your submission contained main.bib and main.tex file but no main.bbl file (include main.bbl or submit without main.bib; and remember to verify references)

【速读】：该论文旨在解决自动驾驶系统在处理不可预测的边缘情况（edge-case scenarios）时所面临的挑战，如对抗性行人行为、危险车辆操作和突发环境变化。当前端到端驾驶模型难以泛化到这些罕见事件，主要是由于传统检测和预测方法的局限性。为了解决这一问题，论文提出了一种名为INSIGHT（语义和视觉输入集成用于泛化风险追踪）的分层视觉-语言模型（VLM）框架。其关键是通过多模态数据融合整合语义和视觉表征，从而实现精确的情景解读和潜在危险的准确预测，并通过基于注意力机制和坐标回归技术优化空间风险定位，最终显著提升了危险预测的简便性和准确性。

链接: https://arxiv.org/abs/2502.00262
作者: Dianwei Chen,Zifan Zhang,Yuchen Liu,Xianfeng Terry Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.
zh

[CV-145] ransformer-Based Vector Font Classification Using Different Font Formats: TrueType versus PostScript IJCNN2025

【速读】：该论文旨在解决在Transformer-based模型中矢量字体分类任务中不同字体表示格式的影响。关键在于研究发现基于PostScript轮廓的字体表示在矢量字体分类任务中优于基于TrueType轮廓的表示。论文指出信息聚合在基于Transformer的矢量图形深度学习中至关重要，这为未来选择合适的轮廓格式提供了有价值的指导。

链接: https://arxiv.org/abs/2502.00250
作者: Takumu Fujioka(1),Gouhei Tanaka(1 and 2) ((1) Nagoya Institute of Technology, (2) The University of Tokyo)
机构: Department of Computer Science, Nagoya Institute of Technology (名古屋工业技术研究所); International Research Center for Neurointelligence, The University of Tokyo (东京大学神经智能国际研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 4 tables, Submitted to IJCNN 2025. Code available at this https URL

点击查看摘要

Abstract:Modern fonts adopt vector-based formats, which ensure scalability without loss of quality. While many deep learning studies on fonts focus on bitmap formats, deep learning for vector fonts remains underexplored. In studies involving deep learning for vector fonts, the choice of font representation has often been made conventionally. However, the font representation format is one of the factors that can influence the computational performance of machine learning models in font-related tasks. Here we show that font representations based on PostScript outlines outperform those based on TrueType outlines in Transformer-based vector font classification. TrueType outlines represent character shapes as sequences of points and their associated flags, whereas PostScript outlines represent them as sequences of commands. In previous research, PostScript outlines have been predominantly used when fonts are treated as part of vector graphics, while TrueType outlines are mainly employed when focusing on fonts alone. Whether to use PostScript or TrueType outlines has been mainly determined by file format specifications and precedent settings in previous studies, rather than performance considerations. To date, few studies have compared which outline format provides better embedding representations. Our findings suggest that information aggregation is crucial in Transformer-based deep learning for vector graphics, as in tokenization in language models and patch division in bitmap-based image recognition models. This insight provides valuable guidance for selecting outline formats in future research on vector graphics.
zh

[CV-146] Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms

【速读】：该论文旨在解决离散扩散模型在高维状态空间中的高效推理问题。现有方法主要分为精确模拟和近似方法（如(\tau)-跳跃）。精确方法面临不可预测的推理时间和冗余函数评估的问题，而(\tau)-跳跃方法仅具有一阶精度。论文的关键解决方案是提出了一种高阶数值推理方案的扩展，特别是(\theta)-梯形法，以实现更大的步长并减少误差，该方法在KL散度下证明具有二阶精度。实验结果表明，在同等计算约束下，所提方法在GPT-2级别的文本生成和ImageNet级别的图像生成任务中实现了更高质量的样本。

链接: https://arxiv.org/abs/2502.00234
作者: Yinuo Ren,Haoxuan Chen,Yuchen Zhu,Wei Guo,Yongxin Chen,Grant M. Rotskoff,Molei Tao,Lexing Ying
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注: 38 pages, 7 figures

点击查看摘要

Abstract:Discrete diffusion models have emerged as a powerful generative modeling framework for discrete data with successful applications spanning from text generation to image synthesis. However, their deployment faces challenges due to the high dimensionality of the state space, necessitating the development of efficient inference algorithms. Current inference approaches mainly fall into two categories: exact simulation and approximate methods such as \tau -leaping. While exact methods suffer from unpredictable inference time and redundant function evaluations, \tau -leaping is limited by its first-order accuracy. In this work, we advance the latter category by tailoring the first extension of high-order numerical inference schemes to discrete diffusion models, enabling larger step sizes while reducing error. We rigorously analyze the proposed schemes and establish the second-order accuracy of the \theta -trapezoidal method in KL divergence. Empirical evaluations on GPT-2 level text and ImageNet-level image generation tasks demonstrate that our method achieves superior sample quality compared to existing approaches under equivalent computational constraints.
zh

[CV-147] A Hybrid Random Forest and CNN Framework for Tile-Wise Oil-Water Classification in Hyperspectral Images

【速读】：该论文旨在解决油水分类在高光谱图像（HSI）中的空间上下文保持难题。解决方案的关键在于提出了一种新颖的随机森林（Random Forest）与卷积神经网络（CNN）混合框架。首先，通过将图像划分为较小的非重叠瓦片来保留空间信息，并将其用于训练、验证和测试。尽管随机森林在逐像素分类中表现出色，但它无法充分利用空间关系。因此，进一步利用CNN处理随机森林生成的概率图，以增强其空间特征学习能力，从而提高整体性能。

链接: https://arxiv.org/abs/2502.00232
作者: Mehdi Nickzamir,Seyed Mohammad Sheikh Ahamdi Gandab
机构: Politecnico di Torino(都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A novel hybrid Random Forest and Convolutional Neural Network (CNN) framework is presented for oil-water classification in hyperspectral images (HSI). To address the challenge of preserving spatial context, the images were divided into smaller, non-overlapping tiles, which served as the basis for training, validation, and testing. Random Forest demonstrated strong performance in pixel-wise classification, outperforming models such as XGBoost, Attention-Based U-Net, and HybridSN. However, Random Forest loses spatial context, limiting its ability to fully exploit the spatial relationships in hyperspectral data. To improve performance, a CNN was trained on the probability maps generated by the Random Forest, leveraging the CNN’s capacity to incorporate spatial context. The hybrid approach achieved 7.6% improvement in recall (to 0.85), 2.4% improvement in F1 score (to 0.84), and 0.54% improvement in AUC (to 0.99) compared to the baseline. These results highlight the effectiveness of combining probabilistic outputs with spatial feature learning for context-aware analysis of hyperspectral images.
zh

[CV-148] Fantastic Multi-Task Gradient Updates and How to Find Them In a Cone

【速读】：该论文旨在解决多任务学习（Multi-Task Learning, MTL）中竞争目标之间的平衡问题，主要挑战源于各个任务之间冲突的梯度。论文的关键解决方案是提出了一种名为ConicGrad的方法，该方法将MTL问题构造成一个带有角度约束的优化问题。通过动态调节梯度更新方向，使其限制在一个以总体目标参考梯度为中心的圆锥内，从而有效解决任务间梯度冲突，同时保持计算效率和高维参数空间的可扩展性。

链接: https://arxiv.org/abs/2502.00217
作者: Negar Hassanpour,Muhammad Kamran Janjua,Kunlin Zhang,Sepehr Lavasani,Xiaowen Zhang,Chunhua Zhou,Chao Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Balancing competing objectives remains a fundamental challenge in multi-task learning (MTL), primarily due to conflicting gradients across individual tasks. A common solution relies on computing a dynamic gradient update vector that balances competing tasks as optimization progresses. Building on this idea, we propose ConicGrad, a principled, scalable, and robust MTL approach formulated as a constrained optimization problem. Our method introduces an angular constraint to dynamically regulate gradient update directions, confining them within a cone centered on the reference gradient of the overall objective. By balancing task-specific gradients without over-constraining their direction or magnitude, ConicGrad effectively resolves inter-task gradient conflicts. Moreover, our framework ensures computational efficiency and scalability to high-dimensional parameter spaces. We conduct extensive experiments on standard supervised learning and reinforcement learning MTL benchmarks, and demonstrate that ConicGrad achieves state-of-the-art performance across diverse tasks.
zh

[CV-149] EcoWeedNet: A Lightweight and Automated Weed Detection Method for Sustainable Next-Generation Agricultural Consumer Electronics

【速读】：该论文旨在解决可持续精准农业中的杂草检测问题，传统方法如化学除草剂和人工除草存在环境损害和健康风险。论文的关键解决方案是提出了一种名为EcoWeedNet的新模型，该模型在不显著增加计算复杂度的前提下提升了杂草检测性能，并且具有轻量级特性，适合部署在地面农业消费电子设备和机器人上。实验结果表明，EcoWeedNet在保持高性能的同时，参数量仅为YOLOv4的大约4.21%，浮点运算次数（GFLOPs）仅为6.59%。

链接: https://arxiv.org/abs/2502.00205
作者: Omar H. Khater,Abdul Jabbar Siddiqui,M. Shamim Hossain
机构: King Fahd University of Petroleum and Minerals (KFUPM); SDAIA-KFUPM Joint Research Center on Artificial Intelligence; IRC for Intelligent Secure Systems; King Saud University (KSU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sustainable agriculture plays a crucial role in ensuring world food security for consumers. A critical challenge faced by sustainable precision agriculture is weed growth, as weeds share essential resources with the crops, such as water, soil nutrients, and sunlight, which notably affect crop yields. The traditional methods employed to combat weeds include the usage of chemical herbicides and manual weed removal methods. However, these could damage the environment and pose health hazards. The adoption of automated computer vision technologies and ground agricultural consumer electronic vehicles in precision agriculture offers sustainable, low-carbon solutions. However, prior works suffer from issues such as low accuracy and precision and high computational expense. This work proposes EcoWeedNet, a novel model with enhanced weed detection performance without adding significant computational complexity, aligning with the goals of low-carbon agricultural practices. Additionally, our model is lightweight and optimal for deployment on ground-based consumer electronic agricultural vehicles and robots. The effectiveness of the proposed model is demonstrated through comprehensive experiments on the CottonWeedDet12 benchmark dataset reflecting real-world scenarios. EcoWeedNet achieves performance close to that of large models yet with much fewer parameters. (approximately 4.21% of the parameters and 6.59% of the GFLOPs of YOLOv4). This work contributes effectively to the development of automated weed detection methods for next-generation agricultural consumer electronics featuring lower energy consumption and lower carbon footprint. This work paves the way forward for sustainable agricultural consumer technologies.
zh

[CV-150] Evaluating Deep Human-in-the-Loop Optimization for Retinal Implants Using Sighted Participants

【速读】：该论文旨在评估 Human-in-the-loop optimization (HILO) 方法在个性化视觉假体中的应用效果，特别是在真实条件下的优化刺激策略能力。解决方案的关键在于通过迭代反馈机制，利用受试者的选择来逐步优化深层刺激编码器（Deep Stimulus Encoder, DSE），从而生成更优的刺激参数，以提升视觉假体的性能。实验结果显示，HILO生成的刺激在所有测试条件下均优于基准方法，证明了该方法的有效性。

链接: https://arxiv.org/abs/2502.00177
作者: Eirini Schoinas,Adyah Rastogi,Anissa Carter,Jacob Granley,Michael Beyeler
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human-in-the-loop optimization (HILO) is a promising approach for personalizing visual prostheses by iteratively refining stimulus parameters based on user feedback. Previous work demonstrated HILO’s efficacy in simulation, but its performance with human participants remains untested. Here we evaluate HILO using sighted participants viewing simulated prosthetic vision to assess its ability to optimize stimulation strategies under realistic conditions. Participants selected between phosphenes generated by competing encoders to iteratively refine a deep stimulus encoder (DSE). We tested HILO in three conditions: standard optimization, threshold misspecifications, and out-of-distribution parameter sampling. Participants consistently preferred HILO-generated stimuli over both a naïve encoder and the DSE alone, with log odds favoring HILO across all conditions. We also observed key differences between human and simulated decision-making, highlighting the importance of validating optimization strategies with human participants. These findings support HILO as a viable approach for adapting visual prostheses to individuals.
zh

[CV-151] Lifting by Gaussians: A Simple Fast and Flexible Method for 3D Instance Segmentation WACV2025

【速读】：该论文旨在解决开放世界下3D高斯辐射场（3D Gaussian Splatted Radiance Fields, 3DGS）的实例分割问题。解决方案的关键在于提出了一种名为Lifting By Gaussians (LBG) 的新方法，该方法直接将2D分割掩模从SAM（或FastSAM等）以及CLIP和DINOv2特征融合到3DGS或其他类似的高斯辐射场中，无需针对每个场景进行训练，从而实现高效且灵活的3D语义分割。

链接: https://arxiv.org/abs/2502.00173
作者: Rohan Chacko,Nicolai Haeni,Eldar Khaliullin,Lin Sun,Douglas Lee
机构: Magic Leap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:We introduce Lifting By Gaussians (LBG), a novel approach for open-world instance segmentation of 3D Gaussian Splatted Radiance Fields (3DGS). Recently, 3DGS Fields have emerged as a highly efficient and explicit alternative to Neural Field-based methods for high-quality Novel View Synthesis. Our 3D instance segmentation method directly lifts 2D segmentation masks from SAM (alternately FastSAM, etc.), together with features from CLIP and DINOv2, directly fusing them onto 3DGS (or similar Gaussian radiance fields such as 2DGS). Unlike previous approaches, LBG requires no per-scene training, allowing it to operate seamlessly on any existing 3DGS reconstruction. Our approach is not only an order of magnitude faster and simpler than existing approaches; it is also highly modular, enabling 3D semantic segmentation of existing 3DGS fields without requiring a specific parametrization of the 3D Gaussians. Furthermore, our technique achieves superior semantic segmentation for 2D semantic novel view synthesis and 3D asset extraction results while maintaining flexibility and efficiency. We further introduce a novel approach to evaluate individually segmented 3D assets from 3D radiance field segmentation methods.
zh

[CV-152] ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition ICLR2025

【速读】：该论文旨在解决动作识别模型中的背景偏见（Background Bias）和前景偏见（Foreground Bias），这些问题可能导致不公平的决策结果。论文的关键解决方案是提出了一种名为ALBAR的新型对抗训练方法，该方法无需专门的偏差属性知识即可缓解这两种偏差。ALBAR通过应用对抗交叉熵损失和熵最大化损失来使静态片段的类别概率均匀分布，并引入梯度惩罚损失以正则化去偏差过程。这种方法在HMDB51数据集上显著提升了综合去偏差性能，超过了现有方法超过12%。此外，论文还发现了UCF101协议中存在的背景泄漏问题，并提出了更精细的演员分割边界以改进偏差评估。

链接: https://arxiv.org/abs/2502.00156
作者: Joseph Fioresi,Ishan Rajendrakumar Dave,Mubarak Shah
机构: Center for Research in Computer Vision (计算机视觉研究中心), University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: this https URL
zh

[CV-153] Exploring Transfer Learning for Deep Learning Polyp Detection in Colonoscopy Images Using YOLOv8

【速读】：该论文旨在解决深度学习模型在有限训练数据下学习特定领域应用能力的挑战。解决方案的关键在于通过迁移学习技术，利用相关数据集的预训练知识，加快和优化新任务的学习过程。研究发现，针对具体任务（如息肉检测）进行预训练的模型显著优于从零开始训练的模型，强调了在具有共享领域特定特征的数据集上进行预训练的重要性。

链接: https://arxiv.org/abs/2502.00133
作者: Fabian Vazquez,Jose Angel Nuñez,Xiaoyan Fu,Pengfei Gu,Bin Fu
机构: University of Texas Rio Grande Valley(德克萨斯大学里奥格兰德河谷分校); The Second Affiliated Hospital of Fujian University of Traditional Chinese Medicine(福建中医药大学第二附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 6 tables, SPIE conference

点击查看摘要

Abstract:Deep learning methods have demonstrated strong performance in objection tasks; however, their ability to learn domain-specific applications with limited training data remains a significant challenge. Transfer learning techniques address this issue by leveraging knowledge from pre-training on related datasets, enabling faster and more efficient learning for new tasks. Finding the right dataset for pre-training can play a critical role in determining the success of transfer learning and overall model performance. In this paper, we investigate the impact of pre-training a YOLOv8n model on seven distinct datasets, evaluating their effectiveness when transferred to the task of polyp detection. We compare whether large, general-purpose datasets with diverse objects outperform niche datasets with characteristics similar to polyps. In addition, we assess the influence of the size of the dataset on the efficacy of transfer learning. Experiments on the polyp datasets show that models pre-trained on relevant datasets consistently outperform those trained from scratch, highlighting the benefit of pre-training on datasets with shared domain-specific features.
zh

[CV-154] ProtoSnap: Prototype Alignment for Cuneiform Signs ICLR2025

【速读】：该论文旨在解决通过自动化技术精确解析楔形文字内部复杂结构的问题。此前的方法大多将楔形文字类型视为类别标签，未能显式建模其高度变化的内部配置。论文的关键在于提出了一种无监督方法ProtoSnap，利用强大的生成模型和原型字体图像的外观与结构作为先验知识，通过深度图像特征来估计楔形文字的各种配置，并将基于骨架的模板拟合到拍摄的楔形文字图像上。这种方法能够实现结构一致性，显著提升了楔形文字识别的性能，特别是在罕见字符的识别方面。

链接: https://arxiv.org/abs/2502.00129
作者: Rachel Mikulinsky,Morris Alper,Shai Gordin,Enrique Jiménez,Yoram Cohen,Hadar Averbuch-Elor
机构: Tel Aviv University(特拉维夫大学); Ariel University(阿里尔大学); LMU(路德维希-马克西米利安大学); Cornell University(康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025. Project page: this https URL

点击查看摘要

Abstract:The cuneiform writing system served as the medium for transmitting knowledge in the ancient Near East for a period of over three thousand years. Cuneiform signs have a complex internal structure which is the subject of expert paleographic analysis, as variations in sign shapes bear witness to historical developments and transmission of writing and culture over time. However, prior automated techniques mostly treat sign types as categorical and do not explicitly model their highly varied internal configurations. In this work, we present an unsupervised approach for recovering the fine-grained internal configuration of cuneiform signs by leveraging powerful generative models and the appearance and structure of prototype font images as priors. Our approach, ProtoSnap, enforces structural consistency on matches found with deep image features to estimate the diverse configurations of cuneiform characters, snapping a skeleton-based template to photographed cuneiform signs. We provide a new benchmark of expert annotations and evaluate our method on this task. Our evaluation shows that our approach succeeds in aligning prototype skeletons to a wide variety of cuneiform signs. Moreover, we show that conditioning on structures produced by our method allows for generating synthetic data with correct structural configurations, significantly boosting the performance of cuneiform sign recognition beyond existing techniques, in particular over rare signs. Our code, data, and trained models are available at the project page: this https URL
zh

[CV-155] A Direct Semi-Exhaustive Search Method for Robust Partial-to-Full Point Cloud Registration IROS2024

【速读】：该论文旨在解决点云配准问题，即寻找将两个给定点云对齐的刚体变换。论文的关键在于提出了一种直接优化点云配准问题的方法，无需对应关系。具体而言，作者提出了直接半穷举搜索（Direct Semi-Exhaustive Search, DSES）算法，通过迭代潜在的旋转矩阵，并高效计算与每个旋转相关的最大内点集平移。此方法利用现代GPU的并行性，从而在ModelNet40基准测试和实际机器人位姿估计任务中表现出色。

链接: https://arxiv.org/abs/2502.00115
作者: Richard Cheng,Chavdar Papozov,Dan Helmick,Mark Tjersland
机构: Toyota Research Institute (丰田研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2024

点击查看摘要

Abstract:Point cloud registration refers to the problem of finding the rigid transformation that aligns two given point clouds, and is crucial for many applications in robotics and computer vision. The main insight of this paper is that we can directly optimize the point cloud registration problem without correspondences by utilizing an algorithmically simple, yet computationally complex, semi-exhaustive search approach that is very well-suited for parallelization on modern GPUs. Our proposed algorithm, Direct Semi-Exhaustive Search (DSES), iterates over potential rotation matrices and efficiently computes the inlier-maximizing translation associated with each rotation. It then computes the optimal rigid transformation based on any desired distance metric by directly computing the error associated with each transformation candidate \R, t\ . By leveraging the parallelism of modern GPUs, DSES outperforms state-of-the-art methods for partial-to-full point cloud registration on the simulated ModelNet40 benchmark and demonstrates high performance and robustness for pose estimation on a real-world robotics problem (this https URL).
zh

[CV-156] Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach

【速读】：该论文旨在解决手绘地图在机器人导航中因比例失真和地标缺失等不准确性所引发的挑战。解决方案的关键在于引入了一种新颖的手绘地图导航（HAM-Nav）架构，该架构利用预训练的视觉语言模型（VLMs）进行跨多样化环境、手绘风格及机器人形态的导航。HAM-Nav集成了选择性视觉关联提示（Selective Visual Association Prompting）方法以基于拓扑地图的位置估计和导航规划，并采用预测导航计划解析器（Predictive Navigation Plan Parser）来推断缺失地标，从而有效应对地图中的不准确性。

链接: https://arxiv.org/abs/2502.00114
作者: Aaron Hao Tan,Angus Fung,Haitong Wang,Goldie Nejat
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Hand-drawn maps can be used to convey navigation instructions between humans and robots in a natural and efficient manner. However, these maps can often contain inaccuracies such as scale distortions and missing landmarks which present challenges for mobile robot navigation. This paper introduces a novel Hand-drawn Map Navigation (HAM-Nav) architecture that leverages pre-trained vision language models (VLMs) for robot navigation across diverse environments, hand-drawing styles, and robot embodiments, even in the presence of map inaccuracies. HAM-Nav integrates a unique Selective Visual Association Prompting approach for topological map-based position estimation and navigation planning as well as a Predictive Navigation Plan Parser to infer missing landmarks. Extensive experiments were conducted in photorealistic simulated environments, using both wheeled and legged robots, demonstrating the effectiveness of HAM-Nav in terms of navigation success rates and Success weighted by Path Length. Furthermore, a user study in real-world environments highlighted the practical utility of hand-drawn maps for robot navigation as well as successful navigation outcomes.
zh

[CV-157] CerraData-4MM: A multimodal benchmark dataset on Cerrado for land use and land cover classification

【速读】：该论文旨在解决塞拉多地区（Cerrado）面临的土地利用和土地覆盖（LULC）映射挑战，特别是在类别不平衡和视觉上相似的类别方面的难题。解决方案的关键在于提出了CerraData-4MM数据集，该数据集结合了Sentinel-1合成孔径雷达（SAR）和Sentinel-2多光谱成像（MSI），具有10米的空间分辨率，并包含两个层次分类，分别有7类和14类。通过评估标准U-Net和更复杂的Vision Transformer (ViT)模型，论文展示了ViT在多模态场景中的优越性能，最高宏F1得分为57.60%，平均交并比（mIoU）为49.05%。

链接: https://arxiv.org/abs/2502.00083
作者: Mateus de Souza Miranda,Ronny Hänsch,Valdivino Alexandre de Santiago Júnior,Thales Sehn Körting,Erison Carlos dos Santos Monteiro
机构: Instituto Nacional de Pesquisas Espaciais (INPE)(国家空间研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 13 Figures, 3 tables

点击查看摘要

Abstract:The Cerrado faces increasing environmental pressures, necessitating accurate land use and land cover (LULC) mapping despite challenges such as class imbalance and visually similar categories. To address this, we present CerraData-4MM, a multimodal dataset combining Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 MultiSpectral Imagery (MSI) with 10m spatial resolution. The dataset includes two hierarchical classification levels with 7 and 14 classes, respectively, focusing on the diverse Bico do Papagaio ecoregion. We highlight CerraData-4MM’s capacity to benchmark advanced semantic segmentation techniques by evaluating a standard U-Net and a more sophisticated Vision Transformer (ViT) model. The ViT achieves superior performance in multimodal scenarios, with the highest macro F1-score of 57.60% and a mean Intersection over Union (mIoU) of 49.05% at the first hierarchical level. Both models struggle with minority classes, particularly at the second hierarchical level, where U-Net’s performance drops to an F1-score of 18.16%. Class balancing improves representation for underrepresented classes but reduces overall accuracy, underscoring the trade-off in weighted training. CerraData-4MM offers a challenging benchmark for advancing deep learning models to handle class imbalance and multimodal data fusion. Code, trained models, and data are publicly available at this https URL.
zh

[CV-158] Influence of color correction on pathology detection in Capsule Endoscopy

【速读】：该论文旨在评估色彩校正对无线胶囊内镜 (Wireless Capsule Endoscopy, WCE) 病理检测的影响。研究使用两个显著的目标检测模型（Retinanet 和 YOLOv5）在原始数据集及其两种不同色彩校正版本上进行实验。关键在于通过比较这些模型在原始数据与色彩校正数据上的表现，揭示色彩校正如何改变边界框大小及交并比，并导致某些病理类型的误报增加，但这些变化并未一致地改善性能指标如 F1 分数、交并比 (IoU) 和平均精度 (AP50)。

链接: https://arxiv.org/abs/2502.00076
作者: Bidossessi Emmanuel Agossou,Marius Pedersen,Kiran Raja,Anuja Vats,Pål Anders Floor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pathology detection in Wireless Capsule Endoscopy (WCE) using deep learning has been explored in the recent past. However, deep learning models can be influenced by the color quality of the dataset used to train them, impacting detection, segmentation and classification tasks. In this work, we evaluate the impact of color correction on pathology detection using two prominent object detection models: Retinanet and YOLOv5. We first generate two color corrected versions of a popular WCE dataset (i.e., SEE-AI dataset) using two different color correction functions. We then evaluate the performance of the Retinanet and YOLOv5 on the original and color corrected versions of the dataset. The results reveal that color correction makes the models generate larger bounding boxes and larger intersection areas with the ground truth annotations. Furthermore, color correction leads to an increased number of false positives for certain pathologies. However, these effects do not translate into a consistent improvement in performance metrics such as F1-scores, IoU, and AP50. The code is available at this https URL. Keywords: Wireless Capsule Endoscopy, Color correction, Retinanet, YOLOv5, Detection
zh

[CV-159] SpikingRTNH: Spiking Neural Network for 4D Radar Object Detection

【速读】：该论文旨在解决在自动驾驶系统中使用4D雷达进行3D物体检测时，处理高密度点云数据所导致的高能耗问题。解决方案的关键在于提出了一种名为SpikingRTNH的新型脉冲神经网络（SNN），通过采用泄漏积分与发射（LIF）脉冲神经元替代传统的ReLU激活函数，显著提高了能效。此外，引入了受人类认知过程启发的生物自上而下推理（BTI）机制，该机制从高密度到低密度顺序处理点云数据，从而有效利用噪声较低且更为重要的点进行检测。这些创新使得SpikingRTNH不仅实现了显著的能耗降低（相比传统人工神经网络ANN降低了78%），同时保持了可比的检测性能（AP 3D为51.1%，AP BEV为57.0%）。

链接: https://arxiv.org/abs/2502.00074
作者: Dong-Hee Paek,Seung-Hyun Kong
机构: CCS Graduate School of Mobility, KAIST (KAIST移动系); Daejeon 34051, Republic of Korea (韩国大田市)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: arxiv preprint

点击查看摘要

Abstract:Recently, 4D Radar has emerged as a crucial sensor for 3D object detection in autonomous vehicles, offering both stable perception in adverse weather and high-density point clouds for object shape recognition. However, processing such high-density data demands substantial computational resources and energy consumption. We propose SpikingRTNH, the first spiking neural network (SNN) for 3D object detection using 4D Radar data. By replacing conventional ReLU activation functions with leaky integrate-and-fire (LIF) spiking neurons, SpikingRTNH achieves significant energy efficiency gains. Furthermore, inspired by human cognitive processes, we introduce biological top-down inference (BTI), which processes point clouds sequentially from higher to lower densities. This approach effectively utilizes points with lower noise and higher importance for detection. Experiments on K-Radar dataset demonstrate that SpikingRTNH with BTI significantly reduces energy consumption by 78% while achieving comparable detection performance to its ANN counterpart (51.1% AP 3D, 57.0% AP BEV). These results establish the viability of SNNs for energy-efficient 4D Radar-based object detection in autonomous driving systems. All codes are available at this https URL.
zh

[CV-160] A two-stage dual-task learning strategy for early prediction of pathological complete response to neoadjuvant chemotherapy for breast cancer using dynamic contrast-enhanced magnetic resonance images

【速读】：该论文旨在解决早期预测乳腺癌患者病理完全缓解（Pathological Complete Response, pCR）的问题。为提高新辅助化疗早期阶段的预测准确性，研究提出了一种两阶段双任务学习策略。解决方案的关键在于利用动态对比增强磁共振成像（Dynamic Contrast-Enhanced Magnetic Resonance Imaging, DCE-MRI）在治疗前（T0）、治疗3周后（T1）以及治疗12周后（T2）的图像数据，通过训练卷积长短期记忆网络（Convolutional Long Short-Term Memory Network, ConvLSTM）提取T2阶段的潜在空间图像特征，并进一步采用双任务网络同时预测pCR及T2阶段的图像特征，从而实现基于T0和T1阶段图像的早期预测，而无需使用T2阶段的图像数据。

链接: https://arxiv.org/abs/2502.00051
作者: Bowen Jing(1),Jing Wang(1) ((1) Department of Radiation Oncology, University of Texas Southwestern Medical Center)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Rationale and Objectives: Early prediction of pathological complete response (pCR) can facilitate personalized treatment for breast cancer patients. To improve prediction accuracy at the early time point of neoadjuvant chemotherapy, we proposed a two-stage dual-task learning strategy to train a deep neural network for early prediction of pCR using early-treatment magnetic resonance images. Methods: We developed and validated the two-stage dual-task learning strategy using the dataset from the national-wide, multi-institutional I-SPY2 clinical trial, which included dynamic contrast-enhanced magnetic resonance images acquired at three time points: pretreatment (T0), after 3 weeks (T1), and after 12 weeks of treatment (T2). First, we trained a convolutional long short-term memory network to predict pCR and extract the latent space image features at T2. At the second stage, we trained a dual-task network to simultaneously predict pCR and the image features at T2 using images from T0 and T1. This allowed us to predict pCR earlier without using images from T2. Results: The conventional single-stage single-task strategy gave an area under the receiver operating characteristic curve (AUROC) of 0.799 for pCR prediction using all the data at time points T0 and T1. By using the proposed two-stage dual-task learning strategy, the AUROC was improved to 0.820. Conclusions: The proposed two-stage dual-task learning strategy can improve model performance significantly (p=0.0025) for predicting pCR at the early stage (3rd week) of neoadjuvant chemotherapy. The early prediction model can potentially help physicians to intervene early and develop personalized plans at the early stage of chemotherapy.
zh

[CV-161] mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

【速读】：该论文旨在解决多语言音频-视觉语音识别（AVSR）数据集规模有限及模型训练难度大的问题。关键在于提出了一种名为mWhisper-Flamingo的模型，它结合了预训练的音频模型（Whisper）和视频模型（AV-HuBERT）。为了实现更好的多模态整合并提升噪声环境下的多语言性能，引入了解码器模态dropout技术，使得模型能够在配对的音频-视觉输入以及单独的音频或视觉输入上进行训练。

链接: https://arxiv.org/abs/2502.01547
作者: Andrew Rouditchenko,Saurabhchand Bhati,Samuel Thomas,Hilde Kuehne,Rogerio Feris,James Glass
机构: MIT(麻省理工学院), USA; MIT-IBM Watson AI Lab(麻省理工学院-IBM Watson人工智能实验室), USA; University of Tuebingen(图宾根大学), DE
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
zh

[CV-162] Assessing the use of Diffusion models for motion artifact correction in brain MRI

【速读】：该论文旨在解决磁共振成像（MRI）中因患者运动导致的运动伪影问题。这些伪影会降低图像的诊断价值。论文的关键解决方案是评估扩散模型在修正2D脑部MRI扫描中的运动伪影方面的应用。通过与基于U-Net的监督学习方法进行对比，研究发现扩散模型能够产生准确预测或有害幻觉，这取决于数据异质性和输入的采集平面。

链接: https://arxiv.org/abs/2502.01418
作者: Paolo Angella,Vito Paolo Pastore,Matteo Santacesaria
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: Accepted at IEEE International Symposium for Biomedical Imaging (ISBI) 2025

点击查看摘要

Abstract:Magnetic Resonance Imaging generally requires long exposure times, while being sensitive to patient motion, resulting in artifacts in the acquired images, which may hinder their diagnostic relevance. Despite research efforts to decrease the acquisition time, and designing efficient acquisition sequences, motion artifacts are still a persistent problem, pushing toward the need for the development of automatic motion artifact correction techniques. Recently, diffusion models have been proposed as a solution for the task at hand. While diffusion models can produce high-quality reconstructions, they are also susceptible to hallucination, which poses risks in diagnostic applications. In this study, we critically evaluate the use of diffusion models for correcting motion artifacts in 2D brain MRI scans. Using a popular benchmark dataset, we compare a diffusion model-based approach with state-of-the-art methods consisting of Unets trained in a supervised fashion on motion-affected images to reconstruct ground truth motion-free images. Our findings reveal mixed results: diffusion models can produce accurate predictions or generate harmful hallucinations in this context, depending on data heterogeneity and the acquisition planes considered as input.
zh

[CV-163] Diffusion at Absolute Zero: Langevin Sampling Using Successive Moreau Envelopes

【速读】：本文旨在解决从形式为 (\pi(x) \propto \exp(-U(x))) 的吉布斯分布中采样（Gibbs sampling）的问题，其中 (U(x)) 是势函数（potential）。论文提出的关键解决方案是引入了一种新颖的方法，通过考虑一系列目标密度的近似序列 ((\pi^t_k)_k) ，其中当 (k) 较小时，(\pi^t_k) 近似于 (\pi)；而当 (k) 较大时，(\pi^t_k) 具有更佳的采样性质。这一序列通过对势函数 (U) 的莫罗包络（Moreau envelopes）进行部分替换来获得。采样过程采用类似退火朗之万动力学（Annealed Langevin dynamics）的程序，即按顺序从 (\pi^t_k) 中采样，随着 (k) 的减小，有效地引导样本从一个简单的起始密度过渡到复杂的靶密度。理论分析和实验结果均表明，该方法提高了收敛速度，并且适用于多模态密度 (\pi)。

链接: https://arxiv.org/abs/2502.01358
作者: Andreas Habring,Alexander Falk,Thomas Pock
机构: Graz University of Technology(格拉茨技术大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:In this article we propose a novel method for sampling from Gibbs distributions of the form \pi(x)\propto\exp(-U(x)) with a potential U(x) . In particular, inspired by diffusion models we propose to consider a sequence (\pi^t_k)_k of approximations of the target density, for which \pi^t_k\approx \pi for k small and, on the other hand, \pi^t_k exhibits favorable properties for sampling for k large. This sequence is obtained by replacing parts of the potential U by its Moreau envelopes. Sampling is performed in an Annealed Langevin type procedure, that is, sequentially sampling from \pi^t_k for decreasing k , effectively guiding the samples from a simple starting density to the more complex target. In addition to a theoretical analysis we show experimental results supporting the efficacy of the method in terms of increased convergence speed and applicability to multi-modal densities \pi .
zh

[CV-164] Deep generative computed perfusion-deficit mapping of ischaemic stroke

【速读】：该论文旨在解决利用急性缺血性卒中患者的计算机断层血管造影（CTA）灌注图来预测神经功能缺损的问题。关键在于采用深度生成推理方法，无需已知病变区域即可定位神经功能缺损的解剖基础，并揭示新的神经依赖关系。研究表明，这种基于急性CTA灌注图的方法在描述缺血性卒中的功能解剖关系方面具有高精度，且可能在临床和科学研究中发挥重要作用。

链接: https://arxiv.org/abs/2502.01334
作者: Chayanin Tangwiriyasakul,Pedro Borges,Guilherme Pombo,Stefano Moriconi,Michael S. Elmalem,Paul Wright,Yee-Haur Mah,Jane Rondina,Robert Gray,Sebastien Ourselin,Parashkev Nachev,M. Jorge Cardoso
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Focal deficits in ischaemic stroke result from impaired perfusion downstream of a critical vascular occlusion. While parenchymal lesions are traditionally used to predict clinical deficits, the underlying pattern of disrupted perfusion provides information upstream of the lesion, potentially yielding earlier predictive and localizing signals. Such perfusion maps can be derived from routine CT angiography (CTA) widely deployed in clinical practice. Analysing computed perfusion maps from 1,393 CTA-imaged-patients with acute ischaemic stroke, we use deep generative inference to localise neural substrates of NIHSS sub-scores. We show that our approach replicates known lesion-deficit relations without knowledge of the lesion itself and reveals novel neural dependents. The high achieved anatomical fidelity suggests acute CTA-derived computed perfusion maps may be of substantial clinical-and-scientific value in rich phenotyping of acute stroke. Using only hyperacute imaging, deep generative inference could power highly expressive models of functional anatomical relations in ischaemic stroke within the pre-interventional window.
zh

[CV-165] Compressed Image Generation with Denoising Diffusion Codebook Models

【速读】：本文旨在解决高质量图像生成与高效压缩之间的平衡问题。关键在于提出了一种基于去噪扩散模型（Denoising Diffusion Models, DDMs）的新方法——去噪扩散码本模型（Denoising Diffusion Codebook Model, DDCM）。该方法通过从预定义的固定独立同分布高斯向量（iid Gaussian vectors）码本中选择噪声样本，替代标准DDM中的高斯噪声采样，在保持样本质量和多样性的同时实现了无损压缩比特流表示，从而在保证生成图像质量的前提下，显著提升了图像压缩效果。

链接: https://arxiv.org/abs/2502.01189
作者: Guy Ohayon,Hila Manor,Tomer Michaeli,Michael Elad
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: Code and demo are available at this https URL

点击查看摘要

Abstract:We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
zh

[CV-166] owards Robust and Generalizable Lensless Imaging with Modular Learned Reconstruction

【速读】：该论文旨在解决现有镜头less成像技术在简化建模假设下的校准和计算复杂性问题，并探究这些学习方法对新掩膜类型的泛化能力。论文的关键解决方案在于引入了一种模块化的学习重构方法，其中包含一个图像恢复前的预处理器组件。理论分析证明了预处理器对于标准图像恢复技术（如维纳滤波和迭代算法）的必要性，并通过大量实验验证了其对多种镜头less成像方法及不同类型的掩膜数据集（振幅和相位掩膜）的有效性。此外，论文还进行了首次跨掩膜类型的一般化基准测试，评估了在一个系统上训练的重构模型对其他系统的泛化性能。这种模块化重构方法使得使用预训练组件和新系统的迁移学习成为可能，从而大幅减少了繁琐的测量和训练时间。

链接: https://arxiv.org/abs/2502.01102
作者: Eric Bezzam,Yohann Perron,Martin Vetterli
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Lensless cameras disregard the conventional design that imaging should mimic the human eye. This is done by replacing the lens with a thin mask, and moving image formation to the digital post-processing. State-of-the-art lensless imaging techniques use learned approaches that combine physical modeling and neural networks. However, these approaches make simplifying modeling assumptions for ease of calibration and computation. Moreover, the generalizability of learned approaches to lensless measurements of new masks has not been studied. To this end, we utilize a modular learned reconstruction in which a key component is a pre-processor prior to image recovery. We theoretically demonstrate the pre-processor’s necessity for standard image recovery techniques (Wiener filtering and iterative algorithms), and through extensive experiments show its effectiveness for multiple lensless imaging approaches and across datasets of different mask types (amplitude and phase). We also perform the first generalization benchmark across mask types to evaluate how well reconstructions trained with one system generalize to others. Our modular reconstruction enables us to use pre-trained components and transfer learning on new systems to cut down weeks of tedious measurements and training. As part of our work, we open-source four datasets, and software for measuring datasets and for training our modular reconstruction.
zh

[CV-167] Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images

【速读】：该论文旨在解决前列腺癌早期检测中MRI-TRUS融合活检复杂且耗时的问题，并减少手动标注带来的潜在错误。解决方案的关键在于提出了一种全自动的基于MRI-TRUS融合的分割方法，该方法通过注册-分割框架整合MRI和TRUS模态的空间信息，实现直接在经尿道超声（TRUS）图像中识别前列腺肿瘤，无需手动标注，从而提高了分割精度并降低了对人工操作的依赖。

链接: https://arxiv.org/abs/2502.00712
作者: Shengtian Sang,Hassan Jahanandish,Cynthia Xinran Li,Indrani Bhattachary,Jeong Hoon Lee,Lichun Zhang,Sulaiman Vesal,Pejman Ghanouni,Richard Fan,Geoffrey A. Sonn,Mirabela Rusu
机构: Stanford University (斯坦福大学); University of Miami (迈阿密大学); Dartmouth College (达特茅斯学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prostate cancer is a major cause of cancer-related deaths in men, where early detection greatly improves survival rates. Although MRI-TRUS fusion biopsy offers superior accuracy by combining MRI’s detailed visualization with TRUS’s real-time guidance, it is a complex and time-intensive procedure that relies heavily on manual annotations, leading to potential errors. To address these challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation method that identifies prostate tumors directly in TRUS images without requiring manual annotations. Unlike traditional multimodal fusion approaches that rely on naive data concatenation, our method integrates a registration-segmentation framework to align and leverage spatial information between MRI and TRUS modalities. This alignment enhances segmentation accuracy and reduces reliance on manual effort. Our approach was validated on a dataset of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132) methods, with significant improvements (p 0.01). This framework demonstrates the potential for reducing the complexity of prostate cancer diagnosis and provides a flexible architecture applicable to other multimodal medical imaging tasks.
zh

[CV-168] Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective

【速读】：该论文旨在解决医疗图像分割中的公平性问题，特别是由于人口属性（如年龄、性别、种族）和临床因素（如疾病严重程度）导致的数据采集不平衡所引发的偏见。论文的关键解决方案是提出了一种基于最优控制理论的分布感知混合专家模型（Distribution-aware Mixture of Experts, dMoE）。此模型通过适应异构数据分布，在多个网络架构中展现了广泛的适用性，并在两个二维基准数据集和一个三维自建数据集上实现了最先进的性能，从而有效缓解了因数据分布不均带来的偏见。

链接: https://arxiv.org/abs/2502.00619
作者: Yujin Oh,Pengfei Jin,Sangjoon Park,Sekeun Kim,Siyeop Yoon,Kyungsang Kim,Jin Sung Kim,Xiang Li,Quanzheng Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE’s role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available.
zh

[CV-169] Segment Anything for Histopathology

【速读】：该论文旨在解决在数字病理学中，现有自动分割方法难以应对来自不同分布的新数据的问题。为了解决这一挑战，论文提出了一种基于训练Segment Anything Model (SAM) 的多样化数据集的新型视觉基础模型（VFM），命名为PathoSAM。关键在于通过引入PathoSAM，实现了更稳健的自动和交互式核分割，并且展示了其在其他分割任务中的适应性，包括语义核分割。虽然在语义核分割任务上尚未超越CellViT，但PathoSAM已经成为了新的最先进模型。

链接: https://arxiv.org/abs/2502.00408
作者: Titus Griebel,Anwai Archit,Constantin Pape
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data-specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state-of-the-art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state-of-the-art, CellViT. Our models are open-source and compatible with popular tools for data annotation. We also provide scripts for whole-slide image segmentation. Our code and models are publicly available at this https URL.
zh

[CV-170] Prostate-Specific Foundation Models for Enhanced Detection of Clinically Significant

【速读】：该论文旨在解决前列腺癌诊断准确性低及潜在延迟的问题。现有方法依赖于MRI影像，但放射科医生的特异性和观察者间变异性较低，导致不必要的活检以及可能遗漏临床显著性癌症的风险。解决方案的关键在于提出了一种名为ProViCNet的前列腺器官特定视觉基础模型，该模型通过多机构的4,401名患者数据进行训练和验证，并采用基于活检确认标注的病灶级对比学习方法。ProViCNet在多种内部和外部验证队列中表现出色，其ROC曲线下面积在0.875至0.966之间，显著优于放射科医生的表现（0.907 vs. 0.805, p<0.001），尤其是在mpMRI检测中。此外，将ProViCNet与标准PSA测试结合，可以提高检测临床显著性癌症的特异性，从15%提升到38%，从而大幅减少不必要的活检。

链接: https://arxiv.org/abs/2502.00366
作者: Jeong Hoon Lee,Cynthia Xinran Li,Hassan Jahanandish,Indrani Bhattacharya,Sulaiman Vesal,Lichun Zhang,Shengtian Sang,Moon Hyung Choi,Simon John Christoph Soerensen,Steve Ran Zhou,Elijah Richard Sommer,Richard Fan,Pejman Ghanouni,Yuze Song,Tyler M. Seibert,Geoffrey A. Sonn,Mirabela Rusu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 44pages

点击查看摘要

Abstract:Accurate prostate cancer diagnosis remains challenging. Even when using MRI, radiologists exhibit low specificity and significant inter-observer variability, leading to potential delays or inaccuracies in identifying clinically significant cancers. This leads to numerous unnecessary biopsies and risks of missing clinically significant cancers. Here we present prostate vision contrastive network (ProViCNet), prostate organ-specific vision foundation models for Magnetic Resonance Imaging (MRI) and Trans-Rectal Ultrasound imaging (TRUS) for comprehensive cancer detection. ProViCNet was trained and validated using 4,401 patients across six institutions, as a prostate cancer detection model on radiology images relying on patch-level contrastive learning guided by biopsy confirmed radiologist annotations. ProViCNet demonstrated consistent performance across multiple internal and external validation cohorts with area under the receiver operating curve values ranging from 0.875 to 0.966, significantly outperforming radiologists in the reader study (0.907 versus 0.805, p0.001) for mpMRI, while achieving 0.670 to 0.740 for TRUS. We also integrated ProViCNet with standard PSA to develop a virtual screening test, and we showed that we can maintain the high sensitivity for detecting clinically significant cancers while more than doubling specificity from 15% to 38% (p0.001), thereby substantially reducing unnecessary biopsies. These findings highlight that ProViCNet’s potential for enhancing prostate cancer diagnosis accuracy and reduce unnecessary biopsies, thereby optimizing diagnostic pathways.
zh

[CV-171] A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation

【速读】：该论文旨在解决腹部后腔区域肿瘤自动分割面临的挑战，特别是由于其不规则形状导致的体积估算困难以及手动分割耗时的问题。论文的关键解决方案在于引入并评估了多种架构，包括改进的U-Net模型（如CNN、Vision Transformer (ViT)、Mamba State Space Model (Mamba SSM) 和 Extended Long-Short Term Memory (xLSTM)），其中ViLU-Net模型通过集成Vi-blocks来提升分割效果。特别地，实验结果突显了xLSTM在U-Net框架中的效率，能够以较低的资源消耗处理长距离依赖关系。

链接: https://arxiv.org/abs/2502.00314
作者: Moein Heidari,Ehsan Khodapanah Aghdam,Alexander Manzella,Daniel Hsu,Rebecca Scalabrino,Wenjin Chen,David J. Foran,Ilker Hacihaliloglu
机构: School of Biomedical Engineering, University of British Columbia (英属哥伦比亚大学); Independent Researcher (独立研究员); Rutgers Robert Wood Johnson Medical School (罗格斯罗伯特伍德约翰逊医学院); Beth Israel Deaconess Medical Center (贝丝以色列女执事医疗中心); Harvard Medical School (哈佛医学院); Weill Cornell Medical School (威尔康奈尔医学学院); Memorial Sloan Kettering Cancer Center (纪念斯隆凯特琳癌症中心); Center for Biomedical Imaging and Informatics, Rutgers Cancer Institute (罗格斯癌症研究所生物医学成像与信息中心); Department of Medicine, University of British Columbia (英属哥伦比亚大学医学系); Department of Radiology, University of British Columbia (英属哥伦比亚大学放射学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2025 SPIE Medical Imaging Conference

点击查看摘要

Abstract:The retroperitoneum hosts a variety of tumors, including rare benign and malignant types, which pose diagnostic and treatment challenges due to their infrequency and proximity to vital structures. Estimating tumor volume is difficult due to their irregular shapes, and manual segmentation is time-consuming. Automatic segmentation using U-Net and its variants, incorporating Vision Transformer (ViT) elements, has shown promising results but struggles with high computational demands. To address this, architectures like the Mamba State Space Model (SSM) and Extended Long-Short Term Memory (xLSTM) offer efficient solutions by handling long-range dependencies with lower resource consumption. This study evaluates U-Net enhancements, including CNN, ViT, Mamba, and xLSTM, on a new in-house CT dataset and a public organ segmentation dataset. The proposed ViLU-Net model integrates Vi-blocks for improved segmentation. Results highlight xLSTM’s efficiency in the U-Net framework. The code is publicly accessible on GitHub.
zh

[CV-172] Patch Triplet Similarity Purification for Guided Real-World Low-Dose CT Image Denoising

【速读】：该论文旨在解决低剂量计算机断层扫描（Low-dose Computed Tomography, LDCT）图像去噪问题，以提高临床诊断中的图像质量，同时减少辐射暴露。论文的关键解决方案在于引入无对比剂CT（Non-Contrast CT, NCCT）图像作为清洁指导信息，并设计了一种新的Patch Triplet Similarity Purification (PTSP)策略来选择高度相似的LDCT、正常剂量CT（Normal-Dose CT, NDCT）和NCCT图像块三元组用于网络训练。此外，通过将标准自注意力机制替换为交叉注意力机制，对SwinIR和HAT两种图像去噪变换器进行了修改，以适应NCCT图像指导。这些改进显著提升了实际LDCT图像去噪性能。

链接: https://arxiv.org/abs/2502.00253
作者: Junhao Long,Fengwei Yang,Juncheng Yan,Baoping Zhang,Chao Jin,Jian Yang,Changliang Zou,Jun Xu
机构: School of Statistics and Data Science, Nankai University (南开大学统计与数据科学学院), Tianjin, China; Department of Radiology, The First Affiliated Hospital of Xi’an Jiaotong University (西安交通大学第一附属医院放射科), Xi’an, China; Shanxi Engineering Research Center of Computational Imaging and Medical Intelligence (陕西计算成像与医学智能工程研究中心), Xi’an, China; Xi’an Key Laboratory of Medical Computational Imaging (西安医学计算成像重点实验室), Xi’an, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image denoising of low-dose computed tomography (LDCT) is an important problem for clinical diagnosis with reduced radiation exposure. Previous methods are mostly trained with pairs of synthetic or misaligned LDCT and normal-dose CT (NDCT) images. However, trained with synthetic noise or misaligned LDCT/NDCT image pairs, the denoising networks would suffer from blurry structure or motion artifacts. Since non-contrast CT (NCCT) images share the content characteristics to the corresponding NDCT images in a three-phase scan, they can potentially provide useful information for real-world LDCT image denoising. To exploit this aspect, in this paper, we propose to incorporate clean NCCT images as useful guidance for the learning of real-world LDCT image denoising networks. To alleviate the issue of spatial misalignment in training data, we design a new Patch Triplet Similarity Purification (PTSP) strategy to select highly similar patch (instead of image) triplets of LDCT, NDCT, and NCCT images for network training. Furthermore, we modify two image denoising transformers of SwinIR and HAT to accommodate the NCCT image guidance, by replacing vanilla self-attention with cross-attention. On our collected clinical dataset, the modified transformers trained with the data selected by our PTSP strategy show better performance than 15 comparison methods on real-world LDCT image denoising. Ablation studies validate the effectiveness of our NCCT image guidance and PTSP strategy. We will publicly release our data and code.
zh

[CV-173] Improving Quality Control Of MRI Images Using Synthetic Motion Data

【速读】：该论文旨在解决MRI质量控制（QC）过程中由于数据集不平衡和有限以及主观评分所导致的挑战，这些问题阻碍了可靠自动化QC系统的开发。论文的关键解决方案在于通过在合成生成的运动伪影上预训练模型，然后应用迁移学习进行QC分类，从而不仅提高了识别低质量扫描的准确性，还减少了训练时间和资源需求。这种方法利用合成数据提供了更为稳健且资源高效的MRI质量控制自动化方案。

链接: https://arxiv.org/abs/2502.00160
作者: Charles Bricout,Sylvain Bouix,Samira Ebrahimi Kahou,Kang Ik K. Cho,Michael Harms,Ofer Pasternak,Carrie E. Bearden,Patrick D. McGorry,Rene S. Kahn,John Kane,Barnaby Nelson,Scott W. Woods,Martha E. Shenton
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2025

点击查看摘要

Abstract:MRI quality control (QC) is challenging due to unbalanced and limited datasets, as well as subjective scoring, which hinder the development of reliable automated QC systems. To address these issues, we introduce an approach that pretrains a model on synthetically generated motion artifacts before applying transfer learning for QC classification. This method not only improves the accuracy in identifying poor-quality scans but also reduces training time and resource requirements compared to training from scratch. By leveraging synthetic data, we provide a more robust and resource-efficient solution for QC automation in MRI, paving the way for broader adoption in diverse research settings.
zh

[CV-174] Multimodal MRI-Ultrasound AI for Prostate Cancer Detection Outperforms Radiologist MRI Interpretation: A Multi-Center Study

【速读】：该论文旨在解决前列腺活检过程中通过磁共振成像（MRI）检测到的临床显著前列腺癌（CsPCa）病灶在转换至经直肠超声（TRUS）图像时容易遗漏的问题。研究的关键在于提出了一种基于多模态人工智能（AI）框架，该框架整合了MRI和TRUS图像序列，以增强CsPCa的识别能力。具体而言，该框架采用了基于3D UNet架构的方法，并在三个机构的1700个测试病例中进行了评估，结果显示其敏感性（80%）和病灶Dice系数（42%）均优于仅使用MRI或TRUS的单模态AI模型。此外，该多模态AI模型在另一组110例患者中的表现也超过了放射科医生，显示出更高的特异性（88%）和病灶Dice系数（38%），同时保持了等效的敏感性（79%）。

链接: https://arxiv.org/abs/2502.00146
作者: Hassan Jahanandish,Shengtian Sang,Cynthia Xinran Li,Sulaiman Vesal,Indrani Bhattacharya,Jeong Hoon Lee,Richard Fan,Geoffrey A. Sonna,Mirabela Rusu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-biopsy magnetic resonance imaging (MRI) is increasingly used to target suspicious prostate lesions. This has led to artificial intelligence (AI) applications improving MRI-based detection of clinically significant prostate cancer (CsPCa). However, MRI-detected lesions must still be mapped to transrectal ultrasound (TRUS) images during biopsy, which results in missing CsPCa. This study systematically evaluates a multimodal AI framework integrating MRI and TRUS image sequences to enhance CsPCa identification. The study included 3110 patients from three cohorts across two institutions who underwent prostate biopsy. The proposed framework, based on the 3D UNet architecture, was evaluated on 1700 test cases, comparing performance to unimodal AI models that use either MRI or TRUS alone. Additionally, the proposed model was compared to radiologists in a cohort of 110 patients. The multimodal AI approach achieved superior sensitivity (80%) and Lesion Dice (42%) compared to unimodal MRI (73%, 30%) and TRUS models (49%, 27%). Compared to radiologists, the multimodal model showed higher specificity (88% vs. 78%) and Lesion Dice (38% vs. 33%), with equivalent sensitivity (79%). Our findings demonstrate the potential of multimodal AI to improve CsPCa lesion targeting during biopsy and treatment planning, surpassing current unimodal models and radiologists; ultimately improving outcomes for prostate cancer patients.
zh

[CV-175] Advanced Assessment of Stroke in Retinal Fundus Imaging with Deep Multi-view Learning

【速读】：该论文旨在解决通过视网膜影像准确识别和区分脑卒中（Stroke）和短暂性缺血发作（TIA）的问题。解决方案的关键在于提出了一种多视角脑卒中网络（MVS-Net），该网络采用端到端的深度学习方法，整合来自左右眼的视网膜影像多视角输入，定义并区分视网膜图像中的代表特征及黄斑中心和视神经头中心视角下的关联关系，从而实现对脑卒中和TIA的检测。实验结果表明，所提出的框架在检测脑卒中和TIA方面达到了0.84的AUC评分。

链接: https://arxiv.org/abs/2502.00079
作者: Aysen Degerli,Mika Hilvo,Juha Pajula,Petri Huhtinen,Pekka Jäkälä
机构: VTT Technical Research Centre of Finland (芬兰技术研究中心); Optomed Oyj (Optomed Oyj); Kuopio University Hospital (库奥皮奥大学医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stroke is globally a major cause of mortality and morbidity, and hence accurate and rapid diagnosis of stroke is valuable. Retinal fundus imaging reveals the known markers of elevated stroke risk in the eyes, which are retinal venular widening, arteriolar narrowing, and increased tortuosity. In contrast to other imaging techniques used for stroke diagnosis, the acquisition of fundus images is easy, non-invasive, fast, and inexpensive. Therefore, in this study, we propose a multi-view stroke network (MVS-Net) to detect stroke and transient ischemic attack (TIA) using retinal fundus images. Contrary to existing studies, our study proposes for the first time a solution to discriminate stroke and TIA with deep multi-view learning by proposing an end-to-end deep network, consisting of multi-view inputs of fundus images captured from both right and left eyes. Accordingly, the proposed MVS-Net defines representative features from fundus images of both eyes and determines the relation within their macula-centered and optic nerve head-centered views. Experiments performed on a dataset collected from stroke and TIA patients, in addition to healthy controls, show that the proposed framework achieves an AUC score of 0.84 for stroke and TIA detection.
zh

[CV-176] Deep Ensembling with Multimodal Image Fusion for Efficient Classification of Lung Cancer

【速读】：该论文旨在解决多模态肺部图像中癌变与健康组织切片的分类问题。针对有限样本量的挑战，论文提出的关键解决方案是开发了一种基于深度集成的多模态融合网络（Deep Ensembled Multimodal Fusion, DEMF），通过主成分分析（Principal Component Analysis, PCA）和自动编码器（Autoencoder）融合正电子发射断层成像（Positron Emission Tomography, PET）和计算机断层扫描（Computed Tomography, CT）图像，并采用多数投票策略进行分类。此外，使用梯度加权类激活映射（Gradient-weighted Class Activation Mapping, Grad-CAM）来可视化分类结果，同时在训练阶段采用了随机图像增强策略以应对样本不足的问题。

链接: https://arxiv.org/abs/2502.00078
作者: Surochita Pal,Sushmita Mitra
机构: Indian Statistical Institute (印度统计学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study focuses on the classification of cancerous and healthy slices from multimodal lung images. The data used in the research comprises Computed Tomography (CT) and Positron Emission Tomography (PET) images. The proposed strategy achieves the fusion of PET and CT images by utilizing Principal Component Analysis (PCA) and an Autoencoder. Subsequently, a new ensemble-based classifier developed, Deep Ensembled Multimodal Fusion (DEMF), employing majority voting to classify the sample images under examination. Gradient-weighted Class Activation Mapping (Grad-CAM) employed to visualize the classification accuracy of cancer-affected images. Given the limited sample size, a random image augmentation strategy employed during the training phase. The DEMF network helps mitigate the challenges of scarce data in computer-aided medical image analysis. The proposed network compared with state-of-the-art networks across three publicly available datasets. The network outperforms others based on the metrics - Accuracy, F1-Score, Precision, and Recall. The investigation results highlight the effectiveness of the proposed network.
zh

[CV-177] LSU-Net: Lightweight Automatic Organs Segmentation Network For Medical Images ICASSP2025

【速读】：该论文旨在解决现有UNet及其变体在医学图像分割应用中的高参数量和计算复杂性问题，限制其在临床环境中有限计算资源下的实用性。解决方案的关键在于提出了一种新型的轻量化位移U-Net（LSU-Net），通过集成轻量化卷积块（Light Conv Block）和标记化位移块（Tokenized Shift Block），结合动态权重多损失设计，实现高效特征提取与动态权重分配。轻量化卷积块通过标准卷积与深度可分离卷积相结合的方式，以低参数量有效捕捉特征；标记化位移块则通过空间位移块与深度可分离卷积的组合优化特征表示，通过深度特征的位移和捕捉提升性能。动态调整各层的损失权重能够逼近最优解并增强训练稳定性。

链接: https://arxiv.org/abs/2502.00042
作者: Yujie Ding,Shenghua Teng,Zuoyong Li,Xiao Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, 4 tables. Accepted at ICASSP 2025

点击查看摘要

Abstract:UNet and its variants have widespread applications in medical image segmentation. However, the substantial number of parameters and computational complexity of these models make them less suitable for use in clinical settings with limited computational resources. To address this limitation, we propose a novel Lightweight Shift U-Net (LSU-Net). We integrate the Light Conv Block and the Tokenized Shift Block in a lightweight manner, combining them with a dynamic weight multi-loss design for efficient dynamic weight allocation. The Light Conv Block effectively captures features with a low parameter count by combining standard convolutions with depthwise separable convolutions. The Tokenized Shift Block optimizes feature representation by shifting and capturing deep features through a combination of the Spatial Shift Block and depthwise separable convolutions. Dynamic adjustment of the loss weights at each layer approaches the optimal solution and enhances training stability. We validated LSU-Net on the UWMGI and MSD Colon datasets, and experimental results demonstrate that LSU-Net outperforms most state-of-the-art segmentation architectures.
zh

人工智能

[AI-0] he AI Agent Index

链接: https://arxiv.org/abs/2502.01635
作者: Stephen Casper,Luke Bailey,Rosco Hunter,Carson Ezell,Emma Cabalé,Michael Gerovitch,Stewart Slocum,Kevin Wei,Nikola Jurkovic,Ariba Khan,Phillip J.K. Christoffersen,A. Pinar Ozisik,Rakshit Trivedi,Dylan Hadfield-Menell,Noam Kolt
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accompanying website: this https URL

点击查看摘要

Abstract:Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system’s components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at this https URL

[AI-1] Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data

链接: https://arxiv.org/abs/2502.01634
作者: Huawei Lin,Jun Woo Chung,Yingjie Lao,Weijie Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 16 tables. Keywords: Decremental Learning, Incremental Learning, Machine Unlearning, Online Learning, Gradient Boosting Decision Trees, GBDTs

点击查看摘要

Abstract:Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. However, in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow to add or delete any data instances after training. In this paper, we propose an efficient online learning framework for GBDT supporting both incremental and decremental learning. To the best of our knowledge, this is the first work that considers an in-place unified incremental and decremental learning on GBDT. To reduce the learning cost, we present a collection of optimizations for our framework, so that it can add or delete a small fraction of data on the fly. We theoretically show the relationship between the hyper-parameters of the proposed optimizations, which enables trading off accuracy and cost on incremental and decremental learning. The backdoor attack results show that our framework can successfully inject and remove backdoor in a well-trained model using incremental and decremental learning, and the empirical results on public datasets confirm the effectiveness and efficiency of our proposed online learning framework and optimizations.

[AI-2] Adversarial Reasoning at Jailbreaking Time

链接: https://arxiv.org/abs/2502.01633
作者: Mahdi Sabbaghi,Paul Kassianik,George Pappas,Yaron Singer,Amin Karbasi,Hamed Hassani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

[AI-3] ReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM -Agents with Memory in Multi-Session Dialogues

链接: https://arxiv.org/abs/2502.01630
作者: Yubin Ge,Salvatore Romeo,Jason Cai,Raphael Shu,Monica Sunkara,Yassine Benajiba,Yi Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs \textittime-aware memorization through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate \textitneuro-symbolic temporal reasoning, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.

[AI-4] A Probabilistic Inference Approach to Inference-Time Scaling of LLM s using Particle-Based Monte Carlo Methods

链接: https://arxiv.org/abs/2502.01618
作者: Isha Puri,Shivchander Sudalairaj,Guangxuan Xu,Kai Xu,Akash Srivastava
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at this https URL.

[AI-5] Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

链接: https://arxiv.org/abs/2502.01612
作者: Nayoung Lee,Ziyang Cai,Avi Schwarzschild,Kangwook Lee,Dimitris Papailiopoulos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture.

[AI-6] Reinforcement Learning for Long-Horizon Interactive LLM Agents

链接: https://arxiv.org/abs/2502.01600
作者: Kevin Chen,Marco Cusumano-Towner,Brody Huval,Aleksei Petrenko,Jackson Hamburger,Vladlen Koltun,Philipp Krähenbühl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests. While IDAs powered by instruction-tuned large language models (LLMs) can react to feedback from interface invocations in multi-step exchanges, they have not been trained in their respective digital environments. Prior methods accomplish less than half of tasks in sophisticated benchmarks such as AppWorld. We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments. We formalize this training as a partially observable Markov decision process and derive M-PPO, a data- and memory-efficient variant of proximal policy optimization. M-PPO uses no value network and maintains exactly one copy of the underlying LLM in memory, making its implementation straightforward and as memory-efficient as fine-tuning a single LLM. A 32-billion-parameter agent trained with M-PPO in the AppWorld environment outperforms the much larger OpenAI o1 agent by 9 percentage points (15% relative). To our knowledge, this is the first reported application of RL to IDAs that interact with a stateful, multi-domain, multi-app environment via direct API calls. Our analysis sheds light on the effectiveness of RL in this area, showing that the agent learns to consult the API documentation, avoid unwarranted assumptions, minimize confabulation, and recover from setbacks.

[AI-7] Improving Transformer World Models for Data-Efficient RL

链接: https://arxiv.org/abs/2502.01591
作者: Antoine Dedieu,Joseph Ortiz,Xinghua Lou,Carter Wendelken,Wolfgang Lehrach,J Swaroop Guntupalli,Miguel Lazaro-Gredilla,Kevin Patrick Murphy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities – such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps, significantly outperforming DreamerV3, which achieves 53.2%, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) “Dyna with warmup”, which trains the policy on real and imaginary data, (b) “nearest neighbor tokenizer” on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and © “block teacher forcing”, which allows the TWM to reason jointly about the future tokens of the next timestep.

[AI-8] Verbalized Bayesian Persuasion

链接: https://arxiv.org/abs/2502.01587
作者: Wenhao Li,Yue Lin,Xiangfeng Wang,Bo Jin,Hongyuan Zha,Baoxiang Wang
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 63 pages, 21 figures

点击查看摘要

Abstract:Information design (ID) explores how a sender influence the optimal behavior of receivers to achieve specific objectives. While ID originates from everyday human communication, existing game-theoretic and machine learning methods often model information structures as numbers, which limits many applications to toy games. This work leverages LLMs and proposes a verbalized framework in Bayesian persuasion (BP), which extends classic BP to real-world games involving human dialogues for the first time. Specifically, we map the BP to a verbalized mediator-augmented extensive-form game, where LLMs instantiate the sender and receiver. To efficiently solve the verbalized game, we propose a generalized equilibrium-finding algorithm combining LLM and game solver. The algorithm is reinforced with techniques including verbalized commitment assumptions, verbalized obedience constraints, and information obfuscation. Numerical experiments in dialogue scenarios, such as recommendation letters, courtroom interactions, and law enforcement, validate that our framework can both reproduce theoretical results in classic BP and discover effective persuasion strategies in more complex natural language and multi-stage scenarios.

[AI-9] PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

链接: https://arxiv.org/abs/2502.01584
作者: Carolyn Jane Anderson,Joydeep Biswas,Aleksander Boruch-Gruszecki,Federico Cassano,Molly Q Feldman,Arjun Guha,Francesca Lucchetti,Zixuan Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing benchmarks for frontier models often test specialized, PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with I give up’’ before providing an answer that it knows is wrong. R1 can also be remarkably uncertain'' in its output and in rare cases, it does not finish thinking,‘’ which suggests the need for an inference-time technique to ``wrap up’’ before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.01584 [cs.AI] (or arXiv:2502.01584v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.01584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Next Steps in LLM -Supported Java Verification ICSE

链接: https://arxiv.org/abs/2502.01573
作者: Samuel Teuber,Bernhard Beckert
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Accepted to NSE 2025, 1st International Workshop on Neuro-Symbolic Software Engineering (ICSE Workshop), 6 pages, 3 figures

点击查看摘要

Abstract:Recent work has shown that Large Language Models (LLMs) are not only a suitable tool for code generation but also capable of generating annotation-based code specifications. Scaling these methodologies may allow us to deduce provable correctness guarantees for large-scale software systems. In comparison to other LLM tasks, the application field of deductive verification has the notable advantage of providing a rigorous toolset to check LLM-generated solutions. This short paper provides early results on how this rigorous toolset can be used to reliably elicit correct specification annotations from an unreliable LLM oracle.

[AI-11] MeetMap: Real-Time Collaborative Dialogue Mapping with LLM s in Online Meetings

链接: https://arxiv.org/abs/2502.01564
作者: Xinyue Chen,Nathan Yap,Xinyi Lu,Aylin Gunal,Xu Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CSCW2025 Accepted

点击查看摘要

Abstract:Video meeting platforms display conversations linearly through transcripts or summaries. However, ideas during a meeting do not emerge linearly. We leverage LLMs to create dialogue maps in real time to help people visually structure and connect ideas. Balancing the need to reduce the cognitive load on users during the conversation while giving them sufficient control when using AI, we explore two system variants that encompass different levels of AI assistance. In Human-Map, AI generates summaries of conversations as nodes, and users create dialogue maps with the nodes. In AI-Map, AI produces dialogue maps where users can make edits. We ran a within-subject experiment with ten pairs of users, comparing the two MeetMap variants and a baseline. Users preferred MeetMap over traditional methods for taking notes, which aligned better with their mental models of conversations. Users liked the ease of use for AI-Map due to the low effort demands and appreciated the hands-on opportunity in Human-Map for sense-making.

[AI-12] Search-Based Adversarial Estimates for Improving Sample Efficiency in Off-Policy Reinforcement Learning

链接: https://arxiv.org/abs/2502.01558
作者: Federico Malato,Ville Hautamaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to International Conference on Machine Learning 2025. Currently under peer-review

点击查看摘要

Abstract:Sample inefficiency is a long-lasting challenge in deep reinforcement learning (DRL). Despite dramatic improvements have been made, the problem is far from being solved and is especially challenging in environments with sparse or delayed rewards. In our work, we propose to use Adversarial Estimates as a new, simple and efficient approach to mitigate this problem for a class of feedback-based DRL algorithms. Our approach leverages latent similarity search from a small set of human-collected trajectories to boost learning, using only five minutes of human-recorded experience. The results of our study show algorithms trained with Adversarial Estimates converge faster than their original version. Moreover, we discuss how our approach could enable learning in feedback-based algorithms in extreme scenarios with very sparse rewards.

[AI-13] Query Brand Entity Linking in E-Commerce Search

链接: https://arxiv.org/abs/2502.01555
作者: Dong Liu,Sreyashi Nag
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we address the brand entity linking problem for e-commerce search queries. The entity linking task is done by either i)a two-stage process consisting of entity mention detection followed by entity disambiguation or ii) an end-to-end linking approaches that directly fetch the target entity given the input text. The task presents unique challenges: queries are extremely short (averaging 2.4 words), lack natural language structure, and must handle a massive space of unique brands. We present a two-stage approach combining named-entity recognition with matching, and a novel end-to-end solution using extreme multi-class classification. We validate our solutions by both offline benchmarks and the impact of online A/B test.

[AI-14] ransformers trained on proteins can learn to attend to Euclidean distance

链接: https://arxiv.org/abs/2502.01533
作者: Isaac Ellmen,Constantin Schneider,Matthew I.J. Raybould,Charlotte M. Deane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre-training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure-language models.

[AI-15] oward Task Generalization via Memory Augmentation in Meta-Reinforcement Learning

链接: https://arxiv.org/abs/2502.01521
作者: Kaixi Bao,Chenhao Li,Yarden As,Andreas Krause,Marco Hutter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In reinforcement learning (RL), agents often struggle to perform well on tasks that differ from those encountered during training. This limitation presents a challenge to the broader deployment of RL in diverse and dynamic task settings. In this work, we introduce memory augmentation, a memory-based RL approach to improve task generalization. Our approach leverages task-structured augmentations to simulate plausible out-of-distribution scenarios and incorporates memory mechanisms to enable context-aware policy adaptation. Trained on a predefined set of tasks, our policy demonstrates the ability to generalize to unseen tasks through memory augmentation without requiring additional interactions with the environment. Through extensive simulation experiments and real-world hardware evaluations on legged locomotion tasks, we demonstrate that our approach achieves zero-shot generalization to unseen tasks while maintaining robust in-distribution performance and high sample efficiency.

[AI-16] Regularized interpolation in 4D neural fields enables optimization of 3D printed geometries

链接: https://arxiv.org/abs/2502.01517
作者: Christos Margadji,Andi Kuswoyo,Sebastian W. Pattinson
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ability to accurately produce geometries with specified properties is perhaps the most important characteristic of a manufacturing process. 3D printing is marked by exceptional design freedom and complexity but is also prone to geometric and other defects that must be resolved for it to reach its full potential. Ultimately, this will require both astute design decisions and timely parameter adjustments to maintain stability that is challenging even with expert human operators. While machine learning is widely investigated in 3D printing, existing methods typically overlook spatial features that vary across prints and thus find it difficult to produce desired geometries. Here, we encode volumetric representations of printed parts into neural fields and apply a new regularization strategy, based on minimizing the partial derivative of the field’s output with respect to a single, non-learnable parameter. By thus encouraging small input changes to yield only small output variations, we encourage smooth interpolation between observed volumes and hence realistic geometry predictions. This framework therefore allows the extraction of ‘imagined’ 3D shapes, revealing how a part would look if manufactured under previously unseen parameters. The resulting continuous field is used for data-driven optimization to maximize geometric fidelity between expected and produced geometries, reducing post-processing, material waste, and production costs. By optimizing process parameters dynamically, our approach enables advanced planning strategies, potentially allowing manufacturers to better realize complex and feature-rich designs.

[AI-17] Sea-cret Agents : Maritime Abduction for Region Generation to Expose Dark Vessel Trajectories AAMAS2025

链接: https://arxiv.org/abs/2502.01503
作者: Divyagna Bavikadi,Nathaniel Lee,Paulo Shakarian,Chad Parvis
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注: Accepted to 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Bad actors in the maritime industry engage in illegal behaviors after disabling their vessel’s automatic identification system (AIS) - which makes finding such vessels difficult for analysts. Machine learning approaches only succeed in identifying the locations of these ``dark vessels’’ in the immediate future. This work leverages ideas from the literature on abductive inference applied to locating adversarial agents to solve the problem. Specifically, we combine concepts from abduction, logic programming, and rule learning to create an efficient method that approaches full recall of dark vessels while requiring less search area than machine learning methods. We provide a logic-based paradigm for reasoning about maritime vessels, an abductive inference query method, an automatically extracted rule-based behavior model methodology, and a thorough suite of experiments.

[AI-18] Develop AI Agents for System Engineering in Factorio

链接: https://arxiv.org/abs/2502.01492
作者: Neel Kant
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continuing advances in frontier model research are paving the way for widespread deployment of AI agents. Meanwhile, global interest in building large, complex systems in software, manufacturing, energy and logistics has never been greater. Although AI driven system engineering holds tremendous promise, the static benchmarks dominating agent evaluations today fail to capture the crucial skills required for implementing dynamic systems, such as managing uncertain trade-offs and ensuring proactive adaptability. This position paper advocates for training and evaluating AI agents’ system engineering abilities through automation-oriented sandbox games-particularly Factorio. By directing research efforts in this direction, we can equip AI agents with the specialized reasoning and long-horizon planning necessary to design, maintain, and optimize tomorrow’s most demanding engineering projects.

[AI-19] Position: Empowering Time Series Reasoning with Multimodal LLM s

链接: https://arxiv.org/abs/2502.01477
作者: Yaxuan Kong,Yiyuan Yang,Shiyu Wang,Chenghao Liu,Yuxuan Liang,Ming Jin,Stefan Zohren,Dan Pei,Yan Liu,Qingsong Wen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding time series data is crucial for multiple real-world applications. While large language models (LLMs) show promise in time series tasks, current approaches often rely on numerical data alone, overlooking the multimodal nature of time-dependent information, such as textual descriptions, visual data, and audio signals. Moreover, these methods underutilize LLMs’ reasoning capabilities, limiting the analysis to surface-level interpretations instead of deeper temporal and multimodal reasoning. In this position paper, we argue that multimodal LLMs (MLLMs) can enable more powerful and flexible reasoning for time series analysis, enhancing decision-making and real-world applications. We call on researchers and practitioners to leverage this potential by developing strategies that prioritize trust, interpretability, and robust reasoning in MLLMs. Lastly, we highlight key research directions, including novel reasoning paradigms, architectural innovations, and domain-specific applications, to advance time series reasoning with MLLMs.

[AI-20] Simulating Rumor Spreading in Social Networks using LLM Agents

链接: https://arxiv.org/abs/2502.01450
作者: Tianrui Hu,Dimitrios Liakopoulos,Xiwen Wei,Radu Marculescu,Neeraja J. Yadwadkar
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 7 pages, 8 figures

点击查看摘要

Abstract:With the rise of social media, misinformation has become increasingly prevalent, fueled largely by the spread of rumors. This study explores the use of Large Language Model (LLM) agents within a novel framework to simulate and analyze the dynamics of rumor propagation across social networks. To this end, we design a variety of LLM-based agent types and construct four distinct network structures to conduct these simulations. Our framework assesses the effectiveness of different network constructions and agent behaviors in influencing the spread of rumors. Our results demonstrate that the framework can simulate rumor spreading across more than one hundred agents in various networks with thousands of edges. The evaluations indicate that network structure, personas, and spreading schemes can significantly influence rumor dissemination, ranging from no spread to affecting 83% of agents in iterations, thereby offering a realistic simulation of rumor spread in social networks.

[AI-21] Can message-passing GNN approximate triangular factorizations of sparse matrices?

链接: https://arxiv.org/abs/2502.01397
作者: Vladislav Trifonov,Ekaterina Muravleva,Ivan Oseledets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We study fundamental limitations of Graph Neural Networks (GNNs) for learning sparse matrix preconditioners. While recent works have shown promising results using GNNs to predict incomplete factorizations, we demonstrate that the local nature of message passing creates inherent barriers for capturing non-local dependencies required for optimal preconditioning. We introduce a new benchmark dataset of matrices where good sparse preconditioners exist but require non-local computations, constructed using both synthetic examples and real-world matrices. Our experimental results show that current GNN architectures struggle to approximate these preconditioners, suggesting the need for new architectural approaches beyond traditional message passing networks. We provide theoretical analysis and empirical evidence to explain these limitations, with implications for the broader use of GNNs in numerical linear algebra.

[AI-22] LL-Drive: Enhancing Autonomous Driving with Teacher LLM -Guided Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.01387
作者: Chengkai Xu,Jiaqi Liu,Peng Hang,Jian Sun
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Although Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) each show promise in addressing decision-making challenges in autonomous driving, DRL often suffers from high sample complexity, while LLMs have difficulty ensuring real-time decision making. To address these limitations, we propose TeLL-Drive, a hybrid framework that integrates an Teacher LLM to guide an attention-based Student DRL policy. By incorporating risk metrics, historical scenario retrieval, and domain heuristics into context-rich prompts, the LLM produces high-level driving strategies through chain-of-thought reasoning. A self-attention mechanism then fuses these strategies with the DRL agent’s exploration, accelerating policy convergence and boosting robustness across diverse driving conditions. Our experimental results, evaluated across multiple traffic scenarios, show that TeLL-Drive outperforms existing baseline methods, including other LLM-based approaches, in terms of success rates, average returns, and real-time feasibility. Ablation studies underscore the importance of each model component, especially the synergy between the attention mechanism and LLM-driven guidance. These findings suggest that TeLL-Drive significantly enhances both the adaptability and safety of autonomous driving systems, while offering a more efficient and scalable approach for policy learning. Full validation results are available on our website.

[AI-23] Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data

链接: https://arxiv.org/abs/2502.01377
作者: Zhi Zhang,Yan Liu,Mengxia Gao,Yu Yang,Jiannong Cao,Wai Kai Hou,Shirley Li,Sonata Yau,Yun Kwok Wing,Tatia M. C. Lee
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Psychological resilience, defined as the ability to rebound from adversity, is crucial for mental health. Compared with traditional resilience assessments through self-reported questionnaires, resilience assessments based on neurological data offer more objective results with biological markers, hence significantly enhancing credibility. This paper proposes a novel data-efficient model to address the scarcity of neurological data. We employ Neuro Kolmogorov-Arnold Networks as the structure of the prediction model. In the training stage, a new trait-informed multimodal representation algorithm with a smart chunk technique is proposed to learn the shared latent space with limited data. In the test stage, a new noise-informed inference algorithm is proposed to address the low signal-to-noise ratio of the neurological data. The proposed model not only shows impressive performance on both public datasets and self-constructed datasets but also provides some valuable psychological hypotheses for future research.

[AI-24] Compact Rule-Based Classifier Learning via Gradient Descent

链接: https://arxiv.org/abs/2502.01375
作者: Javier Fumanal-Idocin,Raquel Fernandez-Peralta,Javier Andreu-Perez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Rule-based models play a crucial role in scenarios that require transparency and accountable decision-making. However, they primarily consist of discrete parameters and structures, which presents challenges for scalability and optimization. In this work, we introduce a new rule-based classifier trained using gradient descent, in which the user can control the maximum number and length of the rules. For numerical partitions, the user can also control the partitions used with fuzzy sets, which also helps keep the number of partitions small. We perform a series of exhaustive experiments on 40 datasets to show how this classifier performs in terms of accuracy and rule base size. Then, we compare our results with a genetic search that fits an equivalent classifier and with other explainable and non-explainable state-of-the-art classifiers. Our results show how our method can obtain compact rule bases that use significantly fewer patterns than other rule-based methods and perform better than other explainable classifiers.

[AI-25] Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss

链接: https://arxiv.org/abs/2502.01342
作者: Sangyeon Park,Isaac Han,Seungwon Oh,Kyung-Joong Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Plasticity loss, a critical challenge in neural network training, limits a model’s ability to adapt to new tasks or shifts in data distribution. This paper introduces AID (Activation by Interval-wise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.

[AI-26] Learning Fused State Representations for Control from Multi-View Observations

链接: https://arxiv.org/abs/2502.01316
作者: Zeyu Wang,Yao-Hui Li,Xin Li,Hongyu Zang,Romain Laroche,Riashat Islam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-View Reinforcement Learning (MVRL) seeks to provide agents with multi-view observations, enabling them to perceive environment with greater effectiveness and precision. Recent advancements in MVRL focus on extracting latent representations from multiview observations and leveraging them in control tasks. However, it is not straightforward to learn compact and task-relevant representations, particularly in the presence of redundancy, distracting information, or missing views. In this paper, we propose Multi-view Fusion State for Control (MFSC), firstly incorporating bisimulation metric learning into MVRL to learn task-relevant representations. Furthermore, we propose a multiview-based mask and latent reconstruction auxiliary task that exploits shared information across views and improves MFSC’s robustness in missing views by introducing a mask token. Extensive experimental results demonstrate that our method outperforms existing approaches in MVRL tasks. Even in more realistic scenarios with interference or missing views, MFSC consistently maintains high performance.

[AI-27] FBS-Finder: Deep Learning-based Model with DNABERT and Convolutional Networks to Predict Transcription Factor Binding Sites

链接: https://arxiv.org/abs/2502.01311
作者: Nimisha Ghosh,Pratik Dutta,Daniele Santoni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Transcription factors are proteins that regulate the expression of genes by binding to specific genomic regions known as Transcription Factor Binding Sites (TFBSs), typically located in the promoter regions of those genes. Accurate prediction of these binding sites is essential for understanding the complex gene regulatory networks underlying various cellular functions. In this regard, many deep learning models have been developed for such prediction, but there is still scope of improvement. In this work, we have developed a deep learning model which uses pre-trained DNABERT, a Convolutional Neural Network (CNN) module, a Modified Convolutional Block Attention Module (MCBAM), a Multi-Scale Convolutions with Attention (MSCA) module and an output module. The pre-trained DNABERT is used for sequence embedding, thereby capturing the long-term dependencies in the DNA sequences while the CNN, MCBAM and MSCA modules are useful in extracting higher-order local features. TFBS-Finder is trained and tested on 165 ENCODE ChIP-seq datasets. We have also performed ablation studies as well as cross-cell line validations and comparisons with other models. The experimental results show the superiority of the proposed method in predicting TFBSs compared to the existing methodologies. The codes and the relevant datasets are publicly available at this https URL.

[AI-28] A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers

链接: https://arxiv.org/abs/2502.01310
作者: Roman Tarasov,Petr Mokrov,Milena Gazdieva,Evgeny Burnaev,Alexander Korotin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural network based Optimal Transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing approaches to OT, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural networks). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for more general OT formulations, paving the promising direction for future research.

[AI-29] Common Foundations for SHACL ShEx and PG-Schema WWW2025

链接: https://arxiv.org/abs/2502.01295
作者: S. Ahmetaj,I. Boneva,J. Hidders,K. Hose,M. Jakubowski,J.E. Labra-Gayo,W. Martens,F. Mogavero,F. Murlak,C. Okulmus,A. Polleres,O. Savkovic,M. Simkus,D. Tomaszuk
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: To be published at WWW 2025

点击查看摘要

Abstract:Graphs have emerged as an important foundation for a variety of applications, including capturing and reasoning over factual knowledge, semantic data integration, social networks, and providing factual knowledge for machine learning algorithms. To formalise certain properties of the data and to ensure data quality, there is a need to describe the schema of such graphs. Because of the breadth of applications and availability of different data models, such as RDF and property graphs, both the Semantic Web and the database community have independently developed graph schema languages: SHACL, ShEx, and PG-Schema. Each language has its unique approach to defining constraints and validating graph data, leaving potential users in the dark about their commonalities and differences. In this paper, we provide formal, concise definitions of the core components of each of these schema languages. We employ a uniform framework to facilitate a comprehensive comparison between the languages and identify a common set of functionalities, shedding light on both overlapping and distinctive features of the three languages.

[AI-30] HyperSHAP: Shapley Values and Interactions for Hyperparameter Importance

链接: https://arxiv.org/abs/2502.01276
作者: Marcel Wever,Maximilian Muschalik,Fabian Fumagalli,Marius Lindauer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Hyperparameter optimization (HPO) is a crucial step in achieving strong predictive performance. However, the impact of individual hyperparameters on model generalization is highly context-dependent, prohibiting a one-size-fits-all solution and requiring opaque automated machine learning (AutoML) systems to find optimal configurations. The black-box nature of most AutoML systems undermines user trust and discourages adoption. To address this, we propose a game-theoretic explainability framework for HPO that is based on Shapley values and interactions. Our approach provides an additive decomposition of a performance measure across hyperparameters, enabling local and global explanations of hyperparameter importance and interactions. The framework, named HyperSHAP, offers insights into ablations, the tunability of learning algorithms, and optimizer behavior across different hyperparameter spaces. We evaluate HyperSHAP on various HPO benchmarks by analyzing the interaction structure of the HPO problem. Our results show that while higher-order interactions exist, most performance improvements can be explained by focusing on lower-order representations.

[AI-31] Analysis of Student-LLM Interaction in a Software Engineering Project

链接: https://arxiv.org/abs/2502.01273
作者: Agrawal Naman,Ridwan Shariffdeen,Guanlin Wang,Sanka Rasnayaka,Ganesh Neelakanta Iyer
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming increasingly competent across various domains, educators are showing a growing interest in integrating these LLMs into the learning process. Especially in software engineering, LLMs have demonstrated qualitatively better capabilities in code summarization, code generation, and debugging. Despite various research on LLMs for software engineering tasks in practice, limited research captures the benefits of LLMs for pedagogical advancements and their impact on the student learning process. To this extent, we analyze 126 undergraduate students’ interaction with an AI assistant during a 13-week semester to understand the benefits of AI for software engineering learning. We analyze the conversations, code generated, code utilized, and the human intervention levels to integrate the code into the code base. Our findings suggest that students prefer ChatGPT over CoPilot. Our analysis also finds that ChatGPT generates responses with lower computational complexity compared to CoPilot. Furthermore, conversational-based interaction helps improve the quality of the code generated compared to auto-generated code. Early adoption of LLMs in software engineering is crucial to remain competitive in the rapidly developing landscape. Hence, the next generation of software engineers must acquire the necessary skills to interact with AI to improve productivity. Comments: 8 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.3 Cite as: arXiv:2502.01273 [cs.SE] (or arXiv:2502.01273v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.01273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Resilient UAV Trajectory Planning via Few-Shot Meta-Offline Reinforcement Learning

链接: https://arxiv.org/abs/2502.01268
作者: Eslam Eldeeb,Hirley Alves
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has been a promising essence in future 5G-beyond and 6G systems. Its main advantage lies in its robust model-free decision-making in complex and large-dimension wireless environments. However, most existing RL frameworks rely on online interaction with the environment, which might not be feasible due to safety and cost concerns. Another problem with online RL is the lack of scalability of the designed algorithm with dynamic or new environments. This work proposes a novel, resilient, few-shot meta-offline RL algorithm combining offline RL using conservative Q-learning (CQL) and meta-learning using model-agnostic meta-learning (MAML). The proposed algorithm can train RL models using static offline datasets without any online interaction with the environments. In addition, with the aid of MAML, the proposed model can be scaled up to new unseen environments. We showcase the proposed algorithm for optimizing an unmanned aerial vehicle (UAV) 's trajectory and scheduling policy to minimize the age-of-information (AoI) and transmission power of limited-power devices. Numerical results show that the proposed few-shot meta-offline RL algorithm converges faster than baseline schemes, such as deep Q-networks and CQL. In addition, it is the only algorithm that can achieve optimal joint AoI and transmission power using an offline dataset with few shots of data points and is resilient to network failures due to unprecedented environmental changes.

[AI-33] Explainability-Driven Quality Assessment for Rule-Based Systems

链接: https://arxiv.org/abs/2502.01253
作者: Oshani Seneviratne,Brendan Capuzzo,William Van Woensel
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:This paper introduces an explanation framework designed to enhance the quality of rules in knowledge-based reasoning systems based on dataset-driven insights. The traditional method for rule induction from data typically requires labor-intensive labeling and data-driven learning. This framework provides an alternative and instead allows for the data-driven refinement of existing rules: it generates explanations of rule inferences and leverages human interpretation to refine rules. It leverages four complementary explanation types: trace-based, contextual, contrastive, and counterfactual, providing diverse perspectives for debugging, validating, and ultimately refining rules. By embedding explainability into the reasoning architecture, the framework enables knowledge engineers to address inconsistencies, optimize thresholds, and ensure fairness, transparency, and interpretability in decision-making processes. Its practicality is demonstrated through a use case in finance.

[AI-34] Efficient rule induction by ignoring pointless rules

链接: https://arxiv.org/abs/2502.01232
作者: Andrew Cropper,David M. Cerna
类目: Artificial Intelligence (cs.AI)
*备注: Under review for a conference

点击查看摘要

Abstract:The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

[AI-35] he dark deep side of DeepSeek : Fine-tuning attacks against the safety alignment of CoT-enabled models

链接: https://arxiv.org/abs/2502.01225
作者: Zhiyuan Xu,Joseph Gardiner,Sana Belguith
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 12 Pages

点击查看摘要

Abstract:Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model’s output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

[AI-36] Dance recalibration for dance coherency with recurrent convolution block

链接: https://arxiv.org/abs/2502.01190
作者: Seungho Eum,Ihjoon Cho,Junghyeon Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the recent advancements in generative AI such as GAN, Diffusion, and VAE, the use of generative AI for dance generation has seen significant progress and received considerable interest. In this study, We propose R-Lodge, an enhanced version of Lodge. R-Lodge incorporates Recurrent Sequential Representation Learning named Dance Recalibration to original coarse-to-fine long dance generation model. R-Lodge utilizes Dance Recalibration method using N Dance Recalibration Block to address the lack of consistency in the coarse dance representation of the Lodge model. By utilizing this method, each generated dance motion incorporates a bit of information from the previous dance motions. We evaluate R-Lodge on FineDance dataset and the results show that R-Lodge enhances the consistency of the whole generated dance motions.

[AI-37] Deep Active Speech Cancellation with Multi-Band Mamba Network

链接: https://arxiv.org/abs/2502.01185
作者: Yehuda Mishaly,Lior Wolf,Eliya Nachmani
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We present a novel deep learning network for Active Speech Cancellation (ASC), advancing beyond Active Noise Cancellation (ANC) methods by effectively canceling both noise and speech signals. The proposed Multi-Band Mamba architecture segments input audio into distinct frequency bands, enabling precise anti-signal generation and improved phase alignment across frequencies. Additionally, we introduce an optimization-driven loss function that provides near-optimal supervisory signals for anti-signal generation. Experimental results demonstrate substantial performance gains, achieving up to 7.2dB improvement in ANC scenarios and 6.2dB in ASC, significantly outperforming existing methods. Audio samples are available at this https URL

[AI-38] Frag mentNet: Adaptive Graph Frag mentation for Graph-to-Sequence Molecular Representation Learning

链接: https://arxiv.org/abs/2502.01184
作者: Ankur Samanta,Rohan Gupta,Aditi Misra,Christian McIntosh Clarke,Jayakumar Rajadas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
*备注: 22 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Molecular property prediction uses molecular structure to infer chemical properties. Chemically interpretable representations that capture meaningful intramolecular interactions enhance the usability and effectiveness of these predictions. However, existing methods often rely on atom-based or rule-based fragment tokenization, which can be chemically suboptimal and lack scalability. We introduce FragmentNet, a graph-to-sequence foundation model with an adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments while preserving structural connectivity. FragmentNet integrates VQVAE-GCN for hierarchical fragment embeddings, spatial positional encodings for graph serialization, global molecular descriptors, and a transformer. Pre-trained with Masked Fragment Modeling and fine-tuned on MoleculeNet tasks, FragmentNet outperforms models with similarly scaled architectures and datasets while rivaling larger state-of-the-art models requiring significantly more resources. This novel framework enables adaptive decomposition, serialization, and reconstruction of molecular graphs, facilitating fragment-based editing and visualization of property trends in learned embeddings - a powerful tool for molecular design and optimization.

[AI-39] Scalable Precise Computation of Shannon Entropy

链接: https://arxiv.org/abs/2502.01160
作者: Yong Lai,Haolong Tong,Zhenghang Xu,Minghao Yin
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Quantitative information flow analyses (QIF) are a class of techniques for measuring the amount of confidential information leaked by a program to its public outputs. Shannon entropy is an important method to quantify the amount of leakage in QIF. This paper focuses on the programs modeled in Boolean constraints and optimizes the two stages of the Shannon entropy computation to implement a scalable precise tool PSE. In the first stage, we design a knowledge compilation language called \ADDAND that combines Algebraic Decision Diagrams and conjunctive decomposition. \ADDAND avoids enumerating possible outputs of a program and supports tractable entropy computation. In the second stage, we optimize the model counting queries that are used to compute the probabilities of outputs. We compare PSE with the state-of-the-art probably approximately correct tool EntropyEstimation, which was shown to significantly outperform the existing precise tools. The experimental results demonstrate that PSE solved 55 more benchmarks compared to EntropyEstimation in a total of 441. For 98% of the benchmarks that both PSE and EntropyEstimation solved, PSE is at least 10\times as efficient as EntropyEstimation. Comments: 9 pages, 3 figures Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT) Cite as: arXiv:2502.01160 [cs.AI] (or arXiv:2502.01160v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.01160 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haolong Tong [view email] [v1] Mon, 3 Feb 2025 08:51:03 UTC (233 KB) Full-text links: Access Paper: View a PDF of the paper titled Scalable Precise Computation of Shannon Entropy, by Yong Lai and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-02 Change to browse by: cs cs.IT math math.IT References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-40] AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

链接: https://arxiv.org/abs/2502.01159
作者: Chenyue Li,Wen Deng,Mengqian Lu,Binhang Yuan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. To address this need, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. We employ a template-based question generation framework, enabling scalable and diverse multiple-choice questions curated from graduate-level atmospheric science problems. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate service by offering a standard and rigorous evaluation framework. Our source codes are currently available at this https URL.

[AI-41] ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

链接: https://arxiv.org/abs/2502.01143
作者: Tairan He,Jiawei Gao,Wenli Xiao,Yuanhang Zhang,Zi Wang,Jiashun Wang,Zhengyi Luo,Guanqi He,Nikhil Sobanbab,Chaoyi Pan,Zeji Yi,Guannan Qu,Kris Kitani,Jessica Hodgins,Linxi “Jim” Fan,Yuke Zhu,Changliu Liu,Guanya Shi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project website: this https URL

点击查看摘要

Abstract:Humanoid robots hold the potential for unparalleled versatility in performing human-like, whole-body skills. However, achieving agile and coordinated whole-body motions remains a significant challenge due to the dynamics mismatch between simulation and the real world. Existing approaches, such as system identification (SysID) and domain randomization (DR) methods, often rely on labor-intensive parameter tuning or result in overly conservative policies that sacrifice agility. In this paper, we present ASAP (Aligning Simulation and Real-World Physics), a two-stage framework designed to tackle the dynamics mismatch and enable agile humanoid whole-body skills. In the first stage, we pre-train motion tracking policies in simulation using retargeted human motion data. In the second stage, we deploy the policies in the real world and collect real-world data to train a delta (residual) action model that compensates for the dynamics mismatch. Then, ASAP fine-tunes pre-trained policies with the delta action model integrated into the simulator to align effectively with real-world dynamics. We evaluate ASAP across three transfer scenarios: IsaacGym to IsaacSim, IsaacGym to Genesis, and IsaacGym to the real-world Unitree G1 humanoid robot. Our approach significantly improves agility and whole-body coordination across various dynamic motions, reducing tracking error compared to SysID, DR, and delta dynamics learning baselines. ASAP enables highly agile motions that were previously difficult to achieve, demonstrating the potential of delta action learning in bridging simulation and real-world dynamics. These results suggest a promising sim-to-real direction for developing more expressive and agile humanoids.

[AI-42] Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations

链接: https://arxiv.org/abs/2502.01141
作者: Qian Chen,Stefanie Rinderle-Ma,Lijie Wen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most existing process compliance monitoring approaches detect compliance violations in an ex post manner. Only predicate prediction focuses on predicting them. However, predicate prediction provides a binary yes/no notion of compliance, lacking the ability to measure to which extent an ongoing process instance deviates from the desired state as specified in constraints. Here, being able to quantify the magnitude of violation would provide organizations with deeper insights into their operational performance, enabling informed decision making to reduce or mitigate the risk of non-compliance. Thus, we propose two predictive compliance monitoring approaches to close this research gap. The first approach reformulates the binary classification problem as a hybrid task that considers both classification and regression, while the second employs a multi-task learning method to explicitly predict the compliance status and the magnitude of violation for deviant cases simultaneously. In this work, we focus on temporal constraints as they are significant in almost any application domain, e.g., health care. The evaluation on synthetic and real-world event logs demonstrates that our approaches are capable of quantifying the magnitude of violations while maintaining comparable performance for compliance predictions achieved by state-of-the-art approaches.

[AI-43] Self-Organizing Interaction Spaces: A Framework for Engineering Pervasive Applications in Mobile and Distributed Environments

链接: https://arxiv.org/abs/2502.01137
作者: Shubham Malhotra
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 9 pages, 3 listings

点击查看摘要

Abstract:The rapid adoption of pervasive and mobile computing has led to an unprecedented rate of data production and consumption by mobile applications at the network edge. These applications often require interactions such as data exchange, behavior coordination, and collaboration, which are typically mediated by cloud servers. While cloud computing has been effective for distributed systems, challenges like latency, cost, and intermittent connectivity persist. With the advent of 5G technology, features like location-awareness and device-to-device (D2D) communication enable a more distributed and adaptive architecture. This paper introduces Self-Organizing Interaction Spaces (SOIS), a novel framework for engineering pervasive applications. SOIS leverages the dynamic and heterogeneous nature of mobile nodes, allowing them to form adaptive organizational structures based on their individual and social contexts. The framework provides two key abstractions for modeling and programming pervasive applications using an organizational mindset and mechanisms for adapting dynamic organizational structures. Case examples and performance evaluations of a simulated mobile crowd-sensing application demonstrate the feasibility and benefits of SOIS. Results highlight its potential to enhance efficiency and reduce reliance on traditional cloud models, paving the way for innovative solutions in mobile and distributed environments.

[AI-44] Deep Reinforcement Learning for Dynamic Resource Allocation in Wireless Networks

链接: https://arxiv.org/abs/2502.01129
作者: Shubham Malhotra
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:This report investigates the application of deep reinforcement learning (DRL) algorithms for dynamic resource allocation in wireless communication systems. An environment that includes a base station, multiple antennas, and user equipment is created. Using the RLlib library, various DRL algorithms such as Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) are then applied. These algorithms are compared based on their ability to optimize resource allocation, focusing on the impact of different learning rates and scheduling policies. The findings demonstrate that the choice of algorithm and learning rate significantly influences system performance, with DRL providing more efficient resource allocation compared to traditional methods.

[AI-45] he Battling Influencers Game: Nash Equilibria Structure of a Potential Game and Implications to Value Alignment ICML

链接: https://arxiv.org/abs/2502.01127
作者: Young Wu,Yancheng Zhu,Jin-Yi Cai,Xiaojin Zhu
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: 9 pages, 8 figures, submitted to ICML

点击查看摘要

Abstract:When multiple influencers attempt to compete for a receiver’s attention, their influencing strategies must account for the presence of one another. We introduce the Battling Influencers Game (BIG), a multi-player simultaneous-move general-sum game, to provide a game-theoretic characterization of this social phenomenon. We prove that BIG is a potential game, that it has either one or an infinite number of pure Nash equilibria (NEs), and these pure NEs can be found by convex optimization. Interestingly, we also prove that at any pure NE, all (except at most one) influencers must exaggerate their actions to the maximum extent. In other words, it is rational for the influencers to be non-truthful and extreme because they anticipate other influencers to cancel out part of their influence. We discuss the implications of BIG to value alignment.

[AI-46] Large Language Model-Enhanced Multi-Armed Bandits

链接: https://arxiv.org/abs/2502.01118
作者: Jiahang Sun,Zhiyong Wang,Runhan Yang,Chenjun Xiao,John C.S. Lui,Zhongxiang Dai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have been adopted to solve sequential decision-making tasks such as multi-armed bandits (MAB), in which an LLM is directly instructed to select the arms to pull in every iteration. However, this paradigm of direct arm selection using LLMs has been shown to be suboptimal in many MAB tasks. Therefore, we propose an alternative approach which combines the strengths of classical MAB and LLMs. Specifically, we adopt a classical MAB algorithm as the high-level framework and leverage the strong in-context learning capability of LLMs to perform the sub-task of reward prediction. Firstly, we incorporate the LLM-based reward predictor into the classical Thompson sampling (TS) algorithm and adopt a decaying schedule for the LLM temperature to ensure a transition from exploration to exploitation. Next, we incorporate the LLM-based reward predictor (with a temperature of 0) into a regression oracle-based MAB algorithm equipped with an explicit exploration mechanism. We also extend our TS-based algorithm to dueling bandits where only the preference feedback between pairs of arms is available, which requires non-trivial algorithmic modifications. We conduct empirical evaluations using both synthetic MAB tasks and experiments designed using real-world text datasets, in which the results show that our algorithms consistently outperform previous baseline methods based on direct arm selection. Interestingly, we also demonstrate that in challenging tasks where the arms lack semantic meanings that can be exploited by the LLM, our approach achieves considerably better performance than LLM-based direct arm selection.

[AI-47] Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings

链接: https://arxiv.org/abs/2502.01108
作者: Mithun Saha,Maxwell A. Xu,Wanting Mao,Sameer Neupane,James M. Rehg,Santosh Kumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: The first two listed authors contributed equally to this research

点击查看摘要

Abstract:Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, the first open-source PPG foundation model trained exclusively on raw PPG data collected over a 100-day field study with 120 participants. Existing PPG foundation models are either open-source but trained on clinical data or closed-source, limiting their applicability in real-world settings. We evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its performance against a state-of-the-art foundation model trained on clinical data. Our results demonstrate that Pulse-PPG, trained on uncurated field data, exhibits superior generalization across clinical and mobile health applications in both lab and field settings. This suggests that exposure to real-world variability enables the model to learn fine-grained representations, making it more adaptable across tasks. Furthermore, pre-training on field data surprisingly outperforms its pre-training on clinical data in many tasks, reinforcing the importance of training on real-world, diverse datasets. To encourage further advancements in robust foundation models leveraging field data, we plan to release Pulse-PPG, providing researchers with a powerful resource for developing more generalizable PPG-based models.

[AI-48] Advanced Architectures Integrated with Agent ic AI for Next-Generation Wireless Networks

链接: https://arxiv.org/abs/2502.01089
作者: Kapal Dev,Sunder Ali Khowaja,Engin Zeydan,Merouane Debbah
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 6 Pages

点击查看摘要

Abstract:This paper investigates a range of cutting-edge technologies and architectural innovations aimed at simplifying network operations, reducing operational expenditure (OpEx), and enabling the deployment of new service models. The focus is on (i) Proposing novel, more efficient 6G architectures, with both Control and User planes enabling the seamless expansion of services, while addressing long-term 6G network evolution. (ii) Exploring advanced techniques for constrained artificial intelligence (AI) operations, particularly the design of AI agents for real-time learning, optimizing energy consumption, and the allocation of computational resources. (iii) Identifying technologies and architectures that support the orchestration of backend services using serverless computing models across multiple domains, particularly for vertical industries. (iv) Introducing optically-based, ultra-high-speed, low-latency network architectures, with fast optical switching and real-time control, replacing conventional electronic switching to reduce power consumption by an order of magnitude.

[AI-49] Learning Nonlinearity of Boolean Functions: An Experimentation with Neural Networks

链接: https://arxiv.org/abs/2502.01060
作者: Sriram Ranga,Nandish Chattopadhyay,Anupam Chattopadhyay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: To be published in International conference on Artificial Intelligence and Sustainable Computing, AISC 2024

点击查看摘要

Abstract:This paper investigates the learnability of the nonlinearity property of Boolean functions using neural networks. We train encoder style deep neural networks to learn to predict the nonlinearity of Boolean functions from examples of functions in the form of a truth table and their corresponding nonlinearity values. We report empirical results to show that deep neural networks are able to learn to predict the property for functions in 4 and 5 variables with an accuracy above 95%. While these results are positive and a disciplined analysis is being presented for the first time in this regard, we should also underline the statutory warning that it seems quite challenging to extend the idea to higher number of variables, and it is also not clear whether one can get advantage in terms of time and space complexity over the existing combinatorial algorithms.

[AI-50] agle: early approximated gradient based learning rate estimator

链接: https://arxiv.org/abs/2502.01036
作者: Takumi Fujimoto,Hiroaki Nishi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 43pages, 24figures

点击查看摘要

Abstract:We propose EAGLE update rule, a novel optimization method that accelerates loss convergence during the early stages of training by leveraging both current and previous step parameter and gradient values. The update algorithm estimates optimal parameters by computing the changes in parameters and gradients between consecutive training steps and leveraging the local curvature of the loss landscape derived from these changes. However, this update rule has potential instability, and to address that, we introduce an adaptive switching mechanism that dynamically selects between Adam and EAGLE update rules to enhance training stability. Experiments on standard benchmark datasets demonstrate that EAGLE optimizer, which combines this novel update rule with the switching mechanism achieves rapid training loss convergence with fewer epochs, compared to conventional optimization methods.

[AI-51] Comprehensive Modeling Approaches for Forecasting Bitcoin Transaction Fees: A Comparative Study

链接: https://arxiv.org/abs/2502.01029
作者: Jiangqin Ma,Erfan Mahmoudinia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transaction fee prediction in Bitcoin’s ecosystem represents a crucial challenge affecting both user costs and miner revenue optimization. This study presents a systematic evaluation of six predictive models for forecasting Bitcoin transaction fees across a 24-hour horizon (144 blocks): SARIMAX, Prophet, Time2Vec, Time2Vec with Attention, a Hybrid model combining SARIMAX with Gradient Boosting, and the Temporal Fusion Transformer (TFT). Our approach integrates comprehensive feature engineering spanning mempool metrics, network parameters, and historical fee patterns to capture the multifaceted dynamics of fee behavior. Through rigorous 5-fold cross-validation and independent testing, our analysis reveals that traditional statistical approaches outperform more complex deep learning architectures. The SARIMAX model achieves superior accuracy on the independent test set, while Prophet demonstrates strong performance during cross-validation. Notably, sophisticated deep learning models like Time2Vec and TFT show comparatively lower predictive power despite their architectural complexity. This performance disparity likely stems from the relatively constrained training dataset of 91 days, suggesting that deep learning models may achieve enhanced results with extended historical data. These findings offer significant practical implications for cryptocurrency stakeholders, providing empirically-validated guidance for fee-sensitive decision making while illuminating critical considerations in model selection based on data constraints. The study establishes a foundation for advanced fee prediction while highlighting the current advantages of traditional statistical methods in this domain. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.01029 [cs.LG] (or arXiv:2502.01029v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01029 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-52] Refining Adaptive Zeroth-Order Optimization at Ease

链接: https://arxiv.org/abs/2502.01014
作者: Yao Shu,Qixin Zhang,Kun He,Zhongxiang Dai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, zeroth-order (ZO) optimization plays an essential role in scenarios where gradient information is inaccessible or unaffordable, such as black-box systems and resource-constrained environments. While existing adaptive methods such as ZO-AdaMM have shown promise, they are fundamentally limited by their underutilization of moment information during optimization, usually resulting in underperforming convergence. To overcome these limitations, this paper introduces Refined Adaptive Zeroth-Order Optimization (R-AdaZO). Specifically, we first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation, which improves the accuracy and stability of ZO updates. We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape, enabling a more effective scaling of ZO updates. We present rigorous theoretical analysis to show (I) the first analysis to the variance reduction of first moment estimate in ZO optimization, (II) the improved second moment estimates with a more accurate approximation of its variance-free ideal, (III) the first variance-aware convergence framework for adaptive ZO methods, which may be of independent interest, and (IV) the faster convergence of R-AdaZO than existing baselines like ZO-AdaMM. Our extensive experiments, including synthetic problems, black-box adversarial attack, and memory-efficient fine-tuning of large language models (LLMs), further verify the superior convergence of R-AdaZO, indicating that R-AdaZO offers an improved solution for real-world ZO optimization challenges.

[AI-53] Encrypted Large Model Inference: The Equivariant Encryption Paradigm

链接: https://arxiv.org/abs/2502.01013
作者: James Buban,Hongyang Zhang,Claudio Angione,Harry Yang,Ahmad Farhan,Seyfal Sultanov,Michael Du,Xuran Ma,Zihao Wang,Yue Zhao,Arria Owlia,Fielding Johnston,Patrick Colangelo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large scale deep learning model, such as modern language models and diffusion architectures, have revolutionized applications ranging from natural language processing to computer vision. However, their deployment in distributed or decentralized environments raises significant privacy concerns, as sensitive data may be exposed during inference. Traditional techniques like secure multi-party computation, homomorphic encryption, and differential privacy offer partial remedies but often incur substantial computational overhead, latency penalties, or limited compatibility with non-linear network operations. In this work, we introduce Equivariant Encryption (EE), a novel paradigm designed to enable secure, “blind” inference on encrypted data with near zero performance overhead. Unlike fully homomorphic approaches that encrypt the entire computational graph, EE selectively obfuscates critical internal representations within neural network layers while preserving the exact functionality of both linear and a prescribed set of non-linear operations. This targeted encryption ensures that raw inputs, intermediate activations, and outputs remain confidential, even when processed on untrusted infrastructure. We detail the theoretical foundations of EE, compare its performance and integration complexity against conventional privacy preserving techniques, and demonstrate its applicability across a range of architectures, from convolutional networks to large language models. Furthermore, our work provides a comprehensive threat analysis, outlining potential attack vectors and baseline strategies, and benchmarks EE against standard inference pipelines in decentralized settings. The results confirm that EE maintains high fidelity and throughput, effectively bridging the gap between robust data confidentiality and the stringent efficiency requirements of modern, large scale model inference.

[AI-54] Forecasting VIX using interpretable Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2502.00980
作者: So-Yoon Cho,Sungchul Lee,Hyun-Gyoon Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:This paper presents the use of Kolmogorov-Arnold Networks (KANs) for forecasting the CBOE Volatility Index (VIX). Unlike traditional MLP-based neural networks that are often criticized for their black-box nature, KAN offers an interpretable approach via learnable spline-based activation functions and symbolification. Based on a parsimonious architecture with symbolic functions, KAN expresses a forecast of the VIX as a closed-form in terms of explanatory variables, and provide interpretable insights into key characteristics of the VIX, including mean reversion and the leverage effect. Through in-depth empirical analysis across multiple datasets and periods, we show that KANs achieve competitive forecasting performance while requiring significantly fewer parameters compared to MLP-based neural network models. Our findings demonstrate the capacity and potential of KAN as an interpretable financial time-series forecasting method.

[AI-55] ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows

链接: https://arxiv.org/abs/2502.00964
作者: Harshith Padigela,Chintan Shah,Dinkar Juyal
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this report, we present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks. While existing benchmarks focus on isolated coding tasks or Kaggle-style competitions, ML-Dev-Bench tests agents’ ability to handle the full complexity of ML development workflows. The benchmark assesses performance across critical aspects including dataset handling, model training, improving existing models, debugging, and API integration with popular ML tools. We evaluate three agents – ReAct, Openhands, and AIDE – on a diverse set of 25 tasks, providing insights into their strengths and limitations in handling practical ML development challenges.

[AI-56] An MDP Model for Censoring in Harvesting Sensors: Optimal and Approximated Solutions

链接: https://arxiv.org/abs/2502.00940
作者: Jesus Fernandez-Bes,Jesus Cid-Sueiro,Antonio G. Marques
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel censoring policy for energy-efficient transmissions in energy-harvesting sensors. The problem is formulated as an infinite-horizon Markov Decision Process (MDP). The objective to be optimized is the expected sum of the importance (utility) of all transmitted messages. Assuming that such importance can be evaluated at the transmitting node, we show that, under certain conditions on the battery model, the optimal censoring policy is a threshold function on the importance value. Specifically, messages are transmitted only if their importance is above a threshold whose value depends on the battery level. Exploiting this property, we propose a model-based stochastic scheme that approximates the optimal solution, with less computational complexity and faster convergence speed than a conventional Q-learning algorithm. Numerical experiments in single-hop and multi-hop networks confirm the analytical advantages of the proposed scheme.

[AI-57] owards Efficient Large Multimodal Model Serving

链接: https://arxiv.org/abs/2502.00937
作者: Haoran Qiu,Anish Biswas,Zihan Zhao,Jayashree Mohan,Alind Khare,Esha Choukse,Íñigo Goiri,Zeyu Zhang,Haiying Shen,Chetan Bansal,Ramachandran Ramjee,Rodrigo Fonseca
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs of various modalities such as text, images, video, and audio. While these models demonstrate impressive capabilities, efficiently serving them in production environments poses significant challenges due to their complex architectures and heterogeneous resource requirements. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, on six representative open-source models. We investigate their multi-stage inference pipelines and resource utilization patterns that lead to unique systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions, diverse modal combinations, and bursty traffic patterns. Our key findings reveal that different LMM inference stages exhibit highly heterogeneous performance characteristics and resource demands, while concurrent requests across modalities lead to significant performance interference. To address these challenges, we propose a decoupled serving architecture that enables independent resource allocation and adaptive scaling for each stage. We further propose optimizations such as stage colocation to maximize throughput and resource utilization while meeting the latency objectives. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.00937 [cs.DC] (or arXiv:2502.00937v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.00937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] FedHPD: Heterogeneous Federated Reinforcement Learning via Policy Distillation AAMAS2025

链接: https://arxiv.org/abs/2502.00870
作者: Wenzheng Jiang,Ji Wang,Xiongtao Zhang,Weidong Bao,Cheston Tan,Flint Xiaofeng Fan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: This preprint presents the full version of the Extended Abstract accepted by AAMAS 2025, including all the proofs and experiments

点击查看摘要

Abstract:Federated Reinforcement Learning (FedRL) improves sample efficiency while preserving privacy; however, most existing studies assume homogeneous agents, limiting its applicability in real-world scenarios. This paper investigates FedRL in black-box settings with heterogeneous agents, where each agent employs distinct policy networks and training configurations without disclosing their internal details. Knowledge Distillation (KD) is a promising method for facilitating knowledge sharing among heterogeneous models, but it faces challenges related to the scarcity of public datasets and limitations in knowledge representation when applied to FedRL. To address these challenges, we propose Federated Heterogeneous Policy Distillation (FedHPD), which solves the problem of heterogeneous FedRL by utilizing action probability distributions as a medium for knowledge sharing. We provide a theoretical analysis of FedHPD’s convergence under standard assumptions. Extensive experiments corroborate that FedHPD shows significant improvements across various reinforcement learning benchmark tasks, further validating our theoretical findings. Moreover, additional experiments demonstrate that FedHPD operates effectively without the need for an elaborate selection of public datasets.

[AI-59] Learning to Plan with Personalized Preferences

链接: https://arxiv.org/abs/2502.00858
作者: Manjie Xu,Xinyi Yang,Wei Liang,Chi Zhang,Yixin Zhu
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Effective integration of AI agents into daily life requires them to understand and adapt to individual human preferences, particularly in collaborative roles. Although recent studies on embodied intelligence have advanced significantly, they typically adopt generalized approaches that overlook personal preferences in planning. We address this limitation by developing agents that not only learn preferences from few demonstrations but also learn to adapt their planning strategies based on these preferences. Our research leverages the observation that preferences, though implicitly expressed through minimal demonstrations, can generalize across diverse planning scenarios. To systematically evaluate this hypothesis, we introduce Preference-based Planning (PbP) benchmark, an embodied benchmark featuring hundreds of diverse preferences spanning from atomic actions to complex sequences. Our evaluation of SOTA methods reveals that while symbol-based approaches show promise in scalability, significant challenges remain in learning to generate and execute plans that satisfy personalized preferences. We further demonstrate that incorporating learned preferences as intermediate representations in planning significantly improves the agent’s ability to construct personalized plans. These findings establish preferences as a valuable abstraction layer for adaptive planning, opening new directions for research in preference-guided plan generation and execution.

[AI-60] Psychometric-Based Evaluation for Theorem Proving with Large Language Models

链接: https://arxiv.org/abs/2502.00855
作者: Jianyu Zhang,Yongwang Zhao,Long Zhang,Jilin Hu,Xiaokun Luan,Zhiwei Xu,Feng Yang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) for formal theorem proving have become a prominent research focus. At present, the proving ability of these LLMs is mainly evaluated through proof pass rates on datasets such as miniF2F. However, this evaluation method overlooks the varying importance of theorems. As a result, it fails to highlight the real performance disparities between LLMs and leads to high evaluation costs. This study proposes a psychometric-based evaluation method for theorem proving with LLMs, comprising two main components: Dataset Annotation and Adaptive Evaluation. First, we propose a metric calculation method to annotate the dataset with difficulty and discrimination metrics. Specifically, we annotate each theorem in the miniF2F dataset and grade them into varying difficulty levels according to the performance of LLMs, resulting in an enhanced dataset: miniF2F-Graded. Experimental results show that the difficulty grading in miniF2F-Graded better reflects the theorem difficulty perceived by LLMs. Secondly, we design an adaptive evaluation method to dynamically select the most suitable theorems for testing based on the annotated metrics and the real-time performance of LLMs. We apply this method to evaluate 10 LLMs. The results show that our method finely highlights the performance disparities between LLMs. It also reduces evaluation costs by using only 23% of the theorems in the dataset.

[AI-61] Dual Alignment Maximin Optimization for Offline Model-based RL

链接: https://arxiv.org/abs/2502.00850
作者: Chi Zhou,Wang Luo,Haoran Li,Congying Han,Tiande Guo,Zicheng Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.

[AI-62] SecPE: Secure Prompt Ensembling for Private and Robust Large Language Models

链接: https://arxiv.org/abs/2502.00847
作者: Jiawen Zhang,Kejia Chen,Zunlei Feng,Jian Lou,Mingli Song,Jian Liu,Xiaohu Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the growing popularity of LLMs among the general public users, privacy-preserving and adversarial robustness have become two pressing demands for LLM-based services, which have largely been pursued separately but rarely jointly. In this paper, to the best of our knowledge, we are among the first attempts towards robust and private LLM inference by tightly integrating two disconnected fields: private inference and prompt ensembling. The former protects users’ privacy by encrypting inference data transmitted and processed by LLMs, while the latter enhances adversarial robustness by yielding an aggregated output from multiple prompted LLM responses. Although widely recognized as effective individually, private inference for prompt ensembling together entails new challenges that render the naive combination of existing techniques inefficient. To overcome the hurdles, we propose SecPE, which designs efficient fully homomorphic encryption (FHE) counterparts for the core algorithmic building blocks of prompt ensembling. We conduct extensive experiments on 8 tasks to evaluate the accuracy, robustness, and efficiency of SecPE. The results show that SecPE maintains high clean accuracy and offers better robustness at the expense of merely 2.5% efficiency overhead compared to baseline private inference methods, indicating a satisfactory ``accuracy-robustness-efficiency’’ tradeoff. For the efficiency of the encrypted Argmax operation that incurs major slowdown for prompt ensembling, SecPE is 35.4x faster than the state-of-the-art peers, which can be of independent interest beyond this work.

[AI-63] Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLM s: Comprehensive Analysis and Defense

链接: https://arxiv.org/abs/2502.00840
作者: Jiawen Zhang,Kejia Chen,Lipeng He,Jian Lou,Dan Li,Zunlei Feng,Mingli Song,Jian Liu,Kui Ren,Xiaohu Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, and Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference efficiency bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain unclear. In this work, we fill this critical gap in LLM safety by conducting the first systematic safety evaluation of activation approximations. Our safety vetting spans seven sota techniques across three popular categories, revealing consistent safety degradation across ten safety-aligned LLMs.

[AI-64] Fisher-Guided Selective Forgetting: Mitigating The Primacy Bias in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.00802
作者: Massimiliano Falzari,Matthia Sabatelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) systems often tend to overfit to early experiences, a phenomenon known as the primacy bias (PB). This bias can severely hinder learning efficiency and final performance, particularly in complex environments. This paper presents a comprehensive investigation of PB through the lens of the Fisher Information Matrix (FIM). We develop a framework characterizing PB through distinct patterns in the FIM trace, identifying critical memorization and reorganization phases during learning. Building on this understanding, we propose Fisher-Guided Selective Forgetting (FGSF), a novel method that leverages the geometric structure of the parameter space to selectively modify network weights, preventing early experiences from dominating the learning process. Empirical results across DeepMind Control Suite (DMC) environments show that FGSF consistently outperforms baselines, particularly in complex tasks. We analyze the different impacts of PB on actor and critic networks, the role of replay ratios in exacerbating the effect, and the effectiveness of even simple noise injection methods. Our findings provide a deeper understanding of PB and practical mitigation strategies, offering a FIM-based geometric perspective for advancing DRL.

[AI-65] RTBAgent : A LLM -based Agent System for Real-Time Bidding WWW2025

链接: https://arxiv.org/abs/2502.00792
作者: Leng Cai,Junxuan He,Yikai Li,Junjie Liang,Yuanping Lin,Ziming Quan,Yawen Zeng,Jin Xu
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by WWW 2025

点击查看摘要

Abstract:Real-Time Bidding (RTB) enables advertisers to place competitive bids on impression opportunities instantaneously, striving for cost-effectiveness in a highly competitive landscape. Although RTB has widely benefited from the utilization of technologies such as deep learning and reinforcement learning, the reliability of related methods often encounters challenges due to the discrepancies between online and offline environments and the rapid fluctuations of online bidding. To handle these challenges, RTBAgent is proposed as the first RTB agent system based on large language models (LLMs), which synchronizes real competitive advertising bidding environments and obtains bidding prices through an integrated decision-making process. Specifically, obtaining reasoning ability through LLMs, RTBAgent is further tailored to be more professional for RTB via involved auxiliary modules, i.e., click-through rate estimation model, expert strategy knowledge, and daily reflection. In addition, we propose a two-step decision-making process and multi-memory retrieval mechanism, which enables RTBAgent to review historical decisions and transaction records and subsequently make decisions more adaptive to market changes in real-time bidding. Empirical testing with real advertising datasets demonstrates that RTBAgent significantly enhances profitability. The RTBAgent code will be publicly accessible at: this https URL.

[AI-66] Role of Mixup in Topological Persistence Based Knowledge Distillation for Wearable Sensor Data

链接: https://arxiv.org/abs/2502.00779
作者: Eun Som Jeon,Hongjun Choi,Matthew P. Buman,Pavan Turaga
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: IEEE Sensors Journal (2024)

点击查看摘要

Abstract:The analysis of wearable sensor data has enabled many successes in several applications. To represent the high-sampling rate time-series with sufficient detail, the use of topological data analysis (TDA) has been considered, and it is found that TDA can complement other time-series features. Nonetheless, due to the large time consumption and high computational resource requirements of extracting topological features through TDA, it is difficult to deploy topological knowledge in various applications. To tackle this problem, knowledge distillation (KD) can be adopted, which is a technique facilitating model compression and transfer learning to generate a smaller model by transferring knowledge from a larger network. By leveraging multiple teachers in KD, both time-series and topological features can be transferred, and finally, a superior student using only time-series data is distilled. On the other hand, mixup has been popularly used as a robust data augmentation technique to enhance model performance during training. Mixup and KD employ similar learning strategies. In KD, the student model learns from the smoothed distribution generated by the teacher model, while mixup creates smoothed labels by blending two labels. Hence, this common smoothness serves as the connecting link that establishes a connection between these two methods. In this paper, we analyze the role of mixup in KD with time-series as well as topological persistence, employing multiple teachers. We present a comprehensive analysis of various methods in KD and mixup on wearable sensor data.

[AI-67] Learning-Based TSP-Solvers Tend to Be Overly Greedy

链接: https://arxiv.org/abs/2502.00767
作者: Xiayang Li,Shihua Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:Deep learning has shown significant potential in solving combinatorial optimization problems such as the Euclidean traveling salesman problem (TSP). However, most training and test instances for existing TSP algorithms are generated randomly from specific distributions like uniform distribution. This has led to a lack of analysis and understanding of the performance of deep learning algorithms in out-of-distribution (OOD) generalization scenarios, which has a close relationship with the worst-case performance in the combinatorial optimization field. For data-driven algorithms, the statistical properties of randomly generated datasets are critical. This study constructs a statistical measure called nearest-neighbor density to verify the asymptotic properties of randomly generated datasets and reveal the greedy behavior of learning-based solvers, i.e., always choosing the nearest neighbor nodes to construct the solution path. Based on this statistical measure, we develop interpretable data augmentation methods that rely on distribution shifts or instance perturbations and validate that the performance of the learning-based solvers degenerates much on such augmented data. Moreover, fine-tuning learning-based solvers with augmented data further enhances their generalization abilities. In short, we decipher the limitations of learning-based TSP solvers tending to be overly greedy, which may have profound implications for AI-empowered combinatorial optimization solvers.

[AI-68] Agent Breeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds

链接: https://arxiv.org/abs/2502.00757
作者: J Rosser,Jakob Nicolaus Foerster
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been as thoroughly explored. In this paper, we introduce AGENTBREEDER a framework for multi-objective evolutionary search over scaffolds. Our REDAGENTBREEDER evolves scaffolds towards jailbreaking the base LLM while achieving high task success, while BLUEAGENTBREEDER instead aims to combine safety with task reward. We evaluate the systems discovered by the different instances of AGENTBREEDER and popular baselines using widely recognized reasoning, mathematics, and safety benchmarks. Our work highlights and mitigates the safety risks due to multi-agent scaffolding.

[AI-69] From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLM s

链接: https://arxiv.org/abs/2502.00735
作者: Chun Wai Chiu,Linghan Huang,Bo Li,Huaming Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have seen widespread applications across various domains due to their growing ability to process diverse types of input data, including text, audio, image and video. While LLMs have demonstrated outstanding performance in understanding and generating contexts for different scenarios, they are vulnerable to prompt-based attacks, which are mostly via text input. In this paper, we introduce the first voice-based jailbreak attack against multimodal LLMs, termed as Flanking Attack, which can process different types of input simultaneously towards the multimodal LLMs. Our work is motivated by recent advancements in monolingual voice-driven large language models, which have introduced new attack surfaces beyond traditional text-based vulnerabilities for LLMs. To investigate these risks, we examine the frontier multimodal LLMs, which can be accessed via different types of inputs such as audio input, focusing on how adversarial prompts can bypass its defense mechanisms. We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. It is integrated in the Flanking Attack which attempts to humanizes the interaction context and execute the attack through a fictional setting. To better evaluate the attack performance, we present a semi-automated self-assessment framework for policy violation detection. We demonstrate that Flank Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs, which achieves an average attack success rate ranging from 0.67 to 0.93 across seven forbidden scenarios. These findings highlight both the potency of prompt-based obfuscation in voice-enabled contexts and the limitations of current LLMs’ moderation safeguards and the urgent need for advanced defense strategies to address the challenges posed by evolving, context-rich attacks.

[AI-70] CycleGuardian: A Framework for Automatic RespiratorySound classification Based on Improved Deep clustering and Contrastive Learning

链接: https://arxiv.org/abs/2502.00734
作者: Yun Chu,Qiuhao Wang,Enze Zhou,Ling Fu,Qian Liu,Gang Zheng
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Auscultation plays a pivotal role in early respiratory and pulmonary disease diagnosis. Despite the emergence of deep learning-based methods for automatic respiratory sound classification post-Covid-19, limited datasets impede performance enhancement. Distinguishing between normal and abnormal respiratory sounds poses challenges due to the coexistence of normal respiratory components and noise components in both types. Moreover, different abnormal respiratory sounds exhibit similar anomalous features, hindering their differentiation. Besides, existing state-of-the-art models suffer from excessive parameter size, impeding deployment on resource-constrained mobile platforms. To address these issues, we design a lightweight network CycleGuardian and propose a framework based on an improved deep clustering and contrastive learning. We first generate a hybrid spectrogram for feature diversity and grouping spectrograms to facilitating intermittent abnormal sound this http URL, CycleGuardian integrates a deep clustering module with a similarity-constrained clustering component to improve the ability to capture abnormal features and a contrastive learning module with group mixing for enhanced abnormal feature discernment. Multi-objective optimization enhances overall performance during training. In experiments we use the ICBHI2017 dataset, following the official split method and without any pre-trained weights, our method achieves Sp: 82.06 % , Se: 44.47 % , and Score: 63.26 % with a network model size of 38M, comparing to the current model, our method leads by nearly 7 % , achieving the current best performances. Additionally, we deploy the network on Android devices, showcasing a comprehensive intelligent respiratory sound auscultation system.

[AI-71] Selective Response Strategies for GenAI

链接: https://arxiv.org/abs/2502.00729
作者: Boaz Taitler,Omer Ben-Porat
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The rise of Generative AI (GenAI) has significantly impacted human-based forums like Stack Overflow, which are essential for generating high-quality data. This creates a negative feedback loop, hindering the development of GenAI systems, which rely on such data to provide accurate responses. In this paper, we provide a possible remedy: A novel strategy we call selective response. Selective response implies that GenAI could strategically provide inaccurate (or conservative) responses to queries involving emerging topics and novel technologies, thereby driving users to use human-based forums like Stack Overflow. We show that selective response can potentially have a compounding effect on the data generation process, increasing both GenAI’s revenue and user welfare in the long term. From an algorithmic perspective, we propose an approximately optimal approach to maximize GenAI’s revenue under social welfare constraints. From a regulatory perspective, we derive sufficient and necessary conditions for selective response to improve welfare improvements.

[AI-72] Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.00726
作者: Yoann Poupart,Aurélie Beynier,Nicolas Maudet
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex problems in robotics or games, yet most of the trained models are hard to interpret. While learning intrinsically interpretable models remains a prominent approach, its scalability and flexibility are limited in handling complex tasks or multi-agent dynamics. This paper advocates for direct interpretability, generating post hoc explanations directly from trained models, as a versatile and scalable alternative, offering insights into agents’ behaviour, emergent phenomena, and biases without altering models’ architectures. We explore modern methods, including relevance backpropagation, knowledge edition, model steering, activation patching, sparse autoencoders and circuit discovery, to highlight their applicability to single-agent, multi-agent, and training process challenges. By addressing MADRL interpretability, we propose directions aiming to advance active topics such as team identification, swarm coordination and sample efficiency.

[AI-73] Leverag ing Large Language Models to Predict Antibody Biological Activity Against Influenza A Hemagglutinin

链接: https://arxiv.org/abs/2502.00694
作者: Ella Barkan,Ibrahim Siddiqui,Kevin J. Cheng,Alex Golts,Yoel Shoshan,Jeffrey K. Weber,Yailin Campos Mota,Michal Ozery-Flato,Giuseppe A. Sautto
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Monoclonal antibodies (mAbs) represent one of the most prevalent FDA-approved modalities for treating autoimmune diseases, infectious diseases, and cancers. However, discovery and development of therapeutic antibodies remains a time-consuming and expensive process. Recent advancements in machine learning (ML) and artificial intelligence (AI) have shown significant promise in revolutionizing antibody discovery and optimization. In particular, models that predict antibody biological activity enable in-silico evaluation of binding and functional properties; such models can prioritize antibodies with the highest likelihoods of success in costly and time-intensive laboratory testing procedures. We here explore an AI model for predicting the binding and receptor blocking activity of antibodies against influenza A hemagglutinin (HA) antigens. Our present model is developed with the MAMMAL framework for biologics discovery to predict antibody-antigen interactions using only sequence information. To evaluate the model’s performance, we tested it under various data split conditions to mimic real-world scenarios. Our models achieved an AUROC \geq 0.91 for predicting the activity of existing antibodies against seen HAs and an AUROC of 0.9 for unseen HAs. For novel antibody activity prediction, the AUROC was 0.73, which further declined to 0.63-0.66 under stringent constraints on similarity to existing antibodies. These results demonstrate the potential of AI foundation models to transform antibody design by reducing dependence on extensive laboratory testing and enabling more efficient prioritization of antibody candidates. Moreover, our findings emphasize the critical importance of diverse and comprehensive antibody datasets to improve the generalization of prediction models, particularly for novel antibody development. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM) Cite as: arXiv:2502.00694 [cs.LG] (or arXiv:2502.00694v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00694 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-74] Dissecting Submission Limit in Desk-Rejections: A Mathematical Analysis of Fairness in AI Conference Policies

链接: https://arxiv.org/abs/2502.00690
作者: Yuefan Cao,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Jiahao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:As AI research surges in both impact and volume, conferences have imposed submission limits to maintain paper quality and alleviate organizational pressure. In this work, we examine the fairness of desk-rejection systems under submission limits and reveal that existing practices can result in substantial inequities. Specifically, we formally define the paper submission limit problem and identify a critical dilemma: when the number of authors exceeds three, it becomes impossible to reject papers solely based on excessive submissions without negatively impacting innocent authors. Thus, this issue may unfairly affect early-career researchers, as their submissions may be penalized due to co-authors with significantly higher submission counts, while senior researchers with numerous papers face minimal consequences. To address this, we propose an optimization-based fairness-aware desk-rejection mechanism and formally define two fairness metrics: individual fairness and group fairness. We prove that optimizing individual fairness is NP-hard, whereas group fairness can be efficiently optimized via linear programming. Through case studies, we demonstrate that our proposed system ensures greater equity than existing methods, including those used in CVPR 2025, offering a more socially just approach to managing excessive submissions in AI conferences.

[AI-75] Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning IJCAI2025

链接: https://arxiv.org/abs/2502.00684
作者: Zeyu Jiang,Hai Huang,Xingquan Zuo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures, IJCAI 2025

点击查看摘要

Abstract:Deep reinforcement learning (DRL), through learning policies or values represented by neural networks, has successfully addressed many complex control problems. However, the neural networks introduced by DRL lack interpretability and transparency. Current DRL interpretability methods largely treat neural networks as black boxes, with few approaches delving into the internal mechanisms of policy/value networks. This limitation undermines trust in both the neural network models that represent policies and the explanations derived from them. In this work, we propose a novel concept-based interpretability method that provides fine-grained explanations of DRL models at the neuron level. Our method formalizes atomic concepts as binary functions over the state space and constructs complex concepts through logical operations. By analyzing the correspondence between neuron activations and concept functions, we establish interpretable explanations for individual neurons in policy/value networks. Experimental results on both continuous control tasks and discrete decision-making environments demonstrate that our method can effectively identify meaningful concepts that align with human understanding while faithfully reflecting the network’s decision-making logic.

[AI-76] Guidance Source Matters: How Guidance from AI Expert or a Group of Analysts Impacts Visual Data Preparation and Analysis

链接: https://arxiv.org/abs/2502.00682
作者: Arpit Narechania,Alex Endert,Atanu R Sinha
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 21 pages, 10 figures, 6 figures, to appear in proceedings of ACM IUI 2025

点击查看摘要

Abstract:The progress in generative AI has fueled AI-powered tools like co-pilots and assistants to provision better guidance, particularly during data analysis. However, research on guidance has not yet examined the perceived efficacy of the source from which guidance is offered and the impact of this source on the user’s perception and usage of guidance. We ask whether users perceive all guidance sources as equal, with particular interest in three sources: (i) AI, (ii) human expert, and (iii) a group of human analysts. As a benchmark, we consider a fourth source, (iv) unattributed guidance, where guidance is provided without attribution to any source, enabling isolation of and comparison with the effects of source-specific guidance. We design a five-condition between-subjects study, with one condition for each of the four guidance sources and an additional (v) no-guidance condition, which serves as a baseline to evaluate the influence of any kind of guidance. We situate our study in a custom data preparation and analysis tool wherein we task users to select relevant attributes from an unfamiliar dataset to inform a business report. Depending on the assigned condition, users can request guidance, which the system then provides in the form of attribute suggestions. To ensure internal validity, we control for the quality of guidance across source-conditions. Through several metrics of usage and perception, we statistically test five preregistered hypotheses and report on additional analysis. We find that the source of guidance matters to users, but not in a manner that matches received wisdom. For instance, users utilize guidance differently at various stages of analysis, including expressing varying levels of regret, despite receiving guidance of similar quality. Notably, users in the AI condition reported both higher post-task benefit and regret.

[AI-77] LLM -based event log analysis techniques: A survey

链接: https://arxiv.org/abs/2502.00677
作者: Siraaj Akhtar,Saad Khan,Simon Parkinson
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Event log analysis is an important task that security professionals undertake. Event logs record key information on activities that occur on computing devices, and due to the substantial number of events generated, they consume a large amount of time and resources to analyse. This demanding and repetitive task is also prone to errors. To address these concerns, researchers have developed automated techniques to improve the event log analysis process. Large Language Models (LLMs) have recently demonstrated the ability to successfully perform a wide range of tasks that individuals would usually partake in, to high standards, and at a pace and degree of complexity that outperform humans. Due to this, researchers are rapidly investigating the use of LLMs for event log analysis. This includes fine-tuning, Retrieval-Augmented Generation (RAG) and in-context learning, which affect performance. These works demonstrate good progress, yet there is a need to understand the developing body of knowledge, identify commonalities between works, and identify key challenges and potential solutions to further developments in this domain. This paper aims to survey LLM-based event log analysis techniques, providing readers with an in-depth overview of the domain, gaps identified in previous research, and concluding with potential avenues to explore in future.

[AI-78] Avoiding mathbfexp(R_max) scaling in RLHF through Preference-based Exploration

链接: https://arxiv.org/abs/2502.00666
作者: Mingyu Chen,Yiding Chen,Wen Sun,Xuezhou Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024)… Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design.

[AI-79] LLM Safety Alignment is Divergence Estimation in Disguise

链接: https://arxiv.org/abs/2502.00657
作者: Rajdeep Haldar,Ziyi Wang,Qifan Song,Guang Lin,Yue Xing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a theoretical framework demonstrating that popular Large Language Model (LLM) alignment methods, including Reinforcement Learning from Human Feedback (RLHF) and alternatives, fundamentally function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions. This explains the separation phenomenon between safe and harmful prompts in the model hidden representation after alignment. Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation and, introduce a new method, KLDO, and further demonstrate the implication of our theories. We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment, supported by both theoretical reasoning and empirical evidence. Additionally, to quantify safety separation, we leverage a distance metric in the representation space and statistically validate its efficacy as a statistical significant indicator of LLM resilience against jailbreak attacks.

[AI-80] Agency in the Age of AI

链接: https://arxiv.org/abs/2502.00648
作者: Samarth Swarup
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:There is significant concern about the impact of generative AI on society. Modern AI tools are capable of generating ever more realistic text, images, and videos, and functional code, from minimal prompts. Accompanying this rise in ability and usability, there is increasing alarm about the misuses to which these tools can be put, and the intentional and unintentional harms to individuals and society that may result. In this paper, we argue that \emphagency is the appropriate lens to study these harms and benefits, but that doing so will require advancement in the theory of agency, and advancement in how this theory is applied in (agent-based) models.

[AI-81] rojanTime: Backdoor Attacks on Time Series Classification

链接: https://arxiv.org/abs/2502.00646
作者: Chang Dong,Zechao Sun,Guangdong Bai,Shuying Piao,Weitong Chen,Wei Emma Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Time Series Classification (TSC) is highly vulnerable to backdoor attacks, posing significant security threats. Existing methods primarily focus on data poisoning during the training phase, designing sophisticated triggers to improve stealthiness and attack success rate (ASR). However, in practical scenarios, attackers often face restrictions in accessing training data. Moreover, it is a challenge for the model to maintain generalization ability on clean test data while remaining vulnerable to poisoned inputs when data is inaccessible. To address these challenges, we propose TrojanTime, a novel two-step training algorithm. In the first stage, we generate a pseudo-dataset using an external arbitrary dataset through target adversarial attacks. The clean model is then continually trained on this pseudo-dataset and its poisoned version. To ensure generalization ability, the second stage employs a carefully designed training strategy, combining logits alignment and batch norm freezing. We evaluate TrojanTime using five types of triggers across four TSC architectures in UCR benchmark datasets from diverse domains. The results demonstrate the effectiveness of TrojanTime in executing backdoor attacks while maintaining clean accuracy. Finally, to mitigate this threat, we propose a defensive unlearning strategy that effectively reduces the ASR while preserving clean accuracy.

[AI-82] CollabLLM : From Passive Responders to Active Collaborators

链接: https://arxiv.org/abs/2502.00640
作者: Shirley Wu,Michel Galley,Baolin Peng,Hao Cheng,Gavin Li,Yao Dou,Weixin Cai,James Zou,Jure Leskovec,Jianfeng Gao
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages

点击查看摘要

Abstract:Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions-a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.

[AI-83] Lipschitz Lifelong Monte Carlo Tree Search for Mastering Non-Stationary Tasks

链接: https://arxiv.org/abs/2502.00633
作者: Zuyuan Zhang,Tian Lan
类目: Artificial Intelligence (cs.AI)
*备注: 6 figures

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) has proven highly effective in solving complex planning tasks by balancing exploration and exploitation using Upper Confidence Bound for Trees (UCT). However, existing work have not considered MCTS-based lifelong planning, where an agent faces a non-stationary series of tasks – e.g., with varying transition probabilities and rewards – that are drawn sequentially throughout the operational lifetime. This paper presents LiZero for Lipschitz lifelong planning using MCTS. We propose a novel concept of adaptive UCT (aUCT) to transfer knowledge from a source task to the exploration/exploitation of a new task, depending on both the Lipschitz continuity between tasks and the confidence of knowledge in in Monte Carlo action sampling. We analyze LiZero’s acceleration factor in terms of improved sampling efficiency and also develop efficient algorithms to compute aUCT in an online fashion by both data-driven and model-based approaches, whose sampling complexity and error bounds are also characterized. Experiment results show that LiZero significantly outperforms existing MCTS and lifelong learning baselines in terms of much faster convergence (3 \sim 4x) to optimal rewards. Our results highlight the potential of LiZero to advance decision-making and planning in dynamic real-world environments.

[AI-84] Advanced Weakly-Supervised Formula Exploration for Neuro-Symbolic Mathematical Reasoning

链接: https://arxiv.org/abs/2502.00629
作者: Yuxuan Wu,Hideki Nakayama
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, neuro-symbolic methods have become a popular and powerful approach that augments artificial intelligence systems with the capability to perform abstract, logical, and quantitative deductions with enhanced precision and controllability. Recent studies successfully performed symbolic reasoning by leveraging various machine learning models to explicitly or implicitly predict intermediate labels that provide symbolic instructions. However, these intermediate labels are not always prepared for every task as a part of training data, and pre-trained models, represented by Large Language Models (LLMs), also do not consistently generate valid symbolic instructions with their intrinsic knowledge. On the other hand, existing work developed alternative learning techniques that allow the learning system to autonomously uncover optimal symbolic instructions. Nevertheless, their performance also exhibits limitations when faced with relatively huge search spaces or more challenging reasoning problems. In view of this, in this work, we put forward an advanced practice for neuro-symbolic reasoning systems to explore the intermediate labels with weak supervision from problem inputs and final outputs. Our experiments on the Mathematics dataset illustrated the effectiveness of our proposals from multiple aspects.

[AI-85] Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

链接: https://arxiv.org/abs/2502.00620
作者: Yihao Xue,Jiping Li,Baharan Mirzasoleiman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models’ internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.

[AI-86] Enhancing Code Consistency in AI Research with Large Language Models and Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2502.00611
作者: Rajat Keshri,Arun George Zachariah,Michael Boone
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring that code accurately reflects the algorithms and methods described in research papers is critical for maintaining credibility and fostering trust in AI research. This paper presents a novel system designed to verify code implementations against the algorithms and methodologies outlined in corresponding research papers. Our system employs Retrieval-Augmented Generation to extract relevant details from both the research papers and code bases, followed by a structured comparison using Large Language Models. This approach improves the accuracy and comprehensiveness of code implementation verification while contributing to the transparency, explainability, and reproducibility of AI research. By automating the verification process, our system reduces manual effort, enhances research credibility, and ultimately advances the state of the art in code verification.

[AI-87] Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

链接: https://arxiv.org/abs/2502.00604
作者: Sifan Wang,Ananyae Kumar Bhartari,Bowen Li,Paris Perdikaris
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 39 pages, 22 figures

点击查看摘要

Abstract:Multi-task learning through composite loss functions is fundamental to modern deep learning, yet optimizing competing objectives remains challenging. We present new theoretical and practical approaches for addressing directional conflicts between loss terms, demonstrating their effectiveness in physics-informed neural networks (PINNs) where such conflicts are particularly challenging to resolve. Through theoretical analysis, we demonstrate how these conflicts limit first-order methods and show that second-order optimization naturally resolves them through implicit gradient alignment. We prove that SOAP, a recently proposed quasi-Newton method, efficiently approximates the Hessian preconditioner, enabling breakthrough performance in PINNs: state-of-the-art results on 10 challenging PDE benchmarks, including the first successful application to turbulent flows with Reynolds numbers up to 10,000, with 2-10x accuracy improvements over existing methods. We also introduce a novel gradient alignment score that generalizes cosine similarity to multiple gradients, providing a practical tool for analyzing optimization dynamics. Our findings establish frameworks for understanding and resolving gradient conflicts, with broad implications for optimization beyond scientific computing.

[AI-88] Robust Knowledge Distillation in Federated Learning: Counteracting Backdoor Attacks

链接: https://arxiv.org/abs/2502.00587
作者: Ebtisaam Alharbi,Leandro Soriano Marcolino,Qiang Ni,Antonios Gouglidis
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across multiple devices while preserving data privacy. However, it remains susceptible to backdoor attacks, where malicious participants can compromise the global model. Existing defence methods are limited by strict assumptions on data heterogeneity (Non-Independent and Identically Distributed data) and the proportion of malicious clients, reducing their practicality and effectiveness. To overcome these limitations, we propose Robust Knowledge Distillation (RKD), a novel defence mechanism that enhances model integrity without relying on restrictive assumptions. RKD integrates clustering and model selection techniques to identify and filter out malicious updates, forming a reliable ensemble of models. It then employs knowledge distillation to transfer the collective insights from this ensemble to a global model. Extensive evaluations demonstrate that RKD effectively mitigates backdoor threats while maintaining high model performance, outperforming current state-of-the-art defence methods across various scenarios.

[AI-89] Lessons for GenAI Literacy From a Field Study of Human-GenAI Augmentation in the Workplace

链接: https://arxiv.org/abs/2502.00567
作者: Aditya Johri,Johannes Schleiss,Nupoor Ranade
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Pre-print, paper accepted at IEEE EDUCON2025

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is increasingly becoming a part of work practices across the technology industry and being used across a range of industries. This has necessitated the need to better understand how GenAI is being used by professionals in the field so that we can better prepare students for the workforce. An improved understanding of the use of GenAI in practice can help provide guidance on the design of GenAI literacy efforts including how to integrate it within courses and curriculum, what aspects of GenAI to teach, and even how to teach it. This paper presents a field study that compares the use of GenAI across three different functions - product development, software engineering, and digital content creation - to identify how GenAI is currently being used in the industry. This study takes a human augmentation approach with a focus on human cognition and addresses three research questions: how is GenAI augmenting work practices; what knowledge is important and how are workers learning; and what are the implications for training the future workforce. Findings show a wide variance in the use of GenAI and in the level of computing knowledge of users. In some industries GenAI is being used in a highly technical manner with deployment of fine-tuned models across domains. Whereas in others, only off-the-shelf applications are being used for generating content. This means that the need for what to know about GenAI varies, and so does the background knowledge needed to utilize it. For the purposes of teaching and learning, our findings indicated that different levels of GenAI understanding needs to be integrated into courses. From a faculty perspective, the work has implications for training faculty so that they are aware of the advances and how students are possibly, as early adopters, already using GenAI to augment their learning practices.

[AI-90] Generic Multimodal Spatially Graph Network for Spatially Embedded Network Representation Learning

链接: https://arxiv.org/abs/2502.00530
作者: Xudong Fan,Jürgen Hackl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Spatially embedded networks (SENs) represent a special type of complex graph, whose topologies are constrained by the networks’ embedded spatial environments. The graph representation of such networks is thereby influenced by the embedded spatial features of both nodes and edges. Accurate network representation of the graph structure and graph features is a fundamental task for various graph-related tasks. In this study, a Generic Multimodal Spatially Graph Convolutional Network (GMu-SGCN) is developed for efficient representation of spatially embedded networks. The developed GMu-SGCN model has the ability to learn the node connection pattern via multimodal node and edge features. In order to evaluate the developed model, a river network dataset and a power network dataset have been used as test beds. The river network represents the naturally developed SENs, whereas the power network represents a man-made network. Both types of networks are heavily constrained by the spatial environments and uncertainties from nature. Comprehensive evaluation analysis shows the developed GMu-SGCN can improve accuracy of the edge existence prediction task by 37.1% compared to a GraphSAGE model which only considers the node’s position feature in a power network test bed. Our model demonstrates the importance of considering the multidimensional spatial feature for spatially embedded network representation.

[AI-91] Discovering Directly-Follows Graph Model for Acyclic Processes

链接: https://arxiv.org/abs/2502.00499
作者: Nikita Shaimov,Irina Lomazova,Alexey Mitsyuk
类目: Artificial Intelligence (cs.AI)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:Process mining is the common name for a range of methods and approaches aimed at analysing and improving processes. Specifically, methods that aim to derive process models from event logs fall under the category of process discovery. Within the range of processes, acyclic processes form a distinct category. In such processes, previously performed actions are not repeated, forming chains of unique actions. However, due to differences in the order of actions, existing process discovery methods can provide models containing cycles even if a process is acyclic. This paper presents a new process discovery algorithm that allows to discover acyclic DFG models for acyclic processes. A model is discovered by partitioning an event log into parts that provide acyclic DFG models and merging them while avoiding the formation of cycles. The resulting algorithm was tested both on real-life and artificial event logs. Absence of cycles improves model visual clarity and precision, also allowing to apply cycle-sensitive methods or visualisations to the model.

[AI-92] MetaOpenFOAM 2.0: Large Language Model Driven Chain of Thought for Automating CFD Simulation and Post-Processing

链接: https://arxiv.org/abs/2502.00498
作者: Yuxuan Chen,Xu Zhu,Hua Zhou,Zhuyin Ren
类目: Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 16 pages,11 figures

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) is widely used in aerospace, energy, and biology to model fluid flow, heat transfer, and chemical reactions. While Large Language Models (LLMs) have transformed various domains, their application in CFD remains limited, particularly for complex tasks like post-processing. To bridge this gap, we introduce MetaOpenFOAM 2.0, which leverages Chain of Thought (COT) decomposition and iterative verification to enhance accessibility for non-expert users through natural language inputs. Tested on a new benchmark covering simulation (fluid flow, heat transfer, combustion) and post-processing (extraction, visualization), MetaOpenFOAM 2.0 achieved an Executability score of 6.3/7 and a pass rate of 86.9%, significantly outperforming MetaOpenFOAM 1.0 (2.1/7, 0%). Additionally, it proved cost-efficient, averaging 0.15 per case. An ablation study confirmed that COT-driven decomposition and iterative refinement substantially improved task performance. Furthermore, scaling laws showed that increasing COT steps enhanced accuracy while raising token usage, aligning with LLM post-training scaling trends. These results highlight the transformative potential of LLMs in automating CFD workflows for industrial and research applications. Code is available at this https URL

[AI-93] Looking into the Future of Health-Care Services: Can Life-Like Agents Change the Future of Health-Care Services? ICML

链接: https://arxiv.org/abs/2502.00495
作者: Mohammad Saleh Torkestani,Robert Davis,Abdolhossein Sarrafzadeh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, 3rd International Conference on Machine Learning and Computing (ICMLC 2011): February 26-28, 2011, Singapore

点击查看摘要

Abstract:Time constraints on doctor patient interaction and restricted access to specialists under the managed care system led to increasingly referring to computers as a medical information source and a self-health-care management tool. However, research show that less than 40% of information seekers indicated that online information helped them to make a decision about their health. Searching multiple web sites that need basic computer skills, lack of interaction and no face to face interaction in most search engines and some social issues, led us to develop a specialized life-like agent that would overcome mentioned problems.

[AI-94] Data Overvaluation Attack and Truthful Data Valuation

链接: https://arxiv.org/abs/2502.00494
作者: Shuyuan Zheng,Sudong Cai,Chuan Xiao,Yang Cao,Jainbin Qin,Masatoshi Yoshikawa,Makoto Onizuka
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In collaborative machine learning, data valuation, i.e., evaluating the contribution of each client’ data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the first data overvaluation attack, enabling strategic clients to have their data significantly overvalued. Furthermore, we propose a truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients’ optimal strategy is to perform truthful data valuation. Our experiments demonstrate the vulnerability of existing data valuation metrics to the data overvaluation attack and validate the robustness and effectiveness of Truth-Shapley.

[AI-95] Enhance Learning Efficiency of Oblique Decision Tree via Feature Concatenation

链接: https://arxiv.org/abs/2502.00465
作者: Shen-Huan Lyu,Yi-Xiao He,Yanyan Wang,Zhihao Qu,Bin Tang,Baoliu Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Oblique Decision Tree (ODT) separates the feature space by linear projections, as opposed to the conventional Decision Tree (DT) that forces axis-parallel splits. ODT has been proven to have a stronger representation ability than DT, as it provides a way to create shallower tree structures while still approximating complex decision boundaries. However, its learning efficiency is still insufficient, since the linear projections cannot be transmitted to the child nodes, resulting in a waste of model parameters. In this work, we propose an enhanced ODT method with Feature Concatenation (\textttFC-ODT), which enables in-model feature transformation to transmit the projections along the decision paths. Theoretically, we prove that our method enjoys a faster consistency rate w.r.t. the tree depth, indicating that our method possesses a significant advantage in generalization performance, especially for shallow trees. Experiments show that \textttFC-ODT can outperform the other state-of-the-art decision trees with a limited tree depth.

[AI-96] AudioGenX: Explainability on Text-to-Audio Generative Models

链接: https://arxiv.org/abs/2502.00459
作者: Kang Hyunju,Han Geonhee,Jeong Yoonjae,Park Hogun
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 14 pages

点击查看摘要

Abstract:Text-to-audio generation models (TAG) have achieved significant advances in generating audio conditioned on text descriptions. However, a critical challenge lies in the lack of transparency regarding how each textual input impacts the generated audio. To address this issue, we introduce AudioGenX, an Explainable AI (XAI) method that provides explanations for text-to-audio generation models by highlighting the importance of input tokens. AudioGenX optimizes an Explainer by leveraging factual and counterfactual objective functions to provide faithful explanations at the audio token level. This method offers a detailed and comprehensive understanding of the relationship between text inputs and audio outputs, enhancing both the explainability and trustworthiness of TAG models. Extensive experiments demonstrate the effectiveness of AudioGenX in producing faithful explanations, benchmarked against existing methods using novel evaluation metrics specifically designed for audio generation tasks.

[AI-97] Model-Free Predictive Control: Introductory Algebraic Calculations and a Comparison with HEOL and ANNs

链接: https://arxiv.org/abs/2502.00443
作者: Cédric Join,Emmanuel Delaleau,Michel Fliess
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model predictive control (MPC) is a popular control engineering practice, but requires a sound knowledge of the model. Model-free predictive control (MFPC), a burning issue today, also related to reinforcement learning (RL) in AI, is reformulated here via a linear differential equation with constant coefficients, thanks to a new perspective on optimal control combined with recent advances in the field of model-free control. It is replacing Dynamic Programming, the Hamilton-Jacobi-Bellman equation, and Pontryagin’s Maximum Principle. The computing burden is low. The implementation is straightforward. Two nonlinear examples, a chemical reactor and a two tank system, are illustrating our approach. A comparison with the HEOL setting, where some expertise of the process model is needed, shows only a slight superiority of the later. A recent identification of the two tank system via a complex ANN architecture might indicate that a full modeling and the corresponding machine learning mechanism are not always necessary neither in control, nor, more generally, in AI.

[AI-98] Compilation and Fast Model Counting beyond CNF

链接: https://arxiv.org/abs/2502.00434
作者: Alexis de Colnet,Stefan Szeider,Tianwei Zhang
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Circuits in deterministic decomposable negation normal form (d-DNNF) are representations of Boolean functions that enable linear-time model counting. This paper strengthens our theoretical knowledge of what classes of functions can be efficiently transformed, or compiled, into d-DNNF. Our main contribution is the fixed-parameter tractable (FPT) compilation of conjunctions of specific constraints parameterized by incidence treewidth. This subsumes the known result for CNF. The constraints in question are all functions representable by constant-width ordered binary decision diagrams (OBDDs) for all variable orderings. For instance, this includes parity constraints and cardinality constraints with constant threshold. The running time of the FPT compilation is singly exponential in the incidence treewidth but hides large constants in the exponent. To balance that, we give a more efficient FPT algorithm for model counting that applies to a sub-family of the constraints and does not require compilation.

[AI-99] Causal Abstraction Learning based on the Semantic Embedding Principle

链接: https://arxiv.org/abs/2502.00407
作者: Gabriele D’Acunto,Fabio Massimo Zennaro,Yorgos Felekis,Paolo Di Lorenzo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Structural causal models (SCMs) allow us to investigate complex systems at multiple levels of resolution. The causal abstraction (CA) framework formalizes the mapping between high- and low-level SCMs. We address CA learning in a challenging and realistic setting, where SCMs are inaccessible, interventional data is unavailable, and sample data is misaligned. A key principle of our framework is \textitsemantic embedding , formalized as the high-level distribution lying on a subspace of the low-level one. This principle naturally links linear CA to the geometry of the \textitStiefel manifold . We present a category-theoretic approach to SCMs that enables the learning of a CA by finding a morphism between the low- and high-level probability measures, adhering to the semantic embedding principle. Consequently, we formulate a general CA learning problem. As an application, we solve the latter problem for linear CA; considering Gaussian measures and the Kullback-Leibler divergence as an objective. Given the nonconvexity of the learning task, we develop three algorithms building upon existing paradigms for Riemannian optimization. We demonstrate that the proposed methods succeed on both synthetic and real-world brain data with different degrees of prior information about the structure of CA.

[AI-100] Spectro-Riemannian Graph Neural Networks ICLR2025

链接: https://arxiv.org/abs/2502.00401
作者: Karish Grover,Haiyang Yu,Xiang Song,Qi Zhu,Han Xie,Vassilis N. Ioannidis,Christos Faloutsos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: ICLR 2025

点击查看摘要

Abstract:Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) - the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and © Cusp Pooling, a hierarchical attention mechanism combined with a curvature-based positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models.

[AI-101] What should an AI assessor optimise for?

链接: https://arxiv.org/abs/2502.00365
作者: Daniel Romero-Alvarado,Fernando Martínez-Plumed,José Hernández-Orallo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic transformations are promising. For example, the logistic loss is useful for minimis-ing absolute or quadratic errors in regression, and the logarithmic score helps maximise quadratic or spherical scores in classification.

[AI-102] Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

链接: https://arxiv.org/abs/2502.00358
作者: Jia Li,Wenjie Zhao,Ziru Huang,Yunhui Guo,Yapeng Tian
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance.

[AI-103] PM-MOE: Mixture of Experts on Private Model Parameters for Personalized Federated Learning

链接: https://arxiv.org/abs/2502.00354
作者: Yu Feng,Yangli-ao Geng,Yifan Zhu,Zongfu Han,Xie Yu,Kaiwen Xue,Haoran Luo,Mengyang Sun,Guangwei Zhang,Meina Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has gained widespread attention for its privacy-preserving and collaborative learning capabilities. Due to significant statistical heterogeneity, traditional FL struggles to generalize a shared model across diverse data domains. Personalized federated learning addresses this issue by dividing the model into a globally shared part and a locally private part, with the local model correcting representation biases introduced by the global model. Nevertheless, locally converged parameters more accurately capture domain-specific knowledge, and current methods overlook the potential benefits of these parameters. To address these limitations, we propose PM-MoE architecture. This architecture integrates a mixture of personalized modules and an energy-based personalized modules denoising, enabling each client to select beneficial personalized parameters from other clients. We applied the PM-MoE architecture to nine recent model-split-based personalized federated learning algorithms, achieving performance improvements with minimal additional training. Extensive experiments on six widely adopted datasets and two heterogeneity settings validate the effectiveness of our approach. The source code is available at \urlthis https URL.

[AI-104] A Differentiated Reward Method for Reinforcement Learning based Multi-Vehicle Cooperative Decision-Making Algorithms

链接: https://arxiv.org/abs/2502.00352
作者: Ye Han,Lijun Zhang,Dejian Meng
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 8 pages, 3 figures, submitted to IEEE IV 2025

点击查看摘要

Abstract:Reinforcement learning (RL) shows great potential for optimizing multi-vehicle cooperative driving strategies through the state-action-reward feedback loop, but it still faces challenges such as low sample efficiency. This paper proposes a differentiated reward method based on steady-state transition systems, which incorporates state transition gradient information into the reward design by analyzing traffic flow characteristics, aiming to optimize action selection and policy learning in multi-vehicle cooperative decision-making. The performance of the proposed method is validated in RL algorithms such as MAPPO, MADQN, and QMIX under varying autonomous vehicle penetration. The results show that the differentiated reward method significantly accelerates training convergence and outperforms centering reward and others in terms of traffic efficiency, safety, and action rationality. Additionally, the method demonstrates strong scalability and environmental adaptability, providing a novel approach for multi-agent cooperative decision-making in complex traffic scenarios.

[AI-105] Multi-Order Hyperbolic Graph Convolution and Aggregated Attention for Social Event Detection

链接: https://arxiv.org/abs/2502.00351
作者: Yao Liu,Zhilan Liu,Tien Ping Tan,Yuxin Li
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social event detection (SED) is a task focused on identifying specific real-world events and has broad applications across various domains. It is integral to many mobile applications with social features, including major platforms like Twitter, Weibo, and Facebook. By enabling the analysis of social events, SED provides valuable insights for businesses to understand consumer preferences and supports public services in handling emergencies and disaster management. Due to the hierarchical structure of event detection data, traditional approaches in Euclidean space often fall short in capturing the complexity of such relationships. While existing methods in both Euclidean and hyperbolic spaces have shown promising results, they tend to overlook multi-order relationships between events. To address these limitations, this paper introduces a novel framework, Multi-Order Hyperbolic Graph Convolution with Aggregated Attention (MOHGCAA), designed to enhance the performance of SED. Experimental results demonstrate significant improvements under both supervised and unsupervised settings. To further validate the effectiveness and robustness of the proposed framework, we conducted extensive evaluations across multiple datasets, confirming its superiority in tackling common challenges in social event detection.

[AI-106] OrcaLoca: An LLM Agent Framework for Software Issue Localization

链接: https://arxiv.org/abs/2502.00350
作者: Zhongming Yu,Hejia Zhang,Yujie Zhao,Hanxian Huang,Matrix Yao,Ke Ding,Jishen Zhao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization – precisely identifying software problems by navigating to relevant code sections – remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.

[AI-107] Actor Critic with Experience Replay-based automatic treatment planning for prostate cancer intensity modulated radiotherapy

链接: https://arxiv.org/abs/2502.00346
作者: Md Mainul Abrar,Parvat Sapkota,Damon Sprouts,Xun Jia,Yujie Chi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: 27 Pages, 8 Figures, 4 Tables

点击查看摘要

Abstract:Background: Real-time treatment planning in IMRT is challenging due to complex beam interactions. AI has improved automation, but existing models require large, high-quality datasets and lack universal applicability. Deep reinforcement learning (DRL) offers a promising alternative by mimicking human trial-and-error planning. Purpose: Develop a stochastic policy-based DRL agent for automatic treatment planning with efficient training, broad applicability, and robustness against adversarial attacks using Fast Gradient Sign Method (FGSM). Methods: Using the Actor-Critic with Experience Replay (ACER) architecture, the agent tunes treatment planning parameters (TPPs) in inverse planning. Training is based on prostate cancer IMRT cases, using dose-volume histograms (DVHs) as input. The model is trained on a single patient case, validated on two independent cases, and tested on 300+ plans across three datasets. Plan quality is assessed using ProKnow scores, and robustness is tested against adversarial attacks. Results: Despite training on a single case, the model generalizes well. Before ACER-based planning, the mean plan score was 6.20 \pm 1.84; after, 93.09% of cases achieved a perfect score of 9, with a mean of 8.93 \pm 0.27. The agent effectively prioritizes optimal TPP tuning and remains robust against adversarial attacks. Conclusions: The ACER-based DRL agent enables efficient, high-quality treatment planning in prostate cancer IMRT, demonstrating strong generalizability and robustness. Comments: 27 Pages, 8 Figures, 4 Tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph) MSC classes: 92C50 (Primary) 68T07 (Secondary) ACMclasses: I.2.1; J.2; J.3 Cite as: arXiv:2502.00346 [cs.LG] (or arXiv:2502.00346v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00346 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md Mainul Abrar [view email] [v1] Sat, 1 Feb 2025 07:09:40 UTC (7,006 KB) Full-text links: Access Paper: View a PDF of the paper titled Actor Critic with Experience Replay-based automatic treatment planning for prostate cancer intensity modulated radiotherapy, by Md Mainul Abrar and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-02 Change to browse by: cs cs.AI physics physics.med-ph References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-108] he Composite Task Challenge for Cooperative Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.00345
作者: Yurui Li,Yuxuan Chen,Li Zhang,Shijian Li,Gang Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The significant role of division of labor (DOL) in promoting cooperation is widely recognized in real-world this http URL cooperative multi-agent reinforcement learning (MARL) methods have incorporated the concept of DOL to improve cooperation among this http URL, the tasks used in existing testbeds typically correspond to tasks where DOL is often not a necessary feature for achieving optimal this http URL, the full utilize of DOL concept in MARL methods remains unrealized due to the absence of appropriate this http URL enhance the generality and applicability of MARL methods in real-world scenarios, there is a necessary to develop tasks that demand multi-agent DOL and this http URL this paper, we propose a series of tasks designed to meet these requirements, drawing on real-world rules as the guidance for their this http URL guarantee that DOL and cooperation are necessary condition for completing tasks and introduce three factors to expand the diversity of proposed tasks to cover more realistic this http URL evaluate 10 cooperative MARL methods on the proposed this http URL results indicate that all baselines perform poorly on these this http URL further validate the solvability of these tasks, we also propose simplified variants of proposed this http URL results show that baselines are able to handle these simplified variants, providing evidence of the solvability of the proposed this http URL source files is available at this https URL.

[AI-109] From Few to Many: Self-Improving Many-Shot Reason ers Through Iterative Optimization and Generation ICLR2025

链接: https://arxiv.org/abs/2502.00330
作者: Xingchen Wan,Han Zhou,Ruoxi Sun,Hootan Nakhost,Ke Jiang,Sercan Ö. Arık
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Expanded version of the ICLR 2025 paper

点击查看摘要

Abstract:Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples (“optimize”) and using them as demonstrations to regenerate new examples (“generate”) can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.

[AI-110] CoddLLM : Empowering Large Language Models for Data Analytics

链接: https://arxiv.org/abs/2502.00329
作者: Jiani Zhang,Hengrui Zhang,Rishav Chakravarti,Yiqun Hu,Patrick Ng,Asterios Katsifodimos,Huzefa Rangwala,George Karypis,Alon Halevy
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the potential to revolutionize data analytics by simplifying tasks such as data discovery and SQL query synthesis through natural language interactions. This work serves as a pivotal first step toward the development of foundation models explicitly designed for data analytics applications. To propel this vision forward, we unveil a new data recipe for post-training LLMs, enhancing their comprehension of data management and empowering them to tackle complex real-world analytics tasks. Specifically, our innovative approach includes a scalable synthetic data generation method that enables the creation of a broad spectrum of topics centered on data representation and manipulation. Furthermore, we introduce two new tasks that seamlessly bridge tables and text. We show that such tasks can enhance models’ understanding of schema creation and the nuanced translation between natural language and tabular data. Leveraging this data recipe, we post-train a new foundation model, named CoddLLM, based on Mistral-NeMo-12B. To assess the language understanding and reasoning capabilities of LLMs in the realm of data analytics, we contribute AnalyticsMMLU, a benchmark containing thousands of multiple-choice questions on databases, data analysis, and machine learning. Our focus on data discovery, has resulted in the contribution of three comprehensive benchmarks that address both database and data lake scenarios. CoddLLM not only excels in performance but also sets a new standard, achieving the highest average accuracy across eight datasets. It outperforms GPT-3.5-Turbo on AnalyticsMMLU, exceeding GPT-4o by 12.1% in table selection and showing an average improvement of 24.9% in Text-to-SQL compared to the base model.

[AI-111] MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling

链接: https://arxiv.org/abs/2502.00321
作者: Bencheng Yan,Si Chen,Shichang Jia,Jianyu Liu,Yueran Liu,Chenghan Fu,Wanxian Guan,Hui Zhao,Xiang Zhang,Kai Zhang,Wenbo Su,Pengjie Wang,Jian Xu,Bo Zheng,Baolin Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users’ real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users’ true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world’s largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at this https URL.

[AI-112] HoP: Homeomorphic Polar Learning for Hard Constrained Optimization

链接: https://arxiv.org/abs/2502.00304
作者: Ke Deng,Hanwen Zhang,Jin Lu,Haijian Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: in submission

点击查看摘要

Abstract:Constrained optimization demands highly efficient solvers which promotes the development of learn-to-optimize (L2O) approaches. As a data-driven method, L2O leverages neural networks to efficiently produce approximate solutions. However, a significant challenge remains in ensuring both optimality and feasibility of neural networks’ output. To tackle this issue, we introduce Homeomorphic Polar Learning (HoP) to solve the star-convex hard-constrained optimization by embedding homeomorphic mapping in neural networks. The bijective structure enables end-to-end training without extra penalty or correction. For performance evaluation, we evaluate HoP’s performance across a variety of synthetic optimization tasks and real-world applications in wireless communications. In all cases, HoP achieves solutions closer to the optimum than existing L2O methods while strictly maintaining feasibility.

[AI-113] Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective

链接: https://arxiv.org/abs/2502.00281
作者: Fanqi Yan,Huy Nguyen,Pedram Akbarian,Nhat Ho,Alessandro Rinaldo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Fanqi Yan, Huy Nguyen contributed equally to this work. 51 pages, 2 figures, 3 tables

点击查看摘要

Abstract:At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we illustrate that each row of the self-attention matrix can be represented as a mixture of experts. Our analysis shows that ‘‘experts’’ in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention. We corroborate our theoretical findings through extensive experiments on both synthetic and real-world datasets.

[AI-114] DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

链接: https://arxiv.org/abs/2502.00270
作者: Zhiliang Chen,Gregory Kang Ruey Lau,Chuan-Sheng Foo,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The performance of a machine learning (ML) model depends heavily on the relevance of its training data to the domain of the downstream evaluation task. However, in practice, the data involved in an unseen evaluation task is often not known to us (e.g., conversations between an LLM and a user are end-to-end encrypted). So, it is not obvious what data would be relevant for training/fine-tuning the ML model to maximize its task performance. Instead, one can only deploy the ML model in the unseen evaluation task to gather multiple rounds of coarse feedback on how well the model has performed. This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization. As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model’s performance on the unseen evaluation task and its convergence to the optimal data mixture can be theoretically guaranteed by analyzing its cumulative regret. Empirical evaluation on image and LLM evaluation tasks shows that DUET finds better training data mixtures than conventional baselines.

[AI-115] Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

链接: https://arxiv.org/abs/2502.00213
作者: Akiyoshi Tomihari,Issei Sato
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Transformer models are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformer models by focusing on \emphgradient heterogeneity, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformer models and show that it is influenced by the placement of layer normalization. Additionally, we show that the momentum term in sign-based optimization is important for preventing the excessive growth of linear-head parameters in tasks with many classes. Experimental results from fine-tuning transformer models in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformer models and offers guidance for designing future optimization algorithms. Code is available at \urlthis https URL.

[AI-116] Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

链接: https://arxiv.org/abs/2502.00212
作者: Kefan Dong,Tengyu Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 22 pages, 5 figures

点击查看摘要

Abstract:A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 19.8 billion tokens generated during the training in Lean, STP proves 26.3% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (61.1%, pass@3200), Proofnet-test (23.1%, pass@3200) and PutnamBench (8/644, pass@64).

[AI-117] Year-over-Year Developments in Financial Fraud Detection via Deep Learning: A Systematic Literature Review

链接: https://arxiv.org/abs/2502.00201
作者: Yisong Chen,Chuqing Zhao,Yixin Xu,Chuanhao Nie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:This paper systematically reviews advancements in deep learning (DL) techniques for financial fraud detection, a critical issue in the financial sector. Using the Kitchenham systematic literature review approach, 57 studies published between 2019 and 2024 were analyzed. The review highlights the effectiveness of various deep learning models such as Convolutional Neural Networks, Long Short-Term Memory, and transformers across domains such as credit card transactions, insurance claims, and financial statement audits. Performance metrics such as precision, recall, F1-score, and AUC-ROC were evaluated. Key themes explored include the impact of data privacy frameworks and advancements in feature engineering and data preprocessing. The study emphasizes challenges such as imbalanced datasets, model interpretability, and ethical considerations, alongside opportunities for automation and privacy-preserving techniques such as blockchain integration and Principal Component Analysis. By examining trends over the past five years, this review identifies critical gaps and promising directions for advancing DL applications in financial fraud detection, offering actionable insights for researchers and practitioners.

[AI-118] Physics-Informed Neural Network based Damage Identification for Truss Railroad Bridges

链接: https://arxiv.org/abs/2502.00194
作者: Althaf Shajihan,Kirill Mechitov,Girish Chowdhary,Billie F. Spencer Jr
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 30 pages, 15 figures

点击查看摘要

Abstract:Railroad bridges are a crucial component of the U.S. freight rail system, which moves over 40 percent of the nation’s freight and plays a critical role in the economy. However, aging bridge infrastructure and increasing train traffic pose significant safety hazards and risk service disruptions. The U.S. rail network includes over 100,000 railroad bridges, averaging one every 1.4 miles of track, with steel bridges comprising over 50% of the network’s total bridge length. Early identification and assessment of damage in these bridges remain challenging tasks. This study proposes a physics-informed neural network (PINN) based approach for damage identification in steel truss railroad bridges. The proposed approach employs an unsupervised learning approach, eliminating the need for large datasets typically required by supervised methods. The approach utilizes train wheel load data and bridge response during train crossing events as inputs for damage identification. The PINN model explicitly incorporates the governing differential equations of the linear time-varying (LTV) bridge-train system. Herein, this model employs a recurrent neural network (RNN) based architecture incorporating a custom Runge-Kutta (RK) integrator cell, designed for gradient-based learning. The proposed approach updates the bridge finite element model while also quantifying damage severity and localizing the affected structural members. A case study on the Calumet Bridge in Chicago, Illinois, with simulated damage scenarios, is used to demonstrate the model’s effectiveness in identifying damage while maintaining low false-positive rates. Furthermore, the damage identification pipeline is designed to seamlessly integrate prior knowledge from inspections and drone surveys, also enabling context-aware updating and assessment of bridge’s condition.

[AI-119] Understanding Federated Learning from IID to Non-IID dataset: An Experimental Study

链接: https://arxiv.org/abs/2502.00182
作者: Jungwon Seo,Ferhat Ozgur Catak,Chunming Rong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As privacy concerns and data regulations grow, federated learning (FL) has emerged as a promising approach for training machine learning models across decentralized data sources without sharing raw data. However, a significant challenge in FL is that client data are often non-IID (non-independent and identically distributed), leading to reduced performance compared to centralized learning. While many methods have been proposed to address this issue, their underlying mechanisms are often viewed from different perspectives. Through a comprehensive investigation from gradient descent to FL, and from IID to non-IID data settings, we find that inconsistencies in client loss landscapes primarily cause performance degradation in non-IID scenarios. From this understanding, we observe that existing methods can be grouped into two main strategies: (i) adjusting parameter update paths and (ii) modifying client loss landscapes. These findings offer a clear perspective on addressing non-IID challenges in FL and help guide future research in the field.

[AI-120] he role of positional encodings in the ARC benchmark

链接: https://arxiv.org/abs/2502.00174
作者: Guilherme H. Bandeira Costa,Miguel Freire,Arlindo L. Oliveira
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus challenges AI systems to perform abstract reasoning with minimal training data, a task intuitive for humans but demanding for machine learning models. Using CodeT5+ as a case study, we demonstrate how limitations in positional encoding hinder reasoning and impact performance. This work further examines the role of positional encoding across transformer architectures, highlighting its critical influence on models of varying sizes and configurations. Comparing several strategies, we find that while 2D positional encoding and Rotary Position Embedding offer competitive performance, 2D encoding excels in data-constrained scenarios, emphasizing its effectiveness for ARC tasks

[AI-121] Counting and Reasoning with Plans

链接: https://arxiv.org/abs/2502.00145
作者: David Speck,Markus Hecher,Daniel Gnad,Johannes K. Fichte,Augusto B. Corrêa
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Classical planning asks for a sequence of operators reaching a given goal. While the most common case is to compute a plan, many scenarios require more than that. However, quantitative reasoning on the plan space remains mostly unexplored. A fundamental problem is to count plans, which relates to the conditional probability on the plan space. Indeed, qualitative and quantitative approaches are well-established in various other areas of automated reasoning. We present the first study to quantitative and qualitative reasoning on the plan space. In particular, we focus on polynomially bounded plans. On the theoretical side, we study its complexity, which gives rise to rich reasoning modes. Since counting is hard in general, we introduce the easier notion of facets, which enables understanding the significance of operators. On the practical side, we implement quantitative reasoning for planning. Thereby, we transform a planning task into a propositional formula and use knowledge compilation to count different plans. This framework scales well to large plan spaces, while enabling rich reasoning capabilities such as learning pruning functions and explainable planning.

[AI-122] Demystifying MPNNs: Message Passing as Merely Efficient Matrix Multiplication

链接: https://arxiv.org/abs/2502.00140
作者: Qin Jiang,Chengjia Wang,Michael Lones,Wei Pang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:While Graph Neural Networks (GNNs) have achieved remarkable success, their design largely relies on empirical intuition rather than theoretical understanding. In this paper, we present a comprehensive analysis of GNN behavior through three fundamental aspects: (1) we establish that \textbf k -layer Message Passing Neural Networks efficiently aggregate \textbf k -hop neighborhood information through iterative computation, (2) analyze how different loop structures influence neighborhood computation, and (3) examine behavior across structure-feature hybrid and structure-only tasks. For deeper GNNs, we demonstrate that gradient-related issues, rather than just over-smoothing, can significantly impact performance in sparse graphs. We also analyze how different normalization schemes affect model performance and how GNNs make predictions with uniform node features, providing a theoretical framework that bridges the gap between empirical success and theoretical understanding.

[AI-123] Can AI Solve the Peer Review Crisis? A Large Scale Experiment on LLM s Performance and Biases in Evaluating Economics Papers

链接: https://arxiv.org/abs/2502.00070
作者: Pat Pataranutaporn,Nattavudh Powdthavee,Pattie Maes
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 72 pages

点击查看摘要

Abstract:We investigate whether artificial intelligence can address the peer review crisis in economics by analyzing 27,090 evaluations of 9,030 unique submissions using a large language model (LLM). The experiment systematically varies author characteristics (e.g., affiliation, reputation, gender) and publication quality (e.g., top-tier, mid-tier, low-tier, AI generated papers). The results indicate that LLMs effectively distinguish paper quality but exhibit biases favoring prominent institutions, male authors, and renowned economists. Additionally, LLMs struggle to differentiate high-quality AI-generated papers from genuine top-tier submissions. While LLMs offer efficiency gains, their susceptibility to bias necessitates cautious integration and hybrid peer review models to balance equity and accuracy.

[AI-124] Privacy Preserving Charge Location Prediction for Electric Vehicles

链接: https://arxiv.org/abs/2502.00068
作者: Robert Marlin,Raja Jurdak,Alsharif Abuadbba,Dimity Miller
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures, IEEE Journal paper

点击查看摘要

Abstract:By 2050, electric vehicles (EVs) are projected to account for 70% of global vehicle sales. While EVs provide environmental benefits, they also pose challenges for energy generation, grid infrastructure, and data privacy. Current research on EV routing and charge management often overlooks privacy when predicting energy demands, leaving sensitive mobility data vulnerable. To address this, we developed a Federated Learning Transformer Network (FLTN) to predict EVs’ next charge location with enhanced privacy measures. Each EV operates as a client, training an onboard FLTN model that shares only model weights, not raw data with a community-based Distributed Energy Resource Management System (DERMS), which aggregates them into a community global model. To further enhance privacy, non-transitory EVs use peer-to-peer weight sharing and augmentation within their community, obfuscating individual contributions and improving model accuracy. Community DERMS global model weights are then redistributed to EVs for continuous training. Our FLTN approach achieved up to 92% accuracy while preserving data privacy, compared to our baseline centralised model, which achieved 98% accuracy with no data privacy. Simulations conducted across diverse charge levels confirm the FLTN’s ability to forecast energy demands over extended periods. We present a privacy-focused solution for forecasting EV charge location prediction, effectively mitigating data leakage risks.

[AI-125] From Data to Action: Charting A Data-Driven Path to Combat Antimicrobial Resistance

链接: https://arxiv.org/abs/2502.00061
作者: Qian Fu,Yuzhe Zhang,Yanfeng Shu,Ming Ding,Lina Yao,Chen Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
*备注: 29 pages, 3 figures, 4 tables, survey paper

点击查看摘要

Abstract:Antimicrobial-resistant (AMR) microbes are a growing challenge in healthcare, rendering modern medicines ineffective. AMR arises from antibiotic production and bacterial evolution, but quantifying its transmission remains difficult. With increasing AMR-related data, data-driven methods offer promising insights into its causes and treatments. This paper reviews AMR research from a data analytics and machine learning perspective, summarizing the state-of-the-art and exploring key areas such as surveillance, prediction, drug discovery, stewardship, and driver analysis. It discusses data sources, methods, and challenges, emphasizing standardization and interoperability. Additionally, it surveys statistical and machine learning techniques for AMR analysis, addressing issues like data noise and bias. Strategies for denoising and debiasing are highlighted to enhance fairness and robustness in AMR research. The paper underscores the importance of interdisciplinary collaboration and awareness of data challenges in advancing AMR research, pointing to future directions for innovation and improved methodologies.

[AI-126] Israel-Hamas war through Telegram Reddit and Twitter

链接: https://arxiv.org/abs/2502.00060
作者: Despoina Antonakaki,Sotiris Ioannidis
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Israeli-Palestinian conflict started on 7 October 2023, have resulted thus far to over 48,000 people killed including more than 17,000 children with a majority from Gaza, more than 30,000 people injured, over 10,000 missing, and over 1 million people displaced, fleeing conflict zones. The infrastructure damage includes the 87% of housing units, 80% of public buildings and 60% of cropland 17 out of 36 hospitals, 68% of road networks and 87% of school buildings damaged. This conflict has as well launched an online discussion across various social media platforms. Telegram was no exception due to its encrypted communication and highly involved audience. The current study will cover an analysis of the related discussion in relation to different participants of the conflict and sentiment represented in those discussion. To this end, we prepared a dataset of 125K messages shared on channels in Telegram spanning from 23 October 2025 until today. Additionally, we apply the same analysis in two publicly available datasets from Twitter containing 2001 tweets and from Reddit containing 2M opinions. We apply a volume analysis across the three datasets, entity extraction and then proceed to BERT topic analysis in order to extract common themes or topics. Next, we apply sentiment analysis to analyze the emotional tone of the discussions. Our findings hint at polarized narratives as the hallmark of how political factions and outsiders mold public opinion. We also analyze the sentiment-topic prevalence relationship, detailing the trends that may show manipulation and attempts of propaganda by the involved parties. This will give a better understanding of the online discourse on the Israel-Palestine conflict and contribute to the knowledge on the dynamics of social media communication during geopolitical crises.

[AI-127] Large Language Models are Few-shot Multivariate Time Series Classifiers

链接: https://arxiv.org/abs/2502.00059
作者: Yakun Chen,Zihao Li,Chao Yang,Xianzhi Wang,Guandong Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been extensively applied in time series analysis. Yet, their utility in the few-shot classification (i.e., a crucial training scenario due to the limited training data available in industrial applications) concerning multivariate time series data remains underexplored. We aim to leverage the extensive pre-trained knowledge in LLMs to overcome the data scarcity problem within multivariate time series. Specifically, we propose LLMFew, an LLM-enhanced framework to investigate the feasibility and capacity of LLMs for few-shot multivariate time series classification. This model introduces a Patch-wise Temporal Convolution Encoder (PTCEnc) to align time series data with the textual embedding input of LLMs. We further fine-tune the pre-trained LLM decoder with Low-rank Adaptations (LoRA) to enhance its feature representation learning ability in time series data. Experimental results show that our model outperformed state-of-the-art baselines by a large margin, achieving 125.2% and 50.2% improvement in classification accuracy on Handwriting and EthanolConcentration datasets, respectively. Moreover, our experimental results demonstrate that LLM-based methods perform well across a variety of datasets in few-shot MTSC, delivering reliable results compared to traditional models. This success paves the way for their deployment in industrial environments where data are limited.

[AI-128] owards Recommender Systems LLM s Playground (RecSysLLM sP): Exploring Polarization and Engagement in Simulated Social Networks

链接: https://arxiv.org/abs/2502.00055
作者: Ljubisa Bojic,Zorica Dodevska,Yashar Deldjoo,Nenad Pantelic
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Given the exponential advancement in AI technologies and the potential escalation of harmful effects from recommendation systems, it is crucial to simulate and evaluate these effects early on. Doing so can help prevent possible damage to both societies and technology companies. This paper introduces the Recommender Systems LLMs Playground (RecSysLLMsP), a novel simulation framework leveraging Large Language Models (LLMs) to explore the impacts of different content recommendation setups on user engagement and polarization in social networks. By creating diverse AI agents (AgentPrompts) with descriptive, static, and dynamic attributes, we assess their autonomous behaviour across three scenarios: Plurality, Balanced, and Similarity. Our findings reveal that the Similarity Scenario, which aligns content with user preferences, maximizes engagement while potentially fostering echo chambers. Conversely, the Plurality Scenario promotes diverse interactions but produces mixed engagement results. Our study emphasizes the need for a careful balance in recommender system designs to enhance user satisfaction while mitigating societal polarization. It underscores the unique value and challenges of incorporating LLMs into simulation environments. The benefits of RecSysLLMsP lie in its potential to calculate polarization effects, which is crucial for assessing societal impacts and determining user engagement levels with diverse recommender system setups. This advantage is essential for developing and maintaining a successful business model for social media companies. However, the study’s limitations revolve around accurately emulating reality. Future efforts should validate the similarity in behaviour between real humans and AgentPrompts and establish metrics for measuring polarization scores.

[AI-129] Bridging Contrastive Learning and Domain Adaptation: Theoretical Perspective and Practical Application

链接: https://arxiv.org/abs/2502.00052
作者: Gonzalo Iñaki Quintana,Laurence Vancamberg,Vincent Jugnon,Agnès Desolneux,Mathilde Mougeot
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work studies the relationship between Contrastive Learning and Domain Adaptation from a theoretical perspective. The two standard contrastive losses, NT-Xent loss (Self-supervised) and Supervised Contrastive loss, are related to the Class-wise Mean Maximum Discrepancy (CMMD), a dissimilarity measure widely used for Domain Adaptation. Our work shows that minimizing the contrastive losses decreases the CMMD and simultaneously improves class-separability, laying the theoretical groundwork for the use of Contrastive Learning in the context of Domain Adaptation. Due to the relevance of Domain Adaptation in medical imaging, we focused the experiments on mammography images. Extensive experiments on three mammography datasets - synthetic patches, clinical (real) patches, and clinical (real) images - show improved Domain Adaptation, class-separability, and classification performance, when minimizing the Supervised Contrastive loss.

[AI-130] Restless Multi-armed Bandits under Frequency and Window Constraints for Public Service Inspections

链接: https://arxiv.org/abs/2502.00045
作者: Yi Mao,Andrew Perrault
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Municipal inspections are an important part of maintaining the quality of goods and services. In this paper, we approach the problem of intelligently scheduling service inspections to maximize their impact, using the case of food establishment inspections in Chicago as a case study. The Chicago Department of Public Health (CDPH) inspects thousands of establishments each year, with a substantial fail rate (over 3,000 failed inspection reports in 2023). To balance the objectives of ensuring adherence to guidelines, minimizing disruption to establishments, and minimizing inspection costs, CDPH assigns each establishment an inspection window every year and guarantees that they will be inspected exactly once during that window. These constraints create a challenge for a restless multi-armed bandit (RMAB) approach, for which there are no existing methods. We develop an extension to Whittle index-based systems for RMABs that can guarantee action window constraints and frequencies, and furthermore can be leveraged to optimize action window assignments themselves. Briefly, we combine MDP reformulation and integer programming-based lookahead to maximize the impact of inspections subject to constraints. A neural network-based supervised learning model is developed to model state transitions of real Chicago establishments using public CDPH inspection records, which demonstrates 10% AUC improvements compared with directly predicting establishments’ failures. Our experiments not only show up to 24% (in simulation) or 33% (on real data) reward improvements resulting from our approach but also give insight into the impact of scheduling constraints.

[AI-131] A scalable adaptive deep Koopman predictive controller for real-time optimization of mixed traffic flow

链接: https://arxiv.org/abs/2502.00043
作者: Hao Lyu,Yanyong Guo,Pan Liu,Nan Zheng,Ting Wang
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of connected automated vehicle (CAV) is advocated to mitigate traffic oscillations in mixed traffic flow consisting of CAVs and human driven vehicles (HDVs). This study proposes an adaptive deep Koopman predictive control framework (AdapKoopPC) for regulating mixed traffic flow. Firstly, a Koopman theory-based adaptive trajectory prediction deep network (AdapKoopnet) is designed for modeling HDVs car-following behavior. AdapKoopnet enables the representation of HDVs behavior by a linear model in a high-dimensional space. Secondly, the model predictive control is employed to smooth the mixed traffic flow, where the combination of the linear dynamic model of CAVs and linear prediction blocks from AdapKoopnet is embedded as the predictive model into the AdapKoopPC. Finally, the predictive performance of the prosed AdapKoopnet is verified using the HighD naturalistic driving dataset. Furthermore, the control performance of AdapKoopPC is validated by the numerical simulations. Results demonstrate that the AdapKoopnet provides more accuracy HDVs predicted trajectories than the baseline nonlinear models. Moreover, the proposed AdapKoopPC exhibits more effective control performance with less computation cost compared with baselines in mitigating traffic oscillations, especially at the low CAVs penetration rates. The code of proposed AdapKoopPC is open source.

[AI-132] Multi-Objective Reinforcement Learning for Power Grid Topology Control

链接: https://arxiv.org/abs/2502.00040
作者: Thomas Lautenbacher,Ali Rajaei,Davide Barbieri,Jan Viebahn,Jochen L. Cremer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Transmission grid congestion increases as the electrification of various sectors requires transmitting more power. Topology control, through substation reconfiguration, can reduce congestion but its potential remains under-exploited in operations. A challenge is modeling the topology control problem to align well with the objectives and constraints of operators. Addressing this challenge, this paper investigates the application of multi-objective reinforcement learning (MORL) to integrate multiple conflicting objectives for power grid topology control. We develop a MORL approach using deep optimistic linear support (DOL) and multi-objective proximal policy optimization (MOPPO) to generate a set of Pareto-optimal policies that balance objectives such as minimizing line loading, topological deviation, and switching frequency. Initial case studies show that the MORL approach can provide valuable insights into objective trade-offs and improve Pareto front approximation compared to a random search baseline. The generated multi-objective RL policies are 30% more successful in preventing grid failure under contingencies and 20% more effective when training budget is reduced - compared to the common single objective RL policy.

[AI-133] Efficient Client Selection in Federated Learning

链接: https://arxiv.org/abs/2502.00036
作者: William Marfo,Deepak K. Tosh,Shirley V. Moore
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized machine learning while preserving data privacy. This paper proposes a novel client selection framework that integrates differential privacy and fault tolerance. The adaptive client selection adjusts the number of clients based on performance and system constraints, with noise added to protect privacy. Evaluated on the UNSW-NB15 and ROAD datasets for network anomaly detection, the method improves accuracy by 7% and reduces training time by 25% compared to baselines. Fault tolerance enhances robustness with minimal performance trade-offs.

[AI-134] owards Efficient Multi-Objective Optimisation for Real-World Power Grid Topology Control

链接: https://arxiv.org/abs/2502.00034
作者: Yassine El Manyari,Anton R. Fuxjager,Stefan Zahlner,Joost Van Dijk,Alberto Castagna,Davide Barbieri,Jan Viebahn,Marcel Wasserer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Power grid operators face increasing difficulties in the control room as the increase in energy demand and the shift to renewable energy introduce new complexities in managing congestion and maintaining a stable supply. Effective grid topology control requires advanced tools capable of handling multi-objective trade-offs. While Reinforcement Learning (RL) offers a promising framework for tackling such challenges, existing Multi-Objective Reinforcement Learning (MORL) approaches fail to scale to the large state and action spaces inherent in real-world grid operations. Here we present a two-phase, efficient and scalable Multi-Objective Optimisation (MOO) method designed for grid topology control, combining an efficient RL learning phase with a rapid planning phase to generate day-ahead plans for unseen scenarios. We validate our approach using historical data from TenneT, a European Transmission System Operator (TSO), demonstrating minimal deployment time, generating day-ahead plans within 4-7 minutes with strong performance. These results underline the potential of our scalable method to support real-world power grid management, offering a practical, computationally efficient, and time-effective tool for operational planning. Based on current congestion costs and inefficiencies in grid operations, adopting our approach by TSOs could potentially save millions of euros annually, providing a compelling economic incentive for its integration in the control room.

[AI-135] Querying Databases with Function Calling

链接: https://arxiv.org/abs/2502.00032
作者: Connor Shorten,Charles Pierse,Thomas Benjamin Smith,Karel D’Oosterlinck,Tuana Celik,Erika Cardenas,Leonie Monigatti,Mohd Shukri Hasan,Edward Schmuhl,Daniel Williams,Aravind Kesiraju,Bob van Luijt
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Preprint. 23 pages, 7 figures

点击查看摘要

Abstract:The capabilities of Large Language Models (LLMs) are rapidly accelerating largely thanks to their integration with external tools. Querying databases is among the most effective of these integrations, enabling LLMs to access private or continually updating data. While Function Calling is the most common method for interfacing external tools to LLMs, its application to database querying as a tool has been underexplored. We propose a tool definition for database querying that unifies accessing data with search queries, filters, or a combination both, as well as transforming results with aggregation and groupby operators. To evaluate its effectiveness, we conduct a study with 8 LLMs spanning 5 model families. We present a novel pipeline adapting the Gorilla LLM framework to create synthetic database schemas and queries. We primarily evaluate the models with the Exact Match of predicted and ground truth query APIs. Among the models tested, Claude 3.5 Sonnet achieves the highest performance with an Exact Match score of 74.3%, followed by GPT-4o mini at 73.7%, and GPT-4o at 71.8%. We further breakdown these results per API component utilized and across synthetic use cases. We find that LLMs are highly effective at utilizing operators on boolean properties, but struggle with text property filters. Across use cases we find robust results with the higher performing models such as GPT-4o, but significant performance variance across use cases from lower performing models. We additionally conduct ablation studies exploring the impact of parallel tool calling, adding a rationale as an argument of the tool call, using a separate tool per database collection, and tool calling with structured outputs. Our findings demonstrate the effectiveness of enabling LLMs to query databases with Function Calling. We have open-sourced our experimental code and results at this http URL.

[AI-136] Analysis of a Memcapacitor-Based for Neural Network Accelerator Framework

链接: https://arxiv.org/abs/2502.00027
作者: Ankur Singh,Dowon Kim,Byung-Geun Lee
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Data-intensive computing tasks, such as training neural networks, are crucial for artificial intelligence applications but often come with high energy demands. One promising solution is to develop specialized hardware that directly maps neural networks, utilizing arrays of memristive devices to perform parallel multiply-accumulate operations. In our research, we introduce a novel CMOS-based memcapacitor circuit that is validated using the cadence tool. Additionally, we developed the device in Python to facilitate the design of a memcapacitive-based accelerator. Our proposed framework employs a crossbar array of memcapacitor devices to train a neural network capable of digit classification and CIFAR dataset recognition. We tested the non-ideal characteristics of the constructed memcapacitor-based neural network. The system achieved an impressive 98.4% training accuracy in digit recognition and 94.4% training accuracy in CIFAR recognition, highlighting its effectiveness. This study demonstrates the potential of memcapacitor-based neural network systems in handling classification tasks and sets the stage for further advancements in neuromorphic computing.

[AI-137] Pushing the Limits of BFP on Narrow Precision LLM Inference

链接: https://arxiv.org/abs/2502.00026
作者: Hui Wang,Yuan Cheng,Xiaomeng Han,Zhengpeng Zhao,Dawei Yang,Zhe Jiang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low overhead performance improvement over SOTA designs.

[AI-138] Leverag ing Large Language Models to Enhance Machine Learning Interpretability and Predictive Performance: A Case Study on Emergency Department Returns for Mental Health Patients

链接: https://arxiv.org/abs/2502.00025
作者: Abdulaziz Ahmed,Mohammad Saleem,Mohammed Alzeen,Badari Birur,Rachel E Fargason,Bradley G Burk,Hannah Rose Harkins,Ahmed Alhassan,Mohammed Ali Al-Garadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Objective: To evaluate whether integrating large language models (LLMs) with traditional machine learning approaches improves both the predictive accuracy and clinical interpretability of ED mental health returns risk models. Methods: This retrospective cohort study analyzed 42,464 ED visits for 27,904 unique mental health patients at an Academic Medical Center in the deep South of the United States between January 2018 and December 2022. Main Outcomes and Measures: Two primary outcomes were evaluated: (1) 30 days ED return prediction accuracy and (2) model interpretability through a novel retrieval-augmented generation (RAG) framework integrating SHAP (SHapley Additive exPlanations) values with contextual clinical knowledge. Results: The proposed machine learning interpretability framework, leveraging LLM, achieved 99% accuracy in translating complex model predictions into clinically relevant explanations. Integration of LLM-extracted features enhanced predictive performance, improving the XGBoost model area under the curve (AUC) from 0.73 to 0.76. The LLM-based feature extraction using 10-shot learning significantly outperformed traditional approaches, achieving an accuracy of 0.882 and an F1 score of 0.86 for chief complaint classification (compared to conventional methods with an accuracy range of 0.59 to 0.63) and demonstrating accuracy values ranging from 0.65 to 0.93 across multiple SDoH categories, underscoring its robust performance in extracting features from clinical notes. Conclusions and Relevance: Integrating LLMs with traditional machine learning models yielded modest but consistent improvements in ED return prediction accuracy while substantially enhancing model interpretability through automated, clinically relevant explanations. This approach offers a framework for translating complex predictive analytics into actionable clinical insights.

[AI-139] Musical Agent Systems: MACAT and MACataRT NIPS

链接: https://arxiv.org/abs/2502.00023
作者: Keon Ju M. Lee,Philippe Pasquier
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: In Proceedings of the Creativity and Generative AI NIPS (Neural Information Processing Systems) Workshop 2024

点击查看摘要

Abstract:Our research explores the development and application of musical agents, human-in-the-loop generative AI systems designed to support music performance and improvisation within co-creative spaces. We introduce MACAT and MACataRT, two distinct musical agent systems crafted to enhance interactive music-making between human musicians and AI. MACAT is optimized for agent-led performance, employing real-time synthesis and self-listening to shape its output autonomously, while MACataRT provides a flexible environment for collaborative improvisation through audio mosaicing and sequence-based learning. Both systems emphasize training on personalized, small datasets, fostering ethical and transparent AI engagement that respects artistic integrity. This research highlights how interactive, artist-centred generative AI can expand creative possibilities, empowering musicians to explore new forms of artistic expression in real-time, performance-driven and music improvisation contexts.

[AI-140] A Dynamic and High-Precision Method for Scenario-Based HRA Synthetic Data Collection in Multi-Agent Collaborative Environments Driven by LLM s

链接: https://arxiv.org/abs/2502.00022
作者: Xingyu Xiao,Peng Chen,Qianqian Jia,Jiejuan Tong,Jingang Liang,Haitao Wang
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:HRA (Human Reliability Analysis) data is crucial for advancing HRA methodologies. however, existing data collection methods lack the necessary granularity, and most approaches fail to capture dynamic features. Additionally, many methods require expert knowledge as input, making them time-consuming and labor-intensive. To address these challenges, we propose a new paradigm for the automated collection of HRA data. Our approach focuses on key indicators behind human error, specifically measuring workload in collaborative settings. This study introduces a novel, scenario-driven method for workload estimation, leveraging fine-tuned large language models (LLMs). By training LLMs on real-world operational data from high-temperature gas-cooled reactors (HTGRs), we simulate human behavior and cognitive load in real time across various collaborative scenarios. The method dynamically adapts to changes in operator workload, providing more accurate, flexible, and scalable workload estimates. The results demonstrate that the proposed WELLA (Workload Estimation with LLMs and Agents) outperforms existing commercial LLM-based methods in terms of prediction accuracy.

[AI-141] mporal Reasoning in AI systems

链接: https://arxiv.org/abs/2502.00020
作者: Abhishek Sharma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Commonsense temporal reasoning at scale is a core problem for cognitive systems. The correct inference of the duration for which fluents hold is required by many tasks, including natural language understanding and planning. Many AI systems have limited deductive closure because they cannot extrapolate information correctly regarding existing fluents and events. In this study, we discuss the knowledge representation and reasoning schemes required for robust temporal projection in the Cyc Knowledge Base. We discuss how events can start and end risk periods for fluents. We then use discrete survival functions, which represent knowledge of the persistence of facts, to extrapolate a given fluent. The extrapolated intervals can be truncated by temporal constraints and other types of commonsense knowledge. Finally, we present the results of experiments to demonstrate that these methods obtain significant improvements in terms of Q/A performance.

[AI-142] Growth Patterns of Inference

链接: https://arxiv.org/abs/2502.00019
作者: Abhishek Sharma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:What properties of a first-order search space support/hinder inference? What kinds of facts would be most effective to learn? Answering these questions is essential for understanding the dynamics of deductive reasoning and creating large-scale knowledge-based learning systems that support efficient inference. We address these questions by developing a model of how the distribution of ground facts affects inference performance in search spaces. Experiments suggest that uniform search spaces are suitable for larger KBs whereas search spaces with skewed degree distribution show better performance in smaller KBs. A sharp transition in Q/A performance is seen in some cases, suggesting that analysis of the structure of search spaces with existing knowledge should be used to guide the acquisition of new ground facts in learning systems.

[AI-143] An Expectation-Maximization Algorithm-based Autoregressive Model for the Fuzzy Job Shop Scheduling Problem

链接: https://arxiv.org/abs/2502.00018
作者: Yijian Wang,Tongxian Guo,Zhaoqiang Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The fuzzy job shop scheduling problem (FJSSP) emerges as an innovative extension to the job shop scheduling problem (JSSP), incorporating a layer of uncertainty that aligns the problem more closely with the complexities of real-world manufacturing environments. This improvement increases the computational complexity of deriving the solution while improving its applicability. In the domain of deterministic scheduling, neural combinatorial optimization (NCO) has recently demonstrated remarkable efficacy. However, its application to the realm of fuzzy scheduling has been relatively unexplored. This paper aims to bridge this gap by investigating the feasibility of employing neural networks to assimilate and process fuzzy information for the resolution of FJSSP, thereby leveraging the advancements in NCO to enhance fuzzy scheduling methodologies. To achieve this, we approach the FJSSP as a generative task and introduce an expectation-maximization algorithm-based autoregressive model (EMARM) to address it. During training, our model alternates between generating scheduling schemes from given instances (E-step) and adjusting the autoregressive model weights based on these generated schemes (M-step). This novel methodology effectively navigates around the substantial hurdle of obtaining ground-truth labels, which is a prevalent issue in NCO frameworks. In testing, the experimental results demonstrate the superior capability of EMARM in addressing the FJSSP, showcasing its effectiveness and potential for practical applications in fuzzy scheduling.

[AI-144] Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping Study

链接: https://arxiv.org/abs/2502.00015
作者: Yutan Huang,Chetan Arora,Wen Cheng Houng,Tanjila Kanij,Anuradha Madulgalla,John Grundy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:[Context] Generative AI technologies, particularly Large Language Models (LLMs), have transformed numerous domains by enhancing convenience and efficiency in information retrieval, content generation, and decision-making processes. However, deploying LLMs also presents diverse ethical challenges, and their mitigation strategies remain complex and domain-dependent. [Objective] This paper aims to identify and categorize the key ethical concerns associated with using LLMs, examine existing mitigation strategies, and assess the outstanding challenges in implementing these strategies across various domains. [Method] We conducted a systematic mapping study, reviewing 39 studies that discuss ethical concerns and mitigation strategies related to LLMs. We analyzed these ethical concerns using five ethical dimensions that we extracted based on various existing guidelines, frameworks, and an analysis of the mitigation strategies and implementation challenges. [Results] Our findings reveal that ethical concerns in LLMs are multi-dimensional and context-dependent. While proposed mitigation strategies address some of these concerns, significant challenges still remain. [Conclusion] Our results highlight that ethical issues often hinder the practical implementation of the mitigation strategies, particularly in high-stake areas like healthcare and public governance; existing frameworks often lack adaptability, failing to accommodate evolving societal expectations and diverse contexts.

[AI-145] OAST Framework: A Multidimensional Approach to Ethical and Sustainable AI Integration in Organizations

链接: https://arxiv.org/abs/2502.00011
作者: Dian Tjondronegoro
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 1 figure

点击查看摘要

Abstract:Artificial Intelligence (AI) has emerged as a transformative technology with the potential to revolutionize various sectors, from healthcare to finance, education, and beyond. However, successfully implementing AI systems remains a complex challenge, requiring a comprehensive and methodologically sound framework. This paper contributes to this challenge by introducing the Trustworthy, Optimized, Adaptable, and Socio-Technologically harmonious (TOAST) framework. It draws on insights from various disciplines to align technical strategy with ethical values, societal responsibilities, and innovation aspirations. The TOAST framework is a novel approach designed to guide the implementation of AI systems, focusing on reliability, accountability, technical advancement, adaptability, and socio-technical harmony. By grounding the TOAST framework in healthcare case studies, this paper provides a robust evaluation of its practicality and theoretical soundness in addressing operational, ethical, and regulatory challenges in high-stakes environments, demonstrating how adaptable AI systems can enhance institutional efficiency, mitigate risks like bias and data privacy, and offer a replicable model for other sectors requiring ethically aligned and efficient AI integration.

[AI-146] A Study about Distribution and Acceptance of Conversational Agents for Mental Health in Germany: Keep the Human in the Loop?

链接: https://arxiv.org/abs/2502.00005
作者: Christina Lukas
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Master’s thesis

点击查看摘要

Abstract:Good mental health enables individuals to cope with the normal stresses of life. In Germany, approximately one-quarter of the adult population is affected by mental illnesses. Teletherapy and digital health applications are available to bridge gaps in care and relieve healthcare professionals. The acceptance of these tools is a strongly influencing factor for their effectiveness, which also needs to be evaluated for AI-based conversational agents (CAs) (e. g. ChatGPT, Siri) to assess the risks and potential for integration into therapeutic practice. This study investigates the perspectives of both the general population and healthcare professionals with the following questions: 1. How frequently are CAs used for mental health? 2. How high is the acceptance of CAs in the field of mental health? 3. To what extent is the use of CAs in counselling, diagnosis, and treatment acceptable? To address these questions, two quantitative online surveys were conducted with 444 participants from the general population and 351 healthcare professionals. Statistical analyses show that 27 % of the surveyed population already confide their concerns to CAs. Not only experience with this technology but also experience with telemedicine shows a higher acceptance among both groups for using CAs for mental health. Additionally, participants from the general population were more likely to support CAs as companions controlled by healthcare professionals rather than as additional experts for the professionals. CAs have the potential to support mental health, particularly in counselling. Future research should examine the influence of different communication media and further possibilities of augmented intelligence. With the right balance between technology and human care, integration into patient-professional interaction can be achieved.

[AI-147] Defending Compute Thresholds Against Legal Loopholes

链接: https://arxiv.org/abs/2502.00003
作者: Matteo Pistillo,Pablo Villalobos
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing legal frameworks on AI rely on training compute thresholds as a proxy to identify potentially-dangerous AI models and trigger increased regulatory attention. In the United States, Section 4.2(a) of Executive Order 14110 instructs the Secretary of Commerce to require extensive reporting from developers of AI models above a certain training compute threshold. In the European Union, Article 51 of the AI Act establishes a presumption that AI models above a certain compute threshold have high impact capabilities and hence pose systemic risk, thus subjecting their developers to several obligations including capability evaluations, reporting, and incident monitoring. In this paper, we examine some enhancement techniques that are capable of decreasing training compute usage while preserving, or even increasing, model capabilities. Since training compute thresholds rely on training compute as a metric and trigger for increased regulatory attention, these capability-enhancing and compute-saving techniques could constitute a legal loophole to existing training compute thresholds. In particular, we concentrate on four illustrative techniques (fine-tuning, model reuse, model expansion, and above compute-optimal inference compute) with the goal of furthering the conversation about their implications on training compute thresholds as a legal mechanism and advancing policy recommendations that could address the relevant legal loopholes.

[AI-148] Supersonic: Learning to Generate Source Code Optimizations in C/C

链接: https://arxiv.org/abs/2309.14846
作者: Zimin Chen,Sen Fang,Martin Monperrus
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ( x_t , x_t+1 ), where x_t+1 is an optimized version of x_t , and outputs a diff. Supersonic’s performance is benchmarked against OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.

[AI-149] Rational Gaussian wavelets and corresponding model driven neural networks

链接: https://arxiv.org/abs/2502.01282
作者: Attila Miklós Ámon,Kristian Fenech,Péter Kovács,Tamás Dózsa
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Signal Processing, 2024 (under review)

点击查看摘要

Abstract:In this paper we consider the continuous wavelet transform using Gaussian wavelets multiplied by an appropriate rational term. The zeros and poles of this rational modifier act as free parameters and their choice highly influences the shape of the mother wavelet. This allows the proposed construction to approximate signals with complex morphology using only a few wavelet coefficients. We show that the proposed rational Gaussian wavelets are admissible and provide numerical approximations of the wavelet coefficients using variable projection operators. In addition, we show how the proposed variable projection based rational Gaussian wavelet transform can be used in neural networks to obtain a highly interpretable feature learning layer. We demonstrate the effectiveness of the proposed scheme through a biomedical application, namely, the detection of ventricular ectopic beats (VEBs) in real ECG measurements.

[AI-150] One-step full gradient suffices for low-rank fine-tuning provably and efficiently

链接: https://arxiv.org/abs/2502.01235
作者: Yuanhe Zhang,Fanghui Liu,Yudong Chen
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 86 pages

点击查看摘要

Abstract:This paper studies how to improve the performance of Low-Rank Adaption (LoRA) as guided by our theoretical analysis. Our first set of theoretical results show that for random initialization and linear models, \textiti) LoRA will align to the certain singular subspace of one-step gradient of full fine-tuning; \textitii) preconditioners improve convergence in the high-rank case. These insights motivate us to focus on preconditioned LoRA using a specific spectral initialization strategy for aligning with certain subspaces. For both linear and nonlinear models, we prove that alignment and generalization guarantees can be directly achieved at initialization, and the subsequent linear convergence can be also built. Our analysis leads to the \emphLoRA-One algorithm (using \emphOne-step gradient and preconditioning), a theoretically grounded algorithm that achieves significant empirical improvement over vanilla LoRA and its variants on several benchmarks. Our theoretical analysis, based on decoupling the learning dynamics and characterizing how spectral initialization contributes to feature learning, may be of independent interest for understanding matrix sensing and deep learning theory. The source code can be found in the this https URL.

[AI-151] Quantum Machine Learning: A Hands-on Tutorial for Machine Learning Practitioners and Researchers

链接: https://arxiv.org/abs/2502.01146
作者: Yuxuan Du,Xinbiao Wang,Naixu Guo,Zhan Yu,Yang Qian,Kaining Zhang,Min-Hsiu Hsieh,Patrick Rebentrost,Dacheng Tao
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 260 pages; Comments are welcome

点击查看摘要

Abstract:This tutorial intends to introduce readers with a background in AI to quantum machine learning (QML) – a rapidly evolving field that seeks to leverage the power of quantum computers to reshape the landscape of machine learning. For self-consistency, this tutorial covers foundational principles, representative QML algorithms, their potential applications, and critical aspects such as trainability, generalization, and computational complexity. In addition, practical code demonstrations are provided in this https URL to illustrate real-world implementations and facilitate hands-on learning. Together, these elements offer readers a comprehensive overview of the latest advancements in QML. By bridging the gap between classical machine learning and quantum computing, this tutorial serves as a valuable resource for those looking to engage with QML and explore the forefront of AI in the quantum era.

[AI-152] A generative foundation model for an all-in-one seismic processing framework

链接: https://arxiv.org/abs/2502.01111
作者: Shijun Cheng,Randy Harsuko,Tariq Alkhalifah
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Seismic data often face challenges in their utilization due to noise contamination, incomplete acquisition, and limited low-frequency information, which hinder accurate subsurface imaging and interpretation. Traditional processing methods rely heavily on task-specific designs to address these challenges and fail to account for the variability of data. To address these limitations, we present a generative seismic foundation model (GSFM), a unified framework based on generative diffusion models (GDMs), designed to tackle multi-task seismic processing challenges, including denoising, backscattered noise attenuation, interpolation, and low-frequency extrapolation. GSFM leverages a pre-training stage on synthetic data to capture the features of clean, complete, and broadband seismic data distributions and applies an iterative fine-tuning strategy to adapt the model to field data. By adopting a target-oriented diffusion process prediction, GSFM improves computational efficiency without compromising accuracy. Synthetic data tests demonstrate GSFM surpasses benchmarks with equivalent architectures in all tasks and achieves performance comparable to traditional pre-training strategies, even after their fine-tuning. Also, field data tests suggest that our iterative fine-tuning approach addresses the generalization limitations of conventional pre-training and fine-tuning paradigms, delivering significantly enhanced performance across diverse tasks. Furthermore, GSFM’s inherent probabilistic nature enables effective uncertainty quantification, offering valuable insights into the reliability of processing results.

[AI-153] FetDTIAlign: A Deep Learning Framework for Affine and Deformable Registration of Fetal Brain dMRI

链接: https://arxiv.org/abs/2502.01057
作者: Bo Li,Qi Zeng,Simon K. Warfield,Davood Karimi
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion MRI (dMRI) provides unique insights into fetal brain microstructure in utero. Longitudinal and cross-sectional fetal dMRI studies can reveal crucial neurodevelopmental changes but require precise spatial alignment across scans and subjects. This is challenging due to low data quality, rapid brain development, and limited anatomical landmarks. Existing registration methods, designed for high-quality adult data, struggle with these complexities. To address this, we introduce FetDTIAlign, a deep learning approach for fetal brain dMRI registration, enabling accurate affine and deformable alignment. FetDTIAlign features a dual-encoder architecture and iterative feature-based inference, reducing the impact of noise and low resolution. It optimizes network configurations and domain-specific features at each registration stage, enhancing both robustness and accuracy. We validated FetDTIAlign on data from 23 to 36 weeks gestation, covering 60 white matter tracts. It consistently outperformed two classical optimization-based methods and a deep learning pipeline, achieving superior anatomical correspondence. Further validation on external data from the Developing Human Connectome Project confirmed its generalizability across acquisition protocols. Our results demonstrate the feasibility of deep learning for fetal brain dMRI registration, providing a more accurate and reliable alternative to classical techniques. By enabling precise cross-subject and tract-specific analyses, FetDTIAlign supports new discoveries in early brain development.

[AI-154] Decision-informed Neural Networks with Large Language Model Integration for Portfolio Optimization

链接: https://arxiv.org/abs/2502.00828
作者: Yoontae Hwang,Yaxuan Kong,Stefan Zohren,Yongjae Lee
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: Submitted paper

点击查看摘要

Abstract:This paper addresses the critical disconnect between prediction and decision quality in portfolio optimization by integrating Large Language Models (LLMs) with decision-focused learning. We demonstrate both theoretically and empirically that minimizing the prediction error alone leads to suboptimal portfolio decisions. We aim to exploit the representational power of LLMs for investment decisions. An attention mechanism processes asset relationships, temporal dependencies, and macro variables, which are then directly integrated into a portfolio optimization layer. This enables the model to capture complex market dynamics and align predictions with the decision objectives. Extensive experiments on S\P100 and DOW30 datasets show that our model consistently outperforms state-of-the-art deep learning models. In addition, gradient-based analyses show that our model prioritizes the assets most crucial to decision making, thus mitigating the effects of prediction errors on portfolio performance. These findings underscore the value of integrating decision objectives into predictions for more robust and context-aware portfolio management.

[AI-155] Learned Bayesian Cramer-Rao Bound for Unknown Measurement Models Using Score Neural Networks

链接: https://arxiv.org/abs/2502.00724
作者: Hai Victor Habi,Hagit Messer,Yoram Bresler
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 11 figures

点击查看摘要

Abstract:The Bayesian Cramér-Rao bound (BCRB) is a crucial tool in signal processing for assessing the fundamental limitations of any estimation problem as well as benchmarking within a Bayesian frameworks. However, the BCRB cannot be computed without full knowledge of the prior and the measurement distributions. In this work, we propose a fully learned Bayesian Cramér-Rao bound (LBCRB) that learns both the prior and the measurement distributions. Specifically, we suggest two approaches to obtain the LBCRB: the Posterior Approach and the Measurement-Prior Approach. The Posterior Approach provides a simple method to obtain the LBCRB, whereas the Measurement-Prior Approach enables us to incorporate domain knowledge to improve the sample complexity and interpretability. To achieve this, we introduce a Physics-encoded score neural network which enables us to easily incorporate such domain knowledge into a neural network. We study the learning errors of the two suggested approaches theoretically, and validate them numerically. We demonstrate the two approaches on several signal processing examples, including a linear measurement problem with unknown mixing and Gaussian noise covariance matrices, frequency estimation, and quantized measurement. In addition, we test our approach on a nonlinear signal processing problem of frequency estimation with real-world underwater ambient noise.

[AI-156] Biogeochemistry-Informed Neural Network (BINN) for Improving Accuracy of Model Prediction and Scientific Understanding of Soil Organic Carbon

链接: https://arxiv.org/abs/2502.00672
作者: Haodi Xu,Joshua Fan,Feng Tao,Lifen Jiang,Fengqi You,Benjamin Z. Houlton,Ying Sun,Carla P. Gomes,Yiqi Luo
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注: 41 pages, 8 figures

点击查看摘要

Abstract:Big data and the rapid development of artificial intelligence (AI) provide unprecedented opportunities to enhance our understanding of the global carbon cycle and other biogeochemical processes. However, retrieving mechanistic knowledge from big data remains a challenge. Here, we develop a Biogeochemistry-Informed Neural Network (BINN) that seamlessly integrates a vectorized process-based soil carbon cycle model (i.e., Community Land Model version 5, CLM5) into a neural network (NN) structure to examine mechanisms governing soil organic carbon (SOC) storage from big data. BINN demonstrates high accuracy in retrieving biogeochemical parameter values from synthetic data in a parameter recovery experiment. We use BINN to predict six major processes regulating the soil carbon cycle (or components in process-based models) from 25,925 observed SOC profiles across the conterminous US and compared them with the same processes previously retrieved by a Bayesian inference-based PROcess-guided deep learning and DAta-driven modeling (PRODA) approach (Tao et al. 2020; 2023). The high agreement between the spatial patterns of the retrieved processes using the two approaches with an average correlation coefficient of 0.81 confirms BINN’s ability in retrieving mechanistic knowledge from big data. Additionally, the integration of neural networks and process-based models in BINN improves computational efficiency by more than 50 times over PRODA. We conclude that BINN is a transformative tool that harnesses the power of both AI and process-based modeling, facilitation new scientific discoveries while improving interpretability and accuracy of Earth system models.

[AI-157] Optimizing Feature Selection in Causal Inference: A Three-Stage Computational Framework for Unbiased Estimation

链接: https://arxiv.org/abs/2502.00501
作者: Tianyu Yang,Md. Noor-E-Alam
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Feature selection is an important but challenging task in causal inference for obtaining unbiased estimates of causal quantities. Properly selected features in causal inference not only significantly reduce the time required to implement a matching algorithm but, more importantly, can also reduce the bias and variance when estimating causal quantities. When feature selection techniques are applied in causal inference, the crucial criterion is to select variables that, when used for matching, can achieve an unbiased and robust estimation of causal quantities. Recent research suggests that balancing only on treatment-associated variables introduces bias while balancing on spurious variables increases variance. To address this issue, we propose an enhanced three-stage framework that shows a significant improvement in selecting the desired subset of variables compared to the existing state-of-the-art feature selection framework for causal inference, resulting in lower bias and variance in estimating the causal quantity. We evaluated our proposed framework using a state-of-the-art synthetic data across various settings and observed superior performance within a feasible computation time, ensuring scalability for large-scale datasets. Finally, to demonstrate the applicability of our proposed methodology using large-scale real-world data, we evaluated an important US healthcare policy related to the opioid epidemic crisis: whether opioid use disorder has a causal relationship with suicidal behavior.

[AI-158] Learning to Fuse Temporal Proximity Networks: A Case Study in Chimpanzee Social Interactions

链接: https://arxiv.org/abs/2502.00302
作者: Yixuan He,Aaron Sandel,David Wipf,Mihai Cucuringu,John Mitani,Gesine Reinert
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:How can we identify groups of primate individuals which could be conjectured to drive social structure? To address this question, one of us has collected a time series of data for social interactions between chimpanzees. Here we use a network representation, leading to the task of combining these data into a time series of a single weighted network per time stamp, where different proximities should be given different weights reflecting their relative importance. We optimize these proximity-type weights in a principled way, using an innovative loss function which rewards structural consistency across time. The approach is empirically validated by carefully designed synthetic data. Using statistical tests, we provide a way of identifying groups of individuals that stay related for a significant length of time. Applying the approach to the chimpanzee data set, we detect cliques in the animal social network time series, which can be validated by real-world intuition from prior research and qualitative observations by chimpanzee experts.

[AI-159] A Comprehensive Review: Applicability of Deep Neural Networks in Business Decision Making and Market Prediction Investment

链接: https://arxiv.org/abs/2502.00151
作者: Viet Trinh
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Big data, both in its structured and unstructured formats, have brought in unforeseen challenges in economics and business. How to organize, classify, and then analyze such data to obtain meaningful insights are the ever-going research topics for business leaders and academic researchers. This paper studies recent applications of deep neural networks in decision making in economical business and investment; especially in risk management, portfolio optimization, and algorithmic trading. Set aside limitation in data privacy and cross-market analysis, the article establishes that deep neural networks have performed remarkably in financial classification and prediction. Moreover, the study suggests that by compositing multiple neural networks, spanning different data type modalities, a more robust, efficient, and scalable financial prediction framework can be constructed.

机器学习

[LG-0] Harmonic Loss Trains Interpretable AI Models

链接: https://arxiv.org/abs/2502.01628
作者: David D. Baek,Ziming Liu,Riya Tyagi,Max Tegmark
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures; The first two authors contributed equally

点击查看摘要

Abstract:In this paper, we introduce harmonic loss as an alternative to the standard cross-entropy loss for training neural networks and large language models (LLMs). Harmonic loss enables improved interpretability and faster convergence, owing to its scale invariance and finite convergence point by design, which can be interpreted as a class center. We first validate the performance of harmonic models across algorithmic, vision, and language datasets. Through extensive experiments, we demonstrate that models trained with harmonic loss outperform standard models by: (a) enhancing interpretability, (b) requiring less data for generalization, and © reducing grokking. Moreover, we compare a GPT-2 model trained with harmonic loss to the standard GPT-2, illustrating that the harmonic model develops more interpretable representations. Looking forward, we believe harmonic loss has the potential to become a valuable tool in domains with limited data availability or in high-stakes applications where interpretability and reliability are paramount, paving the way for more robust and efficient neural network models.

[LG-1] Preference VLM: Leverag ing VLMs for Scalable Preference-Based Reinforcement Learning

链接: https://arxiv.org/abs/2502.01616
作者: Udita Ghosh,Dripta S. Raychaudhuri,Jiachen Li,Konstantinos Karydis,Amit Roy-Chowdhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.

[LG-2] Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization

链接: https://arxiv.org/abs/2502.01594
作者: Adela DePavia,Vasileios Charisopoulos,Rebecca Willett
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive optimization algorithms – such as Adagrad, Adam, and their variants – have found widespread use in machine learning, signal processing and many other settings. Several methods in this family are not rotationally equivariant, meaning that simple reparameterizations (i.e. change of basis) can drastically affect their convergence. However, their sensitivity to the choice of parameterization has not been systematically studied; it is not clear how to identify a “favorable” change of basis in which these methods perform best. In this paper we propose a reparameterization method and demonstrate both theoretically and empirically its potential to improve their convergence behavior. Our method is an orthonormal transformation based on the expected gradient outer product (EGOP) matrix, which can be approximated using either full-batch or stochastic gradient oracles. We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum. We illustrate the potential impact of EGOP reparameterization by presenting empirical evidence and theoretical arguments that common machine learning tasks with “natural” data exhibit EGOP spectral decay.

[LG-3] A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

链接: https://arxiv.org/abs/2502.01588
作者: Yacouba Kaloga,Shashi Kumar,Petr Motlicek,Ina Kodrasi
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.

[LG-4] SubTrack your Grad: Gradient Subspace Tracking for Memory and Time Efficient Full-Parameter LLM Training

链接: https://arxiv.org/abs/2502.01586
作者: Sahar Rajabi,Nayeema Nonta,Sirisha Rambhatla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) demand significant time and computational resources due to their large model sizes and optimizer states. To overcome these challenges, recent methods, such as BAdam, employ partial weight updates to enhance time and memory efficiency, though sometimes at the cost of performance. Others, like GaLore, focus on maintaining performance while optimizing memory usage through full parameter training, but may incur higher time complexity. By leveraging the low-rank structure of the gradient and the Grassmannian geometry, we propose SubTrack-Grad, a subspace tracking-based optimization method that efficiently tracks the evolving gradient subspace by incorporating estimation errors and previously identified subspaces. SubTrack-Grad delivers better or on-par results compared to GaLore, while significantly outperforming BAdam, which, despite being time-efficient, compromises performance. SubTrack-Grad reduces wall-time by up to 20.57% on GLUE tasks (15% average reduction) and up to 65% on SuperGLUE tasks (22% average reduction) compared to GaLore. Notably, for a 3B parameter model, GaLore incurred a substantial 157% increase in wall-time compared to full-rank training, whereas SubTrack-Grad exhibited a 31% increase, representing a 49% reduction in wall-time, while enjoying the same memory reductions as GaLore.

[LG-5] Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization

链接: https://arxiv.org/abs/2502.01562
作者: Minttu Alakuijala,Ya Gao,Georgy Ananov,Samuel Kaski,Pekka Marttinen,Alexander Ilin,Harri Valpola
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the general capabilities of artificial intelligence (AI) agents continue to evolve, their ability to learn to master multiple complex tasks through experience remains a key challenge. Current LLM agents, particularly those based on proprietary language models, typically rely on prompts to incorporate knowledge about the target tasks. This approach does not allow the agent to internalize this information and instead relies on ever-expanding prompts to sustain its functionality in diverse scenarios. This resembles a system of notes used by a person affected by anterograde amnesia, the inability to form new memories. In this paper, we propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for either cumbersome note systems or prior high-quality demonstration data. Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights via a context distillation training procedure. We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent which, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in a taskset requiring correct sequencing of information retrieval, tool use, and question answering.

[LG-6] raining in reverse: How iteration order influences convergence and stability in deep learning

链接: https://arxiv.org/abs/2502.01557
作者: Benoit Dherin,Benny Avelin,Anders Karlsson,Hanna Mazzawi,Javier Gonzalvo,Michael Munn
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which processes batch gradient updates like SGD but in reverse order. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights opportunities to exploit reverse training dynamics (or more generally alternate iteration orders) to improve training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

[LG-7] Observation Noise and Initialization in Wide Neural Networks

链接: https://arxiv.org/abs/2502.01556
作者: Sergio Calvo-Ordoñez,Jonathan Plenk,Richard Bergna,Alvaro Cartea,Jose Miguel Hernandez-Lobato,Konstantina Palla,Kamil Ciosek
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Work under review, 22 pages

点击查看摘要

Abstract:Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific choice of prior mean and with zero observation noise. However, existing formulations of this result have two limitations: i) the resultant NTK-GP assumes no noise in the observed target variables, which can result in suboptimal predictions with noisy data; ii) it is unclear how to extend the equivalence to an arbitrary prior mean, a crucial aspect of formulating a well-specified model. To address the first limitation, we introduce a regularizer into the neural network’s training objective, formally showing its correspondence to incorporating observation noise into the NTK-GP model. To address the second, we introduce a \textitshifted network that enables arbitrary prior mean functions. This approach allows us to perform gradient descent on a single neural network, without expensive ensembling or kernel matrix inversion. Our theoretical insights are validated empirically, with experiments exploring different values of observation noise and network architectures.

[LG-8] Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning ICRA2025

链接: https://arxiv.org/abs/2502.01546
作者: Ioannis Dadiotis,Mayank Mittal,Nikos Tsagarakis,Marco Hutter
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: accepted for ICRA 2025

点击查看摘要

Abstract:Non-prehensile pushing to move and reorient objects to a goal is a versatile loco-manipulation skill. In the real world, the object’s physical properties and friction with the floor contain significant uncertainties, which makes the task challenging for a mobile manipulator. In this paper, we develop a learning-based controller for a mobile manipulator to move an unknown object to a desired position and yaw orientation through a sequence of pushing actions. The proposed controller for the robotic arm and the mobile base motion is trained using a constrained Reinforcement Learning (RL) formulation. We demonstrate its capability in experiments with a quadrupedal robot equipped with an arm. The learned policy achieves a success rate of 91.35% in simulation and at least 80% on hardware in challenging scenarios. Through our extensive hardware experiments, we show that the approach demonstrates high robustness against unknown objects of different masses, materials, sizes, and shapes. It reactively discovers the pushing location and direction, thus achieving contact-rich behavior while observing only the pose of the object. Additionally, we demonstrate the adaptive behavior of the learned policy towards preventing the object from toppling.

[LG-9] Unsupervised anomaly detection in large-scale estuarine acoustic telemetry data

链接: https://arxiv.org/abs/2502.01543
作者: Siphendulwe Zaza,Marcellin Atemkeng,Taryn S. Murray,John David Filmalter,Paul D. Cowley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Acoustic telemetry data plays a vital role in understanding the behaviour and movement of aquatic animals. However, these datasets, which often consist of millions of individual data points, frequently contain anomalous movements that pose significant challenges. Traditionally, anomalous movements are identified either manually or through basic statistical methods, approaches that are time-consuming and prone to high rates of unidentified anomalies in large datasets. This study focuses on the development of automated classifiers for a large telemetry dataset comprising detections from fifty acoustically tagged dusky kob monitored in the Breede Estuary, South Africa. Using an array of 16 acoustic receivers deployed throughout the estuary between 2016 and 2021, we collected over three million individual data points. We present detailed guidelines for data pre-processing, resampling strategies, labelling process, feature engineering, data splitting methodologies, and the selection and interpretation of machine learning and deep learning models for anomaly detection. Among the evaluated models, neural networks autoencoder (NN-AE) demonstrated superior performance, aided by our proposed threshold-finding algorithm. NN-AE achieved a high recall with no false normal (i.e., no misclassifications of anomalous movements as normal patterns), a critical factor in ensuring that no true anomalies are overlooked. In contrast, other models exhibited false normal fractions exceeding 0.9, indicating they failed to detect the majority of true anomalies; a significant limitation for telemetry studies where undetected anomalies can distort interpretations of movement patterns. While the NN-AE’s performance highlights its reliability and robustness in detecting anomalies, it faced challenges in accurately learning normal movement patterns when these patterns gradually deviated from anomalous ones.

[LG-10] FedGES: A Federated Learning Approach for BN Structure Learning

链接: https://arxiv.org/abs/2502.01538
作者: Pablo Torrijos,José A. Gámez,José M. Puerta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Network (BN) structure learning traditionally centralizes data, raising privacy concerns when data is distributed across multiple entities. This research introduces Federated GES (FedGES), a novel Federated Learning approach tailored for BN structure learning in decentralized settings using the Greedy Equivalence Search (GES) algorithm. FedGES uniquely addresses privacy and security challenges by exchanging only evolving network structures, not parameters or data. It realizes collaborative model development, using structural fusion to combine the limited models generated by each client in successive iterations. A controlled structural fusion is also proposed to enhance client consensus when adding any edge. Experimental results on various BNs from \sf bnlearn’s BN Repository validate the effectiveness of FedGES, particularly in high-dimensional (a large number of variables) and sparse data scenarios, offering a practical and privacy-preserving solution for real-world BN structure learning.

[LG-11] Federated Learning with Discriminative Naive Bayes Classifier

链接: https://arxiv.org/abs/2502.01532
作者: Pablo Torrijos,Juan C. Alfaro,José A. Gámez,José M. Puerta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning has emerged as a promising approach to train machine learning models on decentralized data sources while preserving data privacy. This paper proposes a new federated approach for Naive Bayes (NB) classification, assuming discrete variables. Our approach federates a discriminative variant of NB, sharing meaningless parameters instead of conditional probability tables. Therefore, this process is more reliable against possible attacks. We conduct extensive experiments on 12 datasets to validate the efficacy of our approach, comparing federated and non-federated settings. Additionally, we benchmark our method against the generative variant of NB, which serves as a baseline for comparison. Our experimental results demonstrate the effectiveness of our method in achieving accurate classification.

[LG-12] Enhancing Bayesian Network Structural Learning with Monte Carlo Tree Search

链接: https://arxiv.org/abs/2502.01527
作者: Jorge D. Laborda,Pablo Torrijos,José M. Puerta,José A. Gámez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article presents MCTS-BN, an adaptation of the Monte Carlo Tree Search (MCTS) algorithm for the structural learning of Bayesian Networks (BNs). Initially designed for game tree exploration, MCTS has been repurposed to address the challenge of learning BN structures by exploring the search space of potential ancestral orders in Bayesian Networks. Then, it employs Hill Climbing (HC) to derive a Bayesian Network structure from each order. In large BNs, where the search space for variable orders becomes vast, using completely random orders during the rollout phase is often unreliable and impractical. We adopt a semi-randomized approach to address this challenge by incorporating variable orders obtained from other heuristic search algorithms such as Greedy Equivalent Search (GES), PC, or HC itself. This hybrid strategy mitigates the computational burden and enhances the reliability of the rollout process. Experimental evaluations demonstrate the effectiveness of MCTS-BN in improving BNs generated by traditional structural learning algorithms, exhibiting robust performance even when base algorithm orders are suboptimal and surpassing the gold standard when provided with favorable orders.

[LG-13] Prioritizing App Reviews for Developer Responses on Google Play

链接: https://arxiv.org/abs/2502.01520
作者: Mohsen Jafari,Forough Majidi,Abbas Heydarnoori
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, 5 tables

点击查看摘要

Abstract:The number of applications in Google Play has increased dramatically in recent years. On Google Play, users can write detailed reviews and rate apps, with these ratings significantly influencing app success and download numbers. Reviews often include notable information like feature requests, which are valuable for software maintenance. Users can update their reviews and ratings anytime. Studies indicate that apps with ratings below three stars are typically avoided by potential users. Since 2013, Google Play has allowed developers to respond to user reviews, helping resolve issues and potentially boosting overall ratings and download rates. However, responding to reviews is time-consuming, and only 13% to 18% of developers engage in this practice. To address this challenge, we propose a method to prioritize reviews based on response priority. We collected and preprocessed review data, extracted both textual and semantic features, and assessed their impact on the importance of responses. We labelled reviews as requiring a response or not and trained four different machine learning models to prioritize them. We evaluated the models performance using metrics such as F1-Score, Accuracy, Precision, and Recall. Our findings indicate that the XGBoost model is the most effective for prioritizing reviews needing a response.

[LG-14] Compact Yet Highly Accurate Printed Classifiers Using Sequential Support Vector Machine Circuits ISCAS

链接: https://arxiv.org/abs/2502.01498
作者: Ilias Sertaridis,Spyridon Besias,Florentia Afentaki,Konstantinos Balaskas,Georgios Zervakis
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted at the 2025 IEEE International Symposium on Circuits and Systems (ISCAS), May 25-28 2025, London, UK

点击查看摘要

Abstract:Printed Electronics (PE) technology has emerged as a promising alternative to silicon-based computing. It offers attractive properties such as on-demand ultra-low-cost fabrication, mechanical flexibility, and conformality. However, PE are governed by large feature sizes, prohibiting the realization of complex printed Machine Learning (ML) classifiers. Leveraging PE’s ultra-low non-recurring engineering and fabrication costs, designers can fully customize hardware to a specific ML model and dataset, significantly reducing circuit complexity. Despite significant advancements, state-of-the-art solutions achieve area efficiency at the expense of considerable accuracy loss. Our work mitigates this by designing area- and power-efficient printed ML classifiers with little to no accuracy degradation. Specifically, we introduce the first sequential Support Vector Machine (SVM) classifiers, exploiting the hardware efficiency of bespoke control and storage units and a single Multiply-Accumulate compute engine. Our SVMs yield on average 6x lower area and 4.6% higher accuracy compared to the printed state of the art.

[LG-15] Neuro-Symbolic AI for Analytical Solutions of Differential Equations

链接: https://arxiv.org/abs/2502.01476
作者: Orestis Oikonomou,Levi Lingsch,Dana Grund,Siddhartha Mishra,Georgios Kissas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analytical solutions of differential equations offer exact insights into fundamental behaviors of physical processes. Their application, however, is limited as finding these solutions is difficult. To overcome this limitation, we combine two key insights. First, constructing an analytical solution requires a composition of foundational solution components. Second, iterative solvers define parameterized function spaces with constraint-based updates. Our approach merges compositional differential equation solution techniques with iterative refinement by using formal grammars, building a rich space of candidate solutions that are embedded into a low-dimensional (continuous) latent manifold for probabilistic exploration. This integration unifies numerical and symbolic differential equation solvers via a neuro-symbolic AI framework to find analytical solutions of a wide variety of differential equations. By systematically constructing candidate expressions and applying constraint-based refinement, we overcome longstanding barriers to extract such closed-form solutions. We illustrate advantages over commercial solvers, symbolic methods, and approximate neural networks on a diverse set of problems, demonstrating both generality and accuracy.

[LG-16] Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention

链接: https://arxiv.org/abs/2502.01473
作者: Arya Honarpisheh,Mustafa Bozdag,Mario Sznaier,Octavia Camps
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-space models (SSMs) are a new class of foundation models that have emerged as a compelling alternative to Transformers and their attention mechanisms for sequence processing tasks. This paper provides a detailed theoretical analysis of selective SSMs, the core components of the Mamba and Mamba-2 architectures. We leverage the connection between selective SSMs and the self-attention mechanism to highlight the fundamental similarities between these models. Building on this connection, we establish a length independent covering number-based generalization bound for selective SSMs, providing a deeper understanding of their theoretical performance guarantees. We analyze the effects of state matrix stability and input-dependent discretization, shedding light on the critical role played by these factors in the generalization capabilities of selective SSMs. Finally, we empirically demonstrate the sequence length independence of the derived bounds on two tasks.

[LG-17] Docking-Aware Attention: Dynamic Protein Representations through Molecular Context Integration

链接: https://arxiv.org/abs/2502.01461
作者: Amitay Sicherman,Kira Radinsky
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Computational prediction of enzymatic reactions represents a crucial challenge in sustainable chemical synthesis across various scientific domains, ranging from drug discovery to materials science and green chemistry. These syntheses rely on proteins that selectively catalyze complex molecular transformations. These protein catalysts exhibit remarkable substrate adaptability, with the same protein often catalyzing different chemical transformations depending on its molecular partners. Current approaches to protein representation in reaction prediction either ignore protein structure entirely or rely on static embeddings, failing to capture how proteins dynamically adapt their behavior to different substrates. We present Docking-Aware Attention (DAA), a novel architecture that generates dynamic, context-dependent protein representations by incorporating molecular docking information into the attention mechanism. DAA combines physical interaction scores from docking predictions with learned attention patterns to focus on protein regions most relevant to specific molecular interactions. We evaluate our method on enzymatic reaction prediction, where it outperforms previous state-of-the-art methods, achieving 62.2% accuracy versus 56.79% on complex molecules and 55.54% versus 49.45% on innovative reactions. Through detailed ablation studies and visualizations, we demonstrate how DAA generates interpretable attention patterns that adapt to different molecular contexts. Our approach represents a general framework for context-aware protein representation in biocatalysis prediction, with potential applications across enzymatic synthesis planning. We open-source our implementation and pre-trained models to facilitate further research.

[LG-18] Understanding the Capabilities and Limitations of Weak-to-Strong Generalization

链接: https://arxiv.org/abs/2502.01458
作者: Wei Yao,Wenkai Yang,Ziqiao Wang,Yankai Lin,Yong Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weak-to-strong generalization, where weakly supervised strong models outperform their weaker teachers, offers a promising approach to aligning superhuman models with human values. To deepen the understanding of this approach, we provide theoretical insights into its capabilities and limitations. First, in the classification setting, we establish upper and lower generalization error bounds for the strong model, identifying the primary limitations as stemming from the weak model’s generalization error and the optimization objective itself. Additionally, we derive lower and upper bounds on the calibration error of the strong model. These theoretical bounds reveal two critical insights: (1) the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions, and (2) the strong model’s training process must strike a careful balance, as excessive optimization could undermine its generalization capability by over-relying on the weak supervision signals. Finally, in the regression setting, we extend the work of Charikar et al. (2024) to a loss function based on Kullback-Leibler (KL) divergence, offering guarantees that the strong student can outperform its weak teacher by at least the magnitude of their disagreement. We conduct sufficient experiments to validate our theory.

[LG-19] Molecular Odor Prediction Based on Multi-Feature Graph Attention Networks

链接: https://arxiv.org/abs/2502.01430
作者: HongXin Xie,JianDe Sun,Yi Shao,Shuai Li,Sujuan Hou,YuLong Sun,Jian Wang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Olfactory perception plays a critical role in both human and organismal interactions, yet understanding of its underlying mechanisms and influencing factors remain insufficient. Molecular structures influence odor perception through intricate biochemical interactions, and accurately quantifying structure-odor relationships presents significant challenges. The Quantitative Structure-Odor Relationship (QSOR) task, which involves predicting the associations between molecular structures and their corresponding odors, seeks to address these challenges. To this end, we propose a method for QSOR, utilizing Graph Attention Networks to model molecular structures and capture both local and global features. Unlike conventional QSOR approaches reliant on predefined descriptors, our method leverages diverse molecular feature extraction techniques to automatically learn comprehensive representations. This integration enhances the model’s capacity to handle complex molecular information, improves prediction accuracy. Our approach demonstrates clear advantages in QSOR prediction tasks, offering valuable insights into the application of deep learning in cheminformatics.

[LG-20] An Algorithm for Fixed Budget Best Arm Identification with Combinatorial Exploration

链接: https://arxiv.org/abs/2502.01429
作者: Siddhartha Parupudi,Gourab Ghatak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the best arm identification (BAI) problem in the K- armed bandit framework with a modification - the agent is allowed to play a subset of arms at each time slot instead of one arm. Consequently, the agent observes the sample average of the rewards of the arms that constitute the probed subset. Several trade-offs arise here - e.g., sampling a larger number of arms together results in a wider view of the environment, while sampling fewer arms enhances the information about individual reward distributions. Furthermore, grouping a large number of suboptimal arms together albeit reduces the variance of the reward of the group, it may enhance the group mean to make it close to that containing the optimal arm. To solve this problem, we propose an algorithm that constructs \log_2 K groups and performs a likelihood ratio test to detect the presence of the best arm in each of these groups. Then a Hamming decoding procedure determines the unique best arm. We derive an upper bound for the error probability of the proposed algorithm based on a new hardness parameter H_4 . Finally, we demonstrate cases under which it outperforms the state-of-the-art algorithms for the single play case.

[LG-21] he Batch Complexity of Bandit Pure Exploration

链接: https://arxiv.org/abs/2502.01425
作者: Adrienne Tuynman,Rémy Degenne
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In a fixed-confidence pure exploration problem in stochastic multi-armed bandits, an algorithm iteratively samples arms and should stop as early as possible and return the correct answer to a query about the arms distributions. We are interested in batched methods, which change their sampling behaviour only a few times, between batches of observations. We give an instance-dependent lower bound on the number of batches used by any sample efficient algorithm for any pure exploration task. We then give a general batched algorithm and prove upper bounds on its expected sample complexity and batch complexity. We illustrate both lower and upper bounds on best-arm identification and thresholding bandits.

[LG-22] Categorical Schr"odinger Bridge Matching

链接: https://arxiv.org/abs/2502.01416
作者: Grigoriy Ksenofontov,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space \mathbbR^D and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces \mathbbS^D . Notable examples of such sets \mathbbS are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images.

[LG-23] InfoBridge: Mutual Information estimation via Bridge Matching

链接: https://arxiv.org/abs/2502.01383
作者: Sergei Kholkin,Ivan Butakov,Evgeny Burnaev,Nikita Gushchin,Alexander Korotin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory - the estimation of the mutual information (MI) between two random variables. We show that by using the theory of diffusion bridges, one can construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on a series of standard MI estimation benchmarks.

[LG-24] CE-LoRA: Computation-Efficient LoRA Fine-Tuning for Language Models

链接: https://arxiv.org/abs/2502.01378
作者: Guanduo Chen,Yutong He,Yipeng Hu,Kun Yuan,Binhang Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional performance across various tasks but demand substantial computational resources even for fine-tuning computation. Although Low-Rank Adaptation (LoRA) significantly alleviates memory consumption during fine-tuning, its impact on computational cost reduction is limited. This paper identifies the computation of activation gradients as the primary bottleneck in LoRA’s backward propagation and introduces the Computation-Efficient LoRA (CE-LoRA) algorithm, which enhances computational efficiency while preserving memory efficiency. CE-LoRA leverages two key techniques: Approximated Matrix Multiplication, which replaces dense multiplications of large and complete matrices with sparse multiplications involving only critical rows and columns, and the Double-LoRA technique, which reduces error propagation in activation gradients. Theoretically, CE-LoRA converges at the same rate as LoRA, \mathcalO(1/\sqrtT) , where T is the number of iteartions. Empirical evaluations confirm that CE-LoRA significantly reduces computational costs compared to LoRA without notable performance degradation.

[LG-25] rajectory World Models for Heterogeneous Environments

链接: https://arxiv.org/abs/2502.01366
作者: Shaofeng Yin,Jialong Wu,Siqiao Huang,Xingjian Su,Xu He,Jianye Hao,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj demonstrates significant improvements in transition prediction and achieves a new state-of-the-art for off-policy evaluation. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments.

[LG-26] A Relative Homology Theory of Representation in Neural Networks

链接: https://arxiv.org/abs/2502.01360
作者: Kosio Beshkov
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain into convex polyhedra G_J over which the network \Phi operates in an affine manner. In this work, we leverage these properties to define the equivalence class of inputs \sim_\Phi , which can be split into two sets related to the local rank of \Phi_J and the intersections \cap \textIm\Phi_J_i . We refer to the latter as the overlap decomposition O_\Phi and prove that if the intersections between each polyhedron and the input manifold are convex, the homology groups of neural representations are isomorphic to relative homology groups H_k(\Phi(M)) \simeq H_k(M,O_\Phi) . This lets us compute Betti numbers without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm. Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our relative homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on various classification problems while varying network width and depth and discuss some shortcomings of our method. Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2502.01360 [cs.LG] (or arXiv:2502.01360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Metric Privacy in Federated Learning for Medical Imaging: Improving Convergence and Preventing Client Inference Attacks

链接: https://arxiv.org/abs/2502.01352
作者: Judith Sáinz-Pardo Díaz,Andreas Athanasiou,Kangsoo Jung,Catuscia Palamidessi,Álvaro López García
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning is a distributed learning technique that allows training a global model with the participation of different data owners without the need to share raw data. This architecture is orchestrated by a central server that aggregates the local models from the clients. This server may be trusted, but not all nodes in the network. Then, differential privacy (DP) can be used to privatize the global model by adding noise. However, this may affect convergence across the rounds of the federated architecture, depending also on the aggregation strategy employed. In this work, we aim to introduce the notion of metric-privacy to mitigate the impact of classical server side global-DP on the convergence of the aggregated model. Metric-privacy is a relaxation of DP, suitable for domains provided with a notion of distance. We apply it from the server side by computing a distance for the difference between the local models. We compare our approach with standard DP by analyzing the impact on six classical aggregation strategies. The proposed methodology is applied to an example of medical imaging and different scenarios are simulated across homogeneous and non-i.i.d clients. Finally, we introduce a novel client inference attack, where a semi-honest client tries to find whether another client participated in the training and study how it can be mitigated using DP and metric-privacy. Our evaluation shows that metric-privacy can increase the performance of the model compared to standard DP, while offering similar protection against client inference attacks.

[LG-28] Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity

链接: https://arxiv.org/abs/2502.01330
作者: Alessandro Pierro,Steven Abreu,Jonathan Timcheck,Philipp Stratmann,Andreas Wild,Sumit Bam Shrestha
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Under review

点击查看摘要

Abstract:Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements–when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with 2x less compute and 36% less memory at iso-accuracy. Our models achieve state-of-the-art results on a real-time streaming task for audio denoising. By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of 42x lower latency and 149x lower energy consumption compared to a dense model on an edge GPU. Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.

[LG-29] Strategic Classification with Randomised Classifiers

链接: https://arxiv.org/abs/2502.01313
作者: Jack Geary,Henry Gouk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of strategic classification, where a learner must build a model to classify agents based on features that have been strategically modified. Previous work in this area has concentrated on the case when the learner is restricted to deterministic classifiers. In contrast, we perform a theoretical analysis of an extension to this setting that allows the learner to produce a randomised classifier. We show that, under certain conditions, the optimal randomised classifier can achieve better accuracy than the optimal deterministic classifier, but under no conditions can it be worse. When a finite set of training data is available, we show that the excess risk of Strategic Empirical Risk Minimisation over the class of randomised classifiers is bounded in a similar manner as the deterministic case. In both the deterministic and randomised cases, the risk of the classifier produced by the learner converges to that of the corresponding optimal classifier as the volume of available training data grows. Moreover, this convergence happens at the same rate as in the i.i.d. case. Our findings are compared with previous theoretical work analysing the problem of strategic classification. We conclude that randomisation has the potential to alleviate some issues that could be faced in practice without introducing any substantial downsides.

[LG-30] Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning AAMAS2025

链接: https://arxiv.org/abs/2502.01307
作者: Henrik Müller,Daniel Kudenko
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, accepted as extended abstract at AAMAS 2025

点击查看摘要

Abstract:Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependence of effective potential-based reward shaping on the initial Q-values and external rewards, which determine the agent’s ability to exploit the shaping rewards to guide its exploration and achieve increased sample efficiency. We formally derive how a simple linear shift of the potential function can be used to improve the effectiveness of reward shaping without changing the encoded preferences in the potential function, and without having to adjust the initial Q-values, which can be challenging and undesirable in deep reinforcement learning. We show the theoretical limitations of continuous potential functions for correctly assigning positive and negative reward shaping values. We verify our theoretical findings empirically on Gridworld domains with sparse and uninformative reward functions, as well as on the Cart Pole and Mountain Car environments, where we demonstrate the application of our results in deep reinforcement learning.

[LG-31] Molecular Odor Prediction with Harmonic Modulated Feature Mapping and Chemically-Informed Loss

链接: https://arxiv.org/abs/2502.01296
作者: HongXin Xie,JianDe Sun,Yi Shao,Shuai Li,Sujuan Hou,YuLong Sun,Yuxiang Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecular odor prediction has great potential across diverse fields such as chemistry, pharmaceuticals, and environmental science, enabling the rapid design of new materials and enhancing environmental monitoring. However, current methods face two main challenges: First, existing models struggle with non-smooth objective functions and the complexity of mixed feature dimensions; Second, datasets suffer from severe label imbalance, which hampers model training, particularly in learning minority class labels. To address these issues, we introduce a novel feature mapping method and a molecular ensemble optimization loss function. By incorporating feature importance learning and frequency modulation, our model adaptively adjusts the contribution of each feature, efficiently capturing the intricate relationship between molecular structures and odor descriptors. Our feature mapping preserves feature independence while enhancing the model’s efficiency in utilizing molecular features through frequency modulation. Furthermore, the proposed loss function dynamically adjusts label weights, improves structural consistency, and strengthens label correlations, effectively addressing data imbalance and label co-occurrence challenges. Experimental results show that our method significantly can improves the accuracy of molecular odor prediction across various deep learning models, demonstrating its promising potential in molecular structure representation and chemoinformatics.

[LG-32] Boosting Graph Robustness Against Backdoor Attacks: An Over-Similarity Perspective

链接: https://arxiv.org/abs/2502.01272
作者: Chang Liu,Hai Huang,Yujie Xing,Xingquan Zuo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved notable success in tasks such as social and transportation networks. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, raising significant concerns about their reliability in real-world applications. Despite initial efforts to defend against specific graph backdoor attacks, existing defense methods face two main challenges: either the inability to establish a clear distinction between triggers and clean nodes, resulting in the removal of many clean nodes, or the failure to eliminate the impact of triggers, making it challenging to restore the target nodes to their pre-attack state. Through empirical analysis of various existing graph backdoor attacks, we observe that the triggers generated by these methods exhibit over-similarity in both features and structure. Based on this observation, we propose a novel graph backdoor defense method SimGuard. We first utilizes a similarity-based metric to detect triggers and then employs contrastive learning to train a backdoor detector that generates embeddings capable of separating triggers from clean nodes, thereby improving detection efficiency. Extensive experiments conducted on real-world datasets demonstrate that our proposed method effectively defends against various graph backdoor attacks while preserving performance on clean nodes. The code will be released upon acceptance.

[LG-33] Exploratory Utility Maximization Problem with Tsallis Entropy

链接: https://arxiv.org/abs/2502.01269
作者: Chen Ziyi,Gu Jia-wen
类目: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:

点击查看摘要

Abstract:We study expected utility maximization problem with constant relative risk aversion utility function in a complete market under the reinforcement learning framework. To induce exploration, we introduce the Tsallis entropy regularizer, which generalizes the commonly used Shannon entropy. Unlike the classical Merton’s problem, which is always well-posed and admits closed-form solutions, we find that the utility maximization exploratory problem is ill-posed in certain cases, due to over-exploration. With a carefully selected primary temperature function, we investigate two specific examples, for which we fully characterize their well-posedness and provide semi-closed-form solutions. It is interesting to find that one example has the well-known Gaussian distribution as the optimal strategy, while the other features the rare Wigner semicircle distribution, which is equivalent to a scaled Beta distribution. The means of the two optimal exploratory policies coincide with that of the classical counterpart. In addition, we examine the convergence of the value function and optimal exploratory strategy as the exploration vanishes. Finally, we design a reinforcement learning algorithm and conduct numerical experiments to demonstrate the advantages of reinforcement learning.

[LG-34] Counterfactual Situation Testing: From Single to Multidimensional Discrimination

链接: https://arxiv.org/abs/2502.01267
作者: Jose M. Alvarez,Salvatore Ruggieri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present counterfactual situation testing (CST), a causal data mining framework for detecting individual discrimination in a dataset of classifier decisions. CST answers the question “what would have been the model outcome had the individual, or complainant, been of a different protected status?” It extends the legally-grounded situation testing (ST) of Thanh et al. (2011) by operationalizing the notion of fairness given the difference via counterfactual reasoning. ST finds for each complainant similar protected and non-protected instances in the dataset; constructs, respectively, a control and test group; and compares the groups such that a difference in outcomes implies a potential case of individual discrimination. CST, instead, avoids this idealized comparison by establishing the test group on the complainant’s generated counterfactual, which reflects how the protected attribute when changed influences other seemingly neutral attributes of the complainant. Under CST we test for discrimination for each complainant by comparing similar individuals within each group but dissimilar individuals across groups. We consider single (e.g., gender) and multidimensional (e.g., gender and race) discrimination testing. For multidimensional discrimination we study multiple and intersectional discrimination and, as feared by legal scholars, find evidence that the former fails to account for the latter kind. Using a k-nearest neighbor implementation, we showcase CST on synthetic and real data. Experimental results show that CST uncovers a higher number of cases than ST, even when the model is counterfactually fair. In fact, CST extends counterfactual fairness (CF) of Kusner et al. (2017) by equipping CF with confidence intervals.

[LG-35] On Exact Learning of d-Monotone Functions

链接: https://arxiv.org/abs/2502.01265
作者: Nader H. Bshouty
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:In this paper, we study the learnability of the Boolean class of d -monotone functions f:\cal X\to\0,1\ from membership and equivalence queries, where (\cal X,\le) is a finite lattice. We show that the class of d -monotone functions that are represented in the form f=F(g_1,g_2,\ldots,g_d) , where F is any Boolean function F:\0,1^d\to\0,1\ and g_1,\ldots,g_d:\cal X\to \0,1\ are any monotone functions, is learnable in time \sigma(\cal X)\cdot (size(f)/d+1)^d where \sigma(\cal X) is the maximum sum of the number of immediate predecessors in a chain from the largest element to the smallest element in the lattice \cal X and size(f)=size(g_1)+\cdots+size(g_d) , where size(g_i) is the number of minimal elements in g_i^-1(1) . For the Boolean function f:\0,1^n\to\0,1\ , the class of d -monotone functions that are represented in the form f=F(g_1,g_2,\ldots,g_d) , where F is any Boolean function and g_1,\ldots,g_d are any monotone DNF, is learnable in time O(n^2)\cdot (size(f)/d+1)^d where size(f)=size(g_1)+\cdots+size(g_d) . In particular, this class is learnable in polynomial time when d is constant. Additionally, this class is learnable in polynomial time when size(g_i) is constant for all i and d=O(\log n) . Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.01265 [cs.LG] (or arXiv:2502.01265v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] Beyond Win Rates: A Clustering-Based Approach to Character Balance Analysis in Team-Based Games

链接: https://arxiv.org/abs/2502.01250
作者: Haokun Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Character diversity in competitive games, while enriching gameplay, often introduces balance challenges that can negatively impact player experience and strategic depth. Traditional balance assessments rely on aggregate metrics like win rates and pick rates, which offer limited insight into the intricate dynamics of team-based games and nuanced character roles. This paper proposes a novel clustering-based methodology to analyze character balance, leveraging in-game data from Valorant to account for team composition influences and reveal latent character roles. By applying hierarchical agglomerative clustering with Jensen-Shannon Divergence to professional match data from the Valorant Champions Tour 2022, our approach identifies distinct clusters of agents exhibiting similar co-occurrence patterns within team compositions. This method not only complements existing quantitative metrics but also provides a more holistic and interpretable perspective on character synergies and potential imbalances, offering game developers a valuable tool for informed and context-aware balance adjustments.

[LG-37] Neural Cellular Automata for Decentralized Sensing using a Soft Inductive Sensor Array for Distributed Manipulator Systems

链接: https://arxiv.org/abs/2502.01242
作者: Bailey Dacre,Nicolas Bessone,Matteo Lo Preti,Diana Cafiso,Rodrigo Moreno,Andrés Faíña,Lucia Beccai
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:In Distributed Manipulator Systems (DMS), decentralization is a highly desirable property as it promotes robustness and facilitates scalability by distributing computational burden and eliminating singular points of failure. However, current DMS typically utilize a centralized approach to sensing, such as single-camera computer vision systems. This centralization poses a risk to system reliability and offers a significant limiting factor to system size. In this work, we introduce a decentralized approach for sensing and in a Distributed Manipulator Systems using Neural Cellular Automata (NCA). Demonstrating a decentralized sensing in a hardware implementation, we present a novel inductive sensor board designed for distributed sensing and evaluate its ability to estimate global object properties, such as the geometric center, through local interactions and computations. Experiments demonstrate that NCA-based sensing networks accurately estimate object position at 0.24 times the inter sensor distance. They maintain resilience under sensor faults and noise, and scale seamlessly across varying network sizes. These findings underscore the potential of local, decentralized computations to enable scalable, fault-tolerant, and noise-resilient object property estimation in DMS

[LG-38] he Differences Between Direct Alignment Algorithms are a Blur

链接: https://arxiv.org/abs/2502.01237
作者: Alexey Gorbatovski,Boris Shaposhnikov,Viacheslav Sinii,Alexey Malakhov,Daniil Gavrilov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the \beta parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by + 3.46 (ORPO) and + 8.27 (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.

[LG-39] Efficient Prior Selection in Gaussian Process Bandits with Thompson Sampling

链接: https://arxiv.org/abs/2502.01226
作者: Jack Sandberg,Morteza Haghir Chehreghani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 12 figures

点击查看摘要

Abstract:Gaussian process (GP) bandits provide a powerful framework for solving blackbox optimization of unknown functions. The characteristics of the unknown function depends heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we propose two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) and HyperPrior GP-TS (HP-GP-TS). We theoretically analyze the algorithms and establish upper bounds for their respective regret. In addition, we demonstrate the effectiveness of our algorithms compared to the alternatives through experiments with synthetic and real-world data.

[LG-40] Privilege Scores

链接: https://arxiv.org/abs/2502.01211
作者: Ludwig Bothmann,Philip A. Boustani,Jose M. Alvarez,Giuseppe Casalicchio,Bernd Bischl,Susanne Dandl
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.

[LG-41] heoretical Analysis of KL-regularized RLHF with Multiple Reference Models

链接: https://arxiv.org/abs/2502.01203
作者: Gholamali Aminian,Amir R. Asadi,Idan Shenfeld,Youssef Mroueh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:Recent methods for aligning large language models (LLMs) with human feedback predominantly rely on a single reference model, which limits diversity, model overfitting, and underutilizes the wide range of available pre-trained models. Incorporating multiple reference models has the potential to address these limitations by broadening perspectives, reducing bias, and leveraging the strengths of diverse open-source LLMs. However, integrating multiple reference models into reinforcement learning with human feedback (RLHF) frameworks poses significant theoretical challenges, particularly in reverse KL-regularization, where achieving exact solutions has remained an open problem. This paper presents the first \emphexact solution to the multiple reference model problem in reverse KL-regularized RLHF. We introduce a comprehensive theoretical framework that includes rigorous statistical analysis and provides sample complexity guarantees. Additionally, we extend our analysis to forward KL-regularized RLHF, offering new insights into sample complexity requirements in multiple reference scenarios. Our contributions lay the foundation for more advanced and adaptable LLM alignment techniques, enabling the effective use of multiple reference models. This work paves the way for developing alignment frameworks that are both theoretically sound and better suited to the challenges of modern AI ecosystems.

[LG-42] FairUDT: Fairness-aware Uplift Decision Trees

链接: https://arxiv.org/abs/2502.01188
作者: Anam Zahid,Abdur Rehman Ali,Shaina Raza,Rai Shahnawaz,Faisal Kamiran,Asim Karim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in Knowledge-based Systems (2025)

点击查看摘要

Abstract:Training data used for developing machine learning classifiers can exhibit biases against specific protected attributes. Such biases typically originate from historical discrimination or certain underlying patterns that disproportionately under-represent minority groups, such as those identified by their gender, religion, or race. In this paper, we propose a novel approach, FairUDT, a fairness-aware Uplift-based Decision Tree for discrimination identification. FairUDT demonstrates how the integration of uplift modeling with decision trees can be adapted to include fair splitting criteria. Additionally, we introduce a modified leaf relabeling approach for removing discrimination. We divide our dataset into favored and deprived groups based on a binary sensitive attribute, with the favored dataset serving as the treatment group and the deprived dataset as the control group. By applying FairUDT and our leaf relabeling approach to preprocess three benchmark datasets, we achieve an acceptable accuracy-discrimination tradeoff. We also show that FairUDT is inherently interpretable and can be utilized in discrimination detection tasks. The code for this project is available this https URL

[LG-43] Insights from Network Science can advance Deep Graph Learning

链接: https://arxiv.org/abs/2502.01177
作者: Christopher Blöcker,Martin Rosvall,Ingo Scholtes,Jevin D. West
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep graph learning and network science both analyze graphs but approach similar problems from different perspectives. Whereas network science focuses on models and measures that reveal the organizational principles of complex systems with explicit assumptions, deep graph learning focuses on flexible and generalizable models that learn patterns in graph data in an automated fashion. Despite these differences, both fields share the same goal: to better model and understand patterns in graph-structured data. Early efforts to integrate methods, models, and measures from network science and deep graph learning indicate significant untapped potential. In this position, we explore opportunities at their intersection. We discuss open challenges in deep graph learning, including data augmentation, improved evaluation practices, higher-order models, and pooling methods. Likewise, we highlight challenges in network science, including scaling to massive graphs, integrating continuous gradient-based optimization, and developing standardized benchmarks.

[LG-44] Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity

链接: https://arxiv.org/abs/2502.01171
作者: Erpai Luo,Xinran Wei,Lin Huang,Yunyang Li,Han Yang,Zun Wang,Chang Liu,Zaishuo Xia,Jia Zhang,Bin Shao
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost-driven by high-order tensor product (TP) operations-restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network that incorporates adaptive sparsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70 percent. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact.

[LG-45] Label Distribution Learning with Biased Annotations by Learning Multi-Label Representation

链接: https://arxiv.org/abs/2502.01170
作者: Zhiqiang Kou,Si Qin,Hailin Wang,Mingkun Xie,Shuo Chen,Yuheng Jia,Tongliang Liu,Masashi Sugiyama,Xin Geng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-label learning (MLL) has gained attention for its ability to represent real-world data. Label Distribution Learning (LDL), an extension of MLL to learning from label distributions, faces challenges in collecting accurate label distributions. To address the issue of biased annotations, based on the low-rank assumption, existing works recover true distributions from biased observations by exploring the label correlations. However, recent evidence shows that the label distribution tends to be full-rank, and naive apply of low-rank approximation on biased observation leads to inaccurate recovery and performance degradation. In this paper, we address the LDL with biased annotations problem from a novel perspective, where we first degenerate the soft label distribution into a hard multi-hot label and then recover the true label information for each instance. This idea stems from an insight that assigning hard multi-hot labels is often easier than assigning a soft label distribution, and it shows stronger immunity to noise disturbances, leading to smaller label bias. Moreover, assuming that the multi-label space for predicting label distributions is low-rank offers a more reasonable approach to capturing label correlations. Theoretical analysis and experiments confirm the effectiveness and robustness of our method on real-world datasets.

[LG-46] ConditionNET: Learning Preconditions and Effects for Execution Monitoring

链接: https://arxiv.org/abs/2502.01167
作者: Daniel Sliwowski,Dongheui Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to optimize for consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments. The data is available on the project website: this https URL.

[LG-47] Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition ICASSP2025

链接: https://arxiv.org/abs/2502.01152
作者: Nanjun Zhou,Weilin Lin,Li Liu
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 5 figures. This work has been accpeted by ICASSP 2025

点击查看摘要

Abstract:Backdoor attacks have posed a significant threat to the security of deep neural networks (DNNs). Despite considerable strides in developing defenses against backdoor attacks in the visual domain, the specialized defenses for the audio domain remain empty. Furthermore, the defenses adapted from the visual to audio domain demonstrate limited effectiveness. To fill this gap, we propose Gradient Norm-based FineTuning (GN-FT), a novel defense strategy against the attacks in the audio domain, based on the observation from the corresponding backdoored models. Specifically, we first empirically find that the backdoored neurons exhibit greater gradient values compared to other neurons, while clean neurons stay the lowest. On this basis, we fine-tune the backdoored model by incorporating the gradient norm regularization, aiming to weaken and reduce the backdoored neurons. We further approximate the loss computation for lower implementation costs. Extensive experiments on two speech recognition datasets across five models demonstrate the superior performance of our proposed method. To the best of our knowledge, this work is the first specialized and effective defense against backdoor attacks in the audio domain.

[LG-48] ackling Feature and Sample Heterogeneity in Decentralized Multi-Task Learning: A Sheaf-Theoretic Approach

链接: https://arxiv.org/abs/2502.01145
作者: Chaouki Ben Issaid,Praneeth Vepakomma,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated multi-task learning (FMTL) aims to simultaneously learn multiple related tasks across clients without sharing sensitive raw data. However, in the decentralized setting, existing FMTL frameworks are limited in their ability to capture complex task relationships and handle feature and sample heterogeneity across clients. To address these challenges, we introduce a novel sheaf-theoretic-based approach for FMTL. By representing client relationships using cellular sheaves, our framework can flexibly model interactions between heterogeneous client models. We formulate the sheaf-based FMTL optimization problem using sheaf Laplacian regularization and propose the Sheaf-FMTL algorithm to solve it. We show that the proposed framework provides a unified view encompassing many existing federated learning (FL) and FMTL approaches. Furthermore, we prove that our proposed algorithm, Sheaf-FMTL, achieves a sublinear convergence rate in line with state-of-the-art decentralized FMTL algorithms. Extensive experiments demonstrate that Sheaf-FMTL exhibits communication savings by sending significantly fewer bits compared to decentralized FMTL baselines.

[LG-49] Simple Linear Neuron Boosting

链接: https://arxiv.org/abs/2502.01131
作者: Daniel Munoz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Given a differentiable network architecture and loss function, we revisit optimizing the network’s neurons in function space using Boosted Backpropagation (Grubb Bagnell, 2010), in contrast to optimizing in parameter space. From this perspective, we reduce descent in the space of linear functions that optimizes the network’s backpropagated-errors to a preconditioned gradient descent algorithm. We show that this preconditioned update rule is equivalent to reparameterizing the network to whiten each neuron’s features, with the benefit that the normalization occurs outside of inference. In practice, we use this equivalence to construct an online estimator for approximating the preconditioner and we propose an online, matrix-free learning algorithm with adaptive step sizes. The algorithm is applicable whenever autodifferentiation is available, including convolutional networks and transformers, and it is simple to implement for both the local and distributed training settings. We demonstrate fast convergence both in terms of epochs and wall clock time on a variety of tasks and networks.

[LG-50] Learning Efficient Positional Encodings with Graph Neural Networks

链接: https://arxiv.org/abs/2502.01122
作者: Charilaos I. Kanatsoulis,Evelyn Choi,Stephanie Jegelka,Jure Leskovec,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Positional encodings (PEs) are essential for effective graph representation learning because they provide position awareness in inherently position-agnostic transformer architectures and increase the expressive capacity of Graph Neural Networks (GNNs). However, designing powerful and efficient PEs for graphs poses significant challenges due to the absence of canonical node ordering and the scale of the graph. In this work, we identify four key properties that graph PEs should satisfy: stability, expressive power, scalability, and genericness. We find that existing eigenvector-based PE methods often fall short of jointly satisfying these criteria. To address this gap, we introduce PEARL, a novel framework of learnable PEs for graphs. Our primary insight is that message-passing GNNs function as nonlinear mappings of eigenvectors, enabling the design of GNN architectures for generating powerful and efficient PEs. A crucial challenge lies in initializing node attributes in a manner that is both expressive and permutation equivariant. We tackle this by initializing GNNs with random node inputs or standard basis vectors, thereby unlocking the expressive power of message-passing operations, while employing statistical pooling functions to maintain permutation equivariance. Our analysis demonstrates that PEARL approximates equivariant functions of eigenvectors with linear complexity, while rigorously establishing its stability and high expressive power. Experimental evaluations show that PEARL outperforms lightweight versions of eigenvector-based PEs and achieves comparable performance to full eigenvector-based PEs, but with one or two orders of magnitude lower complexity. Our code is available at this https URL.

[LG-51] GTG: Generalizable Trajectory Generation Model for Urban Mobility

链接: https://arxiv.org/abs/2502.01107
作者: Jingyuan Wang,Yujing Lin,Yudong Li
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Trajectory data mining is crucial for smart city management. However, collecting large-scale trajectory datasets is challenging due to factors such as commercial conflicts and privacy regulations. Therefore, we urgently need trajectory generation techniques to address this issue. Existing trajectory generation methods rely on the global road network structure of cities. When the road network structure changes, these methods are often not transferable to other cities. In fact, there exist invariant mobility patterns between different cities: 1) People prefer paths with the minimal travel cost; 2) The travel cost of roads has an invariant relationship with the topological features of the road network. Based on the above insight, this paper proposes a Generalizable Trajectory Generation model (GTG). The model consists of three parts: 1) Extracting city-invariant road representation based on Space Syntax method; 2) Cross-city travel cost prediction through disentangled adversarial training; 3) Travel preference learning by shortest path search and preference update. By learning invariant movement patterns, the model is capable of generating trajectories in new cities. Experiments on three datasets demonstrates that our model significantly outperforms existing models in terms of generalization ability.

[LG-52] Can We Validate Counterfactual Estimations in the Presence of General Network Interference?

链接: https://arxiv.org/abs/2502.01106
作者: Sadegh Shirani,Yuwei Luo,William Overman,Ruoxuan Xiong,Mohsen Bayati
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In experimental settings with network interference, a unit’s treatment can influence outcomes of other units, challenging both causal effect estimation and its validation. Classic validation approaches fail as outcomes are only observable under one treatment scenario and exhibit complex correlation patterns due to interference. To address these challenges, we introduce a new framework enabling cross-validation for counterfactual estimation. At its core is our distribution-preserving network bootstrap method – a theoretically-grounded approach inspired by approximate message passing. This method creates multiple subpopulations while preserving the underlying distribution of network effects. We extend recent causal message-passing developments by incorporating heterogeneous unit-level characteristics and varying local interactions, ensuring reliable finite-sample performance through non-asymptotic analysis. We also develop and publicly release a comprehensive benchmark toolbox with diverse experimental environments, from networks of interacting AI agents to opinion formation in real-world communities and ride-sharing applications. These environments provide known ground truth values while maintaining realistic complexities, enabling systematic examination of causal inference methods. Extensive evaluation across these environments demonstrates our method’s robustness to diverse forms of network interference. Our work provides researchers with both a practical estimation framework and a standardized platform for testing future methodological developments.

[LG-53] Federated Linear Dueling Bandits

链接: https://arxiv.org/abs/2502.01085
作者: Xuhan Huang,Yan Hu,Zhiyan Li,Zhiyong Wang,Benyou Wang,Zhongxiang Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual linear dueling bandits have recently garnered significant attention due to their widespread applications in important domains such as recommender systems and large language models. Classical dueling bandit algorithms are typically only applicable to a single agent. However, many applications of dueling bandits involve multiple agents who wish to collaborate for improved performance yet are unwilling to share their data. This motivates us to draw inspirations from federated learning, which involves multiple agents aiming to collaboratively train their neural networks via gradient descent (GD) without sharing their raw data. Previous works have developed federated linear bandit algorithms which rely on closed-form updates of the bandit parameters (e.g., the linear function parameter) to achieve collaboration. However, in linear dueling bandits, the linear function parameter lacks a closed-form expression and its estimation requires minimizing a loss function. This renders these previous methods inapplicable. In this work, we overcome this challenge through an innovative and principled combination of online gradient descent (for minimizing the loss function to estimate the linear function parameters) and federated learning, hence introducing the first federated linear dueling bandit algorithms. Through rigorous theoretical analysis, we prove that our algorithms enjoy a sub-linear upper bound on its cumulative regret. We also use empirical experiments to demonstrate the effectiveness of our algorithms and the practical benefit of collaboration.

[LG-54] Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis ICLR2025

链接: https://arxiv.org/abs/2502.01084
作者: Weiwei Lin,Chenghan He
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: ICLR 2025

点击查看摘要

Abstract:We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE’s latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E’s parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at this https URL.

[LG-55] qNBO: quasi-Newton Meets Bilevel Optimization

链接: https://arxiv.org/abs/2502.01076
作者: Sheng Fang,Yong-Jin Liu,Wei Yao,Chengming Yu,Jin Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Bilevel optimization, addressing challenges in hierarchical learning tasks, has gained significant interest in machine learning. The practical implementation of the gradient descent method to bilevel optimization encounters computational hurdles, notably the computation of the exact lower-level solution and the inverse Hessian of the lower-level objective. Although these two aspects are inherently connected, existing methods typically handle them separately by solving the lower-level problem and a linear system for the inverse Hessian-vector product. In this paper, we introduce a general framework to address these computational challenges in a coordinated manner. Specifically, we leverage quasi-Newton algorithms to accelerate the resolution of the lower-level problem while efficiently approximating the inverse Hessian-vector product. Furthermore, by exploiting the superlinear convergence properties of BFGS, we establish the non-asymptotic convergence analysis of the BFGS adaptation within our framework. Numerical experiments demonstrate the comparable or superior performance of the proposed algorithms in real-world learning tasks, including hyperparameter optimization, data hyper-cleaning, and few-shot meta-learning.

[LG-56] Omni-Mol: Exploring Universal Convergent Space for Omni-Molecular Tasks

链接: https://arxiv.org/abs/2502.01074
作者: Chengxin Hu,Hao Li,Yihe Yuan,Zezheng Song,Haixin Wang
类目: Machine Learning (cs.LG)
*备注: 29 pages, 13 figures, 7 tables, paper under review

点击查看摘要

Abstract:Building generalist models has recently demonstrated remarkable capabilities in diverse scientific domains. Within the realm of molecular learning, several studies have explored unifying diverse tasks across diverse domains. However, negative conflicts and interference between molecules and knowledge from different domain may have a worse impact in threefold. First, conflicting molecular representations can lead to optimization difficulties for the models. Second, mixing and scaling up training data across diverse tasks is inherently challenging. Third, the computational cost of refined pretraining is prohibitively high. To address these limitations, this paper presents Omni-Mol, a scalable and unified LLM-based framework for direct instruction tuning. Omni-Mol builds on three key components to tackles conflicts: (1) a unified encoding mechanism for any task input; (2) an active-learning-driven data selection strategy that significantly reduces dataset size; (3) a novel design of the adaptive gradient stabilization module and anchor-and-reconcile MoE framework that ensures stable convergence. Experimentally, Omni-Mol achieves state-of-the-art performance across 15 molecular tasks, demonstrates the presence of scaling laws in the molecular domain, and is supported by extensive ablation studies and analyses validating the effectiveness of its design. The code and weights of the powerful AI-driven chemistry generalist are open-sourced at: this https URL.

[LG-57] An Investigation of FP8 Across Accelerators for LLM Inference

链接: https://arxiv.org/abs/2502.01070
作者: Jiwoo Kim,Joonhyung Lee,Gunho Park,Byeongwook Kim,Se Jung Kwon,Dongsoo Lee,Youngjoo Lee
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The introduction of 8-bit floating-point (FP8) computation units in modern AI accelerators has generated significant interest in FP8-based large language model (LLM) inference. Unlike 16-bit floating-point formats, FP8 in deep learning requires a shared scaling factor. Additionally, while E4M3 and E5M2 are well-defined at the individual value level, their scaling and accumulation methods remain unspecified and vary across hardware and software implementations. As a result, FP8 behaves more like a quantization format than a standard numeric representation. In this work, we provide the first comprehensive analysis of FP8 computation and acceleration on two AI accelerators: the NVIDIA H100 and Intel Gaudi 2. Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during LLM inference, offering valuable insights into the practical implications of FP8 adoption for datacenter-scale LLM serving.

[LG-58] Nearly Tight Bounds for Exploration in Streaming Multi-armed Bandits with Known Optimality Gap AAAI2025

链接: https://arxiv.org/abs/2502.01067
作者: Nikolai Karpov,Chen Wang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: AAAI 2025

点击查看摘要

Abstract:We investigate the sample-memory-pass trade-offs for pure exploration in multi-pass streaming multi-armed bandits (MABs) with the a priori knowledge of the optimality gap \Delta_[2] . Here, and throughout, the optimality gap \Delta_[i] is defined as the mean reward gap between the best and the i -th best arms. A recent line of results by Jin, Huang, Tang, and Xiao [ICML’21] and Assadi and Wang [COLT’24] have shown that if there is no known \Delta_[2] , a pass complexity of \Theta(\log(1/\Delta_[2])) (up to \log\log(1/\Delta_[2]) terms) is necessary and sufficient to obtain the worst-case optimal sample complexity of O(n/\Delta^2_[2]) with a single-arm memory. However, our understanding of multi-pass algorithms with known \Delta_[2] is still limited. Here, the key open problem is how many passes are required to achieve the complexity, i.e., O( \sum_i=2^n1/\Delta^2_[i]) arm pulls, with a sublinear memory size. In this work, we show that the ``right answer’’ for the question is \Theta(\logn) passes (up to \log\logn terms). We first present a lower bound, showing that any algorithm that finds the best arm with slightly sublinear memory – a memory of o(n/\textpolylog(n)) arms – and O(\sum_i=2^n1/\Delta^2_[i]\cdot \log(n)) arm pulls has to make \Omega(\frac\logn\log\logn) passes over the stream. We then show a nearly-matching algorithm that assuming the knowledge of \Delta_[2] , finds the best arm with O( \sum_i=2^n1/\Delta^2_[i] \cdot \logn) arm pulls and a single arm memory. Comments: AAAI 2025 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.01067 [cs.LG] (or arXiv:2502.01067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

链接: https://arxiv.org/abs/2502.01042
作者: Peixuan Han,Cheng Qian,Xiusi Chen,Yuji Zhang,Denghui Zhang,Heng Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks but also pose significant risks due to their potential to generate harmful content. Although existing safety mechanisms can improve model safety, they often lead to overly cautious behavior and fail to fully utilize LLMs’ internal cognitive processes. Drawing inspiration from cognitive science, where humans rely on reflective reasoning (System 2 thinking) to regulate language and behavior, we empirically demonstrate that LLMs also possess a similar capacity for internal assessment and regulation, which can be actively detected. Building on this insight, we introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model’s internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility. Compared to traditional safety alignment methods, SafeSwitch delivers more informative and context-aware refusals, demonstrates resilience to unseen queries, and achieves these benefits while only tuning less than 6% of the original parameters. These features make SafeSwitch a promising approach for implementing nuanced safety controls in LLMs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.01042 [cs.LG] (or arXiv:2502.01042v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.01042 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Geoinformatics-Guided Machine Learning for Power Plant Classification

链接: https://arxiv.org/abs/2502.01039
作者: Blessing Austin-Gabriel,Aparna S. Varde,Hao Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes an approach in the area of Knowledge-Guided Machine Learning (KGML) via a novel integrated framework comprising CNN (Convolutional Neural Networks) and ViT (Vision Transformers) along with GIS (Geographic Information Systems) to enhance power plant classification in the context of energy management. Knowledge from geoinformatics derived through Spatial Masks (SM) in GIS is infused into an architecture of CNN and ViT, in this proposed KGML approach. It is found to provide much better performance compared to the baseline of CNN and ViT only in the classification of multiple types of power plants from real satellite imagery, hence emphasizing the vital role of the geoinformatics-guided approach. This work makes a contribution to the main theme of KGML that can be beneficial in many AI systems today. It makes broader impacts on AI in Smart Cities, and Environmental Computing.

[LG-61] End-to-End Imitation Learning for Optimal Asteroid Proximity Operations

链接: https://arxiv.org/abs/2502.01034
作者: Patrick Quinn,George Nehma,Madhur Tiwari
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures. Submitted to the 2025 IEEE Aerospace Conference

点击查看摘要

Abstract:Controlling spacecraft near asteroids in deep space comes with many challenges. The delays involved necessitate heavy usage of limited onboard computation resources while fuel efficiency remains a priority to support the long loiter times needed for gathering data. Additionally, the difficulty of state determination due to the lack of traditional reference systems requires a guidance, navigation, and control (GNC) pipeline that ideally is both computationally and fuel-efficient, and that incorporates a robust state determination system. In this paper, we propose an end-to-end algorithm utilizing neural networks to generate near-optimal control commands from raw sensor data, as well as a hybrid model predictive control (MPC) guided imitation learning controller delivering improvements in computational efficiency over a traditional MPC controller.

[LG-62] Converting MLPs into Polynomials in Closed Form

链接: https://arxiv.org/abs/2502.01032
作者: Nora Belrose,Alice Rigg
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work has shown that purely quadratic functions can replace MLPs in transformers with no significant loss in performance, while enabling new methods of interpretability based on linear algebra. In this work, we theoretically derive closed-form least-squares optimal approximations of feedforward networks (multilayer perceptrons and gated linear units) using polynomial functions of arbitrary degree. When the R^2 is high, this allows us to interpret MLPs and GLUs by visualizing the eigendecomposition of the coefficients of their linear and quadratic approximants. We also show that these approximants can be used to create SVD-based adversarial examples. By tracing the R^2 of linear and quadratic approximants across training time, we find new evidence that networks start out simple, and get progressively more complex. Even at the end of training, however, our quadratic approximants explain over 95% of the variance in network outputs.

[LG-63] DiffIM: Differentiable Influence Minimization with Surrogate Modeling and Continuous Relaxation AAAI’25

链接: https://arxiv.org/abs/2502.01031
作者: Junghun Lee,Hyunju Kim,Fanchen Bu,Jihoon Ko,Kijung Shin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted to AAAI’25

点击查看摘要

Abstract:In social networks, people influence each other through social links, which can be represented as propagation among nodes in graphs. Influence minimization (IMIN) is the problem of manipulating the structures of an input graph (e.g., removing edges) to reduce the propagation among nodes. IMIN can represent time-critical real-world applications, such as rumor blocking, but IMIN is theoretically difficult and computationally expensive. Moreover, the discrete nature of IMIN hinders the usage of powerful machine learning techniques, which requires differentiable computation. In this work, we propose DiffIM, a novel method for IMIN with two differentiable schemes for acceleration: (1) surrogate modeling for efficient influence estimation, which avoids time-consuming simulations (e.g., Monte Carlo), and (2) the continuous relaxation of decisions, which avoids the evaluation of individual discrete decisions (e.g., removing an edge). We further propose a third accelerating scheme, gradient-driven selection, that chooses edges instantly based on gradients without optimization (spec., gradient descent iterations) on each test instance. Through extensive experiments on real-world graphs, we show that each proposed scheme significantly improves speed with little (or even no) IMIN performance degradation. Our method is Pareto-optimal (i.e., no baseline is faster and more effective than it) and typically several orders of magnitude (spec., up to 15,160X) faster than the most effective baseline while being more effective.

[LG-64] Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach

链接: https://arxiv.org/abs/2502.01015
作者: Siqi Zeng,Yifei He,Weiqiu You,Yifan Hao,Yao-Hung Hubert Tsai,Makoto Yamada,Han Zhao
类目: Machine Learning (cs.LG)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:Task vectors, which are derived from the difference between pre-trained and fine-tuned model weights, enable flexible task adaptation and model merging through arithmetic operations such as addition and negation. However, existing approaches often rely on heuristics with limited theoretical support, often leading to performance gaps comparing to direct task fine tuning. Meanwhile, although it is easy to manipulate saved task vectors with arithmetic for different purposes, such compositional flexibility demands high memory usage, especially when dealing with a huge number of tasks, limiting scalability. This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework. Building upon existing task arithmetic literature, our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage, providing a practical solution for large-scale task arithmetic.

[LG-65] Deep Active Learning based Experimental Design to Uncover Synergistic Genetic Interactions for Host Targeted Therapeutics

链接: https://arxiv.org/abs/2502.01012
作者: Haonan Zhu,Mary Silva,Jose Cadena,Braden Soper,Michał Lisicki,Braian Peetoom,Sergio E. Baranzini,Shivshankar Sundaram,Priyadip Ray,Jeff Drocco
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Recent technological advances have introduced new high-throughput methods for studying host-virus interactions, but testing synergistic interactions between host gene pairs during infection remains relatively slow and labor intensive. Identification of multiple gene knockdowns that effectively inhibit viral replication requires a search over the combinatorial space of all possible target gene pairs and is infeasible via brute-force experiments. Although active learning methods for sequential experimental design have shown promise, existing approaches have generally been restricted to single-gene knockdowns or small-scale double knockdown datasets. In this study, we present an integrated Deep Active Learning (DeepAL) framework that incorporates information from a biological knowledge graph (SPOKE, the Scalable Precision Medicine Open Knowledge Engine) to efficiently search the configuration space of a large dataset of all pairwise knockdowns of 356 human genes in HIV infection. Through graph representation learning, the framework is able to generate task-specific representations of genes while also balancing the exploration-exploitation trade-off to pinpoint highly effective double-knockdown pairs. We additionally present an ensemble method for uncertainty quantification and an interpretation of the gene pairs selected by our algorithm via pathway analysis. To our knowledge, this is the first work to show promising results on double-gene knockdown experimental data of appreciable scale (356 by 356 matrix).

[LG-66] CausalCOMRL: Context-Based Offline Meta-Reinforcement Learning with Causal Representation

链接: https://arxiv.org/abs/2502.00983
作者: Zhengzhe Zhang,Wenjia Meng,Haoliang Sun,Gang Pan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Context-based offline meta-reinforcement learning (OMRL) methods have achieved appealing success by leveraging pre-collected offline datasets to develop task representations that guide policy learning. However, current context-based OMRL methods often introduce spurious correlations, where task components are incorrectly correlated due to confounders. These correlations can degrade policy performance when the confounders in the test task differ from those in the training task. To address this problem, we propose CausalCOMRL, a context-based OMRL method that integrates causal representation learning. This approach uncovers causal relationships among the task components and incorporates the causal relationships into task representations, enhancing the generalizability of RL agents. We further improve the distinction of task representations from different tasks by using mutual information optimization and contrastive learning. Utilizing these causal task representations, we employ SAC to optimize policies on meta-RL benchmarks. Experimental results show that CausalCOMRL achieves better performance than other methods on most benchmarks.

[LG-67] A Wearable Device Dataset for Mental Health Assessment Using Laser Doppler Flowmetry and Fluorescence Spectroscopy Sensors

链接: https://arxiv.org/abs/2502.00973
作者: Minh Ngoc Nguyen,Khai Le-Duc,Tan-Hanh Pham,Trang Nguyen,Quang Minh Luu,Ba Kien Tran,Truong-Son Hy,Viktor Dremin,Sergei Sokolovsky,Edik Rafailov
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint, 55 pages

点击查看摘要

Abstract:In this study, we introduce a novel method to predict mental health by building machine learning models for a non-invasive wearable device equipped with Laser Doppler Flowmetry (LDF) and Fluorescence Spectroscopy (FS) sensors. Besides, we present the corresponding dataset to predict mental health, e.g. depression, anxiety, and stress levels via the DAS-21 questionnaire. To our best knowledge, this is the world’s largest and the most generalized dataset ever collected for both LDF and FS studies. The device captures cutaneous blood microcirculation parameters, and wavelet analysis of the LDF signal extracts key rhythmic oscillations. The dataset, collected from 132 volunteers aged 18-94 from 19 countries, explores relationships between physiological features, demographics, lifestyle habits, and health conditions. We employed a variety of machine learning methods to classify stress detection, in which LightGBM is identified as the most effective model for stress detection, achieving a ROC AUC of 0.7168 and a PR AUC of 0.8852. In addition, we also incorporated Explainable Artificial Intelligence (XAI) techniques into our analysis to investigate deeper insights into the model’s predictions. Our results suggest that females, younger individuals and those with a higher Body Mass Index (BMI) or heart rate have a greater likelihood of experiencing mental health conditions like stress and anxiety. All related code and data are published online: this https URL.

[LG-68] PDE-Controller: LLM s for Autoformalization and Reasoning of PDEs

链接: https://arxiv.org/abs/2502.00963
作者: Mauricio Soroco,Jialin Song,Mengzhou Xia,Kye Emond,Weiran Sun,Wuyang Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While recent AI-for-math has made strides in pure mathematics, areas of applied mathematics, particularly PDEs, remain underexplored despite their significant real-world applications. We present PDE-Controller, a framework that enables large language models (LLMs) to control systems governed by partial differential equations (PDEs). Our approach enables LLMs to transform informal natural language instructions into formal specifications, and then execute reasoning and planning steps to improve the utility of PDE control. We build a holistic solution comprising datasets (both human-written cases and 2 million synthetic samples), math-reasoning models, and novel evaluation metrics, all of which require significant effort. Our PDE-Controller significantly outperforms prompting the latest open-source and GPT models in reasoning, autoformalization, and program synthesis, achieving up to a 62% improvement in utility gain for PDE control. By bridging the gap between language generation and PDE systems, we demonstrate the potential of LLMs in addressing complex scientific and engineering challenges. We will release all data, model checkpoints, and code at this https URL.

[LG-69] Analysis of static and dynamic batching algorithms for graph neural networks

链接: https://arxiv.org/abs/2502.00944
作者: Daniel Speckhard,Tim Bechtel,Sebastian Kehl,Jonathan Godwin,Claudia Draxl
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNN) have shown promising results for several domains such as materials science, chemistry, and the social sciences. GNN models often contain millions of parameters, and like other neural network (NN) models, are often fed only a fraction of the graphs that make up the training dataset in batches to update model parameters. The effect of batching algorithms on training time and model performance has been thoroughly explored for NNs but not yet for GNNs. We analyze two different batching algorithms for graph based models, namely static and dynamic batching. We use the Jraph library built on JAX to perform our experiments, where we compare the two batching methods for two datasets, the QM9 dataset of small molecules and the AFLOW materials database. Our experiments show that significant training time savings can be found from changing the batching algorithm, but the fastest algorithm depends on the data, model, batch size and number of training steps run. Experiments show no significant difference in model learning between the algorithms.

[LG-70] Generalizing Safety Beyond Collision-Avoidance via Latent-Space Reachability Analysis

链接: https://arxiv.org/abs/2502.00935
作者: Kensuke Nakamura,Lasse Peters,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 figures, 6 tables

点击查看摘要

Abstract:Hamilton-Jacobi (HJ) reachability is a rigorous mathematical framework that enables robots to simultaneously detect unsafe states and generate actions that prevent future failures. While in theory, HJ reachability can synthesize safe controllers for nonlinear systems and nonconvex constraints, in practice, it has been limited to hand-engineered collision-avoidance constraints modeled via low-dimensional state-space representations and first-principles dynamics. In this work, our goal is to generalize safe robot controllers to prevent failures that are hard – if not impossible – to write down by hand, but can be intuitively identified from high-dimensional observations: for example, spilling the contents of a bag. We propose Latent Safety Filters, a latent-space generalization of HJ reachability that tractably operates directly on raw observation data (e.g., RGB images) by performing safety analysis in the latent embedding space of a generative world model. This transforms nuanced constraint specification to a classification problem in latent space and enables reasoning about dynamical consequences that are hard to simulate. In simulation and hardware experiments, we use Latent Safety Filters to safeguard arbitrary policies (from generative policies to direct teleoperation) from complex safety hazards, like preventing a Franka Research 3 manipulator from spilling the contents of a bag or toppling cluttered objects.

[LG-71] Huff-LLM : End-to-End Lossless Compression for Efficient LLM Inference

链接: https://arxiv.org/abs/2502.00922
作者: Patrick Yubeaton,Tareq Mahmoud,Shehab Naga,Pooria Taheri,Tianhua Xia,Arun George,Yasmein Khalil,Sai Qian Zhang,Siddharth Joshi,Chinmay Hegde,Siddharth Garg
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emphend-to-end, lossless model compression method that lets users store LLM weights in compressed format \empheverywhere – cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.

[LG-72] Blink of an eye: a simple theory for feature localization in generative models

链接: https://arxiv.org/abs/2502.00921
作者: Marvin Li,Aayush Karan,Sitan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can exhibit undesirable and unexpected behavior in the blink of an eye. In a recent Anthropic demo, Claude switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow ``critical windows’’ of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. We show that it emerges generically as the generation process localizes to a sub-population of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes no distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic tools and no stochastic calculus or statistical physics-based machinery. We also identify an intriguing connection to the all-or-nothing phenomenon from statistical inference. Finally, we validate our predictions empirically for LLMs and find that critical windows often coincide with failures in problem solving for various math and reasoning benchmarks.

[LG-73] Position: More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research

链接: https://arxiv.org/abs/2502.00902
作者: Moritz Wolter,Lokesh Veeramacheneni
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Source code available at this https URL

点击查看摘要

Abstract:Experimental verification and falsification of scholarly work are part of the scientific method’s core. To improve the Machine Learning (ML)-communities’ ability to verify results from prior work, we argue for more robust software engineering. We estimate the adoption of common engineering best practices by examining repository links from all recently accepted International Conference on Machine Learning (ICML), International Conference on Learning Representations (ICLR) and Neural Information Processing Systems (NeurIPS) papers as well as ICML papers over time. Based on the results, we recommend how we, as a community, can improve reproducibility in ML-research.

[LG-74] Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds

链接: https://arxiv.org/abs/2502.00901
作者: Emanuele Troiani,Hugo Cui,Yatin Dandi,Florent Krzakala,Lenka Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension D and commensurably large number of samples N , we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing–, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.

[LG-75] Multi-frequency wavefield solutions for variable velocity models using meta-learning enhanced low-rank physics-informed neural network

链接: https://arxiv.org/abs/2502.00897
作者: Shijun Cheng,Tariq Alkhalifah
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) face significant challenges in modeling multi-frequency wavefields in complex velocity models due to their slow convergence, difficulty in representing high-frequency details, and lack of generalization to varying frequencies and velocity scenarios. To address these issues, we propose Meta-LRPINN, a novel framework that combines low-rank parameterization using singular value decomposition (SVD) with meta-learning and frequency embedding. Specifically, we decompose the weights of PINN’s hidden layers using SVD and introduce an innovative frequency embedding hypernetwork (FEH) that links input frequencies with the singular values, enabling efficient and frequency-adaptive wavefield representation. Meta-learning is employed to provide robust initialization, improving optimization stability and reducing training time. Additionally, we implement adaptive rank reduction and FEH pruning during the meta-testing phase to further enhance efficiency. Numerical experiments, which are presented on multi-frequency scattered wavefields for different velocity models, demonstrate that Meta-LRPINN achieves much fast convergence speed and much high accuracy compared to baseline methods such as Meta-PINN and vanilla PINN. Also, the proposed framework shows strong generalization to out-of-distribution frequencies while maintaining computational efficiency. These results highlight the potential of our Meta-LRPINN for scalable and adaptable seismic wavefield modeling.

[LG-76] Worth Their Weight: Randomized and Regularized Block Kaczmarz Algorithms without Preprocessing

链接: https://arxiv.org/abs/2502.00882
作者: Gil Goldshlager,Jiang Hu,Lin Lin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:Due to the ever growing amounts of data leveraged for machine learning and scientific computing, it is increasingly important to develop algorithms that sample only a small portion of the data at a time. In the case of linear least-squares, the randomized block Kaczmarz method (RBK) is an appealing example of such an algorithm, but its convergence is only understood under sampling distributions that require potentially prohibitively expensive preprocessing steps. To address this limitation, we analyze RBK when the data is sampled uniformly, showing that its iterates converge in a Monte Carlo sense to a \textitweighted least-squares solution. Unfortunately, for general problems the condition number of the weight matrix and the variance of the iterates can become arbitrarily large. We resolve these issues by incorporating regularization into the RBK iterations. Numerical experiments, including examples arising from natural gradient optimization, suggest that the regularized algorithm, ReBlocK, outperforms minibatch stochastic gradient descent for realistic problems that exhibit fast singular value decay.

[LG-77] owards Automation of Cognitive Modeling using Large Language Models

链接: https://arxiv.org/abs/2502.00879
作者: Milena Rmus,Akshay K. Jagadish,Marvin Mathony,Tobias Ludwig,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational cognitive models, which formalize theories of cognition, enable researchers to quantify cognitive processes and arbitrate between competing theories by fitting models to behavioral data. Traditionally, these models are handcrafted, which requires significant domain knowledge, coding expertise, and time investment. Previous work has demonstrated that Large Language Models (LLMs) are adept at pattern recognition in-context, solving complex problems, and generating executable code. In this work, we leverage these abilities to explore the potential of LLMs in automating the generation of cognitive models based on behavioral data. We evaluated the LLM in two different tasks: model identification (relating data to a source model), and model generation (generating the underlying cognitive model). We performed these tasks across two cognitive domains - decision making and learning. In the case of data simulated from canonical cognitive models, we found that the LLM successfully identified and generated the ground truth model. In the case of human data, where behavioral noise and lack of knowledge of the true underlying process pose significant challenges, the LLM generated models that are identical or close to the winning model from cognitive science literature. Our findings suggest that LLMs can have a transformative impact on cognitive modeling. With this project, we aim to contribute to an ongoing effort of automating scientific discovery in cognitive science.

[LG-78] Modified Adaptive Tree-Structured Parzen Estimator for Hyperparameter Optimization

链接: https://arxiv.org/abs/2502.00871
作者: Szymon Sieradzki,Jacek Mańdziuk
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:In this paper, we review hyperparameter optimization methods for machine learning models, with a particular focus on the Adaptive Tree-Structured Parzen Estimator (ATPE) algorithm. We propose several modifications to ATPE and assess their efficacy on a diverse set of standard benchmark functions. Experimental results demonstrate that the proposed modifications significantly improve the effectiveness of ATPE hyperparameter optimization on selected benchmarks, a finding that holds practical relevance for their application in real-world machine learning / optimization tasks.

[LG-79] FedRIR: Rethinking Information Representation in Federated Learning

链接: https://arxiv.org/abs/2502.00859
作者: Yongqiang Huang,Zerui Shao,Ziyuan Yang,Zexin Lu,Yi Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Mobile and Web-of-Things (WoT) devices at the network edge generate vast amounts of data for machine learning applications, yet privacy concerns hinder centralized model training. Federated Learning (FL) allows clients (devices) to collaboratively train a shared model coordinated by a central server without transfer private data, but inherent statistical heterogeneity among clients presents challenges, often leading to a dilemma between clients’ needs for personalized local models and the server’s goal of building a generalized global model. Existing FL methods typically prioritize either global generalization or local personalization, resulting in a trade-off between these two objectives and limiting the full potential of diverse client data. To address this challenge, we propose a novel framework that simultaneously enhances global generalization and local personalization by Rethinking Information Representation in the Federated learning process (FedRIR). Specifically, we introduce Masked Client-Specific Learning (MCSL), which isolates and extracts fine-grained client-specific features tailored to each client’s unique data characteristics, thereby enhancing personalization. Concurrently, the Information Distillation Module (IDM) refines the global shared features by filtering out redundant client-specific information, resulting in a purer and more robust global representation that enhances generalization. By integrating the refined global features with the isolated client-specific features, we construct enriched representations that effectively capture both global patterns and local nuances, thereby improving the performance of downstream tasks on the client. The code is available at this https URL.

[LG-80] Federated Generalised Variational Inference: A Robust Probabilistic Federated Learning Framework

链接: https://arxiv.org/abs/2502.00846
作者: Terje Mildner,Oliver Hamelijnck,Paris Giampouras,Theodoros Damoulas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce FedGVI, a probabilistic Federated Learning (FL) framework that is provably robust to both prior and likelihood misspecification. FedGVI addresses limitations in both frequentist and Bayesian FL by providing unbiased predictions under model misspecification, with calibrated uncertainty quantification. Our approach generalises previous FL approaches, specifically Partitioned Variational Inference (Ashman et al., 2022), by allowing robust and conjugate updates, decreasing computational complexity at the clients. We offer theoretical analysis in terms of fixed-point convergence, optimality of the cavity distribution, and provable robustness. Additionally, we empirically demonstrate the effectiveness of FedGVI in terms of improved robustness and predictive performance on multiple synthetic and real world classification data sets.

[LG-81] CAIMAN: Causal Action Influence Detection for Sample Efficient Loco-manipulation

链接: https://arxiv.org/abs/2502.00835
作者: Yuanchen Yuan,Jin Cheng,Núria Armengol Urpí,Stelian Coros
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enabling legged robots to perform non-prehensile loco-manipulation with large and heavy objects is crucial for enhancing their versatility. However, this is a challenging task, often requiring sophisticated planning strategies or extensive task-specific reward shaping, especially in unstructured scenarios with obstacles. In this work, we present CAIMAN, a novel framework for learning loco-manipulation that relies solely on sparse task rewards. We leverage causal action influence to detect states where the robot is in control over other entities in the environment, and use this measure as an intrinsically motivated objective to enable sample-efficient learning. We employ a hierarchical control strategy, combining a low-level locomotion policy with a high-level policy that prioritizes task-relevant velocity commands. Through simulated and real-world experiments, including object manipulation with obstacles, we demonstrate the framework’s superior sample efficiency, adaptability to diverse environments, and successful transfer to hardware without fine-tuning. The proposed approach paves the way for scalable, robust, and autonomous loco-manipulation in real-world applications.

[LG-82] Boosting Adversarial Robustness and Generalization with Structural Prior

链接: https://arxiv.org/abs/2502.00834
作者: Zhichao Hou,Weizhi Gao,Hamid Krim,Xiaorui Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:This work investigates a novel approach to boost adversarial robustness and generalization by incorporating structural prior into the design of deep learning models. Specifically, our study surprisingly reveals that existing dictionary learning-inspired convolutional neural networks (CNNs) provide a false sense of security against adversarial attacks. To address this, we propose Elastic Dictionary Learning Networks (EDLNets), a novel ResNet architecture that significantly enhances adversarial robustness and generalization. This novel and effective approach is supported by a theoretical robustness analysis using influence functions. Moreover, extensive and reliable experiments demonstrate consistent and significant performance improvement on open robustness leaderboards such as RobustBench, surpassing state-of-the-art baselines. To the best of our knowledge, this is the first work to discover and validate that structural prior can reliably enhance deep learning robustness under strong adaptive attacks, unveiling a promising direction for future research.

[LG-83] A Comprehensive Analysis on LLM -based Node Classification Algorithms

链接: https://arxiv.org/abs/2502.00829
作者: Xixi Wu,Yifei Shen,Fangzhou Ge,Caihua Shan,Yizhu Jiao,Xiangguo Sun,Hong Cheng
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. Subsequently, we conducted extensive experiments, training and evaluating over 2,200 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size) that affect performance. Our findings uncover eight insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \hrefthis https URLthis https URL.

[LG-84] Sundial: A Family of Highly Capable Time Series Foundation Models

链接: https://arxiv.org/abs/2502.00816
作者: Yong Liu,Guo Qin,Zhiyuan Shi,Zhi Chen,Caiyin Yang,Xiangdong Huang,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch’s distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on time series without discrete tokenization. Conditioned on arbitrary-length time series, our model is pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving flexibility in representation learning beyond using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with 1 trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse through TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which exhibit unprecedented model capacity and generalization performance on zero-shot forecasting. In addition to presenting good scaling behavior, Sundial achieves new state-of-the-art on both point forecasting and probabilistic forecasting benchmarks. We believe that Sundial’s pioneering generative paradigm will facilitate a wide variety of forecasting scenarios.

[LG-85] Synthetic Artifact Auditing: Tracing LLM -Generated Synthetic Data Usage in Downstream Applications USENIX-SECURITY

链接: https://arxiv.org/abs/2502.00808
作者: Yixin Wu,Ziqing Yang,Yun Shen,Michael Backes,Yang Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025

点击查看摘要

Abstract:Large language models (LLMs) have facilitated the generation of high-quality, cost-effective synthetic data for developing downstream models and conducting statistical analyses in various domains. However, the increased reliance on synthetic data may pose potential negative impacts. Numerous studies have demonstrated that LLM-generated synthetic data can perpetuate and even amplify societal biases and stereotypes, and produce erroneous outputs known as ``hallucinations’’ that deviate from factual knowledge. In this paper, we aim to audit artifacts, such as classifiers, generators, or statistical plots, to identify those trained on or derived from synthetic data and raise user awareness, thereby reducing unexpected consequences and risks in downstream applications. To this end, we take the first step to introduce synthetic artifact auditing to assess whether a given artifact is derived from LLM-generated synthetic data. We then propose an auditing framework with three methods including metric-based auditing, tuning-based auditing, and classification-based auditing. These methods operate without requiring the artifact owner to disclose proprietary training details. We evaluate our auditing framework on three text classification tasks, two text summarization tasks, and two data visualization tasks across three training scenarios. Our evaluation demonstrates the effectiveness of all proposed auditing methods across all these tasks. For instance, black-box metric-based auditing can achieve an average accuracy of 0.868 \pm 0.071 for auditing classifiers and 0.880 \pm 0.052 for auditing generators using only 200 random queries across three scenarios. We hope our research will enhance model transparency and regulatory compliance, ensuring the ethical and responsible use of synthetic data.

[LG-86] UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs WWW2025

链接: https://arxiv.org/abs/2502.00806
作者: Yufei He,Yuan Sui,Xiaoxin He,Yue Liu,Yifei Sun,Bryan Hooi
类目: Machine Learning (cs.LG)
*备注: WWW 2025

点击查看摘要

Abstract:Existing foundation models, such as CLIP, aim to learn a unified embedding space for multimodal data, enabling a wide range of downstream web-based applications like search, recommendation, and content classification. However, these models often overlook the inherent graph structures in multimodal datasets, where entities and their relationships are crucial. Multimodal graphs (MMGs) represent such graphs where each node is associated with features from different modalities, while the edges capture the relationships between these entities. On the other hand, existing graph foundation models primarily focus on text-attributed graphs (TAGs) and are not designed to handle the complexities of MMGs. To address these limitations, we propose UniGraph2, a novel cross-domain graph foundation model that enables general representation learning on MMGs, providing a unified embedding space. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space that captures both the multimodal information and the underlying graph structure. We propose a new cross-domain multi-graph pre-training algorithm at scale to ensure effective transfer learning across diverse graph domains and modalities. Additionally, we adopt a Mixture of Experts (MoE) component to align features from different domains and modalities, ensuring coherent and robust embeddings that unify the information across modalities. Extensive experiments on a variety of multimodal graph tasks demonstrate that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks, offering a scalable and flexible solution for learning on MMGs.

[LG-87] ProPINN: Demystifying Propagation Failures in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.00803
作者: Haixu Wu,Yuezhou Ma,Hang Zhou,Huikun Weng,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have earned high expectations in solving partial differential equations (PDEs), but their optimization usually faces thorny challenges due to the unique derivative-dependent loss function. By analyzing the loss distribution, previous research observed the propagation failure phenomenon of PINNs, intuitively described as the correct supervision for model outputs cannot ``propagate’’ from initial states or boundaries to the interior domain. Going beyond intuitive understanding, this paper provides the first formal and in-depth study of propagation failure and its root cause. Based on a detailed comparison with classical finite element methods, we ascribe the failure to the conventional single-point-processing architecture of PINNs and further prove that propagation failure is essentially caused by the lower gradient correlation of PINN models on nearby collocation points. Compared to superficial loss maps, this new perspective provides a more precise quantitative criterion to identify where and why PINN fails. The theoretical finding also inspires us to present a new PINN architecture, named ProPINN, which can effectively unite the gradient of region points for better propagation. ProPINN can reliably resolve PINN failure modes and significantly surpass advanced Transformer-based models with 46% relative promotion.

[LG-88] ransfer Learning in Physics-Informed Neural Networks: Full Fine-Tuning Lightweight Fine-Tuning and Low-Rank Adaptation

链接: https://arxiv.org/abs/2502.00782
作者: Yizheng Wang,Jinshuai Bai,Mohammad Sadegh Eshaghi,Cosmin Anitescu,Xiaoying Zhuang,Timon Rabczuk,Yinghua Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI for PDEs has garnered significant attention, particularly Physics-Informed Neural Networks (PINNs). However, PINNs are typically limited to solving specific problems, and any changes in problem conditions necessitate retraining. Therefore, we explore the generalization capability of transfer learning in the strong and energy form of PINNs across different boundary conditions, materials, and geometries. The transfer learning methods we employ include full finetuning, lightweight finetuning, and Low-Rank Adaptation (LoRA). The results demonstrate that full finetuning and LoRA can significantly improve convergence speed while providing a slight enhancement in accuracy.

[LG-89] ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

链接: https://arxiv.org/abs/2502.00775
作者: Artavazd Maranjyan,El Mehdi Saad,Peter Richtárik,Francesco Orabona
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.

[LG-90] CoNNect: A Swiss-Army-Knife Regularizer for Pruning of Neural Networks

链接: https://arxiv.org/abs/2502.00744
作者: Christian Franssen,Jinyang Jiang,Yijie Peng,Bernd Heidergott
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pruning encompasses a range of techniques aimed at increasing the sparsity of neural networks (NNs). These techniques can generally be framed as minimizing a loss function subject to an L_0 -norm constraint. This paper introduces CoNNect, a novel differentiable regularizer for sparse NN training that ensures connectivity between input and output layers. CoNNect integrates with established pruning strategies and supports both structured and unstructured pruning. We proof that CoNNect approximates L_0 -regularization, guaranteeing maximally connected network structures while avoiding issues like layer collapse. Numerical experiments demonstrate that CoNNect improves classical pruning strategies and enhances state-of-the-art one-shot pruners, such as DepGraph and LLM-pruner.

[LG-91] Meta-Prompt Optimization for LLM -Based Sequential Decision Making

链接: https://arxiv.org/abs/2502.00728
作者: Mingze Kong,Zhiyong Wang,Yao Shu,Zhongxiang Dai
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have recently been employed as agents to solve sequential decision-making tasks such as Bayesian optimization and multi-armed bandits (MAB). These works usually adopt an LLM for sequential action selection by providing it with a fixed, manually designed meta-prompt. However, numerous previous works have found that the prompt has a significant impact on the performance of the LLM, which calls for a method to automatically optimize the meta-prompt for LLM-based agents. Unfortunately, the non-stationarity in the reward observations during LLM-based sequential decision-making makes meta-prompt optimization highly challenging. To address this challenge, we draw inspirations from adversarial bandit algorithms, which are inherently capable of handling non-stationary reward observations. Building on this foundation, we propose our EXPonential-weight algorithm for prompt Optimization (EXPO) to automatically optimize the task description and meta-instruction in the meta-prompt for LLM-based agents. We also extend EXPO to additionally optimize the exemplars (i.e., history of interactions) in the meta-prompt to further enhance the performance, hence introducing our EXPO-ES algorithm. We use extensive experiments to show that our algorithms significantly improve the performance of LLM-based sequential decision-making.

[LG-92] Understanding and Mitigating the High Computational Cost in Path Data Diffusion

链接: https://arxiv.org/abs/2502.00725
作者: Dingyuan Shi,Lulu Zhang,Yongxin Tong,Ke Xu
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Advancements in mobility services, navigation systems, and smart transportation technologies have made it possible to collect large amounts of path data. Modeling the distribution of this path data, known as the Path Generation (PG) problem, is crucial for understanding urban mobility patterns and developing intelligent transportation systems. Recent studies have explored using diffusion models to address the PG problem due to their ability to capture multimodal distributions and support conditional generation. A recent work devises a diffusion process explicitly in graph space and achieves state-of-the-art performance. However, this method suffers a high computation cost in terms of both time and memory, which prohibits its application. In this paper, we analyze this method both theoretically and experimentally and find that the main culprit of its high computation cost is its explicit design of the diffusion process in graph space. To improve efficiency, we devise a Latent-space Path Diffusion (LPD) model, which operates in latent space instead of graph space. Our LPD significantly reduces both time and memory costs by up to 82.8% and 83.1%, respectively. Despite these reductions, our approach does not suffer from performance degradation. It outperforms the state-of-the-art method in most scenarios by 24.5%~34.0%.

[LG-93] “I am bad”: Interpreting Stealthy Universal and Robust Audio Jailbreaks in Audio-Language Models

链接: https://arxiv.org/abs/2502.00718
作者: Isha Gupta,David Khachaturov,Robert Mullins
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.

[LG-94] UPL: Uncertainty-aware Pseudo-labeling for Imbalance Transductive Node Classification

链接: https://arxiv.org/abs/2502.00716
作者: Mohammad T. Teimuri,Zahra Dehghanian,Gholamali Aminian,Hamid R. Rabiee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-structured datasets often suffer from class imbalance, which complicates node classification tasks. In this work, we address this issue by first providing an upper bound on population risk for imbalanced transductive node classification. We then propose a simple and novel algorithm, Uncertainty-aware Pseudo-labeling (UPL). Our approach leverages pseudo-labels assigned to unlabeled nodes to mitigate the adverse effects of imbalance on classification accuracy. Furthermore, the UPL algorithm enhances the accuracy of pseudo-labeling by reducing training noise of pseudo-labels through a novel uncertainty-aware approach. We comprehensively evaluate the UPL algorithm across various benchmark datasets, demonstrating its superior performance compared to existing state-of-the-art methods.

[LG-95] Optimization for Neural Operators can Benefit from Width

链接: https://arxiv.org/abs/2502.00705
作者: Pedro Cisneros-Velarde,Bhavesh Shrimali,Arindam Banerjee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions – restricted strong convexity (RSC) and smoothness – that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.

[LG-96] Safety Alignment Depth in Large Language Models : A Markov Chain Perspective

链接: https://arxiv.org/abs/2502.00669
作者: Ching-Chia Kao,Chia-Mu Yu,Chun-Shien Lu,Chu-Song Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly adopted in high-stakes scenarios, yet their safety mechanisms often remain fragile. Simple jailbreak prompts or even benign fine-tuning can bypass these protocols, underscoring the need to understand where and how they fail. Recent findings suggest that vulnerabilities emerge when alignment is confined to only the initial output tokens. Unfortunately, even with the introduction of deep safety alignment, determining the optimal safety depth remains an unresolved challenge. By leveraging the equivalence between autoregressive language models and Markov chains, this paper offers the first theoretical result on how to identify the ideal depth for safety alignment, and demonstrates how permutation-based data augmentation can tighten these bounds. Crucially, we reveal a fundamental interaction between alignment depth and ensemble width-indicating that broader ensembles can compensate for shallower alignments. These insights provide a theoretical foundation for designing more robust, scalable safety strategies that complement existing alignment approaches, opening new avenues for research into safer, more reliable LLMs.

[LG-97] General Coded Computing in a Probabilistic Strag gler Regime

链接: https://arxiv.org/abs/2502.00645
作者: Parsa Moradi,Mohammad Ali Maddah-Ali
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 11 pages, 1 figure

点击查看摘要

Abstract:Coded computing has demonstrated promising results in addressing straggler resiliency in distributed computing systems. However, most coded computing schemes are designed for exact computation, requiring the number of responding servers to exceed a certain recovery threshold. Additionally, these schemes are tailored for highly structured functions. Recently, new coded computing schemes for general computing functions, where exact computation is replaced with approximate computation, have emerged. In these schemes, the availability of additional results corresponds to more accurate estimation of computational tasks. This flexibility introduces new questions that need to be addressed. This paper addresses the practically important scenario in the context of general coded computing, where each server may become a straggler with a probability p , independently from others. We theoretically analyze the approximation error of two existing general coded computing schemes: Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC). Under the probabilistic straggler configuration, we demonstrate that the average approximation error for BACC and LeTCC converge to zero with the rate of at least \mathcalO(\log^3_\frac1p(N)\cdotN^-3) and \mathcalO(\log^4_\frac1p(N)\cdotN^-2) , respectively. This is perhaps surprising, as earlier results does not indicate a convergence when the number of stragglers scales with the total number of servers N . However, in this case, despite the average number of stragglers being Np , the independence of servers in becoming stragglers allows the approximation error to converge to zero. These theoretical results are validated through experiments on various computing functions, including deep neural networks.

[LG-98] Using Causality for Enhanced Prediction of Web Traffic Time Series

链接: https://arxiv.org/abs/2502.00612
作者: Chang Tian,Mingzhe Xing,Zenglin Shi,Matthew B. Blaschko,Yinliang Yue,Marie-Francine Moens
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: time series, web service, web traffic, causality

点击查看摘要

Abstract:Predicting web service traffic has significant social value, as it can be applied to various practical scenarios, including but not limited to dynamic resource scaling, load balancing, system anomaly detection, service-level agreement compliance, and fraud detection. Web service traffic is characterized by frequent and drastic fluctuations over time and are influenced by heterogeneous web user behaviors, making accurate prediction a challenging task. Previous research has extensively explored statistical approaches, and neural networks to mine features from preceding service traffic time series for prediction. However, these methods have largely overlooked the causal relationships between services. Drawing inspiration from causality in ecological systems, we empirically recognize the causal relationships between web services. To leverage these relationships for improved web service traffic prediction, we propose an effective neural network module, CCMPlus, designed to extract causal relationship features across services. This module can be seamlessly integrated with existing time series models to consistently enhance the performance of web service traffic predictions. We theoretically justify that the causal correlation matrix generated by the CCMPlus module captures causal relationships among services. Empirical results on real-world datasets from Microsoft Azure, Alibaba Group, and Ant Group confirm that our method surpasses state-of-the-art approaches in Mean Squared Error (MSE) and Mean Absolute Error (MAE) for predicting service traffic time series. These findings highlight the efficacy of leveraging causal relationships for improved predictions.

[LG-99] PAC Learning is just Bipartite Matching (Sort of)

链接: https://arxiv.org/abs/2502.00607
作者: Shaddin Dughmi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: Position paper

点击查看摘要

Abstract:The main goal of this article is to convince you, the reader, that supervised learning in the Probably Approximately Correct (PAC) model is closely related to – of all things – bipartite matching! En-route from PAC learning to bipartite matching, I will overview a particular transductive model of learning, and associated one-inclusion graphs, which can be viewed as a generalization of some of the hat puzzles that are popular in recreational mathematics. Whereas this transductive model is far from new, it has recently seen a resurgence of interest as a tool for tackling deep questions in learning theory. A secondary purpose of this article could be as a (biased) tutorial on the connections between the PAC and transductive models of learning.

[LG-100] he Query/Hit Model for Sequential Hypothesis Testing

链接: https://arxiv.org/abs/2502.00605
作者: Mahshad Shariatnasab,Stefano Rini,Farhad Shirani,S. Sitharama Iyengar
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This work introduces the Query/Hit (Q/H) learning model. The setup consists of two agents. One agent, Alice, has access to a streaming source, while the other, Bob, does not have direct access to the source. Communication occurs through sequential Q/H pairs: Bob sends a sequence of source symbols (queries), and Alice responds with the waiting time until each query appears in the source stream (hits). This model is motivated by scenarios with communication, computation, and privacy constraints that limit real-time access to the source. The error exponent for sequential hypothesis testing under the Q/H model is characterized, and a querying strategy, the Dynamic Scout-Sentinel Algorithm (DSSA), is proposed. The strategy employs a mutual information neural estimator to compute the error exponent associated with each query and to select the query with the highest efficiency. Extensive empirical evaluations on both synthetic and real-world datasets – including mouse movement trajectories, typesetting patterns, and touch-based user interactions – are provided to evaluate the performance of the proposed strategy in comparison with baselines, in terms of probability of error, query choice, and time-to-detection.

[LG-101] Enhancing Offline Reinforcement Learning with Curriculum Learning-Based Trajectory Valuation AAMAS2025

链接: https://arxiv.org/abs/2502.00601
作者: Amir Abolfazli,Zekun Song,Avishek Anand,Wolfgang Nejdl
类目: Machine Learning (cs.LG)
*备注: Accepted at AAMAS 2025

点击查看摘要

Abstract:The success of deep reinforcement learning (DRL) relies on the availability and quality of training data, often requiring extensive interactions with specific environments. In many real-world scenarios, where data collection is costly and risky, offline reinforcement learning (RL) offers a solution by utilizing data collected by domain experts and searching for a batch-constrained optimal policy. This approach is further augmented by incorporating external data sources, expanding the range and diversity of data collection possibilities. However, existing offline RL methods often struggle with challenges posed by non-matching data from these external sources. In this work, we specifically address the problem of source-target domain mismatch in scenarios involving mixed datasets, characterized by a predominance of source data generated from random or suboptimal policies and a limited amount of target data generated from higher-quality policies. To tackle this problem, we introduce Transition Scoring (TS), a novel method that assigns scores to transitions based on their similarity to the target domain, and propose Curriculum Learning-Based Trajectory Valuation (CLTV), which effectively leverages these transition scores to identify and prioritize high-quality trajectories through a curriculum learning approach. Our extensive experiments across various offline RL methods and MuJoCo environments, complemented by rigorous theoretical analysis, demonstrate that CLTV enhances the overall performance and transferability of policies learned by offline RL algorithms.

[LG-102] Dominated Novelty Search: Rethinking Local Competition in Quality-Diversity

链接: https://arxiv.org/abs/2502.00593
作者: Ryan Bahlous-Boldi,Maxence Faldor,Luca Grillotti,Hannah Janmohamed,Lisa Coiffard,Lee Spector,Antoine Cully
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quality-Diversity is a family of evolutionary algorithms that generate diverse, high-performing solutions through local competition principles inspired by natural evolution. While research has focused on improving specific aspects of Quality-Diversity algorithms, surprisingly little attention has been paid to investigating alternative formulations of local competition itself – the core mechanism distinguishing Quality-Diversity from traditional evolutionary algorithms. Most approaches implement local competition through explicit collection mechanisms like fixed grids or unstructured archives, imposing artificial constraints that require predefined bounds or hard-to-tune parameters. We show that Quality-Diversity methods can be reformulated as Genetic Algorithms where local competition occurs through fitness transformations rather than explicit collection mechanisms. Building on this insight, we introduce Dominated Novelty Search, a Quality-Diversity algorithm that implements local competition through dynamic fitness transformations, eliminating the need for predefined bounds or parameters. Our experiments show that Dominated Novelty Search significantly outperforms existing approaches across standard Quality-Diversity benchmarks, while maintaining its advantage in challenging scenarios like high-dimensional and unsupervised spaces.

[LG-103] Optimal Sensor Placement in Power Transformers Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.00552
作者: Sirui Li,Federica Bragone,Matthieu Barreau,Tor Laneryd,Kateryna Morozovska
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Our work aims at simulating and predicting the temperature conditions inside a power transformer using Physics-Informed Neural Networks (PINNs). The predictions obtained are then used to determine the optimal placement for temperature sensors inside the transformer under the constraint of a limited number of sensors, enabling efficient performance monitoring. The method consists of combining PINNs with Mixed Integer Optimization Programming to obtain the optimal temperature reconstruction inside the transformer. First, we extend our PINN model for the thermal modeling of power transformers to solve the heat diffusion equation from 1D to 2D space. Finally, we construct an optimal sensor placement model inside the transformer that can be applied to problems in 1D and 2D.

[LG-104] Muti-Fidelity Prediction and Uncertainty Quantification with Laplace Neural Operators for Parametric Partial Differential Equations

链接: https://arxiv.org/abs/2502.00550
作者: Haoyang Zheng,Guang Lin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 30 pages, 11 figures

点击查看摘要

Abstract:Laplace Neural Operators (LNOs) have recently emerged as a promising approach in scientific machine learning due to the ability to learn nonlinear maps between functional spaces. However, this framework often requires substantial amounts of high-fidelity (HF) training data, which is often prohibitively expensive to acquire. To address this, we propose multi-fidelity Laplace Neural Operators (MF-LNOs), which combine a low-fidelity (LF) base model with parallel linear/nonlinear HF correctors and dynamic inter-fidelity weighting. This allows us to exploit correlations between LF and HF datasets and achieve accurate inference of quantities of interest even with sparse HF data. We further incorporate a modified replica exchange stochastic gradient Langevin algorithm, which enables a more effective posterior distribution estimation and uncertainty quantification in model predictions. Extensive validation across four canonical dynamical systems (the Lorenz system, Duffing oscillator, Burgers equation, and Brusselator reaction-diffusion system) demonstrates the framework’s effectiveness. The results show significant improvements, with testing losses reduced by 40% to 80% compared to traditional approaches. This validates MF-LNO as a versatile tool for surrogate modeling in parametric PDEs, offering significant improvements in data efficiency and uncertainty-aware prediction.

[LG-105] Enhancing Field-Oriented Control of Electric Drives with Tiny Neural Network Optimized for Micro-controllers WWW DATE

链接: https://arxiv.org/abs/2502.00532
作者: Martin Joel Mouk Elele,Danilo Pau,Shixin Zhuang,Tullio Facchinetti
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This paper has been submitted to the EDGE AI Research Symposium 2025 ( this https URL previously known as tinyML Research Symposium). It was peer reviewed and camera ready updated accordingly to reviewer’s feedback. It will be presented in EDGE AI FOUNDATION Austin 2025 ( this https URL )

点击查看摘要

Abstract:The deployment of neural networks on resource-constrained micro-controllers has gained momentum, driving many advancements in Tiny Neural Networks. This paper introduces a tiny feed-forward neural network, TinyFC, integrated into the Field-Oriented Control (FOC) of Permanent Magnet Synchronous Motors (PMSMs). Proportional-Integral (PI) controllers are widely used in FOC for their simplicity, although their limitations in handling nonlinear dynamics hinder precision. To address this issue, a lightweight 1,400 parameters TinyFC was devised to enhance the FOC performance while fitting into the computational and memory constraints of a micro-controller. Advanced optimization techniques, including pruning, hyperparameter tuning, and quantization to 8-bit integers, were applied to reduce the model’s footprint while preserving the network effectiveness. Simulation results show the proposed approach significantly reduced overshoot by up to 87.5%, with the pruned model achieving complete overshoot elimination, highlighting the potential of tiny neural networks in real-time motor control applications.

[LG-106] CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

链接: https://arxiv.org/abs/2502.00519
作者: Kunal Pai,Premkumar Devanbu,Toufique Ahmed
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at the International Conference on Mining Software Repositories (MSR): Data and Tool Showcase Track - 2025

点击查看摘要

Abstract:One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function’s new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new, “natural”, large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama-3.1 405B, Mixtral 8 \times 22B) do find these maintenance-related tasks challenging.

[LG-107] Convolutional Fourier Analysis Network (CFAN): A Unified Time-Frequency Approach for ECG Classification

链接: https://arxiv.org/abs/2502.00497
作者: Sam Jeong,Hae Yong Kim
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Machine learning has transformed the classification of biomedical signals such as electrocardiograms (ECGs). Advances in deep learning, particularly convolutional neural networks (CNNs), enable automatic feature extraction, raising the question: Can combining time- and frequency-domain attributes enhance classification accuracy? To explore this, we evaluated three ECG classification tasks: (1) arrhythmia detection, (2) identity recognition, and (3) apnea detection. We initially tested three methods: (i) 2-D spectrogram-based frequency-time classification (SPECT), (ii) time-domain classification using a 1-D CNN (CNN1D), and (iii) frequency-domain classification using a Fourier transform-based CNN (FFT1D). Performance was validated using K-fold cross-validation. Among these, CNN1D (time only) performed best, followed by SPECT (time-frequency) and FFT1D (frequency only). Surprisingly, SPECT, which integrates time- and frequency-domain features, performed worse than CNN1D, suggesting a need for a more effective time and frequency fusion approach. To address this, we tested the recently proposed Fourier Analysis Network (FAN), which combines time- and frequency-domain features. However, FAN performed comparably to CNN1D, excelling in some tasks while underperforming in others. To enhance this approach, we developed the Convolutional Fourier Analysis Network (CFAN), which integrates FAN with CNN. CFAN outperformed all previous methods across all classification tasks. These findings underscore the advantages of combining time- and frequency-domain features, demonstrating CFAN’s potential as a powerful and versatile solution for ECG classification and broader biomedical signal analysis

[LG-108] Oscillations Make Neural Networks Robust to Quantization

链接: https://arxiv.org/abs/2502.00490
作者: Jonathan Wenshøj,Bob Pepin,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We challenge the prevailing view that oscillations in Quantization Aware Training (QAT) are merely undesirable artifacts caused by the Straight-Through Estimator (STE). Through theoretical analysis of QAT in linear models, we demonstrate that the gradient of the loss function can be decomposed into two terms: the original full-precision loss and a term that causes quantization oscillations. Based on these insights, we propose a novel regularization method that induces oscillations to improve quantization robustness. Contrary to traditional methods that focuses on minimizing the effects of oscillations, our approach leverages the beneficial aspects of weight oscillations to preserve model performance under quantization. Our empirical results on ResNet-18 and Tiny ViT demonstrate that this counter-intuitive strategy matches QAT accuracy at = 3-bit weight quantization, while maintaining close to full precision accuracy at bits greater than the target bit. Our work therefore provides a new perspective on model preparation for quantization, particularly for finding weights that are robust to changes in the bit of the quantizer – an area where current methods struggle to match the accuracy of QAT at specific bits.

[LG-109] Learn Sharp Interface Solution by Homotopy Dynamics

链接: https://arxiv.org/abs/2502.00488
作者: Chuqi Chen,Yahong Yang,Yang Xiang,Wenrui Hao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper explores challenges in training Physics-Informed Neural Networks (PINNs), emphasizing the role of the loss landscape in the training process. We examine difficulties in minimizing the PINN loss function, particularly due to ill-conditioning caused by differential operators in the residual term. We compare gradient-based optimizers Adam, L-BFGS, and their combination \al, showing the superiority of \al, and introduce a novel second-order optimizer, NysNewton-CG (NNCG), which significantly improves PINN performance. Theoretically, our work elucidates the connection between ill-conditioned differential operators and ill-conditioning in the PINN loss and shows the benefits of combining first- and second-order optimization methods. Our work presents valuable insights and more powerful optimization strategies for training PINNs, which could improve the utility of PINNs for solving difficult partial differential equations.

[LG-110] Binned Spectral Power Loss for Improved Prediction of Chaotic Systems

链接: https://arxiv.org/abs/2502.00472
作者: Dibyajyoti Chakraborty,Arvind T. Mohan,Romit Maulik
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Forecasting multiscale chaotic dynamical systems with deep learning remains a formidable challenge due to the spectral bias of neural networks, which hinders the accurate representation of fine-scale structures in long-term predictions. This issue is exacerbated when models are deployed autoregressively, leading to compounding errors and instability. In this work, we introduce a novel approach to mitigate the spectral bias which we call the Binned Spectral Power (BSP) Loss. The BSP loss is a frequency-domain loss function that adaptively weighs errors in predicting both larger and smaller scales of the dataset. Unlike traditional losses that focus on pointwise misfits, our BSP loss explicitly penalizes deviations in the energy distribution across different scales, promoting stable and physically consistent predictions. We demonstrate that the BSP loss mitigates the well-known problem of spectral bias in deep learning. We further validate our approach for the data-driven high-dimensional time-series forecasting of a range of benchmark chaotic systems which are typically intractable due to spectral bias. Our results demonstrate that the BSP loss significantly improves the stability and spectral accuracy of neural forecasting models without requiring architectural modifications. By directly targeting spectral consistency, our approach paves the way for more robust deep learning models for long-term forecasting of chaotic dynamical systems.

[LG-111] Enhancing Memory and Imagination Consistency in Diffusion-based World Models via Linear-Time Sequence Modeling

链接: https://arxiv.org/abs/2502.00466
作者: Jia-Hua Lee,Bor-Jiun Lin,Wei-Fang Sun,Chun-Yi Lee
类目: Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:World models are crucial for enabling agents to simulate and plan within environments, yet existing approaches struggle with long-term dependencies and inconsistent predictions. We introduce EDELINE, a novel framework that integrates diffusion models with linear-time state space modelsto enhance memory retention and temporal consistency. EDELINE employs a recurrent embedding module based on Mamba SSMs for processing unbounded sequences, a unified architecture for joint reward and termination prediction, and dynamic loss harmonization to balance multi-task learning. Our results across multiple benchmarks demonstrate EDELINE’s superiority and robustness over prior baselines in long-horizon tasks.

[LG-112] Efficient Over-parameterized Matrix Sensing from Noisy Measurements via Alternating Preconditioned Gradient Descent

链接: https://arxiv.org/abs/2502.00463
作者: Zhiyu Liu,Zhi Han,Yandong Tang,Hai Zhang,Shaojie Tang,Yao Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:We consider the noisy matrix sensing problem in the over-parameterization setting, where the estimated rank r is larger than the true rank r_\star . Specifically, our main objective is to recover a matrix X_\star \in \mathbbR^n_1 \times n_2 with rank r_\star from noisy measurements using an over-parameterized factorized form LR^\top , where L \in \mathbbR^n_1 \times r, , R \in \mathbbR^n_2 \times r and \min\n_1, n_2\ \ge r r_\star , with the true rank r_\star being unknown. Recently, preconditioning methods have been proposed to accelerate the convergence of matrix sensing problem compared to vanilla gradient descent, incorporating preconditioning terms (L^\top L + \lambda I)^-1 and (R^\top R + \lambda I)^-1 into the original gradient. However, these methods require careful tuning of the damping parameter \lambda and are sensitive to initial points and step size. To address these limitations, we propose the alternating preconditioned gradient descent (APGD) algorithm, which alternately updates the two factor matrices, eliminating the need for the damping parameter and enabling faster convergence with larger step sizes. We theoretically prove that APGD achieves near-optimal error convergence at a linear rate, starting from arbitrary random initializations. Through extensive experiments, we validate our theoretical results and demonstrate that APGD outperforms other methods, achieving the fastest convergence rate. Notably, both our theoretical analysis and experimental results illustrate that APGD does not rely on the initialization procedure, making it more practical and versatile.

[LG-113] How Do Model Export Formats Impact the Development of ML-Enabled Systems? A Case Study on Model Integration

链接: https://arxiv.org/abs/2502.00429
作者: Shreyas Kumar Parida,Ilias Gerostathopoulos,Justus Bogner
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted for publication at the International Conference on AI Engineering - Software Engineering for AI (CAIN’25, see this https URL )

点击查看摘要

Abstract:Machine learning (ML) models are often integrated into ML-enabled systems to provide software functionality that would otherwise be impossible. This integration requires the selection of an appropriate ML model export format, for which many options are available. These formats are crucial for ensuring a seamless integration, and choosing a suboptimal one can negatively impact system development. However, little evidence is available to guide practitioners during the export format selection. We therefore evaluated various model export formats regarding their impact on the development of ML-enabled systems from an integration perspective. Based on the results of a preliminary questionnaire survey (n=17), we designed an extensive embedded case study with two ML-enabled systems in three versions with different technologies. We then analyzed the effect of five popular export formats, namely ONNX, Pickle, TensorFlow’s SavedModel, PyTorch’s TorchScript, and Joblib. In total, we studied 30 units of analysis (2 systems x 3 tech stacks x 5 formats) and collected data via structured field notes. The holistic qualitative analysis of the results indicated that ONNX offered the most efficient integration and portability across most cases. SavedModel and TorchScript were very convenient to use in Python-based systems, but otherwise required workarounds (TorchScript more than SavedModel). SavedModel also allowed the easy incorporation of preprocessing logic into a single file, which made it scalable for complex deep learning use cases. Pickle and Joblib were the most challenging to integrate, even in Python-based systems. Regarding technical support, all model export formats had strong technical documentation and strong community support across platforms such as Stack Overflow and Reddit. Practitioners can use our findings to inform the selection of ML export formats suited to their context. Comments: Accepted for publication at the International Conference on AI Engineering - Software Engineering for AI (CAIN’25, see this https URL) Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2502.00429 [cs.SE] (or arXiv:2502.00429v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.00429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-114] Stochastic Linear Bandits with Latent Heterogeneity

链接: https://arxiv.org/abs/2502.00423
作者: Elynn Chen,Xi Chen,Wenbo Jing,Xiao Liu
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper addresses the critical challenge of latent heterogeneity in online decision-making, where individual responses to business actions vary due to unobserved characteristics. While existing approaches in data-driven decision-making have focused on observable heterogeneity through contextual features, they fall short when heterogeneity stems from unobservable factors such as lifestyle preferences and personal experiences. We propose a novel latent heterogeneous bandit framework that explicitly models this unobserved heterogeneity in customer responses, with promotion targeting as our primary example. Our methodology introduces an innovative algorithm that simultaneously learns latent group memberships and group-specific reward functions. Through theoretical analysis and empirical validation using data from a mobile commerce platform, we establish high-probability bounds for parameter estimation, convergence rates for group classification, and comprehensive regret bounds. Notably, our theoretical analysis reveals two distinct types of regret measures: a strong regret'' against an oracle with perfect knowledge of customer memberships, which remains non-sub-linear due to inherent classification uncertainty, and a regular regret’’ against an oracle aware only of deterministic components, for which our algorithm achieves a sub-linear rate that is minimax optimal in horizon length and dimension. We further demonstrate that existing bandit algorithms ignoring latent heterogeneity incur constant average regret that accumulates linearly over time. Our framework provides practitioners with new tools for decision-making under latent heterogeneity and extends to various business applications, including personalized pricing, resource allocation, and inventory management.

[LG-115] Predictive modeling and anomaly detection in large-scale web portals through the CAWAL framework

链接: https://arxiv.org/abs/2502.00413
作者: Ozkan Canay,Umit Kocabicak
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:This study presents an approach that uses session and page view data collected through the CAWAL framework, enriched through specialized processes, for advanced predictive modeling and anomaly detection in web usage mining (WUM) applications. Traditional WUM methods often rely on web server logs, which limit data diversity and quality. Integrating application logs with web analytics, the CAWAL framework creates comprehensive session and page view datasets, providing a more detailed view of user interactions and effectively addressing these limitations. This integration enhances data diversity and quality while eliminating the preprocessing stage required in conventional WUM, leading to greater process efficiency. The enriched datasets, created by cross-integrating session and page view data, were applied to advanced machine learning models, such as Gradient Boosting and Random Forest, which are known for their effectiveness in capturing complex patterns and modeling non-linear relationships. These models achieved over 92% accuracy in predicting user behavior and significantly improved anomaly detection capabilities. The results show that this approach offers detailed insights into user behavior and system performance metrics, making it a reliable solution for improving large-scale web portals’ efficiency, reliability, and scalability.

[LG-116] Its Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis

链接: https://arxiv.org/abs/2502.00384
作者: Sengim Karayalçin,Marina Krček,Stjepan Picek
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 13 figures, 1 table

点击查看摘要

Abstract:Side-channel analysis (SCA) represents a realistic threat where the attacker can observe unintentional information to obtain secret data. Evaluation labs also use the same SCA techniques in the security certification process. The results in the last decade have shown that machine learning, especially deep learning, is an extremely powerful SCA approach, allowing the breaking of protected devices while achieving optimal attack performance. Unfortunately, deep learning operates as a black-box, making it less useful for security evaluators who must understand how attacks work to prevent them in the future. This work demonstrates that mechanistic interpretability can effectively scale to realistic scenarios where relevant information is sparse and well-defined interchange interventions to the input are impossible due to side-channel protections. Concretely, we reverse engineer the features the network learns during phase transitions, eventually retrieving secret masks, allowing us to move from black-box to white-box evaluation.

[LG-117] CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

链接: https://arxiv.org/abs/2502.00380
作者: Bruno Belucci,Karim Lounici,Katia Meziani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Clustering high-dimensional data poses significant challenges due to the curse of dimensionality, scalability issues, and the presence of noisy and irrelevant features. We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address these challenges effectively. CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion. This iterative approach constructs a cluster assignment matrix, where each row records the cluster assignments of a sample across repetitions, enabling the identification of stable clusters by comparing identical rows. Clusters are organized hierarchically, enabling the interpretation of the hierarchy to gain insights into the dataset. CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS. Experimental results on synthetic and real-world datasets confirm the method’s ability to reveal meaningful patterns while maintaining scalability, making it a powerful tool for high-dimensional data analysis.

[LG-118] SSRepL-ADHD: Adaptive Complex Representation Learning Framework for ADHD Detection from Visual Attention Tasks

链接: https://arxiv.org/abs/2502.00376
作者: Abdul Rehman,Ilona Heldal,Jerry Chun-Wei Lin
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Self Supervised Representation Learning (SSRepL) can capture meaningful and robust representations of the Attention Deficit Hyperactivity Disorder (ADHD) data and have the potential to improve the model’s performance on also downstream different types of Neurodevelopmental disorder (NDD) detection. In this paper, a novel SSRepL and Transfer Learning (TL)-based framework that incorporates a Long Short-Term Memory (LSTM) and a Gated Recurrent Units (GRU) model is proposed to detect children with potential symptoms of ADHD. This model uses Electroencephalogram (EEG) signals extracted during visual attention tasks to accurately detect ADHD by preprocessing EEG signal quality through normalization, filtering, and data balancing. For the experimental analysis, we use three different models: 1) SSRepL and TL-based LSTM-GRU model named as SSRepL-ADHD, which integrates LSTM and GRU layers to capture temporal dependencies in the data, 2) lightweight SSRepL-based DNN model (LSSRepL-DNN), and 3) Random Forest (RF). In the study, these models are thoroughly evaluated using well-known performance metrics (i.e., accuracy, precision, recall, and F1-score). The results show that the proposed SSRepL-ADHD model achieves the maximum accuracy of 81.11% while admitting the difficulties associated with dataset imbalance and feature selection.

[LG-119] Generalized Lie Symmetries in Physics-Informed Neural Operators

链接: https://arxiv.org/abs/2502.00373
作者: Amy Xiang Wang,Zakhar Shumaylov,Peter Zaika,Ferdia Sherry,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: SCML 2025 Oral

点击查看摘要

Abstract:Physics-informed neural operators (PINOs) have emerged as powerful tools for learning solution operators of partial differential equations (PDEs). Recent research has demonstrated that incorporating Lie point symmetry information can significantly enhance the training efficiency of PINOs, primarily through techniques like data, architecture, and loss augmentation. In this work, we focus on the latter, highlighting that point symmetries oftentimes result in no training signal, limiting their effectiveness in many problems. To address this, we propose a novel loss augmentation strategy that leverages evolutionary representatives of point symmetries, a specific class of generalized symmetries of the underlying PDE. These generalized symmetries provide a richer set of generators compared to standard symmetries, leading to a more informative training signal. We demonstrate that leveraging evolutionary representatives enhances the performance of neural operators, resulting in improved data efficiency and accuracy during training.

[LG-120] Machine Learning Models for Reinforced Concrete Pipes Condition Prediction: The State-of-the-Art Using Artificial Neural Networks and Multiple Linear Regression in a Wisconsin Case Study

链接: https://arxiv.org/abs/2502.00363
作者: Mohsen Mohammadagha,Mohammad Najafi,Vinayak Kaushal,Ahmad Mahmoud Ahmad Jibreen
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The aging sewer infrastructure in the U.S., covering 2.1 million kilometers, encounters increasing structural issues, resulting in around 75,000 yearly sanitary sewer overflows that present serious economic, environmental, and public health hazards. Conventional inspection techniques and deterministic models do not account for the unpredictable nature of sewer decline, whereas probabilistic methods depend on extensive historical data, which is frequently lacking or incomplete. This research intends to enhance predictive accuracy for the condition of sewer pipelines through machine learning models artificial neural networks (ANNs) and multiple linear regression (MLR) by integrating factors such as pipe age, material, diameter, environmental influences, and PACP ratings. ANNs utilized ReLU activation functions and Adam optimization, whereas MLR applied regularization to address multicollinearity, with both models assessed through metrics like RMSE, MAE, and R2. The findings indicated that ANNs surpassed MLR, attaining an R2 of 0.9066 compared to MLRs 0.8474, successfully modeling nonlinear relationships while preserving generalization. MLR, on the other hand, offered enhanced interpretability by pinpointing significant predictors such as residual buildup. As a result, pipeline degradation is driven by pipe length, age, and pipe diameter as key predictors, while depth, soil type, and segment show minimal influence in this analysis. Future studies ought to prioritize hybrid models that merge the accuracy of ANNs with the interpretability of MLR, incorporating advanced methods such as SHAP analysis and transfer learning to improve scalability in managing infrastructure and promoting environmental sustainability.

[LG-121] Soft Diffusion Actor-Critic: Efficient Online Reinforcement Learning for Diffusion Policy

链接: https://arxiv.org/abs/2502.00361
作者: Haitong Ma,Tianyi Chen,Kai Wang,Na Li,Bo Dai
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the vanilla diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy, making training diffusion policies highly non-trivial in online RL. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and impractical. To enable efficient diffusion policy training for online RL, we propose Soft Diffusion Actor-Critic (SDAC), exploiting the viewpoint of diffusion models as noise-perturbed energy-based models. The proposed SDAC relies solely on the state-action value function as the energy functions to train diffusion policies, bypassing sampling from the optimal policy while maintaining lightweight computations. We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that SDAC outperforms all recent diffusion-policy online RLs on most tasks, and improves more than 120% over soft actor-critic on complex locomotion tasks such as Humanoid and Ant.

[LG-122] Exploring Representation-Aligned Latent Space for Better Generation

链接: https://arxiv.org/abs/2502.00359
作者: Wanghan Xu,Xiaoyu Yue,Zidong Wang,Yao Teng,Wenlong Zhang,Xihui Liu,Luping Zhou,Wanli Ouyang,Lei Bai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents’ quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.

[LG-123] Sampling in High-Dimensions using Stochastic Interpolants and Forward-Backward Stochastic Differential Equations

链接: https://arxiv.org/abs/2502.00355
作者: Anand Jerry George,Nicolas Macris
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages

点击查看摘要

Abstract:We present a class of diffusion-based algorithms to draw samples from high-dimensional probability distributions given their unnormalized densities. Ideally, our methods can transport samples from a Gaussian distribution to a specified target distribution in finite time. Our approach relies on the stochastic interpolants framework to define a time-indexed collection of probability densities that bridge a Gaussian distribution to the target distribution. Subsequently, we derive a diffusion process that obeys the aforementioned probability density at each time instant. Obtaining such a diffusion process involves solving certain Hamilton-Jacobi-Bellman PDEs. We solve these PDEs using the theory of forward-backward stochastic differential equations (FBSDE) together with machine learning-based methods. Through numerical experiments, we demonstrate that our algorithm can effectively draw samples from distributions that conventional methods struggle to handle.

[LG-124] OneForecast: A Universal Framework for Global and Regional Weather Forecasting

链接: https://arxiv.org/abs/2502.00338
作者: Yuan Gao,Hao Wu,Ruiqi Shu,Huanshuo Dong,Fan Xu,Rui Chen,Yibo Yan,Qingsong Wen,Xuming Hu,Kun Wang,Jiahao Wu,Qing Li,Hui Xiong,Xiaomeng Huang
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate weather forecasts are important for disaster prevention, agricultural planning, and water resource management. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning methods have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework based on graph neural networks (GNNs). By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive information propagation mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that the proposed method performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions (e.g., typhoons), significantly improving forecast accuracy. Our codes are available at this https URL.

[LG-125] Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves

链接: https://arxiv.org/abs/2502.00336
作者: Anand Jerry George,Rodrigo Veiga,Nicolas Macris
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages

点击查看摘要

Abstract:We derive asymptotically precise expressions for test and train errors of denoising score matching (DSM) in generative diffusion models. The score function is parameterized by random features neural networks, with the target distribution being d -dimensional standard Gaussian. We operate in a regime where the dimension d , number of data samples n , and number of features p tend to infinity while keeping the ratios \psi_n=\fracnd and \psi_p=\fracpd fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization in diffusion models. Furthermore, our work sheds light on the conditions enhancing either generalization or memorization. Consistent with prior empirical observations, our findings indicate that the model complexity ( p ) and the number of noise samples per data sample ( m ) used during DSM significantly influence generalization and memorization behaviors.

[LG-126] k-SVD with Gradient Descent

链接: https://arxiv.org/abs/2502.00320
作者: Emily Gan,Yassir Jedra,Devavrat Shah
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We show that a gradient-descent with a simple, universal rule for step-size selection provably finds k -SVD, i.e., the k\geq 1 largest singular values and corresponding vectors, of any matrix, despite nonconvexity. There has been substantial progress towards this in the past few years where existing results are able to establish such guarantees for the \emphexact-parameterized and \emphover-parameterized settings, with choice of oracle-provided step size. But guarantees for generic setting with a step size selection that does not require oracle-provided information has remained a challenge. We overcome this challenge and establish that gradient descent with an appealingly simple adaptive step size (akin to preconditioning) and random initialization enjoys global linear convergence for generic setting. Our convergence analysis reveals that the gradient method has an attracting region, and within this attracting region, the method behaves like Heron’s method (a.k.a. the Babylonian method). Empirically, we validate the theoretical results. The emergence of modern compute infrastructure for iterative optimization coupled with this work is likely to provide means to solve k -SVD for very large matrices.

[LG-127] Physics-Inspired Distributed Radio Map Estimation

链接: https://arxiv.org/abs/2502.00319
作者: Dong Yang,Yue Wang,Songyang Zhang,Yingshu Li,Zhipeng Cai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:To gain panoramic awareness of spectrum coverage in complex wireless environments, data-driven learning approaches have recently been introduced for radio map estimation (RME). While existing deep learning based methods conduct RME given spectrum measurements gathered from dispersed sensors in the region of interest, they rely on centralized data at a fusion center, which however raises critical concerns on data privacy leakages and high communication overloads. Federated learning (FL) enhance data security and communication efficiency in RME by allowing multiple clients to collaborate in model training without directly sharing local data. However, the performance of the FL-based RME can be hindered by the problem of task heterogeneity across clients due to their unavailable or inaccurate landscaping information. To fill this gap, in this paper, we propose a physics-inspired distributed RME solution in the absence of landscaping information. The main idea is to develop a novel distributed RME framework empowered by leveraging the domain knowledge of radio propagation models, and by designing a new distributed learning approach that splits the entire RME model into two modules. A global autoencoder module is shared among clients to capture the common pathloss influence on radio propagation pattern, while a client-specific autoencoder module focuses on learning the individual features produced by local shadowing effects from the unique building distributions in local environment. Simulation results show that our proposed method outperforms the benchmarks in achieving higher performance.

[LG-128] Sub-Sequential Physics-Informed Learning with State Space Model

链接: https://arxiv.org/abs/2502.00318
作者: Chenhui Xu,Dancheng Liu,Yuting Hu,Jiajie Li,Ruiyang Qin,Qingxiao Zheng,Jinjun Xiong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE’s continuity and PINN’s discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3% compared with state-of-the-art architecture. Our code is available at this https URL.

[LG-129] Sparse Gradient Compression for Fine-Tuning Large Language Models

链接: https://arxiv.org/abs/2502.00311
作者: David H. Yang,Mohammad Mohammadi Amiri,Tejaswini Pedapati,Subhajit Chaudhury,Pin-Yu Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of the original model’s parameters. By enabling optimizer state updates in an arbitrary low-dimensional subspace, SGC offers a flexible tradeoff between memory efficiency and performance. We demonstrate through experiments that SGC can decrease memory usage in optimizer states more effectively than existing PEFT methods. Furthermore, by fine-tuning LLMs on various downstream tasks, we show that SGC can deliver superior performance while substantially lowering optimizer state memory requirements, particularly in both data-limited and memory-limited settings.

[LG-130] Uncertainty Quantification of Wind Gust Predictions in the Northeast US: An Evidential Neural Network and Explainable Artificial Intelligence Approach

链接: https://arxiv.org/abs/2502.00300
作者: Israt Jahan,John S. Schreck,David John Gagne,Charlie Becker,Marina Astitha
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注: Main body 27 pages with 12 figures

点击查看摘要

Abstract:Machine learning has shown promise in reducing bias in numerical weather model predictions of wind gusts. Yet, they underperform to predict high gusts even with additional observations due to the right-skewed distribution of gusts. Uncertainty quantification (UQ) addresses this by identifying when predictions are reliable or needs cautious interpretation. Using data from 61 extratropical storms in the Northeastern USA, we introduce evidential neural network (ENN) as a novel approach for UQ in gust predictions, leveraging atmospheric variables from the Weather Research and Forecasting (WRF) model as features and gust observations as targets. Explainable artificial intelligence (XAI) techniques demonstrated that key predictive features also contributed to higher uncertainty. Estimated uncertainty correlated with storm intensity and spatial gust gradients. ENN allowed constructing gust prediction intervals without requiring an ensemble. From an operational perspective, providing gust forecasts with quantified uncertainty enhances stakeholders’ confidence in risk assessment and response planning for extreme gust events.

[LG-131] he Price of Linear Time: Error Analysis of Structured Kernel Interpolation

链接: https://arxiv.org/abs/2502.00298
作者: Alexander Moreno,Justin Xiao,Jonathan Mei
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Structured Kernel Interpolation (SKI) (Wilson et al. 2015) helps scale Gaussian Processes (GPs) by approximating the kernel matrix via interpolation at inducing points, achieving linear computational complexity. However, it lacks rigorous theoretical error analysis. This paper bridges the gap: we prove error bounds for the SKI Gram matrix and examine the error’s effect on hyperparameter estimation and posterior inference. We further provide a practical guide to selecting the number of inducing points under convolutional cubic interpolation: they should grow as n^d/3 for error control. Crucially, we identify two dimensionality regimes governing the trade-off between SKI Gram matrix spectral norm error and computational complexity. For d \leq 3 , any error tolerance can achieve linear time for sufficiently large sample size. For d 3 , the error must increase with sample size to maintain linear time. Our analysis provides key insights into SKI’s scalability-accuracy trade-offs, establishing precise conditions for achieving linear-time GP inference with controlled approximation error.

[LG-132] Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

链接: https://arxiv.org/abs/2502.00288
作者: Jijia Liu,Feng Gao,Qingmin Liao,Chao Yu,Yu Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start” training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1.62\times performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.

[LG-133] GraphMinNet: Learning Dependencies in Graphs with Light Complexity Minimal Architecture

链接: https://arxiv.org/abs/2502.00282
作者: Md Atik Ahamed,Andrew Cheng,Qiang Ye,Qiang Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable success in various applications, yet they often struggle to capture long-range dependencies (LRD) effectively. This paper introduces GraphMinNet, a novel GNN architecture that generalizes the idea of minimal Gated Recurrent Units to graph-structured data. Our approach achieves efficient LRD modeling with linear computational complexity while maintaining permutation equivariance and stability. The model incorporates both structural and positional information through a unique combination of feature and positional encodings, leading to provably stronger expressiveness than the 1-WL test. Theoretical analysis establishes that GraphMinNet maintains non-decaying gradients over long distances, ensuring effective long-range information propagation. Extensive experiments on ten diverse datasets, including molecular graphs, image graphs, and synthetic networks, demonstrate that GraphMinNet achieves state-of-the-art performance while being computationally efficient. Our results show superior performance on 6 out of 10 datasets and competitive results on the others, validating the effectiveness of our approach in capturing both local and global graph structures.

[LG-134] On the study of frequency control and spectral bias in Wavelet-Based Kolmogorov Arnold networks: A path to physics-informed KANs

链接: https://arxiv.org/abs/2502.00280
作者: Juan Daniel Meshir,Abel Palafox,Edgar Alejandro Guerrero
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Spectral bias, the tendency of neural networks to prioritize learning low-frequency components of functions during the initial training stages, poses a significant challenge when approximating solutions with high-frequency details. This issue is particularly pronounced in physics-informed neural networks (PINNs), widely used to solve differential equations that describe physical phenomena. In the literature, contributions such as Wavelet Kolmogorov Arnold Networks (Wav-KANs) have demonstrated promising results in capturing both low- and high-frequency components. Similarly, Fourier features (FF) are often employed to address this challenge. However, the theoretical foundations of Wav-KANs, particularly the relationship between the frequency of the mother wavelet and spectral bias, remain underexplored. A more in-depth understanding of how Wav-KANs manage high-frequency terms could offer valuable insights for addressing oscillatory phenomena encountered in parabolic, elliptic, and hyperbolic differential equations. In this work, we analyze the eigenvalues of the neural tangent kernel (NTK) of Wav-KANs to enhance their ability to converge on high-frequency components, effectively mitigating spectral bias. Our theoretical findings are validated through numerical experiments, where we also discuss the limitations of traditional approaches, such as standard PINNs and Fourier features, in addressing multi-frequency problems.

[LG-135] Improving realistic semi-supervised learning with doubly robust estimation

链接: https://arxiv.org/abs/2502.00279
作者: Khiem Pham,Charles Herrmann,Ramin Zabih
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A major challenge in Semi-Supervised Learning (SSL) is the limited information available about the class distribution in the unlabeled data. In many real-world applications this arises from the prevalence of long-tailed distributions, where the standard pseudo-label approach to SSL is biased towards the labeled class distribution and thus performs poorly on unlabeled data. Existing methods typically assume that the unlabeled class distribution is either known a priori, which is unrealistic in most situations, or estimate it on-the-fly using the pseudo-labels themselves. We propose to explicitly estimate the unlabeled class distribution, which is a finite-dimensional parameter, \emphas an initial step, using a doubly robust estimator with a strong theoretical guarantee; this estimate can then be integrated into existing methods to pseudo-label the unlabeled data during training more accurately. Experimental results demonstrate that incorporating our techniques into common pseudo-labeling approaches improves their performance.

[LG-136] Regularized Langevin Dynamics for Combinatorial Optimization

链接: https://arxiv.org/abs/2502.00277
作者: Shengyu Feng,Yiming Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative algorithm. However, we observed that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA) and the other one based on neural network (NN). Empirical results on three classical CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA and NN-based solvers. In particular, our SA algorithm reduces the running time of the previous SOTA SA method by up to 80%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems.

[LG-137] Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

链接: https://arxiv.org/abs/2502.00245
作者: Tianyuan Zou,Yang Liu,Peng Li,Yufei Xiong,Jianqing Zhang,Jingjing Liu,Xiaozhou Ye,Ye Ouyang,Ya-Qin Zhang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 tables, 7 figures

点击查看摘要

Abstract:Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained this http URL experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at this https URL.

[LG-138] HackerRank-ASTRA: Evaluating Correctness Consistency of Large Language Models on cross-domain multi-file project problems

链接: https://arxiv.org/abs/2502.00226
作者: Jun Xing,Mayur Bhatia,Sahil Phulwani,Darshan Suresh,Rafik Matta
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 24 pages, 25 figures

点击查看摘要

Abstract:Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models – o1, o1-preview, and Claude-3.5-Sonnet-1022 – achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.

[LG-139] Algorithmic Clustering based on String Compression to Extract P300 Structure in EEG Signals

链接: https://arxiv.org/abs/2502.00220
作者: Guillermo Sarasa,Ana Granados,Francisco B Rodríguez
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:P300 is an Event-Related Potential widely used in Brain-Computer Interfaces, but its detection is challenging due to inter-subject and temporal variability. This work introduces a clustering methodology based on Normalized Compression Distance (NCD) to extract the P300 structure, ensuring robustness against variability. We propose a novel signal-to-ASCII transformation to generate compression-friendly objects, which are then clustered using a hierarchical tree-based method and a multidimensional projection approach. Experimental results on two datasets demonstrate the method’s ability to reveal relevant P300 structures, showing clustering performance comparable to state-of-the-art approaches. Furthermore, analysis at the electrode level suggests that the method could assist in electrode selection for P300 detection. This compression-driven clustering methodology offers a complementary tool for EEG analysis and P300 identification.

[LG-140] BICompFL: Stochastic Federated Learning with Bi-Directional Compression

链接: https://arxiv.org/abs/2502.00206
作者: Maximilian Egger,Rawad Bitar,Antonia Wachter-Zeh,Nir Weinberger,Deniz Gündüz
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the prominent communication bottleneck in federated learning (FL). We specifically consider stochastic FL, in which models or compressed model updates are specified by distributions rather than deterministic parameters. Stochastic FL offers a principled approach to compression, and has been shown to reduce the communication load under perfect downlink transmission from the federator to the clients. However, in practice, both the uplink and downlink communications are constrained. We show that bi-directional compression for stochastic FL has inherent challenges, which we address by introducing BICompFL. Our BICompFL is experimentally shown to reduce the communication cost by an order of magnitude compared to multiple benchmarks, while maintaining state-of-the-art accuracies. Theoretically, we study the communication cost of BICompFL through a new analysis of an importance-sampling based technique, which exposes the interplay between uplink and downlink communication costs.

[LG-141] Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

链接: https://arxiv.org/abs/2502.00204
作者: Maria-Florina Balcan,Martino Bernasconi,Matteo Castiglioni,Andrea Celli,Keegan Harris,Zhiwei Steven Wu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve O(T^1/2) regret under bandit feedback, an improvement from the previously best-known rates of O(T^2/3) . Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader’s mixed strategy. We extend our algorithms to the setting in which the leader’s utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

[LG-142] Model Successor Functions

链接: https://arxiv.org/abs/2502.00197
作者: Yingshan Chang,Yonatan Bisk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The notion of generalization has moved away from the classical one defined in statistical learning theory towards an emphasis on out-of-domain generalization (OODG). Recently, there is a growing focus on inductive generalization, where a progression of difficulty implicitly governs the direction of domain shifts. In inductive generalization, it is often assumed that the training data lie in the easier side, while the testing data lie in the harder side. The challenge is that training data are always finite, but a learner is expected to infer an inductive principle that could be applied in an unbounded manner. This emerging regime has appeared in the literature under different names, such as length/logical/algorithmic extrapolation, but a formal definition is lacking. This work provides such a formalization that centers on the concept of model successors. Then we outline directions to adapt well-established techniques towards the learning of model successors. This work calls for restructuring of the research discussion around inductive generalization from fragmented task-centric communities to a more unified effort, focused on universal properties of learning and computation.

[LG-143] Byzantine-Resilient Zero-Order Optimization for Communication-Efficient Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2502.00193
作者: Maximilian Egger,Mayank Bakshi,Rawad Bitar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce CyBeR-0, a Byzantine-resilient federated zero-order optimization method that is robust under Byzantine attacks and provides significant savings in uplink and downlink communication costs. We introduce transformed robust aggregation to give convergence guarantees for general non-convex objectives under client data heterogeneity. Empirical evaluations for standard learning tasks and fine-tuning large language models show that CyBeR-0 exhibits stable performance with only a few scalars per-round communication cost and reduced memory requirements.

[LG-144] On the Effectiveness of Random Weights in Graph Neural Networks

链接: https://arxiv.org/abs/2502.00190
作者: Thu Bui,Carola-Bibiane Schönlieb,Bruno Ribeiro,Beatrice Bevilacqua,Moshe Eliasof
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable success across diverse tasks on graph-structured data, primarily through the use of learned weights in message passing layers. In this paper, we demonstrate that random weights can be surprisingly effective, achieving performance comparable to end-to-end training counterparts, across various tasks and datasets. Specifically, we show that by replacing learnable weights with random weights, GNNs can retain strong predictive power, while significantly reducing training time by up to 6 \times and memory usage by up to 3 \times . Moreover, the random weights combined with our construction yield random graph propagation operators, which we show to reduce the problem of feature rank collapse in GNNs. These understandings and empirical results highlight random weights as a lightweight and efficient alternative, offering a compelling perspective on the design and training of GNN architectures.

[LG-145] Designing Scheduling for Diffusion Models via Spectral Analysis

链接: https://arxiv.org/abs/2502.00180
作者: Roi Benita,Michael Elad,Joseph Keshet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as powerful tools for modeling complex data distributions and generating realistic new samples. Over the years, advanced architectures and sampling methods have been developed to make these models practically usable. However, certain synthesis process decisions still rely on heuristics without a solid theoretical foundation. In our work, we offer a novel analysis of the DM’s inference process, introducing a comprehensive frequency response perspective. Specifically, by relying on Gaussianity and shift-invariance assumptions, we present the inference process as a closed-form spectral transfer function, capturing how the generated signal evolves in response to the initial noise. We demonstrate how the proposed analysis can be leveraged for optimizing the noise schedule, ensuring the best alignment with the original dataset’s characteristics. Our results lead to scheduling curves that are dependent on the frequency content of the data, offering a theoretical justification for some of the heuristics taken by practitioners.

[LG-146] Distribution-Specific Agnostic Conditional Classification With Halfspaces

链接: https://arxiv.org/abs/2502.00172
作者: Jizhou Huang,Brendan Juba
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study selective'' or conditional’’ classification problems under an agnostic setting. Classification tasks commonly focus on modeling the relationship between features and categories that captures the vast majority of data. In contrast to common machine learning frameworks, conditional classification intends to model such relationships only on a subset of the data defined by some selection rule. Most work on conditional classification either solves the problem in a realizable setting or does not guarantee the error is bounded compared to an optimal solution. In this work, we consider selective/conditional classification by sparse linear classifiers for subsets defined by halfspaces, and give both positive as well as negative results for Gaussian feature distributions. On the positive side, we present the first PAC-learning algorithm for homogeneous halfspace selectors with error guarantee \bigO*\sqrt\mathrmopt , where \mathrmopt is the smallest conditional classification error over the given class of classifiers and homogeneous halfspaces. On the negative side, we find that, under cryptographic assumptions, approximating the conditional classification loss within a small additive error is computationally hard even under Gaussian distribution. We prove that approximating conditional classification is at least as hard as approximating agnostic classification in both additive and multiplicative form.

[LG-147] SAGRAD: A Program for Neural Network Training with Simulated Annealing and the Conjugate Gradient Method

链接: https://arxiv.org/abs/2502.00112
作者: Javier Bernal,Jose Torres-Jimenez
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:SAGRAD (Simulated Annealing GRADient), a Fortran 77 program for computing neural networks for classification using batch learning, is discussed. Neural network training in SAGRAD is based on a combination of simulated annealing and Møller’s scaled conjugate gradient algorithm, the latter a variation of the traditional conjugate gradient method, better suited for the nonquadratic nature of neural networks. Different aspects of the implementation of the training process in SAGRAD are discussed, such as the efficient computation of gradients and multiplication of vectors by Hessian matrices that are required by Møller’s algorithm; the (re)initialization of weights with simulated annealing required to (re)start Møller’s algorithm the first time and each time thereafter that it shows insufficient progress in reaching a possibly local minimum; and the use of simulated annealing when Møller’s algorithm, after possibly making considerable progress, becomes stuck at a local minimum or flat area of weight space. Outlines of the scaled conjugate gradient algorithm, the simulated annealing procedure and the training process used in SAGRAD are presented together with results from running SAGRAD on two examples of training data.

[LG-148] racking Most Significant Shifts in Infinite-Armed Bandits

链接: https://arxiv.org/abs/2502.00108
作者: Joe Suk,Jung-hun Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study an infinite-armed bandit problem where actions’ mean rewards are initially sampled from a reservoir distribution. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds for all regimes while also relaxing distributional assumptions on the reservoir. We first introduce a blackbox scheme to convert a finite-armed MAB algorithm designed for near-stationary environments into a parameter-free algorithm for the infinite-armed non-stationary problem with optimal regret guarantees. We next study a natural notion of significant shift for this problem inspired by recent developments in finite-armed MAB (Suk Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained by employing a randomized variant of elimination within our blackbox scheme. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising rewards do not factor into the difficulty of non-stationarity. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.00108 [cs.LG] (or arXiv:2502.00108v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.00108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-149] Re-Visiting Explainable AI Evaluation Metrics to Identify The Most Informative Features

链接: https://arxiv.org/abs/2502.00088
作者: Ahmed M. Salih
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Functionality or proxy-based approach is one of the used approaches to evaluate the quality of explainable artificial intelligence methods. It uses statistical methods, definitions and new developed metrics for the evaluation without human intervention. Among them, Selectivity or RemOve And Retrain (ROAR), and Permutation Importance (PI) are the most commonly used metrics to evaluate the quality of explainable artificial intelligence methods to highlight the most significant features in machine learning models. They state that the model performance should experience a sharp reduction if the most informative feature is removed from the model or permuted. However, the efficiency of both metrics is significantly affected by multicollinearity, number of significant features in the model and the accuracy of the model. This paper shows with empirical examples that both metrics suffer from the aforementioned limitations. Accordingly, we propose expected accuracy interval (EAI), a metric to predict the upper and lower bounds of the the accuracy of the model when ROAR or IP is implemented. The proposed metric found to be very useful especially with collinear features.

[LG-150] Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows ICML

链接: https://arxiv.org/abs/2502.00064
作者: Jie Lin,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 pages, 2 tables. Appeared in ICMLA 2024

点击查看摘要

Abstract:This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. Using chi-square tests and known ground truth, we found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance. We recommend future LLM development focus on minimizing the influence of input length for better vulnerability detection. Additionally, preprocessing techniques that reduce token count while preserving code structure could enhance LLM accuracy and explicitness in these tasks.

[LG-151] Differentiable Projection-based Learn to Optimize in Wireless Network-Part I: Convex Constrained (Non-)Convex Programming

链接: https://arxiv.org/abs/2502.00053
作者: Xiucheng Wang,Xuan Zhao,Nan Cheng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses a class of (non-)convex optimization problems subject to general convex constraints, which pose significant challenges for traditional methods due to their inherent non-convexity and diversity. Conventional convex optimization-based solvers often struggle to efficiently handle these problems in their most general form. While neural network (NN)-based approaches offer a promising alternative, ensuring the feasibility of NN-generated solutions and effectively training the NN remain key hurdles, largely because finite-capacity networks can produce infeasible outputs. To overcome these issues, we propose a projection-based method that projects any infeasible NN output onto the feasible domain, thus guaranteeing strict adherence to the constraints without compromising the NN’s optimization capability. Furthermore, we derive the objective function values for both the raw NN outputs and their projected counterparts, along with the gradients of these values with respect to the NN parameters. This derivation enables label-free (unsupervised) training, reducing reliance on labeled data and improving scalability. Experimental results demonstrate that the proposed projection-based method consistently ensures feasibility.

[LG-152] DISC: Dataset for Analyzing Driving Styles In Simulated Crashes for Mixed Autonomy

链接: https://arxiv.org/abs/2502.00050
作者: Sandip Sharan Senthil Kumar,Sandeep Thalapanane,Guru Nandhan Appiya Dilipkumar Peethambari,Sourang SriHari,Laura Zheng,Ming C. Lin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling pre-crash scenarios is still a major challenge for self-driving cars due to limited practical data and human-driving behavior datasets. We introduce DISC (Driving Styles In Simulated Crashes), one of the first datasets designed to capture various driving styles and behaviors in pre-crash scenarios for mixed autonomy analysis. DISC includes over 8 classes of driving styles/behaviors from hundreds of drivers navigating a simulated vehicle through a virtual city, encountering rare-event traffic scenarios. This dataset enables the classification of pre-crash human driving behaviors in unsafe conditions, supporting individualized trajectory prediction based on observed driving patterns. By utilizing a custom-designed VR-based in-house driving simulator, TRAVERSE, data was collected through a driver-centric study involving human drivers encountering twelve simulated accident scenarios. This dataset fills a critical gap in human-centric driving data for rare events involving interactions with autonomous vehicles. It enables autonomous systems to better react to human drivers and optimize trajectory prediction in mixed autonomy environments involving both human-driven and self-driving cars. In addition, individual driving behaviors are classified through a set of standardized questionnaires, carefully designed to identify and categorize driving behavior traits. We correlate data features with driving behaviors, showing that the simulated environment reflects real-world driving styles. DISC is the first dataset to capture how various driving styles respond to accident scenarios, offering significant potential to enhance autonomous vehicle safety and driving behavior analysis in mixed autonomy environments.

[LG-153] HadamRNN: Binary and Sparse Ternary Orthogonal RNNs ICLR

链接: https://arxiv.org/abs/2502.00047
作者: Armand Foucault(IMT, ANITI),Franck Mamalet(ANITI),François Malgouyres(IMT)
类目: Machine Learning (cs.LG)
*备注: International Conference on Learning Representations (ICLR), Apr 2025, Singapour, Singapore

点击查看摘要

Abstract:Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and lock-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, and IMDB dataset. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.

[LG-154] he Best Soules Basis for the Estimation of a Spectral Barycentre Network

链接: https://arxiv.org/abs/2502.00038
作者: François G. Meyer
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 20 pages

点击查看摘要

Abstract:The main contribution of this work is a fast algorithm to compute the barycentre of a set of networks based on a Laplacian spectral pseudo-distance. The core engine for the reconstruction of the barycentre is an algorithm that explores the large library of Soules bases, and returns a basis that yields a sparse approximation of the sample mean adjacency matrix. We prove that when the networks are random realizations of stochastic block models, then our algorithm reconstructs the population mean adjacency matrix. In addition to the theoretical analysis of the estimator of the barycentre network, we perform Monte Carlo simulations to validate the theoretical properties of the estimator. This work is significant because it opens the door to the design of new spectral-based network synthesis that have theoretical guarantees.

[LG-155] PixelBrax: Learning Continuous Control from Pixels End-to-End on the GPU

链接: https://arxiv.org/abs/2502.00021
作者: Trevor McInroe,Samuel Garcin
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:We present PixelBrax, a set of continuous control tasks with pixel observations. We combine the Brax physics engine with a pure JAX renderer, allowing reinforcement learning (RL) experiments to run end-to-end on the GPU. PixelBrax can render observations over thousands of parallel environments and can run two orders of magnitude faster than existing benchmarks that rely on CPU-based rendering. Additionally, PixelBrax supports fully reproducible experiments through its explicit handling of any stochasticity within the environments and supports color and video distractors for benchmarking generalization. We open-source PixelBrax alongside JAX implementations of several RL algorithms at this http URL.

[LG-156] A Frugal Model for Accurate Early Student Failure Prediction

链接: https://arxiv.org/abs/2502.00017
作者: Gagaoua Ikram(UL, CNRS, LORIA),Armelle Brun(UL, CNRS, LORIA),Anne Boyer(UL, CNRS, LORIA)
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: LICE - London International Conference on Education, London International Conference on Education, Nov 2024, London, United Kingdom

点击查看摘要

Abstract:Predicting student success or failure is vital for timely interventions and personalized support. Early failure prediction is particularly crucial, yet limited data availability in the early stages poses challenges, one of the possible solutions is to make use of additional data from other contexts, however, this might lead to overconsumption with no guarantee of better results. To address this, we propose the Frugal Early Prediction (FEP) model, a new hybrid model that selectively incorporates additional data, promoting data frugality and efficient resource utilization. Experiments conducted on a public dataset from a VLE demonstrate FEP’s effectiveness in reducing data usage, a primary goal of this this http URL showcase a remarkable 27% reduction in data consumption, compared to a systematic use of additional data, aligning with our commitment to data frugality and offering substantial benefits to educational institutions seeking efficient data consumption. Additionally, FEP also excels in enhancing prediction accuracy. Compared to traditional approaches, FEP achieves an average accuracy gain of 7.3%. This not only highlights the practicality and efficiency of FEP but also its superiority in performance, while respecting resource constraints, providing beneficial findings for educational institutions seeking data frugality.

[LG-157] Behavioural Analytics: Mathematics of the Mind

链接: https://arxiv.org/abs/2502.00013
作者: Richard Lane,Hannah State-Davey,Claire Taylor,Wendy Holmes,Rachel Boon,Mark Round
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 19 pages, 14 figures, presented at 7th IMA Conference on Mathematics in Defence and Security, London, UK, 7 September 2023 (conference page at this https URL )

点击查看摘要

Abstract:Behavioural analytics provides insights into individual and crowd behaviour, enabling analysis of what previously happened and predictions for how people may be likely to act in the future. In defence and security, this analysis allows organisations to achieve tactical and strategic advantage through influence campaigns, a key counterpart to physical activities. Before action can be taken, online and real-world behaviour must be analysed to determine the level of threat. Huge data volumes mean that automated processes are required to attain an accurate understanding of risk. We describe the mathematical basis of technologies to analyse quotes in multiple languages. These include a Bayesian network to understand behavioural factors, state estimation algorithms for time series analysis, and machine learning algorithms for classification. We present results from studies of quotes in English, French, and Arabic, from anti-violence campaigners, politicians, extremists, and terrorists. The algorithms correctly identify extreme statements; and analysis at individual, group, and population levels detects both trends over time and sharp changes attributed to major geopolitical events. Group analysis shows that additional population characteristics can be determined, such as polarisation over particular issues and large-scale shifts in attitude. Finally, MP voting behaviour and statements from publicly-available records are analysed to determine the level of correlation between what people say and what they do.

[LG-158] A Poisson Process AutoDecoder for X-ray Sources

链接: https://arxiv.org/abs/2502.01627
作者: Yanke Song,Victoria Ashley Villar,Juan Rafael Martinez-Galarza,Steven Dillmann
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:X-ray observing facilities, such as the Chandra X-ray Observatory and the eROSITA, have detected millions of astronomical sources associated with high-energy phenomena. The arrival of photons as a function of time follows a Poisson process and can vary by orders-of-magnitude, presenting obstacles for common tasks such as source classification, physical property derivation, and anomaly detection. Previous work has either failed to directly capture the Poisson nature of the data or only focuses on Poisson rate function reconstruction. In this work, we present Poisson Process AutoDecoder (PPAD). PPAD is a neural field decoder that maps fixed-length latent features to continuous Poisson rate functions across energy band and time via unsupervised learning. PPAD reconstructs the rate function and yields a representation at the same time. We demonstrate the efficacy of PPAD via reconstruction, regression, classification and anomaly detection experiments using the Chandra Source Catalog.

[LG-159] Re-examining Double Descent and Scaling Laws under Norm-based Capacity via Deterministic Equivalence

链接: https://arxiv.org/abs/2502.01585
作者: Yichen Wang,Yudong Chen,Lorenzo Rosasco,Fanghui Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 71 pages

点击查看摘要

Abstract:We investigate double descent and scaling laws in terms of weights rather than the number of parameters. Specifically, we analyze linear and random features models using the deterministic equivalence approach from random matrix theory. We precisely characterize how the weights norm concentrate around deterministic quantities and elucidate the relationship between the expected test error and the norm-based capacity (complexity). Our results rigorously answer whether double descent exists under norm-based capacity and reshape the corresponding scaling laws. Moreover, they prompt a rethinking of the data-parameter paradigm - from under-parameterized to over-parameterized regimes - by shifting the focus to norms (weights) rather than parameter count.

[LG-160] Spectral Estimators for Multi-Index Models: Precise Asymptotics and Optimal Weak Recovery

链接: https://arxiv.org/abs/2502.01583
作者: Filip Kovačević,Yihan Zhang,Marco Mondelli
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Multi-index models provide a popular framework to investigate the learnability of functions with low-dimensional structure and, also due to their connections with neural networks, they have been object of recent intensive study. In this paper, we focus on recovering the subspace spanned by the signals via spectral estimators – a family of methods that are routinely used in practice, often as a warm-start for iterative algorithms. Our main technical contribution is a precise asymptotic characterization of the performance of spectral methods, when sample size and input dimension grow proportionally and the dimension p of the space to recover is fixed. Specifically, we locate the top- p eigenvalues of the spectral matrix and establish the overlaps between the corresponding eigenvectors (which give the spectral estimators) and a basis of the signal subspace. Our analysis unveils a phase transition phenomenon in which, as the sample complexity grows, eigenvalues escape from the bulk of the spectrum and, when that happens, eigenvectors recover directions of the desired subspace. The precise characterization we put forward enables the optimization of the data preprocessing, thus allowing to identify the spectral estimator that requires the minimal sample size for weak recovery.

[LG-161] Heterogeneous Treatment Effect in Time-to-Event Outcomes: Harnessing Censored Data with Recursively Imputed Trees

链接: https://arxiv.org/abs/2502.01575
作者: Tomer Meir,Uri Shalit,Malka Gorfine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tailoring treatments to individual needs is a central goal in fields such as medicine. A key step toward this goal is estimating Heterogeneous Treatment Effects (HTE) - the way treatments impact different subgroups. While crucial, HTE estimation is challenging with survival data, where time until an event (e.g., death) is key. Existing methods often assume complete observation, an assumption violated in survival data due to right-censoring, leading to bias and inefficiency. Cui et al. (2023) proposed a doubly-robust method for HTE estimation in survival data under no hidden confounders, combining a causal survival forest with an augmented inverse-censoring weighting estimator. However, we find it struggles under heavy censoring, which is common in rare-outcome problems such as Amyotrophic lateral sclerosis (ALS). Moreover, most current methods cannot handle instrumental variables, which are a crucial tool in the causal inference arsenal. We introduce Multiple Imputation for Survival Treatment Response (MISTR), a novel, general, and non-parametric method for estimating HTE in survival data. MISTR uses recursively imputed survival trees to handle censoring without directly modeling the censoring mechanism. Through extensive simulations and analysis of two real-world datasets-the AIDS Clinical Trials Group Protocol 175 and the Illinois unemployment dataset we show that MISTR outperforms prior methods under heavy censoring in the no-hidden-confounders setting, and extends to the instrumental variable setting. To our knowledge, MISTR is the first non-parametric approach for HTE estimation with unobserved confounders via instrumental variables.

[LG-162] Wrapped Gaussian on the manifold of Symmetric Positive Definite Matrices

链接: https://arxiv.org/abs/2502.01512
作者: Thibault de Surrel,Fabien Lotte,Sylvain Chevallier,Florian Yger
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Circular and non-flat data distributions are prevalent across diverse domains of data science, yet their specific geometric structures often remain underutilized in machine learning frameworks. A principled approach to accounting for the underlying geometry of such data is pivotal, particularly when extending statistical models, like the pervasive Gaussian distribution. In this work, we tackle those issue by focusing on the manifold of symmetric positive definite matrices, a key focus in information geometry. We introduced a non-isotropic wrapped Gaussian by leveraging the exponential map, we derive theoretical properties of this distribution and propose a maximum likelihood framework for parameter estimation. Furthermore, we reinterpret established classifiers on SPD through a probabilistic lens and introduce new classifiers based on the wrapped Gaussian model. Experiments on synthetic and real-world datasets demonstrate the robustness and flexibility of this geometry-aware distribution, underscoring its potential to advance manifold-based data analysis. This work lays the groundwork for extending classical machine learning and statistical methods to more complex and structured data.

[LG-163] Grid-based exoplanet atmospheric mass loss predictions through neural network

链接: https://arxiv.org/abs/2502.01510
作者: Amit Reza,Daria Kubyshkina,Luca Fossati,Christiane Helling
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Accepted for publication on AA

点击查看摘要

Abstract:The fast and accurate estimation of planetary mass-loss rates is critical for planet population and evolution modelling. We use machine learning (ML) for fast interpolation across an existing large grid of hydrodynamic upper atmosphere models, providing mass-loss rates for any planet inside the grid boundaries with superior accuracy compared to previously published interpolation schemes. We consider an already available grid comprising about 11000 hydrodynamic upper atmosphere models for training and generate an additional grid of about 250 models for testing purposes. We develop the ML interpolation scheme (dubbed “atmospheric Mass Loss INquiry frameworK”; MLink) using a Dense Neural Network, further comparing the results with what was obtained employing classical approaches (e.g. linear interpolation and radial basis function-based regression). Finally, we study the impact of the different interpolation schemes on the evolution of a small sample of carefully selected synthetic planets. MLink provides high-quality interpolation across the entire parameter space by significantly reducing both the number of points with large interpolation errors and the maximum interpolation error compared to previously available schemes. For most cases, evolutionary tracks computed employing MLink and classical schemes lead to comparable planetary parameters at Gyr-timescales. However, particularly for planets close to the top edge of the radius gap, the difference between the predicted planetary radii at a given age of tracks obtained employing MLink and classical interpolation schemes can exceed the typical observational uncertainties. Machine learning can be successfully used to estimate atmospheric mass-loss rates from model grids paving the way to explore future larger and more complex grids of models computed accounting for more physical processes.

[LG-164] Gamma/hadron separation in the TAIGA experiment with neural network methods

链接: https://arxiv.org/abs/2502.01500
作者: E. O. Gres,A. P. Kryukov,P. A. Volchugov,J. J. Dubenskaya,D. P. Zhurov,S. P. Polyakov,E. B. Postnikov,A. A. Vlaskina
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures, Proceedings of The 8th International Conference on Deep Learning in Computational Physics, June 19-21, 2024, Moscow, Russia

点击查看摘要

Abstract:In this work, the ability of rare VHE gamma ray selection with neural network methods is investigated in the case when cosmic radiation flux strongly prevails (ratio up to 10^4 over the gamma radiation flux from a point source). This ratio is valid for the Crab Nebula in the TeV energy range, since the Crab is a well-studied source for calibration and test of various methods and installations in gamma astronomy. The part of TAIGA experiment which includes three Imaging Atmospheric Cherenkov Telescopes observes this gamma-source too. Cherenkov telescopes obtain images of Extensive Air Showers. Hillas parameters can be used to analyse images in standard processing method, or images can be processed with convolutional neural networks. In this work we would like to describe the main steps and results obtained in the gamma/hadron separation task from the Crab Nebula with neural network methods. The results obtained are compared with standard processing method applied in the TAIGA collaboration and using Hillas parameter cuts. It is demonstrated that a signal was received at the level of higher than 5.5\sigma in 21 hours of Crab Nebula observations after processing the experimental data with the neural network method.

[LG-165] Learning to Partially Defer for Sequences

链接: https://arxiv.org/abs/2502.01459
作者: Sahana Rayan,Ambuj Tewari
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the Learning to Defer (L2D) framework, a prediction model can either make a prediction or defer it to an expert, as determined by a rejector. Current L2D methods train the rejector to decide whether to reject the entire prediction, which is not desirable when the model predicts long sequences. We present an L2D setting for sequence outputs where the system can defer specific outputs of the whole model prediction to an expert in an effort to interleave the expert and machine throughout the prediction. We propose two types of model-based post-hoc rejectors for pre-trained predictors: a token-level rejector, which defers specific token predictions to experts with next token prediction capabilities, and a one-time rejector for experts without such abilities, which defers the remaining sequence from a specific point onward. In the experiments, we also empirically demonstrate that such granular deferrals achieve better cost-accuracy tradeoffs than whole deferrals on Traveling salesman solvers and News summarization models.

[LG-166] Spurious Correlations in High Dimensional Regression: The Roles of Regularization Simplicity Bias and Over-Parameterization

链接: https://arxiv.org/abs/2502.01347
作者: Simone Bombari,Marco Mondelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature x and a spurious feature y . Specifically, we quantify the amount of spurious correlations C learned via linear regression, in terms of the data covariance and the strength \lambda of the ridge regularization. As a consequence, we first capture the simplicity of y through the spectrum of its covariance, and its correlation with x through the Schur complement of the full data covariance. Next, we prove a trade-off between C and the in-distribution test loss L , by showing that the value of \lambda that minimizes L lies in an interval where C is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.

[LG-167] PtyGenography: using generative models for regularization of the phase retrieval problem

链接: https://arxiv.org/abs/2502.01338
作者: Selin Aslan,Tristan van Leeuwen,Allard Mosk,Palina Salanevich
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In phase retrieval and similar inverse problems, the stability of solutions across different noise levels is crucial for applications. One approach to promote it is using signal priors in a form of a generative model as a regularization, at the expense of introducing a bias in the reconstruction. In this paper, we explore and compare the reconstruction properties of classical and generative inverse problem formulations. We propose a new unified reconstruction approach that mitigates overfitting to the generative model for varying noise levels.

[LG-168] DRL-based Dolph-Tschebyscheff Beamforming in Downlink Transmission for Mobile Users

链接: https://arxiv.org/abs/2502.01278
作者: Nancy Nayak,Kin K. Leung,Lajos Hanzo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the emergence of AI technologies in next-generation communication systems, machine learning plays a pivotal role due to its ability to address high-dimensional, non-stationary optimization problems within dynamic environments while maintaining computational efficiency. One such application is directional beamforming, achieved through learning-based blind beamforming techniques that utilize already existing radio frequency (RF) fingerprints of the user equipment obtained from the base stations and eliminate the need for additional hardware or channel and angle estimations. However, as the number of users and antenna dimensions increase, thereby expanding the problem’s complexity, the learning process becomes increasingly challenging, and the performance of the learning-based method cannot match that of the optimal solution. In such a scenario, we propose a deep reinforcement learning-based blind beamforming technique using a learnable Dolph-Tschebyscheff antenna array that can change its beam pattern to accommodate mobile users. Our simulation results show that the proposed method can support data rates very close to the best possible values.

[LG-169] Generalized Lanczos method for systematic optimization of neural-network quantum states

链接: https://arxiv.org/abs/2502.01264
作者: Jia-Qi Wang,Rong-Qiang He,Zhong-Yi Lu
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 11 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Recently, artificial intelligence for science has made significant inroads into various fields of natural science research. In the field of quantum many-body computation, researchers have developed numerous ground state solvers based on neural-network quantum states (NQSs), achieving ground state energies with accuracy comparable to or surpassing traditional methods such as variational Monte Carlo methods, density matrix renormalization group, and quantum Monte Carlo methods. Here, we combine supervised learning, reinforcement learning, and the Lanczos method to develop a systematic approach to improving the NQSs of many-body systems, which we refer to as the NQS Lanczos method. The algorithm mainly consists of two parts: the supervised learning part and the reinforcement learning part. Through supervised learning, the Lanczos states are represented by the NQSs. Through reinforcement learning, the NQSs are further optimized. We analyze the reasons for the underfitting problem and demonstrate how the NQS Lanczos method systematically improves the energy in the highly frustrated regime of the two-dimensional Heisenberg J_1 - J_2 model. Compared to the existing method that combines the Lanczos method with the restricted Boltzmann machine, the primary advantage of the NQS Lanczos method is its linearly increasing computational cost.

[LG-170] Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

链接: https://arxiv.org/abs/2502.01027
作者: Yannis Montreuil,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-to-Defer (L2D) facilitates optimal task allocation between AI systems and decision-makers. Despite its potential, we show that current two-stage L2D frameworks are highly vulnerable to adversarial attacks, which can misdirect queries or overwhelm decision agents, significantly degrading system performance. This paper conducts the first comprehensive analysis of adversarial robustness in two-stage L2D frameworks. We introduce two novel attack strategies – untargeted and targeted – that exploit inherent structural vulnerabilities in these systems. To mitigate these threats, we propose SARD, a robust, convex, deferral algorithm rooted in Bayes and (\mathcalR,\mathcalG) -consistency. Our approach guarantees optimal task allocation under adversarial perturbations for all surrogates in the cross-entropy family. Extensive experiments on classification, regression, and multi-task benchmarks validate the robustness of SARD.

[LG-171] Minimax Optimality of Classical Scaling Under General Noise Conditions

链接: https://arxiv.org/abs/2502.00947
作者: Siddharth Vishwanath,Ery Arias-Castro
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 45 pages, 4 figures

点击查看摘要

Abstract:We establish the consistency of classical scaling under a broad class of noise models, encompassing many commonly studied cases in literature. Our approach requires only finite fourth moments of the noise, significantly weakening standard assumptions. We derive convergence rates for classical scaling and establish matching minimax lower bounds, demonstrating that classical scaling achieves minimax optimality in recovering the true configuration even when the input dissimilarities are corrupted by noise.

[LG-172] HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLM s

链接: https://arxiv.org/abs/2502.00899
作者: Mehdi Makni,Kayhan Behdin,Zheng Xu,Natalia Ponomareva,Rahul Mazumder
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models’ dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the exact introduced optimization problem. HASSLE-free substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, HASSLE-free reduces the test perplexity by 12% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 15% compared to existing methods.

[LG-173] Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2502.00885
作者: Thanh Dang,Melih Barsbey,A K M Rokonuzzaman Sonet,Mert Gurbuzbalaban,Umut Simsekli,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 64 pages, 2 figures

点击查看摘要

Abstract:Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

[LG-174] Online Learning of Pure States is as Hard as Mixed States

链接: https://arxiv.org/abs/2502.00823
作者: Maxime Meyer,Soumik Adhikary,Naixu Guo,Patrick Rebentrost
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Quantum state tomography, the task of learning an unknown quantum state, is a fundamental problem in quantum information. In standard settings, the complexity of this problem depends significantly on the type of quantum state that one is trying to learn, with pure states being substantially easier to learn than general mixed states. A natural question is whether this separation holds for any quantum state learning setting. In this work, we consider the online learning framework and prove the surprising result that learning pure states in this setting is as hard as learning mixed states. More specifically, we show that both classes share almost the same sequential fat-shattering dimension, leading to identical regret scaling under the L_1 -loss. We also generalize previous results on full quantum state tomography in the online setting to learning only partially the density matrix, using smooth analysis.

[LG-175] Error-quantified Conformal Inference for Time Series ICLR2025

链接: https://arxiv.org/abs/2502.00818
作者: Junxi Wu,Dongjian Hu,Yajie Bao,Shu-Tao Xia,Changliang Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICLR 2025 camera version

点击查看摘要

Abstract:Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data. Conformal inference provides a pivotal and flexible instrument for assessing the uncertainty of machine learning models through prediction sets. Recently, a series of online conformal inference methods updated thresholds of prediction sets by performing online gradient descent on a sequence of quantile loss functions. A drawback of such methods is that they only use the information of revealed non-conformity scores via miscoverage indicators but ignore error quantification, namely the distance between the non-conformity score and the current threshold. To accurately leverage the dynamic of miscoverage error, we propose \textitError-quantified Conformal Inference (ECI) by smoothing the quantile loss function. ECI introduces a continuous and adaptive feedback scale with the miscoverage error, rather than simple binary feedback in existing methods. We establish a long-term coverage guarantee for ECI under arbitrary dependence and distribution shift. The extensive experimental results show that ECI can achieve valid miscoverage control and output tighter prediction sets than other baselines.

[LG-176] Deep Neural Network for Phonon-Assisted Optical Spectra in Semiconductors

链接: https://arxiv.org/abs/2502.00798
作者: Qiangqiang Gu,Shishir Kumar Pandey
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Phonon-assisted optical absorption in semiconductors is crucial for understanding and optimizing optoelectronic devices, yet its accurate simulation remains a significant challenge in computational materials science. We present an efficient approach that combines deep learning tight-binding (TB) and potential models to efficiently calculate the phonon-assisted optical absorption in semiconductors with ab initio accuracy. Our strategy enables efficient sampling of atomic configurations through molecular dynamics and rapid computation of electronic structure and optical properties from the TB models. We demonstrate its efficacy by calculating the temperature-dependent optical absorption spectra and band gap renormalization of Si and GaAs due to electron-phonon coupling over a temperature range of 100-400 K. Our results show excellent agreement with experimental data, capturing both indirect and direct absorption processes, including subtle features like the Urbach tail. This approach offers a powerful tool for studying complex materials with high accuracy and efficiency, paving the way for high-throughput screening of optoelectronic materials.

[LG-177] Mirror Descent Under Generalized Smoothness

链接: https://arxiv.org/abs/2502.00753
作者: Dingzhi Yu,Wei Jiang,Yuanyu Wan,Lijun Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 59 pages, 2 figures

点击查看摘要

Abstract:Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with \ell_2 -norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new \ell* -smoothness concept that measures the norm of Hessian in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish an anytime convergence for stochastic mirror descent based on a new bounded noise condition that encompasses the widely adopted bounded or affine noise assumptions.

[LG-178] Orlicz-Sobolev Transport for Unbalanced Measures on a Graph

链接: https://arxiv.org/abs/2502.00739
作者: Tam Le,Truyen Nguyen,Hideitsu Hino,Kenji Fukumizu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Moving beyond L^p geometric structure, Orlicz-Wasserstein (OW) leverages a specific class of convex functions for Orlicz geometric structure. While OW remarkably helps to advance certain machine learning approaches, it has a high computational complexity due to its two-level optimization formula. Recently, Le et al. (2024) exploits graph structure to propose generalized Sobolev transport (GST), i.e., a scalable variant for OW. However, GST assumes that input measures have the same mass. Unlike optimal transport (OT), it is nontrivial to incorporate a mass constraint to extend GST for measures on a graph, possibly having different total mass. In this work, we propose to take a step back by considering the entropy partial transport (EPT) for nonnegative measures on a graph. By leveraging Caffarelli McCann (2010)'s observations, EPT can be reformulated as a standard complete OT between two corresponding balanced measures. Consequently, we develop a novel EPT with Orlicz geometric structure, namely Orlicz-EPT, for unbalanced measures on a graph. Especially, by exploiting the dual EPT formulation and geometric structures of the graph-based Orlicz-Sobolev space, we derive a novel regularization to propose Orlicz-Sobolev transport (OST). The resulting distance can be efficiently computed by simply solving a univariate optimization problem, unlike the high-computational two-level optimization problem for Orlicz-EPT. Additionally, we derive geometric structures for the OST and draw its relations to other transport distances. We empirically show that OST is several-order faster than Orlicz-EPT. We further illustrate preliminary evidences on the advantages of OST for document classification, and several tasks in topological data analysis.

[LG-179] Scalable Sobolev IPM for Probability Measures on a Graph

链接: https://arxiv.org/abs/2502.00737
作者: Tam Le,Truyen Nguyen,Hideitsu Hino,Kenji Fukumizu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the Sobolev IPM problem for probability measures supported on a graph metric space. Sobolev IPM is an important instance of integral probability metrics (IPM), and is obtained by constraining a critic function within a unit ball defined by the Sobolev norm. In particular, it has been used to compare probability measures and is crucial for several theoretical works in machine learning. However, to our knowledge, there are no efficient algorithmic approaches to compute Sobolev IPM effectively, which hinders its practical applications. In this work, we establish a relation between Sobolev norm and weighted L^p -norm, and leverage it to propose a \emphnovel regularization for Sobolev IPM. By exploiting the graph structure, we demonstrate that the regularized Sobolev IPM provides a \emphclosed-form expression for fast computation. This advancement addresses long-standing computational challenges, and paves the way to apply Sobolev IPM for practical applications, even in large-scale settings. Additionally, the regularized Sobolev IPM is negative definite. Utilizing this property, we design positive-definite kernels upon the regularized Sobolev IPM, and provide preliminary evidences of their advantages on document classification and topological data analysis for measures on a graph.

[LG-180] Uniform-in-time weak propagation of chaos for consensus-based optimization

链接: https://arxiv.org/abs/2502.00582
作者: Erhan Bayraktar,Ibrahim Ekren,Hongyi Zhou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: keywords: Consensus-based optimization, Uniform-in-time propagation of chaos, Weak convergence, Sobolev spaces, Linearized Fokker-Planck equations

点击查看摘要

Abstract:We study the uniform-in-time weak propagation of chaos for the consensus-based optimization (CBO) method on a bounded searching domain. We apply the methodology for studying long-time behaviors of interacting particle systems developed in the work of Delarue and Tse (arXiv:2104.14973). Our work shows that the weak error has order O(N^-1) uniformly in time, where N denotes the number of particles. The main strategy behind the proofs are the decomposition of the weak errors using the linearized Fokker-Planck equations and the exponential decay of their Sobolev norms. Consequently, our result leads to the joint convergence of the empirical distribution of the CBO particle system to the Dirac-delta distribution at the global minimizer in population size and running time in Wasserstein-type metrics.

[LG-181] Deep learning model for ECG reconstruction reveals the information content of ECG leads

链接: https://arxiv.org/abs/2502.00559
作者: Tomasz Gradowski,Teodor Buchner
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:This study introduces a deep learning model based on the U-net architecture to reconstruct missing leads in electrocardiograms (ECGs). Using publicly available datasets, the model was trained to regenerate 12-lead ECG data from reduced lead configurations, demonstrating high accuracy in lead reconstruction. The results highlight the ability of the model to quantify the information content of each ECG lead and their inter-lead correlations. This has significant implications for optimizing lead selection in diagnostic scenarios, particularly in settings where full 12-lead ECGs are impractical. Additionally, the study provides insights into the physiological underpinnings of ECG signals and their propagation. The findings pave the way for advancements in telemedicine, portable ECG devices, and personalized cardiac diagnostics by reducing redundancy and enhancing signal interpretation.

[LG-182] Sampling Binary Data by Denoising through Score Functions

链接: https://arxiv.org/abs/2502.00557
作者: Francis Bach,Saeed Saremi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian smoothing combined with a probabilistic framework for denoising via the empirical Bayes formalism, i.e., the Tweedie-Miyasawa formula (TMF), are the two key ingredients in the success of score-based generative models in Euclidean spaces. Smoothing holds the key for easing the problem of learning and sampling in high dimensions, denoising is needed for recovering the original signal, and TMF ties these together via the score function of noisy data. In this work, we extend this paradigm to the problem of learning and sampling the distribution of binary data on the Boolean hypercube by adopting Bernoulli noise, instead of Gaussian noise, as a smoothing device. We first derive a TMF-like expression for the optimal denoiser for the Hamming loss, where a score function naturally appears. Sampling noisy binary data is then achieved using a Langevin-like sampler which we theoretically analyze for different noise levels. At high Bernoulli noise levels sampling becomes easy, akin to log-concave sampling in Euclidean spaces. In addition, we extend the sequential multi-measurement sampling of Saremi et al. (2024) to the binary setting where we can bring the “effective noise” down by sampling multiple noisy measurements at a fixed noise level, without the need for continuous-time stochastic processes. We validate our formalism and theoretical findings by experiments on synthetic data and binarized images.

[LG-183] ransition Transfer Q-Learning for Composite Markov Decision Processes

链接: https://arxiv.org/abs/2502.00534
作者: Jinhang Chai,Elynn Chen,Lin Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of \tildeO(\sqrteH^5N) that scales independently of the ambient dimension. Here, N represents the number of trajectories in the target task, while e quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.

[LG-184] Variance Reduction via Resampling and Experience Replay

链接: https://arxiv.org/abs/2502.00520
作者: Jiale Han,Xiaowu Dai,Yuhua Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework that models experience replay using resampled U - and V -statistics, providing rigorous variance reduction guarantees. We apply this framework to policy evaluation tasks using the Least-Squares Temporal Difference (LSTD) algorithm and a Partial Differential Equation (PDE)-based model-free algorithm, demonstrating significant improvements in stability and efficiency, particularly in data-scarce scenarios. Beyond policy evaluation, we extend the framework to kernel ridge regression, showing that the experience replay-based method reduces the computational cost from the traditional O(n^3) in time to as low as O(n^2) in time while simultaneously reducing variance. Extensive numerical experiments validate our theoretical findings, demonstrating the broad applicability and effectiveness of experience replay in diverse machine learning tasks.

[LG-185] Distributed Primal-Dual Algorithms: Unification Connections and Insights

链接: https://arxiv.org/abs/2502.00470
作者: Runxiong Wu,Dong Liu,Xueqin Wang,Andi Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 4 figures, 1 table

点击查看摘要

Abstract:We study primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), including consensus ADMM, linearized ADMM, and proximal ADMM. We demonstrate that both classes of algorithms can be transformed into a unified update form that involves only primal and dual variables. This discovery reveals key connections between the two classes of algorithms: CoCoA can be interpreted as a special case of proximal ADMM for solving the dual problem, while consensus ADMM is closely related to a proximal ADMM algorithm. This discovery provides the insight that by adjusting the augmented Lagrangian parameter, we can easily enable the ADMM variants to outperform the CoCoA variants. We further explore linearized versions of ADMM and analyze the effects of tuning parameters on these ADMM variants in the distributed setting. Our theoretical findings are supported by extensive simulation studies and real-world data analysis.

[LG-186] Decentralized Inference for Distributed Geospatial Data Using Low-Rank Models

链接: https://arxiv.org/abs/2502.00309
作者: Jianwei Shi,Sameh Abdulah,Ying Sun,Marc G. Genton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 84 pages

点击查看摘要

Abstract:Advancements in information technology have enabled the creation of massive spatial datasets, driving the need for scalable and efficient computational methodologies. While offering viable solutions, centralized frameworks are limited by vulnerabilities such as single-point failures and communication bottlenecks. This paper presents a decentralized framework tailored for parameter inference in spatial low-rank models to address these challenges. A key obstacle arises from the spatial dependence among observations, which prevents the log-likelihood from being expressed as a summation-a critical requirement for decentralized optimization approaches. To overcome this challenge, we propose a novel objective function leveraging the evidence lower bound, which facilitates the use of decentralized optimization techniques. Our approach employs a block descent method integrated with multi-consensus and dynamic consensus averaging for effective parameter optimization. We prove the convexity of the new objective function in the vicinity of the true parameters, ensuring the convergence of the proposed method. Additionally, we present the first theoretical results establishing the consistency and asymptotic normality of the estimator within the context of spatial low-rank models. Extensive simulations and real-world data experiments corroborate these theoretical findings, showcasing the robustness and scalability of the framework.

[LG-187] Provably-Stable Neural Network-Based Control of Nonlinear Systems

链接: https://arxiv.org/abs/2502.00248
作者: Anran Li,John P. Swensen,Mehdi Hosseinzadeh
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In recent years, Neural Networks (NNs) have been employed to control nonlinear systems due to their potential capability in dealing with situations that might be difficult for conventional nonlinear control schemes. However, to the best of our knowledge, the current literature on NN-based control lacks theoretical guarantees for stability and tracking performance. This precludes the application of NN-based control schemes to systems where stringent stability and performance guarantees are required. To address this gap, this paper proposes a systematic and comprehensive methodology to design provably-stable NN-based control schemes for affine nonlinear systems. Rigorous analysis is provided to show that the proposed approach guarantees stability of the closed-loop system with the NN in the loop. Also, it is shown that the resulting NN-based control scheme ensures that system states asymptotically converge to a neighborhood around the desired equilibrium point, with a tunable proximity threshold. The proposed methodology is validated and evaluated via simulation studies on an inverted pendulum and experimental studies on a Parrot Bebop 2 drone.

[LG-188] Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees

链接: https://arxiv.org/abs/2502.00240
作者: Yasi Zhang,Oscar Leong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often result in highly nonconvex formulations that lack theoretical guarantees. Recent work has shown that incorporating structured nonconvexity into neural network-based regularizers, such as weak convexity, can strike a balance between empirical performance and theoretical tractability. In this paper, we demonstrate that a broader class of nonconvex functions, difference-of-convex (DC) functions, can yield improved empirical performance while retaining strong convergence guarantees. The DC structure enables the use of well-established optimization algorithms, such as the Difference-of-Convex Algorithm (DCA) and a Proximal Subgradient Method (PSM), which extend beyond standard gradient descent. Furthermore, we provide theoretical insights into the conditions under which optimal regularizers can be expressed as DC functions. Extensive experiments on computed tomography (CT) reconstruction tasks show that our approach achieves strong performance across sparse and limited-view settings, consistently outperforming other weakly supervised learned regularizers. Our code is available at \urlthis https URL.

[LG-189] Supervised Quadratic Feature Analysis: An Information Geometry Approach to Dimensionality Reduction

链接: https://arxiv.org/abs/2502.00168
作者: Daniel Herrera-Esposito,Johannes Burge
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG); Statistics Theory (math.ST)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Supervised dimensionality reduction aims to map labeled data to a low-dimensional feature space while maximizing class discriminability. Despite the availability of methods for learning complex non-linear features (e.g. Deep Learning), there is an enduring demand for dimensionality reduction methods that learn linear features due to their interpretability, low computational cost, and broad applicability. However, there is a gap between methods that optimize linear separability (e.g. LDA), and more flexible but computationally expensive methods that optimize over arbitrary class boundaries (e.g. metric-learning methods). Here, we present Supervised Quadratic Feature Analysis (SQFA), a dimensionality reduction method for learning linear features that maximize the differences between class-conditional first- and second-order statistics, which allow for quadratic discrimination. SQFA exploits the information geometry of second-order statistics in the symmetric positive definite manifold. We show that SQFA features support quadratic discriminability in real-world problems. We also provide a theoretical link, based on information geometry, between SQFA and the Quadratic Discriminant Analysis (QDA) classifier.

[LG-190] Blood Glucose Level Prediction in Type 1 Diabetes Using Machine Learning

链接: https://arxiv.org/abs/2502.00065
作者: Soon Jynn Chu,Nalaka Amarasiri,Sandesh Giri,Priyata Kafle
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures. This work was accepted for CSCI 2024 conference

点击查看摘要

Abstract:Type 1 Diabetes is a chronic autoimmune condition in which the immune system attacks and destroys insulin-producing beta cells in the pancreas, resulting in little to no insulin production. Insulin helps glucose in your blood enter your muscle, fat, and liver cells so they can use it for energy or store it for later use. If insulin is insufficient, it causes sugar to build up in the blood and leads to serious health problems. People with Type 1 Diabetes need synthetic insulin every day. In diabetes management, continuous glucose monitoring is an important feature that provides near real-time blood glucose data. It is useful in deciding the synthetic insulin dose. In this research work, we used machine learning tools, deep neural networks, deep reinforcement learning, and voting and stacking regressors to predict blood glucose levels at 30-min time intervals using the latest DiaTrend dataset. Predicting blood glucose levels is useful in better diabetes management systems. The trained models were compared using several evaluation metrics. Our evaluation results demonstrate the performance of various models across different glycemic conditions for blood glucose prediction. The source codes of this work can be found in: this https URL

[LG-191] Super Quantum Mechanics

链接: https://arxiv.org/abs/2502.00037
作者: Mikhail Gennadievich Belov,Victor Victorovich Dubov,Vadim Konstantinovich Ivanov,Alexander Yurievich Maslov,Olga Vladimirovna Proshina,Vladislav Gennadievich Malyshkin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: The ML approach presented in arXiv:2407.04406 is extended to stationary and non-stationary quantum dynamics

点击查看摘要

Abstract:We introduce Super Quantum Mechanics (SQM) as a theory that considers states in Hilbert space subject to multiple quadratic constraints. Traditional quantum mechanics corresponds to a single quadratic constraint of wavefunction normalization. In its simplest form, SQM considers states in the form of unitary operators, where the quadratic constraints are conditions of unitarity. In this case, the stationary SQM problem is a quantum inverse problem with multiple applications in machine learning and artificial intelligence. The SQM stationary problem is equivalent to a new algebraic problem that we address in this paper. The SQM non-stationary problem considers the evolution of a quantum system, distinct from the explicit time dependence of the Hamiltonian, H(t) . Several options for the SQM dynamic equation are considered, and quantum circuits of 2D type are introduced, which transform one quantum system into another. Although no known physical process currently describes such dynamics, this approach naturally bridges direct and inverse quantum mechanics problems, allowing for the development of a new type of computer algorithm. Beyond computer modeling, the developed theory could be directly applied if or when a physical process capable of solving an inverse quantum problem in a single measurement act (analogous to wavefunction measurement in traditional quantum mechanics) is discovered in the future.

[LG-192] Retail Market Analysis

链接: https://arxiv.org/abs/2502.00024
作者: Ke Yuan,Yaoxin Liu,Shriyesh Chandra,Rishav Roy
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This project focuses on analyzing retail market trends using historical sales data, search trends, and customer reviews. By identifying the patterns and trending products, the analysis provides actionable insights for retailers to optimize inventory management and marketing strategies, ultimately enhancing customer satisfaction and maximizing revenue.

[LG-193] Normalizing flows for SU(N) gauge theories employing singular value decomposition

链接: https://arxiv.org/abs/2501.18288
作者: Javad Komijani,Marina K. Marinkovic
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures; contribution to the 41th International Symposium on Lattice Field Theory, 2024, Liverpool, UK

点击查看摘要

Abstract:We present a progress report on the use of normalizing flows for generating gauge field configurations in pure SU(N) gauge theories. We discuss how the singular value decomposition can be used to construct gauge-invariant quantities, which serve as the building blocks for designing gauge-equivariant transformations of SU(N) gauge links. Using this novel approach, we build representative models for the SU(3) Wilson action on a ( 4^4 ) lattice with ( \beta = 1 ). We train these models and provide an analysis of their performance, highlighting the effectiveness of the new technique for gauge-invariant transformations. We also provide a comparison between the efficiency of the proposed algorithm and the spectral flow of Wilson loops.

信息检索

[IR-0] Augmented Knowledge Graph Querying leverag ing LLM s

链接: https://arxiv.org/abs/2502.01298
作者: Marco Arazzi,Davide Ligari,Serena Nicolazzo,Antonino Nocera
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Adopting Knowledge Graphs (KGs) as a structured, semantic-oriented, data representation model has significantly improved data integration, reasoning, and querying capabilities across different domains. This is especially true in modern scenarios such as Industry 5.0, in which the integration of data produced by humans, smart devices, and production processes plays a crucial role. However, the management, retrieval, and visualization of data from a KG using formal query languages can be difficult for non-expert users due to their technical complexity, thus limiting their usage inside industrial environments. For this reason, we introduce SparqLLM, a framework that utilizes a Retrieval-Augmented Generation (RAG) solution, to enhance the querying of Knowledge Graphs (KGs). SparqLLM executes the Extract, Transform, and Load (ETL) pipeline to construct KGs from raw data. It also features a natural language interface powered by Large Language Models (LLMs) to enable automatic SPARQL query generation. By integrating template-based methods as retrieved-context for the LLM, SparqLLM enhances query reliability and reduces semantic errors, ensuring more accurate and efficient KG interactions. Moreover, to improve usability, the system incorporates a dynamic visualization dashboard that adapts to the structure of the retrieved data, presenting the query results in an intuitive format. Rigorous experimental evaluations demonstrate that SparqLLM achieves high query accuracy, improved robustness, and user-friendly interaction with KGs, establishing it as a scalable solution to access semantic data.

[IR-1] RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models

链接: https://arxiv.org/abs/2502.00709
作者: Can Jin,Hongwu Peng,Anxiang Zhang,Nuo Chen,Jiahui Zhao,Xi Xie,Kuangzheng Li,Shuya Feng,Kai Zhong,Caiwen Ding,Dimitris N. Metaxas
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In an Information Retrieval (IR) system, reranking plays a critical role by sorting candidate passages according to their relevance to a specific query. This process demands a nuanced understanding of the variations among passages linked to the query. In this work, we introduce RankFlow, a multi-role reranking workflow that leverages the capabilities of Large Language Models (LLMs) and role specializations to improve reranking performance. RankFlow enlists LLMs to fulfill four distinct roles: the query Rewriter, the pseudo Answerer, the passage Summarizer, and the Reranker. This orchestrated approach enables RankFlow to: (1) accurately interpret queries, (2) draw upon LLMs’ extensive pre-existing knowledge, (3) distill passages into concise versions, and (4) assess passages in a comprehensive manner, resulting in notably better reranking results. Our experimental results reveal that RankFlow outperforms existing leading approaches on widely recognized IR benchmarks, such as TREC-DL, BEIR, and NovelEval. Additionally, we investigate the individual contributions of each role in RankFlow. Code is available at this https URL.

[IR-2] Retracted Citations and Self-citations in Retracted Publications: A Comparative Study of Plagiarism and Fake Peer Review

链接: https://arxiv.org/abs/2502.00673
作者: Kiran Sharmaa,Parul Khurana
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retracted citations remain a significant concern in academia as they perpetuate misinformation and compromise the integrity of scientific literature despite their invalidation. To analyze the impact of retracted citations, we focused on two retraction categories: plagiarism and fake peer review. The data set was sourced from Scopus and the reasons for the retraction were mapped using the Retraction Watch database. The retraction trend shows a steady average growth in plagiarism cases of 1.2 times, while the fake peer review exhibits a fluctuating pattern with an average growth of 5.5 times. Although fewer papers are retracted in the plagiarism category compared to fake peer reviews, plagiarism-related papers receive 2.5 times more citations. Furthermore, the total number of retracted citations for plagiarized papers is 1.8 times higher than that for fake peer review papers. Within the plagiarism category, 46% of the retracted citations are due to plagiarism, while 53.6% of the retracted citations in the fake peer review category are attributed to the fake peer review. The results also suggest that fake peer review cases are identified and retracted more rapidly than plagiarism cases. Finally, self-citations constitute a small percentage of citations to retracted papers but are notably higher among citations that are later retracted in both the categories.

[IR-3] Personalized Denoising Implicit Feedback for Robust Recommender System WWW2025

链接: https://arxiv.org/abs/2502.00348
作者: Kaike Zhang,Qi Cao,Yunfan Wu,Fei Sun,Huawei Shen,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注: To appear in WWW 2025

点击查看摘要

Abstract:While implicit feedback is foundational to modern recommender systems, factors such as human error, uncertainty, and ambiguity in user behavior inevitably introduce significant noise into this feedback, adversely affecting the accuracy and robustness of recommendations. To address this issue, existing methods typically aim to reduce the training weight of noisy feedback or discard it entirely, based on the observation that noisy interactions often exhibit higher losses in the overall loss distribution. However, we identify two key issues: (1) there is a significant overlap between normal and noisy interactions in the overall loss distribution, and (2) this overlap becomes even more pronounced when transitioning from pointwise loss functions (e.g., BCE loss) to pairwise loss functions (e.g., BPR loss). This overlap leads traditional methods to misclassify noisy interactions as normal, and vice versa. To tackle these challenges, we further investigate the loss overlap and find that for a given user, there is a clear distinction between normal and noisy interactions in the user’s personal loss distribution. Based on this insight, we propose a resampling strategy to Denoise using the user’s Personal Loss distribution, named PLD, which reduces the probability of noisy interactions being optimized. Specifically, during each optimization iteration, we create a candidate item pool for each user and resample the items from this pool based on the user’s personal loss distribution, prioritizing normal interactions. Additionally, we conduct a theoretical analysis to validate PLD’s effectiveness and suggest ways to further enhance its performance. Extensive experiments conducted on three datasets with varying noise ratios demonstrate PLD’s efficacy and robustness.

[IR-4] Middleman Bias in Advertising: Aligning Relevance of Keyphrase Recommendations with Search

链接: https://arxiv.org/abs/2502.00131
作者: Soumik Dey,Wei Zhang,Hansi Wu,Bingfeng Dong,Binbin Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:E-commerce sellers are recommended keyphrases based on their inventory on which they advertise to increase buyer engagement (clicks/sales). Keyphrases must be pertinent to items; otherwise, it can result in seller dissatisfaction and poor targeting – towards that end relevance filters are employed. In this work, we describe the shortcomings of training relevance filter models on biased click/sales signals. We re-conceptualize advertiser keyphrase relevance as interaction between two dynamical systems – Advertising which produces the keyphrases and Search which acts as a middleman to reach buyers. We discuss the bias of search relevance systems (middleman bias) and the need to align advertiser keyphrases with search relevance signals. We also compare the performance of cross encoders and bi-encoders in modeling this alignment and the scalability of such a solution for sellers at eBay.

[IR-5] On Overlap Ratio in Defocused Electron Ptychography

链接: https://arxiv.org/abs/2502.00762
作者: Amirafshar Moshtaghpour,Angus I. Kirkland
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR); Applied Physics (physics.app-ph); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Four-dimensional Scanning Transmission Electron Microscopy (4D STEM) with data acquired using a defocused electron probe is a promising tool for characterising complex biological specimens and materials through a phase retrieval process known as Electron Ptychography (EP). The efficacy of 4D STEM acquisition and the resulting quality of EP reconstruction depends on the overlap ratio of adjacent illuminated areas. This paper demonstrates how the overlap ratio impacts the data redundancy and the quality of the EP reconstruction. We define two quantities as a function of the overlap ratio that are independent of both the object and the EP algorithm. Subsequently, we evaluate an EP algorithm for varying overlap ratios using simulated 4D STEM datasets. Notably, a 40% or greater overlap ratio yields stable, high-quality reconstructions.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-04

目录

概览 (2025-02-04)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载