Arxiv今日论文 | 2025-03-05

本篇博文主要内容为 2025-03-05 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图分析Large Language Models (LLMs)对Wikipedia的影响，并评估其潜在风险。论文通过分析页面浏览量与文章内容，结合模拟实验，探讨LLMs对Wikipedia相关自然语言处理任务（如机器翻译和检索增强生成，Retrieval-Augmented Generation, RAG）的作用及其影响机制。关键在于量化LLMs对Wikipedia文章及关联任务的具体影响程度，并揭示其可能带来的风险，例如模型评分偏差和知识库污染问题，从而为未来应对潜在风险提供依据。

链接: https://arxiv.org/abs/2503.02879
作者: Siming Huang,Yuliang Xu,Mingmeng Geng,Yao Wan,Dongping Chen
机构: Huazhong University of Science and Technology (华中科技大学); International School for Advanced Studies (SISSA) (国际高等研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: We release all the experimental dataset and source code at: this https URL

点击查看摘要

Abstract:In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia’s recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia’s language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
zh

[NLP-1] Language Models can Self-Improve at State-Value Estimation for Better Search

【速读】：该论文旨在解决在多步推理任务中收集真实奖励或人类演示的成本高昂且耗时的问题，特别是在交互式领域（如网页任务）中。为应对这一瓶颈，论文提出了一种名为“self-taught lookahead”的自监督方法，其关键在于利用状态转换动态来训练一个价值模型（value model），该模型能够有效引导由语言模型控制的搜索过程。通过这种方法，适度规模（80亿参数）的开源权重价值模型经过改进后，能够在性能上媲美使用前沿大型语言模型（如gpt-4o）作为价值模型的效果，同时实现性能提升20%，成本降低37倍，并且无需依赖真实奖励。

链接: https://arxiv.org/abs/2503.02878
作者: Ethan Mendes,Alan Ritter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.
zh

[NLP-2] he First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理能力提升过程中通常依赖于标注数据或计算昂贵的采样方法的问题。为了解决这一挑战，论文提出了一种名为无监督前缀微调（Unsupervised Prefix Fine-Tuning, UPFT）的方法。UPFT 的关键在于利用前缀自一致性（Prefix Self-Consistency）这一观察结果，即在多种解题路径中共享的初始推理步骤，通过仅训练初始的前缀子字符串（最少仅 8 个 token），无需标注数据或穷尽采样即可显著提高 LLM 的推理效率。实验表明，UPFT 在推理基准测试中的性能与有监督方法相当，但将训练时间减少了 75%，采样成本降低了 99%。

链接: https://arxiv.org/abs/2503.02875
作者: Ke Ji,Jiahao Xu,Tian Liang,Qiuzhi Liu,Zhiwei He,Xingyu Chen,Xiaoyuan Liu,Zhijie Wang,Junying Chen,Benyou Wang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯人工智能实验室); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency – the shared initial reasoning steps across diverse solution trajectories – to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model’s structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.
zh

[NLP-3] FairSense-AI: Responsible AI Meets Sustainability

【速读】：该论文试图解决人工智能系统中与偏见相关的文本和图像内容检测及缓解的问题。解决方案的关键在于FairSense-AI框架，它结合了大型语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLMs），能够揭示内容中存在的微妙偏见或刻板印象，并提供偏见评分、解释性亮点以及公平性增强的自动化建议。此外，该框架集成了符合MIT AI风险库和NIST AI风险管理框架的AI风险评估组件，以结构化方式识别伦理和安全问题。通过这些方法，FairSense-AI不仅推动了社会层面的公平性，还通过模型剪枝和混合精度计算等技术优化了能源效率，从而减少了环境影响。

链接: https://arxiv.org/abs/2503.02865
作者: Shaina Raza,Mukund Sayeeganesh Chettiar,Matin Yousefabadi,Tahniat Khan,Marcelo Lotif
机构: Vector Institute (向量研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce FairSense-AI: a multimodal framework designed to detect and mitigate bias in both text and images. By leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), FairSense-AI uncovers subtle forms of prejudice or stereotyping that can appear in content, providing users with bias scores, explanatory highlights, and automated recommendations for fairness enhancements. In addition, FairSense-AI integrates an AI risk assessment component that aligns with frameworks like the MIT AI Risk Repository and NIST AI Risk Management Framework, enabling structured identification of ethical and safety concerns. The platform is optimized for energy efficiency via techniques such as model pruning and mixed-precision computation, thereby reducing its environmental footprint. Through a series of case studies and applications, we demonstrate how FairSense-AI promotes responsible AI use by addressing both the social dimension of fairness and the pressing need for sustainability in large-scale AI deployments. this https URL, this https URL
zh

[NLP-4] Calibrating LLM Confidence with Semantic Steering: A Multi-Prompt Aggregation Framework

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在预测中通常表现出不合理的置信度评分问题，即其置信度评分往往高估预测的可靠性。尽管已有研究表明通过提示（prompting）可以影响LLMs的置信度评分，但此前的研究对此存在分歧，甚至有研究认为这种影响微不足道，表明LLMs的置信度校准对语言学干预具有刚性。针对这一争议，论文首先通过在三个模型（包括GPT3.5、LLAMA3-70B、GPT4）上测试7个基准数据集，严格验证了明确指令能够系统性地调节LLMs的置信度评分，从而证明置信度评分的方向性偏移确实存在。

解决方案的关键在于提出了一种名为SteeringConf的新框架，包含三个核心组件：置信度引导（confidence steering）、引导后的置信度聚合（steered confidence aggregation）以及引导后的答案选择（steered answers selection）。该方法利用一种置信度操纵机制，以可控的方式调整LLMs的置信度评分，并通过汇总模块整合这些调整后的置信度评分以生成最终预测。论文在7个基准数据集上的评估结果表明，SteeringConf在置信度校准和失效检测任务中始终优于基线方法。

链接: https://arxiv.org/abs/2503.02863
作者: Ziang Zhou,Tianyuan Jin,Jieming Shi,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit misaligned confidence scores, usually overestimating the reliability of their predictions. While verbalized confidence in Large Language Models (LLMs) has gained attention, prior work remains divided on whether confidence scores can be systematically steered through prompting. Recent studies even argue that such prompt-induced confidence shifts are negligible, suggesting LLMs’ confidence calibration is rigid to linguistic interventions. Contrary to these claims, we first rigorously confirm the existence of directional confidence shifts by probing three models (including GPT3.5, LLAMA3-70b, GPT4) across 7 benchmarks, demonstrating that explicit instructions can inflate or deflate confidence scores in a regulated manner. Based on this observation, we propose a novel framework containing three components: confidence steering, steered confidence aggregation and steered answers selection, named SteeringConf. Our method, SteeringConf, leverages a confidence manipulation mechanism to steer the confidence scores of LLMs in several desired directions, followed by a summarization module that aggregates the steered confidence scores to produce a final prediction. We evaluate our method on 7 benchmarks and it consistently outperforms the baselines in terms of calibration metrics in task of confidence calibration and failure detection.
zh

[NLP-5] (How) Do Language Models Track State?

【速读】：该论文试图解决的问题是：Transformer语言模型如何在处理需要跟踪未观察世界状态的任务（如故事叙述或代码生成）时实现有效的状态跟踪机制。论文通过研究一种特定任务——排列组合（permutation composition），即计算对象集合经过一系列交换后的顺序，来探索这一问题。许多其他复杂任务（如有限自动机模拟和布尔表达式评估）可以归约为排列组合问题，因此该任务成为通用状态跟踪建模的一个自然选择。

解决方案的关键在于：论文发现，Transformer语言模型在该任务中一致地学习到了两种状态跟踪机制。第一种机制类似于Liu等人（2023）和Merrill等人（2024）提出的“关联扫描”（associative scan）构造；第二种机制则利用一个易于计算的特征（排列奇偶性，permutation parity）初步剪枝输出空间，再通过关联扫描进一步优化。这两种机制表现出显著不同的鲁棒性特性，且可以通过中间训练任务引导模型偏向其中一种机制。论文表明，无论是预训练还是微调的Transformer语言模型，都能学会高效且可解释的状态跟踪机制，并且这些机制的涌现可以被预测和控制。

链接: https://arxiv.org/abs/2503.02854
作者: Belinda Z. Li,Zifan Carl Guo,Jacob Andreas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 17 figures, 1 table

点击查看摘要

Abstract:Transformer language models (LMs) exhibit behaviors – from storytelling to code generation – that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the “associative scan” construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
zh

[NLP-6] Shakespearean Sparks: The Dance of Hallucination and Creativity in LLM s Decoding Layers

【速读】：该论文旨在定量研究大型语言模型（Large Language Models, LLMs）中幻觉（hallucination）与创造力（creativity）之间的关系。以往的研究主要从理论或定性角度探讨两者联系，而本文采用量化方法，提出了一种针对LLMs的狭义创造力定义，并设计了一个名为HCL的评估框架，用于在解码过程中量化不同层的幻觉与创造力。解决方案的关键在于通过HCL框架揭示了幻觉与创造力之间在整个模型深度、类型及规模上的权衡关系，并发现每种模型大小下均存在一个特定层能够最优平衡这种权衡。此外，较大模型中的这一最优层通常出现在较浅层，且模型在此层的置信度显著提高。这些结果为理解LLMs创造力与幻觉的相互作用提供了新的视角。实验代码和数据可在提供的链接获取。

链接: https://arxiv.org/abs/2503.02851
作者: Zicong He,Boxuan Zhang,Lu Cheng
机构: Georgia Institute of Technology; Wuhan University; University of Illinois Chicago
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to hallucinate, a phenomenon often linked to creativity. While previous research has primarily explored this connection through theoretical or qualitative lenses, our work takes a quantitative approach to systematically examine the relationship between hallucination and creativity in LLMs. Given the complex nature of creativity, we propose a narrow definition tailored to LLMs and introduce an evaluation framework, HCL, which quantifies Hallucination and Creativity across different Layers of LLMs during decoding. Our empirical analysis reveals a tradeoff between hallucination and creativity that is consistent across layer depth, model type, and model size. Notably, across different model architectures, we identify a specific layer at each model size that optimally balances this tradeoff. Additionally, the optimal layer tends to appear in the early layers of larger models, and the confidence of the model is also significantly higher at this layer. These findings provide a quantitative perspective that offers new insights into the interplay between LLM creativity and hallucination. The code and data for our experiments are available at this https URL.
zh

[NLP-7] Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLM s ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在作为AI助手时产生幻觉（即不忠实或无意义的信息）的问题。传统基于响应级偏好学习的事实性对齐方法因在训练过程中不可避免地引入噪声而存在局限性，因为LLM的回应中往往同时包含真实与虚假内容。为了解决这一问题，论文提出了一种基于直接偏好优化（Direct Preference Optimization, DPO）的细粒度事实性对齐方法——Mask-DPO。其关键是通过将句子级事实性作为掩码信号，仅从优选样本中具有事实正确性的句子中学习，并避免对非优选样本中的事实内容进行惩罚，从而消除偏好学习中的歧义。实验结果表明，该方法显著提升了LLMs在域内和域外数据集上的事实性表现。

链接: https://arxiv.org/abs/2503.02846
作者: Yuzhe Gu,Wenwei Zhang,Chengqi Lyu,Dahua Lin,Kai Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.
zh

[NLP-8] AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

【速读】：该论文旨在解决现有大型语言模型（LLMs）对齐方法中忽视token级奖励的问题。大多数现有的LLM对齐方法仅使用稀疏的响应级奖励或偏好标注来优化所有tokens，这种做法可能导致高质量tokens被错误惩罚或低质量tokens被鼓励，从而影响性能并减缓收敛速度。为了解决这一问题，论文提出了一种名为AlignDistil的新方法，这是一种与强化学习从人类反馈（RLHF）等效的蒸馏方法，专注于token级奖励优化。关键在于将直接偏好优化（DPO）学到的奖励引入到RLHF目标中，并通过理论证明此目标与一种教师分布线性组合DPO模型和参考模型logits的token级蒸馏过程等价。此外，通过构建正常和反向DPO模型的对比DPO奖励来缩小来自DPO模型的奖励与纯奖励模型之间的差距。同时，为了防止不同tokens出现欠优化或过优化的情况，设计了一种token自适应logit外推机制以构建每个token的适当教师分布。实验结果表明，AlignDistil相比现有方法具有优势，并展示了由于其token级分布奖励优化所带来的快速收敛特性。

链接: https://arxiv.org/abs/2503.02832
作者: Songming Zhang,Xue Zhang,Tong Zhang,Bojie Hu,Yufeng Chen,Jinan Xu
机构: Beijing Key Lab of Traffic Data Analysis and Mining (北京交通数据分析与挖掘重点实验室), Beijing Jiaotong University (北京交通大学), China; Tencent Inc (腾讯股份有限公司), China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
zh

[NLP-9] Q-Filters: Leverag ing QK Geometry for Efficient KV Cache Compression

【速读】：该论文旨在解决随着模型规模和上下文长度的增长，Key-Value (KV) Cache 成为显著内存瓶颈的问题，限制其在生成过程中的大小成为关键需求。论文的关键解决方案是提出了一种无需训练的 KV Cache 压缩方法 Q-Filters，它通过单一的上下文无关投影过滤掉不太重要的 Key-Value 对，利用 Query 和 Key 向量的意外性质高效近似注意力分数，而无需计算注意力图。与许多替代方案不同，Q-Filters 兼容 FlashAttention，因为它不依赖于直接访问注意力权重。实验结果表明，Q-Filters 在长上下文设置下的检索任务中表现出色，并且在生成任务中始终优于高效的压缩方案如 Streaming-LLM。

链接: https://arxiv.org/abs/2503.02812
作者: Nathan Godey,Alessio Devoto,Yu Zhao,Simone Scardapane,Pasquale Minervini,Éric de la Clergerie,Benoît Sagot
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.
zh

[NLP-10] IterPref: Focal Preference Learning for Code Generation via Iterative Debugging

【速读】：该论文旨在解决现有偏好学习方法在提升代码大语言模型（Code LLMs）性能时存在的两个主要问题：一是仅基于测试用例通过率构造偏好对，未能精确定位代码中的具体错误，导致模型无法学习到更具有信息量的错误修正模式；二是将整个失败代码对齐缺乏足够的粒度，难以捕捉有意义的错误解决关系。为了解决这些问题，论文提出了一种名为IterPref的新颖偏好对齐框架，其关键在于模仿人类迭代调试过程，通过定制的差异性偏好优化（DPO）算法显式定位错误区域并对齐相应的标记，从而生成更具信息量的偏好对。此外，作者引入了CodeFlow数据集，通过迭代细化直至测试通过的方式捕获错误修正过程，进一步增强模型的学习效果。实验结果表明，IterPref显著提升了多种代码生成任务的性能，并减少了错误数量。

链接: https://arxiv.org/abs/2503.02783
作者: Jie Wu,Haoling Li,Xin Zhang,Jianwen Luo,Yangyu Huang,Ruihang Chu,Yujiu Yang,Scarlett Li
机构: Tsinghua University (清华大学); Microsoft Research (微软研究); CASIA (中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code and data will be released soon

点击查看摘要

Abstract:Preference learning enhances Code LLMs beyond supervised fine-tuning by leveraging relative quality comparisons. Existing methods construct preference pairs from candidates based on test case success, treating the higher pass rate sample as positive and the lower as negative. However, this approach does not pinpoint specific errors in the code, which prevents the model from learning more informative error correction patterns, as aligning failing code as a whole lacks the granularity needed to capture meaningful error-resolution relationships. To address these issues, we propose IterPref, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To generate informative pairs, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that IterPref yields fewer errors. Our code and data will be made publicaly available. Comments: The code and data will be released soon Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.02783 [cs.CL] (or arXiv:2503.02783v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.02783 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-11] Implicit Bias in LLM s: A Survey

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中隐性偏差（Implicit Bias）的研究与评估问题。论文的关键在于提出了一种系统性的方法来识别、分类和评估LLMs中的隐性偏差，并探讨潜在的缓解策略。论文通过引入心理学领域的概念（如内隐联想测验 Implicit Association Test, IAT）和框架，将检测方法归纳为三种主要途径：词关联分析、任务导向的文本生成以及决策过程建模；同时，评价指标被分为单一值基础度量和对比值基础度量两类；数据集则涵盖带掩码句子与完整句子两种形式，以覆盖LLMs的广泛应用场景。尽管目前针对LLMs隐性偏差缓解的研究尚处于初期阶段，但论文总结了现有努力并提出了未来挑战，期望为研究者提供清晰指引，激发创新思路推动该领域发展。

链接: https://arxiv.org/abs/2503.02776
作者: Xinru Lin,Luyang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the implement of guardrails by developers, Large language models (LLMs) have demonstrated exceptional performance in explicit bias tests. However, bias in LLMs may occur not only explicitly, but also implicitly, much like humans who consciously strive for impartiality yet still harbor implicit bias. The unconscious and automatic nature of implicit bias makes it particularly challenging to study. This paper provides a comprehensive review of the existing literature on implicit bias in LLMs. We begin by introducing key concepts, theories and methods related to implicit bias in psychology, extending them from humans to LLMs. Drawing on the Implicit Association Test (IAT) and other psychological frameworks, we categorize detection methods into three primary approaches: word association, task-oriented text generation and decision-making. We divide our taxonomy of evaluation metrics for implicit bias into two categories: single-value-based metrics and comparison-value-based metrics. We classify datasets into two types: sentences with masked tokens and complete sentences, incorporating datasets from various domains to reflect the broad application of LLMs. Although research on mitigating implicit bias in LLMs is still limited, we summarize existing efforts and offer insights on future challenges. We aim for this work to serve as a clear guide for researchers and inspire innovative ideas to advance exploration in this task.
zh

[NLP-12] InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

【速读】：该论文旨在解决当前语音大语言模型（SpeechLLMs）在遵循语音指令时性能不足的问题，特别是模型在处理语音形式输入时的语义一致性与智能水平显著低于文本形式输入的情况。为应对这一挑战，论文提出了一种名为InSerter的简单且可扩展的训练方法，其全称Interleaved Speech-Text Representation Pre-training通过在预训练阶段使用从大规模文本语料库中随机选择的段落合成语音的方式，构建无监督的语音-文本序列。这种方法的关键在于无需复杂的后训练数据配对设计，即可使模型学会根据提供的语音片段生成对应的文本续写，从而有效缓解了语音与文本表征之间的语义不一致问题。此外，论文还引入了SpeechInstructBench基准测试集，用于系统性评估语音指令跟随能力，并验证了InSerter在多种语音处理任务中的卓越表现。

链接: https://arxiv.org/abs/2503.02769
作者: Dingdong Wang,Jin Xu,Ruihang Chu,Zhifang Guo,Xiong Wang,Jincenzi Wu,Dongchao Yang,Shengpeng Ji,Junyang Lin
机构: The Chinese University of Hong Kong (香港中文大学); Alibaba Group (阿里巴巴集团)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.
zh

[NLP-13] From Metaphor to Mechanism: How LLM s Decode Traditional Chinese Medicine Symbolic Language for Modern Clinical Relevance

【速读】：本文旨在解决传统中医（TCM）中大量隐喻表达与基于解剖学的西方医学（WM）概念之间难以自动化映射的问题，这一挑战在自然语言处理及临床实践中均具有显著难度。为填补这一鸿沟，论文提出了一种新颖的多智能体与链式思维（Chain-of-Thought, CoT）框架，用于准确解析中医隐喻并将其映射至西医病理生理学。方案的关键在于结合领域专用智能体（如中医专家、西医专家）与协调智能体，通过逐步链式思维提示确保推理透明性并解决潜在冲突。这种方法强调跨医学范式的隐喻解读理论基础，并设计了包含多智能体协作与CoT推理策略的具体系统架构，同时指出了其潜在优势与局限性。

链接: https://arxiv.org/abs/2503.02760
作者: Jiacheng Tang,Nankai Wu,Fan Gao,Chengxiao Dai,Mengyao Zhao,Xinjie Zhao
机构: School of Information Science and Engineering, Shandong Normal University (山东师范大学信息科学与工程学院); School of Pharmacy, China Pharmaceutical University (中国药科大学药学院); Department of Civil Engineering, The University of Tokyo (东京大学土木工程系); Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman (马来西亚塔克西拉阿卜杜勒拉赫曼大学信息与通信技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metaphorical expressions are abundant in Traditional Chinese Medicine (TCM), conveying complex disease mechanisms and holistic health concepts through culturally rich and often abstract terminology. Bridging these metaphors to anatomically driven Western medical (WM) concepts poses significant challenges for both automated language processing and real-world clinical practice. To address this gap, we propose a novel multi-agent and chain-of-thought (CoT) framework designed to interpret TCM metaphors accurately and map them to WM pathophysiology. Specifically, our approach combines domain-specialized agents (TCM Expert, WM Expert) with a Coordinator Agent, leveraging stepwise chain-of-thought prompts to ensure transparent reasoning and conflict resolution. We detail a methodology for building a metaphor-rich TCM dataset, discuss strategies for effectively integrating multi-agent collaboration and CoT reasoning, and articulate the theoretical underpinnings that guide metaphor interpretation across distinct medical paradigms. We present a comprehensive system design and highlight both the potential benefits and limitations of our approach, while leaving placeholders for future experimental validation. Our work aims to support clinical decision-making, cross-system educational initiatives, and integrated healthcare research, ultimately offering a robust scaffold for reconciling TCM’s symbolic language with the mechanistic focus of Western medicine.
zh

[NLP-14] BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression

【速读】：该论文旨在解决基于大型语言模型（Large Language Model, LLM）的自然语言生成评估中单例提示（single-example prompting）导致的显著标记开销和计算效率低下的问题。为了解决这一问题，论文提出了BatchGEMBA-MQM框架，该框架将批量提示与GEMBA-MQM翻译评估指标相结合。其关键是通过将多个翻译示例聚合到一个提示中，相比单例提示可减少2至4倍的标记使用量（取决于批量大小）。此外，还提出了一种批量感知的提示压缩模型，在保持或恢复质量的同时，进一步平均减少了13-15%的标记使用，并缓解了批量处理引起的质量下降问题。这一解决方案的核心在于结合批量处理与提示优化技术以提高评估效率并减轻质量损失。

链接: https://arxiv.org/abs/2503.02756
作者: Daniil Larionov,Steffen Eger
机构: NLLG (NLLG); University of Mannheim (曼海姆大学); University of Technology Nuremberg (纽伦堡大学技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at this https URL to support future research in this domain.
zh

[NLP-15] Large Language Models for Multilingual Previously Fact-Checked Claim Detection

【速读】：该论文试图解决在多语言环境中自动检测已被验证过的虚假信息声明的问题。这一挑战源于虚假信息跨越语言边界传播，而人工事实核查员难以避免重复劳动。为应对这一问题，论文评估了大规模语言模型（Large Language Models, LLMs）在20种语言中的跨语言和单语言环境下检测先前已验证声明的能力。研究的关键在于发现虽然LLMs在高资源语言中表现良好，但在低资源语言中表现较差；同时，将原始文本翻译成英语对低资源语言的检测任务具有显著益处。这些发现凸显了LLMs在此类多语言任务中的潜力，并为未来相关研究奠定了基础。

链接: https://arxiv.org/abs/2503.02737
作者: Ivan Vykopal,Matúš Pikuliak,Simon Ostermann,Tatiana Anikina,Michal Gregor,Marián Šimko
机构: Faculty of Information Technology, Brno University of Technology (布尔诺科技大学), Czech Republic; Kempelen Institute of Intelligent Technologies (KInIT) (肯佩伦智能技术研究所), Slovakia; German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心), Germany; Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In our era of widespread false information, human fact-checkers often face the challenge of duplicating efforts when verifying claims that may have already been addressed in other countries or languages. As false information transcends linguistic boundaries, the ability to automatically detect previously fact-checked claims across languages has become an increasingly important task. This paper presents the first comprehensive evaluation of large language models (LLMs) for multilingual previously fact-checked claim detection. We assess seven LLMs across 20 languages in both monolingual and cross-lingual settings. Our results show that while LLMs perform well for high-resource languages, they struggle with low-resource languages. Moreover, translating original texts into English proved to be beneficial for low-resource languages. These findings highlight the potential of LLMs for multilingual previously fact-checked claim detection and provide a foundation for further research on this promising application of LLMs.
zh

[NLP-16] Evaluating Knowledge Generation and Self-Refinement Strategies for LLM -based Column Type Annotation

【速读】：该论文旨在解决通过知识生成和自优化策略提升基于大型语言模型（LLM）的列类型标注（Column Type Annotation, CTA）性能的问题。解决方案的关键在于对比多种策略的有效性和效率，包括利用LLM生成术语定义、基于错误的定义细化、自我校正以及结合示例与术语定义的微调方法。研究发现，最佳策略取决于模型与数据集的组合，且使用训练数据生成标签定义通常优于将其作为上下文学习的演示；术语定义的LLM自优化可平均提高F1分数3.9%。最终，将微调模型与自优化术语定义相结合实现了整体最优性能，其F1分数至少比零样本提示的微调模型高出3%。此外，成本分析表明，对于需要标注较少表的情况，通过提示实现的自优化更具成本效益，而对于大规模标注任务，微调则更为高效。

链接: https://arxiv.org/abs/2503.02718
作者: Keti Korini,Christian Bizer
机构: University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the semantics of columns in relational tables is an important pre-processing step for indexing data lakes in order to provide rich data search. An approach to establishing such understanding is column type annotation (CTA) where the goal is to annotate table columns with terms from a given vocabulary. This paper experimentally compares different knowledge generation and self-refinement strategies for LLM-based column type annotation. The strategies include using LLMs to generate term definitions, error-based refinement of term definitions, self-correction, and fine-tuning using examples and term definitions. We evaluate these strategies along two dimensions: effectiveness measured as F1 performance and efficiency measured in terms of token usage and cost. Our experiments show that the best performing strategy depends on the model/dataset combination. We find that using training data to generate label definitions outperforms using the same data as demonstrations for in-context learning for two out of three datasets using OpenAI models. The experiments further show that using the LLMs to refine label definitions brings an average increase of 3.9% F1 in 10 out of 12 setups compared to the performance of the non-refined definitions. Combining fine-tuned models with self-refined term definitions results in the overall highest performance, outperforming zero-shot prompting fine-tuned models by at least 3% in F1 score. The costs analysis shows that while reaching similar F1 score, self-refinement via prompting is more cost efficient for use cases requiring smaller amounts of tables to be annotated while fine-tuning is more efficient for large amounts of tables.
zh

[NLP-17] Multilingualism Transnationality and K-pop in the Online #StopAsianHate Movement

【速读】：该论文试图解决两个主要问题：一是#StopAsianHate (SAH) 运动在非英语国家及地区的传播情况尚不明确；二是该运动是否以及如何在长期内被成功维持缺乏深入研究。为解决这些问题，论文的关键在于通过分析来自全球220万用户的650万条"#StopAsianHate"推文（涵盖60种语言），首次系统性地研究了SAH运动的非英语及跨国组成部分。通过结合主题建模（topic modeling）、用户建模（user modeling）及人工标注（hand annotation）的方法，论文识别并刻画了主导讨论及其参与者，并对比了英语与非英语话题及用户的差异。研究发现，英语推文的主题驱动事件多为美国的暴力犯罪，而非英语推文则更多由针对亚洲国家象征代表的跨国反亚情绪事件驱动。此外，论文揭示了全球K-pop粉丝群体比其他用户群体更早参与并持续支持该运动的时间更长。这些方法和发现为理解SAH运动的跨国性质及其演化提供了重要贡献。

链接: https://arxiv.org/abs/2503.02707
作者: Tessa Masis,Zhangqi Duan,Weiai Wayne Xu,Ethan Zuckerman,Jane Yeahin Pyo,Brendan O’Connor
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: WebSci’25

点击查看摘要

Abstract:The #StopAsianHate (SAH) movement is a broad social movement against violence targeting Asians and Asian Americans, beginning in 2021 in response to racial discrimination related to COVID-19 and sparking worldwide conversation about anti-Asian hate. However, research on the online SAH movement has focused on English-speaking participants so the spread of the movement outside of the United States is largely unknown. In addition, there have been no long-term studies of SAH so the extent to which it has been successfully sustained over time is not well understood. We present an analysis of 6.5 million “#StopAsianHate” tweets from 2.2 million users all over the globe and spanning 60 different languages, constituting the first study of the non-English and transnational component of the online SAH movement. Using a combination of topic modeling, user modeling, and hand annotation, we identify and characterize the dominant discussions and users participating in the movement and draw comparisons of English versus non-English topics and users. We discover clear differences in events driving topics, where spikes in English tweets are driven by violent crimes in the US but spikes in non-English tweets are driven by transnational incidents of anti-Asian sentiment towards symbolic representatives of Asian nations. We also find that global K-pop fans were quick to adopt the SAH movement and, in fact, sustained it for longer than any other user group. Our work contributes to understanding the transnationality and evolution of the SAH movement, and more generally to exploring upward scale shift and public attention in large-scale multilingual online activism.
zh

[NLP-18] MPO: Boosting LLM Agents with Meta Plan Optimization

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）驱动的智能体在交互式规划任务中容易出现规划幻觉（planning hallucinations）以及针对每个新智能体需重新训练的问题。为了解决这些挑战，论文提出了一种名为元计划优化（Meta Plan Optimization, MPO）的框架，其关键是通过显式的高层次元计划（meta plans）提供明确指导，并利用智能体任务执行的反馈持续优化这些元计划，从而显著提升智能体的规划能力和泛化能力，同时避免了依赖复杂且质量难以保证的知识，实现了即插即用的解决方案。

链接: https://arxiv.org/abs/2503.02682
作者: Weimin Xiong,Yifan Song,Qingxiu Dong,Bingchan Zhao,Feifan Song,Xun Wang,Sujian Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent’s task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.
zh

[NLP-19] Are some books better than others?

【速读】：该论文试图解决如何量化书评在多大程度上由书籍实际内容决定，以及在多大程度上受读者个人偏好的影响。解决方案的关键在于通过对624,320条数值和文本书评的分析，揭示专业出版书籍的内容与普通读者阅读体验之间的关系，并评估书评中反映的书籍信息量与读者个性化倾向的比例。研究发现，对于流行虚构和非虚构书籍，在线书评更多反映的是评论者自身特性而非书籍本身，且不同读者对同一书籍的评价难以达成一致。此外，经验丰富的书评作者比普通读者的评价更具普适性，而针对具体书籍问题的书评之间预测性较差，这支持了文学评价中极端视角主义的合理性。

链接: https://arxiv.org/abs/2503.02671
作者: Hannes Rosenbusch,Luke Korthals
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Scholars, awards committees, and laypeople frequently discuss the merit of written works. Literary professionals and journalists differ in how much perspectivism they concede in their book reviews. Here, we quantify how strongly book reviews are determined by the actual book contents vs. idiosyncratic reader tendencies. In our analysis of 624,320 numerical and textual book reviews, we find that the contents of professionally published books are not predictive of a random reader’s reading enjoyment. Online reviews of popular fiction and non-fiction books carry up to ten times more information about the reviewer than about the book. For books of a preferred genre, readers might be less likely to give low ratings, but still struggle to converge in their relative assessments. We find that book evaluations generalize more across experienced review writers than casual readers. When discussing specific issues with a book, one review text had poor predictability of issues brought up in another review of the same book. We conclude that extreme perspectivism is a justifiable position when researching literary quality, bestowing literary awards, and designing recommendation systems.
zh

[NLP-20] Multidimensional Consistency Improves Reasoning in Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理复杂推理任务时因输入变化敏感而导致答案一致性差的问题。论文的关键在于提出了一种名为“多维推理一致性”（Multidimensional Reasoning Consistency）的框架，通过系统性地促使模型在不同输入变化下多样化其解题路径，从而评估模型在多种条件下的答案一致性。该框架通过诱导提示中样本顺序（shots order in prompt）、问题表述（problem phrasing）以及语言（languages）的多样性变化来实现这一目标。实验结果表明，通过跨维度聚合一致性，该方法显著提升了包括单语数据集GSM8K和多语数据集MGSM在内的数学推理性能，尤其对较小规模的模型效果更为明显。

链接: https://arxiv.org/abs/2503.02670
作者: Huiyuan Lai,Xiao Zhang,Malvina Nissim
机构: CLCG, University of Groningen (格罗宁根大学) / The Netherlands
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large language models (LLMs) have proved able to address some complex reasoning tasks, we also know that they are highly sensitive to input variation, which can lead to different solution paths and final answers. Answer consistency across input variations can thus be taken as a sign of stronger confidence. Leveraging this insight, we introduce a framework, \em Multidimensional Reasoning Consistency where, focusing on math problems, models are systematically pushed to diversify solution paths towards a final answer, thereby testing them for answer consistency across multiple input variations. We induce variations in (i) order of shots in prompt, (ii) problem phrasing, and (iii) languages used. Extensive experiments on a large range of open-source state-of-the-art LLMs of various sizes show that reasoning consistency differs by variation dimension, and that by aggregating consistency across dimensions, our framework consistently enhances mathematical reasoning performance on both monolingual dataset GSM8K and multilingual dataset MGSM, especially for smaller models.
zh

[NLP-21] LoRA-Null: Low-Rank Adaptation via Null Space for Large Language Models

【速读】：该论文试图解决低秩适应（Low-Rank Adaptation, LoRA）方法在微调大规模语言模型（Large Language Models, LLMs）时导致的预训练世界知识灾难性遗忘问题。解决方案的关键在于提出了一种基于零空间的低秩适应方法（LoRA-Null）。具体而言，通过随机收集少量数据样本并捕获其经过LLM层后的激活值，在输入激活值上进行奇异值分解以获得零空间，并将预训练权重投影到该零空间作为适配器的初始化方式。此外，如果在微调过程中冻结下投影矩阵的值，则可以进一步提高预训练世界知识的保留效果。实验结果表明，这种方法能够在保持良好微调性能的同时有效保护LLMs的原始预训练世界知识，在LLaMA系列模型（如LLaMA2、LLaMA3、LLaMA3.1和LLaMA3.2）上针对代码生成、数学推理以及指令跟随任务进行了验证，并提供了理论上的保证。

链接: https://arxiv.org/abs/2503.02659
作者: Pengwei Tang,Yong Liu,Dongjie Zhang,Xing Wu,Debing Zhang
机构: Renmin University of China (中国人民大学); Xiaohongshu Inc (小红书); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs). However, the fine-tuned LLMs encounter the issue of catastrophic forgetting of the pre-trained world knowledge. To address this issue, inspired by theoretical insights of null space, we propose LoRA-Null, i.e., Low-Rank Adaptation via null space, which builds adapters initialized from the null space of the pre-trained knowledge activation. Concretely, we randomly collect a few data samples and capture their activations after passing through the LLM layer. We perform Singular Value Decomposition on the input activations to obtain their null space. We use the projection of the pre-trained weights onto the null space as the initialization for adapters. Experimental results demonstrate that this initialization approach can effectively preserve the original pre-trained world knowledge of the LLMs during fine-tuning. Additionally, if we freeze the values of the down-projection matrices during fine-tuning, it achieves even better preservation of the pre-trained world knowledge. LoRA-Null effectively preserves pre-trained world knowledge while maintaining strong fine-tuning performance, as validated by extensive experiments on LLaMA series (LLaMA2, LLaMA3, LLaMA3.1, and LLaMA3.2) across Code, Math, and Instruction Following tasks. We also provide a theoretical guarantee for the capacity of LoRA-Null to retain pre-trained knowledge. Code is in this https URL.
zh

[NLP-22] Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks

【速读】：该论文旨在解决解码器型 Transformer 在自然语言处理任务中未能完全取代编码器型架构的问题，尤其是在分类、回归和排名等任务中，编码器-only 模型仍占据主导地位。这一局限性源于解码器型模型的固有结构限制了其直接应用于这些非生成式任务的能力。为了解决这一问题，论文的关键在于提出 Gemma Encoder，通过将强大的 Gemma 解码器模型适配为编码器架构，从而释放其在更广泛非生成式应用场景中的潜力。为此，研究者系统性地分析了多种池化策略、注意力机制以及超参数（如丢弃率）的优化方法，并在 GLUE 和 MS MARCO 等基准数据集上验证了 Gemma Encoder 的有效性和通用性。

链接: https://arxiv.org/abs/2503.02656
作者: Paul Suganthan,Fedor Moiseev,Le Yan,Junru Wu,Jianmo Ni,Jay Han,Imed Zitouni,Enrique Alfonseca,Xuanhui Wang,Zhe Dong
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoder-based transformers, while revolutionizing language modeling and scaling to immense sizes, have not completely overtaken encoder-heavy architectures in natural language processing. Specifically, encoder-only models remain dominant in tasks like classification, regression, and ranking. This is primarily due to the inherent structure of decoder-based models, which limits their direct applicability to these tasks. In this paper, we introduce Gemma Encoder, adapting the powerful Gemma decoder model to an encoder architecture, thereby unlocking its potential for a wider range of non-generative applications. To optimize the adaptation from decoder to encoder, we systematically analyze various pooling strategies, attention mechanisms, and hyperparameters (e.g., dropout rate). Furthermore, we benchmark Gemma Encoder against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark, demonstrating its effectiveness and versatility.
zh

[NLP-23] he Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats

【速读】：该论文旨在解决现代数据管理和信息检索中无结构文本数据快速增长带来的根本性挑战，具体目标是探索大型语言模型（Large Language Models, LLMs）将特定领域的无结构文本转化为标准化、结构化格式的能力。这一能力若得以实现，可能在多个行业中彻底革新数据处理工作流程。论文的关键解决方案在于通过系统性评估四种LLMs（GPT-4o、GPT-4o-mini、Llama3.1:70b和Llama3.1:8b）将无结构食谱文本转换为Cooklang结构化格式的能力，并提出了一种创新的评价方法，结合传统指标（如WER、ROUGE-L、TER）与针对语义元素识别的专业度量。研究发现，GPT-4o在少量样本提示下的表现尤为突出（ROUGE-L: 0.9722，WER: 0.0730），首次证明LLMs无需经过大量训练即可可靠地完成此类转换任务。此外，研究还揭示了较小规模模型（如Llama3.1:8b）通过针对性微调展现出优化潜力，从而为医疗记录、技术文档等多个领域中的自动化结构化数据生成开辟了新途径。

链接: https://arxiv.org/abs/2503.02650
作者: William Brach,Kristián Košťál,Michal Ries
机构: Slovak Technical University (斯洛伐克技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exponential growth of unstructured text data presents a fundamental challenge in modern data management and information retrieval. While Large Language Models (LLMs) have shown remarkable capabilities in natural language processing, their potential to transform unstructured text into standardized, structured formats remains largely unexplored - a capability that could revolutionize data processing workflows across industries. This study breaks new ground by systematically evaluating LLMs’ ability to convert unstructured recipe text into the structured Cooklang format. Through comprehensive testing of four models (GPT-4o, GPT-4o-mini, Llama3.1:70b, and Llama3.1:8b), an innovative evaluation approach is introduced that combines traditional metrics (WER, ROUGE-L, TER) with specialized metrics for semantic element identification. Our experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance (ROUGE-L: 0.9722, WER: 0.0730), demonstrating for the first time that LLMs can reliably transform domain-specific unstructured text into structured formats without extensive training. Although model performance generally scales with size, we uncover surprising potential in smaller models like Llama3.1:8b for optimization through targeted fine-tuning. These findings open new possibilities for automated structured data generation across various domains, from medical records to technical documentation, potentially transforming the way organizations process and utilize unstructured information.
zh

[NLP-24] owards Event Extraction with Massive Types: LLM -based Collaborative Annotation and Partitioning Extraction

【速读】：该论文致力于解决通用事件抽取系统开发中的两个核心挑战：一是缺乏高效且有效的标注方法；二是缺乏能够处理大量事件类型的强大提取方法。为应对第一个挑战，论文提出了一种基于大型语言模型（Large Language Models, LLMs）的协作标注方法。此方法通过多个LLMs之间的协作，首先优化触发词的远程监督标注，随后进行论元标注，并通过投票阶段整合不同LLMs的标注偏好，最终构建了EEMT数据集，这是迄今为止最大的事件抽取数据集，包含超过200,000个样本、3,465种事件类型以及6,297种角色类型。针对第二个挑战，论文提出了基于LLM的划分事件抽取方法（LLM-PEE）。为克服LLMs上下文长度限制，LLM-PEE首先召回候选事件类型，然后将其划分为多个子任务分区以供LLMs执行事件抽取。在有监督设置下，LLM-PEE在事件检测和论元抽取方面分别比现有最先进方法提升了5.4和6.1；在零样本设置下，其性能较主流LLMs提升了高达12.9，展示了其强大的泛化能力。因此，该研究的关键在于创新性地结合协作标注与LLM驱动的事件抽取策略，有效解决了大规模事件类型处理的技术难题。

链接: https://arxiv.org/abs/2503.02628
作者: Wenxuan Liu,Zixuan Li,Long Bai,Yuxin Zuo,Daozhu Xu,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences (中科院计算技术研究所网络数据科学与技术重点实验室); State Key Laboratory of Geo-Information Engineering, Xi’an, Shaanxi, China (中国陕西西安地理信息工程国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Developing a general-purpose extraction system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the challenge comes from two aspects: 1) The absence of an efficient and effective annotation method. 2) The absence of a powerful extraction method can handle massive types. For the first challenge, we propose a collaborative annotation method based on Large Language Models (LLMs). Through collaboration among multiple LLMs, it first refines annotations of trigger words from distant supervision and then carries out argument annotation. Next, a voting phase consolidates the annotation preferences across different LLMs. Finally, we create the EEMT dataset, the largest EE dataset to date, featuring over 200,000 samples, 3,465 event types, and 6,297 role types. For the second challenge, we propose an LLM-based Partitioning EE method called LLM-PEE. To overcome the limited context length of LLMs, LLM-PEE first recalls candidate event types and then splits them into multiple partitions for LLMs to extract events. The results in the supervised setting show that LLM-PEE outperforms the state-of-the-art methods by 5.4 in event detection and 6.1 in argument extraction. In the zero-shot setting, LLM-PEE achieves up to 12.9 improvement compared to mainstream LLMs, demonstrating its strong generalization capabilities.
zh

[NLP-25] a: Tools for Temporal Text Analysis

【速读】：该论文试图解决文本数据随时间变化语义和上下文动态性被传统自然语言处理（NLP）技术忽视的问题。传统方法通常假设语料库在时间上是同质的，这种简化可能导致结果偏差，无法捕捉词汇或短语随时间的变化以及主题的动态演进。论文的关键解决方案在于开发“ttta”包，它旨在整合一系列工具，以一致且可重复的方式分析随时间变化的文本数据，从而弥补现有工具分散且难以系统应用的不足。

链接: https://arxiv.org/abs/2503.02625
作者: Kai-Robin Lange,Niklas Benner,Lars Grönberg,Aymane Hachcham,Imene Kolli,Jonas Rieger,Carsten Jentsch
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Text data is inherently temporal. The meaning of words and phrases changes over time, and the context in which they are used is constantly evolving. This is not just true for social media data, where the language used is rapidly influenced by current events, memes and trends, but also for journalistic, economic or political text data. Most NLP techniques however consider the corpus at hand to be homogenous in regard to time. This is a simplification that can lead to biased results, as the meaning of words and phrases can change over time. For instance, running a classic Latent Dirichlet Allocation on a corpus that spans several years is not enough to capture changes in the topics over time, but only portraits an “average” topic distribution over the whole time span. Researchers have developed a number of tools for analyzing text data over time. However, these tools are often scattered across different packages and libraries, making it difficult for researchers to use them in a consistent and reproducible way. The ttta package is supposed to serve as a collection of tools for analyzing text data over time.
zh

[NLP-26] Rewarding Doubt: A Reinforcement Learning Approach to Confidence Calibration of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在表达答案置信度方面准确性不足的问题。为实现这一目标，论文提出了一种基于强化学习（Reinforcement Learning, RL）的新方法，用于调整LLMs以使其能够提供校准良好的置信度估计，尤其是在回答事实性问题时。解决方案的关键在于将问题建模为一个投注游戏（betting game），其中模型在生成每个答案的同时预测置信分数，并设计了一种奖励函数来同时惩罚过高的和过低的置信度。通过理论证明，在这种奖励函数的设计下，最优策略能够实现完全校准的置信估计。实验结果表明，该方法显著提高了置信度校准能力，并且能够在不重新训练的情况下推广到新任务，从而赋予模型一种通用的置信度意识。

链接: https://arxiv.org/abs/2503.02623
作者: Paul Stangel,David Bani-Harouni,Chantal Pellegrini,Ege Özsoy,Kamilia Zaripova,Matthias Keicher,Nassir Navab
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.
zh

[NLP-27] OkraLong: A Flexible Retrieval-Augmented Framework for Long-Text Query Processing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长文本查询（如企业文档分析和财务报告理解）时面临的效率与成本挑战。传统方法如长上下文处理或检索增强生成（Retrieval-Augmented Generation, RAG）存在输入开销过高或信息不完整的问题，而近期采用的上下文压缩和动态检索循环方法虽有所改进，但仍会丢失关键细节或导致迭代成本高昂。为应对这些局限性，论文提出了一种名为OkraLong的新框架，其核心在于通过三个协同组件实现对整个处理流程的细粒度优化：分析器负责任务状态的刻画，指导组织器动态调度工作流，执行器则负责具体执行并输出最终答案。这种细粒度的协调策略是解决方案的关键所在。实验结果表明，OkraLong不仅提高了答案准确性，还在多种数据集上实现了成本效益。

链接: https://arxiv.org/abs/2503.02603
作者: Yulong Hui,Yihao Liu,Yao Lu,Huanchen Zhang
机构: Tsinghua University (清华大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) encounter challenges in efficiently processing long-text queries, as seen in applications like enterprise document analysis and financial report comprehension. While conventional solutions employ long-context processing or Retrieval-Augmented Generation (RAG), they suffer from prohibitive input expenses or incomplete information. Recent advancements adopt context compression and dynamic retrieval loops, but still sacrifice critical details or incur iterative this http URL address these limitations, we propose OkraLong, a novel framework that flexibly optimizes the entire processing workflow. Unlike prior static or coarse-grained adaptive strategies, OkraLong adopts fine-grained orchestration through three synergistic components: analyzer, organizer and executor. The analyzer characterizes the task states, which guide the organizer in dynamically scheduling the workflow. The executor carries out the execution and generates the final answer. Experimental results demonstrate that OkraLong not only enhances answer accuracy but also achieves cost-effectiveness across a variety of datasets.
zh

[NLP-28] MciteBench: A Benchmark for Multimodal Citation Text Generation in MLLM s

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在生成文本时经常出现幻觉（hallucination）的问题。现有方法通过生成带引用的文本以提高透明度和可验证性，但主要集中于纯文本内容，忽视了多模态上下文中的挑战与机遇。为填补这一空白，论文引入了MCiteBench，这是首个专门评估和分析MLLMs多模态引用文本生成能力的基准数据集。MCiteBench的数据来源于学术论文和评审-反驳交互，包含多样化的信息源和多模态内容。通过多维度评估（如引用质量、来源可靠性及答案准确性），研究表明MLLMs在多模态引用文本生成方面表现不佳。进一步分析表明，其瓶颈并非在于理解多模态内容，而在于正确 attribution 来源（source attribution）。因此，解决方案的关键在于改进模型对多模态信息中来源的准确归因能力。

链接: https://arxiv.org/abs/2503.02589
作者: Caiyu Hu,Yikai Zhang,Tinghui Zhu,Yiwei Ye,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (上海关键数据科学实验室，复旦大学计算机学院); School of Computer Engineering and Science, Shanghai University (上海大学计算机工程与科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, overlooking the challenges and opportunities of multimodal contexts. To address this gap, we introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. We comprehensively evaluate models from multiple dimensions, including citation quality, source reliability, and answer accuracy. Through extensive experiments, we observe that MLLMs struggle with multimodal citation text generation. We also conduct deep analyses of models’ performance, revealing that the bottleneck lies in attributing the correct sources rather than understanding the multimodal content.
zh

[NLP-29] Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在逐步推理框架下面临的“单次通过”问题，即每次生成的中间想法无论正确与否都会被纳入执行轨迹，可能导致不可逆的错误传播。为了解决这一问题，论文提出了一种名为Generator-Assistant Stepwise Rollback (GA-Rollback) 的新颖框架。该方案的关键在于利用生成器与环境交互，并引入助手来检查生成器产生的每个动作，一旦检测到错误动作，助手将触发回滚操作。此外，论文还针对回滚场景设计了两种额外策略以进一步提升其有效性。实验结果表明，GA-Rollback在三个广泛使用的基准测试中显著优于多个强大基线方法，并且可以作为一个鲁棒的即插即用模块与其他方法无缝集成。

链接: https://arxiv.org/abs/2503.02519
作者: Xingzuo Li,Kehai Chen,Yunfei Long,Xuefeng Bai,Yong Xu,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.
zh

[NLP-30] LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLM s ACL

【速读】：该论文试图解决长上下文（long-context）训练数据质量难以衡量的问题。为了解决这一挑战，论文提出了一种基于注意力机制依赖性测量的长上下文数据选择框架（Long-context data selection framework with Attention-based Dependency Measurement, LADM）。其关键是利用注意力机制的检索能力捕捉上下文依赖关系，从而实现对长上下文数据的全面质量评估，有效从大规模多领域预训练语料库中识别高质量的长上下文数据。

链接: https://arxiv.org/abs/2503.02502
作者: Jianghao Chen,Junhong Wu,Yangyifan Xu,Jiajun Zhang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院，中国科学院大学); Zhongguancun Academy, Beijing, China (中关村学院，北京，中国); Wuhan AI Research (武汉人工智能研究院); Shanghai Artificial Intelligence Laboratory, Shanghai, China (上海人工智能实验室，中国上海)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL ARR 2024 December

点击查看摘要

Abstract:Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
zh

[NLP-31] Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

【速读】：本文针对现有 Mixture-of-Experts (MoE) 模型中专家独立工作缺乏高质量交互以及未能有效扩展至注意力模块的问题，提出了一种 Union-of-Experts (UoE) 方法。其关键是通过张量并行中的矩阵划分对多层感知机 (MLP) 块和注意力块进行等效专家分解，并在输入数据与专家间实施动态路由，包括基于补丁的数据选择和专家选择两种路由范式。此外，设计了包含 Selective Multi-Head Attention (SMHA) 和 Union-of-MLP-Experts (UoME) 的 UoE 模型架构，并优化了路由与计算操作的并行实现以提升硬件效率。实验表明，采用 UoE 的模型在图像和自然语言处理任务中优于全注意力机制、最先进的 MoE 及高效变压器模型。

链接: https://arxiv.org/abs/2503.02495
作者: Yujiao Yang,Jing Lian,Linhui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE’s routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at this https URL.
zh

[NLP-32] It Helps to Take a Second Opinion: Teaching Smaller LLM s to Deliberate Mutually via Selective Rationale Optimisation ICLR2025

【速读】：该论文试图解决小规模语言模型（Small Language Models, SLMs）在缺乏大规模预训练模型支持的情况下，难以有效生成和评估多样化理由（rationales）以优化目标任务性能的问题。现有方法受限于API成本、版权、法律及伦理政策等因素，无法充分利用大模型的知识蒸馏技术。此外，尽管已有尝试通过自我反思（self-deliberation）提升SLMs的能力，但效果有限。

解决方案的关键在于提出COALITION框架，它通过让两种行为不同的SLM变体相互交互，并采用选择性理由优化（Selective Rationale Optimization, SRO）策略，训练这些变体生成和精炼与目标任务高度相关的理由候选集。这两种变体在生成和精炼过程中表现出多样性，从而提高了理由探索空间的有效性。最终，在推理阶段，COALITION通过控制器选择合适的变体来完成任务。实验结果表明，这种跨变体通信方式显著优于单一模型的自我精炼方法，并且该框架适用于不同规模（4B至14B参数）和家族的LMs。

链接: https://arxiv.org/abs/2503.02463
作者: Sohan Patnaik,Milan Aggarwal,Sumit Bhatia,Balaji Krishnamurthy
机构: Media and Data Science Research Lab, Adobe (Adobe 数据科学与媒体研究实验室)
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Very large language models (LLMs) such as GPT-4 have shown the ability to handle complex tasks by generating and self-refining step-by-step rationales. Smaller language models (SLMs), typically with 13B parameters, have been improved by using the data generated from very-large LMs through knowledge distillation. However, various practical constraints such as API costs, copyright, legal and ethical policies restrict using large (often opaque) models to train smaller models for commercial use. Limited success has been achieved at improving the ability of an SLM to explore the space of possible rationales and evaluate them by itself through self-deliberation. To address this, we propose COALITION, a trainable framework that facilitates interaction between two variants of the same SLM and trains them to generate and refine rationales optimized for the end-task. The variants exhibit different behaviors to produce a set of diverse candidate rationales during the generation and refinement steps. The model is then trained via Selective Rationale Optimization (SRO) to prefer generating rationale candidates that maximize the likelihood of producing the ground-truth answer. During inference, COALITION employs a controller to select the suitable variant for generating and refining the rationales. On five different datasets covering mathematical problems, commonsense reasoning, and natural language inference, COALITION outperforms several baselines by up to 5%. Our ablation studies reveal that cross-communication between the two variants performs better than using the single model to self-refine the rationales. We also demonstrate the applicability of COALITION for LMs of varying scales (4B to 14B parameters) and model families (Mistral, Llama, Qwen, Phi). We release the code for this work at this https URL.
zh

[NLP-33] Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization

【速读】：该论文旨在解决个性化大型语言模型（Large Language Models, LLMs）在应用过程中面临的局限性，即现有方法在提取用户个性化信息时忽视了用户间对比分析的重要性，而这种对比分析对于准确识别影响偏好的用户间差异至关重要。论文的关键创新在于提出了一种名为差异感知个性化学习（Difference-aware Personalization Learning, DPL）的新方法，该方法通过战略性选择代表性用户进行对比，并构建结构化标准来提取有意义且任务相关的用户间差异，从而提升LLMs的个性化定制能力。实验结果表明，DPL显著提高了LLMs的个性化水平。

链接: https://arxiv.org/abs/2503.02450
作者: Yilun Qiu,Xiaoyan Zhao,Yang Zhang,Yimeng Bai,Wenjie Wang,Hong Cheng,Fuli Feng,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong (香港中文大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalizing Large Language Models (LLMs) has become a critical step in facilitating their widespread application to enhance individual life experiences. In pursuit of personalization, distilling key preference information from an individual’s historical data as instructional preference context to customize LLM generation has emerged as a promising direction. However, these methods face a fundamental limitation by overlooking the inter-user comparative analysis, which is essential for identifying the inter-user differences that truly shape preferences. To address this limitation, we propose Difference-aware Personalization Learning (DPL), a novel approach that emphasizes extracting inter-user differences to enhance LLM personalization. DPL strategically selects representative users for comparison and establishes a structured standard to extract meaningful, task-relevant differences for customizing LLM generation. Extensive experiments on real-world datasets demonstrate that DPL significantly enhances LLM personalization. We release our code at this https URL.
zh

[NLP-34] AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking

【速读】：该论文旨在解决从大型语言模型中去除目标数据点（Unlearning Sensitive Content）的问题，同时尽量减少对模型整体通用知识的影响。论文的关键解决方案在于采用参数高效的基于梯度的遗忘方法，具体通过低秩适应（Low-Rank Adaptation, LoRA）和针对特定层的微调实现。此外，为了进一步提升遗忘效果，提出了一种任务无关的数据分块策略，即将需要遗忘的数据拆分为不相交的分区，并以预定义的比例与循环采样的保留样本进行混合。该方法在遗忘-保留平衡方面表现出色，在排行榜上排名第一，显著优于基线和其他竞争系统。

链接: https://arxiv.org/abs/2503.02443
作者: Iraklis Premptis,Maria Lymperaiou,Giorgos Filandrianos,Orfeas Menis Mastromichalakis,Athanasios Voulodimos,Giorgos Stamou
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Unlearning Sensitive Content from Large Language Models task aims to remove targeted datapoints from trained models while minimally affecting their general knowledge. In our work, we leverage parameter-efficient, gradient-based unlearning using low-rank (LoRA) adaptation and layer-focused fine-tuning. To further enhance unlearning effectiveness, we employ data chunking, splitting forget data into disjoint partitions and merging them with cyclically sampled retain samples at a pre-defined ratio. Our task-agnostic method achieves an outstanding forget-retain balance, ranking first on leaderboards and significantly outperforming baselines and competing systems.
zh

[NLP-35] AILS-NTUA at SemEval-2025 Task 3: Leverag ing Large Language Models and Translation Strategies for Multilingual Hallucination Detection

【速读】：该论文试图解决多语言幻觉检测（multilingual hallucination detection）这一尚未充分探索的挑战，这是由Mu-SHROOM共享任务所关注的核心问题。论文的关键解决方案是一种高效的、无需训练的语言模型提示策略，通过将多语言文本片段翻译成英语来增强幻觉检测能力。这种方法在多种语言中取得了具有竞争力的排名，并在低资源语言中获得了两个第一名，证明了该翻译策略在幻觉检测中的有效性及其对源语言的广泛适用性。

链接: https://arxiv.org/abs/2503.02442
作者: Dimitra Karkani,Maria Lymperaiou,Giorgos Filandrianos,Nikolaos Spanos,Athanasios Voulodimos,Giorgos Stamou
机构: School of Electrical and Computer Engineering (电气与计算机工程学院), AILS Laboratory (AILS实验室), National Technical University of Athens (雅典国立技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual hallucination detection stands as an underexplored challenge, which the Mu-SHROOM shared task seeks to address. In this work, we propose an efficient, training-free LLM prompting strategy that enhances detection by translating multilingual text spans into English. Our approach achieves competitive rankings across multiple languages, securing two first positions in low-resource languages. The consistency of our results highlights the effectiveness of our translation strategy for hallucination detection, demonstrating its applicability regardless of the source language.
zh

[NLP-36] Hierarchical Re-ranker Retriever (HRR)

【速读】：该论文试图解决在信息检索中为给定查询检索合适上下文级别的长期挑战，即过大的上下文片段会削弱语义特异性，而过小的片段则缺乏更广泛的背景。解决方案的关键在于提出了一种名为Hierarchical Re-ranker Retriever (HRR) 的框架，通过将文档分割为句子级别和中间级别（512个标记）的片段，以最大化向量搜索的质量，同时利用一个针对这些512标记片段进行重排序的重排器，确保检索结果既不过于粗粒度也不过于细粒度，从而实现稳健的相关性评分。最后，将排名靠前的中间级别片段映射到其父级片段（2048个标记），为大型语言模型 (LLM) 提供足够大的上下文。

链接: https://arxiv.org/abs/2503.02401
作者: Ashish Singh,Priti Mohapatra
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Retrieving the right level of context for a given query is a perennial challenge in information retrieval - too large a chunk dilutes semantic specificity, while chunks that are too small lack broader context. This paper introduces the Hierarchical Re-ranker Retriever (HRR), a framework designed to achieve both fine-grained and high-level context retrieval for large language model (LLM) applications. In HRR, documents are split into sentence-level and intermediate-level (512 tokens) chunks to maximize vector-search quality for both short and broad queries. We then employ a reranker that operates on these 512-token chunks, ensuring an optimal balance neither too coarse nor too fine for robust relevance scoring. Finally, top-ranked intermediate chunks are mapped to parent chunks (2048 tokens) to provide an LLM with sufficiently large context.
zh

[NLP-37] An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数学推理能力提升过程中面临的挑战，具体包括现有过程监督奖励模型（Process-Supervised Reward Models, PRMs）训练数据构建方法存在的成本高昂或质量不佳的问题。为应对这些挑战，论文提出了一种名为EpicPRM的框架，其关键在于通过量化每一步中间推理步骤的贡献来实现精准标注，并采用自适应二分搜索算法以提高标注的精确性和效率。利用此方法，论文高效构建了一个高质量的过程监督训练数据集Epic50k，包含50,000个标注的中间推理步骤。实验结果显示，基于Epic50k训练的PRM相较于其他公开可用的数据集表现出显著更优的性能。

链接: https://arxiv.org/abs/2503.02382
作者: Wei Sun,Qianlong Du,Fuwei Cui,Jiajun Zhang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院，中国科学院大学); Wuhan AI Search (武汉人工智能搜索)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models’ reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at this https URL.
zh

[NLP-38] MedEthicEval: Evaluating Large Language Models Based on Chinese Medical Ethics

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医疗伦理挑战中的应用能力尚未被充分探索的问题。为了解决这一问题，论文提出了MedEthicEval，这是一个专门设计的基准，用于系统性评估LLMs在医疗伦理领域的表现。解决方案的关键在于MedEthicEval框架包含两个核心组成部分：知识部分评估模型对医疗伦理原则的理解，应用部分则关注其在不同场景中应用这些原则的能力。此外，为了支持该基准，研究团队与医学伦理研究人员合作，开发了三个数据集以应对不同的伦理挑战。

链接: https://arxiv.org/abs/2503.02374
作者: Haoan Jin,Jiacheng Shi,Hanhui Xu,Kenny Q. Zhu,Mengyue Wu
机构: SJTU (Shanghai Jiao Tong University); Ant Group; FDU (Fudan University); UTA (University of Texas at Arlington)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate significant potential in advancing medical applications, yet their capabilities in addressing medical ethics challenges remain underexplored. This paper introduces MedEthicEval, a novel benchmark designed to systematically evaluate LLMs in the domain of medical ethics. Our framework encompasses two key components: knowledge, assessing the models’ grasp of medical ethics principles, and application, focusing on their ability to apply these principles across diverse scenarios. To support this benchmark, we consulted with medical ethics researchers and developed three datasets addressing distinct ethical challenges: blatant violations of medical ethics, priority dilemmas with clear inclinations, and equilibrium dilemmas without obvious resolutions. MedEthicEval serves as a critical tool for understanding LLMs’ ethical reasoning in healthcare, paving the way for their responsible and effective use in medical contexts.
zh

[NLP-39] Iterative Value Function Optimization for Guided Decoding

【速读】：该论文旨在解决基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）在控制语言模型输出时面临的高计算成本和训练不稳定的问题。同时，尽管引导解码（Guided Decoding），特别是基于价值函数的方法，提供了一种成本效益更高的替代方案，但现有方法在准确估计最优价值函数方面存在困难，导致控制效果不佳。论文提出了一种名为“迭代价值函数优化”（Iterative Value Function Optimization）的新框架，其关键是通过两个关键组件来克服这些限制：蒙特卡洛价值估计（Monte Carlo Value Estimation），通过探索多样化的轨迹降低估计方差；以及基于策略的迭代在线优化（Iterative On-Policy Optimization），通过从基于价值的策略收集轨迹逐步改进价值估计。实验结果表明，该框架显著提升了基于价值的引导解码方法在文本摘要、多轮对话和指令跟随任务中的性能，并大幅降低了计算成本。

链接: https://arxiv.org/abs/2503.02368
作者: Zhenhua Liu,Lijun Li,Ruizhe Chen,Yuxian Jiang,Tong Zhu,Wenliang Chen,Jing Shao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.
zh

[NLP-40] EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports NEURIPS

【速读】：本文旨在解决临床问答系统在心脏病学领域性能不足的问题，通过构建一个包含771,244个问答对的新型超声心动图报告问答数据集，覆盖多种心脏异常及其严重程度。解决方案的关键在于利用大规模语言模型（Large Language Models, LLMs），通过对这些模型进行微调（fine-tuning）显著提升了跨多种问答评估指标的表现，并验证了新数据集的价值。此外，论文还对最佳表现模型进行了临床专家的定性评估以及针对健康社会决定因素的细粒度公平性审计，以评估模型在不同群体中的偏见与性能权衡，从而推动基于LLM的AI代理成为支持临床医生进行心脏鉴别诊断的基准工具，减轻因文档负担导致的临床倦怠，并使医疗专业人员能够更多关注患者护理。

链接: https://arxiv.org/abs/2503.02365
作者: Lama Moukheiber,Mira Moukheiber,Dana Moukheiiber,Hyung-Chul Lee
机构: Massachusetts Institute of Technology (麻省理工学院); Seoul National University (首尔国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS SafeGenAI 2024

点击查看摘要

Abstract:We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database. This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity. We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation. Our results show that fine-tuning LLMs improves performance across various QA metrics, validating the value of our dataset. Clinicians also qualitatively evaluate the best-performing model to assess the LLM responses for correctness. Further, we conduct fine-grained fairness audits to assess the bias-performance trade-off of LLMs across various social determinants of health. Our objective is to propel the field forward by establishing a benchmark for LLM AI agents aimed at supporting clinicians with cardiac differential diagnoses, thereby reducing the documentation burden that contributes to clinician burnout and enabling healthcare professionals to focus more on patient care.
zh

[NLP-41] Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm

【速读】：该论文旨在解决从大规模数据集中选择高质量且多样化的训练样本时面临的挑战，现有方法在评估所选数据的整体价值方面存在不足，主要关注个体样本质量，难以有效平衡多样性与数据点遍历开销。论文的关键解决方案在于提出了一种基于选择的样本选择框架，将评估重点从单一样本质量转向比较不同样本加入子集后的贡献价值，并利用大型语言模型（LLMs）的语言理解能力来评估选项的价值。此外，设计了一种贪心采样过程，通过逐步添加样本到子集中提高效率，避免对整个数据集进行穷尽遍历。这种方法不仅提升了性能，还减少了所需的选择次数，同时在更大规模的医学数据集上验证了其实际应用价值。

链接: https://arxiv.org/abs/2503.02359
作者: Zhuo Li,Yuhao Du,Xiaoqi Jiao,Yiwen Guo,Yuege Feng,Xiang Wan,Anningzhe Gao,Jinpeng Hu
机构: Shenzhen Research Institute of Big Data (深圳大数据研究院); The Chinese University of Hong Kong, Shenzhen (香港中文大学, 深圳); LIGHTSPEED STUDIOS (光线影业); Independent Researcher (独立研究员); Birmingham City University (伯明翰城市大学); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing the overall value of selected data, focusing primarily on individual quality, and struggle to strike an effective balance between ensuring diversity and minimizing data point traversals. Therefore, this paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples when incorporated into the subset. Thanks to the advanced language understanding capabilities of LLMs, we utilize LLMs to evaluate the value of each option during the selection process. Furthermore, we design a greedy sampling process where samples are incrementally added to the subset, thereby improving efficiency by eliminating the need for exhaustive traversal of the entire dataset with the limited budget. Extensive experiments demonstrate that selected data from our method not only surpass the performance of the full dataset but also achieves competitive results with state-of-the-art (SOTA) studies, while requiring fewer selections. Moreover, we validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
zh

[NLP-42] Are Large Vision Language Models Good Game Players? ICLR2025

【速读】：该论文试图解决现有视觉语言大模型（Large Vision Language Models, LVLMs）评估方法存在的局限性，这些问题包括对细节视觉感知评估不足、数据污染以及缺乏对多轮推理的关注。论文的关键解决方案是提出了一种基于游戏的评估框架（\method），该框架通过设计四类核心任务——感知、问答、规则遵循和端到端游戏，全面评估LVLMs在结构化环境中的认知与推理能力。这一方法旨在更准确地捕捉LVLMs的全貌能力，并揭示其当前的局限性，如处理长结构化输出及精细密集元素的感知能力。

链接: https://arxiv.org/abs/2503.02358
作者: Xinyu Wang,Bohan Zhuang,Qi Wu
机构: The University of Adelaide (阿德莱德大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR2025

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs’ capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose \method, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs’ cognitive and reasoning skills in structured environments. \method uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc. Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at this https URL.
zh

[NLP-43] DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中生成内容可靠性不足的问题，具体表现为生成内容偏离事实准确性或逻辑推理存在缺陷。为应对这一挑战，论文提出了一种名为Decoding by Logit Trajectory-based approach (DeLTa) 的新型解码策略。该方案的关键在于通过分析Transformer模型从低层到高层logits的变化轨迹，并利用线性回归调整下一个词的概率分布，从而在不修改模型架构或预训练参数的情况下增强生成内容的事实性和推理能力。实验结果表明，DeLTa方法在TruthfulQA数据集上的表现较基线提升了4.9%，同时在需要较强推理能力的StrategyQA和GSM8K数据集上分别提升了8.1%和7.3%。

链接: https://arxiv.org/abs/2503.02343
作者: Yunzhen He,Yusuke Takase,Yoichi Ishibashi,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); NEC (NEC); RIKEN AIP (理化学研究所 AIP)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Source code is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used in real-world applications. However, concerns about the reliability of the content they generate persist, as it frequently deviates from factual correctness or exhibits deficiencies in logical reasoning. This paper proposes a novel decoding strategy aimed at enhancing both factual accuracy and inferential reasoning without requiring any modifications to the architecture or pre-trained parameters of LLMs. Our approach adjusts next-token probabilities by analyzing the trajectory of logits from lower to higher layers in Transformers and applying linear regression. We find that this Decoding by Logit Trajectory-based approach (DeLTa) effectively reinforces factuality and reasoning while mitigating incorrect generation. Experiments on TruthfulQA demonstrate that DeLTa attains up to a 4.9% improvement over the baseline. Furthermore, it enhances performance by up to 8.1% on StrategyQA and 7.3% on GSM8K, both of which demand strong reasoning capabilities.
zh

[NLP-44] Unlocking a New Rust Programming Experience: Fast and Slow Thinking with LLM s to Conquer Undefined Behaviors

【速读】：该论文旨在解决Rust项目中因使用unsafe标签导致的内存安全性和未定义行为（Undefined Behaviors, UBs）问题。这些UBs削弱了Rust的安全性，并增加了代码出错的风险。传统方法需要深入分析代码，依赖于手动设计且耗时费力。论文提出的关键解决方案是基于大型语言模型（LLM）的RustBrain框架，它通过结合“快速思维”（Fast Thinking）和“慢速思维”（Slow Thinking）的双过程理论，自动且灵活地最小化Rust项目中的UBs。“快速思维”用于提取特征并生成解决方案，“慢速思维”则负责分解、验证和抽象化这些方案。此外，RustBrain通过反馈机制整合两种思维方式，以实现验证与泛化的结果应用于解决方案生成，从而实现动态调整和精确输出。实验结果显示，在Miri数据集上的通过率为94.3%，执行率为80.4%，显著提升了灵活性和项目安全性。

链接: https://arxiv.org/abs/2503.02335
作者: Renshuang Jiang,Pan Dong,Zhenling Duan,Yu Shi,Xiaoxiang Fang,Yan Ding,Jun Ma,Shuai Zhao,Zhe Jiang
机构: National University of Defense Technology (国防科技大学); Sun Yat-sen University (中山大学); Southeast University (东南大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To provide flexibility and low-level interaction capabilities, the unsafe tag in Rust is essential in many projects, but undermines memory safety and introduces Undefined Behaviors (UBs) that reduce safety. Eliminating these UBs requires a deep understanding of Rust’s safety rules and strong typing. Traditional methods require depth analysis of code, which is laborious and depends on knowledge design. The powerful semantic understanding capabilities of LLM offer new opportunities to solve this problem. Although existing large model debugging frameworks excel in semantic tasks, limited by fixed processes and lack adaptive and dynamic adjustment capabilities. Inspired by the dual process theory of decision-making (Fast and Slow Thinking), we present a LLM-based framework called RustBrain that automatically and flexibly minimizes UBs in Rust projects. Fast thinking extracts features to generate solutions, while slow thinking decomposes, verifies, and generalizes them abstractly. To apply verification and generalization results to solution generation, enabling dynamic adjustments and precise outputs, RustBrain integrates two thinking through a feedback mechanism. Experimental results on Miri dataset show a 94.3% pass rate and 80.4% execution rate, improving flexibility and Rust projects safety.
zh

[NLP-45] Examining the Mental Health Impact of Misinformation on Social Media Using a Hybrid Transformer-Based Approach

【速读】：该论文旨在解决社交媒体上虚假信息的传播及其对心理健康的影响问题。论文提出了一种基于混合Transformer的解决方案，关键在于使用RoBERTa-LSTM分类器，该模型能够同时检测 misinformation（ misinformation）、评估其对心理健康的影响以及分类与 misinformation 暴露相关的心理障碍。通过实验验证，该方法在 misinformation 检测、心理健康影响评估及障碍分类上的准确率分别达到 98.4%、87.8% 和 77.3%，并通过统计分析（Pearson’s Chi-Squared Test, p-value = 0.003871）证明了 misinformation 与心理健康恶化之间的直接关联。研究强调了改进 misinformation 管理策略以减轻其心理后果的重要性，并建议未来研究扩展至包含语言、人口统计和文化变量的更广泛数据集，以进一步深入理解 misinformation 引发的心理健康问题。

链接: https://arxiv.org/abs/2503.02333
作者: Sarvesh Arora,Sarthak Arora,Deepika Kumar,Vallari Agrawal,Vedika Gupta,Dipit Vasdev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Social media has significantly reshaped interpersonal communication, fostering connectivity while also enabling the proliferation of misinformation. The unchecked spread of false narratives has profound effects on mental health, contributing to increased stress, anxiety, and misinformation-driven paranoia. This study presents a hybrid transformer-based approach using a RoBERTa-LSTM classifier to detect misinformation, assess its impact on mental health, and classify disorders linked to misinformation exposure. The proposed models demonstrate accuracy rates of 98.4, 87.8, and 77.3 in detecting misinformation, mental health implications, and disorder classification, respectively. Furthermore, Pearson’s Chi-Squared Test for Independence (p-value = 0.003871) validates the direct correlation between misinformation and deteriorating mental well-being. This study underscores the urgent need for better misinformation management strategies to mitigate its psychological repercussions. Future research could explore broader datasets incorporating linguistic, demographic, and cultural variables to deepen the understanding of misinformation-induced mental health distress.
zh

[NLP-46] Limited Effectiveness of LLM -based Data Augmentation for COVID-19 Misinformation Stance Detection

【速读】：该论文旨在解决社交媒体上围绕新兴疫情（如新冠疫情）的误导性信息传播所带来的严重社会威胁，提出通过立场检测（Stance Detection, SD）技术识别支持或反对误导性声明的社交帖子，以作为强有力的应对措施。论文的关键解决方案在于测试了一种基于可控误导信息生成（Controllable Misinformation Generation, CMG）的方法，利用大型语言模型（Large Language Models, LLMs）进行数据增强，从而扩展训练数据集。然而，实验表明，与传统数据增强方法相比，CMG带来的性能提升通常较小且不稳定，主要原因是LLMs内置的安全机制限制了其生成多样化误导信息的能力。因此，论文的核心贡献在于评估了CMG在SD任务中的潜力与局限，并为后续误导信息检测与生成的研究提供了代码和数据集支持。

链接: https://arxiv.org/abs/2503.02328
作者: Eun Cheol Choi,Ashwin Balasubramanian,Jinhu Qi,Emilio Ferrara
机构: University of Southern California(南加州大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Misinformation surrounding emerging outbreaks poses a serious societal threat, making robust countermeasures essential. One promising approach is stance detection (SD), which identifies whether social media posts support or oppose misleading claims. In this work, we finetune classifiers on COVID-19 misinformation SD datasets consisting of claims and corresponding tweets. Specifically, we test controllable misinformation generation (CMG) using large language models (LLMs) as a method for data augmentation. While CMG demonstrates the potential for expanding training datasets, our experiments reveal that performance gains over traditional augmentation methods are often minimal and inconsistent, primarily due to built-in safeguards within LLMs. We release our code and datasets to facilitate further research on misinformation detection and generation.
zh

[NLP-47] PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models

【速读】：该论文试图解决大型语言模型在解决复杂数学问题，特别是奥林匹克级别难题时，因缺乏足够具有挑战性的数据而阻碍进一步发展的瓶颈问题。为应对这一挑战，论文提出了一种名为PromptCoT的新方法，其关键是通过合成基于数学概念及其构造背后的逻辑来生成高质量的奥林匹克级别数学问题，模拟有经验的问题设计者的思维过程。论文理论分析表明，最优的生成逻辑应最大化给定相关概念的推理生成可能性以及在推理和概念条件下问题生成的可能性。PromptCoT在多个标准基准（如GSM8K、MATH-500和AIME2024）上的表现优于现有方法，并展现出卓越的数据可扩展性，在数据集规模增加时仍能保持高性能。

链接: https://arxiv.org/abs/2503.02324
作者: Xueliang Zhao,Wei Wu,Jian Guan,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the Olympiad level, hinders further advancements. In this work, we introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction, emulating the thought processes of experienced problem designers. We provide a theoretical analysis demonstrating that an optimal rationale should maximize both the likelihood of rationale generation given the associated concepts and the likelihood of problem generation conditioned on both the rationale and the concepts. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods. Furthermore, we demonstrate that PromptCoT exhibits superior data scalability, consistently maintaining high performance as the dataset size increases, outperforming the baselines. The implementation is available at this https URL.
zh

[NLP-48] Audio-Reason er: Improving Reasoning Capability in Large Audio Language Models

【速读】：该论文旨在解决多模态推理领域中音频模态被忽视的问题，提出了一种名为Audio-Reasoner的大规模音频语言模型，用于深度音频任务推理。论文的关键在于构建了一个高质量的推理数据集CoTA，包含120万富含推理样本的数据，并通过结构化Chain of Thought (CoT) 训练方法提升音频推理能力。解决方案的核心是精心设计的多任务音频数据集以及结合封闭源模型进行二次标注、问答生成及结构化CoT过程，从而实现Audio-Reasoner在音频推理任务中的卓越逻辑能力。

链接: https://arxiv.org/abs/2503.02318
作者: Zhifei Xie,Mingbao Lin,Zihang Liu,Pengcheng Wu,Shuicheng Yan,Chunyan Miao
机构: Nanyang Technological University (南洋理工大学); Skywork AI (未知中文); Beijing Institute of Technology (北京理工大学); National University of Singapore (新加坡国立大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Technical report, in process

点击查看摘要

Abstract:Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning.
zh

[NLP-49] AxBERT: An Interpretable Chinese Spelling Correction Method Driven by Associative Knowledge Network

【速读】：该论文试图解决深度学习模型在需要特征解释的应用场景（如文本校正）中因不可解释性而限制其使用的问题。解决方案的关键在于提出了一种名为AxBERT的新颖可解释深度学习模型，该模型通过与关联知识网络（Associative Knowledge Network, AKN）对齐实现中文拼写纠错。其中，AKN基于汉字间的共现关系构建，提供了与不可解释的BERT逻辑相对立的可解释统计逻辑；同时引入了BERT与AKN之间的翻译矩阵，用于对齐和调节BERT的注意力组件，并设计了一个权重调节器以调整BERT中的注意力分布，从而适当地建模句子语义。

链接: https://arxiv.org/abs/2503.02255
作者: Fanyu Wang,Hangyu Zhu,Zhenping Xie
机构: School of Artificial Intelligence and Computer Science and the Jiangsu Key Laboratory of Media Design and Software Technology, Jiangnan University (江南大学人工智能与计算机科学学院以及江苏省媒体设计与软件技术重点实验室), Wuxi 214122, Jiangsu, China
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning has shown promising performance on various machine learning tasks. Nevertheless, the uninterpretability of deep learning models severely restricts the usage domains that require feature explanations, such as text correction. Therefore, a novel interpretable deep learning model (named AxBERT) is proposed for Chinese spelling correction by aligning with an associative knowledge network (AKN). Wherein AKN is constructed based on the co-occurrence relations among Chinese characters, which denotes the interpretable statistic logic contrasted with uninterpretable BERT logic. And a translator matrix between BERT and AKN is introduced for the alignment and regulation of the attention component in BERT. In addition, a weight regulator is designed to adjust the attention distributions in BERT to appropriately model the sentence semantics. Experimental results on SIGHAN datasets demonstrate that AxBERT can achieve extraordinary performance, especially upon model precision compared to baselines. Our interpretable analysis, together with qualitative reasoning, can effectively illustrate the interpretability of AxBERT.
zh

[NLP-50] OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale

【速读】：该论文旨在解决现有文本到SQL（Text-to-SQL）方法在实际应用中的两个主要问题：基于提示的方法依赖于昂贵且闭源的语言模型，存在隐私风险且缺乏定制化能力；而基于微调的方法由于公开可用训练数据的覆盖范围有限，泛化性能较差。为克服这些挑战，论文提出了一种新颖且可扩展的文本到SQL数据合成框架，用于自动生成大规模、高质量且多样化的数据集，而无需大量人工干预。关键在于通过此框架生成了SynSQL-2.5M数据集，并基于该数据集开发了OmniSQL模型，从而实现了卓越的性能表现，同时保持开源特性。

链接: https://arxiv.org/abs/2503.02240
作者: Haoyang Li,Shang Wu,Xiaokang Zhang,Xinmei Huang,Jing Zhang,Fuxin Jiang,Shuai Wang,Tieying Zhang,Jianjun Chen,Rui Shi,Hong Chen,Cuiping Li
机构: Engineering Research Center of Database and Business Intelligence, MOE, China (教育部数据库与商务智能工程研究中心); School of Information, Renmin University of China, Beijing, China (中国人民大学信息学院); Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China (教育部数据工程与知识工程重点实验室); ByteDance Inc. (字节跳动)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.
zh

[NLP-51] Haste Makes Waste: Evaluating Planning Abilities of LLM s for Efficient and Feasible Multitasking with Time Constraints Between Actions

【速读】：该论文旨在解决现有大型语言模型（Large Language Model, LLM）基准测试过于关注单一任务性能，而对实际场景中多任务规划与执行效率重视不足的问题。论文提出Recipe2Plan这一基于真实烹饪场景的新基准框架，通过引入严格的时序约束，要求代理在优化烹饪时间的同时，确保各步骤间的时间依赖关系不被破坏。关键在于平衡全局多任务操作的并发性与严格的时间约束，避免因过度局部并行化导致的整体任务失败。解决方案的核心是提升大语言模型的时间感知能力和全局多任务规划能力。论文开源了基准及相关代码以促进进一步研究。

链接: https://arxiv.org/abs/2503.02238
作者: Zirui Wu,Xiao Liu,Jiayi Li,Lingpeng Kong,Yansong Feng
机构: Peking University (北京大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open-source our benchmark and code at this https URL.
zh

[NLP-52] Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

【速读】：该论文试图解决大型语言模型（LLMs）在超出其知识边界时频繁产生错误输出的问题，即“幻觉”（hallucination）。现有方法通过不确定性估计或查询拒绝来缓解这一问题，但存在计算效率低下或实用性下降的局限。为了解决这些问题，论文提出了一种名为显式知识边界建模（Explicit Knowledge Boundary Modeling, EKBM）的框架，其关键是结合快速推理系统和慢速推理系统以实现可靠性和可用性的平衡。具体而言，快速推理模型用于生成带有置信度标签的响应，从而立即使用高置信度输出；对于不确定预测，则由慢速精化模型进行针对性推理以提高准确性。此外，通过引入混合训练管道，该框架提升了模型的自我意识（self-awareness），同时未损害任务性能。实验表明，EKBM 在对话状态跟踪任务中显著提高了模型可靠性，并且精化过程在保持低计算开销的同时大幅提升了准确性。

链接: https://arxiv.org/abs/2503.02233
作者: Hang Zheng,Hongshen Xu,Yuncong Liu,Lu Chen,Pascale Fung,Kai Yu
机构: X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University (上海交通大学); Center for Artificial Intelligence Research (CAiRE), Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently hallucinate due to misaligned self-awareness, generating erroneous outputs when addressing queries beyond their knowledge boundaries. While existing approaches mitigate hallucinations via uncertainty estimation or query rejection, they suffer from computational inefficiency or sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate use of high-confidence outputs. For uncertain predictions, a slow refinement model conducts targeted reasoning to improve accuracy. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. Our work establishes a scalable paradigm for advancing LLM reliability and balancing accuracy and practical utility in error-sensitive applications.
zh

[NLP-53] Words or Vision: Do Vision-Language Models Have Blind Faith in Text? CVPR2025

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在处理模态间不一致时过度依赖文本数据的问题，即所谓的“盲信文本”（blind faith in text）现象。这种现象表现为当视觉与文本模态之间存在矛盾时，VLMs 过度信任文本信息，导致性能显著下降，并引发潜在的安全隐患。论文通过引入四种视觉为中心任务的文本变化，并评估十种 VLMs 的表现，揭示了这一问题的存在及其影响因素，包括指令提示（instruction prompts）、语言模型规模（language model size）、文本相关性（text relevance）、标记顺序（token order）以及视觉与文本确定性之间的相互作用。

解决方案的关键在于通过有监督微调（supervised fine-tuning）结合文本增强（text augmentation）来减少文本偏见（text bias）。此外，理论分析表明，这种盲信文本现象可能源于训练过程中纯文本数据与多模态数据的比例失衡。因此，论文强调在训练 VLMs 时需要平衡多模态数据的使用，并仔细考虑模态间的交互关系，以提升模型在处理多模态数据不一致时的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2503.02199
作者: Ailin Deng,Tri Cao,Zhirui Chen,Bryan Hooi
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs’ modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a \emph``blind faith in text’’ phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.
zh

[NLP-54] ATLaS: Agent Tuning via Learning Critical Steps

【速读】：该论文试图解决大型语言模型（LLM）代理在多领域任务中行为克隆整个专家轨迹时引入专家偏见以及对未覆盖状态泛化能力较弱的问题。此外，规划、复杂推理和战略决策等关键步骤的学习对于提升LLM代理的成功至关重要。论文的关键解决方案是提出ATLaS方法，通过识别专家轨迹中的关键步骤，并仅针对这些步骤进行微调，从而降低训练成本，同时减少对整个轨迹过拟合的风险，增强不同环境和任务之间的泛化能力。实验表明，仅基于ATLaS选出的30%关键步骤微调的LLM性能优于对所有步骤进行微调的LLM及近期开源LLM代理，同时保持并提升了基础LLM作为通用代理的能力。

链接: https://arxiv.org/abs/2503.02197
作者: Zhixun Chen,Ming Li,Yuxuan Huang,Yali Du,Meng Fang,Tianyi Zhou
机构: University of Technology Sydney (悉尼科技大学); University of Maryland (马里兰大学); University of Liverpool (利物浦大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps, such as planning, complex reasoning for intermediate subtasks, and strategic decision-making, are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLaS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training’s focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLaS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLaS maintains and improves base LLM skills as generalist agents interacting with diverse environments.
zh

[NLP-55] Adversarial Tokenization

【速读】：该论文试图解决的问题是：尽管大型语言模型（LLMs）在训练和推理过程中仅采用一种可能的分词方式，但它们是否仍然保留了对其他分词方式的语义理解能力，这是否会对LLMs的安全性带来潜在威胁。具体而言，论文探讨了是否存在通过恶意分词（adversarial tokenization）绕过安全与对齐限制的可能性。
解决方案的关键在于证明了恶意分词是一种有效且此前未被充分关注的攻击维度，并且其效果可与现有的最先进的对抗方法相媲美，而无需修改有害请求的实际文本内容。论文通过实证研究验证了这一漏洞，在三种最先进的LLMs及对抗数据集上展示了子词模型中存在的未知脆弱性。

链接: https://arxiv.org/abs/2503.02174
作者: Renato Lui Geh,Zilei Shao,Guy Van den Broeck
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the standard Llama3 tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.
zh

[NLP-56] abby: Tabular Data Synthesis with Language Models

【速读】：该论文试图解决合成表格数据质量相对较低的问题，与近年来大型语言模型（Large Language Models, LLMs）在文本数据生成方面取得的显著进展形成对比。论文的关键解决方案是提出Tabby，这是一种对标准Transformer语言模型架构的简单但强大的后训练修改方法，使其能够用于表格数据集的合成。Tabby通过门控专家混合机制（Gated Mixture-of-Experts）表示列间差异，并为各列提供特定的参数集合。此外，结合创新的表格训练技术Plain，Tabby实现了高达44%的质量提升，并进一步扩展到更通用的结构化数据，如嵌套JSON数据集，其生成质量达到接近真实数据的水平。

链接: https://arxiv.org/abs/2503.02152
作者: Sonia Cromp,Satya Sai Srinath Namburi GNVV,Mohammed Alkhudhayri,Catherine Cao,Samuel Guo,Nicholas Roberts,Frederic Sala
机构: University of Wisconsin-Madison; GE HealthCare
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.
zh

[NLP-57] Malware Classification from Memory Dumps Using Machine Learning Transformers and Large Language Models

【速读】：该论文旨在研究不同分类模型在恶意软件分类任务中的性能表现，重点关注特征集和数据配置的影响。论文评估了六种传统机器学习模型（逻辑回归、K近邻、支持向量机、决策树、随机森林、极端梯度提升）以及两种深度学习模型（循环神经网络、Transformer），还探讨了Gemini零样本和少样本学习方法的表现。关键在于通过对比不同特征集（全部特征、文献综述特征、随机森林选出的Top 45特征、下采样后的Top 45特征）来验证模型性能。结果显示，基于Top 45特征，极端梯度提升（XGB）达到了87.42%的最高准确率，而随机森林（RF）以87.23%紧随其后。相比之下，深度学习模型表现不佳，且特征下采样显著降低了所有模型的性能。这表明特征选择对于提升模型性能及降低计算复杂度至关重要，传统模型如XGB和RF表现出色，而深度学习和少样本方法未能达到同等水平。论文强调了传统机器学习模型在结构化数据集上的有效性，并为未来混合方法和更大规模数据集的研究奠定了基础。

链接: https://arxiv.org/abs/2503.02144
作者: Areej Dweib,Montaser Tanina,Shehab Alawi,Mohammad Dyab,Huthaifa I. Ashqar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This study investigates the performance of various classification models for a malware classification task using different feature sets and data configurations. Six models-Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Decision Trees, Random Forest (RF), and Extreme Gradient Boosting (XGB)-were evaluated alongside two deep learning models, Recurrent Neural Networks (RNN) and Transformers, as well as the Gemini zero-shot and few-shot learning methods. Four feature sets were tested including All Features, Literature Review Features, the Top 45 Features from RF, and Down-Sampled with Top 45 Features. XGB achieved the highest accuracy of 87.42% using the Top 45 Features, outperforming all other models. RF followed closely with 87.23% accuracy on the same feature set. In contrast, deep learning models underperformed, with RNN achieving 66.71% accuracy and Transformers reaching 71.59%. Down-sampling reduced performance across all models, with XGB dropping to 81.31%. Gemini zero-shot and few-shot learning approaches showed the lowest performance, with accuracies of 40.65% and 48.65%, respectively. The results highlight the importance of feature selection in improving model performance while reducing computational complexity. Traditional models like XGB and RF demonstrated superior performance, while deep learning and few-shot methods struggled to match their accuracy. This study underscores the effectiveness of traditional machine learning models for structured datasets and provides a foundation for future research into hybrid approaches and larger datasets.
zh

[NLP-58] Measuring Intrinsic Dimension of Token Embeddings

【速读】：该论文试图解决的问题是如何定量评估语言模型中表示的冗余性，并理解其嵌入空间的内在维度特性。论文的关键解决方案在于通过计算生成嵌入空间的流形的内禀维度（Intrinsic Dimension, ID），并与模型的外延维度进行对比，以量化冗余程度。具体而言，研究估计了小规模及大规模语言模型中词嵌入的ID，观察到嵌入空间通常位于比其外延维度低维的流形上；进一步分析发现，随着模型规模增大，冗余率也随之增加；同时，训练过程中的ID动态变化显示早期快速下降的现象；此外，在应用LoRA技术于嵌入层时，观察到在接近估计ID值时困惑度（perplexity）出现显著下降，表明ID可作为LoRA应用的有效指导指标。

链接: https://arxiv.org/abs/2503.02142
作者: Takuya Kataiwa,Cho Hakaze,Tetsushi Ohki
机构: Shizuoka University (静冈大学); Japan Advanced Institute of Science and Technology (日本先进技术研究院); RIKEN AIP (理化学研究所人工智能研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:In this study, we measure the Intrinsic Dimension (ID) of token embedding to estimate the intrinsic dimensions of the manifolds spanned by the representations, so as to evaluate their redundancy quantitatively compared to their extrinsic dimensionality. In detail, (1) we estimate the ID of token embeddings in small-scale language models and also modern large language models, finding that the embedding spaces often reside on lower-dimensional manifolds compared to their extrinsic dimensionality; (2) we measure the ID across various model sizes and observe an increase in redundancy rates as the model scale grows; (3) we measure the dynamics of IDs during the training process, and find a rapid ID drop in the early stages of training. Moreover, (4) when LoRA is applied to the embedding layers, we observe a sudden drop in perplexity around the estimated IDs, suggesting that the ID can serve as a useful guideline for LoRA application.
zh

[NLP-59] Network Traffic Classification Using Machine Learning Transformer and Large Language Models

【速读】：该论文旨在解决网络流量分类问题，通过多种模型对网络流量进行分类，包括Web、浏览、IPSec、备份和电子邮件等类别。论文的关键在于采用了一个包含30,959个观测值和19个特征的综合数据集，并评估了多种模型的表现，如朴素贝叶斯、决策树、随机森林、梯度提升、XGBoost、深度神经网络（DNN）、Transformer以及两种大型语言模型（LLMs）GPT-4o和Gemini在零样本和少量样本学习下的表现。研究发现，Transformer和XGBoost表现出色，分别达到了98.95%和97.56%的最高准确率。尽管GPT-4o和Gemini在少量样本学习下显示出有希望的结果，但它们在复杂类别如IPSec和备份上的分类仍存在误分类现象。论文强调了模型选择、微调以及训练数据量与模型复杂度之间平衡的重要性以实现可靠的分类结果。

链接: https://arxiv.org/abs/2503.02141
作者: Ahmad Antari,Yazan Abo-Aisheh,Jehad Shamasneh,Huthaifa I. Ashqar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This study uses various models to address network traffic classification, categorizing traffic into web, browsing, IPSec, backup, and email. We collected a comprehensive dataset from Arbor Edge Defender (AED) devices, comprising of 30,959 observations and 19 features. Multiple models were evaluated, including Naive Bayes, Decision Tree, Random Forest, Gradient Boosting, XGBoost, Deep Neural Networks (DNN), Transformer, and two Large Language Models (LLMs) including GPT-4o and Gemini with zero- and few-shot learning. Transformer and XGBoost showed the best performance, achieving the highest accuracy of 98.95 and 97.56%, respectively. GPT-4o and Gemini showed promising results with few-shot learning, improving accuracy significantly from initial zero-shot performance. While Gemini Few-Shot and GPT-4o Few-Shot performed well in categories like Web and Email, misclassifications occurred in more complex categories like IPSec and Backup. The study highlights the importance of model selection, fine-tuning, and the balance between training data size and model complexity for achieving reliable classification results.
zh

[NLP-60] Forgetting Transformer: Softmax Attention with a Forget Gate ICLR2025

【速读】：该论文旨在解决长上下文语言建模及下游任务中的性能瓶颈问题，同时探索如何在Transformer架构中引入类似于循环序列模型的遗忘机制。论文的关键创新在于提出了Forgetting Attention（遗忘注意力）机制，通过以数据相关的方式下调未归一化的注意力分数来实现类似于循环模型中遗忘门的功能。这种机制被整合进Transformer，形成Forgetting Transformer（FoX），从而提升了长上下文语言建模和长度外推任务的表现，同时保持了与标准Transformer相当的长上下文下游任务性能。FoX无需位置嵌入，并且兼容FlashAttention算法，进一步增强了其实用性。此外，论文还设计了一种名为“Pro”块的架构组件，显著提升了FoX及标准Transformer的整体性能。

链接: https://arxiv.org/abs/2503.02130
作者: Zhixuan Lin,Evgenii Nikishin,Xu Owen He,Aaron Courville
机构: Mila (Mila); Université de Montréal (蒙特利尔大学); MakerMaker AI (MakerMaker AI); Google DeepMind (谷歌深思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer’s superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a “Pro” block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at this https URL.
zh

[NLP-61] Superficial Self-Improved Reason ers Benefit from Model Merging

【速读】：该论文旨在解决大型语言模型（LLMs）在自我改进过程中出现的两个核心问题：一是模型坍塌（model collapse），即模型输出变得过于确定性；二是表层自改进推理者现象（superficial self-improved reasoners），即尽管模型在特定领域（in-domain, ID）的推理准确性有所提升，但其跨领域的泛化能力（out-of-domain, OOD）却因记忆效应而非真正理解而受损。论文通过分析发现，自我改进过程中权重更新主要集中于较少推理相关的层，导致表层学习的发生。为解决这些问题，论文提出了一种名为迭代模型合并（Iterative Model Merging, IMM）的关键方法，通过战略性地结合原始模型与自改进模型的权重，在保留泛化能力的同时吸收真实的推理改进。该方案有效缓解了模型坍塌与表层学习，推动了更稳定的自改进系统的发展。

链接: https://arxiv.org/abs/2503.02103
作者: Xiangchi Yuan,Chunhui Zhang,Zheyuan Liu,Dachuan Shi,Soroush Vosoughi,Wenke Lee
机构: Georgia Institute of Technology (乔治亚理工学院); Dartmouth College (达特茅斯学院); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As scaled language models (LMs) approach human-level reasoning capabilities, self-improvement emerges as a solution to synthesizing high-quality data corpus. While previous research has identified model collapse as a risk in self-improvement, where model outputs become increasingly deterministic, we discover a more fundamental challenge: the superficial self-improved reasoners phenomenon. In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities on out-of-domain (OOD) tasks due to memorization rather than genuine. Through a systematic investigation of LM architecture, we discover that during self-improvement, LM weight updates are concentrated in less reasoning-critical layers, leading to superficial learning. To address this, we propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization while incorporating genuine reasoning improvements. Our approach effectively mitigates both LM collapse and superficial learning, moving towards more stable self-improving systems.
zh

[NLP-62] Provable Benefits of Task-Specific Prompts for In-context Learning AISTATS

【速读】：本文旨在解决现代语言模型中任务特定信息嵌入与全局任务分布分解的问题，特别是在全局任务分布可划分为多个条件任务分布的情况下。论文的关键创新在于提出利用任务特定提示（task-specific prompts）和预测头（prediction heads）来学习条件任务分布相关的先验信息，并通过单层注意力模型实现这一目标。研究发现，任务特定提示能够在损失函数景观中促成均值-协方差解耦，其中提示微调主要解释分布的条件均值，而方差则通过上下文学习隐式获取。进一步引入任务特定预测头能够完全分离均值和方差分量的估计过程。这种均值-协方差视角同样揭示了联合训练提示与注意力权重相较于预训练后微调的理论优势。

链接: https://arxiv.org/abs/2503.02102
作者: Xiangyu Chang,Yingcong Li,Muti Kara,Samet Oymak,Amit K. Roy-Chowdhury
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

点击查看摘要

Abstract:The in-context learning capabilities of modern language models have motivated a deeper mathematical understanding of sequence models. A line of recent work has shown that linear attention models can emulate projected gradient descent iterations to implicitly learn the task vector from the data provided in the context window. In this work, we consider a novel setting where the global task distribution can be partitioned into a union of conditional task distributions. We then examine the use of task-specific prompts and prediction heads for learning the prior information associated with the conditional task distribution using a one-layer attention model. Our results on loss landscape show that task-specific prompts facilitate a covariance-mean decoupling where prompt-tuning explains the conditional mean of the distribution whereas the variance is learned/explained through in-context learning. Incorporating task-specific head further aids this process by entirely decoupling estimation of mean and variance components. This covariance-mean perspective similarly explains how jointly training prompt and attention weights can provably help over fine-tuning after pretraining.
zh

[NLP-63] wenty Years of Personality Computing: Threats Challenges and Future Directions

【速读】：该论文旨在探讨 Personality Computing 领域的发展现状、关键技术方法、面临的挑战与威胁，并提出负责任的技术开发与应用方向。论文试图解决的核心问题是 Personality Computing 在快速发展过程中所引发的伦理问题，包括数据隐私、算法偏见以及由人格感知型人工智能带来的潜在操纵风险。关键在于通过分析数字足迹（如文本、图像、社交媒体等）实现对人类人格特征的理解与预测，同时强调在技术设计与部署中融入伦理考量，确保其应用符合社会价值观与道德规范。

链接: https://arxiv.org/abs/2503.02082
作者: Fabio Celli,Aleksandar Kartelj,Miljan Đorđević,Derwin Suhartono,Vladimir Filipović,Veljko Milutinović,Georgios Spathoulas,Alessandro Vinciarelli,Michal Kosinski,Bruno Lepri
机构: Maggioli SpA(Santarcangelo di Romagna, Italy); Faculty of Mathematics, University of Belgrade(Belgrade, Serbia); School of Electrical Engineering, University of Belgrade(Belgrade, Serbia); School of Computer Science, Bina Nusantara University(Jakarta, Indonesia); Faculty of Mathematics, University of Belgrade(Belgrade, Serbia); School of Electrical Engineering, University of Belgrade(Belgrade, Serbia); Dept. of Information Security, Norwegian University of Science and Technology(Gjøvik, Norway); School of Computing Science, University of Glasgow(Glasgow, UK); Graduate School of Business, Stanford University(Stanford, CA, USA); Center for Augmented Intelligence, Fondazione Bruno Kessler(Trento, Italy)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personality Computing is a field at the intersection of Personality Psychology and Computer Science. Started in 2005, research in the field utilizes computational methods to understand and predict human personality traits. The expansion of the field has been very rapid and, by analyzing digital footprints (text, images, social media, etc.), it helped to develop systems that recognize and even replicate human personality. While offering promising applications in talent recruiting, marketing and healthcare, the ethical implications of Personality Computing are significant. Concerns include data privacy, algorithmic bias, and the potential for manipulation by personality-aware Artificial Intelligence. This paper provides an overview of the field, explores key methodologies, discusses the challenges and threats, and outlines potential future directions for responsible development and deployment of Personality Computing technologies.
zh

[NLP-64] Linear Representations of Political Perspective Emerge in Large Language Models ICLR2025

【速读】：该论文试图解决的问题是：如何解释大型语言模型（LLMs）在生成文本时能够反映不同主观政治观点的能力，并探索是否可以识别、监控及操控这些隐含的观点。关键解决方案在于通过线性探针（linear probes）分析LLMs激活空间中的线性表征，这些探针能够预测美国国会议员的DW-NOMINATE评分，并进一步验证相同探针可用于预测新闻媒体的政治倾向。研究还表明，通过对关键注意力头（attention heads）施加线性干预，可以引导模型输出更偏向自由派或保守派的立场。论文的关键创新点在于利用机械可解释性（mechanistic interpretability）的最新进展，揭示了LLMs对美国政治意识形态的高层次线性表征，并提供了可视化、解释及调控生成文本主观视角的方法。

链接: https://arxiv.org/abs/2503.02080
作者: Junsol Kim,James Evans,Aaron Schein
机构: University of Chicago (芝加哥大学); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the ability to generate text that realistically reflects a range of different subjective human perspectives. This paper studies how LLMs are seemingly able to reflect more liberal versus more conservative viewpoints among other political perspectives in American politics. We show that LLMs possess linear representations of political perspectives within activation space, wherein more similar perspectives are represented closer together. To do so, we probe the attention heads across the layers of three open transformer-based LLMs (\textttLlama-2-7b-chat, \textttMistral-7b-instruct, \textttVicuna-7b). We first prompt models to generate text from the perspectives of different U.S.~lawmakers. We then identify sets of attention heads whose activations linearly predict those lawmakers’ DW-NOMINATE scores, a widely-used and validated measure of political ideology. We find that highly predictive heads are primarily located in the middle layers, often speculated to encode high-level concepts and tasks. Using probes only trained to predict lawmakers’ ideology, we then show that the same probes can predict measures of news outlets’ slant from the activations of models prompted to simulate text from those news outlets. These linear probes allow us to visualize, interpret, and monitor ideological stances implicitly adopted by an LLM as it generates open-ended responses. Finally, we demonstrate that by applying linear interventions to these attention heads, we can steer the model outputs toward a more liberal or conservative stance. Overall, our research suggests that LLMs possess a high-level linear representation of American political ideology and that by leveraging recent advances in mechanistic interpretability, we can identify, monitor, and steer the subjective perspective underlying generated text.
zh

[NLP-65] Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）内部表示的理解与解释这一开放性挑战。论文的关键解决方案是引入Superscopes技术，该方法通过在多层感知机（Multilayer Perceptron, MLP）输出和隐藏状态中系统性增强叠加特征，并将其插入新的上下文环境，从而实现对模型内部表示的更深入解读。这种方法受到“特征作为方向”视角以及扩散模型中的无分类器引导（Classifier-Free Guidance, CFG）方法的启发，能够放大微弱但有意义的特征，解释先前方法无法解析的内部表示，且无需额外训练。这种途径为理解LLMs如何构建上下文及表征复杂概念提供了新见解，进一步推动了机制可解释性的进展。

链接: https://arxiv.org/abs/2503.02078
作者: Jonathan Jacobi,Gal Niv
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the “features as directions” perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
zh

[NLP-66] Hebbian learning the local structure of language

【速读】：该论文旨在探索人类语言学习的微观起源，并尝试理解语言和创造力的神经机制。论文提出的解决方案基于大脑学习的局部性和非监督性（Hebbian）特性，构建了一个有效的人类语言模型。该模型包含两个关键部分：(1) 一个神经元层级结构，用于从文本中学习词令牌化；(2) 额外的神经元，将令牌化后无意义的模式绑定为有意义的嵌入表示（embedding）。解决方案的关键在于引入了一种允许连续并行学习且不遗忘的机制，并通过重正化群方法实现对冗余信息的利用，使生成的令牌始终可分解为基础集合（如字母表），同时能够融合来自多种语言的学习特征。这种设计使得模型无需数据即可学习自然语言形态学，并预测出与真实语言中观察到的词形形成模式一致的分布，进一步揭示了为什么人类语言在微观层面被分割成单词。

链接: https://arxiv.org/abs/2503.02057
作者: P. Myles Eugenio
机构: Department of Physics, Indiana University, Bloomington, Indiana 47405, USA; Department of Physics, University of Connecticut, Storrs, Connecticut 06269, USA; Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 10 figures, 14 pages

点击查看摘要

Abstract:Learning in the brain is local and unsupervised (Hebbian). We derive the foundations of an effective human language model inspired by these microscopic constraints. It has two parts: (1) a hierarchy of neurons which learns to tokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additional neurons which bind the learned symanticless patterns of the tokenizer into a symanticful token (an embedding). The model permits continuous parallel learning without forgetting; and is a powerful tokenizer which performs renormalization group. This allows it to exploit redundancy, such that it generates tokens which are always decomposable into a basis set (e.g an alphabet), and can mix features learned from multiple languages. We find that the structure of this model allows it to learn a natural language morphology WITHOUT data. The language data generated by this model predicts the correct distribution of word-forming patterns observed in real languages, and further demonstrates why microscopically human speech is broken up into words. This model provides the basis for understanding the microscopic origins of language and human creativity.
zh

[NLP-67] CareerBERT: Matching Resumes to ESCO Jobs in a Shared Embedding Space for Generic Job Recommendations

【速读】：该论文旨在解决传统职业匹配与咨询服务在快速变化的劳动力市场中面临的挑战，特别是因技术进步和经济转型导致的工作机会推荐不够精准的问题。为应对这一挑战，论文提出了一种基于CareerBERT的先进支持工具，用于职业顾问和求职者。解决方案的关键在于利用未结构化文本数据（如简历）创建了一个结合欧洲技能、能力与职业分类（ESCO）以及欧洲就业服务（EURES）广告数据的语料库，以提供更全面且更新及时的职业标题表示。此外，通过两步评估方法——基于EURES广告的应用场景评估和基于真实简历及人力资源专家反馈的人本评估，验证了CareerBERT的有效性，证明其在生成基于简历的相关职业推荐方面优于传统及最新的嵌入方法，并在人本评价中表现出稳健的效果。这些结果表明，CareerBERT能够有效提升职业咨询效率并拓宽求职者的视野。

链接: https://arxiv.org/abs/2503.02056
作者: Julian Rosenberger,Lukas Wolfrum,Sven Weinzierl,Mathias Kraus,Patrick Zschech
机构: University of Regensburg(雷根斯堡大学); Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃尔兰根-纽伦堡大学); Technical University of Dresden(德累斯顿工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at Expert Systems with Applications. In Press, see this https URL

点击查看摘要

Abstract:The rapidly evolving labor market, driven by technological advancements and economic shifts, presents significant challenges for traditional job matching and consultation services. In response, we introduce an advanced support tool for career counselors and job seekers based on CareerBERT, a novel approach that leverages the power of unstructured textual data sources, such as resumes, to provide more accurate and comprehensive job recommendations. In contrast to previous approaches that primarily focus on job recommendations based on a fixed set of concrete job advertisements, our approach involves the creation of a corpus that combines data from the European Skills, Competences, and Occupations (ESCO) taxonomy and EURopean Employment Services (EURES) job advertisements, ensuring an up-to-date and well-defined representation of general job titles in the labor market. Our two-step evaluation approach, consisting of an application-grounded evaluation using EURES job advertisements and a human-grounded evaluation using real-world resumes and Human Resources (HR) expert feedback, provides a comprehensive assessment of CareerBERT’s performance. Our experimental results demonstrate that CareerBERT outperforms both traditional and state-of-the-art embedding approaches while showing robust effectiveness in human expert evaluations. These results confirm the effectiveness of CareerBERT in supporting career consultants by generating relevant job recommendations based on resumes, ultimately enhancing the efficiency of job consultations and expanding the perspectives of job seekers. This research contributes to the field of NLP and job recommendation systems, offering valuable insights for both researchers and practitioners in the domain of career consulting and job matching.
zh

[NLP-68] EPEE: Towards Efficient and Effective Foundation Models in Biomedicine

【速读】：该论文旨在解决大型预训练模型（Foundation Models）在生物医学任务推理过程中存在的高延迟（inference latency）和“过度思考”（overthinking）问题，这些问题限制了其在实时临床场景中的应用。为了解决这些挑战，论文提出了一种名为EPEE（基于熵和耐心的早期退出，Entropy- and Patience-based Early Exiting）的新颖混合策略。EPEE的关键在于结合基于熵的早期退出方法和基于耐心的早期退出方法的优势，并克服各自的局限性，从而在保持或提升准确率的同时显著降低推理时间，实现效率与效果之间的平衡。这一解决方案通过适应多样化的数据集和任务，展示了其在实际医疗部署中的潜力，为基于大型预训练模型的实时临床决策提供了实用方案。

链接: https://arxiv.org/abs/2503.02053
作者: Zaifu Zhan,Shuang Zhou,Huixue Zhou,Zirui Liu,Rui Zhang
机构: Department of Electrical and Computer Engineering, University of Minnesota (明尼苏达大学); Division of Computational Health Sciences, Department of Surgery, University of Minnesota (明尼苏达大学); Institute for Health Informatics, University of Minnesota (明尼苏达大学); Department of Computer Science and Engineering, University of Minnesota (明尼苏达大学); Division of Computational Health Sciences, Department of Surgery, University of Minnesota (明尼苏达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to npj Digital Medicine

点击查看摘要

Abstract:Foundation models, including language models, e.g., GPT, and vision models, e.g., CLIP, have significantly advanced numerous biomedical tasks. Despite these advancements, the high inference latency and the “overthinking” issues in model inference impair the efficiency and effectiveness of foundation models, thus limiting their application in real-time clinical settings. To address these challenges, we proposed EPEE (Entropy- and Patience-based Early Exiting), a novel hybrid strategy designed to improve the inference efficiency of foundation models. The core idea was to leverage the strengths of entropy-based and patience-based early exiting methods to overcome their respective weaknesses. To evaluate EPEE, we conducted experiments on three core biomedical tasks-classification, relation extraction, and event extraction-using four foundation models (BERT, ALBERT, GPT-2, and ViT) across twelve datasets, including clinical notes and medical images. The results showed that EPEE significantly reduced inference time while maintaining or improving accuracy, demonstrating its adaptability to diverse datasets and tasks. EPEE addressed critical barriers to deploying foundation models in healthcare by balancing efficiency and effectiveness. It potentially provided a practical solution for real-time clinical decision-making with foundation models, supporting reliable and efficient workflows.
zh

[NLP-69] Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在误导性信息传播中的双向说服动态问题，特别是不同人口统计群体在面对误导性内容时的易感性和影响差异。研究关注LLMs如何通过生成有说服力的内容放大现有偏见，并探讨人类与LLMs之间的相互影响。解决方案的关键在于结合多个人类-LLM交互数据集，分析人类对LLMs的影响以及LLMs对人类的影响，同时引入一个面向人口统计的多代理LLM框架来模拟和评估误导性信息在说服过程中的传播模式。研究发现表明，LLMs对误导性信息的易感性受到人口统计因素的影响，类似于人类群体的行为模式，并揭示了多代理LLMs中回声室效应的存在。这些发现为未来针对不同人口统计群体的干预措施提供了洞见。

链接: https://arxiv.org/abs/2503.02038
作者: Angana Borah,Rada Mihalcea,Verónica Pérez-Rosas
机构: University of Michigan - Ann Arbor (密歇根大学安娜堡分校); Texas State University (德克萨斯州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing challenges in misinformation exposure and susceptibility vary across demographic groups, as some populations are more vulnerable to misinformation than others. Large language models (LLMs) introduce new dimensions to these challenges through their ability to generate persuasive content at scale and reinforcing existing biases. This study investigates the bidirectional persuasion dynamics between LLMs and humans when exposed to misinformative content. We analyze human-to-LLM influence using human-stance datasets and assess LLM-to-human influence by generating LLM-based persuasive arguments. Additionally, we use a multi-agent LLM framework to analyze the spread of misinformation under persuasion among demographic-oriented LLM agents. Our findings show that demographic factors influence susceptibility to misinformation in LLMs, closely reflecting the demographic-based patterns seen in human susceptibility. We also find that, similar to human demographic groups, multi-agent LLMs exhibit echo chamber behavior. This research explores the interplay between humans and LLMs, highlighting demographic differences in the context of misinformation and offering insights for future interventions.
zh

[NLP-70] Comparative Analysis of OpenAI GPT -4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

【速读】：该论文试图解决大型语言模型在科学文本分类任务中的性能评估问题，特别是探索DeepSeek R1在科学句子关系分类上的表现。论文的关键解决方案在于设计了一种新的评价方法，并构建了一个包含多样化领域科学论文的数据集，用于对比GPT-4o和DeepSeek R1在分类任务中的效果与一致性。

链接: https://arxiv.org/abs/2503.02032
作者: Aniruddha Maiti,Samuel Adewumi,Temesgen Alemayehu Tikure,Zichun Wang,Niladri Sengupta,Anastasiia Sukhanova,Ananya Jana
机构: Department of Mathematics, Engineering, and Computer Science, West Virginia State University (西弗吉尼亚州立大学), Institute, WV, USA; Fractal Analytics Inc (弗拉克塔尔分析公司), USA; Department of Computer Sciences and Electrical Engineering, Marshall University (马歇尔大学), Huntington, WV, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ASEE North Central Section 2025

点击查看摘要

Abstract:This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.
zh

[NLP-71] Mind the (Belief) Gap: Group Identity in the World of LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多智能体系统中的社会偏见和信念驱动行为对任务决策的影响问题。具体而言，研究关注LLMs在模拟群体心理特性（如信念一致性）时的表现，并探索其在虚假信息传播及学习过程中的负面效应。论文的关键在于提出了一种结合社会心理学理论的多智能体框架，通过引入接触假设（contact hypothesis）、准确性引导（accuracy nudges）以及全球公民框架（global citizenship framework）等策略来缓解LLMs过度信念一致性的负面影响。实验结果表明，最优策略可将虚假信息传播减少高达37%，同时提升学习效果11%。这一工作为利用LLMs处理现实世界交互提供了重要见解，同时有效应对了由信念驱动的偏见问题。

链接: https://arxiv.org/abs/2503.02016
作者: Angana Borah,Marwa Houalla,Rada Mihalcea
机构: University of Michigan - Ann Arbor (密歇根大学安娜堡分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social biases and belief-driven behaviors can significantly impact Large Language Models (LLMs) decisions on several tasks. As LLMs are increasingly used in multi-agent systems for societal simulations, their ability to model fundamental group psychological characteristics remains critical yet under-explored. In this study, we present a multi-agent framework that simulates belief congruence, a classical group psychology theory that plays a crucial role in shaping societal interactions and preferences. Our findings reveal that LLMs exhibit amplified belief congruence compared to humans, across diverse contexts. We further investigate the implications of this behavior on two downstream tasks: (1) misinformation dissemination and (2) LLM learning, finding that belief congruence in LLMs increases misinformation dissemination and impedes learning. To mitigate these negative impacts, we propose strategies inspired by: (1) contact hypothesis, (2) accuracy nudges, and (3) global citizenship framework. Our results show that the best strategies reduce misinformation dissemination by up to 37% and enhance learning by 11%. Bridging social psychology and AI, our work provides insights to navigate real-world interactions using LLMs while addressing belief-driven biases.
zh

[NLP-72] HoT: Highlighted Chain of Thought for Referencing Supportive Facts from Inputs

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成非事实性陈述方面的倾向性问题，这一问题导致包含事实与非事实混合的响应难以被人类准确验证并据此做出决策。为应对这一挑战，论文提出了一种名为高亮链式思维提示（Highlighted Chain-of-Thought Prompting, HoT）的技术，其关键是通过XML标签将生成的回答与输入查询中的已知事实关联起来，使LLMs不仅重新格式化问题以突出关键事实，还在生成答案时对引用的事实进行高亮显示。实验表明，在少量样本学习场景下，HoT在涵盖算术、阅读理解到逻辑推理等17项任务上的表现优于传统的链式思维提示（CoT）。此外，当人类验证LLM响应时，高亮显示有助于时间受限的参与者更准确且高效地判断LLM是否正确，但令人惊讶的是，当LLM出错时，HoT反而可能让用户误以为答案正确。

链接: https://arxiv.org/abs/2503.02003
作者: Tin Nguyen,Logan Bolton,Mohammad Reza Taesiri,Anh Totti Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.
zh

[NLP-73] One ruler to measure them all: Benchmarking multilingual long-context language models

【速读】：该论文试图解决多语言长上下文语言模型在不同资源语言上的性能评估与差距问题。解决方案的关键在于设计了一个名为ONERULER的多语言基准测试集，它通过扩展英语单语言RULER基准，增加了七项合成任务以测试检索与聚合能力，并包含新的“棘手任务”变体，允许不存在目标答案的情况。ONERULER的构建采用了双步翻译策略，首先用英语编写任务指令，再由母语者将其翻译成其他25种语言。研究发现，在上下文长度从8K增加到128K时，开放权重和闭源大语言模型（LLMs）在低资源语言上的表现差距扩大，且令人惊讶的是，英语并非长上下文任务中的最佳表现语言，而波兰语排名第一。此外，许多LLMs在高资源语言中也常错误地预测无答案情况。最后，跨语言场景下指令和上下文语言不一致会导致性能波动高达20%。因此，ONERULER旨在促进未来多语言和跨语言长上下文训练管道的研究改进。

链接: https://arxiv.org/abs/2503.01996
作者: Yekyung Kim,Jenna Russell,Marzena Karpinska,Mohit Iyyer
机构: University of Maryland, College Park (马里兰大学帕克分校); Microsoft (微软); UMass Amherst (麻省大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages. ONERULER adapts the English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic tasks that test both retrieval and aggregation, including new variations of the “needle-in-a-haystack” task that allow for the possibility of a nonexistent needle. We create ONERULER through a two-step process, first writing English instructions for each task and then collaborating with native speakers to translate them into 25 additional languages. Experiments with both open-weight and closed LLMs reveal a widening performance gap between low- and high-resource languages as context length increases from 8K to 128K tokens. Surprisingly, English is not the top-performing language on long-context tasks (ranked 6th out of 26), with Polish emerging as the top language. Our experiments also show that many LLMs (particularly OpenAI’s o3-mini-high) incorrectly predict the absence of an answer, even in high-resource languages. Finally, in cross-lingual scenarios where instructions and context appear in different languages, performance can fluctuate by up to 20% depending on the instruction language. We hope the release of ONERULER will facilitate future research into improving multilingual and cross-lingual long-context training pipelines.
zh

[NLP-74] Adaptively evaluating models with task elicitation

【速读】：该论文旨在解决手动构建评估数据集难以跟上语言模型快速发展的能力及多样化部署场景的问题。为实现可扩展的模型评估，论文引入并验证了一种名为自适应评估（Adaptive Evaluations）的框架。其关键是利用支架语言模型（evaluator agents）搜索目标模型在特定领域数据集上的行为，并生成具有挑战性的任务以揭示和探究模型的失效模式。这种方法不仅能够在多样化的数据集和任务（如法律推理、预测和在线骚扰等）上发现前沿模型的一致性不足，还能通过生成的高质量问题通过人工有效性检查并迁移到其他能力不同的模型上，从而进一步用于创建领域特定的困难数据集。

链接: https://arxiv.org/abs/2503.01986
作者: Davis Brown,Prithvi Balehannina,Helen Jin,Shreya Havaldar,Hamed Hassani,Eric Wong
机构: Department of Computer and Information Science, University of Pennsylvania (宾夕法尼亚大学计算机与信息科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Manual curation of evaluation datasets is struggling to keep up with the rapidly expanding capabilities and deployment scenarios of language models. Towards scalable model profiling, we introduce and validate a framework for evaluating LLMs, called Adaptive Evaluations. Adaptive evaluations use scaffolded language models (evaluator agents) to search through a target model’s behavior on a domain dataset and create difficult questions (tasks) that can discover and probe the model’s failure modes. We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks, including but not limited to legal reasoning, forecasting, and online harassment. Generated questions pass human validity checks and often transfer to other models with different capability profiles, demonstrating that adaptive evaluations can also be used to create difficult domain-specific datasets.
zh

[NLP-75] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval CVPR2025

【速读】：该论文致力于解决跨模态检索（Cross-modal Retrieval）中支持多模态查询（包含图像和文本）并在多模态文档集合（图像和文本交错）中进行高效搜索的问题。论文的关键创新在于提出了一种名为ReT的模型，其核心解决方案是利用视觉和文本主干网络不同层提取的多层级表示，并在查询和文档侧均采用基于Transformer的新型循环单元。该单元通过结合文本和视觉特征的sigmoid门机制（受LSTM经典设计启发），实现多层级和跨模态的理解与特征提取。实验结果表明，ReT在M2KR和M-BEIR基准数据集上达到了当前最优性能。

链接: https://arxiv.org/abs/2503.01980
作者: Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳-雷焦艾米利亚大学); IIT-CNR (意大利国立研究 council), Pisa (比萨)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: CVPR 2025

点击查看摘要

Abstract:Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at this https URL.
zh

[NLP-76] Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts

【速读】：该论文试图解决Large Language Models（LLMs）在处理日语时因固有刻板印象而可能产生的不安全行为问题。现有研究主要依赖间接评估方法，存在局限性，而直接评估方法虽有所发展，但主要集中于英语为中心的模型，对非英语模型特别是日语模型的研究较为匮乏。为此，论文通过构建包含301个社会群体术语与12个刻板印象诱导模板组合而成的3,612个直接评估提示，分析了三种基础模型（分别基于日语、英语和中文训练）的表现。关键在于揭示日语本地模型LLM-jp在拒绝率最低的同时，更可能生成有毒性和负面响应，并指出提示格式显著影响所有模型的输出，且生成的响应对特定社会群体存在夸张反应。论文强调了日语LLMs在伦理安全机制上的不足，并呼吁改进其安全机制与偏见缓解策略，以推动跨语言的AI伦理讨论。

链接: https://arxiv.org/abs/2503.01947
作者: Akito Nakanishi,Yukie Sano,Geng Liu,Francesco Pierri
机构: Graduate School of Science and Technology, University of Tsukuba (东京大学研究生院科学与技术学院); Institute of Systems and Information Engineering, University of Tsukuba (东京大学系统信息工程学院); Department of Electronics, Information and Bioengineering, Politecnico di Milano (米兰理工大学电子、信息和生物工程系)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This paper has been submitted to IEEE Transactions on Artificial Intelligence for possible publication

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have attracted growing interest for their significant potential, though concerns have rapidly emerged regarding unsafe behaviors stemming from inherent stereotypes and this http URL research on stereotypes in LLMs has primarily relied on indirect evaluation setups, in which models are prompted to select between pairs of sentences associated with particular social groups. Recently, direct evaluation methods have emerged, examining open-ended model responses to overcome limitations of previous approaches, such as annotator this http URL existing studies have focused on English-centric LLMs, whereas research on non-English models–particularly Japanese–remains sparse, despite the growing development and adoption of these this http URL study examines the safety of Japanese LLMs when responding to stereotype-triggering prompts in direct this http URL constructed 3,612 prompts by combining 301 social group terms–categorized by age, gender, and other attributes–with 12 stereotype-inducing templates in this http URL were analyzed from three foundational models trained respectively on Japanese, English, and Chinese this http URL findings reveal that LLM-jp, a Japanese native model, exhibits the lowest refusal rate and is more likely to generate toxic and negative responses compared to other this http URL, prompt format significantly influence the output of all models, and the generated responses include exaggerated reactions toward specific social groups, varying across this http URL findings underscore the insufficient ethical safety mechanisms in Japanese LLMs and demonstrate that even high-accuracy models can produce biased outputs when processing Japanese-language this http URL advocate for improving safety mechanisms and bias mitigation strategies in Japanese LLMs, contributing to ongoing discussions on AI ethics beyond linguistic boundaries.
zh

[NLP-77] AskToAct: Enhancing LLM s Tool Use via Self-Correcting Clarification

【速读】：该论文致力于解决现有交互式澄清方法在处理用户查询时面临的两个关键限制：一是依赖人工构建的数据集，二是缺乏多轮澄清过程中的错误修正机制。为应对这些挑战，论文提出了AskToAct框架，其核心在于利用查询与其工具调用解决方案之间的结构映射关系。关键创新点包括通过系统性移除查询中的关键参数并将其作为ground truth保留，从而实现高质量训练数据的自动化构建；同时，通过采用选择性掩码机制对增强错误修正的数据进行微调，以实现在澄清交互过程中动态检测和修正错误的能力。实验结果表明，AskToAct在恢复未指定意图方面达到79%以上的准确率，提升了48.34%的澄清效率，并且在保持工具调用高准确性的同时展现出强大的泛化能力，能够在无需额外训练的情况下适应全新API，性能接近GPT-4但所需计算资源显著减少。

链接: https://arxiv.org/abs/2503.01940
作者: Xuan Zhang,Yongliang Shen,Zhe Zheng,Linjuan Wu,Wenqi Zhang,Yuchen Yan,Qiuying Peng,Jun Wang,Weiming Lu
机构: Zhejiang University (浙江大学); OPPO Research Institute (欧珀研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets and lack of error correction mechanisms during multi-turn clarification. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness by fine-tuning on error-correction augmented data using selective masking mechanism, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 79% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 48.34% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across varying complexity levels and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4 with substantially fewer computational resources.
zh

[NLP-78] MultiAgent Bench: Evaluating the Collaboration and Competition of LLM agents

【速读】：该论文试图解决现有基准测试无法有效评估大型语言模型（Large Language Models, LLMs）在多智能体协作与竞争场景中的能力这一问题。为了解决这一挑战，论文提出了MultiAgentBench，这是一个全面的基准测试框架，用于评估基于LLM的多智能体系统在多样化交互场景下的表现。其关键解决方案在于引入了一套新的里程碑驱动的关键性能指标（KPIs），不仅衡量任务完成质量，还量化协作与竞争的质量，并进一步探索了不同的协调协议（如星型、链型、树型和图型拓扑结构）以及创新策略（如群体讨论和认知规划）。实验结果显示，gpt-4o-mini在任务得分上表现最优，图型结构在研究场景中的协调效果最佳，而认知规划提升了3%的里程碑达成率。

链接: https://arxiv.org/abs/2503.01935
作者: Kunlun Zhu,Hongyi Du,Zhaochen Hong,Xiaocheng Yang,Shuyi Guo,Zhe Wang,Zhenhailong Wang,Cheng Qian,Xiangru Tang,Heng Ji,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at this https URL.
zh

[NLP-79] Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective

【速读】：该论文旨在解决在边缘设备上部署大规模语言模型所面临的高计算需求、高能耗以及潜在的数据隐私风险等固有挑战。论文的关键解决方案在于提出了Shakti小规模语言模型（Shakti Small Language Models, SLMs）系列，包括Shakti-100M、Shakti-250M和Shakti-500M。这些模型通过结合高效的架构设计、量化技术以及负责任的人工智能（Responsible AI）原则，在智能手机、智能设备、物联网系统等领域实现了设备端的智能化能力。其关键之处在于，经过精心工程设计与微调的紧凑型模型能够在实际边缘人工智能（edge-AI）场景中满足甚至超越预期性能。

链接: https://arxiv.org/abs/2503.01933
作者: Rakshit Aralimatti,Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi
机构: SandLogic Technologies Pvt Ltd(沙兰科技私营有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying large scale language models on edge devices faces inherent challenges such as high computational demands, energy consumption, and potential data privacy risks. This paper introduces the Shakti Small Language Models (SLMs) Shakti-100M, Shakti-250M, and Shakti-500M which target these constraints headon. By combining efficient architectures, quantization techniques, and responsible AI principles, the Shakti series enables on-device intelligence for smartphones, smart appliances, IoT systems, and beyond. We provide comprehensive insights into their design philosophy, training pipelines, and benchmark performance on both general tasks (e.g., MMLU, Hellaswag) and specialized domains (healthcare, finance, and legal). Our findings illustrate that compact models, when carefully engineered and fine-tuned, can meet and often exceed expectations in real-world edge-AI scenarios.
zh

[NLP-80] Unnatural Languages Are Not Bugs but Features for LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理非人类可读文本序列（如越狱提示jailbreak prompts）时被视为“Bug”的问题。研究挑战了将这些非自然语言视为异常现象的传统观点，证明了这些对人类而言看似无意义但实际上对模型具有语义意义的字符串蕴含着可被利用的潜在特征。解决方案的关键在于揭示这些潜在特征不仅能在不同模型间通用，还能跨任务应用，同时通过在非自然语言版本的数据集上微调模型，使其性能与使用自然语言训练的模型相当，在Length-controlled AlpacaEval 2.0中的平均胜率可达49.71%。此外，研究进一步表明，LLMs通过过滤噪声并从过滤后的词汇中推断上下文含义来处理这些非自然语言。

链接: https://arxiv.org/abs/2503.01926
作者: Keyu Duan,Yiran Zhao,Zhili Feng,Jinjie Ni,Tianyu Pang,Qian Liu,Tianle Cai,Longxu Dou,Kenji Kawaguchi,Anirudh Goyal,J. Zico Kolter,Michael Qizhe Shieh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been observed to process non-human-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving 49.71 win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words.
zh

[NLP-81] Output Length Effect on DeepSeek -R1s Safety in Forced Thinking

【速读】：该论文试图解决大型语言模型（LLMs）在对抗条件下的安全性挑战，特别是探讨输出长度对DeepSeek-R1模型在强制思维（Forced Thinking）场景下鲁棒性的影响。研究发现，虽然较长的输出可以通过自我修正提高安全性，但某些攻击类型会利用扩展的生成内容。为应对这一问题，论文的关键解决方案是通过基于强化学习的策略调整和自适应令牌长度调节，动态控制输出长度以平衡推理效果与安全性。

链接: https://arxiv.org/abs/2503.01923
作者: Xuying Li,Zhuo Li,Yuji Kosuga,Victor Bian
机构: HydroX AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning capabilities, but their safety under adversarial conditions remains a challenge. This study examines the impact of output length on the robustness of DeepSeek-R1, particularly in Forced Thinking scenarios. We analyze responses across various adversarial prompts and find that while longer outputs can improve safety through self-correction, certain attack types exploit extended generations. Our findings suggest that output length should be dynamically controlled to balance reasoning effectiveness and security. We propose reinforcement learning-based policy adjustments and adaptive token length regulation to enhance LLM safety.
zh

[NLP-82] NCL-UoR at SemEval-2025 Task 3: Detecting Multilingual Hallucination and Related Observable Overgeneration Text Spans with Modified RefChecker and Modified SeflCheckGPT

【速读】：该论文旨在解决多语言大语言模型（Large Language Models, LLMs）生成内容中幻觉现象（hallucinations）的检测问题，不仅需要识别幻觉的存在，还需准确定位其具体发生位置。论文的关键解决方案在于提出两种改进方法：修改版RefChecker和修改版SelfCheckGPT。其中，修改版RefChecker通过将基于提示的事实验证集成到参考文献中，并将其结构化为基于主张的测试而非单一外部知识源；而修改版SelfCheckGPT则引入外部知识以克服对内部知识的依赖。此外，这两种方法的原始提示设计均被优化，以更有效地识别LLM生成文本中的幻觉词汇。实验结果表明，该方法在多种语言的测试数据集上表现出色，平均IoU（Intersection over Union）为0.5310，平均COR（Correct Rate）为0.5669。

链接: https://arxiv.org/abs/2503.01921
作者: Jiaying Hong,Thanet Markchom,Jianfei Xu,Tong Wu,Huizhi Liang
机构: School of Computing, Newcastle University (纽卡斯尔大学), Newcastle upon Tyne, UK; Department of Computer Science, University of Reading (雷丁大学), Reading, UK
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in content generated by various large language models (LLMs) across multiple languages. This task involves not only identifying the presence of hallucinations but also pinpointing their specific occurrences. To tackle this challenge, this study introduces two methods: modified RefChecker and modified SelfCheckGPT. The modified RefChecker integrates prompt-based factual verification into References, structuring them as claim-based tests rather than single external knowledge sources. The modified SelfCheckGPT incorporates external knowledge to overcome its reliance on internal knowledge. In addition, both methods’ original prompt designs are enhanced to identify hallucinated words within LLM-generated texts. Experimental results demonstrate the effectiveness of the approach, achieving a high ranking on the test dataset in detecting hallucinations across various languages, with an average IoU of 0.5310 and an average COR of 0.5669.
zh

[NLP-83] How to Steer LLM Latents for Hallucination Detection? ICLR

【速读】：该论文旨在解决大型语言模型（LLMs）在实际应用中产生幻觉（hallucinations）的安全性问题。现有方法通过利用LLM的潜在空间进行幻觉检测，但由于这些嵌入向量主要优化语言连贯性而非事实准确性，往往无法清晰区分真实内容与幻觉内容。论文的关键解决方案是提出了一种名为真实性分离向量（Truthfulness Separator Vector, TSV）的轻量且灵活的引导向量，在推理过程中重新塑造LLM的表示空间，以增强真实输出与幻觉输出之间的分离能力，同时不改变模型参数。这一两阶段框架首先使用少量标注样本训练TSV以形成紧凑且良好分离的聚类，然后结合基于最优传输算法的伪标签分配与基于置信度的过滤过程扩充示例集。大量实验表明，TSV在极小标注数据的情况下实现了最先进的性能，并在不同数据集上表现出强大的泛化能力，为实际LLM应用提供了实用解决方案。

链接: https://arxiv.org/abs/2503.01917
作者: Seongheon Park,Xuefeng Du,Min-Hsuan Yeh,Haobo Wang,Yixuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR Workshop on Quantify Uncertainty and Hallucination in Foundation Models (QUESTION), 2025

点击查看摘要

Abstract:Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM’s representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.
zh

[NLP-84] Conceptual Contrastive Edits in Textual and Vision-Language Retrieval

【速读】：该论文致力于解决深度学习模型复杂化背景下，如何实现模型无关的可解释性（model-agnostic interpretability）的问题。论文的关键解决方案在于提出了一种后验概念对比编辑方法（post-hoc conceptual contrastive edits），通过系统设计针对不同词性的最优且可控的对比干预（contrastive interventions），以揭示检索模型表征中嵌入的显著模式与偏差。此外，论文引入了一种新的度量方法，用于评估对比干预对单个词汇级别影响的效果，从而提供全面的干预有效性评价。这种方法能够在黑盒场景下解释语言模型和视觉-语言预训练模型的行为。

链接: https://arxiv.org/abs/2503.01914
作者: Maria Lymperaiou,Giorgos Stamou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As deep learning models grow in complexity, achieving model-agnostic interpretability becomes increasingly vital. In this work, we employ post-hoc conceptual contrastive edits to expose noteworthy patterns and biases imprinted in representations of retrieval models. We systematically design optimal and controllable contrastive interventions targeting various parts of speech, and effectively apply them to explain both linguistic and visiolinguistic pre-trained models in a black-box manner. Additionally, we introduce a novel metric to assess the per-word impact of contrastive interventions on model outcomes, providing a comprehensive evaluation of each intervention’s effectiveness.
zh

[NLP-85] PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM -assisted psychiatric clinical practice

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在真实精神病临床环境中的效用评估缺乏系统性的问题，以及由此阻碍针对精神病应用优化的专用LLMs的发展。论文的关键解决方案在于提出了一套名为PsychBench的基准测试系统，该系统结合了精神病学的临床需求和实际数据，用于全面评估LLMs在精神病临床场景下的实用性。通过PsychBench，研究者不仅进行了定量分析，还结合提示设计、链式思维推理、输入文本长度及领域特定知识微调等因素，深入探究了现有模型的优势与局限，并提出了改进建议。此外，通过临床读者研究进一步验证了LLMs作为辅助工具的潜在价值及其对不同资历精神科医生的支持效果，从而为推动该领域的研究提供了数据和评估框架支持。

链接: https://arxiv.org/abs/2503.01903
作者: Ruoxi Wang,Shuyu Liu,Ling Zhang,Xuequan Zhu,Rui Yang,Xinzhu Zhou,Fei Wu,Zhi Yang,Cheng Jin,Gang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) offers potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. Despite this potential, a robust and comprehensive benchmarking framework to assess the efficacy of LLMs in authentic psychiatric clinical environments is absent. This has impeded the advancement of specialized LLMs tailored to psychiatric applications. In response to this gap, by incorporating clinical demands in psychiatry and clinical data, we proposed a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We conducted a comprehensive quantitative evaluation of 16 LLMs using PsychBench, and investigated the impact of prompt design, chain-of-thought reasoning, input text length, and domain-specific knowledge fine-tuning on model performance. Through detailed error analysis, we identified strengths and potential limitations of the existing models and suggested directions for improvement. Subsequently, a clinical reader study involving 60 psychiatrists of varying seniority was conducted to further explore the practical benefits of existing LLMs as supportive tools for psychiatrists of varying seniority. Through the quantitative and reader evaluation, we show that while existing models demonstrate significant potential, they are not yet adequate as decision-making tools in psychiatric clinical practice. The reader study further indicates that, as an auxiliary tool, LLM could provide particularly notable support for junior psychiatrists, effectively enhancing their work efficiency and overall clinical quality. To promote research in this area, we will make the dataset and evaluation framework publicly available, with the hope of advancing the application of LLMs in psychiatric clinical settings.
zh

[NLP-86] An Empirical Analysis of LLM s for Countering Misinformation

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）在对抗政治性虚假信息方面的有效性，试图解决如何通过生成式 AI 提升事实核查能力的问题。论文的关键在于提出了一种两步式链式思维提示（two-step, chain-of-thought prompting）方法，即首先让模型识别给定主张的可信来源，然后生成有说服力的回应。然而，实验结果显示，这些模型在引用真实新闻来源方面存在困难，并倾向于引用带有左倾倾向的信息源，同时不同模型之间生成回应的多样性也存在差异。这表明仅依靠提示工程（prompt-engineering）进行事实核查存在局限性，强调了需要更严格的约束机制（guardrails）。这一研究结果对研究人员和技术非专业人士均具有重要意义。

链接: https://arxiv.org/abs/2503.01902
作者: Adiba Mahbub Proma,Neeley Pate,James Druckman,Gourab Ghoshal,Hangfeng He,Ehsan Hoque
机构: University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Adiba and Neeley contributed equally

点击查看摘要

Abstract:While Large Language Models (LLMs) can amplify online misinformation, they also show promise in tackling misinformation. In this paper, we empirically study the capabilities of three LLMs – ChatGPT, Gemini, and Claude – in countering political misinformation. We implement a two-step, chain-of-thought prompting approach, where models first identify credible sources for a given claim and then generate persuasive responses. Our findings suggest that models struggle to ground their responses in real news sources, and tend to prefer citing left-leaning sources. We also observe varying degrees of response diversity among models. Our findings highlight concerns about using LLMs for fact-checking through only prompt-engineering, emphasizing the need for more robust guardrails. Our results have implications for both researchers and non-technical users.
zh

[NLP-87] MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems

【速读】：该论文试图解决科学推理能力评估在多模态设置中的不足问题，特别是大型语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLVMs）在数学与物理推理任务上的表现。为了解决这一问题，论文提出了MMSciBench基准数据集，它包含文本和图文两种格式的任务，并提供了人类标注的难度级别、带详细解释的解答以及分类映射。关键在于通过设计具有挑战性的科学推理任务和全面的评估标准，揭示当前最先进的模型在复杂推理和视觉-文本整合方面的重要局限性，从而为多模态科学理解的进步提供严格的度量基准。

链接: https://arxiv.org/abs/2503.01891
作者: Xinwu Ye,Chengfan Li,Siming Chen,Xiangru Tang,Wei Wei
机构: Department of Computer Science, Yale University (耶鲁大学); Department of Computer Science, Brown University (布朗大学); School of Data Science, Fudan University (复旦大学); Datawiz LLC (数据智公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only \textbf63.77% accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.
zh

[NLP-88] Advanced Deep Learning Techniques for Analyzing Earnings Call Transcripts: Methodologies and Applications

【速读】：该论文试图解决如何利用自然语言处理（NLP）技术从大规模财务会议纪要中提取情感信息，以辅助更明智的投资决策和风险管理策略的问题。论文的关键在于通过比较分析BERT、FinBERT和ULMFiT等深度学习方法在金融情感分析中的性能，评估它们的数据预处理需求、计算效率及模型优化能力，并通过准确率、精确率、召回率和F1分数等指标严格实验验证其有效性，同时探讨潜在的改进措施以提升这些模型在金融文本分析中的实际应用效果。

链接: https://arxiv.org/abs/2503.01886
作者: Umair Zakir,Evan Daykin,Amssatou Diagne,Jacob Faile
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:This study presents a comparative analysis of deep learning methodologies such as BERT, FinBERT and ULMFiT for sentiment analysis of earnings call transcripts. The objective is to investigate how Natural Language Processing (NLP) can be leveraged to extract sentiment from large-scale financial transcripts, thereby aiding in more informed investment decisions and risk management strategies. We examine the strengths and limitations of each model in the context of financial sentiment analysis, focusing on data preprocessing requirements, computational efficiency, and model optimization. Through rigorous experimentation, we evaluate their performance using key metrics, including accuracy, precision, recall, and F1-score. Furthermore, we discuss potential enhancements to improve the effectiveness of these models in financial text analysis, providing insights into their applicability for real-world financial decision-making.
zh

[NLP-89] BEYONDWORDS is All You Need: Agent ic Generative AI based Social Media Themes Extractor

【速读】：该论文旨在解决传统方法在处理大规模、非结构化社交媒体文本数据时难以捕捉其复杂性和细微差别的问题。为应对这一挑战，研究提出了一种创新的主题分析方法，关键在于将预训练语言模型生成的推文嵌入（tweet embeddings）、通过矩阵分解实现的降维技术以及生成式 AI （Generative AI）相结合，以识别和精炼潜在主题。该方案通过聚类压缩后的推文表示，并利用生成式 AI 的链式思维（Chain of Thought, CoT）提示机制来提取和阐述主题，同时辅以另一个大型语言模型（LLM）进行质量保证。这种方法的核心创新在于结合机器学习与生成式 AI，从而提升在线社区中主题识别的深度和准确性。

链接: https://arxiv.org/abs/2503.01880
作者: Mohammed-Khalil Ghali,Abdelrahman Farrag,Sarah Lam,Daehan Won
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thematic analysis of social media posts provides a major understanding of public discourse, yet traditional methods often struggle to capture the complexity and nuance of unstructured, large-scale text data. This study introduces a novel methodology for thematic analysis that integrates tweet embeddings from pre-trained language models, dimensionality reduction using and matrix factorization, and generative AI to identify and refine latent themes. Our approach clusters compressed tweet representations and employs generative AI to extract and articulate themes through an agentic Chain of Thought (CoT) prompting, with a secondary LLM for quality assurance. This methodology is applied to tweets from the autistic community, a group that increasingly uses social media to discuss their experiences and challenges. By automating the thematic extraction process, the aim is to uncover key insights while maintaining the richness of the original discourse. This autism case study demonstrates the utility of the proposed approach in improving thematic analysis of social media data, offering a scalable and adaptable framework that can be applied to diverse contexts. The results highlight the potential of combining machine learning and Generative AI to enhance the depth and accuracy of theme identification in online communities.
zh

[NLP-90] me-MQA: Time Series Multi-Task Question Answering with Context Enhancement

【速读】：该论文旨在解决现有时间序列方法和数据集任务范围狭窄的问题，主要集中于预测或异常检测等单一任务，而忽视了更广泛的应用场景。为填补这一空白，论文提出了一种名为Time Series Multi-Task Question Answering (Time-MQA)的统一框架，支持通过自然语言查询实现多种时间序列任务，包括数值分析任务与需要推理的开放式问答。Time-MQA的核心在于TSQA数据集，这是一个包含约20万问答对的大规模数据集，涵盖来自环境、交通等多个领域的多样化时间序列数据，其设计旨在覆盖不同长度的时间序列，促进模型的稳健发展。论文的关键解决方案是通过在TSQA数据集上对大规模语言模型（如Mistral 7B、Llama-3 8B和Qwen-2.5 7B）进行持续预训练，从而提升模型的时间序列推理能力，使其超越简单的数值任务，实现更高级且直观的时间序列交互。此外，TSQA数据集、模型、可执行代码以及用户评估问卷和结果均已开源发布。

链接: https://arxiv.org/abs/2503.01875
作者: Yaxuan Kong,Yiyuan Yang,Yoontae Hwang,Wenjie Du,Stefan Zohren,Zhangyang Wang,Ming Jin,Qingsong Wen
机构: University of Oxford (牛津大学); PyPOTS Research; University of Texas at Austin (德克萨斯大学奥斯汀分校); Griffith University (格里菲斯大学); Squirrel Ai Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing \sim 200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, executable codes, user study questionnaires for evaluation, and results have all been open-sourced.
zh

[NLP-91] Can Large Language Models Extract Customer Needs as well as Professional Analysts?

【速读】：该论文试图解决通过自动化手段从文本数据（如访谈记录、在线评论）中提取客户需求数字化表达的问题，以替代当前依赖人工分析和关键词搜索结合机器学习的低效方法。论文的关键在于探索大型语言模型（Large Language Models, LLMs）在无需人工干预的情况下能否有效提取清晰、具体且来源明确的客户需求（Customer Needs, CNs）。研究设计了基于基础LLM（仅使用提示工程优化的Base LLM）和经过专业标注数据微调的LLM（Supervised Fine-Tuning LLM, SFT LLM）的实验，并与专业分析师的表现进行对比。结果显示，SFT LLM能够达到甚至超越专业分析师的性能，其提取的客户需求具有良好的表述性、足够的具体性、来源可靠（无幻觉现象），同时效率更高且覆盖更全面。因此，解决方案的关键在于采用经过专业数据微调的LLM技术，以实现客户需求的自动化提取和精准表达。

链接: https://arxiv.org/abs/2503.01870
作者: Artem Timoshenko,Chengfeng Mao,John R. Hauser
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Identifying customer needs (CNs) is important for product management, product development, and marketing. Applications rely on professional analysts interpreting textual data (e.g., interview transcripts, online reviews) to understand the nuances of customer experience and concisely formulate “jobs to be done.” The task is cognitively complex and time-consuming. Current practice facilitates the process with keyword search and machine learning but relies on human judgment to formulate CNs. We examine whether Large Language Models (LLMs) can automatically extract CNs. Because evaluating CNs requires professional judgment, we partnered with a marketing consulting firm to conduct a blind study of CNs extracted by: (1) a foundational LLM with prompt engineering only (Base LLM), (2) an LLM fine-tuned with professionally identified CNs (SFT LLM), and (3) professional analysts. The SFT LLM performs as well as or better than professional analysts when extracting CNs. The extracted CNs are well-formulated, sufficiently specific to identify opportunities, and justified by source content (no hallucinations). The SFT LLM is efficient and provides more complete coverage of CNs. The Base LLM was not sufficiently accurate or specific. Organizations can rely on SFT LLMs to reduce manual effort, enhance the precision of CN articulation, and provide improved insight for innovation and marketing strategy.
zh

[NLP-92] From Small to Large Language Models : Revisiting the Federalist Papers

【速读】：该论文试图解决联邦党人文集作者身份认定这一长期存在争议的问题，并从现代语言模型（LLM）的角度重新审视这一历史数据集。论文的关键在于探索未经微调的通用词嵌入（word embedding）是否可用于文体学（stylometry）和作者身份归因任务，以及比较不同嵌入方法在作者身份认定中的表现。研究发现，与基于主题嵌入（topic embedding）的降维相比，基于词嵌入的维度扩展并不总是有助于作者身份归因；此外，尽管默认LLM嵌入（即使经过手动微调）可能无法始终提高归因准确性，但基于“功能词”训练的主题嵌入结合贝叶斯分析却能实现更优的分类性能。这表明传统小规模统计语言模型因其可解释性和坚实的理论基础，在作者身份归因任务中具有显著优势。

链接: https://arxiv.org/abs/2503.01869
作者: So Won Jeong,Veronika Rockova
机构: Booth School of Business of the University of Chicago (芝加哥大学布斯商学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:For a long time, the authorship of the Federalist Papers had been a subject of inquiry and debate, not only by linguists and historians but also by statisticians. In what was arguably the first Bayesian case study, Mosteller and Wallace (1963) provided the first statistical evidence for attributing all disputed papers to Madison. Our paper revisits this historical dataset but from a lens of modern language models, both small and large. We review some of the more popular Large Language Model (LLM) tools and examine them from a statistical point of view in the context of text classification. We investigate whether, without any attempt to fine-tune, the general embedding constructs can be useful for stylometry and attribution. We explain differences between various word/phrase embeddings and discuss how to aggregate them in a document. Contrary to our expectations, we exemplify that dimension expansion with word embeddings may not always be beneficial for attribution relative to dimension reduction with topic embeddings. Our experiments demonstrate that default LLM embeddings (even after manual fine-tuning) may not consistently improve authorship attribution accuracy. Instead, Bayesian analysis with topic embeddings trained on ``function words" yields superior out-of-sample classification performance. This suggests that traditional (small) statistical language models, with their interpretability and solid theoretical foundation, can offer significant advantages in authorship attribution tasks. The code used in this analysis is available at this http URL
zh

[NLP-93] Larger or Smaller Reward Margins to Select Preferences for Alignment?

【速读】：该论文旨在解决现有偏好学习（Preference Learning）中数据质量评估指标的局限性问题，即这些指标通常基于显式或隐式的奖励边界对数据质量进行评估，但往往对同一数据集给出矛盾的评价结果。为了解决这一问题，论文提出了一种新的度量方法——对齐潜力（Alignment Potential, AP）指标，其核心在于量化模型当前隐式奖励边界与目标显式奖励边界之间的差距，从而评估模型与偏好数据对齐的潜力。关键之处在于，AP 指标能够更一致且有效地指导数据选择，无论是在传统的监督学习场景下还是在自博弈（self-play）数据生成框架中，均能显著提升模型的对齐性能，并在不同基础模型、优化目标以及训练设置下超越现有最先进的方法。此外，该方法还能有效识别自生成数据中的高质量内容，展现出随着数据规模和训练迭代次数增加，对齐性能持续改善的能力。

链接: https://arxiv.org/abs/2503.01864
作者: Kexin Huang,Junkang Wu,Ziqian Chen,Xue Wang,Jinyang Gao,Bolin Ding,Jiancan Wu,Xiangnan He,Xiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on either explicit or implicit reward margins, they often provide contradictory evaluations for the same data. To address this issue, we introduce the alignment potential metric, which quantifies the gap from the model’s current implicit reward margin to the target explicit reward margin, thereby estimating the model’s potential to align with the preference data. Empirical results demonstrate that training on data selected by this metric consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives. Furthermore, our method extends to self-play data generation frameworks, where the metric is used to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-art (SOTA) results across various training settings and demonstrates continuous improvements in alignment performance as dataset size and training iterations increase.
zh

[NLP-94] Vision Language Models in Medicine

【速读】：该论文旨在全面梳理医学视觉-语言模型（Med-VLMs）在医疗领域的最新进展与挑战，试图解决如何有效利用这些模型整合视觉与文本数据以提升医疗服务质量的问题。论文的关键在于探讨Med-VLMs的基础技术及其在复杂医疗任务中的适应性，并强调其在临床实践、教育及患者护理中的变革性影响。同时，论文指出数据稀缺性、任务泛化能力不足、可解释性问题以及公平性、责任归属与隐私等伦理挑战，这些问题受到数据分布不均、计算资源需求高及监管障碍的影响。为应对这些挑战，论文提出需要严格的评估方法和健全的监管框架以确保Med-VLMs安全融入医疗工作流程。未来的研究方向包括利用大规模多样化数据集、提升跨模态泛化能力和增强模型可解释性，同时探索联邦学习、轻量化架构设计及电子健康记录集成等创新路径，以实现更广泛的医疗应用普及和更高的临床相关性。

链接: https://arxiv.org/abs/2503.01863
作者: Beria Chingnabe Kalpelbe,Angel Gabriel Adaambiik,Wei Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:With the advent of Vision-Language Models (VLMs), medical artificial intelligence (AI) has experienced significant technological progress and paradigm shifts. This survey provides an extensive review of recent advancements in Medical Vision-Language Models (Med-VLMs), which integrate visual and textual data to enhance healthcare outcomes. We discuss the foundational technology behind Med-VLMs, illustrating how general models are adapted for complex medical tasks, and examine their applications in healthcare. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted, alongside challenges such as data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy. These limitations are exacerbated by uneven dataset distribution, computational demands, and regulatory hurdles. Rigorous evaluation methods and robust regulatory frameworks are essential for safe integration into healthcare workflows. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability. Innovations like federated learning, lightweight architectures, and Electronic Health Record (EHR) integration are explored as pathways to democratize access and improve clinical relevance. This review aims to provide a comprehensive understanding of Med-VLMs’ strengths and limitations, fostering their ethical and balanced adoption in healthcare.
zh

[NLP-95] Optimizing Retrieval-Augmented Generation of Medical Content for Spaced Repetition Learning

【速读】：该论文旨在解决医学教育中高质量个性化学习资源的可扩展性与非英语使用者需求之间的矛盾。为实现这一目标，论文提出了一种基于改进型 Retrieval-Augmented Generation (RAG) 系统的管道，用于生成针对波兰国家专科考试（PES）的评论内容，并结合间隔重复学习算法以优化知识保留并减轻认知负担。解决方案的关键在于通过精炼的检索系统、查询重述模块及先进的重排序器提升生成内容的准确性而非单纯追求效率，同时确保文档相关性、可信度及逻辑连贯性的显著提高。实验结果表明，该方法在多项指标上优于传统方案，证明其在提供高质量、个性化教育资源方面的潜力。

链接: https://arxiv.org/abs/2503.01859
作者: Jeremi I. Kaczmarek,Jakub Pokrywka,Krzysztof Biedalak,Grzegorz Kurzyp,Łukasz Grzybowski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in Large Language Models revolutionized medical education by enabling scalable and efficient learning solutions. This paper presents a pipeline employing Retrieval-Augmented Generation (RAG) system to prepare comments generation for Poland’s State Specialization Examination (PES) based on verified resources. The system integrates these generated comments and source documents with a spaced repetition learning algorithm to enhance knowledge retention while minimizing cognitive overload. By employing a refined retrieval system, query rephraser, and an advanced reranker, our modified RAG solution promotes accuracy more than efficiency. Rigorous evaluation by medical annotators demonstrates improvements in key metrics such as document relevance, credibility, and logical coherence of generated content, proven by a series of experiments presented in the paper. This study highlights the potential of RAG systems to provide scalable, high-quality, and individualized educational resources, addressing non-English speaking users.
zh

[NLP-96] A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models

【速读】：该论文试图解决的问题是如何在不重新训练的情况下，从大型语言模型（Large Language Models, LLMs）中有效移除不当数据（如敏感或非法信息）的影响，同时保持模型的整体效用。论文的关键在于提出一种系统性的方法来组织现有的机器遗忘（Machine Unlearning）研究，并提炼出关键见解，通过定义LLM遗忘的范式、构建全面的分类法、总结当前方法的优缺点，以及评估现有度量标准和基准，从而为未来的研究方向提供指导。

链接: https://arxiv.org/abs/2503.01854
作者: Jiahui Geng,Qing Li,Herbert Woisetschlaeger,Zongxiong Chen,Yuxia Wang,Preslav Nakov,Hans-Arno Jacobsen,Fakhri Karray
机构: Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed University of Artificial Intelligence ); Technical University of Munich ( Technical University of Munich ); Fraunhofer FOKUS ( Fraunhofer FOKUS ); University of Toronto ( University of Toronto )
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the machine unlearning techniques within the context of large language models (LLMs), referred to as \textitLLM unlearning. LLM unlearning offers a principled approach to removing the influence of undesirable data (e.g., sensitive or illegal information) from LLMs, while preserving their overall utility without requiring full retraining. Despite growing research interest, there is no comprehensive survey that systematically organizes existing work and distills key insights; here, we aim to bridge this gap. We begin by introducing the definition and the paradigms of LLM unlearning, followed by a comprehensive taxonomy of existing unlearning studies. Next, we categorize current unlearning approaches, summarizing their strengths and limitations. Additionally, we review evaluation metrics and benchmarks, providing a structured overview of current assessment methodologies. Finally, we outline promising directions for future research, highlighting key challenges and opportunities in the field.
zh

[NLP-97] Euskarazko lehen C1 ebaluatzaile automatikoa

【速读】：该论文旨在开发一种自动评估系统，以确定巴斯克语作文是否达到欧洲共同语言参考框架（CEFR）的 C1 水平。为实现此目标，研究团队通过与 HABE 和 HiTZ 的合作获取了 10,000 篇转录作文用于训练系统，并采用了多种技术来应对数据稀缺性和模型过拟合问题，包括探索性数据分析（EDA）、自一致性学习（SCL）以及正则化方法。此外，研究还测试了不同的语言模型以分析其行为，并进一步评估了系统的校准性能及人工制品的影响。关键在于综合运用这些技术和方法，确保评估系统的准确性和鲁棒性。

链接: https://arxiv.org/abs/2503.01851
作者: Ekhi Azurmendi,Oier Lopez de Lacalle
机构: 未知
类目: Computation and Language (cs.CL)
备注: Comments: 62 pages, in Basque

点击查看摘要

Abstract:Throughout this project, we have attempted to develop an automatic evaluator that determines whether Basque language compositions meet the C1 level. To achieve our goal, we obtained 10,000 transcribed compositions through an agreement between HABE and HiTZ to train our system. We have developed different techniques to avoid data scarcity and system overfitting: EDA, SCL and regulation; We have also conducted tests with different Language Models to analyze their behavior. Finally, we have also performed analyses of different system behaviors to measure model calibration and the impact of artifacts. – Proiektu honetan zehar euskarazko idazlanek C1 maila duten edo ez zehazten duen ebaluatzaile automatiko bat garatzen saiatu gara. Gure helburua betetzeko HABE eta HiTZ arteko hitzarmenaren bitartez 10.000 transkribatutako idazlan eskuratu ditugu gure sistema entrenatzeko. Datu eskasia eta sistemaren gaindoitzea ekiditeko teknika ezberdinak landu ditugu: EDA, SCL eta erregulazioa; Hizkuntza Eredu ezberdinekin ere probak egin ditugu duten portaera aztertzeko. Azkenik, sistema ezberdinen portaeren analisiak ere egin ditugu, ereduen kalibrazioa eta artefaktuen eragina neurtzeko. Comments: Comments: 62 pages, in Basque Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.01851 [cs.CL] (or arXiv:2503.01851v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.01851 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-98] Seeded Poisson Factorization: Leverag ing domain knowledge to fit topic models

【速读】：该论文旨在解决传统无监督主题模型难以与预定义概念域对齐的问题。为应对这一挑战，论文提出了Seeded Poisson Factorization (SPF)，这是一种通过引入种子词（seed words）扩展Poisson因子化框架的新方法。SPF的关键在于通过修改主题特定词汇强度的先验分布，将领域知识融入模型，赋予预定义种子词更高的初始强度，从而实现更可解释且结构化的主题发现。该模型采用变分推理结合随机梯度优化进行估计，确保其在大规模数据集上的可扩展性。通过在Amazon客户反馈数据集上的应用，SPF展示了相较于其他引导型主题模型，在计算效率和预测性能方面的优越性，并证明了其在种子词选择不完美情况下自适应平衡领域知识与数据驱动主题发现的能力。这些结果确立了SPF作为一种强大且可扩展的工具，能够有效整合专家知识以提升主题建模在实际应用中的可解释性和效率。

链接: https://arxiv.org/abs/2503.02741
作者: Bernd Prostmaier,Jan Vávra,Bettina Grün,Paul Hofmarcher
机构: 未知
类目: Methodology (stat.ME); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with predefined conceptual domains. This paper introduces Seeded Poisson Factorization (SPF), a novel approach that extends the Poisson Factorization framework by incorporating domain knowledge through seed words. SPF enables a more interpretable and structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to predefined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets. We apply SPF to an Amazon customer feedback dataset, leveraging predefined product categories as guiding structures. Our evaluation demonstrates that SPF achieves superior classification performance compared to alternative guided topic models, particularly in terms of computational efficiency and predictive performance. Furthermore, robustness checks highlight SPF’s ability to adaptively balance domain knowledge and data-driven topic discovery, even in cases of imperfect seed word selection. These results establish SPF as a powerful and scalable alternative for integrating expert knowledge into topic modeling, enhancing both interpretability and efficiency in real-world applications. Subjects: Methodology (stat.ME); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN) Cite as: arXiv:2503.02741 [stat.ME] (or arXiv:2503.02741v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2503.02741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

计算机视觉

[CV-0] ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models

【速读】：该论文旨在解决现有自回归（Autoregressive, AR）图像生成模型在建模高维像素分布时面临的挑战，即如何高效且准确地生成复杂图像。传统方法要么过于简化而无法充分拟合分布，要么导致生成速度过慢。论文的关键创新在于提出了一种基于特征级自回归（feature-by-feature autoregressive）的双层自回归模型ARINAR（AR-in-AR）。其核心解决方案是将整个像素分布分解为逐特征的生成过程：外层AR模块以先前生成的token为输入，预测条件向量z；内层AR模块则在条件向量z的指导下，自回归地生成下一个token的特征。通过这种方式，内层仅需建模单一特征的简单分布（如高斯混合模型），从而显著提升了生成效率，同时保持了高质量的生成效果。实验结果显示，在ImageNet 256x256图像生成任务中，ARINAR-B模型（2.13亿参数）取得了FID=2.75的成绩，相比现有最先进的多尺度AR模型MAR-B（FID=2.31）性能相当，但生成速度提高了五倍。

链接: https://arxiv.org/abs/2503.02883
作者: Qinyu Zhao,Stephen Gould,Liang Zheng
机构: Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. Our code is available at this https URL

点击查看摘要

Abstract:Existing autoregressive (AR) image generative models use a token-by-token generation schema. That is, they predict a per-token probability distribution and sample the next token from that distribution. The main challenge is how to model the complex distribution of high-dimensional tokens. Previous methods either are too simplistic to fit the distribution or result in slow generation speed. Instead of fitting the distribution of the whole tokens, we explore using a AR model to generate each token in a feature-by-feature way, i.e., taking the generated features as input and generating the next feature. Based on that, we propose ARINAR (AR-in-AR), a bi-level AR model. The outer AR layer take previous tokens as input, predicts a condition vector z for the next token. The inner layer, conditional on z, generates features of the next token autoregressively. In this way, the inner layer only needs to model the distribution of a single feature, for example, using a simple Gaussian Mixture Model. On the ImageNet 256x256 image generation task, ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
zh

[CV-1] Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

【速读】：该论文试图解决在高度逼真的生成式 AI (Generative AI) 时代下，如何有效检测深度伪造（Deepfake）以应对欺诈和虚假信息传播的问题。论文指出现有的学术基准数据集已过时且无法代表近期的深度伪造技术，因此引入了 Deepfake-Eval-2024 数据集，这是一个包含 2024 年从社交媒体和检测平台收集的真实世界深度伪造样本的新基准。该数据集涵盖了视频、音频和图像等多种媒体形式，包含最新的篡改技术，并具有多语言和多来源的特点。论文的关键解决方案在于通过构建这一更具挑战性和真实性的基准测试，揭示现有开源和商业深度伪造检测模型的性能局限性，同时为改进检测技术提供方向。研究发现，开源最先进的深度伪造检测模型在 Deepfake-Eval-2024 上的表现显著下降，而经过微调的模型虽然优于开源模型，但仍未能达到人类取证分析师的准确性水平。

链接: https://arxiv.org/abs/2503.02857
作者: Nuria Alina Chandra,Ryan Murtfeldt,Lin Qiu,Arnab Karmakar,Hannah Lee,Emmanuel Tanumihardja,Kevin Farhat,Ben Caffee,Sejin Paik,Changyeon Lee,Jongwook Choi,Aerin Kim,Oren Etzioni
机构: TrueMedia.org; University of Washington, Seattle (华盛顿大学，西雅图); Georgetown University, Washington D.C. (乔治敦大学，华盛顿特区); Miraflow AI; Yonsei University, Seoul (延世大学，首尔); Chung-Ang University, Seoul (中央大学，首尔); Oren Etzioni
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of recent deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 44 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but they do not yet reach the accuracy of human deepfake forensic analysts. The dataset is available at this https URL.
zh

[CV-2] CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors

【速读】：该论文旨在解决课堂内学生行为监测与预测的问题，以提升教学效能及学习体验。由于缺乏大规模标注数据集的限制，可靠预测系统的开发受到阻碍。为此，论文提出了一种基于低成本惯性测量单元（IMU）传感器的新型数据集，用于课堂活动检测。该数据集包含12名参与者在典型教室场景中执行的19种多样化活动，并整合了加速度计、陀螺仪、旋转矢量数据以及同步立体图像，为多模态算法的开发提供了全面资源。解决方案的关键在于利用非侵入式设备（如智能手表中的IMU传感器）采集多模态数据，结合丰富且多样化的标注样本，从而构建适用于教育场景的行为识别基础数据集。

链接: https://arxiv.org/abs/2503.02853
作者: Luis Marquez-Carpintero,Sergio Suescun-Ferrandiz,Monica Pina-Navarro,Miguel Cazorla,Francisco Gomez-Donoso
机构: Institute for Computer Research (计算机研究所), P.O. Box 99. 03080, Alicante, Spain.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The monitoring and prediction of in-class student activities is of paramount importance for the comprehension of engagement and the enhancement of pedagogical efficacy. The accurate detection of these activities enables educators to modify their lessons in real time, thereby reducing negative emotional states and enhancing the overall learning experience. To this end, the use of non-intrusive devices, such as inertial measurement units (IMUs) embedded in smartwatches, represents a viable solution. The development of reliable predictive systems has been limited by the lack of large, labeled datasets in education. To bridge this gap, we present a novel dataset for in-class activity detection using affordable IMU sensors. The dataset comprises 19 diverse activities, both instantaneous and continuous, performed by 12 participants in typical classroom scenarios. It includes accelerometer, gyroscope, rotation vector data, and synchronized stereo images, offering a comprehensive resource for developing multimodal algorithms using sensor and visual data. This dataset represents a key step toward scalable solutions for activity recognition in educational settings.
zh

[CV-3] Multimodal Deep Learning for Subtype Classification in Breast Cancer Using Histopathological Images and Gene Expression Data

【速读】：该论文旨在解决乳腺癌分子亚型分类准确性不足的问题，传统方法依赖于组织病理学图像或基因表达谱，其预测能力受限。论文的关键解决方案在于提出了一种深度多模态学习框架，通过整合组织病理学图像和基因表达数据，利用ResNet-50模型进行图像特征提取、全连接层处理基因表达数据，并结合交叉注意力融合机制增强模态间交互，从而实现更精准的乳腺癌亚型（如Luminal A、Luminal B、Her2等）分类。实验结果表明，该多模态集成方法在分类准确率、精确召回AUC及F1分数等方面优于单模态方法。

链接: https://arxiv.org/abs/2503.02849
作者: Amin Honarmandi Shandiz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Molecular subtyping of breast cancer is crucial for personalized treatment and prognosis. Traditional classification approaches rely on either histopathological images or gene expression profiling, limiting their predictive power. In this study, we propose a deep multimodal learning framework that integrates histopathological images and gene expression data to classify breast cancer into this http URL and this http URL / Her2 subtypes. Our approach employs a ResNet-50 model for image feature extraction and fully connected layers for gene expression processing, with a cross-attention fusion mechanism to enhance modality interaction. We conduct extensive experiments using five-fold cross-validation, demonstrating that our multimodal integration outperforms unimodal approaches in terms of classification accuracy, precision-recall AUC, and F1-score. Our findings highlight the potential of deep learning for robust and interpretable breast cancer subtype classification, paving the way for improved clinical decision-making.
zh

[CV-4] Boltzmann Attention Sampling for Image Analysis with Small Objects

【速读】：该论文旨在解决在图像分析中检测和分割小对象（如肺结节和肿瘤病灶）这一关键挑战。传统Transformer架构因在无关区域进行冗余注意力计算而效率低下且容易导致性能下降，尤其是在处理仅占图像不到0.1%的小目标时。现有稀疏注意力机制依赖于刚性的分层结构，难以应对小目标位置的多样性和不确定性。论文提出的BoltzFormer是一种基于Transformer的新架构，通过动态稀疏注意力机制解决上述问题。其关键是利用带退火调度的玻尔兹曼分布建模不确定性，早期层采用较高温度以允许更广泛的区域采样，后期层则随着温度降低使注意力更加聚焦，从而提高效率和准确性。此外，BoltzFormer通过模块化玻尔兹曼注意力采样机制无缝集成到现有Transformer架构中。全面评估表明，BoltzFormer不仅显著提升了小目标分割性能，还将注意力计算减少了量级。

链接: https://arxiv.org/abs/2503.02841
作者: Theodore Zhao,Sid Kiblawi,Naoto Usuyama,Ho Hin Lee,Sam Preston,Hoifung Poon,Mu Wei
机构: Microsoft
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
zh

[CV-5] In-Depth Analysis of Automated Acne Disease Recognition and Classification

【速读】：该论文旨在解决面部痤疮（Acne）分类困难的问题，传统方法依赖于耗时的视觉检查或专家评估，难以准确区分不同类型的痤疮。论文提出了一种基于机器学习的自动化专家系统，用于痤疮的识别与分类，以辅助皮肤科医生的诊断。解决方案的关键在于结合图像预处理（如对比度增强、平滑滤波及RGB到Lab颜色空间转换）、基于k-means聚类的病灶区域分割、特征提取（利用灰度共生矩阵GLCM与统计特征相结合的方法），以及多种机器学习分类器（包括Random Forest等），最终实现了高达98.50%的分类准确率，显著优于现有方法。

链接: https://arxiv.org/abs/2503.02835
作者: Afsana Ahsan Jeny,Masum Shah Junayed,Md Robel Mia,Md Baharul Islam
机构: Department of Computer Engineering, Bahcesehir University (巴什谢希尔大学), Istanbul, Turkey; Department of CSE, University of Connecticut (康涅狄格大学), Storrs, CT 06269, USA; Department of CSE, Daffodil International University (达夫迪尔国际大学), Dhaka, Bangladesh; Dept. of Computing & Software Engineering, Florida Gulf Coast University (佛罗里达海湾海岸大学), Fort Myers, FL 33965, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial acne is a common disease, especially among adolescents, negatively affecting both physically and psychologically. Classifying acne is vital to providing the appropriate treatment. Traditional visual inspection or expert scanning is time-consuming and difficult to differentiate acne types. This paper introduces an automated expert system for acne recognition and classification. The proposed method employs a machine learning-based technique to classify and evaluate six types of acne diseases to facilitate the diagnosis of dermatologists. The pre-processing phase includes contrast improvement, smoothing filter, and RGB to Lab color conversion to eliminate noise and improve the classification accuracy. Then, a clustering-based segmentation method, k-means clustering, is applied for segmenting the disease-affected regions that pass through the feature extraction step. Characteristics of these disease-affected regions are extracted based on a combination of gray-level co-occurrence matrix (GLCM) and Statistical features. Finally, five different machine learning classifiers are employed to classify acne diseases. Experimental results show that the Random Forest (RF) achieves the highest accuracy of 98.50%, which is promising compared to the state-of-the-art methods.
zh

[CV-6] Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging

【速读】：该论文旨在解决现有基于人工智能的PET/CT分析方法因依赖任务特定模型或有限数据集而导致的泛化性和鲁棒性不足的问题。论文提出了一种专门设计用于多模态PET/CT成像的基础模型方法，并引入了Cross-Fraternal Twin Masked Autoencoder (FratMAE) 框架作为解决方案的关键。FratMAE通过分别使用Vision Transformer (ViT) 编码器处理PET和CT扫描，并结合交叉注意力解码器实现模态间的协同交互，有效整合全身解剖与功能或分子信息。此外，它还利用文本元数据增强PET表征学习。通过在PET/CT数据集上的预训练，FratMAE捕获复杂的跨模态关系和全局摄取模式，在下游任务中表现出色，展示了其作为通用基础模型的潜力。

链接: https://arxiv.org/abs/2503.02824
作者: Yujin Oh,Robert Seifert,Yihan Cao,Christoph Clement,Justin Ferdinandus,Constantin Lapa,Alessandro Liebich,Michelle Amon,Johanna Enke,Sifan Song,Runqi Meng,Fang Zeng,Ning Guo,Xiang Li,Pedram Heidari,Axel Rominger,Kuangyu Shi,Quanzheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables

点击查看摘要

Abstract:In oncology, Positron Emission Tomography-Computed Tomography (PET/CT) is widely used in cancer diagnosis, staging, and treatment monitoring, as it combines anatomical details from CT with functional metabolic activity and molecular marker expression information from PET. However, existing artificial intelligence-driven PET/CT analyses rely predominantly on task-specific models trained from scratch or on limited datasets, limiting their generalizability and robustness. To address this, we propose a foundation model approach specifically designed for multimodal PET/CT imaging. We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional or molecular information. FratMAE employs separate Vision Transformer (ViT) encoders for PET and CT scans, along with cross-attention decoders that enable synergistic interactions between modalities during masked autoencoder training. Additionally, it incorporates textual metadata to enhance PET representation learning. By pre-training on PET/CT datasets, FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks and demonstrating its potential as a generalizable foundation model.
zh

[CV-7] MX-Font: Mixture of Heterogeneous Aggregation Experts for Few-shot Font Generation ICASSP2025

【速读】：该论文旨在解决Few-shot Font Generation (FFG) 中低资源语言字符迁移的挑战，特别是在字体字形在训练集间差异显著的情况下，现有方法难以有效解耦内容与风格的问题。论文的关键在于提出Heterogeneous Aggregation Experts (HAE)，这是一种强大的特征提取专家，能够通过在通道和空间维度上聚合信息来更好地实现内容与风格的解耦。此外，论文还引入了一种新颖的内容-风格同质性损失，以进一步增强解耦效果。实验结果表明，所提出的MX-Font++在FFG任务中取得了卓越的视觉效果，并且优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.02799
作者: Weihang Wang,Duolin Sun,Jielei Zhang,Longwen Gao
机构: BILIBILI Inc. (哔哩哔哩)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures, accepted by ICASSP 2025

点击查看摘要

Abstract:Few-shot Font Generation (FFG) aims to create new font libraries using limited reference glyphs, with crucial applications in digital accessibility and equity for low-resource languages, especially in multilingual artificial intelligence systems. Although existing methods have shown promising performance, transitioning to unseen characters in low-resource languages remains a significant challenge, especially when font glyphs vary considerably across training sets. MX-Font considers the content of a character from the perspective of a local component, employing a Mixture of Experts (MoE) approach to adaptively extract the component for better transition. However, the lack of a robust feature extractor prevents them from adequately decoupling content and style, leading to sub-optimal generation results. To alleviate these problems, we propose Heterogeneous Aggregation Experts (HAE), a powerful feature extraction expert that helps decouple content and style downstream from being able to aggregate information in channel and spatial dimensions. Additionally, we propose a novel content-style homogeneity loss to enhance the untangling. Extensive experiments on several datasets demonstrate that our MX-Font++ yields superior visual results in FFG and effectively outperforms state-of-the-art methods. Code and data are available at this https URL.
zh

[CV-8] A Causal Framework for Aligning Image Quality Metrics and Deep Neural Network Robustness

【速读】：该论文试图解决的问题是如何量化和理解大规模图像数据集的底层质量分布，以更好地表征深度神经网络 (Deep Neural Networks, DNNs) 的性能与鲁棒性。传统图像质量评估 (Image Quality Assessment, IQA) 方法主要关注与人类感知判断的一致性，但这些方法在预测 DNN 性能方面表现较弱，尤其是在分类任务中。论文的关键在于从因果视角重新定义 IQA，并提出一种新的图像质量度量方法，该方法不仅对成像条件敏感，还与 DNN 的敏感性高度一致。通过这种方法，论文提供了一种直接估计大规模图像数据集质量分布的手段，从而揭示数据集组成与 DNN 性能之间的关系。

链接: https://arxiv.org/abs/2503.02797
作者: Nathan Drenkow,Mathias Unberath
机构: The Johns Hopkins University Applied Physics Laboratory (约翰斯·霍普金斯大学应用物理实验室); The Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image quality plays an important role in the performance of deep neural networks (DNNs) and DNNs have been widely shown to exhibit sensitivity to changes in imaging conditions. Large-scale datasets often contain images under a wide range of conditions prompting a need to quantify and understand their underlying quality distribution in order to better characterize DNN performance and robustness. Aligning the sensitivities of image quality metrics and DNNs ensures that estimates of quality can act as proxies for image/dataset difficulty independent of the task models trained/evaluated on the data. Conventional image quality assessment (IQA) seeks to measure and align quality relative to human perceptual judgments, but here we seek a quality measure that is not only sensitive to imaging conditions but also well-aligned with DNN sensitivities. We first ask whether conventional IQA metrics are also informative of DNN performance. In order to answer this question, we reframe IQA from a causal perspective and examine conditions under which quality metrics are predictive of DNN performance. We show theoretically and empirically that current IQA metrics are weak predictors of DNN performance in the context of classification. We then use our causal framework to provide an alternative formulation and a new image quality metric that is more strongly correlated with DNN performance and can act as a prior on performance without training new task models. Our approach provides a means to directly estimate the quality distribution of large-scale image datasets towards characterizing the relationship between dataset composition and DNN performance.
zh

[CV-9] Deep Learning-Enhanced Visual Monitoring in Hazardous Underwater Environments with a Swarm of Micro-Robots ICRA2025

【速读】：该论文旨在解决在极端环境中（如水下设施）长期监测与探索所面临的高成本、劳动密集且危险的问题。传统方法依赖人工操作，效率低下且易受环境干扰影响。为应对这些挑战，论文提出了一种创新性方法，其关键在于结合数据模拟、多模态深度学习网络用于坐标预测以及图像重装配技术。这种方法通过整合视觉信息、全局位置上下文及噪声坐标来增强在嘈杂环境中的对齐精度，并有效克服由于环境扰动导致的机器人位置和方向漂移与旋转问题。实验结果表明，该方案在合成数据上的坐标预测具有极高的准确性，并能实现合理的图像拼接，验证了其实用性。最终，生成的清晰连贯的水下环境视图可用于高效监控与检查，展现了其在其他极端环境应用中的潜力，有助于提升危险区域监测的安全性、效率并降低成本。代码资源已公开。

链接: https://arxiv.org/abs/2503.02752
作者: Shuang Chen,Yifeng He,Barry Lennox,Farshad Arvin,Amir Atapour-Abarghouei
机构: Department of Computer Science, Durham University (杜伦大学), UK; Department of Electrical & Electronic Engineering, The University of Manchester (曼彻斯特大学), UK
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Long-term monitoring and exploration of extreme environments, such as underwater storage facilities, is costly, labor-intensive, and hazardous. Automating this process with low-cost, collaborative robots can greatly improve efficiency. These robots capture images from different positions, which must be processed simultaneously to create a spatio-temporal model of the facility. In this paper, we propose a novel approach that integrates data simulation, a multi-modal deep learning network for coordinate prediction, and image reassembly to address the challenges posed by environmental disturbances causing drift and rotation in the robots’ positions and orientations. Our approach enhances the precision of alignment in noisy environments by integrating visual information from snapshots, global positional context from masks, and noisy coordinates. We validate our method through extensive experiments using synthetic data that simulate real-world robotic operations in underwater settings. The results demonstrate very high coordinate prediction accuracy and plausible image assembly, indicating the real-world applicability of our approach. The assembled images provide clear and coherent views of the underwater environment for effective monitoring and inspection, showcasing the potential for broader use in extreme settings, further contributing to improved safety, efficiency, and cost reduction in hazardous field monitoring. Code is available on this https URL.
zh

[CV-10] ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points CVPR2025

【速读】：本文旨在解决从高度稀疏且低质量点云中恢复结构化三维抽象（Structured 3D Abstraction）的问题。解决方案的关键在于提出了一种基于架构程序（Architectural Program）的新学习框架ArcPro，并设计了一种领域专用语言（Domain-Specific Language, DSL）来分层表示建筑结构为程序，进而高效转换为网格。此外，通过前馈过程合成训练数据以桥接前馈与逆向过程建模，使网络能够进行反向预测。该方法采用三维卷积编码器提取点云特征，并利用Transformer解码器以自回归方式以令牌形式预测程序，从而建立无序点云到架构程序的映射。这种方法在推理阶段表现出高效率，生成的三维抽象结果合理且忠实于原始数据。实验表明，ArcPro在传统建筑代理重建和基于学习的抽象方法上均表现更优。进一步探索显示其可与多视图图像和自然语言输入结合使用。

链接: https://arxiv.org/abs/2503.02745
作者: Qirui Huang,Runze Zhang,Kangjun Liu,Minglun Gong,Hao Zhang,Hui Huang
机构: CSSE, Shenzhen University (深圳大学); Pengcheng Laboratory (鹏城实验室); University of Guelph (圭尔夫大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025 (Patent Protected); Project page: this https URL

点击查看摘要

Abstract:We introduce ArcPro, a novel learning framework built on architectural programs to recover structured 3D abstractions from highly sparse and low-quality point clouds. Specifically, we design a domain-specific language (DSL) to hierarchically represent building structures as a program, which can be efficiently converted into a mesh. We bridge feedforward and inverse procedural modeling by using a feedforward process for training data synthesis, allowing the network to make reverse predictions. We train an encoder-decoder on the points-program pairs to establish a mapping from unstructured point clouds to architectural programs, where a 3D convolutional encoder extracts point cloud features and a transformer decoder autoregressively predicts the programs in a tokenized form. Inference by our method is highly efficient and produces plausible and faithful 3D abstractions. Comprehensive experiments demonstrate that ArcPro outperforms both traditional architectural proxy reconstruction and learning-based abstraction methods. We further explore its potential to work with multi-view image and natural language inputs.
zh

[CV-11] UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression

【速读】：本文旨在解决基于隐式神经表示（Implicit Neural Representations, INRs）的视频压缩在处理长视频时因内存消耗迅速增加而面临的资源受限挑战。解决方案的关键在于提出了一种统一的时间线自回归框架（Unified AutoRegressive Framework for memory-efficient Neural Video Compression, UAR-NVC）。该框架通过将视频分割为多个片段，并为每个片段采用独立的INR模型实例，在保持时间线自回归建模优势的同时，融合了传统帧处理压缩框架的高效性。此外，设计了两个模块优化模型参数的初始化、训练及压缩过程以进一步减少片段间的时间冗余。通过灵活调整片段长度，UAR-NVC能够在资源受限环境下显著提升性能。

链接: https://arxiv.org/abs/2503.02733
作者: Jia Wang,Xinfeng Zhang,Gai Zhang,Jun Zhu,Lv Tang,Li Zhang
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院), Beijing, China; Bytedance Inc. (字节跳动), San Diego, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we design two modules to optimize the initialization, training, and compression of these model parameters. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models.
zh

[CV-12] Creating Sorted Grid Layouts with Gradient-based Optimization

【速读】：该论文旨在解决在二维网格上对高维向量进行高效排序的问题，目标是通过空间布局反映向量之间的相似性关系。然而，由于网格排序的组合复杂性极高（例如，一个8x8的小网格已有超过 (1.3 \times 10^{89}) 种排列可能），传统穷举法难以应对。尽管已有多种方法被提出，但尚未有研究探索基于梯度优化的潜力。论文的关键创新在于首次引入了一种基于梯度优化的新方法，设计了一个平衡两项任务的新型损失函数：生成有效的置换矩阵以确保排序合法性，同时优化网格布局以最大化向量间的相似性表达。实验结果表明，该方法在排序质量上优于现有技术，展现了显著的应用前景。

链接: https://arxiv.org/abs/2503.02730
作者: Kai Uwe Barthel,Florian Tim Barthel,Peter Eisert,Nico Hezel,Konstantin Schall
机构: Visual Computing Group (视觉计算组); HTW Berlin (柏林应用技术大学); Fraunhofer HHI (弗劳恩霍夫赫兹研究所); HU Berlin (柏林洪堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visually sorted grid layouts provide an efficient method for organizing high-dimensional vectors in two-dimensional space by aligning spatial proximity with similarity relationships. This approach facilitates the effective sorting of diverse elements ranging from data points to images, and enables the simultaneous visualization of a significant number of elements. However, sorting data on two-dimensional grids is a challenge due to its high complexity. Even for a small 8-by-8 grid with 64 elements, the number of possible arrangements exceeds 1.3 \cdot 10^89 - more than the number of atoms in the universe - making brute-force solutions impractical. Although various methods have been proposed to address the challenge of determining sorted grid layouts, none have investigated the potential of gradient-based optimization. In this paper, we present a novel method for grid-based sorting that exploits gradient optimization for the first time. We introduce a novel loss function that balances two opposing goals: ensuring the generation of a “valid” permutation matrix, and optimizing the arrangement on the grid to reflect the similarity between vectors, inspired by metrics that assess the quality of sorted grids. While learning-based approaches are inherently computationally complex, our method shows promising results in generating sorted grid layouts with superior sorting quality compared to existing techniques. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.02730 [cs.CV] (or arXiv:2503.02730v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.02730 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval, Pages 1199-1206 Related DOI: https://doi.org/10.1145/3652583.3657585 Focus to learn more DOI(s) linking to related resources
zh

[CV-13] A Joint Visual Compression and Perception Framework for Neuralmorphic Spiking Camera

【速读】：该论文旨在解决神经形态脉冲相机产生的二进制脉冲数据在存储和处理方面的高资源需求问题，同时探索如何利用这些数据进行高效的压缩与智能应用。论文的关键在于提出了一种名为“Spike Coding for Intelligence (SCI)”的概念，通过设计一种双路径架构分别处理空间语义和运动信息，并最终合并生成用于压缩的特征。此外，引入了一种时间回归方法来整合多种运动动态，以充分利用变形和扭曲技术的进步。实验结果表明，所提出的方案在脉冲数据压缩和基于脉冲分类任务中实现了最先进的性能，相比现有最佳编解码器平均降低了17.25%的BD-rate，并且在编码侧减少了88.26%的复杂度和42.41%的推理时间，同时在基于脉冲的分类任务中比SpiReco提高了4.3%的准确性。

链接: https://arxiv.org/abs/2503.02725
作者: Kexiang Feng,Chuanmin Jia,Siwei Ma,Wen Gao
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); National Engineering Research Center of Visual Technology, Peking University (北京大学视觉技术国家工程研究中心); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The advent of neuralmorphic spike cameras has garnered significant attention for their ability to capture continuous motion with unparalleled temporal this http URL, this imaging attribute necessitates considerable resources for binary spike data storage and this http URL light of compression and spike-driven intelligent applications, we present the notion of Spike Coding for Intelligence (SCI), wherein spike sequences are compressed and optimized for both bit-rate and task this http URL inspiration from the mammalian vision system, we propose a dual-pathway architecture for separate processing of spatial semantics and motion information, which is then merged to produce features for compression.A refinement scheme is also introduced to ensure consistency between decoded features and motion this http URL further propose a temporal regression approach that integrates various motion dynamics, capitalizing on the advancements in warping and deformation this http URL experiments demonstrate our scheme achieves state-of-the-art (SOTA) performance for spike compression and this http URL achieve an average 17.25% BD-rate reduction compared to SOTA codecs and a 4.3% accuracy improvement over SpiReco for spike-based classification, with 88.26% complexity reduction and 42.41% inference time saving on the encoding side.
zh

[CV-14] Catheter Detection and Segmentation in X-ray Images via Multi-task Learning

【速读】：该论文旨在解决微创心脏手术中通过X射线透视图像自动检测和分割导管等手术器械的问题，以增强图像引导的准确性。论文提出了一种集成ResNet架构与多个预测头的卷积神经网络模型，在端到端深度学习框架下实现电极精确定位和导管分割。解决方案的关键在于采用多任务学习策略的同时，引入了一种新颖的多级动态资源优先级分配方法，该方法在训练过程中动态调整样本和任务权重，优先处理更具挑战性的任务，从而在准确性和效率之间取得良好的平衡，优于现有最先进的单任务和多任务方法。

链接: https://arxiv.org/abs/2503.02717
作者: Lin Xi,Yingliang Ma,Ethan Koland,Sandra Howell,Aldo Rinaldi,Kawal S. Rhode
机构: University of East Anglia (UEA)(东安格利亚大学); King’s College London (KCL)(伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated detection and segmentation of surgical devices, such as catheters or wires, in X-ray fluoroscopic images have the potential to enhance image guidance in minimally invasive heart surgeries. In this paper, we present a convolutional neural network model that integrates a resnet architecture with multiple prediction heads to achieve real-time, accurate localization of electrodes on catheters and catheter segmentation in an end-to-end deep learning framework. We also propose a multi-task learning strategy in which our model is trained to perform both accurate electrode detection and catheter segmentation simultaneously. A key challenge with this approach is achieving optimal performance for both tasks. To address this, we introduce a novel multi-level dynamic resource prioritization method. This method dynamically adjusts sample and task weights during training to effectively prioritize more challenging tasks, where task difficulty is inversely proportional to performance and evolves throughout the training process. Experiments on both public and private datasets have demonstrated that the accuracy of our method surpasses the existing state-of-the-art methods in both single segmentation task and in the detection and segmentation multi-task. Our approach achieves a good trade-off between accuracy and efficiency, making it well-suited for real-time surgical guidance applications.
zh

[CV-15] Memory Efficient Continual Learning for Edge-Based Visual Anomaly Detection

【速读】：本文旨在解决视觉异常检测（Visual Anomaly Detection, VAD）模型在边缘设备上的部署挑战，特别是在资源受限情况下实现持续学习（Continual Learning）的问题。边缘设备通常面临计算和存储资源有限的约束，同时实际应用中的动态数据分布要求模型能够持续适应新数据。论文的关键在于提出了一种针对视觉异常检测持续学习（Continual Learning for Visual Anomaly Detection, CLAD）的新研究方向，并评估了STFPM方法在结合Replay策略后的表现。此外，论文引入了一种专为边缘设计但尚未在持续学习场景下研究的方法PaSTe，通过结构优化实现了比传统STFPM更低的内存占用和更高的异常检测性能，其f1像素性能提升了10%，且通过压缩Replay技术将内存开销最多减少了91.5%。这证明了在资源受限的边缘设备上部署能够增量适应和学习的VAD模型的可行性。

链接: https://arxiv.org/abs/2503.02691
作者: Manuel Barusco,Lorenzo D’Antoni,Davide Dalle Pezze,Francesco Borsatti,Gian Antonio Susto
机构: University of Padova (帕多瓦大学), Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) is a critical task in computer vision with numerous real-world applications. However, deploying these models on edge devices presents significant challenges, such as constrained computational and memory resources. Additionally, dynamic data distributions in real-world settings necessitate continuous model adaptation, further complicating deployment under limited resources. To address these challenges, we present a novel investigation into the problem of Continual Learning for Visual Anomaly Detection (CLAD) on edge devices. We evaluate the STFPM approach, given its low memory footprint on edge devices, which demonstrates good performance when combined with the Replay approach. Furthermore, we propose to study the behavior of a recently proposed approach, PaSTe, specifically designed for the edge but not yet explored in the Continual Learning context. Our results show that PaSTe is not only a lighter version of STPFM, but it also achieves superior anomaly detection performance, improving the f1 pixel performance by 10% with the Replay technique. In particular, the structure of PaSTe allows us to test it using a series of Compressed Replay techniques, reducing memory overhead by a maximum of 91.5% compared to the traditional Replay for STFPM. Our study proves the feasibility of deploying VAD models that adapt and learn incrementally on CLAD scenarios on resource-constrained edge devices.
zh

[CV-16] STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks CVPR2025

【速读】：该论文旨在解决SNN（Spiking Neural Networks）在性能上与传统ANN（Artificial Neural Networks）之间的差距问题，这一差距限制了SNN的广泛应用。为了解决这一挑战，论文提出了一种名为STAA-SNN（Spatial-Temporal Attention Aggregator Spiking Neural Network）的新框架，其关键在于动态捕捉空间和时间依赖关系。具体而言，STAA-SNN引入了专为SNN设计的脉冲驱动自注意力机制，并创新性地结合位置编码以整合潜在的时间关系到输入特征中。此外，通过步幅注意力实现空间-时间信息的有效聚合，并采用时间步随机丢弃策略避免局部最优解。最终，该框架不仅能够高效捕获空间和时间依赖，还展示了卓越的泛化能力和在多个数据集上的领先性能。

链接: https://arxiv.org/abs/2503.02689
作者: Tianqing Zhang,Kairong Yu,Xian Zhong,Hongwei Wang,Qi Xu,Qiang Zhang
机构: Zhejiang Univeristy (浙江大学); Dalian University of Technology (大连理工大学); Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have gained significant attention due to their biological plausibility and energy efficiency, making them promising alternatives to Artificial Neural Networks (ANNs). However, the performance gap between SNNs and ANNs remains a substantial challenge hindering the widespread adoption of SNNs. In this paper, we propose a Spatial-Temporal Attention Aggregator SNN (STAA-SNN) framework, which dynamically focuses on and captures both spatial and temporal dependencies. First, we introduce a spike-driven self-attention mechanism specifically designed for SNNs. Additionally, we pioneeringly incorporate position encoding to integrate latent temporal relationships into the incoming features. For spatial-temporal information aggregation, we employ step attention to selectively amplify relevant features at different steps. Finally, we implement a time-step random dropout strategy to avoid local optima. As a result, STAA-SNN effectively captures both spatial and temporal dependencies, enabling the model to analyze complex patterns and make accurate predictions. The framework demonstrates exceptional performance across diverse datasets and exhibits strong generalization capabilities. Notably, STAA-SNN achieves state-of-the-art results on neuromorphic datasets CIFAR10-DVS, with remarkable performances of 97.14%, 82.05% and 70.40% on the static datasets CIFAR-10, CIFAR-100 and ImageNet, respectively. Furthermore, our model exhibits improved performance ranging from 0.33% to 2.80% with fewer time steps. The code for the model is available on GitHub.
zh

[CV-17] Class-Aware PillarMix: Can Mixed Sample Data Augmentation Enhance 3D Object Detection with Radar Point Clouds? IROS2025

【速读】：该论文旨在解决现有混合样本数据增强（Mixed Sample Data Augmentation, MSDA）方法在雷达点云应用中的可行性及适应性问题。论文指出，尽管MSDA技术已在LiDAR点云中得到广泛应用，但将其扩展到雷达点云时面临三大挑战：雷达点云的不规则角分布、多雷达设置下偏离单传感器极坐标布局的问题以及点云稀疏性。为克服这些障碍，论文提出了一种名为Class-Aware PillarMix (CAPMix) 的新型MSDA方法。该方法的关键在于，在三维点云的pillar级别上基于类别标签执行MixUp操作，并为每个pillar分配独立的混合比例，从而提升样本多样性。此外，CAPMix通过类别特定的分布策略处理不同类别的点云密度，对于密集物体偏向引入其他样本的点，而对于稀疏物体则更多保留原始样本点，以保留关键细节并丰富训练数据的多样性。实验结果表明，CAPMix不仅显著提升了性能，还在两个数据集上超越了现有的MSDA方法。

链接: https://arxiv.org/abs/2503.02687
作者: Miao Zhang,Sherif Abdulatif,Benedikt Loesch,Marco Altmann,Bin Yang
机构: Robert Bosch GmbH (罗伯特博世有限公司); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 6 figures, 4 tables, submitted to 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

Abstract:Due to the significant effort required for data collection and annotation in 3D perception tasks, mixed sample data augmentation (MSDA) has been widely studied to generate diverse training samples by mixing existing data. Recently, many MSDA techniques have been developed for point clouds, but they mainly target LiDAR data, leaving their application to radar point clouds largely unexplored. In this paper, we examine the feasibility of applying existing MSDA methods to radar point clouds and identify several challenges in adapting these techniques. These obstacles stem from the radar’s irregular angular distribution, deviations from a single-sensor polar layout in multi-radar setups, and point sparsity. To address these issues, we propose Class-Aware PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar level in 3D point clouds, guided by class labels. Unlike methods that rely a single mix ratio to the entire sample, CAPMix assigns an independent ratio to each pillar, boosting sample diversity. To account for the density of different classes, we use class-specific distributions: for dense objects (e.g., large vehicles), we skew ratios to favor points from another sample, while for sparse objects (e.g., pedestrians), we sample more points from the original. This class-aware mixing retains critical details and enriches each sample with new information, ultimately generating more diverse training data. Experimental results demonstrate that our method not only significantly boosts performance but also outperforms existing MSDA approaches across two datasets (Bosch Street and K-Radar). We believe that this straightforward yet effective approach will spark further investigation into MSDA techniques for radar data.
zh

[CV-18] State of play and future directions in industrial computer vision AI standards

【速读】：该论文试图解决工业规模计算机视觉（Computer Vision, CV）人工智能（Artificial Intelligence, AI）模型在可靠性、透明性、可信性、安全性、稳定性和鲁棒性等方面的需求问题，并探索高效、全面且广泛采用的工业标准的开发路径。论文的关键在于系统分析国际标准化组织（如ISO/IEC、IEEE、DIN等）已发布及正在制定的计算机视觉标准，重点关注模型可解释性、数据质量以及法规遵从等关键方面，同时讨论当前面临的挑战与未来发展方向。

链接: https://arxiv.org/abs/2503.02675
作者: Artemis Stefanidou,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: Dept. of Informatics and Telematics (信息学与电信学系), Harokopio University of Athens (哈罗科皮奥斯大学); K3Y Ltd. (凯伊有限公司); Dept. of Networks and Digital Media (网络与数字媒体系), Kingston University (金斯顿大学); Dept. of Electrical and Computer Engineering (电气与计算机工程系), University of Western Macedonia (西方马其顿大学); Dept. of Informatics and Telematics (信息学与电信学系), Harokopio University of Athens (哈罗科皮奥斯大学); Dept. of Informatics and Telematics (信息学与电信学系), Harokopio University of Athens (哈罗科皮奥斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent tremendous advancements in the areas of Artificial Intelligence (AI) and Deep Learning (DL) have also resulted into corresponding remarkable progress in the field of Computer Vision (CV), showcasing robust technological solutions in a wide range of application sectors of high industrial interest (e.g., healthcare, autonomous driving, automation, etc.). Despite the outstanding performance of CV systems in specific domains, their development and exploitation at industrial-scale necessitates, among other, the addressing of requirements related to the reliability, transparency, trustworthiness, security, safety, and robustness of the developed AI models. The latter raises the imperative need for the development of efficient, comprehensive and widely-adopted industrial standards. In this context, this study investigates the current state of play regarding the development of industrial computer vision AI standards, emphasizing on critical aspects, like model interpretability, data quality, and regulatory compliance. In particular, a systematic analysis of launched and currently developing CV standards, proposed by the main international standardization bodies (e.g. ISO/IEC, IEEE, DIN, etc.) is performed. The latter is complemented by a comprehensive discussion on the current challenges and future directions observed in this regularization endeavor.
zh

[CV-19] 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

【速读】：该论文旨在解决红外小目标检测（IRSTD）任务在边缘设备上的部署需求与二值神经网络（BNNs）精度损失之间的矛盾。论文的关键在于提出了一种名为Binarized Infrared Small-Target Detection Network (BiisNet) 的方法，通过在保留二值化卷积核心操作的同时，将全精度特征融入网络的信息流中，解决了因红外目标尺寸小而对检测精度提出的高要求问题。具体而言，BiisNet 引入了 Dot-Binary Convolution 来在特征图中保留细粒度语义信息，并结合平滑且自适应的 Dynamic Softsign 函数以提供更全面且逐步细化的梯度，从而增强模型稳定性并优化权重更新。实验结果表明，BiisNet 不仅显著优于其他二值架构，还在全精度模型中展现出强大的竞争力。

链接: https://arxiv.org/abs/2503.02662
作者: Biqiao Xin,Qianchen Mao,Bingshu Wang,Jiangbin Zheng,Yong Zhao,C.L. Philip Chen
机构: School of Software, Northwestern Polytechnical University (西北工业大学软件学院); Shenzhen Gradaute School, Peking University (北京大学深圳研究生院); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread deployment of InfRared Small-Target Detection(IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binary neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision loss during binarization presents a significant challenge. To address this, we propose the Binarized Infrared Small-Target Detection Network (BiisNet), which preserves the core operations of binarized convolutions while integrating full-precision features into the network’s information flow. Specifically, we propose the Dot-Binary Convolution, which retains fine-grained semantic information in feature maps while still leveraging the binarized convolution operations. In addition, we introduce a smooth and adaptive Dynamic Softsign function, which provides more comprehensive and progressively finer gradient during back-propagation, enhancing model stability and promoting an optimal weight this http URL results demonstrate that BiisNet not only significantly outperforms other binary architectures but also demonstrates strong competitiveness among state-of-the-art full-precision models.
zh

[CV-20] A dataset-free approach for self-supervised learning of 3D reflectional symmetries

【速读】：本文旨在解决单个物体对称性检测的问题，传统方法依赖于大规模标注数据集，而本文提出的方法无需任何标注数据，仅依靠输入物体本身即可完成学习。关键在于两点：首先，通过假设物体的对称性可以由其内在特征决定，从而避免了训练过程中对大规模数据集的需求；其次，设计了一种自监督学习策略以去除对真实标签的依赖。此外，通过利用基础图像模型提取的特征为物体上的每个点计算视觉描述符，使点云具备视觉特征，进而优化自监督模型。实验结果表明，该方法在不使用大规模数据集的情况下超越了现有的最先进的模型，并且具有更高的效率和更低的资源消耗。

链接: https://arxiv.org/abs/2503.02660
作者: Issac Aguirre,Ivan Sipiran,Gabriel Montañana
机构: Department of Computer Science, University of Chile (智利大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we explore a self-supervised model that learns to detect the symmetry of a single object without requiring a dataset-relying solely on the input object itself. We hypothesize that the symmetry of an object can be determined by its intrinsic features, eliminating the need for large datasets during training. Additionally, we design a self-supervised learning strategy that removes the necessity of ground truth labels. These two key elements make our approach both effective and efficient, addressing the prohibitive costs associated with constructing large, labeled datasets for this task. The novelty of our method lies in computing features for each point on the object based on the idea that symmetric points should exhibit similar visual appearances. To achieve this, we leverage features extracted from a foundational image model to compute a visual descriptor for the points. This approach equips the point cloud with visual features that facilitate the optimization of our self-supervised model. Experimental results demonstrate that our method surpasses the state-of-the-art models trained on large datasets. Furthermore, our model is more efficient, effective, and operates with minimal computational and data resources.
zh

[CV-21] XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification

【速读】：该论文旨在解决多视图医学图像分类任务中现有方法存在的两个主要问题：一是未能充分建模跨视图的相关性，导致分类性能次优；二是受限于卷积神经网络（Convolutional Neural Networks, CNNs）的感受野限制或Transformer模型的二次计算复杂度。为应对这些挑战，论文提出了一种名为XFMamba的纯Mamba基跨融合架构。其关键是引入了一种新颖的两阶段融合策略，通过学习单视图特征及其跨视图差异，捕捉每视图的空间长程依赖，并促进视图间无缝的信息传递。这一机制有效提升了多视图医学图像分类的性能，在MURA、CheXpert和DDSM三个公开数据集上的实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.02619
作者: Xiaoyu Zheng,Xu Chen,Shaogang Gong,Xavier Griffin,Greg Slabaugh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at this https URL.
zh

[CV-22] Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises ICLR2025

【速读】：该论文旨在解决多模态数据在测试时适应（Test-Time Adaptation, TTA）场景下的新挑战，即多模态野外TTA（multimodal wild TTA），其中复杂的噪声模式（如多种模态的同时损坏和模态缺失）以及来自不同分布偏移的混合干扰使得现有方法难以有效适应。为应对这一挑战，论文提出了一种名为SuMi的解决方案，其关键是通过四分位距平滑（interquartile range smoothing）缓解突变的分布偏移，并利用单模态特征选择具有丰富多模态信息且熵较低的样本进行优化，同时引入互信息共享（mutual information sharing）以跨模态对齐信息、减少差异并增强信息利用率。实验验证了所提方法在复杂噪声模式下的有效性与优越性。

链接: https://arxiv.org/abs/2503.02616
作者: Zirun Guo,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Test-Time Adaptation (TTA) aims to tackle distribution shifts using unlabeled test data without access to the source data. In the context of multimodal data, there are more complex noise patterns than unimodal data such as simultaneous corruptions for multiple modalities and missing modalities. Besides, in real-world applications, corruptions from different distribution shifts are always mixed. Existing TTA methods always fail in such multimodal scenario because the abrupt distribution shifts will destroy the prior knowledge from the source model, thus leading to performance degradation. To this end, we reveal a new challenge named multimodal wild TTA. To address this challenging problem, we propose two novel strategies: sample identification with interquartile range Smoothing and unimodal assistance, and Mutual information sharing (SuMi). SuMi smooths the adaptation process by interquartile range which avoids the abrupt distribution shifts. Then, SuMi fully utilizes the unimodal features to select low-entropy samples with rich multimodal information for optimization. Furthermore, mutual information sharing is introduced to align the information, reduce the discrepancies and enhance the information utilization across different modalities. Extensive experiments on two public datasets show the effectiveness and superiority over existing methods under the complex noise patterns in multimodal data. Code is available at this https URL.
zh

[CV-23] ARC-Flow : Articulated Resolution-Agnostic Correspondence-Free Matching and Interpolation of 3D Shapes Under Flow Fields

【速读】：该论文旨在解决两个核心问题：一是无监督条件下物理合理的两组三维刚体形状之间插值预测；二是自动估计这两组形状之间的密集对应关系。为实现这些目标，论文提出了一种统一框架。解决方案的关键在于采用神经常微分方程（Neural Ordinary Differential Equations, ODEs）建模插值过程，通过平滑的时间变化流场生成满足拓扑一致性且无交叉轨迹的微分同胚变换，同时支持硬约束（如体积守恒）和软约束（如物理先验）。对于密集对应关系的恢复，论文利用高效的Varifold公式，在具有不同参数化的高保真表面间表现优异。此外，通过仅提供源形状的简单骨架信息，论文施加了基于物理的变形场约束，并解决了对称性歧义，而无需依赖皮肤权重或任何关于目标骨架姿态配置的先验知识。实验结果表明，该方法在标准数据集上的形状对应和插值任务中表现出竞争性或更优的性能。

链接: https://arxiv.org/abs/2503.02606
作者: Adam Hartshorne,Allen Paul,Tony Shardlow,Neill D.F. Campbell
机构: University of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:This work presents a unified framework for the unsupervised prediction of physically plausible interpolations between two 3D articulated shapes and the automatic estimation of dense correspondence between them. Interpolation is modelled as a diffeomorphic transformation using a smooth, time-varying flow field governed by Neural Ordinary Differential Equations (ODEs). This ensures topological consistency and non-intersecting trajectories while accommodating hard constraints, such as volume preservation, and soft constraints, \eg physical priors. Correspondence is recovered using an efficient Varifold formulation, that is effective on high-fidelity surfaces with differing parameterisations. By providing a simple skeleton for the source shape only, we impose physically motivated constraints on the deformation field and resolve symmetric ambiguities. This is achieved without relying on skinning weights or any prior knowledge of the skeleton’s target pose configuration. Qualitative and quantitative results demonstrate competitive or superior performance over existing state-of-the-art approaches in both shape correspondence and interpolation tasks across standard datasets. Comments: 11 pages, 6 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.02606 [cs.CV] (or arXiv:2503.02606v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.02606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-24] Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

【速读】：该论文旨在解决现有多模态 affordance 方法在提取有用信息时面临的挑战，这些问题主要源于简单的结构设计、基础的模态融合方法以及过大的模型参数，导致其难以满足实际部署的性能需求。为了解决这些问题，论文提出了一种名为 BiT-Align 的图像-深度-文本 affordance 映射框架。该框架的关键在于引入了 Bypass Prompt 模块（BPM）和 Text Feature Guidance（TFG）注意选择机制。BPM 直接将辅助模态的深度图像作为提示嵌入到主模态的 RGB 图像中，无需额外引入编码器，从而减少了模型参数并提高了功能区域定位的准确性；而 TFG 机制利用文本特征引导图像编码器中注意力头的选择与增强，提升了对 affordance 特性的理解能力。实验结果表明，所提方法在 AGD20K 和 HICO-IIF 数据集上实现了显著的性能提升，并在减少 88.8% 模型参数的情况下较当前最先进方法提升了 6.0% 的 KLD 指标，展现了其实用价值。

链接: https://arxiv.org/abs/2503.02600
作者: Yizhou Huang,Fan Yang,Guoliang Zhu,Gen Li,Hao Shi,Yukun Zuo,Wenrui Chen,Zhiyong Li,Kailun Yang
机构: School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China (湖南大学机器人学院和机器人视觉感知与控制技术国家工程研究中心); School of Informatics, University of Edinburgh, UK (爱丁堡大学信息学学院); State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, China (浙江大学极端光子学和仪器国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL

点击查看摘要

Abstract:Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model’s parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at this https URL.
zh

[CV-25] Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLM s

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）中存在的视觉与语言错位（vision-language misalignment）问题，即模型生成的文本响应在事实层面未能与给定的图文输入保持一致。为应对这一挑战，论文从基础架构的角度重新审视了MLLMs的设计，并提出了一种名为AKI的新方法。其关键创新在于将因果注意力机制（causal attention）改造为模态互注意力机制（modality-mutual attention, MMA），从而允许图像标记能够关注文本标记。这种设计不仅提升了模型在12个跨模态理解基准测试中的性能（平均提升7.2%），还无需增加额外参数或显著延长训练时间。MMA机制的通用性和可扩展性使其适用于多种模态组合及复杂场景。

链接: https://arxiv.org/abs/2503.02597
作者: Wei-Yao Wang,Zhao Wang,Helen Suzuki,Yoshiyuki Kobayashi
机构: Sony Group Corporation (索尼集团公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of earlier modalities (e.g., images) to incorporate information from later modalities (e.g., text). To address this problem, we propose AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows AKI to achieve superior performance in 12 multimodal understanding benchmarks (+7.2% on average) without introducing additional parameters and increasing training time. Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios. The code is publicly available at this https URL, and we will release our AKI-4B model to encourage further advancements in MLLMs across various directions.
zh

[CV-26] StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts

【速读】：该论文旨在解决艺术舞台场景生成的问题，特别是利用大型语言模型与布局控制扩散模型相结合，模拟资深艺术家的工作流程以生成沉浸式的3D舞台场景。解决方案的关键在于StageDesigner框架的三个主要模块：Script Analysis（脚本分析），通过提取输入剧本中的主题和空间线索；Foreground Generation（前景生成），构建并布置关键的3D对象；以及Background Generation（背景生成），创建与叙事氛围协调一致且通过管理前景与背景元素之间的遮挡来保持空间连贯性的背景。此外，还引入了StagePro-V1数据集，用于支持这一任务。评估结果表明，StageDesigner在标准及新提出的度量标准下均表现出色，并通过广泛的用户研究验证了其有效性。

链接: https://arxiv.org/abs/2503.02595
作者: Zhaoxing Gan,Mengtian Li,Ruhua Chen,Zhongxia Ji,Sichen Guo,Huanling Hu,Guangnan Ye,Zuo Hu
机构: Fudan University (复旦大学); Shanghai University (上海大学); Shanghai Theatre Academy (上海戏剧学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce StageDesigner, the first comprehensive framework for artistic stage generation using large language models combined with layout-controlled diffusion models. Given the professional requirements of stage scenography, StageDesigner simulates the workflows of seasoned artists to generate immersive 3D stage scenes. Specifically, our approach is divided into three primary modules: Script Analysis, which extracts thematic and spatial cues from input scripts; Foreground Generation, which constructs and arranges essential 3D objects; and Background Generation, which produces a harmonious background aligned with the narrative atmosphere and maintains spatial coherence by managing occlusions between foreground and background elements. Furthermore, we introduce the StagePro-V1 dataset, a dedicated dataset with 276 unique stage scenes spanning different historical styles and annotated with scripts, images, and detailed 3D layouts, specifically tailored for this task. Finally, evaluations using both standard and newly proposed metrics, along with extensive user studies, demonstrate the effectiveness of StageDesigner. Project can be found at: this https URL
zh

[CV-27] CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework

【速读】：该论文旨在解决基于语言描述的点云定位问题，即利用文本描述在大规模城市环境中识别三维位置。这一任务在车辆接送或货物配送等场景中有潜在应用价值。然而，在实际操作中，如车辆接送场景下，乘客通常只会描述显著且附近的环境部分，而非整个环境，这导致文本描述与对应的三维位置之间的语义关系变得部分相关。为应对这一挑战，论文提出了一种名为CMMLoc的框架，其核心在于引入一种不确定性感知的Cauchy-Mixture-Model (CMM)，用于处理文本到点云的定位任务。解决方案的关键在于通过CMM约束建模文本与点云之间不确定的语义关系，并在两个模态交互过程中将其作为先验知识整合；同时设计了一种空间整合方案，以自适应聚合不同视场范围的三维物体；此外，还提出了方向集成模块及模态预对齐策略，以精确捕捉物体间的空间关系并拉近文本模态与三维物体的距离。实验结果表明，CMMLoc在KITTI360Pose数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2503.02593
作者: Yanlong Xu,Haoxuan Qu,Jun Liu,Wenxiao Zhang,Xun Yang
机构: University of Science and Technology of China (中国科学技术大学); Lancaster University (兰卡斯特大学); Hohai University (河海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The goal of point cloud localization based on linguistic description is to identify a 3D position using textual description in large urban environments, which has potential applications in various fields, such as determining the location for vehicle pickup or goods delivery. Ideally, for a textual description and its corresponding 3D location, the objects around the 3D location should be fully described in the text description. However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this \textbfpartially relevant challenge, we propose \textbfCMMLoc , an uncertainty-aware \textbfC auchy- \textbfM ixture- \textbfM odel ( \textbfCMM ) based framework for text-to-point-cloud \textbfLoc alization. To model the uncertain semantic relations between text and point cloud, we integrate CMM constraints as a prior during the interaction between the two modalities. We further design a spatial consolidation scheme to enable adaptive aggregation of different 3D objects with varying receptive fields. To achieve precise localization, we propose a cardinal direction integration module alongside a modality pre-alignment strategy, helping capture the spatial relationships among objects and bringing the 3D objects closer to the text modality. Comprehensive experiments validate that CMMLoc outperforms existing methods, achieving state-of-the-art results on the KITTI360Pose dataset. Codes are available in this GitHub repository this https URL.
zh

[CV-28] A Hypernetwork-Based Approach to KAN Representation of Audio Signals

【速读】：该论文试图解决隐式神经表示（Implicit Neural Representations, INR）在音频信号应用中的局限性问题。为了解决这一问题，论文提出了Kolmogorov-Arnold网络（KAN），这是一种采用可学习激活函数的新型架构，作为高效的INR模型用于音频表示。KAN的关键创新在于其通过引入特定的网络结构和可学习激活函数，显著提升了音频表示的感知性能，在1.5秒音频任务中实现了最低的对数频谱距离（Log-Spectral Distance, LSD = 1.29）和最高的语音感知质量评分（Perceptual Evaluation of Speech Quality, PESQ = 3.57）。此外，为了进一步提升KAN的适用性和效率，论文还提出FewSound，这是一种基于超网络（Hypernetwork）的架构，通过优化INR参数更新过程，使MSE性能较现有最佳方法HyperSound提升了33.3%，SI-SNR提升了60.87%。这些改进表明KAN不仅具有鲁棒性，而且适配性强，具备扩展性和与其他超网络框架集成的潜力。

链接: https://arxiv.org/abs/2503.02585
作者: Patryk Marszałek,Maciej Rut,Piotr Kawa,Piotr Syga
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN’s utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at this https URL.
zh

[CV-29] Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance

【速读】：该论文旨在解决现有基于Segment Anything Model 2 (SAM2) 的方法在RGB-T (RGB- Thermal) 任务中表现不佳的问题。尽管SAM2在大规模数据集上训练，具备强大的感知潜力，但其固有的训练范式限制了其在多模态任务中的适用性。为了解决这一挑战，论文提出了一种名为SHIFNet的新框架，它通过引入语言引导的混合交互范式来释放SAM2的潜力，以实现高效的RGB-T感知。

SHIFNet的关键解决方案包含两个核心组件：(1) 语义感知跨模态融合(Semantic-Aware Cross-modal Fusion, SACF) 模块，通过文本引导的亲和力学习动态平衡模态贡献，克服了SAM2固有的RGB偏见；(2) 异构提示解码器(Heterogeneous Prompting Decoder, HPD)，通过语义增强模块提升全局语义信息，并结合类别嵌入以增强跨模态语义一致性。凭借仅32.27M可训练参数，SHIFNet在公共基准测试中实现了最先进的分割性能，分别达到了PST900的89.8%和FMB的67.8%。该框架有效降低了数据收集的高成本，同时赋予机器人系统全面的感知能力。

链接: https://arxiv.org/abs/2503.02581
作者: Jiayi Zhao,Fei Teng,Kai Luo,Guoqiang Zhao,Zhiyong Li,Xu Zheng,Kailun Yang
机构: School of Robotics, Hunan University, China (湖南大学机器人学院); National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China (湖南大学机器人视觉感知与控制技术国家工程研究中心); AI Thrust, Hong Kong University of Science and Technology (Guangzhou), China (香港科技大学（广州）人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL

点击查看摘要

Abstract:The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2’s inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at this https URL.
zh

[CV-30] MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments

【速读】：该论文致力于解决现有手术室（Operating Room, OR）场景数据集规模有限、真实性不足且无法捕捉多模态特性的局限性，这些局限阻碍了手术室建模领域的进展。论文的关键在于提出了MM-OR，这是一个大规模、高真实性的多模态时空手术室数据集，并首次支持多模态场景图生成。此外，论文还引入了MM2SG，首个用于场景图生成的多模态大型视觉-语言模型。通过广泛的实验验证，MM2SG展示了其有效利用多模态输入的能力。综上所述，MM-OR与MM2SG共同确立了手术室综合理解的新基准，并为复杂、高风险环境中的多模态场景分析开辟了道路。相关代码与数据可通过提供的链接获取。

链接: https://arxiv.org/abs/2503.02579
作者: Ege Özsoy,Chantal Pellegrini,Tobias Czempiel,Felix Tristram,Kun Yuan,David Bani-Harouni,Ulrich Eck,Benjamin Busam,Matthias Keicher,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establish a new benchmark for holistic OR understanding, and open the path towards multimodal scene analysis in complex, high-stakes environments. Our code, and data is available at this https URL.
zh

[CV-31] S-CGNet: Temporal-Spatial Fusion Meets Centerline-Guided Diffusion for BEV Mapping KR

【速读】：本文旨在解决现有基于鸟瞰视图（Bird’s Eye View, BEV）地图生成研究中缺乏深度感知推理能力的问题，特别是在处理遮挡（occlusions）、复杂环境以及恶劣天气或低光照条件下的感知性能下降方面。为应对这些挑战，论文提出了一种名为TS-CGNet的框架，其关键在于结合时间-空间融合（Temporal-Spatial fusion）与中心线引导扩散模型（Centerline-Guided diffusion）。具体而言，TS-CGNet框架被解耦为三个部分：局部映射系统利用纯视觉信息生成语义地图；时间-空间对齐模块通过应用变换矩阵整合历史信息；而中心线引导扩散模型则基于扩散模型提供预测功能，并通过空间注意力机制引入中心线信息以增强语义分割重建。这一设计不仅提升了在nuScenes等公开数据集上的BEV高清地图和语义分割任务中的性能，还显著改善了不同天气条件及传感器干扰下的检测准确性。

链接: https://arxiv.org/abs/2503.02578
作者: Xinying Hong,Siyu Li,Kang Zeng,Hao Shi,Bomin Peng,Kailun Yang,Zhiyong Li
机构: College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China (湖南大学计算机科学与电子工程学院); School of Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, Changsha 410082, China (湖南大学机器人学院及机器人视觉感知与控制技术国家工程研究中心); State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University, Hangzhou 310027, China (浙江大学极端光学与仪器国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be publicly available at this https URL

点击查看摘要

Abstract:Bird’s Eye View (BEV) perception technology is crucial for autonomous driving, as it generates top-down 2D maps for environment perception, navigation, and decision-making. Nevertheless, the majority of current BEV map generation studies focusing on visual map generation lack depth-aware reasoning capabilities. They exhibit limited efficacy in managing occlusions and handling complex environments, with a notable decline in perceptual performance under adverse weather conditions or low-light scenarios. Therefore, this paper proposes TS-CGNet, which leverages Temporal-Spatial fusion with Centerline-Guided diffusion. This visual framework, grounded in prior knowledge, is designed for integration into any existing network for building BEV maps. Specifically, this framework is decoupled into three parts: Local mapping system involves the initial generation of semantic maps using purely visual information; The Temporal-Spatial Aligner Module (TSAM) integrates historical information into mapping generation by applying transformation matrices; The Centerline-Guided Diffusion Model (CGDM) is a prediction module based on the diffusion model. CGDM incorporates centerline information through spatial-attention mechanisms to enhance semantic segmentation reconstruction. We construct BEV semantic segmentation maps by our methods on the public nuScenes and the robustness benchmarks under various corruptions. Our method improves 1.90%, 1.73%, and 2.87% for perceived ranges of 60x30m, 120x60m, and 240x60m in the task of BEV HD mapping. TS-CGNet attains an improvement of 1.92% for perceived ranges of 100x100m in the task of BEV semantic mapping. Moreover, TS-CGNet achieves an average improvement of 2.92% in detection accuracy under varying weather conditions and sensor interferences in the perception range of 240x60m. The source code will be publicly available at this https URL.
zh

[CV-32] SPG: Improving Motion Diffusion by Smooth Perturbation Guidance

【速读】：该论文旨在解决在不增加额外训练的情况下提升人体运动扩散模型输出质量的问题。解决方案的关键在于提出了一种测试时引导方法——平滑扰动引导（Smooth Perturbation Guidance, SPG）。SPG 通过在去噪步骤中对运动进行时间平滑处理来构建一个弱模型，从而实现负向引导。与源自图像生成领域的模型不可知方法相比，SPG 能够有效缓解运动扩散模型在扰动过程中出现的分布外问题，同时保持运动结构的本质特性。这种简单且无需额外训练需求的方法，在不同模型架构和任务中均能显著提高运动保真度。

链接: https://arxiv.org/abs/2503.02577
作者: Boseong Jeon
机构: Samsung Research (三星研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a test-time guidance method to improve the output quality of the human motion diffusion models without requiring additional training. To have negative guidance, Smooth Perturbation Guidance (SPG) builds a weak model by temporally smoothing the motion in the denoising steps. Compared to model-agnostic methods originating from the image generation field, SPG effectively mitigates out-of-distribution issues when perturbing motion diffusion models. In SPG guidance, the nature of motion structure remains intact. This work conducts a comprehensive analysis across distinct model architectures and tasks. Despite its extremely simple implementation and no need for additional training requirements, SPG consistently enhances motion fidelity. Project page can be found at this https URL
zh

[CV-33] racking-Aware Deformation Field Estimation for Non-rigid 3D Reconstruction in Robotic Surgeries

【速读】：该论文旨在解决在机器人腹腔镜手术中，即使是最小的组织变形也需要被精确感知的安全性挑战，特别是在三维空间中的仪器-组织交互。现有方法依赖NeRF从不同视角渲染二维视频以消除遮挡，但大多未能稳健地预测准确的三维形状及其相关变形估计。论文的关键创新在于提出了一种名为Tracking-Aware Deformation Field (TADF) 的新框架，它能够同时重建三维网格和三维组织变形。其解决方案的关键步骤包括：首先利用基础视觉模型跟踪软组织的关键点，生成精确的二维变形场；然后将该二维变形场与神经隐式重建网络平滑结合，从而获得三维空间中的组织变形信息。实验结果表明，所提方法在两个公开数据集上提供了比其他三维神经重建方法更准确的变形估计。

链接: https://arxiv.org/abs/2503.02558
作者: Zeqing Wang,Han Fang,Yihong Xu,Yutong Ban
机构: UM-SJTU Joint Institute, Shanghai Jiao Tong University (UM-SJTU 联合 institute, 上海交通大学); Valeo.ai (Valeo.ai), Paris, France (巴黎)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Minimally invasive procedures have been advanced rapidly by the robotic laparoscopic surgery. The latter greatly assists surgeons in sophisticated and precise operations with reduced invasiveness. Nevertheless, it is still safety critical to be aware of even the least tissue deformation during instrument-tissue interactions, especially in 3D space. To address this, recent works rely on NeRF to render 2D videos from different perspectives and eliminate occlusions. However, most of the methods fail to predict the accurate 3D shapes and associated deformation estimates robustly. Differently, we propose Tracking-Aware Deformation Field (TADF), a novel framework which reconstructs the 3D mesh along with the 3D tissue deformation simultaneously. It first tracks the key points of soft tissue by a foundation vision model, providing an accurate 2D deformation field. Then, the 2D deformation field is smoothly incorporated with a neural implicit reconstruction network to obtain tissue deformation in the 3D space. Finally, we experimentally demonstrate that the proposed method provides more accurate deformation estimation compared with other 3D neural reconstruction methods in two public datasets.
zh

[CV-34] Federated nnU-Net for Privacy-Preserving Medical Image Segmentation

【速读】：该论文旨在解决传统nnU-Net框架在医疗图像分割任务中采用集中式训练方法所面临的隐私保护问题，如敏感患者信息泄露及隐私侵犯。为应对这一挑战，论文提出了一种名为FednnU-Net的联邦学习扩展方案，通过引入两种创新的联邦学习方法——联邦指纹提取（Federated Fingerprint Extraction, FFE）与非对称联邦平均（Asymmetric Federated Averaging, AsymFedAvg），实现模型在去中心化环境下的有效训练。这些方法不仅确保了患者数据的本地保留，还通过实验验证了其在乳腺、心脏及胎儿分割任务中的稳定性能，使用了来自18家机构的6个数据集。此外，为了促进隐私受限机构中去中心化训练的研究与部署，论文开源了该插件式框架。

链接: https://arxiv.org/abs/2503.02549
作者: Grzegorz Skorupko,Fotios Avgoustidis,Carlos Martín-Isla,Lidia Garrucho,Dimitri A. Kessler,Esmeralda Ruiz Pujadas,Oliver Díaz,Maciej Bobowicz,Katarzyna Gwoździewicz,Xavier Bargalló,Paulius Jaruševičius,Kaisar Kushibar,Karim Lekadir
机构: Artificial Intelligence in Medicine Laboratory (BCN-AIM), Departament de Matemàtiques i Informàtica, Universitat de Barcelona (巴塞罗那大学); Medical University of Gdańsk (GUMed) (格但斯克医科大学); Hospital Clínic de Barcelona (HCB) (巴塞罗那临床医院); Lithuanian University of Health Sciences (立陶宛健康科学大学); Institució Catalana de Recerca i Estudis Avançats (ICREA) (加泰罗尼亚研究与先进研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In review

点击查看摘要

Abstract:The nnU-Net framework has played a crucial role in medical image segmentation and has become the gold standard in multitudes of applications targeting different diseases, organs, and modalities. However, so far it has been used primarily in a centralized approach where the data collected from hospitals are stored in one center and used to train the nnU-Net. This centralized approach has various limitations, such as leakage of sensitive patient information and violation of patient privacy. Federated learning is one of the approaches to train a segmentation model in a decentralized manner that helps preserve patient privacy. In this paper, we propose FednnU-Net, a federated learning extension of nnU-Net. We introduce two novel federated learning methods to the nnU-Net framework - Federated Fingerprint Extraction (FFE) and Asymmetric Federated Averaging (AsymFedAvg) - and experimentally show their consistent performance for breast, cardiac and fetal segmentation using 6 datasets representing samples from 18 institutions. Additionally, to further promote research and deployment of decentralized training in privacy constrained institutions, we make our plug-n-play framework public. The source-code is available at this https URL .
zh

[CV-35] PVTree: Realistic and Controllable Palm Vein Generation for Recognition Tasks

【速读】：该论文旨在解决基于深度学习的掌静脉识别模型训练中因数据收集成本高及隐私保护限制导致的样本不足问题，并提出了一种新的掌静脉生成框架PVTree。现有方法常生成不真实的掌静脉图案或难以控制身份与风格属性，为此，PVTree的关键创新在于：首先通过改进的约束构造优化(Constrained Constructive Optimization, CCO)算法构建复杂且真实的三维掌静脉树来定义身份；其次从不同视角投影同一三维静脉树至二维图像，并利用生成模型转换为逼真的图像，从而同时满足身份一致性与类内多样性需求。实验结果表明，PVTree在多个公开数据集上的表现优于现有方法，并首次实现了基于合成数据训练的识别模型性能超越真实数据训练模型的情况，这预示着掌静脉图像生成研究具有广阔前景。

链接: https://arxiv.org/abs/2503.02547
作者: Sheng Shang,Chenglong Zhao,Ruixin Zhang,Jianlong Jin,Jingyun Zhang,Rizen Guo,Shouhong Ding,Yunsheng Wu,Yang Zhao,Wei Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Palm vein recognition is an emerging biometric technology that offers enhanced security and privacy. However, acquiring sufficient palm vein data for training deep learning-based recognition models is challenging due to the high costs of data collection and privacy protection constraints. This has led to a growing interest in generating pseudo-palm vein data using generative models. Existing methods, however, often produce unrealistic palm vein patterns or struggle with controlling identity and style attributes. To address these issues, we propose a novel palm vein generation framework named PVTree. First, the palm vein identity is defined by a complex and authentic 3D palm vascular tree, created using an improved Constrained Constructive Optimization (CCO) algorithm. Second, palm vein patterns of the same identity are generated by projecting the same 3D vascular tree into 2D images from different views and converting them into realistic images using a generative model. As a result, PVTree satisfies the need for both identity consistency and intra-class diversity. Extensive experiments conducted on several publicly available datasets demonstrate that our proposed palm vein generation method surpasses existing methods and achieves a higher TAR@FAR=1e-4 under the 1:1 Open-set protocol. To the best of our knowledge, this is the first time that the performance of a recognition model trained on synthetic palm vein data exceeds that of the recognition model trained on real data, which indicates that palm vein image generation research has a promising future.
zh

[CV-36] RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification

【速读】：该论文旨在解决扩散模型（Diffusion Models）在生成超分辨率图像时性能显著下降的问题。尽管已有多种方法用于生成高分辨率图像，但它们要么效率低下，要么受限于复杂的操作。为了解决这一挑战，论文提出了RectifiedHR，这是一种无需训练即可高效生成高分辨率图像的解决方案。关键在于引入了“噪声刷新策略”（Noise Refresh Strategy），该策略理论上仅需少量代码即可解锁模型的高分辨率生成能力并提升效率。此外，论文首次观察到高分辨率图像生成过程中可能出现的能量衰减现象，可能导致图像模糊。为应对这一问题，提出了“能量校正策略”（Energy Rectification Strategy），通过调整无分类器引导（Classifier-Free Guidance）的超参数有效改善生成性能。总体而言，RectifiedHR完全无需训练且逻辑简单，经大量对比实验验证，其效果与效率均优于现有基线方法。

链接: https://arxiv.org/abs/2503.02537
作者: Zhen Yang,Guibao Shen,Liang Hou,Mushui Liu,Luozhou Wang,Xin Tao,Pengfei Wan,Di Zhang,Ying-Cong Chen
机构: HKUST(GZ)(香港科技大学（广州）); Kuaishou Technology (快手科技); HKUST (香港科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have achieved remarkable advances in various image generation tasks. However, their performance notably declines when generating images at resolutions higher than those used during the training period. Despite the existence of numerous methods for producing high-resolution images, they either suffer from inefficiency or are hindered by complex operations. In this paper, we propose RectifiedHR, an efficient and straightforward solution for training-free high-resolution image generation. Specifically, we introduce the noise refresh strategy, which theoretically only requires a few lines of code to unlock the model’s high-resolution generation ability and improve efficiency. Additionally, we first observe the phenomenon of energy decay that may cause image blurriness during the high-resolution image generation process. To address this issue, we propose an Energy Rectification strategy, where modifying the hyperparameters of the classifier-free guidance effectively improves the generation performance. Our method is entirely training-free and boasts a simple implementation logic. Through extensive comparisons with numerous baseline methods, our RectifiedHR demonstrates superior effectiveness and efficiency.
zh

[CV-37] RA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition

【速读】：该论文旨在解决视觉位置识别（Visual Place Recognition, VPR）在资源受限平台上的高效性和准确性之间的矛盾。尽管视觉Transformer（Vision Transformer, ViT）方法提供了高精度，但其大模型通常超出无人机和移动机器人等资源受限平台的内存和计算预算。为了解决这一问题，论文提出了一种名为TeTRA的三值Transformer方法，其关键是通过渐进式量化将ViT主干网络压缩至2位精度，并将其最终嵌入层二值化，从而显著减小模型大小和推理延迟。此外，精心设计的渐进式蒸馏策略保留了全精度教师模型的表征能力，使TeTRA能够在使用更少资源的情况下保持甚至超越未压缩卷积模型的精度。实验结果表明，TeTRA相比高效基线减少了高达69%的内存消耗，降低了35%的推理延迟，同时在召回率@1方面无损或略有提升。这些改进使得TeTRA成为在功耗和内存受限的机器人平台上实现高精度VPR的理想解决方案。

链接: https://arxiv.org/abs/2503.02511
作者: Oliver Grainge,Michael Milford,Indu Bodala,Sarvapali D. Ramchurn,Shoaib Ehsan
机构: University of Southampton (南安普顿大学); Queensland University of Technology (昆士兰科技大学) (澳大利亚布里斯班, QLD 4000); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.
zh

[CV-38] Remote Sensing Image Classification Using Convolutional Neural Network (CNN) and Transfer Learning Techniques

【速读】：该论文旨在解决航空图像分类问题，具体涉及传输塔、森林、农田和山脉四种场景的分类。解决方案的关键在于利用迁移学习（Transfer Learning）结合卷积神经网络（CNN）架构，特别是MobileNetV2模型。研究通过从输入图像中提取特征，并使用Softmax函数进行分类。在训练过程中，采用Adam优化器，学习率为0.001，批量大小为90，在包含10,400张图像的数据集上进行了十轮训练。实验结果表明，MobileNetV2在精度与效率之间取得了良好的平衡，基于其进行迁移学习可显著提升分类性能，最终达到96%的整体准确率和0.119的测试损失。

链接: https://arxiv.org/abs/2503.02510
作者: Mustafa Majeed Abd Zaid,Ahmed Abed Mohammed,Putra Sumari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is published in Journal of Computer Science, Volume 21 No. 3, 2025. It contains 635-645 pages

点击查看摘要

Abstract:This study investigates the classification of aerial images depicting transmission towers, forests, farmland, and mountains. To complete the classification job, features are extracted from input photos using a Convolutional Neural Network (CNN) architecture. Then, the images are classified using Softmax. To test the model, we ran it for ten epochs using a batch size of 90, the Adam optimizer, and a learning rate of 0.001. Both training and assessment are conducted using a dataset that blends self-collected pictures from Google satellite imagery with the MLRNet dataset. The comprehensive dataset comprises 10,400 images. Our study shows that transfer learning models and MobileNetV2 in particular, work well for landscape categorization. These models are good options for practical use because they strike a good mix between precision and efficiency; our approach achieves results with an overall accuracy of 87% on the built CNN model. Furthermore, we reach even higher accuracies by utilizing the pretrained VGG16 and MobileNetV2 models as a starting point for transfer learning. Specifically, VGG16 achieves an accuracy of 90% and a test loss of 0.298, while MobileNetV2 outperforms both models with an accuracy of 96% and a test loss of 0.119; the results demonstrate the effectiveness of employing transfer learning with MobileNetV2 for classifying transmission towers, forests, farmland, and mountains.
zh

[CV-39] QC: When Quantization Meets Cache in Efficient Image Generation

【速读】：该论文旨在解决将量化（Quantization）与缓存（Cache）机制结合应用于高效Diffusion Transformers (DiTs) 时面临的挑战，具体问题是：(i) 后训练量化 (Post-Training Quantization, PTQ) 的校准数据集样本有效性因缓存操作显著降低；(ii) 这两种机制的结合引入更严重的暴露偏差 (Exposure Bias)，导致图像生成过程中误差积累加剧。论文的关键解决方案包括设计一种时间感知并行聚类 (Temporal-Aware Parallel Clustering, TAP) 方法以动态提升PTQ校准中的样本选择效能，并提出一种方差补偿 (Variance Compensation, VC) 策略通过自适应校正因子生成来缓解暴露偏差。实验表明，该方法在保持优秀生成能力的同时将DiTs加速了12.7倍。

链接: https://arxiv.org/abs/2503.02508
作者: Xin Ding,Xin Li,Haotong Qin,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学); ETH Zürich, Switzerland (瑞士苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages

点击查看摘要

Abstract:Quantization and cache mechanisms are typically applied individually for efficient Diffusion Transformers (DiTs), each demonstrating notable potential for acceleration. However, the promoting effect of combining the two mechanisms on efficient generation remains under-explored. Through empirical investigation, we find that the combination of quantization and cache mechanisms for DiT is not straightforward, and two key challenges lead to severe catastrophic performance degradation: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the combination of the above mechanisms introduces more severe exposure bias within sampling distribution, resulting in amplified error accumulation in the image generation process. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of DiTs while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments have shown that our method has accelerated DiTs by 12.7x while preserving competitive generation capability. The code will be available at this https URL.
zh

[CV-40] ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

【速读】：该论文旨在解决如何以语义清晰、空间敏感且直观的方式让用户指定目标，从而指导智能体在具身环境中的交互。论文提出了一种新颖的跨视角目标对齐框架，允许用户通过自身相机视图的分割掩码而非智能体的观测来指定目标对象。论文的关键发现是，仅依靠行为克隆（Behavior Cloning）无法使智能体行为与人类意图保持一致，尤其是在人机相机视图存在显著差异时。为解决此问题，论文引入了两个辅助目标：跨视角一致性损失（cross-view consistency loss）和目标可见性损失（target visibility loss），以显式增强智能体的空间推理能力。最终，基于此方法开发的ROCKET-2在Minecraft中实现了效率提升3到6倍，并首次能够直接解析人类相机视图中的目标，为改善人机交互奠定了基础。

链接: https://arxiv.org/abs/2503.02505
作者: Shaofei Cai,Zhancun Mu,Anji Liu,Yitao Liang
机构: Peking University (北京大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent’s observations. We highlight that behavior cloning alone fails to align the agent’s behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent’s spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3x to 6x. We show ROCKET-2 can directly interpret goals from human camera views for the first time, paving the way for better human-agent interaction.
zh

[CV-41] Deepfake Detection via Knowledge Injection

【速读】：该论文旨在解决现有深度伪造检测方法在处理未见过的真实与伪造数据时泛化能力不足的问题。主要原因是这些方法往往忽视了真实数据知识的重要性。为了解决这一挑战，论文提出了一种名为“基于知识注入的深度伪造检测（Knowledge Injection based deepfake Detection, KID）”的新方法。其关键是设计了一个基于多任务学习的知识注入框架，并将其轻松集成到现有的ViT（Vision Transformer）基础模型中。具体而言，通过引入知识注入模块来学习并注入必要知识，以更准确地建模真实与伪造数据的分布；构建粗粒度伪造定位分支，在多任务学习方式下学习伪造位置，丰富知识注入模块的学习知识；同时提出两种分层抑制和对比损失函数，强调知识注入模块中的真实数据知识，进一步平衡真实与伪造知识的比例。实验结果表明，KID不仅与不同规模的ViT基础模型具有良好的兼容性，还实现了最先进的泛化性能并提升了训练收敛速度。

链接: https://arxiv.org/abs/2503.02503
作者: Tonghui Li,Yuanfang Guo,Zeming Liu,Heqi Peng,Yunhong Wang
机构: School of Computer Science and Engineering, Beihang University (北航大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detection technologies become vital because current generative AI models can generate realistic deepfakes, which may be utilized in malicious purposes. Existing deepfake detection methods either rely on developing classification methods to better fit the distributions of the training data, or exploiting forgery synthesis mechanisms to learn a more comprehensive forgery distribution. Unfortunately, these methods tend to overlook the essential role of real data knowledge, which limits their generalization ability in processing the unseen real and fake data. To tackle these challenges, in this paper, we propose a simple and novel approach, named Knowledge Injection based deepfake Detection (KID), by constructing a multi-task learning based knowledge injection framework, which can be easily plugged into existing ViT-based backbone models, including foundation models. Specifically, a knowledge injection module is proposed to learn and inject necessary knowledge into the backbone model, to achieve a more accurate modeling of the distributions of real and fake data. A coarse-grained forgery localization branch is constructed to learn the forgery locations in a multi-task learning manner, to enrich the learned forgery knowledge for the knowledge injection module. Two layer-wise suppression and contrast losses are proposed to emphasize the knowledge of real data in the knowledge injection module, to further balance the portions of the real and fake knowledge. Extensive experiments have demonstrated that our KID possesses excellent compatibility with different scales of Vit-based backbone models, and achieves state-of-the-art generalization performance while enhancing the training convergence speed.
zh

[CV-42] Joint Out-of-Distribution Filtering and Data Discovery Active Learning

【速读】：该论文致力于解决在实际场景中将主动学习（Active Learning, AL）与分布外数据（Out-of-Distribution, OOD）处理及类别发现相结合时未被充分探索的问题。传统方法分别处理 OOD 数据或类别发现，但未能同时兼顾两者。论文的关键创新在于提出了一种名为联合分布外过滤与数据发现主动学习（Joint Out-of-distribution filtering and data Discovery Active learning, Joda）的方法，通过在候选样本标注前过滤掉分布外数据，并构建一个统一的特征空间，使已知类别与新类别对齐的同时分离分布外样本。与先前方法不同，Joda无需辅助模型或对未标注池的额外训练访问即可高效实现过滤与选择，从而在保持高标注准确性的同时实现最佳的类别发现与分布外数据过滤平衡。

链接: https://arxiv.org/abs/2503.02491
作者: Sebastian Schmidt,Leonard Schenk,Leo Schwinn,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团); SPRIND (SPRIND)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the data demand for deep learning models increases, active learning (AL) becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs. Real-world scenarios necessitate the consideration of incomplete data knowledge within AL. Prior works address handling out-of-distribution (OOD) data, while another research direction has focused on category discovery. However, a combined analysis of real-world considerations combining AL with out-of-distribution data and category discovery remains unexplored. To address this gap, we propose Joint Out-of-distribution filtering and data Discovery Active learning (Joda) , to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling. In contrast to previous methods, we deeply entangle the training procedure with filter and selection to construct a common feature space that aligns known and novel categories while separating OOD samples. Unlike previous works, Joda is highly efficient and completely omits auxiliary models and training access to the unlabeled pool for filtering or selection. In extensive experiments on 18 configurations and 3 metrics, \ours consistently achieves the highest accuracy with the best class discovery to OOD filtering balance compared to state-of-the-art competitor approaches.
zh

[CV-43] Deep Robust Reversible Watermarking

【速读】：该论文旨在解决现有鲁棒可逆水印（Robust Reversible Watermarking, RRW）方法因复杂设计、高计算成本及较差的鲁棒性而导致的实际应用局限性问题。论文提出了一种基于深度学习的新型方案——深度鲁棒可逆水印（Deep Robust Reversible Watermarking, DRRW），其关键在于引入整数可逆水印网络（Integer Invertible Watermark Network, iIWN）以实现整数数据分布的可逆映射，同时采用编码器-噪声层-解码器框架结合端到端训练，从而实现自适应鲁棒性。此外，通过引入溢出惩罚损失减少像素溢出，并利用自适应权重调整策略优化训练稳定性与性能。实验表明，DRRW在提升鲁棒性的同时显著降低了嵌入、提取及恢复的复杂度，并大幅减少了辅助位流长度，成功实现了对16,762张PASCAL VOC 2012图像的可逆嵌入，超越了传统不可逆鲁棒水印在鲁棒性和质量方面的表现，同时保持了可逆性。

链接: https://arxiv.org/abs/2503.02490
作者: Jiale Chen,Wei Wang,Chongyang Shi,Li Dong,Yuanman Li,Xiping Hu
机构: School of Computer Science, Beijing Institute of Technology (北京理工大学计算机学院); Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Artificial Intelligence Research Institute, Shenzhen MSU-BIT University (深圳北理莫斯科大学人工智能研究院情绪智能与泛在计算联合实验室); College of Electronics and Information Engineering, Shenzhen University (深圳大学电子与信息工程学院); Department of Computer Science, Faculty of Electrical Engineering and Computer Science, Ningbo University (宁波大学电气工程与自动化学院计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust Reversible Watermarking (RRW) enables perfect recovery of cover images and watermarks in lossless channels while ensuring robust watermark extraction in lossy channels. Existing RRW methods, mostly non-deep learning-based, face complex designs, high computational costs, and poor robustness, limiting their practical use. This paper proposes Deep Robust Reversible Watermarking (DRRW), a deep learning-based RRW scheme. DRRW uses an Integer Invertible Watermark Network (iIWN) to map integer data distributions invertibly, addressing conventional RRW limitations. Unlike traditional RRW, which needs distortion-specific designs, DRRW employs an encoder-noise layer-decoder framework for adaptive robustness via end-to-end training. In inference, cover image and watermark map to an overflowed stego image and latent variables, compressed by arithmetic coding into a bitstream embedded via reversible data hiding for lossless recovery. We introduce an overflow penalty loss to reduce pixel overflow, shortening the auxiliary bitstream while enhancing robustness and stego image quality. An adaptive weight adjustment strategy avoids manual watermark loss weighting, improving training stability and performance. Experiments show DRRW outperforms state-of-the-art RRW methods, boosting robustness and cutting embedding, extraction, and recovery complexities by 55.14(\times), 5.95(\times), and 3.57(\times), respectively. The auxiliary bitstream shrinks by 43.86(\times), with reversible embedding succeeding on 16,762 PASCAL VOC 2012 images, advancing practical RRW. DRRW exceeds irreversible robust watermarking in robustness and quality while maintaining reversibility.
zh

[CV-44] ERetinex: Event Camera Meets Retinex Theory for Low-Light Image Enhancement ICRA2025

【速读】：该论文旨在解决低光照图像增强问题，即在暗场景下恢复传统帧相机因曝光时间限制而难以捕捉结构与色彩信息的欠曝图像。论文的关键在于结合事件相机（Event Camera）的高动态范围特性和基于Retinex理论的传统图像处理方法，提出了一种新颖的Retinex基础低光照图像恢复框架ERetinex。其解决方案的核心包括：一是开发一种利用事件相机高时间分辨率数据与传统图像信息相结合的新方法，以更精确地估计场景光照；二是提出一种有效的融合策略，将事件相机的高动态范围数据与传统图像的颜色信息相结合，生成细节更丰富且视觉完整性更高的图像。实验结果表明，所提方法在峰值信噪比（PSNR）上较现有最优方法提升了1.0613 dB，同时减少了84.28%的浮点运算次数（FLOPS）。

链接: https://arxiv.org/abs/2503.02484
作者: Xuejian Guo,Zhiqiang Tian,Yuehang Wang,Siqi Li,Yu Jiang,Shaoyi Du,Yue Gao
机构: School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件工程学院); College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人机混合增强智能国家重点实验室，视觉信息与应用国家工程研究中心，人工智能与机器人研究所); BNRist, THUIBCS, KLISS, BLBCI, School of Software, Tsinghua University (清华大学智能技术与系统国家重点实验室，知识工程与智能系统研究所，行为大数据与脑机智能研究中心，软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2025

点击查看摘要

Abstract:Low-light image enhancement aims to restore the under-exposure image captured in dark scenarios. Under such scenarios, traditional frame-based cameras may fail to capture the structure and color information due to the exposure time limitation. Event cameras are bio-inspired vision sensors that respond to pixel-wise brightness changes asynchronously. Event cameras’ high dynamic range is pivotal for visual perception in extreme low-light scenarios, surpassing traditional cameras and enabling applications in challenging dark environments. In this paper, inspired by the success of the retinex theory for traditional frame-based low-light image restoration, we introduce the first methods that combine the retinex theory with event cameras and propose a novel retinex-based low-light image restoration framework named ERetinex. Among our contributions, the first is developing a new approach that leverages the high temporal resolution data from event cameras with traditional image information to estimate scene illumination accurately. This method outperforms traditional image-only techniques, especially in low-light environments, by providing more precise lighting information. Additionally, we propose an effective fusion strategy that combines the high dynamic range data from event cameras with the color information of traditional images to enhance image quality. Through this fusion, we can generate clearer and more detail-rich images, maintaining the integrity of visual information even under extreme lighting conditions. The experimental results indicate that our proposed method outperforms state-of-the-art (SOTA) methods, achieving a gain of 1.0613 dB in PSNR while reducing FLOPS by \textbf84.28%.
zh

[CV-45] A Novel Streamline-based diffusion MRI Tractography Registration Method with Probabilistic Keypoint Detection

【速读】：该论文试图解决基于纤维束的弥散磁共振成像（diffusion MRI, dMRI）纤维束图谱配准问题，特别是现有方法在识别跨数据集解剖对应关系时忽视纤维束内部点连接模式的问题。论文的关键解决方案在于提出了一种基于深度学习的无监督方法，通过将纤维束建模为点云并利用流线内的图连接性，设计了一种新的流线关键点检测方法，将其形式化为概率分类任务以识别跨非结构化流线集合的解剖一致对应关系，从而实现基于纤维束的空间配准。

链接: https://arxiv.org/abs/2503.02481
作者: Junyi Wang,Mubai Du,Ye Wu,Yijie Li,William M. Wells III,Lauren J. O’Donnell,Fan Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Nanjing University of Science and Technology (南京理工大学); Brigham and Women’s Hospital; Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Registration of diffusion MRI tractography is an essential step for analyzing group similarities and variations in the brain’s white matter (WM). Streamline-based registration approaches can leverage the 3D geometric information of fiber pathways to enable spatial alignment after registration. Existing methods usually rely on the optimization of the spatial distances to identify the optimal transformation. However, such methods overlook point connectivity patterns within the streamline itself, limiting their ability to identify anatomical correspondences across tractography datasets. In this work, we propose a novel unsupervised approach using deep learning to perform streamline-based dMRI tractography registration. The overall idea is to identify corresponding keypoint pairs across subjects for spatial alignment of tractography datasets. We model tractography as point clouds to leverage the graph connectivity along streamlines. We propose a novel keypoint detection method for streamlines, framed as a probabilistic classification task to identify anatomically consistent correspondences across unstructured streamline sets. In the experiments, we compare several existing methods and show highly effective and efficient tractography registration performance.
zh

[CV-46] BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA

【速读】：该论文旨在解决现有生物医学视觉问答（Biomedical VQA）模型在处理复杂任务时多模态语义对齐效果不佳的问题。当前方法仅在大型语言模型（LLMs）的模型层面进行多模态信息交互，导致语义一致性不足。为了解决这一问题，论文提出了一种名为BioD2C的新框架，其关键在于实现模型层面与特征层面的双层语义一致性约束。具体而言，BioD2C通过图像-文本融合机制在特征层面实现文本特征与视觉特征的交互，从而获得基于文本条件的视觉特征；同时引入基于文本队列的跨模态软语义损失函数进一步对齐图像语义与问题语义。此外，论文构建了一个新的数据集BioVGQ，通过过滤人工修改的图像并校准多模态上下文中的问题-答案对来缓解已有数据集的固有偏差，并在该数据集上训练模型。实验结果表明，BioD2C在多个下游数据集上达到了最先进的性能，展示了其鲁棒性、可泛化性和推动生物医学VQA研究发展的潜力。

链接: https://arxiv.org/abs/2503.02476
作者: Zhengyang Ji,Shang Gao,Li Liu,Yifan Jia,Yutao Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.
zh

[CV-47] Exploring Token-Level Augmentation in Vision Transformer for Semi-Supervised Semantic Segmentation

【速读】：该论文旨在解决在半监督语义分割任务中，现有基于卷积神经网络（Convolutional Neural Networks, CNN）的方法直接应用于视觉变换器（Vision Transformers, ViT）时存在的局限性问题。论文的关键创新在于提出了一种名为TokenMix的数据增强技术，专为ViT设计，通过在token级别混合图像来增强上下文信息的学习能力，并与全局注意力机制良好对齐。此外，结合图像增强和特征增强以提升增强的多样性，并通过引入双分支框架加强一致性正则化，每个分支对输入图像同时应用图像增强和特征增强。实验结果表明，该方法在Pascal VOC 2012、Cityscapes和COCO等多个基准数据集上显著优于现有最先进算法，尤其是在有限精细标注的情况下表现出更明显的准确性提升。

链接: https://arxiv.org/abs/2503.02459
作者: Dengke Zhang,Quan Tang,Fagui Liu,C. L. Philip Chen,Haiqing Mei
机构: School of Computer Science and Engineering, South China University of Technology (华南理工大学计算机科学与工程学院); Department of New Network, Pengcheng Laboratory (鹏城实验室新网络研究部); State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所管理与控制复杂系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised semantic segmentation has witnessed remarkable advancements in recent years. However, existing algorithms are based on convolutional neural networks and directly applying them to Vision Transformers poses certain limitations due to conceptual disparities. To this end, we propose TokenMix, a data augmentation technique specifically designed for semi-supervised semantic segmentation with Vision Transformers. TokenMix aligns well with the global attention mechanism by mixing images at the token level, enhancing learning capability for contexutual information among image patches. We further incorporate image augmentation and feature augmentation to promote the diversity of augmentation. Moreover, to enhance consistency regularization, we propose a dual-branch framework where each branch applies both image augmentation and feature augmentation to the input image. We conduct extensive experiments across multiple benchmark datasets, including Pascal VOC 2012, Cityscapes, and COCO. Results suggest that the proposed method outperforms state-of-the-art algorithms with notably observed accuracy improvement, especially under the circumstance of limited fine annotations.
zh

[CV-48] 2DGS-Avatar: Animatable High-fidelity Clothed Avatar via 2D Gaussian Splatting

【速读】：该论文致力于解决从单目视频实时渲染高保真且可动画化的虚拟角色这一计算机视觉与图形学领域的挑战性问题。尽管Neural Radiance Field (NeRF) 在渲染质量方面取得了显著进展，但其体积渲染效率低下导致运行时性能不佳；而基于3D Gaussian Splatting (3DGS) 的方法虽在快速训练和实时渲染方面展现出潜力，但仍受几何不准确引起的伪影困扰。论文的关键解决方案是提出2DGS-Avatar，这是一种基于2D Gaussian Splatting (2DGS) 的新方法，用于建模具有高保真度和快速训练性能的可动画化服装虚拟角色。通过输入单目RGB视频，该方法能够生成由姿态驱动且支持实时渲染的虚拟角色，并结合2DGS的优势，在保持快速训练和渲染的同时，捕获详细的动态和照片级真实外观。实验验证表明，相比基于3DGS的方法，2DGS-Avatar在AvatarRex和THuman4.0等数据集上的表现均表现出色，无论是在定性还是定量指标上都展现了卓越性能。

链接: https://arxiv.org/abs/2503.02452
作者: Qipeng Yan,Mingyang Sun,Lihua Zhang
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICVRV 2024

点击查看摘要

Abstract:Real-time rendering of high-fidelity and animatable avatars from monocular videos remains a challenging problem in computer vision and graphics. Over the past few years, the Neural Radiance Field (NeRF) has made significant progress in rendering quality but behaves poorly in run-time performance due to the low efficiency of volumetric rendering. Recently, methods based on 3D Gaussian Splatting (3DGS) have shown great potential in fast training and real-time rendering. However, they still suffer from artifacts caused by inaccurate geometry. To address these problems, we propose 2DGS-Avatar, a novel approach based on 2D Gaussian Splatting (2DGS) for modeling animatable clothed avatars with high-fidelity and fast training performance. Given monocular RGB videos as input, our method generates an avatar that can be driven by poses and rendered in real-time. Compared to 3DGS-based methods, our 2DGS-Avatar retains the advantages of fast training and rendering while also capturing detailed, dynamic, and photo-realistic appearances. We conduct abundant experiments on popular datasets such as AvatarRex and THuman4.0, demonstrating impressive performance in both qualitative and quantitative metrics.
zh

[CV-49] Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection CVPR2025

【速读】：该论文旨在解决工业检测中的异常检测（Anomaly Detection, AD）问题，现有方法通常依赖于将测试图像与训练集中正常样本进行“比较”，但由于外观和位置的变化，这种比较可能导致参考样本与测试图像难以对齐，从而限制了检测精度。论文观察到大多数异常表现为局部变化，即使在异常图像中也存在有价值的正常信息，并且这些正常信息可能与异常更一致，因为它们都源自同一图像。因此，论文提出了一种名为INP-Former的新方法，通过从测试图像中直接提取内在正常原型（Intrinsic Normal Prototypes, INPs）来避免依赖训练集中的外部正常性。关键解决方案在于引入INP Extractor以线性组合正常标记来表示INPs，并进一步提出INP一致性损失（INP Coherence Loss）以确保INPs能够准确反映测试图像的正常性。这些INPs引导INP指导解码器仅重建正常标记，其重建误差作为异常分数。此外，还提出了Soft Mining Loss以在训练过程中优先优化难样本。INP-Former在MVTec-AD、VisA和Real-IAD数据集的单类、多类及少量样本异常检测任务中达到最先进的性能，展示了其作为通用异常检测解决方案的潜力，并表现出一定的零样本异常检测能力。

链接: https://arxiv.org/abs/2503.02424
作者: Wei Luo,Yunkang Cao,Haiming Yao,Xiaotian Zhang,Jianan Lou,Yuqi Cheng,Weiming Shen,Wenyong Yu
机构: Department of Precision Instrument, Tsinghua University (清华大学精密仪器系); School of Mechanical Science & Engineering, Huazhong University of Science & Technology (华中科技大学机械科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on ``comparing’’ test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Code is available at:this https URL.
zh

[CV-50] Exploring Model Quantization in GenAI-based Image Inpainting and Detection of Arable Plants

【速读】：该论文旨在解决深度学习驱动的杂草控制系统因训练数据多样性不足和车载计算资源受限而导致实际应用性能下降的问题。为应对这些挑战，论文提出了一种利用基于Stable Diffusion的图像修复（inpainting）技术逐步扩充训练数据的方法，在原有数据基础上以10%的比例递增扩充，最多可达原始数据量的两倍，从而提升样本的数量与多样性。解决方案的关键在于结合生成式图像修复技术来增强训练数据，并通过量化策略（如FP16和INT8）优化生成式修复模型及检测模型，在保证推理速度的同时尽可能减少精度损失，最终验证了该框架在资源受限环境中的实用性和有效性，显著提升了智能杂草管理系统中的检测精度与计算效率。

链接: https://arxiv.org/abs/2503.02420
作者: Sourav Modak,Ahmet Oğuz Saltık,Anthony Stein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based weed control systems often suffer from limited training data diversity and constrained on-board computation, impacting their real-world performance. To overcome these challenges, we propose a framework that leverages Stable Diffusion-based inpainting to augment training data progressively in 10% increments – up to an additional 200%, thus enhancing both the volume and diversity of samples. Our approach is evaluated on two state-of-the-art object detection models, YOLO11(l) and RT-DETR(l), using the mAP50 metric to assess detection performance. We explore quantization strategies (FP16 and INT8) for both the generative inpainting and detection models to strike a balance between inference speed and accuracy. Deployment of the downstream models on the Jetson Orin Nano demonstrates the practical viability of our framework in resource-constrained environments, ultimately improving detection accuracy and computational efficiency in intelligent weed management systems.
zh

[CV-51] InfoGNN: End-to-end deep learning on mesh via graph neural networks

【速读】：该论文旨在解决基于深度学习处理不规则网格数据（mesh data）时面临的挑战，特别是在利用图神经网络（Graph Neural Networks, GNN）处理复杂表面信息和无序结构方面的问题。传统方法受限于流形假设等限制条件，难以充分发挥网格模型的优势。论文的关键解决方案是提出了一种名为InfoGNN的端到端框架，将网格模型视为图结构，并通过引入InfoConv和InfoMP模块，有效利用点的位置信息以及静态特征（如面法向量、二面角）和动态全局特征信息，全面挖掘各类数据的价值。此外，InfoGNN简化了网络设计，提高了效率，为复杂三维模型的高效深度学习奠定了基础。实验结果表明，InfoGNN在网格分类和分割任务中表现出色。

链接: https://arxiv.org/abs/2503.02414
作者: Ling Gao,Zhenyu Shu,Shiqing Xin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D models are widely used in various industries, and mesh data has become an indispensable part of 3D modeling because of its unique advantages. Mesh data can provide an intuitive and practical expression of rich 3D information. However, its disordered, irregular data structure and complex surface information make it challenging to apply with deep learning models directly. Traditional mesh data processing methods often rely on mesh models with many limitations, such as manifold, which restrict their application scopes in reality and do not fully utilize the advantages of mesh models. This paper proposes a novel end-to-end framework for addressing the challenges associated with deep learning in mesh models centered around graph neural networks (GNN) and is titled InfoGNN. InfoGNN treats the mesh model as a graph, which enables it to handle irregular mesh data efficiently. Moreover, we propose InfoConv and InfoMP modules, which utilize the position information of the points and fully use the static information such as face normals, dihedral angles, and dynamic global feature information to fully use all kinds of data. In addition, InfoGNN is an end-to-end framework, and we simplify the network design to make it more efficient, paving the way for efficient deep learning of complex 3D models. We conducted experiments on several publicly available datasets, and the results show that InfoGNN achieves excellent performance in mesh classification and segmentation tasks.
zh

[CV-52] VisAgent : Narrative-Preserving Story Visualization Framework ICASSP2025

【速读】：该论文旨在解决现有故事可视化方法主要关注视觉上下文连贯性，而忽视叙事本质的问题。这种局限性导致生成的图像难以全面捕捉叙事的意图和细微差别，从而限制了其实际应用。为了解决这些问题，论文提出了一种名为VisAgent的无训练多智能体框架。VisAgent的关键在于通过考虑叙事蒸馏、语义一致性以及上下文连贯性，采用一种基于多智能体协作的工作流，其中多个专门化的智能体协同工作以优化分层提示并整合生成的元素到最终图像中，从而有效实现故事关键场景的理解与可视化。

链接: https://arxiv.org/abs/2503.02399
作者: Seungkwon Kim,GyuTae Park,Sangyeon Kim,Seung-Hun Nam
机构: NAVER WEBTOON AI (NAVER WEBTOON AI); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2025. Equal contribution from first two authors

点击查看摘要

Abstract:Story visualization is the transformation of narrative elements into image sequences. While existing research has primarily focused on visual contextual coherence, the deeper narrative essence of stories often remains overlooked. This limitation hinders the practical application of these approaches, as generated images frequently fail to capture the intended meaning and nuances of the narrative fully. To address these challenges, we propose VisAgent, a training-free multi-agent framework designed to comprehend and visualize pivotal scenes within a given story. By considering story distillation, semantic consistency, and contextual coherence, VisAgent employs an agentic workflow. In this workflow, multiple specialized agents collaborate to: (i) refine layered prompts based on the narrative structure and (ii) seamlessly integrate \gtgenerated elements, including refined prompts, scene elements, and subject placement, into the final image. The empirically validated effectiveness confirms the framework’s suitability for practical story visualization applications.
zh

[CV-53] BHViT: Binarized Hybrid Vision Transformer CVPR2025

【速读】：该论文旨在解决视觉Transformer（Vision Transformers, ViTs）在边缘设备上部署面临的实时性和能效挑战。由于卷积神经网络（CNN）与Transformer架构的结构差异，直接将二值化CNN策略应用于ViT模型会导致显著的性能下降。为应对这一挑战，论文提出了BHViT，这是一种针对二值化的友好型混合ViT架构及其全二值化模型，并基于三个重要观察设计了解决方案。关键在于：首先利用局部信息交互和从粗到细的分层特征聚合技术减少冗余计算；其次引入基于移位操作的新模块以提升二值化多层感知机（MLP）模块的性能而不增加显著计算开销；再者提出一种基于量化分解的创新注意力矩阵二值化方法评估二值化注意力矩阵中token的重要性；最后设计了一种正则化损失函数解决二值层权重振荡与Adam优化器之间的不兼容导致的优化不足问题。实验结果表明，所提算法在二值化ViT方法中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.02394
作者: Tian Gao,Yu Zhang,Zhiyuan Zhang,Huajun Liu,Kaijie Yin,Chengzhong Xu,Hui Kong
机构: Nanjing University of Science and Technology (南京理工大学); University of Macau (澳门大学); Shanghai Jiaotong University (上海交通大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN), offering a potential solution to the deployment challenges faced by Vision Transformers (ViTs) on edge devices. However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Initially, BHViT utilizes the local information interaction and hierarchical feature aggregation technique from coarse to fine levels to address redundant computations stemming from excessive tokens. Then, a novel module based on shift operations is proposed to enhance the performance of the binary Multilayer Perceptron (MLP) module without significantly increasing computational overhead. In addition, an innovative attention matrix binarization method based on quantization decomposition is proposed to evaluate the token’s importance in the binarized attention matrix. Finally, we propose a regularization loss to address the inadequate optimization caused by the incompatibility between the weight oscillation in the binary layers and the Adam Optimizer. Extensive experimental results demonstrate that our proposed algorithm achieves SOTA performance among binary ViT methods.
zh

[CV-54] Vision-Language Model IP Protection via Prompt-based Learning

【速读】：本文旨在解决视觉-语言模型（Vision-Language Models, VLMs）知识产权（Intellectual Property, IP）保护的问题，特别是当这些模型在特定目标域进行微调后，如何有效限制其在未经授权的数据域上的部署。当前的IP保护方法通常仅依赖于视觉主干网络，这可能缺乏足够的语义丰富性。为了解决这一不足，论文提出了IP-CLIP，这是一种针对CLIP模型的轻量级IP保护策略，采用基于提示（prompt-based）的学习方法。关键在于利用冻结的CLIP视觉主干提取图像风格与内容信息，并将其融入到IP提示学习过程中，形成一道强有力的屏障，阻止特征从授权域非法转移到未授权域。此外，通过引入风格增强分支，构建授权与未授权域的功能库，并结合自增强及跨域特征，进一步提升IP-CLIP阻止未授权域特征的能力。最后，设计了三个新指标以更好地平衡授权与未授权域的性能退化。实验结果表明，IP-CLIP在VLMs的IP保护任务中具有良好的应用潜力。

链接: https://arxiv.org/abs/2503.02393
作者: Lianyu Wang,Meng Wang,Huazhu Fu,Daoqiang Zhang
机构: The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (脑机智能技术教育部重点实验室); Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore (新加坡国立大学永昭医学院创新与精准眼科健康中心); Institute of High Performance Computing, Agency for Science, Technology and Research (高性能计算研究所，科学技术研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP’s capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.
zh

[CV-55] PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers CVPR-25

【速读】：该论文旨在解决在复杂环境中（如城市峡谷）基于GNSS的定位方法难以实现精确车辆定位的问题。为克服现有基于跨视图特征优化的方法仅依赖于特定姿态下的跨视图特征、忽略局部精细上下文以提高精度以及全局上下文以增强鲁棒性的局限性，论文提出了一种名为PIDLoc的新方法。PIDLoc受比例-积分-微分（Proportional-Integral-Derivative, PID）控制器启发，结合RGB图像与LiDAR数据，通过PID分支建模跨视图特征关系，并利用空间感知姿态估计器（Spatially Aware Pose Estimator, SPE）从这些关系中估算车辆姿态。其中，PID分支分别利用特征差异进行局部上下文建模（P）、聚合特征差异进行全局上下文增强（I）以及特征差异梯度实现精确定位调整（D），从而在大初始定位误差条件下提升定位精度。同时，SPE捕获PID分支特征内的空间关系，确保一致的定位结果。实验表明，PIDLoc在KITTI数据集上的跨视图姿态估计达到了最先进的性能，相比之前的最佳方法减少了37.8%的位置误差。

链接: https://arxiv.org/abs/2503.02388
作者: Wooju Lee,Juhye Park,Dasol Hong,Changki Sung,Youngwoo Seo,Dongwan Kang,Hyun Myung
机构: Urban Robotics Lab, School of Elctrical Engineering, KAIST (城市机器人实验室, 电气工程学院, 韩国科学技术院); Hanwha Aerospace (韩华宇航)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR-25

点击查看摘要

Abstract:Accurate localization is essential for autonomous driving, but GNSS-based methods struggle in challenging environments such as urban canyons. Cross-view pose optimization offers an effective solution by directly estimating vehicle pose using satellite-view images. However, existing methods primarily rely on cross-view features at a given pose, neglecting fine-grained contexts for precision and global contexts for robustness against large initial pose errors. To overcome these limitations, we propose PIDLoc, a novel cross-view pose optimization approach inspired by the proportional-integral-derivative (PID) controller. Using RGB images and LiDAR, the PIDLoc comprises the PID branches to model cross-view feature relationships and the spatially aware pose estimator (SPE) to estimate the pose from these relationships. The PID branches leverage feature differences for local context §, aggregated feature differences for global context (I), and gradients of feature differences for precise pose adjustment (D) to enhance localization accuracy under large initial pose errors. Integrated with the PID branches, the SPE captures spatial relationships within the PID-branch features for consistent localization. Experimental results demonstrate that the PIDLoc achieves state-of-the-art performance in cross-view pose estimation for the KITTI dataset, reducing position error by 37.8% compared with the previous state-of-the-art.
zh

[CV-56] aching Metric Distance to Autoregressive Multimodal Foundational Models

【速读】：本文旨在解决大型语言模型在扩展到数学、多模态理解及具身代理等领域的过程中，输出令牌(token)逐渐反映度量关系而非纯粹语言意义的问题。为应对这一挑战，论文提出了一种名为DIST2Loss的距离感知框架，用于训练自回归离散模型。其关键是将源自内在度量的连续指数族分布转换为与模型架构兼容的离散、分类优化目标，使模型能够在令牌生成过程中学习并保留有意义的度量关系，同时保持对现有架构的兼容性。实验评估表明，DIST2Loss在视觉定位、机器人操作、生成式奖励建模以及使用向量化特征的图像生成等多种多模态应用中均表现出一致的性能提升，尤其在训练数据有限的情况下效果显著，凸显了该方法在资源受限环境中的有效性。

链接: https://arxiv.org/abs/2503.02379
作者: Jiwan Chung,Saejin Kim,Yongrae Jo,Jaewoo Park,Dongjun Min,Youngjae Yu
机构: Yonsei University (延世大学); LG AI Research (LG人工智能研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models’ architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss’s effectiveness in resource-constrained settings.
zh

[CV-57] mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction

【速读】：该论文旨在解决毫米波 (mmWave) 雷达点云稀疏性导致的人体重建精度受限的问题。为克服这一挑战，论文提出了一种两阶段深度学习框架，其关键是通过时序特征增强原始点云数据，并利用多阶段完成网络提高点云密度（mmWave 点云增强模块），随后结合 2D-3D 融合模块提取二维和三维运动特征以优化 SMPL 参数，从而显著提升人体重建的准确性。此外，该方法仅在训练阶段使用基于图像的监督，在推理阶段完全依赖稀疏点云以保护隐私。实验结果表明，所提方法优于现有技术，且增强后的点云进一步提升了集成到已有模型中的性能。

链接: https://arxiv.org/abs/2503.02375
作者: Jiarui Yang,Songpengcheng Xia,Zengyuan Lai,Lan Sun,Qi Wu,Wenxian Yu,Ling Pei
机构: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China (上海交通大学电子信息技术与电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mmWave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mmWave point clouds and improves human body reconstruction accuracy. Our method includes a mmWave point cloud enhancement module that densifies the raw data by leveraging temporal features and a multi-stage completion network, followed by a 2D-3D fusion module that extracts both 2D and 3D motion features to refine SMPL parameters. The mmWave point cloud enhancement module learns the detailed shape and posture information from 2D human masks in single-view images. However, image-based supervision is involved only during the training phase, and the inference relies solely on sparse point clouds to maintain privacy. Experiments on multiple datasets demonstrate that our approach outperforms state-of-the-art methods, with the enhanced point clouds further improving performance when integrated into existing models.
zh

[CV-58] Label-Efficient LiDAR Panoptic Segmentation

【速读】：该论文旨在解决基于学习的 LiDAR 全景分割方法在标注样本极少的情况下性能不足的问题。传统方法依赖大量标注数据以实现语义和实例分割，而 LiDAR 数据的高维度复杂性进一步加剧了这一挑战。为应对这一问题，论文提出了一种新颖的方法，称为 Limited-Label LiDAR Panoptic Segmentation (L3PS)，其核心在于仅需极少量标注数据即可完成任务。关键解决方案包括：首先利用一种标注高效的 2D 网络从少量标注图像生成全景伪标签，并将其投影到点云数据中；其次引入一个创新的 3D 精化模块，通过结合聚类技术、连续扫描累积以及地面点分离等几何特性显著提升伪标签精度，使伪标签的 Panoptic Quality (PQ) 提升高达 +10.6，mean Intersection over Union (mIoU) 提升 +7.9。最终证明，这些精化后的伪标签能够有效训练标准的 LiDAR 分割网络，同时大幅降低标注需求。

链接: https://arxiv.org/abs/2503.02372
作者: Ahmet Selim Çanakçı,Niclas Vödisch,Kürsat Petek,Wolfram Burgard,Abhinav Valada
机构: Department of Computer Science, University of Freiburg (弗赖堡大学计算机科学系); Department of Eng., University of Technology Nuremberg (纽伦堡技术大学工程系); German Research Foundation (DFG)(德国研究基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A main bottleneck of learning-based robotic scene understanding methods is the heavy reliance on extensive annotated training data, which often limits their generalization ability. In LiDAR panoptic segmentation, this challenge becomes even more pronounced due to the need to simultaneously address both semantic and instance segmentation from complex, high-dimensional point cloud data. In this work, we address the challenge of LiDAR panoptic segmentation with very few labeled samples by leveraging recent advances in label-efficient vision panoptic segmentation. To this end, we propose a novel method, Limited-Label LiDAR Panoptic Segmentation (L3PS), which requires only a minimal amount of labeled data. Our approach first utilizes a label-efficient 2D network to generate panoptic pseudo-labels from a small set of annotated images, which are subsequently projected onto point clouds. We then introduce a novel 3D refinement module that capitalizes on the geometric properties of point clouds. By incorporating clustering techniques, sequential scan accumulation, and ground point separation, this module significantly enhances the accuracy of the pseudo-labels, improving segmentation quality by up to +10.6 PQ and +7.9 mIoU. We demonstrate that these refined pseudo-labels can be used to effectively train off-the-shelf LiDAR segmentation networks. Through extensive experiments, we show that L3PS not only outperforms existing methods but also substantially reduces the annotation burden. We release the code of our work at this https URL.
zh

[CV-59] BdSLW401: Transformer-Based Word-Level Bangla Sign Language Recognition Using Relative Quantization Encoding (RQE)

【速读】：该论文旨在解决低资源语言手语识别（SLR）中因手语使用者差异、视角变化以及标注数据有限所导致的问题，特别是针对孟加拉语手语（BdSL）。论文提出了一种名为BdSLW401的大规模多视角词级BdSL数据集，并引入了一种结构化嵌入方法——相对量化编码（Relative Quantization Encoding, RQE）。RQE通过将标志点锚定到生理参考点并量化运动轨迹来改进基于变换器的手语识别模型。关键在于RQE能够通过减少空间变异性优化注意力分配，从而显著提高多种基准数据集上的单词错误率（WER），例如在WLASL100上减少了44.3%，在SignBD-200上减少了21.0%。此外，RQE-SF的扩展版本进一步增强了肩部标志点的稳定性，但对侧视图识别略有影响。这些改进不仅提升了模型性能，还增强了模型的可解释性，聚焦于主要的发音特征和更具区分度的关键帧，而非整体姿态变化。

链接: https://arxiv.org/abs/2503.02360
作者: Husne Ara Rubaiyeat,Njayou Youssouf,Md Kamrul Hasan,Hasan Mahmud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sign language recognition (SLR) for low-resource languages like Bangla suffers from signer variability, viewpoint variations, and limited annotated datasets. In this paper, we present BdSLW401, a large-scale, multi-view, word-level Bangla Sign Language (BdSL) dataset with 401 signs and 102,176 video samples from 18 signers in front and lateral views. To improve transformer-based SLR, we introduce Relative Quantization Encoding (RQE), a structured embedding approach anchoring landmarks to physiological reference points and quantize motion trajectories. RQE improves attention allocation by decreasing spatial variability, resulting in 44.3% WER reduction in WLASL100, 21.0% in SignBD-200, and significant gains in BdSLW60 and SignBD-90. However, fixed quantization becomes insufficient on large-scale datasets (e.g., WLASL2000), indicating the need for adaptive encoding strategies. Further, RQE-SF, an extended variant that stabilizes shoulder landmarks, achieves improvements in pose consistency at the cost of small trade-offs in lateral view recognition. The attention graphs prove that RQE improves model interpretability by focusing on the major articulatory features (fingers, wrists) and the more distinctive frames instead of global pose changes. Introducing BdSLW401 and demonstrating the effectiveness of RQE-enhanced structured embeddings, this work advances transformer-based SLR for low-resource languages and sets a benchmark for future research in this area.
zh

[CV-60] Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content CVPR2025

【速读】：该论文旨在解决文本到视觉内容（text-to-vision）评估中的两大核心问题：视觉质量与对齐效果。当前评估模型在客观评价这些维度方面虽已取得进展，但其性能高度依赖于大规模高质量的人类标注数据。为此，论文提出了解决方案的关键在于构建一个包含大量人类标注数据的综合数据集——Q-EVAL-100K，该数据集涵盖了文本到图像和文本到视频模型的评估，包含针对100K实例（包括60K图像和40K视频）的960K个人类标注的平均意见分数（Mean Opinion Score, MOS），专门用于评估视觉质量和对齐水平。基于此数据集及上下文提示，论文进一步提出了Q-Eval-Score模型，这是一种能够同时评估视觉质量和对齐效果的统一模型，并针对长文本提示的对齐进行了特殊优化。实验结果表明，该模型在多个基准测试中均表现出色，验证了Q-EVAL-100K数据集的重要价值。

链接: https://arxiv.org/abs/2503.02357
作者: Zicheng Zhang,Tengchuan Kou,Shushi Wang,Chunyi Li,Wei Sun,Wei Wang,Xiaoyu Li,Zongyu Wang,Xuezhi Cao,Xiongkuo Min,Xiaohong Liu,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at this https URL.
zh

[CV-61] YOLO-PRO: Enhancing Instance-Specific Object Detection with Full-Channel Global Self-Attention

【速读】：本文针对目标检测框架中传统瓶颈结构（因过度依赖批次统计导致实例可分辨性下降）和解耦头（计算冗余）的固有限制，提出两个创新模块：具有全通道全局自注意力的实例特定瓶颈（Instance-Specific Bottleneck, ISB）以及实例特定不对称解耦头（Instance-Specific Asymmetric Decoupled Head, ISADH）。ISB 模块通过批次统计特征与实例特定特征的协同融合，创新性地重构特征图以建立高效的全通道全局注意机制。同时，ISADH 模块开创了一种不对称解耦架构，通过双流批次-实例表示融合实现多维特征的层次化集成。实验结果表明，在 MS-COCO 数据集上的广泛测试显示，将 ISB 和 ISADH 协同应用于 YOLO-PRO 框架中，在所有计算尺度上均达到最先进的性能，具体表现为在 N/S/M/L/X 各尺度上比 YOLOv8 提升 1.0%-1.6% 的平均精度（AP），并在关键的 M/L/X 组别中比 YOLO11 提高 0.1%-0.5% 的 AP，同时保持了竞争性的计算效率。论文的关键解决方案在于通过 ISB 和 ISADH 的协同作用，有效解决了传统方法中的瓶颈与冗余问题，提升了模型的检测精度和效率。

链接: https://arxiv.org/abs/2503.02348
作者: Lin Huang,Yujuan Tan,Weisheng Li,Shitai Shan,Linlin Shen,Jing Yu
机构: cqu.edu.cn (重庆大学); cqupt.edu.cn (重庆邮电大学); inspur.com (浪潮集团); szu.edu.cn (深圳大学); cqu.edu.cn (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the inherent limitations of conventional bottleneck structures (diminished instance discriminability due to overemphasis on batch statistics) and decoupled heads (computational redundancy) in object detection frameworks by proposing two novel modules: the Instance-Specific Bottleneck with full-channel global self-attention (ISB) and the Instance-Specific Asymmetric Decoupled Head (ISADH). The ISB module innovatively reconstructs feature maps to establish an efficient full-channel global attention mechanism through synergistic fusion of batch-statistical and instance-specific features. Complementing this, the ISADH module pioneers an asymmetric decoupled architecture enabling hierarchical multi-dimensional feature integration via dual-stream batch-instance representation fusion. Extensive experiments on the MS-COCO benchmark demonstrate that the coordinated deployment of ISB and ISADH in the YOLO-PRO framework achieves state-of-the-art performance across all computational scales. Specifically, YOLO-PRO surpasses YOLOv8 by 1.0-1.6% AP (N/S/M/L/X scales) and outperforms YOLO11 by 0.1-0.5% AP in critical M/L/X groups, while maintaining competitive computational efficiency. This work provides practical insights for developing high-precision detectors deployable on edge devices.
zh

[CV-62] GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

【速读】：该论文试图解决视频生成模型在实际应用中缺乏有效且可解释的自动化评估方法的问题。现有自动化评估指标由于缺乏高级语义理解和推理能力，无法准确评价生成视频的质量。为填补这一空白，论文提出了 GRADEO 方法，其关键是通过构建 GRADEO-Instruct 数据集（包含来自 10 多个现有视频生成模型的 3.3k 段视频及基于 16k 人工标注转换的多步推理评估）以及设计 GRADEO 模型，实现对生成视频的多步推理评分与可解释性评估。实验表明，该方法比现有方法更符合人类评估结果，并揭示了当前视频生成模型在复杂场景下的不足。

链接: https://arxiv.org/abs/2503.02341
作者: Zhun Mou,Bin Xia,Zhengchao Huang,Wenming Yang,Jiaya Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
zh

[CV-63] BiasICL: In-Context Learning and Demographic Biases of Vision Language Models

【速读】：该论文旨在探究视觉语言模型（Vision Language Models, VLMs）在使用情境学习（in-context learning, ICL）进行医学诊断时，在不同人口统计学亚组中的表现差异。研究重点分析了示例数据集中人口统计学组成对模型性能的影响，特别是在皮肤病变恶性预测和胸部X光片气胸检测两个医学影像任务中的表现。论文的关键发现在于揭示了ICL通过多种机制影响模型预测：首先，ICL使VLM能够从提示中学习特定亚组的疾病基线发生率；其次，即使控制了这些亚组特定的基线发生率后，ICL仍会导致模型在不同人口统计学群体间的预测结果存在差异。基于此，论文提出了当前VLMs应用的最佳实践建议，包括检查亚组性能、调整标签基线发生率以匹配目标分布的整体水平及亚组内部水平，同时指出了未来改进理论理解的方向。因此，论文的核心解决方案在于明确ICL对模型跨亚组性能的具体影响机制，并据此提出针对性的优化策略。

链接: https://arxiv.org/abs/2503.02334
作者: Sonnet Xu,Joseph Janizek,Yixing Jiang,Roxana Daneshjou
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) show promise in medical diagnosis, but their performance across demographic subgroups when using in-context learning (ICL) remains poorly understood. We examine how the demographic composition of demonstration examples affects VLM performance in two medical imaging tasks: skin lesion malignancy prediction and pneumothorax detection from chest radiographs. Our analysis reveals that ICL influences model predictions through multiple mechanisms: (1) ICL allows VLMs to learn subgroup-specific disease base rates from prompts and (2) ICL leads VLMs to make predictions that perform differently across demographic groups, even after controlling for subgroup-specific disease base rates. Our empirical results inform best-practices for prompting current VLMs (specifically examining demographic subgroup performance, and matching base rates of labels to target distribution at a bulk level and within subgroups), while also suggesting next steps for improving our theoretical understanding of these models.
zh

[CV-64] Exploring Simple Siamese Network for High-Resolution Video Quality Assessment ICASSP2025

【速读】：该论文旨在解决现有两分支视频质量评估（Two-Branch Video Quality Assessment, VQA）网络在高分辨率视频场景下技术分支难以有效感知语义的问题。论文指出，虽然技术视角与美学视角具有互补性，但当前技术分支通常基于从视频中采样的局部小块进行训练，这导致其在高分辨率视频中的语义理解能力不足，而在低分辨率视频中这一问题可能被看似良好的结果所掩盖。为了解决此问题，论文提出SiamVQA，一种基于Siamese网络的高效高分辨率VQA方法。其关键创新在于：通过在技术分支和美学分支之间共享权重，增强技术分支的语义感知能力，从而更好地学习技术质量表示；同时引入双交叉注意力层以融合技术特征与美学特征。实验结果显示，SiamVQA在高分辨率基准数据集上达到了最先进的准确率，并在低分辨率基准测试中也取得了竞争性结果。

链接: https://arxiv.org/abs/2503.02330
作者: Guotao Shen,Ziheng Yan,Xin Jin,Longhai Wu,Jie Chen,Ilhyun Cho,Cheul-Hee Hahm
机构: Samsung Electronics (China) R&D Centre (三星电子（中国）研究中心); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: this https URL
zh

[CV-65] Unified Arbitrary-Time Video Frame Interpolation and Prediction ICASSP2025

【速读】：该论文旨在解决视频帧插值与预测问题，传统方法通常采用不同的模型架构或相同架构但独立训练权重分别处理这两项任务。此外，尽管任意时间尺度的插值已得到广泛研究，任意时间尺度的预测价值却未受到足够重视。论文提出的解决方案之关键是设计了一个名为uniVIP（统一任意时间视频插值与预测）的框架：首先扩展仅支持插值的网络以实现同时处理插值与预测，并引入特殊输入通道用于编码任务类型（插值或预测）；其次，展示如何基于常见的三帧数据集训练统一模型。这种统一方法不仅在插值任务上提供竞争力的结果，还在预测任务上超越现有最先进的技术。

链接: https://arxiv.org/abs/2503.02316
作者: Xin Jin,Longhai Wu,Jie Chen,Ilhyun Cho,Cheul-Hee Hahm
机构: Samsung Electronics (China) R&D Centre (三星电子（中国）研发中心); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: this https URL
zh

[CV-66] A Token-level Text Image Foundation Model for Document Understanding

【速读】：该论文试图解决在下游文本-图像相关任务（如包含小而密集文本的图像感知、理解与推理）中，通用视觉基础模型由于缺乏语义细粒度监督而普遍存在的根本性预测错误问题。为了解决这一问题，论文的关键方案是开发了TokenOCR，这是一种专门针对文本-图像相关任务设计的首个令牌级视觉基础模型，并通过构建高质量的数据生产管道生成了包含2000万图像和18亿令牌掩码对的TokenIT数据集，从而实现有效的预训练。此外，利用TokenOCR卓越的图像转文本能力，论文进一步将其应用于文档级多模态大型语言模型TokenVL的构建，以支持基于视觉问答的文档理解任务。实验结果验证了TokenOCR及其衍生模型TokenVL的有效性。

链接: https://arxiv.org/abs/2503.02304
作者: Tongkun Guan,Zining Wang,Pei Fu,Zhengtao Guo,Wei Shen,Kai Zhou,Tiezhu Yue,Chen Duan,Hao Sun,Qianyi Jiang,Junfeng Luo,Xiaokang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at this https URL.
zh

[CV-67] On the Relationship Between Double Descent of CNNs and Shape/Texture Bias Under Learning Process

【速读】：该论文试图解决的问题是：传统偏差-方差权衡理论无法完全解释双下降（double descent）现象的机制，同时探讨卷积神经网络（CNNs）在图像识别学习过程中形状偏差与纹理偏差对测试误差的影响及其潜在关系。论文的关键在于验证CNN在训练过程中的形状/纹理偏差是否与测试误差的epoch-wise双下降现象同步发生，并通过定量评估确认测试误差与偏差值之间的相关性。实验结果表明，即使在无标签噪声的情况下，形状/纹理偏差仍会出现双下降/上升现象，这些发现有助于深入理解双下降现象背后的机制以及CNN在图像识别中的学习过程。

链接: https://arxiv.org/abs/2503.02302
作者: Shun Iwase,Shuya Takahashi,Nakamasa Inoue,Rio Yokota,Ryo Nakamura,Hirokatsu Kataoka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The double descent phenomenon, which deviates from the traditional bias-variance trade-off theory, attracts considerable research attention; however, the mechanism of its occurrence is not fully understood. On the other hand, in the study of convolutional neural networks (CNNs) for image recognition, methods are proposed to quantify the bias on shape features versus texture features in images, determining which features the CNN focuses on more. In this work, we hypothesize that there is a relationship between the shape/texture bias in the learning process of CNNs and epoch-wise double descent, and we conduct verification. As a result, we discover double descent/ascent of shape/texture bias synchronized with double descent of test error under conditions where epoch-wise double descent is observed. Quantitative evaluations confirm this correlation between the test errors and the bias values from the initial decrease to the full increase in test error. Interestingly, double descent/ascent of shape/texture bias is observed in some cases even in conditions without label noise, where double descent is thought not to occur. These experimental results are considered to contribute to the understanding of the mechanisms behind the double descent phenomenon and the learning process of CNNs in image recognition.
zh

[CV-68] Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

【速读】：该论文试图解决视频动作识别任务中标注数据昂贵且难以获取的问题，并探索如何在少量标注数据的情况下通过结合视觉与音频模态信息提升半监督学习（Semi-Supervised Learning, SSL）的效果。此前的研究主要集中在单一视觉模态上，而忽视了视频作为多模态数据的特性。论文的关键在于提出了一种基于视听模态的半监督学习框架，同时设计了一种新颖的基于音频源定位引导的混合（Mixup）方法，以充分利用视听模态间的相互关系，从而进一步提高模型性能。

链接: https://arxiv.org/abs/2503.02284
作者: Seokun Kang,Taehwan Kim
机构: Artificial Intelligence Graduate School (人工智能研究生院), Ulsan National Institute of Science & Technology (蔚山科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.
zh

[CV-69] SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images

【速读】：该论文致力于解决RGB-D图像显著物体检测（SOD）任务中的三个主要挑战：现有方法难以捕捉模态间的全局依赖性，缺乏从RGB和深度数据中全面提取显著性先验的能力，以及在处理低质量深度图时效果不佳。为了解决这些问题，论文提出了一种基于显著性先验和状态空间模型（State Space Model, SSM）的网络SSNet。其关键创新在于引入了一个基于SSM的多模态多尺度解码模块，能够以线性复杂度高效捕获模内和模间全局依赖关系。具体而言，通过设计跨模态选择性扫描SSM（Cross-Modal Selective Scan SSM, CM-S6）机制，有效捕捉不同模态之间的全局关联；同时，引入显著性增强模块（Saliency Enhancement Module, SEM），结合三种显著性先验与深度特征，优化特征表示并提升显著物体定位的准确性。此外，还提出了自适应对比度增强技术，动态优化低质量深度图，使其更适合SOD任务。实验结果表明，SSNet在七个基准数据集上的表现优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.02270
作者: Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray
机构: Department of EE, IIT Kharagpur, India (印度印第安那理工学院电气工程系); Rekhi Centre of Excellence for the Science of Happiness, IIT Kharagpur, India (印度印第安那理工学院Rekhi幸福科学卓越中心); Department of E&ECE, IIT Kharagpur, India (印度印第安那理工学院电子与电气通信工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient object detection (SOD) in RGB-D images is an essential task in computer vision, enabling applications in scene understanding, robotics, and augmented reality. However, existing methods struggle to capture global dependency across modalities, lack comprehensive saliency priors from both RGB and depth data, and are ineffective in handling low-quality depth maps. To address these challenges, we propose SSNet, a saliency-prior and state space model (SSM)-based network for the RGB-D SOD task. Unlike existing convolution- or transformer-based approaches, SSNet introduces an SSM-based multi-modal multi-scale decoder module to efficiently capture both intra- and inter-modal global dependency with linear complexity. Specifically, we propose a cross-modal selective scan SSM (CM-S6) mechanism, which effectively captures global dependency between different modalities. Furthermore, we introduce a saliency enhancement module (SEM) that integrates three saliency priors with deep features to refine feature representation and improve the localization of salient objects. To further address the issue of low-quality depth maps, we propose an adaptive contrast enhancement technique that dynamically refines depth maps, making them more suitable for the RGB-D SOD task. Extensive quantitative and qualitative experiments on seven benchmark datasets demonstrate that SSNet outperforms state-of-the-art methods.
zh

[CV-70] Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts

【速读】：该论文试图解决的问题是如何在基于 CLIP 的零样本图像分类任务中改进错误的生成方式，使其“犯更好的错误”。这里的“更好”指的是错误的严重程度能够与下游任务的标签层次结构相一致。论文的关键在于利用 CLIP 图像编码器隐含捕获的不同类别之间的层次语义关系，并通过查询语言模型生成针对给定类别的文本表示，作为 CLIP 的零样本分类器，从而优化零样本分类中的错误模式。这种做法在实验中通过五种具有不同规模和标签层次高度的数据集验证了其有效性，且为基于 CLIP 的零样本分类引入了全新的研究方向。

链接: https://arxiv.org/abs/2503.02248
作者: Tong Liang,Jim Davis
机构: Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages

点击查看摘要

Abstract:Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes’ severities are derived from the given label hierarchy of downstream tasks. Since CLIP’s image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \hrefthis https URLthis https URL.
zh

[CV-71] WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

【速读】：该论文旨在解决对象目标导航（Object Goal Navigation）这一核心挑战，即在未知环境中引导智能体定位特定物体。尽管基于视觉-语言模型（Vision-Language Model, VLM）的智能体通过提示技术展示了出色的感知与决策能力，但尚未有方法构建完全模块化的世界模型（World Model），以预测环境未来状态从而减少对环境的高风险和高成本交互。论文的关键解决方案是提出WMNav框架，这是一种基于VLM的世界模型驱动导航方法。其核心在于利用世界模型预测决策可能的结果并构建记忆反馈给策略模块，同时通过在线维护的好奇值图（Curiosity Value Map）动态配置导航策略以保留环境的预测状态。此外，通过分解类似于人类思维过程，WMNav通过比较世界模型计划与观察结果之间的反馈差异来有效减轻模型幻觉的影响，并采用两阶段动作提议策略（广义探索后精确定位）进一步提升效率。实验结果表明，WMNav在HM3D和MP3D数据集上的成功率（Success Rate, SR）和探索效率（Success Weighted by Path Length, SPL）均超越现有零样本基准。

链接: https://arxiv.org/abs/2503.02247
作者: Dujun Nie,Xianda Guo,Yiqun Duan,Ruijun Zhang,Long Chen
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所, 中国科学院); School of Computer Science, Wuhan University (计算机科学学院, 武汉大学); School of Computer Science, University of Technology Sydney (悉尼科技大学计算机科学学院); IAIR, Xi’an Jiaotong University (西安交通大学 IAIR); Waytous (Waytous)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: this https URL.
zh

[CV-72] mathbfΦ-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data

【速读】：该论文旨在解决生成式对抗网络（GANs）在训练合成孔径雷达（SAR）图像时因样本量不足而导致性能受限的问题。传统方法主要针对自然图像设计，未能充分考虑SAR图像独特的电磁散射特性。为解决这一问题，论文提出了一种基于物理启发的正则化方法——Φ-GAN。其关键在于将理想点散射中心（PSC）模型与两种物理一致性损失函数相结合，并通过引入一个能够高效估计SAR目标物理参数的物理启发神经模块，实现了端到端的可解释性训练。同时，论文设计了针对生成器和判别器的两种物理损失函数，分别引导生成器生成符合真实物理特性的SAR图像，以及增强判别器对PSC属性的决策鲁棒性。实验结果表明，Φ-GAN在多种条件GAN（cGAN）模型中展现出在数据稀缺场景下的领先性能。

链接: https://arxiv.org/abs/2503.02242
作者: Xidan Zhang,Yihan Zhuang,Qian Guo,Haodong Yang,Xuelin Qian,Gong Cheng,Junwei Han,Zhongling Huang
机构: School of Automation, Northwestern Polytechnical University (西北工业大学自动化学院); College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics (南京航空航天大学电子与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed \Phi -GAN, which incorporates the ideal point scattering center (PSC) model of SAR with two physical consistency losses. The PSC model approximates SAR targets using physical parameters, ensuring that \Phi -GAN generates SAR images consistent with real physical properties while preventing discriminator overfitting by focusing on PSC-based decision cues. To embed the PSC model into GANs for end-to-end training, we introduce a physics-inspired neural module capable of estimating the physical parameters of SAR targets efficiently. This module retains the interpretability of the physical model and can be trained with limited data. We propose two physical loss functions: one for the generator, guiding it to produce SAR images with physical parameters consistent with real ones, and one for the discriminator, enhancing its robustness by basing decisions on PSC attributes. We evaluate \Phi -GAN across several conditional GAN (cGAN) models, demonstrating state-of-the-art performance in data-scarce scenarios on three SAR image datasets.
zh

[CV-73] Unsupervised Waste Classification By Dual-Encoder Contrastive Learning and Multi-Clustering Voting (DECMCV)

【速读】：该论文旨在解决自动化废料分类中因标注数据不足及类别与风格偏见导致的模型性能下降和泛化能力受限的问题。为应对这些挑战，论文提出了一种新颖的无监督方法——Dual-Encoder Contrastive Learning with Multi-Clustering Voting (DECMCV)。其关键在于结合预训练的ConvNeXt模型进行图像编码，利用VisionTransformer生成正样本，并通过多聚类投票机制解决数据标注和领域漂移问题，从而在无需大量标注数据的情况下实现高效且准确的废料分类。实验结果表明，DECMCV在多个数据集上的分类准确率优于或媲美监督学习方法，尤其在真实世界数据集上仅需少量标注样本即可显著提升分类性能，同时有效缓解风格差异并增强模型泛化能力。

链接: https://arxiv.org/abs/2503.02241
作者: Kui Huang,Mengke Song,Shuo Ba,Ling An,Huajie Liang,Huanxi Deng,Yang Liu,Zhenyu Zhang,Chichun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Waste classification is crucial for improving processing efficiency and reducing environmental pollution. Supervised deep learning methods are commonly used for automated waste classification, but they rely heavily on large labeled datasets, which are costly and inefficient to obtain. Real-world waste data often exhibit category and style biases, such as variations in camera angles, lighting conditions, and types of waste, which can impact the model’s performance and generalization ability. Therefore, constructing a bias-free dataset is essential. Manual labeling is not only costly but also inefficient. While self-supervised learning helps address data scarcity, it still depends on some labeled data and generally results in lower accuracy compared to supervised methods. Unsupervised methods show potential in certain cases but typically do not perform as well as supervised models, highlighting the need for an efficient and cost-effective unsupervised approach. This study presents a novel unsupervised method, Dual-Encoder Contrastive Learning with Multi-Clustering Voting (DECMCV). The approach involves using a pre-trained ConvNeXt model for image encoding, leveraging VisionTransformer to generate positive samples, and applying a multi-clustering voting mechanism to address data labeling and domain shift issues. Experimental results demonstrate that DECMCV achieves classification accuracies of 93.78% and 98.29% on the TrashNet and Huawei Cloud datasets, respectively, outperforming or matching supervised models. On a real-world dataset of 4,169 waste images, only 50 labeled samples were needed to accurately label thousands, improving classification accuracy by 29.85% compared to supervised models. This method effectively addresses style differences, enhances model generalization, and contributes to the advancement of automated waste classification.
zh

[CV-74] Anomaly detection in non-stationary videos using time-recursive differencing network based prediction

【速读】：该论文旨在解决视频异常检测中非平稳性（time-varying feature statistics）这一长期存在的挑战，尽管已有许多复杂的重建和预测模型用于视频异常检测，但显式处理非平稳性的方法鲜有提出。论文的关键解决方案在于引入了一个时间递归差分网络（time-recursive differencing network），用于在异常检测过程中有效处理视频数据的非平稳特性，并在此基础上结合自回归移动平均估计（autoregressive moving average estimation）进行预测。通过在光学流（optical flow）特征基础上验证其有效性，并在三个航拍视频数据集和两个标准异常检测视频数据集上生成定性和定量结果，证明了所提方法的优势，最终通过等错误率（EER）、曲线下面积（AUC）以及接收者操作特征曲线（ROC curve）与现有方法的对比进一步验证了该方法的优越性。

链接: https://arxiv.org/abs/2503.02234
作者: Gargi V. Pillai,Debashis Sen
机构: Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, India (电子与电气通信工程系，印度理工学院，卡哈格普尔，印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Copyright 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Most videos, including those captured through aerial remote sensing, are usually non-stationary in nature having time-varying feature statistics. Although, sophisticated reconstruction and prediction models exist for video anomaly detection, effective handling of non-stationarity has seldom been considered explicitly. In this paper, we propose to perform prediction using a time-recursive differencing network followed by autoregressive moving average estimation for video anomaly detection. The differencing network is employed to effectively handle non-stationarity in video data during the anomaly detection. Focusing on the prediction process, the effectiveness of the proposed approach is demonstrated considering a simple optical flow based video feature, and by generating qualitative and quantitative results on three aerial video datasets and two standard anomaly detection video datasets. EER, AUC and ROC curve based comparison with several existing methods including the state-of-the-art reveal the superiority of the proposed approach.
zh

[CV-75] CGMatch: A Different Perspective of Semi-supervised Learning

【速读】：该论文旨在解决半监督学习（Semi-Supervised Learning, SSL）在标注数据稀缺情况下性能下降的问题。现有方法主要依赖模型置信度（confidence）来评估未标注样本的质量，但在标注数据有限的情况下，这种方法难以准确判断模型状态以及识别对训练有帮助的未标注样本，尤其是在模型训练的早期阶段。论文的关键创新在于提出了一种新的度量指标——Count-Gap (CG)，并结合置信度提出了细粒度动态选择（Fine-Grained Dynamic Selection, FDS）策略。通过将未标注数据集动态划分为易学集、模糊集和难学集三个子集，并针对不同子集应用相应的正则化方法，CGMatch有效减少了错误伪标签对模型优化和泛化能力的负面影响，从而提升了SSL模型在标注数据受限情况下的性能。

链接: https://arxiv.org/abs/2503.02231
作者: Bo Cheng,Jueqing Lu,Yuan Tian,Haifeng Zhao,Yi Chang,Lan Du
机构: School of Artificial Intelligence, Jilin University, China (吉林大学人工智能学院); International Center of Future Science, Jilin University, China (吉林大学未来科学国际合作中心); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (教育部知识驱动人机智能工程研究中心); Faculty of Information Technology, Monash University, Australia (澳大利亚蒙纳士大学信息技术学院); Department of Computer Science, Jinling Institute of Technology, China (金陵科技学院计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has garnered significant attention due to its ability to leverage limited labeled data and a large amount of unlabeled data to improve model generalization performance. Recent approaches achieve impressive successes by combining ideas from both consistency regularization and pseudo-labeling. However, these methods tend to underperform in the more realistic situations with relatively scarce labeled data. We argue that this issue arises because existing methods rely solely on the model’s confidence, making them challenging to accurately assess the model’s state and identify unlabeled examples contributing to the training phase when supervision information is limited, especially during the early stages of model training. In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). We demonstrate that CG is effective in discovering unlabeled examples beneficial for model training. Along with confidence, a commonly used metric in SSL, we propose a fine-grained dynamic selection (FDS) strategy. This strategy dynamically divides the unlabeled dataset into three subsets with different characteristics: easy-to-learn set, ambiguous set, and hard-to-learn set. By selective filtering subsets, and applying corresponding regularization with selected subsets, we mitigate the negative impact of incorrect pseudo-labels on model optimization and generalization. Extensive experimental results on several common SSL benchmarks indicate the effectiveness of CGMatch especially when the labeled data are particularly limited. Source code is available at this https URL.
zh

[CV-76] Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

【速读】：该论文旨在解决Neural Radiance Fields (NeRF) 在稀疏输入条件下渲染质量急剧下降的问题。为应对这一挑战，论文的关键解决方案在于利用从密集新视角渲染出的语义信息作为更鲁棒的数据增强形式，而非仅依赖RGB信息。具体而言，通过引入语义引导（Semantic Guidance），该方法在监督层面和特征层面上分别设计了双向验证模块和可学习代码本，前者用于验证渲染语义标签的有效性，后者通过注意力机制为每个点编码语义相关的信息以提升预测准确性。这些语义引导被整合进一个自我改进的流水线中，并进一步通过引入更具挑战性的稀疏输入室内数据集验证了所提方法的有效性，证明其相较现有方法具有优越性能。

链接: https://arxiv.org/abs/2503.02230
作者: Yingji Zhong,Kaichen Zhou,Zhihao Li,Lanqing Hong,Zhenguo Li,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF’s performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semantic-relevant predictions. The overall semantic guidance is embedded into a self-improved pipeline. We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.
zh

[CV-77] One Patients Annotation is Another Ones Initialization: Towards Zero-Shot Surgical Video Segmentation with Cross-Patient Initialization

【速读】：该论文旨在解决视频对象分割在实时外科手术视频分割中的应用限制问题，具体表现为需要人工干预以选择被跟踪的目标对象，这在手术环境中难以实施。为应对这一挑战，论文提出了一种创新性解决方案：利用其他患者的先前标注帧作为跟踪帧。解决方案的关键在于通过跨患者帧的选择，实现了与使用患者自身跟踪帧相当甚至更优的性能表现，从而促进了更加自主和高效的AI辅助手术工作流程。此外，论文还分析了该方法的优势与局限性，并探讨其在提升分割精度和减少人工输入需求方面的潜力。研究结果揭示了影响性能的关键因素，为未来优化跨患者帧选择以实现实时手术视频分析提供了基础。

链接: https://arxiv.org/abs/2503.02228
作者: Seyed Amir Mousavi,Utku Ozbulak,Francesca Tozzi,Nikdokht Rashidian,Wouter Willaert,Joris Vankerschaver,Wesley De Neve
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video object segmentation is an emerging technology that is well-suited for real-time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients’ own tracking frames, enabling more autonomous and efficient AI-assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross-patient frame selection for real-time surgical video analysis.
zh

[CV-78] DQO-MAP: Dual Quadrics Multi-Object mapping with Gaussian Splatting

【速读】：本文旨在解决机器人应用中物体导航所需的精确物体感知问题，特别是如何同时实现高精度的物体位姿估计与重建。为了解决这一问题，论文提出了一种名为DQO-MAP的新颖物体-SLAM系统，其关键在于将物体位姿估计与重建无缝集成，并通过使用3D高斯点 splatting（3D Gaussian Splatting）进行高保真度物体重建，结合二次曲面（quadrics）实现精确的物体位姿估计。此外，该系统在CPU上管理数据，在GPU上执行优化操作，从而显著提升了系统的计算效率。通过为物体分配唯一ID，系统还能快速从场景中提取目标物体。实验结果表明，DQO-MAP在精度、重建质量和计算效率方面表现出色。

链接: https://arxiv.org/abs/2503.02223
作者: Haoyuan Li,Ziqin Ye,Yue Hao,Weiyang Lin,Chao Ye
机构: Research Institute of Intelligent Control and Systems, Harbin Institute of Technology (哈尔滨工业大学智能控制与系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate object perception is essential for robotic applications such as object navigation. In this paper, we propose DQO-MAP, a novel object-SLAM system that seamlessly integrates object pose estimation and reconstruction. We employ 3D Gaussian Splatting for high-fidelity object reconstruction and leverage quadrics for precise object pose estimation. Both of them management is handled on the CPU, while optimization is performed on the GPU, significantly improving system efficiency. By associating objects with unique IDs, our system enables rapid object extraction from the scene. Extensive experimental results on object reconstruction and pose estimation demonstrate that DQO-MAP achieves outstanding performance in terms of precision, reconstruction quality, and computational efficiency. The code and dataset are available at: this https URL.
zh

[CV-79] Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection

【速读】：本文旨在解决多帧红外小目标检测（Multi-frame Infrared Small Target Detection, IRSTD）中的关键挑战，特别是在低空和海上监视场景下。当前基于混合架构结合卷积神经网络（CNNs）和Transformer的方法虽然展现出潜力，但标准Vision Transformer中的线性patch嵌入方式无法充分捕捉对红外小目标至关重要的尺度敏感局部特征。为此，论文提出LVNet，其核心创新包括：(1) 引入一个多尺度CNN前端，通过卷积的局部空间偏置显式建模局部特征；(2) 设计一种U形视频Transformer，用于多帧时空上下文建模，以有效捕获目标的运动特性。实验结果表明，LVNet在nIoU指标上比现有最佳方法LMAFormer提升了5.63%/18.36%，同时参数量仅为1/221，计算成本为1/92或1/21，验证了低级特征学习在混合架构中的重要性。

链接: https://arxiv.org/abs/2503.02220
作者: Zhihua Shen,Siyang Chen,Han Wang,Tongsu Zhang,Xiaohu Zhang,Xiangpeng Xu,Xia Yang
机构: School of Aeronautics and Astronautics, Sun Yat-sen University, Guangzhou 510275, China (中山大学航空学院, 广州 510275, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63% / 18.36% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at this https URL.
zh

[CV-80] me-Varying Coronary Artery Deformation: A Dynamic Skinning Framework for Surgical Training

【速读】：本文旨在解决冠状动脉手术模拟中精确控制血管变形的同时保持实时性能的问题。为实现这一目标，论文提出了一种基于解剖驱动的动态建模框架，其关键是利用双调和能量最小化方法计算骨骼蒙皮权重，并通过四面体网格生成实现体积离散化。该方法结合时间采样与插值，确保心脏周期内连续的血管变形，同时施加机械约束并保证体积守恒。通过临床数据集的验证表明，该框架在几何精度、分支完整性及连续性方面均表现出色，为虚拟手术训练系统提供了更灵活的技术基础。

链接: https://arxiv.org/abs/2503.02218
作者: Shuo Wang,Tong Ren,Nan Cheng,Rong Wang,Li Zhang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 24 pages,8 figures,Submitted to International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Purpose: This study proposes a novel anatomically-driven dynamic modeling framework for coronary arteries using skeletal skinning weights computation, aiming to achieve precise control over vessel deformation while maintaining real-time performance for surgical simulation applications. Methods: We developed a computational framework based on biharmonic energy minimization for skinning weight calculation, incorporating volumetric discretization through tetrahedral mesh generation. The method implements temporal sampling and interpolation for continuous vessel deformation throughout the cardiac cycle, with mechanical constraints and volume conservation enforcement. The framework was validated using clinical datasets from 5 patients, comparing interpolated deformation results against ground truth data obtained from frame-by-frame segmentation across cardiac phases. Results: The proposed framework effectively handled interactive vessel manipulation. Geometric accuracy evaluation showed mean Hausdorff distance of 4.96 ± 1.78 mm and mean surface distance of 1.78 ± 0.75 mm between interpolated meshes and ground truth models. The Branch Completeness Ratio achieved 1.82 ± 0.46, while Branch Continuity Score maintained 0.84 ± 0.06 (scale 0-1) across all datasets. The system demonstrated capability in supporting real-time guidewire-vessel collision detection and contrast medium flow simulation throughout the complete coronary tree structure. Conclusion: Our skinning weight-based methodology enhances model interactivity and applicability while maintaining geometric accuracy. The framework provides a more flexible technical foundation for virtual surgical training systems, demonstrating promising potential for both clinical practice and medical education applications. The code is available at this https URL.
zh

[CV-81] Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation

【速读】：该论文旨在解决现有对比视觉-语言模型（如CLIP）在语义识别任务中的零样本能力虽强，但在需要精细控制感知和语义特征的任务（如图像质量评估IQA和条件图像生成CIG）中表现欠佳的问题。论文的关键解决方案在于提出了一种新的多模态解耦表征学习框架，通过解耦文本指导图像解耦来实现这一目标。具体而言，首先构建了一个包含感知描述和语义描述的I2T数据集，然后利用这些解耦的文本描述作为监督信号，从CLIP原始的“粗粒度”特征空间中分离出纯粹的感知表征，形成DeCLIP。最终，这种解耦后的特征表示被应用于图像质量评估（技术质量和美学质量）和条件图像生成任务中。实验结果验证了所提方法在上述两个任务上的优越性。

链接: https://arxiv.org/abs/2503.02206
作者: Zhichao Yang,Leida Li,Pengfei Chen,Jinjian Wu,Giuseppe Valenzise
机构: Xidian University (西安电子科技大学); Chongqing Three Gorges University (重庆三峡大学); Paris-Saclay University (巴黎萨克雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive vision-language models, such as CLIP, have demonstrated excellent zero-shot capability across semantic recognition tasks, mainly attributed to the training on a large-scale I1T (one Image with one Text) dataset. This kind of multimodal representations often blend semantic and perceptual elements, placing a particular emphasis on semantics. However, this could be problematic for popular tasks like image quality assessment (IQA) and conditional image generation (CIG), which typically need to have fine control on perceptual and semantic features. Motivated by the above facts, this paper presents a new multimodal disentangled representation learning framework, which leverages disentangled text to guide image disentanglement. To this end, we first build an I2T (one Image with a perceptual Text and a semantic Text) dataset, which consists of disentangled perceptual and semantic text descriptions for an image. Then, the disentangled text descriptions are utilized as supervisory signals to disentangle pure perceptual representations from CLIP’s original `coarse’ feature space, dubbed DeCLIP. Finally, the decoupled feature representations are used for both image quality assessment (technical quality and aesthetic quality) and conditional image generation. Extensive experiments and comparisons have demonstrated the advantages of the proposed method on the two popular tasks. The dataset, code, and model will be available.
zh

[CV-82] MonoLite3D: Lightweight 3D Object Properties Estimation

【速读】：该论文旨在解决在资源受限的硬件环境中，高效且准确地从单目图像估计三维物体的尺寸、空间位置及朝向的问题。为实现这一目标，论文提出了一种轻量级深度学习方法——MonoLite3D网络。其关键在于通过设计高效的模型架构，在嵌入式设备等计算能力有限的硬件平台上实现实时性能，同时保持高精度。实验结果表明，该方法在KITTI数据集的定位基准测试中表现优异，其中在中等难度类别上达到82.27%，在困难类别上达到69.81%，验证了其在资源受限环境中的适用性和有效性。

链接: https://arxiv.org/abs/2503.02201
作者: Ahmed El-Dawy,Amr El-Zawawi,Mohamed El-Habrouk
机构: Electrical Power Department (电气动力系), Faculty of Engineering (工程学院), Alexandria University (亚历山大大学), Alexandria, Egypt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable perception of the environment plays a crucial role in enabling efficient self-driving vehicles. Therefore, the perception system necessitates the acquisition of comprehensive 3D data regarding the surrounding objects within a specific time constrain, including their dimensions, spatial location and orientation. Deep learning has gained significant popularity in perception systems, enabling the conversion of image features captured by a camera into meaningful semantic information. This research paper introduces MonoLite3D network, an embedded-device friendly lightweight deep learning methodology designed for hardware environments with limited resources. MonoLite3D network is a cutting-edge technique that focuses on estimating multiple properties of 3D objects, encompassing their dimensions and spatial orientation, solely from monocular images. This approach is specifically designed to meet the requirements of resource-constrained environments, making it highly suitable for deployment on devices with limited computational capabilities. The experimental results validate the accuracy and efficiency of the proposed approach on the orientation benchmark of the KITTI dataset. It achieves an impressive score of 82.27% on the moderate class and 69.81% on the hard class, while still meeting the real-time requirements.
zh

[CV-83] HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration

【速读】：该论文旨在解决三维点云配准中特征匹配之间的几何约束建模问题。现有方法通常将无序匹配表示为一致性图，并通过采样一致匹配来生成假设，但显式的图构建引入噪声，使得手工设计的几何约束难以有效实现匹配间的一致性。为克服这一挑战，论文提出了一种灵活的动态超图学习几何约束方法HyperGCT (Hyper-GNN-Learned Geometric Constraint)，其关键在于利用三维对应关系中的高阶一致性来挖掘鲁棒的几何约束。通过动态优化超图并通过顶点和边特征聚合捕捉对应关系间的相关性，HyperGCT实现了精确的假设生成。实验结果表明，该方法在多个数据集上达到了最先进的性能，并且对图噪声具有鲁棒性，在泛化能力方面表现出显著优势。

链接: https://arxiv.org/abs/2503.02195
作者: Xiyu Zhang,Jiayi Ma,Jianwei Guo,Wei Hu,Zhaoshuai Qi,Fei Hui,Jiaqi Yang,Yanning Zhang
机构: School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机学院); Electronic Information School, Wuhan University (武汉大学电子信息学院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); Electronics and Control Engineering, Chang’an University (长安大学电子与控制工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency among matches. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNN-learned geometric constraint that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, our method is robust to graph noise, demonstrating a significant advantage in terms of generalization. The code will be released.
zh

[CV-84] DarkDeblur: Learning single-shot image deblurring in low-light condition

【速读】：该论文旨在解决低光条件下单帧图像去模糊这一极具挑战性的图像转换任务。论文通过基于学习的方法克服现有低光图像去模糊的局限性，并提出了一种名为DarkDeblurNet的新颖深度网络。该网络的关键在于结合密集注意力块（Dense-Attention Block）和上下文门控机制（Contextual Gating Mechanism）于特征金字塔结构中，以实现内容感知能力。此外，模型引入多目标函数优化策略，在保持图像感知质量的同时完成低光条件下的图像去模糊。为了验证所提方法的实用性，作者将其应用于多种计算机视觉任务，并构建了一个使用真实硬件采集的数据集作为实际场景下的基准数据集。实验结果表明，该方法在合成数据与真实世界数据上均优于当前最先进的单帧图像去模糊算法，特别是在具有挑战性的光照环境下表现尤为突出。

链接: https://arxiv.org/abs/2503.02194
作者: S M A Sharif,Rizwan Ali Naqvi,Farman Alic,Mithun Biswas
机构: Department of Unmanned Vehicle Engineering, Sejong University, South Korea (无人飞行器工程系，世宗大学); Department of Software, Sejong University, South Korea (软件系，世宗大学); Rigel-IT, Banasree, Dhaka-1219, Bangladesh (Rigel-IT, 孟加拉国达卡)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Single-shot image deblurring in a low-light condition is known to be a profoundly challenging image translation task. This study tackles the limitations of the low-light image deblurring with a learning-based approach and proposes a novel deep network named as DarkDeblurNet. The proposed DarkDeblur- Net comprises a dense-attention block and a contextual gating mechanism in a feature pyramid structure to leverage content awareness. The model additionally incorporates a multi-term objective function to perceive a plausible perceptual image quality while performing image deblurring in the low-light settings. The practicability of the proposed model has been verified by fusing it in numerous computer vision applications. Apart from that, this study introduces a benchmark dataset collected with actual hardware to assess the low-light image deblurring methods in a real-world setup. The experimental results illustrate that the proposed method can outperform the state-of-the-art methods in both synthesized and real-world data for single-shot image deblurring, even in challenging lighting environment.
zh

[CV-85] h-Edit: Effective and Flexible Diffusion-Based Editing via Doobs h-Transform CVPR2025

【速读】：该论文试图解决扩散模型在图像编辑任务中的灵活性与效果平衡问题。传统方法往往依赖于特定的训练过程或单一编辑目标，难以同时处理多种编辑需求或实现复杂的编辑任务。论文的关键解决方案在于提出了一种基于反向时间桥建模的理论框架，通过修改预训练扩散模型的后向过程构建一个渐进收敛到隐式分布的桥接模型。在此基础上，论文进一步引入h-Edit方法，利用Doob’s h变换和Langevin蒙特卡罗技术将中间编辑样本的更新分解为“重建”和“编辑”两个组件。这种分解方式不仅提供了灵活性，还支持通过现有反转技术计算重建项，并结合多个编辑项以应对复杂任务，从而实现了无需额外训练即可完成文本引导和奖励模型驱动的联合编辑，显著提升了编辑效果和保真度。

链接: https://arxiv.org/abs/2503.02187
作者: Toan Nguyen,Kien Do,Duc Kieu,Thin Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025

点击查看摘要

Abstract:We introduce a theoretical framework for diffusion-based image editing by formulating it as a reverse-time bridge modeling problem. This approach modifies the backward process of a pretrained diffusion model to construct a bridge that converges to an implicit distribution associated with the editing target at time 0. Building on this framework, we propose h-Edit, a novel editing method that utilizes Doob’s h-transform and Langevin Monte Carlo to decompose the update of an intermediate edited sample into two components: a “reconstruction” term and an “editing” term. This decomposition provides flexibility, allowing the reconstruction term to be computed via existing inversion techniques and enabling the combination of multiple editing terms to handle complex editing tasks. To our knowledge, h-Edit is the first training-free method capable of performing simultaneous text-guided and reward-model-based editing. Extensive experiments, both quantitative and qualitative, show that h-Edit outperforms state-of-the-art baselines in terms of editing effectiveness and faithfulness. Our source code is available at this https URL.
zh

[CV-86] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

【速读】：该论文旨在解决大型多模态模型（LMMs）中因包含大量视觉标记导致输入长度增加，从而引发推理复杂度提高及高延迟的问题。现有的标记剪枝方法要么需要大量的校准和微调，要么依赖于次优的重要性度量，导致保留标记之间的冗余增加。论文的关键在于将标记剪枝重新表述为最大最小多样性问题（MMDP），目标是从中选择一个子集以最大化所选标记之间的多样性。通过解决这一问题，论文提出的DivPrune方法能够减少冗余并实现所选标记的最大化多样性，确保在高剪枝比率下仍能保持有效性能且无需微调。实验结果表明，该方法在16个图像-视频语言数据集上实现了最先进的准确性，并同时减少了端到端延迟和GPU内存使用。

链接: https://arxiv.org/abs/2503.02175
作者: Saeed Ranjbar Alvar,Gursimran Singh,Mohammad Akbari,Yong Zhang
机构: Huawei Technologies Canada Co., Ltd. (华为加拿大技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected tokens is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available \hrefthis https URL\texthere .
zh

[CV-87] Adaptive Camera Sensor for Vision Models ICLR2025

【速读】：该论文旨在解决深度学习在计算机视觉领域中因域偏移（domain shift）导致的性能下降问题。传统方法通常需要对模型进行大量修改或依赖大规模标注数据集来缓解这一挑战。受人类视觉系统通过矫正镜片而非过度训练大脑来适应输入质量的启发，论文提出了一种名为Lens的新颖相机传感器控制方法。Lens的关键在于从模型视角而非传统的以人为中心的传感器控制出发，实时调整传感器参数以适应特定模型和场景。其核心组件VisiT是一种无需额外训练且针对具体模型的质量指标，在测试阶段利用置信度分数评估未标注样本，而无需增加额外的适配成本。实验结果表明，Lens显著提升了多种基线方案下的模型精度，并有效补偿了模型大小差异，同时保持了低延迟的图像捕获能力。

链接: https://arxiv.org/abs/2503.02170
作者: Eunsu Baek,Sunghwan Han,Taesik Gong,Hyung-Sin Kim
机构: Graduate School of Data Science, Seoul National University (首尔国立大学); Department of Computer Science & Engineering, Seogang University (崇实大学); Department of Computer Science & Engineering, Ulsan National Institute of Science and Technology (蔚山科学技术院); Graduate School of Data Science, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Domain shift remains a persistent challenge in deep-learning-based computer vision, often requiring extensive model modifications or large labeled datasets to address. Inspired by human visual perception, which adjusts input quality through corrective lenses rather than over-training the brain, we propose Lens, a novel camera sensor control method that enhances model performance by capturing high-quality images from the model’s perspective rather than relying on traditional human-centric sensor control. Lens is lightweight and adapts sensor parameters to specific models and scenes in real-time. At its core, Lens utilizes VisiT, a training-free, model-specific quality indicator that evaluates individual unlabeled samples at test time using confidence scores without additional adaptation costs. To validate Lens, we introduce ImageNet-ES Diverse, a new benchmark dataset capturing natural perturbations from varying sensor and lighting conditions. Extensive experiments on both ImageNet-ES and our new ImageNet-ES Diverse show that Lens significantly improves model accuracy across various baseline schemes for sensor control and model modification while maintaining low latency in image captures. Lens effectively compensates for large model size differences and integrates synergistically with model improvement techniques. Our code and dataset are available at this http URL.
zh

[CV-88] X2CT-CLIP: Enable Multi-Abnormality Detection in Computed Tomography from Chest Radiography via Tri-Modal Contrastive Learning

【速读】：该论文旨在解决胸部计算机断层扫描（CT）在大规模筛查中的高辐射暴露和长周转时间限制问题，同时克服现有胸部X光片（CXR）基础模型仅能检测易于通过CXR识别疾病的局限性。此外，尽管已有研究尝试利用模拟CXR进行疾病分类或借助CT衍生标签提升CXR模型性能，但这些方法仍存在模态转换能力不足及泛化应用困难的问题。针对上述挑战，论文提出了一种名为X2CT-CLIP的三模态知识迁移学习框架，其关键在于通过精心设计的潜在空间三模态对齐机制，在胸部CT体积及其相关放射学报告与CXR编码器之间建立桥梁，从而实现从CXR到CT多异常分类的知识转移，同时显著降低模型训练的计算负担。这一创新方法不仅有效弥合了CT与CXR之间的模态差距，还在跨模态检索、少量样本适应以及外部验证任务中取得了超越当前最先进水平的表现。

链接: https://arxiv.org/abs/2503.02162
作者: Jianzhong You,Yuan Gao,Sangwook Kim,Chris Mcintosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Computed tomography (CT) is a key imaging modality for diagnosis, yet its clinical utility is marred by high radiation exposure and long turnaround times, restricting its use for larger-scale screening. Although chest radiography (CXR) is more accessible and safer, existing CXR foundation models focus primarily on detecting diseases that are readily visible on the CXR. Recently, works have explored training disease classification models on simulated CXRs, but they remain limited to recognizing a single disease type from CT. CT foundation models have also emerged with significantly improved detection of pathologies in CT. However, the generalized application of CT-derived labels on CXR has remained illusive. In this study, we propose X2CT-CLIP, a tri-modal knowledge transfer learning framework that bridges the modality gap between CT and CXR while reducing the computational burden of model training. Our approach is the first work to enable multi-abnormality classification in CT, using CXR, by transferring knowledge from 3D CT volumes and associated radiology reports to a CXR encoder via a carefully designed tri-modal alignment mechanism in latent space. Extensive evaluations on three multi-label CT datasets demonstrate that our method outperforms state-of-the-art baselines in cross-modal retrieval, few-shot adaptation, and external validation. These results highlight the potential of CXR, enriched with knowledge derived from CT, as a viable efficient alternative for disease detection in resource-limited settings.
zh

[CV-89] MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models

【速读】：该论文旨在解决医学领域大型视觉语言模型（Medical LVLMs, Med-LVLMs）在生成过程中频繁出现幻觉（hallucinations）的问题。这些问题主要源于模型的专业知识局限性和医疗应用的复杂性。现有基准未能有效评估幻觉的根本原因，也缺乏对缓解策略的全面评估。论文的关键解决方案是引入MedHEval，这是一个新的基准，通过将幻觉分类为视觉误释（visual misinterpretation）、知识缺陷（knowledge deficiency）和上下文错位（context misalignment）三种根本原因，系统性地评估Med-LVLMs中的幻觉及其缓解策略。MedHEval构建了一组包含多种评估指标的闭合与开放性医学视觉问答（VQA）数据集，并通过广泛的实验验证了11种流行的Med-LVLMs及7种最先进的幻觉缓解技术。研究结果表明，Med-LVLMs在不同原因引起的幻觉面前表现不佳，现有的缓解方法效果有限，尤其是针对基于知识和上下文的错误。这些发现强调了改进对齐训练和开发专门缓解策略的重要性，以提高Med-LVLMs的可靠性。MedHEval为此提供了标准化的评估与缓解框架，推动了更可信的Med-LVLMs的发展。

链接: https://arxiv.org/abs/2503.02157
作者: Aofei Chang,Le Huang,Parminder Bhatia,Taha Kass-Hout,Fenglong Ma,Cao Xiao
机构: The Pennsylvania State University (宾夕法尼亚州立大学); GE Healthcare (GE 医疗); GE Healthcare (GE 医疗); GE Healthcare (GE 医疗); GE Healthcare (GE 医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, under review

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain, yet Medical LVLMs (Med-LVLMs) frequently generate hallucinations due to limited expertise and the complexity of medical applications. Existing benchmarks fail to effectively evaluate hallucinations based on their underlying causes and lack assessments of mitigation strategies. To address this gap, we introduce MedHEval, a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs by categorizing them into three underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. We construct a diverse set of close- and open-ended medical VQA datasets with comprehensive evaluation metrics to assess these hallucination types. We conduct extensive experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques. Results reveal that Med-LVLMs struggle with hallucinations arising from different causes while existing mitigation methods show limited effectiveness, especially for knowledge- and context-based errors. These findings underscore the need for improved alignment training and specialized mitigation strategies to enhance Med-LVLMs’ reliability. MedHEval establishes a standardized framework for evaluating and mitigating medical hallucinations, guiding the development of more trustworthy Med-LVLMs.
zh

[CV-90] Video-DPRP: A Differentially Private Approach for Visual Privacy-Preserving Video Human Activity Recognition

【速读】：本文旨在解决隐私保护视频人体活动识别（Privacy-Preserving Video Human Activity Recognition, Video HAR）中的两个关键问题：(1) 开发一种无需模型的视频视觉隐私保护方法，利用差分隐私（Differential Privacy, DP）的特性；(2) 在动作识别任务中通过差分隐私和视觉隐私评估综合评估所提出技术的效果。为实现目标 (1)，论文引入了 Video-DPRP（Video-sample-wise Differentially Private Random Projection），这是一种基于随机投影的框架，通过视频的奇异值分解得到噪声矩阵和右奇异向量，以隐私参数 (\epsilon,\delta) 重构差分私有视频，同时支持视觉隐私评估。为实现目标 (2)，论文使用 UCF101 和 HMDB51 数据集对比了 Video-DPRP 与传统差分隐私方法及最先进的视觉隐私保护技术在动作识别任务上的性能，并通过 PA-HMDB 和 VISPR 数据集评估其对人脸特征、性别和肤色等隐私相关属性的保护效果。论文的关键在于结合差分隐私与视觉隐私保护的优势，提出了 Video-DPRP 框架，从而实现同时从差分隐私和视觉隐私角度保护视频数据。

链接: https://arxiv.org/abs/2503.02132
作者: Allassan Tchangmena A Nken,Susan Mckeever,Peter Corcoran,Ihsan Ullah
机构: University of Galway (高威大学), Ireland; Technological University Dublin (都柏林理工学院), Ireland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Considerable effort has been made in privacy-preserving video human activity recognition (HAR). Two primary approaches to ensure privacy preservation in Video HAR are differential privacy (DP) and visual privacy. Techniques enforcing DP during training provide strong theoretical privacy guarantees but offer limited capabilities for visual privacy assessment. Conversely methods, such as low-resolution transformations, data obfuscation and adversarial networks, emphasize visual privacy but lack clear theoretical privacy assurances. In this work, we focus on two main objectives: (1) leveraging DP properties to develop a model-free approach for visual privacy in videos and (2) evaluating our proposed technique using both differential privacy and visual privacy assessments on HAR tasks. To achieve goal (1), we introduce Video-DPRP: a Video-sample-wise Differentially Private Random Projection framework for privacy-preserved video reconstruction for HAR. By using random projections, noise matrices and right singular vectors derived from the singular value decomposition of videos, Video-DPRP reconstructs DP videos using privacy parameters ( \epsilon,\delta ) while enabling visual privacy assessment. For goal (2), using UCF101 and HMDB51 datasets, we compare Video-DPRP’s performance on activity recognition with traditional DP methods, and state-of-the-art (SOTA) visual privacy-preserving techniques. Additionally, we assess its effectiveness in preserving privacy-related attributes such as facial features, gender, and skin color, using the PA-HMDB and VISPR datasets. Video-DPRP combines privacy-preservation from both a DP and visual privacy perspective unlike SOTA methods that typically address only one of these aspects.
zh

[CV-91] Aerial Infrared Health Monitoring of Solar Photovoltaic Farms at Scale

【速读】：本文旨在解决大规模太阳能光伏（PV）电站实际运行效率未知的问题，通过开发一种基于数据驱动的综合框架，实现北美地区太阳能设施的机载红外检测。解决方案的关键在于利用高分辨率热成像技术构建并整理一个地理多样性丰富的数据集，涵盖数千个PV站点，从而支持基于机器学习的缺陷检测与定位，这些缺陷在可见光谱中不可见。此外，该框架整合了先进的图像处理、地理配准以及机载热红外异常检测，以提供性能损失的严格评估。关键还体现在针对航拍数据采集、标注方法学及模型部署在多种环境和运行条件下的实际考量。这一研究为大型太阳能资产可靠性提供了新见解，并为可再生能源领域中的性能趋势分析、预测性维护及可扩展分析奠定了基础。

链接: https://arxiv.org/abs/2503.02128
作者: Isaac Corley,Conor Wallace,Sourav Agrawal,Burton Putrah,Jonathan Lwowski
机构: Zeitview (zeitview)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Solar photovoltaic (PV) farms represent a major source of global renewable energy generation, yet their true operational efficiency often remains unknown at scale. In this paper, we present a comprehensive, data-driven framework for large-scale airborne infrared inspection of North American solar installations. Leveraging high-resolution thermal imagery, we construct and curate a geographically diverse dataset encompassing thousands of PV sites, enabling machine learning-based detection and localization of defects that are not detectable in the visible spectrum. Our pipeline integrates advanced image processing, georeferencing, and airborne thermal infrared anomaly detection to provide rigorous estimates of performance losses. We highlight practical considerations in aerial data collection, annotation methodologies, and model deployment across a wide range of environmental and operational conditions. Our work delivers new insights into the reliability of large-scale solar assets and serves as a foundation for ongoing research on performance trends, predictive maintenance, and scalable analytics in the renewable energy sector.
zh

[CV-92] HanDrawer: Leverag ing Spatial Information to Render Realistic Hands Using a Conditional Diffusion Model in Single Stage

【速读】：该论文旨在解决文本到手部手势生成中扩散模型面临的挑战，特别是手部姿态生成中的准确性问题，如手指数量错误或不自然的手势，这些问题导致严重的伪影。为了解决这一问题，论文提出了一种名为HanDrawer的模块，用于条件化手部生成过程。HanDrawer的关键在于通过图卷积层提取MANO手部网格顶点中隐含的内源性空间结构和物理约束，并利用跨注意力机制将这些空间特征与其他模态对齐融合。此外，为了提高空间特征融合的准确性，提出了位置保持零填充（PPZP）融合策略，确保HanDrawer提取的特征能够正确融入扩散模型相关层中的感兴趣区域。同时，结合去噪损失与额外的手部重建损失，使HanDrawer能够在关注整体图像特征的同时特别关注手部区域。这些方法共同提高了手部生成的质量和准确性。

链接: https://arxiv.org/abs/2503.02127
作者: Qifan Fu,Xu Chen,Muhammad Asad,Shanxin Yuan,Changjae Oh,Gregory Slabaugh
机构: Digital Environment Research Institute, Queen Mary University of London (数字环境研究院，皇后玛丽大学伦敦分校); School of Electronic Engineering and Computer Science, Queen Mary University of London (电子工程与计算机科学学院，皇后玛丽大学伦敦分校); Department of Medicine, University of Cambridge (医学系，剑桥大学); Queen Mary University of London (皇后玛丽大学伦敦分校); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Although diffusion methods excel in text-to-image generation, generating accurate hand gestures remains a major challenge, resulting in severe artifacts, such as incorrect number of fingers or unnatural gestures. To enable the diffusion model to learn spatial information to improve the quality of the hands generated, we propose HanDrawer, a module to condition the hand generation process. Specifically, we apply graph convolutional layers to extract the endogenous spatial structure and physical constraints implicit in MANO hand mesh vertices. We then align and fuse these spatial features with other modalities via cross-attention. The spatially fused features are used to guide a single stage diffusion model denoising process for high quality generation of the hand region. To improve the accuracy of spatial feature fusion, we propose a Position-Preserving Zero Padding (PPZP) fusion strategy, which ensures that the features extracted by HanDrawer are fused into the region of interest in the relevant layers of the diffusion model. HanDrawer learns the entire image features while paying special attention to the hand region thanks to an additional hand reconstruction loss combined with the denoising loss. To accurately train and evaluate our approach, we perform careful cleansing and relabeling of the widely used HaGRID hand gesture dataset and obtain high quality multimodal data. Quantitative and qualitative analyses demonstrate the state-of-the-art performance of our method on the HaGRID dataset through multiple evaluation metrics. Source code and our enhanced dataset will be released publicly if the paper is accepted.
zh

[CV-93] Parabolic Continual Learning

【速读】：该论文试图解决连续学习（Continual Learning）技术在处理新数据时算法行为不可预测的问题。具体而言，作者关注如何通过正则化方法减少遗忘误差（forgetting error）并增强泛化能力（generalization error）。解决方案的关键在于引入一类抛物型偏微分方程（Parabolic Partial Differential Equation, PDE）的性质来正则化损失函数随时间演化的期望行为，并利用记忆缓冲区（memory buffer）作为边界条件来约束误差。通过这种方式，论文实现了对长期依赖关系的有效建模，并通过边界损失（boundary loss）控制整体误差，从而提升了连续学习任务中的性能表现。

链接: https://arxiv.org/abs/2503.02117
作者: Haoming Yang,Ali Hasan,Vahid Tarokh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Regularizing continual learning techniques is important for anticipating algorithmic behavior under new realizations of data. We introduce a new approach to continual learning by imposing the properties of a parabolic partial differential equation (PDE) to regularize the expected behavior of the loss over time. This class of parabolic PDEs has a number of favorable properties that allow us to analyze the error incurred through forgetting and the error induced through generalization. Specifically, we do this through imposing boundary conditions where the boundary is given by a memory buffer. By using the memory buffer as a boundary, we can enforce long term dependencies by bounding the expected error by the boundary loss. Finally, we illustrate the empirical performance of the method on a series of continual learning tasks.
zh

[CV-94] Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection CVPR2025

【速读】：该论文致力于解决域泛化（Domain Generalization, DG）在目标检测任务中的挑战，旨在提升检测器在未见过场景中的性能。这一问题的难点在于现实应用中复杂的域间变化。论文的关键创新在于提出了一种利用扩散模型（Diffusion Models）提取域不变特征的方法。具体而言，该方法通过在扩散过程中捕获多步中间特征，而非直接生成图像，从而获得适用于泛化检测的特征表示。此外，论文设计了一个高效的跨域知识迁移框架，通过特征级和对象级对齐，使检测器能够继承扩散模型的泛化能力，而无需增加推理时间。实验结果表明，该方法在六个具有挑战性的DG基准数据集上显著提升了平均精度均值（mAP），相比现有方法提升了14.0%，且优于大多数需要访问目标域数据的领域自适应方法。

链接: https://arxiv.org/abs/2503.02101
作者: Boyong He,Yuxiang Ji,Qianwen Ye,Zhuoyue Tan,Liaoni Wu
机构: Institute of Artifcial Intelligence, Xiamen University (厦门大学人工智能研究所); School of Aerospace Engineering, Xiamen University (厦门大学航空航天学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Domain generalization (DG) for object detection aims to enhance detectors’ performance in unseen scenarios. This task remains challenging due to complex variations in real-world applications. Recently, diffusion models have demonstrated remarkable capabilities in diverse scene generation, which inspires us to explore their potential for improving DG tasks. Instead of generating images, our method extracts multi-step intermediate features during the diffusion process to obtain domain-invariant features for generalized detection. Furthermore, we propose an efficient knowledge transfer framework that enables detectors to inherit the generalization capabilities of diffusion models through feature and object-level alignment, without increasing inference time. We conduct extensive experiments on six challenging DG benchmarks. The results demonstrate that our method achieves substantial improvements of 14.0% mAP over existing DG approaches across different domains and corruption types. Notably, our method even outperforms most domain adaptation methods without accessing any target domain data. Moreover, the diffusion-guided detectors show consistent improvements of 15.9% mAP on average compared to the baseline. Our work aims to present an effective approach for domain-generalized detection and provide potential insights for robust visual recognition in real-world scenarios. The code is available at \hrefthis https URLGeneralized Diffusion Detector
zh

[CV-95] Data Augmentation for NeRFs in the Low Data Limit ICRA2025

【速读】：该论文旨在解决基于Neural Radiance Fields的方法在低数据量场景下，尤其是在训练数据不完整时表现不佳的问题。传统方法仅在下一最佳视角（next-best-view）应用中增强训练数据，但这种方法在稀疏数据情况下容易导致幻觉现象（hallucinations）和模型崩溃（model collapse）。论文的关键解决方案在于通过拒绝采样（rejection sampling）从后验不确定性分布（posterior uncertainty distribution）中引入一组额外视图，该分布由体积不确定性估计器（volumetric uncertainty estimator）与空间覆盖度（spatial coverage）相结合生成。实验结果表明，该方法在部分观测场景中的性能比现有最先进的基线方法平均高出39.9%，且变异系数降低了87.5%。此外，论文进一步证明，从任意分布中采样增强训练集能够显著改善稀疏环境下的场景重建效果。这项工作为机器人任务提供了重要基础，在资源受限且先验未知的环境中，利用信息丰富的数据增强数据集至关重要。

链接: https://arxiv.org/abs/2503.02092
作者: Ayush Gaggar,Todd D. Murphey
机构: Department of Mechanical Engineering at Northwestern University (西北大学机械工程系); National Science Foundation (国家科学基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To be published in 2025 IEEE International Conference on Robotics and Automation (ICRA 2025)

点击查看摘要

Abstract:Current methods based on Neural Radiance Fields fail in the low data limit, particularly when training on incomplete scene data. Prior works augment training data only in next-best-view applications, which lead to hallucinations and model collapse with sparse data. In contrast, we propose adding a set of views during training by rejection sampling from a posterior uncertainty distribution, generated by combining a volumetric uncertainty estimator with spatial coverage. We validate our results on partially observed scenes; on average, our method performs 39.9% better with 87.5% less variability across established scene reconstruction benchmarks, as compared to state of the art baselines. We further demonstrate that augmenting the training set by sampling from any distribution leads to better, more consistent scene reconstruction in sparse environments. This work is foundational for robotic tasks where augmenting a dataset with informative data is critical in resource-constrained, a priori unknown environments. Videos and source code are available at this https URL.
zh

[CV-96] V2Dial: Unification of Video and Visual Dialog via Multimodal Experts CVPR2025

【速读】：本文针对现有多模态模型主要专注于较简单的任务（如视觉问答 VQA、视频问答 VideoQA 和视频-文本检索），而忽视更具挑战性的多模态对话任务（如视频对话和视觉对话）的问题，提出了解决方案。论文的关键在于提出了一种基于专家的新型模型 V² Dial，该模型首次通过专用专家路由同时学习图像和视频的空间与时间特征，并利用匹配和对比学习技术对齐这些特征。此外，研究系统性地探讨了两项任务之间的领域偏移问题，评估了相关任务在各自训练数据上的相互受益潜力。实验结果表明，该模型在 AVSD 和 VisDial 数据集的四个基准测试中，无论是零样本还是微调设置下均达到了最新的性能水平。

链接: https://arxiv.org/abs/2503.02063
作者: Adnen Abdessaied,Anna Rohrbach,Marcus Rohrbach,Andreas Bulling
机构: University of Stuttgart, Germany (斯图加特大学); TU Darmstadt, Germany (达姆施塔特工业大学); hessian.AI, Germany (海森堡人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We present V ^2 Dial - a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.
zh

[CV-97] Robustness to Geographic Distribution Shift using Location Encoders ICLR2025

【速读】：该论文试图解决地理分布偏移（Geographic Distribution Shift）问题，即训练数据中地理位置分布与测试时所见分布不一致的情况。传统方法通常将受行政边界（如国家或大陆）定义的区域视为独立领域，并应用标准的领域自适应（Domain Adaptation）方法，而忽略了常常作为元数据提供的地理坐标信息。论文的关键解决方案是引入位置编码器（Location Encoders），通过使用简单的正弦-余弦编码器或预训练的位置编码器，改进标准的领域自适应方法，从而更有效地应对地理分布偏移问题。实验结果显示，所提出的方法在WILDS基准的地理标记图像数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.02036
作者: Ruth Crasto
机构: Microsoft (微软)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025 Machine Learning for Remote Sensing (ML4RS) Workshop

点击查看摘要

Abstract:Geographic distribution shift arises when the distribution of locations on Earth in a training dataset is different from what is seen at test time. The most common approaches to tackling geographic distribution shift treat regions delimited by administrative boundaries such as countries or continents as separate domains and apply standard domain adaptation methods, ignoring geographic coordinates that are often available as metadata. This paper proposes the use of location encoders for training models that are more robust to geographic distribution shift. We show how both simple sine-cosine encoders and pre-trained location encoders can be used to improve standard domain adaptation methods for the special case of geographic distribution shift. Our proposed methods achieve state-of-the-art results on geo-tagged imagery datasets from the WILDS benchmark.
zh

[CV-98] Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

【速读】：该论文旨在解决医学影像中肺部CT血管成像（CTPA）扫描解读复杂性高、放射学报告准确性与全面性不足的问题。为应对这一挑战，论文提出了一种名为Abn-BLIP（Abnormality-aligned Bootstrapping Language-Image Pretraining）的诊断模型，其关键在于通过可学习查询（learnable queries）和跨模态注意力机制（cross-modal attention mechanisms），实现异常病灶的精准对齐与检测，同时生成结构化的放射学报告。实验结果表明，Abn-BLIP在检测异常、减少漏诊以及提升报告的准确性和临床相关性方面优于现有最先进的多模态视觉语言模型及三维报告生成方法。

链接: https://arxiv.org/abs/2503.02034
作者: Zhusi Zhong,Yuli Wang,Lulu Bi,Zhuoqi Ma,Sun Ho Ahn,Christopher J. Mullin,Colin F. Greineder,Michael K. Atalay,Scott Collins,Grayson L. Baird,Cheng Ting Lin,Webster Stayman,Todd M. Kolb,Ihab Kamel,Harrison X. Bai,Zhicheng Jiao
机构: Department of Diagnostic Imaging, Brown University Health (布朗大学健康科学); Warren Alpert Medical School of Brown University (沃伦·阿尔珀特布朗大学医学院); Department of Biomedical Engineering, Johns Hopkins University School of Medicine (约翰斯·霍普金斯大学医学院生物医学工程系); Department of Radiology and Radiological Sciences, Johns Hopkins University School of Medicine (约翰斯·霍普金斯大学医学院放射科和放射科学系); Department of Emergency Medicine and Department of Pharmacology, University of Michigan (密歇根大学急诊医学和药理学系); Johns Hopkins University Division of Pulmonary and Critical Care Medicine (约翰斯·霍普金斯大学肺病与重症监护医学部); Department of Radiology, University of Colorado School of Medicine (科罗拉多大学医学院放射科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography (CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However, the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significant challenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining), an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, our model demonstrates superior performance in detecting abnormalities, reducing missed findings, and generating structured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms state-of-the-art medical vision-language models and 3D report generation methods in both accuracy and clinical relevance. These results highlight the potential of integrating multimodal learning strategies for improving radiology reporting. The source code is available at this https URL.
zh

[CV-99] Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization

【速读】：该论文旨在解决现有新型视角合成（Novel-View Synthesis）风格化技术在几何形状变化方面的不足，即这些方法通常因需要保持风格化稳定性与一致性而限制了风格强度，从而难以令人信服地改变几何结构。为应对这一挑战，论文提出了一种新的自回归三维高斯点 splatting（Gaussian Splatting）风格化方法。其关键在于贡献了一个新的RGB-D扩散模型，实现了外观与形状风格化的强度控制，并通过深度引导的交叉注意力机制、特征注入以及基于复合帧的Warp ControlNet来确保风格化前后的一致性。

链接: https://arxiv.org/abs/2503.02009
作者: Jamie Wynn,Zawar Qureshi,Jakub Powierza,Jamie Watson,Mohamed Sayed
机构: Niantic Labs; Niantic (尼安蒂克实验室); UCL (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Exploring real-world spaces using novel-view synthesis is fun, and reimagining those worlds in a different style adds another layer of excitement. Stylized worlds can also be used for downstream tasks where there is limited training data and a need to expand a model’s training distribution. Most current novel-view synthesis stylization techniques lack the ability to convincingly change geometry. This is because any geometry change requires increased style strength which is often capped for stylization stability and consistency. In this work, we propose a new autoregressive 3D Gaussian Splatting stylization method. As part of this method, we contribute a new RGBD diffusion model that allows for strength control over appearance and shape stylization. To ensure consistency across stylized frames, we use a combination of novel depth-guided cross attention, feature injection, and a Warp ControlNet conditioned on composite frames for guiding the stylization of new frames. We validate our method via extensive qualitative results, quantitative experiments, and a user study. Code will be released online.
zh

[CV-100] Road Boundary Detection Using 4D mmWave Radar for Autonomous Driving

【速读】：该论文旨在解决传统自动驾驶中路边界检测方法在复杂场景下易受光照条件影响或成本过高的问题。论文提出了一种基于4D毫米波雷达的路边界检测方法4DRadarRBD，其关键在于利用毫米波雷达反射生成点云数据来检测路边界，并通过物理约束减少噪声点以及引入基于距离的损失函数来精确分割实际路边界附近的点。此外，通过捕捉点云序列的时间动态特性，结合车辆运动补偿后的结果与点云的空间分布进一步实现逐点路边界分割。实验表明，该方法在真实驾驶测试中达到了93%的分割精度，相比基线模型中误差减少了92.6%，中位数距离误差仅为0.023米。

链接: https://arxiv.org/abs/2503.01930
作者: Yuyan Wu,Hae Young Noh
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting road boundaries, the static physical edges of the available driving area, is important for safe navigation and effective path planning in autonomous driving and advanced driver-assistance systems (ADAS). Traditionally, road boundary detection in autonomous driving relies on cameras and LiDAR. However, they are vulnerable to poor lighting conditions, such as nighttime and direct sunlight glare, or prohibitively expensive for low-end vehicles. To this end, this paper introduces 4DRadarRBD, the first road boundary detection method based on 4D mmWave radar which is cost-effective and robust in complex driving scenarios. The main idea is that road boundaries (e.g., fences, bushes, roadblocks), reflect millimeter waves, thus generating point cloud data for the radar. To overcome the challenge that the 4D mmWave radar point clouds contain many noisy points, we initially reduce noisy points via physical constraints for road boundaries and then segment the road boundary points from the noisy points by incorporating a distance-based loss which penalizes for falsely detecting the points far away from the actual road boundaries. In addition, we capture the temporal dynamics of point cloud sequences by utilizing each point’s deviation from the vehicle motion-compensated road boundary detection result obtained from the previous frame, along with the spatial distribution of the point cloud for point-wise road boundary segmentation. We evaluated 4DRadarRBD through real-world driving tests and achieved a road boundary point segmentation accuracy of 93 % , with a median distance error of up to 0.023 m and an error reduction of 92.6 % compared to the baseline model.
zh

[CV-101] Volume-Wise Task fMRI Decoding with Deep Learning:Enhancing Temporal Resolution and Cognitive Function Analysis

【速读】：该论文旨在解决传统任务功能磁共振成像（task functional Magnetic Resonance Imaging, tfMRI）解码方法中因假设神经活动具有时间平稳性而导致的时间分辨率较低的问题。传统方法多采用基于块状分析的方式，其时间分辨率仅能达到数十秒量级，限制了对认知功能的精细解码能力。为应对这些局限性，论文提出了一种深度神经网络模型，用于实现tfMRI数据中任务状态的体素级识别，从而突破传统方法的约束。该解决方案的关键在于设计了一种能够有效提升时间分辨率的深度学习架构，使模型在Human Connectome Project提供的运动任务和赌博任务tfMRI数据集上分别达到了94.0%和79.6%的平均识别精度，显著增强了对动态脑活动变化的解析能力，并结合可视化算法进一步探索了不同任务下的大脑映射特性，为基于深度学习的帧级别tfMRI解码提供了新的方法论和技术手段。

链接: https://arxiv.org/abs/2503.01925
作者: Yueyang Wu,Sinan Yang,Yanming Wang,Jiajie He,Muhammad Mohsin Pathan,Bensheng Qiu,Xiaoxiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,11 figures

点击查看摘要

Abstract:In recent years,the application of deep learning in task functional Magnetic Resonance Imaging (tfMRI) decoding has led to significant this http URL,most studies remain constrained by assumption of temporal stationarity in neural activity,resulting in predominantly block-wise analysis with limited temporal resolution on the order of tens of this http URL limitation restricts the ability to decode cognitive functions in this http URL address these limitations, this study proposes a deep neural network designed for volume-wise identification of task states within tfMRI data,thereby overcoming the constraints of conventional this http URL on Human Connectome Project (HCP) motor and gambling tfMRI datasets,the model achieved impressive mean accuracy rates of 94.0% and 79.6%,this http URL results demonstrate a substantial enhancement in temporal resolution,enabling more detailed exploration of cognitive this http URL study further employs visualization algorithms to investigate dynamic brain mappings during different tasks,marking a significant step forward in deep learning-based frame-level tfMRI this http URL approach offers new methodologies and tools for examining dynamic changes in brain activities and understanding the underlying cognitive mechanisms.
zh

[CV-102] chnical Report for ReID-SAM on SkiTB Visual Tracking Challenge 2025 WACV2025

【速读】：该论文致力于解决滑雪场景中运动员外观跟踪的复杂性问题，特别是在多目标、遮挡和快速运动等具有挑战性的条件下提高跟踪准确性。解决方案的关键在于将SAMURAI跟踪器与基于OSNet的人体重识别（Person Re-Identification, Re-ID）模块相结合，并辅以先进的后处理技术。此外，通过使用YOLOv11结合Kalman滤波或基于STARK的物体检测方法实现装备的精确跟踪，进一步增强了整体系统的性能。这些创新点共同确保了ReID-SAM在SkiTB数据集上的卓越表现，达到了F1分数0.870的领先水平。

链接: https://arxiv.org/abs/2503.01907
作者: Kunjun Li,Cheng-Yen Yang,Hsiang-Wei Huang,Jenq-Neng Hwang
机构: Information Processing Lab, University of Washington (信息处理实验室, 华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Technical report for 2nd solution of SkiTB Visual Tracking Challenge (WACV 2025)

点击查看摘要

Abstract:This report introduces ReID-SAM, a novel model developed for the SkiTB Challenge that addresses the complexities of tracking skier appearance. Our approach integrates the SAMURAI tracker with a person re-identification (Re-ID) module and advanced post-processing techniques to enhance accuracy in challenging skiing scenarios. We employ an OSNet-based Re-ID model to minimize identity switches and utilize YOLOv11 with Kalman filtering or STARK-based object detection for precise equipment tracking. When evaluated on the SkiTB dataset, ReID-SAM achieved a state-of-the-art F1-score of 0.870, surpassing existing methods across alpine, ski jumping, and freestyle skiing disciplines. These results demonstrate significant advancements in skier tracking accuracy and provide valuable insights for computer vision applications in winter sports.
zh

[CV-103] Learning to Chain Operations by Routing Information Through a Global Workspace

【速读】：该论文旨在解决如何通过深度学习模型实现基于序列推理的任务，其核心在于模仿人类系统-2（System-2）的推理方式。论文提出了一种受全局工作空间理论（Global Workspace Theory）启发的模型，通过控制器以门控机制选择性地在各专业化模块间路由信息，并通过迭代广播信息的方式在特定领域内链式操作。关键解决方案在于设计一个包含工作区的架构，使模型能够通过适配的模块序列处理任务，如将两个加数通过输入模块、增量模块（多次）以及输出模块依次处理完成加法任务。此外，论文展示了即使参数较少，该全局工作空间模型在未见过的加法运算（包括插值和外推）上仍优于LSTM和Transformer，证明了此类架构提升深度学习推理能力的潜力。

链接: https://arxiv.org/abs/2503.01906
作者: Hugo Chateau-Laurent,Rufin VanRullen
机构: CerCo, CNRS UMR 5549, Université de Toulouse; CerCo, CNRS UMR 5549, Université de Toulouse and ANITI, Artificial and Natural Intelligence Toulouse Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 12 pages, 14 figures, submitted to CCN

点击查看摘要

Abstract:We present a model inspired by the Global Workspace Theory that integrates specialized modules to perform a sequential reasoning task. A controller selectively routes information between modules through the workspace using a gating mechanism. This approach allows the model to chain operations by iteratively broadcasting information between specialized domains, mimicking System-2 reasoning. We evaluate the model’s performance on a simple addition task, where two addends must be summed. The task can be solved by routing information sequentially through an Input module, an Increment module (multiple times), and finally an Output module. We consider two implementations of this system with increasing complexity. First, using hand-designed modules operating on one-hot digit representations, the controller (a LSTM recurrent network) learns to select the appropriate modules (input, increment, output) in the appropriate sequence. Second, we replace the hand-designed modules with learned representation modules for MNIST images and an increment module trained on the task objectives; here again, the controller learns the appropriate sequential module selection to solve the task. Finally, we show that the Global Workspace model, while having fewer parameters, outperforms LSTMs and Transformers when tested on unseen addition operations (both interpolations and extrapolations of addition operations seen during training). Our results highlight the potential of architectures inspired by the Global Workspace Theory to enhance deep learning’s reasoning capabilities.
zh

[CV-104] What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning Methods

【速读】：该论文试图解决的问题是如何详细分析深度神经网络在处理高维多模态数据时对各模态信息源的具体处理方式。现有研究虽已发展出多种融合方法，但对多模态模型如何具体利用不同模态的信息了解不足。论文的关键解决方案是提出了一种基于遮挡（occlusion-based）的方法，该方法不依赖于模型性能，能够定量衡量数据集中每个模态对于模型完成任务的重要性。通过将此方法应用于三种不同的医学多模态问题，研究发现某些网络存在模态偏好并可能导致单模态崩溃，而部分数据集从本质上即不平衡，并且确定了该度量与单一模态训练网络性能之间的联系。这一信息增益具有显著潜力，可促进多模态模型的发展及数据集构建，同时推动多模态人工智能在临床实践中的整合。

链接: https://arxiv.org/abs/2503.01904
作者: Christian Gapp,Elias Tappeiner,Martin Welk,Karl Fritscher,Elke Ruth Gizewski,Rainer Schubert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Contribution to Conference for Computer Assisted Radiology and Surgery (CARS 2025)

点击查看摘要

Abstract:Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Particularly, in the field of medicine with its presence of high dimensional multimodal patient data, multimodal models characterize the next step. However, what is yet very underexplored is how these models process the source information in detail. Methods To this end, we implemented an occlusion-based both model and performance agnostic modality contribution method that quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we could determine a link between our metric and the performance of single modality trained nets. Conclusion The information gain through our metric holds remarkable potential to improve the development of multimodal models and the creation of datasets in the future. With our method we make a crucial contribution to the field of interpretability in deep learning based multimodal research and thereby notably push the integrability of multimodal AI into clinical practice. Our code is publicly available at this https URL.
zh

[CV-105] FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection

【速读】：本文旨在解决基于激光雷达的时间序列三维检测器中存在的两个主要问题：一是无差别采样和融合方法忽略了单个点的不同贡献，并随着输入帧数的增加导致复杂度呈指数增长；二是结果级任意级联限制了全局信息提取。为了解决这些问题，论文提出了一种名为FASTer（Focal Token Acquring-and-Scaling Transformer）的方法，其关键是动态选择焦点标记并通过自适应且轻量的方式压缩标记序列。此外，通过强调单个标记的贡献，提出了一个简单而有效的自适应缩放机制来捕捉几何上下文同时筛选焦点点。这种仅在历史帧中自适应存储和处理焦点点的方式极大地降低了整体复杂度。另外，还提出了一种新的分组层次融合策略，逐步执行序列缩放和组内融合操作以促进全局空间和时间信息的交换。实验结果表明，与现有的其他最先进检测器相比，FASTer在性能和效率方面均有显著提升，并且表现出更好的灵活性和鲁棒性。代码可在提供的链接处获取。

链接: https://arxiv.org/abs/2503.01899
作者: Chenxu Dang,Zaipeng Duan,Pei An,Xinmin Zhang,Xuzhong Hu,Jie Ma
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10pages,6 figures

点击查看摘要

Abstract:Recent top-performing temporal 3D detectors based on Lidars have increasingly adopted region-based paradigms. They first generate coarse proposals, followed by encoding and fusing regional features. However, indiscriminate sampling and fusion often overlook the varying contributions of individual points and lead to exponentially increased complexity as the number of input frames grows. Moreover, arbitrary result-level concatenation limits the global information extraction. In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in an adaptive and lightweight manner. Emphasizing the contribution of individual tokens, we propose a simple but effective Adaptive Scaling mechanism to capture geometric contexts while sifting out focal points. Adaptively storing and processing only focal points in historical frames dramatically reduces the overall complexity. Furthermore, a novel Grouped Hierarchical Fusion strategy is proposed, progressively performing sequence scaling and Intra-Group Fusion operations to facilitate the exchange of global spatial and temporal information. Experiments on the Waymo Open Dataset demonstrate that our FASTer significantly outperforms other state-of-the-art detectors in both performance and efficiency while also exhibiting improved flexibility and robustness. The code is available at this https URL.
zh

[CV-106] LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces

【速读】：该论文旨在解决多准则对齐问题在包容性城市规划中的文本到图像（Text-to-Image, T2I）模型应用，特别是如何通过整合社区偏好来优化生成式 AI (Generative AI) 模型。论文的关键在于引入了一个名为 Local Intersectional Visual Spaces (LIVS) 的数据集，该数据集通过两年的参与式过程构建，包含来自 30 个社区组织的 37,710 对比评估，编码了六个核心准则：可访问性 (Accessibility)、安全性 (Safety)、舒适性 (Comfort)、吸引力 (Invitingness)、包容性 (Inclusivity) 和多样性 (Diversity)。通过使用 Direct Preference Optimization (DPO) 微调 Stable Diffusion XL 模型，研究发现显著提升了模型与社区偏好的一致性，尽管大量中性评分表明建模交叉需求的复杂性。此外，随着标注量增加，模型精度进一步向 DPO 微调方向偏移，这表明更大规模的偏好数据可以增强微调效果。因此，论文的核心解决方案是通过上下文特定且利益相关者驱动的准则集成到生成式建模中，从而提升模型在多样化社会空间背景下的对齐能力。

链接: https://arxiv.org/abs/2503.01894
作者: Rashid Mushkani,Shravan Nayak,Hugo Berard,Allison Cohen,Shin Koseki,Hadrien Bertrand
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 19 figures

点击查看摘要

Abstract:We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multi-criteria alignment of text-to-image (T2I) models in inclusive urban planning. Developed through a two-year participatory process with 30 community organizations, LIVS encodes diverse spatial preferences across 634 initial concepts, consolidated into six core criteria: Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity, through 37,710 pairwise comparisons. Using Direct Preference Optimization (DPO) to fine-tune Stable Diffusion XL, we observed a measurable increase in alignment with community preferences, though a significant proportion of neutral ratings highlights the complexity of modeling intersectional needs. Additionally, as annotation volume increases, accuracy shifts further toward the DPO-tuned model, suggesting that larger-scale preference data enhances fine-tuning effectiveness. LIVS underscores the necessity of integrating context-specific, stakeholder-driven criteria into generative modeling and provides a resource for evaluating AI alignment methodologies across diverse socio-spatial contexts.
zh

[CV-107] Recognition of Dysarthria in Amyotrophic Lateral Sclerosis patients using Hypernetworks

【速读】：该论文旨在解决肌萎缩性脊髓侧索硬化症（Amyotrophic Lateral Sclerosis, ALS）患者构音障碍（dysarthria）识别的问题。现有方法主要依赖于特征提取策略与定制的卷积神经网络（Convolutional Neural Network, CNN），并通过密集层进行预测，通常采用临床标准ALS功能评定量表修订版（ALS Functional Rating Scale Revised, ALSFRS-R）作为参考。然而，研究表明基于输入条件计算逻辑的神经网络在训练速度、性能表现及灵活性方面具有显著优势。为克服现有方法的局限性，本文提出了一种创新性的解决方案，即首次将超网络（hypernetwork）应用于构音障碍的识别任务。关键在于利用音频文件生成对数梅尔频谱图（log-Mel spectrogram）、一阶差分（delta）和二阶差分（delta-delta），并通过预训练的改良AlexNet模型处理图像数据，最终由超网络动态生成目标网络的权重参数。实验结果表明，所提方法在新收集的公开数据集VOC-ALS上的准确率高达82.66%，优于多模态融合等基准方法，并通过消融研究验证了该方法的有效性，同时在泛化能力、参数效率及鲁棒性方面展现了显著优势。

链接: https://arxiv.org/abs/2503.01892
作者: Loukas Ilias,Dimitris Askounis
机构: DSS Lab, School of ECE, NTUA (DSS实验室, 电子与计算机工程学院, 希腊国立技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Amyotrophic Lateral Sclerosis (ALS) constitutes a progressive neurodegenerative disease with varying symptoms, including decline in speech intelligibility. Existing studies, which recognize dysarthria in ALS patients by predicting the clinical standard ALSFRS-R, rely on feature extraction strategies and the design of customized convolutional neural networks followed by dense layers. However, recent studies have shown that neural networks adopting the logic of input-conditional computations enjoy a series of benefits, including faster training, better performance, and flexibility. To resolve these issues, we present the first study incorporating hypernetworks for recognizing dysarthria. Specifically, we use audio files, convert them into log-Mel spectrogram, delta, and delta-delta, and pass the resulting image through a pretrained modified AlexNet model. Finally, we use a hypernetwork, which generates weights for a target network. Experiments are conducted on a newly collected publicly available dataset, namely VOC-ALS. Results showed that the proposed approach reaches Accuracy up to 82.66% outperforming strong baselines, including multimodal fusion methods, while findings from an ablation study demonstrated the effectiveness of the introduced methodology. Overall, our approach incorporating hypernetworks obtains valuable advantages over state-of-the-art results in terms of generalization ability, parameter efficiency, and robustness.
zh

[CV-108] Nexus-O: An Omni-Perceptive And -Interactive Model for Language Audio And Vision

【速读】：该论文致力于解决如何构建能够高效处理多模态数据（音频、图像、视频和文本）并实现跨模态对齐、理解与推理的模型问题。论文的关键在于提出了一种名为\textbfNexus-O的行业级全感知交互模型，并通过以下三个方面的创新解决了核心挑战：首先，基于视觉-语言模型而非语言模型进行设计与预训练，使模型具备三模态感知能力；其次，构建了一个包含多样化实际场景语音样本的新测试平台Nexus-O-audio，用于评估多模态模型的鲁棒性；最后，设计了一套高质量语音数据合成管道以获取真实场景下的训练数据。这些策略共同确保了模型在下游任务中的优越性能。

链接: https://arxiv.org/abs/2503.01879
作者: Che Liu,Yingji Zhang,Dong Zhang,Weijie Zhang,Chenggong Gong,Haohan Li,Yu Lu,Shilin Zhou,Yue Lu,Ziliang Gan,Ziao Wang,Junwei Liao,Haipang Wu,Ji Liu,André Freitas,Qifan Wang,Zenglin Xu,Rongjuncheng Zhang,Yong Dai
机构: HiThink Research (慧思科技研究有限公司), China; Imperial College London (帝国理工学院), UK; University of Manchester (曼彻斯特大学), UK; Zhejiang University (浙江大学), China; Fudan University (复旦大学), China; Soochow University (苏州大学), China; Baptist University (香港浸会大学), HK; Microsoft (微软), USA; Meta AI (Meta 人工智能实验室), USA; Idiap Research Institute (Idiap 研究院), Switzerland
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. The journey towards achieving Artificial General Intelligence (AGI) necessitates the development of models that can emulate these multifaceted perceptual capabilities and comprehensively understand these diversified data. To this end, we introduce \textbfNexus-O, an industry-level \textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data in any combination and output audio/text in an end-to-end way. We systematically investigate Nexus-O by addressing three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario speech datasets? For the first question, we design and pre-train Nexus-O based on the vision-language model, rather than the language model. By pre-training the model over high-quality synthetic audio data, our model is capable of tri-modal perception and interaction. For the second question, we introduce a new audio testbed, Nexus-O-audio, comprising diverse Automatic Speech Recognition (ASR) samples, spanning various real-world scenarios, such as corporate meetings and live stream. For the third question, we design the speech data synthesis pipeline to obtain high-quality speech training datasets, covering various real-world scenarios. Comprehensive experimentation and an in-depth analysis of tri-modal alignment over latent space demonstrate the advantages of our model on downstream tasks.
zh

[CV-109] FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance

【速读】：该论文旨在解决文本到图像扩散模型中存在的对特定人口群体的生成偏见问题，例如在生成工程师图像时倾向于产生更多男性而非女性的现象，这引发了伦理关切并限制了其广泛应用。论文提出了解决方案FairGen，这是一种自适应潜在引导机制，在推理过程中控制生成分布。FairGen的关键在于其潜在引导模块动态调整扩散过程以强制特定属性，同时记忆模块跟踪生成统计数据，并引导潜在参数调整以与目标属性值的公平分布对齐。此外，为了全面评估扩散模型中的偏差，引入了综合偏差评估基准HBE。广泛的评估表明，FairGen在减少偏见方面优于现有方法，例如在Stable Diffusion 2上实现了68.5%的性别偏见减少。消融研究进一步展示了FairGen能够灵活且精确地在任何用户指定粒度上控制生成分布，确保适应性和针对性的偏见缓解。

链接: https://arxiv.org/abs/2503.01872
作者: Mintong Kang,Vinayshekhar Bannihatti Kumar,Shamik Roy,Abhishek Kumar,Sopan Khosla,Balakrishnan Murali Narayanaswamy,Rashmi Gangadharaiah
机构: UIUC (伊利诺伊大学香槟分校); AWS AI Labs (AWS人工智能实验室); mintong2@illinois.edu 对应 UIUC (伊利诺伊大学香槟分校); {vinayshk, royshami, akmarou}@amazon.com 对应 Amazon (亚马逊)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission

点击查看摘要

Abstract:Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., “male” for “gender”) in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Further, given the limitations of existing datasets in comprehensively assessing bias in diffusion models, we introduce a holistic bias evaluation benchmark HBE, covering diverse domains and incorporating complex prompts across various applications. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen’s ability to flexibly and precisely control generation distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.
zh

[CV-110] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

【速读】：该论文致力于解决病理学图像与转录组学数据在多模态对齐过程中因模态间显著异质性导致的模态特定结构丢失问题。传统方法主要关注模态对齐，而忽视了保留各模态特有的结构信息。论文提出的解决方案关键在于MIRROR方法，它通过专用编码器提取每种模态的全面特征，并结合模态对齐模块实现表型模式与分子谱之间的无缝整合，同时引入模态保留模块保护各模态的独特属性，以及风格聚类模块通过建模和对齐聚类空间内的病理特征一致性来减少冗余并增强与疾病相关的信息。这些创新点共同确保了在保持模态特异性的同时实现有效的多模态融合。

链接: https://arxiv.org/abs/2503.00374
作者: Tianyi Wang,Jianan Fan,Dingxin Zhang,Dongnan Liu,Yong Xia,Heng Huang,Weidong Cai
机构: School of Computer Science, The University of Sydney, Sydney, NSW, 2006, Australia (悉尼大学计算机科学学院); National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072, China, with Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China, and also with the Ningbo Institute of Northwestern Polytechnical University, Ningbo 315048, China (西北工业大学航空航天海天地大数据应用技术国家工程实验室, 西北工业大学计算机学院); Department of Computer Science, University of Maryland, College Park, MD 20742, USA (马里兰大学帕克分校计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR’s superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.
zh

[CV-111] Efficient Diffusion as Low Light Enhancer

【速读】：该论文旨在解决基于扩散模型的低光照图像增强（Low-Light Image Enhancement, LLIE）中迭代采样过程计算负担过重的问题。当前加速方法无论是否依赖训练，通常会导致显著的性能下降，凸显了性能与效率之间的权衡难题。论文的关键洞察在于识别出性能下降的主要因素为拟合误差和推理间隙，并提出通过线性外推错误的分数函数来缓解拟合误差，同时通过将高斯流转移到反射意识残差空间以减少推理间隙。基于此，论文设计了一个名为反射意识轨迹优化（Reflectance-Aware Trajectory Refinement, RATR）的模块，用于利用图像的反射成分优化教师轨迹。随后，引入了反射意识蒸馏轨迹（Reflectance-aware Diffusion with Distilled Trajectory, ReDDiT），这是一种针对LLIE任务高效且灵活的蒸馏框架。实验结果表明，该框架在仅两步的情况下即可达到与先前冗长步骤的扩散方法相当的性能，并在八步或四步内建立了新的SOTA结果。

链接: https://arxiv.org/abs/2410.12346
作者: Guanzhou Lan,Qianli Ma,Yuqi Yang,Zhigang Wang,Dong Wang,Xuelong Li,Bin Zhao
机构: Northwestern Polytechnical University (西北工业大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The computational burden of the iterative sampling process remains a major challenge in diffusion-based Low-Light Image Enhancement (LLIE). Current acceleration methods, whether training-based or training-free, often lead to significant performance degradation, highlighting the trade-off between performance and efficiency. In this paper, we identify two primary factors contributing to performance degradation: fitting errors and the inference gap. Our key insight is that fitting errors can be mitigated by linearly extrapolating the incorrect score functions, while the inference gap can be reduced by shifting the Gaussian flow to a reflectance-aware residual space. Based on the above insights, we design Reflectance-Aware Trajectory Refinement (RATR) module, a simple yet effective module to refine the teacher trajectory using the reflectance component of images. Following this, we introduce \textbfReflectance-aware \textbfDiffusion with \textbfDistilled \textbfTrajectory (\textbfReDDiT), an efficient and flexible distillation framework tailored for LLIE. Our framework achieves comparable performance to previous diffusion-based methods with redundant steps in just 2 steps while establishing new state-of-the-art (SOTA) results with 8 or 4 steps. Comprehensive experimental evaluations on 10 benchmark datasets validate the effectiveness of our method, consistently outperforming existing SOTA methods.
zh

[CV-112] SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models

【速读】：该论文旨在解决计算病理学（Computational Pathology）领域中生成式人工智能（Generative AI）模型所依赖的数据集局限性问题，具体表现为现有公开数据集在器官多样性、类别覆盖范围或标注质量方面的不足。为填补这一空白，论文引入了SPIDER（Supervised Pathology Image-DEscription Repository），这是一个包含多器官类型（如皮肤、结直肠和胸部）的大型公开切片级数据集，每个器官具有全面的类别覆盖，并提供经专家病理学家验证的高质量标注以及上下文切片，以增强空间上下文的分类性能。解决方案的关键在于构建了一个基于Hibou-L基础模型特征提取器与注意力机制分类头相结合的基线模型，该模型在多种组织类别上实现了最先进的性能，同时为未来的数字病理研究提供了强大的基准。此外，论文不仅关注单模态切片分类，还展示了快速识别显著区域、定量组织度量以及多模态方法的基础工作。

链接: https://arxiv.org/abs/2503.02876
作者: Dmitry Nechaev,Alexey Pchelnikov,Ekaterina Ivanova
机构: HistAI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, and Thorax, with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: this https URL Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.02876 [eess.IV] (or arXiv:2503.02876v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.02876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-113] Undertrained Image Reconstruction for Realistic Degradation in Blind Image Super-Resolution

【速读】：该论文试图解决现有超分辨率（Super-Resolution, SR）模型在真实世界低分辨率（Low-Resolution, LR）图像上表现不佳的问题。这一问题源于合成数据集中的退化特性与真实世界LR图像中的复杂退化（如成像过程和JPEG压缩引起的退化）存在差异。解决方案的关键在于提出了一种基于欠训练图像重建模型的数据集生成方法。这些模型能够从输入图像中重建出具有多样化退化的低质量图像。通过利用此特性，研究者们使用高分辨率（High-Resolution, HR）图像生成包含多样化退化的LR图像以构建数据集，并通过对预训练SR模型进行微调，显著提升了去噪和去模糊的能力，从而改善了模型在真实世界LR图像上的性能。此外，分析表明退化多样性有助于提升性能，而HR与LR图像之间的色彩差异可能会影响性能。

链接: https://arxiv.org/abs/2503.02767
作者: Ru Ito,Supatta Viriyavisuthisakul,Kazuhiko Kawamoto,Hiroshi Kera
机构: Chiba University (千叶大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Most super-resolution (SR) models struggle with real-world low-resolution (LR) images. This issue arises because the degradation characteristics in the synthetic datasets differ from those in real-world LR images. Since SR models are trained on pairs of high-resolution (HR) and LR images generated by downsampling, they are optimized for simple degradation. However, real-world LR images contain complex degradation caused by factors such as the imaging process and JPEG compression. Due to these differences in degradation characteristics, most SR models perform poorly on real-world LR images. This study proposes a dataset generation method using undertrained image reconstruction models. These models have the property of reconstructing low-quality images with diverse degradation from input images. By leveraging this property, this study generates LR images with diverse degradation from HR images to construct the datasets. Fine-tuning pre-trained SR models on our generated datasets improves noise removal and blur reduction, enhancing performance on real-world LR images. Furthermore, an analysis of the datasets reveals that degradation diversity contributes to performance improvements, whereas color differences between HR and LR images may degrade performance. 11 pages, (11 figures and 2 tables)
zh

[CV-114] ReND: Transformer derived features and Regularized NMF for neonatal functional network Delineation

【速读】：该论文旨在解决早起发育人脑功能网络（Functional Networks, FNs）精确划分的问题，这是识别发育障碍生物标志物和理解功能发展的基础。由于新生儿的功能网络尚未成熟，现有的成人功能分区无法直接应用于新生儿，且目前尚无标准化的新生儿功能图谱。为解决这一根本性问题，论文提出了一种名为TReND的新颖全自动自监督Transformer-自动编码器框架，其关键在于结合正则化非负矩阵分解（Regularized Nonnegative Matrix Factorization, RNMF）来揭示新生儿的功能网络。TReND通过整合置信度自适应掩码到Transformer自注意力层以减轻噪声影响，并利用自监督解码器作为调节器优化潜在嵌入表示，同时结合基于大脑皮层测地距离的空间编码与时间特征的功能连接，确保空间连贯性和功能性同质性。最终，该方法在模拟数据集、dHCP及HCP-YA数据集上的验证表明，TReND框架在新生儿功能网络划分方面显著优于传统方法。

链接: https://arxiv.org/abs/2503.02685
作者: Sovesh Mohapatra,Minhui Ouyang,Shufang Tan,Jianlin Guo,Lianglong Sun,Yong He,Hao Huang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
备注: 10 Pages, 5 figures

点击查看摘要

Abstract:Precise parcellation of functional networks (FNs) of early developing human brain is the fundamental basis for identifying biomarker of developmental disorders and understanding functional development. Resting-state fMRI (rs-fMRI) enables in vivo exploration of functional changes, but adult FN parcellations cannot be directly applied to the neonates due to incomplete network maturation. No standardized neonatal functional atlas is currently available. To solve this fundamental issue, we propose TReND, a novel and fully automated self-supervised transformer-autoencoder framework that integrates regularized nonnegative matrix factorization (RNMF) to unveil the FNs in neonates. TReND effectively disentangles spatiotemporal features in voxel-wise rs-fMRI data. The framework integrates confidence-adaptive masks into transformer self-attention layers to mitigate noise influence. A self supervised decoder acts as a regulator to refine the encoder’s latent embeddings, which serve as reliable temporal features. For spatial coherence, we incorporate brain surface-based geodesic distances as spatial encodings along with functional connectivity from temporal features. The TReND clustering approach processes these features under sparsity and smoothness constraints, producing robust and biologically plausible parcellations. We extensively validated our TReND framework on three different rs-fMRI datasets: simulated, dHCP and HCP-YA against comparable traditional feature extraction and clustering techniques. Our results demonstrated the superiority of the TReND framework in the delineation of neonate FNs with significantly better spatial contiguity and functional homogeneity. Collectively, we established TReND, a novel and robust framework, for neonatal FN delineation. TReND-derived neonatal FNs could serve as a neonatal functional atlas for perinatal populations in health and disease.
zh

[CV-115] ZAPBench: A Benchmark for Whole-Brain Activity Prediction in Zebrafish

【速读】：该论文试图解决的问题是预测整个脊椎动物大脑中神经元活动的细胞分辨率。解决方案的关键在于引入了一个名为Zebrafish Activity Prediction Benchmark (ZAPBench) 的基准测试，基于一个包含超过70,000个神经元的斑马鱼幼鱼大脑的四维光片显微镜记录数据集，并结合运动稳定化处理及体素级细胞分割的数据，这些数据促进了多种预测方法的发展。初始结果显示，时间序列和体积视频建模方法的表现优于简单的基线方法，但仍需进一步改进。此外，用于活动记录的具体大脑还正在进行突触级别的解剖图谱绘制，这将有助于未来将详细的结构信息整合到预测方法中。

链接: https://arxiv.org/abs/2503.02618
作者: Jan-Matthis Lueckmann,Alexander Immer,Alex Bo-Yuan Chen,Peter H. Li,Mariela D. Petkova,Nirmala A. Iyer,Luuk Willem Hesselink,Aparna Dev,Gudrun Ihrke,Woohyun Park,Alyson Petruncio,Aubrey Weigel,Wyatt Korff,Florian Engert,Jeff W. Lichtman,Misha B. Ahrens,Michał Januszewski,Viren Jain
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data-driven benchmarks have led to significant progress in key scientific modeling domains including weather and structural biology. Here, we introduce the Zebrafish Activity Prediction Benchmark (ZAPBench) to measure progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet microscopy recordings of over 70,000 neurons in a larval zebrafish brain, along with motion stabilized and voxel-level cell segmentations of these data that facilitate development of a variety of forecasting methods. Initial results from a selection of time series and volumetric video modeling approaches achieve better performance than naive baseline methods, but also show room for further improvement. The specific brain used in the activity recording is also undergoing synaptic-level anatomical mapping, which will enable future integration of detailed structural information into forecasting methods.
zh

[CV-116] owards a robust R2D2 paradigm for radio-interferometric imaging: revisiting DNN training and architecture

【速读】：该论文旨在提升基于 R2D2 深度神经网络（Deep Neural Network, DNN）系列在射电干涉成像中的性能，特别是在图像重建质量、数据保真度以及认识不确定性（epistemic uncertainty）方面的表现。论文通过改进训练方法、优化收敛准则以及重新设计网络架构来实现这一目标。关键在于：首先，通过引入随机化傅里叶采样积分时间、多扫描多噪声配置以及变化的成像设置（如像素分辨率和可见性加权方案），增强了模型的泛化能力；其次，提出了新的收敛标准，即当数据残差与噪声兼容时停止重建过程，从而提高计算效率并优化训练；最后，用一种结合 U-Net 和 WDSR 的新型架构（U-WDSR）替代原有的早期 U-Net，利用宽激活、密集连接、权重归一化和低秩卷积技术，显著提升了特征重用和重建精度。这些改进使新版本的 R2D2 在图像重建质量和数据保真度方面表现出色，并有效降低了认识不确定性。

链接: https://arxiv.org/abs/2503.02554
作者: Amir Aghabiglou,Chung San Chu,Chao Tang,Arwa Dabbech,Yves Wiaux
机构: Institute of Sensors, Signals and Systems, Heriot-Watt University (传感器、信号和系统研究所，赫瑞瓦特大学), Edinburgh EH14 4AS, United Kingdom; EPCC, University of Edinburgh (爱丁堡大学高性能计算中心), Potterrow, Edinburgh EH8 9BT, United Kingdom
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:The R2D2 Deep Neural Network (DNN) series was recently introduced for image formation in radio interferometry. It can be understood as a learned version of CLEAN, whose minor cycles are substituted with DNNs. We revisit R2D2 on the grounds of series convergence, training methodology, and DNN architecture, improving its robustness in terms of generalisability beyond training conditions, capability to deliver high data fidelity, and epistemic uncertainty. Firstly, while still focusing on telescope-specific training, we enhance the learning process by randomising Fourier sampling integration times, incorporating multi-scan multi-noise configurations, and varying imaging settings, including pixel resolution and visibility-weighting scheme. Secondly, we introduce a convergence criterion whereby the reconstruction process stops when the data residual is compatible with noise, rather than simply using all available DNNs. This not only increases the reconstruction efficiency by reducing its computational cost, but also refines training by pruning out the data/image pairs for which optimal data fidelity is reached before training the next DNN. Thirdly, we substitute R2D2’s early U-Net DNN with a novel architecture (U-WDSR) combining U-Net and WDSR, which leverages wide activation, dense connections, weight normalisation, and low-rank convolution to improve feature reuse and reconstruction precision. As previously, R2D2 was trained for monochromatic intensity imaging with the Very Large Array (VLA) at fixed 512 \times 512 image size. Simulations on a wide range of inverse problems and a case study on real data reveal that the new R2D2 model consistently outperforms its earlier version in image reconstruction quality, data fidelity, and epistemic uncertainty.
zh

[CV-117] Scene-based nonuniformity correction with homography transformation

【速读】：该论文旨在解决基于未冷却微测辐射热计（uncooled microbolometer）的长波红外（LWIR）焦平面阵列（FPAs）相机在农业遥感典型户外条件下因增益和偏移漂移导致的校准需求问题。论文的关键在于提出了一种计算方案，通过利用由真实无人机悬停建模的单应性变换引起的图像序列中的相对位移，联合估计目标物体的热成像值、增益和偏移。解决方案的核心是采用交替最小化方法得到的最小似然估计器，并使用广义Lucas-Kanade方法完成配准，从而实现无需连续校准即可准确恢复热成像结果，模拟结果显示其平均皮尔逊相关系数超过0.9999998，等效均方误差小于0.01摄氏度。

链接: https://arxiv.org/abs/2503.02487
作者: Peretz Yafin,Nir Sochen,Iftach Klapp
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: Imaging, Inverse problems, Functional analysis, Blind deconvolution

点击查看摘要

Abstract:Due to their affordable, low mass, and small dimensions, uncooled microbolometer-based thermal focal plane arrays (UC-FPAs) are useful for long-wave infrared (LWIR)imaging applications. However, in outdoor conditions typical in agricultural remote sensing, cameras based on UC-FPAs may suffer from drift in offset and gain. To tackle the persistent drift, the system requires continuous calibration. Our goal in this study was to eliminate this requirement via a computational schema. In a former study, we estimated unknown gain and offset values and thermographic images of an object from a sequence of pairs of successive images taken at two different blur this http URL the current work, we took on a similar problem using a sequence of shifted images, with relative shifts caused by realistic drone hovering modeled by homography transformation. This places our work in the realm of scene-based nonuniformity correction problems. We show that an object’s thermographic values, as well as gain and offset, can be jointly estimated by relying on a few sets of shifted images. We use a minimum likelihood estimator, which is found using alternating minimization. Registration is done using a generalized Lucas-Kanade method. Simulations show promising accuracy with mean Pearson correlation of more than 0.9999998 between ground truth and restoration. Under ideal assumptions, this is equivalent to a mean restoration error of less than 0.01 Celsius degree.
zh

[CV-118] Building 3D In-Context Learning Universal Model in Neuroimaging

【速读】：该论文旨在解决现有基于上下文学习（In-context Learning, ICL）模型在神经影像处理中的局限性，即这些模型因仅以二维图像为输入，难以充分挖掘三维解剖结构信息，导致全局感知不足及性能不佳的问题。为应对这一挑战，论文提出Neuroverse3D，这是一种能够处理三维神经影像数据并执行多种任务（如分割、去噪、修复等）的ICL模型。其关键创新点在于通过自适应并行-序列上下文处理以及U形融合策略有效降低三维输入带来的大内存消耗，同时提出优化损失函数以平衡多任务训练并增强对解剖结构的关注。实验结果表明，Neuroverse3D不仅显著优于传统ICL模型，且接近特定任务模型的表现水平。

链接: https://arxiv.org/abs/2503.02410
作者: Jiesi Hu,Hanyang Peng,Yanwu Yang,Xutao Guo,Yang Shang,Pengcheng Shi,Chenfei Ye,Ting Ma
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Peng Cheng Laboratory (鹏城实验室); Tubingen Center for Mental Health, Germany (图宾根心理健康中心, 德国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In-context learning (ICL), a type of universal model, demonstrates exceptional generalization across a wide range of tasks without retraining by leveraging task-specific guidance from context, making it particularly effective for the complex demands of neuroimaging. However, existing ICL models, which take 2D images as input, struggle to fully leverage the 3D anatomical structures in neuroimages, leading to a lack of global awareness and suboptimal performance. In this regard, we introduce Neuroverse3D, an ICL model capable of performing multiple neuroimaging tasks (e.g., segmentation, denoising, inpainting) in 3D. Neuroverse3D overcomes the large memory consumption due to 3D inputs through adaptive parallel-sequential context processing and a U-shape fusion strategy, allowing it to handle an unlimited number of context images. Additionally, we propose an optimized loss to balance multi-task training and enhance the focus on anatomical structures. Our study incorporates 43,674 3D scans from 19 neuroimaging datasets and evaluates Neuroverse3D on 14 diverse tasks using held-out test sets. The results demonstrate that Neuroverse3D significantly outperforms existing ICL models and closely matches the performance of task-specific models. The code and model weights are publicly released at: this https URL.
zh

[CV-119] CQ CNN: A Hybrid Classical Quantum Convolutional Neural Network for Alzheimers Disease Detection Using Diffusion Generated and U Net Segmented 3D MRI

【速读】：该论文旨在解决基于临床MRI数据检测阿尔茨海默病（Alzheimer’s Disease, AD）的问题。传统方法存在局限性，而量子机器学习（Quantum Machine Learning, QML）虽具潜力但尚处于早期阶段。为此，论文提出了一种端到端的混合经典-量子卷积神经网络（Classical-Quantum Convolutional Neural Network, CQ CNN）。解决方案的关键在于开发一个框架以使3D MRI数据适用于机器学习，设计并训练了一种脑组织分割模型（Skull Net），以及利用扩散模型生成少数类别的合成图像。最终模型不仅在较少的训练轮次内实现了更高的准确性（97.50%的精度，超越当前最先进的模型），且仅使用13K参数（0.48 MB），显著减少了计算资源需求，同时确保生成的数据符合临床结构标准。

链接: https://arxiv.org/abs/2503.02345
作者: Mominul Islam,Mohammad Junayed Hasan,M.R.C. Mahdy
机构: Department of Electrical and Computer Engineering, North South University; Department of Computer Science, Johns Hopkins University; Mahdy Research Academy
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Application of hybrid quantum-classical machine learning for (early stage) disease detection

点击查看摘要

Abstract:The detection of Alzheimer disease (AD) from clinical MRI data is an active area of research in medical imaging. Recent advances in quantum computing, particularly the integration of parameterized quantum circuits (PQCs) with classical machine learning architectures, offer new opportunities to develop models that may outperform traditional methods. However, quantum machine learning (QML) remains in its early stages and requires further experimental analysis to better understand its behavior and limitations. In this paper, we propose an end to end hybrid classical quantum convolutional neural network (CQ CNN) for AD detection using clinically formatted 3D MRI data. Our approach involves developing a framework to make 3D MRI data usable for machine learning, designing and training a brain tissue segmentation model (Skull Net), and training a diffusion model to generate synthetic images for the minority class. Our converged models exhibit potential quantum advantages, achieving higher accuracy in fewer epochs than classical models. The proposed beta8 3 qubit model achieves an accuracy of 97.50%, surpassing state of the art (SOTA) models while requiring significantly fewer computational resources. In particular, the architecture employs only 13K parameters (0.48 MB), reducing the parameter count by more than 99.99% compared to current SOTA models. Furthermore, the diffusion-generated data used to train our quantum models, in conjunction with real samples, preserve clinical structural standards, representing a notable first in the field of QML. We conclude that CQCNN architecture like models, with further improvements in gradient optimization techniques, could become a viable option and even a potential alternative to classical models for AD detection, especially in data limited and resource constrained clinical settings.
zh

[CV-120] COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

【速读】：该论文旨在解决3D血管结构分割中由于血管分散特性导致的空间不确定性问题，现有方法通常通过基于patch的训练策略建模，但这种方法往往丢失了全局的空间上下文信息。论文的关键解决方案是提出Coordinate-aware Modulated Mamba Network (COMMA)，其通过全局与局部分支同时利用完整图像数据和裁剪patch数据，确保稳健且高效的空间位置感知能力。具体而言，COMMA采用通道压缩Mamba (ccMamba)块来编码完整图像数据，捕获长距离依赖关系同时优化计算成本；并引入坐标感知调制(CaM)块增强全局与局部分支之间的交互，使局部分支能够更好地感知空间信息。此外，论文贡献了一个包含570例手动标注数据的手动标记数据集，为当前公开可用的最大规模3D血管数据集。实验结果表明，COMMA在多个数据集上的性能优于最先进的方法，特别是在小血管分割任务中表现尤为突出。

链接: https://arxiv.org/abs/2503.02332
作者: Gen Shi,Hui Zhang,Jie Tian
机构: Beihang University (北京航空航天大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D vessel dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to encode entire image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA’s superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code and data will be open source at this https URL.
zh

[CV-121] Generative Model-Assisted Demosaicing for Cross-multispectral Cameras

【速读】：该论文旨在解决基于光谱滤波阵列 (Spectral Filter Array, SFA) 的多光谱成像过程中光谱去马赛克 (spectral demosaicing) 面临的三个主要挑战：(1) 真实数据难以获取对应标签或模拟实际光谱成像过程的困难；(2) 不同相机间的光谱差异导致预训练模型难以迁移到新相机；(3) 现有网络在处理复杂场景时容易引入视觉伪影。论文的关键解决方案在于提出了一种结合自监督生成模型的混合监督训练方法，并通过频率域硬块选择机制识别伪影易发区域以针对性优化，从而显著提升真实数据上的性能。此外，还构建了一个名为 UniSpecTest 的现实世界多光谱马赛克图像数据集用于测试。

链接: https://arxiv.org/abs/2503.02322
作者: Jiahui Luo,Kai Feng,Haijin Zeng,Yongyong Chen
机构: School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(计算机科学与技术学院, 哈尔滨工业大学(深圳)); School of Automation, Northwestern Polytechnical University (自动化学院, 西北工业大学); Image Processing and Interpretation, IMEC Research Group, Ghent University (图像处理与解释, IMEC 研究组, 根特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a crucial part of the spectral filter array (SFA)-based multispectral imaging process, spectral demosaicing has exploded with the proliferation of deep learning techniques. However, (1) bothering by the difficulty of capturing corresponding labels for real data or simulating the practical spectral imaging process, end-to-end networks trained in a supervised manner using simulated data often perform poorly on real data. (2) cross-camera spectral discrepancies make it difficult to apply pre-trained models to new cameras. (3) existing demosaicing networks are prone to introducing visual artifacts on hard cases due to the interpolation of unknown values. To address these issues, we propose a hybrid supervised training method with the assistance of the self-supervised generative model, which performs well on real data across different spectral cameras. Specifically, our approach consists of three steps: (1) Pre-Training step: training the end-to-end neural network on a large amount of simulated data; (2) Pseudo-Pairing step: generating pseudo-labels of real target data using the self-supervised generative model; (3) Fine-Tuning step: fine-tuning the pre-trained model on the pseudo data pairs obtained in (2). To alleviate artifacts, we propose a frequency-domain hard patch selection method that identifies artifact-prone regions by analyzing spectral discrepancies using Fourier transform and filtering techniques, allowing targeted fine-tuning to enhance demosaicing performance. Finally, we propose UniSpecTest, a real-world multispectral mosaic image dataset for testing. Ablation experiments have demonstrated the effectiveness of each training step, and extensive experiments on both synthetic and real datasets show that our method achieves significant performance gains compared to state-of-the-art techniques.
zh

[CV-122] Semantic Prior Distillation with Vision Foundation Model for Enhanced Rapid Bone Scintigraphy Image Restoration

【速读】：该论文旨在解决快速骨扫描图像在儿科患者中因扫描速度加快而导致的图像质量下降问题，这可能影响骨骼疾病和肿瘤转移的诊断准确性。为了解决这一问题，论文提出了一种基于SAM（Segment Anything Model）语义先验的方法来增强快速骨扫描图像。该方法的关键在于设计了包含三个关键模块的两级级联网络：语义先验集成（SPI）模块、语义知识蒸馏（SKD）模块以及语义一致性（SCM）模块。其中，SPI和SKD模块利用经过微调的SAM引入领域特定的语义信息，而SCM则在整个级联网络中保持语义特征表示的一致性。此外，论文还发布了首个专注于儿科患者快速骨扫描图像修复的数据集RBS（Rapid Bone Scintigraphy）。

链接: https://arxiv.org/abs/2503.02321
作者: Pengchen Liang,Leijun Shi,Huiping Yao,Bin Pu,Jianguo Chen,Lei Zhao,Haishan Huang,Zhuangzhuang Chen,Zhaozhao Xu,Lite Xu,Qing Chang,Yiwei Li
机构: Department of Nuclear Medicine, Shanghai Children’s Hospital, School of Medicine, Shanghai Jiao Tong University (上海交通大学医学院); School of Microelectronics, Shanghai University (上海大学); Department of Ophthalmology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院); Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology (香港科技大学); School of Software Engineering, Sun Yat-sen University (中山大学); College of Computer Science and Electronic Engineering, Hunan University (湖南大学); School of Computer Science and Technology, Henan Polytechnic University (河南理工大学); Department Shanghai Key Laboratory of Gastric Neoplasms, Department of Surgery, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Rapid bone scintigraphy is an essential tool for diagnosing skeletal diseases and tumor metastasis in pediatric patients, as it reduces scan time and minimizes patient discomfort. However, rapid scans often result in poor image quality, potentially affecting diagnosis due to reduced resolution and detail, which make it challenging to identify and evaluate finer anatomical structures. To address this issue, we propose the first application of SAM-based semantic priors for medical image restoration, leveraging the Segment Anything Model (SAM) to enhance rapid bone scintigraphy images in pediatric populations. Our method comprises two cascaded networks, f^IR1 and f^IR2 , augmented by three key modules: a Semantic Prior Integration (SPI) module, a Semantic Knowledge Distillation (SKD) module, and a Semantic Consistency Module (SCM). The SPI and SKD modules incorporate domain-specific semantic information from a fine-tuned SAM, while the SCM maintains consistent semantic feature representation throughout the cascaded networks. In addition, we will release a novel Rapid Bone Scintigraphy dataset called RBS, the first dataset dedicated to rapid bone scintigraphy image restoration in pediatric patients. RBS consists of 137 pediatric patients aged between 0.5 and 16 years who underwent both standard and rapid bone scans. The dataset includes scans performed at 20 cm/min (standard) and 40 cm/min (rapid), representing a 2\times acceleration. We conducted extensive experiments on both the publicly available endoscopic dataset and RBS. The results demonstrate that our method outperforms all existing methods across various metrics, including PSNR, SSIM, FID, and LPIPS.
zh

[CV-123] Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution CVPR2025

【速读】：该论文旨在解决三维荧光显微成像中存在的空间可变噪声和各向异性分辨率问题，特别是在保持细胞存活的前提下，由于激光功率受限导致难以获取低噪声与高分辨率配对的地面真实数据（Ground Truth, GT）。论文提出了一种名为Volume Tells via Dual Cycle-consistent Diffusion (VTCD) 的无监督方法，以有效挖掘三维细胞体积内的成像先验信息，同时实现去噪和超分辨（Super-Resolution, SR）。

解决方案的关键在于：首先设计了一个空间等分布去噪器，利用三维细胞体积内相邻低噪声和高噪声区域之间的噪声分布一致性，抑制空间可变噪声；其次，基于细胞体积的结构一致性，构建了一个跨平面全局传播超分辨模块，将XY平面上的高分辨率细节传播到XZ和YZ平面的相邻区域，逐步提升整个三维细胞体积的分辨率。实验结果表明，在10个体内细胞数据集上的去噪和超分辨均取得了显著改进，轴向分辨率从约430纳米提高到了约90纳米。

链接: https://arxiv.org/abs/2503.02261
作者: Zelin Li,Chenwei Wang,Zhaoke Huang,Yiming MA,Cunmin Zhao,Zhongying Zhao,Hong Yan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted on CVPR 2025

点击查看摘要

Abstract:3D fluorescence microscopy is essential for understanding fundamental life processes through long-term live-cell imaging. However, due to inherent issues in imaging principles, it faces significant challenges including spatially varying noise and anisotropic resolution, where the axial resolution lags behind the lateral resolution up to 4.5 times. Meanwhile, laser power is kept low to maintain cell viability, leading to inaccessible low-noise and high-resolution paired ground truth (GT). To tackle these limitations, a dual Cycle-consistent Diffusion is proposed to effectively mine intra-volume imaging priors within 3D cell volumes in an unsupervised manner, i.e., Volume Tells (VTCD), achieving de-noising and super-resolution (SR) simultaneously. Specifically, a spatially iso-distributed denoiser is designed to exploit the noise distribution consistency between adjacent low-noise and high-noise regions within the 3D cell volume, suppressing the spatially varying noise. Then, in light of the structural consistency of the cell volume, a cross-plane global-propagation SR module propagates high-resolution details from the XY plane into adjacent regions in the XZ and YZ planes, progressively enhancing resolution across the entire 3D cell volume. Experimental results on 10 in vivo cellular dataset demonstrate high improvements in both denoising and super-resolution, with axial resolution enhanced from ~ 430 nm to ~ 90 nm.
zh

[CV-124] CrossFusion: A Multi-Scale Cross-Attention Convolutional Fusion Model for Cancer Survival Prediction

【速读】：该论文旨在解决从全片扫描图像（Whole Slide Images, WSIs）预测癌症生存期这一计算病理学中的挑战性任务。由于WSIs具有大尺寸、不规则形状和高分辨率粒度等特点，难以全面捕捉从细微的细胞异常到复杂的组织相互作用等关键模式，从而影响了生存预测的准确性。为了解决这一问题，论文提出了一种名为CrossFusion的新型多尺度特征集成框架，其关键是通过从不同放大倍率的图像块中提取并融合信息，有效建模特定尺度上的模式及其交互作用，从而生成丰富的特征集以提升生存预测的精度。实验验证表明，CrossFusion在六个癌症类型上的表现显著优于现有最先进的方法，并且结合领域特定的特征提取骨干网络后，相较于通用型骨干网络进一步提升了预后性能。

链接: https://arxiv.org/abs/2503.02064
作者: Rustin Soraki,Huayu Wang,Joann G. Elmore,Linda Shapiro
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cancer survival prediction from whole slide images (WSIs) is a challenging task in computational pathology due to the large size, irregular shape, and high granularity of the WSIs. These characteristics make it difficult to capture the full spectrum of patterns, from subtle cellular abnormalities to complex tissue interactions, which are crucial for accurate prognosis. To address this, we propose CrossFusion, a novel multi-scale feature integration framework that extracts and fuses information from patches across different magnification levels. By effectively modeling both scale-specific patterns and their interactions, CrossFusion generates a rich feature set that enhances survival prediction accuracy. We validate our approach across six cancer types from public datasets, demonstrating significant improvements over existing state-of-the-art methods. Moreover, when coupled with domain-specific feature extraction backbones, our method shows further gains in prognostic performance compared to general-purpose backbones. The source code is available at: this https URL
zh

[CV-125] A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal

【速读】：该论文致力于解决单图像反射去除（Single Image Reflection Removal, SIRR）这一典型盲源分离问题，即从包含反射的图像中分离出透射图像和反射图像。现有深度学习方法要么忽视特征交互的重要性，要么依赖启发式设计的网络架构。论文提出了一种新颖的深度排除展开网络（Deep Exclusion unfolding Network, DExNet），这是一种轻量级、可解释且有效的网络结构用于SIRR任务。DExNet的核心创新在于通过展开和参数化一个简单的迭代稀疏与辅助特征更新（iterative Sparse and Auxiliary Feature Update, i-SAFU）算法构建，该算法专门针对一种新的基于模型的SIRR优化公式进行设计，并引入通用排除先验（general exclusion prior）。此通用排除先验使展开的SAFU模块能够内在识别并惩罚透射特征与反射特征之间的共同点，从而确保更精确的分离。DExNet的原理性设计不仅提升了其可解释性，还显著提高了性能。在四个基准数据集上的综合实验表明，DExNet在仅使用领先方法约8%参数量的情况下实现了最先进的视觉和定量结果。

链接: https://arxiv.org/abs/2503.01938
作者: Jun-Jie Huang,Tianrui Liu,Zihan Chen,Xinwang Liu,Meng Wang,Pier Luigi Dragotti
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学计算机科学与技术学院), Changsha, China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), Hefei, China; Department of Electrical and Electronic Engineering, Imperial College London (帝国理工学院电气与电子工程系), London, UK
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically designed architectures. In this paper, we propose a novel Deep Exclusion unfolding Network (DExNet), a lightweight, interpretable, and effective network architecture for SIRR. DExNet is principally constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm, which is specifically designed to solve a new model-based SIRR optimization formulation incorporating a general exclusion prior. This general exclusion prior enables the unfolded SAFU module to inherently identify and penalize commonalities between the transmission and reflection features, ensuring more accurate separation. The principled design of DExNet not only enhances its interpretability but also significantly improves its performance. Comprehensive experiments on four benchmark datasets demonstrate that DExNet achieves state-of-the-art visual and quantitative results while utilizing only approximately 8% of the parameters required by leading methods.
zh

[CV-126] QDCNN: Quantum Deep Learning for Enhancing Safety and Reliability in Autonomous Transportation Systems

【速读】：该论文旨在解决交通运输网络中实时决策的安全性和可靠性问题，特别是在复杂环境中处理模糊输入（如阴影）时面临的挑战，例如高计算复杂度。为应对这些挑战，论文提出了一种量子深度卷积神经网络（Quantum Deep Convolutional Neural Network, QDCNN），通过利用量子算法增强交通领域网络物理系统（Cyber-Physical Systems, CPS）的安全性和可靠性。QDCNN 的核心是 UU† 方法，它通过结合预处理和后处理操作的传播算法来训练质心值，从而实现对图像中阴影区域的精确分类，显著提升阴影检测性能。实验结果显示，QDCNN 在正常条件下的三个数据集以及受雨影响的道路数据集上表现出色，其阴影检测时间仅为 0.0049352 秒，比传统方法（如基于强度阈值法、基于色度的阴影检测和局部二值模式技术）快多个数量级，同时具备更高的准确性和抗噪能力，为自动驾驶等实时安全导航提供了重要保障。这一研究展示了量子增强模型在克服经典方法局限性方面的潜力，有助于构建更可靠和鲁棒的自主交通系统。

链接: https://arxiv.org/abs/2503.01916
作者: Ashtakala Meghanath,Subham Das,Bikash K.Behera,Muhammad Attique Khan,Saif Al-Kuwari,Ahmed Farouk
机构: Department of Physics, Indian Institute of Science Education and Research, Thiruvananthapuram, India (印度科学教育与研究学院物理系，提鲁凡anthapuram); Bikash’s Quantum (OPC) Pvt. Ltd., Mohanpur, WB, 741246 India (Bikash的量子私营有限公司，莫汉普尔，西孟加拉邦，741246，印度); Department of AI, College of Computer Engineering and Science, Prince Mohammad Bin Fahd University, Al Khobar, Saudi Arabia (沙特阿拉伯法赫德国王大学计算机工程与科学学院人工智能系); Qatar Center for Quantum Computing, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar (卡塔尔哈马德本哈利法大学科学与工程学院卡塔尔量子计算中心); Department of Computer Science, Faculty of Computers and Artificial Intelligence, Hurghada University, Hurghada, Egypt (埃及胡尔加达大学计算机科学系，计算机与人工智能学院)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 Pages, 7 Figures, 4 Tables

点击查看摘要

Abstract:In transportation cyber-physical systems (CPS), ensuring safety and reliability in real-time decision-making is essential for successfully deploying autonomous vehicles and intelligent transportation networks. However, these systems face significant challenges, such as computational complexity and the ability to handle ambiguous inputs like shadows in complex environments. This paper introduces a Quantum Deep Convolutional Neural Network (QDCNN) designed to enhance the safety and reliability of CPS in transportation by leveraging quantum algorithms. At the core of QDCNN is the UU† method, which is utilized to improve shadow detection through a propagation algorithm that trains the centroid value with preprocessing and postprocessing operations to classify shadow regions in images accurately. The proposed QDCNN is evaluated on three datasets on normal conditions and one road affected by rain to test its robustness. It outperforms existing methods in terms of computational efficiency, achieving a shadow detection time of just 0.0049352 seconds, faster than classical algorithms like intensity-based thresholding (0.03 seconds), chromaticity-based shadow detection (1.47 seconds), and local binary pattern techniques (2.05 seconds). This remarkable speed, superior accuracy, and noise resilience demonstrate the key factors for safe navigation in autonomous transportation in real-time. This research demonstrates the potential of quantum-enhanced models in addressing critical limitations of classical methods, contributing to more dependable and robust autonomous transportation systems within the CPS framework.
zh

人工智能

[AI-0] Bringing Comparative Cognition To Computers

链接: https://arxiv.org/abs/2503.02882
作者: Konstantinos Voudouris,Lucy G. Cheke,Eric Schulz
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Researchers are increasingly subjecting artificial intelligence systems to psychological testing. But to rigorously compare their cognitive capacities with humans and other animals, we must avoid both over- and under-stating our similarities and differences. By embracing a comparative approach, we can integrate AI cognition research into the broader cognitive sciences.

[AI-1] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation

链接: https://arxiv.org/abs/2503.02881
作者: Han Xue,Jieji Ren,Wendi Chen,Gu Zhang,Yuan Fang,Guoying Gu,Huazhe Xu,Cewu Lu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans can accomplish complex contact-rich tasks using vision and touch, with highly reactive capabilities such as quick adjustments to environmental changes and adaptive control of contact forces; however, this remains challenging for robots. Existing visual imitation learning (IL) approaches rely on action chunking to model complex behaviors, which lacks the ability to respond instantly to real-time tactile feedback during the chunk execution. Furthermore, most teleoperation systems struggle to provide fine-grained tactile / force feedback, which limits the range of tasks that can be performed. To address these challenges, we introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality (AR), along with Reactive Diffusion Policy (RDP), a novel slow-fast visual-tactile imitation learning algorithm for learning contact-rich manipulation skills. RDP employs a two-level hierarchy: (1) a slow latent diffusion policy for predicting high-level action chunks in latent space at low frequency, (2) a fast asymmetric tokenizer for closed-loop tactile feedback control at high frequency. This design enables both complex trajectory modeling and quick reactive behavior within a unified framework. Through extensive evaluation across three challenging contact-rich tasks, RDP significantly improves performance compared to state-of-the-art visual IL baselines through rapid response to tactile / force feedback. Furthermore, experiments show that RDP is applicable across different tactile / force sensors. Code and videos are available on this https URL.

[AI-2] Evaluation of Architectural Synthesis Using Generative AI

链接: https://arxiv.org/abs/2503.02861
作者: Jingfei Huang,Alexandros Haridis
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in multimodal Generative AI have the potential to democratize specialized architectural tasks, such as interpreting technical drawings and creating 3D CAD models, which traditionally require expert knowledge. This paper presents a comparative evaluation of two systems: GPT-4o and Claude 3.5, in the task of architectural 3D synthesis. We conduct a case study on two buildings from Palladio’s Four Books of Architecture (1965): Villa Rotonda and Palazzo Porto. High-level architectural models and drawings of these buildings were prepared, inspired by Palladio’s original texts and drawings. Through sequential text and image prompting, we assess the systems’ abilities in (1) interpreting 2D and 3D representations of buildings from drawings, (2) encoding the buildings into a CAD software script, and (3) self-improving based on outputs. While both systems successfully generate individual parts, they struggle to accurately assemble these parts into the desired spatial relationships, with Claude 3.5 demonstrating better performance, particularly in self-correcting its output. This study contributes to ongoing research on benchmarking the strengths and weaknesses of off-the-shelf AI systems in performing intelligent human tasks that require discipline-specific knowledge. The findings highlight the potential of language-enabled AI systems to act as collaborative technical assistants in the architectural design process.

[AI-3] SeqFusion: Sequential Fusion of Pre-Trained Models for Zero-Shot Time-Series Forecasting

链接: https://arxiv.org/abs/2503.02836
作者: Ting-Ji Huang,Xu-Yang Chen,Han-Jia Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unlike traditional time-series forecasting methods that require extensive in-task data for training, zero-shot forecasting can directly predict future values given a target time series without additional training data. Current zero-shot approaches primarily rely on pre-trained generalized models, with their performance often depending on the variety and relevance of the pre-training data, which can raise privacy concerns. Instead of collecting diverse pre-training data, we introduce SeqFusion in this work, a novel framework that collects and fuses diverse pre-trained models (PTMs) sequentially for zero-shot forecasting. Based on the specific temporal characteristics of the target time series, SeqFusion selects the most suitable PTMs from a batch of pre-collected PTMs, performs sequential predictions, and fuses all the predictions while using minimal data to protect privacy. Each of these PTMs specializes in different temporal patterns and forecasting tasks, allowing SeqFusion to select by measuring distances in a shared representation space of the target time series with each PTM. Experiments demonstrate that SeqFusion achieves competitive accuracy in zero-shot forecasting compared to state-of-the-art methods.

[AI-4] A Multimodal Symphony: Integrating Taste and Sound through Generative AI

链接: https://arxiv.org/abs/2503.02823
作者: Matteo Spanio,Massimiliano Zampini,Antonio Rodà,Franco Pierucci
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 17 pages, 6 figures (2 + 2 figures with 2 subfigures each)

点击查看摘要

Abstract:In recent decades, neuroscientific and psychological research has traced direct relationships between taste and auditory perceptions. This article explores multimodal generative models capable of converting taste information into music, building on this foundational research. We provide a brief review of the state of the art in this field, highlighting key findings and methodologies. We present an experiment in which a fine-tuned version of a generative music model (MusicGEN) is used to generate music based on detailed taste descriptions provided for each musical piece. The results are promising: according the participants’ ( n=111 ) evaluation, the fine-tuned model produces music that more coherently reflects the input taste descriptions compared to the non-fine-tuned model. This study represents a significant step towards understanding and developing embodied interactions between AI, sound, and taste, opening new possibilities in the field of generative AI. We release our dataset, code and pre-trained model at: this https URL.

[AI-5] Do Not Trust Licenses You See – Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

链接: https://arxiv.org/abs/2503.02784
作者: Jaekyeom Kim,Sungryull Sohn,Gerrard Jeongwon Jo,Jihoon Choi,Kyunghoon Bae,Hwayoung Lee,Yongmin Park,Honglak Lee
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper argues that a dataset’s legal risk cannot be accurately assessed by its license terms alone; instead, tracking dataset redistribution and its full lifecycle is essential. However, this process is too complex for legal experts to handle manually at scale. Tracking dataset provenance, verifying redistribution rights, and assessing evolving legal risks across multiple stages require a level of precision and efficiency that exceeds human capabilities. Addressing this challenge effectively demands AI agents that can systematically trace dataset redistribution, analyze compliance, and identify legal risks. We develop an automated data compliance system called NEXUS and show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts. Our massive legal analysis of 17,429 unique entities and 8,072 license terms using this approach reveals the discrepancies in legal rights between the original datasets before redistribution and their redistributed subsets, underscoring the necessity of the data lifecycle-aware compliance. For instance, we find that out of 2,852 datasets with commercially viable individual license terms, only 605 (21%) are legally permissible for commercialization. This work sets a new standard for AI data governance, advocating for a framework that systematically examines the entire lifecycle of dataset redistribution to ensure transparent, legal, and responsible dataset management.

[AI-6] Prime Convolutional Model: Breaking the Ground for Theoretical Explainability

链接: https://arxiv.org/abs/2503.02773
作者: Francesco Panelli,Doaa Almhaithawi,Tania Cerquitelli,Alessandro Bellini
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a new theoretical approach to Explainable AI. Following the Scientific Method, this approach consists in formulating on the basis of empirical evidence, a mathematical model to explain and predict the behaviors of Neural Networks. We apply the method to a case study created in a controlled environment, which we call Prime Convolutional Model (p-Conv for short). p-Conv operates on a dataset consisting of the first one million natural numbers and is trained to identify the congruence classes modulo a given integer m . Its architecture uses a convolutional-type neural network that contextually processes a sequence of B consecutive numbers to each input. We take an empirical approach and exploit p-Conv to identify the congruence classes of numbers in a validation set using different values for m and B . The results show that the different behaviors of p-Conv (i.e., whether it can perform the task or not) can be modeled mathematically in terms of m and B . The inferred mathematical model reveals interesting patterns able to explain when and why p-Conv succeeds in performing task and, if not, which error pattern it follows.

[AI-7] Vibration-Assisted Hysteresis Mitigation for Achieving High Compensation Efficiency

链接: https://arxiv.org/abs/2503.02720
作者: Myeongbo Park,Chunggil An,Junhyun Park,Jonghyun Kang,Minho Hwang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures, and 2 tables

点击查看摘要

Abstract:Tendon-sheath mechanisms (TSMs) are widely used in minimally invasive surgical (MIS) applications, but their inherent hysteresis-caused by friction, backlash, and tendon elongation-leads to significant tracking errors. Conventional modeling and compensation methods struggle with these nonlinearities and require extensive parameter tuning. To address this, we propose a vibration-assisted hysteresis compensation approach, where controlled vibrational motion is applied along the tendon’s movement direction to mitigate friction and reduce dead zones. Experimental results demonstrate that the exerted vibration consistently reduces hysteresis across all tested frequencies, decreasing RMSE by up to 23.41% (from 2.2345 mm to 1.7113 mm) and improving correlation, leading to more accurate trajectory tracking. When combined with a Temporal Convolutional Network (TCN)-based compensation model, vibration further enhances performance, achieving an 85.2% reduction in MAE (from 1.334 mm to 0.1969 mm). Without vibration, the TCN-based approach still reduces MAE by 72.3% (from 1.334 mm to 0.370 mm) under the same parameter settings. These findings confirm that vibration effectively mitigates hysteresis, improving trajectory accuracy and enabling more efficient compensation models with fewer trainable parameters. This approach provides a scalable and practical solution for TSM-based robotic applications, particularly in MIS.

[AI-8] Generative Tools for Graphical Assets: Empirical Guidelines based on Game Designers and Developers Preferences

链接: https://arxiv.org/abs/2503.02703
作者: Kaisei Fukaya,Damon Daylamani-Zad,Harry Agius
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graphical assets play an important role in the design and development of games. There is potential in the use of generative tools, to aid in creating graphical assets, thus improving game design and development pipelines. However, there is little research to address how the generative methods can fit into the wider pipeline. We conducted a user study with 16 game designers and developers to examine their preferences regarding generative tools for graphical assets. The findings highlight that early design stage is preferred by all participants (mean values above 0.67 and p .001 for early stages). Designers and developers prefer to use such tools for creating large amounts of variations at the cost of quality as they can improve the quality of the artefacts once they generate a suitable asset (mean value 0.17 where 1 is high quality, p .001). They also strongly (mean value .78, p .001) raised the need for better integration of such tools in existing design and development environments and the need for the outputs to be in common data formats, to be manipulatable and integrate smoothly into existing environments (mean 3.5 out of 5, p = .004). The study also highlights the requirement for further emphasis on the needs of the users to incorporate these tools effectively in existing pipelines. Informed by these results, we provide a set of guidelines for creating tools that meet the expectations and needs of game designers and developers.

[AI-9] MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality

链接: https://arxiv.org/abs/2503.02701
作者: Shuaike Li,Kai Zhang,Qi Liu,Enhong Chen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge editing is a technique for efficiently and accurately updating the knowledge of large language models (LLMs) to alleviate obsolescence and correct errors. However, most existing methods overfit to specific models, causing edited knowledge to be discarded during each LLM update and requiring frequent re-editing, which is particularly burdensome in today’s rapidly evolving open-source community. To address this issue, we propose the problem of cross-model knowledge editing and introduce MindBridge, a scalable solution inspired by the low coupling between modality processing and LLMs in multi-modal models. MindBridge introduces the novel concept of memory modality, which encodes edited knowledge as an independent modality. It first performs LLM-agnostic pre-training of the memory modality and then integrates it with various LLMs. Extensive experiments on multiple LLMs and popular knowledge editing datasets demonstrate that MindBridge achieves superior performance even in editing tens of thousands of knowledge entries and can flexibly adapt to different LLMs. Our code is available at this https URL.

[AI-10] Seeding for Success: Skill and Stochasticity in Tabletop Games

链接: https://arxiv.org/abs/2503.02686
作者: James Goodman,Diego Perez-Liebana,Simon Lucas
类目: Artificial Intelligence (cs.AI)
*备注: Published in IEEE Transactions on Games, 2025

点击查看摘要

Abstract:Games often incorporate random elements in the form of dice or shuffled card decks. This randomness is a key contributor to the player experience and the variety of game situations encountered. There is a tension between a level of randomness that makes the game interesting and contributes to the player enjoyment of a game, and a level at which the outcome itself is effectively random and the game becomes dull. The optimal level for a game will depend on the design goals and target audience. We introduce a new technique to quantify the level of randomness in game outcome and use it to compare 15 tabletop games and disentangle the different contributions to the overall randomness from specific parts of some games. We further explore the interaction between game randomness and player skill, and how this innate randomness can affect error analysis in common game experiments.

[AI-11] Reflection on Data Storytelling Tools in the Generative AI Era from the Human-AI Collaboration Perspective

链接: https://arxiv.org/abs/2503.02631
作者: Haotian Li,Yun Wang,Huamin Qu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: This paper is a sequel to the CHI 24 paper "Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration ( this https URL ), aiming to refresh our understanding with the latest advancements

点击查看摘要

Abstract:Human-AI collaborative tools attract attentions from the data storytelling community to lower the barrier of expertise and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify persistent collaboration patterns, e.g., human-creator + AI-assistant, and emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and other implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.

[AI-12] Reinforcement Learning-based Threat Assessment

链接: https://arxiv.org/abs/2503.02612
作者: Wuzhou Sun,Siyi Li,Qingxiang Zou,Zixing Liao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages,9 figures

点击查看摘要

Abstract:In some game scenarios, due to the uncertainty of the number of enemy units and the priority of various attributes, the evaluation of the threat level of enemy units as well as the screening has been a challenging research topic, and the core difficulty lies in how to reasonably set the priority of different attributes in order to achieve quantitative evaluation of the threat. In this paper, we innovatively transform the problem of threat assessment into a reinforcement learning problem, and through systematic reinforcement learning training, we successfully construct an efficient neural network evaluator. The evaluator can not only comprehensively integrate the multidimensional attribute features of the enemy, but also effectively combine our state information, thus realizing a more accurate and scientific threat assessment.

[AI-13] Playing games with Large language models : Randomness and strategy

链接: https://arxiv.org/abs/2503.02582
作者: Alicia Vidler,Toby Walsh
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 9 pages

点击查看摘要

Abstract:Playing games has a long history of describing intricate interactions in simplified forms. In this paper we explore if large language models (LLMs) can play games, investigating their capabilities for randomisation and strategic adaptation through both simultaneous and sequential game interactions. We focus on GPT-4o-Mini-2024-08-17 and test two games between LLMs: Rock Paper Scissors (RPS) and games of strategy (Prisoners Dilemma PD). LLMs are often described as stochastic parrots, and while they may indeed be parrots, our results suggest that they are not very stochastic in the sense that their outputs - when prompted to be random - are often very biased. Our research reveals that LLMs appear to develop loss aversion strategies in repeated games, with RPS converging to stalemate conditions while PD shows systematic shifts between cooperative and competitive outcomes based on prompt design. We detail programmatic tools for independent agent interactions and the Agentic AI challenges faced in implementation. We show that LLMs can indeed play games, just not very well. These results have implications for the use of LLMs in multi-agent LLM systems and showcase limitations in current approaches to model output for strategic decision-making.

[AI-14] LLM -Safety Evaluations Lack Robustness

链接: https://arxiv.org/abs/2503.02574
作者: Tim Beyer,Sophie Xhonneux,Simon Geisler,Gauthier Gidel,Leo Schwinn,Stephan Günnemann
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field’s ability to generate easily comparable results and make measurable progress.

[AI-15] RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour IROS2025

链接: https://arxiv.org/abs/2503.02572
作者: Valerii Serpiva,Artem Lykov,Artyom Myshlyaev,Muhammad Haris Khan,Ali Alridha Abdulkarim,Oleg Sautenkov,Dzmitry Tsetserukou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 6 figures. Submitted to IROS 2025

点击查看摘要

Abstract:RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA’s ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: this https URL

[AI-16] World Models for Anomaly Detection during Model-Based Reinforcement Learning Inference

链接: https://arxiv.org/abs/2503.02552
作者: Fabian Domberg,Georg Schildbach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning-based controllers are often purposefully kept out of real-world applications due to concerns about their safety and reliability. We explore how state-of-the-art world models in Model-Based Reinforcement Learning can be utilized beyond the training phase to ensure a deployed policy only operates within regions of the state-space it is sufficiently familiar with. This is achieved by continuously monitoring discrepancies between a world model’s predictions and observed system behavior during inference. It allows for triggering appropriate measures, such as an emergency stop, once an error threshold is surpassed. This does not require any task-specific knowledge and is thus universally applicable. Simulated experiments on established robot control tasks show the effectiveness of this method, recognizing changes in local robot geometry and global gravitational magnitude. Real-world experiments using an agile quadcopter further demonstrate the benefits of this approach by detecting unexpected forces acting on the vehicle. These results indicate how even in new and adverse conditions, safe and reliable operation of otherwise unpredictable learning-based controllers can be achieved.

[AI-17] LTL Verification of Memoryful Neural Agents AAMAS2025

链接: https://arxiv.org/abs/2503.02512
作者: Mehran Hosseini,Alessio Lomuscio,Nicola Paoletti
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
*备注: 11 pages, 2 figures, accepted at AAMAS 2025 conference

点击查看摘要

Abstract:We present a framework for verifying Memoryful Neural Multi-Agent Systems (MN-MAS) against full Linear Temporal Logic (LTL) specifications. In MN-MAS, agents interact with a non-deterministic, partially observable environment. Examples of MN-MAS include multi-agent systems based on feed-forward and recurrent neural networks or state-space models. Different from previous approaches, we support the verification of both bounded and unbounded LTL specifications. We leverage well-established bounded model checking techniques, including lasso search and invariant synthesis, to reduce the verification problem to that of constraint solving. To solve these constraints, we develop efficient methods based on bound propagation, mixed-integer linear programming, and adaptive splitting. We evaluate the effectiveness of our algorithms in single and multi-agent environments from the Gymnasium and PettingZoo libraries, verifying unbounded specifications for the first time and improving the verification time for bounded specifications by an order of magnitude compared to the SoA.

[AI-18] PennyLang: Pioneering LLM -Based Quantum Code Generation with a Novel PennyLane-Centric Dataset IJCNN2025

链接: https://arxiv.org/abs/2503.02497
作者: Haider Asif,Abdul Basit,Nouhaila Innan,Muhammad Kashif,Alberto Marchisio,Muhammad Shafique
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 6 tables, submitted for review under IJCNN 2025

点击查看摘要

Abstract:Large Language Models (LLMs) offer remarkable capabilities in code generation, natural language processing, and domain-specific reasoning. Their potential in aiding quantum software development remains underexplored, particularly for the PennyLane framework-a leading platform for hybrid quantum-classical computing. To address this gap, we introduce a novel, high-quality dataset comprising 3,347 PennyLane-specific code samples of quantum circuits and their contextual descriptions, specifically curated to train/fine-tune LLM-based quantum code assistance. Our key contributions are threefold: (1) the automatic creation and open-source release of a comprehensive PennyLane dataset leveraging quantum computing textbooks, official documentation, and open-source repositories; (2) the development of a systematic methodology for data refinement, annotation, and formatting to optimize LLM training efficiency; and (3) a thorough evaluation, based on a Retrieval-Augmented Generation (RAG) framework, demonstrating the effectiveness of our dataset in streamlining PennyLane code generation and improving quantum development workflows. Compared to existing efforts that predominantly focus on Qiskit, our dataset significantly broadens the spectrum of quantum frameworks covered in AI-driven code assistance. By bridging this gap and providing reproducible dataset-creation methodologies, we aim to advance the field of AI-assisted quantum programming, making quantum computing more accessible to both newcomers and experienced developers.

[AI-19] Dont Get Too Excited – Eliciting Emotions in LLM s

链接: https://arxiv.org/abs/2503.02457
作者: Gino Franco Fazzi,Julie Skoven Hinge,Stefan Heinrich,Paolo Burelli
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the challenges of affect control in large language models (LLMs), focusing on their ability to express appropriate emotional states during extended dialogues. We evaluated state-of-the-art open-weight LLMs to assess their affective expressive range in terms of arousal and valence. Our study employs a novel methodology combining LLM-based sentiment analysis with multiturn dialogue simulations between LLMs. We quantify the models’ capacity to express a wide spectrum of emotions and how they fluctuate during interactions. Our findings reveal significant variations among LLMs in their ability to maintain consistent affect, with some models demonstrating more stable emotional trajectories than others. Furthermore, we identify key challenges in affect control, including difficulties in producing and maintaining extreme emotional states and limitations in adapting affect to changing conversational contexts. These findings have important implications for the development of more emotionally intelligent AI systems and highlight the need for improved affect modelling in LLMs.

[AI-20] Sparse Meets Dense: Unified Generative Recommendations with Cascaded Sparse-Dense Representations

链接: https://arxiv.org/abs/2503.02453
作者: Yuhao Yang,Zhi Ji,Zhaopeng Li,Yi Li,Zhonglin Mo,Yue Ding,Kai Chen,Zijian Zhang,Jie Li,Shuanglong Li,Lin Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models have recently gained attention in recommendation systems by directly predicting item identifiers from user interaction sequences. However, existing methods suffer from significant information loss due to the separation of stages such as quantization and sequence modeling, hindering their ability to achieve the modeling precision and accuracy of sequential dense retrieval techniques. Integrating generative and dense retrieval methods remains a critical challenge. To address this, we introduce the Cascaded Organized Bi-Represented generAtive retrieval (COBRA) framework, which innovatively integrates sparse semantic IDs and dense vectors through a cascading process. Our method alternates between generating these representations by first generating sparse IDs, which serve as conditions to aid in the generation of dense vectors. End-to-end training enables dynamic refinement of dense representations, capturing both semantic insights and collaborative signals from user-item interactions. During inference, COBRA employs a coarse-to-fine strategy, starting with sparse ID generation and refining them into dense vectors via the generative model. We further propose BeamFusion, an innovative approach combining beam search with nearest neighbor scores to enhance inference flexibility and recommendation diversity. Extensive experiments on public datasets and offline tests validate our method’s robustness. Online A/B tests on a real-world advertising platform with over 200 million daily users demonstrate substantial improvements in key metrics, highlighting COBRA’s practical advantages.

[AI-21] AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents

链接: https://arxiv.org/abs/2503.02403
作者: Jiahui Sun,Zhichao Hua,Yubin Xia
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate and systematic evaluation of mobile agents can significantly advance their development and real-world applicability. However, existing benchmarks for mobile agents lack practicality and scalability due to the extensive manual effort required to define task reward signals and implement corresponding evaluation codes. To this end, we propose AutoEval, an autonomous agent evaluation framework that tests a mobile agent without any manual effort. First, we design a Structured Substate Representation to describe the UI state changes while agent execution, such that task reward signals can be automatically generated. Second, we utilize a Judge System that can autonomously evaluate agents’ performance given the automatically generated task reward signals. By providing only a task description, our framework evaluates agents with fine-grained performance feedback to that task without any extra manual effort. We implement a prototype of our framework and validate the automatically generated task reward signals, finding over 93% coverage to human-annotated reward signals. Moreover, to prove the effectiveness of our autonomous Judge System, we manually verify its judge results and demonstrate that it achieves 94% accuracy. Finally, we evaluate the state-of-the-art mobile agents using our framework, providing detailed insights into their performance characteristics and limitations.

[AI-22] PersonaX: A Recommendation Agent Oriented User Modeling Framework for Long Behavior Sequence

链接: https://arxiv.org/abs/2503.02398
作者: Yunxiao Shi,Wujiang Xu,Zeqi Zhang,Xing Zi,Qiang Wu,Min Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: draft paper

点击查看摘要

Abstract:Recommendation agents leverage large language models for user modeling LLM UM to construct textual personas guiding alignment with real users. However existing LLM UM methods struggle with long user generated content UGC due to context limitations and performance degradation. To address this sampling strategies prioritize relevance or recency are often applied yet they inevitably neglect the diverse user interests embedded within the discarded behaviors resulting in incomplete modeling and degraded profiling quality. Furthermore relevance based sampling requires real time retrieval forcing the user modeling process to operate online which introduces significant latency overhead. In this paper we propose PersonaX an agent agnostic LLM UM framework that tackles these challenges through sub behavior sequence SBS selection and offline multi persona construction. PersonaX extracts compact SBS segments offline to capture diverse user interests generating fine grained textual personas that are cached for efficient online retrieval. This approach ensures that the user persona used for prompting remains highly relevant to the current context while eliminating the need for online user modeling. For SBS selection we ensure both efficiency length less than five and high representational quality by balancing prototypicality and diversity within the sampled data. Extensive experiments validate the effectiveness and versatility of PersonaX in high quality user profiling. Utilizing only 30 to 50 percent of the behavioral data with a sequence length of 480 integrating PersonaX with AgentCF yields an absolute performance improvement of 3 to 11 percent while integration with Agent4Rec results in a gain of 10 to 50 percent. PersonaX as an agent agnostic framework sets a new benchmark for scalable user modeling paving the way for more accurate and efficient LLM driven recommendation agents.

[AI-23] A Binary Classification Social Network Dataset for Graph Machine Learning

链接: https://arxiv.org/abs/2503.02397
作者: Adnan Ali,Jinglong Li,Huanhuan Chen,AlMotasem Bellah Al Ajlouni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social networks have a vast range of applications with graphs. The available benchmark datasets are citation, co-occurrence, e-commerce networks, etc, with classes ranging from 3 to 15. However, there is no benchmark classification social network dataset for graph machine learning. This paper fills the gap and presents the Binary Classification Social Network Dataset (\textitBiSND), designed for graph machine learning applications to predict binary classes. We present the BiSND in \textittabular and graph formats to verify its robustness across classical and advanced machine learning. We employ a diverse set of classifiers, including four traditional machine learning algorithms (Decision Trees, K-Nearest Neighbour, Random Forest, XGBoost), one Deep Neural Network (multi-layer perceptrons), one Graph Neural Network (Graph Convolutional Network), and three state-of-the-art Graph Contrastive Learning methods (BGRL, GRACE, DAENS). Our findings reveal that BiSND is suitable for classification tasks, with F1-scores ranging from 67.66 to 70.15, indicating promising avenues for future enhancements.

[AI-24] JPDS-NN: Reinforcement Learning-Based Dynamic Task Allocation for Agricultural Vehicle Routing Optimization IROS2025

链接: https://arxiv.org/abs/2503.02369
作者: Yixuan Fan,Haotian Xu,Mengqiao Liu,Qing Zhuo,Tao Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures, submitted to IROS 2025

点击查看摘要

Abstract:The Entrance Dependent Vehicle Routing Problem (EDVRP) is a variant of the Vehicle Routing Problem (VRP) where the scale of cities influences routing outcomes, necessitating consideration of their entrances. This paper addresses EDVRP in agriculture, focusing on multi-parameter vehicle planning for irregularly shaped fields. To address the limitations of traditional methods, such as heuristic approaches, which often overlook field geometry and entrance constraints, we propose a Joint Probability Distribution Sampling Neural Network (JPDS-NN) to effectively solve the EDVRP. The network uses an encoder-decoder architecture with graph transformers and attention mechanisms to model routing as a Markov Decision Process, and is trained via reinforcement learning for efficient and rapid end-to-end planning. Experimental results indicate that JPDS-NN reduces travel distances by 48.4-65.4%, lowers fuel consumption by 14.0-17.6%, and computes two orders of magnitude faster than baseline methods, while demonstrating 15-25% superior performance in dynamic arrangement scenarios. Ablation studies validate the necessity of cross-attention and pre-training. The framework enables scalable, intelligent routing for large-scale farming under dynamic constraints.

[AI-25] CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory ASPLOS’25

链接: https://arxiv.org/abs/2503.02354
作者: Jiashun Suo,Xiaojian Liao,Limin Xiao,Li Ruan,Jinquan Wang,Xiao Su,Zhisheng Huo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: Accepted to ASPLOS '25

点击查看摘要

Abstract:Large language models like GPT-4 are resource-intensive, but recent advancements suggest that smaller, specialized experts can outperform the monolithic models on specific tasks. The Collaboration-of-Experts (CoE) approach integrates multiple expert models, improving the accuracy of generated results and offering great potential for precision-critical applications, such as automatic circuit board quality inspection. However, deploying CoE serving systems presents challenges to memory capacity due to the large number of experts required, which can lead to significant performance overhead from frequent expert switching across different memory and storage tiers. We propose CoServe, an efficient CoE model serving system on heterogeneous CPU and GPU with limited memory. CoServe reduces unnecessary expert switching by leveraging expert dependency, a key property of CoE inference. CoServe introduces a dependency-aware request scheduler and dependency-aware expert management for efficient inference. It also introduces an offline profiler to automatically find optimal resource allocation on various processors and devices. In real-world intelligent manufacturing workloads, CoServe achieves 4.5 \times to 12 \times higher throughput compared to state-of-the-art systems. Comments: Accepted to ASPLOS '25 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2503.02354 [cs.DC] (or arXiv:2503.02354v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2503.02354 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2025 Related DOI: https://doi.org/10.1145/3676641.3715986 Focus to learn more DOI(s) linking to related resources

[AI-26] Enhancing the Product Quality of the Injection Process Using eXplainable Artificial Intelligence

链接: https://arxiv.org/abs/2503.02338
作者: Jisoo Hong,Yongmin Hong,Jung-Woo Baek,Sung-Woo Kang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The injection molding process is a traditional technique for making products in various industries such as electronics and automobiles via solidifying liquid resin into certain molds. Although the process is not related to creating the main part of engines or semiconductors, this manufacturing methodology sets the final form of the products. Re-cently, research has continued to reduce the defect rate of the injection molding process. This study proposes an optimal injection molding process control system to reduce the defect rate of injection molding products with XAI (eXplainable Artificial Intelligence) ap-proaches. Boosting algorithms (XGBoost and LightGBM) are used as tree-based classifiers for predicting whether each product is normal or defective. The main features to control the process for improving the product are extracted by SHapley Additive exPlanations, while the individual conditional expectation analyzes the optimal control range of these extracted features. To validate the methodology presented in this work, the actual injection molding AI manufacturing dataset provided by KAMP (Korea AI Manufacturing Platform) is employed for the case study. The results reveal that the defect rate decreases from 1.00% (Original defect rate) to 0.21% with XGBoost and 0.13% with LightGBM, respectively.

[AI-27] arget Return Optimizer for Multi-Game Decision Transformer

链接: https://arxiv.org/abs/2503.02311
作者: Kensuke Tatematsu,Akifumi Wachi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 10 pages

点击查看摘要

Abstract:Achieving autonomous agents with robust generalization capabilities across diverse games and tasks remains one of the ultimate goals in AI research. Recent advancements in transformer-based offline reinforcement learning, exemplified by the MultiGame Decision Transformer [Lee et al., 2022], have shown remarkable performance across various games or tasks. However, these approaches depend heavily on human expertise, presenting substantial challenges for practical deployment, particularly in scenarios with limited prior game-specific knowledge. In this paper, we propose an algorithm called Multi-Game Target Return Optimizer (MTRO) to autonomously determine game-specific target returns within the Multi-Game Decision Transformer framework using solely offline datasets. MTRO addresses the existing limitations by automating the target return configuration process, leveraging environmental reward information extracted from offline datasets. Notably, MTRO does not require additional training, enabling seamless integration into existing Multi-Game Decision Transformer architectures. Our experimental evaluations on Atari games demonstrate that MTRO enhances the performance of RL policies across a wide array of games, underscoring its potential to advance the field of autonomous agent development.

[AI-28] Flexible Prefrontal Control over Hippocampal Episodic Memory for Goal-Directed Generalization

链接: https://arxiv.org/abs/2503.02303
作者: Yicong Zheng,Nora Wolf,Charan Ranganath,Randall C. O’Reilly,Kevin L. McKee
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many tasks require flexibly modifying perception and behavior based on current goals. Humans can retrieve episodic memories from days to years ago, using them to contextualize and generalize behaviors across novel but structurally related situations. The brain’s ability to control episodic memories based on task demands is often attributed to interactions between the prefrontal cortex (PFC) and hippocampus (HPC). We propose a reinforcement learning model that incorporates a PFC-HPC interaction mechanism for goal-directed generalization. In our model, the PFC learns to generate query-key representations to encode and retrieve goal-relevant episodic memories, modulating HPC memories top-down based on current task demands. Moreover, the PFC adapts its encoding and retrieval strategies dynamically when faced with multiple goals presented in a blocked, rather than interleaved, manner. Our results show that: (1) combining working memory with selectively retrieved episodic memory allows transfer of decisions among similar environments or situations, (2) top-down control from PFC over HPC improves learning of arbitrary structural associations between events for generalization to novel environments compared to a bottom-up sensory-driven approach, and (3) the PFC encodes generalizable representations during both encoding and retrieval of goal-relevant memories, whereas the HPC exhibits event-specific representations. Together, these findings highlight the importance of goal-directed prefrontal control over hippocampal episodic memory for decision-making in novel situations and suggest a computational mechanism by which PFC-HPC interactions enable flexible behavior.

[AI-29] Memorize or Generalize? Evaluating LLM Code Generation with Evolved Questions

链接: https://arxiv.org/abs/2503.02296
作者: Wentao Chen,Lizhe Zhang,Li Zhong,Letian Peng,Zilong Wang,Jingbo Shang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known to exhibit a memorization phenomenon in code generation: instead of truly understanding the underlying principles of a programming problem, they tend to memorize the original prompt and its solution together in the training. Consequently, when facing variants of the original problem, their answers very likely resemble the memorized solutions and fail to generalize. In this paper, we investigate this phenomenon by designing three evolution strategies to create variants: mutation, paraphrasing, and code-rewriting. By comparing the performance and AST similarity of the LLM-generated codes before and after these three evolutions, we develop a memorization score that positively correlates with the level of memorization. As expected, as supervised fine-tuning goes on, the memorization score rises before overfitting, suggesting more severe memorization. We demonstrate that common mitigation approaches, such as prompt translation and using evolved variants as data augmentation in supervised learning and reinforcement learning, either compromise the performance or fail to alleviate the memorization issue. Therefore, memorization remains a significant challenge in LLM code generation, highlighting the need for a more effective solution.

[AI-30] Experience Replay with Random Reshuffling

链接: https://arxiv.org/abs/2503.02269
作者: Yasuhiro Fujita
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning.

[AI-31] AppAgent X: Evolving GUI Agents as Proficient Smartphone Users

链接: https://arxiv.org/abs/2503.02268
作者: Wenjia Jiang,Yangyang Zhuang,Chenxi Song,Xu Yang,Chi Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent’s task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.

[AI-32] REAct: Rational Exponential Activation for Better Learning and Generalization in PINNs ICASSP2025

链接: https://arxiv.org/abs/2503.02267
作者: Sourav Mishra,Shreya Hallikeri,Suresh Sundaram
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 tables, 1 figure; Accepted at ICASSP 2025

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) offer a promising approach to simulating physical systems. Still, their application is limited by optimization challenges, mainly due to the lack of activation functions that generalize well across several physical systems. Existing activation functions often lack such flexibility and generalization power. To address this issue, we introduce Rational Exponential Activation (REAct), a generalized form of tanh consisting of four learnable shape parameters. Experiments show that REAct outperforms many standard and benchmark activations, achieving an MSE three orders of magnitude lower than tanh on heat problems and generalizing well to finer grids and points beyond the training domain. It also excels at function approximation tasks and improves noise rejection in inverse problems, leading to more accurate parameter estimates across varying noise levels.

[AI-33] Large Language Models as Natural Selector for Embodied Soft Robot Design

链接: https://arxiv.org/abs/2503.02249
作者: Changhe Chen,Xiaohao Xu,Xiangdong Wang,Xiaonan Huang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing soft robots is a complex and iterative process that demands cross-disciplinary expertise in materials science, mechanics, and control, often relying on intuition and extensive experimentation. While Large Language Models (LLMs) have demonstrated impressive reasoning abilities, their capacity to learn and apply embodied design principles–crucial for creating functional robotic systems–remains largely unexplored. This paper introduces RoboCrafter-QA, a novel benchmark to evaluate whether LLMs can learn representations of soft robot designs that effectively bridge the gap between high-level task descriptions and low-level morphological and material choices. RoboCrafter-QA leverages the EvoGym simulator to generate a diverse set of soft robot design challenges, spanning robotic locomotion, manipulation, and balancing tasks. Our experiments with state-of-the-art multi-modal LLMs reveal that while these models exhibit promising capabilities in learning design representations, they struggle with fine-grained distinctions between designs with subtle performance differences. We further demonstrate the practical utility of LLMs for robot design initialization. Our code and benchmark will be available to encourage the community to foster this exciting research direction.

[AI-34] V2X-LLM : Enhancing V2X Integration and Understanding in Connected Vehicle Corridors

链接: https://arxiv.org/abs/2503.02239
作者: Keshu Wu,Pei Li,Yang Zhou,Rui Gan,Junwei You,Yang Cheng,Jingwen Zhu,Steven T. Parker,Bin Ran,David A. Noyce,Zhengzhong Tu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Connected and Automated Vehicles (CAVs) and Vehicle-to-Everything (V2X) offers significant potential for enhancing transportation safety, mobility, and sustainability. However, the integration and analysis of the diverse and voluminous V2X data, including Basic Safety Messages (BSMs) and Signal Phase and Timing (SPaT) data, present substantial challenges, especially on Connected Vehicle Corridors. These challenges include managing large data volumes, ensuring real-time data integration, and understanding complex traffic scenarios. Although these projects have developed an advanced CAV data pipeline that enables real-time communication between vehicles, infrastructure, and other road users for managing connected vehicle and roadside unit (RSU) data, significant hurdles in data comprehension and real-time scenario analysis and reasoning persist. To address these issues, we introduce the V2X-LLM framework, a novel enhancement to the existing CV data pipeline. V2X-LLM leverages Large Language Models (LLMs) to improve the understanding and real-time analysis of V2X data. The framework includes four key tasks: Scenario Explanation, offering detailed narratives of traffic conditions; V2X Data Description, detailing vehicle and infrastructure statuses; State Prediction, forecasting future traffic states; and Navigation Advisory, providing optimized routing instructions. By integrating LLM-driven reasoning with V2X data within the data pipeline, the V2X-LLM framework offers real-time feedback and decision support for traffic management. This integration enhances the accuracy of traffic analysis, safety, and traffic optimization. Demonstrations in a real-world urban corridor highlight the framework’s potential to advance intelligent transportation systems.

[AI-35] Deficient Excitation in Parameter Learning

链接: https://arxiv.org/abs/2503.02235
作者: Ganghui Cao,Shimin Wang,Martin Guay,Jinzhi Wang,Zhisheng Duan,Marios M. Polycarpou
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 16 pages,9 figures

点击查看摘要

Abstract:This paper investigates parameter learning problems under deficient excitation (DE). The DE condition is a rank-deficient, and therefore, a more general evolution of the well-known persistent excitation condition. Under the DE condition, a proposed online algorithm is able to calculate the identifiable and non-identifiable subspaces, and finally give an optimal parameter estimate in the sense of least squares. In particular, the learning error within the identifiable subspace exponentially converges to zero in the noise-free case, even without persistent excitation. The DE condition also provides a new perspective for solving distributed parameter learning problems, where the challenge is posed by local regressors that are often insufficiently excited. To improve knowledge of the unknown parameters, a cooperative learning protocol is proposed for a group of estimators that collect measured information under complementary DE conditions. This protocol allows each local estimator to operate locally in its identifiable subspace, and reach a consensus with neighbours in its non-identifiable subspace. As a result, the task of estimating unknown parameters can be achieved in a distributed way using cooperative local estimators. Application examples in system identification are given to demonstrate the effectiveness of the theoretical results developed in this paper.

[AI-36] Attention Bootstrapping for Multi-Modal Test-Time Adaptation

链接: https://arxiv.org/abs/2503.02221
作者: Yusheng Zhao,Junyu Luo,Xiao Luo,Jinsheng Huang,Jingyang Yuan,Zhiping Xiao,Ming Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test-time adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adaptation by proposing a novel method named Attention Bootstrapping with Principal Entropy Minimization (ABPEM). We observe that test-time distribution shift causes misalignment across modalities, leading to a large gap between intra-modality discrepancies (measured by self-attention) and inter-modality discrepancies (measured by cross-attention). We name this the attention gap. This attention gap widens with more severe distribution shifts, hindering effective modality fusion. To mitigate this attention gap and encourage better modality fusion, we propose attention bootstrapping that promotes cross-attention with the guidance of self-attention. Moreover, to reduce the gradient noise in the commonly-used entropy minimization, we adopt principal entropy minimization, a refinement of entropy minimization that reduces gradient noise by focusing on the principal parts of entropy, excluding less reliable gradient information. Extensive experiments on the benchmarks validate the effectiveness of the proposed ABPEM in comparison with competing baselines.

[AI-37] Discrete Differential Evolution Particle Swarm Optimization Algorithm for Energy Saving Flexible Job Shop Scheduling Problem Considering Machine Multi States

链接: https://arxiv.org/abs/2503.02180
作者: Da Wang,Yu Zhang,Kai Zhang,Junqing Li,Dengwang Li
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the continuous deepening of low-carbon emission reduction policies, the manufacturing industries urgently need sensible energy-saving scheduling schemes to achieve the balance between improving production efficiency and reducing energy consumption. In energy-saving scheduling, reasonable machine states-switching is a key point to achieve expected goals, i.e., whether the machines need to switch speed between different operations, and whether the machines need to add extra setup time between different jobs. Regarding this matter, this work proposes a novel machine multi states-based energy saving flexible job scheduling problem (EFJSP-M), which simultaneously takes into account machine multi speeds and setup time. To address the proposed EFJSP-M, a kind of discrete differential evolution particle swarm optimization algorithm (D-DEPSO) is designed. In specific, D-DEPSO includes a hybrid initialization strategy to improve the initial population performance, an updating mechanism embedded with differential evolution operators to enhance population diversity, and a critical path variable neighborhood search strategy to expand the solution space. At last, based on datasets DPs and MKs, the experiment results compared with five state-of-the-art algorithms demonstrate the feasible of EFJSP-M and the superior of D-DEPSO.

[AI-38] KGCompiler: Deep Learning Compilation Optimization for Knowledge Graph Complex Logical Query Answering

链接: https://arxiv.org/abs/2503.02172
作者: Hongyu Lin,Haoran Luo,Hanghang Cao,Yang Liu,Shihao Gao,Kaichun Yao,Libo Zhang,Mingjie Xing,Yanjun Wu
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Complex Logical Query Answering (CLQA) involves intricate multi-hop logical reasoning over large-scale and potentially incomplete Knowledge Graphs (KGs). Although existing CLQA algorithms achieve high accuracy in answering such queries, their reasoning time and memory usage scale significantly with the number of First-Order Logic (FOL) operators involved, creating serious challenges for practical deployment. In addition, current research primarily focuses on algorithm-level optimizations for CLQA tasks, often overlooking compiler-level optimizations, which can offer greater generality and scalability. To address these limitations, we introduce a Knowledge Graph Compiler, namely KGCompiler, the first deep learning compiler specifically designed for CLQA tasks. By incorporating KG-specific optimizations proposed in this paper, KGCompiler enhances the reasoning performance of CLQA algorithms without requiring additional manual modifications to their implementations. At the same time, it significantly reduces memory usage. Extensive experiments demonstrate that KGCompiler accelerates CLQA algorithms by factors ranging from 1.04x to 8.26x, with an average speedup of 3.71x. We also provide an interface to enable hands-on experience with KGCompiler.

[AI-39] AugFL: Augmenting Federated Learning with Pretrained Models

链接: https://arxiv.org/abs/2503.02154
作者: Sheng Yue,Zerui Qin,Yongheng Deng,Ju Ren,Yaoxue Zhang,Junshan Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: to be published in Transactions on Networking

点击查看摘要

Abstract:Federated Learning (FL) has garnered widespread interest in recent years. However, owing to strict privacy policies or limited storage capacities of training participants such as IoT devices, its effective deployment is often impeded by the scarcity of training data in practical decentralized learning environments. In this paper, we study enhancing FL with the aid of (large) pre-trained models (PMs), that encapsulate wealthy general/domain-agnostic knowledge, to alleviate the data requirement in conducting FL from scratch. Specifically, we consider a networked FL system formed by a central server and distributed clients. First, we formulate the PM-aided personalized FL as a regularization-based federated meta-learning problem, where clients join forces to learn a meta-model with knowledge transferred from a private PM stored at the server. Then, we develop an inexact-ADMM-based algorithm, AugFL, to optimize the problem with no need to expose the PM or incur additional computational costs to local clients. Further, we establish theoretical guarantees for AugFL in terms of communication complexity, adaptation performance, and the benefit of knowledge transfer in general non-convex cases. Extensive experiments corroborate the efficacy and superiority of AugFL over existing baselines.

[AI-40] Elliptic Loss Regularization ICLR2025

链接: https://arxiv.org/abs/2503.02138
作者: Ali Hasan,Haoming Yang,Yuting Ng,Vahid Tarokh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: ICLR 2025

点击查看摘要

Abstract:Regularizing neural networks is important for anticipating model behavior in regions of the data space that are not well represented. In this work, we propose a regularization technique for enforcing a level of smoothness in the mapping between the data input space and the loss value. We specify the level of regularity by requiring that the loss of the network satisfies an elliptic operator over the data domain. To do this, we modify the usual empirical risk minimization objective such that we instead minimize a new objective that satisfies an elliptic operator over points within the domain. This allows us to use existing theory on elliptic operators to anticipate the behavior of the error for points outside the training set. We propose a tractable computational method that approximates the behavior of the elliptic operator while being computationally efficient. Finally, we analyze the properties of the proposed regularization to understand the performance on common problems of distribution shift and group imbalance. Numerical experiments confirm the utility of the proposed regularization technique.

[AI-41] A Near Complete Nonasymptotic Generalization Theory For Multilayer Neural Networks: Beyond the Bias-Variance Tradeoff

链接: https://arxiv.org/abs/2503.02129
作者: Hao Yu,Xiangyang Ji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We propose a first near complete (that will make explicit sense in the main text) nonasymptotic generalization theory for multilayer neural networks with arbitrary Lipschitz activations and general Lipschitz loss functions (with some very mild conditions). In particular, it doens’t require the boundness of loss function, as commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff, aligned with phenomenon typically encountered in deep learning. It is therefore sharp different with other existing nonasymptotic generalization error bounds for neural networks. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with arbitrary Lipschitz activations \sigma with \sigma(0)=0 and broad enough Lipschitz loss functions, without requiring either the width, depth or other hyperparameters of the neural network approaching infinity, a specific neural network architect (e.g. sparsity, boundness of some norms), a particular activation function, a particular optimization algorithm or boundness of the loss function, and with taking the approximation error into consideration. General Lipschitz activation can also be accommodated into our framework. A feature of our theory is that it also considers approximation errors. Furthermore, we show the near minimax optimality of our theory for multilayer ReLU networks for regression problems. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. This work emphasizes a view that many classical results should be improved to embrace the unintuitive characteristics of deep learning to get a better understanding of it.

[AI-42] MIQ: Quantifying Test and Measurement Domain Intelligence in Large Language Models

链接: https://arxiv.org/abs/2503.02123
作者: Emmanuel A. Olowe,Danial Chitnis
类目: Artificial Intelligence (cs.AI)
*备注: accepted in IEEE I2MTC 2025

点击查看摘要

Abstract:The Test and Measurement domain, known for its strict requirements for accuracy and efficiency, is increasingly adopting Generative AI technologies to enhance the performance of data analysis, automation, and decision-making processes. Among these, Large Language Models (LLMs) show significant promise for advancing automation and precision in testing. However, the evaluation of LLMs in this specialized area remains insufficiently explored. To address this gap, we introduce the Test and Measurement Intelligence Quotient (TMIQ), a benchmark designed to quantitatively assess LLMs across a wide range of electronic engineering tasks. TMIQ offers a comprehensive set of scenarios and metrics for detailed evaluation, including SCPI command matching accuracy, ranked response evaluation, Chain-of-Thought Reasoning (CoT), and the impact of output formatting variations required by LLMs on performance. In testing various LLMs, our findings indicate varying levels of proficiency, with exact SCPI command match accuracy ranging from around 56% to 73%, and ranked matching first-position scores achieving around 33% for the best-performing model. We also assess token usage, cost-efficiency, and response times, identifying trade-offs between accuracy and operational efficiency. Additionally, we present a command-line interface (CLI) tool that enables users to generate datasets using the same methodology, allowing for tailored assessments of LLMs. TMIQ and the CLI tool provide a rigorous, reproducible means of evaluating LLMs for production environments, facilitating continuous monitoring and identifying strengths and areas for improvement, and driving innovation in their selections for applications within the Test and Measurement industry.

[AI-43] LLM s as Educational Analysts: Transforming Multimodal Data Traces into Actionable Reading Assessment Reports

链接: https://arxiv.org/abs/2503.02099
作者: Eduardo Davalos,Yike Zhang,Namrata Srivastava,Jorge Alberto Salas,Sara McFadden,Sun-Joo Cho,Gautam Biswas,Amanda Goodwin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 15 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Reading assessments are essential for enhancing students’ comprehension, yet many EdTech applications focus mainly on outcome-based metrics, providing limited insights into student behavior and cognition. This study investigates the use of multimodal data sources – including eye-tracking data, learning outcomes, assessment content, and teaching standards – to derive meaningful reading insights. We employ unsupervised learning techniques to identify distinct reading behavior patterns, and then a large language model (LLM) synthesizes the derived information into actionable reports for educators, streamlining the interpretation process. LLM experts and human educators evaluate these reports for clarity, accuracy, relevance, and pedagogical usefulness. Our findings indicate that LLMs can effectively function as educational analysts, turning diverse data into teacher-friendly insights that are well-received by educators. While promising for automating insight generation, human oversight remains crucial to ensure reliability and fairness. This research advances human-centered AI in education, connecting data-driven analytics with practical classroom applications.

[AI-44] Correlation to Causation: A Causal Deep Learning Framework for Arctic Sea Ice Prediction

链接: https://arxiv.org/abs/2503.02093
作者: Emam Hossain,Muhammad Hasan Ferdous,Jianwu Wang,Aneesh Subramanian,Md Osman Gani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for Publication in Causal AI for Robust Decision Making (CARD) Workshop in the International Conference on Pervasive Computing and Communications (PerCom 2025)

点击查看摘要

Abstract:Traditional machine learning and deep learning techniques rely on correlation-based learning, often failing to distinguish spurious associations from true causal relationships, which limits robustness, interpretability, and generalizability. To address these challenges, we propose a causality-driven deep learning framework that integrates Multivariate Granger Causality (MVGC) and PCMCI+ causal discovery algorithms with a hybrid deep learning architecture. Using 43 years (1979-2021) of daily and monthly Arctic Sea Ice Extent (SIE) and ocean-atmospheric datasets, our approach identifies causally significant factors, prioritizes features with direct influence, reduces feature overhead, and improves computational efficiency. Experiments demonstrate that integrating causal features enhances the deep learning model’s predictive accuracy and interpretability across multiple lead times. Beyond SIE prediction, the proposed framework offers a scalable solution for dynamic, high-dimensional systems, advancing both theoretical understanding and practical applications in predictive modeling.

[AI-45] AI persuading AI vs AI persuading Humans: LLM s Differential Effectiveness in Promoting Pro-Environmental Behavior

链接: https://arxiv.org/abs/2503.02067
作者: Alexander Doudkin,Pat Pataranutaporn,Pattie Maes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 17 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Pro-environmental behavior (PEB) is vital to combat climate change, yet turning awareness into intention and action remains elusive. We explore large language models (LLMs) as tools to promote PEB, comparing their impact across 3,200 participants: real humans (n=1,200), simulated humans based on actual participant data (n=1,200), and fully synthetic personas (n=1,200). All three participant groups faced personalized or standard chatbots, or static statements, employing four persuasion strategies (moral foundations, future self-continuity, action orientation, or “freestyle” chosen by the LLM). Results reveal a “synthetic persuasion paradox”: synthetic and simulated agents significantly affect their post-intervention PEB stance, while human responses barely shift. Simulated participants better approximate human trends but still overestimate effects. This disconnect underscores LLM’s potential for pre-evaluating PEB interventions but warns of its limits in predicting real-world behavior. We call for refined synthetic modeling and sustained and extended human trials to align conversational AI’s promise with tangible sustainability outcomes.

[AI-46] Survey Perspective: The Role of Explainable AI in Threat Intelligence SIGIR

链接: https://arxiv.org/abs/2503.02065
作者: Nidhi Rastogi,Devang Dhanuka,Amulya Saxena,Pranjal Mairal,Le Nguyen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 5 pages, SIGIR Symposium on IR in Practice (SIRIP), 2025

点击查看摘要

Abstract:The increasing reliance on AI-based security tools in Security Operations Centers (SOCs) has transformed threat detection and response, yet analysts frequently struggle with alert overload, false positives, and lack of contextual relevance. The inability to effectively analyze AI-generated security alerts lead to inefficiencies in incident response and reduces trust in automated decision-making. In this paper, we show results and analysis of our investigation of how SOC analysts navigate AI-based alerts, their challenges with current security tools, and how explainability (XAI) integrated into their security workflows has the potential to become an effective decision support. In this vein, we conducted an industry survey. Using the survey responses, we analyze how security analysts’ process, retrieve, and prioritize alerts. Our findings indicate that most analysts have not yet adopted XAI-integrated tools, but they express high interest in attack attribution, confidence scores, and feature contribution explanations to improve interpretability, and triage efficiency. Based on our findings, we also propose practical design recommendations for XAI-enhanced security alert systems, enabling AI-based cybersecurity solutions to be more transparent, interpretable, and actionable.

[AI-47] FRMD: Fast Robot Motion Diffusion with Consistency-Distilled Movement Primitives for Smooth Action Generation

链接: https://arxiv.org/abs/2503.02048
作者: Xirui Shi,Jun Jin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2406.01586 by other authors

点击查看摘要

Abstract:We consider the problem of using diffusion models to generate fast, smooth, and temporally consistent robot motions. Although diffusion models have demonstrated superior performance in robot learning due to their task scalability and multi-modal flexibility, they suffer from two fundamental limitations: (1) they often produce non-smooth, jerky motions due to their inability to capture temporally consistent movement dynamics, and (2) their iterative sampling process incurs prohibitive latency for many robotic tasks. Inspired by classic robot motion generation methods such as DMPs and ProMPs, which capture temporally and spatially consistent dynamic of trajectories using low-dimensional vectors – and by recent advances in diffusion-based image generation that use consistency models with probability flow ODEs to accelerate the denoising process, we propose Fast Robot Motion Diffusion (FRMD). FRMD uniquely integrates Movement Primitives (MPs) with Consistency Models to enable efficient, single-step trajectory generation. By leveraging probabilistic flow ODEs and consistency distillation, our method models trajectory distributions while learning a compact, time-continuous motion representation within an encoder-decoder architecture. This unified approach eliminates the slow, multi-step denoising process of conventional diffusion models, enabling efficient one-step inference and smooth robot motion generation. We extensively evaluated our FRMD on the well-recognized Meta-World and ManiSkills Benchmarks, ranging from simple to more complex manipulation tasks, comparing its performance against state-of-the-art baselines. Our results show that FRMD generates significantly faster, smoother trajectories while achieving higher success rates.

[AI-48] Dynamic Search for Inference-Time Alignment in Diffusion Models

链接: https://arxiv.org/abs/2503.02039
作者: Xiner Li,Masatoshi Uehara,Xingyu Su,Gabriele Scalia,Tommaso Biancalani,Aviv Regev,Sergey Levine,Shuiwang Ji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have shown promising generative capabilities across diverse domains, yet aligning their outputs with desired reward functions remains a challenge, particularly in cases where reward functions are non-differentiable. Some gradient-free guidance methods have been developed, but they often struggle to achieve optimal inference-time alignment. In this work, we newly frame inference-time alignment in diffusion as a search problem and propose Dynamic Search for Diffusion (DSearch), which subsamples from denoising processes and approximates intermediate node rewards. It also dynamically adjusts beam width and tree expansion to efficiently explore high-reward generations. To refine intermediate decisions, DSearch incorporates adaptive scheduling based on noise levels and a lookahead heuristic function. We validate DSearch across multiple domains, including biological sequence design, molecular optimization, and image generation, demonstrating superior reward optimization compared to existing approaches.

[AI-49] Pretrained Embeddings as a Behavior Specification Mechanism

链接: https://arxiv.org/abs/2503.02012
作者: Parv Kapoor,Abigail Hammer,Ashish Kapoor,Karen Leung,Eunsuk Kang
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:We propose an approach to formally specifying the behavioral properties of systems that rely on a perception model for interactions with the physical world. The key idea is to introduce embeddings – mathematical representations of a real-world concept – as a first-class construct in a specification language, where properties are expressed in terms of distances between a pair of ideal and observed embeddings. To realize this approach, we propose a new type of temporal logic called Embedding Temporal Logic (ETL), and describe how it can be used to express a wider range of properties about AI-enabled systems than previously possible. We demonstrate the applicability of ETL through a preliminary evaluation involving planning tasks in robots that are driven by foundation models; the results are promising, showing that embedding-based specifications can be used to steer a system towards desirable behaviors.

[AI-50] actStyle: Generating Tactile Textures with Generative AI for Digital Fabrication

链接: https://arxiv.org/abs/2503.02007
作者: Faraz Faruqi,Maxine Perroni-Scharf,Jaskaran Singh Walia,Yunyi Zhu,Shuyue Feng,Donald Degraen,Stefanie Mueller
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent work in Generative AI enables the stylization of 3D models based on image prompts. However, these methods do not incorporate tactile information, leading to designs that lack the expected tactile properties. We present TactStyle, a system that allows creators to stylize 3D models with images while incorporating the expected tactile properties. TactStyle accomplishes this using a modified image-generation model fine-tuned to generate heightfields for given surface textures. By optimizing 3D model surfaces to embody a generated texture, TactStyle creates models that match the desired style and replicate the tactile experience. We utilize a large-scale dataset of textures to train our texture generation model. In a psychophysical experiment, we evaluate the tactile qualities of a set of 3D-printed original textures and TactStyle’s generated textures. Our results show that TactStyle successfully generates a wide range of tactile features from a single image input, enabling a novel approach to haptic design.

[AI-51] Proportionality in Thumbs Up and Down Voting

链接: https://arxiv.org/abs/2503.01985
作者: Sonja Kraiczy,Georgios Papasotiropoulos,Grzegorz Pierczyński,Piotr Skowron
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Consider the decision-making setting where agents elect a panel by expressing both positive and negative preferences. Prominently, in constitutional AI, citizens democratically select a slate of ethical preferences on which a foundation model is to be trained. There, in practice, agents may both approve and disapprove of different ethical principles. Proportionality has been well-studied in computational social choice for approval ballots, but its meaning remains unclear when negative sentiments are also considered. In this work, we propose two conceptually distinct approaches to interpret proportionality in the presence of up and down votes. The first approach treats the satisfaction from electing candidates and the impact of vetoing them as comparable, leading to combined proportionality guarantees. The second approach considers veto power separately, introducing guarantees distinct from traditional proportionality. We formalize axioms for each perspective and examine their satisfiability by suitable adaptations of Phragmén’s rule, Proportional Approval Voting rule and the Method of Equal Shares.

[AI-52] ask Scheduling Forgetting in Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2503.01941
作者: Marc Speckmann,Theresa Eimer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at RLDM 2025

点击查看摘要

Abstract:Reinforcement learning (RL) agents can forget tasks they have previously been trained on. There is a rich body of work on such forgetting effects in humans. Therefore we look for commonalities in the forgetting behavior of humans and RL agents across tasks and test the viability of forgetting prevention measures from learning theory in RL. We find that in many cases, RL agents exhibit forgetting curves similar to those of humans. Methods like Leitner or SuperMemo have been shown to be effective at counteracting human forgetting, but we demonstrate they do not transfer as well to RL. We identify a likely cause: asymmetrical learning and retention patterns between tasks that cannot be captured by retention-based or performance-based curriculum strategies.

[AI-53] Synthetic Tabular Data Detection In the Wild

链接: https://arxiv.org/abs/2503.01937
作者: G. Charbel N. Kindji(IRISA, LACODAM),Elisa Fromont(IRISA, LACODAM),Lina Maria Rojas-Barahona,Tanguy Urvoy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: International Symposium on Intelligent Data Analysis, May 2025, Konstanz, Germany

点击查看摘要

Abstract:Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified across different tables. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose four table-agnostic detectors combined with simple preprocessing schemes that we evaluate on six evaluation protocols, with different levels of ‘‘wildness’’. Our results show that cross-table learning on a restricted set of tables is possible even with naive preprocessing schemes. They confirm however that cross-table transfer (i.e. deployment on a table that has not been seen before) is challenging. This suggests that sophisticated encoding schemes are required to handle this problem.

[AI-54] Decision-Focused Fine-Tuning of Time Series Foundation Models for Dispatchable Feeder Optimization

链接: https://arxiv.org/abs/2503.01936
作者: Maximilian Beichter,Nils Friederich,Janik Pinter,Dorina Werling,Kaleb Phipps,Sebastian Beichter,Oliver Neumann,Ralf Mikut,Veit Hagenmeyer,Benedikt Heidrich
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series foundation models provide a universal solution for generating forecasts to support optimization problems in energy systems. Those foundation models are typically trained in a prediction-focused manner to maximize forecast quality. In contrast, decision-focused learning directly improves the resulting value of the forecast in downstream optimization rather than merely maximizing forecasting quality. The practical integration of forecast values into forecasting models is challenging, particularly when addressing complex applications with diverse instances, such as buildings. This becomes even more complicated when instances possess specific characteristics that require instance-specific, tailored predictions to increase the forecast value. To tackle this challenge, we use decision-focused fine-tuning within time series foundation models to offer a scalable and efficient solution for decision-focused learning applied to the dispatchable feeder optimization problem. To obtain more robust predictions for scarce building data, we use Moirai as a state-of-the-art foundation model, which offers robust and generalized results with few-shot parameter-efficient fine-tuning. Comparing the decision-focused fine-tuned Moirai with a state-of-the-art classical prediction-focused fine-tuning Morai, we observe an improvement of 9.45% in average total daily costs.

[AI-55] Adversarial Generative Flow Network for Solving Vehicle Routing Problems ICLR2025

链接: https://arxiv.org/abs/2503.01931
作者: Ni Zhang,Jingfeng Yang,Zhiguang Cao,Xu Chi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Recent research into solving vehicle routing problems (VRPs) has gained significant traction, particularly through the application of deep (reinforcement) learning for end-to-end solution construction. However, many current construction-based neural solvers predominantly utilize Transformer architectures, which can face scalability challenges and struggle to produce diverse solutions. To address these limitations, we introduce a novel framework beyond Transformer-based approaches, i.e., Adversarial Generative Flow Networks (AGFN). This framework integrates the generative flow network (GFlowNet)-a probabilistic model inherently adept at generating diverse solutions (routes)-with a complementary model for discriminating (or evaluating) the solutions. These models are trained alternately in an adversarial manner to improve the overall solution quality, followed by a proposed hybrid decoding method to construct the solution. We apply the AGFN framework to solve the capacitated vehicle routing problem (CVRP) and travelling salesman problem (TSP), and our experimental results demonstrate that AGFN surpasses the popular construction-based neural solvers, showcasing strong generalization capabilities on synthetic and real-world benchmark instances.

[AI-56] AET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions CVPR2025

链接: https://arxiv.org/abs/2503.01924
作者: Wang YuHang,Junkang Guo,Aolei Liu,Kaihao Wang,Zaitong Wu,Zhenyu Liu,Wenfei Yin,Jian Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Text: 8 pages of main content, 5 pages of appendices have been accepted by CVPR2025

点击查看摘要

Abstract:Adversarial robustness is a critical challenge in deploying deep neural networks for real-world applications. While adversarial training is a widely recognized defense strategy, most existing studies focus on balanced datasets, overlooking the prevalence of long-tailed distributions in real-world data, which significantly complicates robustness. This paper provides a comprehensive analysis of adversarial training under long-tailed distributions and identifies limitations in the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which integrates an initial stabilization phase followed by a stratified equalization adversarial training phase. Additionally, prior work on long-tailed robustness has largely ignored the crucial evaluation metric of balanced accuracy. To bridge this gap, we introduce the concept of balanced robustness, a comprehensive metric tailored for assessing robustness under long-tailed distributions. Extensive experiments demonstrate that our method surpasses existing advanced defenses, achieving significant improvements in both memory and computational efficiency. This work represents a substantial advancement in addressing robustness challenges in real-world applications. Our code is available at: this https URL.

[AI-57] Reinforcement learning with combinatorial actions for coupled restless bandits ICLR2025

链接: https://arxiv.org/abs/2503.01919
作者: Lily Xu,Bryan Wilder,Elias B. Khalil,Milind Tambe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear at ICLR 2025. Code at this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods – which cannot address sequential planning and combinatorial selection simultaneously – by an average of 26.4% on these difficult instances.

[AI-58] Attend or Perish: Benchmarking Attention in Algorithmic Reasoning

链接: https://arxiv.org/abs/2503.01909
作者: Michal Spiegel,Michal Štefánik,Marek Kadlčík,Josef Kuchař
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Can transformers learn to perform algorithmic tasks reliably across previously unseen input/output domains? While pre-trained language models show solid accuracy on benchmarks incorporating algorithmic reasoning, assessing the reliability of these results necessitates an ability to cleanse models’ functional capabilities from memorization. In this paper, we propose an algorithmic benchmark comprising six tasks of infinite input domains where we can also disentangle and trace the correct, robust algorithm necessary for the task. This allows us to assess (i) models’ ability to extrapolate to unseen types of inputs, including new lengths, value ranges or input domains, but also (ii) to assess the robustness of the functional mechanism in recent models through the lens of their attention maps. We make the implementation of all our tasks and interoperability methods publicly available at this https URL .

[AI-59] UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

链接: https://arxiv.org/abs/2503.01908
作者: Jiawei Zhang,Shuang Yang,Bo Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for handling complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements also amplify the risks of adversarial attacks, particularly when LLM agents can access sensitive external functionalities. Moreover, because LLM agents engage in extensive reasoning or planning before executing final actions, manipulating them into performing targeted malicious actions or invoking specific tools remains a significant challenge. Consequently, directly embedding adversarial strings in malicious instructions or injecting malicious prompts into tool interactions has become less effective against modern LLM agents. In this work, we present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent’s own reasoning processes to compel it toward malicious behavior. Specifically, UDora first samples the model’s reasoning for the given task, then automatically identifies multiple optimal positions within these reasoning traces to insert targeted perturbations. Subsequently, it uses the modified reasoning as the objective to optimize the adversarial strings. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets.

[AI-60] PaCA: Partial Connection Adaptation for Efficient Fine-Tuning

链接: https://arxiv.org/abs/2503.01905
作者: Sunghyeon Woo,Sol Namkung,Sunwoo Lee,Inho Jeong,Beomseok Kim,Dongsuk Jeon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prior parameter-efficient fine-tuning (PEFT) algorithms reduce memory usage and computational costs of fine-tuning large neural network models by training only a few additional adapter parameters, rather than the entire model. However, the reduction in computational costs due to PEFT does not necessarily translate to a reduction in training time; although the computational costs of the adapter layers are much smaller than the pretrained layers, it is well known that those two types of layers are processed sequentially on GPUs, resulting in significant latency overhead. LoRA and its variants merge low-rank adapter matrices with pretrained weights during inference to avoid latency overhead, but during training, the pretrained weights remain frozen while the adapter matrices are continuously updated, preventing such merging. To mitigate this issue, we propose Partial Connection Adaptation (PaCA), which fine-tunes randomly selected partial connections within the pretrained weights instead of introducing adapter layers in the model. PaCA not only enhances training speed by eliminating the time overhead due to the sequential processing of the adapter and pretrained layers but also reduces activation memory since only partial activations, rather than full activations, need to be stored for gradient computation. Compared to LoRA, PaCA reduces training time by 22% and total memory usage by 16%, while maintaining comparable accuracy across various fine-tuning scenarios, such as fine-tuning on the MMLU dataset and instruction tuning on the Oasst1 dataset. PaCA can also be combined with quantization, enabling the fine-tuning of large models such as LLaMA3.1-70B. In addition, PaCA enables training with 23% longer sequence and improves throughput by 16% on both NVIDIA A100 GPU and INTEL Gaudi2 HPU compared to LoRA. The code is available at this https URL.

[AI-61] Identifying Sensitive Weights via Post-quantization Integral

链接: https://arxiv.org/abs/2503.01901
作者: Yuezhou Hu,Weiyu Huang,Zichen Liang,Chang Chen,Jintao Zhang,Jun Zhu,Jianfei Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization’s impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor’s formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

[AI-62] LLM -Empowered Class Imbalanced Graph Prompt Learning for Online Drug Trafficking Detection

链接: https://arxiv.org/abs/2503.01900
作者: Tianyi Ma,Yiyue Qian,Zehong Wang,Zheyuan Zhang,Chuxu Zhang,Yanfang Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the market for illicit drugs remains extremely profitable, major online platforms have become direct-to-consumer intermediaries for illicit drug trafficking participants. These online activities raise significant social concerns that require immediate actions. Existing approaches to combating this challenge are generally impractical, due to the imbalance of classes and scarcity of labeled samples in real-world applications. To this end, we propose a novel Large Language Model-empowered Heterogeneous Graph Prompt Learning framework for illicit Drug Trafficking detection, called LLM-HetGDT, that leverages LLM to facilitate heterogeneous graph neural networks (HGNNs) to effectively identify drug trafficking activities in the class-imbalanced scenarios. Specifically, we first pre-train HGNN over a contrastive pretext task to capture the inherent node and structure information over the unlabeled drug trafficking heterogeneous graph (HG). Afterward, we employ LLM to augment the HG by generating high-quality synthetic user nodes in minority classes. Then, we fine-tune the soft prompts on the augmented HG to capture the important information in the minority classes for the downstream drug trafficking detection task. To comprehensively study online illicit drug trafficking activities, we collect a new HG dataset over Twitter, called Twitter-HetDrug. Extensive experiments on this dataset demonstrate the effectiveness, efficiency, and applicability of LLM-HetGDT.

[AI-63] Continual Learning-Aided Super-Resolution Scheme for Channel Reconstruction and Generalization in OFDM Systems

链接: https://arxiv.org/abs/2503.01897
作者: Jianqiao Chen,Nan Ma,Wenkai Liu,Xiaodong Xu,Ping Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Channel reconstruction and generalization capability are of equal importance for developing channel estimation schemes within deep learning (DL) framework. In this paper, we exploit a novel DL-based scheme for efficient OFDM channel estimation where the neural networks for channel reconstruction and generalization are respectively designed. For the former, we propose a dual-attention-aided super-resolution neural network (DA-SRNN) to map the channels at pilot positions to the whole time-frequency channels. Specifically, the channel-spatial attention mechanism is first introduced to sequentially infer attention maps along two separate dimensions corresponding to two types of underlying channel correlations, and then the lightweight SR module is developed for efficient channel reconstruction. For the latter, we introduce continual learning (CL)-aided training strategies to make the neural network adapt to different channel distributions. Specifically, the elastic weight consolidation (EWC) is introduced as the regularization term in regard to loss function of channel reconstruction, which can constrain the direction and space of updating the important weights of neural networks among different channel distributions. Meanwhile, the corresponding training process is provided in detail. By evaluating under 3rd Generation Partnership Project (3GPP) channel models, numerical results verify the superiority of the proposed channel estimation scheme with significantly improved channel reconstruction and generalization performance over counterparts.

[AI-64] Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification

链接: https://arxiv.org/abs/2503.01896
作者: Vishnu Kabir Chhabra,Ding Zhu,Mohammad Mahdi Khalili
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Previous research has shown that fine-tuning language models on general tasks enhance their underlying mechanisms. However, the impact of fine-tuning on poisoned data and the resulting changes in these mechanisms are poorly understood. This study investigates the changes in a model’s mechanisms during toxic fine-tuning and identifies the primary corruption mechanisms. We also analyze the changes after retraining a corrupted model on the original dataset and observe neuroplasticity behaviors, where the model relearns original mechanisms after fine-tuning the corrupted model. Our findings indicate that: (i) Underlying mechanisms are amplified across task-specific fine-tuning which can be generalized to longer epochs, (ii) Model corruption via toxic fine-tuning is localized to specific circuit components, (iii) Models exhibit neuroplasticity when retraining corrupted models on clean dataset, reforming the original model mechanisms.

[AI-65] Evaluating System 1 vs. 2 Reasoning Approaches for Zero-Shot Time-Series Forecasting: A Benchmark and Insights

链接: https://arxiv.org/abs/2503.01895
作者: Haoxin Liu,Zhiyuan Zhao,Shiduo Li,B. Aditya Prakash
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reasoning ability is crucial for solving challenging tasks. With the advancement of foundation models, such as the emergence of large language models (LLMs), a wide range of reasoning strategies has been proposed, including test-time enhancements, such as Chain-ofThought, and post-training optimizations, as used in DeepSeek-R1. While these reasoning strategies have demonstrated effectiveness across various challenging language or vision tasks, their applicability and impact on time-series forecasting (TSF), particularly the challenging zero-shot TSF, remain largely unexplored. In particular, it is unclear whether zero-shot TSF benefits from reasoning and, if so, what types of reasoning strategies are most effective. To bridge this gap, we propose ReC4TS, the first benchmark that systematically evaluates the effectiveness of popular reasoning strategies when applied to zero-shot TSF tasks. ReC4TS conducts comprehensive evaluations across datasets spanning eight domains, covering both unimodal and multimodal with short-term and longterm forecasting tasks. More importantly, ReC4TS provides key insights: (1) Self-consistency emerges as the most effective test-time reasoning strategy; (2) Group-relative policy optimization emerges as a more suitable approach for incentivizing reasoning ability during post-training; (3) Multimodal TSF benefits more from reasoning strategies compared to unimodal TSF. Beyond these insights, ReC4TS establishes two pioneering starting blocks to support future zero-shot TSF reasoning research: (1) A novel dataset, TimeThinking, containing forecasting samples annotated with reasoning trajectories from multiple advanced LLMs, and (2) A new and simple test-time scaling-law validated on foundational TSF models enabled by self-consistency reasoning strategy. All data and code are publicly accessible at: this https URL

[AI-66] Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel Approach

链接: https://arxiv.org/abs/2503.01888
作者: Zhihua Duan,Jialin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.

[AI-67] When Continue Learning Meets Multimodal Large Language Model: A Survey

链接: https://arxiv.org/abs/2503.01887
作者: Yukang Huo,Hao Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 42 pages, 6 figures, 37 tables

点击查看摘要

Abstract:Recent advancements in Artificial Intelligence have led to the development of Multimodal Large Language Models (MLLMs). However, adapting these pre-trained models to dynamic data distributions and various tasks efficiently remains a challenge. Fine-tuning MLLMs for specific tasks often causes performance degradation in the model’s prior knowledge domain, a problem known as ‘Catastrophic Forgetting’. While this issue has been well-studied in the Continual Learning (CL) community, it presents new challenges for MLLMs. This review paper, the first of its kind in MLLM continual learning, presents an overview and analysis of 440 research papers in this this http URL review is structured into four sections. First, it discusses the latest research on MLLMs, covering model innovations, benchmarks, and applications in various fields. Second, it categorizes and overviews the latest studies on continual learning, divided into three parts: non-large language models unimodal continual learning (Non-LLM Unimodal CL), non-large language models multimodal continual learning (Non-LLM Multimodal CL), and continual learning in large language models (CL in LLM). The third section provides a detailed analysis of the current state of MLLM continual learning research, including benchmark evaluations, architectural innovations, and a summary of theoretical and empirical this http URL, the paper discusses the challenges and future directions of continual learning in MLLMs, aiming to inspire future research and development in the field. This review connects the foundational concepts, theoretical insights, method innovations, and practical applications of continual learning for multimodal large models, providing a comprehensive understanding of the research progress and challenges in this field, aiming to inspire researchers in the field and promote the advancement of related technologies.

[AI-68] Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks

链接: https://arxiv.org/abs/2503.01885
作者: Luise Ge,Michael Lanier,Anindya Sarkar,Bengisu Guresti,Yevgeniy Vorobeychik,Chongjie Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin.

[AI-69] Contextual Quantum Neural Networks for Stock Price Prediction

链接: https://arxiv.org/abs/2503.01884
作者: Sharan Mourya,Hannes Leipold,Bibhas Adhikari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we apply quantum machine learning (QML) to predict the stock prices of multiple assets using a contextual quantum neural network. Our approach captures recent trends to predict future stock price distributions, moving beyond traditional models that focus on entire historical data, enhancing adaptability and precision. Utilizing the principles of quantum superposition, we introduce a new training technique called the quantum batch gradient update (QBGU), which accelerates the standard stochastic gradient descent (SGD) in quantum applications and improves convergence. Consequently, we propose a quantum multi-task learning (QMTL) architecture, specifically, the share-and-specify ansatz, that integrates task-specific operators controlled by quantum labels, enabling the simultaneous and efficient training of multiple assets on the same quantum circuit as well as enabling efficient portfolio representation with logarithmic overhead in the number of qubits. This architecture represents the first of its kind in quantum finance, offering superior predictive power and computational efficiency for multi-asset stock price forecasting. Through extensive experimentation on S\P 500 data for Apple, Google, Microsoft, and Amazon stocks, we demonstrate that our approach not only outperforms quantum single-task learning (QSTL) models but also effectively captures inter-asset correlations, leading to enhanced prediction accuracy. Our findings highlight the transformative potential of QML in financial applications, paving the way for more advanced, resource-efficient quantum algorithms in stock price prediction and other complex financial modeling tasks.

[AI-70] Learning Surrogates for Offline Black-Box Optimization via Gradient Matching ICML2024

链接: https://arxiv.org/abs/2503.01883
作者: Minh Hoang,Azza Fadhel,Aryan Deshwal,Janardhan Rao Doppa,Trong Nghia Hoang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Offline design optimization problem arises in numerous science and engineering applications including material and chemical design, where expensive online experimentation necessitates the use of in silico surrogate functions to predict and maximize the target objective over candidate designs. Although these surrogates can be learned from offline data, their predictions are often inaccurate outside the offline data regime. This challenge raises a fundamental question about the impact of imperfect surrogate model on the performance gap between its optima and the true optima, and to what extent the performance loss can be mitigated. Although prior work developed methods to improve the robustness of surrogate models and their associated optimization processes, a provably quantifiable relationship between an imperfect surrogate and the corresponding performance gap, as well as whether prior methods directly address it, remain elusive. To shed light on this important question, we present a theoretical framework to understand offline black-box optimization, by explicitly bounding the optimization quality based on how well the surrogate matches the latent gradient field that underlines the offline data. Inspired by our theoretical analysis, we propose a principled black-box gradient matching algorithm to create effective surrogate models for offline optimization, improving over prior approaches on various real-world benchmarks.

[AI-71] Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching

链接: https://arxiv.org/abs/2503.01881
作者: Antonio Pio Ricciardi,Valentino Maiorca,Luca Moschella,Riccardo Marin,Emanuele Rodolà
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment’s observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent’s encoder to another agent’s encoder without further fine-tuning. Our approach relies on a small set of “anchor” observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.

[AI-72] District Vitality Index Using Machine Learning Methods for Urban Planners

链接: https://arxiv.org/abs/2503.01878
作者: Sylvain Marcoux,Jean-Sébastien Dessureault
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:City leaders face critical decisions regarding budget allocation and investment priorities. How can they identify which city districts require revitalization? To address this challenge, a Current Vitality Index and a Long-Term Vitality Index are proposed. These indexes are based on a carefully curated set of indicators. Missing data is handled using K-Nearest Neighbors imputation, while Random Forest is employed to identify the most reliable and significant features. Additionally, k-means clustering is utilized to generate meaningful data groupings for enhanced monitoring of Long-Term Vitality. Current vitality is visualized through an interactive map, while Long-Term Vitality is tracked over 15 years with predictions made using Multilayer Perceptron or Linear Regression. The results, approved by urban planners, are already promising and helpful, with the potential for further improvement as more data becomes available. This paper proposes leveraging machine learning methods to optimize urban planning and enhance citizens’ quality of life.

[AI-73] Starjob: Dataset for LLM -Driven Job Shop Scheduling

链接: https://arxiv.org/abs/2503.01877
作者: Henrik Abgaryan,Tristan Cazenave,Ararat Harutyunyan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2408.06993

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various domains, but their potential for solving combinatorial optimization problems remains largely unexplored. In this paper, we investigate the applicability of LLMs to the Job Shop Scheduling Problem (JSSP), a classic challenge in combinatorial optimization that requires efficient job allocation to machines to minimize makespan. To this end, we introduce Starjob, the first supervised dataset for JSSP, comprising 130k instances specifically designed for training LLMs. Leveraging this dataset, we fine-tune the LLaMA 8B 4-bit quantized model with the LoRA method to develop an end-to-end scheduling approach. Our evaluation on standard benchmarks demonstrates that the proposed LLM-based method not only surpasses traditional Priority Dispatching Rules (PDRs) but also achieves notable improvements over state-of-the-art neural approaches like L2D, with an average improvement of 15.36% on DMU and 7.85% on Taillard benchmarks. These results highlight the untapped potential of LLMs in tackling combinatorial optimization problems, paving the way for future advancements in this area.

[AI-74] CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging

链接: https://arxiv.org/abs/2503.01874
作者: Zongzhen Yang,Binhang Qi,Hailong Sun,Wenrui Long,Ruobing Zhao,Xiang Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model merging based on task vectors, i.e., the parameter differences between fine-tuned models and a shared base model, provides an efficient way to integrate multiple task-specific models into a multitask model without retraining. Recent works have endeavored to address the conflicts between task vectors, one of the significant challenges faced by model merging, through sparsification; however, two issues significantly limit their performance: high parameter overlap and unbalanced weight distribution. To address these issues, we propose a simple, yet effective framework called CABS (Conflict-Aware and Balanced Sparsification), consisting of Conflict-Aware Sparsification (CA) and Balanced Sparsification (BS). CA can reduce parameter overlap by applying masks during sequential pruning, ensuring that each task vector retains distinct, non-overlapping parameters. BS leverages n : m pruning to preserve critical weights while maintaining an even distribution across layers. Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.

[AI-75] Online Pseudo-averag e Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis

链接: https://arxiv.org/abs/2503.01873
作者: Long Cheng,Qichen Liao,Fan Wu,Junlin Mu,Tengfei Han,Zhe Qiu,Lianqiang Li,Tianyi Liu,Fangzheng Miao,Keming Gao,Liang Wang,Zhen Zhang,Qiande Yin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Numerical Analysis (math.NA)
*备注: 21 Pages, 14 figures, conference paper

点击查看摘要

Abstract:Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use of half-precision computation throughout the Flash Attention process without incurring overflow instability or unacceptable numerical accuracy loss. This algorithm enhances performance on memory-restricted AI hardware architectures, such as the Ascend Neural-network Processing Unit(NPU), by reducing data movement and increasing computational FLOPs. The algorithm is validated using both designed random benchmarks and real large models. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow ( 65504 for half precision) in two different categories of large models (Qwen2-7B language models and Stable-Video-Diffusion multi-modal models). Specifically, overflow arises due to the large bias in the sequence dimension and the resonance mechanism between the query and key in the head dimension of the Stable-Video-Diffusion models. The resonance mechanism is defined as phase coincidence or 180-degree phase shift between query and key matrices. It will remarkably amplify the element values of attention score matrix. This issue also applies to the Qwen models. Additionally, numerical accuracy is assessed through root mean square error (RMSE) and by comparing the final generated texts and videos to those produced using high-precision attention.

[AI-76] Data Augmentation for Instruction Following Policies via Trajectory Segmentation

链接: https://arxiv.org/abs/2503.01871
作者: Niklas Höpner,Ilaria Tiddi,Herke van Hoof
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The scalability of instructable agents in robotics or gaming is often hindered by limited data that pairs instructions with agent trajectories. However, large datasets of unannotated trajectories containing sequences of various agent behaviour (play trajectories) are often available. In a semi-supervised setup, we explore methods to extract labelled segments from play trajectories. The goal is to augment a small annotated dataset of instruction-trajectory pairs to improve the performance of an instruction-following policy trained downstream via imitation learning. Assuming little variation in segment length, recent video segmentation methods can effectively extract labelled segments. To address the constraint of segment length, we propose Play Segmentation (PS), a probabilistic model that finds maximum likely segmentations of extended subsegments, while only being trained on individual instruction segments. Our results in a game environment and a simulated robotic gripper setting underscore the importance of segmentation; randomly sampled segments diminish performance, while incorporating labelled segments from PS improves policy performance to the level of a policy trained on twice the amount of labelled data.

[AI-77] Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

链接: https://arxiv.org/abs/2503.01868
作者: Jerome Ku,Eric Nguyen,David W. Romero,Garyk Brixi,Brandon Yang,Anton Vorontsov,Ali Taghibakhshi,Amy X. Lu,Dave P. Burke,Greg Brockman,Stefano Massaroli,Christopher Ré,Patrick D. Hsu,Brian L. Hie,Stefano Ermon,Michael Poli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

[AI-78] Neural Manifolds and Cognitive Consistency: A New Approach to Memory Consolidation in Artificial Systems

链接: https://arxiv.org/abs/2503.01867
作者: Phuong-Nam Nguyen
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We introduce a novel mathematical framework that unifies neural population dynamics, hippocampal sharp wave-ripple (SpWR) generation, and cognitive consistency constraints inspired by Heider’s theory. Our model leverages low-dimensional manifold representations to capture structured neural drift and incorporates a balance energy function to enforce coherent synaptic interactions, effectively simulating the memory consolidation processes observed in biological systems. Simulation results demonstrate that our approach not only reproduces key features of SpWR events but also enhances network interpretability. This work paves the way for scalable neuromorphic architectures that bridge neuroscience and artificial intelligence, offering more robust and adaptive learning mechanisms for future intelligent systems.

[AI-79] Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLM s via Removing Superfluous Constraints

链接: https://arxiv.org/abs/2503.01865
作者: Junxiao Yang,Zhexin Zhang,Shiyao Cui,Hongning Wang,Minlie Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.

[AI-80] owards Enterprise-Ready Computer Using Generalist Agent

链接: https://arxiv.org/abs/2503.01861
作者: Sami Marreed,Alon Oved,Avi Yaeli,Segev Shlomov,Ido Levy,Aviad Sela,Asaf Adi,Nir Mashkif
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper presents our ongoing work toward developing an enterprise-ready Computer Using Generalist Agent (CUGA) system. Our research highlights the evolutionary nature of building agentic systems suitable for enterprise environments. By integrating state-of-the-art agentic AI techniques with a systematic approach to iterative evaluation, analysis, and refinement, we have achieved rapid and cost-effective performance gains, notably reaching a new state-of-the-art performance on the WebArena benchmark. We detail our development roadmap, the methodology and tools that facilitated rapid learning from failures and continuous system refinement, and discuss key lessons learned and future challenges for enterprise adoption.

[AI-81] A Review of Artificial Intelligence Impacting Statistical Process Monitoring and Future Directions

链接: https://arxiv.org/abs/2503.01858
作者: Shing I Chang,Parviz Ghafariasl
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: 44 pages, 5 figures

点击查看摘要

Abstract:It has been 100 years since statistical process control (SPC) or statistical process monitoring (SPM) was first introduced for production processes and later applied to service, healthcare, and other industries. The techniques applied to SPM applications are mostly statistically oriented. Recent advances in Artificial Intelligence (AI) have reinvigorated the imagination of adopting AI for SPM applications. This manuscript begins with a concise review of the historical development of the statistically based SPM methods. Next, this manuscript explores AI and Machine Learning (ML) algorithms and methods applied in various SPM applications, addressing quality characteristics of univariate, multivariate, profile, and image. These AI methods can be classified into the following categories: classification, pattern recognition, time series applications, and generative AI. Specifically, different kinds of neural networks, such as artificial neural networks (ANN), convolutional neural networks (CNN), recurrent neural networks (RNN), and generative adversarial networks (GAN), are among the most implemented AI methods impacting SPM. Finally, this manuscript outlines a couple of future directions that harness the potential of the Large Multimodal Model (LMM) for advancing SPM research and applications in complex systems. The ultimate objective is to transform statistical process monitoring (SPM) into smart process control (SMPC), where corrective actions are autonomously implemented to either prevent quality issues or restore process performance.

[AI-82] Improving Oil Slick Trajectory Simulations with Bayesian Optimization

链接: https://arxiv.org/abs/2503.02749
作者: Gabriele Accarino,Marco M. De Carlo,Igor Atake,Donatello Elia,Anusha L. Dissanayake,Antonio Augusto Sepp Neves,Juan Peña Ibañez,Italo Epicoco,Paola Nassisi,Sandro Fiore,Giovanni Coppini
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 29 pages, 10 figures, 3 tables, research paper

点击查看摘要

Abstract:Accurate simulations of oil spill trajectories are essential for supporting practitioners’ response and mitigating environmental and socioeconomic impacts. Numerical models, such as MEDSLIK-II, simulate advection, dispersion, and transformation processes of oil particles. However, simulations heavily rely on accurate parameter tuning, still based on expert knowledge and manual calibration. To overcome these limitations, we integrate the MEDSLIK-II numerical oil spill model with a Bayesian optimization framework to iteratively estimate the best physical parameter configuration that yields simulation closer to satellite observations of the slick. We focus on key parameters, such as horizontal diffusivity and drift factor, maximizing the Fraction Skill Score (FSS) as a measure of spatio-temporal overlap between simulated and observed oil distributions. We validate the framework for the Baniyas oil incident that occurred in Syria between August 23 and September 4, 2021, which released over 12,000 m^3 of oil. We show that, on average, the proposed approach systematically improves the FSS from 5.82% to 11.07% compared to control simulations initialized with default parameters. The optimization results in consistent improvement across multiple time steps, particularly during periods of increased drift variability, demonstrating the robustness of our method in dynamic environmental conditions.

[AI-83] YARE-GAN: Yet Another Resting State EEG-GAN

链接: https://arxiv.org/abs/2503.02636
作者: Yeganeh Farahzadi,Morteza Ansarinia,Zoltan Kekecs
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have shown promise in synthesising realistic neural data, yet their potential for unsupervised representation learning in resting-state EEG remains under explored. In this study, we implement a Wasserstein GAN with Gradient Penalty (WGAN-GP) to generate multi-channel resting-state EEG data and assess the quality of the synthesised signals through both visual and feature-based evaluations. Our results indicate that the model effectively captures the statistical and spectral characteristics of real EEG data, although challenges remain in replicating high-frequency oscillations in the frontal region. Additionally, we demonstrate that the Critic’s learned representations can be fine-tuned for age group classification, achieving an out-of-sample accuracy, significantly better than a shuffled-label baseline. These findings suggest that generative models can serve not only as EEG data generators but also as unsupervised feature extractors, reducing the need for manual feature engineering. This study highlights the potential of GAN-based unsupervised learning for EEG analysis, suggesting avenues for more data-efficient deep learning applications in neuroscience.

[AI-84] MindSimulator: Exploring Brain Concept Localization via Synthetic FMRI ICLR2025

链接: https://arxiv.org/abs/2503.02351
作者: Guangyin Bao,Qi Zhang,Zixuan Gong,Zhuojia Wu,Duoqian Miao
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 23 pages, ICLR 2025

点击查看摘要

Abstract:Concept-selective regions within the human cerebral cortex exhibit significant activation in response to specific visual stimuli associated with particular concepts. Precisely localizing these regions stands as a crucial long-term goal in neuroscience to grasp essential brain functions and mechanisms. Conventional experiment-driven approaches hinge on manually constructed visual stimulus collections and corresponding brain activity recordings, constraining the support and coverage of concept localization. Additionally, these stimuli often consist of concept objects in unnatural contexts and are potentially biased by subjective preferences, thus prompting concerns about the validity and generalizability of the identified regions. To address these limitations, we propose a data-driven exploration approach. By synthesizing extensive brain activity recordings, we statistically localize various concept-selective regions. Our proposed MindSimulator leverages advanced generative technologies to learn the probability distribution of brain activity conditioned on concept-oriented visual stimuli. This enables the creation of simulated brain recordings that reflect real neural response patterns. Using the synthetic recordings, we successfully localize several well-studied concept-selective regions and validate them against empirical findings, achieving promising prediction accuracy. The feasibility opens avenues for exploring novel concept-selective regions and provides prior hypotheses for future neuroscience research.

[AI-85] MobRFFI: Non-cooperative Device Re-identification for Mobility Intelligence

链接: https://arxiv.org/abs/2503.02156
作者: Stepan Mazokha,Fanchen Bao,George Sklivanitis,Jason O. Hallstrom
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 9 figures, 3 tables

点击查看摘要

Abstract:WiFi-based mobility monitoring in urban environments can provide valuable insights into pedestrian and vehicle movements. However, MAC address randomization introduces a significant obstacle in accurately estimating congestion levels and path trajectories. To this end, we consider radio frequency fingerprinting and re-identification for attributing WiFi traffic to emitting devices without the use of MAC addresses. We present MobRFFI, an AI-based device fingerprinting and re-identification framework for WiFi networks that leverages an encoder deep learning model to extract unique features based on WiFi chipset hardware impairments. It is entirely independent of frame type. When evaluated on the WiFi fingerprinting dataset WiSig, our approach achieves 94% and 100% device accuracy in multi-day and single-day re-identification scenarios, respectively. We also collect a novel dataset, MobRFFI, for granular multi-receiver WiFi device fingerprinting evaluation. Using the dataset, we demonstrate that the combination of fingerprints from multiple receivers boosts re-identification performance from 81% to 100% on a single-day scenario and from 41% to 100% on a multi-day scenario. Comments: 10 pages, 9 figures, 3 tables Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2503.02156 [eess.SP] (or arXiv:2503.02156v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2503.02156 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-86] Mathematical Foundation of Interpretable Equivariant Surrogate Models

链接: https://arxiv.org/abs/2503.01942
作者: Jacopo Joy Colombini,Filippo Bonchi,Francesco Giannini,Fosca Giannotti,Roberto Pellungrini,Patrizio Frosini
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a rigorous mathematical framework for neural network explainability, and more broadly for the explainability of equivariant operators called Group Equivariant Operators (GEOs) based on Group Equivariant Non-Expansive Operators (GENEOs) transformations. The central concept involves quantifying the distance between GEOs by measuring the non-commutativity of specific diagrams. Additionally, the paper proposes a definition of interpretability of GEOs according to a complexity measure that can be defined according to each user preferences. Moreover, we explore the formal properties of this framework and show how it can be applied in classical machine learning scenarios, like image classification with convolutional neural networks.

[AI-87] QCS-ADME: Quantum Circuit Search for Drug Property Prediction with Imbalanced Data and Regression Adaptation

链接: https://arxiv.org/abs/2503.01927
作者: Kangyu Zheng,Tianfan Fu,Zhiding Liang
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The biomedical field is beginning to explore the use of quantum machine learning (QML) for tasks traditionally handled by classical machine learning, especially in predicting ADME (absorption, distribution, metabolism, and excretion) properties, which are essential in drug evaluation. However, ADME tasks pose unique challenges for existing quantum computing systems (QCS) frameworks, as they involve both classification with unbalanced dataset and regression problems. These dual requirements make it necessary to adapt and refine current QCS frameworks to effectively address the complexities of ADME predictions. We propose a novel training-free scoring mechanism to evaluate QML circuit performance on imbalanced classification and regression tasks. Our mechanism demonstrates significant correlation between scoring metrics and test performance on imbalanced classification tasks. Additionally, we develop methods to quantify continuous similarity relationships between quantum states, enabling performance prediction for regression tasks. This represents the first comprehensive approach to searching and evaluating QCS circuits specifically for regression applications. Validation on representative ADME tasks-one imbalanced classification and one regression-demonstrates moderate positive correlation between our scoring metrics and circuit performance, significantly outperforming baseline scoring methods that show negligible correlation.

[AI-88] dyAb: Flow Matching for Flexible Antibody Design with AlphaFold-driven Pre-binding Antigen AAAI2025

链接: https://arxiv.org/abs/2503.01910
作者: Cheng Tan,Yijie Zhang,Zhangyang Gao,Yufei Huang,Haitao Lin,Lirong Wu,Fandi Wu,Mathieu Blanchette,Stan. Z. Li
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: AAAI 2025 Oral

点击查看摘要

Abstract:The development of therapeutic antibodies heavily relies on accurate predictions of how antigens will interact with antibodies. Existing computational methods in antibody design often overlook crucial conformational changes that antigens undergo during the binding process, significantly impacting the reliability of the resulting antibodies. To bridge this gap, we introduce dyAb, a flexible framework that incorporates AlphaFold2-driven predictions to model pre-binding antigen structures and specifically addresses the dynamic nature of antigen conformation changes. Our dyAb model leverages a unique combination of coarse-grained interface alignment and fine-grained flow matching techniques to simulate the interaction dynamics and structural evolution of the antigen-antibody complex, providing a realistic representation of the binding process. Extensive experiments show that dyAb significantly outperforms existing models in antibody design involving changing antigen conformations. These results highlight dyAb’s potential to streamline the design process for therapeutic antibodies, promising more efficient development cycles and improved outcomes in clinical applications.

机器学习

[LG-0] Weak-to-Strong Generalization Even in Random Feature Networks Provably

链接: https://arxiv.org/abs/2503.02877
作者: Marko Medvedev,Kaifeng Lyu,Dingli Yu,Sanjeev Arora,Zhiyuan Li,Nathan Srebro
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A “weak” teacher, with a small number of units (i.e. random features), is trained on the population, and a “strong” student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

[LG-1] Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

链接: https://arxiv.org/abs/2503.02844
作者: Paul Janson,Vaibhav Singh,Paria Mehrbod,Adam Ibrahim,Irina Rish,Eugene Belilovsky,Benjamin Thérien
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

[LG-2] Meta-Learning to Explore via Memory Density Feedback

链接: https://arxiv.org/abs/2503.02831
作者: Kevin L. McKee
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Exploration algorithms for reinforcement learning typically replace or augment the reward function with an additional ``intrinsic’’ reward that trains the agent to seek previously unseen states of the environment. Here, we consider an exploration algorithm that exploits meta-learning, or learning to learn, such that the agent learns to maximize its exploration progress within a single episode, even between epochs of training. The agent learns a policy that aims to minimize the probability density of new observations with respect to all of its memories. In addition, it receives as feedback evaluations of the current observation density and retains that feedback in a recurrent network. By remembering trajectories of density, the agent learns to navigate a complex and growing landscape of familiarity in real-time, allowing it to maximize its exploration progress even in completely novel states of the environment for which its policy has not been trained.

[LG-3] On Separation Between Best-Iterate Random-Iterate and Last-Iterate Convergence of Learning in Games

链接: https://arxiv.org/abs/2503.02825
作者: Yang Cai,Gabriele Farina,Julien Grand-Clément,Christian Kroer,Chung-Wei Lee,Haipeng Luo,Weiqiang Zheng
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注: 33 pages

点击查看摘要

Abstract:Non-ergodic convergence of learning dynamics in games is widely studied recently because of its importance in both theory and practice. Recent work (Cai et al., 2024) showed that a broad class of learning dynamics, including Optimistic Multiplicative Weights Update (OMWU), can exhibit arbitrarily slow last-iterate convergence even in simple 2 \times 2 matrix games, despite many of these dynamics being known to converge asymptotically in the last iterate. It remains unclear, however, whether these algorithms achieve fast non-ergodic convergence under weaker criteria, such as best-iterate convergence. We show that for 2\times 2 matrix games, OMWU achieves an O(T^-1/6) best-iterate convergence rate, in stark contrast to its slow last-iterate convergence in the same class of games. Furthermore, we establish a lower bound showing that OMWU does not achieve any polynomial random-iterate convergence rate, measured by the expected duality gaps across all iterates. This result challenges the conventional wisdom that random-iterate convergence is essentially equivalent to best-iterate convergence, with the former often used as a proxy for establishing the latter. Our analysis uncovers a new connection to dynamic regret and presents a novel two-phase approach to best-iterate convergence, which could be of independent interest.

[LG-4] Feynman-Kac Correctors in Diffusion: Annealing Guidance and Product of Experts

链接: https://arxiv.org/abs/2503.02819
作者: Marta Skreta,Tara Akhound-Sadegh,Viktor Ohanesian,Roberto Bondesan,Alán Aspuru-Guzik,Arnaud Doucet,Rob Brekelmans,Alexander Tong,Kirill Neklyudov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional ‘corrector’ steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at this https URL.

[LG-5] A Minimalist Example of Edge-of-Stability and Progressive Sharpening

链接: https://arxiv.org/abs/2503.02809
作者: Liming Liu,Zixuan Zhang,Simon Du,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注: 39 pages, 15 figures

点击查看摘要

Abstract:Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

[LG-6] Inductive randomness predictors

链接: https://arxiv.org/abs/2503.02803
作者: Vladimir Vovk
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 10 pages, 1 table

点击查看摘要

Abstract:This paper introduces inductive randomness predictors, which form a superset of inductive conformal predictors. Its focus is on a very simple special case, binary inductive randomness predictors. It is interesting that binary inductive randomness predictors have an advantage over inductive conformal predictors, although they also have a serious disadvantage. This advantage will allow us to reach the surprising conclusion that non-trivial inductive conformal predictors are inadmissible in the sense of statistical decision theory.

[LG-7] RAAD-LLM : Adaptive Anomaly Detection Using LLM s and RAG Integration

链接: https://arxiv.org/abs/2503.02800
作者: Alicia Russell-Gilbert,Sudip Mittal,Shahram Rahimi,Maria Seale,Joseph Jabour,Thomas Arnold,Joshua Church
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: arXiv admin note: substantial text overlap with arXiv:2411.00914

点击查看摘要

Abstract:Anomaly detection in complex industrial environments poses unique challenges, particularly in contexts characterized by data sparsity and evolving operational conditions. Predictive maintenance (PdM) in such settings demands methodologies that are adaptive, transferable, and capable of integrating domain-specific knowledge. In this paper, we present RAAD-LLM, a novel framework for adaptive anomaly detection, leveraging large language models (LLMs) integrated with Retrieval-Augmented Generation (RAG). This approach addresses the aforementioned PdM challenges. By effectively utilizing domain-specific knowledge, RAAD-LLM enhances the detection of anomalies in time series data without requiring fine-tuning on specific datasets. The framework’s adaptability mechanism enables it to adjust its understanding of normal operating conditions dynamically, thus increasing detection accuracy. We validate this methodology through a real-world application for a plastics manufacturing plant and the Skoltech Anomaly Benchmark (SKAB). Results show significant improvements over our previous model with an accuracy increase from 70.7 to 89.1 on the real-world dataset. By allowing for the enriching of input series data with semantics, RAAD-LLM incorporates multimodal capabilities that facilitate more collaborative decision-making between the model and plant operators. Overall, our findings support RAAD-LLM’s ability to revolutionize anomaly detection methodologies in PdM, potentially leading to a paradigm shift in how anomaly detection is implemented across various industries.

[LG-8] Quantitative Resilience Modeling for Autonomous Cyber Defense

链接: https://arxiv.org/abs/2503.02780
作者: Xavier Cadet,Simona Boboila,Edward Koh,Peter Chin,Alina Oprea
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Cyber resilience is the ability of a system to recover from an attack with minimal impact on system operations. However, characterizing a network’s resilience under a cyber attack is challenging, as there are no formal definitions of resilience applicable to diverse network topologies and attack patterns. In this work, we propose a quantifiable formulation of resilience that considers multiple defender operational goals, the criticality of various network resources for daily operations, and provides interpretability to security operators about their system’s resilience under attack. We evaluate our approach within the CybORG environment, a reinforcement learning (RL) framework for autonomous cyber defense, analyzing trade-offs between resilience, costs, and prioritization of operational goals. Furthermore, we introduce methods to aggregate resilience metrics across time-variable attack patterns and multiple network topologies, comprehensively characterizing system resilience. Using insights gained from our resilience metrics, we design RL autonomous defensive agents and compare them against several heuristic baselines, showing that proactive network hardening techniques and prompt recovery of compromised machines are critical for effective cyber defenses.

[LG-9] Efficient and Optimal No-Regret Caching under Partial Observation

链接: https://arxiv.org/abs/2503.02758
作者: Younes Ben Mazziane,Francescomaria Faticanti,Sara Alouf,Giovanni Neglia
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Online learning algorithms have been successfully used to design caching policies with sublinear regret in the total number of requests, with no statistical assumption about the request sequence. Most existing algorithms involve computationally expensive operations and require knowledge of all past requests. However, this may not be feasible in practical scenarios like caching at a cellular base station. Therefore, we study the caching problem in a more restrictive setting where only a fraction of past requests are observed, and we propose a randomized caching policy with sublinear regret based on the classic online learning algorithm Follow-the-Perturbed-Leader (FPL). Our caching policy is the first to attain the asymptotically optimal regret bound while ensuring asymptotically constant amortized time complexity in the partial observability setting of requests. The experimental evaluation compares the proposed solution against classic caching policies and validates the proposed approach under synthetic and real-world request traces.

[LG-10] Clustered KL-barycenter design for policy evaluation

链接: https://arxiv.org/abs/2503.02735
作者: Simon Weissmann,Till Freihaut,Claire Vernade,Giorgia Ramponi,Leif Döring
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the context of stochastic bandit models, this article examines how to design sample-efficient behavior policies for the importance sampling evaluation of multiple target policies. From importance sampling theory, it is well established that sample efficiency is highly sensitive to the KL divergence between the target and importance sampling distributions. We first analyze a single behavior policy defined as the KL-barycenter of the target policies. Then, we refine this approach by clustering the target policies into groups with small KL divergences and assigning each cluster its own KL-barycenter as a behavior policy. This clustered KL-based policy evaluation (CKL-PE) algorithm provides a novel perspective on optimal policy selection. We prove upper bounds on the sample complexity of our method and demonstrate its effectiveness with numerical validation.

[LG-11] S4D-Bio Audio Monitoring of Bone Cement Disintegration in Pulsating Fluid Jet Surgery under Laboratory Conditions

链接: https://arxiv.org/abs/2503.02714
作者: Melanie Schaller,Sergej Hloch,Akash Nag,Dagmar Klichova,Nick Janssen,Frank Pude,Michal Zelenak,Bodo Rosenhahn
类目: Machine Learning (cs.LG)
*备注: submitted to Computers in Biology and Medicine Journal

点击查看摘要

Abstract:This study investigates a pulsating fluid jet as a novel precise, minimally invasive and cold technique for bone cement removal. We utilize the pulsating fluid jet device to remove bone cement from samples designed to mimic clinical conditions. The effectiveness of long nozzles was tested to enable minimally invasive procedures. Audio signal monitoring, complemented by the State Space Model (SSM) S4D-Bio, was employed to optimize the fluid jet parameters dynamically, addressing challenges like visibility obstruction from splashing. Within our experiments, we generate a comprehensive dataset correlating various process parameters and their equivalent audio signals to material erosion. The use of SSMs yields precise control over the predictive erosion process, achieving 98.93 % accuracy. The study demonstrates on the one hand, that the pulsating fluid jet device, coupled with advanced audio monitoring techniques, is a highly effective tool for precise bone cement removal. On the other hand, this study presents the first application of SSMs in biomedical surgery technology, marking a significant advancement in the application. This research significantly advances biomedical engineering by integrating machine learning combined with pulsating fluid jet as surgical technology, offering a novel, minimally invasive, cold and adaptive approach for bone cement removal in orthopedic applications.

[LG-12] RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises

链接: https://arxiv.org/abs/2503.02702
作者: Chenyu Li,Zhengjia Zhu,Jiyan He,Xiu Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internal threat detection aims to address security threats within organizations or enterprises by identifying potential or already occurring malicious threats within vast amounts of logs. Although organizations or enterprises have dedicated personnel responsible for reviewing these logs, it is impossible to manually examine all logs entirely. In response to the vast number of logs, we propose a system called RedChronos, which is a Large Language Model-Based Log Analysis System. This system incorporates innovative improvements over previous research by employing Query-Aware Weighted Voting and a Semantic Expansion-based Genetic Algorithm with LLM-driven Mutations. On the public datasets CERT 4.2 and 5.2, RedChronos outperforms or matches existing approaches in terms of accuracy, precision, and detection rate. Moreover, RedChronos reduces the need for manual intervention in security log reviews by 90% in the Xiaohongshu SOC. Therefore, our RedChronos system demonstrates exceptional performance in handling Internal Threat Detection (IDT) tasks, providing innovative solutions for these challenges. We believe that future research can continue to enhance the system’s performance in IDT tasks while also reducing the response time to internal risk events.

[LG-13] Federated Learning for Privacy-Preserving Feedforward Control in Multi-Agent Systems IJCNN2025

链接: https://arxiv.org/abs/2503.02693
作者: Jakob Weber,Markus Gurtner,Benedikt Alt,Adrian Trachte,Andreas Kugi
类目: Machine Learning (cs.LG)
*备注: Submitted to IJCNN 2025

点击查看摘要

Abstract:Feedforward control (FF) is often combined with feedback control (FB) in many control systems, improving tracking performance, efficiency, and stability. However, designing effective data-driven FF controllers in multi-agent systems requires significant data collection, including transferring private or proprietary data, which raises privacy concerns and incurs high communication costs. Therefore, we propose a novel approach integrating Federated Learning (FL) into FF control to address these challenges. This approach enables privacy-preserving, communication-efficient, and decentralized continuous improvement of FF controllers across multiple agents without sharing personal or proprietary data. By leveraging FL, each agent learns a local, neural FF controller using its data and contributes only model updates to a global aggregation process, ensuring data privacy and scalability. We demonstrate the effectiveness of our method in an autonomous driving use case. Therein, vehicles equipped with a trajectory-tracking feedback controller are enhanced by FL-based neural FF control. Simulations highlight significant improvements in tracking performance compared to pure FB control, analogous to model-based FF control. We achieve comparable tracking performance without exchanging private vehicle-specific data compared to a centralized neural FF control. Our results underscore the potential of FL-based neural FF control to enable privacy-preserving learning in multi-agent control systems, paving the way for scalable and efficient autonomous systems applications.

[LG-14] Generative Modeling of Microweather Wind Velocities for Urban Air Mobility

链接: https://arxiv.org/abs/2503.02690
作者: Tristan A. Shah,Michael C. Stanley,James E. Warner
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 17 pages, 13 figures, published in 2025 IEEE Aerospace Conference proceedings

点击查看摘要

Abstract:Motivated by the pursuit of safe, reliable, and weather-tolerant urban air mobility (UAM) solutions, this work proposes a generative modeling approach for characterizing microweather wind velocities. Microweather, or the weather conditions in highly localized areas, is particularly complex in urban environments owing to the chaotic and turbulent nature of wind flows. Furthermore, traditional means of assessing local wind fields are not generally viable solutions for UAM applications: 1) field measurements that would rely on permanent wind profiling systems in operational air space are not practical, 2) physics-based models that simulate fluid dynamics at a sufficiently high resolution are not computationally tractable, and 3) data-driven modeling approaches that are largely deterministic ignore the inherent variability in turbulent flows that dictates UAM reliability. Thus, advancements in predictive capabilities are needed to help mitigate the unique operational safety risks that microweather winds pose for smaller, lighter weight UAM aircraft. This work aims to model microweather wind velocities in a manner that is computationally-efficient, captures random variability, and would only require a temporary, rather than permanent, field measurement campaign. Inspired by recent breakthroughs in conditional generative AI such as text-to-image generation, the proposed approach learns a probabilistic macro-to-microweather mapping between regional weather forecasts and measured local wind velocities using generative modeling (denoising diffusion probabilistic models, flow matching, and Gaussian mixture models). A simple proof of concept was implemented using a dataset comprised of local (micro) measurements from a Sonic Detection and Ranging (SoDAR) wind profiler along with (macro) forecast data from a nearby weather station over the same time period. Comments: 17 pages, 13 figures, published in 2025 IEEE Aerospace Conference proceedings Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG) Cite as: arXiv:2503.02690 [cs.CE] (or arXiv:2503.02690v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2503.02690 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Quantum Geometry insights in Deep Learning

链接: https://arxiv.org/abs/2503.02655
作者: Noémie C. Combe
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:In this paper, we explore the fundamental role of the Monge-Ampère equation in deep learning, particularly in the context of Boltzmann machines and energy-based models. We first review the structure of Boltzmann learning and its relation to free energy minimization. We then establish a connection between optimal transport theory and deep learning, demonstrating how the Monge-Ampère equation governs probability transformations in generative models. Additionally, we provide insights from quantum geometry, showing that the space of covariance matrices arising in the learning process coincides with the Connes-Araki-Haagerup (CAH) cone in von Neumann algebra theory. Furthermore, we introduce an alternative approach based on renormalization group (RG) flow, which, while distinct from the optimal transport perspective, reveals another manifestation of the Monge-Ampère domain in learning dynamics. This dual perspective offers a deeper mathematical understanding of hierarchical feature learning, bridging concepts from statistical mechanics, quantum geometry, and deep learning theory.

[LG-16] Cellular Automaton With CNN

链接: https://arxiv.org/abs/2503.02652
作者: Valery Ashu,Zhisong Liu,Heikki Haario,Andreas Rupp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cellular automata (CA) models are widely used to simulate complex systems with emergent behaviors, but identifying hidden parameters that govern their dynamics remains a significant challenge. This study explores the use of Convolutional Neural Networks (CNN) to identify jump parameters in a two-dimensional CA model. We propose a custom CNN architecture trained on CA-generated data to classify jump parameters, which dictates the neighborhood size and movement rules of cells within the CA. Experiments were conducted across varying domain sizes (25 x 25 to 150 x 150) and CA iterations (0 to 50), demonstrating that the accuracy improves with larger domain sizes, as they provide more spatial information for parameter estimation. Interestingly, while initial CA iterations enhance the performance, increasing the number of iterations beyond a certain threshold does not significantly improve accuracy, suggesting that only specific temporal information is relevant for parameter identification. The proposed CNN achieves competitive accuracy (89.31) compared to established architectures like LeNet-5 and AlexNet, while offering significantly faster inference times, making it suitable for real-time applications. This study highlights the potential of CNNs as a powerful tool for fast and accurate parameter estimation in CA models, paving the way for their use in more complex systems and higher-dimensional domains. Future work will explore the identification of multiple hidden parameters and extend the approach to three-dimensional CA models.

[LG-17] A Generalized Theory of Mixup for Structure-Preserving Synthetic Data

链接: https://arxiv.org/abs/2503.02645
作者: Chungpa Lee,Jongho Im,Joseph H.T. Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Mixup is a widely adopted data augmentation technique known for enhancing the generalization of machine learning models by interpolating between data points. Despite its success and popularity, limited attention has been given to understanding the statistical properties of the synthetic data it generates. In this paper, we delve into the theoretical underpinnings of mixup, specifically its effects on the statistical structure of synthesized data. We demonstrate that while mixup improves model performance, it can distort key statistical properties such as variance, potentially leading to unintended consequences in data synthesis. To address this, we propose a novel mixup method that incorporates a generalized and flexible weighting scheme, better preserving the original data’s structure. Through theoretical developments, we provide conditions under which our proposed method maintains the (co)variance and distributional properties of the original dataset. Numerical experiments confirm that the new approach not only preserves the statistical characteristics of the original data but also sustains model performance across repeated synthesis, alleviating concerns of model collapse identified in previous research.

[LG-18] Development of a Deep Learning Model for the Prediction of Ventilator Weaning

链接: https://arxiv.org/abs/2503.02643
作者: Hernando Gonzalez,Carlos Julio Arizmendi,Beatriz F. Giraldo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The issue of failed weaning is a critical concern in the intensive care unit (ICU) setting. This scenario occurs when a patient experiences difficulty maintaining spontaneous breathing and ensuring a patent airway within the first 48 hours after the withdrawal of mechanical ventilation. Approximately 20 of ICU patients experience this phenomenon, which has severe repercussions on their health. It also has a substantial impact on clinical evolution and mortality, which can increase by 25 to 50. To address this issue, we propose a medical support system that uses a convolutional neural network (CNN) to assess a patients suitability for disconnection from a mechanical ventilator after a spontaneous breathing test (SBT). During SBT, respiratory flow and electrocardiographic activity were recorded and after processed using time-frequency analysis (TFA) techniques. Two CNN architectures were evaluated in this study: one based on ResNet50, with parameters tuned using a Bayesian optimization algorithm, and another CNN designed from scratch, with its structure also adapted using a Bayesian optimization algorithm. The WEANDB database was used to train and evaluate both models. The results showed remarkable performance, with an average accuracy 98 when using CNN from scratch. This model has significant implications for the ICU because it provides a reliable tool to enhance patient care by assisting clinicians in making timely and accurate decisions regarding weaning. This can potentially reduce the adverse outcomes associated with failed weaning events.

[LG-19] Leverag ing Self-Supervised Learning Methods for Remote Screening of Subjects with Paroxysmal Atrial Fibrillation

链接: https://arxiv.org/abs/2503.02621
作者: Adrian Atienza,Gouthamaan Manimaran,Sadasivan Puthusserypady,Helena Dominguez,Peter K. Jacobsen,Jakob E. Bardram
类目: Machine Learning (cs.LG)
*备注: Under revision

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) into clinical research has great potential to reveal patterns that are difficult for humans to detect, creating impactful connections between inputs and clinical outcomes. However, these methods often require large amounts of labeled data, which can be difficult to obtain in healthcare due to strict privacy laws and the need for experts to annotate data. This requirement creates a bottleneck when investigating unexplored clinical questions. This study explores the application of Self-Supervised Learning (SSL) as a way to obtain preliminary results from clinical studies with limited sized cohorts. To assess our approach, we focus on an underexplored clinical task: screening subjects for Paroxysmal Atrial Fibrillation (P-AF) using remote monitoring, single-lead ECG signals captured during normal sinus rhythm. We evaluate state-of-the-art SSL methods alongside supervised learning approaches, where SSL outperforms supervised learning in this task of interest. More importantly, it prevents misleading conclusions that may arise from poor performance in the latter paradigm when dealing with limited cohort settings.

[LG-20] Lightweight Channel-wise Dynamic Fusion Model: Non-stationary Time Series Forecasting via Entropy Analysis

链接: https://arxiv.org/abs/2503.02609
作者: Tianyu Jia,Zongxia Xie,Yanru Sun,Dilfira Kudrat,Qinghua Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-stationarity is an intrinsic property of real-world time series and plays a crucial role in time series forecasting. Previous studies primarily adopt instance normalization to attenuate the non-stationarity of original series for better predictability. However, instance normalization that directly removes the inherent non-stationarity can lead to three issues: (1) disrupting global temporal dependencies, (2) ignoring channel-specific differences, and (3) producing over-smoothed predictions. To address these issues, we theoretically demonstrate that variance can be a valid and interpretable proxy for quantifying non-stationarity of time series. Based on the analysis, we propose a novel lightweight \textitChannel-wise \textitDynamic \textitFusion \textitModel (\textitCDFM), which selectively and dynamically recovers intrinsic non-stationarity of the original series, while keeping the predictability of normalized series. First, we design a Dual-Predictor Module, which involves two branches: a Time Stationary Predictor for capturing stable patterns and a Time Non-stationary Predictor for modeling global dynamics patterns. Second, we propose a Fusion Weight Learner to dynamically characterize the intrinsic non-stationary information across different samples based on variance. Finally, we introduce a Channel Selector to selectively recover non-stationary information from specific channels by evaluating their non-stationarity, similarity, and distribution consistency, enabling the model to capture relevant dynamic features and avoid overfitting. Comprehensive experiments on seven time series datasets demonstrate the superiority and generalization capabilities of CDFM.

[LG-21] o Vaccinate or not to Vaccinate? Analyzing mathbbX Power over the Pandemic

链接: https://arxiv.org/abs/2503.02563
作者: Tanveer Khan,Fahad Sohrab,Antonis Michalas,Moncef Gabbouj
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The COVID-19 pandemic has profoundly affected the normal course of life – from lock-downs and virtual meetings to the unprecedentedly swift creation of vaccines. To halt the COVID-19 pandemic, the world has started preparing for the global vaccine roll-out. In an effort to navigate the immense volume of information about COVID-19, the public has turned to social networks. Among them, \mathbbX (formerly Twitter) has played a key role in distributing related information. Most people are not trained to interpret medical research and remain skeptical about the efficacy of new vaccines. Measuring their reactions and perceptions is gaining significance in the fight against COVID-19. To assess the public perception regarding the COVID-19 vaccine, our work applies a sentiment analysis approach, using natural language processing of \mathbbX data. We show how to use textual analytics and textual data visualization to discover early insights (for example, by analyzing the most frequently used keywords and hashtags). Furthermore, we look at how people’s sentiments vary across the countries. Our results indicate that although the overall reaction to the vaccine is positive, there are also negative sentiments associated with the tweets, especially when examined at the country level. Additionally, from the extracted tweets, we manually labeled 100 tweets as positive and 100 tweets as negative and trained various One-Class Classifiers (OCCs). The experimental results indicate that the S-SVDD classifiers outperform other OCCs.

[LG-22] Disentangled Knowledge Tracing for Alleviating Cognitive Bias

链接: https://arxiv.org/abs/2503.02539
作者: Yiyun Zhou,Zheqi Lv,Shengyu Zhang,Jingyuan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of Intelligent Tutoring System (ITS), the accurate assessment of students’ knowledge states through Knowledge Tracing (KT) is crucial for personalized learning. However, due to data bias, \textiti.e. , the unbalanced distribution of question groups ( \textite.g. , concepts), conventional KT models are plagued by cognitive bias, which tends to result in cognitive underload for overperformers and cognitive overload for underperformers. More seriously, this bias is amplified with the exercise recommendations by ITS. After delving into the causal relations in the KT models, we identify the main cause as the confounder effect of students’ historical correct rate distribution over question groups on the student representation and prediction score. Towards this end, we propose a Disentangled Knowledge Tracing (DisKT) model, which separately models students’ familiar and unfamiliar abilities based on causal effects and eliminates the impact of the confounder in student representation within the model. Additionally, to shield the contradictory psychology ( \textite.g. , guessing and mistaking) in the students’ biased data, DisKT introduces a contradiction attention mechanism. Furthermore, DisKT enhances the interpretability of the model predictions by integrating a variant of Item Response Theory. Experimental results on 11 benchmarks and 3 synthesized datasets with different bias strengths demonstrate that DisKT significantly alleviates cognitive bias and outperforms 16 baselines in evaluation accuracy.

[LG-23] SAGE-Amine: Generative Amine Design with Multi-Property Optimization for Efficient CO2 Capture

链接: https://arxiv.org/abs/2503.02534
作者: Hocheol Lim,Hyein Cho,Jeonghoon Kim
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 33 pages, 5 figures

点击查看摘要

Abstract:Efficient CO2 capture is vital for mitigating climate change, with amine-based solvents being widely used due to their strong reactivity with CO2. However, optimizing key properties such as basicity, viscosity, and absorption capacity remains challenging, as traditional methods rely on labor-intensive experimentation and predefined chemical databases, limiting the exploration of novel solutions. Here, SAGE-Amine was introduced, a generative modeling approach that integrates Scoring-Assisted Generative Exploration (SAGE) with quantitative structure-property relationship models to design new amines tailored for CO2 capture. Unlike conventional virtual screening restricted to existing compounds, SAGE-Amine generates novel amines by leveraging autoregressive natural language processing models trained on amine datasets. SAGE-Amine identified known amines for CO2 capture from scratch and successfully performed single-property optimization, increasing basicity or reducing viscosity or vapor pressure. Furthermore, it facilitated multi-property optimization, simultaneously achieving high basicity with low viscosity and vapor pressure. The 10 top-ranked amines were suggested using SAGE-Amine and their thermodynamic properties were further assessed using COSMO-RS simulations, confirming their potential for CO2 capture. These results highlight the potential of generative modeling in accelerating the discovery of amine solvents and expanding the possibilities for industrial CO2 capture applications.

[LG-24] A Theory of Initialisations Impact on Specialisation

链接: https://arxiv.org/abs/2503.02526
作者: Devon Jarvis,Sebastian Lee,Clémentine Carla Juliette Dominé,Andrew M Saxe,Stefano Sarao Mannelli
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network’s tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.

[LG-25] A Systematic Literature Review on Safety of the Intended Functionality for Automated Driving Systems

链接: https://arxiv.org/abs/2503.02498
作者: Milin Patel,Rolf Jung,Marzana Khatun
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Scheduled to be published in SAE journal as technical paper as a part of non-technical event and will be available as open access in 2025

点击查看摘要

Abstract:In the automobile industry, ensuring the safety of automated vehicles equipped with the Automated Driving System (ADS) is becoming a significant focus due to the increasing development and deployment of automated driving. Automated driving depends on sensing both the external and internal environments of a vehicle, utilizing perception sensors and algorithms, and Electrical/Electronic (E/E) systems for situational awareness and response. ISO 21448 is the standard for Safety of the Intended Functionality (SOTIF) that aims to ensure that the ADS operate safely within their intended functionality. SOTIF focuses on preventing or mitigating potential hazards that may arise from the limitations or failures of the ADS, including hazards due to insufficiencies of specification, or performance insufficiencies, as well as foreseeable misuse of the intended functionality. However, the challenge lies in ensuring the safety of vehicles despite the limited availability of extensive and systematic literature on SOTIF. To address this challenge, a Systematic Literature Review (SLR) on SOTIF for the ADS is performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The objective is to methodically gather and analyze the existing literature on SOTIF. The major contributions of this paper are: (i) presenting a summary of the literature by synthesizing and organizing the collective findings, methodologies, and insights into distinct thematic groups, and (ii) summarizing and categorizing the acknowledged limitations based on data extracted from an SLR of 51 research papers published between 2018 and 2023. Furthermore, research gaps are determined, and future research directions are proposed.

[LG-26] Joint Tensor and Inter-View Low-Rank Recovery for Incomplete Multiview Clustering

链接: https://arxiv.org/abs/2503.02449
作者: Jianyu Wang,Zhengqiao Zhao,Nicolas Dobigeon,Jingdong Chen
类目: Machine Learning (cs.LG)
*备注: The paper is under review at IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:Incomplete multiview clustering (IMVC) has gained significant attention for its effectiveness in handling missing sample challenges across various views in real-world multiview clustering applications. Most IMVC approaches tackle this problem by either learning consensus representations from available views or reconstructing missing samples using the underlying manifold structure. However, the reconstruction of learned similarity graph tensor in prior studies only exploits the low-tubal-rank information, neglecting the exploration of inter-view correlations. This paper propose a novel joint tensor and inter-view low-rank Recovery (JTIV-LRR), framing IMVC as a joint optimization problem that integrates incomplete similarity graph learning and tensor representation recovery. By leveraging both intra-view and inter-view low rank information, the method achieves robust estimation of the complete similarity graph tensor through sparse noise removal and low-tubal-rank constraints along different modes. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed approach, achieving significant improvements in clustering accuracy and robustness compared to state-of-the-art methods.

[LG-27] NodeNAS: Node-Specific Graph Neural Architecture Search for Out-of-Distribution Generalization DASFAA2025

链接: https://arxiv.org/abs/2503.02448
作者: Qiyi Wang,Yinning Shao,Yunlong Ma,Min Liu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by DASFAA2025

点击查看摘要

Abstract:Graph neural architecture search (GraphNAS) has demonstrated advantages in mitigating performance degradation of graph neural networks (GNNs) due to distribution shifts. Recent approaches introduce weight sharing across tailored architectures, generating unique GNN architectures for each graph end-to-end. However, existing GraphNAS methods do not account for distribution patterns across different graphs and heavily rely on extensive training data. With sparse or single training graphs, these methods struggle to discover optimal mappings between graphs and architectures, failing to generalize to out-of-distribution (OOD) data. In this paper, we propose node-specific graph neural architecture search(NodeNAS), which aims to tailor distinct aggregation methods for different nodes through disentangling node topology and graph distribution with limited datasets. We further propose adaptive aggregation attention based multi-dim NodeNAS method(MNNAS), which learns an node-specific architecture customizer with good generalizability. Specifically, we extend the vertical depth of the search space, supporting simultaneous node-specific architecture customization across multiple dimensions. Moreover, we model the power-law distribution of node degrees under varying assortativity, encoding structure invariant information to guide architecture customization across each dimension. Extensive experiments across supervised and unsupervised tasks demonstrate that MNNAS surpasses state-of-the-art algorithms and achieves excellent OOD generalization.

[LG-28] BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modelling

链接: https://arxiv.org/abs/2503.02445
作者: Hao Li,Yu-Hao Huang,Chang Xu,Viktor Schlegel,Ren-He Jiang,Riza Batista-Navarro,Goran Nenadic,Jiang Bian
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG’', a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by 12.52% on MSE and 6.34% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data.

[LG-29] Artificial Intelligence in Reactor Physics: Current Status and Future Prospects

链接: https://arxiv.org/abs/2503.02440
作者: Ruizhi Zhang,Shengfeng Zhu,Kan Wang,Ding She,Jean-Philippe Argaud,Bertrand Bouriquet,Qing Li,Helin Gong
类目: Machine Learning (cs.LG)
*备注: 33 pages, 6 figures

点击查看摘要

Abstract:Reactor physics is the study of neutron properties, focusing on using models to examine the interactions between neutrons and materials in nuclear reactors. Artificial intelligence (AI) has made significant contributions to reactor physics, e.g., in operational simulations, safety design, real-time monitoring, core management and maintenance. This paper presents a comprehensive review of AI approaches in reactor physics, especially considering the category of Machine Learning (ML), with the aim of describing the application scenarios, frontier topics, unsolved challenges and future research directions. From equation solving and state parameter prediction to nuclear industry applications, this paper provides a step-by-step overview of ML methods applied to steady-state, transient and combustion problems. Most literature works achieve industry-demanded models by enhancing the efficiency of deterministic methods or correcting uncertainty methods, which leads to successful applications. However, research on ML methods in reactor physics is somewhat fragmented, and the ability to generalize models needs to be strengthened. Progress is still possible, especially in addressing theoretical challenges and enhancing industrial applications such as building surrogate models and digital twins.

[LG-30] ght Gap-Dependent Memory-Regret Trade-Off for Single-Pass Streaming Stochastic Multi-Armed Bandits

链接: https://arxiv.org/abs/2503.02428
作者: Zichun Ye,Chihao Zhang,Jiahao Zhao
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the n arms are present in a stream, and at most mn arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms n , the memory size m , the number of rounds T and (\Delta_i)i\in [n] where \Delta_i is the reward mean gap between the best arm and the i -th arm. These gaps are not known in advance by the player. Specifically, for any constant \alpha \ge 1 , we present two algorithms: one applicable for m\ge \frac23n with regret at most O\alpha\Big(\frac(n-m)T^\frac1\alpha + 1n^1 + \frac1\alpha + 1\displaystyle\sum_i:\Delta_i 0\Delta_i^1 - 2\alpha\Big) and another applicable for m\frac23n with regret at most O_\alpha\Big(\fracT^\frac1\alpha+1m^\frac1\alpha+1\displaystyle\sum_i:\Delta_i 0\Delta_i^1 - 2\alpha\Big) . We also prove matching lower bounds for both cases by showing that for any constant \alpha\ge 1 and any m\leq k n , there exists a set of hard instances on which the regret of any algorithm is \Omega_\alpha\Big(\frac(k-m+1) T^\frac1\alpha+1k^1 + \frac1\alpha+1 \sum_i:\Delta_i 0\Delta_i^1-2\alpha\Big) . This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an O\Big(\sum_i\colon\Delta0 \frac\sqrtT\log T\Delta_i\Big) upper bound for the special case of \alpha=1 and m=O(1) was established by Agarwal, Khanna and Patil (COLT’22). In contrast, our results provide the correct order of regret as \Theta\Big(\frac1\sqrtm\sum_i\colon\Delta0\frac\sqrtT\Delta_i\Big) .

[LG-31] Aggregation Strategies for Efficient Annotation of Bioacoustic Sound Events Using Active Learning

链接: https://arxiv.org/abs/2503.02422
作者: Richard Lindholm,Oscar Marklund,Olof Mogren,John Martinsson
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The vast amounts of audio data collected in Sound Event Detection (SED) applications require efficient annotation strategies to enable supervised learning. Manual labeling is expensive and time-consuming, making Active Learning (AL) a promising approach for reducing annotation effort. We introduce Top K Entropy, a novel uncertainty aggregation strategy for AL that prioritizes the most uncertain segments within an audio recording, instead of averaging uncertainty across all segments. This approach enables the selection of entire recordings for annotation, improving efficiency in sparse data scenarios. We compare Top K Entropy to random sampling and Mean Entropy, and show that fewer labels can lead to the same model performance, particularly in datasets with sparse sound events. Evaluations are conducted on audio mixtures of sound recordings from parks with meerkat, dog, and baby crying sound events, representing real-world bioacoustic monitoring scenarios. Using Top K Entropy for active learning, we can achieve comparable performance to training on the fully labeled dataset with only 8% of the labels. Top K Entropy outperforms Mean Entropy, suggesting that it is best to let the most uncertain segments represent the uncertainty of an audio file. The findings highlight the potential of AL for scalable annotation in audio and time-series applications, including bioacoustics.

[LG-32] A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations

链接: https://arxiv.org/abs/2503.02421
作者: Chrysa Pratikaki,Panagiotis Filntisis,Athanasios Katsamanis,Anastasios Roussos,Petros Maragos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sign Languages are the primary form of communication for Deaf communities across the world. To break the communication barriers between the Deaf and Hard-of-Hearing and the hearing communities, it is imperative to build systems capable of translating the spoken language into sign language and vice versa. Building on insights from previous research, we propose a deep learning model for Sign Language Production (SLP), which to our knowledge is the first attempt on Greek SLP. We tackle this task by utilizing a transformer-based architecture that enables the translation from text input to human pose keypoints, and the opposite. We evaluate the effectiveness of the proposed pipeline on the Greek SL dataset Elementary23, through a series of comparative analyses and ablation studies. Our pipeline’s components, which include data-driven gloss generation, training through video to text translation and a scheduling algorithm for teacher forcing - auto-regressive decoding seem to actively enhance the quality of produced SL videos.

[LG-33] Robust detection of overlapping bioacoustic sound events

链接: https://arxiv.org/abs/2503.02389
作者: Louis Mahon,Benjamin Hoffman,Logan S James,Maddie Cusimano,Masato Hagiwara,Sarah C Woolley,Olivier Pietquin
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.

[LG-34] An Accelerated Alternating Partial Bregman Algorithm for ReLU-based Matrix Decomposition

链接: https://arxiv.org/abs/2503.02386
作者: Qingsong Wang,Yunfei Qu,Chunfeng Cui,Deren Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable success of low-rank estimation in data mining, its effectiveness diminishes when applied to data that inherently lacks low-rank structure. To address this limitation, in this paper, we focus on non-negative sparse matrices and aim to investigate the intrinsic low-rank characteristics of the rectified linear unit (ReLU) activation function. We first propose a novel nonlinear matrix decomposition framework incorporating a comprehensive regularization term designed to simultaneously promote useful structures in clustering and compression tasks, such as low-rankness, sparsity, and non-negativity in the resulting factors. This formulation presents significant computational challenges due to its multi-block structure, non-convexity, non-smoothness, and the absence of global gradient Lipschitz continuity. To address these challenges, we develop an accelerated alternating partial Bregman proximal gradient method (AAPB), whose distinctive feature lies in its capability to enable simultaneous updates of multiple variables. Under mild and theoretically justified assumptions, we establish both sublinear and global convergence properties of the proposed algorithm. Through careful selection of kernel generating distances tailored to various regularization terms, we derive corresponding closed-form solutions while maintaining the L -smooth adaptable property always holds for any L\ge 1 . Numerical experiments, on graph regularized clustering and sparse NMF basis compression confirm the effectiveness of our model and algorithm.

[LG-35] ruthfulness of Decision-Theoretic Calibration Measures

链接: https://arxiv.org/abs/2503.02384
作者: Mingda Qiao,Eric Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Calibration measures quantify how much a forecaster’s predictions violates calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best-respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both. We introduce a new calibration measure termed subsampled step calibration, \mathsfStepCE^\textsfsub , that is both decision-theoretic and truthful. In particular, on any product distribution, \mathsfStepCE^\textsfsub is truthful up to an O(1) factor whereas prior decision-theoretic calibration measures suffer from an e^-\Omega(T) - \Omega(\sqrtT) truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude c 0 , \mathsfStepCE^\textsfsub is truthful up to an O(\sqrt\log(1/c)) factor, while prior decision-theoretic measures have an e^-\Omega(T) - \Omega(T^1/3) truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.02384 [cs.LG] (or arXiv:2503.02384v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.02384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] Controllable Motion Generation via Diffusion Modal Coupling

链接: https://arxiv.org/abs/2503.02353
作者: Luobin Wang,Hongzhan Yu,Chenning Yu,Sicun Gao,Henrik Christensen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have recently gained significant attention in robotics due to their ability to generate multi-modal distributions of system states and behaviors. However, a key challenge remains: ensuring precise control over the generated outcomes without compromising realism. This is crucial for applications such as motion planning or trajectory forecasting, where adherence to physical constraints and task-specific objectives is essential. We propose a novel framework that enhances controllability in diffusion models by leveraging multi-modal prior distributions and enforcing strong modal coupling. This allows us to initiate the denoising process directly from distinct prior modes that correspond to different possible system behaviors, ensuring sampling to align with the training distribution. We evaluate our approach on motion prediction using the Waymo dataset and multi-task control in Maze2D environments. Experimental results show that our framework outperforms both guidance-based techniques and conditioned models with unimodal priors, achieving superior fidelity, diversity, and controllability, even in the absence of explicit conditioning. Overall, our approach provides a more reliable and scalable solution for controllable motion generation in robotics.

[LG-37] Confidence HNC: A Network Flow Technique for Binary Classification with Noisy Labels

链接: https://arxiv.org/abs/2503.02352
作者: Dorit Hochbaum,Torpong Nitayanont
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider here a classification method that balances two objectives: large similarity within the samples in the cluster, and large dissimilarity between the cluster and its complement. The method, referred to as HNC or SNC, requires seed nodes, or labeled samples, at least one of which is in the cluster and at least one in the complement. Other than that, the method relies only on the relationship between the samples. The contribution here is the new method in the presence of noisy labels, based on HNC, called Confidence HNC, in which we introduce confidence weights that allow the given labels of labeled samples to be violated, with a penalty that reflects the perceived correctness of each given label. If a label is violated then it is interpreted that the label was noisy. The method involves a representation of the problem as a graph problem with hyperparameters that is solved very efficiently by the network flow technique of parametric cut. We compare the performance of the new method with leading algorithms on both real and synthetic data with noisy labels and demonstrate that it delivers improved performance in terms of classification accuracy as well as noise detection capability.

[LG-38] Incorporating graph neural network into route choice model

链接: https://arxiv.org/abs/2503.02315
作者: Yuxun Ma,Toru Seo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Route choice models are one of the most important foundations for transportation research. Traditionally, theory-based models have been utilized for their great interpretability, such as logit models and Recursive logit models. More recently, machine learning approaches have gained attentions for their better prediction accuracy. In this study, we propose novel hybrid models that integrate the Recursive logit model with Graph Neural Networks (GNNs) to enhance both predictive performance and model interpretability. To the authors’ knowldedge, GNNs have not been utilized for route choice modeling, despite their proven effectiveness in capturing road network features and their widespread use in other transportation research areas. We mathematically show that our use of GNN is not only beneficial for enhancing the prediction performance, but also relaxing the Independence of Irrelevant Alternatives property without relying on strong assumptions. This is due to the fact that a specific type of GNN can efficiently capture multiple cross-effect patterns on networks from data. By applying the proposed models to one-day travel trajectory data in Tokyo, we confirmed their higher prediction accuracy compared to the existing models.

[LG-39] Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization

链接: https://arxiv.org/abs/2503.02312
作者: Aviv Shamsian,Eitan Shaar,Aviv Navon,Gal Chechik,Ethan Fetaya
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of problematic training data after a model has been trained. The primary challenge in machine unlearning is ensuring that the process effectively removes specified data without compromising the model’s overall performance on the remaining dataset. Many existing machine unlearning methods address this challenge by carefully balancing gradient ascent on the unlearn data with the gradient descent on a retain set representing the training data. Here, we propose OrthoGrad, a novel approach that mitigates interference between the unlearn set and the retain set rather than competing ascent and descent processes. Our method projects the gradient of the unlearn set onto the subspace orthogonal to all gradients in the retain batch, effectively avoiding any gradient interference. We demonstrate the effectiveness of OrthoGrad on multiple machine unlearning benchmarks, including automatic speech recognition, outperforming competing methods.

[LG-40] A Kolmogorov-Arnold Network for Explainable Detection of Cyberattacks on EV Chargers

链接: https://arxiv.org/abs/2503.02281
作者: Ahmad Mohammad Saber,Max Mauro Dias Santos,Mohammad Al Janaideh,Amr Youssef,Deepa Kundur
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted for the 2025 IEEE Power Energy Society General Meeting (PESGM), 27-31 July 2025 Austin, TX, USA

点击查看摘要

Abstract:The increasing adoption of Electric Vehicles (EVs) and the expansion of charging infrastructure and their reliance on communication expose Electric Vehicle Supply Equipment (EVSE) to cyberattacks. This paper presents a novel Kolmogorov-Arnold Network (KAN)-based framework for detecting cyberattacks on EV chargers using only power consumption measurements. Leveraging the KAN’s capability to model nonlinear, high-dimensional functions and its inherently interpretable architecture, the framework effectively differentiates between normal and malicious charging scenarios. The model is trained offline on a comprehensive dataset containing over 100,000 cyberattack cases generated through an experimental setup. Once trained, the KAN model can be deployed within individual chargers for real-time detection of abnormal charging behaviors indicative of cyberattacks. Our results demonstrate that the proposed KAN-based approach can accurately detect cyberattacks on EV chargers with Precision and F1-score of 99% and 92%, respectively, outperforming existing detection methods. Additionally, the proposed KANs’s enable the extraction of mathematical formulas representing KAN’s detection decisions, addressing interpretability, a key challenge in deep learning-based cybersecurity frameworks. This work marks a significant step toward building secure and explainable EV charging infrastructure.

[LG-41] DreamerV3 for Traffic Signal Control: Hyperparameter Tuning and Performance

链接: https://arxiv.org/abs/2503.02279
作者: Qiang Li,Yinhan Lin,Qin Luo,Lina Yu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has evolved into a widely investigated technology for the development of smart TSC strategies. However, current RL algorithms necessitate excessive interaction with the environment to learn effective policies, making them impractical for large-scale tasks. The DreamerV3 algorithm presents compelling properties for policy learning. It summarizes general dynamics knowledge about the environment and enables the prediction of future outcomes of potential actions from past experience, reducing the interaction with the environment through imagination training. In this paper, a corridor TSC model is trained using the DreamerV3 algorithm to explore the benefits of world models for TSC strategy learning. In RL environment design, to manage congestion levels effectively, both the state and reward functions are defined based on queue length, and the action is designed to manage queue length efficiently. Using the SUMO simulation platform, the two hyperparameters (training ratio and model size) of the DreamerV3 algorithm were tuned and analyzed across different OD matrix scenarios. We discovered that choosing a smaller model size and initially attempting several medium training ratios can significantly reduce the time spent on hyperparameter tuning. Additionally, we found that the approach is generally applicable as it can solve two TSC task scenarios with the same hyperparameters. Regarding the claimed data-efficiency of the DreamerV3 algorithm, due to the significant fluctuation of the episode reward curve in the early stages of training, it can only be confirmed that larger model sizes exhibit modest data-efficiency, and no evidence was found that increasing the training ratio accelerates convergence.

[LG-42] Nonlinear energy-preserving model reduction with lifting transformations that quadratize the energy

链接: https://arxiv.org/abs/2503.02273
作者: Harsh Sharma,Juan Diego Draxl Giannoni,Boris Kramer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing model reduction techniques for high-dimensional models of conservative partial differential equations (PDEs) encounter computational bottlenecks when dealing with systems featuring non-polynomial nonlinearities. This work presents a nonlinear model reduction method that employs lifting variable transformations to derive structure-preserving quadratic reduced-order models for conservative PDEs with general nonlinearities. We present an energy-quadratization strategy that defines the auxiliary variable in terms of the nonlinear term in the energy expression to derive an equivalent quadratic lifted system with quadratic system energy. The proposed strategy combined with proper orthogonal decomposition model reduction yields quadratic reduced-order models that conserve the quadratized lifted energy exactly in high dimensions. We demonstrate the proposed model reduction approach on four nonlinear conservative PDEs: the one-dimensional wave equation with exponential nonlinearity, the two-dimensional sine-Gordon equation, the two-dimensional Klein-Gordon equation with parametric dependence, and the two-dimensional Klein-Gordon-Zakharov equations. The numerical results show that the proposed lifting approach is competitive with the state-of-the-art structure-preserving hyper-reduction method in terms of both accuracy and computational efficiency in the online stage while providing significant computational gains in the offline stage.

[LG-43] HiGP: A high-performance Python package for Gaussian Process

链接: https://arxiv.org/abs/2503.02259
作者: Hua Huang,Tianshi Xu,Yuanzhe Xi,Edmond Chow
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian Processes (GPs) are flexible, nonparametric Bayesian models widely used for regression and classification tasks due to their ability to capture complex data patterns and provide uncertainty quantification (UQ). Traditional GP implementations often face challenges in scalability and computational efficiency, especially with large datasets. To address these challenges, HiGP, a high-performance Python package, is designed for efficient Gaussian Process regression (GPR) and classification (GPC) across datasets of varying sizes. HiGP combines multiple new iterative methods to enhance the performance and efficiency of GP computations. It implements various effective matrix-vector (MatVec) and matrix-matrix (MatMul) multiplication strategies specifically tailored for kernel matrices. To improve the convergence of iterative methods, HiGP also integrates the recently developed Adaptive Factorized Nystrom (AFN) preconditioner and employs precise formulas for computing the gradients. With a user-friendly Python interface, HiGP seamlessly integrates with PyTorch and other Python packages, allowing easy incorporation into existing machine learning and data analysis workflows.

[LG-44] CrystalFramer: Rethinking the Role of Frames for SE(3)-Invariant Crystal Structure Modeling ICLR2025

链接: https://arxiv.org/abs/2503.02209
作者: Yusei Ito,Tatsunori Taniai,Ryo Igarashi,Yoshitaka Ushiku,Kanta Ono
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注: 12 main pages, 3 main figures, and 4 main tables. Published as a conference paper at ICLR 2025. This version moves some appendices into the main text. For more information, see this https URL

点击查看摘要

Abstract:Crystal structure modeling with graph neural networks is essential for various applications in materials informatics, and capturing SE(3)-invariant geometric features is a fundamental requirement for these networks. A straightforward approach is to model with orientation-standardized structures through structure-aligned coordinate systems, or"frames." However, unlike molecules, determining frames for crystal structures is challenging due to their infinite and highly symmetric nature. In particular, existing methods rely on a statically fixed frame for each structure, determined solely by its structural information, regardless of the task under consideration. Here, we rethink the role of frames, questioning whether such simplistic alignment with the structure is sufficient, and propose the concept of dynamic frames. While accommodating the infinite and symmetric nature of crystals, these frames provide each atom with a dynamic view of its local environment, focusing on actively interacting atoms. We demonstrate this concept by utilizing the attention mechanism in a recent transformer-based crystal encoder, resulting in a new architecture called CrystalFramer. Extensive experiments show that CrystalFramer outperforms conventional frames and existing crystal encoders in various crystal property prediction tasks.

[LG-45] Volume-Sorted Prediction Set: Efficient Conformal Prediction for Multi-Target Regression

链接: https://arxiv.org/abs/2503.02205
作者: Rui Luo,Zhixin Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce Volume-Sorted Prediction Set (VSPS), a novel method for uncertainty quantification in multi-target regression that uses conditional normalizing flows with conformal calibration. This approach constructs flexible, non-convex predictive regions with guaranteed coverage probabilities, overcoming limitations of traditional methods. By learning a transformation where the conditional distribution of responses follows a known form, VSPS identifies dense regions in the original space using the Jacobian determinant. This enables the creation of prediction regions that adapt to the true underlying distribution, focusing on areas of high probability density. Experimental results demonstrate that VSPS produces smaller, more informative prediction regions while maintaining robust coverage guarantees, enhancing uncertainty modeling in complex, high-dimensional settings.

[LG-46] From Data to Uncertainty Sets: a Machine Learning Approach

链接: https://arxiv.org/abs/2503.02173
作者: Dimitris Bertsimas,Benjamin Boucher
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Existing approaches of prescriptive analytics – where inputs of an optimization model can be predicted by leveraging covariates in a machine learning model – often attempt to optimize the mean value of an uncertain objective. However, when applied to uncertain constraints, these methods rarely work because satisfying a crucial constraint in expectation may result in a high probability of violation. To remedy this, we leverage robust optimization to protect a constraint against the uncertainty of a machine learning model’s output. To do so, we design an uncertainty set based on the model’s loss function. Intuitively, this approach attempts to minimize the uncertainty around a prediction. Extending guarantees from the robust optimization literature, we derive strong guarantees on the probability of violation. On synthetic computational experiments, our method requires uncertainty sets with radii up to one order of magnitude smaller than those of other approaches.

[LG-47] Is Bellm an Equation Enough for Learning Control?

链接: https://arxiv.org/abs/2503.02171
作者: Haoxiang You,Lekan Molu,Ian Abraham
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Bellman equation and its continuous-time counterpart, the Hamilton-Jacobi-Bellman (HJB) equation, serve as necessary conditions for optimality in reinforcement learning and optimal control. While the value function is known to be the unique solution to the Bellman equation in tabular settings, we demonstrate that this uniqueness fails to hold in continuous state spaces. Specifically, for linear dynamical systems, we prove the Bellman equation admits at least \binom2nn solutions, where n is the state dimension. Crucially, only one of these solutions yields both an optimal policy and a stable closed-loop system. We then demonstrate a common failure mode in value-based methods: convergence to unstable solutions due to the exponential imbalance between admissible and inadmissible solutions. Finally, we introduce a positive-definite neural architecture that guarantees convergence to the stable solution by construction to address this issue.

[LG-48] DDAD: A Two-pronged Adversarial Defense Based on Distributional Discrepancy

链接: https://arxiv.org/abs/2503.02169
作者: Jiacheng Zhang,Benjamin I. P. Rubinstein,Jingfeng Zhang,Feng Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Statistical adversarial data detection (SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we reveal the potential strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Nevertheless, despite these advantages, SADD-based methods have a potential limitation: they discard inputs that are detected as AEs, leading to the loss of clean information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, named Distributional-Discrepancy-based Adversarial Defense (DDAD). In the training phase, DDAD first optimizes the test power of the maximum mean discrepancy (MMD) to derive MMD-OPT, and then trains a denoiser by minimizing the MMD-OPT between CEs and AEs. In the inference phase, DDAD first leverages MMD-OPT to differentiate CEs and AEs, and then applies a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DDAD outperforms current state-of-the-art (SOTA) defense methods by notably improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks.

[LG-49] LLM -TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation

链接: https://arxiv.org/abs/2503.02161
作者: Yunbo Long,Liming Xu,Alexandra Brintrup
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic tabular data have widespread applications in industrial domains such as healthcare, finance, and supply chains, owing to their potential to protect privacy and mitigate data scarcity. However, generating realistic synthetic tabular data while preserving inter-column logical relationships remains a significant challenge for the existing generative models. To address these challenges, we propose LLM-TabFlow, a novel approach that leverages Large Language Model (LLM) reasoning to capture complex inter-column relationships and compress tabular data, while using Score-based Diffusion to model the distribution of the compressed data in latent space. Additionally, we introduce an evaluation framework, which is absent in literature, to fairly assess the performance of synthetic tabular data generation methods in real-world contexts. Using this framework, we conduct extensive experiments on two real-world industrial datasets, evaluating LLM-TabFlow against other five baseline methods, including SMOTE (an interpolation-based approach) and other state-of-the-art generative models. Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy. This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation, offering new insights for developing more realistic and reliable tabular data generation methods.

[LG-50] Frankenstein Optimizer: Harnessing the Potential by Revisiting Optimization Tricks

链接: https://arxiv.org/abs/2503.02147
作者: Chia-Wei Hsu,Nien-Ti Tsou,Yu-Cheng Chen,Yang Jeong Park,Ju Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-based optimization drives the unprecedented performance of modern deep neural network models across diverse applications. Adaptive algorithms have accelerated neural network training due to their rapid convergence rates; however, they struggle to find ``flat minima" reliably, resulting in suboptimal generalization compared to stochastic gradient descent (SGD). By revisiting various adaptive algorithms’ mechanisms, we propose the Frankenstein optimizer, which combines their advantages. The proposed Frankenstein dynamically adjusts first- and second-momentum coefficients according to the optimizer’s current state to directly maintain consistent learning dynamics and immediately reflect sudden gradient changes. Extensive experiments across several research domains such as computer vision, natural language processing, few-shot learning, and scientific simulations show that Frankenstein surpasses existing adaptive algorithms and SGD empirically regarding convergence speed and generalization performance. Furthermore, this research deepens our understanding of adaptive algorithms through centered kernel alignment analysis and loss landscape visualization during the learning process.

[LG-51] Four Principles for Physically Interpretable World Models

链接: https://arxiv.org/abs/2503.02143
作者: Jordan Peper,Zhenjiang Mao,Yuang Geng,Siyuan Pan,Ivan Ruchkin
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Equal contribution by the first two authors

点击查看摘要

Abstract:As autonomous systems are increasingly deployed in open and uncertain settings, there is a growing need for trustworthy world models that can reliably predict future high-dimensional observations. The learned latent representations in world models lack direct mapping to meaningful physical quantities and dynamics, limiting their utility and interpretability in downstream planning, control, and safety verification. In this paper, we argue for a fundamental shift from physically informed to physically interpretable world models - and crystallize four principles that leverage symbolic knowledge to achieve these ends: (1) structuring latent spaces according to the physical intent of variables, (2) learning aligned invariant and equivariant representations of the physical world, (3) adapting training to the varied granularity of supervision signals, and (4) partitioning generative outputs to support scalability and verifiability. We experimentally demonstrate the value of each principle on two benchmarks. This paper opens several intriguing research directions to achieve and capitalize on full physical interpretability in world models.

[LG-52] AVG-DICE: Stationary Distribution Correction by Regression

链接: https://arxiv.org/abs/2503.02125
作者: Fengdi Che,Bryan Chan,Chen Ma,A. Rupam Mahmood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-policy policy evaluation (OPE), an essential component of reinforcement learning, has long suffered from stationary state distribution mismatch, undermining both stability and accuracy of OPE estimates. While existing methods correct distribution shifts by estimating density ratios, they often rely on expensive optimization or backward Bellman-based updates and struggle to outperform simpler baselines. We introduce AVG-DICE, a computationally simple Monte Carlo estimator for the density ratio that averages discounted importance sampling ratios, providing an unbiased and consistent correction. AVG-DICE extends naturally to nonlinear function approximation using regression, which we roughly tune and test on OPE tasks based on Mujoco Gym environments and compare with state-of-the-art density-ratio estimators using their reported hyperparameters. In our experiments, AVG-DICE is at least as accurate as state-of-the-art estimators and sometimes offers orders-of-magnitude improvements. However, a sensitivity analysis shows that best-performing hyperparameters may vary substantially across different discount factors, so a re-tuning is suggested.

[LG-53] A Hybrid CNN-Transformer Model for Heart Disease Prediction Using Life History Data

链接: https://arxiv.org/abs/2503.02124
作者: Ran Hao,Yanlin Xiang,Junliang Du,Qingyuan He,Jiacheng Hu,Ting Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposed a hybrid model of a convolutional neural network (CNN) and a Transformer to predict and diagnose heart disease. Based on CNN’s strength in detecting local features and the Transformer’s high capacity in sensing global relations, the model is able to successfully detect risk factors of heart disease from high-dimensional life history data. Experimental results show that the proposed model outperforms traditional benchmark models like support vector machine (SVM), convolutional neural network (CNN), and long short-term memory network (LSTM) on several measures like accuracy, precision, and recall. This demonstrates its strong ability to deal with multi-dimensional and unstructured data. In order to verify the effectiveness of the model, experiments removing certain parts were carried out, and the results of the experiments showed that it is important to use both CNN and Transformer modules in enhancing the model. This paper also discusses the incorporation of additional features and approaches in future studies to enhance the model’s performance and enable it to operate effectively in diverse conditions. This study presents novel insights and methods for predicting heart disease using machine learning, with numerous potential applications especially in personalized medicine and health management.

[LG-54] An Efficient Plugin Method for Metric Optimization of Black-Box Models

链接: https://arxiv.org/abs/2503.02119
作者: Siddartha Devic,Nurendra Choudhary,Anirudh Srinivasan,Sahika Genc,Branislav Kveton,Gaurush Hiranandani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning algorithms and classifiers are available only via API queries as a ``black-box’’ – that is, the downstream user has no ability to change, re-train, or fine-tune the model on a particular target distribution. Indeed, the downstream user may not even have knowledge of the \emphoriginal training distribution or performance metric used to construct and optimize the black-box model. We propose a simple and efficient method, Plugin, which \emphpost-processes arbitrary multiclass predictions from any black-box classifier in order to simultaneously (1) adapt these predictions to a target distribution; and (2) optimize a particular metric of the confusion matrix. Importantly, Plugin is a completely \textitpost-hoc method which does not rely on feature information, only requires a small amount of probabilistic predictions along with their corresponding true label, and optimizes metrics by querying. We empirically demonstrate that Plugin is both broadly applicable and has performance competitive with related methods on a variety of tabular and language tasks.

[LG-55] Fairness and/or Privacy on Social Graphs

链接: https://arxiv.org/abs/2503.02114
作者: Bartlomiej Surma,Michael Backes,Yang Zhang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown remarkable success in various graph-based learning tasks. However, recent studies have raised concerns about fairness and privacy issues in GNNs, highlighting the potential for biased or discriminatory outcomes and the vulnerability of sensitive information. This paper presents a comprehensive investigation of fairness and privacy in GNNs, exploring the impact of various fairness-preserving measures on model performance. We conduct experiments across diverse datasets and evaluate the effectiveness of different fairness interventions. Our analysis considers the trade-offs between fairness, privacy, and accuracy, providing insights into the challenges and opportunities in achieving both fair and private graph learning. The results highlight the importance of carefully selecting and combining fairness-preserving measures based on the specific characteristics of the data and the desired fairness objectives. This study contributes to a deeper understanding of the complex interplay between fairness, privacy, and accuracy in GNNs, paving the way for the development of more robust and ethical graph learning models.

[LG-56] Deep Learning is Not So Mysterious or Different

链接: https://arxiv.org/abs/2503.02113
作者: Andrew Gordon Wilson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

[LG-57] Building Machine Learning Challenges for Anomaly Detection in Science

链接: https://arxiv.org/abs/2503.02112
作者: Elizabeth G. Campolongo,Yuan-Tang Chou,Ekaterina Govorkova,Wahid Bhimji,Wei-Lun Chao,Chris Harris,Shih-Chieh Hsu,Hilmar Lapp,Mark S. Neubauer,Josephine Namayanja,Aneesh Subramanian,Philip Harris,Advaith Anand,David E. Carlyn,Subhankar Ghosh,Christopher Lawrence,Eric Moreno,Ryan Raikman,Jiaman Wu,Ziheng Zhang,Bayu Adhi,Mohammad Ahmadi Gharehtoragh,Saúl Alonso Monsalve,Marta Babicz,Furqan Baig,Namrata Banerji,William Bardon,Tyler Barna,Tanya Berger-Wolf,Adji Bousso Dieng,Micah Brachman,Quentin Buat,David C.Y. Hui,Phuong Cao,Franco Cerino,Yi-Chun Chang,Shivaji Chaulagain,An-Kai Chen,Deming Chen,Eric Chen,Chia-Jui Chou,Zih-Chen Ciou,Miles Cochran-Branson,Artur Cordeiro Oudot Choi,Michael Coughlin,Matteo Cremonesi,Maria Dadarlat,Peter Darch,Malina Desai,Daniel Diaz,Steven Dillmann,Javier Duarte,Isla Duporge,Urbas Ekka,Saba Entezari Heravi,Hao Fang,Rian Flynn,Geoffrey Fox,Emily Freed,Hang Gao,Jing Gao,Julia Gonski,Matthew Graham,Abolfazl Hashemi,Scott Hauck,James Hazelden,Joshua Henry Peterson,Duc Hoang,Wei Hu,Mirco Huennefeld,David Hyde,Vandana Janeja,Nattapon Jaroenchai,Haoyi Jia,Yunfan Kang,Maksim Kholiavchenko,Elham E. Khoda,Sangin Kim,Aditya Kumar,Bo-Cheng Lai,Trung Le,Chi-Wei Lee,JangHyeon Lee,Shaocheng Lee,Suzan van der Lee,Charles Lewis,Haitong Li,Haoyang Li,Henry Liao,Mia Liu,Xiaolin Liu,Xiulong Liu,Vladimir Loncar,Fangzheng Lyu,Ilya Makarov,Abhishikth Mallampalli Chen-Yu Mao,Alexander Michels,Alexander Migala,Farouk Mokhtar,Mathieu Morlighem
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 18 pages 6 figures to be submitted to Nature Communications

点击查看摘要

Abstract:Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.

[LG-58] Correcting Mode Proportion Bias in Generalized Bayesian Inference via a Weighted Kernel Stein Discrepancy

链接: https://arxiv.org/abs/2503.02108
作者: Elham Afzali,Saman Muthukumarana,Liqun Wang
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 20 pages, 3 figures. Submitted to Bayesian Analysis for review

点击查看摘要

Abstract:Generalized Bayesian Inference (GBI) provides a flexible framework for updating prior distributions using various loss functions instead of the traditional likelihoods, thereby enhancing the model robustness to model misspecification. However, GBI often suffers the problem associated with intractable likelihoods. Kernelized Stein Discrepancy (KSD), as utilized in a recent study, addresses this challenge by relying only on the gradient of the log-likelihood. Despite this innovation, KSD-Bayes suffers from critical pathologies, including insensitivity to well-separated modes in multimodal posteriors. To address this limitation, we propose a weighted KSD method that retains computational efficiency while effectively capturing multimodal structures. Our method improves the GBI framework for handling intractable multimodal posteriors while maintaining key theoretical properties such as posterior consistency and asymptotic normality. Experimental results demonstrate that our method substantially improves mode sensitivity compared to standard KSD-Bayes, while retaining robust performance in unimodal settings and in the presence of outliers.

[LG-59] Biomedical Foundation Model: A Survey

链接: https://arxiv.org/abs/2503.02104
作者: Xiangrui Liu,Yuanyuan Zhang,Yingzhou Lu,Changchang Yin,Xiaoling Hu,Xiaoou Liu,Lulu Chen,Sheng Wang,Alexander Rodriguez,Huaxiu Yao,Yezhou Yang,Ping Zhang,Jintai Chen,Tianfan Fu,Xiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models, first introduced in 2021, are large-scale pre-trained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in leveraging artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models across diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.

[LG-60] Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection

链接: https://arxiv.org/abs/2503.02087
作者: Milin Patel,Rolf Jung
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: submitted as extended paper of Vehicle Technology and Intelligent Transport Systems (VEHITS)2024 conference and will be published by Springer in a CCIS Series book later in 2025

点击查看摘要

Abstract:Uncertainty in LiDAR sensor-based object detection arises from environmental variability and sensor performance limitations. Representing these uncertainties is essential for ensuring the Safety of the Intended Functionality (SOTIF), which focuses on preventing hazards in automated driving scenarios. This paper presents a systematic approach to identifying, classifying, and representing uncertainties in LiDAR-based object detection within a SOTIF-related scenario. Dempster-Shafer Theory (DST) is employed to construct a Frame of Discernment (FoD) to represent detection outcomes. Conditional Basic Probability Assignments (BPAs) are applied based on dependencies among identified uncertainty sources. Yager’s Rule of Combination is used to resolve conflicting evidence from multiple sources, providing a structured framework to evaluate uncertainties’ effects on detection accuracy. The study applies variance-based sensitivity analysis (VBSA) to quantify and prioritize uncertainties, detailing their specific impact on detection performance.

[LG-61] textM3textHF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

链接: https://arxiv.org/abs/2503.02077
作者: Ziyan Wang,Zhicheng Zhang,Fei Fang,Yali Du
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Seventeen pages, four figures

点击查看摘要

Abstract:Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ( \textM^3\textHF ), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, \textM^3\textHF leverages both expert and non-expert feedback to continuously refine agents’ policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that \textM^3\textHF significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.

[LG-62] Active Alignments of Lens Systems with Reinforcement Learning

链接: https://arxiv.org/abs/2503.02075
作者: Matthias Burkhardt,Tobias Schmähling,Michael Layh,Tobias Windisch
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Aligning a lens system relative to an imager is a critical challenge in camera manufacturing. While optimal alignment can be mathematically computed under ideal conditions, real-world deviations caused by manufacturing tolerances often render this approach impractical. Measuring these tolerances can be costly or even infeasible, and neglecting them may result in suboptimal alignments. We propose a reinforcement learning (RL) approach that learns exclusively in the pixel space of the sensor output, eliminating the need to develop expert-designed alignment concepts. We conduct an extensive benchmark study and show that our approach surpasses other methods in speed, precision, and robustness. We further introduce relign, a realistic, freely explorable, open-source simulation utilizing physically based rendering that models optical systems with non-deterministic manufacturing tolerances and noise in robotic alignment movement. It provides an interface to popular machine learning frameworks, enabling seamless experimentation and development. Our work highlights the potential of RL in a manufacturing environment to enhance efficiency of optical alignments while minimizing the need for manual intervention.

[LG-63] Constrained Linear Thompson Sampling

链接: https://arxiv.org/abs/2503.02043
作者: Aditya Gangrade,Venkatesh Saligrama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the safe linear bandit problem, where an agent sequentially selects actions from a convex domain to maximize an unknown objective while ensuring unknown linear constraints are satisfied on a per-round basis. Existing approaches primarily rely on optimism-based methods with frequentist confidence bounds, often leading to computationally expensive action selection routines. We propose COnstrained Linear Thompson Sampling (COLTS), a sampling-based framework that efficiently balances regret minimization and constraint satisfaction by selecting actions on the basis of noisy perturbations of the estimates of the unknown objective vector and constraint matrix. We introduce three variants of COLTS, distinguished by the learner’s available side information: - S-COLTS assumes access to a known safe action and ensures strict constraint enforcement by combining the COLTS approach with a rescaling towards the safe action. For d -dimensional actions, this yields \tildeO(\sqrtd^3 T) regret and zero constraint violations (or risk). - E-COLTS enforces constraints softly under Slater’s condition, and attains regret and risk of \tildeO(\sqrtd^3 T) by combining COLTS with uniform exploration. - R-COLTS requires no side information, and ensures instance-independent regret and risk of \tildeO(\sqrtd^3 T) by leveraging repeated resampling. A key technical innovation is a coupled noise design, which maintains optimism while preserving computational efficiency, which is combined with a scaling based analysis technique to address the variation of the per-round feasible region induced by sampled constraint matrices. Our methods match the regret bounds of prior approaches, while significantly reducing computational costs compared to them, thus yielding a scalable and practical approach for constrained bandit linear optimization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.02043 [cs.LG] (or arXiv:2503.02043v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.02043 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aditya Gangrade [view email] [v1] Mon, 3 Mar 2025 20:44:58 UTC (559 KB) Full-text links: Access Paper: View a PDF of the paper titled Constrained Linear Thompson Sampling, by Aditya Gangrade and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-64] Accelerating Multi-Task Temporal Difference Learning under Low-Rank Representation

链接: https://arxiv.org/abs/2503.02030
作者: Yitao Bai,Sihan Zeng,Justin Romberg,Thinh T. Doan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:We study policy evaluation problems in multi-task reinforcement learning (RL) under a low-rank representation setting. In this setting, we are given N learning tasks where the corresponding value function of these tasks lie in an r -dimensional subspace, with rN . One can apply the classic temporal-difference (TD) learning method for solving these problems where this method learns the value function of each task independently. In this paper, we are interested in understanding whether one can exploit the low-rank structure of the multi-task setting to accelerate the performance of TD learning. To answer this question, we propose a new variant of TD learning method, where we integrate the so-called truncated singular value decomposition step into the update of TD learning. This additional step will enable TD learning to exploit the dominant directions due to the low rank structure to update the iterates, therefore, improving its performance. Our empirical results show that the proposed method significantly outperforms the classic TD learning, where the performance gap increases as the rank r decreases. From the theoretical point of view, introducing the truncated singular value decomposition step into TD learning might cause an instability on the updates. We provide a theoretical result showing that the instability does not happen. Specifically, we prove that the proposed method converges at a rate \mathcalO(\frac\ln(t)t) , where t is the number of iterations. This rate matches that of the standard TD learning. Comments: 13 pages, 3 figures Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2503.02030 [cs.LG] (or arXiv:2503.02030v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.02030 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] A Lightweight and Secure Deep Learning Model for Privacy-Preserving Federated Learning in Intelligent Enterprises

链接: https://arxiv.org/abs/2503.02017
作者: Reza Fotohi,Fereidoon Shams Aliee,Bahar Farahani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, IEEE Internet of Things Journal (2024)

点击查看摘要

Abstract:The ever growing Internet of Things (IoT) connections drive a new type of organization, the Intelligent Enterprise. In intelligent enterprises, machine learning based models are adopted to extract insights from data. Due to the efficiency and privacy challenges of these traditional models, a new federated learning (FL) paradigm has emerged. In FL, multiple enterprises can jointly train a model to update a final model. However, firstly, FL trained models usually perform worse than centralized models, especially when enterprises training data is non-IID (Independent and Identically Distributed). Second, due to the centrality of FL and the untrustworthiness of local enterprises, traditional FL solutions are vulnerable to poisoning and inference attacks and violate privacy. Thirdly, the continuous transfer of parameters between enterprises and servers increases communication costs. To this end, the FedAnil+ model is proposed, a novel, lightweight, and secure Federated Deep Learning Model that includes three main phases. In the first phase, the goal is to solve the data type distribution skew challenge. Addressing privacy concerns against poisoning and inference attacks is covered in the second phase. Finally, to alleviate the communication overhead, a novel compression approach is proposed that significantly reduces the size of the updates. The experiment results validate that FedAnil+ is secure against inference and poisoning attacks with better accuracy. In addition, it shows improvements over existing approaches in terms of model accuracy (13%, 16%, and 26%), communication cost (17%, 21%, and 25%), and computation cost (7%, 9%, and 11%).

[LG-66] Interval Regression: A Comparative Study with Proposed Models

链接: https://arxiv.org/abs/2503.02011
作者: Tung L Nguyen,Toby Dylan Hocking
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely known; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

[LG-67] A Deep Autoregressive Model for Dynamic Combinatorial Complexes

链接: https://arxiv.org/abs/2503.01999
作者: Ata Tuna
类目: Machine Learning (cs.LG)
*备注: 66 pages, 12 figures. Submitted in partial fulfillment of the requirements for the MRes degree in Artificial Intelligence and Machine Learning of Imperial College London

点击查看摘要

Abstract:We introduce DAMCC (Deep Autoregressive Model for Dynamic Combinatorial Complexes), the first deep learning model designed to generate dynamic combinatorial complexes (CCs). Unlike traditional graph-based models, CCs capture higher-order interactions, making them ideal for representing social networks, biological systems, and evolving infrastructures. While existing models primarily focus on static graphs, DAMCC addresses the challenge of modeling temporal dynamics and higher-order structures in dynamic networks. DAMCC employs an autoregressive framework to predict the evolution of CCs over time. Through comprehensive experiments on real-world and synthetic datasets, we demonstrate its ability to capture both temporal and higher-order dependencies. As the first model of its kind, DAMCC lays the foundation for future advancements in dynamic combinatorial complex modeling, with opportunities for improved scalability and efficiency on larger networks.

[LG-68] Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

链接: https://arxiv.org/abs/2503.01922
作者: Leonid Berlyand,Theo Bourdais,Houman Owhadi,Yitzchak Shmalo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50% with less than 1% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

[LG-69] Multi-models with averag ing in feature domain for non-invasive blood glucose estimation

链接: https://arxiv.org/abs/2503.01918
作者: Yiting Wei,Bingo Wing-Kuen Ling,Qing Liu,Jiaxin Liu
类目: Machine Learning (cs.LG)
*备注: This version corrects two typos

点击查看摘要

Abstract:Diabetes is a serious chronic metabolic disease. In the recent years, more and more consumer technology enterprises focusing on human health are committed to implementing accurate and non-invasive blood glucose algorithm in their products. However, due to the interference from the external environment, these wearable non-invasive methods yield the low estimation accuracy. To address this issue, this paper employs different models based on different ranges of the blood glucose values for performing the blood glucose estimation. First the photoplethysmograms (PPGs) are acquired and they are denoised via the bit plane singular spectrum analysis (SSA) method. Second, the features are extracted. For the data in the training set, first the features are averaged across the measurements in the feature domain via the optimization approach. Second, the random forest is employed to sort the importance of each feature. Third, the training set is divided into three subsets according to the reference blood glucose values. Fourth, the feature vectors and the corresponding blood glucose values in the same group are employed to build an individual model. Fifth, for each feature, the average of the feature values for all the measurements in the same subset is computed. For the data in the test set, first, the sum of the weighted distances between the test feature values and the average values obtained in the above is computed for each model. Here, the weights are defined based on the importance sorted by the random forest obtained in the above. The model corresponding to the smallest sum is assigned. Finally, the blood glucose value is estimated based on the corresponding model. Compared to the state of arts methods, our proposed method can effectively improve the estimation accuracy.

[LG-70] VAEs and GANs: Implicitly Approximating Complex Distributions with Simple Base Distributions and Deep Neural Networks – Principles Necessity and Limitations

链接: https://arxiv.org/abs/2503.01898
作者: Yuan-Hao Wei
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This tutorial focuses on the fundamental architectures of Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), disregarding their numerous variations, to highlight their core principles. Both VAE and GAN utilize simple distributions, such as Gaussians, as a basis and leverage the powerful nonlinear transformation capabilities of neural networks to approximate arbitrarily complex distributions. The theoretical basis lies in that a linear combination of multiple Gaussians can almost approximate any probability distribution, while neural networks enable further refinement through nonlinear transformations. Both methods approximate complex data distributions implicitly. This implicit approximation is crucial because directly modeling high-dimensional distributions explicitly is often intractable. However, the choice of a simple latent prior, while computationally convenient, introduces limitations. In VAEs, the fixed Gaussian prior forces the posterior distribution to align with it, potentially leading to loss of information and reduced expressiveness. This restriction affects both the interpretability of the model and the quality of generated samples.

[LG-71] BiHRNN – Bi-Directional Hierarchical Recurrent Neural Network for Inflation Forecasting

链接: https://arxiv.org/abs/2503.01893
作者: Maya Vilenko
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注: Master’s thesis. Under the supervision of Dr. Noam Koeningstein. 40 pages

点击查看摘要

Abstract:Inflation prediction guides decisions on interest rates, investments, and wages, playing a key role in economic stability. Yet accurate forecasting is challenging due to dynamic factors and the layered structure of the Consumer Price Index, which organizes goods and services into multiple categories. We propose the Bi-directional Hierarchical Recurrent Neural Network (BiHRNN) model to address these challenges by leveraging the hierarchical structure to enable bidirectional information flow between levels. Informative constraints on the RNN parameters enhance predictive accuracy at all levels without the inefficiencies of a unified model. We validated BiHRNN on inflation datasets from the United States, Canada, and Norway by training, tuning hyperparameters, and experimenting with various loss functions. Our results demonstrate that BiHRNN significantly outperforms traditional RNN models, with its bidirectional architecture playing a pivotal role in achieving improved forecasting accuracy.

[LG-72] AutoHete: An Automatic and Efficient Heterogeneous Training System for LLM s

链接: https://arxiv.org/abs/2503.01890
作者: Zihao Zeng,Chubo Liu,Xin He,Juan Hu,Yong Jiang,Fei Huang,Kenli Li,Wei Yang Bryan Lim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation, with improvements scaling proportionally with model size. However, the limitations of GPU memory have restricted LLM training accessibility for many researchers. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. In this work, we propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single-GPU and multi-GPU environments. AutoHete dynamically adjusts activation checkpointing, parameter offloading, and optimizer offloading based on the specific hardware configuration and LLM training needs. Additionally, we design a priority-based scheduling mechanism that maximizes the overlap between operations across training iterations, enhancing throughput. Compared to state-of-the-art heterogeneous training systems, AutoHete delivers a 1.32x~1.91x throughput improvement across various model sizes and training configurations.

[LG-73] Constructing balanced datasets for predicting failure modes in structural systems under seismic hazards

链接: https://arxiv.org/abs/2503.01882
作者: Jungho Kim,Taeyong Kim
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate prediction of structural failure modes under seismic excitations is essential for seismic risk and resilience assessment. Traditional simulation-based approaches often result in imbalanced datasets dominated by non-failure or frequently observed failure scenarios, limiting the effectiveness in machine learning-based prediction. To address this challenge, this study proposes a framework for constructing balanced datasets that include distinct failure modes. The framework consists of three key steps. First, critical ground motion features (GMFs) are identified to effectively represent ground motion time histories. Second, an adaptive algorithm is employed to estimate the probability densities of various failure domains in the space of critical GMFs and structural parameters. Third, samples generated from these probability densities are transformed into ground motion time histories by using a scaling factor optimization process. A balanced dataset is constructed by performing nonlinear response history analyses on structural systems with parameters matching the generated samples, subjected to corresponding transformed ground motion time histories. Deep neural network models are trained on balanced and imbalanced datasets to highlight the importance of dataset balancing. To further evaluate the framework’s applicability, numerical investigations are conducted using two different structural models subjected to recorded and synthetic ground motions. The results demonstrate the framework’s robustness and effectiveness in addressing dataset imbalance and improving machine learning performance in seismic failure mode prediction.

[LG-74] Uncertainty Comes for Free: Human-in-the-Loop Policies with Diffusion Models

链接: https://arxiv.org/abs/2503.01876
作者: Zhanpeng He,Yifeng Cao,Matei Ciocarlie
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Human-in-the-loop (HitL) robot deployment has gained significant attention in both academia and industry as a semi-autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor-intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty-based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine-tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real-world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.

[LG-75] A New sim 5σ Tension at Characteristic Redshift from DESI DR1 and DES-SN5YR observations

链接: https://arxiv.org/abs/2503.02880
作者: Purba Mukherjee,Anjan A Sen
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc); High Energy Physics - Theory (hep-th)
*备注: 4 pages, 1 table, 1 figure. Comments are welcome

点击查看摘要

Abstract:We perform a model-independent reconstruction of the angular diameter distance ( D_A ) using the Multi-Task Gaussian Process (MTGP) framework with DESI-DR1 BAO and DES-SN5YR datasets. We calibrate the comoving sound horizon at the baryon drag epoch r_d to the Planck best-fit value, ensuring consistency with early-universe physics. With the reconstructed D_A at two key redshifts, z\sim 1.63 (where D_A^\prime =0 ) and at z\sim 0.512 (where D_A^\prime = D_A ), we derive the expansion rate of the Universe H(z) at these redshifts. Our findings reveal that at z\sim 1.63 , the H(z) is fully consistent with the Planck-2018 \Lambda CDM prediction, confirming no new physics at that redshift. However, at z \sim 0.512 , the derived H(z) shows a more than 5\sigma discrepancy with the Planck-2018 \Lambda CDM prediction, suggesting a possible breakdown of the \Lambda CDM model as constrained by Planck-2018 at this lower redshift. This emerging \sim 5\sigma tension at z\sim 0.512 , distinct from the existing ``Hubble Tension’', may signal the first strong evidence for new physics at low redshifts.

[LG-76] Multiaccuracy and Multicalibration via Proxy Groups

链接: https://arxiv.org/abs/2503.02870
作者: Beepul Bharti,Mary Versa Clemens-Sewall,Paul H. Yi,Jeremias Sulam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the use of predictive machine learning algorithms increases in high-stakes decision-making, it is imperative that these algorithms are fair across sensitive groups. Unfortunately, measuring and enforcing fairness in real-world applications can be challenging due to missing or incomplete sensitive group data. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer and more flexible frameworks, such as multiaccuracy and multicalibration, remains unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably be used to derive actionable upper bounds on the true multiaccuracy and multicalibration, providing insights into a model’s potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, sensitive groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group information is incomplete or unavailable.

[LG-77] Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees

链接: https://arxiv.org/abs/2503.02859
作者: Emma Ceccherini,Ian Gallagher,Andrew Jones,Daniel Lawson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Stability for dynamic network embeddings ensures that nodes behaving the same at different times receive the same embedding, allowing comparison of nodes in the network across time. We present attributed unfolded adjacency spectral embedding (AUASE), a stable unsupervised representation learning framework for dynamic networks in which nodes are attributed with time-varying covariate information. To establish stability, we prove uniform convergence to an associated latent position model. We quantify the benefits of our dynamic embedding by comparing with state-of-the-art network representation learning methods on three real attributed networks. To the best of our knowledge, AUASE is the only attributed dynamic embedding that satisfies stability guarantees without the need for ground truth labels, which we demonstrate provides significant improvements for link prediction and node classification.

[LG-78] Spike-and-Slab Posterior Sampling in High Dimensions

链接: https://arxiv.org/abs/2503.02798
作者: Syamantak Kumar,Purnamrita Sarkar,Kevin Tian,Yusong Zhu
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 53 pages

点击查看摘要

Abstract:Posterior sampling with the spike-and-slab prior [MB88], a popular multimodal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression [CPS09, Roc18]. However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) [YWJ16], only work when the measurement count grows at least linearly in the dimension [MW24], or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix \mathbfX \in \mathbbR^n\times d and noisy observations \mathbfy = \mathbfX\mathbf\theta^\star + \mathbf\xi of a signal \mathbf\theta^\star drawn from a spike-and-slab prior \pi with a Gaussian diffuse density and expected sparsity k, where \mathbf\xi \sim \mathcalN(\mathbb0_n, \sigma^2\mathbfI_n) . We give a polynomial-time high-accuracy sampler for the posterior \pi(\cdot \mid \mathbfX, \mathbfy) , for any SNR \sigma^-1 0, as long as n \geq k^3 \cdot \textpolylog(d) and X is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time \approx nd in the same setting, as long as n \geq k^5 \cdot \textpolylog(d) . To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when \sigma = O(\frac1k) is bounded.

[LG-79] VWAP Execution with Signature-Enhanced Transformers: A Multi-Asset Learning Approach

链接: https://arxiv.org/abs/2503.02680
作者: Remi Genet
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper I propose a novel approach to Volume Weighted Average Price (VWAP) execution that addresses two key practical challenges: the need for asset-specific model training and the capture of complex temporal dependencies. Building upon my recent work in dynamic VWAP execution arXiv:2502.18177, I demonstrate that a single neural network trained across multiple assets can achieve performance comparable to or better than traditional asset-specific models. The proposed architecture combines a transformer-based design inspired by arXiv:2406.02486 with path signatures for capturing geometric features of price-volume trajectories, as in arXiv:2406.17890. The empirical analysis, conducted on hourly cryptocurrency trading data from 80 trading pairs, shows that the globally-fitted model with signature features (GFT-Sig) achieves superior performance in both absolute and quadratic VWAP loss metrics compared to asset-specific approaches. Notably, these improvements persist for out-of-sample assets, demonstrating the model’s ability to generalize across different market conditions. The results suggest that combining global parameter sharing with signature-based feature extraction provides a scalable and robust approach to VWAP execution, offering significant practical advantages over traditional asset-specific implementations.

[LG-80] Weakly-Constrained 4D Var for Downscaling with Uncertainty using Data-Driven Surrogate Models

链接: https://arxiv.org/abs/2503.02665
作者: Philip Dinenis,Vishwas Rao,Mihai Anitescu
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Dynamic downscaling typically involves using numerical weather prediction (NWP) solvers to refine coarse data to higher spatial resolutions. Data-driven models such as FourCastNet have emerged as a promising alternative to the traditional NWP models for forecasting. Once these models are trained, they are capable of delivering forecasts in a few seconds, thousands of times faster compared to classical NWP models. However, as the lead times, and, therefore, their forecast window, increase, these models show instability in that they tend to diverge from reality. In this paper, we propose to use data assimilation approaches to stabilize them when used for downscaling tasks. Data assimilation uses information from three different sources, namely an imperfect computational model based on partial differential equations (PDE), from noisy observations, and from an uncertainty-reflecting prior. In this work, when carrying out dynamic downscaling, we replace the computationally expensive PDE-based NWP models with FourCastNet in a ``weak-constrained 4DVar framework" that accounts for the implied model errors. We demonstrate the efficacy of this approach for a hurricane-tracking problem; moreover, the 4DVar framework naturally allows the expression and quantification of uncertainty. We demonstrate, using ERA5 data, that our approach performs better than the ensemble Kalman filter (EnKF) and the unstabilized FourCastNet model, both in terms of forecast accuracy and forecast uncertainty.

[LG-81] A Tight Regret Analysis of Non-Parametric Repeated Contextual Brokerag e AISTATS2025

链接: https://arxiv.org/abs/2503.02646
作者: François Bachoc,Tommaso Cesari,Roberto Colomboni
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:We study a contextual version of the repeated brokerage problem. In each interaction, two traders with private valuations for an item seek to buy or sell based on the learner’s-a broker-proposed price, which is informed by some contextual information. The broker’s goal is to maximize the traders’ net utility-also known as the gain from trade-by minimizing regret compared to an oracle with perfect knowledge of traders’ valuation distributions. We assume that traders’ valuations are zero-mean perturbations of the unknown item’s current market value-which can change arbitrarily from one interaction to the next-and that similar contexts will correspond to similar market prices. We analyze two feedback settings: full-feedback, where after each interaction the traders’ valuations are revealed to the broker, and limited-feedback, where only transaction attempts are revealed. For both feedback types, we propose algorithms achieving tight regret bounds. We further strengthen our performance guarantees by providing a tight 1/2-approximation result showing that the oracle that knows the traders’ valuation distributions achieves at least 1/2 of the gain from trade of the omniscient oracle that knows in advance the actual realized traders’ valuations.

[LG-82] Weight transport through spike timing for robust local gradients

链接: https://arxiv.org/abs/2503.02642
作者: Timo Gierlich,Andreas Baumbach,Akos F. Kungl,Kevin Max,Mihai A. Petrovici
类目: Neurons and Cognition (q-bio.NC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:In both machine learning and in computational neuroscience, plasticity in functional neural networks is frequently expressed as gradient descent on a cost. Often, this imposes symmetry constraints that are difficult to reconcile with local computation, as is required for biological networks or neuromorphic hardware. For example, wake-sleep learning in networks characterized by Boltzmann distributions builds on the assumption of symmetric connectivity. Similarly, the error backpropagation algorithm is notoriously plagued by the weight transport problem between the representation and the error stream. Existing solutions such as feedback alignment tend to circumvent the problem by deferring to the robustness of these algorithms to weight asymmetry. However, they are known to scale poorly with network size and depth. We introduce spike-based alignment learning (SAL), a complementary learning rule for spiking neural networks, which uses spike timing statistics to extract and correct the asymmetry between effective reciprocal connections. Apart from being spike-based and fully local, our proposed mechanism takes advantage of noise. Based on an interplay between Hebbian and anti-Hebbian plasticity, synapses can thereby recover the true local gradient. This also alleviates discrepancies that arise from neuron and synapse variability – an omnipresent property of physical neuronal networks. We demonstrate the efficacy of our mechanism using different spiking network models. First, we show how SAL can significantly improve convergence to the target distribution in probabilistic spiking networks as compared to Hebbian plasticity alone. Second, in neuronal hierarchies based on cortical microcircuits, we show how our proposed mechanism effectively enables the alignment of feedback weights to the forward pathway, thus allowing the backpropagation of correct feedback errors.

[LG-83] Weighted Euclidean Distance Matrices over Mixed Continuous and Categorical Inputs for Gaussian Process Models

链接: https://arxiv.org/abs/2503.02630
作者: Mingyu Pu,Songhao Wang,Haowei Wang,Szu Hui Ng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian Process (GP) models are widely utilized as surrogate models in scientific and engineering fields. However, standard GP models are limited to continuous variables due to the difficulties in establishing correlation structures for categorical variables. To overcome this limitati on, we introduce WEighted Euclidean distance matrices Gaussian Process (WEGP). WEGP constructs the kernel function for each categorical input by estimating the Euclidean distance matrix (EDM) among all categorical choices of this input. The EDM is represented as a linear combination of several predefined base EDMs, each scaled by a positive weight. The weights, along with other kernel hyperparameters, are inferred using a fully Bayesian framework. We analyze the predictive performance of WEGP theoretically. Numerical experiments validate the accuracy of our GP model, and by WEGP, into Bayesian Optimization (BO), we achieve superior performance on both synthetic and real-world optimization problems.

[LG-84] A generalized approach to label shift: the Conditional Probability Shift Model

链接: https://arxiv.org/abs/2503.02583
作者: Paweł Teisseyre,Jan Mielniczuk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many practical applications of machine learning, a discrepancy often arises between a source distribution from which labeled training examples are drawn and a target distribution for which only unlabeled data is observed. Traditionally, two main scenarios have been considered to address this issue: covariate shift (CS), where only the marginal distribution of features changes, and label shift (LS), which involves a change in the class variable’s prior distribution. However, these frameworks do not encompass all forms of distributional shift. This paper introduces a new setting, Conditional Probability Shift (CPS), which captures the case when the conditional distribution of the class variable given some specific features changes while the distribution of remaining features given the specific features and the class is preserved. For this scenario we present the Conditional Probability Shift Model (CPSM) based on modeling the class variable’s conditional probabilities using multinomial regression. Since the class variable is not observed for the target data, the parameters of the multinomial model for its distribution are estimated using the Expectation-Maximization algorithm. The proposed method is generic and can be combined with any probabilistic classifier. The effectiveness of CPSM is demonstrated through experiments on synthetic datasets and a case study using the MIMIC medical database, revealing its superior balanced classification accuracy on the target data compared to existing methods, particularly in situations situations of conditional distribution shift and no apriori distribution shift, which are not detected by LS-based methods.

[LG-85] he Distributionally Robust Optimization Model of Sparse Principal Component Analysis

链接: https://arxiv.org/abs/2503.02494
作者: Lei Wang,Xin Liu,Xiaojun Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider sparse principal component analysis (PCA) under a stochastic setting where the underlying probability distribution of the random parameter is uncertain. This problem is formulated as a distributionally robust optimization (DRO) model based on a constructive approach to capturing uncertainty in the covariance matrix, which constitutes a nonsmooth constrained min-max optimization problem. We further prove that the inner maximization problem admits a closed-form solution, reformulating the original DRO model into an equivalent minimization problem on the Stiefel manifold. This transformation leads to a Riemannian optimization problem with intricate nonsmooth terms, a challenging formulation beyond the reach of existing algorithms. To address this issue, we devise an efficient smoothing manifold proximal gradient algorithm. We prove the Riemannian gradient consistency and global convergence of our algorithm to a stationary point of the nonsmooth minimization problem. Moreover, we establish the iteration complexity of our algorithm. Finally, numerical experiments are conducted to validate the effectiveness and scalability of our algorithm, as well as to highlight the necessity and rationality of adopting the DRO model for sparse PCA.

[LG-86] Decentralized Reinforcement Learning for Multi-Agent Multi-Resource Allocation via Dynamic Cluster Agreements

链接: https://arxiv.org/abs/2503.02437
作者: Antonio Marino(UR, RAINBOW),Esteban Restrepo(RAINBOW, CNRS),Claudio Pacchierotti(RAINBOW, CNRS),Paolo Robuffo Giordano(RAINBOW, CNRS)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of allocating heterogeneous resources among multiple agents in a decentralized manner. Our proposed method, LGTC-IPPO, builds upon Independent Proximal Policy Optimization (IPPO) by integrating dynamic cluster consensus, a mechanism that allows agents to form and adapt local sub-teams based on resource demands. This decentralized coordination strategy reduces reliance on global information and enhances scalability. We evaluate LGTC-IPPO against standard multi-agent reinforcement learning baselines and a centralized expert solution across a range of team sizes and resource distributions. Experimental results demonstrate that LGTC-IPPO achieves more stable rewards, better coordination, and robust performance even as the number of agents or resource types increases. Additionally, we illustrate how dynamic clustering enables agents to reallocate resources efficiently also for scenarios with discharging resources.

[LG-87] Wyckoff Transformer: Generation of Symmetric Crystals

链接: https://arxiv.org/abs/2503.02407
作者: Nikita Kazeev,Wei Nong,Ignat Romanov,Ruiming Zhu,Andrey Ustyuzhanin,Shuya Yamazaki,Kedar Hippalgaonkar
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: this https URL

点击查看摘要

Abstract:Symmetry rules that atoms obey when they bond together to form an ordered crystal play a fundamental role in determining their physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. Consistently generating stable crystal structures is still an open challenge, specifically because such symmetry rules are not accounted for. To address this issue, we propose WyFormer, a generative model for materials conditioned on space group symmetry. We use Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer and an absence of positional encoding. WyFormer has a unique and powerful synergy of attributes, proven by extensive experimentation: best-in-class symmetry-conditioned generation, physics-motivated inductive bias, competitive stability of the generated structures, competitive material property prediction quality, and unparalleled inference speed.

[LG-88] Sharpness-Aware Minimization: General Analysis and Improved Rates ICLR2025

链接: https://arxiv.org/abs/2503.02225
作者: Dimitris Oikonomou,Nicolas Loizou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13th International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.

[LG-89] owards Heisenberg limit without critical slowing down via quantum reinforcement learning

链接: https://arxiv.org/abs/2503.02210
作者: Hang Xu,Tailong Xiao,Jingzheng Huang,Ming He,Jianping Fan,Guihua Zeng
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Critical ground states of quantum many-body systems have emerged as vital resources for quantum-enhanced sensing. Traditional methods to prepare these states often rely on adiabatic evolution, which may diminish the quantum sensing advantage. In this work, we propose a quantum reinforcement learning (QRL)-enhanced critical sensing protocol for quantum many-body systems with exotic phase diagrams. Starting from product states and utilizing QRL-discovered gate sequences, we explore sensing accuracy in the presence of unknown external magnetic fields, covering both local and global regimes. Our results demonstrate that QRL-learned sequences reach the finite quantum speed limit and generalize effectively across systems of arbitrary size, ensuring accuracy regardless of preparation time. This method can robustly achieve Heisenberg and super-Heisenberg limits, even in noisy environments with practical Pauli measurements. Our study highlights the efficacy of QRL in enabling precise quantum state preparation, thereby advancing scalable, high-accuracy quantum critical sensing.

[LG-90] Online Inference for Quantiles by Constant Learning-Rate Stochastic Gradient Descent

链接: https://arxiv.org/abs/2503.02178
作者: Ziyang Wei,Jiaqi Li,Likai Chen,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 3 figures

点击查看摘要

Abstract:This paper proposes an online inference method of the stochastic gradient descent (SGD) with a constant learning rate for quantile loss functions with theoretical guarantees. Since the quantile loss function is neither smooth nor strongly convex, we view such SGD iterates as an irreducible and positive recurrent Markov chain. By leveraging this interpretation, we show the existence of a unique asymptotic stationary distribution, regardless of the arbitrarily fixed initialization. To characterize the exact form of this limiting distribution, we derive bounds for its moment generating function and tail probabilities, controlling over the first and second moments of SGD iterates. By these techniques, we prove that the stationary distribution converges to a Gaussian distribution as the constant learning rate \eta\rightarrow0 . Our findings provide the first central limit theorem (CLT)-type theoretical guarantees for the last iterate of constant learning-rate SGD in non-smooth and non-strongly convex settings. We further propose a recursive algorithm to construct confidence intervals of SGD iterates in an online manner. Numerical studies demonstrate strong finite-sample performance of our proposed quantile estimator and inference method. The theoretical tools in this study are of independent interest to investigate general transition kernels in Markov chains.

[LG-91] Integrated Computation and Communication with Fiber-optic Transmissions

链接: https://arxiv.org/abs/2503.02165
作者: Jiahao Zhang,Lu Zhang,Xiaodan Pang,Oskars Ozolins,Qun Zhang,Xianbin Yu
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fiber-optic transmission systems are leveraged not only as high-speed communication channels but also as nonlinear kernel functions for machine learning computations, enabling the seamless integration of computational intelligence and communication.

[LG-92] Gradient-free stochastic optimization for additive models

链接: https://arxiv.org/abs/2503.02131
作者: Arya Akhavan,Alexandre B. Tsybakov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of zero-order optimization from noisy observations for an objective function satisfying the Polyak-Łojasiewicz or the strong convexity condition. Additionally, we assume that the objective function has an additive structure and satisfies a higher-order smoothness property, characterized by the Hölder family of functions. The additive model for Hölder classes of functions is well-studied in the literature on nonparametric function estimation, where it is shown that such a model benefits from a substantial improvement of the estimation accuracy compared to the Hölder model without additive structure. We study this established framework in the context of gradient-free optimization. We propose a randomized gradient estimator that, when plugged into a gradient descent algorithm, allows one to achieve minimax optimal optimization error of the order dT^-(\beta-1)/\beta , where d is the dimension of the problem, T is the number of queries and \beta\ge 2 is the Hölder degree of smoothness. We conclude that, in contrast to nonparametric estimation problems, no substantial gain of accuracy can be achieved when using additive models in gradient-free optimization.

[LG-93] Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

链接: https://arxiv.org/abs/2503.02110
作者: Xiaohan Zhu,Nathan Srebro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. \citetGL previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of \lambda=1 , suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter \lambda and noise level (or approximation error), significantly tightening the analysis of \citeauthorGL for \lambda=1 and extending it to all other choices of \lambda .

[LG-94] RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow

链接: https://arxiv.org/abs/2503.02058
作者: Dana Rubin,Allan dos Santos Costa,Manvitha Ponnapati,Joseph Jacobson
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Ribonucleic acid (RNA) plays fundamental roles in biological systems, from carrying genetic information to performing enzymatic function. Understanding and designing RNA can enable novel therapeutic application and biotechnological innovation. To enhance RNA design, in this paper we introduce RiboGen, the first deep learning model to simultaneously generate RNA sequence and all-atom 3D structure. RiboGen leverages the standard Flow Matching with Discrete Flow Matching in a multimodal data representation. RiboGen is based on Euclidean Equivariant neural networks for efficiently processing and learning three-dimensional geometry. Our experiments show that RiboGen can efficiently generate chemically plausible and self-consistent RNA samples. Our results suggest that co-generation of sequence and structure is a competitive approach for modeling RNA.

[LG-95] A General Neural Network Potential for Energetic Materials with C H N and O elements

链接: https://arxiv.org/abs/2503.01932
作者: Mingjie Wen,Jiahe Han,Wenjuan Li,Xiaoya Chang,Qingzhao Chu,Dongping Chen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 41 pages,16 figures

点击查看摘要

Abstract:The discovery and optimization of high-energy materials (HEMs) are constrained by the prohibitive computational expense and prolonged development cycles inherent in conventional approaches. In this work, we develop a general neural network potential (NNP) that efficiently predicts the structural, mechanical, and decomposition properties of HEMs composed of C, H, N, and O. Our framework leverages pre-trained NNP models, fine-tuned using transfer learning on energy and force data derived from density functional theory (DFT) calculations. This strategy enables rapid adaptation across 20 different HEM systems while maintaining DFT-level accuracy, significantly reducing computational costs. A key aspect of this work is the ability of NNP model to capture the chemical activity space of HEMs, accurately describe the key atomic interactions and reaction mechanisms during thermal decomposition. The general NNP model has been applied in molecular dynamics (MD) simulations and validated with experimental data for various HEM structures. Results show that the NNP model accurately predicts the structural, mechanical, and decomposition properties of HEMs by effectively describing their chemical activity space. Compared to traditional force fields, it offers superior DFT-level accuracy and generalization across both microscopic and macroscopic properties, reducing the computational and experimental costs. This work provides an efficient strategy for the design and development of HEMs and proposes a promising framework for integrating DFT, machine learning, and experimental methods in materials research. (To facilitate further research and practical applications, we open-source our NNP model on GitHub: this https URL.)

[LG-96] owards Environment-Sensitive Molecular Inference via Mixed Integer Linear Programming

链接: https://arxiv.org/abs/2503.01849
作者: Jianshen Zhu,Mao Takekida,Naveed Ahmed Azam,Kazuya Haraguchi,Liang Zhao,Tatsuya Akutsu
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional QSAR/QSPR and inverse QSAR/QSPR methods often assume that chemical properties are dictated by single molecules, overlooking the influence of molecular interactions and environmental factors. In this paper, we introduce a novel QSAR/QSPR framework that can capture the combined effects of multiple molecules (e.g., small molecules or polymers) and experimental conditions on property values. We design a feature function to integrate the information of multiple molecules and the environment. Specifically, for the property Flory-Huggins \chi -parameter, which characterizes the thermodynamic properties between the solute and the solvent, and varies in temperatures, we demonstrate through computational experimental results that our approach can achieve a competitively high learning performance compared to existing works on predicting \chi -parameter values, while inferring the solute polymers with up to 50 non-hydrogen atoms in their monomer forms in a relatively short time. A comparison study with the simulation software J-OCTA demonstrates that the polymers inferred by our methods are of high quality.

信息检索

[IR-0] Zero-Shot Complex Question-Answering on Long Scientific Documents AAAI2025

链接: https://arxiv.org/abs/2503.02695
作者: Wanting Wang
类目: Information Retrieval (cs.IR)
*备注: AAAI 2025 Workshop on Document Understanding and Intelligence

点击查看摘要

Abstract:With the rapid development in Transformer-based language models, the reading comprehension tasks on short documents and simple questions have been largely addressed. Long documents, specifically the scientific documents that are densely packed with knowledge discovered and developed by humans, remain relatively unexplored. These documents often come with a set of complex and more realistic questions, adding to their complexity. We present a zero-shot pipeline framework that enables social science researchers to perform question-answering tasks that are complex yet of predetermined question formats on full-length research papers without requiring machine learning expertise. Our approach integrates pre-trained language models to handle challenging scenarios including multi-span extraction, multi-hop reasoning, and long-answer generation. Evaluating on MLPsych, a novel dataset of social psychology papers with annotated complex questions, we demonstrate that our framework achieves strong performance through combination of extractive and generative models. This work advances document understanding capabilities for social sciences while providing practical tools for researchers.

[IR-1] owards Robust Expert Finding in Community Question Answering Platforms

链接: https://arxiv.org/abs/2503.02674
作者: Maddalena Amendola,Andrea Passarella,Raffaele Perego
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper introduces TUEF, a topic-oriented user-interaction model for fair Expert Finding in Community Question Answering (CQA) platforms. The Expert Finding task in CQA platforms involves identifying proficient users capable of providing accurate answers to questions from the community. To this aim, TUEF improves the robustness and credibility of the CQA platform through a more precise Expert Finding component. The key idea of TUEF is to exploit diverse types of information, specifically, content and social information, to identify more precisely experts thus improving the robustness of the task. We assess TUEF through reproducible experiments conducted on a large-scale dataset from StackOverflow. The results consistently demonstrate that TUEF outperforms state-of-the-art competitors while promoting transparent expert identification.

[IR-2] Personalized Generation In Large Model Era: A Survey

链接: https://arxiv.org/abs/2503.02614
作者: Yiyan Xu,Jinghao Zhang,Alireza Salemi,Xinting Hu,Wenjie Wang,Fuli Feng,Hamed Zamani,Xiangnan He,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the era of large models, content generation is gradually shifting to Personalized Generation (PGen), tailoring content to individual preferences and needs. This paper presents the first comprehensive survey on PGen, investigating existing research in this rapidly growing field. We conceptualize PGen from a unified perspective, systematically formalizing its key components, core objectives, and abstract workflows. Based on this unified perspective, we propose a multi-level taxonomy, offering an in-depth review of technical advancements, commonly used datasets, and evaluation metrics across multiple modalities, personalized contexts, and tasks. Moreover, we envision the potential applications of PGen and highlight open challenges and promising directions for future exploration. By bridging PGen research across multiple modalities, this survey serves as a valuable resource for fostering knowledge sharing and interdisciplinary collaboration, ultimately contributing to a more personalized digital landscape.

[IR-3] Efficient Long Sequential Low-rank Adaptive Attention for Click-through rate Prediction

链接: https://arxiv.org/abs/2503.02542
作者: Xin Song,Xiaochen Li,Jinxin Hu,Hong Wen,Zulong Chen,Yu Zhang,Xiaoyi Zeng,Zhang Jing
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the context of burgeoning user historical behavior data, Accurate click-through rate(CTR) prediction requires effective modeling of lengthy user behavior sequences. As the volume of such data keeps swelling, the focus of research has shifted towards developing effective long-term behavior modeling methods to capture latent user interests. Nevertheless, the complexity introduced by large scale data brings about computational hurdles. There is a pressing need to strike a balance between achieving high model performance and meeting the strict response time requirements of online services. While existing retrieval-based methods (e.g., similarity filtering or attention approximation) achieve practical runtime efficiency, they inherently compromise information fidelity through aggressive sequence truncation or attention sparsification. This paper presents a novel attention mechanism. It overcomes the shortcomings of existing methods while ensuring computational efficiency. This mechanism learn compressed representation of sequence with length L via low-rank projection matrices (rank r \ll L ), reducing attention complexity from O(L) to O® . It also integrates a uniquely designed loss function to preserve nonlinearity of attention. In the inference stage, the mechanism adopts matrix absorption and prestorage strategies. These strategies enable it to effectively satisfy online constraints. Comprehensive offline and online experiments demonstrate that the proposed method outperforms current state-of-the-art solutions.

[IR-4] owards Explainable Doctor Recommendation with Large Language Models ALT

链接: https://arxiv.org/abs/2503.02298
作者: Ziyang Zeng,Dongyuan Li,Yuqing Yang
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 6 figures, Journal of Biomedical and Health Informatics (JBHI) under review

点击查看摘要

Abstract:The advent of internet medicine provides patients with unprecedented convenience in searching and communicating with doctors relevant to their diseases and desired treatments online. However, the current doctor recommendation systems fail to fully ensure the professionalism and interpretability of the recommended results. In this work, we formulate doctor recommendation as a ranking task and develop a large language model (LLM)-based pointwise ranking framework. Our framework ranks doctors according to their relevance regarding specific diseases-treatment pairs in a zero-shot setting. The advantage of our framework lies in its ability to generate precise and explainable doctor ranking results. Additionally, we construct DrRank, a new expertise-driven doctor ranking dataset comprising over 38 disease-treatment pairs. Experiment results on the DrRank dataset demonstrate that our framework significantly outperforms the strongest cross-encoder baseline, achieving a notable gain of +5.45 in the NDCG@10 score while maintaining affordable latency consumption. Furthermore, we comprehensively present the fairness analysis results of our framework from three perspectives of different diseases, patient gender, and geographical regions. Meanwhile, the interpretability of our framework is rigorously verified by three human experts, providing further evidence of the reliability of our proposed framework for doctor recommendation.

[IR-5] ailoring Table Retrieval from a Field-aware Hybrid Matching Perspective

链接: https://arxiv.org/abs/2503.02251
作者: Da Li,Keping Bi,Jiafeng Guo,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges. For example, different table fields have varying matching preferences: cells may favor finer-grained (word/phrase level) matching over broader (sentence/passage level) matching due to their fragmented and detailed nature, unlike titles. This necessitates a table-specific retriever to accommodate the various matching needs of each table field. Therefore, we introduce a Table-tailored HYbrid Matching rEtriever (THYME), which approaches table retrieval from a field-aware hybrid matching perspective. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that THYME significantly outperforms state-of-the-art baselines. Comprehensive analyses confirm the differing matching preferences across table fields and validate the design of THYME.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-05

目录

概览 (2025-03-05)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载